Bug 18766

Summary: Inefficient code with -mfpmath=387,sse
Product: gcc Reporter: Wolfgang Bangerth <bangerth>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: minor CC: gcc-bugs
Priority: P2 Keywords: missed-optimization, ra, ssemmx
Version: 4.0.0   
Target Milestone: 4.4.0   
Host: Target: i?86-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2006-01-21 03:08:44

Description Wolfgang Bangerth 2004-12-01 20:54:15 UTC
This is spinoff #1 of PR 17619: 
 
Take this simple piece of code: 
--------------------- 
float a[2],b[2];  
  
float foobar () {  
  return a[0] * b[0] 
    + a[1] * b[1];  
}  
--------------------- 
 
Compiled with  
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387 
we get this code: 
--------------------- 
	pushl	%ebp 
	movl	%esp, %ebp 
	flds	b 
	fmuls	a 
	flds	b+4 
	fmuls	a+4 
	faddp	%st, %st(1) 
	popl	%ebp 
	ret 
----------------------------- 
That's certainly optimal. 
 
On the other hand, if we let the compiler use sse registers as well (though 
we do not force it, we simply want the most efficient code), the code 
we get with flags 
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387,sse 
looks like this: 
----------------------------- 
	pushl	%ebp 
	movl	%esp, %ebp 
	subl	$4, %esp 
	flds	b 
	fmuls	a 
	movss	b+4, %xmm0 
	mulss	a+4, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	leave 
	ret 
--------------------------- 
The code is almost equivalent except for the fact that we have one 
stack push and pop more to satisfy the system ABI that return values 
are passed through st(0). 
 
In essence, the compiler should just generate the first code sequence, 
even if given the flag -mfpmath=387,sse. 
 
W.
Comment 1 Andrew Pinski 2004-12-06 21:34:55 UTC
Confirmed.
Comment 2 Andrew Pinski 2005-10-24 03:08:44 UTC
What is happening is that the register allocator is selecting the return possition for the last add which is a x87 register so it is doing the add in x87 instead of sse which causes the rest to go bonkers.
Comment 3 Uroš Bizjak 2008-08-03 16:59:43 UTC
GNU C (GCC) version 4.4.0 20080803 (experimental) is now much smarter, several rewrites of math ops now result in:

foobar:
	pushl	%ebp
	movl	%esp, %ebp
	flds	a
	fmuls	b
	flds	a+4
	fmuls	b+4
	faddp	%st, %st(1)
	popl	%ebp
	ret

So, fixed for 4.4.