18766 – Inefficient code with -mfpmath=387,sse

Bug 18766 - Inefficient code with -mfpmath=387,sse

Summary: Inefficient code with -mfpmath=387,sse

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.0

Importance:	P2 minor
Target Milestone:	4.4.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra, ssemmx

Depends on:
Blocks:

Reported:	2004-12-01 20:54 UTC by Wolfgang Bangerth
Modified:	2008-08-03 16:59 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2006-01-21 03:08:44

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wolfgang Bangerth 2004-12-01 20:54:15 UTC

This is spinoff #1 of PR 17619: 
 
Take this simple piece of code: 
--------------------- 
float a[2],b[2];  
  
float foobar () {  
  return a[0] * b[0] 
    + a[1] * b[1];  
}  
--------------------- 
 
Compiled with  
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387 
we get this code: 
--------------------- 
	pushl	%ebp 
	movl	%esp, %ebp 
	flds	b 
	fmuls	a 
	flds	b+4 
	fmuls	a+4 
	faddp	%st, %st(1) 
	popl	%ebp 
	ret 
----------------------------- 
That's certainly optimal. 
 
On the other hand, if we let the compiler use sse registers as well (though 
we do not force it, we simply want the most efficient code), the code 
we get with flags 
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 -mfpmath=387,sse 
looks like this: 
----------------------------- 
	pushl	%ebp 
	movl	%esp, %ebp 
	subl	$4, %esp 
	flds	b 
	fmuls	a 
	movss	b+4, %xmm0 
	mulss	a+4, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	leave 
	ret 
--------------------------- 
The code is almost equivalent except for the fact that we have one 
stack push and pop more to satisfy the system ABI that return values 
are passed through st(0). 
 
In essence, the compiler should just generate the first code sequence, 
even if given the flag -mfpmath=387,sse. 
 
W.

Comment 1 Andrew Pinski 2004-12-06 21:34:55 UTC

Confirmed.

Comment 2 Andrew Pinski 2005-10-24 03:08:44 UTC

What is happening is that the register allocator is selecting the return possition for the last add which is a x87 register so it is doing the add in x87 instead of sse which causes the rest to go bonkers.

Comment 3 Uroš Bizjak 2008-08-03 16:59:43 UTC

GNU C (GCC) version 4.4.0 20080803 (experimental) is now much smarter, several rewrites of math ops now result in:

foobar:
	pushl	%ebp
	movl	%esp, %ebp
	flds	a
	fmuls	b
	flds	a+4
	fmuls	b+4
	faddp	%st, %st(1)
	popl	%ebp
	ret

So, fixed for 4.4.