17619 – Non-optimal code for -mfpmath=387,sse

Bug 17619 - Non-optimal code for -mfpmath=387,sse

Summary: Non-optimal code for -mfpmath=387,sse

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-09-22 19:19 UTC by Wolfgang Bangerth
Modified:	2004-12-01 20:59 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2004-09-22 21:15:57

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wolfgang Bangerth 2004-09-22 19:19:17 UTC

I know that -mfpmath=387,sse is not considered production quality. 
Nevertheless, I though I might give it a try. So here's some  
example code that computes the scalar product between two 
vectors of length 4: 
-------------------------------- 
struct X { float array[4]; }; 
 
X a,b; 
 
float foobar () { 
  float s = 0; 
  for (unsigned int d=0; d<4; ++d) 
    s += a.array[d] * b.array[d]; 
  return s; 
} 
-------------------------- 
In the following, I will always use compile flags  
  -O3 -funroll-loops -msse3 -mtune=pentium4 -march=pentium4 
in addition to whatever setting for -mfpmath is decribed. 
 
With -mfpmath=387 we get this (reasonable) piece of code: 
_Z6foobarv: 
	pushl	%ebp 
	movl	%esp, %ebp 
	flds	b 
	fmuls	a 
	fadds	.LC0 
	flds	b+4 
	fmuls	a+4 
	faddp	%st, %st(1) 
	flds	b+8 
	fmuls	a+8 
	faddp	%st, %st(1) 
	flds	b+12 
	fmuls	a+12 
	faddp	%st, %st(1) 
	popl	%ebp 
	ret 
Here, we load each pair of vector elements and multiply them, then 
adding to the accumulator. The only thing that's nonoptimal is that 
the initial addition to zero in "fadds	.LC0" could be avoided (LC0 
is a label to a zero floating point number). 
 
If one tries to compile with -mfpmath=sse, one gets very similar 
code, with the exception that multiplications and additions are 
performed in xmm? registers. 
 
However, here comes the catch: I though if I specify -mfpmath=387,sse 
it should produce at least as good code as without. But I get this: 
_Z6foobarv: 
	pushl	%ebp 
	movl	%esp, %ebp 
	subl	$4, %esp 
	flds	b 
	fmuls	a 
	fadds	.LC0 
	movss	b+4, %xmm0 
	mulss	a+4, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	movss	b+8, %xmm0 
	mulss	a+8, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	movss	b+12, %xmm0 
	mulss	a+12, %xmm0 
	movss	%xmm0, -4(%ebp) 
	flds	-4(%ebp) 
	faddp	%st, %st(1) 
	leave 
	ret 
That is decidedly not optimal: we compute the result of each multiplication 
in xmm registers, but then we push them onto the stack, reload them into 
st(?) registers and accumulate them there. Surely the whole thing 
can be done without these stack operations and be more efficient. In 
particular, using just -mfpmath=sse shows that this is possible. 
 
W.

Comment 1 Wolfgang Bangerth 2004-09-22 19:33:53 UTC

I should add that the code produced by 3.3.4 and 3.4.2 is significantly 
different, though it also shows the basic problem of moves to and from 
the stack. 
 
W.

Comment 2 Andrew Pinski 2004-09-22 21:15:57 UTC

This is wrong:
The only thing that's nonoptimal is that the initial addition to zero in "fadds  .LC0" could be avoided 
(LC0 is a label to a zero floating point number).
You cannot do this transformation except with -ffast-math.

Other than that confirmed.

Comment 3 Wolfgang Bangerth 2004-09-22 21:22:19 UTC

> You cannot do this transformation except with -ffast-math. 
 
What do you mean by that? Certainly the addition of a zero floating point constant 
can be avoided even without -ffast-math (or other unsafe math operations). If there 
should be an overflow or similar during this operation, then it should have triggered the 
relevant exceptions already in the multiplication that computed the second addend. 
 
However, I don't want to dwell on this point -- the fact that we have unnecessary stack 
moves is what bothers me. 
 
W.

Comment 4 Wolfgang Bangerth 2004-09-22 21:25:30 UTC

However, Andrew is right in that the zero addition vanishes when using 
-ffast-math. I'll open another bug report for this. 
 
W.

Comment 5 Wolfgang Bangerth 2004-09-22 21:35:11 UTC

That new PR is now PR 17622.

Comment 6 uros 2004-12-01 14:07:24 UTC

With "GCC: (GNU) 4.0.0 20041201 (experimental)", following code is produced
(without -ffast-math):

_Z6foobarv:
.LFB2:
        pushl   %ebp
.LCFI0:
        movl %esp, %ebp
.LCFI1:
        subl $4, %esp
.LCFI2:
        flds b+12
        fmuls   a+12
        movss   b, %xmm1
        mulss   a, %xmm1
        addss   .LC0, %xmm1
        movss   b+4, %xmm0
        mulss   a+4, %xmm0
        addss   %xmm0, %xmm1
        movss   b+8, %xmm0
        mulss   a+8, %xmm0
        addss   %xmm0, %xmm1
        movss   %xmm1, -4(%ebp)
        flds -4(%ebp)
        faddp   %st, %st(1)
        leave
        ret

Please note, that we should return the result in fp reg, so final flds is needed
in any case. I think, this code is optimal.

Should we close this bug?

Uros.

Comment 7 Andrew Pinski 2004-12-01 14:27:08 UTC

Actually the most optimial code would be:
_Z6foobarv:
.LFB2:  
        pushl   %ebp
.LCFI0: 
        movl    %esp, %ebp
.LCFI1: 
        subl    $24, %esp
.LCFI2: 
        movaps  a, %xmm0
        mulps   b, %xmm0
        movaps  %xmm0, -24(%ebp)
        fldz
        fadds   -24(%ebp)
        fadds   -20(%ebp)
        fadds   -16(%ebp)
        fadds   -12(%ebp)
        leave
        ret

But to do that we need the tree vectorizer to become better and also split the loop into two.

Comment 8 uros 2004-12-01 16:02:32 UTC

If the loop is splitted manually and putting a, b and c inside the foobar()
function [otherwise vectorizer complains about unaligned load]:

--cut here--
struct X
{
  float array[4];
};

float foobar()
{
  X a, b, c;

  float s = 0;
  for (unsigned int d = 0; d < 4; ++d)
    c.array[d] = a.array[d] * b.array[d];

  for (unsigned int d = 0; d < 4; ++d)
    s += c.array[d];

  return s;
}
--cut here--

Compiling this example with rigth pack of options: -O2 -march=pentium4
-ftree-vectorize -mfpmath=sse,387 -funroll-loops -fomit-frame-pointer
-ffast-math, this wonderful piece of code is produced:

_Z6foobarv:
.LFB2:
        subl    $60, %esp
.LCFI0:
        movaps  32(%esp), %xmm0
        mulps   16(%esp), %xmm0
        movaps  %xmm0, (%esp)
        flds    4(%esp)
        fadds   (%esp)
        fadds   8(%esp)
        fadds   12(%esp)
        addl    $60, %esp
        ret

I don't know why vectorized doesn't like original testcase.

Uros.

Comment 9 Wolfgang Bangerth 2004-12-01 20:49:25 UTC

In reply to comment #6: 
 
> Please note, that we should return the result in fp reg, so final flds is 
> needed in any case. I think, this code is optimal. 
 
Almost, or at least I believe so. If we assume that all the operations  
with xmm registers cost the same as with the floating point stack, then 
the result of -mfpmath=387,sse requires one stack push and pop more than 
the result of -mfpmath=387. The compiler should recognize this and then 
simply not use the sse registers at all. 
 
I will open a new PR for this, and another one for the vectorization issue. 
 
Thanks for now 
 Wolfgang

Comment 10 Wolfgang Bangerth 2004-12-01 20:59:05 UTC

The two spinoffs are PR 18766 and PR 18767. 
W.