Bug 19780

Summary: Floating point computation far slower for -mfpmath=sse
Product: gcc Reporter: Uroš Bizjak <ubizjak>
Component: rtl-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED WONTFIX    
Severity: enhancement CC: amacleod, bonzini, dnovillo, gcc-bugs, hjl.tools, joey.ye, rguenth, weiliang.lin, xuepeng.guo
Priority: P2 Keywords: missed-optimization, ra
Version: 4.0.0   
Target Milestone: ---   
Host: Target: i686-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2006-01-15 20:36:24

Description Uroš Bizjak 2005-02-03 16:24:20 UTC
The testcase from PR 8126 runs ~20% slower when compiled with -mfpmath=sse:

--cut here--
#include <stdio.h>

typedef float real;

int
main (int argc, char *argv[])
{
  int i;

  real v1x, v1y, v1z;
  real v2x, v2y, v2z;
  real v3x, v3y, v3z;

  printf ("Start?\n");

  v1x = 1.;
  v1y = 0.;
  v1z = 0.;

  v2x = 0.;
  v2y = 1.;
  v2z = 0.;

  for (i = 0; i < 100000000; i++)
    {
      v3x = v1y * v2z - v1z * v2y;
      v3y = v1z * v2x - v1x * v2z;
      v3z = v1x * v2y - v1y * v2x;

      v1x = v2x;
      v1y = v2y;
      v1z = v2z;

      v2x = v3x;
      v2y = v3y;
      v2z = v3z;
    }

  printf ("Stop!\n");
  printf ("Result = %f, %f, %f\n", v3x, v3y, v3z);

  return 0;
}
--cut here--

gcc -O3 -march=pentium4
real    0m0.603s
user    0m0.602s
sys     0m0.002s

gcc -O3 -march=pentium4 -mfpmath=sse
real    0m0.726s
user    0m0.727s
sys     0m0.000s
Comment 1 Uroš Bizjak 2005-02-03 16:40:45 UTC
First thing to see is this:

	...
	mulss	%xmm7, %xmm1
	movss	-12(%ebp), %xmm0
	mulss	%xmm4, %xmm0
	subss	%xmm0, %xmm1
	movss	-12(%ebp), %xmm0
	mulss	%xmm5, %xmm0
	mulss	%xmm6, %xmm3
	...

Memory access is expensive, but in -mfpmath=387 case we have equivalent code.
Comment 2 Andrew Pinski 2005-09-29 04:05:34 UTC
Confirmed.  This is weird and this is an ra issue.  I don't understand why the ra is spilling it to the stack 
as there are enough SSE registers to hold the 6 registers.
Comment 3 Andrew Pinski 2005-09-29 04:06:51 UTC
Oh, and this looks very related to two operand instructions issue.
PPC gives optimial code:
L2:
        fmul f0,f6,f9
        fmul f13,f7,f10
        fmul f12,f8,f11
        fmsub f29,f8,f10,f0
        fmsub f30,f6,f11,f13
        fmsub f31,f7,f9,f12
        fmr f6,f10
        fmr f7,f11
        fmr f8,f9
        fmr f10,f31
        fmr f11,f29
        fmr f9,f30
        bdnz L2
Comment 4 Paolo Bonzini 2006-08-11 10:22:27 UTC
Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31.  Not that we can blame GCC for using 12, but it is not a fair comparison. :-)

In fact, 8 registers are enough, but it is quite tricky to obtain them.
The problem is that v3[xyz] is live across multiple BB's, making the task of the register allocator quite harder.  Even if we change v3[xyz] in the printf to v2[xyz], cfg-cleanup (between vrp1 and dce2) replaces it and, in doing so, it extends the lifetime of v3[xyz].

(Since it's all about having short lifetimes, CCing amacleod@gcc.gnu.org)

BTW, here is the optimal code (if it works...):

ENTER basic block: v1[xyz], v2[xyz] are live (6 registers)

      v3x = v1y * v2z - v1z * v2y;

v3x is now live, and it takes 2 registers to compute this statement.  Here we hit a maximum of 8 live registers.  After the statement 7 registers are live.

      v3y = v1z * v2x - v1x * v2z;

v1z dies here, so we need only one additional register for this statement.  We also hit a maximum of 8 live registers.  At the end of the statement, 7 registers are also live (7 - 1 v1z that dies + 1 for v3y)

      v3z = v1x * v2y - v1y * v2x;

Likewise, v1x and v1y die, so we need 7 registers and, at the end of the statement, 6 registers are also live.

Optimal code would be like this (%xmm0..2 = v1[xyz], %xmm3..5 = v2[xyz])

v3x = v1y * v2z - v1z * v2y
      movss %xmm1, %xmm6
      mulss %xmm5, %xmm6 ;; v1y * v2z in %xmm6
      movss %xmm2, %xmm7
      mulss %xmm4, %xmm7 ;; v1z * v2y in %xmm7
      subss %xmm7, %xmm6 ;; v3x in %xmm6

v3y = v1z * v2x - v1x * v2z
      mulss %xmm3, %xmm2 ;; v1z dies, v1z * v2x in %xmm2
      movss %xmm1, %xmm7
      mulss %xmm5, %xmm7 ;; v1x * v2z in %xmm7
      subss %xmm7, %xmm2 ;; v3y in %xmm2

v3z = v1x * v2y - v1y * v2x
      mulss %xmm4, %xmm0 ;; v1x dies, v1x * x2y in %xmm0
      mulss %xmm3, %xmm1 ;; v1y dies, v1y * v2x in %xmm1
      subss %xmm1, %xmm0 ;; v3z in %xmm0

Note now how we should reorder the final moves to obtain optimal code!

      movss %xmm0, %xmm7 ;; save v3z... alternatively, do it before the subss

      movss %xmm3, %xmm0 ;; v1x = v2x
      movss %xmm6, %xmm3 ;; v2x = v3x (in %xmm6)
      movss %xmm4, %xmm1 ;; v1y = v2y
      movss %xmm2, %xmm4 ;; v2y = v3y (in %xmm2)
      movss %xmm5, %xmm2 ;; v1z = v2z
      movss %xmm7, %xmm5 ;; v2z = v3z (saved in %xmm7)

(Note that doing the reordering manually does not help...) :-(  Out of curiosity, can somebody check out yara-branch to see how it fares?


---

By comparison, the x87 is relatively easier, because there are never more than 8 registers and fxch makes it much easier to write the compensation code:

v3x = v1y * v2z - v1z * v2y
                            ;; v1x v1y v1z v2x v2y v2z
       fld %st(1)           ;; v1y v1x v1y v1z v2x v2y v2z
       fmul %st(6), %st(0)  ;; v1y*v2z v1x v1y v1z v2x v2y v2z
       fld %st(3)           ;; v1z v1y*v2z v1x v1y v1z v2x v2y v2z
       fmul %st(6), %st(0)  ;; v1z*v2y v1y*v2z v1x v1y v1z v2x v2y v2z
       fsubp %st(0), %st(1) ;; v3x v1x v1y v1z v2x v2y v2z

v3y = v1z * v2x - v1x * v2z
       fld %st(4)           ;; v2x v3x v1x v1y v1z v2x v2y v2z
       fmulp %st(0), %st(4) ;; v3x v1x v1y v1z*v2x v2x v2y v2z
       fld %st(1)           ;; v1x v3x v1x v1y v1z*v2x v2x v2y v2z
       fmul %st(7), %st(0)  ;; v1x*v2z v3x v1x v1y v1z*v2x v2x v2y v2z
       fsubp %st(0), %st(4) ;; v3x v1x v1y v3y v2x v2y v2z

v3z = v1x * v2y - v1y * v2x
       fld %st(5)           ;; v2y v3x v1x v1y v3y v2x v2y v2z
       fmulp %st(0), %st(2) ;; v3x v1x*v2y v1y v3y v2x v2y v2z
       fld %st(4)           ;; v2x v3x v1x*v2y v1y v3y v2x v2y v2z
       fmul %st(3), %st(0)  ;; v1y*v2x v3x v1x*v2y v1y v3y v2x v2y v2z
       fsubp %st(0), %st(2) ;; v3x v3z v1y v3y v2x v2y v2z
       fstp %st(2)          ;; v3z v3x v3y v2x v2y v2z

       fxch %st(5)          ;; v2z v3x v3y v2x v2y v3z
       fxch %st(2)          ;; v3y v3x v2z v2x v2y v3z
       fxch %st(4)          ;; v2y v3x v2z v2x v3y v3z
       fxch %st(1)          ;; v3x v2y v2z v2x v3y v3z
       fxch %st(0)          ;; v2x v2y v2z v3x v3y v3z

(well, the fxch should be scheduled, but still it is possible to do it without spilling).

Paolo
Comment 5 Richard Biener 2006-10-24 13:28:04 UTC
With more registers (x86_64) the stack moves are gone, but: (!)

rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -O2 -o t t.c -mfpmath=387
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> /usr/bin/time ./t
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000
5.31user 0.00system 0:05.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -O2 -o t t.c        
rguenther@murzim:/abuild/rguenther/trunk-g/gcc> /usr/bin/time ./t
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000
9.96user 0.05system 0:10.06elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+135minor)pagefaults 0swaps

that is almost twice as fast with 387 math than with SSE math on x86_64!

The inner loop is

.L7:
        movaps  %xmm3, %xmm6
        movaps  %xmm1, %xmm5
        movaps  %xmm0, %xmm4
.L2:
        movaps  %xmm2, %xmm3
        mulss   %xmm6, %xmm2
        movaps  %xmm7, %xmm0
        addl    $1, %eax
        mulss   %xmm4, %xmm3
        movaps  %xmm7, %xmm1
        mulss   %xmm5, %xmm0
        cmpl    $1000000000, %eax
        mulss   %xmm6, %xmm1
        movaps  %xmm4, %xmm7
        subss   %xmm0, %xmm3
        movaps  %xmm8, %xmm0
        mulss   %xmm4, %xmm0
        subss   %xmm0, %xmm1
        movaps  %xmm8, %xmm0
        movaps  %xmm6, %xmm8
        mulss   %xmm5, %xmm0
        subss   %xmm2, %xmm0
        movaps  %xmm5, %xmm2
        jne     .L7

vs.

.L7:
        fxch    %st(3)
        fxch    %st(2)
.L2:
        fld     %st(2)
        addl    $1, %eax
        cmpl    $1000000000, %eax
        fmul    %st(1), %st
        flds    76(%rsp)
        fmul    %st(5), %st
        fsubrp  %st, %st(1)
        flds    76(%rsp)
        fmul    %st(3), %st
        flds    72(%rsp)
        fmul    %st(3), %st
        fsubrp  %st, %st(1)
        flds    72(%rsp)
        fmul    %st(6), %st
        fxch    %st(5)
        fmul    %st(4), %st
        fsubrp  %st, %st(5)
        fxch    %st(2)
        fstps   76(%rsp)
        fxch    %st(2)
        fstps   72(%rsp)
        jne     .L7

(testing done on AMD Athlon fam 15 model 35 stepping 2)
Comment 6 Uroš Bizjak 2006-10-25 12:04:57 UTC
(In reply to comment #5)
> With more registers (x86_64) the stack moves are gone, but: (!)

> (testing done on AMD Athlon fam 15 model 35 stepping 2)

On Xeon 3.6, SSE is now faster:

gcc -O2 -march=pentium4 -mfpmath=387 pr19780.c 
time ./a.out
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m0.805s
user    0m0.804s
sys     0m0.000s

gcc -O2 -march=pentium4 -mfpmath=sse pr19780.c 
time ./a.out
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m0.707s
user    0m0.704s
sys     0m0.004s

vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.60GHz
stepping        : 10
cpu MHz         : 3600.970
cache size      : 2048 KB

The question is now, why is Athlon so slow with SFmode SSE?
Comment 7 Uroš Bizjak 2006-10-25 12:18:21 UTC
(In reply to comment #6)

> On Xeon 3.6, SSE is now faster:

... but for -ffast-math:

SSE: user    0m0.756s
x87: user    0m0.612s

Yes, x87 is faster for -ffast-math by some 20%.
Comment 8 Paolo Bonzini 2007-04-03 12:43:08 UTC
what's the generated code for -ffast-math? in principle i don't see a reason why it should make any difference...
Comment 9 Uroš Bizjak 2007-04-03 13:32:38 UTC
(In reply to comment #8)
> what's the generated code for -ffast-math? in principle i don't see a reason
> why it should make any difference...

Trying to answer your question, I have played a bit with compile flags and things are getting really strange:

[uros@localhost test]$ gcc -O2 -mfpmath=387 pr19780.c 
[uros@localhost test]$ time ./a.out
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m1.211s
user    0m1.212s
sys     0m0.004s
[uros@localhost test]$ gcc -O2 -mfpmath=387 -msse pr19780.c 
[uros@localhost test]$ time ./a.out
Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m0.555s
user    0m0.552s
sys     0m0.004s

Note that -msse should have no effect on calculations. The difference between asm dumps is:

--- pr19780.s   2007-04-03 14:28:14.000000000 +0200
+++ pr19780.s_  2007-04-03 14:28:01.000000000 +0200
@@ -17,69 +17,61 @@
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
-       subl    $84, %esp
+       subl    $100, %esp
        movl    $.LC0, (%esp)
        call    puts
        xorl    %eax, %eax
-       fldz
        fld1
        fsts    -16(%ebp)
+       fldz
+       fsts    -12(%ebp)
+       fld     %st(0)
        fld     %st(1)
-       fld     %st(2)
-       fld     %st(3)
        jmp     .L2
        .p2align 4,,7
 .L7:
-       fstp    %st(5)
-       fstp    %st(0)
-       fxch    %st(1)
-       fxch    %st(2)
-       fxch    %st(3)
-       fxch    %st(4)
        fxch    %st(3)
+       fxch    %st(2)
 .L2:
-       fld     %st(1)
+       fld     %st(2)
        addl    $1, %eax
-       fmul    %st(3), %st
+       fmul    %st(1), %st
        cmpl    $100000000, %eax
-       fstps   -12(%ebp)
+       flds    -12(%ebp)
+       fmul    %st(5), %st
+       fsubrp  %st, %st(1)
+       flds    -12(%ebp)
+       fmul    %st(3), %st
        flds    -16(%ebp)
-       fmul    %st(1), %st
-       fsubrs  -12(%ebp)
-       fstps   -12(%ebp)
-       fmul    %st(4), %st
-       fld     %st(3)
        fmul    %st(3), %st
        fsubrp  %st, %st(1)
        flds    -16(%ebp)
-       fmulp   %st, %st(4)
-       fxch    %st(1)
+       fmul    %st(6), %st
+       fxch    %st(5)
        fmul    %st(4), %st
-       fsubrp  %st, %st(3)
-       flds    -16(%ebp)
-       fld     %st(3)
+       fsubrp  %st, %st(5)
        fxch    %st(2)
-       fsts    -16(%ebp)
-       flds    -12(%ebp)
+       fstps   -12(%ebp)
+       fxch    %st(2)
+       fstps   -16(%ebp)
        jne     .L7
-       fstp    %st(0)
-       fstp    %st(5)
-       fstp    %st(0)
-       fstp    %st(0)
-       fstp    %st(0)
+       fstp    %st(3)
+       fxch    %st(1)
        movl    $.LC3, (%esp)
        fstps   -40(%ebp)
+       fxch    %st(1)
        fstps   -56(%ebp)
+       fstps   -72(%ebp)
        call    puts
        flds    -40(%ebp)
        fstpl   20(%esp)
        flds    -56(%ebp)
        fstpl   12(%esp)
-       flds    -12(%ebp)
+       flds    -72(%ebp)
        fstpl   4(%esp)
        movl    $.LC4, (%esp)
        call    printf
-       addl    $84, %esp
+       addl    $100, %esp
        xorl    %eax, %eax
        popl    %ecx
        popl    %ebp

where (+++) is with -msse.
Comment 10 Paolo Bonzini 2007-04-03 13:36:44 UTC
I would look at the lreg output, which contains the results of regclass.
Comment 11 Uroš Bizjak 2007-04-05 10:58:15 UTC
(In reply to comment #10)
> I would look at the lreg output, which contains the results of regclass.

No, the difference is due to ssa pass that generates:

  # v1z_10 = PHI <v1z_13(2), v1z_32(3)>
  # v1y_9 = PHI <v1y_12(2), v1y_31(3)>
  # v1x_8 = PHI <v1x_11(2), v1x_30(3)>
  # i_7 = PHI <i_17(2), i_36(3)>
  # v3z_6 = PHI <v3z_18(D)(2), v3z_29(3)>
  # v3y_5 = PHI <v3y_19(D)(2), v3y_26(3)>
  # v3x_4 = PHI <v3x_20(D)(2), v3x_23(3)>
  # v2z_3 = PHI <v2z_16(2), v2z_35(3)>
  # v2y_2 = PHI <v2y_15(2), v2y_34(3)>
  # v2x_1 = PHI <v2x_14(2), v2x_33(3)>

without -msse and

  # v3z_10 = PHI <v3z_18(D)(2), v3z_29(3)>
  # v3y_9 = PHI <v3y_19(D)(2), v3y_26(3)>
  # v3x_8 = PHI <v3x_20(D)(2), v3x_23(3)>
  # v2z_7 = PHI <v2z_16(2), v2z_35(3)>
  # v2y_6 = PHI <v2y_15(2), v2y_34(3)>
  # v2x_5 = PHI <v2x_14(2), v2x_33(3)>
  # v1z_4 = PHI <v1z_13(2), v1z_32(3)>
  # v1y_3 = PHI <v1y_12(2), v1y_31(3)>
  # v1x_2 = PHI <v1x_11(2), v1x_30(3)>
  # i_1 = PHI <i_17(2), i_36(3)>

with -msse compile flag. Note different variable suffixes that create different sort order. This is (IMO) due to fact that -msse enables lots of additional __builtin functions (these can be seen in 001.tu dump). Since we don't have x87 scheduler the results became quite unpredictable, and depend on -msseX settings. It just _happens_ that second form better suits stack nature of x87.

So, why does SSA pass have to interfere with computation dataflow? This interferece makes things worse and effectively takes away user's control on the flow of data.
Comment 12 Uroš Bizjak 2007-04-05 11:00:57 UTC
(In reply to comment #11)

> with -msse compile flag. Note different variable suffixes that create different
> sort order. This is (IMO) due to fact that -msse enables lots of additional
> __builtin functions (these can be seen in 001.tu dump).

I forgot to add that -ffast-math simply enables more builtins, and again different sort order is introduced.
Comment 13 Paolo Bonzini 2007-04-05 11:01:33 UTC
So this is an unstable sorting.  Adding dnovillo.
Comment 14 Diego Novillo 2007-04-05 12:49:41 UTC
(In reply to comment #11)

> So, why does SSA pass have to interfere with computation dataflow? This
> interferece makes things worse and effectively takes away user's control on the
> flow of data.
> 

Huh?  How is it relevant whether PHIs are in different order?  Conceptually, the ordering of PHI nodes in a basic block is completely irrelevant.  Some pass is getting confused when it shouldn't.  Transformations should not depend on how PHI nodes are emitted in a block as all PHI nodes are always evaluated in parallel.
Comment 15 Paolo Bonzini 2007-04-05 13:03:55 UTC
Transformations do not, but out-of-SSA could.  Is there a way to ensure ordering of PHI functions unlike what Uros's dumps suggest?
Comment 16 Diego Novillo 2007-04-05 13:15:41 UTC
Subject: Re:  Floating point computation far slower
 for -mfpmath=sse

bonzini at gnu dot org wrote on 04/05/07 08:03:

> Is there a way to ensure ordering of PHI functions unlike what Uros's
> dumps suggest?

No.

I also don't see how PHI ordering would affect out-of-ssa.  It just
emits copies.  If the ordering of those copies is affecting things like
register pressure, then RA should be looked at.
Comment 17 Andrew Macleod 2007-04-05 14:23:46 UTC
Is the output from .optimized different?  (once the ssa versions numbers have been stripped).   Those PHIs should be irrelevant, the question is whether the different versioning has any effect.    

The only way I can think that out-of-ssa could produce different results is if it had to choose between two same-cost coalesces, and the versioning resulted in them being in different places in the coalesce list.  Check the .optimized output and if the code is equivalent, the problem is after that stage.
Comment 18 Uroš Bizjak 2007-04-05 16:39:06 UTC
(In reply to comment #17)
> Is the output from .optimized different?  (once the ssa versions numbers have
> been stripped).   Those PHIs should be irrelevant, the question is whether the
> different versioning has any effect.    
> 
> The only way I can think that out-of-ssa could produce different results is if
> it had to choose between two same-cost coalesces, and the versioning resulted
> in them being in different places in the coalesce list.  Check the .optimized
> output and if the code is equivalent, the problem is after that stage.

They are _not_ equivalent. We have:

--cut here--
<bb 2>:
  __builtin_puts (&"Start?"[0]);
  v2x = 0.0;
  v2y = 1.0e+0;
  v2z = 0.0;
  i = 0;
  v1x = 1.0e+0;
  v1y = 0.0;
  v1z = 0.0;

<L0>:;
  v3x = v1y * v2z - v1z * v2y;
  v3y = v1z * v2x - v1x * v2z;
  v3z = v1x * v2y - v1y * v2x;
  i = i + 1;
  v1z = v2z;
  v1y = v2y;
  v1x = v2x;
  v2z = v3z;
  v2y = v3y;
  v2x = v3x;
  if (i != 100000000) goto <L0>; else goto <L2>;

<L2>:;
  __builtin_puts (&"Stop!"[0]);
  printf (&"Result = %f, %f, %f\n"[0], (double) v3x, (double) v3y, (double) v3z);
  return 0;
--cut here--

=====VS=====

--cut here--
<bb 2>:
  __builtin_puts (&"Start?"[0]);
  i = 0;
  v1x = 1.0e+0;
  v1y = 0.0;
  v1z = 0.0;
  v2x.43 = 0.0;
  v2y.44 = 1.0e+0;
  v2z.45 = 0.0;

<L0>:;
  v3x = v1y * v2z.45 - v1z * v2y.44;
  v3y = v1z * v2x.43 - v1x * v2z.45;
  v3z = v1x * v2y.44 - v1y * v2x.43;
  i = i + 1;
  v2z = v3z;
  v2y = v3y;
  v2x = v3x;
  v1z = v2z.45;
  v1y = v2y.44;
  v1x = v2x.43;
  if (i != 100000000) goto <L8>; else goto <L2>;

<L8>:;
  v2x.43 = v2x;
  v2y.44 = v2y;
  v2z.45 = v2z;
  goto <bb 3> (<L0>);

<L2>:;
  __builtin_puts (&"Stop!"[0]);
  printf (&"Result = %f, %f, %f\n"[0], (double) v3x, (double) v3y, (double) v3z);
  return 0;
--cut here--
Comment 19 Andrew Macleod 2007-04-05 17:24:23 UTC
what are you using for a compiler? Im using a mainline from mid march, and with it, my .optimized files diff exactly the same, and I get the aforementioned time differences in the executables.
(sse.c and sse-bad.c are same, just different names to get different output files)

2007-03-13/gcc> diff sse.c sse-bad.c

2007-03-13/gcc>./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o sse

2007-03-13/gcc>./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4 -mfpmath=sse -o sse-bad

2007-03-13/gcc>ls -l sse*optimized

-rw-rw-r--  1 amacleod amacleod 864 Apr  5 12:16 sse-bad.c.116t.optimized
-rw-rw-r--  1 amacleod amacleod 864 Apr  5 12:16 sse.c.116t.optimized

2007-03-13/gcc>diff sse.c.116t.optimized sse-bad.c.116t.optimized

2007-03-13/gcc>time ./sse

Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m0.630s
user    0m0.572s
sys     0m0.000s

2007-03-13/gcc>time ./sse-bad

Start?
Stop!
Result = 0.000000, 0.000000, 1.000000

real    0m0.883s
user    0m0.780s
sys     0m0.000s


Is this just with earlier compilers, what version are you using?  It at least seems to indicate that the problem isn't before out-of-ssa since the time issue is still there with identical outputs from .optimized
Comment 20 Uroš Bizjak 2007-04-05 19:39:52 UTC
(In reply to comment #19)
> what are you using for a compiler? Im using a mainline from mid march, and 

gcc version 4.3.0 20070404 (experimental) on i686-pc-linux-gnu

with
> it, my .optimized files diff exactly the same, and I get the aforementioned
> time differences in the executables.

This is because -march=pentium4 enables all sse builtins for both cases.

> (sse.c and sse-bad.c are same, just different names to get different output
> files)
> 
> 2007-03-13/gcc> diff sse.c sse-bad.c
> 
> 2007-03-13/gcc>./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o
> sse
> 
> 2007-03-13/gcc>./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4
> -mfpmath=sse -o sse-bad

This is known effect of SFmode SSE being slower than SFmode x87. But again, you have enabled sse(2) builtins due to -march=pentium4.

Please try to compile using only "-O2" and "-O2 -msse". x87 math will be used in both cases, but .optimized will show the difference. You can also try to compile with and without -ffast-math.

IMO it is not acceptabe for tree dumps to depend on target compile flag in any way...
Comment 21 Uroš Bizjak 2007-04-06 07:37:27 UTC
Strange things happen.

I have fully removed gcc build directory and bootstrapped gcc from scratch. To my suprise, the difference with -msse and without -msse is now gone and optimized dumps are now the same. For reference, the compiler has ident "gcc version 4.3.0 20070406 (experimental)".

Regarding this bug - SSE performance vs x87 performance is clearly target procesor dependant. There is nothing gcc can do, and even without memory acces, SSE is slower than x87 on some targets (ref: Comment #5).

Let's close this bug as WONTFIX, as there is nothing to fix in gcc.