Summary: | Floating point computation far slower for -mfpmath=sse | ||
---|---|---|---|
Product: | gcc | Reporter: | Uroš Bizjak <ubizjak> |
Component: | rtl-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | RESOLVED WONTFIX | ||
Severity: | enhancement | CC: | amacleod, bonzini, dnovillo, gcc-bugs, hjl.tools, joey.ye, rguenth, weiliang.lin, xuepeng.guo |
Priority: | P2 | Keywords: | missed-optimization, ra |
Version: | 4.0.0 | ||
Target Milestone: | --- | ||
Host: | Target: | i686-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2006-01-15 20:36:24 |
Description
Uroš Bizjak
2005-02-03 16:24:20 UTC
First thing to see is this: ... mulss %xmm7, %xmm1 movss -12(%ebp), %xmm0 mulss %xmm4, %xmm0 subss %xmm0, %xmm1 movss -12(%ebp), %xmm0 mulss %xmm5, %xmm0 mulss %xmm6, %xmm3 ... Memory access is expensive, but in -mfpmath=387 case we have equivalent code. Confirmed. This is weird and this is an ra issue. I don't understand why the ra is spilling it to the stack as there are enough SSE registers to hold the 6 registers. Oh, and this looks very related to two operand instructions issue. PPC gives optimial code: L2: fmul f0,f6,f9 fmul f13,f7,f10 fmul f12,f8,f11 fmsub f29,f8,f10,f0 fmsub f30,f6,f11,f13 fmsub f31,f7,f9,f12 fmr f6,f10 fmr f7,f11 fmr f8,f9 fmr f10,f31 fmr f11,f29 fmr f9,f30 bdnz L2 Except that PPC uses 12 registers f0 f6 f7 f8 f9 f10 f11 f12 f13 f29 f30 f31. Not that we can blame GCC for using 12, but it is not a fair comparison. :-) In fact, 8 registers are enough, but it is quite tricky to obtain them. The problem is that v3[xyz] is live across multiple BB's, making the task of the register allocator quite harder. Even if we change v3[xyz] in the printf to v2[xyz], cfg-cleanup (between vrp1 and dce2) replaces it and, in doing so, it extends the lifetime of v3[xyz]. (Since it's all about having short lifetimes, CCing amacleod@gcc.gnu.org) BTW, here is the optimal code (if it works...): ENTER basic block: v1[xyz], v2[xyz] are live (6 registers) v3x = v1y * v2z - v1z * v2y; v3x is now live, and it takes 2 registers to compute this statement. Here we hit a maximum of 8 live registers. After the statement 7 registers are live. v3y = v1z * v2x - v1x * v2z; v1z dies here, so we need only one additional register for this statement. We also hit a maximum of 8 live registers. At the end of the statement, 7 registers are also live (7 - 1 v1z that dies + 1 for v3y) v3z = v1x * v2y - v1y * v2x; Likewise, v1x and v1y die, so we need 7 registers and, at the end of the statement, 6 registers are also live. Optimal code would be like this (%xmm0..2 = v1[xyz], %xmm3..5 = v2[xyz]) v3x = v1y * v2z - v1z * v2y movss %xmm1, %xmm6 mulss %xmm5, %xmm6 ;; v1y * v2z in %xmm6 movss %xmm2, %xmm7 mulss %xmm4, %xmm7 ;; v1z * v2y in %xmm7 subss %xmm7, %xmm6 ;; v3x in %xmm6 v3y = v1z * v2x - v1x * v2z mulss %xmm3, %xmm2 ;; v1z dies, v1z * v2x in %xmm2 movss %xmm1, %xmm7 mulss %xmm5, %xmm7 ;; v1x * v2z in %xmm7 subss %xmm7, %xmm2 ;; v3y in %xmm2 v3z = v1x * v2y - v1y * v2x mulss %xmm4, %xmm0 ;; v1x dies, v1x * x2y in %xmm0 mulss %xmm3, %xmm1 ;; v1y dies, v1y * v2x in %xmm1 subss %xmm1, %xmm0 ;; v3z in %xmm0 Note now how we should reorder the final moves to obtain optimal code! movss %xmm0, %xmm7 ;; save v3z... alternatively, do it before the subss movss %xmm3, %xmm0 ;; v1x = v2x movss %xmm6, %xmm3 ;; v2x = v3x (in %xmm6) movss %xmm4, %xmm1 ;; v1y = v2y movss %xmm2, %xmm4 ;; v2y = v3y (in %xmm2) movss %xmm5, %xmm2 ;; v1z = v2z movss %xmm7, %xmm5 ;; v2z = v3z (saved in %xmm7) (Note that doing the reordering manually does not help...) :-( Out of curiosity, can somebody check out yara-branch to see how it fares? --- By comparison, the x87 is relatively easier, because there are never more than 8 registers and fxch makes it much easier to write the compensation code: v3x = v1y * v2z - v1z * v2y ;; v1x v1y v1z v2x v2y v2z fld %st(1) ;; v1y v1x v1y v1z v2x v2y v2z fmul %st(6), %st(0) ;; v1y*v2z v1x v1y v1z v2x v2y v2z fld %st(3) ;; v1z v1y*v2z v1x v1y v1z v2x v2y v2z fmul %st(6), %st(0) ;; v1z*v2y v1y*v2z v1x v1y v1z v2x v2y v2z fsubp %st(0), %st(1) ;; v3x v1x v1y v1z v2x v2y v2z v3y = v1z * v2x - v1x * v2z fld %st(4) ;; v2x v3x v1x v1y v1z v2x v2y v2z fmulp %st(0), %st(4) ;; v3x v1x v1y v1z*v2x v2x v2y v2z fld %st(1) ;; v1x v3x v1x v1y v1z*v2x v2x v2y v2z fmul %st(7), %st(0) ;; v1x*v2z v3x v1x v1y v1z*v2x v2x v2y v2z fsubp %st(0), %st(4) ;; v3x v1x v1y v3y v2x v2y v2z v3z = v1x * v2y - v1y * v2x fld %st(5) ;; v2y v3x v1x v1y v3y v2x v2y v2z fmulp %st(0), %st(2) ;; v3x v1x*v2y v1y v3y v2x v2y v2z fld %st(4) ;; v2x v3x v1x*v2y v1y v3y v2x v2y v2z fmul %st(3), %st(0) ;; v1y*v2x v3x v1x*v2y v1y v3y v2x v2y v2z fsubp %st(0), %st(2) ;; v3x v3z v1y v3y v2x v2y v2z fstp %st(2) ;; v3z v3x v3y v2x v2y v2z fxch %st(5) ;; v2z v3x v3y v2x v2y v3z fxch %st(2) ;; v3y v3x v2z v2x v2y v3z fxch %st(4) ;; v2y v3x v2z v2x v3y v3z fxch %st(1) ;; v3x v2y v2z v2x v3y v3z fxch %st(0) ;; v2x v2y v2z v3x v3y v3z (well, the fxch should be scheduled, but still it is possible to do it without spilling). Paolo With more registers (x86_64) the stack moves are gone, but: (!) rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -O2 -o t t.c -mfpmath=387 rguenther@murzim:/abuild/rguenther/trunk-g/gcc> /usr/bin/time ./t Start? Stop! Result = 0.000000, 0.000000, 1.000000 5.31user 0.00system 0:05.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps rguenther@murzim:/abuild/rguenther/trunk-g/gcc> ./xgcc -B. -O2 -o t t.c rguenther@murzim:/abuild/rguenther/trunk-g/gcc> /usr/bin/time ./t Start? Stop! Result = 0.000000, 0.000000, 1.000000 9.96user 0.05system 0:10.06elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps that is almost twice as fast with 387 math than with SSE math on x86_64! The inner loop is .L7: movaps %xmm3, %xmm6 movaps %xmm1, %xmm5 movaps %xmm0, %xmm4 .L2: movaps %xmm2, %xmm3 mulss %xmm6, %xmm2 movaps %xmm7, %xmm0 addl $1, %eax mulss %xmm4, %xmm3 movaps %xmm7, %xmm1 mulss %xmm5, %xmm0 cmpl $1000000000, %eax mulss %xmm6, %xmm1 movaps %xmm4, %xmm7 subss %xmm0, %xmm3 movaps %xmm8, %xmm0 mulss %xmm4, %xmm0 subss %xmm0, %xmm1 movaps %xmm8, %xmm0 movaps %xmm6, %xmm8 mulss %xmm5, %xmm0 subss %xmm2, %xmm0 movaps %xmm5, %xmm2 jne .L7 vs. .L7: fxch %st(3) fxch %st(2) .L2: fld %st(2) addl $1, %eax cmpl $1000000000, %eax fmul %st(1), %st flds 76(%rsp) fmul %st(5), %st fsubrp %st, %st(1) flds 76(%rsp) fmul %st(3), %st flds 72(%rsp) fmul %st(3), %st fsubrp %st, %st(1) flds 72(%rsp) fmul %st(6), %st fxch %st(5) fmul %st(4), %st fsubrp %st, %st(5) fxch %st(2) fstps 76(%rsp) fxch %st(2) fstps 72(%rsp) jne .L7 (testing done on AMD Athlon fam 15 model 35 stepping 2) (In reply to comment #5) > With more registers (x86_64) the stack moves are gone, but: (!) > (testing done on AMD Athlon fam 15 model 35 stepping 2) On Xeon 3.6, SSE is now faster: gcc -O2 -march=pentium4 -mfpmath=387 pr19780.c time ./a.out Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m0.805s user 0m0.804s sys 0m0.000s gcc -O2 -march=pentium4 -mfpmath=sse pr19780.c time ./a.out Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m0.707s user 0m0.704s sys 0m0.004s vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.60GHz stepping : 10 cpu MHz : 3600.970 cache size : 2048 KB The question is now, why is Athlon so slow with SFmode SSE? (In reply to comment #6) > On Xeon 3.6, SSE is now faster: ... but for -ffast-math: SSE: user 0m0.756s x87: user 0m0.612s Yes, x87 is faster for -ffast-math by some 20%. what's the generated code for -ffast-math? in principle i don't see a reason why it should make any difference... (In reply to comment #8) > what's the generated code for -ffast-math? in principle i don't see a reason > why it should make any difference... Trying to answer your question, I have played a bit with compile flags and things are getting really strange: [uros@localhost test]$ gcc -O2 -mfpmath=387 pr19780.c [uros@localhost test]$ time ./a.out Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m1.211s user 0m1.212s sys 0m0.004s [uros@localhost test]$ gcc -O2 -mfpmath=387 -msse pr19780.c [uros@localhost test]$ time ./a.out Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m0.555s user 0m0.552s sys 0m0.004s Note that -msse should have no effect on calculations. The difference between asm dumps is: --- pr19780.s 2007-04-03 14:28:14.000000000 +0200 +++ pr19780.s_ 2007-04-03 14:28:01.000000000 +0200 @@ -17,69 +17,61 @@ pushl %ebp movl %esp, %ebp pushl %ecx - subl $84, %esp + subl $100, %esp movl $.LC0, (%esp) call puts xorl %eax, %eax - fldz fld1 fsts -16(%ebp) + fldz + fsts -12(%ebp) + fld %st(0) fld %st(1) - fld %st(2) - fld %st(3) jmp .L2 .p2align 4,,7 .L7: - fstp %st(5) - fstp %st(0) - fxch %st(1) - fxch %st(2) - fxch %st(3) - fxch %st(4) fxch %st(3) + fxch %st(2) .L2: - fld %st(1) + fld %st(2) addl $1, %eax - fmul %st(3), %st + fmul %st(1), %st cmpl $100000000, %eax - fstps -12(%ebp) + flds -12(%ebp) + fmul %st(5), %st + fsubrp %st, %st(1) + flds -12(%ebp) + fmul %st(3), %st flds -16(%ebp) - fmul %st(1), %st - fsubrs -12(%ebp) - fstps -12(%ebp) - fmul %st(4), %st - fld %st(3) fmul %st(3), %st fsubrp %st, %st(1) flds -16(%ebp) - fmulp %st, %st(4) - fxch %st(1) + fmul %st(6), %st + fxch %st(5) fmul %st(4), %st - fsubrp %st, %st(3) - flds -16(%ebp) - fld %st(3) + fsubrp %st, %st(5) fxch %st(2) - fsts -16(%ebp) - flds -12(%ebp) + fstps -12(%ebp) + fxch %st(2) + fstps -16(%ebp) jne .L7 - fstp %st(0) - fstp %st(5) - fstp %st(0) - fstp %st(0) - fstp %st(0) + fstp %st(3) + fxch %st(1) movl $.LC3, (%esp) fstps -40(%ebp) + fxch %st(1) fstps -56(%ebp) + fstps -72(%ebp) call puts flds -40(%ebp) fstpl 20(%esp) flds -56(%ebp) fstpl 12(%esp) - flds -12(%ebp) + flds -72(%ebp) fstpl 4(%esp) movl $.LC4, (%esp) call printf - addl $84, %esp + addl $100, %esp xorl %eax, %eax popl %ecx popl %ebp where (+++) is with -msse. I would look at the lreg output, which contains the results of regclass. (In reply to comment #10) > I would look at the lreg output, which contains the results of regclass. No, the difference is due to ssa pass that generates: # v1z_10 = PHI <v1z_13(2), v1z_32(3)> # v1y_9 = PHI <v1y_12(2), v1y_31(3)> # v1x_8 = PHI <v1x_11(2), v1x_30(3)> # i_7 = PHI <i_17(2), i_36(3)> # v3z_6 = PHI <v3z_18(D)(2), v3z_29(3)> # v3y_5 = PHI <v3y_19(D)(2), v3y_26(3)> # v3x_4 = PHI <v3x_20(D)(2), v3x_23(3)> # v2z_3 = PHI <v2z_16(2), v2z_35(3)> # v2y_2 = PHI <v2y_15(2), v2y_34(3)> # v2x_1 = PHI <v2x_14(2), v2x_33(3)> without -msse and # v3z_10 = PHI <v3z_18(D)(2), v3z_29(3)> # v3y_9 = PHI <v3y_19(D)(2), v3y_26(3)> # v3x_8 = PHI <v3x_20(D)(2), v3x_23(3)> # v2z_7 = PHI <v2z_16(2), v2z_35(3)> # v2y_6 = PHI <v2y_15(2), v2y_34(3)> # v2x_5 = PHI <v2x_14(2), v2x_33(3)> # v1z_4 = PHI <v1z_13(2), v1z_32(3)> # v1y_3 = PHI <v1y_12(2), v1y_31(3)> # v1x_2 = PHI <v1x_11(2), v1x_30(3)> # i_1 = PHI <i_17(2), i_36(3)> with -msse compile flag. Note different variable suffixes that create different sort order. This is (IMO) due to fact that -msse enables lots of additional __builtin functions (these can be seen in 001.tu dump). Since we don't have x87 scheduler the results became quite unpredictable, and depend on -msseX settings. It just _happens_ that second form better suits stack nature of x87. So, why does SSA pass have to interfere with computation dataflow? This interferece makes things worse and effectively takes away user's control on the flow of data. (In reply to comment #11) > with -msse compile flag. Note different variable suffixes that create different > sort order. This is (IMO) due to fact that -msse enables lots of additional > __builtin functions (these can be seen in 001.tu dump). I forgot to add that -ffast-math simply enables more builtins, and again different sort order is introduced. So this is an unstable sorting. Adding dnovillo. (In reply to comment #11) > So, why does SSA pass have to interfere with computation dataflow? This > interferece makes things worse and effectively takes away user's control on the > flow of data. > Huh? How is it relevant whether PHIs are in different order? Conceptually, the ordering of PHI nodes in a basic block is completely irrelevant. Some pass is getting confused when it shouldn't. Transformations should not depend on how PHI nodes are emitted in a block as all PHI nodes are always evaluated in parallel. Transformations do not, but out-of-SSA could. Is there a way to ensure ordering of PHI functions unlike what Uros's dumps suggest? Subject: Re: Floating point computation far slower
for -mfpmath=sse
bonzini at gnu dot org wrote on 04/05/07 08:03:
> Is there a way to ensure ordering of PHI functions unlike what Uros's
> dumps suggest?
No.
I also don't see how PHI ordering would affect out-of-ssa. It just
emits copies. If the ordering of those copies is affecting things like
register pressure, then RA should be looked at.
Is the output from .optimized different? (once the ssa versions numbers have been stripped). Those PHIs should be irrelevant, the question is whether the different versioning has any effect. The only way I can think that out-of-ssa could produce different results is if it had to choose between two same-cost coalesces, and the versioning resulted in them being in different places in the coalesce list. Check the .optimized output and if the code is equivalent, the problem is after that stage. (In reply to comment #17) > Is the output from .optimized different? (once the ssa versions numbers have > been stripped). Those PHIs should be irrelevant, the question is whether the > different versioning has any effect. > > The only way I can think that out-of-ssa could produce different results is if > it had to choose between two same-cost coalesces, and the versioning resulted > in them being in different places in the coalesce list. Check the .optimized > output and if the code is equivalent, the problem is after that stage. They are _not_ equivalent. We have: --cut here-- <bb 2>: __builtin_puts (&"Start?"[0]); v2x = 0.0; v2y = 1.0e+0; v2z = 0.0; i = 0; v1x = 1.0e+0; v1y = 0.0; v1z = 0.0; <L0>:; v3x = v1y * v2z - v1z * v2y; v3y = v1z * v2x - v1x * v2z; v3z = v1x * v2y - v1y * v2x; i = i + 1; v1z = v2z; v1y = v2y; v1x = v2x; v2z = v3z; v2y = v3y; v2x = v3x; if (i != 100000000) goto <L0>; else goto <L2>; <L2>:; __builtin_puts (&"Stop!"[0]); printf (&"Result = %f, %f, %f\n"[0], (double) v3x, (double) v3y, (double) v3z); return 0; --cut here-- =====VS===== --cut here-- <bb 2>: __builtin_puts (&"Start?"[0]); i = 0; v1x = 1.0e+0; v1y = 0.0; v1z = 0.0; v2x.43 = 0.0; v2y.44 = 1.0e+0; v2z.45 = 0.0; <L0>:; v3x = v1y * v2z.45 - v1z * v2y.44; v3y = v1z * v2x.43 - v1x * v2z.45; v3z = v1x * v2y.44 - v1y * v2x.43; i = i + 1; v2z = v3z; v2y = v3y; v2x = v3x; v1z = v2z.45; v1y = v2y.44; v1x = v2x.43; if (i != 100000000) goto <L8>; else goto <L2>; <L8>:; v2x.43 = v2x; v2y.44 = v2y; v2z.45 = v2z; goto <bb 3> (<L0>); <L2>:; __builtin_puts (&"Stop!"[0]); printf (&"Result = %f, %f, %f\n"[0], (double) v3x, (double) v3y, (double) v3z); return 0; --cut here-- what are you using for a compiler? Im using a mainline from mid march, and with it, my .optimized files diff exactly the same, and I get the aforementioned time differences in the executables. (sse.c and sse-bad.c are same, just different names to get different output files) 2007-03-13/gcc> diff sse.c sse-bad.c 2007-03-13/gcc>./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o sse 2007-03-13/gcc>./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4 -mfpmath=sse -o sse-bad 2007-03-13/gcc>ls -l sse*optimized -rw-rw-r-- 1 amacleod amacleod 864 Apr 5 12:16 sse-bad.c.116t.optimized -rw-rw-r-- 1 amacleod amacleod 864 Apr 5 12:16 sse.c.116t.optimized 2007-03-13/gcc>diff sse.c.116t.optimized sse-bad.c.116t.optimized 2007-03-13/gcc>time ./sse Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m0.630s user 0m0.572s sys 0m0.000s 2007-03-13/gcc>time ./sse-bad Start? Stop! Result = 0.000000, 0.000000, 1.000000 real 0m0.883s user 0m0.780s sys 0m0.000s Is this just with earlier compilers, what version are you using? It at least seems to indicate that the problem isn't before out-of-ssa since the time issue is still there with identical outputs from .optimized (In reply to comment #19) > what are you using for a compiler? Im using a mainline from mid march, and gcc version 4.3.0 20070404 (experimental) on i686-pc-linux-gnu with > it, my .optimized files diff exactly the same, and I get the aforementioned > time differences in the executables. This is because -march=pentium4 enables all sse builtins for both cases. > (sse.c and sse-bad.c are same, just different names to get different output > files) > > 2007-03-13/gcc> diff sse.c sse-bad.c > > 2007-03-13/gcc>./xgcc -B./ sse.c -fdump-tree-optimized -O3 -march=pentium4 -o > sse > > 2007-03-13/gcc>./xgcc -B./ sse-bad.c -fdump-tree-optimized -O3 -march=pentium4 > -mfpmath=sse -o sse-bad This is known effect of SFmode SSE being slower than SFmode x87. But again, you have enabled sse(2) builtins due to -march=pentium4. Please try to compile using only "-O2" and "-O2 -msse". x87 math will be used in both cases, but .optimized will show the difference. You can also try to compile with and without -ffast-math. IMO it is not acceptabe for tree dumps to depend on target compile flag in any way... Strange things happen. I have fully removed gcc build directory and bootstrapped gcc from scratch. To my suprise, the difference with -msse and without -msse is now gone and optimized dumps are now the same. For reference, the compiler has ident "gcc version 4.3.0 20070406 (experimental)". Regarding this bug - SSE performance vs x87 performance is clearly target procesor dependant. There is nothing gcc can do, and even without memory acces, SSE is slower than x87 on some targets (ref: Comment #5). Let's close this bug as WONTFIX, as there is nothing to fix in gcc. |