Hi guys. My name is Clint Whaley, I'm the developer of ATLAS, an open source linear algebra package: http://directory.fsf.org/atlas.html My users are asking me to support gcc 4, but right now its x87 fp performance is much worse than gcc 3. Depending on the machine and code being run it appears to be between 10-50% worse. Here is a tarfile that allows you to reproduce the problem on any machine: http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz I have timed under a Pentium-D (gcc 4 gets 85% of gcc 3's performance on example code) and Athlon-64 X2 (gcc 4 gets 60% of gcc 3's performance). This is a typical kernel from ATLAS, not the worst . . . By looking at the assembly (the provided makefile will gen it with "make assall"), the differences seem fairly minor. From what I can tell, mostly it seems to come down to gcc 4 using a from memory fmull rather than loading ops to the fpstack first. I know that sse is the prefered target these days, but the x87 (when optimized right) kills the single precision SSE unit in scalar mode due to the expense of the scalar vector load, and the x87 unit is slightly faster even in double precision (in scalar mode). Gcc cannot yet auto-vectorize any ATLAS kernels. Any help much appreciated, Clint
Do you have a small testcase which shows the problem?
Created attachment 11541 [details] Makefile and source to demonstrate performance problem
This is fully a target issue.
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Andrew, Thanks for the reply. For the small case demonstrating the problem, I included it in the original message: http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz and have uploaded it as an attachment. I am not sure what you mean by "fully a target issue". Perhaps I have submitted to the wrong area of gcc performance bug? Note that it is not limited to one machine: the gcc 4 code is inferior to gcc 3 on both AMD and Intel. I chose the two newest machines I have access to, but I believe it is true for older machines as well . . . Any clarification appreciated, Clint On 31 May 2006 00:41:55 -0000, pinskia at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > > > ------- Comment #3 from pinskia at gcc dot gnu dot org 2006-05-31 00:41 ------- > This is fully a target issue. > > > -- > > pinskia at gcc dot gnu dot org changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Component|rtl-optimization |target > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. > You reported the bug, or are watching the reporter. >
(In reply to comment #4) > and have uploaded it as an attachment. I am not sure what you mean by > "fully a target issue". Perhaps I have submitted to the wrong area of > gcc performance bug? Note that it is not limited to one machine: the > gcc 4 code is inferior to gcc 3 on both AMD and Intel. I chose the > two newest machines I have access to, but I believe it is true for > older machines as well . . . It only effects x86/x86_64 (really just x87 and the stack machine). It truely looks like a ra issue. There is no issues like this on say Powerpc.
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Yes, I agree it is an x86/x86_64 issue. I have not yet scoped the performance of any of the other architectures with gcc 4 vs. 3: since 90% of my users use an x86 of some sort, I can't switch to gcc 4 support until the x86 performance is reasonable. It seems the x87 performance always goes down with any big gcc change (bugzilla 4991 is a similar performance drop between 2.x and 3.0, though the issues are not the exact same), probably because its oddball 2-op assembler / x87 stack doesn't map well to the more sane ISAs, which all compiler guys strongly prefer :) Thanks, Clint
IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck. Looking into 3.x RTL, these things can be observed: Instruction that multiplies pA0 and rB0 is described as: __.20.combine: (insn 75 73 76 2 (set (reg:DF 84) (mult:DF (mem:DF (reg/v/f:DI 70 [ pA0 ]) [0 S8 A64]) (reg/v:DF 78 [ rB0 ]))) 551 {*fop_df_comm_nosse} (insn_list 65 (nil)) (nil)) At this point, first input operand does not satisfy the operand constraint, so register allocator pushes memory operand into the register: __.25.greg: (insn 703 73 75 2 (set (reg:DF 8 st [84]) (mem:DF (reg/v/f:DI 0 ax [orig:70 pA0 ] [70]) [0 S8 A64])) 96 {*movdf_integer} (nil) (nil)) (insn 75 703 76 2 (set (reg:DF 8 st [84]) (mult:DF (reg:DF 8 st [84]) (reg/v:DF 9 st(1) [orig:78 rB0 ] [78]))) 551 {*fop_df_comm_nosse} (insn_list 65 (nil)) (nil)) This RTL produces following asm sequence: fldl (%rax) #* pA0 fmul %st(1), %st # In 4.x case, we have: __.127r.combine: (insn 60 58 61 4 (set (reg:DF 207) (mult:DF (reg/v:DF 187 [ rB0 ]) (mem:DF (plus:DI (reg/v/f:DI 178 [ pA0.161 ]) (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil) (nil)) This instruction almost satisfies operand constraint, and register allocator produces: __.138r.greg: (insn 470 58 60 5 (set (reg:DF 12 st(4) [207]) (reg/v:DF 8 st [orig:187 rB0 ] [187])) 94 {*movdf_integer} (nil) (nil)) (insn 60 470 61 5 (set (reg:DF 12 st(4) [207]) (mult:DF (reg:DF 12 st(4) [207]) (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178]) (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil) (nil)) Stack handling then fixes this RTL to: __.151r.stack: (insn 470 58 60 4 (set (reg:DF 8 st) (reg:DF 8 st)) 94 {*movdf_integer} (nil) (nil)) (insn 60 470 61 4 (set (reg:DF 8 st) (mult:DF (reg:DF 8 st) (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178]) (const_int 960 [0x3c0])) [0 S8 A64]))) 591 {*fop_df_comm_i387} (nil) (nil)) From your measurement, it looks that instead of: fld %st(0) # fmull (%rax) #* pA0.161 it is faster to emit fldl (%rax) #* pA0 fmul %st(1), %st #,
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Uros, >IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck. As far as understanding from first principles, performance on a modern x86 (which is busy doing OOE, register renaming, CISC/RISC translation, operand fusion and fission, etc) is *always* a blind accident, IMHO :) I've hand-tuned code for the x87 for a *long* time (and written my own compilation framework), and it has been my experience that only by trying different schedules, instruction selection, etc. can you get decent performing code. gcc actually does an amazing job of x87 performance when it's working right, and I always figured it had to empirically tweaked to get that level of performance. The fact that x87 performance always drops off at major releases (return to first principles over discovered best-cases) seems to verify this . . . So, I agree with you that the difference does not seem to have some big plan behind it, but I want to stress that it is nonetheless critical: it happens to all x87 codes on every x86 machine (I have so far tried Pentium-D, Athlon 64 X2, and P4e), and it happens no matter what optimized code I feed gcc 4. Note that ATLAS is not a static library, but rather uses a code generator to tune matrix multiplication. What this means is that ATLAS tries thousands of different source implementations in trying to find one that will run the fastest on the given architecture/compiler (the code generator does things like tiling, register blocking, unroll & jam, software pipelining, unrolling, all at the ANSI C source level, in an attempt to find the combo that the compiler/arch likes etc). On no x86 architecture I've installed on can gcc 4 compete with gcc 3. Thus, out of literally thousands of implementations on each platform, gcc 4 cannot find one that it can compete with gcc 3's best case. I cannot, of course, send you thousands of codes and say "see all of these are inferior", but they are, and the case I sent is not the worst. For instance, for single precision gemm on the Athlon 64, the kernel tuned for gcc 4 (best case of thousands taken) runs at 56.7% of the performance of the gcc 3-tuned kernel. Nor does using SSE fix things: gcc 4 is still far slower using SSE than gcc 3 using the x87 on all platforms, and for single precision, the gap is worse than between x87 implementations! Thanks, Clint
The benchmark run on a Pentium4 3.2G/800MHz FSB (32bit): vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 9 cpu MHz : 3191.917 cache size : 512 KB shows even more interesting results: gcc version 3.4.6 vs. gcc version 4.2.0 20060601 (experimental) -fomit-frame-pointer -O -msse2 -mfpmath=sse GCC 3.x performance: ./xmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.162 2664.87 GCC 4.x performance: ./xmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.164 2633.13 and -fomit-frame-pointer -O -mfpmath=387 GCC 3.x performance: ./xmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.160 2697.37 GCC 4.x performance: ./xmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.164 2633.15 There is a small performance drop on gcc-4.x, but nothing critical. I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the problem is in the order of instructions (Software Optimization Guide for AMD Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how things should be, and gcc-4.2 code looks similar to the example, how things should _NOT_ be. BTW: Did you try to run the benchmark on AMD target with -march=k8? The effects of this flag are devastating on Pentium4 CPU: -O -msse2 -mfpmath=sse -march=k8 ./xmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.836 516.79 GCC 4.x performance: ./xmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.287 1504.66
Created attachment 11571 [details] Same benchmark, but with single precision timing included Here's the same benchmark, but can time single as well as double precision, in case you want to play with the SSE code.
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Uros, OK, I originally replied a couple of hours ago, but that is not appearing on bugzilla for some reason, so I'll try again, this time CCing myself so I don't have to retype everything :) >gcc version 3.4.6 >vs. >gcc version 4.2.0 20060601 (experimental) > >-fomit-frame-pointer -O -msse2 -mfpmath=sse > >There is a small performance drop on gcc-4.x, but nothing critical. > >I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the >problem is in the order of instructions (Software Optimization Guide for AMD >Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how >things should be, and gcc-4.2 code looks similar to the example, how things >should _NOT_ be. First, thanks for looking into this! As to your point, yes, I am aware that gcc4-sse can get almost the same performance as gcc3-x87 (though not quite), and in fact can do so on the Athlon 64 as well, **but only for double precision**. To get SSE within a few percent of x87 on the AMD machine, you use a different kernel (remember, I'm sending you an example out of many), and throw the following flags: -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ -ftree-vectorize -fargument-noalias-global (note this does not vectorize the code, but I throw the flag in the hope that future versions will :) Note that my bug report concentrates on "x87 performance"! There are reasons to use x87 even if scalar SSE is competitive performance-wise, as the x87 unit produces much superior accuracy. However, even if we were to take the tack (and gcc may be doing this for all I know) that once scalar SSE can compete performance wise, the x87 unit will no longer be supported, we must also examine single precision performance. For single precision performance, I have never gotten any scalar SSE kernel to compete even close to the gcc3-x87 numbers. I believe (w/o having proved it) that this is probably due to the cost of using the scalar load: double precision can use the low-overhead movlpd instruction, but single must use MOVSS, which is **much** slower than FLD, and so any kernel using scalar SSE blows chunks. ATLAS's best case gcc4-sse kernel gets roughly half of the gcc-x87 performance on an Athlon-64, and something like 80% on a P4e (note that intel machines have half the theoretical peak for x87 [AMD: 2 flops/cycle, Intel: 1 flop/cycle]: getting a large % of performance gets easier the lower your peak gets!). I originally submitted a double precision kernel, because that showed the x87 performance problem, and allowed me to reuse the infrastructure I created for an earlier bug report (bugzilla 4991). I have just uploaded an example attachment that can time both single and double precision performance, if you want to confirm for yourself that SSE is not competitive for single precision. Thanks, Clint
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Uros, >gcc version 3.4.6 >vs. >gcc version 4.2.0 20060601 (experimental) > >-fomit-frame-pointer -O -msse2 -mfpmath=sse >There is a small performance drop on gcc-4.x, but nothing critical. >I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the >problem is in the order of instructions (Software Optimization Guide for AMD >Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how >things should be, and gcc-4.2 code looks similar to the example, how things >should _NOT_ be. Thanks for looking into this! However, I am indeed aware that by using SSE2 you can get the double precision results fairly close to the x87 on most platforms. In fact, you can get gcc 4.1-sse within a few % of gcc 3-x87 on the Athlon 64 as well, by changing the kernel you feed gcc, and giving it these flags: -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ -ftree-vectorize -fargument-noalias-global (this doesn't make it vectorize, but I throw the flag for future hope :) Now, sometimes you want to use the x87 unit because of its superior precision, but the real problem with the approach of "ignore the x87 performance and just use SSE" comes in single precision. The performance of the best kernel found by ATLAS in single precision using gcc4.1-sse is roughly half of that of using the x87 unit on an Athlon-64, and 80% on a P4e (one reason they are closer on the P4e is that the P4e's x87 peak is 1/2 that of the Athlon [AMD machines can do 2 flops/cycle using the x87, whereas intel machines can do only 1]), so there's not as large a gap between excellent and non-so-excellent kernels). My guess (and it's only a guess) for the reason scalar double-precision sse can compete and single cannot comes down to the cost of doing scalar load and stores. In double, you can use movlpd instead of movsd for a low-overhead vector load, but in single you must use movss, and since movss is much more expensive than fld, scalar SSE always blows in comparison to x87 . . . So, that's why my error report concentrated on "x87 performance". I submitted in double precision because I had a preexisting Makefile/source demonstrating the performance problem from a prior bug report (bugzilla 4991). I think we should not blow off the x87 performance even if SSE *was* competitive, because there are times when the x87 is better. However, in single precision, scalar SSE is not competitive, at least on the platforms I have tried. If you guys are planning on deprecating the x87 unit when SSE is competitive on modern machines, I can certainly rework the tarfile so I can send you single precision benchmark, so you can see the sse/x87 performance gap yourself. Let me know if you want this, as I'll need to do a bit of extra work. Thanks, Clint
Subject: Re: gcc 4 produces worse x87 code on all platforms than gcc 3 Guys, Just got access to a CoreDuo machine, and tested things there. I had to do some hand-translation of assemblies, as I didn't have access to the gnu compiler there, so there's the possibility of error, but it looked like to me that the Core likes the gcc4 x87 code stream better than the gcc3, so I think you'll want to select amongst them according to -march . . . Core is a PIII-based architecture, so when I have a moment I'll try to find a PIII that's still running to see if PIIIs in general like that code stream, while P4s and Athlons like the gcc3 way of things . . . Thanks, Clint
OK, I got access to some older machines, and it appears that Core is the only architecture that likes gcc 4's code. More precisely, I have confirmed that the following architectures run significantly slower using gcc4 than gcc 3: Pentium-D, P4e, Pentium III, PentiumPRO, Athlon-64 X2, Opteron. Any help appreciated, Clint
Hi, Can someone tell me if anyone is looking into this problem with the hopes of fixing it? I just noticed that despite the posted code demonstrating the problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned to look at it . . . The reason I ask is that I am preparing the next stable release of ATLAS, and I'm getting close to having to make a decision on what compilers I will support. If someone is working feverishly in the background, I will be sure to wait for it, in the hopes that there'll be a fix that will allow me to use gcc 4, which I think will be what most of my users want. If this problem is not being looked into, I should not delay the ATLAS release for it, and just require my users to install gcc 3 in order to get decent performance. I realize you guys are busy, and fp performance is probably not your main concern, so hopefully this message sounds more like a request for info on what is going on, than a bitch about help that I'm getting for free :) Thanks, Clint
Don't hold your breath.
OK, thanks for the reply. I will assume gcc 4 won't be fixed in the near future. My guess is this will make icc an easier compiler for users, which I kind of hate, which is why I worked as much as I did on this report . . . I hope you will consider adding the mmbench4s.tar.gz attachment above (the one that runs both single and double precision) to the gcc regression tests. Notice that it caught this problem between 3 and 4, as well as a similar fp performance drop between gcc 2 and 3 (bugzilla 4991). The kernel here is typical of those used in ATLAS, which is used by hundreds of thousands of people worldwide. I believe these kernels are also typical of pretty much any register blocked fp code, so having them in the regression tests may help other open source fp packages (eg, fftw, etc) as well. Notice that closed-source alternatives that ship binaries do not face this challenge, so that having the compiler drop between releases gives them an advantage, and can drive HPC users (where performance dictates everything) to proprietary solutions. Thanks, Clint
Unfortunately we don't have infrastructure for performance regression tests. Btw. did you check what happens if you do not unroll the innermost loop manually but let -funroll-loops do it? For me the performance is the same (but I may have screwed up removing the unrolling).
Thanks for the info. I'm sorry to hear that no performance regression tests are done, but I guess it kind of explains why these problems reoccur :) As to not unrolling, the fully unrolled case is almost always commandingly better whenever I've looked at it. After your note, I just tried on my P4, using ATLAS's P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so 40 is fully unrolled): GCC 4 ku=1 : 1.65Gflop GCC 4 ku=40 : 1.84Gflop Gcc 3 ku=1 : 1.90Gflop Gcc 3 ku=40: 2.19Gflop This is throwing the -funroll-loops flag. BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . . . Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully unrolled> on any machine. Even in assembly, I have to fully unroll the inner loop to get near peak on all intel machines. On the Opteron, you can get within 5% or so with a rolled loop in assembly, but I've not gotten a C code to do that.I think the gcc unrolling probably defaults to something like 4 or 8 (guess from performance, not verified): unrolling all the way (the loop is over a compile-time constant) is the way to go . . . When you said competitive, did you mean that gcc 4 ku=1 was competitive with gcc 4 ku=40 or gcc 3 ku=1? If the latter, I find it hard to believe unless you use SSE for gcc 4 and something unexpected happens. Even so, if you are using SSE try it with the single precision kernel, where SSE cannot compete with the x87 unit (even the broken one in gcc 4). Thanks, Clint
(In reply to comment #15) > Can someone tell me if anyone is looking into this problem with the hopes of > fixing it? I just noticed that despite the posted code demonstrating the > problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D, > Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned > to look at it . . . Hm, I tried your single testcase (SSE) on: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 9 cpu MHz : 3191.917 cache size : 512 KB And the results are a bit suprising (this is the exact output of your test): /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gcc mmbench.o sgemm_atlas.o rm -f *.o /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gc4 mmbench.o sgemm_atlas.o rm -f *.o echo "GCC 3.x single performance:" GCC 3.x single performance: ./xsmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.141 3072.00 echo "GCC 4.x single performance:" GCC 4.x single performance: ./xsmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.141 3072.00 where: "gcc (GCC) 3.4.6" was tested against "gcc version 4.2.0 20060608 (experimental)" FYI: there is another pathological testcase (PR target/19780), where SSE code is 30% slower on AMD64, despite the fact that for SSE, 16 xmm registers were available and _no_ memory was accessed in a for loop. > The reason I ask is that I am preparing the next stable release of ATLAS, and > I'm getting close to having to make a decision on what compilers I will > support. > If someone is working feverishly in the background, I will be sure to wait > for it, in the hopes that there'll be a fix that will allow me to use > gcc 4, which I think will be what most of my users want. If this problem > is not being looked into, I should not delay the ATLAS release for it, and > just require my users to install gcc 3 in order to get decent performance. > > I realize you guys are busy, and fp performance is probably not your main > concern, so hopefully this message sounds more like a request for info on what > is going on, than a bitch about help that I'm getting for free :) Without any other information available, I can only speculate, that perhaps gcc4 code does not fully utilize multiple FP pipelines in the processors you listed.
Uros, Thanks for the reply; I think some confusion has set in (see below) :) >And the results are a bit suprising (this is the exact output of your test): Note that you are running the opposite of my test case: SSE vs SSE rather than x87 vs x87. This whole bug report is about x87 performance. You can get more detail on why I want x87 in my messages above, particularly comment #11, but single precision is indeed the place where SSE cannot compete with the x87 unit. To see it, put the flags back the way I had them in the attachment, and you'll see that gcc 3 is much faster. Also, you should find in single precision that the x87 unit soundly beats the SSE unit (unlike double precision, where the gcc 3's x87 unit is only slightly faster than the best SSE code). I think the x87 will win even using gcc 4 for both compilations, even though gcc 4's x87 support is crippled by its new register allocation scheme. So, let me say what I think is going on here, and you can correct me if I've gotten it wrong. I think in this last timing you think you've found an exception to the problem, but have forgotten we want to look at the x87 (which is the fastest method in this case anyway). Try it with my original flags (essentially, throw '-mfpmath=387' instead of the sse flags), and you should see that this gives far better performance using gcc 3 than any use of scalar sse. I think even gcc 4 will be better using its de-optimized x87 code, because x87 is inherently better than scalar sse on these platforms. There is only one machine that likes the gcc 4's new x87 register usage pattern of all the ones I've tested, and that is the CoreDue. The issue is in x87 register usage: Gcc 4 saves a register, and does the FMUL from memory rather than first loading the value to the fpstack, and on at least the PentiumPRO, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron, that drops your x87 (which is your best) performance significantly. Note that given gcc 3's register usage, I think a simple peephole step can transform it to gcc 4's, if you want to maintain that usage for CoreDuo. Unfortunately, going the other way requires an additional register, and the load plays with your stack operands, so it is easier to keep gcc 3's way as the default, and peephole to gcc 4's when on a machine that likes that usage (currently, only the Core). Thanks, Clint
(In reply to comment #21) > Note that you are running the opposite of my test case: SSE vs SSE rather than > x87 vs x87. This whole bug report is about x87 performance. You can get more > detail on why I want x87 in my messages above, particularly comment #11, but > single precision is indeed the place where SSE cannot compete with the x87 > unit. To see it, put the flags back the way I had them in the attachment, and > you'll see that gcc 3 is much faster. Also, you should find in single Hm, these are x87 results: /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float -c mmbench.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c sgemm_atlas.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gcc mmbench.o sgemm_atlas.o rm -f *.o /usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float -c mmbench.c /usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c sgemm_atlas.c /usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gc4 mmbench.o sgemm_atlas.o rm -f *.o echo "GCC 3.x single performance:" GCC 3.x single performance: ./xsmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.141 3072.00 echo "GCC 4.x single performance:" GCC 4.x single performance: ./xsmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.143 3029.92
Uros, OK, I made the stupid assumption that the P4 would behave like the P4e, should've known better :) I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the several surprising things you report: (1) SSE does as well as x87 on this platform (2) The difference between gcc 3 & 4 x87 performance extremely minor (3) The code is amazingly optimal (roughly 95-96% of peak!) The significance of (3) is that it tells us we are not in the bad case where the kernel in question gets such crappy performance that all codes look alike. This performance was so good, that I ran a tester to verify that we were still getting the right answer, and indeed we are :) On this platform, I didn't install the compilers myself, (system had Red Hat 4.0.2-8 and 3.3.6 installed), so I scoped the assembly, and indeed they have the fmul difference that causes problems on the other x87 machines, so it is really true that the Pentium 4 handles either instruction stream almost as well (not sure the 2% is significant; 2% is less than clock resolution, though in my timings anytime there is a difference, gcc 4 always loses). Here is the machine breakdown as measured now: LIKES GCC 4 DOESN'T CARE LIKES GCC 3 =========== ============ =========== CoreDuo Pentium 4 PentiumPRO Pentium III Pentium 4e Pentium D Athlon-64 X2 Opteron The only machine we are missing that I can think of is the K7 (i.e. original Athlon, not Athlon-64). I don't presently have access to a K7, but I can probably find someone on the developer list who could run the test if you like. The other thing that would be of interest is for each machine to chart the % performance lost/gained. Here, though, we want two numbers: % lost on simple benchmark code (which is easy to repeat), and % lost with ATLAS code generator (which compares each compiler's best case out of thousands to each other). I will undertake to get this first (quick to run) number for the machines so we have some quantitative results to look at . . . The ATLAS comparison is probably more important, but takes so long that maybe I'll post it only for the most problematic platforms (i.e., if the arch shows a big drop gcc3 v. gcc4, see if the drop is that big when we ask ATLAS to auto-adapt to gcc4). Thanks, Clint
Guys, OK, here is a table summarizing the performance you can see using the mmbench4s.tar.gz. I believe this covers a strong majority of the x86 architectures in use today (there are some specialty processors such as the Pentium-M, Turion, Efficeon, etc. missing, but I don't think they are a big % of the market). In this table, I report the following for each machine and data precision: % Clock: % of clock rate achieved by best compiled version of gemm_atlas.c (rated in mflop). Note, theoretical peak for intel machines is 1 flop/clock, and is 2 flops/clock for AMD, which would correspond to 100% and 200% respectively. gcc4/3 : (gcc 4 x87 performance) / (gcc 3 x87 performance) so < 1 indicates slowdown, > 1 indicates speedup NOTES: (1) Pentium 4 is a model=2, while Pentium 4E is model=3. (2) PPRO, PIII & P4e get bad % clock for double: this is because the static blocking factor in the benchmark (nb=60) exceeds the cache, which makes the gcc 4 #s look better than they are. (3) In general, the % peak achieved by this kernel is large enough that I think it is truly indicative of the computational efficiency of the generated code. double single -------------- --------------- MACHINES %CLOCK gcc4/3 %CLOCK gcc4/3 =========== ====== ====== ====== ====== PentiumPRO 67.5 0.77 78.5 0.71 PentiumIII 47.6 0.95 81.4 0.69 Pentium 4 93.8 0.92 95.7 1.00 Pentium4e 72.8 0.75 80.4 0.80 Pentium-D 86.7 0.83 94.1 0.91 CoreDuo 85.8 1.01 94.9 1.11 Athlon-K7 137.8 0.62 139.1 0.63 Athlon-64 X2 160.0 0.58 165.5 0.60 Opteron 164.6 0.57 164.6 0.61 The CoreDue numbers above are generated by me on a OS X machine, where I hand-translated Linux assembly to run, since I could not compile stock gccs. I have a request out for results from a guy who has Linux/CoreDue, and when I get those I will update the results if necessary. At that time, I will also post an attachment with all the raw timing runs that I generated the table from. Thanks, Clint
Pure luck or not, this is a regression.
Created attachment 11773 [details] raw runs table is generated from As promised, here is the raw data I built the table out of, including a new run from the Linux/CoreDuo user, which does not materially change the table.
Created attachment 11777 [details] An integer loop I changed the loop from double to long long. The 64bit code generated by gcc 4.0 is 10% slower than gcc 3.4 on Nocona: /usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c /usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c /usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gcc mmbench.o gemm_atlas.o rm -f *.o /usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c /usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c /usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o gemm_atlas.o rm -f *.o echo "GCC 3.x performance:" GCC 3.x performance: ./xmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 250 0.381 283.51 echo "GCC 4.x performance:" GCC 4.x performance: ./xmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 250 0.389 277.68 gnu-16:pts/2[5]> make ~/bugs/gcc/27827/loop /usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c /usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c /usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gcc mmbench.o gemm_atlas.o rm -f *.o /usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c /usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c /usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o gemm_atlas.o rm -f *.o echo "GCC 3.x performance:" GCC 3.x performance: ./xmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.172 2512.01 echo "GCC 4.x performance:" GCC 4.x performance: ./xmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.193 2238.68 So the problem may be also loop related.
Guys, If you are looking for the reason that the new code might be slower, my feeling from the benchmark data is that involves hiding the cost of the loads. Notice that, except for the cases where the double exceeds the cache, the single precision gcc4 code always gets a greater percentage of gcc3's numbers than double for each platform. This is the opposite of what you expect if the problem is purely computational, but exactly what you expect if the problem is due to memory costs (since single has half the memory cost). If I were forced to take a WAG as to what's going on, I would guess it has to do with the more dependencies in the new code sequence confusing tomasulo's or register renaming. I haven't worked it out in detail, but scope the two competing code sequences: gcc 3 gcc 4 =========== ======= fldl 32(%edx) fldl 32(%edx) fldl 32(%eax) fld %st(0) fmul %st(1),%st fmull 32(%eax) faddp %st,%st(6) faddp %st, %st(2) Note that in gcc 3, both loads are independent, and can be moved past each other and arbitrarily early in the instruction stream. The fmull would need to be broken into two instructions before a similar freedom occurs. I'm not sure how the fp stack handling is done in hardware, but the fact that you've replaced two independent loads with 3 forced-order instructions cannot be beneficial. At the same time, it is difficult for me to see how the new sequence can be better. We've got the same number of loads, the same number of instructions, the same register use (I think), with a forced ordering and loads you cannot advance (critical in load-happy 8-register land). I originally thought that the gcc 4 stream used one less register, but it appears to copy the edx operand twice to stack, so I'm no longer sure it has even that advantage? Just my guess, Clint
Guys, The integer and fp differences do not appear to be strongly related. In particular, on my P4e, gcc 4's integer code is actually faster than gcc 3's. Further, if you look at the assemblies of the integer code, it does not have the extra dependencies that gcc 4's x87 code has. In integer, both gcc 3 and 4 explicitly do all loads to registers. I haven't scoped it in detail, but the main difference appears to be in scheduling, with gcc 3 performing a bunch of loads, then a bunch of computations, and gcc 4 intermixing them more. So, we'd need a new series of runs to see which integer schedule is better, but the integer code should not be studied to solve the x87 problem. Thanks, Clint
Can you try this patch? My only i686 machine is neutral to this problem. I'm a bit worried about the Core Duo thing, but my hope is that other changes between GCC 3 and GCC 4 improved performance on all machines, and Core Duo is the only processor that does not see the performance loss introduced by "fld %st". I'm currently bootstrapping and regtesting the patch; a minimal testcase is here: /* { dg-do compile } */ /* { dg-options "-O2" } */ double a, b; double f(double c) { double x = a * b; return x + c * a; } /* { dg-final { scan-assembler-not "fld\[ \t\]*%st" } } */ /* { dg-final { scan-assembler "fmul\[ \t\]*%st" } } */ Without patch: fldl a fld %st(0) fmull b fxch %st(1) fmull 4(%esp) faddp %st, %st(1) ret With patch: fldl a fldl 4(%esp) fmul %st(1), %st fxch %st(1) fmull b faddp %st, %st(1) ret Index: i386.md =================================================================== --- i386.md (revision 115412) +++ i386.md (working copy) @@ -18757,6 +18757,32 @@ [(set_attr "type" "sseadd") (set_attr "mode" "DF")]) +;; Make two stack loads independent: +;; fld aa fld aa +;; fld %st(0) -> fld bb +;; fmul bb fmul %st(1), %st +;; +;; Actually we only match the last two instructions for simplicity. +(define_peephole2 + [(set (match_operand 0 "fp_register_operand" "") + (match_operand 1 "fp_register_operand" "")) + (set (match_dup 0) + (match_operator 2 "binary_fp_operator" + [(match_dup 0) + (match_operand 3 "memory_operand" "")]))] + "REGNO (operands[0]) != REGNO (operands[1])" + [(set (match_dup 0) (match_dup 3)) + (set (match_dup 0) (match_dup 4))] + + ;; The % modifier is not operational anymore in peephole2's, so we have to + ;; swap the operands manually in the case of addition and multiplication. + "if (COMMUTATIVE_ARITH_P (operands[2])) + operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]), + operands[0], operands[1]); + else + operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE (operands[2]), + operands[1], operands[0]);") + ;; Conditional addition patterns (define_expand "addqicc" [(match_operand:QI 0 "register_operand" "")
Paolo, Thanks for the update. I attempted to apply this patch, but apparantly I failed, as it made absolutely no difference. I mean, not only did it not change performance, but if you diff the assembly, you get only 4 lines different (version numbers and use of ffreep rather than fstp). Here is what I did: > 59 10:29 cd gcc-4.1.1/ > 60 10:30 pushd gcc/config/i386/ > 62 10:30 patch < ~/x87patch > 64 10:31 cd ../../.. > 67 10:31 mkdir MyObj > 68 10:31 cd MyObj/ > 71 10:32 ../configure --prefix=/home/whaley/local/gcc4.1.1p1 --enable-languages=c,fortran > 72 10:32 make > 73 10:58 make install I did this on my P4e (IA32) and Athlon64 X2 (x86-64) machines. I did have to hand-edit the patch, due to line breaks in mouse-copying from the webpage (it wouldn't apply until I did that), so maybe that is the problem. Can you grab the mmbench4s.tar.gz attachment, and point its Makefile at your modified compiler, and tell it "make assall", and see if the generated dmm_4.s and smm_4.s are different than what you get with stock 4.1.1? If so, post them as attachments, and I can probably hack the benchmark to load the assembly, as I did on the Core. Assuming they are different, maybe you can check that this is the only patch I need to make? If it is, is there something wrong with the way I applied it? If not, maybe you should post the patch file as an attachment so we can rule out copying error . . . Thanks, Clint
It works for me. GCC 4.x double 60 1000 0.208 2076.79 GCC patch double 60 1000 0.168 2571.28 GCC 4.x single 60 1000 0.188 2297.74 GCC patch single 60 1000 0.152 2841.94 Assembly changes are as follows: < is without my patch, > is with it. --- 21,22c21,22 < fld %st(0) < fmuls (%eax) --- > flds (%eax) > fmul %st(1), %st 25,26c25,26 < fld %st(2) < fmuls 240(%eax) --- > flds 240(%eax) > fmul %st(3), %st 28,29c28,29 < fld %st(3) < fmuls 480(%eax) --- > flds 480(%eax) > fmul %st(4), %st
Paolo, Can you post the assembly and the patch as attachments? If necessary, I can hack the benchmark to call the assembly routines on a couple of platforms. Also, did you see what I did wrong in applying the patch? Thanks, Clint
Created attachment 12019 [details] MMBENCH4s.tar.gz + assembly without and with patch I don't know what was wrong, but you can now fetch the patch yourself from http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00113.html Anyway, here's your .tar.gz now including the .s files (and the Makefile points to my gcc's). ?mm_3.s is the unpatched GCC 4.2, ?mm_4.s is the patched one.
Created attachment 12020 [details] new Makefile targets OK, this is same benchmark again, now creating MMBENCHS directory. In addition to the ability to make single & double, also has ability to build executables from assembly files (see "asgexe" target of Makefile)
Paola, Thanks for working on this. We are making progres, but I have some mixed results. I timed the assemblies you provided directly. I added a target "asgexe" that builds the same benchmark, assuming assembly source instead of C to make this more reproducable. I ran on the Athlon-64X2, where your new assembly ran *faster* than gcc 3 for double precision. However, you still lost for single precision. I believe the reason is that you still have more fmuls/fmull (fmul from memory) than does gcc 3: >animal>fgrep -i fmuls smm_4.s | wc > 240 480 4051 >animal>fgrep -i fmuls smm_asg.s | wc > 60 120 1020 >animal>fgrep -i fmuls smm_3.s | wc > 0 0 0 >animal>fgrep -i fmull dmm_4.s | wc > 100 200 1739 >animal>fgrep -i fmull dmm_asg.s | wc > 20 40 360 >animal>fgrep -i fmuls dmm_3.s | wc > 0 0 0 I haven't really scoped out the dmm diff, but in single prec anyway, these dreaded fmuls are in the inner loop, and this is probably why you are still losing. I'm guessing your peephole is missing some cases, and for some reason is missing more under single. Any ideas? As for your assembly actually beating gcc 3 for double, my guess is that it is some other optimization that gcc 4 has, and you will beat by even more once the final fmull are removed . . . On the P4e, your double precision code is faster than stock gcc 4, but still slower than gcc3. again, I suspect the remaining fmull. Then comes the thing I cannot explain at all. Your single precision results are horrible. gcc 3 gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34! No chance the mixed fld/fmuls is causing stack overflow, I guess? I think this might account for such a catastrophic drop . . . That's about the only WAG I've got for this behavior. Anyway, I think the first order of business may be to get your peephole to grabbing all the cases, and see if that makes you win everywhere on Athlon, and if it makes single precision P4e better, and we can go from there . . . If you do that, attach the assemblies again, and I'll redo timings. Also, if you could attach (not put in comment) the patch, it'd be nice to get the compiler, so I could test x86-64 code on Athlon, etc. Thanks, Clint
I don't see how the last fmul[sl] can be removed without increasing code size. The only way to fix it would be to change the machine description to say that "this processor does not like FP operations with a memory operand". With a peephole, this is as good as we can get it. The last fmul is not coupled with a "fld %st" because it consumes the stack entry. See in comment #30, where there is still a "fmull b". Can you please try re-running the tests? It takes skill^W^W seems quite weird to have a 100x slow-down, also because my tests were run on a similar Prescott (P4e). It also would be interesting to re-run your code generator on a compiler built from svn trunk. If it can provide higher performance, you'd be satisfied I guess even if it comes from a different kernel. Also, I strongly believe that you should implement vectorization, or at least find out *why* GCC does not vectorize your code. It may be simply that it does not have any guarantee on the alignment.
Paolo, Thanks for all the help. I'm not sure I understand everything perfectly though, so there's some questions below . . . >I don't see how the last fmul[sl] can be removed without increasing code size. Since the flags are asking for performance, not size optimization, this should only be an argument if the fmul[s,l]'s are performance-neutral. A lot of performance optimizations increase code size, after all . . . Obviously, no fmul[sl] is possible, since gcc 3 achieves it. However, I can see that the peephole phase might not be able to change the register usage. >Can you please try re-running the tests? It takes skill^W^W Yes, I found the results confusing as well, which is why I reran them 50 times before posting. I also posted the tarfile (wt Makefile and assemblies) that built them, so that my mistakes could be caught by someone with more skill. Just as a check, maybe you can confirm the .s you posted is the right one? I can't find the loads of the matrix C anywhere in its assembly, and I can find them in the double version . . . Anyway, I like your suggestion (below) of getting the compiler so we won't have to worry about assemblies, so that's probably the way to go. On this front, is there some reason you cannot post the patch(es) as attachments, just to rule out copy problems, as I've asked in last several messages? Note there's no need if I can grab your stuff from SVN, as below . . . >because my tests were run on a similar Prescott (P4e) You didn't post the gcc 3 performance numbers. What were those like? If you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big deal. If gcc 3 is still winning, on the other hand . . . >It also would be interesting to re-run your code generator on a compiler built from svn trunk. Are your changes on a branch I could check out? If so, give me the commands to get that branch, as we are scoping assemblies only because of the patching problem. Having a full compiler would indeed enable more detailed investigations, including loosing the full code generator on the improved compiler. >Also, I strongly believe that you should implement vectorization, ATLAS implements vectorization, by writing the entire GEMM kernel in assembly and directly using SSE. However, there are cases where generated C code must be called, and that's where gcc comes in . . . >or at least find out *why* GCC does not vectorize your code. It may be simply that it does not have any guarantee on the alignment. I'm all for this. info gcc says that w/o a guarantee of alignment, loops are duped, with an if selecting between vector and scalar loops, is this not accurate? I spent a day trying to get gcc to vectorize any of the generator's loops, and did not succeed (can you make it vectorize the provided benchmark code?). I also tried various unrollings of the inner loop, particularly no unrolling and unroll=2 (vector length). I was unable to truly decipher the warning messages explaining the lack of vectorization, and I would truly welcome some help in fixing this. This is a separate issue from the x87 code, and this tracker item is already fairly complex :) I'm assuming if I attempted to open a bug tracker of "gcc will not vectorize atlas's generated code" it would be closed pretty quickly. Maybe you can recommend how to approach this, or open another report that we can exchange info on? I would truly appreciate the opportunity to get some feedback from gcc authors to help guide me to solving this problem. Thanks for all the info, Clint
Paolo, OK, never mind about all the questions on assembly/patches/SVN/gcc3 perf: I checked out the main branch, and vi'd the patched file, and I see that your patch is there. I am presently building the SVN gcc on several machines, and will be posting results/issues as they come in . . . I would still be very interested in advice on approaching the vectorization problem as discussed at the end of the mail. Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 >> I don't see how the last fmul[sl] can be removed without increasing code size. >> > However, I can see that the > peephole phase might not be able to change the register usage. Actually, the peephole phase may not change the register usage, but it could peruse a scratch register if available. But it would be much more controversial (even if backed by your hard numbers on ATLAS) to state that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless there is some manual telling us exactly that... for example it would be a different story if it could give higher scheduling freedom (stuff like VectorPath vs. DirectPath on Athlons), and if we could figure out on which platforms it improves performance. > On this front, is there some reason you cannot post > the patch(es) as attachments, just to rule out copy problems, as I've asked in > last several messages? Note there's no need if I can grab your stuff from SVN, > as below . . . > You already found about this :-P Unfortunately I mistyped the PR number when I committed the patch; I meant the commit to appear in the audit trail, so that you'd have seen that I had committed it. >> because my tests were run on a similar Prescott (P4e) >> > You didn't post the gcc 3 performance numbers. What were those like? If > you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big > deal. If gcc 3 is still winning, on the other hand . . . > I don't have GCC 3 on that machine. Paolo
Paolo, >Actually, the peephole phase may not change the register usage, but it >could peruse a scratch register if available. But it would be much more >controversial (even if backed by your hard numbers on ATLAS) to state >that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless We'll have to see how this is in x87 code. I have experience with it in SSE, where doing it is fully a target issue. For instance, the P4E likes you to avoid the explicit load on the end, where the Hammer prefers the explicit load. If I recall right, there is a *slight* advantage on the intel to the from-mem instruction, but I can't remember how much difference doing the separate load/use made on the AMD. We should get some idea by comparing gcc3 vs. your patched compiler on the various platforms, though other gcc3/4 changes will cloud the picture somewhat . . . If this kind of machine difference in optimality holds true for x87 as well, I assume a new peephole phase that looks for the scratch register could be called if the appropriate -march were thrown? Speaking of -march issues, when I get a compiler build that gens your new code, I will pull the assembly trick to try it on the CoreDuo as well. If the new code is worse, you can probably not call your present peephole if that -march is thrown? Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > We should get some idea by comparing gcc3 vs. your > patched compiler on the various platforms, though other gcc3/4 changes will > cloud the picture somewhat . . . > That's why you should compare 4.2 before and after my patch, instead. > If this kind of machine difference in optimality holds true for x87 as well, I > assume a new peephole phase that looks for the scratch register could be called > if the appropriate -march were thrown? > Or you can disable the fmul[sl] instructions altogether. > Speaking of -march issues, when I get a compiler build that gens your new code, > I will pull the assembly trick to try it on the CoreDuo as well. If the new > code is worse, you can probably not call your present peephole if that -march > is thrown? > I'd find it very strange. It is more likely that the Core Duo has a more powerful scheduler (maybe the micro-op fusion thing?) that does not dislike fmul[sl].
> I'm all for this. info gcc says that w/o a guarantee of alignment, loops are > duped, with an if selecting between vector and scalar loops, is this not > accurate? yes >I spent a day trying to get gcc to vectorize any of the generator's > loops, and did not succeed (can you make it vectorize the provided benchmark > code?). The aggressive unrolling in the provided example seems to be the first obstacle to vectorize the code > I also tried various unrollings of the inner loop, particularly no > unrolling and unroll=2 (vector length). I was unable to truly decipher the > warning messages explaining the lack of vectorization, and I would truly > welcome some help in fixing this. I'd be happy to help decipher the vectorizer's dump file. please send the un-unrolled version and the dump file generated by -fdump-tree-vect-details, and I'll see if I can help.
Guys, OK, the mystery of why my hand-patched gcc didn't work is now cleared up. My first clue was that neither did the SVN-build gcc! Turns out, your peephole opt is only done if I throw the flag -O3 rather than -O, which is what my tarfile used. Any reason it's done at only the high levels, since it makes such a performance difference? FYI, in gcc3 -O gets better performance than -O3, which is why that's my default flags. However, it appears that gcc4 gets very nice performance with -O3. Its fairly common for -O to give better performance than -O3, however (since the ATLAS code is already aggressively optimized, gcc's max optimization often de-optimize an optimal code), so turning this on at the default level, or being able to turn it off and on manually would be ideal . . . >That's why you should compare 4.2 before and after my patch, instead. Yeah, except 4.2 w/o your patch has horrible performance. Our goal is not to beat horrible performance, but rather to get good performance! Gcc 3 provides a measure of good performance. However, I take your point that it'd be nice to see the new stuff put a headlock on the crap performance, so I include that below as well :) Here's some initial data. I report MFLOPS achieved by the kernel as compiled by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually gcc 4.1.1). I will try to get more data later, but this is pretty suggestive, IMHO. DOUBLE SINGLE PEAK gcc3/gccS/gcc4 gcc3/gccS/gcc4 ==== ============== ============== Pentium-D : 2800 2359/2417/2067 2685/2684/2362 Ath64-X2 : 5600 3677/3585/2102 3680/3914/2207 Opteron : 3200 2590/2517/1507 2625/2800/1580 So, it appears to me we are seeing the same pattern I previously saw in my hand-tuned SSE code: Intel likes the new pattern of doing the last load as part of the FMUL instruction, but AMD is hampered by it. Note that gccS is the best compiler for both single & double on the Intel. On both AMD machines, however, it wins only for single, where the cost of the load is lower. It loses to gcc3 for double, where load performance more completely determines matmul performance. This is consistant with the view that gcc 4 does some other optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would win for all precisions . . . Don't get me wrong, your patch has already removed the emergency: in the worst case so far you are less than 3% slower. However, I suspect if we added the optional (for amd chips only) peephole step to get rid of all possible fmul[s,l], then we'd win for double, and win even more for single on AMD chips . . . So, any chance of an AMD-only or flag-controlled peephole step to get rid of the last fmul[s,l]? >Or you can disable the fmul[sl] instructions altogether. As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is good for Intel netburst archs, but bad for AMD hammer archs. I'll see about posting some vectorization data ASAP. Can someone create a new bug report so that the two threads of inquiry don't get mixed up, or do you want to just intermix them here? Thanks, Clint P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly to OS X assembly. The double precision gccS runs at the same speed as apple's gcc. However, the single precision is an order of magnitude slower, as I experienced this morning on the P4E. This is almost certainly an error in my makefile, but damned if I can find it.
Guys, OK, with Dorit's -fdump-tree-vect-details, I made a little progress on vectorization. In order to get vectorization to work, I had to add the flag '-funsafe-math-optimizations'. I will try to create a tarfile with everything tomorrow so you guys can see all the output, but is it normal to need to throw this to get vectorization? SSE is IEEE compliant (unless you turn it off), and ATLAS needs to stay IEEE, so I can't turn on unsafe-math-opt in general . . . With these flags, gcc can vectorize the kernel if I do no unrolling at all. I have not yet run the full search on with these flags, but I've done quite a few hand-called cases, and the performance is lower than either the x87 (best) or scalar SSE for double on both the P4E and Ath64X2. For single precision, there is a modest speedup over the x87 code on both systems, but the total is *way* below my assembly SSE kernels. I just quickly glanced at the code, and I see that it never uses "movapd" from memory, which is a key to getting decent performance. ATLAS ensures that the input matrices (A & B) are 16-byte aligned. Is there any pragma/flag/etc I can set that says "pointer X points to data that is 16-byte aligned"? Thanks, Clint
In x86/x86-64 world one can be almost sure that the load+execute instruction pair will execute (marginaly to noticeably) faster than move+load-and-execute instruction pair as the more complex instructions are harder for on-chip scheduling (they retire later). Perhaps we can move such a transformation somewhere more generically perhaps to post-reload copyprop? Honza
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > In x86/x86-64 world one can be almost sure that the load+execute instruction > pair will execute (marginaly to noticeably) faster than move+load-and-execute > instruction pair as the more complex instructions are harder for on-chip > scheduling (they retire later). ^^^ retirement filling up the scheduler easilly. > Perhaps we can move such a transformation somewhere more generically perhaps to > post-reload copyprop? > > Honza
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > In x86/x86-64 world one can be almost sure that the load+execute instruction > pair will execute (marginaly to noticeably) faster than move+load-and-execute > instruction pair as the more complex instructions are harder for on-chip > scheduling (they retire later). Yes, so far so good and this part has already been committed. But does a *single* load-and-execute instruction execute faster than the two instructions in a load+execute sequence?
Paolo, >Yes, so far so good and this part has already been committed. But does >a *single* load-and-execute instruction execute faster than the two >instructions in a load+execute sequence? As I said, in my hand-tuned SSE assembly experience, which is faster depends on the architecture. In particular, netburst or Core do well with the final fmul[ls], and other archs do not. My guess is that netburst and Core probably crack this single instruction in two during decode, which allows the implicit load to be advanced, but with less instruction load. I think other architectures do not split the inst during decode, which means that tomasulo's cannot advance the load due to dependencies, which makes the separate instructions faster, even in the face of the extra instruction. If you can give me a patch that makes gcc call a new peephole opt getting rid of the final mul[sl] only when a certain flag is thrown, I will see if I can't post timings across a variety of architectures using both ways, so we can see if my SSE experience is true for x87, and how strong the performance benefit for various architectures. This will allow us to evaluate how important getting this choice is, what should be the default state, and how we should vary it according to architecture. My own theoretical guess is that if you *have* to pick a behavior, surely separate instructions are better: on systems with the cracking, this extra inst at worst eats up some mem and a bit of decode bandwidth, which on most machines is not critical. On the other hand, having a non-advancable load is pretty bad news on systems w/o the cracking ability. The proposed timings could demonstrate the accuracy of this guess. As I mentioned, and I *think* Jan echoed, for the case you have already fixed, the peephole's way should be the default way, even at low optimization: there's no extra instruction to this peephole, and it is better everywhere we've timed, and I see no way in theory for the first sequence to be better. Thanks, Clint
Guys, I've been scoping this a little closer on the Athlon64X2. I have found that the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to go to town. That at least ties the best I've ever seen for an x86 chip, and what it means is that on this architecture, the x87 unit can be coaxed into beating the SSE unit *even when the SSE instructions are fully vectorized* (for double precision only, of course: vector single prec SSE has twice theoretical peak of x87). This also means that ATLAS should get a real speed boost when the new gcc is released, and other fp packages have the potential to do so as well. So, with this motivation, I edited the genned assembly, and made the following changes by hand in ~30 different places in the kernel assembly: >#ifdef FMULL > fmull 1440(%rcx) >#else > fldl 1440(%rcx) > fmulp %st,%st(1) >#endif To my surprise, on this arch, using the fldl/fmulp pair caused a performance drop. So, either my SSE experience does not necessarily translate to x87, or the Opteron (where I did the SSE tuning) is subtly different than the Athlon64X2, or my memory of the tuning is faulty. Just as a check, Paulo: is this the peephole you would do? Anyway, doing this by hand is too burdensome to make widespread timings feasable, so if you'd like to see that, I'll need a gcc patch to do it automatically . . . Cheers, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > I've been scoping this a little closer on the Athlon64X2. I have found that > the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a > 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to > go to town. Not unexpected. Code was so tightly tuned for GCC 3, and so big were the changes between GCC 3 and 4, that you were comparing sort of apples to oranges. It could be interesting to see which different optimizations are performed by your code generator for GCC 3 vs. GCC 4. >> fmull 1440(%rcx) >> #else >> fldl 1440(%rcx) >> fmulp %st,%st(1) >> #endif >> > To my surprise, on this arch, using the fldl/fmulp pair caused a performance > drop. So, either my SSE experience does not necessarily translate to x87, or > the Opteron (where I did the SSE tuning) is subtly different than the > Athlon64X2, or my memory of the tuning is faulty. Just as a check, Paulo: is > this the peephole you would do? > In some sense, this is the peephole I would rather *not* do. But the answer is yes. :-) So, do you now agree that the bug would be fixed if the patch that is in GCC 4.2 was backported to GCC 4.1 (so that your users can use that)? And do you still see the abysmal x87 single-precision FP performance? Thanks!
Paolo, >In some sense, this is the peephole I would rather *not* do. But the answer is yes. :-) Ahh, got it :) >So, do you now agree that the bug would be fixed if the patch that is in GCC 4.2 was backported to GCC 4.1 (so that your users can use that)? Well, much as I might like to deny it, yes I must agree bug is fixed :) I think there might still be more performance to get, and initial timings show that 4 may be slower than 3 on some systems. However, it will also clearly be faster than 3 on some (so far, most) systems, and so far, is competitive everwhere, so not even I can call that a performance bug :) And yes, getting it into the next gcc release would be very helpful for ATLAS. >And do you still see the abysmal x87 single-precision FP performance? No, the problems were the same for both precisions. I haven't retimed all the systems, but here's the numbers I do have for the benchmark: DOUBLE SINGLE PEAK gcc3/gccS/gcc4 gcc3/gccS/gcc4 ==== ============== ============== Pentium-D : 2800 2359/2417/2067 2685/2684/2362 Ath64-X2 : 5600 3681/4011/2102 3716/4256/2207 Opteron : 3200 2590/2517/1507 2625/2800/1580 P4E : 2800 1767/1754/1480 1914/1954/1609 PentiumIII: 500 239/238/225 407/393/283 As you can see, on the benchmark, the single precision numbers are better than the double now. I cannot get single precision to run at quite the impressive 93% of peak as double when exercising the code generator on the Ath64-X2, but it gets a respectable 85% of peak (at these levels of performance, it takes only very minor differences to drop from 93 to 85, so that's not that unexpected: I am still investigating this). Thanks for all the help, Clint
Created attachment 12047 [details] benchmark wt vectorizable kernel
Dorit, OK, I've posted a new tarfile with a safe kernel code where the loop is not unrolled, so that the vectorizer has a chance. With this kernel, I can make it vectorize code, but only if I throw the -funsafe-math-optimizations flag. This kernel doesn't use a lot of registers, so it should work for both x86-32 and x86-64 archs. I would expect for the vectorized code to beat the x87 in both precisions on the P4E (vector SSE has two and four times the peak of x87 respectively), and beat the x87 code in single on the Ath64 (twice the peak). So far, vectorization is never a win on the P4e, but I can make single win on Ath64. On both platforms, editing the assembly confirms that there are loops in there that use the vector instructions. Once I understand better what's going on, maybe I can improve this . . . Here's some questions I need to figure out: (1) Why do I have to throw the -funsafe-math-optimizations flag to enable this? -- I see where the .vect file warns of it, but it refers to an SSA line, so I'm not sure what's going on. -- ATLAS cannot throw this flag, because it enables non-IEEE fp arithmetic, and ATLAS must maintain IEEE compliance. SSE itself does *not* require ruining IEEE compliance. -- Let me know if there is some way in the code that I can avoid this prob -- If it cannot be avoided, is there a way to make this optimization controlled by a flag that does not mean a loss of IEEE compliance? (2) Is there any pragma or assertion, etc, that I can put in the code to notify the compiler that certain pointers point to 16-byte aligned data? -- Only the output array (C) is possibly misaligned in ATLAS Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > > Here's some questions I need to figure out: > (1) Why do I have to throw the -funsafe-math-optimizations flag to > enable this? > -- I see where the .vect file warns of it, but it refers to an SSA line, > so I'm not sure what's going on. This flag is needed in order to allow vectorization of reduction (summation in your case) of floating-point data. This is because vectorization of reduction changes the order of the computation, which may result in different behavior (instead of summing this way: ((((((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way (((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7) > (2) Is there any pragma or assertion, etc, that I can put in the code to > notify the compiler that certain pointers point to 16-byte aligned data? > -- Only the output array (C) is possibly misaligned in ATLAS > Not really, I'm afraid - there is something that's not entirely supported in gcc yet - see details in PR20794. dorit > Thanks, > Clint > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 >
Dorit, >This flag is needed in order to allow vectorization of reduction (summation >in your case) of floating-point data. OK, but this is a baaaad flag to require. From the computational scientist's point of view, there is a *vast* difference between reordering (which many aggressive optimizations imply) and failing to have IEEE compliance. Almost no computational scientist will use non-IEEE code (because you have essentially no idea if your answer is correct), but almost all will allow reordering. So, it is really important to separate the non-IEEE optimizations from the IEEE compliant ones. If vectorization requires me to throw a flag that says it causes non-IEEE arithmetic, I can't use it, and neither can anyone other than, AFAIK, some graphics guys. IEEE is the "contract" between the user and the computer, that bounds how much error there can be, and allows the programmer to know if a given algorithm will produce a usable result. Non-IEEE is therefore the death-knell for having any theoretical or a priori understanding of accuracy. So, while reordering and non-IEEE may both seem unsafe, a reordering just gives different results, which are still known to be within normal fp error, while non-IEEE means there is no contract between the programmer at all, and indeed the answer may be arbitrarily bad. Further, behavior under exceptional conditions is not maintained, and so the answer may actually be undetectably nonsensical, not merely inaccurate. Having an oddly colored pixel doesn't hurt the graphics guy, but sending a satellite into the atmosphere, or registering cancer in a clean MRI are rather more serious . . . So, mixing the two transformation types on one flag means that vectorization is unusable to what must be the majority of it's audience. Maybe I should open this as another bug report "flag mixes normal and catastrophic optimizations"? >Not really, I'm afraid - there is something that's not entirely supported >in gcc yet - see details in PR20794 Hmm. I'd tried the __attribute__, but I must have mistyped it, because it didn't work before on pointers. However, it just did in the MMBENCHV tarfile. However, the code still didn't use aligned load to access the vectors (using multiple movlpd/movhpd instead) . . . Even more scary, having the attribute calls does not change the genned assembly at all. Does the vectorization phase get this alignment info passed to it? Aligned loads can be as much as twice as fast as unaligned, and if you have to choose amongst loops in the midst of a deep loop nest, these factors can actually make vectorization a loser . . . Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > > > > ------- Comment #56 from whaley at cs dot utsa dot edu 2006-08-09 21:33 ------- > Dorit, > > >This flag is needed in order to allow vectorization of reduction (summation > >in your case) of floating-point data. > > OK, but this is a baaaad flag to require. From the computational scientist's > point of view, there is a *vast* difference between reordering (which many > aggressive optimizations imply) and failing to have IEEE compliance. Almost no > computational scientist will use non-IEEE code (because you have essentially no > idea if your answer is correct), but almost all will allow reordering. So, it > is really important to separate the non-IEEE optimizations from the IEEE > compliant ones. Except for the fact IEEE compliant fp does not allow for reordering at all except in some small cases. For an example is (a + b) + (-a) is not the same as (a + (-a)) + b, so reordering will invalid IEEE fp for larger a and small b. Yes maybe we should split out the option for unsafe math fp op for reordering but that is different issue. -- Pinski
Andrew, >Except for the fact IEEE compliant fp does not allow for reordering at all >except >in some small cases. For an example is (a + b) + (-a) is not the same as (a + >(-a)) + b, >so reordering will invalid IEEE fp for larger a and small b. Yes maybe we >should split out >the option for unsafe math fp op for reordering but that is different issue. Thanks for the response, but I believe you are conflating two issues (as is this flag, which is why this is bad news). Different answers to the question "what is this sum" does not ruin IEEE compliance. I am referring to IEEE 754, which is a standard set of rules for storage and arithmetic for floating point (fp) on modern hardware. I am unaware of their being any rules on compilation. I.e. whether re-orderings are allowed is beyond the standard. It rather is a set of rules that discusses for floating point operations (FLOPS) how rounding must be done, how overflow/underflow must be handled, etc. Perhaps there is another IEEE standard concerning compilation that you are referring to? Now of course, floating point arithmetic in general (and IEEE-compliant fp in specific) is not associative, so indeed (a+b+c) != (c+b+a). However, both sequences are valid answers to "what are these 3 things summed up", and both are IEEE compliant if each addition is compliant. What non-IEEE means is that the individual flops are no longer IEEE compliant. This means that overflow may not be handled, or exceptional conditions may cause unknown results (eg., divide by zero), and indeed we have no way at all of knowing what an fp add even means. An example of a non-IEEE optimization is using 3DNow! vectorization, because 3DNow! does not follow the IEEE standard (for instance, it handles overflow only by saturation, which violates the standard). SSE (unless you turn IEEE compliance off manually) is IEEE compliant, and this is why you see computational guys like myself using it, and not using 3DNow!. To a computational scientist, non-IEEE is catastophic, and "may change the answer" is not. "May change the answer" in this case simply means that I've got a different ordering, which is also a valid IEEE fp answer, and indeed may be a "better" answer than the original ordering (depending on the data; no way to know this w/o looking at the data). Non-IEEE means that I have no way of knowing what kind of rounding was done, how flop was done, if underflow (or gradual overflow!) occurred, etc. It is for this reason that optimizations which are non-IEEE are a killer for computational scientists, and reorders are no big deal. In the first you have no idea what has happened with the data, and in the second you have an IEEE-compliant answer, which has known properties. It has been my experience that most compiler people (and I have some experience there, as I got my PhD in compilation) are more concerned with integer work, and thus not experts on fp computation. I've done fp computational work for the majority of my research for the last decade, so I thought I might be able to provide useful input to bridge the camps, so to speak. In this case, I think that by lumping "cause different IEEE-compliant answers" in with "use non-IEEE arithmetic" you are preventing all serious fp users from utilizing the optimizations. Since vectorization is of great importance on modern machines, this is bad news. Obviously, I may be wrong in what I say, but if reordering makes something non-IEEE I'm going to have some students mad at me for teaching them the wrong stuff :) Has this made my point any clearer, or do you still think I am wrong? If I'm wrong, maybe you can point to the part of the IEEE standard that discusses orderings violating the standard (as opposed to the well-known fact that all implemented fp arithemetic is non-associative)? After you do this, I'll have to dig up my copy of the thing, which I don't think I've seen in the last 2 years (but I did scope some of books that cover it, and didn't find anything about compilation). Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > Thanks for the response, but I believe you are conflating two issues (as is > this flag, which is why this is bad news). Different answers to the question > "what is this sum" does not ruin IEEE compliance. I am referring to IEEE 754, > which is a standard set of rules for storage and arithmetic for floating point > (fp) on modern hardware. You are also confusing -funsafe-math-optimizations with -ffast-math. The latter is a "one catch all" flag that compiles as if there were no FP traps, infinities, NaNs, and so on. The former instead enables "unsafe" optimizations but not "catastrophic" optimizations -- if you consider meaningless results on badly conditioned matrixes to not be catastrophic... A more or less complete list of things enabled by -funsafe-math-optimizations includes: Reassociation: - reassociation of operations, not only for the vectorizer's sake but also in the unroller (see around line 1600 of loop-unroll.c) - other simplifications like a/(b*c) for a/b/c - expansion of pow (a, b) to multiplications if b is integer Compile-time evaluation: - doing more aggressive compile-time evaluation of floating-point expressions (e.g. cabs) - less accurate modeling of overflow in compile-time expressions, for formats such as 106-bit mantissa long doubles Math identities: - expansion of cabs to sqrt (a*a + b*b) - simplifications involving trascendental functions, e.g. exp (0.5*x) for sqrt (exp (x)), or x for tan(atan(x)) - moving terms to the other side of a comparison, e.g. a > 4 for a + 4 > 8, or x > -1 for 1 - x < 2 - assuming in-domain arguments of sqrt, log, etc., e.g. x for sqrt(x)*sqrt(x) - in turn, this enables removing math functions from comparisons, e.g. x > 4 for sqrt (x) > 2 Optimization: - strength reduction of a/b to a*(1/b), both as loop invariants and in code like vector normalization - eliminating recursion for "accumulator"-like functions, i.e. f (n) = n + f(n-1) Back-end operation: - using x87 builtins for transcendental functions There may be bugs, but in general these optimizations are safe for infinities and NaNs, but not for signed zeros or (as I said) for very badly conditioned data. > I am unaware of their being any rules on compilation. > Rules are determined by the language standards. I believe that C mandates no reassociation; Fortran allows reassociation unless explicit parentheses are present in the source, but this is not (yet) implemented by GCC. Paolo
Paolo, Thanks for the explanation of what -funsafe is presently doing. >You are also confusing -funsafe-math-optimizations with -ffast-math. No, what I'm doing is reading the man page (the closest thing to a contract between gcc and me on what it is doing with my code): | -funsafe-math-optimizations | Allow optimizations for floating-point arithmetic that (a) assume | that arguments and results are valid and (b) may violate IEEE or | ANSI standards. The (b) in this statement prevents me, as a library provider that *must* be able to reassure my users that I have done nothing to violate IEEE fp standard (don't get me wrong, there's plenty of violations of the standard that occur in hardware, but typically in well-understood ways by the scientists of those platforms, and in the less important parts of the standard), from using this flag. I can't even use it after verifying that no optimization has hurt the present code, because an optimization that violates IEEE could be added at a later date, or used on a system that I'm not testing on (eg., on some systems, could cause 3DNow! vectorization). >Rules are determined by the language standards. I believe that C >mandates no reassociation; Fortran allows reassociation unless explicit >parentheses are present in the source, but this is not (yet) implemented >by GCC. My precise point. There are *lots* of C rules that a fp guy could give a crap about (for certain types of fp kernels), but IEEE is pretty much inviolate. Since this flag conflates language violations (don't care) with IEEE (catastrophic) I can't use it. I cannot stress enough just how important IEEE is: it is the only contract that tells us what it means to do a flop, and gives us any way of understanding what our answer will be. Making vectorization depend on a flag that says it is allowed to violate IEEE is therefore a killer for me (and most knowledgable fp guys). This is ironic, since vectorization of sums (as in GEMM) is usually implemented as scalar expansion on the accumulators, and this not only produces an IEEE-compliant answer, but it is *more* accurate for almost all data. Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 > Making vectorization depend on a flag that says it is allowed to violate IEEE > is therefore a killer for me (and most knowledgable fp guys). This is ironic, > since vectorization of sums (as in GEMM) is usually implemented as scalar > expansion on the accumulators > In case of GCC, it performs the transformation that Dorit explained. It may not produce an IEEE-compliant answer if there are zeros and you expect to see a particular sign for the zero. > and this not only produces an IEEE-compliant answer > The IEEE standard mandates particular rules for performing operations on infinities, NaNs, signed zeros, denormals, ... The C standard, by mandating no reassociation, ensures that you don't mess with NaNs, infinities, and signed zeros. As soon as you perform reassociation, there is *no way* you can be sure that you get IEEE-compliant math. +Inf + (1 / +0) = Inf, +Inf + (1 / -0) = NaN. > but it is *more* accurate for almost all data. http://citeseer.ist.psu.edu/589698.html is an example of a paper that shows FP code that avoids accuracy problems. Any kind of reassociation will break that code, and lower its accuracy. That's why reassociation is an "unsafe" math optimization. If you want a -freassociate-fp math, open an enhancement PR and somebody might be more than happy to separate reassociation from the other effects of -funsafe-math-optimizations. (Independent of this, you should also open a separate PR for ATLAS vectorization, because that would not be a regression and would not be on x87) :-) Paolo
Paolo, >The IEEE standard mandates particular rules for performing operations on >infinities, NaNs, signed zeros, denormals, ... The C standard, by >mandating no reassociation, ensures that you don't mess with NaNs, >infinities, and signed zeros. As soon as you perform reassociation, >there is *no way* you can be sure that you get IEEE-compliant math. No, again this is a conflation of the issues. You have IEEE-compliant math, but the differing orderings provide different summations of those values. It is a ANSI/ISO C rule being violated, not an IEEE. Each individual operation is IEEE, and therefore both results are IEEE-compliant, but since the C rule requiring order has been broken, some codes will break. However, they break not because of a violation of IEEE, but because of a violation of ANSI/ISO C. I can certify whether my code can take this violation of ANSI/ISO C by examining my code. I cannot certify my code works w/o IEEE by examining it, since that means a+b is now essentially undefined. >http://citeseer.ist.psu.edu/589698.html is an example of a paper that >shows FP code that avoids accuracy problems. Any kind of reassociation >will break that code, and lower its accuracy. That's why reassociation >is an "unsafe" math optimization. Please note I never argued it is was safe. Violating the C usage rules is always unsafe. However, as explained above, I can certify my code for reordering by examination, but nothing helps an IEEE violation. My problem is lumping in IEEE violations (such as 3dNow vectorization, or turning on non-IEEE mode in SSE) with C violations. >If you want a -freassociate-fp math, open an enhancement PR and somebody Ah, you mean like I asked about in end of 2nd paragraph of Comment #56? >might be more than happy to separate reassociation from the other >effects of -funsafe-math-optimizations. What I'm arguing for is not lumping in violations of ISO/ANSI C with IEEE violations, but you are right that this would fix my particular case. From what I see, -funsafe ought to be redefined as violating ANSI/ISO alone, and not mention IEEE at all. >(Independent of this, you should also open a separate PR for ATLAS >vectorization, because that would not be a regression and would not be >on x87) :-) You mean like I pleaded for in the last paragraph of Comment #38, but reluctantly shoved in here because that's what people seemed to want? :) Thanks, Clint
Subject: Re: [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3 >> If you want a -freassociate-fp math, open an enhancement PR and somebody >> > Ah, you mean like I asked about in end of 2nd paragraph of Comment #56? >> (Independent of this, you should also open a separate PR for ATLAS >> vectorization, because that would not be a regression and would not be >> on x87) :-) >> > You mean like I pleaded for in the last paragraph of Comment #38 Be bold. Don't ask, just open PRs if you feel an issue is separate. Go ahead now if you wish. Having them closed or marked as duplicate is not a problem, and it is much easier to track than cluttering an existing PRs. All these issues with ATLAS will not be visible to somebody looking for bug fixes "known to fail" in 4.2.0, because the original problem is now fixed in that version, and will soon be in 4.1.1 too.
Slightly offtopic, but to put some numbers to comment #8 and comment #11, equivalent SSE code now reaches only 50% of x87 single performance and 60% of x87 double performance on AMD x86_64: ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== [float] -O2 -mfpmath=sse -march=k8: atlasmm 60 1000 0.273 1582.66 [float] -O2 -mfpmath=387 -march=k8: atlasmm 60 1000 0.138 3130.91 [double] -O2 -mfpmath=sse -march=k8: atlasmm 60 1000 0.252 1714.54 [double] -O2 -mfpmath=387 -march=k8: atlasmm 60 1000 0.152 2842.55 This effect was first observed in PR19780.
Subject: Bug 27827 Author: bonzini Date: Fri Aug 11 13:25:58 2006 New Revision: 116082 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116082 Log: 2006-08-11 Paolo Bonzini <bonzini@gnu.org> PR target/27827 * config/i386/i386.md: Add peephole2 to avoid "fld %st" instructions. testsuite: 2006-08-11 Paolo Bonzini <bonzini@gnu.org> PR target/27827 * gcc.target/i386/pr27827.c: New testcase. Added: branches/gcc-4_1-branch/gcc/testsuite/gcc.target/i386/pr27827.c - copied unchanged from r115969, trunk/gcc/testsuite/gcc.target/i386/pr27827.c Modified: branches/gcc-4_1-branch/gcc/ChangeLog branches/gcc-4_1-branch/gcc/config/i386/i386.md branches/gcc-4_1-branch/gcc/testsuite/ChangeLog
(on bugzilla because I had problems sending mail to you) > Just got your most recent update. From what I can tell, you have applied > your patch to the 4.1 series, so that the next 4.1 release will have the fix? Yes. > So, my question is that I notice the comment says: > * config/i386/i386.md: Add peephole2 to avoid "fld %st" instructions. > > Which, if its what we've been doing should be something like: > * config/i386/i386.md: Add peephole2 to substitute "fld" for memory-source > "fmul" No, what my patch does is exactly replacing "fld reg + fmul mem" with "fld mem + fmul reg,reg". Maybe the ChangeLog is not completely descriptive, but the PR number is there and will make things clear enough. > BTW, it's going to remain the case that you must do at least -O2 to get > this peephole invoked? You can add -fpeephole2.
Uros, >Slightly offtopic, but to put some numbers to comment #8 and comment #11, >equivalent SSE code now reaches only 50% of x87 single performance and 60% of >x87 double performance on AMD x86_64 FYI, you *may* get slightly better single SSE performance with these flags: -fomit-frame-pointer -march=athlon64 -O2 -mfpmath=sse \ -msse -msse2 -msse3 -fargument-noalias-global Also, when ATLAS is allowed to exercise the code generator to find the best kernel, for double precision gcc 4's SSE could be made to almost tie gcc3's x87 performance (gcc3's double x87 performance is roughly 92% of the patched gcc 4 for this platform). However, single precision SSE, even allowing the code generator to go crazy, could only achieve about 2/3 of double *SSE* performance, and since x87 single perf is actually greater for x87 . . . You can find some details at: https://sourceforge.net/mailarchive/forum.php?thread_id=10026092&forum_id=426 Cheers, Clint
(In reply to comment #23) I read the discussion with a lot of interest - so here are the data for a Pentium-M: echo "GCC 3.x double performance:" GCC 3.x double performance: ./xdmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.281 1537.37 echo "GCC 4.x double performance:" GCC 4.x double performance: ./xdmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.265 1630.19 echo "GCC 3.x single performance:" GCC 3.x single performance: ./xsmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.281 1537.37 echo "GCC 4.x single performance:" GCC 4.x single performance: ./xsmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.266 1624.06 > Here is the machine breakdown as measured now: > LIKES GCC 4 DOESN'T CARE LIKES GCC 3 > =========== ============ =========== > CoreDuo Pentium 4 PentiumPRO > Pentium III > Pentium 4e > Pentium D > Athlon-64 X2 > Opteron So I guess the first column gets another entry: Pentium M
The linked-to patch is already on the trunk.
Fixed, 4.0 branch is now been closed.