With these compile options -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp With this compiler: euler-44% /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2 Thread model: posix gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) With the following routine compiled with gcc-4.2.2 you get (time (direct-fft-recursive-4 a table)) 366 ms real time 366 ms cpu time (366 user, 0 system) no collections 64 bytes allocated no minor faults no major faults while with today's mainline you get (time (direct-fft-recursive-4 a table)) 448 ms real time 448 ms cpu time (448 user, 0 system) no collections 64 bytes allocated no minor faults no major faults I've isolated that one routine and I'll add it at the end of an attachment; unfortunately there are a lot of declarations and global data that are difficult to winnow. There is really only one main loop in the routine, the one that begins at ___L19_direct_2d_fft_2d_recursive_2d_4. This loop was scheduled in 102 cycles (sched2) on 4.4.2 and in 134 cycles in mainline.
Created attachment 14418 [details] .i file for fft routine
Can you attach assembler files? What happens if you use -O2? Why do you need -fno-strict-aliasing? Does -fno-ivopts help?
Created attachment 14423 [details] Assembly from 4.2.2
Created attachment 14424 [details] assembly from 4.3.0 I had to remove the "static" from the declaration of direct-fft-recursive to get assembly. (In the larger file the address of direct-fft-recursive is eventually put into an array.)
Created attachment 14425 [details] assembly after replacing -O1 with -O2
Created attachment 14426 [details] assembly after replacing -O1 with -O2
time with -O2 instead of -O1: with 4.2.2: (time (direct-fft-recursive-4 a table)) 426 ms real time 426 ms cpu time (425 user, 1 system) no collections 64 bytes allocated no minor faults no major faults with 4.3.0: (time (direct-fft-recursive-4 a table)) 433 ms real time 433 ms cpu time (433 user, 0 system) no collections 64 bytes allocated no minor faults no major faults With -O1 -fno-ivopts: with 4.2.2: (time (direct-fft-recursive-4 a table)) 374 ms real time 374 ms cpu time (374 user, 0 system) no collections 64 bytes allocated no minor faults no major faults with 4.3.0: (time (direct-fft-recursive-4 a table)) 443 ms real time 443 ms cpu time (443 user, 0 system) no collections 64 bytes allocated 1 minor fault no major faults Why -fno-strict-aliasing: I don't need it for this particular routine, but in the rest of the file is part of a bignum library that accesses the bignum digits as arrays of either 8-, 32-, or 64-bit unsigned ints, and it hasn't been rewritten to use unions of arrays. (This is part of the runtime system of a Scheme implementation, and there are other places that just cast pointers to achieve low-level things.)
Subject: Re: 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code On Oct 28, 2007, at 8:05 AM, rguenth at gcc dot gnu dot org wrote: > ------- Comment #2 from rguenth at gcc dot gnu dot org 2007-10-28 > 12:05 ------- > Can you attach assembler files? What happens if you use -O2? Why > do you need > -fno-strict-aliasing? Does -fno-ivopts help? I think I've answered your questions in the attachments and comments to the PR. Brad
The main difference I see is that 4.2 avoids re-use of %eax as index register: .L34: movq %r11, %rdi addq 8(%r10), %rdi movq 8(%r10), %rsi movq 8(%r10), %rdx movq 40(%r10), %rax leaq 4(%r11), %rbx addq %rdi, %rsi leaq 4(%rdi), %r9 movq %rdi, -8(%r10) addq %rsi, %rdx leaq 4(%rsi), %r8 movq %rsi, -24(%r10) leaq 4(%rdx), %rcx movq %r9, -16(%r10) movq %rdx, -40(%r10) movq %r8, -32(%r10) addq $7, %rax movq %rcx, -48(%r10) movsd (%rax,%rcx,2), %xmm12 leaq (%rbx,%rbx), %rcx movsd (%rax,%rdx,2), %xmm3 leaq (%rax,%r11,2), %rdx addq $8, %r11 movsd (%rax,%r8,2), %xmm14 cmpq %r11, %r13 movsd (%rax,%rsi,2), %xmm13 movsd (%rax,%r9,2), %xmm11 movsd (%rax,%rdi,2), %xmm10 movsd (%rax,%rcx), %xmm8 ... while 4.3 always re-loads %rax as index: .L26: leaq 4(%rdi), %rdx movq %rdi, %rax movq %rdx, -8(%rsp) addq (%r8), %rax movq %rax, (%r9) addq $4, %rax movq %rax, (%rbp) movq (%r9), %rax addq (%r8), %rax movq %rax, (%r10) addq $4, %rax movq %rax, (%rbx) movq (%r10), %rax addq (%r8), %rax movq %rax, (%r11) movq -64(%rsp), %rcx addq $4, %rax movq %rax, (%rcx) movq (%rsi), %rdx movq -8(%rsp), %rcx addq $7, %rdx movsd (%rdx,%rax,2), %xmm13 movq (%r11), %rax addq %rcx, %rcx movsd (%rdx,%rcx), %xmm8 movsd (%rdx,%rax,2), %xmm3 movq (%rbx), %rax movsd (%rdx,%rax,2), %xmm14 movq (%r10), %rax movsd (%rdx,%rax,2), %xmm12 movq (%rbp), %rax movsd (%rdx,%rax,2), %xmm11 movq (%r9), %rax movsd (%rdx,%rax,2), %xmm10 movq (%r12), %rax leaq (%rdx,%rdi,2), %rdx ... the root cause needs to be investigated still.
So, confirmed.
I suspected that the slowdown had nothing to do with computed gotos, so I regenerated the C code using a switch instead of the computed gotos and got the following: For that same copy of mainline gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) : (time (direct-fft-recursive-4 a table)) 470 ms real time 470 ms cpu time (470 user, 0 system) no collections 64 bytes allocated no minor faults no major faults For 4.2.2: (time (direct-fft-recursive-4 a table)) 384 ms real time 384 ms cpu time (383 user, 1 system) no collections 64 bytes allocated no minor faults no major faults So that's almost exactly the same slowdown as with computed gotos. I changed the subject line to use 22% instead of 33% (I don't know how I got 33% before, perhaps I just mistyped it) and removed the phrase "with computed gotos". I'll include the new .i and .s files as attachments.
Created attachment 14534 [details] .i file using a switch instead of computed gotos This is the generated code with a switch instead of computed gotos.
Created attachment 14535 [details] 4.2.2 assembly for code using switch.
Created attachment 14536 [details] 4.3.0 assembly for code using a switch
I've marked this P1 because I'd like to see us start to explain these kinds of dramatic performance changes. If we can explain the issue coherently, we may well decide that it's not important to fix it, but I think we ought to force ourselves to figure out what's going on.
One suspect is fwprop. Anyone can confirm?
Subject: Re: [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code On Nov 30, 2007, at 12:39 AM, bonzini at gnu dot org wrote: > One suspect is fwprop. Anyone can confirm? How does one turn off fwprop? It doesn't seem to like "-fno-fwprop".
It would be -fno-forward-propagate, but what I meant is that the changes *connected to* fwprop could be the culprit. One has to look at dumps to understand if this is the case. It would be possible, maybe, to put an asm around the problematic basic block, so that one could plot the number of instructions in that basic block over time?
Subject: Re: [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code On Nov 30, 2007, at 9:58 AM, bonzini at gnu dot org wrote: > -fno-forward-propagate I don't know how to debug this, that's clear enough, but adding -fno- forward-propagate as an option doesn't change the code at all.
Can we have updated measurements please? Also I don't think this bug should be P1.
The assembler is identical to that in the third attachment and the time is basically the same (other things were going on at the same time): (time (direct-fft-recursive-4 a table)) 465 ms real time 466 ms cpu time (466 user, 0 system) no collections 64 bytes allocated no minor faults no major faults euler-86% /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2 --with-mpfr=/pkgs/gmp-4.2.2 --enable-gather-detailed-mem-stats Thread model: posix gcc version 4.3.0 20080109 (experimental) [trunk revision 131427] (GCC)
I'm downgrading this to P2.
It is not possible to create an executable from direct.i. My compilation fails: (.text+0x20): undefined reference to `main' /tmp/cc0VOLHm.o: In function `___H_direct_2d_fft_2d_recursive_2d_4': _num.c:(.text+0xf1): undefined reference to `___gstate' _num.c:(.text+0x18e): undefined reference to `___gstate' _num.c:(.text+0x1c7): undefined reference to `___gstate' _num.c:(.text+0x27b): undefined reference to `___gstate' _num.c:(.text+0x2e0): undefined reference to `___gstate' /tmp/cc0VOLHm.o:_num.c:(.text+0x6f0): more undefined references to `___gstate' follow Could you attach the source that can be used to create the executable? Or perhaps a detailed instructions how to create one from sources you already posted.
Subject: Re: [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code On Jan 21, 2008, at 2:21 PM, ubizjak at gmail dot com wrote: > It is not possible to create an executable from direct.i. That's correct, sorry. > Could you attach the source that can be used to create the executable? Here are instructions on how to build and test a modified version of Gambit, from which I derived direct.i. Download the file http://www.math.purdue.edu/~lucier/gcc/test-files/bugzilla/33928/ gambc-v4_1_2.tgz Build it with the following commands: > tar zxf gambc-v4_1_2.tgz > cd gambc-v4_1_2 > ./configure CC='/pkgs/gcc-mainline/bin/gcc -save-temps' > make -j If you want to recompile the source after reconfiguring, do > make mostlyclean not 'make clean', unfortunately. Then test it with > gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a > a)))' The output ends with something like > (time (##bignum.make (##fixnum.quotient result-length > (##fixnum.quotient ##bignum.adigit-width ##bignum.fdigit-width)) #f > #f)) > 4 ms real time > 5 ms cpu time (3 user, 2 system) > no collections > 3962448 bytes allocated > 968 minor faults > no major faults > (time (##make-f64vector (##fixnum.* two^n 2))) > 5 ms real time > 5 ms cpu time (1 user, 4 system) > 1 collection accounting for 5 ms real time (1 user, 4 system) > 33554464 bytes allocated > 59 minor faults > no major faults > (time (make-w (##fixnum.- log-two^n 1))) > 30 ms real time > 31 ms cpu time (17 user, 14 system) > no collections > 16810144 bytes allocated > 4097 minor faults > no major faults > (time (make-w-rac log-two^n)) > 28 ms real time > 28 ms cpu time (16 user, 12 system) > no collections > 16826272 bytes allocated > 4097 minor faults > no major faults > (time (bignum->f64vector-rac x a)) > 45 ms real time > 45 ms cpu time (20 user, 25 system) > no collections > -16 bytes allocated > 8192 minor faults > no major faults > (time (componentwise-rac-multiply a rac-table)) > 26 ms real time > 26 ms cpu time (26 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (direct-fft-recursive-4 a table)) > 445 ms real time > 445 ms cpu time (445 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults > (time (componentwise-complex-multiply a a)) > 24 ms real time > 24 ms cpu time (24 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (inverse-fft-recursive-4 a table)) > 418 ms real time > 418 ms cpu time (418 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults > (time (componentwise-rac-multiply-conjugate a rac-table)) > 26 ms real time > 26 ms cpu time (26 user, 0 system) > no collections > -16 bytes allocated > no minor faults > no major faults > (time (bignum<-f64vector-rac a result result-length)) > 108 ms real time > 108 ms cpu time (108 user, 0 system) > no collections > 112 bytes allocated > no minor faults > no major faults > (time (* a a)) > 1170 ms real time > 1170 ms cpu time (1105 user, 65 system) > 1 collection accounting for 5 ms real time (1 user, 4 system) > 71266896 bytes allocated > 17413 minor faults > no major faults The time for the routine in direct.i is the time reported for direct- fft-recursive-4: > (time (direct-fft-recursive-4 a table)) > 445 ms real time > 445 ms cpu time (445 user, 0 system) > no collections > 64 bytes allocated > no minor faults > no major faults The name of the routine in the .i and .s files is ___H_direct_2d_fft_2d_recursive_2d_4. By the way, ___H_inverse_2d_fft_2d_recursive_2d_4 is a similar routine implementing the inverse fft, which, for some reason, goes faster than the direct (forward) fft. Brad
Created attachment 14996 [details] Much shorter testcase. This testcase was used to track down problems with fre pass. Stay tuned for an analysis.
Really I bet FRE is doing its job and the RA can't do its.
As already noted by Richi in Comment #9, the difference is in usage of %rax. gcc-4.2 generates: ... addq $7, %rax leaq (%rax,%rbp,2), %r10 leaq (%rax,%rdx,2), %rdx leaq (%rax,%rdi,2), %rdi movq (%rcx), %rsi movq (%r13), %rcx leaq (%rax,%r9,2), %r9 leaq (%rax,%r8,2), %r8 leaq (%rax,%r14,2), %r11 addq $8, %rbp movsd (%rdx), %xmm3 leaq (%rax,%rsi,2), %rsi leaq (%rax,%rcx,2), %rcx ... movsd %xmm7, (%rcx) subsd %xmm1, %xmm10 addsd %xmm1, %xmm0 movsd %xmm8, (%rsi) movsd %xmm0, (%rdi) movapd %xmm12, %xmm0 subsd %xmm3, %xmm12 addsd %xmm3, %xmm0 movsd %xmm0, (%r8) movsd %xmm10, (%r9) movsd %xmm12, (%rdx) jg .L26 where gcc-4.3 limps along with: ... leaq 7(%rax), %r9 movq %rbx, -64(%rsp) movq -56(%rsp), %rcx addq %r10, %r10 movsd 7(%rax,%rdx), %xmm3 movsd (%r9,%rbx,2), %xmm8 movq (%r11), %rbx movsd 7(%rax,%r10), %xmm5 addq %r8, %r8 addq %rdi, %rdi movsd 7(%rax,%r8), %xmm12 movsd 15(%rbx), %xmm2 leaq (%r9,%rbp,2), %r9 movsd 7(%rbx), %xmm1 ... movsd %xmm0, 7(%rax,%r9,2) movapd %xmm10, %xmm0 movsd %xmm7, 7(%rax,%rcx) subsd %xmm1, %xmm10 addsd %xmm1, %xmm0 movsd %xmm8, 7(%rax,%rsi) movsd %xmm0, 7(%rax,%rdi) movapd %xmm12, %xmm0 subsd %xmm3, %xmm12 addsd %xmm3, %xmm0 movsd %xmm0, 7(%rax,%r8) movsd %xmm10, 7(%rax,%r10) movsd %xmm12, 7(%rax,%rdx) jg .L17 The difference is in offseted addresses. Looking at the tree dumps, it is obvious that the problem is in fre pass. At the end of the loop (line 685+ in _.034.fre) gcc-4.2 transforms every seqence of: D.2013_432 = ___fp_256 + 40B; D.2014_433 = *D.2013_432; D.2068_434 = (long int *) D.2014_433; D.2069_435 = D.2068_434 + 7B; D.2070_436 = (long int) D.2069_435; D.2094_437 = ___r3_35 << 1; D.2095_438 = D.2070_436 + D.2094_437; D.2096_439 = (double *) D.2095_438; *D.2096_439 = ___F64V53_431; D.2013_440 = ___fp_256 + 40B; D.2014_441 = *D.2013_440; D.2068_442 = (long int *) D.2014_441; D.2069_443 = D.2068_442 + 7B; D.2070_444 = (long int) D.2069_443; D.2091_445 = ___r4_257 << 1; D.2092_446 = D.2070_444 + D.2091_445; D.2093_447 = (double *) D.2092_446; *D.2093_447 = ___F64V52_430; D.2013_448 = ___fp_256 + 40B; D.2014_449 = *D.2013_448; D.2068_450 = (long int *) D.2014_449; D.2069_451 = D.2068_450 + 7B; D.2070_452 = (long int) D.2069_451; ... into: D.2013_432 = D.2013_286; D.2014_433 = D.2014_287; D.2068_434 = D.2068_288; D.2069_435 = D.2069_289; D.2070_436 = D.2070_290; D.2094_437 = D.2094_366; D.2095_438 = D.2095_367; D.2096_439 = D.2096_368; *D.2096_439 = ___F64V53_431; D.2013_440 = D.2013_286; D.2014_441 = D.2014_287; D.2068_442 = D.2068_288; D.2069_443 = D.2069_289; D.2070_444 = D.2070_290; D.2091_445 = D.2091_357; D.2092_446 = D.2092_358; D.2093_447 = D.2093_359; *D.2093_447 = ___F64V52_430; D.2013_448 = D.2013_286; D.2014_449 = D.2014_287; D.2068_450 = D.2068_288; D.2069_451 = D.2069_289; D.2070_452 = D.2070_290; D.1994_453 = D.1994_258; D.2040_454 = D.2040_347; D.2041_455 = D.2041_348; D.2089_456 = D.2089_349; D.2090_457 = D.2090_350; ... and this is optimized in further passes into: *D.2096 = ___F64V32 + ___F64V45; *D.2093 = ___F64V31 + ___F64V42; *D.2090 = ___F64V32 - ___F64V45; *D.2088 = ___F64V31 - ___F64V42; *D.2084 = ___F64V28 + ___F64V39; *D.2081 = ___F64V27 + ___F64V36; *D.2077 = ___F64V28 - ___F64V39; *D.2074 = ___F64V27 - ___F64V36; However, for some reason gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump), creating: D.1683_428 = D.1683_282; D.1684_429 = D.1684_283; D.1738_430 = D.1738_284; D.1739_431 = D.1739_285; D.1740_432 = D.1740_286; D.1764_433 = D.1764_362; D.1765_434 = D.1765_363; D.1766_435 = D.1766_364; *D.1766_435 = ___F64V53_427; D.1683_436 = D.1683_282; D.1684_437 = *D.1683_436; D.1738_438 = (long unsigned int) D.1684_437; D.1739_439 = D.1738_438 + 7; D.1740_440 = (long int) D.1739_439; D.1761_441 = D.1761_353; D.1762_442 = D.1740_440 + D.1761_441; D.1763_443 = (double *) D.1762_442; *D.1763_443 = ___F64V52_426; D.1683_444 = D.1683_282; D.1684_445 = *D.1683_444; D.1738_446 = (long unsigned int) D.1684_445; D.1739_447 = D.1738_446 + 7; D.1740_448 = (long int) D.1739_447; ... which leaves us with: *D.1766 = ___F64V32 + ___F64V45; *(double *) (D.1761 + (long int) ((long unsigned int) *pretmp.33 + 7)) = ___F64V31 + ___F64V42; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.65 << 1)) = ___F64V32 - ___F64V45; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*D.1685 << 1)) = ___F64V31 - ___F64V42; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.61 << 1)) = ___F64V28 + ___F64V39; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.152 << 1)) = ___F64V27 + ___F64V36; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.147 << 1)) = ___F64V28 - ___F64V39; *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*___fp.47 << 1)) = ___F64V27 - ___F64V36; and creates unoptimal asm as above.
This is an alias partitioning problem, with --param max-aliased-vops=10000 I see the sequence optimized by FRE. Or, with the alias-oracle patch for FRE --param max-fields-for-field-sensitive=1 does the job as well.
target independent
Please note that for the original testcase (direct.i), even '-O2 --param max-aliased-vops=100000' doesn't generate expected code.
Created attachment 14997 [details] asm with alias-oracle enabled FRE This is the asm produced from direct.i with -O2 --param max-fields-for-field-sensitive=1 (SFTs disabled, which is the goal for 4.4) with the (ok, a modified) alias-oracle patch for FRE applied.
I've decided to test the current ira branch with this problem. I used the build instructions in comment 24. With -fno-ira I get the same results as with 4.3.0 (no surprise there). With -fira I get the time (time (direct-fft-recursive-4 a table)) 422 ms real time 421 ms cpu time (421 user, 0 system) no collections 64 bytes allocated no minor faults no major faults which is an improvement, and the code at the beginning of the loop is .L7262: movq %rdx, %rcx addq (%rsi), %rcx leaq 4(%rdx), %r15 movq %rcx, (%rbx) addq $4, %rcx movq %rcx, (%rbp) movq (%rbx), %rcx addq (%rsi), %rcx movq %rcx, (%rdi) addq $4, %rcx movq %rcx, (%r8) movq (%rdi), %rcx addq (%rsi), %rcx leaq 4(%rcx), %r10 movq %rcx, (%r9) movq %r10, (%r13) movq (%rax), %rcx addq $7, %rcx movsd (%rcx,%r10,2), %xmm4 movq (%r9), %r10 leaq (%rcx,%rdx,2), %r11 addq $8, %rdx movsd (%r11), %xmm11 movsd (%rcx,%r10,2), %xmm5 movq (%r8), %r10 movsd (%rcx,%r10,2), %xmm6 movq (%rdi), %r10 movsd (%rcx,%r10,2), %xmm7 movq (%rbp), %r10 movsd (%rcx,%r10,2), %xmm8 movq (%rbx), %r10 movapd %xmm8, %xmm14 movsd (%rcx,%r10,2), %xmm9 leaq (%r15,%r15), %r10 movsd (%rcx,%r10), %xmm10 movq (%r12), %rcx movapd %xmm9, %xmm15 movsd 15(%rcx), %xmm1 movsd 7(%rcx), %xmm2 movapd %xmm1, %xmm13 movsd 31(%rcx), %xmm3 movapd %xmm2, %xmm12 which is also an improvement, but it still is nowhere near the result for 4.2.2. So, whatever is causing this problem, it appears the new register allocator isn't going to fix it. The code generated by today's mainline (136210) isn't better than 4.3.0; the time is (time (direct-fft-recursive-4 a table)) 469 ms real time 469 ms cpu time (469 user, 0 system) no collections 64 bytes allocated no minor faults no major faults and code is essentially the same as for 4.3.0
4.3.1 is being released, adjusting target milestone.
Problem still exists with euler-18% /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2/ --with-mpfr=/pkgs/gmp-4.2.2/ --prefix=/pkgs/gcc-mainline --enable-languages=c --enable-gather-detailed-mem-stats Thread model: posix gcc version 4.4.0 20080708 (experimental) [trunk revision 137644] (GCC) Just checking whether recent changes happened to fix it.
4.3.2 is released, changing milestones to 4.3.3.
I don't really understand the status of this bug. Before 4.3.0, it was P!, and Mark said he said he'd "like to see us start to explain these kinds of dramatic performance changes." There was quite a bit of detective work that ended with "for some reason gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump) ...". Richard opined that it was an "alias partitioning problem", but Uros noted that for the original code instead of the reduced testcase expanding some parameter to its maximum still doesn't fix the problem. So (a) we don't know what the current code is doing wrong, and (b) we don't know why 4.2 got it right. So I don't think Mark got what he wanted, and now it's P2, and each release the target release for fixing it gets pushed back. I've been testing mainline on this bug sporadically, especially when an entry in gcc-patches mentions some words that also appear on this PR, to see if it's fixed. I'm a bit concerned that the target of 4.3.* is becoming increasingly out of reach, as changes committed to that branch seem to be more and more conservative because it's a release branch. I don't think the code for this bug is terribly atypical for machine-generated code; it would be nice to be able to remove this performance regression. Unfortunately, I'm in no position to do so.
We have to admit that this bug is unlikely to get fixed in the 4.3 series. It still lacks proper analysis, as unfortunately that done on the shorter testcase was not valid. Analysis takes time, and honestly at this point I rather spend time fixing wrong-code or ice-on-valid bugs.
OK, but I was moved to write because Jakub's latest 4.4 status report requests Please concentrate now on fixing bugs, especially the performance regressions. and this is a definite 4.3/4.4 performance regression from 4.2. (How many of the P1 PRs are performance regressions?)
I may have narrowed down the problem a bit. With this compiler (revision 118491): pythagoras-277% /tmp/lucier/install/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --prefix=/tmp/lucier/install --enable-languages=c Thread model: posix gcc version 4.3.0 20061105 (experimental) one gets (on a faster machine than previous reports) (time (direct-fft-recursive-4 a table)) 133 ms real time 140 ms cpu time (140 user, 0 system) no collections 64 bytes allocated no minor faults no major faults With this compiler (revision 118474): pythagoras-24% /tmp/lucier/install/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --prefix=/tmp/lucier/install --enable-languages=c Thread model: posix gcc version 4.3.0 20061104 (experimental) one gets (time (direct-fft-recursive-4 a table)) 116 ms real time 108 ms cpu time (108 user, 0 system) no collections 64 bytes allocated no minor faults no major faults and you see the typical problem with assembly code from direct.i with the later compiler. Paolo may have been right about fwprop, this patch was installed that day: Author: bonzini Date: Sat Nov 4 08:36:45 2006 New Revision: 118475 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=118475 Log: 2006-11-03 Paolo Bonzini <bonzini@gnu.org> Steven Bosscher <stevenb.gcc@gmail.com> * fwprop.c: New file. * Makefile.in: Add fwprop.o. * tree-pass.h (pass_rtl_fwprop, pass_rtl_fwprop_with_addr): New. * passes.c (init_optimization_passes): Schedule forward propagation. * rtlanal.c (loc_mentioned_in_p): Support NULL value of the second parameter. * timevar.def (TV_FWPROP): New. * common.opt (-fforward-propagate): New. * opts.c (decode_options): Enable forward propagation at -O2. * gcse.c (one_cprop_pass): Do not run local cprop unless touching jumps. * cse.c (fold_rtx_subreg, fold_rtx_mem, fold_rtx_mem_1, find_best_addr, canon_for_address, table_size): Remove. (new_basic_block, insert, remove_from_table): Remove references to table_size. (fold_rtx): Process SUBREGs and MEMs with equiv_constant, make simplification loop more straightforward by not calling fold_rtx recursively. (equiv_constant): Move here a small part of fold_rtx_subreg, do not call fold_rtx. Call avoid_constant_pool_reference to process MEMs. * recog.c (canonicalize_change_group): New. * recog.h (canonicalize_change_group): New. * doc/invoke.texi (Optimization Options): Document fwprop. * doc/passes.texi (RTL passes): Document fwprop. Added: trunk/gcc/fwprop.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/common.opt trunk/gcc/cse.c trunk/gcc/doc/invoke.texi trunk/gcc/doc/passes.texi trunk/gcc/gcse.c trunk/gcc/opts.c trunk/gcc/passes.c trunk/gcc/recog.c trunk/gcc/recog.h trunk/gcc/rtlanal.c trunk/gcc/timevar.def trunk/gcc/tree-pass.h
IIUC this is a typical case in which CSE was fixing something that earlier passes messed up. Unfortunately fwprop does (better) what CSE was meant to do, but does not do what I assumed was already done before CSE. If the problem is aliasing/FRE, then I think Richi is the one who could fix it for good in the tree passes. If there is more to it, however, I can take a look at why fwprop is generating the ugly code.
There's not much to be done for aliasing - everything points to global memory and thus aliases. There may be some opportunities for offset-based disambiguations via pointers, but I didn't investigate in detail. Whoever wants someone to work on specific details needs to provide way shorter testcases ;)
Just a comment that -fforward-propagate isn't enabled at -O1 (the main optimization option in the test) while the cse code it replaces was enabled at -O1. This is presumably why adding -fno-forward-propagate to the command line in the test a year ago didn't affect the generated code. Adding -fno-forward-propagate to the command line of the test case with revision r118475 of gcc changes the generated code, but doesn't improve the problem code in the main loop. Updated the title to report the performance hit on Intel(R) Xeon(R) CPU X5460 @ 3.16GHz as reported by /proc/cpuinfo
GCC 4.3.3 is being released, adjusting target milestone.
A simplified (local, noncascading) fwprop not using UD chains would not be hard to do... Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of every basic block), and use that info instead of UD chains in use_killed_between...
Subject: Re: [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 On Fri, 2009-02-13 at 16:05 +0000, bonzini at gnu dot org wrote: > ------- Comment #44 from bonzini at gnu dot org 2009-02-13 16:05 ------- > A simplified (local, noncascading) fwprop not using UD chains would not be hard > to do... Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking > the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of > every basic block), and use that info instead of UD chains in > use_killed_between... As noted in comment 42, enabling FWPROP on this test case does not fix the performance problem.
Regarding your comment in bug 26854: > address calculations are no longer optimized as much as they > were before Sometimes, actually, they are optimized better. It depends on the case. In comment #42, also, you talked about -O1, where fwprop is not enabled. So I'm failing to understand if the problem is at the tree or RTL level for this bug. My comment was related to something said in PR39517, i.e. that chains are very expensive and a reason why fwprop should not be enabled at -O1. Following up on my comment, alternatively, fwprop could compute its own dataflow instead of using UD chains, since it only cares by design about uses with a single definition. This looks much better. You would use something like df_chain_create_bb and df_chain_create_bb_process_use, with code like the following (cfr. df_chain_create_bb_process_use): /* Do not want to go through this for an uninitialized var. */ int count = DF_DEFS_COUNT (regno); if (count) { if (top_flag == (DF_REF_FLAGS (use) & DF_REF_AT_TOP)) { unsigned int first_index = DF_DEFS_BEGIN (uregno); unsigned int last_index = first_index + count - 1; /* Uninitialized? Exit. */ bmp_iter_set_init (&bi, local_rd, first_index, &def_index); if (!bmp_iter_set (&bi, &def_index) || def_index > last_index) continue; /* 2 or more defs for this use, exit. */ bmp_iter_next (&(ITER), &(BITNUM))) if (!bmp_iter_set (&bi, &def_index) || def_index > last_index) SET_BIT (can_fwprop, DF_REF_ID (use)); } } With this change there would be no reason not to run fwprop at -O1.
Subject: Re: [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 On Fri, 2009-02-13 at 16:32 +0000, bonzini at gnu dot org wrote: > > > ------- Comment #46 from bonzini at gnu dot org 2009-02-13 16:32 ------- > Regarding your comment in bug 26854: > > > address calculations are no longer optimized as much as they > > were before > > Sometimes, actually, they are optimized better. It depends on the case. Yes. I don't see why the optimizations in CSE, which were relatively cheap and which were effective for this case, needed to be disabled when FWPROP was added without, evidently, understanding why FWPROP does not do what CSE was already doing. > In comment #42, also, you talked about -O1, where fwprop is not enabled. So > I'm failing to understand if the problem is at the tree or RTL level for this > bug. When I add -fforward-propagate to the command line, then the assembly code changes in some ways, but the performance problem remains the same. Brad
Subject: Re: [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 > Yes. I don't see why the optimizations in CSE, which were relatively > cheap and which were effective for this case, needed to be disabled when > FWPROP was added without, evidently, understanding why FWPROP does not > do what CSE was already doing. Just to mention it, fwprop saved 3% of compile time. That's not "cheap". It was also tested with SPEC and Nullstone on several architectures.
With 4.4.0 and with mainline this code now runs in 280 ms instead of in 156 ms with 4.2.4. Since 280/156 = 1.794871794871795 I changed the subject line (the slowdown is now not completely caused by r118475). I guess I'll post the assembly code generated by 4.4.0 in the next attachment. Timings (best of three runs) for the last (time (direct-fft-recursive-4 a table)) from gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a a)))' With gcc-4.1.2: 188 ms cpu time (188 user, 0 system) With gcc-4.2.4 156 ms cpu time (152 user, 4 system) With gcc-4.3.3: 180 ms cpu time (180 user, 0 system) With gcc-4.4.0 280 ms cpu time (280 user, 0 system) With 4.5.0 20090423 (experimental) [trunk revision 146634] 280 ms cpu time (280 user, 0 system)
Created attachment 17685 [details] direct.s generated by 4.4.0
Forgot to mention, the main loop starts at .L2947. This is on model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz Brad
I narrowed down the new performance regression to code added some time around March 12, 2009, so I changed back the subject line of this PR to reflect the performance regression caused only by the code added 2006-11-03 and added a new PR http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39914 to reflect the effects of the March, 2009, code.
I posted a possible fix to gcc-patches with the subject line Possible fix for 30% performance regression in PR 33928 Here's the assembly for the main loop after the changes I proposed: .L4230: movq %r11, %rdi addq 8(%r10), %rdi movq 8(%r10), %rsi movq 8(%r10), %rdx movq 40(%r10), %rax leaq 4(%r11), %rbx addq %rdi, %rsi leaq 4(%rdi), %r9 movq %rdi, -8(%r10) addq %rsi, %rdx leaq 4(%rsi), %r8 movq %rsi, -24(%r10) leaq 4(%rdx), %rcx movq %r9, -16(%r10) movq %rdx, -40(%r10) movq %r8, -32(%r10) addq $7, %rax movq %rcx, -48(%r10) movsd (%rax,%rcx,2), %xmm12 leaq (%rbx,%rbx), %rcx movsd (%rax,%rdx,2), %xmm3 leaq (%rax,%r11,2), %rdx addq $8, %r11 movsd (%rax,%r8,2), %xmm14 cmpq %r11, %r13 movsd (%rax,%rsi,2), %xmm13 movsd (%rax,%r9,2), %xmm11 movsd (%rax,%rdi,2), %xmm10 movsd (%rax,%rcx), %xmm8 movq 24(%r10), %rax movsd (%rdx), %xmm7 movsd 15(%rax), %xmm2 movsd 7(%rax), %xmm1 movapd %xmm2, %xmm0 movsd 31(%rax), %xmm9 movapd %xmm1, %xmm6 mulsd %xmm3, %xmm0 movapd %xmm1, %xmm4 mulsd %xmm12, %xmm6 mulsd %xmm3, %xmm4 movapd %xmm1, %xmm3 mulsd %xmm13, %xmm1 mulsd %xmm14, %xmm3 addsd %xmm0, %xmm6 movapd %xmm2, %xmm0 movsd 23(%rax), %xmm5 mulsd %xmm12, %xmm0 movapd %xmm7, %xmm12 subsd %xmm0, %xmm4 movapd %xmm2, %xmm0 mulsd %xmm14, %xmm2 movapd %xmm8, %xmm14 mulsd %xmm13, %xmm0 movapd %xmm11, %xmm13 addsd %xmm6, %xmm11 subsd %xmm6, %xmm13 subsd %xmm2, %xmm1 movapd %xmm10, %xmm2 addsd %xmm0, %xmm3 movapd %xmm5, %xmm0 subsd %xmm4, %xmm2 addsd %xmm4, %xmm10 subsd %xmm1, %xmm12 addsd %xmm1, %xmm7 movapd %xmm9, %xmm1 subsd %xmm3, %xmm14 mulsd %xmm2, %xmm0 xorpd .LC5(%rip), %xmm1 addsd %xmm3, %xmm8 movapd %xmm1, %xmm3 mulsd %xmm2, %xmm1 movapd %xmm5, %xmm2 mulsd %xmm13, %xmm3 mulsd %xmm11, %xmm2 addsd %xmm0, %xmm3 movapd %xmm5, %xmm0 mulsd %xmm10, %xmm5 mulsd %xmm13, %xmm0 subsd %xmm0, %xmm1 movapd %xmm9, %xmm0 mulsd %xmm11, %xmm9 mulsd %xmm10, %xmm0 subsd %xmm9, %xmm5 addsd %xmm0, %xmm2 movapd %xmm7, %xmm0 addsd %xmm5, %xmm0 subsd %xmm5, %xmm7 movsd %xmm0, (%rdx) movapd %xmm8, %xmm0 movq 40(%r10), %rax subsd %xmm2, %xmm8 addsd %xmm2, %xmm0 movsd %xmm0, 7(%rcx,%rax) movq -8(%r10), %rdx movq 40(%r10), %rax movapd %xmm12, %xmm0 subsd %xmm1, %xmm12 movsd %xmm7, 7(%rax,%rdx,2) movq -16(%r10), %rdx movq 40(%r10), %rax addsd %xmm1, %xmm0 movsd %xmm8, 7(%rax,%rdx,2) movq -24(%r10), %rdx movq 40(%r10), %rax movsd %xmm0, 7(%rax,%rdx,2) movapd %xmm14, %xmm0 movq -32(%r10), %rdx movq 40(%r10), %rax subsd %xmm3, %xmm14 addsd %xmm3, %xmm0 movsd %xmm0, 7(%rax,%rdx,2) movq -40(%r10), %rdx movq 40(%r10), %rax movsd %xmm12, 7(%rax,%rdx,2) movq -48(%r10), %rdx movq 40(%r10), %rax movsd %xmm14, 7(%rax,%rdx,2) jg .L4230 movq %rbx, %r13 .L4228:
Created attachment 17805 [details] svn diff of cse.c to fix the performance regression This partially reverts r118475 and adds code to call find_best_address for MEMs in fold_rtx.
Created attachment 17807 [details] svn diff of cse.c to "fix" the performance regression (updated)
Created attachment 17808 [details] usable testcase Ok, I managed to make a reasonably readable source code (uninclude stdlib files, remove unused gambit stuff and ___ prefixes, simplify some expressions), find the heavy loops, annotate them with asm statements (see comment #18, 2007-11-30) and find the length of the loops. 4.2 4.5 4.5 + patch LOOP 1 ~190 ~230 ~190 INNER LOOP 1.1 ~120 ~130 ~120 LOOP 2 33 36 31 I am thus obsoleting (almost) everything that was posted and is not relevant anymore. Let's start from scratch with the new testcase.
Why do you need any #include lines at all in the reduced testcase? Compiles just fine even without them...
Uhm, it's better to run unpatched 4.5 with -O1 -fforward-propagate to get a fair comparison. Also, I was counting the loop headers, which are not part of the hot code. 4.2 -O1 4.5 -O1 -ffw-prop 4.5 + patch -O1 LOOP 1 181 201 180 INNER LOOP 1.1 117 118 113 LOOP 2 27 27 26 This shows that you should compare running the code (you can use direct.i) with 4.2/-O1 and 4.5/-O1 -fforward-propagate. This is very important, otherwise you're comparing apples to oranges. fwprop is creating too high register pressure by creating offsets like these in the loop header: leaq -8(%r12), %rsi leaq 8(%r12), %r10 leaq -16(%r12), %r9 leaq -24(%r12), %rbx leaq -32(%r12), %rbp leaq -40(%r12), %rdi leaq -48(%r12), %r11 leaq 40(%r12), %rdx Then, the additional register pressure is causing the bad scheduling we have in the fast assembly outputs: movq (%rdx), %rax movsd (%rax,%r15,2), %xmm7 movq (%rdi), %r15 movsd (%rax,%r15,2), %xmm10 movq (%rbp), %r15 movsd (%rax,%r15,2), %xmm5 movq (%rbx), %r15 movsd (%rax,%r15,2), %xmm6 movq (%r9), %r15 movsd (%rax,%r15,2), %xmm15 movq (%rsi), %r15 movsd (%rax,%r15,2), %xmm11
Created attachment 17809 [details] usable testcase Without includes as Jakub suggested.
Actually those are created by -fmove-loop-invariants. With -O1 -fforward-propagate -fno-move-loop-invariants I get: 4.5 -O1 -ffw-prop -fno-move-loop-inv LOOP 1 183 INNER LOOP 1.1 116 LOOP 2 25 You should be able to get performance close to 4.2 or better with options "-O1 -fforward-propagate -fno-move-loop-invariants -fschedule-insns2". If you do, this means two things: 1) That the bug is in the register pressure estimations of -fno-move-loop-invariants, and merely exposed by the fwprop patch. 2) That maybe you should start from -O2 and go backwards, eliminating optimizations that do not help you or cause high compilation time, instead of using -O1.
Also see PR39871, maybe that's related (though on ARM).
No, totally unrelated to PR39871
Was the patch in comment 55 meant for me to bootstrap and test with today's mainline? It crashes at the gcc_assert at /* Subroutine of canon_reg. Pass *XLOC through canon_reg, and validate the result if necessary. INSN is as for canon_reg. */ static void validate_canon_reg (rtx *xloc, rtx insn) { if (*xloc) { rtx new_rtx = canon_reg (*xloc, insn); /* If replacing pseudo with hard reg or vice versa, ensure the insn remains valid. Likewise if the insn has MATCH_DUPs. */ gcc_assert (insn && new_rtx); validate_change (insn, xloc, new_rtx, 1); } } when building libgcc: /tmp/lucier/gcc/objdirs/mainline/./gcc/xgcc -B/tmp/lucier/gcc/objdirs/mainline/./gcc/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/bin/ -B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/lib/ -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/include -isystem /pkgs/gcc-mainline/x86_64-unknown-linux-gnu/sys-include -g -O2 -m32 -O2 -g -O2 -DIN_GCC -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wcast-qual -Wold-style-definition -isystem ./include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -I. -I. -I../../.././gcc -I../../../../../mainline/libgcc -I../../../../../mainline/libgcc/. -I../../../../../mainline/libgcc/../gcc -I../../../../../mainline/libgcc/../include -I../../../../../mainline/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o _moddi3.o -MT _moddi3.o -MD -MP -MF _moddi3.dep -DL_moddi3 -c ../../../../../mainline/libgcc/../gcc/libgcc2.c \ -fexceptions -fnon-call-exceptions -fvisibility=hidden -DHIDE_EXPORTS ../../../../../mainline/libgcc/../gcc/libgcc2.c: In function â: ../../../../../mainline/libgcc/../gcc/libgcc2.c:1121: internal compiler error: in validate_canon_reg, at cse.c:2730
In answer to comment 60, here's the command line where I added -fforward-propagate -fno-move-loop-invariants: /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c here's the compiler: /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: /tmp/lucier/gcc/mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c Thread model: posix gcc version 4.5.0 20090506 (experimental) [trunk revision 147199] (GCC) and the runtime didn't change (substantially) 132 ms cpu time (132 user, 0 system) and the loop looks pretty much just as bad (it's 117 instructions long, by my count): .L2752: movq %rcx, %rdx addq 8(%rax), %rdx leaq 4(%rcx), %rdi movq %rdx, -8(%rax) leaq 4(%rdx), %rbx addq 8(%rax), %rdx movq %rbx, -16(%rax) movq %rdx, -24(%rax) leaq 4(%rdx), %rbx addq 8(%rax), %rdx movq %rbx, -32(%rax) movq %rdx, -40(%rax) leaq 4(%rdx), %rbx movq 40(%rax), %rdx movq %rbx, -48(%rax) movsd 7(%rdx,%rbx,2), %xmm9 movq -40(%rax), %rbx leaq 7(%rdx,%rcx,2), %r8 addq $8, %rcx movsd (%r8), %xmm4 cmpq %rcx, %r13 movsd 7(%rdx,%rbx,2), %xmm11 movq -32(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm5 movq -24(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm7 movq -16(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm14 movq -8(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm6 leaq (%rdi,%rdi), %rbx movsd 7(%rbx,%rdx), %xmm8 movq 24(%rax), %rdx movapd %xmm6, %xmm13 movsd 15(%rdx), %xmm1 movsd 7(%rdx), %xmm2 movapd %xmm1, %xmm10 movsd 31(%rdx), %xmm3 movapd %xmm2, %xmm12 mulsd %xmm11, %xmm10 mulsd %xmm9, %xmm12 mulsd %xmm2, %xmm11 mulsd %xmm1, %xmm9 movsd 23(%rdx), %xmm0 addsd %xmm12, %xmm10 movapd %xmm2, %xmm12 mulsd %xmm7, %xmm2 subsd %xmm9, %xmm11 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm12 mulsd %xmm5, %xmm1 movapd %xmm8, %xmm5 mulsd %xmm7, %xmm9 movapd %xmm4, %xmm7 subsd %xmm11, %xmm13 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 movapd %xmm0, %xmm1 addsd %xmm12, %xmm9 movapd %xmm14, %xmm12 xorpd %xmm3, %xmm6 subsd %xmm10, %xmm12 mulsd %xmm13, %xmm1 subsd %xmm2, %xmm7 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 addsd %xmm14, %xmm10 mulsd %xmm13, %xmm6 mulsd %xmm12, %xmm4 subsd %xmm9, %xmm5 mulsd %xmm0, %xmm12 addsd %xmm8, %xmm9 movapd %xmm0, %xmm8 mulsd %xmm11, %xmm0 addsd %xmm1, %xmm4 movapd %xmm3, %xmm1 mulsd %xmm10, %xmm3 subsd %xmm12, %xmm6 mulsd %xmm11, %xmm1 mulsd %xmm10, %xmm8 subsd %xmm3, %xmm0 addsd %xmm1, %xmm8 movapd %xmm2, %xmm1 addsd %xmm0, %xmm1 subsd %xmm0, %xmm2 movapd %xmm7, %xmm0 subsd %xmm6, %xmm7 addsd %xmm6, %xmm0 movsd %xmm1, (%r8) movapd %xmm9, %xmm1 movq 40(%rax), %rdx subsd %xmm8, %xmm9 addsd %xmm8, %xmm1 movsd %xmm1, 7(%rbx,%rdx) movq -8(%rax), %rbx movq 40(%rax), %rdx movsd %xmm2, 7(%rdx,%rbx,2) movq -16(%rax), %rbx movq 40(%rax), %rdx movsd %xmm9, 7(%rdx,%rbx,2) movq -24(%rax), %rbx movq 40(%rax), %rdx movsd %xmm0, 7(%rdx,%rbx,2) movapd %xmm5, %xmm0 movq -32(%rax), %rbx movq 40(%rax), %rdx subsd %xmm4, %xmm5 addsd %xmm4, %xmm0 movsd %xmm0, 7(%rdx,%rbx,2) movq -40(%rax), %rbx movq 40(%rax), %rdx movsd %xmm7, 7(%rdx,%rbx,2) movq -48(%rax), %rbx movq 40(%rax), %rdx movsd %xmm5, 7(%rdx,%rbx,2) jg .L2752 movq %rdi, %r13 .L2751:
Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 lucier at math dot purdue dot edu wrote: > ------- Comment #64 from lucier at math dot purdue dot edu 2009-05-06 20:43 ------- > In answer to comment 60, here's the command line where I added > -fforward-propagate -fno-move-loop-invariants: Hmm, can you try adding -frename-registers *or* -fweb (i.e. together they get no benefit) too? > and the loop looks pretty much just as bad (it's 117 instructions long, by my > count): 116 actually: the movq here is outside the loop (that's how I made all the instruction counts). > movsd %xmm5, 7(%rdx,%rbx,2) > jg .L2752 > movq %rdi, %r13 > .L2751:
Adding -frename-registers gives a significant speedup (sometimes as fast as 4.1.2 on this shared machine, i.e., it somtimes hits 108 ms instead of 132-140ms), the command line with -fforward-propagate -fno-move-loop-invariants -frename-registers is /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c and the loop is .L2752: movq %rcx, %r12 addq 8(%rax), %r12 leaq 4(%rcx), %rdi movq %r12, -8(%rax) leaq 4(%r12), %r8 addq 8(%rax), %r12 movq %r8, -16(%rax) movq -8(%rax), %r8 movq -16(%rax), %rdx movq %r12, -24(%rax) leaq 4(%r12), %rbx addq 8(%rax), %r12 movq -24(%rax), %r9 movq %rbx, -32(%rax) movq 24(%rax), %rbx movq -32(%rax), %r10 leaq 4(%r12), %r11 movq %r12, -40(%rax) movq 40(%rax), %r12 movq -40(%rax), %r14 movq %r11, -48(%rax) movsd 15(%rbx), %xmm1 movsd 7(%rbx), %xmm2 movsd 7(%r12,%r11,2), %xmm9 movapd %xmm1, %xmm3 movsd 7(%r12,%r14,2), %xmm11 leaq 7(%r12,%rcx,2), %r11 movapd %xmm2, %xmm10 leaq (%rdi,%rdi), %r14 mulsd %xmm11, %xmm3 movapd %xmm2, %xmm12 mulsd %xmm9, %xmm10 addq $8, %rcx mulsd %xmm1, %xmm9 cmpq %rcx, %r13 mulsd %xmm2, %xmm11 movsd 7(%r12,%r10,2), %xmm5 movsd 7(%r12,%r9,2), %xmm7 addsd %xmm10, %xmm3 movsd 7(%r12,%r8,2), %xmm6 subsd %xmm9, %xmm11 mulsd %xmm7, %xmm2 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm1 movapd %xmm6, %xmm13 movsd 7(%r12,%rdx,2), %xmm14 mulsd %xmm5, %xmm12 mulsd %xmm7, %xmm9 subsd %xmm11, %xmm13 movsd 31(%rbx), %xmm0 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 movsd (%r11), %xmm4 movapd %xmm14, %xmm10 xorpd %xmm0, %xmm6 addsd %xmm12, %xmm9 movsd 7(%r14,%r12), %xmm8 subsd %xmm3, %xmm10 movapd %xmm4, %xmm7 addsd %xmm14, %xmm3 movsd 23(%rbx), %xmm15 subsd %xmm2, %xmm7 movapd %xmm8, %xmm5 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 subsd %xmm9, %xmm5 movapd %xmm15, %xmm14 addsd %xmm8, %xmm9 mulsd %xmm10, %xmm4 movapd %xmm15, %xmm8 mulsd %xmm15, %xmm10 movapd %xmm0, %xmm12 mulsd %xmm11, %xmm15 mulsd %xmm3, %xmm0 movapd %xmm7, %xmm1 mulsd %xmm13, %xmm6 mulsd %xmm3, %xmm8 movapd %xmm9, %xmm3 mulsd %xmm11, %xmm12 subsd %xmm0, %xmm15 mulsd %xmm13, %xmm14 subsd %xmm10, %xmm6 movapd %xmm2, %xmm10 movapd %xmm5, %xmm0 addsd %xmm12, %xmm8 addsd %xmm15, %xmm10 subsd %xmm15, %xmm2 addsd %xmm14, %xmm4 addsd %xmm8, %xmm3 movsd %xmm10, (%r11) movq 40(%rax), %r10 subsd %xmm8, %xmm9 addsd %xmm6, %xmm1 addsd %xmm4, %xmm0 movsd %xmm3, 7(%r14,%r10) movq -8(%rax), %r9 movq 40(%rax), %rdx subsd %xmm6, %xmm7 subsd %xmm4, %xmm5 movsd %xmm2, 7(%rdx,%r9,2) movq -16(%rax), %r8 movq 40(%rax), %r12 movsd %xmm9, 7(%r12,%r8,2) movq -24(%rax), %rbx movq 40(%rax), %r11 movsd %xmm1, 7(%r11,%rbx,2) movq -32(%rax), %r14 movq 40(%rax), %r10 movsd %xmm0, 7(%r10,%r14,2) movq -40(%rax), %r9 movq 40(%rax), %rdx movsd %xmm7, 7(%rdx,%r9,2) movq -48(%rax), %r8 movq 40(%rax), %r12 movsd %xmm5, 7(%r12,%r8,2) jg .L2752 Adding -fforward-propagate -fno-move-loop-invariants -fweb instead of -fforward-propagate -fno-move-loop-invariants -frename-registers, so the compile line is /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate -fno-move-loop-invariants -fweb -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c the time is not so good (consistently 128ms) and the loop is .L2752: movq %rcx, %rdx addq 8(%rax), %rdx leaq 4(%rcx), %rdi movq %rdx, -8(%rax) leaq 4(%rdx), %rbx addq 8(%rax), %rdx movq %rbx, -16(%rax) movq %rdx, -24(%rax) leaq 4(%rdx), %rbx addq 8(%rax), %rdx movq %rbx, -32(%rax) movq %rdx, -40(%rax) leaq 4(%rdx), %rbx movq 40(%rax), %rdx movq %rbx, -48(%rax) movsd 7(%rdx,%rbx,2), %xmm9 movq -40(%rax), %rbx leaq 7(%rdx,%rcx,2), %r8 addq $8, %rcx movsd (%r8), %xmm4 cmpq %rcx, %r13 movsd 7(%rdx,%rbx,2), %xmm11 movq -32(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm5 movq -24(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm7 movq -16(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm14 movq -8(%rax), %rbx movsd 7(%rdx,%rbx,2), %xmm6 leaq (%rdi,%rdi), %rbx movsd 7(%rbx,%rdx), %xmm8 movq 24(%rax), %rdx movapd %xmm6, %xmm13 movsd 15(%rdx), %xmm1 movsd 7(%rdx), %xmm2 movapd %xmm1, %xmm10 movsd 31(%rdx), %xmm3 movapd %xmm2, %xmm12 mulsd %xmm11, %xmm10 mulsd %xmm9, %xmm12 mulsd %xmm2, %xmm11 mulsd %xmm1, %xmm9 movsd 23(%rdx), %xmm0 addsd %xmm12, %xmm10 movapd %xmm2, %xmm12 mulsd %xmm7, %xmm2 subsd %xmm9, %xmm11 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm12 mulsd %xmm5, %xmm1 movapd %xmm8, %xmm5 mulsd %xmm7, %xmm9 movapd %xmm4, %xmm7 subsd %xmm11, %xmm13 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 movapd %xmm0, %xmm1 addsd %xmm12, %xmm9 movapd %xmm14, %xmm12 xorpd %xmm3, %xmm6 subsd %xmm10, %xmm12 mulsd %xmm13, %xmm1 subsd %xmm2, %xmm7 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 addsd %xmm14, %xmm10 mulsd %xmm13, %xmm6 mulsd %xmm12, %xmm4 subsd %xmm9, %xmm5 mulsd %xmm0, %xmm12 addsd %xmm8, %xmm9 movapd %xmm0, %xmm8 mulsd %xmm11, %xmm0 addsd %xmm1, %xmm4 movapd %xmm3, %xmm1 mulsd %xmm10, %xmm3 subsd %xmm12, %xmm6 mulsd %xmm11, %xmm1 mulsd %xmm10, %xmm8 subsd %xmm3, %xmm0 addsd %xmm1, %xmm8 movapd %xmm2, %xmm1 addsd %xmm0, %xmm1 subsd %xmm0, %xmm2 movapd %xmm7, %xmm0 subsd %xmm6, %xmm7 addsd %xmm6, %xmm0 movsd %xmm1, (%r8) movapd %xmm9, %xmm1 movq 40(%rax), %rdx subsd %xmm8, %xmm9 addsd %xmm8, %xmm1 movsd %xmm1, 7(%rbx,%rdx) movq -8(%rax), %rbx movq 40(%rax), %rdx movsd %xmm2, 7(%rdx,%rbx,2) movq -16(%rax), %rbx movq 40(%rax), %rdx movsd %xmm9, 7(%rdx,%rbx,2) movq -24(%rax), %rbx movq 40(%rax), %rdx movsd %xmm0, 7(%rdx,%rbx,2) movapd %xmm5, %xmm0 movq -32(%rax), %rbx movq 40(%rax), %rdx subsd %xmm4, %xmm5 addsd %xmm4, %xmm0 movsd %xmm0, 7(%rdx,%rbx,2) movq -40(%rax), %rbx movq 40(%rax), %rdx movsd %xmm7, 7(%rdx,%rbx,2) movq -48(%rax), %rbx movq 40(%rax), %rdx movsd %xmm5, 7(%rdx,%rbx,2) jg .L2752 And I still count 117 instructions in the loop in comment 64 (whether that matters, I don't know).
I'm thinking of enabling -frename-registers on x86; since it does not enable the first scheduling pass, the live ranges will be shorter and the register allocator may reuse the same register over and over with no freedom on schedule-insns2. This would leave only the bug with RTL loop invariant motion. Brad, you are the one who's regularly producing "insane" testcases, can you measure the slowdown from -O1 to -O1 -frename-registers? It is a local pass, so it should not be that much, but I'd rather check before (I'll check on a bootstrap instead).
Be careful with -frename-registers, it is quadratic in the size of a basic block. For Bradley's test cases it will certainly give a slow-down. I have tried a rewrite of -frename-registers, but I keep running into trouble with the INDEX_REGS and BASE_REGS non-classes. Paolo, we could look at this stuff together if you want my help.
Well, adding -frename-registers by itself to -O1 and not -fforward-propagate and -fno-move-loop-invariants doesn't help (loop is given below, along with complete compile options), the time is 140 ms cpu time (140 user, 0 system) and adding -frename-registers and -fno-move-loop-invariants without -fforward-propagate doesn't help (loop is again given below), it gets 140 ms cpu time (140 user, 0 system) Adding all three gives a very consistent time this morning of 120 ms cpu time (120 user, 0 system) so which is the same as the 4.2.4 time without any of these options (this morning). But -fforward-propagate is not a viable option in general for this type of code; here are some times for the testcase from PR 31957 with various options on a 2.something GHz Xeon server: pythagoras-45% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report >& rename-report 252.987u 9.592s 4:23.20 99.7% 0+0k 0+0io 0pf+0w pythagoras-46% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & no-rename-report 249.875u 10.544s 4:21.73 99.4% 0+0k 0+0io 0pf+0w pythagoras-47% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & rename-no-move-loop-invariants-report 246.663u 10.484s 4:18.30 99.5% 0+0k 0+0io 0pf+0w pythagoras-48% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -fforward-propagate -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & rename-no-move-loop-invariants-forward-propagate-report 357.830u 28.417s 6:27.81 99.5% 0+0k 0+0io 11pf+0w With -fforward-propagate the memory required went up to at least 21GB. I'll attach the time reports for the various options, but the compiler wasn't configured to provide detailed memory reports. Brad Loop with -frename-registers /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c movq %rdx, %r12 addq (%r11), %r12 leaq 4(%rdx), %r14 movq %r12, (%rsi) addq $4, %r12 movq %r12, (%r10) movq (%r11), %rcx addq (%rsi), %rcx movq %rcx, (%rbx) addq $4, %rcx movq %rcx, (%r9) movq (%r11), %r13 addq (%rbx), %r13 movq %r13, (%r8) addq $4, %r13 movq %r13, (%r15) movq (%rax), %rcx movq (%r8), %r12 addq $7, %rcx movsd (%rcx,%r12,2), %xmm10 movq (%rbx), %r12 movsd (%rcx,%r13,2), %xmm13 movq (%r9), %r13 movsd (%rcx,%r12,2), %xmm6 movq (%rsi), %r12 movsd (%rcx,%r13,2), %xmm5 movq (%r10), %r13 movsd (%rcx,%r12,2), %xmm9 leaq (%r14,%r14), %r12 movsd (%rcx,%r13,2), %xmm11 leaq (%rcx,%rdx,2), %r13 movsd (%rcx,%r12), %xmm3 movq 24(%rdi), %rcx movsd (%r13), %xmm4 addq $8, %rdx movsd 15(%rcx), %xmm14 movsd 7(%rcx), %xmm15 movapd %xmm14, %xmm8 movapd %xmm14, %xmm7 movapd %xmm15, %xmm12 mulsd %xmm10, %xmm8 mulsd %xmm13, %xmm12 mulsd %xmm15, %xmm10 mulsd %xmm14, %xmm13 movsd 31(%rcx), %xmm2 addsd %xmm8, %xmm12 movapd %xmm15, %xmm8 mulsd %xmm6, %xmm7 mulsd %xmm5, %xmm14 subsd %xmm13, %xmm10 mulsd %xmm5, %xmm8 movapd %xmm2, %xmm13 mulsd %xmm6, %xmm15 movapd %xmm4, %xmm6 xorpd .LC5(%rip), %xmm13 movapd %xmm3, %xmm5 addsd %xmm7, %xmm8 movapd %xmm11, %xmm7 subsd %xmm14, %xmm15 movapd %xmm9, %xmm14 movsd 23(%rcx), %xmm0 subsd %xmm12, %xmm7 subsd %xmm10, %xmm14 movapd %xmm13, %xmm1 addsd %xmm11, %xmm12 movapd %xmm2, %xmm11 subsd %xmm15, %xmm6 addsd %xmm4, %xmm15 movapd %xmm0, %xmm4 mulsd %xmm7, %xmm1 addsd %xmm9, %xmm10 mulsd %xmm14, %xmm4 subsd %xmm8, %xmm5 mulsd %xmm0, %xmm7 addsd %xmm3, %xmm8 mulsd %xmm13, %xmm14 movapd %xmm15, %xmm9 mulsd %xmm10, %xmm11 mulsd %xmm0, %xmm10 addsd %xmm1, %xmm4 movapd %xmm8, %xmm3 movapd %xmm5, %xmm1 subsd %xmm7, %xmm14 movapd %xmm0, %xmm7 mulsd %xmm12, %xmm7 addsd %xmm4, %xmm1 mulsd %xmm2, %xmm12 movapd %xmm6, %xmm2 subsd %xmm14, %xmm6 addsd %xmm14, %xmm2 addsd %xmm11, %xmm7 subsd %xmm12, %xmm10 subsd %xmm4, %xmm5 addsd %xmm7, %xmm3 addsd %xmm10, %xmm9 subsd %xmm10, %xmm15 subsd %xmm7, %xmm8 movsd %xmm9, (%r13) movq (%rax), %rcx movsd %xmm3, 7(%r12,%rcx) movq (%rsi), %r13 movq (%rax), %rcx movsd %xmm15, 7(%rcx,%r13,2) movq (%r10), %r12 movq (%rax), %r13 movsd %xmm8, 7(%r13,%r12,2) movq (%rbx), %rcx movq (%rax), %r13 movsd %xmm2, 7(%r13,%rcx,2) movq (%r9), %r12 movq (%rax), %rcx movsd %xmm1, 7(%rcx,%r12,2) movq (%r8), %r13 movq (%rax), %rcx movsd %xmm6, 7(%rcx,%r13,2) movq (%r15), %r12 movq (%rax), %r13 movsd %xmm5, 7(%r13,%r12,2) cmpq %rdx, -104(%rsp) jg .L2941 Loop with -frename-registers -fno-move-loop-invariants /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c .L2755: leaq 8(%rax), %rdx movq %rcx, %r13 leaq -16(%rax), %r9 leaq -8(%rax), %r10 leaq -24(%rax), %r8 leaq -32(%rax), %rdi addq (%rdx), %r13 leaq 4(%rcx), %r14 leaq 4(%r13), %rsi movq %r13, (%r10) movq %rsi, (%r9) addq (%rdx), %r13 leaq -40(%rax), %rsi leaq 4(%r13), %r11 movq %r13, (%r8) movq %r11, (%rdi) addq (%rdx), %r13 leaq -48(%rax), %r11 leaq 40(%rax), %rdx movq %r13, (%rsi) addq $4, %r13 movq %r13, (%r11) movq (%rdx), %rbx movq (%rsi), %r12 addq $7, %rbx movsd (%rbx,%r12,2), %xmm11 movq (%r8), %r12 movsd (%rbx,%r13,2), %xmm9 movq (%rdi), %r13 movsd (%rbx,%r12,2), %xmm7 movq (%r10), %r12 movsd (%rbx,%r13,2), %xmm5 movq (%r9), %r13 movsd (%rbx,%r12,2), %xmm6 leaq (%r14,%r14), %r12 movsd (%rbx,%r13,2), %xmm14 leaq (%rbx,%rcx,2), %r13 movsd (%rbx,%r12), %xmm8 movq 24(%rax), %rbx movapd %xmm6, %xmm13 addq $8, %rcx movsd (%r13), %xmm4 cmpq %rcx, %r15 movsd 15(%rbx), %xmm1 movsd 7(%rbx), %xmm2 movapd %xmm1, %xmm3 movsd 31(%rbx), %xmm0 movapd %xmm2, %xmm10 mulsd %xmm11, %xmm3 movapd %xmm2, %xmm12 mulsd %xmm9, %xmm10 mulsd %xmm2, %xmm11 mulsd %xmm1, %xmm9 mulsd %xmm7, %xmm2 addsd %xmm10, %xmm3 mulsd %xmm5, %xmm12 movapd %xmm14, %xmm10 movsd 23(%rbx), %xmm15 subsd %xmm9, %xmm11 movapd %xmm1, %xmm9 mulsd %xmm5, %xmm1 movapd %xmm8, %xmm5 mulsd %xmm7, %xmm9 subsd %xmm3, %xmm10 movapd %xmm4, %xmm7 subsd %xmm11, %xmm13 addsd %xmm6, %xmm11 movsd .LC5(%rip), %xmm6 subsd %xmm1, %xmm2 xorpd %xmm0, %xmm6 addsd %xmm14, %xmm3 addsd %xmm12, %xmm9 movapd %xmm15, %xmm14 movapd %xmm0, %xmm12 subsd %xmm2, %xmm7 mulsd %xmm13, %xmm14 addsd %xmm4, %xmm2 movapd %xmm6, %xmm4 subsd %xmm9, %xmm5 mulsd %xmm3, %xmm0 addsd %xmm8, %xmm9 mulsd %xmm10, %xmm4 movapd %xmm15, %xmm8 mulsd %xmm15, %xmm10 mulsd %xmm11, %xmm15 movapd %xmm7, %xmm1 mulsd %xmm13, %xmm6 mulsd %xmm3, %xmm8 movapd %xmm9, %xmm3 mulsd %xmm11, %xmm12 addsd %xmm14, %xmm4 subsd %xmm0, %xmm15 movapd %xmm5, %xmm0 subsd %xmm10, %xmm6 movapd %xmm2, %xmm10 addsd %xmm12, %xmm8 addsd %xmm15, %xmm10 subsd %xmm15, %xmm2 addsd %xmm6, %xmm1 addsd %xmm8, %xmm3 movsd %xmm10, (%r13) movq (%rdx), %rbx subsd %xmm8, %xmm9 addsd %xmm4, %xmm0 subsd %xmm6, %xmm7 movsd %xmm3, 7(%r12,%rbx) movq (%r10), %r10 movq (%rdx), %r13 subsd %xmm4, %xmm5 movsd %xmm2, 7(%r13,%r10,2) movq (%r9), %rbx movq (%rdx), %r12 movsd %xmm9, 7(%r12,%rbx,2) movq (%r8), %r13 movq (%rdx), %r10 movsd %xmm1, 7(%r10,%r13,2) movq (%rdi), %r9 movq (%rdx), %rbx movsd %xmm0, 7(%rbx,%r9,2) movq (%rsi), %rsi movq (%rdx), %r8 movsd %xmm7, 7(%r8,%rsi,2) movq (%r11), %rdi movq (%rdx), %r12 movsd %xmm5, 7(%r12,%rdi,2) jg .L2755
Created attachment 17819 [details] time report related to comment 69, time for PR 31957 with no options
Created attachment 17820 [details] time for 31957, with rename-registers
Created attachment 17821 [details] time for 31957, with rename-registers no-move-loop-invariants
Created attachment 17822 [details] time for 31957, with rename-registers no-move-loop-invariants forward-propagate
Ok. One step at a time. :-) To recap, here is the situation: - the CSE optimization you mention was *not* removed, it was moved to fwprop, so it does not run at -O1. - once this was done, the way to go is to tune new optimizations, not to reintroduce old ones - for example, fwprop in turn triggered a bad choice in loop invariant motion, for which a patch has been posted. This patch will remove the need for -fno-move-loop-invariants on this testcase (this is a deficiency in LIM that is not specific to machine-generated code, OTOH the presence of many fp[N] accesses helps triggering it). - that scheduling is necessary now and not in 4.2.x, probably is just a matter of luck - why renaming registers is necessary now and not in 4.2.x is still a mystery; but, there is an explanation as to why it helps (it prolongs live ranges, something that on non-x86 archs is done by the pre-regalloc scheduling) - at least we have a set of options providing good performance on this testcase, and guidance towards better tuning of the various problematic optimizations To conclude, nobody is underestimating the significance of its PR, it's just a matter of priorities. Near the end of the release cycle, you tend to look at PRs with small testcases to minimize the time spent understanding the code; near the beginning, you hope that new features magically fix the PRs and concentrate on wrong-code bugs and so on. Complex P2s such as this one unfortunately tend to stay in a limbo.
Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 On May 7, 2009, at 12:21 PM, bonzini at gnu dot org wrote: > ------- Comment #74 from bonzini at gnu dot org 2009-05-07 16:21 > ------- > Ok. One step at a time. :-) To recap, here is the situation: > > - that scheduling is necessary now and not in 4.2.x, probably is > just a matter > of luck If you mean -fschedule-insns2, it has always been part of the options list. > - at least we have a set of options providing good performance on this > testcase, and guidance towards better tuning of the various > problematic > optimizations OK, but -fforward-propagate is not viable in general for these machine-generated codes. > Brad
It should be possible to modify fwprop to avoid excessive memory usage (doing its own dataflow, basically, instead of using UD chains)
Re. comment #75: Just the fact that an option is enabled in both releases doesn't mean the pass behind it is doing the same thing in both releases. What the scheduler does, depends heavily on the code you feed it. Sometimes it is pure (good or bad) luck that changes the behavior of a pass in the compiler. The interactions between all the pieces are just very complicated (which is why, IMHO, retargetable-compiler engineering is so difficult: controlling the pipeline is undoable). Re. comment #76: Sad as it may be, I think this is the best short-term solution. Alternatively we could re-work fwprop to work on regions and use the partial-CFG dataflow stuff, similar to what the RTL loop optimizers (like loop-invariant) do. To be honest, I'd much prefer the latter, but the DIY-fwprop thing is probably easier in the short term.
Subject: Bug 33928 Author: bonzini Date: Fri May 8 06:51:12 2009 New Revision: 147270 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147270 Log: 2009-05-08 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/33928 * loop-invariant.c (struct use): Add addr_use_p. (struct def): Add n_addr_uses. (struct invariant): Add cheap_address. (create_new_invariant): Set cheap_address. (record_use): Accept df_ref. Set addr_use_p and update n_addr_uses. (record_uses): Pass df_ref to record_use. (get_inv_cost): Do not add inv->cost to comp_cost for cheap addresses used only as such. Modified: trunk/gcc/ChangeLog trunk/gcc/loop-invariant.c
I'm cobbling up the DIY dataflow patch and it is all but ugly, actually.
Subject: Bug 33928 Author: bonzini Date: Fri May 8 07:51:46 2009 New Revision: 147274 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147274 Log: 2009-05-08 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/33928 * loop-invariant.c (record_use): Fix && vs. || mishap. Modified: trunk/gcc/ChangeLog trunk/gcc/loop-invariant.c
Created attachment 17825 [details] speed up fwprop and enable it at -O1 Here is a patch I'm bootstrapping to remove fwprop's usage of UD chains. It does not affect at all the assembly output, it just changes the data structure that is used. compiler.i is probably too big for me, but I tried slatex.i and fwprop was ~2% of compilation time with this patch.
Hm, looking at the time reports the patch will save about 30-40% of the fwprop execution time, and should fix the memory hog problem, but will still leave in the 70s needed to compute reaching definitions. I guess it's a step forward for -O2 but borderline for -O1.
Subject: Bug 33928 Author: bonzini Date: Fri May 8 12:22:30 2009 New Revision: 147282 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147282 Log: 2009-05-08 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/33928 PR 26854 * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween, process_uses, build_single_def_use_links): New. (update_df): Update use_def_ref. (forward_propagate_into): Use get_def_for_use instead of use-def chains. (fwprop_init): Call build_single_def_use_links and let it initialize dataflow. (fwprop_done): Free use_def_ref. (fwprop_addr): Eliminate duplicate call to df_set_flags. * df-problems.c (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. (df_rd_bb_local_compute_process_def): Update head comment. (df_chain_create_bb): Use the new RD simulation functions. * df.h (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. * opts.c (decode_options): Enable fwprop at -O1. * doc/invoke.texi (-fforward-propagate): Document this. Modified: trunk/gcc/ChangeLog trunk/gcc/df-problems.c trunk/gcc/df.h trunk/gcc/doc/invoke.texi trunk/gcc/fwprop.c trunk/gcc/opts.c
Ok, I am working on a patch to add a multiple-definitions DF problem and use that together with a domwalk to find the single definitions (instead of reaching-definitions, which is the remaining slow part). The new problem has a bitvector sized by the number of registers rather than the number of defs (that is sized like the bitvectors for liveness), which means it will be fast. It is defined as follows: MDkill (B) = regs that have a def in B MDinit (B) = (union of MDkill (P) for every P : B \in DomFrontier(P) \cap LRin(B) MDin (B) = MDinit (B) \cup (union of MDout (P) for every predecessor P of B) MDout (B) = MDin (B) - MDkill (B)
Created attachment 17878 [details] Large test file for testing time and memory usage This is the file compiler.i used in the previous tests.
Created attachment 17879 [details] Time and memory report for compiler.i This is the time and memory report after the hack from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39301#c8 to make the statistic fields HOST_WIDEST_INTs. Some interesting lines: fwprop.c:178 (build_single_def_use_links) 8 8438189160 82240 0 1027496 df-problems.c:311 (df_rd_alloc) 155420 8433928200 8433870880 8433870880 0 df-problems.c:593 (df_rd_transfer_functio 909666 40718919320 6755812320 6755736840 2025096 Total 13171390 61130398320
The compiler options for the previous report: /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers -fno-move-loop-invariants -fforward-propagate -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report -fmem-report > & rename-no-move-loop-invariants-forward-propagate-report-new
Created attachment 17963 [details] patch I'm testing Here is a patch I'm testing that completes the rewrite of fwprop's dataflow. This should make it much faster and less memory hungry. It should also keep the generated code fast (with -frename-registers of course), if not it's a bug in the patch.
Created attachment 17964 [details] correct version oops, the previous one didn't work at -O1 even though it bootstrapped :-)
Yo, with the patch the time to compile compiler.i with the given options is 331s on my machine (with a checking compiler). Fwprop takes only 1% (including computation of the new dataflow problem). I'd estimate around 250s with your nonchecking build. I'll split it and post it tomorrow.
Created attachment 17968 [details] time and memory report for compiler.i after Paolo's patch The patch cut the total bitmaps used compiling compiler.i from > 60GB to 3GB; maximum memory (just from top) was 1631MB.
In the meanwhile something caused "tree incremental SSA" to jump up from 10s to 26s. Sob.
I would say that was the new SRA.
(In reply to comment #92) > In the meanwhile something caused "tree incremental SSA" to jump up from 10s to > 26s. Sob. > (In reply to comment #93) > I would say that was the new SRA. > OK, I'll try to investigate. Which of the various attachments to this bug is the one to look at? Martin
The test case is compiler.i.gz
Sorry, the gcc options are in comment 87 (the -fforward-propagate is now redundant), and without Paolo's recently proposed patch it requires about 9GB of memory to compile.
Brad, could you try to time compiler.i with and without -ftime-report to see how much of the "tree stmt walking" timevar is just accounting overhead?
I don't quite understand how you would like me to configure and run the test. First, I've applied your patches to speed up computing DF to my tree; do you want them included in the test, or should I use a pristine mainline? Second, when configuring mainline, should I include, or not include 1. --enable-gather-detailed-mem-stats 2. --enable-checking=release After that, I think you just want to run two compiles with and without -ftime-report, is that right? (Nothing about -fmem-report.)
Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 > First, I've applied your patches to speed up computing DF to my tree; do you > want them included in the test, or should I use a pristine mainline? It doesn't matter, but yes, use them. > Second, when configuring mainline, should I include, or not include > > 1. --enable-gather-detailed-mem-stats > 2. --enable-checking=release Again it shouldn't matter, but use only --enable-checking=release. > After that, I think you just want to run two compiles with and without > -ftime-report, is that right? (Nothing about -fmem-report.) Yes, and the output of -ftime-report is not needed. Just the "time ./cc1 ..." output for the two. Thanks!
Just as a reminder for after the fwprop patches are committed, the problem in CFG cleanup is that the iterative fixing of dominators in remove_edge_and_dominated_blocks is very expensive. Probably we should make sure no dominators are there in some key cfgcleanup passes.
Time for cleanup. This bug is fixed on mainline, and likely WONTFIX on 4.3/4.4 (though it could in principle be fixed by backporting the fwprop patches to 4.4). I'll add some pointers to PR26854 for the attachments related to compile-time problems.
Subject: Re: [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by r118475 On Mon, 2009-06-15 at 16:20 +0000, paolo dot bonzini at gmail dot com wrote: > Yes, and the output of -ftime-report is not needed. Just the "time > ./cc1 ..." output for the two. Thanks! The two commands: time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -c compiler.i 261.424u 1.184s 4:22.76 99.9% 0+0k 0+28456io 0pf+0w time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -c compiler.i -ftime-report 263.424u 4.900s 4:28.68 99.8% 0+0k 0+28480io 0pf+0w
Regarding comment #101 ... With heine:~/programs/gcc/objdirs/gsc-fft-tests/gambc-v4_1_2> /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline --enable-languages=c --disable-multilib --enable-checking=release Thread model: posix gcc version 4.5.0 20090608 (experimental) [trunk revision 148276] (GCC) (and including Paolo's patch to speed up DF), the routine in direct.c takes 168 ms cpu time (168 user, 0 system) As reported here http://www.math.purdue.edu/~lucier/bugzilla/9/ with gcc-4.2.4, this routine takes 156 ms on the same machine. Comment #9 gives the code that 4.2.4 generates at the start of the main loop; the start of the main loop with the version of 4.5.0 I gave above is: .L2938: movq %rcx, %rdx addq 8(%rax), %rdx leaq 4(%rcx), %rbx movq %rdx, -8(%rax) leaq 4(%rdx), %rdi addq 8(%rax), %rdx movq %rdi, -16(%rax) movq %rdx, -24(%rax) leaq 4(%rdx), %rdi addq 8(%rax), %rdx movq %rdi, -32(%rax) movq %rdx, -40(%rax) leaq 4(%rdx), %rdi movq 40(%rax), %rdx movq %rdi, -48(%rax) movsd 7(%rdx,%rdi,2), %xmm7 movq -40(%rax), %rdi leaq 7(%rdx,%rcx,2), %r8 addq $8, %rcx movsd (%r8), %xmm4 cmpq %rcx, %r13 movsd 7(%rdx,%rdi,2), %xmm10 movq -32(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm5 movq -24(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm6 movq -16(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm13 movq -8(%rax), %rdi movsd 7(%rdx,%rdi,2), %xmm11 leaq (%rbx,%rbx), %rdi movsd 7(%rdi,%rdx), %xmm9 movq 24(%rax), %rdx movapd %xmm11, %xmm14 movsd 15(%rdx), %xmm1 movsd 7(%rdx), %xmm2 movapd %xmm1, %xmm8 movsd 31(%rdx), %xmm3 movapd %xmm2, %xmm12 mulsd %xmm10, %xmm8 mulsd %xmm7, %xmm12 mulsd %xmm2, %xmm10 mulsd %xmm1, %xmm7 movsd 23(%rdx), %xmm0 So, to my mind, this is still a 4.5 regression, as there is still a slow-down and the code is still much less optimized by 4.5.0 than by 4.2.4. 168/156 ~ 1.08, so if you want to change the Summary of this bug to 8% regression, or some other things, that's fine, but I've changed this PR back to being a 4.5 regression. I was not really thrilled when Richard marked PR 39157 as a duplicate of this PR. To my mind, there are three more or less independent things---run time of Gambit-generated code, compile time of the code, and the space required to compile the code. This PR is about run time; PR 39157 was about space needed by the compiler; PR 26854 is about compile time. They seem to have all been mushed together.
I understood that with -frename-registers the regression is fixed. As I said, without a pre-regalloc scheduling pass and without register renaming, the scheduling quality you get is more or less random.
Marking PR39157 as a duplicate of PR26854 is not exact (only the fwprop part is a duplicate, because we were getting large compile times because of building large data structures; the CFG Cleanup part is not exactly a duplicate) but I don't think it's important because anyway we have a patch for the fwprop issue.
This machine has 4ms ticks, so we're getting down to a few ticks difference with a benchmark of this size. It's 156ms with 4.2.4, 168ms with 4.5.0, and 164 ms when -frename-registers is added to the command line. It's not just scheduling, there are more memory accesses with 4.5.0. With a problem roughly 10 times as large, the times are 4.2.4: 2912ms 4.5.0: 3204ms 4.5.0: 3120ms (adding -frename-registers) So there's a 7% difference with -frename-registers.
GCC 4.3.4 is being released, adjusting target milestone.
direct.c contains a direct FFT; I've compiled the direct and inverse fft and I ran it on arrays with 2^23 double-precision complex elements and heine:~/programs/gcc/objdirs/bench-mainline-on-fft> /pkgs/gcc-mainline/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../../mainline/configure --enable-checking=release --prefix=/pkgs/gcc-mainline --enable-languages=c,c++ -enable-stage1-languages=c,c++ Thread model: posix gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) The compile options were /pkgs/gcc-mainline/bin/gcc -save-temps -c -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared -fschedule-insns and the same without -fschedule-insns. The runtime for direct+inverse FFT with instruction scheduling was 1.264 seconds and the time for direct+inverse FFT without -fschedule-insns was 1.444 seconds, which is a 14% speedup for that one compiler option. This is on a 2.33GHz Core 2 quad machine. I'll attach the inner loops of direct.c with and with -fschedule-insns. I haven't been able to compile the complete Gambit runtime with -fschedule-insns on either x86-64 or ppc64; I've filed PR41164 and PR41176 for those two different failures.
Created attachment 18432 [details] inner loop of direct.c with -fschedule-insns
Created attachment 18433 [details] inner loop of direct.c without -fschedule-insns
I can compile gambit 4.1.2 with -fschedule-insns except for the function noted in PR41164. On model name : Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz with gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) the times with -fschedule-insns are (time (direct-fft-recursive-4 a table)) 144 ms cpu time (144 user, 0 system) (time (inverse-fft-recursive-4 a table)) 136 ms cpu time (136 user, 0 system) and the times without -fschedule-insns are (time (direct-fft-recursive-4 a table)) 168 ms cpu time (168 user, 0 system) (time (inverse-fft-recursive-4 a table)) 172 ms cpu time (172 user, 0 system) That's a pretty big improvement.
Subject: Bug 33928 Author: bergner Date: Sat Oct 3 01:39:14 2009 New Revision: 152430 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=152430 Log: Backport from mainline. 2009-08-30 Alan Modra <amodra@bigpond.net.au> PR target/41081 * fwprop.c (get_reg_use_in): Delete. (free_load_extend): New function. (forward_propagate_subreg): Use it. 2009-08-23 Alan Modra <amodra@bigpond.net.au> PR target/41081 * fwprop.c (try_fwprop_subst): Allow multiple sets. (get_reg_use_in): New function. (forward_propagate_subreg): Propagate through subreg of zero_extend or sign_extend. 2009-05-08 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/33928 PR 26854 * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween, process_uses, build_single_def_use_links): New. (update_df): Update use_def_ref. (forward_propagate_into): Use get_def_for_use instead of use-def chains. (fwprop_init): Call build_single_def_use_links and let it initialize dataflow. (fwprop_done): Free use_def_ref. (fwprop_addr): Eliminate duplicate call to df_set_flags. * df-problems.c (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. (df_rd_bb_local_compute_process_def): Update head comment. (df_chain_create_bb): Use the new RD simulation functions. * df.h (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. * opts.c (decode_options): Enable fwprop at -O1. * doc/invoke.texi (-fforward-propagate): Document this. Modified: branches/ibm/gcc-4_3-branch/gcc/ChangeLog.ibm branches/ibm/gcc-4_3-branch/gcc/REVISION branches/ibm/gcc-4_3-branch/gcc/df-problems.c branches/ibm/gcc-4_3-branch/gcc/df.h branches/ibm/gcc-4_3-branch/gcc/doc/invoke.texi branches/ibm/gcc-4_3-branch/gcc/fwprop.c branches/ibm/gcc-4_3-branch/gcc/opts.c
Subject: Bug 33928 Author: bergner Date: Thu Apr 29 14:34:35 2010 New Revision: 158902 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=158902 Log: Backport from mainline. 2009-08-30 Alan Modra <amodra@bigpond.net.au> PR target/41081 * fwprop.c (get_reg_use_in): Delete. (free_load_extend): New function. (forward_propagate_subreg): Use it. 2009-08-23 Alan Modra <amodra@bigpond.net.au> PR target/41081 * fwprop.c (try_fwprop_subst): Allow multiple sets. (get_reg_use_in): New function. (forward_propagate_subreg): Propagate through subreg of zero_extend or sign_extend. 2009-05-08 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/33928 PR 26854 * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween, process_uses, build_single_def_use_links): New. (update_df): Update use_def_ref. (forward_propagate_into): Use get_def_for_use instead of use-def chains. (fwprop_init): Call build_single_def_use_links and let it initialize dataflow. (fwprop_done): Free use_def_ref. (fwprop_addr): Eliminate duplicate call to df_set_flags. * df-problems.c (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. (df_rd_bb_local_compute_process_def): Update head comment. (df_chain_create_bb): Use the new RD simulation functions. * df.h (df_rd_simulate_artificial_defs_at_top, df_rd_simulate_one_insn): New. * opts.c (decode_options): Enable fwprop at -O1. * doc/invoke.texi (-fforward-propagate): Document this. Modified: branches/ibm/gcc-4_4-branch/gcc/ChangeLog.ibm branches/ibm/gcc-4_4-branch/gcc/df-problems.c branches/ibm/gcc-4_4-branch/gcc/df.h branches/ibm/gcc-4_4-branch/gcc/doc/invoke.texi branches/ibm/gcc-4_4-branch/gcc/fwprop.c branches/ibm/gcc-4_4-branch/gcc/opts.c
GCC 4.3.5 is being released, adjusting target milestone.
Hm, there doesn't seem to be a runtime testcase attached to this bug, so I can't produce numbers for the upcoming 4.6 release. Brad, can you do so if you have time? Thanks. Btw, how difficult is it to setup a continuous performance testing of Gambit? Is Gambit reasonably self-contained (no external dependenices, commandline-driven)? I'm considering to add it to http://gcc.opensuse.org/c++bench/ I probably can get it built but would appreciate hints on how to setup an automated performance test.
On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928 > > --- Comment #115 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-04 11:58:13 UTC --- > Hm, there doesn't seem to be a runtime testcase attached to this bug, so I > can't produce numbers for the upcoming 4.6 release. Brad, can you do so > if you have time? I'll work on it. I just went through all the comments in this bug report to remind me of the issues, of which there seem to be two. The first is the runtime performance of the direct FFT in direct.c, as discussed, e.g., in comment 103 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928#c103 and the second is the compile-time performance. I presume you want to know about the performance of the FFT code. This is a very specific benchmark (one routine) and would not be indicative of general > Btw, how difficult is it to setup a continuous performance testing of Gambit? > Is Gambit reasonably self-contained (no external dependenices, > commandline-driven)? I'm considering to add it to > http://gcc.opensuse.org/c++bench/ > I probably can get it built but would appreciate hints on how to setup an > automated performance test. > It's completely self-contained and very portable. Benchmarking could be automated. It has a benchmark suite that measures runtime and compile-time performance of a number of programs, most small, but some larger (so compilation used to take quite a few GB of memory and several minutes or more of CPU time; these are not benchmarked by default; would you want to run these as extreme tests of the compiler?). I'll talk with Marc Feeley, the author of Gambit, about how to automate the benchmarks; it will probably require just "make bench" with various options if desired. Brad
On Fri, 4 Mar 2011, lucier at math dot purdue.edu wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928 > > --- Comment #116 from lucier at math dot purdue.edu 2011-03-04 16:09:13 UTC --- > On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928 > > > > --- Comment #115 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-03-04 11:58:13 UTC --- > > Hm, there doesn't seem to be a runtime testcase attached to this bug, so I > > can't produce numbers for the upcoming 4.6 release. Brad, can you do so > > if you have time? > > I'll work on it. Thanks. > I just went through all the comments in this bug report to remind me of > the issues, of which there seem to be two. The first is the runtime > performance of the direct FFT in direct.c, as discussed, e.g., in > comment 103 > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928#c103 > > and the second is the compile-time performance. > > I presume you want to know about the performance of the FFT code. This > is a very specific benchmark (one routine) and would not be indicative > of general Yes, I want to know about runtime performance. > > Btw, how difficult is it to setup a continuous performance testing of Gambit? > > Is Gambit reasonably self-contained (no external dependenices, > > commandline-driven)? I'm considering to add it to > > http://gcc.opensuse.org/c++bench/ > > I probably can get it built but would appreciate hints on how to setup an > > automated performance test. > > > > It's completely self-contained and very portable. Benchmarking could be > automated. It has a benchmark suite that measures runtime and > compile-time performance of a number of programs, most small, but some > larger (so compilation used to take quite a few GB of memory and several > minutes or more of CPU time; these are not benchmarked by default; would > you want to run these as extreme tests of the compiler?). > > I'll talk with Marc Feeley, the author of Gambit, about how to automate > the benchmarks; it will probably require just "make bench" with various > options if desired. Ah, so it's not Gambit from TAMU (the game theory software) then ;) Richard.
On Fri, 2011-03-04 at 11:59 +0000, rguenth at gcc dot gnu.org wrote: > Hm, there doesn't seem to be a runtime testcase attached to this bug, so I > can't produce numbers for the upcoming 4.6 release. Brad, can you do so > if you have time? At http://www.math.purdue.edu/~lucier/bugzilla/14/ is a Readme file and a tarball; I think it should be easy to script a runtime test for this PR from the instructions in the Readme file. Later we'll devise a "make bench" for general Gambit benchmarking. Brad
It's nearly impossible to examine the assembly code responsible for the FFT in the package I set up in the previous comment. If you want a runtime benchmark for this PR where you can easily examine the code I'll have to do more work.
At http://www.math.purdue.edu/~lucier/bugzilla/15/ I've put a tarfile and instructions that allow one to build Gambit-C in a way that splits out the FFT code into its own C function, so the assembly code can be more easily examined. Brad
I'm inclined to close this as "Fixed" for 4.6.0. I've taken the file mentioned in the previous comment and followed the instructions in the readme. The times for a forward FFT of 2^{25} complex doubles on a 2.4HGz Intel Core i5 on x86_64-apple-darwin10.7.0 are as follows: With the usual compiler options of -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp 4.5.2: 2433 ms cpu time (2427 user, 6 system) 4.6.0: 2158 ms cpu time (2154 user, 4 system) Adding -fschedule-insns -march=native to the above: 4.5.2: 2067 ms cpu time (2060 user, 7 system) 4.6.0: 2016 ms cpu time (2012 user, 4 system) The assembly for the main loop looks much better.
Just to be clear, the command to do the test is gsi/gsi -e '(define a (expt 3 100000000))(set! *bench-bignum-fft* #t)(define b (* a a))'
Fixed for GCC 4.6.