According to http://www.suse.de/~gcctest/c++bench/polyhedron/, the run time of "NF" increased from 37.70s to 41.57s (+10%) between 071031 and 071101. This is an AMD Opteron system. The same result I obtained at http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/ also on an AMD64. Interestingly, while x86-64 (-m64) gets slower, using -m32 I don't see any change (apart of noise) - neither for x87 nor for SSE(3). To boil it down to version numbers, I believe the change must have happened between the following versions: Fast: 4.3.0 20071031 (experimental) [trunk revision 129791] Slow: 4.3.0 20071031 (experimental) [trunk revision 129797] As culpits there are essentially only the following checkins possible: - PR tree-optimization/32377 (compute_overlap_steps_for_affine_univar) Make it work also for unknown number of iterations. [Sebastian Pop] - PR middle-end/33779 fold-const.c (extract_muldiv_1): Make sure to not introduce new undefined integer overflow. [Richard Guenther] - PR fortran/33897 decl.c (gfc_match_entry): Do not make ENTRY name global for contained procedures. parse.c (gfc_fixup_sibling_symbols): Fix code for determining whether a procedure is external. [Paul Thomas] The Polyhedron test case is available from: http://www.polyhedron.co.uk/MFL6VW74649
Confirmed.
Well, this bug needs proper analysis and a testcase, but yes, I also see this slowdown.
(In reply to comment #2) > Well, this bug needs proper analysis and a testcase, but yes, I also see this > slowdown. Richi, the only difference in generated code is by backing out your patch [1] [1] http://gcc.gnu.org/viewcvs?view=rev&revision=129796 Other suspected patches have no effect on generated code for "gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3" I will check execution times, but since I have Core2, perhaps there will be no slowdown. Can you try to benchmark on your target with [1] backed out?
Confirmed also on core2: benchmarked with patch: 22 Apr 2008 18:47:04 gfortran - Compile nf command=gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 nf.s -o nf 22 Apr 2008 18:47:04 gfortran - Execute nf nf Run # 1 21.46340 21.46340 - Error=100.0000% nf Run # 2 21.45860 21.46100 - Error= 0.0112% Geometric Mean Execution Time = 21.46 seconds benchmarked without patch: 22 Apr 2008 18:44:51 gfortran - Compile nf command=gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 nf.s -o nf 22 Apr 2008 18:44:51 gfortran - Execute nf nf Run # 1 19.46120 19.46120 - Error=100.0000% nf Run # 2 19.46200 19.46160 - Error= 0.0021% Geometric Mean Execution Time = 19.46 seconds
(In reply to comment #3) > [1] http://gcc.gnu.org/viewcvs?view=rev&revision=129796 It was a correctness fix, which usually will slow down generated code. :) So you have to look at the difference to make sure that the code generated before was actually producing no overflows. Thanks, Andrew Pinski
Indeed. It would be interesting to analyze what optimization the folding enabled and see if that can be recovered somehow.
Created attachment 15527 [details] x86_64 asm dump of trisolve procedure (genereated without the patch) All the difference is in trisolve procedure (attached). The performance will be 10% better, if trisolve in the dump is substituted with attached function. I'm using -O2 -funroll-loops. BTW: There are two loops in this asm (.L3 and .L5). In current asm, suspicious parts are: movsd 16(%r9), %xmm6 mulsd 16(%r8), %xmm6 and mulsd -16(%rdx), %xmm0 mulsd -16(%r11), %xmm0 That is - loads from different addresses that are not present in non-patched asm.
The problem is indeed in trisolve: subroutine trisolve(x,i1,i2) integer :: i1 , i2 real(dpkind),dimension(i2)::x integer :: i x(i1) = gi(i1)* x(i1) do i = i1+1 , i2 x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1)) enddo do i = i2-1 , i1 , -1 x(i) = x(i) - gi(i)*au1(i)*x(i+1) enddo end subroutine trisolve Please note two very tight loops that calculate x[n] from the value x[n-1], where x[n-1] is the result of a previous step. .127t.optimized tree dump for the the first loop (the second loop is the same, only going from last to first element) in non-regressed case shows: <bb 4>: MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.297, offset: 0x0fffffffffffffff8])); ivtmp.295 = ivtmp.295 + D.3347; ivtmp.297 = ivtmp.297 + 8; ivtmp.300 = ivtmp.300 + 8; ivtmp.304 = ivtmp.304 + 1; if ((integer(kind=4)) ivtmp.304 == D.1652) goto <bb 5>; else goto <bb 4>; this code results in: .L3: movsd (%r9), %xmm10 addl $4, %edx movsd (%rcx), %xmm9 X+> mulsd -8(%rcx), %xmm10 movsd 8(%rcx), %xmm7 movsd 16(%rcx), %xmm5 movsd 24(%rcx), %xmm3 subsd %xmm10, %xmm9 mulsd (%rax), %xmm9 addq %r10, %rax 1-> movsd %xmm9, (%rcx) movsd 8(%r9), %xmm8 1+> mulsd %xmm9, %xmm8 subsd %xmm8, %xmm7 mulsd (%rax), %xmm7 addq %r10, %rax 2-> movsd %xmm7, 8(%rcx) movsd 16(%r9), %xmm6 2+> mulsd %xmm7, %xmm6 subsd %xmm6, %xmm5 mulsd (%rax), %xmm5 addq %r10, %rax 3-> movsd %xmm5, 16(%rcx) movsd 24(%r9), %xmm4 addq $32, %r9 3+> mulsd %xmm5, %xmm4 subsd %xmm4, %xmm3 mulsd (%rax), %xmm3 addq %r10, %rax X-> movsd %xmm3, 24(%rcx) addq $32, %rcx cmpl %ebp, %edx jne .L3 In the code above, it can be seen how unrolled iterations are linked together. The result from previous iteration (marked with N->) enters next iteration (marked with N+>). BTW: Optimizer could also link X-> and X+> but this is probably too much... Patched gcc regressed in this area: <bb 4>: MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.302])); ivtmp.295 = ivtmp.295 + D.3349; ivtmp.297 = ivtmp.297 + 8; ivtmp.300 = ivtmp.300 + 8; ivtmp.302 = ivtmp.302 + 8; ivtmp.304 = ivtmp.304 + 1; if ((integer(kind=4)) ivtmp.304 == D.1652) goto <bb 5>; else goto <bb 4>; this code results in: .L3: movsd (%r9), %xmm10 addl $4, %edx movsd (%rcx), %xmm9 X-> mulsd (%r8), %xmm10 movsd 8(%rcx), %xmm7 movsd 16(%rcx), %xmm5 movsd 24(%rcx), %xmm3 subsd %xmm10, %xmm9 mulsd (%rax), %xmm9 addq %rbx, %rax 1-> movsd %xmm9, (%rcx) movsd 8(%r9), %xmm8 1+> mulsd 8(%r8), %xmm8 subsd %xmm8, %xmm7 mulsd (%rax), %xmm7 addq %rbx, %rax 2-> movsd %xmm7, 8(%rcx) movsd 16(%r9), %xmm6 2+> mulsd 16(%r8), %xmm6 subsd %xmm6, %xmm5 mulsd (%rax), %xmm5 addq %rbx, %rax 3-> movsd %xmm5, 16(%rcx) movsd 24(%r9), %xmm4 addq $32, %r9 3+> mulsd 24(%r8), %xmm4 addq $32, %r8 subsd %xmm4, %xmm3 mulsd (%rax), %xmm3 addq %rbx, %rax X-> movsd %xmm3, 24(%rcx) addq $32, %rcx cmpl %r12d, %edx jne .L3 In the code above, the links are broken. In ".+>" case, gcc reloads from memory the same value that is otherwise available in the register, marked with ".->".
Not hoisting the load from x(i) is a missed PRE opportunity. Complete testcase for the second loop: subroutine trisolve2(x,i1,i2,nxyz) integer :: nxyz real,dimension(nxyz):: au1 real,allocatable,dimension(:) :: gi integer :: i1 , i2 real,dimension(i2)::x integer :: i allocate(gi(nxyz)) do i = i1+1 , i2 x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1)) enddo end subroutine trisolve2
(In reply to comment #9) > Not hoisting the load from x(i) is a missed PRE opportunity. Complete testcase > for the second loop: This is actually the first loop. Just for reference: "-O2 -funroll-loops" flags are needed.
GCC 4.3.3 is being released, adjusting target milestone.
On the Trunk using "-O2" or "-O3" can produce slower code. I built gcc version 4.4.0 20090126 [trunk revision 143680] for i386-redhat-linux and was dismayed to find that libmudflaps had a few FAILs: Results for 4.4.0 20090126 (experimental) [trunk revision 143680] (GCC) testsuite on i386-redhat-linux-gnu http://gcc.gnu.org/ml/gcc-testresults/2009-01/msg02853.html The file "libmudflap.cth/pass40-frag.c" fails with NO optimization due to: 6100 6200 WARNING: program timed out. FAIL: libmudflap.cth/pass40-frag.c output pattern test Since it is only completing 62% of it's task the timeout needs an appropriate increase. The file "libmudflap.cth/pass40-frag.c" fails with "-O2" due to: 4100 4200 4300 4400 4500 4600 4700 4800 4900 WARNING: program timed out. FAIL: libmudflap.cth/pass40-frag.c (-O2) output pattern test With "-O2" it only completes 49% (slower than default). The file "libmudflap.cth/pass40-frag.c" fails with "-O3" due to: 5100 5200 5300 5400 5500 5600 5700 5800 WARNING: program timed out. FAIL: libmudflap.cth/pass40-frag.c (-O3) output pattern test With "-O3" it only completes 58% (slower than default, but faster than "-O2"). Rob
Predictive commoning does exactly what you want.
(In reply to comment #13) > Predictive commoning does exactly what you want. It is not effective for the testcase in Comment #9. The dumps for innermost loop are the same for -O2 -funroll-loops [-fpredictive-commoning]: .L6: movss (%rsi), %xmm9 addl $4, %r8d mulss (%rcx), %xmm9 movss (%rdx), %xmm8 movss 4(%rdx), %xmm6 movss 8(%rdx), %xmm4 movss 12(%rdx), %xmm2 subss %xmm9, %xmm8 mulss 0(%rbp), %xmm8 movss %xmm8, (%rdx) movss 4(%rsi), %xmm7 mulss 4(%rcx), %xmm7 subss %xmm7, %xmm6 mulss 4(%rbp), %xmm6 movss %xmm6, 4(%rdx) movss 8(%rsi), %xmm5 mulss 8(%rcx), %xmm5 subss %xmm5, %xmm4 mulss 8(%rbp), %xmm4 movss %xmm4, 8(%rdx) movss 12(%rsi), %xmm3 addq $16, %rsi mulss 12(%rcx), %xmm3 addq $16, %rcx subss %xmm3, %xmm2 mulss 12(%rbp), %xmm2 addq $16, %rbp movss %xmm2, 12(%rdx) addq $16, %rdx cmpl %r9d, %r8d jne .L6
(In reply to comment #14) > (In reply to comment #13) > > Predictive commoning does exactly what you want. Predictive commoning failed: no suitable chains
Executing predictive commoning without unrolling. with -m32. One of the cases SCEV is confused about pointer-plus offsets being sizetype: (Data Ref: stmt: (*x_58(D))[D.1627_54] = D.1638_71; ref: (*x_58(D))[D.1627_54]; base_object: (*x_58(D))[0]; Access function 0: {(integer(kind=8)) i_43 + -1, +, 1}_1 Access function 1: 0B vs. (Data Ref: stmt: D.1634_67 = (*x_58(D))[D.1632_62]; ref: (*x_58(D))[D.1632_62]; base_object: (*x_58(D))[0]; Access function 0: {(integer(kind=8)) (i_43 + -1) + -1, +, 1}_1 Access function 1: 0B
(In reply to comment #16) > One of the cases SCEV is confused about pointer-plus offsets being sizetype: Do we have a solution for this problem...?
Subject: Re: [4.3/4.4/4.5 Regression] 10% performance regression since Nov 1 on Polyhedron's "NF" on AMD64 On Fri, 3 Jul 2009, ubizjak at gmail dot com wrote: > ------- Comment #17 from ubizjak at gmail dot com 2009-07-03 08:46 ------- > (In reply to comment #16) > > > One of the cases SCEV is confused about pointer-plus offsets being sizetype: > > Do we have a solution for this problem...? My hope is that no-undefined-overflow will somehow magically solve these problems ... otherwise no, there is unfortunately no way out here. Richard.
In fact, in this case we have the C equivalent int i; long j = (long)(i - 1); vs. long j = (long)i - 1; which I believe are equivalent if overflow is undefined (or i - 1 does not wrap). It is just that fold obviously considers (long)i - 1 to be more expensive than (long)(i - 1) and thus does not transform the latter into the former (and it can't transform (long)i - 1 to (long)(i - 1) as if (long)i - 1 does not overflow there is no guarantee that i - 1 does not). We should be able to do the former transformation during SCEV analysis though. I have a patch which results in (-O3 -ffast-math -funroll-loops) .L6: mulss (%rcx), %xmm0 movss (%rdx), %xmm5 movss 4(%rdx), %xmm4 addl $4, %ebp subss %xmm0, %xmm5 movss 8(%rdx), %xmm0 mulss (%rsi), %xmm5 movss %xmm5, (%rdx) mulss 4(%rcx), %xmm5 subss %xmm5, %xmm4 mulss 4(%rsi), %xmm4 movss %xmm4, 4(%rdx) movss 8(%rcx), %xmm3 mulss %xmm4, %xmm3 subss %xmm3, %xmm0 mulss 8(%rsi), %xmm0 movss %xmm0, 8(%rdx) movss 12(%rcx), %xmm2 addq $16, %rcx mulss %xmm0, %xmm2 movss 12(%rdx), %xmm0 subss %xmm2, %xmm0 mulss 12(%rsi), %xmm0 addq $16, %rsi movss %xmm0, 12(%rdx) addq $16, %rdx cmpl %r8d, %ebp jne .L6
Before: Time for setup 0.139 Time per iteration 0.271 Total Time 6.649 Time for setup 0.136 Time per iteration 0.265 Total Time 10.210 Time for setup 0.134 Time per iteration 0.265 Total Time 7.276 Time for setup 0.134 Time per iteration 0.260 Total Time 11.572 After: Time for setup 0.114 Time per iteration 0.238 Total Time 5.834 Time for setup 0.111 Time per iteration 0.233 Total Time 8.948 Time for setup 0.110 Time per iteration 0.237 Total Time 6.504 Time for setup 0.112 Time per iteration 0.235 Total Time 10.454 which seems to exactly recover this regression.
Created attachment 18133 [details] patch
Subject: Bug 34163 Author: rguenth Date: Fri Jul 3 14:11:14 2009 New Revision: 149207 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149207 Log: 2009-07-03 Richard Guenther <rguenther@suse.de> PR middle-end/34163 * tree-chrec.c (chrec_convert_1): Fold (T2)(t +- x) to (T2)t +- (T2)x if t +- x is known to not overflow and the conversion widens the operation. * Makefile.in (tree-chrec.o): Add $(FLAGS_H) dependency. * gfortran.dg/pr34163.f90: New testcase. Added: trunk/gcc/testsuite/gfortran.dg/pr34163.f90 Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-chrec.c
Fixed on the trunk.
GCC 4.3.4 is being released, adjusting target milestone.
GCC 4.3.5 is being released, adjusting target milestone.
4.3 branch is being closed, moving to 4.4.7 target.
Fixed in 4.5+, 4.4 is no longer supported.