GCC trunk rev. 154141 seems to handle less efficiently a convolution code than previous stable releases, it was also spotted in revision 153048. Here are some average timings on an Intel E5320 clocked at 1.86 GHz with 4 MB of L2 cache, Debian GNU/Linux with a 2.6.26 kernel. * with -O2 -march=native GCC 4.3.2 8239 ms GCC-4.4.2 8102 ms GCC-snapshot-20091105 9347 ms GCC-trunk-r154141 9343 ms * with -O2 GCC 4.3.2 8128 ms GCC-4.4.2 8158 ms GCC-snapshot-20091105 9824 ms GCC-trunk-r154141 9828 ms * with -O1 GCC 4.3.2 20926 ms GCC-4.4.2 8277 ms GCC-snapshot-20091105 9369 ms GCC-trunk-r154141 9375 ms * with -O0 GCC 4.3.2 34061 ms GCC-4.4.2 34241 ms GCC-snapshot-20091105 34903 ms GCC-trunk-r154141 34910 ms GCC compiled with : configure --prefix=/export/home/nicolas/gcc/trunk-install --enable-languages=c --disable-multilib --disable-bootstrap --enable-checking=release I haven't been able to track down the origin of the performance difference. Note that data are not initialized in the attached code, as the slowdown is observed wether they are or not. ---BEGIN code--- #define N 1024*512 #define M 512 #define ITER 16 double in[N]; double H[M]; double vH[N]; int main ( int argc, char **argv ) { int i, j, k; for ( i=0; i<ITER; ++i ) for ( j=0; j<N; ++j ) for ( k=0; (k<M)&&(k<=j); ++k ) vH[j] += H[k]*in[j-k]; return (int) vH[argc]; } ---END code---
Created attachment 19010 [details] Source file with a convolution loop pattern.
It looks like it is induction variable related.
Using integer instead of double, the performance difference is even more noticeable : * with -O1 GCC 4.4.2 7475 ms GCC-trunk-r154672 9390 ms
It seems that this regression first appeared with revision 151080 * with -O1 GCC-4.4.2 7.4 s GCC-trunk-r151078 7.4 s GCC-trunk-r151079 7.4 s GCC-trunk-r151080 9.4 s GCC-trunk-r151081 9.4 s GCC-trunk-r151082 9.4 s Changelog for revision 151080 http://gcc.gnu.org/ml/gcc-patches/2009-08/msg01336.html 009-08-25 Michael Matz <matz@suse.de> * expr.h (jumpifnot_1, jumpif_1, do_jump_1): Declare. * dojump.c (do_jump_by_parts_greater): Take two operands instead of full expression. (do_jump_by_parts_equality, do_compare_and_jump): Ditto. (jumpifnot_1, jumpif_1): New wrappers for do_jump_1. (do_jump): Split out code for simple binary comparisons into ... (do_jump_1): ... this, taking the individual operands and code. Change callers to helper function above accordingly. * expr.c (expand_expr_real_1): Use jumpifnot_1 for simple binary comparisons.
Hmpf, something fishy is going on, as this patch should have been only refactoring without influence on the generated code. I'll look at it.
Subject: Bug 42027 Author: matz Date: Sun Dec 13 21:51:34 2009 New Revision: 155196 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155196 Log: PR tree-optimization/42027 * dojump.c (do_jump <TRUTH_AND_EXPR, TRUTH_OR_EXPR>): Go to TRUTH_ANDIF_EXPR resp. TRUTH_ORIF_EXPR expander, instead of falling through. Modified: trunk/gcc/ChangeLog trunk/gcc/dojump.c
Fixed.
I am confused, a performance regression is still noticeable: * Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1 flag GCC-4.4.2 7364 ms GCC-trunk-r155286 9515 ms * Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 flag GCC-4.4.1 5960 ms GCC-trunk-r155286 7355 ms Here is a diff on the assembly generated for the Intel E5320: $ diff 442/convol.s r155286/convol.s 11c11 < subl $8, %esp --- > subl $12, %esp 13d12 < movl $H, %esi 17c16 < imull (%esi,%eax,4), %ebx --- > imull H(,%eax,4), %ebx 22c21 < jg .L10 --- > setle %bl 24,25c23,25 < jle .L3 < .L10: --- > setle -21(%ebp) > testb %bl, -21(%ebp) > jne .L3 28c28 < .L6: --- > .L5: 31,32c31,32 < je .L5 < .L8: --- > je .L4 > .L7: 34c34 < js .L6 --- > js .L5 40c40 < .L5: --- > .L4: 43c43 < je .L7 --- > je .L6 46,47c46,47 < jmp .L8 < .L7: --- > jmp .L7 > .L6: 50c50 < addl $8, %esp --- > addl $12, %esp 60c60 < .ident "GCC: (GNU) 4.4.2" --- > .ident "GCC: (GNU) 4.5.0 20091216 (experimental)"
Here is a unified diff which focuses on the inner-loop exit conditions. --- 442/convol.s +++ r155286/convol.s .L3: movl (%edx), %ebx - imull (%esi,%eax,4), %ebx + imull H(,%eax,4), %ebx addl %ebx, %ecx addl $1, %eax subl $4, %edx cmpl $511, %eax - jg .L10 + setle %bl cmpl %edi, %eax - jle .L3 -.L10: + setle -21(%ebp) + testb %bl, -21(%ebp) + jne .L3 movl -16(%ebp), %eax movl %ecx, vH(,%eax,4) -.L6: +.L5: L3 corresponds to the inner loop body.
Which is the good version, the one with less or the one with more branches?
The fastest is the variant with more jumps (442/convol.s in the diff) generated by GCC-4.4.2. In the one jump variant (r155286/convol.s in the diff), I guess it is the computing of both conditions before jumping which slows down the program. Looking between revisions 151079 and 151080 (which is when the regression first appeared), the RTL expand dump shows a difference regarding branch prediction. I do not know if it is relevant: --- r151079/convol.c.135r.expand 2009-12-01 14:10:55.000000000 +0100 +++ r151080/convol.c.135r.expand 2009-12-01 14:11:03.000000000 +0100 @@ -5,7 +5,6 @@ ;; Generating RTL for gimple basic block 2 ;; Generating RTL for gimple basic block 3 -Failed to add probability note ;; Generating RTL for gimple basic block 4 @@ -20,13 +19,6 @@ ;; Generating RTL for gimple basic block 9 ;; Generating RTL for gimple basic block 10 -Purged non-fallthru edges from bb 14 -Predictions for insn 61 bb 3 - no prediction heuristics: 50.0% - combined heuristics: 50.0% -Predictions for insn 65 bb 13 - no prediction heuristics: 50.0% - combined heuristics: 50.0% I am about to attach the whole diff file.
Created attachment 19321 [details] Diff of the RTL expand dump between revisions 151079 and 151080
It's indeed expand that makes the difference when expanding ;; if (k <= 511 && k <= j != 0) probably due to the way TER works now.
That's exactly what I fixed with my last patch. If this still results in a difference it's caused by difference in cheapness of branches. I'll poke at it.
Hmm, no, it's still my fault. Something must have gone wrong with my brain when I thought the bug was fixed, no idea how that happened. The patch wasn't complete.
Created attachment 19332 [details] Real fix Now, before I blow it again, would you be so kind to test this patch (on top of some recent trunk, doesn't have to be the newest one, you don't need to bootstrap) if it fixes the performance problem. For me it does now, I swear :-)
(In reply to comment #16) > Created an attachment (id=19332) [edit] > Real fix > > Now, before I blow it again, would you be so kind to test this patch (on top > of some recent trunk, doesn't have to be the newest one, you don't need to > bootstrap) if it fixes the performance problem. For me it does now, I swear > :-) > Tested with trunk revision 155304, the regression is gone. * Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1 flag GCC-4.4.2 7364 ms GCC-trunk-r155286 7360 ms * Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 flag GCC-4.4.1 5968 ms GCC-trunk-r155286 5963 ms
(In reply to comment #17) > (In reply to comment #16) > > Created an attachment (id=19332) [edit] > > Real fix > > > > Now, before I blow it again, would you be so kind to test this patch (on top > > of some recent trunk, doesn't have to be the newest one, you don't need to > > bootstrap) if it fixes the performance problem. For me it does now, I swear > > :-) > > > > Tested with trunk revision 155304, the regression is gone. > > * Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1 > flag > GCC-4.4.2 7364 ms > GCC-trunk-r155286 7360 ms > > * Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 > flag > GCC-4.4.1 5968 ms > GCC-trunk-r155286 5963 ms > Oups, copy-pasted the GCC versions for the timings. The correct versions are: * Intel Xeon E5320 (x86_64 arch but gcc machine is i686-pc-linux-gnu), with -O1 flag GCC-4.4.2 7364 ms GCC-trunk-r155304-patched 7360 ms * Intel Xeon 5160 (x86_64 arch and gcc machine is x86_64-linux-gnu), with -O1 flag GCC-4.4.1 5968 ms GCC-trunk-r155304-patched 5963 ms
Subject: Bug 42027 Author: matz Date: Sun Dec 20 01:15:46 2009 New Revision: 155367 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155367 Log: PR tree-optimization/42027 * cfgexpand.c (expand_gimple_cond): Use jumpy sequence for &, &&, | and || if jumps are cheap. Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c