Created attachment 22897 [details] test case When compiling the attached testcase on a machine with a Core 2 Duo E8500 CPU and 64bit Linux using gcc -O2 -fomit-frame-pointer testcase.i -lm the results with gcc 4.5.1 are Testing map analysis accuracy. lmax=2047, 0 iterations, spin=0 Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels) iteration 0: wall time for alm2map: 8.811477s wall time for map2alm: 9.204556s component 0: rms 1.390734e-13, maxerr 1.582512e-12 However, with current trunk one obtains Testing map analysis accuracy. lmax=2047, 0 iterations, spin=0 Testing ECP grid (4096 rings, 4096 pixels/ring, 16777216 pixels) iteration 0: wall time for alm2map: 9.518667s wall time for map2alm: 9.780509s component 0: rms 1.390734e-13, maxerr 1.582512e-12 The numerical result is identical, but the code generated by the more recent compiler is noticeably slower. Reducing the test case is unfortunately not trivial; the computational hot spots are located in pshtd_inner_loop() and Ylmgen_recalc_Ylm_sse2(). Please let me know if I can provide further information.
Created attachment 22904 [details] shorter test case More compact test case; the hot spot is marked with "CRITICAL LOOP". Compile with "gcc -O2 -fomit-frame-pointer testcase2.c -lm" and test using "time ./a.out".
The only difference in the hot loop is the usage of two regs in the address: 4.5.1: .L142: movapd %xmm0, (%rcx) mulpd %xmm6, %xmm2 addq $32, %rbx movapd %xmm1, %xmm6 mulpd %xmm0, %xmm6 movsd (%rax), %xmm1 movsd 8(%rax), %xmm3 unpcklpd %xmm1, %xmm1 subpd %xmm2, %xmm6 unpcklpd %xmm3, %xmm3 mulpd %xmm9, %xmm1 mulpd %xmm0, %xmm3 movapd %xmm6, 16(%rcx) addq $32, %rcx movapd %xmm1, %xmm0 movsd 16(%rax), %xmm1 mulpd %xmm6, %xmm0 unpcklpd %xmm1, %xmm1 movsd 24(%rax), %xmm2 addq $32, %rax cmpq %rsi, %rbx unpcklpd %xmm2, %xmm2 subpd %xmm3, %xmm0 mulpd %xmm9, %xmm1 jne .L142 4.6: .L167: movapd %xmm0, %xmm10 .L143: mulpd %xmm2, %xmm6 movapd %xmm3, %xmm2 movapd %xmm10, (%rsi,%rcx) mulpd %xmm10, %xmm2 movsd (%rdx), %xmm0 movsd 8(%rdx), %xmm1 subpd %xmm6, %xmm2 unpcklpd %xmm0, %xmm0 unpcklpd %xmm1, %xmm1 mulpd %xmm11, %xmm0 movapd %xmm2, 16(%rsi,%rcx) mulpd %xmm10, %xmm1 addq $32, %rcx mulpd %xmm2, %xmm0 movsd 16(%rdx), %xmm3 movsd 24(%rdx), %xmm6 addq $32, %rdx cmpq %rdi, %rcx unpcklpd %xmm3, %xmm3 unpcklpd %xmm6, %xmm6 subpd %xmm1, %xmm0 mulpd %xmm11, %xmm3 jne .L167 Given the comment above ix86_address_cost: /* Return cost of the memory address x. For i386, it is better to use a complex address than let gcc copy the address into a reg and make a new pseudo. But not if the address requires to two regs - that would mean more pseudos with longer lifetimes. */ this could be the reason for slowdown.
> this could be the reason for slowdown. Hm, not really. But, by changing the generated assembly around loop entry: $ diff -u testcase2.s testcase2_.s --- testcase2.s 2011-01-05 20:21:01.492919294 +0100 +++ testcase2_.s 2011-01-05 20:22:23.616577277 +0100 @@ -1678,6 +1678,7 @@ addq %r15, %rdx addq $1, %rdi salq $5, %rdi + movapd %xmm10, %xmm0 jmp .L143 .p2align 4,,10 .p2align 3 @@ -1687,7 +1688,7 @@ mulpd %xmm2, %xmm6 movapd %xmm3, %xmm2 movapd %xmm10, (%rsi,%rcx) - mulpd %xmm10, %xmm2 + mulpd %xmm0, %xmm2 movsd (%rdx), %xmm0 movsd 8(%rdx), %xmm1 subpd %xmm6, %xmm2 The slowdown is magically fixed: $ gcc -lm testcase2_.s $ time ./a.out real 0m4.041s user 0m4.034s sys 0m0.003s versus: $ gcc -lm testcase2.s $ time ./a.out real 0m4.239s user 0m4.234s sys 0m0.001s The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction. The functionality of the test didn't change due to existing "movapd %xmm0, %xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before the loop. This all happens on SnB, the code is generated with -O2 only. H.J., any ideas?
Applying the same medicine to original test gets us from: wall time for map2alm: 6.908527s to wall time for map2alm: 6.703142s where 4.5.1 wins with: wall time for map2alm: 6.651740s
(In reply to comment #3) > > this could be the reason for slowdown. > .... > > $ gcc -lm testcase2.s > $ time ./a.out > > real 0m4.239s > user 0m4.234s > sys 0m0.001s > > The important change is the change of %xmm10 -> %xmm0 in the mulpd instruction. > The functionality of the test didn't change due to existing "movapd %xmm0, > %xmm10" at the top of the loop and added extra "movapd %xmm10, %xmm0" before > the loop. > > This all happens on SnB, the code is generated with -O2 only. > > H.J., any ideas? Some loop performance is very sensitive to code sizes. This change - mulpd %xmm10, %xmm2 + mulpd %xmm0, %xmm2 will impact loop size due to exta REX prefix.
(In reply to comment #5) > Some loop performance is very sensitive to code sizes. This change > > - mulpd %xmm10, %xmm2 > + mulpd %xmm0, %xmm2 > > will impact loop size due to exta REX prefix. Adding nop (or several of them, FWIW) around changed mulpd insn does not affect the performance, so IMO it is not the loop layout that matters in this case.
OK, I located the problematic commit, at least on the 4.5 branch: it's revision number 167492 (fix for PR tree-optimization/46806). Between revisions 167491 and 167492 the CPU time for the testcase2.c jumps from 4.7s to 5.4s. Do you think that anything can be done about this regression?
Can you check if the following patch solves your problem? Index: tree-ssa-copyrename.c =================================================================== --- tree-ssa-copyrename.c (revision 168987) +++ tree-ssa-copyrename.c (working copy) @@ -226,8 +226,11 @@ copy_rename_partition_coalesce (var_map ign2 = false; } - /* Don't coalesce if the two variables are not of the same type. */ - if (TREE_TYPE (root1) != TREE_TYPE (root2)) + /* Don't coalesce if the two variables are not compatible . */ + if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2)) + || ((TREE_CODE (TREE_TYPE (root1)) == ENUMERAL_TYPE + || TREE_CODE (TREE_TYPE (root2)) == ENUMERAL_TYPE) + && TREE_TYPE (root1) != TREE_TYPE (root2))) { if (debug) fprintf (debug, " : Different types. No coalesce.\n"); The differences in GIMPLE of the patch do not explain the code-differences though, so it might be just bad luck that the patch regressed things for you. I can see other unwanted differences though.
(In reply to comment #8) > Can you check if the following patch solves your problem? Yes, this patch gets performance back to normal on the 4.5 branch and on trunk. Great! > The differences in GIMPLE of the patch do not explain the code-differences > though, so it might be just bad luck that the patch regressed things for > you. I can see other unwanted differences though. I would of course be happy if the code generation would be less "erratic", and if the nice performance I'm seeing does not depend on my luck ;) So if I can do anything to help optimizing this kind of code more consistently, please let me know! Of course, I'm more into numerics than into compiler writing ...
Author: rguenth Date: Thu Jan 20 10:33:15 2011 New Revision: 169050 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169050 Log: 2011-01-20 Richard Guenther <rguenther@suse.de> PR tree-optimization/47167 * tree-ssa-copyrename.c (copy_rename_partition_coalesce): Revert previous change, only avoid enumeral type changes. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-copyrename.c
Author: rguenth Date: Thu Jan 20 10:36:29 2011 New Revision: 169051 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169051 Log: 2011-01-20 Richard Guenther <rguenther@suse.de> PR tree-optimization/47167 * tree-ssa-copyrename.c (copy_rename_partition_coalesce): Revert previous change, only avoid enumeral type changes. Modified: branches/gcc-4_5-branch/gcc/ChangeLog branches/gcc-4_5-branch/gcc/tree-ssa-copyrename.c
Fixed.