int bar (void); void foo (int *); static int s[10]; void foobar (int i1, int i2, int i3, int i4, int i5, int i6) { int a[100]; int i, i7; i7 = bar (); bar (); for (i = 0; i < 100; i++) a[i] = s[i1] + s[i2] + s[i3] + s[i4] + s[i5] + s[i6] + s[i7]; foo (&a[0]); return; } If you compare mainline to dataflow branch at -O2 you can see --- t.i.trunk 2007-02-21 11:31:09.663252586 +0100 +++ t.i.df 2007-02-21 11:31:10.548064364 +0100 @@ -37,7 +37,6 @@ movl s(,%rbx,4), %edx addl s(,%rcx,4), %edx movslq %r12d,%r12 - leaq 16(%rsp), %rdi addl s(,%r13,4), %edx addl s(,%r14,4), %edx addl s(,%r15,4), %edx @@ -46,10 +45,11 @@ addl s(,%r12,4), %edx .p2align 4,,7 .L2: - movl %edx, (%rdi,%rax,4) + movl %edx, 16(%rsp,%rax,4) addq $1, %rax cmpq $100, %rax jne .L2 + leaq 16(%rsp), %rdi call foo addq $424, %rsp popq %rbx that is, we are choosing a more expensive addressing mode in the loop not noticing that 16(%rsp) can be (G)CSEd. This makes the above loop run 33% slower on x86_64.
On i686, this happens too, due to fwprop1: In insn 47, replacing (mem/s:SI (plus:SI (plus:SI (mult:SI (reg:SI 75 [ ivtmp.37 ]) (const_int 4 [0x4])) (reg/f:SI 91)) (const_int -4 [0xfffffffffffffffc])) [3 a S4 A8]) with (mem/s:SI (plus:SI (plus:SI (mult:SI (reg:SI 75 [ ivtmp.37 ]) (const_int 4 [0x4])) (reg/f:SI 20 frame)) (const_int -404 [0xfffffffffffffe6c])) [3 a S4 A8]) defering rescan insn with uid = 47. This results in different code between trunk and the df-branch: -- trunk ++ df-branch @@ -12,7 +12,6 @@ movl %eax, %ebx call bar movl 12(%ebp), %eax - leal -404(%ebp), %ecx movl s(,%eax,4), %edx movl 8(%ebp), %eax addl s(,%eax,4), %edx @@ -28,11 +27,12 @@ addl s(,%ebx,4), %edx .p2align 4,,7 .L2: - movl %edx, -4(%ecx,%eax,4) + movl %edx, -408(%ebp,%eax,4) addl $1, %eax cmpl $101, %eax jne .L2 - movl %ecx, (%esp) + leal -404(%ebp), %eax + movl %eax, (%esp) call foo addl $404, %esp popl %ebx
fwprop has some tricks to avoid propagating within loops before unrolling. The interesting point is why they trigger differently in mainline vs. dataflow.
This will also appear in mainline when the patch for PR30841 is applied
Though we don't have a testcase for mainline, the bug is there too.
Subject: Bug 30907 Author: bonzini Date: Tue Mar 20 08:31:13 2007 New Revision: 123084 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=123084 Log: 2007-03-19 Paolo Bonzini <bonzini@gnu.org> PR rtl-optimization/30907 * fwprop.c (forward_propagate_into): Never propagate inside a loop. (fwprop_init): Always call loop_optimizer_initialize. (fwprop_done): Always call loop_optimizer_finalize. (fwprop): We always have loop info now. (gate_fwprop_addr): Remove. (pass_fwprop_addr): Use gate_fwprop as gate. PR rtl-optimization/30841 * df-problems.c (df_ru_local_compute, df_rd_local_compute, df_chain_alloc): Call df_reorganize_refs unconditionally. * df-scan.c (df_rescan_blocks, df_reorganize_refs): Change refs_organized to refs_organized_size. (df_ref_create_structure): Use refs_organized_size instead of bitmap_size if refs had been organized, and keep refs_organized_size up-to-date. * df.h (struct df_ref_info): Change refs_organized to refs_organized_size. (DF_DEFS_SIZE, DF_USES_SIZE): Use refs_organized_size instead of bitmap_size. Modified: trunk/gcc/ChangeLog trunk/gcc/df-problems.c trunk/gcc/df-scan.c trunk/gcc/df.h trunk/gcc/fwprop.c
patch committed.