We were running bench++ looking for cases that perform worse with g++-4.x than they do with g++-2.95 . We posted a related mail to the gcc list http://gcc.gnu.org/ml/gcc/2005-08/msg00197.html There seems to be an interesting regression, exhibited by more than one test, that is related to the inliner. For example, the test s000005a, when compiled with g++-4.0.1 runs faster when compiled with -O2 than it does when compiled with -O2 -finline-functions, or with -O3 . Specifically, the slowdown is in the order of x2.5 If g++-2.95.3 is used with the same flags, the slowdown does not occur. Interestingly, if a *dead* function is commented out, or a *dead* call to cerr, the regression goes away. If g++-410_0723 is used, the regression appears as with g++-4.0.1 (when -O2 -finline, or -O3 is used), but it does *not* go away when then dead function is commented out. It only goes away when the dead cerr is commented out.
Created attachment 9457 [details] Source code.
I think this is just a RA issue as the assembler looks good on ppc-darwin.
I don't get the regression on ppc-darwin so this is just a RA issue.
Actually maybe not: <L8>:; first$current$current$current.506 = first$current$current$current.506 + 8B; D.34505 = D.34505 + first$current$current$current->value; if (last$current$current$current != first$current$current$current.506) goto <L37>; else goto <L10>; <L37>:; first$current$current$current = first$current$current$current.506; goto <bb 6> (<L8>); That is just wrong which causes some of the problems but I don't know how much, it looks like only a second or so. From what I looked at this is fully a target issue rather than a normal problem with targets which have a little more registers.
I can confirm a ~2.x time slowdown going from -O2 to -O2 -finline-functions on i686. Though this is unfortunate.
This was a P2 before P3 became the default.(In reply to comment #4) > first$current$current$current.506 = first$current$current$current.506 + 8B; > D.34505 = D.34505 + first$current$current$current->value; If we swaped around those two statements at the tree level, out of SSA would not have produced an extra assignment.
HUH: # D.34332_4 = PHI <D.34332_139(7), D.34332_13(6)>; # first$current$current$current_3 = PHI <first$current$current$current_98(7), first$current$current$current_11(6)>; # first$current$current$current_282 = PHI <first$current$current$current_98(7), first$current$current$current_11(6)>; <L10>:; first$current$current$current_98 = first$current$current$current_282 + 8B; tmp$current$current_113 = first$current$current$current_3 + 8B; tmp$current_122 = tmp$current$current_113 - 8B; y_134 = tmp$current_122; D.34330_138 = y_134->value; D.34332_139 = D.34332_4 + D.34330_138; if (last$current$current$current_12 != first$current$current$current_98) goto <L10>; else goto <L12>; Isn't _3 the same as _282? Why don't we elimitate it? (there is no way not to create it in the first place with this testcase as it is not really created by any pass). I think if we eliminate that, this should be fixed.
On the trunk there is no difference between -O2 and -O2 -finline-functions (the latter is perhaps 1% better), both are as bad as 4.1/4.2 with -O2 -finline-functions. Compiling with -O2 -fno-inline-small-functions gives the speed back. Both x86_64-linux and i686-linux.
On x86_64-linux -m64 with -O2 gcc doesn't hoist movabsq insns out of the loops, which can give some performance back: time ./pr23305-slow real 0m4.028s user 0m4.023s sys 0m0.003s time ./pr23305-slow2 real 0m3.436s user 0m3.434s sys 0m0.001s when I hoist it by hand in assembly: --- pr23305-slow.s 2007-11-22 17:14:09.000000000 +0100 +++ pr23305-slow2.s 2007-11-22 17:31:31.000000000 +0100 @@ -222,16 +222,16 @@ _Z13s000005a_testv: .LVL2: .LBB329: .LBB330: .loc 1 28697 0 cmpq %rax, %rdx je .L13 + movabsq $4613937818241073152, %r8 .p2align 4,,10 .p2align 3 .L14: - movabsq $4613937818241073152, %r8 movq %r8, (%rax) addq $8, %rax cmpq %rax, %rdx jne .L14 .L13: .LBE330: @@ -242,17 +242,17 @@ _Z13s000005a_testv: .LVL3: .LBB326: .LBB327: .loc 1 28697 0 cmpq %rax, %rdx je .L15 + movabsq $4613937818241073152, %rdi .p2align 4,,10 .p2align 3 .L16: .LBE327: - movabsq $4613937818241073152, %rdi movq %rdi, (%rax) .LBB328: addq $8, %rax cmpq %rax, %rdx jne .L16 .L15: but still the -O2 -fno-inline-small-functions version is much faster: time ./pr23305-fast real 0m1.591s user 0m1.588s sys 0m0.001s
The remaining difference is register allocation issue: time ./pr23305-vanilla; time ./pr23305-fixed real 0m4.030s user 0m4.028s sys 0m0.002s real 0m1.593s user 0m1.592s sys 0m0.001s with hand-edited changes: --- pr23305-vanilla.s 2007-11-22 17:57:15.000000000 +0100 +++ pr23305-fixed.s 2007-11-22 17:57:56.000000000 +0100 @@ -95,49 +95,49 @@ _Z13s000005a_testv: subq $24, %rsp .LCFI1: movq _ZL3dpe(%rip), %rdx movq _ZL3dpb(%rip), %rax cmpq %rax, %rdx je .L13 + movabsq $4613937818241073152, %r8 .p2align 4,,10 .p2align 3 .L14: - movabsq $4613937818241073152, %r8 movq %r8, (%rax) addq $8, %rax cmpq %rax, %rdx jne .L14 .L13: movq _ZL3Dpe(%rip), %rdx movq _ZL3Dpb(%rip), %rax cmpq %rax, %rdx je .L15 + movabsq $4613937818241073152, %rdi .p2align 4,,10 .p2align 3 .L16: - movabsq $4613937818241073152, %rdi movq %rdi, (%rax) addq $8, %rax cmpq %rax, %rdx jne .L16 .L15: movq _ZL5rrDPe(%rip), %rdx movq _ZL5rrDPb(%rip), %rax movsd _ZL1D(%rip), %xmm0 cmpq %rdx, %rax movsd %xmm0, 8(%rsp) je .L18 + movsd 8(%rsp), %xmm0 .p2align 4,,10 .p2align 3 .L24: - movsd 8(%rsp), %xmm0 addsd (%rax), %xmm0 addq $8, %rax cmpq %rax, %rdx - movsd %xmm0, 8(%rsp) jne .L24 + movsd %xmm0, 8(%rsp) .L18: movsd 8(%rsp), %xmm0 ucomisd .LC2(%rip), %xmm0 jp .L23 jne .L23 addq $24, %rsp In lreg dump we have: (code_label:HI 98 35 97 7 24 "" [1 uses]) (note:HI 97 98 45 7 [bb 7] NOTE_INSN_BASIC_BLOCK) (insn:HI 45 97 46 7 pr23305.ii:28564 (set (reg/v:DF 64 [ result ]) (plus:DF (reg/v:DF 64 [ result ]) (mem/s:DF (reg:DI 58 [ ivtmp.254 ]) [29 <variable>.value+0 S8 A8]))) 680 {*fop_df_comm_sse} (nil)) (insn:HI 46 45 48 7 pr23305.ii:28564 (parallel [ (set (reg:DI 58 [ ivtmp.254 ]) (plus:DI (reg:DI 58 [ ivtmp.254 ]) (const_int 8 [0x8]))) (clobber (reg:CC 17 flags)) ]) 244 {*adddi_1_rex64} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn:HI 48 46 49 7 pr23305.ii:28673 (set (reg:CCZ 17 flags) (compare:CCZ (reg/f:DI 60 [ last$current$current$current ]) (reg:DI 58 [ ivtmp.254 ]))) 2 {cmpdi_1_insn_rex64} (nil)) (jump_insn:HI 49 48 50 7 pr23305.ii:28673 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0x0])) (label_ref:DI 98) (pc))) 579 {*jcc_1} (expr_list:REG_DEAD (reg:CCZ 17 flags) (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) (nil)))) and Register 64 pref SSE_FIRST_REG, else SSE_REGS Register 64 used 5 times across 23 insns; set 2 times; user var; crosses 3 calls; pref SSE_FIRST_REG, else SSE_REGS. Yet global alloc puts it into 8(%rsp), which is certainly fine, except in a the tight loop.
This testcase is still slower, 4.4s with -O2 and 3.6s with -O2 -fno-inline-small-functions (on i386). I wondered if the patch counting frequency of calls crossed helped here. My slowdown is smaller than what reported by Jakub, so perhaps it did partially, but we are still having regression here. Honza
Looks like last remaining problem is the missed loop invariant motion due to STACK_REGS hack as in the case of pr23322 hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-nostackregs-hack real 0m3.637s user 0m3.588s sys 0m0.008s hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-mainline Does someone have 2.95 around to double check that it didn't perform significandly better than 3.4? real 0m4.627s user 0m4.484s sys 0m0.016s hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-gcc-3.4 real 0m4.229s user 0m3.876s sys 0m0.004s *** This bug has been marked as a duplicate of 23322 ***