23305 – [4.0/4.1/4.2/4.3 Regression] x87 load hoisting problem

Bug 23305 - [4.0/4.1/4.2/4.3 Regression] x87 load hoisting problem

Summary: [4.0/4.1/4.2/4.3 Regression] x87 load hoisting problem

Status:	RESOLVED DUPLICATE of bug 23322

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.1

Importance:	P2 normal
Target Milestone:	4.1.3
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:	28868
Blocks:
	Show dependency tree / graph

Reported:	2005-08-10 01:03 UTC by Anthony Danalis
Modified:	2008-02-06 15:10 UTC (History)
CC List:	9 users (show)

See Also:
Host:
Target:	i686--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2005-08-10 09:15:28

Attachments
Source code. (1.47 KB, text/plain) 2005-08-10 01:06 UTC, Anthony Danalis	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anthony Danalis 2005-08-10 01:03:37 UTC

We were running bench++ looking for cases that perform worse
with g++-4.x than they do with g++-2.95 .
We posted a related mail to the gcc list
http://gcc.gnu.org/ml/gcc/2005-08/msg00197.html

There seems to be an interesting regression, exhibited by
more than one test, that is related to the inliner.
For example, the test s000005a, when compiled with g++-4.0.1
runs faster when compiled with -O2 than it does
when compiled with -O2 -finline-functions, or with -O3 .
Specifically, the slowdown is in the order of x2.5
If g++-2.95.3 is used with the same flags, the
slowdown does not occur.

Interestingly, if a *dead* function is commented
out, or a *dead* call to cerr, the regression goes away.

If g++-410_0723 is used, the regression appears as
with g++-4.0.1 (when -O2 -finline, or -O3 is used),
but it does *not* go away when then dead function
is commented out. It only goes away when the dead
cerr is commented out.

Comment 1 Anthony Danalis 2005-08-10 01:06:16 UTC

Created attachment 9457 [details]
Source code.

Comment 2 Andrew Pinski 2005-08-10 01:16:35 UTC

I think this is just a RA issue as the assembler looks good on ppc-darwin.

Comment 3 Andrew Pinski 2005-08-10 01:23:34 UTC

I don't get the regression on ppc-darwin so this is just a RA issue.

Comment 4 Andrew Pinski 2005-08-10 01:59:09 UTC

Actually maybe not:
<L8>:;
  first$current$current$current.506 = first$current$current$current.506 + 8B;
  D.34505 = D.34505 + first$current$current$current->value;
  if (last$current$current$current != first$current$current$current.506) goto <L37>; else goto <L10>;
  
<L37>:;
  first$current$current$current = first$current$current$current.506;
  goto <bb 6> (<L8>);

That is just wrong which causes some of the problems but I don't know how much, it looks like only a 
second or so.

From what I looked at this is fully a target issue rather than a normal problem with targets which have a 
little more registers.

Comment 5 Richard Biener 2005-08-10 09:15:28 UTC

I can confirm a ~2.x time slowdown going from -O2 to -O2 -finline-functions on
i686.  Though this is unfortunate.

Comment 6 Andrew Pinski 2006-06-04 19:58:07 UTC

This was a P2 before P3 became the default.(In reply to comment #4)
>   first$current$current$current.506 = first$current$current$current.506 + 8B;
>   D.34505 = D.34505 + first$current$current$current->value;

If we swaped around those two statements at the tree level, out of SSA would not have produced an extra assignment.

Comment 7 Andrew Pinski 2006-08-28 05:47:07 UTC

HUH:
  # D.34332_4 = PHI <D.34332_139(7), D.34332_13(6)>;
  # first$current$current$current_3 = PHI <first$current$current$current_98(7), first$current$current$current_11(6)>;
  # first$current$current$current_282 = PHI <first$current$current$current_98(7), first$current$current$current_11(6)>;
<L10>:;
  first$current$current$current_98 = first$current$current$current_282 + 8B;
  tmp$current$current_113 = first$current$current$current_3 + 8B;
  tmp$current_122 = tmp$current$current_113 - 8B;
  y_134 = tmp$current_122;
  D.34330_138 = y_134->value;
  D.34332_139 = D.34332_4 + D.34330_138;
  if (last$current$current$current_12 != first$current$current$current_98) goto <L10>; else goto <L12>;


Isn't _3 the same as _282?  Why don't we elimitate it?  (there is no way not to create it in the first place with this testcase as it is not really created by any pass).  I think if we eliminate that, this should be fixed.

Comment 8 Jakub Jelinek 2007-11-22 16:11:28 UTC

On the trunk there is no difference between -O2 and -O2 -finline-functions
(the latter is perhaps 1% better), both are as bad as 4.1/4.2 with -O2 -finline-functions.  Compiling with -O2 -fno-inline-small-functions gives the speed back.  Both x86_64-linux and i686-linux.

Comment 9 Jakub Jelinek 2007-11-22 16:41:08 UTC

On x86_64-linux -m64 with -O2 gcc doesn't hoist movabsq insns out of the loops,
which can give some performance back:

time ./pr23305-slow

real    0m4.028s
user    0m4.023s
sys     0m0.003s
time ./pr23305-slow2

real    0m3.436s
user    0m3.434s
sys     0m0.001s

when I hoist it by hand in assembly:

--- pr23305-slow.s      2007-11-22 17:14:09.000000000 +0100
+++ pr23305-slow2.s     2007-11-22 17:31:31.000000000 +0100
@@ -222,16 +222,16 @@ _Z13s000005a_testv:
 .LVL2:
 .LBB329:
 .LBB330:
        .loc 1 28697 0
        cmpq    %rax, %rdx
        je      .L13
+       movabsq $4613937818241073152, %r8
        .p2align 4,,10
        .p2align 3
 .L14:
-       movabsq $4613937818241073152, %r8
        movq    %r8, (%rax)
        addq    $8, %rax
        cmpq    %rax, %rdx
        jne     .L14
 .L13:
 .LBE330:
@@ -242,17 +242,17 @@ _Z13s000005a_testv:
 .LVL3:
 .LBB326:
 .LBB327:
        .loc 1 28697 0
        cmpq    %rax, %rdx
        je      .L15
+       movabsq $4613937818241073152, %rdi
        .p2align 4,,10
        .p2align 3
 .L16:
 .LBE327:
-       movabsq $4613937818241073152, %rdi
        movq    %rdi, (%rax)
 .LBB328:
        addq    $8, %rax
        cmpq    %rax, %rdx
        jne     .L16
 .L15:

but still the -O2 -fno-inline-small-functions version is much faster:

time ./pr23305-fast

real    0m1.591s
user    0m1.588s
sys     0m0.001s

Comment 10 Jakub Jelinek 2007-11-22 17:04:08 UTC

The remaining difference is register allocation issue:

time ./pr23305-vanilla; time ./pr23305-fixed

real    0m4.030s
user    0m4.028s
sys     0m0.002s

real    0m1.593s
user    0m1.592s
sys     0m0.001s

with hand-edited changes:

--- pr23305-vanilla.s   2007-11-22 17:57:15.000000000 +0100
+++ pr23305-fixed.s     2007-11-22 17:57:56.000000000 +0100
@@ -95,49 +95,49 @@ _Z13s000005a_testv:
        subq    $24, %rsp
 .LCFI1:
        movq    _ZL3dpe(%rip), %rdx
        movq    _ZL3dpb(%rip), %rax
        cmpq    %rax, %rdx
        je      .L13
+       movabsq $4613937818241073152, %r8
        .p2align 4,,10
        .p2align 3
 .L14:
-       movabsq $4613937818241073152, %r8
        movq    %r8, (%rax)
        addq    $8, %rax
        cmpq    %rax, %rdx
        jne     .L14
 .L13:
        movq    _ZL3Dpe(%rip), %rdx
        movq    _ZL3Dpb(%rip), %rax
        cmpq    %rax, %rdx
        je      .L15
+       movabsq $4613937818241073152, %rdi
        .p2align 4,,10
        .p2align 3
 .L16:
-       movabsq $4613937818241073152, %rdi
        movq    %rdi, (%rax)
        addq    $8, %rax
        cmpq    %rax, %rdx
        jne     .L16
 .L15:
        movq    _ZL5rrDPe(%rip), %rdx
        movq    _ZL5rrDPb(%rip), %rax
        movsd   _ZL1D(%rip), %xmm0
        cmpq    %rdx, %rax
        movsd   %xmm0, 8(%rsp)
        je      .L18
+       movsd   8(%rsp), %xmm0
        .p2align 4,,10
        .p2align 3
 .L24:
-       movsd   8(%rsp), %xmm0
        addsd   (%rax), %xmm0
        addq    $8, %rax
        cmpq    %rax, %rdx
-       movsd   %xmm0, 8(%rsp)
        jne     .L24
+       movsd   %xmm0, 8(%rsp)
 .L18:
        movsd   8(%rsp), %xmm0
        ucomisd .LC2(%rip), %xmm0
        jp      .L23
        jne     .L23
        addq    $24, %rsp

In lreg dump we have:

(code_label:HI 98 35 97 7 24 "" [1 uses])
(note:HI 97 98 45 7 [bb 7] NOTE_INSN_BASIC_BLOCK)
(insn:HI 45 97 46 7 pr23305.ii:28564 (set (reg/v:DF 64 [ result ])
        (plus:DF (reg/v:DF 64 [ result ])
            (mem/s:DF (reg:DI 58 [ ivtmp.254 ]) [29 <variable>.value+0 S8 A8]))) 680 {*fop_df_comm_sse} (nil))
(insn:HI 46 45 48 7 pr23305.ii:28564 (parallel [
            (set (reg:DI 58 [ ivtmp.254 ])
                (plus:DI (reg:DI 58 [ ivtmp.254 ])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))  
        ]) 244 {*adddi_1_rex64} (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
(insn:HI 48 46 49 7 pr23305.ii:28673 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg/f:DI 60 [ last$current$current$current ])
            (reg:DI 58 [ ivtmp.254 ]))) 2 {cmpdi_1_insn_rex64} (nil))
(jump_insn:HI 49 48 50 7 pr23305.ii:28673 (set (pc)
        (if_then_else (ne (reg:CCZ 17 flags)
                (const_int 0 [0x0]))
            (label_ref:DI 98)
            (pc))) 579 {*jcc_1} (expr_list:REG_DEAD (reg:CCZ 17 flags)
        (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
            (nil))))

and
Register 64 pref SSE_FIRST_REG, else SSE_REGS
Register 64 used 5 times across 23 insns; set 2 times; user var; crosses 3 calls; pref SSE_FIRST_REG, else SSE_REGS.

Yet global alloc puts it into 8(%rsp), which is certainly fine, except in a the tight loop.

Comment 11 Jan Hubicka 2008-02-05 13:31:26 UTC

This testcase is still slower, 4.4s with -O2 and 3.6s with -O2 -fno-inline-small-functions (on i386).  I wondered if the patch counting frequency of calls crossed helped here. My slowdown is smaller than what reported by Jakub, so perhaps it did partially, but we are still having regression here.

Honza

Comment 12 Jan Hubicka 2008-02-06 15:10:10 UTC

Looks like last remaining problem is the missed loop invariant motion due to STACK_REGS hack as in the case of pr23322


hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-nostackregs-hack

real    0m3.637s
user    0m3.588s
sys     0m0.008s
hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-mainline


Does someone have 2.95 around to double check that it didn't perform significandly better than 3.4?
real    0m4.627s
user    0m4.484s
sys     0m0.016s
hubicka@occam:/aux/hubicka/trunk-write/buidl2/gcc$ time ./a.out-gcc-3.4

real    0m4.229s
user    0m3.876s
sys     0m0.004s


*** This bug has been marked as a duplicate of 23322 ***