Bug 39839 - [4.6 regression] loop invariant motion causes stack spill
Summary: [4.6 regression] loop invariant motion causes stack spill
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.4.0
: P2 normal
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 42505
Blocks: 16996 33928
  Show dependency treegraph
 
Reported: 2009-04-21 18:27 UTC by Alexander Vodomerov
Modified: 2013-04-12 16:17 UTC (History)
7 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: arm-eabi
Build: x86_64-unknown-linux-gnu
Known to work: 4.7.0
Known to fail:
Last reconfirmed: 2009-05-20 14:17:11


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Vodomerov 2009-04-21 18:27:11 UTC
The following code:
struct S
{
  int count;
  char *addr;
};

void func(const char*, const char*, int, const char*);

void test(struct S *p)
{
  int off = p->count;
  while (p->count >= 0)
    {
      const char *s = "xyz";
      if (*p->addr) s = "pqr";
      func("abcde", p->addr + off, off, s);
      p->count--;
    }
}

is compiled by GCC 4.2.1 to 64 bytes, and by GCC 4.4.0 to 76 bytes. Bisection shows that size is increased several times:
123918 -> 123919: 64 -> 72
124041 -> 124042: 72 -> 76
I already filed a bug for 123919 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39837), so let's take a look at http://gcc.gnu.org/viewcvs?view=rev&revision=124042

GCC rev124041 (with -march=armv5te -mthumb -mthumb-interwork -fpic -Os)
test:
        push    {r4, r5, r6, r7, lr}
        ldr     r3, .L9
        ldr     r2, .L9+4
.LPIC0:
        add     r3, pc
        add     r7, r3, r2
        ldr     r2, .L9+8
        ldr     r5, [r0]
        sub     sp, sp, #4
        mov     r4, r0
        add     r6, r3, r2
        b       .L2
.L3:
        ldr     r0, [r4, #4]
        ldrb    r3, [r0]
        cmp     r3, #0
        beq     .L4
        mov     r2, r6
        b       .L6
.L4:
        mov     r2, r7
.L6:
        add     r0, r0, r5
        lsl     r1, r5, #1
        bl      func
        ldr     r3, [r4]
        sub     r3, r3, #1
        str     r3, [r4]
.L2:
        ldr     r3, [r4]
        cmp     r3, #0
        bge     .L3
        add     sp, sp, #4
        @ sp needed for prologue
        pop     {r4, r5, r6, r7, pc}
.L10:

GCC rev124042:
test:
        push    {r4, r5, r6, r7, lr}
        ldr     r3, .L9
        ldr     r2, .L9+4
.LPIC0:
        add     r3, pc
        add     r2, r3, r2
        sub     sp, sp, #12
        ldr     r5, [r0]
        str     r2, [sp, #4]
        ldr     r2, .L9+8
        mov     r4, r0
        lsl     r6, r5, #1
        add     r7, r3, r2
        b       .L2
.L3:
        ldr     r0, [r4, #4]
        ldrb    r3, [r0]
        cmp     r3, #0
        beq     .L4
        mov     r2, r7
        b       .L6
.L4:
        ldr     r2, [sp, #4]
.L6:
        add     r0, r0, r5
        mov     r1, r6
        bl      func
        ldr     r3, [r4]
        sub     r3, r3, #1
        str     r3, [r4]
.L2:
        ldr     r3, [r4]
        cmp     r3, #0
        bge     .L3
        add     sp, sp, #12
        @ sp needed for prologue
        pop     {r4, r5, r6, r7, pc}

The first different dump is 090t.lim, which moves (off << 1) out of the loop. But this extra variable causes extra stack spill, so it actually a loss, not a win. Any ideas about what to tweak?
Comment 1 Richard Biener 2009-04-22 09:39:47 UTC
Implement re-materialization in reload.
Comment 2 Richard Biener 2009-04-22 13:39:34 UTC
Vlad, with YARA you had implemented rematerialization, right?  How difficult is
it to do the same with IRA for cases like this?
Comment 3 Vladimir Makarov 2009-04-22 20:37:11 UTC
Actually YARA did not have a rematerialization as IRA.  Reload has a primitive rematerialization of constant values.

Although about 5 years I did implemented a register pressure relief through rematerialization which is close to Simpson's thesis.  It was reported on the 2nd GCC summit.  I had a mixed feeling about this: wrong register pressure calculation (because we have not cover classes at that time), small improvement but a few additional percents to compilation time.

Probably it is time to return to this and make it optional or default which will eat a chunk of your recent 5% compilation time improvement :)  I'll try to include this in my plans.
Comment 4 Paolo Bonzini 2009-06-15 16:26:02 UTC
Someone can please check if this is still a problem in 4.5?
Comment 5 Steven Bosscher 2009-06-22 16:29:38 UTC
How is this different from bug 39837?
Comment 6 Richard Biener 2009-08-04 12:30:14 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 7 Steven Bosscher 2010-01-02 00:49:56 UTC
With "GCC: (GNU) 4.5.0 20090808 (experimental) [trunk revision 150579]" I get:

        .arch armv5te
        .fpu softvfp
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 1
        .eabi_attribute 30, 4
        .eabi_attribute 18, 4
        .code   16
        .file   "t.c"
        .text
        .align  1
        .global test
        .code   16
        .thumb_func
        .type   test, %function
test:
        push    {r3, r4, r5, r6, r7, lr}
        ldr     r3, .L7
        ldr     r5, .L7+4
.LPIC0:
        add     r3, pc
        add     r5, r3, r5
        mov     r7, r5
        mov     r4, r0
        ldr     r6, [r0]
        add     r7, r7, #8
        b       .L2
.L4:
        ldr     r1, [r4, #4]
        ldrb    r3, [r1]
        cmp     r3, #0
        bne     .L5
        mov     r3, r5
        b       .L3
.L5:
        add     r3, r5, #4
.L3:
        add     r1, r1, r6
        mov     r0, r7
        mov     r2, r6
        bl      func
        ldr     r3, [r4]
        sub     r3, r3, #1
        str     r3, [r4]
.L2:
        ldr     r3, [r4]
        cmp     r3, #0
        bge     .L4
        @ sp needed for prologue
        pop     {r3, r4, r5, r6, r7, pc}
.L8:
       .align  2
.L7:
        .word   _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
        .word   .LANCHOR0(GOTOFF)
        .size   test, .-test
        .section        .rodata
        .set    .LANCHOR0,. + 0
.LC0:
        .ascii  "xyz\000"
.LC1:
        .ascii  "pqr\000"
.LC2:
        .ascii  "abcde\000"
        .ident  "GCC: (GNU) 4.5.0 20090808 (experimental) [trunk revision 150579]"


I have no idea whether that is better or worse. Alexander, could you give this a look please?
Comment 8 Richard Biener 2010-01-02 11:18:31 UTC
You may also want to try -fsched-pressure -fschedule-insns which is new
in GCC 4.5 and might decrease register pressure (no idea if it ever moves
things into loops though - the haifa scheduler might not consider it a
region to work on).
Comment 9 Richard Biener 2010-05-22 18:13:37 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 10 Sandra Loosemore 2010-06-22 16:26:57 UTC
It looks like this bug has been fixed by my proposed patch for PR42505:

http://gcc.gnu.org/ml/gcc-patches/2010-06/msg01920.html

Applying that patch to r160755 gives:

00000000 <test>:
   0:	b570      	push	{r4, r5, r6, lr}
   2:	4d0c      	ldr	r5, [pc, #48]	; (34 <test+0x34>)
   4:	1c04      	adds	r4, r0, #0
   6:	6806      	ldr	r6, [r0, #0]
   8:	447d      	add	r5, pc
   a:	e00f      	b.n	2c <test+0x2c>
   c:	6861      	ldr	r1, [r4, #4]
   e:	1c2b      	adds	r3, r5, #0
  10:	780a      	ldrb	r2, [r1, #0]
  12:	2a00      	cmp	r2, #0
  14:	d101      	bne.n	1a <test+0x1a>
  16:	4b08      	ldr	r3, [pc, #32]	; (38 <test+0x38>)
  18:	447b      	add	r3, pc
  1a:	4808      	ldr	r0, [pc, #32]	; (3c <test+0x3c>)
  1c:	1989      	adds	r1, r1, r6
  1e:	4478      	add	r0, pc
  20:	1c32      	adds	r2, r6, #0
  22:	f7ff fffe 	bl	0 <func>
  26:	6823      	ldr	r3, [r4, #0]
  28:	3b01      	subs	r3, #1
  2a:	6023      	str	r3, [r4, #0]
  2c:	6823      	ldr	r3, [r4, #0]
  2e:	2b00      	cmp	r3, #0
  30:	daec      	bge.n	c <test+0xc>
  32:	bd70      	pop	{r4, r5, r6, pc}
  34:	00000028 	.word	0x00000028
  38:	0000001c 	.word	0x0000001c
  3c:	0000001a 	.word	0x0000001a

So, back down to 64 bytes of code, and no spills to stack.

Assuming the PR42505 patch is approved, probably the only thing required to close this issue is checking in the additional test case.
Comment 11 Paolo Bonzini 2010-06-22 17:28:49 UTC
That would be WONTFIX for 4.5 and earlier, right?  4.4 and earlier are definitely out of question, but maybe your patch alone gives the same effect on 4.5 branch too, without any of the other ivopts and ARM improvements that recently went into 4.6?
Comment 12 Sandra Loosemore 2010-06-22 18:02:16 UTC
Hrmmm, I was planning to attempt a 4.5 backport of the PR42505 patch for internal use, but if it's not easy or doesn't help, I think I have better things to do with my time than to try to come up with some other fix.  ;-)  So, let's wait and see.
Comment 13 Sandra Loosemore 2010-07-11 01:22:27 UTC
Some further analysis:

The part of my PR42505 patch that made the difference was the change to estimate_register_pressure_cost in cfgloopanal.c, to make it exclude the call-clobbered registers.  This part was finally committed separately in a revised version as r162043.

I'm still looking into what to do about the test case and 4.5 backport.
Comment 14 Sandra Loosemore 2010-07-13 16:13:41 UTC
There are two patches that made the difference:  r158189 (Carrot's patch for PR42601) and r162043 (the second part of my patch for PR42505).  I checked that backporting these two changes to the 4.5 branch is sufficient to fix the code size regression on this example there, too.

I posted the test case patch here:
http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01070.html
Comment 15 sandra 2010-07-23 02:18:33 UTC
Subject: Bug 39839

Author: sandra
Date: Fri Jul 23 02:18:07 2010
New Revision: 162438

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162438
Log:
2010-07-22  Sandra Loosemore  <sandra@codesourcery.com>

	PR tree-optimization/39839

	gcc/testsuite/
	* gcc.target/arm/pr39839.c: New test case.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr39839.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 16 Steven Bosscher 2011-06-24 14:57:12 UTC
So this is fixed??
Comment 17 Richard Biener 2011-06-24 17:11:01 UTC
At least for 4.7 it seems.
Comment 18 Richard Biener 2011-06-27 12:14:33 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 19 Jakub Jelinek 2012-03-13 12:48:01 UTC
4.4 branch is being closed, moving to 4.5.4 target.
Comment 20 Richard Biener 2012-07-02 11:52:15 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 21 Jakub Jelinek 2013-04-12 16:17:50 UTC
The 4.6 branch has been closed, fixed in GCC 4.7.0.