Bug 34737 - Scheduling of post-modified function arguments is not good
Scheduling of post-modified function arguments is not good
Status: NEW
Product: gcc
Classification: Unclassified
Component: tree-optimization
4.3.0
: P3 enhancement
: ---
Assigned To: Not yet assigned to anyone
: missed-optimization
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-11 08:14 UTC by Wouter van Gulik
Modified: 2010-09-13 11:38 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.3.0 4.2.2 4.1.2 3.4.6 3.3.6
Last reconfirmed: 2008-01-11 09:42:38


Attachments
Test case showing the three cases (766 bytes, text/plain)
2008-01-11 08:17 UTC, Wouter van Gulik
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Wouter van Gulik 2008-01-11 08:14:28 UTC
Consider the following:

char *x;
volatile int y;

void foo(char *p)
{
    y += *p;
}

void main(void)
{
    char *p1 = x;
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
    foo(p1++);
}

For the AVR target this will generate ugly code. Having a double saved variable etc.

/* prologue: frame size=0 */
    push r14
    push r15
    push r16
    push r17
/* prologue end (size=4) */
    lds r24,x
    lds r25,(x)+1
    movw r16,r24
    subi r16,lo8(-(1))
    sbci r17,hi8(-(1))
    call foo
    movw r14,r16
    sec
    adc r14,__zero_reg__
    adc r15,__zero_reg__
    movw r24,r16
    call foo
    movw r16,r14
    subi r16,lo8(-(1))
    sbci r17,hi8(-(1))
    movw r24,r14
    call foo
etc..

The results gets much better when writing it like "foo(p); p++;"

/* prologue: frame size=0 */
	push r16
	push r17
/* prologue end (size=2) */
	movw r16,r24
	call foo
	subi r16,lo8(-(1))
	sbci r17,hi8(-(1))
	movw r24,r16
	call foo
	subi r16,lo8(-(1))
	sbci r17,hi8(-(1))

And the results get near optimal when using larger increments then the target can add immediately ( >64). The compiler then adds the cumulative offset. Which would be the most optimal case if also done for lower increments.

	movw r16,r24
	call foo
	movw r24,r16
	subi r24,lo8(-(65))
	sbci r25,hi8(-(65))
	call foo
	movw r24,r16
	subi r24,lo8(-(130))
	sbci r25,hi8(-(130))

This worst behaviour is shown for 4.1.2, 4.2.2, 4.3.0
Better results (still non-optimal) are with 3.4.6 and 3.3.6.
But 4.0.4 is producing the most optimal code for the original foo(p++)

Ugly code is also being seen for arm/thumb and pdp-11.
But good code for arm/arm

So it's a multi-target problem, not just the avr!
Comment 1 Wouter van Gulik 2008-01-11 08:17:26 UTC
Created attachment 14920 [details]
Test case showing the three cases

Compile using -fno-line.
For the AVR I used: avr-gcc -Wall -Os -fno-inline -mmcu=avr5 --save-temps main.c
Comment 2 Richard Biener 2008-01-11 09:42:38 UTC
Confirmed.

void foo(char *p);

void test1(char * p)
{
    foo(p++);
    foo(p++);
    foo(p++);
    foo(p++);
}

void test2(char * p)
{
    foo(p); p++;
    foo(p); p++;
    foo(p); p++;
    foo(p); p++;
}

The problem is with the first variant we have two registers life over each
function call, while with the second variant only one.  This can be seen
from the optimized tree-dump already:

test1 (p)
{
<bb 2>: 
  p_3 = p_1(D) + 1;
  foo (p_1(D));
  p_5 = p_3 + 1;
  foo (p_3);
  p_7 = p_5 + 1;
  foo (p_5);
  foo (p_7) [tail call];
  return;

}

test2 (p)
{
<bb 2>:
  foo (p_1(D));
  p_2 = p_1(D) + 1;
  foo (p_2);
  p_3 = p_2 + 1;
  foo (p_3);
  p_4 = p_3 + 1;
  foo (p_4) [tail call];
  return;

}

and is initially caused by gimplification which produces

  p.0 = p;
  p = p + 1;
  foo (p.0);

from

  foo (p++ );

no further pass undos this transformation.

With GCC 4.0 TER produced

  foo (p);
  foo (p + 1B);
  foo (p + 2B);
...

where we can generate good code from.  From 4.1 on this is no longer done.
Comment 3 Andrew Pinski 2008-01-11 11:33:09 UTC
No what happened with 4.0 is rather DOM would prop x+1 for each x.

Really this comes down to scheduling of instructions and moving them closer to their usage.
Comment 4 Steven Bosscher 2009-06-24 07:42:19 UTC
Couldn't this be fixed also by changing the initial gimplification from:

  p.0 = p;
  p = p + 1;
  foo (p.0);

to:

  p.0 = p;
  foo (p.0);
  p = p + 1;

?
Comment 5 rguenther@suse.de 2009-06-24 09:07:58 UTC
Subject: Re:  Scheduling of post-modified function
 arguments is not good

On Wed, 24 Jun 2009, steven at gcc dot gnu dot org wrote:

> ------- Comment #4 from steven at gcc dot gnu dot org  2009-06-24 07:42 -------
> Couldn't this be fixed also by changing the initial gimplification from:
> 
>   p.0 = p;
>   p = p + 1;
>   foo (p.0);
> 
> to:
> 
>   p.0 = p;
>   foo (p.0);
>   p = p + 1;

Probably yes.

Richard.
Comment 6 abnikant 2010-09-13 11:38:15 UTC
we get better code in the head. Both the cases [test1 and test2] produce the same piece of code:
i.e for the following test case:

void foo(char *p);

void test1(char * p)
{
    foo(p++);
    foo(p++);
    foo(p++);
    foo(p++);
}

void test2(char * p)
{
    foo(p); p++;
    foo(p); p++;
    foo(p); p++;
    foo(p); p++;
}

we get:
test1:
        push r28
        push r29
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
.L__stack_usage = 2
        mov r28,r24
        mov r29,r25
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,1
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,2
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,3
        rcall foo
/* epilogue start */
        pop r29
        pop r28
        ret
        .size   test1, .-test1
.global test2
        .type   test2, @function
test2:
        push r28
        push r29
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
.L__stack_usage = 2
        mov r28,r24
        mov r29,r25
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,1
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,2
        rcall foo
        mov r24,r28
        mov r25,r29
        adiw r24,3
        rcall foo
/* epilogue start */
        pop r29
        pop r28
        ret
        .size   test2, .-test2