Bug 33970 - Missed optimization using unsigned char loop variable
Summary: Missed optimization using unsigned char loop variable
Status: RESOLVED WORKSFORME
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.1.2
: P3 enhancement
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2007-11-01 13:44 UTC by Mike Henning
Modified: 2011-07-20 17:47 UTC (History)
5 users (show)

See Also:
Host: mingw32
Target: avr-*-*
Build:
Known to work:
Known to fail: 4.1.2, 4.2.2
Last reconfirmed: 2007-11-04 23:28:50


Attachments
Preprocessed testcase. (635 bytes, text/plain)
2007-11-01 17:28 UTC, Eric Weddington
Details
Assembly output of test case using 4.1.2. (1.61 KB, text/plain)
2007-11-01 17:45 UTC, Eric Weddington
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Henning 2007-11-01 13:44:38 UTC
When using an unsigned char in a loop with limited range the variable is promoted to a 16 bit integer if the variable is passed to another function within the loop by value. 


Tested using -Os on ATMEGA162:

#include <avr/io.h>


int sub2(uint8_t);

int main(void) {

static uint8_t  x;
volatile uint8_t y;

while(1)
 {

   for(x=0;x<128; x++)
     {
        y+=sub2(x);
     }
  }
 
  return 0;
}
 


int sub2(uint8_t xyz) {

volatile uint8_t abc;

while(xyz < 64)
 {
    xyz++;
    abc+=xyz;
 }
  return abc;
}
Comment 1 Richard Biener 2007-11-01 14:10:16 UTC
I guess you want -fno-tree-loop-ivcanon.
Comment 2 Andrew Pinski 2007-11-01 17:09:44 UTC
I don't see it being promoted on x86-linux-gnu at the tree level.
Comment 3 Eric Weddington 2007-11-01 17:28:31 UTC
Created attachment 14454 [details]
Preprocessed testcase.
Comment 4 Eric Weddington 2007-11-01 17:45:23 UTC
Created attachment 14455 [details]
Assembly output  of test case using 4.1.2.

Maybe I'm wrong, but I don't even see it being promoted on the AVR. See attached assembly output from 4.1.2 with -dP.
Comment 5 Eric Weddington 2007-11-01 17:47:24 UTC
Mike, can you provide additional information as to where the bug is?
Comment 6 Mike Henning 2007-11-01 21:26:17 UTC
(In reply to comment #5)
> Mike, can you provide additional information as to where the bug is?
> 
This is the assembly output I get:

Note that r14,r15 is being reserved for variable x when only a single reg is needed. Strange enough if call sub2 with (x+1) then only one reg is used and the code decreases by 12 bytes.

000000e8 <main>:
  e8:	ef 92       	push	r14
  ea:	ff 92       	push	r15
  ec:	1f 93       	push	r17
  ee:	cf 93       	push	r28
  f0:	df 93       	push	r29
  f2:	cd b7       	in	r28, 0x3d	; 61
  f4:	de b7       	in	r29, 0x3e	; 62
  f6:	21 97       	sbiw	r28, 0x01	; 1
  f8:	0f b6       	in	r0, 0x3f	; 63
  fa:	f8 94       	cli
  fc:	de bf       	out	0x3e, r29	; 62
  fe:	0f be       	out	0x3f, r0	; 63
 100:	cd bf       	out	0x3d, r28	; 61
 102:	ee 24       	eor	r14, r14
 104:	ff 24       	eor	r15, r15
 106:	19 81       	ldd	r17, Y+1	; 0x01
 108:	8e 2d       	mov	r24, r14
 10a:	0e 94 57 00 	call	0xae	; 0xae <sub2>
 10e:	18 0f       	add	r17, r24
 110:	19 83       	std	Y+1, r17	; 0x01
 112:	08 94       	sec
 114:	e1 1c       	adc	r14, r1
 116:	f1 1c       	adc	r15, r1
 118:	80 e8       	ldi	r24, 0x80	; 128
 11a:	e8 16       	cp	r14, r24
 11c:	f1 04       	cpc	r15, r1
 11e:	99 f7       	brne	.-26     	; 0x106 <main+0x1e>
 120:	f0 cf       	rjmp	.-32     	; 0x102 <main+0x1a>

00000122 <_exit>:
 122:	ff cf       	rjmp	.-2      	; 0x122 <_exit>
Comment 7 Eric Weddington 2007-11-04 23:28:15 UTC
With Mike's description in comment #6, confirmed on 4.1.2 and 4.2.2. AVR GCC 4.2.2 is worse than 4.1.2, in that even if sub2 is called with (x+1), the variable is still 16 bits.
Comment 8 Wouter van Gulik 2007-11-05 22:48:51 UTC
(In reply to comment #7)
> With Mike's description in comment #6, confirmed on 4.1.2 and 4.2.2. AVR GCC
> 4.2.2 is worse than 4.1.2, in that even if sub2 is called with (x+1), the
> variable is still 16 bits.
> 

There is something more going on, this is the assembler output when sub2 is not in the same file, and calling sub2(x), i.e not x+1:
===================================================================

	push r17
	push r28
	push r29
	in r28,__SP_L__
	in r29,__SP_H__
	sbiw r28,1
	in __tmp_reg__,__SREG__
	cli
	out __SP_H__,r29
	out __SREG__,__tmp_reg__
	out __SP_L__,r28
/* prologue end (size=11) */
.L3:
	sts x.1261,__zero_reg__
	ldi r24,lo8(0)
.L4:
	ldd r17,Y+1
	call sub2
	add r17,r24
	std Y+1,r17
	lds r24,x.1261
	subi r24,lo8(-(1))
	sts x.1261,r24
	sbrs r24,7
	rjmp .L4
	rjmp .L3
===========================================================

Note that x is now living in memory (as one would expect when it's static in a function). However it's not when looking at the original report. Is this ok? The original post assembler places x in r14,r15.  Does this still adhere the static keyword? I don't know, since it's value is always set at the start what's the point of being static? Unless main is called twice? But then x may not be in r14/r15!

So it seems to me this is some inline optimation problem.

Note that moving the sub2 function above main also gives other results?!? Is this standard behaviour?

Also note that when making sub2 static will completely remove it but also remove x as static, is this allowed?

Tested against avr-gcc 4.1.2
Comment 9 Mike Henning 2007-11-06 12:37:32 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > With Mike's description in comment #6, confirmed on 4.1.2 and 4.2.2. AVR GCC
> > 4.2.2 is worse than 4.1.2, in that even if sub2 is called with (x+1), the
> > variable is still 16 bits.
> > 
> 
> There is something more going on, this is the assembler output when sub2 is not
> in the same file, and calling sub2(x), i.e not x+1:
> ===================================================================

I think you will also find that if x is changed from ststic to auto the same problem appears.
Comment 10 Wouter van Gulik 2007-11-06 19:34:59 UTC
(In reply to comment #9)
> 
> I think you will also find that if x is changed from ststic to auto the same
> problem appears.
> 

Ok, I tried to find the minimum test case.
And it has nothing todo with static/volatile/inline etc.

==============================================
int sub2(unsigned char); // external function

void foo(void) {
  unsigned char x;
  for(x=0;x<128; x++)
  {
    //sub2(x); //x is becomes a int (16bit)
    sub2(x+1); //x is char (8bit)
  }
}

All is compiled using 4.1.2 and  "avr-gcc -Wall -Os -mmcu=avr5 -S test.c"

The output when compiling with "sub2(x)"
=================================================
/* prologue: frame size=0 */
	push r28
	push r29
/* prologue end (size=2) */
	ldi r28,lo8(0)
	ldi r29,hi8(0)
.L2:
	mov r24,r28 ; << loads as a byte!
	call sub2
	adiw r28,1           ; << increment as a int
	cpi r28,128          ; << compare as a int
	cpc r29,__zero_reg__
	brne .L2
/* epilogue: frame size=0 */
	pop r29
	pop r28
	ret

The output when compiling with "sub2(x+1)"
================================================================
/* prologue: frame size=0 */
	push r17
/* prologue end (size=1) */
	ldi r17,lo8(0)
.L2:
	subi r17,lo8(-(1))
	mov r24,r17
	call sub2
	cpi r17,lo8(-128)
	brne .L2
/* epilogue: frame size=0 */
	pop r17
	ret
===========================================================


From compiling with x as a int I got a sort of hint. When using x+1, x is actually loaded with 1 (and not zero). 

So i tried:
======================================================
void foo(void) {
  unsigned char x;
  for(x=1;x<129; x++)
    sub2(x);
}


And it gives the assembler loop check as for the "sub(x+1)" variant of course the loop is now effectivley the same as "sub2(x+1)". But the the incrementing of x is now done after the call, so it proofs that this is not the problem. I also tried different loop lengths, thinking 128 is the nasty one, but it did not help.

======================================================
/* prologue: frame size=0 */
	push r17
/* prologue end (size=1) */
	ldi r17,lo8(1)
.L2:
	mov r24,r17
	call sub2
	subi r17,lo8(-(1))
	cpi r17,lo8(-127)
	brne .L2
/* epilogue: frame size=0 */
	pop r17
	ret


I am out of knowledge now, I do not know how to further debug GCC on this. Just hoping this will help someone to pinpoint the problem.

Comment 11 Wouter van Gulik 2007-11-06 21:01:48 UTC
I just realised I did not tried hard enough to find the smallest case:

===========================================
volatile unsigned char bar;
void foo(void) {
  unsigned char x;
  for(x=0;x<128; x++) {
    //bar = x+1;
    bar = x;
  }
}

Looks like using the loop counter inside the loop causes the problem. Because when not using x, but just e.g. calling a function is also gives the smallest loop.

Looking at the RTL output of test.c.00.expand. it allready is a QI vs HI there... so it's going wrong some where inside gcc I really have no knowledge off...
Comment 12 abnikant 2010-09-13 12:09:11 UTC
I have verified the attached test case and test case with other comments and found the code generated is correct i.e. the variable is not promoted to integer in gcc-4.3.3, gcc-4.4.3, gcc-4.5.0 and also the latest head. 
The assembly for the following piece of code:

int sub2(unsigned char); // external function

void foo(void) {
  unsigned char x;
  for(x=0;x<128; x++)
  {
   sub2(x); //x is becomes a int (16bit)
   // sub2(x+1); //x is char (8bit)
  }
}
in gcc-4.3.3 is:
foo:
    push r17
/* prologue: function */
/* frame size = 0 */
    ldi r17,lo8(0)
.L2:
    mov r24,r17
    rcall sub2
    subi r17,lo8(-(1))
    cpi r17,lo8(-128)
    brne .L2
/* epilogue start */
    pop r17
    ret
    .size   foo, .-foo
Comment 13 Georg-Johann Lay 2011-07-20 17:47:44 UTC
Cloased as WORKS FOR ME.

Using the following test case:


volatile unsigned char bar;

void foo(void)
{
    unsigned char x;
    for(x=0;x<128; x++)
    {
        bar = x;
    }
}

int sub2 (unsigned char);

void foo2 (void)
{
    unsigned char x;
    for(x=0;x<128; x++)
    {
        sub2 (x);
    }
}

void foo21 (void)
{
    unsigned char x;
    for(x=0;x<128; x++)
    {
        sub2 (x+1);
    }
}

And with avr-gcc 4.7 r176517 I get the following result (-Os -mmcu=atmega8).
In no case there is a 16-bit loop variable used.

foo:
	ldi r24,lo8(0)
.L5:
	sts bar,r24
	subi r24,lo8(-(1))
	cpi r24,lo8(-128)
	brne .L5
	ret

foo2:
	push r28
	ldi r28,lo8(0)
.L8:
	mov r24,r28
	rcall sub2
	subi r28,lo8(-(1))
	cpi r28,lo8(-128)
	brne .L8
	pop r28
	ret

foo21:
	push r28
	ldi r28,lo8(0)
.L11:
	subi r28,lo8(-(1))
	mov r24,r28
	rcall sub2
	cpi r28,lo8(-128)
	brne .L11
	pop r28
	ret

	.ident	"GCC: (GNU) 4.7.0 20110720 (experimental)"