Bug 39819 - [avr] Missed optimisation when setting 4-byte values
Summary: [avr] Missed optimisation when setting 4-byte values
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2009-04-19 21:09 UTC by David Brown
Modified: 2011-07-02 21:23 UTC (History)
3 users (show)

See Also:
Host:
Target: avr-*-*
Build:
Known to work: 4.6.1
Known to fail: 4.3.2
Last reconfirmed: 2009-08-21 19:28:03


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Brown 2009-04-19 21:09:38 UTC
avr-gcc misses a number of optimisations when copying 4-byte values or assigning a single byte value to 4 byte values.  The issue actually applies to other sized values as well, but since 4 byte values are common (such as for 32-bit ints, and for floats) the issue is especially relevant.

In summary, the compiler tends to produce code that is either a series of direct memory accesses, or uses indirect access (through Z) in a loop.  A better choice would often be to set up Z as a pointer, then unroll the indirect pointer loop.

All code was compiled using avr-gcc 4.3.2 from winavr-20090313, using -Os.

Look at the code:

typedef unsigned char uint8_t;
typedef unsigned long int uint32_t;

static uint8_t as[4];
static uint8_t bs[4];

void foo1(void) {
	for (uint8_t i = 0; i < sz; i++) {
		bs[i] = as[1];
	}
}

void foo2(void) {
	for (uint8_t i = 0; i < sz; i++) {
		*(bs + i) = *(as + 1);
	}
}

foo1 compiles to:

lds r24, as+1
sts bs, r24
sts bs+1, r24
sts bs+2, r24
sts bs+3, r24
ret

Excluding the "ret", this is 10 words and 10 cycles.

foo2 is logically identical (array access and pointer access are the same thing), but compiles to:

lds r24, as+1
ldi r30, lo8(bs)
ldi r31, hi8(bs)
.L1:
st Z+, r24
ldi r25, hi8(bs+4)
cpi r30, lo8(bs+4)
cpc r31, r25
brne L1
ret

Excluding the "ret", this is 9 words and 31 cycles (27 on the XMega).  Hoisting the "ldi r25, hi8(bs+4)" above the label would save four cycles.

An implementation that is smaller than both of these, and slightly slower on the Mega and slightly faster on the XMega, is:

lds r24, as+1
ldi r30, lo8(bs)
ldi r31, hi8(bs)
st Z+, r24
st Z+, r24
st Z+, r24
st Z+, r24
ret

Excluding the "ret" this is 8 words, and 12 cycles (8 on the XMega).


For the code:

static uint32_t al, bl;
static float af;

void foo3(void) {
	al = 0;
}

void foo4(void) {
	af = 0;
}

we get:

foo3:
sts al, __zero_reg__
sts (al)+1, __zero_reg__
sts (al)+2, __zero_reg__
sts (al)+3, __zero_reg__
ret

That's 8 words and 8 cycles (plus "ret").  Using

ldi r30, lo8(bs)
ldi r31, hi8(bs)
st Z+, __zero_reg__
st Z+, __zero_reg__
st Z+, __zero_reg__
st Z+, __zero_reg__
ret

Gives 6 words and 10 cycles, or 6 cycles on the XMega (plus "ret")

Function foo4() should of course give the same code, but instead compiles to the very inefficient:

foo4:
ldi r24, lo8(0x00)
ldi r25, hi8(0x00)
ldi r26, hlo8(0x00)
ldi r27, hhi8(0x00)
sts af, __zero_reg__
sts (af)+1, __zero_reg__
sts (af)+2, __zero_reg__
sts (af)+3, __zero_reg__
ret

That's 12 words and 12 cycles, and uses 4 registers unnecessarily.


Similar code is produced when copying values:

void foo5(void) {
	al = bl;
}

compiles to:

foo5:
lds r24, bl
lds r25, (bl) + 1
lds r26, (bl) + 2
lds r27, (bl) + 3
sts al, r24
sts (al) + 1, r25
sts (al) + 2, r26
sts (al) + 3, r27

Using the Z and either X or Y pointers would make this code slightly smaller but marginally slower on the Mega (and marginally faster on the XMega).  Even without that, re-arranging the code would allow a single register to be used rather than four.

ret
Comment 1 Eric Weddington 2009-08-21 19:28:03 UTC
Confirmed on 4.3.2.
Comment 2 Georg-Johann Lay 2011-07-02 21:23:59 UTC
Closed as WONTFIX.

Compiled the following code in 4.6.1, -std=c99 -mmcu=atmega88 -Os

typedef unsigned char uint8_t;
typedef unsigned long int uint32_t;

uint8_t as[4];
uint8_t bs[4];
static const uint8_t sz = 4;

void foo1(void) {
    for (uint8_t i = 0; i < sz; i++) {
        bs[i] = as[1];
    }
}

void foo2(void) {
    for (uint8_t i = 0; i < sz; i++) {
        *(bs + i) = *(as + 1);
    }
}

The result is:

foo1:
	lds r24,as+1
	sts bs,r24
	sts bs+1,r24
	sts bs+2,r24
	sts bs+3,r24
	ret

foo2:
	lds r24,as+1
	sts bs,r24
	sts bs+1,r24
	sts bs+2,r24
	sts bs+3,r24
	ret

So the difference has gone.