Bug 29560 - [avr] Poor optimization for byte shifts
Summary: [avr] Poor optimization for byte shifts
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.1.1
: P3 enhancement
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2006-10-23 11:39 UTC by r.leitgeb@x-pin.com
Modified: 2011-08-10 09:19 UTC (History)
4 users (show)

See Also:
Host:
Target: avr-unknown-none
Build:
Known to work:
Known to fail: 4.1.1, 4.3.2
Last reconfirmed: 2011-06-17 17:18:30


Attachments
Test case for 16-bit shifts that use low part only (182 bytes, text/plain)
2011-07-23 14:50 UTC, Georg-Johann Lay
Details
Fix PR29560 by adding peephole2 pattern. (593 bytes, patch)
2011-07-25 12:48 UTC, Georg-Johann Lay
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description r.leitgeb@x-pin.com 2006-10-23 11:39:16 UTC
	For setting individual bits of a register, the construct
	unsigned char BitLocation = Whatever;
	REG |= (1<<BitLocation); 
	is commonly used. avr-gcc fails to realize that the target register
	is only 8 bits wide and performs an unnecessary 16-Bit shift of 1.
	Even if this code snippet is replaced by the following:
	unsigned char BitLocation = Whatever;
	const unsigned char one = 1;
	REG |= (one << BitLocation);
	The inefficient code will be generated, even with the -Os flag.
	I suppose this is because the << operator is not defined for 8-bit
	wide data types.

Environment:
System: Darwin variable.local 7.9.0 Darwin Kernel Version 7.9.0: Wed Mar 30 20:11:17 PST 2005; root:xnu/xnu-517.12.7.obj~1/RELEASE_PPC Power Macintosh powerpc


	
host: powerpc-apple-darwin7.9.0
build: powerpc-apple-darwin7.9.0
target: avr-unknown-none
configured with: ../configure --target=avr --prefix=/usr/local/atmel --enable-languages=c --disable-libssp --enable-__cxa_atexit --enable-clocale=gnu --disable-nls : (reconfigured) ../configure --target=avr --prefix=/usr/local/avr --enable-languages=c --disable-libssp --enable-__cxa_atexit --enable-clocale=gnu --disable-nls

How-To-Repeat:
	Use the following test program, compile with -Os -S to see assembler 
	code generated by avr-gcc:
	#include <avr/io.h>

	void setpin(unsigned char pin, unsigned char val) {
       		if (val == 1) PORTC |= (1<<pin);
        	else PORTC &= ~(1<<pin);
	}

	void setpin1(unsigned char pin, unsigned char val) {
        	const unsigned char one = 1;
        	if (val == 1) PORTC |= (one<<pin);
        	else PORTC &= ~(one<<pin);
	}

	int main(void) {
        	setpin(3, 1);
        	setpin1(3, 1);
        	return 0;
	}
Comment 1 r.leitgeb@x-pin.com 2006-10-23 11:39:16 UTC
Fix:
	Define shift operators for 8 bit wide data types, use if possible.
	I'm not a compiler expert but I know that the multiply operator
	detects the data types involved, so it should be doable with the
	shift operators as well.
Comment 2 r.leitgeb@x-pin.com 2006-10-24 07:31:54 UTC
Here's an excerpt of the assembly code obtained with -Os -S
with some comments from me:

----------------------------------------------------
setpin:
/* prologue: frame size=0 */
/* prologue end (size=0) */
        mov r20,r24
        clr r21
        cpi r22,lo8(1)    <--- The compare is done byte wide
        brne .L2
        in r18,53-0x20
        ldi r24,lo8(1)    <--- Here we load 1 sixteen bit wide
        ldi r25,hi8(1)
        rjmp 2f
1:      lsl r24           <--- Here we shift 1 sixteen bit wide
        rol r25
2:      dec r20
        brpl 1b
        or r18,r24        <--- r25 is not used. Why would we compute it ?
        rjmp .L6
.L2:
        in r18,53-0x20
----------------------------------------------------

Cheers

Rudi
Comment 3 Eric Weddington 2009-08-24 17:18:30 UTC
Confirmed on 4.3.2.
Comment 4 Andy Hutchinson 2009-12-31 03:16:22 UTC
The same occurs in 4.5

I think the bug  is invalid

The expression 1<< pin will be promoted. This produces a defined result if pin>7 and <15
The expression can not then be lower to 8 bit shift - since a shift by >7 is undefined.

If you use pin &7, the shift is indeed 8 bit shift. Though it still loads a int (HIMODE) value of 1 that should have been reduced to QIMODE.

	setpinx:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	mov r25,r24
	andi r25,lo8(7)
	ldi r18,lo8(1)
	ldi r19,hi8(1)
	mov r24,r18
	rjmp 2f
1:	lsl r24
2:	dec r25
	brpl 1b
	cpi r22,lo8(1)
	brne .L7
	in r25,53-0x20
	or r25,r24
	out 53-0x20,r25
	ret
.L7:
	in r25,53-0x20
	com r24
	and r24,r25
	out 53-0x20,r24
	ret
Comment 5 Rudolf Leitgeb 2010-02-18 13:28:19 UTC
Right away: I am NOT an expert in compiler or optimizer design, so please bear with me.

I understand, that (((unsigned char)1) << ((unsigned char)something)) may need more than 8 bits for representation and that gcc's default int size on the ATmega platform is 16 bits. But the result is assigned to an 8 bit value. I take it that there is no way to back propagate this potential optimization from the assignment to the shifting step. If you look in my assembly code, r25 is
computed with great effort yet never transferred anywhere.

The trick with &7 is nice but introduces another unnecessary AND operation. And it is only correct if the input numebr is guaranteed to be between 0 and 7. Am I correct that this whole optimization issue comes from the fact that ATmega is an 8 bit architecture yet gcc's int on that platform is 16 bit?
Comment 6 Georg-Johann Lay 2011-06-21 18:24:44 UTC
Closing this PR as invalid.

As Andy already explained, for shift offsets >7 and <15 there has to be a defined result because int for avr-gcc is 16 bits, and this is actually *not* a byte shift.

With alternative code that tested for these offsets at runtime, jumped around and loaded 0 if appropriate, you were not better of than with the current code -- in the contrary.

Note that gcc deflates you function calls to just one instruction in main.

So maybe what you really want here is to make these functions static inline so that you do no more see the code in the functions.  Presumably you always call this functions with compile time constants.

If you really need to quench out the last tick and call with non-constants, you could invent inline assembler with an instruction sequence that is similar to (1 << (R24 & 7)), perhaps together with __builtin_constant_p magic:

   LDI  R24, 1
   SBRC R24, 1
   LDI  R24, 4
   SBRC R24, 0
   LSL  R24
   SBRC R24, 2
   SWAP R24

Note that with -mint8, int is 8 bits.  That is still present as option, but no more supported and perhaps non-functional.

--

Just for reference; here is a source without external #include:

#define PORTC (*((unsigned char volatile*) 53))

void setpin(unsigned char pin, unsigned char val)
{
    if (val == 1) PORTC |= (1<<pin);
    else PORTC &= ~(1<<pin);
}

void setpin1(unsigned char pin, unsigned char val)
{
    const unsigned char one = 1;
    if (val == 1) PORTC |= (one<<(pin));
    else PORTC &= ~(one<<pin);
}

int main(void)
{
    setpin(3, 1);
    setpin1(3, 1);
    return 0;
}
Comment 7 Georg-Johann Lay 2011-07-23 14:50:58 UTC
Created attachment 24816 [details]
Test case for 16-bit shifts that use low part only

Reopened this optimization issue.

For the attached test case, there are sometimes REG_UNUSED notes for the high part like with avr-gcc-4.6.1 -Os -dP

shift1:
 ; (insn 8 4 17 (set (reg:HI 24 r24 [50])
 ;         (ashift:HI (reg:HI 24 r24 [ x ])
 ;             (reg/v:QI 22 r22 [orig:47 s ] [47]))) foo.c:3 68 {ashlhi3}
 ;      (expr_list:REG_DEAD (reg/v:QI 22 r22 [orig:47 s ] [47])
 ;         (expr_list:REG_UNUSED (reg:QI 25 r25)
 ;             (nil))))
	rjmp 2f	 ;  8	ashlhi3/1	[length = 6]
1:	lsl r24
	rol r25
2:	dec r22
	brpl 1b

So that there is a way to map ashlhi3 to ashlqi3, i.e. 16-bit shift to a 8-bit shift.

Unfortunately, these notes are not always present like in shift2:

shift2:
 ; (insn 15 4 8 (set (reg:HI 18 r18)
 ;         (reg:HI 24 r24 [ x ])) foo.c:10 10 {*movhi}
 ;      (nil))
	movw r18,r24	 ;  15	*movhi/1	[length = 1]
 ; (insn 8 15 9 (set (reg:HI 18 r18)
 ;         (ashift:HI (reg:HI 18 r18)
 ;             (reg/v:QI 22 r22 [orig:46 s ] [46]))) foo.c:10 68 {ashlhi3}
 ;      (expr_list:REG_DEAD (reg/v:QI 22 r22 [orig:46 s ] [46])
 ;         (nil)))
	rjmp 2f	 ;  8	ashlhi3/1	[length = 6]
1:	lsl r18
	rol r19
2:	dec r22
	brpl 1b

The notes are not present in pass .split2 so a split won't help. The notes appear to be available in .peephole2 so that could be a fix. Moreover, the notes won't be back-propagated so that an optimization will cover at most one insn: the shift insn.
Comment 8 Georg-Johann Lay 2011-07-23 14:53:52 UTC
Set to NEW as explained in previous post.
Comment 9 Georg-Johann Lay 2011-07-25 12:48:29 UTC
Created attachment 24827 [details]
Fix PR29560 by adding peephole2 pattern.


	PR target/29560
	* config/avr/avr.md: Add peephole2 to map ashlhi3 to ashlqi3 if
	high part of shift target is unused.
Comment 10 Georg-Johann Lay 2011-08-10 08:58:07 UTC
Author: gjl
Date: Wed Aug 10 08:58:03 2011
New Revision: 177616

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=177616
Log:
	
	PR target/29560
	* config/avr/avr.md (*ashlhiqi3): New insn-and-split.
	(*ashl<extend_prefix>qihiqi3): New insn-and-splits.
	(*ashl<extend_prefix>qihiqi3.mem): New insn-and-splits.
	Add peephole2 to map ashlhi3 to ashlqi3 if high part of
	shift target is unused.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/avr/avr.md
Comment 11 Georg-Johann Lay 2011-08-10 09:19:49 UTC
Closed as FIXED for 4.7.0 trunk with patch from comment #10.

The test case from attachment 24816 [details] compiles with -Os -dp to:

shift1:
	rjmp 2f	 ;  22	*ashlqi3/1	[length = 5]
1:	lsl r24
2:	dec r22
	brpl 1b
	ret	 ;  27	return	[length = 1]

shift2:
	mov r25,r24	 ;  19	*movqi/1	[length = 1]
	rjmp 2f	 ;  15	*ashlqi3/1	[length = 5]
1:	lsl r25
2:	dec r22
	brpl 1b
	sts y,r25	 ;  16	*movqi/3	[length = 2]
	ret	 ;  22	return	[length = 1]

setpin:
	in r25,44-0x20	 ;  11	*movqi/4	[length = 1]
	ldi r18,lo8(1)	 ;  13	*movhi/4	[length = 2]
	ldi r19,hi8(1)
	mov r0,r24	 ;  49	*ashlqi3/1	[length = 5]
	rjmp 2f
1:	lsl r18
2:	dec r0
	brpl 1b
	cpi r22,lo8(1)	 ;  7	*cmpqi/3	[length = 1]
	brne .L4	 ;  8	branch	[length = 1]
	or r25,r18	 ;  15	iorqi3/1	[length = 1]
	out 44-0x20,r25	 ;  17	*movqi/3	[length = 1]
	ret	 ;  46	return	[length = 1]
.L4:
	com r18	 ;  27	one_cmplqi2	[length = 1]
	and r18,r25	 ;  28	andqi3/1	[length = 1]
	out 44-0x20,r18	 ;  30	*movqi/3	[length = 1]
	ret	 ;  48	return	[length = 1]


Notice the *ashlqi3 insns.

The fix uses combine patterns and an RTL peephole to transform
16-bit shift to 8-bit shift if applicable.  There will be
situations where optimization opportunities are not noticed,
e.g. if a move occurs after the shift so that this move will
get a REG_UNUSED note but not the shift loop itself.

The fix just replaces 16-bit loop by 8-bit loop; there is no
back-propagation of the information.