For setting individual bits of a register, the construct unsigned char BitLocation = Whatever; REG |= (1<<BitLocation); is commonly used. avr-gcc fails to realize that the target register is only 8 bits wide and performs an unnecessary 16-Bit shift of 1. Even if this code snippet is replaced by the following: unsigned char BitLocation = Whatever; const unsigned char one = 1; REG |= (one << BitLocation); The inefficient code will be generated, even with the -Os flag. I suppose this is because the << operator is not defined for 8-bit wide data types. Environment: System: Darwin variable.local 7.9.0 Darwin Kernel Version 7.9.0: Wed Mar 30 20:11:17 PST 2005; root:xnu/xnu-517.12.7.obj~1/RELEASE_PPC Power Macintosh powerpc host: powerpc-apple-darwin7.9.0 build: powerpc-apple-darwin7.9.0 target: avr-unknown-none configured with: ../configure --target=avr --prefix=/usr/local/atmel --enable-languages=c --disable-libssp --enable-__cxa_atexit --enable-clocale=gnu --disable-nls : (reconfigured) ../configure --target=avr --prefix=/usr/local/avr --enable-languages=c --disable-libssp --enable-__cxa_atexit --enable-clocale=gnu --disable-nls How-To-Repeat: Use the following test program, compile with -Os -S to see assembler code generated by avr-gcc: #include <avr/io.h> void setpin(unsigned char pin, unsigned char val) { if (val == 1) PORTC |= (1<<pin); else PORTC &= ~(1<<pin); } void setpin1(unsigned char pin, unsigned char val) { const unsigned char one = 1; if (val == 1) PORTC |= (one<<pin); else PORTC &= ~(one<<pin); } int main(void) { setpin(3, 1); setpin1(3, 1); return 0; }
Fix: Define shift operators for 8 bit wide data types, use if possible. I'm not a compiler expert but I know that the multiply operator detects the data types involved, so it should be doable with the shift operators as well.
Here's an excerpt of the assembly code obtained with -Os -S with some comments from me: ---------------------------------------------------- setpin: /* prologue: frame size=0 */ /* prologue end (size=0) */ mov r20,r24 clr r21 cpi r22,lo8(1) <--- The compare is done byte wide brne .L2 in r18,53-0x20 ldi r24,lo8(1) <--- Here we load 1 sixteen bit wide ldi r25,hi8(1) rjmp 2f 1: lsl r24 <--- Here we shift 1 sixteen bit wide rol r25 2: dec r20 brpl 1b or r18,r24 <--- r25 is not used. Why would we compute it ? rjmp .L6 .L2: in r18,53-0x20 ---------------------------------------------------- Cheers Rudi
Confirmed on 4.3.2.
The same occurs in 4.5 I think the bug is invalid The expression 1<< pin will be promoted. This produces a defined result if pin>7 and <15 The expression can not then be lower to 8 bit shift - since a shift by >7 is undefined. If you use pin &7, the shift is indeed 8 bit shift. Though it still loads a int (HIMODE) value of 1 that should have been reduced to QIMODE. setpinx: /* prologue: function */ /* frame size = 0 */ /* stack size = 0 */ .L__stack_usage = 0 mov r25,r24 andi r25,lo8(7) ldi r18,lo8(1) ldi r19,hi8(1) mov r24,r18 rjmp 2f 1: lsl r24 2: dec r25 brpl 1b cpi r22,lo8(1) brne .L7 in r25,53-0x20 or r25,r24 out 53-0x20,r25 ret .L7: in r25,53-0x20 com r24 and r24,r25 out 53-0x20,r24 ret
Right away: I am NOT an expert in compiler or optimizer design, so please bear with me. I understand, that (((unsigned char)1) << ((unsigned char)something)) may need more than 8 bits for representation and that gcc's default int size on the ATmega platform is 16 bits. But the result is assigned to an 8 bit value. I take it that there is no way to back propagate this potential optimization from the assignment to the shifting step. If you look in my assembly code, r25 is computed with great effort yet never transferred anywhere. The trick with &7 is nice but introduces another unnecessary AND operation. And it is only correct if the input numebr is guaranteed to be between 0 and 7. Am I correct that this whole optimization issue comes from the fact that ATmega is an 8 bit architecture yet gcc's int on that platform is 16 bit?
Closing this PR as invalid. As Andy already explained, for shift offsets >7 and <15 there has to be a defined result because int for avr-gcc is 16 bits, and this is actually *not* a byte shift. With alternative code that tested for these offsets at runtime, jumped around and loaded 0 if appropriate, you were not better of than with the current code -- in the contrary. Note that gcc deflates you function calls to just one instruction in main. So maybe what you really want here is to make these functions static inline so that you do no more see the code in the functions. Presumably you always call this functions with compile time constants. If you really need to quench out the last tick and call with non-constants, you could invent inline assembler with an instruction sequence that is similar to (1 << (R24 & 7)), perhaps together with __builtin_constant_p magic: LDI R24, 1 SBRC R24, 1 LDI R24, 4 SBRC R24, 0 LSL R24 SBRC R24, 2 SWAP R24 Note that with -mint8, int is 8 bits. That is still present as option, but no more supported and perhaps non-functional. -- Just for reference; here is a source without external #include: #define PORTC (*((unsigned char volatile*) 53)) void setpin(unsigned char pin, unsigned char val) { if (val == 1) PORTC |= (1<<pin); else PORTC &= ~(1<<pin); } void setpin1(unsigned char pin, unsigned char val) { const unsigned char one = 1; if (val == 1) PORTC |= (one<<(pin)); else PORTC &= ~(one<<pin); } int main(void) { setpin(3, 1); setpin1(3, 1); return 0; }
Created attachment 24816 [details] Test case for 16-bit shifts that use low part only Reopened this optimization issue. For the attached test case, there are sometimes REG_UNUSED notes for the high part like with avr-gcc-4.6.1 -Os -dP shift1: ; (insn 8 4 17 (set (reg:HI 24 r24 [50]) ; (ashift:HI (reg:HI 24 r24 [ x ]) ; (reg/v:QI 22 r22 [orig:47 s ] [47]))) foo.c:3 68 {ashlhi3} ; (expr_list:REG_DEAD (reg/v:QI 22 r22 [orig:47 s ] [47]) ; (expr_list:REG_UNUSED (reg:QI 25 r25) ; (nil)))) rjmp 2f ; 8 ashlhi3/1 [length = 6] 1: lsl r24 rol r25 2: dec r22 brpl 1b So that there is a way to map ashlhi3 to ashlqi3, i.e. 16-bit shift to a 8-bit shift. Unfortunately, these notes are not always present like in shift2: shift2: ; (insn 15 4 8 (set (reg:HI 18 r18) ; (reg:HI 24 r24 [ x ])) foo.c:10 10 {*movhi} ; (nil)) movw r18,r24 ; 15 *movhi/1 [length = 1] ; (insn 8 15 9 (set (reg:HI 18 r18) ; (ashift:HI (reg:HI 18 r18) ; (reg/v:QI 22 r22 [orig:46 s ] [46]))) foo.c:10 68 {ashlhi3} ; (expr_list:REG_DEAD (reg/v:QI 22 r22 [orig:46 s ] [46]) ; (nil))) rjmp 2f ; 8 ashlhi3/1 [length = 6] 1: lsl r18 rol r19 2: dec r22 brpl 1b The notes are not present in pass .split2 so a split won't help. The notes appear to be available in .peephole2 so that could be a fix. Moreover, the notes won't be back-propagated so that an optimization will cover at most one insn: the shift insn.
Set to NEW as explained in previous post.
Created attachment 24827 [details] Fix PR29560 by adding peephole2 pattern. PR target/29560 * config/avr/avr.md: Add peephole2 to map ashlhi3 to ashlqi3 if high part of shift target is unused.
Author: gjl Date: Wed Aug 10 08:58:03 2011 New Revision: 177616 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=177616 Log: PR target/29560 * config/avr/avr.md (*ashlhiqi3): New insn-and-split. (*ashl<extend_prefix>qihiqi3): New insn-and-splits. (*ashl<extend_prefix>qihiqi3.mem): New insn-and-splits. Add peephole2 to map ashlhi3 to ashlqi3 if high part of shift target is unused. Modified: trunk/gcc/ChangeLog trunk/gcc/config/avr/avr.md
Closed as FIXED for 4.7.0 trunk with patch from comment #10. The test case from attachment 24816 [details] compiles with -Os -dp to: shift1: rjmp 2f ; 22 *ashlqi3/1 [length = 5] 1: lsl r24 2: dec r22 brpl 1b ret ; 27 return [length = 1] shift2: mov r25,r24 ; 19 *movqi/1 [length = 1] rjmp 2f ; 15 *ashlqi3/1 [length = 5] 1: lsl r25 2: dec r22 brpl 1b sts y,r25 ; 16 *movqi/3 [length = 2] ret ; 22 return [length = 1] setpin: in r25,44-0x20 ; 11 *movqi/4 [length = 1] ldi r18,lo8(1) ; 13 *movhi/4 [length = 2] ldi r19,hi8(1) mov r0,r24 ; 49 *ashlqi3/1 [length = 5] rjmp 2f 1: lsl r18 2: dec r0 brpl 1b cpi r22,lo8(1) ; 7 *cmpqi/3 [length = 1] brne .L4 ; 8 branch [length = 1] or r25,r18 ; 15 iorqi3/1 [length = 1] out 44-0x20,r25 ; 17 *movqi/3 [length = 1] ret ; 46 return [length = 1] .L4: com r18 ; 27 one_cmplqi2 [length = 1] and r18,r25 ; 28 andqi3/1 [length = 1] out 44-0x20,r18 ; 30 *movqi/3 [length = 1] ret ; 48 return [length = 1] Notice the *ashlqi3 insns. The fix uses combine patterns and an RTL peephole to transform 16-bit shift to 8-bit shift if applicable. There will be situations where optimization opportunities are not noticed, e.g. if a move occurs after the shift so that this move will get a REG_UNUSED note but not the shift loop itself. The fix just replaces 16-bit loop by 8-bit loop; there is no back-propagation of the information.