I am writing some interrupt code for an Atmel AVR microcontroller and I'm trying to shave cycles off, specifically at the beginning of the interrupt. I want to achieve this by minimizing the registers that need to be saved, so I decided to declare a few variables as global register variables. What I found is that GCC will "optimize away" the register assignment and instead produce code that is almost what I want, except it copies the value in and out of another register (allocated the usual way) instead of operating on the assigned one. For example the following C: register unsigned char foo asm ("r4"); void baz(); void quux(); void bar() { foo = foo * 2; if (foo > 10) baz(); else quux(); } generates the following assembly: mov r24,r4 lsl r24 mov r4,r24 cpi r24,lo8(11) brsh .L4 rjmp quux .L4: rjmp baz It does the same thing (copy to r24, manipulate, copy back) on every optimization level. Surely this can't be desired behavior? If you use local register variables it's often even worse, as gcc won't touch the assigned register at all but instead produce identical code that uses a different register. I'm fairly certain this shouldn't work this way...
I think this is correct behavior if you read the manual.
Any specific sections you can point me to? I'd be interested to hear a justification for this behavior.
(In reply to comment #2) > Any specific sections you can point me to? I'd be interested to hear a > justification for this behavior. Simple answer, an interrupt can happen any time after the access/assignment of the global register variable so it needs to be set to the value and treated as a volatile variable.
Wouldn't that be a reason /against/ doing this: mov r24,r4 lsl r24 mov r4,r24 instead of just lsl r4?
(In reply to comment #4) > Wouldn't that be a reason /against/ doing this: > > mov r24,r4 > lsl r24 > mov r4,r24 > > instead of just lsl r4? Just guess... Is that possibly related to zero_extend RTL transformation? Either in gcc optimization, or the design of machine description? Try to use this declaration (use 'int' instead of 'char'): register unsigned int foo asm ("r4"); Is there any difference on code generation?
Well other than the fact that it now uses two registers (the registers are 8-bit and ints are 16-bit), no, it does the same thing register int foo asm ("r30"); void baz(); void quux(); void bar() { foo = foo * 2; if (foo > 10) baz(); else quux(); } yields: bar: /* prologue: function */ /* frame size = 0 */ /* stack size = 0 */ .L__stack_usage = 0 mov r24,r30 mov r25,r31 lsl r24 rol r25 mov r30,r24 mov r31,r25 sbiw r24,11 brge .L4 rjmp quux .L4: rjmp baz
Sorry I didn't notice that it is 8-bit register on avr. I was porting a 32-bit register target and this case is similar to mine. That's why I have such guess in comment 5. Now I tried to build an avr-elf target and test it with your code fragment, using following command: $ avr-elf-gcc -S -Os test.c -fdump-rtl-all Here are my preliminary observation: As Andrew Pinski said in comment 3, global register variable is treated as a volatile variable. So have: $ vi a.c.150r.expand 25 (insn 5 4 6 3 (set (reg:QI 42 [ foo.0 ]) 26 (reg/v:QI 4 r4 [ foo ])) a.c:11 -1 27 (nil)) 28 29 (insn 6 5 7 3 (set (reg:QI 43 [ foo.1 ]) 30 (ashift:QI (reg:QI 42 [ foo.0 ]) 31 (const_int 1 [0x1]))) a.c:11 -1 32 (nil)) 33 34 (insn 7 6 8 3 (set (reg/v:QI 4 r4 [ foo ]) 35 (reg:QI 43 [ foo.1 ])) a.c:11 -1 36 (nil)) 37 Then, in the combine phase, I notice that it does try to combine insn-5 and insn-6. But it does not further combine insn-6 and insn-7. $ vi a.c.185r.combine 29 (insn 6 5 7 2 (set (reg:QI 43 [ foo.1 ]) 30 (ashift:QI (reg/v:QI 4 r4 [ foo ]) 31 (const_int 1 [0x1]))) a.c:11 199 {*ashlqi3} 32 (expr_list:REG_DEAD (reg/v:QI 4 r4 [ foo ]) 33 (nil))) 34 35 (insn 7 6 8 2 (set (reg/v:QI 4 r4 [ foo ]) 36 (reg:QI 43 [ foo.1 ])) a.c:11 28 {movqi_insn} 37 (nil)) 38 Finally, at ira/reload phase, the (reg:QI 43) is assigned r24, which does not satisfy the 'lsl' instruction constraint. So a new insn-24 is created. $ vi a.c.198r.reload 44 (insn 24 5 6 2 (set (reg:QI 24 r24 [orig:43 foo.1 ] [43]) 45 (reg/v:QI 4 r4 [ foo ])) a.c:11 28 {movqi_insn} 46 (nil)) 47 48 (insn 6 24 7 2 (set (reg:QI 24 r24 [orig:43 foo.1 ] [43]) 49 (ashift:QI (reg:QI 24 r24 [orig:43 foo.1 ] [43]) 50 (const_int 1 [0x1]))) a.c:11 199 {*ashlqi3} 51 (nil)) 52 53 (insn 7 6 8 2 (set (reg/v:QI 4 r4 [ foo ]) 54 (reg:QI 24 r24 [orig:43 foo.1 ] [43])) a.c:11 28 {movqi_insn} 55 (nil)) 56 I am not sure if there is a solution to this issue on avr target. Maybe someone else who work on avr porting can improve it in combine phase or machine description design.
(In reply to comment #3) > (In reply to comment #2) >> Any specific sections you can point me to? I'd be interested to hear a >> justification for this behavior. > > Simple answer, an interrupt can happen any time after the access/assignment of > the global register variable so it needs to be set to the value and treated as > a volatile variable. Is this really the case? There is not even a means to tag a REG as volatile. The /v flag is set but for REGs it means "user variable", see rtl.h and the docs for volatil. http://gcc.gnu.org/viewcvs/trunk/gcc/rtl.h?view=markup#l286 And qualifying foo as volatile diagnoses as expected: <stdin>:1:1: warning: optimization may eliminate reads and/or writes to register variables [-Wvolatile-register-var] This issue looks rather like a fallout of reload that for some reasons spills for the shift insn. See also PR49491 Or are global registers spacial for reload? I still don't see why a spilling is needed for the shift insn...
This behaviour is _really_ annoying, and it still exists in gcc 4.9.3 and 5.2.0. It is similar, to the bug you described, Mr. Lay, but not quite the same. This problem seems to affect all arithmetic and bitwise operations on register variables, even on the upper registers >= r16. It greatly diminishes the advantages of using register variables for the purpose of optimising access to often-used global variables. Global register variables do have an enormous program space and speed advantage over accesses to locations in RAM, which is very significant on ATTiny MCUs with limited flash space and in the case of my application, for fulfillment of my RTOS requirements. I did some more research on when this problem actually occurs, and I now have a pretty good idea of when bad things happen. There seem to be two separate cases of this occurring. Please consider this first working example: ##### #include <avr/io.h> register uint16_t globregvar1 asm("r8"); register uint16_t globregvar2 asm("r10"); register uint8_t globregvar3 asm("r12"); int main(void) { register uint8_t locregvar1 asm("r26"); register uint8_t locregvar2 asm("r2"); uint8_t test = 1; while(1) { if(test) test <<= 1; else test = 1; locregvar1 |= test; locregvar2 |= test; globregvar1++; globregvar2 |= test; globregvar3 |= test; __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2)); if(test == 7) PORTB = test; } } ##### The __asm__ statement does nothing and forces the compiler via constraints not to optimize locregvar* away. Compile with: ## $ avr-gcc-5.2.0 -ggdb3 -mmcu=attiny13 -Os -o test.app test.c $ ## Now the output for relevant section: ### locregvar1 |= test; 2e: a8 2b or r26, r24 locregvar2 |= test; 30: 28 2a or r2, r24 globregvar1++; 32: 9f ef ldi r25, 0xFF ; 255 34: 89 1a sub r8, r25 36: 99 0a sbc r9, r25 globregvar2 |= test; 38: a8 2a or r10, r24 globregvar3 |= test; 3a: c8 2a or r12, r24 __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2)); if(test == 7) 3c: 87 30 cpi r24, 0x07 ; 7 [...] ### So far so good. For globregvar1 it's even so intelligent as to load -1 to r25 and then operating directly on r8 and r9 for the 2-byte value, which is great. Watch what happens with these changes, marked with a //**** at end of line: ##### #include <avr/io.h> register uint16_t globregvar1 asm("r8"); register uint16_t globregvar2 asm("r10"); register uint8_t globregvar3 asm("r12"); int main(void) { register uint8_t locregvar1 asm("r26"); register uint8_t locregvar2 asm("r2"); uint8_t test = 1; while(1) { if(test) test <<= 1; else test = 1; locregvar1 |= _BV(4); //**** locregvar2 |= _BV(4); //**** globregvar1++; globregvar2 |= _BV(4); //**** globregvar3 |= _BV(4); //**** __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2)); if(globregvar1 & _BV(2)) //**** PORTB = test; } } ##### and the output: ### locregvar1 |= _BV(4); 2e: a0 61 ori r26, 0x10 ; 16 locregvar2 |= _BV(4); 30: 92 2d mov r25, r2 32: 90 61 ori r25, 0x10 ; 16 34: 29 2e mov r2, r25 globregvar1++; 36: 94 01 movw r18, r8 38: 2f 5f subi r18, 0xFF ; 255 3a: 3f 4f sbci r19, 0xFF ; 255 3c: 49 01 movw r8, r18 globregvar2 |= _BV(4); 3e: 68 94 set 40: a4 f8 bld r10, 4 globregvar3 |= _BV(4); 42: 9c 2d mov r25, r12 44: 90 61 ori r25, 0x10 ; 16 46: c9 2e mov r12, r25 __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2)); if(globregvar1 & _BV(2)) 48: 22 ff sbrs r18, 2 [...] ### So what happens is this: 1. For locregvar1, nothing interesting is happening, because the immediate instruction can operate directly on the target reg. Good. 2. However, for locregvar2 bound to r2, it turns a 2-instruction operation into three operations. Not good. 3. Interestingly, for globregvar2, which is a 16 bit type, it uses the very efficient set/bld pair to set the 5th bit. Nice. 4. globregvar3 is 8 bit. Exactly the same instruction, but here it spills to r25. Not nice, and totally inconsistent with the behaviour displayed for globregvar2. And finally: 5. globregvar1 is still being incremented, no change in input code here except for that it is being used later inside that last if condition. Now that variable spills to r18:r19. This is pretty much what is happening for OP here, and in the case of OP this behaviour is at least understandable: If a copy of the register variable is in one of the higher registers, one can use the cpi instruction for comparison as opposed if you had to compare against lower registers, so you neither lose cycles nor code space. BUT, if an interrupt changes the register variable in the meantime, you will have a bad race condition and you will get totally inconsistent behaviour, because the coming comparison only operates on a copy in the upper registers. Is this really wanted? Furthermore, in my example, there is no need for immediate instructions, the branch condition can be directly applied to the register variable, so no spilling is required here.
*** Bug 65082 has been marked as a duplicate of this bug. ***
here is a nice reduced testcase: ``` typedef unsigned short uint16_t; register uint16_t r4 asm ("r4"); register uint16_t r6 asm ("r6"); uint16_t pllExec(void) { r4 += r6; return r4>>8; }; ``` See bug 43491 comment #12 for some analysis of this issue really.