Bug 56439 - extra register moves for global register and local register variables
Summary: extra register moves for global register and local register variables
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.7.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
: 65082 (view as bug list)
Depends on:
Blocks: 56183
  Show dependency treegraph
 
Reported: 2013-02-25 02:32 UTC by Robert "Finny" Merrill
Modified: 2023-05-26 02:33 UTC (History)
3 users (show)

See Also:
Host:
Target: avr
Build:
Known to work: 3.4.6
Known to fail:
Last reconfirmed: 2013-03-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Robert "Finny" Merrill 2013-02-25 02:32:02 UTC
I am writing some interrupt code for an Atmel AVR microcontroller and I'm trying to shave cycles off, specifically at the beginning of the interrupt. I want to achieve this by minimizing the registers that need to be saved, so I decided to declare a few variables as global register variables.

What I found is that GCC will "optimize away" the register assignment and instead produce code that is almost what I want, except it copies the value in and out of another register (allocated the usual way) instead of operating on the assigned one.

For example the following C:

register unsigned char foo asm ("r4");

void baz();
void quux();

void bar() {
  foo = foo * 2;
  if (foo > 10)
    baz();
  else
    quux();
}

generates the following assembly:

        mov r24,r4
        lsl r24
        mov r4,r24
        cpi r24,lo8(11)
        brsh .L4
        rjmp quux
.L4:
        rjmp baz


It does the same thing (copy to r24, manipulate, copy back) on every optimization level.

Surely this can't be desired behavior? If you use local register variables it's often even worse, as gcc won't touch the assigned register at all but instead produce identical code that uses a different register.

I'm fairly certain this shouldn't work this way...
Comment 1 Andrew Pinski 2013-02-25 03:10:35 UTC
I think this is correct behavior if you read the manual.
Comment 2 Robert "Finny" Merrill 2013-02-25 03:30:47 UTC
Any specific sections you can point me to? I'd be interested to hear a justification for this behavior.
Comment 3 Andrew Pinski 2013-02-25 03:53:25 UTC
(In reply to comment #2)
> Any specific sections you can point me to? I'd be interested to hear a
> justification for this behavior.

Simple answer, an interrupt can happen any time after the access/assignment of the global register variable so it needs to be set to the value and treated as a volatile variable.
Comment 4 Robert "Finny" Merrill 2013-02-25 05:46:29 UTC
Wouldn't that be a reason /against/ doing this:

        mov r24,r4
        lsl r24
        mov r4,r24

instead of just lsl r4?
Comment 5 Chung-Ju Wu 2013-02-25 06:32:05 UTC
(In reply to comment #4)
> Wouldn't that be a reason /against/ doing this:
> 
>         mov r24,r4
>         lsl r24
>         mov r4,r24
> 
> instead of just lsl r4?

Just guess...

Is that possibly related to zero_extend RTL transformation?
Either in gcc optimization, or the design of machine description?

Try to use this declaration (use 'int' instead of 'char'):
  register unsigned int foo asm ("r4");

Is there any difference on code generation?
Comment 6 Robert "Finny" Merrill 2013-02-25 16:39:39 UTC
Well other than the fact that it now uses two registers (the registers are 8-bit and ints are 16-bit), no, it does the same thing

register int foo asm ("r30");

void baz();
void quux();

void bar() {
  foo = foo * 2;
  if (foo > 10)
    baz();
  else
    quux();
}


yields:

bar:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
        mov r24,r30
        mov r25,r31
        lsl r24
        rol r25
        mov r30,r24
        mov r31,r25
        sbiw r24,11
        brge .L4
        rjmp quux
.L4:
        rjmp baz
Comment 7 Chung-Ju Wu 2013-02-26 03:30:40 UTC
Sorry I didn't notice that it is 8-bit register on avr.
I was porting a 32-bit register target and this case is similar to mine.
That's why I have such guess in comment 5.


Now I tried to build an avr-elf target and test it with your code fragment,
using following command: $ avr-elf-gcc -S -Os test.c -fdump-rtl-all

Here are my preliminary observation:


As Andrew Pinski said in comment 3, global register variable is treated
as a volatile variable. So have:

$ vi a.c.150r.expand

 25 (insn 5 4 6 3 (set (reg:QI 42 [ foo.0 ])
 26         (reg/v:QI 4 r4 [ foo ])) a.c:11 -1
 27      (nil))
 28
 29 (insn 6 5 7 3 (set (reg:QI 43 [ foo.1 ])
 30         (ashift:QI (reg:QI 42 [ foo.0 ])
 31             (const_int 1 [0x1]))) a.c:11 -1
 32      (nil))
 33
 34 (insn 7 6 8 3 (set (reg/v:QI 4 r4 [ foo ])
 35         (reg:QI 43 [ foo.1 ])) a.c:11 -1
 36      (nil))
 37


Then, in the combine phase,
I notice that it does try to combine insn-5 and insn-6.
But it does not further combine insn-6 and insn-7.
 
 $ vi a.c.185r.combine

 29 (insn 6 5 7 2 (set (reg:QI 43 [ foo.1 ])
 30         (ashift:QI (reg/v:QI 4 r4 [ foo ])
 31             (const_int 1 [0x1]))) a.c:11 199 {*ashlqi3}
 32      (expr_list:REG_DEAD (reg/v:QI 4 r4 [ foo ])
 33         (nil)))
 34
 35 (insn 7 6 8 2 (set (reg/v:QI 4 r4 [ foo ])
 36         (reg:QI 43 [ foo.1 ])) a.c:11 28 {movqi_insn}
 37      (nil))
 38


Finally, at ira/reload phase, the (reg:QI 43) is assigned r24,
which does not satisfy the 'lsl' instruction constraint.
So a new insn-24 is created.
 
 $ vi a.c.198r.reload
 
 44 (insn 24 5 6 2 (set (reg:QI 24 r24 [orig:43 foo.1 ] [43])
 45         (reg/v:QI 4 r4 [ foo ])) a.c:11 28 {movqi_insn}
 46      (nil))
 47
 48 (insn 6 24 7 2 (set (reg:QI 24 r24 [orig:43 foo.1 ] [43])
 49         (ashift:QI (reg:QI 24 r24 [orig:43 foo.1 ] [43])
 50             (const_int 1 [0x1]))) a.c:11 199 {*ashlqi3}
 51      (nil))
 52
 53 (insn 7 6 8 2 (set (reg/v:QI 4 r4 [ foo ])
 54         (reg:QI 24 r24 [orig:43 foo.1 ] [43])) a.c:11 28 {movqi_insn}
 55      (nil))
 56


I am not sure if there is a solution to this issue on avr target.
Maybe someone else who work on avr porting can improve it
in combine phase or machine description design.
Comment 8 Georg-Johann Lay 2013-03-07 23:26:01 UTC
(In reply to comment #3)
> (In reply to comment #2)
>> Any specific sections you can point me to? I'd be interested to hear a
>> justification for this behavior.
> 
> Simple answer, an interrupt can happen any time after the access/assignment of
> the global register variable so it needs to be set to the value and treated as
> a volatile variable.

Is this really the case?  There is not even a means to tag a REG as volatile.  The /v flag is set but for REGs it means "user variable", see rtl.h and the docs for volatil.

http://gcc.gnu.org/viewcvs/trunk/gcc/rtl.h?view=markup#l286

And qualifying foo as volatile diagnoses as expected:

<stdin>:1:1: warning: optimization may eliminate reads and/or writes to register variables [-Wvolatile-register-var]

This issue looks rather like a fallout of reload that for some reasons spills for the shift insn.

See also PR49491 

Or are global registers spacial for reload? I still don't see why a spilling is needed for the shift insn...
Comment 9 Thilo Schulz 2015-09-09 23:29:29 UTC
This behaviour is _really_ annoying, and it still exists in gcc 4.9.3 and 5.2.0. It is similar, to the bug you described, Mr. Lay, but not quite the same.
This problem seems to affect all arithmetic and bitwise operations on register variables, even on the upper registers >= r16.
It greatly diminishes the advantages of using register variables for the purpose of optimising access to often-used global variables. Global register variables do have an enormous program space and speed advantage over accesses to locations in RAM, which is very significant on ATTiny MCUs with limited flash space and in the case of my application, for fulfillment of my RTOS requirements.

I did some more research on when this problem actually occurs, and I now have a pretty good idea of when bad things happen. There seem to be two separate cases of this occurring.

Please consider this first working example:

#####
#include <avr/io.h>

register uint16_t globregvar1 asm("r8");
register uint16_t globregvar2 asm("r10");
register uint8_t globregvar3 asm("r12");

int main(void)
{
  register uint8_t locregvar1 asm("r26");
  register uint8_t locregvar2 asm("r2");
  uint8_t test = 1;

  while(1)
  {
    if(test)
      test <<= 1;
    else
      test = 1;

    locregvar1 |= test;
    locregvar2 |= test;
    globregvar1++;
    globregvar2 |= test;
    globregvar3 |= test;

    __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2));

    if(test == 7)
      PORTB = test;
  }
}
#####

The __asm__ statement does nothing and forces the compiler via constraints not to optimize locregvar* away.

Compile with:
##
$ avr-gcc-5.2.0 -ggdb3 -mmcu=attiny13 -Os -o test.app test.c
$
##

Now the output for relevant section:

###
    locregvar1 |= test;
  2e:   a8 2b           or      r26, r24
    locregvar2 |= test;
  30:   28 2a           or      r2, r24
    globregvar1++;
  32:   9f ef           ldi     r25, 0xFF       ; 255
  34:   89 1a           sub     r8, r25
  36:   99 0a           sbc     r9, r25
    globregvar2 |= test;
  38:   a8 2a           or      r10, r24
    globregvar3 |= test;
  3a:   c8 2a           or      r12, r24

    __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2));

    if(test == 7)
  3c:   87 30           cpi     r24, 0x07       ; 7
[...]
###

So far so good. For globregvar1 it's even so intelligent as to load -1 to r25 and then operating directly on r8 and r9 for the 2-byte value, which is great.

Watch what happens with these changes, marked with a //**** at end of line:

#####
#include <avr/io.h>

register uint16_t globregvar1 asm("r8");
register uint16_t globregvar2 asm("r10");
register uint8_t globregvar3 asm("r12");

int main(void)
{
  register uint8_t locregvar1 asm("r26");
  register uint8_t locregvar2 asm("r2");
  uint8_t test = 1;

  while(1)
  {
    if(test)
      test <<= 1;
    else
      test = 1;

    locregvar1 |= _BV(4); //****
    locregvar2 |= _BV(4); //****
    globregvar1++;
    globregvar2 |= _BV(4); //****
    globregvar3 |= _BV(4); //****

    __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2));

    if(globregvar1 & _BV(2)) //****
      PORTB = test;
  }
}
#####

and the output:

###
    locregvar1 |= _BV(4);
  2e:   a0 61           ori     r26, 0x10       ; 16
    locregvar2 |= _BV(4);
  30:   92 2d           mov     r25, r2
  32:   90 61           ori     r25, 0x10       ; 16
  34:   29 2e           mov     r2, r25
    globregvar1++;
  36:   94 01           movw    r18, r8
  38:   2f 5f           subi    r18, 0xFF       ; 255
  3a:   3f 4f           sbci    r19, 0xFF       ; 255
  3c:   49 01           movw    r8, r18
    globregvar2 |= _BV(4);
  3e:   68 94           set
  40:   a4 f8           bld     r10, 4
    globregvar3 |= _BV(4);
  42:   9c 2d           mov     r25, r12
  44:   90 61           ori     r25, 0x10       ; 16
  46:   c9 2e           mov     r12, r25

    __asm__ __volatile__("":: "r" (locregvar1), "r" (locregvar2));

    if(globregvar1 & _BV(2))
  48:   22 ff           sbrs    r18, 2
[...]
###

So what happens is this:

1. For locregvar1, nothing interesting is happening, because the immediate instruction can operate directly on the target reg. Good.

2. However, for locregvar2 bound to r2, it turns a 2-instruction operation into three operations. Not good.

3. Interestingly, for globregvar2, which is a 16 bit type, it uses the very efficient set/bld pair to set the 5th bit. Nice.

4. globregvar3 is 8 bit. Exactly the same instruction, but here it spills to r25. Not nice, and totally inconsistent with the behaviour displayed for globregvar2.

And finally:
5. globregvar1 is still being incremented, no change in input code here except for that it is being used later inside that last if condition.
Now that variable spills to r18:r19. This is pretty much what is happening for OP here, and in the case of OP this behaviour is at least understandable: If a copy of the register variable is in one of the higher registers, one can use the cpi instruction for comparison as opposed if you had to compare against lower registers, so you neither lose cycles nor code space. BUT, if an interrupt changes the register variable in the meantime, you will have a bad race condition and you will get totally inconsistent behaviour, because the coming comparison only operates on a copy in the upper registers. Is this really wanted?

Furthermore, in my example, there is no need for immediate instructions, the branch condition can be directly applied to the register variable, so no spilling is required here.
Comment 10 Andrew Pinski 2023-05-26 02:30:29 UTC
*** Bug 65082 has been marked as a duplicate of this bug. ***
Comment 11 Andrew Pinski 2023-05-26 02:32:26 UTC
here is a nice reduced testcase:
```
typedef unsigned short uint16_t;

register uint16_t r4 asm ("r4");
register uint16_t r6 asm ("r6");

uint16_t pllExec(void)
{
  r4 += r6;
  return r4>>8;
};
```

See bug 43491 comment #12 for some analysis of this issue really.