Bug 82658 - Suboptimal codegen on AVR when right-shifting 8-bit unsigned integers.
Summary: Suboptimal codegen on AVR when right-shifting 8-bit unsigned integers.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: 10.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 85533 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-10-22 10:15 UTC by mike.k
Modified: 2021-07-22 21:17 UTC (History)
2 users (show)

See Also:
Host:
Target: avr
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-04-26 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mike.k 2017-10-22 10:15:31 UTC
This issue has been validated to occur back as far as at least 5.4.0, and still occurs in trunk.

When shifting an unsigned char/uint8_t right by less than 4 bits, suboptimal code is generated. This behavior only occurs when compiling source files as C++, not as C, even when the source file is equivalent otherwise. The issue does not manifest with left shifts or with larger composite types (such as uint16_t).

Trivial test:

void test ()
{
    volatile unsigned char val;
    unsigned char local = val;
    local >>= 1;
    val = local;
}

Compiling as C++ (avr-g++ [-O3|-O2] -mmcu=atmega2560 test.cpp -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store:

ldd r24,Y+1
ldi r25,0
asr r25
ror r24
std Y+1,r24

The next operation performed on r25 is a clr. Thus, ldi/asr/ror are entirely equivalent to lsr in this situation, which is what the C frontend does:

Compiling as C (avr-gcc [-O3|-O2] -mmcu=atmega2560 test.c -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store:

ldd r24,Y+1
lsr r24
std Y+1,r24

This is optimal code. This is also the defined behavior in avr.c.

The issue becomes more problematic with larger shifts (up until 4, where the defined behavior takes over again), as it generates the same instruction sequence repeatedly, whereas gcc simply generates 'lsr; lsr; lsr', as expected.

Interestingly, the issue does _not_ manifest if one chooses to use an integer division instead of a shift - if one divides the unsigned char by 2 instead of shifting right 1, it emits 'lsr' as expected.
Comment 1 Richard Biener 2017-10-24 07:02:58 UTC
Ok.  So the issue is a FE one after all.  With the C FE we have

    volatile unsigned char val;
    unsigned char local = val;
  local = local >> 1;
  val = local;

while the C++ FE preserves the widening from the integer promotion rules:

  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (local = (unsigned char) val) >>>>>;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (local = (unsigned char) ((int) local >> 1)) >>>>>;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (val = local) >>>>>;

and nothing in the middle-end shortens the right-shift operation.
Comment 2 mike.k 2017-10-31 18:49:46 UTC
I wanted to validate if this issue was presenting in the toolchains for other architectures, so I tested a bit:

GCC 7.2.0 on x86-64 (-O3):

C:

movzx   eax, BYTE PTR [rsp-1]
shr     al
mov     BYTE PTR [rsp-1], al
ret

C++:

movzx   eax, BYTE PTR [rsp-1]
sar     eax
mov     BYTE PTR [rsp-1], al
ret

While not different in performance, it _is_ generating different code, and the code difference seems to reflect what Richard already found.

I am not able to reproduce any difference on MIPS64, MIPS32, ARM, ARM64, PPC, PPC64. This is probably due to backend differences not causing the sequences to map differently.

I do see it going back to GCC 4.6.4 on AVR.
Comment 3 Georg-Johann Lay 2018-01-30 22:08:53 UTC
According to comment 1, a C++ FE issue
Comment 4 Georg-Johann Lay 2018-04-26 10:21:15 UTC
*** Bug 85533 has been marked as a duplicate of this bug. ***
Comment 5 Randy Brecker 2018-04-28 09:29:20 UTC
I confirm this is still true for x86 (!) with gcc-7.3.1 and gcc-8.0.1 in language-mode c++.
Comment 6 Andrew Pinski 2021-07-22 21:17:05 UTC
Fixed in GCC 10.