This issue has been validated to occur back as far as at least 5.4.0, and still occurs in trunk. When shifting an unsigned char/uint8_t right by less than 4 bits, suboptimal code is generated. This behavior only occurs when compiling source files as C++, not as C, even when the source file is equivalent otherwise. The issue does not manifest with left shifts or with larger composite types (such as uint16_t). Trivial test: void test () { volatile unsigned char val; unsigned char local = val; local >>= 1; val = local; } Compiling as C++ (avr-g++ [-O3|-O2] -mmcu=atmega2560 test.cpp -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store: ldd r24,Y+1 ldi r25,0 asr r25 ror r24 std Y+1,r24 The next operation performed on r25 is a clr. Thus, ldi/asr/ror are entirely equivalent to lsr in this situation, which is what the C frontend does: Compiling as C (avr-gcc [-O3|-O2] -mmcu=atmega2560 test.c -S -c -o test.s) results in the following assembly sequence handling the load, shift, and store: ldd r24,Y+1 lsr r24 std Y+1,r24 This is optimal code. This is also the defined behavior in avr.c. The issue becomes more problematic with larger shifts (up until 4, where the defined behavior takes over again), as it generates the same instruction sequence repeatedly, whereas gcc simply generates 'lsr; lsr; lsr', as expected. Interestingly, the issue does _not_ manifest if one chooses to use an integer division instead of a shift - if one divides the unsigned char by 2 instead of shifting right 1, it emits 'lsr' as expected.
Ok. So the issue is a FE one after all. With the C FE we have volatile unsigned char val; unsigned char local = val; local = local >> 1; val = local; while the C++ FE preserves the widening from the integer promotion rules: <<cleanup_point <<< Unknown tree: expr_stmt (void) (local = (unsigned char) val) >>>>>; <<cleanup_point <<< Unknown tree: expr_stmt (void) (local = (unsigned char) ((int) local >> 1)) >>>>>; <<cleanup_point <<< Unknown tree: expr_stmt (void) (val = local) >>>>>; and nothing in the middle-end shortens the right-shift operation.
I wanted to validate if this issue was presenting in the toolchains for other architectures, so I tested a bit: GCC 7.2.0 on x86-64 (-O3): C: movzx eax, BYTE PTR [rsp-1] shr al mov BYTE PTR [rsp-1], al ret C++: movzx eax, BYTE PTR [rsp-1] sar eax mov BYTE PTR [rsp-1], al ret While not different in performance, it _is_ generating different code, and the code difference seems to reflect what Richard already found. I am not able to reproduce any difference on MIPS64, MIPS32, ARM, ARM64, PPC, PPC64. This is probably due to backend differences not causing the sequences to map differently. I do see it going back to GCC 4.6.4 on AVR.
According to comment 1, a C++ FE issue
*** Bug 85533 has been marked as a duplicate of this bug. ***
I confirm this is still true for x86 (!) with gcc-7.3.1 and gcc-8.0.1 in language-mode c++.
Fixed in GCC 10.