When doing whole-byte shifts, gcc usually optimizes away the shifts and ends up moving data between registers instead. However, it seems this doesn't happen when uint64_t is used. Here's an example (assembler output slightly trimmed of unrelated comments and annotations etc.): matthijs@grubby:~$ cat test.cpp #include <stdint.h> uint8_t foo64_8(uint64_t a) { return a >> 8; } uint16_t foo64_16(uint64_t a) { return a >> 8; } uint8_t foo32_8(uint32_t a) { return a >> 8; } uint16_t foo32_16(uint32_t a) { return (a >> 8); } matthijs@grubby:~$ avr-gcc -Os test.cpp -S -o - _Z7foo64_8y: push r16 ldi r16,lo8(8) rcall __lshrdi3 mov r24,r18 pop r16 ret _Z8foo64_16y: push r16 ldi r16,lo8(8) rcall __lshrdi3 mov r24,r18 mov r25,r19 pop r16 ret _Z7foo32_8m: mov r24,r23 ret _Z8foo32_16m: clr r27 mov r26,r25 mov r25,r24 mov r24,r23 ret .ident "GCC: (GNU) 4.9.2 20141224 (prerelease)" The output is identical for 4.8.1 on Debian, and the above 4.9.2 on Arch. I haven't found a readily available 5.x package yet to test. As you can see, the versions operating on 64 bit values preserve the 8-bit shift (which is very inefficient on AVR), while the versions running on 32 bit values simply copy the right registers. The foo32_16 function still has some useless instructions (r27 and r26 are not part of the return value, not sure why these are set) but that is probably an unrelated problem. I've marked this with component "target", since I think these optimizations are avr-specific (or at least not applicable to bigger architectures).
(In reply to Matthijs Kooijman from comment #0) > I haven't found a readily available 5.x package yet to test. It's the same. > As you can see, the versions operating on 64 bit values preserve the > 8-bit shift (which is very inefficient on AVR), while the versions > running on 32 bit values simply copy the right registers. Lib functions are used because users complained about bloated 64-bit arithmetic. Notice that indide these 64-bit shift functions byte-shifts are used. > The foo32_16 function still has some useless instructions (r27 and r26 > are not part of the return value, not sure why these are set) but that > is probably an unrelated problem. Yes. > I've marked this with component "target", since I think these > optimizations are avr-specific (or at least not applicable to bigger > architectures).
So, IIUC, this is quite hard to fix? Either you use lib functions, which prevents the optimizer from just relabeling or coyping registers to apply shifting, or you don't and then more complex operations will become very verbose and messy? Would it make sense (and be possible) to add a special case to not use lib functions for shifts by a constant number of bits that is also a multiple of 8? At first glance, that would make a lot of common cases (where an integer is decomposed into separate bytes or other parts) a lot faster, while still keeping the lib functions for more complex operations?
(In reply to Matthijs Kooijman from comment #2) > So, IIUC, this is quite hard to fix? Either you use lib functions, which > prevents the optimizer from just relabeling or coyping registers to apply > shifting, or you don't and then more complex operations will become very > verbose and messy? avr-gcc is using lib functions, but not as libcall but as transparent call instead. You'll find the implementation in avr-dimode.md. avr-gcc has no proper DImode support. For reasoning cf. the comments in the head of that file.
Created attachment 54871 [details] proposed patch Proposed patch, using a peephole2 in avr-dimode.md to inline calls to __lshrdi3 that require only a single instruction or two (due to truncation). For truncations to char, this is smaller and faster, and for truncations to unsigned short this is the same size, but faster. The drawback is that performing this late (in peephole2) doesn't eliminate the stack frame prolog/epilog. Thoughts?
(In reply to Roger Sayle from comment #4) > Created attachment 54871 [details] > proposed patch > > Proposed patch, using a peephole2 in avr-dimode.md to inline calls to > __lshrdi3 that require only a single instruction or two (due to truncation). > For truncations to char, this is smaller and faster, and for truncations to > unsigned short this is the same size, but faster. The drawback is that > performing this late (in peephole2) doesn't eliminate the stack frame > prolog/epilog. Thoughts? Looks good to me. Many thanks!