(Not sure what the component should be, just selected "other" for now) Using shifts and bitwise-or to compose multiple bytes into a bigger integer results in suboptimal code on AVR. For example, a few simple functions that take two or four bytes and compose them into (big endian) integers. Since AVR is an 8-bit platform, this essentially just means moving two bytes from the argument register to the return value registers. However, the outputted assembly is significantly bigger than that and contains obvious optimization opportunities. The example below also contains a version that uses a union to compose the integer, which gets optimized as expected (but only works on little-endian systems, since it relies on the native endianness of uint16_t). matthijs@grubby:~$ cat foo.c #include <stdint.h> uint16_t join2(uint8_t a, uint8_t b) { return ((uint16_t)a << 8) | b; } uint16_t join2_efficient(uint8_t a, uint8_t b) { union { uint16_t uint; uint8_t arr[2]; } tmp = {.arr = {b, a}}; return tmp.uint; } uint32_t join4(uint8_t a, uint8_t b, uint8_t c, uint8_t d) { return ((uint32_t)a << 24) | ((uint32_t)b << 16) | ((uint32_t)c << 8) | d; } matthijs@grubby:~$ avr-gcc -c foo.c -O3 && avr-objdump -d foo.o foo.o: file format elf32-avr Disassembly of section .text: 00000000 <join2>: 0: 70 e0 ldi r23, 0x00 ; 0 2: 26 2f mov r18, r22 4: 37 2f mov r19, r23 6: 38 2b or r19, r24 8: 82 2f mov r24, r18 a: 93 2f mov r25, r19 c: 08 95 ret 0000000e <join2_efficient>: e: 98 2f mov r25, r24 10: 86 2f mov r24, r22 12: 08 95 ret 00000014 <join4>: 14: 0f 93 push r16 16: 1f 93 push r17 18: 02 2f mov r16, r18 1a: 10 e0 ldi r17, 0x00 ; 0 1c: 20 e0 ldi r18, 0x00 ; 0 1e: 30 e0 ldi r19, 0x00 ; 0 20: 14 2b or r17, r20 22: 26 2b or r18, r22 24: 38 2b or r19, r24 26: 93 2f mov r25, r19 28: 82 2f mov r24, r18 2a: 71 2f mov r23, r17 2c: 60 2f mov r22, r16 2e: 1f 91 pop r17 30: 0f 91 pop r16 32: 08 95 ret
Looks like PR41076 *** This bug has been marked as a duplicate of bug 41076 ***
Created attachment 40173 [details] 2 patterns for the 4-byte case The 2-byte case should be improved by r242909 but the 4-byte case just leads to insane patterns if tried to catch such code in the backend. FYI I attached 2 patterns that would improve your code, but I didn't even consider to propose them for revirew... If your intention is to swap bytes to adjust endianess, you might consider __builtin_bswap32, cf. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fbswap32-4387 If you want to compose 32-bit values out of 8-bit values, you can use a union as indicated in your example, or you can use vectors which is a GCC extension: #include <stdint.h> typedef uint8_t v4 __attribute__((vector_size (4))); uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3) { return (uint32_t) (v4) { b0, b1, b2, b3 }; } uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3) { v4 vec; vec[0] = b0; vec[1] = b1; vec[2] = b2; vec[3] = b3; return (uint32_t) vec; } And as always, consider inlining such functions.
Thanks for digging into this :-D I suppose you meant https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=242907 instead of the commit you linked (which is also nice btw, I noticed that extra sbiw in some places as well). Looking at the generated assembly, the optimizations look like fairly simple (composable) translations, but I assume that the optimization needs to happen before/while the assembly is generated, not afterwards. And then I can see that the patterns would indeed become complex. My goal was indeed to compose values. Using a union is endian-dependent, which is a downside. If I understand your vector-example correctly, vectors are always stored as big endian, so using this approach would be portable? I couldn't find anything about this in the documentation, though.
(In reply to Matthijs Kooijman from comment #3) > I suppose you meant > https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=242907 instead of the > commit you linked (which is also nice btw, I noticed that extra sbiw in some > places as well). Ah yes, I meant r242907. > Looking at the generated assembly, the optimizations look like fairly simple > (composable) translations, but I assume that the optimization needs to > happen before/while the assembly is generated, not afterwards. And then I > can see that the patterns would indeed become complex. Well, 4 extensions from u8 to u32, 3 shifts and 3 or make 10 operations which have to be written as ONE single insn if the backend should catch this... This needs at least one intermediate pattern because otherwise the insn combiner will not combine such complex expressions. > My goal was indeed to compose values. Using a union is endian-dependent, > which is a downside. Maybe you can use GCC built-in macros like __BYTE_ORDER__ to factor out endianess? #define __ORDER_LITTLE_ENDIAN__ 1234 #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ (as displayed with -E -dM | grep END) > If I understand your vector-example correctly, vectors are always stored as > big endian, so using this approach would be portable? I couldn't find > anything about this in the documentation, though. Looking at the code the example typedef uint8_t v4 __attribute__((vector_size (4))); uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3) { v4 vec; vec[0] = b0; vec[1] = b1; vec[2] = b2; vec[3] = b3; return (uint32_t) vec; } generates, it's little endian, i.e. b0 is stored in the low byte of vec so that vectors are similar to arrays: join4: mov r23,r22 ; b1 mov r25,r18 ; b3 mov r22,r24 ; b0 mov r24,r20 ; b2 ret On tree level, this is represented as BIT_INSERT_EXPR (from -fdump-tree-optimized) join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3) { v4 vec; uint32_t _6; <bb 2>: vec_8 = BIT_INSERT_EXPR <vec_7(D), b0_2(D), 0 (8 bits)>; vec_9 = BIT_INSERT_EXPR <vec_8, b1_3(D), 8 (8 bits)>; vec_10 = BIT_INSERT_EXPR <vec_9, b2_4(D), 16 (8 bits)>; vec_11 = BIT_INSERT_EXPR <vec_10, b3_5(D), 24 (8 bits)>; _6 = VIEW_CONVERT_EXPR<uint32_t>(vec_11); return _6; } Hence an open-coded composition of byte-values into a 4-byte value can be represented neatly on tree level, and this is quite an improvement over ZERO_EXTEND + ASHIFT + IOR w.r.t. the resulting code. Thus turned into a tree level optimization issue.