Bug 60145 - [AVR] Suboptimal code for byte order shuffling using shift and or
Summary: [AVR] Suboptimal code for byte order shuffling using shift and or
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 65964
  Show dependency treegraph
 
Reported: 2014-02-11 14:17 UTC by Matthijs Kooijman
Modified: 2016-11-28 14:15 UTC (History)
1 user (show)

See Also:
Host:
Target: avr
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-11-28 00:00:00


Attachments
2 patterns for the 4-byte case (524 bytes, text/markdown)
2016-11-28 10:16 UTC, Georg-Johann Lay
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthijs Kooijman 2014-02-11 14:17:19 UTC
(Not sure what the component should be, just selected "other" for now)

Using shifts and bitwise-or to compose multiple bytes into a bigger integer
results in suboptimal code on AVR.

For example, a few simple functions that take two or four bytes and
compose them into (big endian) integers. Since AVR is an 8-bit platform,
this essentially just means moving two bytes from the argument register
to the return value registers. However, the outputted assembly is
significantly bigger than that and contains obvious optimization
opportunities.

The example below also contains a version that uses a union to compose the
integer, which gets optimized as expected (but only works on little-endian
systems, since it relies on the native endianness of uint16_t).

matthijs@grubby:~$ cat foo.c
#include <stdint.h>

uint16_t join2(uint8_t a, uint8_t b) {
        return ((uint16_t)a << 8) | b;
}

uint16_t join2_efficient(uint8_t a, uint8_t b) {
        union {
                uint16_t uint;
                uint8_t arr[2];
        } tmp = {.arr = {b, a}};
        return tmp.uint;
}

uint32_t join4(uint8_t a, uint8_t b, uint8_t c, uint8_t d) {
        return ((uint32_t)a << 24) | ((uint32_t)b << 16) | ((uint32_t)c << 8) | d;
}
matthijs@grubby:~$ avr-gcc -c foo.c -O3 && avr-objdump -d foo.o

foo.o:     file format elf32-avr


Disassembly of section .text:

00000000 <join2>:
   0:   70 e0           ldi     r23, 0x00       ; 0
   2:   26 2f           mov     r18, r22
   4:   37 2f           mov     r19, r23
   6:   38 2b           or      r19, r24
   8:   82 2f           mov     r24, r18
   a:   93 2f           mov     r25, r19
   c:   08 95           ret

0000000e <join2_efficient>:
   e:   98 2f           mov     r25, r24
  10:   86 2f           mov     r24, r22
  12:   08 95           ret

00000014 <join4>:
  14:   0f 93           push    r16
  16:   1f 93           push    r17
  18:   02 2f           mov     r16, r18
  1a:   10 e0           ldi     r17, 0x00       ; 0
  1c:   20 e0           ldi     r18, 0x00       ; 0
  1e:   30 e0           ldi     r19, 0x00       ; 0
  20:   14 2b           or      r17, r20
  22:   26 2b           or      r18, r22
  24:   38 2b           or      r19, r24
  26:   93 2f           mov     r25, r19
  28:   82 2f           mov     r24, r18
  2a:   71 2f           mov     r23, r17
  2c:   60 2f           mov     r22, r16
  2e:   1f 91           pop     r17
  30:   0f 91           pop     r16
  32:   08 95           ret
Comment 1 Georg-Johann Lay 2014-02-17 16:35:13 UTC
Looks like PR41076

*** This bug has been marked as a duplicate of bug 41076 ***
Comment 2 Georg-Johann Lay 2016-11-28 10:16:51 UTC
Created attachment 40173 [details]
2 patterns for the 4-byte case

The 2-byte case should be improved by r242909 but the 4-byte case just leads to insane patterns if tried to catch such code in the backend.  FYI I attached 2 patterns that would improve your code, but I didn't even consider to propose them for revirew...

If your intention is to swap bytes to adjust endianess, you might consider __builtin_bswap32, cf.
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fbswap32-4387

If you want to compose 32-bit values out of 8-bit values, you can use a union as indicated in your example, or you can use vectors which is a GCC extension:

#include <stdint.h>

typedef uint8_t v4 __attribute__((vector_size (4)));

uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
    return (uint32_t) (v4) { b0, b1, b2, b3 };
}

uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
    v4 vec;
    vec[0] = b0;
    vec[1] = b1;
    vec[2] = b2;
    vec[3] = b3;
    return (uint32_t) vec;
}

And as always, consider inlining such functions.
Comment 3 Matthijs Kooijman 2016-11-28 12:32:50 UTC
Thanks for digging into this :-D

I suppose you meant https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=242907 instead of the commit you linked (which is also nice btw, I noticed that extra sbiw in some places as well).

Looking at the generated assembly, the optimizations look like fairly simple (composable) translations, but I assume that the optimization needs to happen before/while the assembly is generated, not afterwards. And then I can see that the patterns would indeed become complex.

My goal was indeed to compose values. Using a union is endian-dependent, which is a downside.

If I understand your vector-example correctly, vectors are always stored as big endian, so using this approach would be portable? I couldn't find anything about this in the documentation, though.
Comment 4 Georg-Johann Lay 2016-11-28 14:15:23 UTC
(In reply to Matthijs Kooijman from comment #3)
> I suppose you meant
> https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=242907 instead of the
> commit you linked (which is also nice btw, I noticed that extra sbiw in some
> places as well).

Ah yes, I meant r242907.

> Looking at the generated assembly, the optimizations look like fairly simple
> (composable) translations, but I assume that the optimization needs to
> happen before/while the assembly is generated, not afterwards. And then I
> can see that the patterns would indeed become complex.

Well, 4 extensions from u8 to u32, 3 shifts and 3 or make 10 operations which have to be written as ONE single insn if the backend should catch this...  This needs at least one intermediate pattern because otherwise the insn combiner will not combine such complex expressions.

> My goal was indeed to compose values. Using a union is endian-dependent,
> which is a downside.

Maybe you can use GCC built-in macros like __BYTE_ORDER__ to factor out endianess?

#define __ORDER_LITTLE_ENDIAN__ 1234
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__

(as displayed with -E -dM | grep END)

> If I understand your vector-example correctly, vectors are always stored as
> big endian, so using this approach would be portable? I couldn't find
> anything about this in the documentation, though.

Looking at the code the example

typedef uint8_t v4 __attribute__((vector_size (4)));

uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
    v4 vec;
    vec[0] = b0;
    vec[1] = b1;
    vec[2] = b2;
    vec[3] = b3;
    return (uint32_t) vec;
}

generates, it's little endian, i.e. b0 is stored in the low byte of vec so that vectors are similar to arrays:

join4:
	mov r23,r22 ; b1
	mov r25,r18 ; b3
	mov r22,r24 ; b0
	mov r24,r20 ; b2
	ret

On tree level, this is represented as BIT_INSERT_EXPR (from -fdump-tree-optimized)

join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
  v4 vec;
  uint32_t _6;

  <bb 2>:
  vec_8  = BIT_INSERT_EXPR <vec_7(D), b0_2(D), 0 (8 bits)>;
  vec_9  = BIT_INSERT_EXPR <vec_8, b1_3(D), 8 (8 bits)>;
  vec_10 = BIT_INSERT_EXPR <vec_9, b2_4(D), 16 (8 bits)>;
  vec_11 = BIT_INSERT_EXPR <vec_10, b3_5(D), 24 (8 bits)>;
  _6 = VIEW_CONVERT_EXPR<uint32_t>(vec_11);
  return _6;
}

Hence an open-coded composition of byte-values into a 4-byte value can be represented neatly on tree level, and this is quite an improvement over ZERO_EXTEND + ASHIFT + IOR w.r.t. the resulting code.

Thus turned into a tree level optimization issue.