[Bug rtl-optimization/78963] New: Missed optimization opportunity in copies of small unaligned data

Sun Jan 1 10:26:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78963

            Bug ID: 78963
           Summary: Missed optimization opportunity in copies of small
                    unaligned data
           Product: gcc
           Version: 6.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: eyalroz at technion dot ac.il
  Target Milestone: ---

Preliminary notes:
* This bug report stems from a StackOverflow question I asked:
http://stackoverflow.com/q/41407257/1593077
* This bug regards the x86_64 architecture, but may apply elsewhere.
* This bug regards -O3 optimizations
* Everything described here is about the same for GCC 6.3 and 7 - whatever
version of it GodBolt uses.
* The entire bug is demonstrated here: https://godbolt.org/g/lDJSRm plus here
https://godbolt.org/g/9Y2ebd

Consider the task of copying 3-byte values from one place to another. If both
those places are in memory, it seems reasonable to do four moves, and indeed
GCC compiles this:

  #include <string.h>

  typedef struct { unsigned char data[3]; } uint24_t;

  void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) {
memcpy(dest,src,3); }

into this (clipping the instructions for the return value): 

  f(uint24_t*, uint24_t*):
          movzx   eax, WORD PTR [rsi]
          mov     WORD PTR [rdi], ax
          movzx   eax, BYTE PTR [rsi+2]
          mov     BYTE PTR [rdi+2], al

If the source or the destination is a register, two mov's should suffice -
either the first two or the second two of the above. However, if I write this
(perhaps contrived, but likely demonstrative of what could happen with larger
programs, especially with multi-translation units, or when the OS gives you a
pointer to work with etc):

  #include <string.h>

  typedef struct { unsigned char data[3]; } uint24_t;

  void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) {
memcpy(dest,src,3); }

  int main() {
    uint24_t* p = (uint24_t*) 48;
    unsigned x;
    f((uint24_t*) &x,p);
    x += 1;
    f(p,(uint24_t*) &x);
    return 0;
  }

The 3-byte value is "constructed" on the stack rather than in a register (first
four mov's), and then one cannot avoid using four more mov's to copy it to the
destination:

        movzx   eax, WORD PTR ds:48
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:50
        mov     BYTE PTR [rsp-2], al
        add     DWORD PTR [rsp-4], 1
        movzx   eax, WORD PTR [rsp-4]
        mov     WORD PTR ds:48, ax
        movzx   eax, BYTE PTR [rsp-2]
        mov     BYTE PTR ds:50, al

If we do this with 4-byte values, i.e. replace uint24_t with uint32_t, it's a
single mov both ways, and in fact it gets further optimized, so that this:

  #include <string.h>
  #include <stdint.h> 

  void f(uint32_t* __restrict__ dest, uint32_t* __restrict__ src)
  {
    memcpy(dest,src,4);
  }

 int main() {
    uint32_t* p = (uint32_t*) 48;
    uint32_t x;
    f(&x,p);
    x += 1;
    f(p,&x);
    return 0;
  }

is compiled into just this

        add     DWORD PTR ds:48, 1

Now obviously you can't expect to optimize-out _that_ much with a 3-byte value,
but 2 mov's in and 2 mov's out should be enough. Indeed, clang (since at least
3.4.1 or so) emits this for the uint24_t code:

        movzx   eax, byte ptr [50]
        shl     eax, 16
        movzx   ecx, word ptr [48]
        lea     eax, [rcx + rax + 1]
        mov     word ptr [48], ax
        shr     eax, 16
        mov     byte ptr [50], al

which has just four mov's.