[Bug middle-end/85721] bad codegen for looped copy of primitives at -O2 and -O3 (differently bad)

Thu May 10 06:10:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721

--- Comment #4 from Mathias Stearn <redbeard0531 at gmail dot com> ---
Marc Glisse pointed out at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720#c3 that my I missed an
aliasing case when I created this ticket. memmove isn't a valid replacement if
out is in the range (in, in + n). I did some benchmarking to see what the best
solution is and how much this matters. This seems to do the best on
sandybridge, haswell, and an Opteron 6344 Piledriver:

[[gnu::noinline, gnu::optimize("s")]] void copy0(char* out, const char* in,
size_t n) {
    if (n >= 8 &&__builtin_expect(out >= in + n || out + n <= in, 1)) {
        memcpy(out, in, n);
        return;
    }
    for (size_t i = 0; i < n; i++){
        out[i] = in[i];
    }
}

copy0(char*, char const*, unsigned long):
        cmp     rdx, 7
        jbe     .L7
        lea     rax, [rsi+rdx]
        cmp     rdi, rax
        jnb     .L3
        lea     rax, [rdi+rdx]
        cmp     rsi, rax
        jb      .L7
.L3:
        jmp     memcpy
.L7:
        xor     eax, eax
.L5:
        cmp     rax, rdx
        je      .L1
        mov     cl, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+rax], cl
        inc     rax
        jmp     .L5
.L1:
        ret

With char, it is substantially faster than the current codegen for the orignal
loop at -O2  and moderately faster than -O3, while being about 10% the size.
With a TriviallyCopiable type with a non-trivial default ctor, even -O3 does
byte-by-byte, so it is a substantial win there as well.

Let me know if you'd like me to post the benchmark I was using.