[Bug middle-end/85721] bad codegen for looped copy of primitives at -O2 and -O3 (differently bad)
redbeard0531 at gmail dot com
gcc-bugzilla@gcc.gnu.org
Thu May 10 06:10:00 GMT 2018
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721
--- Comment #4 from Mathias Stearn <redbeard0531 at gmail dot com> ---
Marc Glisse pointed out at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720#c3 that my I missed an
aliasing case when I created this ticket. memmove isn't a valid replacement if
out is in the range (in, in + n). I did some benchmarking to see what the best
solution is and how much this matters. This seems to do the best on
sandybridge, haswell, and an Opteron 6344 Piledriver:
[[gnu::noinline, gnu::optimize("s")]] void copy0(char* out, const char* in,
size_t n) {
if (n >= 8 &&__builtin_expect(out >= in + n || out + n <= in, 1)) {
memcpy(out, in, n);
return;
}
for (size_t i = 0; i < n; i++){
out[i] = in[i];
}
}
copy0(char*, char const*, unsigned long):
cmp rdx, 7
jbe .L7
lea rax, [rsi+rdx]
cmp rdi, rax
jnb .L3
lea rax, [rdi+rdx]
cmp rsi, rax
jb .L7
.L3:
jmp memcpy
.L7:
xor eax, eax
.L5:
cmp rax, rdx
je .L1
mov cl, BYTE PTR [rsi+rax]
mov BYTE PTR [rdi+rax], cl
inc rax
jmp .L5
.L1:
ret
With char, it is substantially faster than the current codegen for the orignal
loop at -O2 and moderately faster than -O3, while being about 10% the size.
With a TriviallyCopiable type with a non-trivial default ctor, even -O3 does
byte-by-byte, so it is a substantial win there as well.
Let me know if you'd like me to post the benchmark I was using.
More information about the Gcc-bugs
mailing list