[Bug rtl-optimization/78963] New: Missed optimization opportunity in copies of small unaligned data
eyalroz at technion dot ac.il
gcc-bugzilla@gcc.gnu.org
Sun Jan 1 10:26:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78963
Bug ID: 78963
Summary: Missed optimization opportunity in copies of small
unaligned data
Product: gcc
Version: 6.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: eyalroz at technion dot ac.il
Target Milestone: ---
Preliminary notes:
* This bug report stems from a StackOverflow question I asked:
http://stackoverflow.com/q/41407257/1593077
* This bug regards the x86_64 architecture, but may apply elsewhere.
* This bug regards -O3 optimizations
* Everything described here is about the same for GCC 6.3 and 7 - whatever
version of it GodBolt uses.
* The entire bug is demonstrated here: https://godbolt.org/g/lDJSRm plus here
https://godbolt.org/g/9Y2ebd
Consider the task of copying 3-byte values from one place to another. If both
those places are in memory, it seems reasonable to do four moves, and indeed
GCC compiles this:
#include <string.h>
typedef struct { unsigned char data[3]; } uint24_t;
void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) {
memcpy(dest,src,3); }
into this (clipping the instructions for the return value):
f(uint24_t*, uint24_t*):
movzx eax, WORD PTR [rsi]
mov WORD PTR [rdi], ax
movzx eax, BYTE PTR [rsi+2]
mov BYTE PTR [rdi+2], al
If the source or the destination is a register, two mov's should suffice -
either the first two or the second two of the above. However, if I write this
(perhaps contrived, but likely demonstrative of what could happen with larger
programs, especially with multi-translation units, or when the OS gives you a
pointer to work with etc):
#include <string.h>
typedef struct { unsigned char data[3]; } uint24_t;
void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) {
memcpy(dest,src,3); }
int main() {
uint24_t* p = (uint24_t*) 48;
unsigned x;
f((uint24_t*) &x,p);
x += 1;
f(p,(uint24_t*) &x);
return 0;
}
The 3-byte value is "constructed" on the stack rather than in a register (first
four mov's), and then one cannot avoid using four more mov's to copy it to the
destination:
movzx eax, WORD PTR ds:48
mov WORD PTR [rsp-4], ax
movzx eax, BYTE PTR ds:50
mov BYTE PTR [rsp-2], al
add DWORD PTR [rsp-4], 1
movzx eax, WORD PTR [rsp-4]
mov WORD PTR ds:48, ax
movzx eax, BYTE PTR [rsp-2]
mov BYTE PTR ds:50, al
If we do this with 4-byte values, i.e. replace uint24_t with uint32_t, it's a
single mov both ways, and in fact it gets further optimized, so that this:
#include <string.h>
#include <stdint.h>
void f(uint32_t* __restrict__ dest, uint32_t* __restrict__ src)
{
memcpy(dest,src,4);
}
int main() {
uint32_t* p = (uint32_t*) 48;
uint32_t x;
f(&x,p);
x += 1;
f(p,&x);
return 0;
}
is compiled into just this
add DWORD PTR ds:48, 1
Now obviously you can't expect to optimize-out _that_ much with a 3-byte value,
but 2 mov's in and 2 mov's out should be enough. Indeed, clang (since at least
3.4.1 or so) emits this for the uint24_t code:
movzx eax, byte ptr [50]
shl eax, 16
movzx ecx, word ptr [48]
lea eax, [rcx + rax + 1]
mov word ptr [48], ax
shr eax, 16
mov byte ptr [50], al
which has just four mov's.
More information about the Gcc-bugs
mailing list