[Bug target/81274] New: x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
cody at codygray dot com
gcc-bugzilla@gcc.gnu.org
Sun Jul 2 06:27:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274
Bug ID: 81274
Summary: x86 optimizer emits unnecessary LEA instruction when
using AVX intrinsics
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: cody at codygray dot com
Target Milestone: ---
Target: i?86-*-*
When AVX intrinsics are used in a function, the x86-32 optimizer emits
unnecessary LEA instructions that clobber a register, forcing it to be
preserved at additional expense.
Test Code:
----------
#include <immintrin.h>
__m256 foo(const float *x)
{
__m256 ymmX = _mm256_load_ps(&x[0]);
return _mm256_addsub_ps(ymmX, ymmX);
}
Compile with: "-m32 -mtune=generic -mavx -O2"
This is also reproduced at -O1 and -O3, and when tuning for any architecture
that supports AVX (not specific to the "generic" target).
It also does not matter whether the code is compiled in C or C++ mode.
This behavior is exhibited by *all* versions of GCC that support AVX targeting,
from at least 4.9.0 through the 8.0.0 (20170701).
The code compiles warning-free, of course.
See it live on Godbolt: https://godbolt.org/g/NDDgsA
Actual Disassembly:
-------------------
foo: # -O2 or -O3
pushl %ecx
movl 8(%esp), %eax
leal 8(%esp), %ecx
vmovaps (%eax), %ymm0
popl %ecx
vaddsubps %ymm0, %ymm0, %ymm0
ret
The LEA instruction performs a redundant load of the parameter from the stack
into ECX, and then promptly discards that value. The load of ECX also has
spill-over effects, requiring that additional code be emitted to preserve the
original value of this register (PUSH+POP).
The same bug is observed at -O1, but the ordering of the instructions is
slightly different and the load of ECX is actually used to load EAX, further
lengthening the dependency chain for no benefit whatsoever.
foo: # -O1
pushl %ecx
leal 8(%esp), %ecx
movl (%ecx), %eax
vmovaps (%eax), %ymm0
vaddsubps %ymm0, %ymm0, %ymm0
popl %ecx
ret
Expected Disassembly:
---------------------
foo:
movl 8(%esp), %eax
vmovaps (%eax), %ymm0
vaddsubps %ymm0, %ymm0, %ymm0
ret
Or better yet:
foo:
vmovaps 8(%esp), %ymm0
vaddsubps %ymm0, %ymm0, %ymm0
ret
The correct code shown above is already generated for x86-64 builds (-m64), so
this optimization deficiency affects only x86-32 builds (-m32).
More information about the Gcc-bugs
mailing list