[Bug c/63756] New: _mm_cvtepi16_epi32 with a memory operand produces either broken or slow asm
tterribe at xiph dot org
gcc-bugzilla@gcc.gnu.org
Wed Nov 5 22:38:00 GMT 2014
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63756
Bug ID: 63756
Summary: _mm_cvtepi16_epi32 with a memory operand produces
either broken or slow asm
Product: gcc
Version: 4.9.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: tterribe at xiph dot org
Created attachment 33900
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33900&action=edit
Reduced testcase
With optimizations enabled, the call
_mm_cvtepi16_epi32(*(__m128i *)x)
for some pointer x produces the asm
pmovsxwd (%rax), %xmm0
which is all well and good, and what was intended.
However, with optimizations disabled, the same code produces
movdqa (%rax), %xmm0
movaps %xmm0, -48(%rbp)
movdqa -48(%rbp), %xmm0
pmovsxwd %xmm0, %xmm0
The problem here is that the initial movdqa has added a 16-byte alignment
requirement and reads 8 bytes past where the original pmovsxwd instruction
would have read in the optimized version. This is very much not equivalent, and
causes crashes in code that runs just fine in the optimized version.
_mm_cvtepi16_epi32() takes an __m128i argument, and the dereference happens
before the function call. Even though the asm instruction it's standing in for
can do it all together, we don't have a single intrinsic which specifies
exactly that. None of the semantics here are very well documented anywhere, but
I can understand why the compiler might think it has the right to do what it
did. So I try the following code:
_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)x)
With optimizations disabled, this produces the slightly long-winded, but at
least correct asm of:
movq (%rax), %rax
movl $0, %edx
movq %rdx, -128(%rbp)
movq %rax, -120(%rbp)
movq -120(%rbp), %rax
movq -128(%rbp), %rdx
movq %rdx, -112(%rbp)
movq %rax, -104(%rbp)
movq -112(%rbp), %rax
movq -104(%rbp), %xmm0
pinsrq $1, %rax, $xmm0
movaps %xmm0, -64(%rbp)
movdqa -64(%rbp), %xmm0
pmovsxwd %xmm0, %xmm0
So that's all good. movq has the same semantics as a memory operand in
pmovsxwd, so we haven't added any extra alignment requirements or read any
extra data. Turning optimizations back on, one might reasonably expect the
optimizer to collapse the two intrinsics into the same single instruction it
had before, since they should, in fact, be equivalent to what that instruction
did. However, the asm one gets instead is
pxor %xmm0, %xmm0
pinsrq $0, (%rax), %xmm0
pmovsxwd %xmm0, %xmm0
So this is 3 instructions, 4 uops, and a 4-cycle latency (minimum) for what
should have been 1 instruction, 1 fused-domain uop, and a 2-cycle latency
(minimum). It makes a noticeable difference.
The current code I am using in my project is to wrap these in a macro that says
#ifdef __OPTIMIZE__, leave out the _mm_loadl_epi64(), but otherwise put it in.
However, that seems moderately terrible, and like something the compiler could
choose to break at any time.
More information about the Gcc-bugs
mailing list