[Bug c/63756] New: _mm_cvtepi16_epi32 with a memory operand produces either broken or slow asm

Wed Nov 5 22:38:00 GMT 2014

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63756

            Bug ID: 63756
           Summary: _mm_cvtepi16_epi32 with a memory operand produces
                    either broken or slow asm
           Product: gcc
           Version: 4.9.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tterribe at xiph dot org

Created attachment 33900
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33900&action=edit
Reduced testcase

With optimizations enabled, the call

_mm_cvtepi16_epi32(*(__m128i *)x)

for some pointer x produces the asm

pmovsxwd (%rax), %xmm0

which is all well and good, and what was intended.

However, with optimizations disabled, the same code produces

movdqa (%rax), %xmm0
movaps %xmm0, -48(%rbp)
movdqa -48(%rbp), %xmm0
pmovsxwd %xmm0, %xmm0

The problem here is that the initial movdqa has added a 16-byte alignment
requirement and reads 8 bytes past where the original pmovsxwd instruction
would have read in the optimized version. This is very much not equivalent, and
causes crashes in code that runs just fine in the optimized version.

_mm_cvtepi16_epi32() takes an __m128i argument, and the dereference happens
before the function call. Even though the asm instruction it's standing in for
can do it all together, we don't have a single intrinsic which specifies
exactly that. None of the semantics here are very well documented anywhere, but
I can understand why the compiler might think it has the right to do what it
did. So I try the following code:

_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)x)

With optimizations disabled, this produces the slightly long-winded, but at
least correct asm of:

movq (%rax), %rax
movl $0, %edx
movq %rdx, -128(%rbp)
movq %rax, -120(%rbp)
movq -120(%rbp), %rax
movq -128(%rbp), %rdx
movq %rdx, -112(%rbp)
movq %rax, -104(%rbp)
movq -112(%rbp), %rax
movq -104(%rbp), %xmm0
pinsrq $1, %rax, $xmm0
movaps %xmm0, -64(%rbp)
movdqa -64(%rbp), %xmm0
pmovsxwd %xmm0, %xmm0

So that's all good. movq has the same semantics as a memory operand in
pmovsxwd, so we haven't added any extra alignment requirements or read any
extra data. Turning optimizations back on, one might reasonably expect the
optimizer to collapse the two intrinsics into the same single instruction it
had before, since they should, in fact, be equivalent to what that instruction
did. However, the asm one gets instead is

pxor %xmm0, %xmm0
pinsrq $0, (%rax), %xmm0
pmovsxwd %xmm0, %xmm0

So this is 3 instructions, 4 uops, and a 4-cycle latency (minimum) for what
should have been 1 instruction, 1 fused-domain uop, and a 2-cycle latency
(minimum). It makes a noticeable difference.

The current code I am using in my project is to wrap these in a macro that says
#ifdef __OPTIMIZE__, leave out the _mm_loadl_epi64(), but otherwise put it in.
However, that seems moderately terrible, and like something the compiler could
choose to break at any time.