[Bug target/91117] New: _mm_movpi64_epi64/_mm_movepi64_pi64 generating store+load instead of using MOVQ2DQ/MOVDQ2Q
wolfwings+gcc at gmail dot com
gcc-bugzilla@gcc.gnu.org
Mon Jul 8 22:31:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91117
Bug ID: 91117
Summary: _mm_movpi64_epi64/_mm_movepi64_pi64 generating
store+load instead of using MOVQ2DQ/MOVDQ2Q
Product: gcc
Version: 9.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: wolfwings+gcc at gmail dot com
Target Milestone: ---
_mm_movpi64_epi64 is never using MOVQ2DQ (and _mm_movepi64_pi64 never using
MOVDQ2Q) despite documentation it should when used in mixed MMX -> SSE
situations, and that these are in fact the intrinsics to use when desiring the
Q2DQ/DQ2Q opcodes.
This appears to be due to the header defining them causing fallback memory
write then read except in (technically invalid) SSE -> SSE cases where a MOVD
is used.
Tested on GCC 7.4 + 9.1 locally, with additional testing on Godbolt all showing
identical code being generated all the way back to 4.x series.
Compiled with -O1:
#include <emmintrin.h>
__m128i test( __m128i input ) {
__m64 x = _mm_movepi64_pi64( input );
return _mm_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}
Generated assembly on GCC 9.1:
movq %xmm0, -16(%rsp)
movq -16(%rsp), %mm0
movq %mm0, %mm1
pmullw %mm0, %mm1
movq %mm1, -16(%rsp)
movq -16(%rsp), %xmm0
ret
A version that makes explicit calls to movq2dq/movdq2q works and outputs the
expected assembly sequence:
#include <emmintrin.h>
static inline __m64 _my_movepi64_pi64( __m128i input ) {
__m64 result;
asm( "movdq2q %1, %0" : "=y" (result) : "x" (input) : );
return result;
}
static inline __m128i _my_movpi64_epi64( __m64 input ) {
__m128i result;
asm( "movq2dq %1, %0" : "=x" (result) : "y" (input) : );
return result;
}
__m128i test( __m128i input ) {
__m64 x = _my_movepi64_pi64( input );
return _my_movpi64_epi64( _mm_mullo_pi16( x, x ) );
}
Generated assembly on GCC 7.4, 9.1, and others via Godbolt, again with -O1 (-O2
and -O3 make no difference):
movdq2q %xmm0, %mm0
pmullw %mm0, %mm0
movq2dq %mm0, %xmm0
ret
For completeness, ICC generates the 'short' code form on all available versions
without needing the inline assembly workaround.
More information about the Gcc-bugs
mailing list