Consider the following functions (compiled with "g++-4.1.2 -msse3 -O3"): #include <emmintrin.h> __m128i int2vector(int i) { return _mm_cvtsi32_si128(i); } int vector2int(__m128i i) { return _mm_cvtsi128_32(i); } __m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); } long long vector2long(__m128i) { return _mm_cvtsi128_si64x(i); } They become: _Z10int2vectori: movd %edi, %xmm0 ret _Z10vector2intU8__vectorx: movd %xmm0, %rax movq %xmm0, -16(%rsp) ret _Z11long2vectorx: movd %rdi, %mm0 movq %rdi, -8(%rsp) movq2dq %mm0, %xmm0 ret _Z11vector2longU8__vectorx: movd %xmm0, %rax movq %xmm0, -16(%rsp) ret long2vector() should use a simple MOVQ instruction the way int2vector() uses MOVD. It appears that the reason for the stack access is that the original code used a reg64->mem->mm->xmm path, which the optimizer partly noticed; gcc-4.3-20070617 leaves the full path in place. Also, do the vector2<X>() functions really need to access the stack? Finally, I've noticed several places where instructions involving 64-bit values use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those generally features, bugs, or a "who cares?"
Fixed testcase: #include <emmintrin.h> __m128i int2vector(int i) { return _mm_cvtsi32_si128(i); } int vector2int(__m128i i) { return _mm_cvtsi128_si32(i); } __m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); } long long vector2long(__m128i i) { return _mm_cvtsi128_si64x(i); } mainline generates: int2vector: .LFB525: pxor %xmm0, %xmm0 movq %rdi, -8(%rsp) movq -8(%rsp), %xmm1 movss %xmm1, %xmm0 ret .LFE525: .size int2vector, .-int2vector .p2align 4,,15 .globl long2vector .type long2vector, @function long2vector: .LFB527: movq %rdi, -8(%rsp) movq -8(%rsp), %mm0 movq2dq %mm0, %xmm0 ret .LFE527: .size long2vector, .-long2vector .p2align 4,,15 .globl vector2long .type vector2long, @function vector2long: .LFB528: movq %xmm0, -16(%rsp) movq -16(%rsp), %rax ret .LFE528: .size vector2long, .-vector2long .p2align 4,,15 .globl vector2int .type vector2int, @function vector2int: .LFB526: movd %xmm0, -12(%rsp) movl -12(%rsp), %eax ret
this looks like a dup of PR30961.
I don't think so, the _mm_ intrinsics are expanded via target builtins.
(In reply to comment #0) > long2vector() should use a simple MOVQ instruction the way int2vector() uses > MOVD. It appears that the reason for the stack access is that the original code > used a reg64->mem->mm->xmm path, which the optimizer partly noticed; > gcc-4.3-20070617 leaves the full path in place. > > Also, do the vector2<X>() functions really need to access the stack? Stack access is the remain of partially implemented TARGET_INTER_UNIT_MOVES, and this was fully fixed in 4.3. The fact that gcc-4.3 leaves full path in place is due to missing pattern for vec_concat:V2DI (implemented in the patch below) where 64bit registers can be concatenated with zero (for 64bit targets) using movq instruction. With attached patch, 4.3 generates: long2vector: .LFB4: movq %rdi, %xmm0 ret .LFE4: for -march=core2 (TARGET_INTER_UNIT_MOVES allowed) and long2vector: .LFB4: movq %rdi, -8(%rsp) movq -8(%rsp), %xmm0 ret .LFE4: for -march=k8 (no TARGET_INTER_UNIT_MOVES allowed). > Finally, I've noticed several places where instructions involving 64-bit values > use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit > operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those > generally features, bugs, or a "who cares?" These are "who cares", produced by reload to satisfy register constraints (i.e. certain alternatives missing as an attempt to solve INTER_UNIT_MOVES requirements). They are harmles. Index: sse.md =================================================================== --- sse.md (revision 126510) +++ sse.md (working copy) @@ -4717,7 +4717,7 @@ (vec_concat:V2DI (match_operand:DI 1 "nonimmediate_operand" " m,*y ,0 ,0,0,m") (match_operand:DI 2 "vector_move_operand" " C, C,Yt,x,m,0")))] - "TARGET_SSE" + "!TARGET_64BIT && TARGET_SSE" "@ movq\t{%1, %0|%0, %1} movq2dq\t{%1, %0|%0, %1} @@ -4728,6 +4728,23 @@ [(set_attr "type" "ssemov,ssemov,sselog,ssemov,ssemov,ssemov") (set_attr "mode" "TI,TI,TI,V4SF,V2SF,V2SF")]) +(define_insn "*vec_concatv2di_rex" + [(set (match_operand:V2DI 0 "register_operand" "=Yt,Yi,!Yt,Yt,x,x,x") + (vec_concat:V2DI + (match_operand:DI 1 "nonimmediate_operand" " m,r ,*y ,0 ,0,0,m") + (match_operand:DI 2 "vector_move_operand" " C,C ,C ,Yt,x,m,0")))] + "TARGET_64BIT" + "@ + movq\t{%1, %0|%0, %1} + movq\t{%1, %0|%0, %1} + movq2dq\t{%1, %0|%0, %1} + punpcklqdq\t{%2, %0|%0, %2} + movlhps\t{%2, %0|%0, %2} + movhps\t{%2, %0|%0, %2} + movlps\t{%1, %0|%0, %1}" + [(set_attr "type" "ssemov,ssemov,ssemov,sselog,ssemov,ssemov,ssemov") + (set_attr "mode" "TI,TI,TI,TI,V4SF,V2SF,V2SF")]) + (define_expand "vec_setv2di" [(match_operand:V2DI 0 "register_operand" "") (match_operand:DI 1 "register_operand" "")
Subject: Bug 32708 Author: uros Date: Tue Jul 10 19:26:58 2007 New Revision: 126523 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=126523 Log: PR target/32708 * config/i386/sse.md (vec_concatv2di): Disable for TARGET_64BIT. (*vec_concatv2di_rex): New insn pattern. testsuite/ChangeLog: PR target/32708 * gcc.target/i386/pr32708-1.c: New test. * gcc.target/i386/pr32708-2.c: Ditto. * gcc.target/i386/pr32708-3.c: Ditto. Added: trunk/gcc/testsuite/gcc.target/i386/pr32708-1.c trunk/gcc/testsuite/gcc.target/i386/pr32708-2.c trunk/gcc/testsuite/gcc.target/i386/pr32708-3.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
(In reply to comment #0) > long2vector() should use a simple MOVQ instruction the way int2vector() uses > MOVD. It appears that the reason for the stack access is that the original code > used a reg64->mem->mm->xmm path, which the optimizer partly noticed; > gcc-4.3-20070617 leaves the full path in place. A note to the reporter: unless a noticeable performance regression is found, missed-optimization bugs should be reported vs. mainline gcc. Usually, mainline sources implement new infrastructure to support new optimizations, and this infrastructure is rarely backported to release branches. Sometimes requested optimization is already implemented in the mainline, again with little or no chances of being backported. If you have a particular (complex) application, you are most welcomed to try to compile it with latest gcc sources. This way, compile-time, runtime and performance issues can be fixed early in gcc development cycle, benefiting your application, as well as gcc development. Otherwise, the patch is committed to mainline. Not a regression on branches.