We should generate "movq" for _mm_move_epi64. But instead, we generate very strange code and never movq: [hjl@gnu-6 gcc]$ cat /tmp/m.c #include <emmintrin.h> __m128i test (__m128i b) { return _mm_move_epi64 (b); } [hjl@gnu-6 gcc]$ ./xgcc -B./ -S /tmp/m.c [hjl@gnu-6 gcc]$ cat m.s .file "m.c" .text .globl test .type test, @function test: .LFB493: pushq %rbp .LCFI0: movq %rsp, %rbp .LCFI1: movdqa %xmm0, -32(%rbp) movdqa -32(%rbp), %xmm0 movdqa %xmm0, %xmm0 movdq2q %xmm0, %mm0 movq %mm0, -40(%rbp) movq -40(%rbp), %rax movl $0, %edx movq %rax, -16(%rbp) movq %rdx, -8(%rbp) movdqu -16(%rbp), %xmm0 leave ret ... [hjl@gnu-6 gcc]$ ./xgcc -B./ -S /tmp/m.c -O2 [hjl@gnu-6 gcc]$ cat m.s .file "m.c" .text .p2align 4,,15 .globl test .type test, @function test: .LFB516: movhps .LC0(%rip), %xmm0 ret
It's very (un-)useful that extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_epi64 (__m128i __A) { return _mm_set_epi64 ((__m64)0LL, _mm_movepi64_pi64 (__A)); } it is not even documented what the function is supposed to do...
Google leads me to VC++ reference which says "Moves the lower 64 bits of the lower 64 bits of the result, zeroing the upper bits. I guess inlining and type fixing messes this up very early.
Btw, 4.3 generates test: .LFB518: movhps .LC0(%rip), %xmm0 ret Likewise does gcc-4.4 (SUSE Linux) 4.4.0 20080611. And current trunk. So ... why are you exactly complaining about the code generated for -O0 again?
(In reply to comment #2) > Google leads me to VC++ reference which says "Moves the lower 64 bits of the > lower 64 bits of the result, zeroing the upper bits. > > I guess inlining and type fixing messes this up very early. > Please download Intel64/IA32 SDM from: http://developer.intel.com/products/processor/manuals/index.htm _mm_move_epi64 is an intrinsic for "movq".
(In reply to comment #3) > Btw, 4.3 generates > > test: > .LFB518: > movhps .LC0(%rip), %xmm0 > ret > > Likewise does gcc-4.4 (SUSE Linux) 4.4.0 20080611. And current trunk. It should be movq %xmm0, %xmm0 with optimization. > > So ... why are you exactly complaining about the code generated for -O0 again? > It should use "movq". There is no need to use MMX moves, like "movq %mm0, -40(%rbp)".
with -march=core2 it uses movd %xmm0, %rax movq %rax, %xmm0 with -march=opteron (and -march=generic) it uses movhps .LC0(%rip), %xmm0 ISTR there is some penalty for using movq on opteron?
(In reply to comment #6) > with -march=core2 it uses > > movd %xmm0, %rax > movq %rax, %xmm0 > Even this isn't necessary. We should just use movq %xmm0,%xmm0 > with -march=opteron (and -march=generic) it uses > > movhps .LC0(%rip), %xmm0 > > ISTR there is some penalty for using movq on opteron? Opteron doesn't like inter-unit move, like movd %xmm0, %rax But it isn't necessary at all. We should use movq %xmm0,%xmm0 anyway.
(In reply to comment #7) > But it isn't necessary at all. We should use > > movq %xmm0,%xmm0 It looks to me that we should add: [(set (match_operand:V2DI 0 "register_operand" "=x") (vec_concat:V2DI (match_operand:DI 1 "nonimmediate_operand" "!x") (match_operand:DI 2 "vector_move_operand" " C")))] to the alternatives of all vec_concatv2di patterns. Perhaps we need something like that also for all vec_concatv2si patterns, since x can also hold SImode value, and movd fills upper part of target SSE reg with zeroes. "!" guarantees, that this alternative applies only when input value is already in SSE reg.
Hm, IA-32 Intel® Architecture Software Developer’s Manual says about movq: Operation MOVQ instruction when operating on MMX technology registers and memory locations: DEST ← SRC; MOVQ instruction when source and destination operands are XMM registers: DEST[63:0] ← SRC[63:0]; MOVQ instruction when source operand is XMM register and destination operand is memory location: DEST ← SRC[63:0]; MOVQ instruction when source operand is memory location and destination operand is XMM register: DEST[63:0] ← SRC; DEST[127:64] ← 0000000000000000H; Please note, that the documentation doesn't say anything about clearing destination bits [127:64] when both source and destination are in XMM register.
"←" in my previous post represents left pointing arrow, <----.
Uh, clearing of top bits is explicitly stated in latest Software Developer's Manual for both movq and movd. I'll fix these patterns.
Patch in testing: Index: testsuite/gcc.target/i386/pr36992.c =================================================================== --- testsuite/gcc.target/i386/pr36992.c (revision 0) +++ testsuite/gcc.target/i386/pr36992.c (revision 0) @@ -0,0 +1,12 @@ +/* { dg-do compile } +/* { dg-options "-msse2 -O2" } */ + +#include <emmintrin.h> + +__m128i +test (__m128i b) +{ + return _mm_move_epi64 (b); +} + +/* { dg-final { scan-assembler-times "mov\[qd\]\[ \\t\]+.*%xmm" 1 } } */ Index: config/i386/sse.md =================================================================== --- config/i386/sse.md (revision 138553) +++ config/i386/sse.md (working copy) @@ -4777,7 +4777,7 @@ "") (define_insn "*sse2_storeq_rex64" - [(set (match_operand:DI 0 "nonimmediate_operand" "=mx,r,r") + [(set (match_operand:DI 0 "nonimmediate_operand" "=mx,*r,r") (vec_select:DI (match_operand:V2DI 1 "nonimmediate_operand" "x,Yi,o") (parallel [(const_int 0)])))] @@ -4940,10 +4940,10 @@ (set_attr "mode" "TI,V4SF,V2SF")]) (define_insn "vec_concatv2di" - [(set (match_operand:V2DI 0 "register_operand" "=Y2,?Y2,Y2,x,x,x") + [(set (match_operand:V2DI 0 "register_operand" "=Y2 ,?Y2,Y2,x,x,x") (vec_concat:V2DI - (match_operand:DI 1 "nonimmediate_operand" " m,*y ,0 ,0,0,m") - (match_operand:DI 2 "vector_move_operand" " C, C,Y2,x,m,0")))] + (match_operand:DI 1 "nonimmediate_operand" " mY2,*y ,0 ,0,0,m") + (match_operand:DI 2 "vector_move_operand" " C , C,Y2,x,m,0")))] "!TARGET_64BIT && TARGET_SSE" "@ movq\t{%1, %0|%0, %1} @@ -4956,10 +4956,10 @@ (set_attr "mode" "TI,TI,TI,V4SF,V2SF,V2SF")]) (define_insn "*vec_concatv2di_rex64_sse4_1" - [(set (match_operand:V2DI 0 "register_operand" "=x,x,Yi,!x,x,x,x,x") + [(set (match_operand:V2DI 0 "register_operand" "=x ,x ,Yi,!x,x,x,x,x") (vec_concat:V2DI - (match_operand:DI 1 "nonimmediate_operand" " 0,m,r ,*y,0,0,0,m") - (match_operand:DI 2 "vector_move_operand" "rm,C,C ,C ,x,x,m,0")))] + (match_operand:DI 1 "nonimmediate_operand" " 0 ,mx,r ,*y,0,0,0,m") + (match_operand:DI 2 "vector_move_operand" " rm,C ,C ,C ,x,x,m,0")))] "TARGET_64BIT && TARGET_SSE4_1" "@ pinsrq\t{$0x1, %2, %0|%0, %2, 0x1} @@ -4975,10 +4975,10 @@ (set_attr "mode" "TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")]) (define_insn "*vec_concatv2di_rex64_sse" - [(set (match_operand:V2DI 0 "register_operand" "=Y2,Yi,!Y2,Y2,x,x,x") + [(set (match_operand:V2DI 0 "register_operand" "=Y2 ,Yi,!Y2,Y2,x,x,x") (vec_concat:V2DI - (match_operand:DI 1 "nonimmediate_operand" " m,r ,*y ,0 ,0,0,m") - (match_operand:DI 2 "vector_move_operand" " C,C ,C ,Y2,x,m,0")))] + (match_operand:DI 1 "nonimmediate_operand" " mY2,r ,*y ,0 ,0,0,m") + (match_operand:DI 2 "vector_move_operand" " C ,C ,C ,Y2,x,m,0")))] "TARGET_64BIT && TARGET_SSE" "@ movq\t{%1, %0|%0, %1}
We should also test -O0. This code: extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_movpi64_epi64 (__m64 __A) { return _mm_set_epi64 ((__m64)0LL, __A); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_epi64 (__m128i __A) { return _mm_set_epi64 ((__m64)0LL, _mm_movepi64_pi64 (__A)); } Why do we use _mm_movepi64_pi64 at all? _mm_movepi64_pi64 is an MMX intrinsic. It isn't necessary here.
(In reply to comment #13) > We should also test -O0. Usage of MMX regs with -O0 is fixed by following patch: Index: config/i386/mmx.md =================================================================== --- config/i386/mmx.md (revision 138553) +++ config/i386/mmx.md (working copy) @@ -65,9 +65,9 @@ (define_insn "*mov<mode>_internal_rex64" [(set (match_operand:MMXMODEI8 0 "nonimmediate_operand" - "=rm,r,!?y,!?y ,m ,!y,Y2,x,x ,m,r,x") + "=rm,r,!?y,!?y ,m ,!y,*Y2,x,x ,m,r,x") (match_operand:MMXMODEI8 1 "vector_move_operand" - "Cr ,m,C ,!?ym,!?y,Y2,!y,C,xm,x,x,r"))] + "Cr ,m,C ,!?ym,!?y,*Y2,!y,C,xm,x,x,r"))] "TARGET_64BIT && TARGET_MMX && !(MEM_P (operands[0]) && MEM_P (operands[1]))" "@ @@ -124,9 +124,9 @@ (define_insn "*movv2sf_internal_rex64" [(set (match_operand:V2SF 0 "nonimmediate_operand" - "=rm,r ,!?y,!?y ,m ,!y,Y2,x,x,x,m,r,x") + "=rm,r ,!?y,!?y ,m ,!y,*Y2,x,x,x,m,r,x") (match_operand:V2SF 1 "vector_move_operand" - "Cr ,m ,C ,!?ym,!y,Y2,!y,C,x,m,x,x,r"))] + "Cr ,m ,C ,!?ym,!y,*Y2,!y,C,x,m,x,x,r"))] "TARGET_64BIT && TARGET_MMX && !(MEM_P (operands[0]) && MEM_P (operands[1]))" "@ > Why do we use _mm_movepi64_pi64 at all? _mm_movepi64_pi64 is an MMX > intrinsic. It isn't necessary here. _mm_movepi64_pi64 extracts DImode element 0 from V2DI vector. It just generates sse2_storeq_* pattern that operates also on SSE regs.
Complete patch at http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00116.html
Subject: Bug 36992 Author: uros Date: Sun Aug 3 06:13:04 2008 New Revision: 138564 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138564 Log: PR target/36992 * config/i386/sse.md (vec_concatv2di): Add Y2 constraint to alternative 0 of operand 1. (*vec_concatv2di_rex64_sse): Ditto. (*vec_concatv2di_rex64_sse4_1): Add x constraint to alternative 0 of operand 1. (*sse2_storeq_rex64): Penalize allocation of "r" registers. * config/i386/mmx.md (*mov<mode>_internal_rex64): Penalize allocation of "Y2" registers to avoid SSE <-> MMX conversions for DImode moves. (*movv2sf_internal_rex64): Ditto. testsuite/ChangeLog: PR target/36992 * gcc.target/i386/pr36992-1.c: New test. * gcc.target/i386/pr36992-2.c: Ditto. Added: trunk/gcc/testsuite/gcc.target/i386/pr36992-1.c trunk/gcc/testsuite/gcc.target/i386/pr36992-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/mmx.md trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
Fixed.
With -msse4, we got bash-3.2$ ./xgcc -B./ -msse4 -S /export/gnu/src/gcc-work/gcc/gcc/testsuite/gcc.target/i386/pr36992-1.c bash-3.2$ ./xgcc -B./ -msse4 -S /export/gnu/src/gcc-work/gcc/gcc/testsuite/gcc.target/i386/pr36992-1.c -O2 bash-3.2$ cat pr36992-1.s .file "pr36992-1.c" .text .p2align 4,,15 .globl test .type test, @function test: .LFB516: pextrq $0, %xmm0, %rax pxor %xmm0, %xmm0 pinsrq $0, %rax, %xmm0 ret Why can't we just emit movq pattern for _mm_move_epi64?
A patch is posted at http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00324.html
Subject: Bug 36992 Author: hjl Date: Tue Aug 5 17:40:16 2008 New Revision: 138734 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138734 Log: gcc/ 2008-08-05 H.J. Lu <hongjiu.lu@intel.com> PR target/36992 * config/i386/emmintrin.h (_mm_move_epi64): Use __builtin_ia32_movq128. * config/i386/i386.c (ix86_builtins): Add IX86_BUILTIN_MOVQ128. (bdesc_args): Add IX86_BUILTIN_MOVQ128. * config/i386/sse.md (sse2_movq128): New. (*sse2_movq128): Likewise. * doc/extend.texi: Document __builtin_ia32_movq128. gcc/testsuite/ 2008-08-04 H.J. Lu <hongjiu.lu@intel.com> PR target/36992 * gcc.target/i386/pr36992-1.c: Scan movq. * gcc.target/i386/pr36992-2.c: Likewise. * gcc.target/i386/pr36992-3.c: New. Added: branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-3.c Modified: branches/ix86/avx/gcc/ChangeLog.avx branches/ix86/avx/gcc/config/i386/emmintrin.h branches/ix86/avx/gcc/config/i386/i386.c branches/ix86/avx/gcc/config/i386/sse.md branches/ix86/avx/gcc/doc/extend.texi branches/ix86/avx/gcc/testsuite/ChangeLog.avx branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-1.c branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-2.c
Subject: Bug 36992 Author: hjl Date: Thu Aug 7 13:16:23 2008 New Revision: 138839 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138839 Log: gcc/ 2008-08-07 H.J. Lu <hongjiu.lu@intel.com> PR target/36992 * config/i386/emmintrin.h (_mm_move_epi64): Use __builtin_ia32_movq128. * config/i386/i386.c (ix86_builtins): Add IX86_BUILTIN_MOVQ128. (bdesc_args): Add IX86_BUILTIN_MOVQ128. * config/i386/sse.md (sse2_movq128): New. * doc/extend.texi: Document __builtin_ia32_movq128. gcc/testsuite/ 2008-08-07 H.J. Lu <hongjiu.lu@intel.com> PR target/36992 * gcc.target/i386/pr36992-1.c: Scan movq. * gcc.target/i386/pr36992-2.c: Use "-O2 -msse4" instead of "-O0 -msse2". Scan movq. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/emmintrin.h trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/sse.md trunk/gcc/doc/extend.texi trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr36992-1.c trunk/gcc/testsuite/gcc.target/i386/pr36992-2.c