36992 – Very stange code for _mm_move_epi64

Bug 36992 - Very stange code for _mm_move_epi64

Summary: Very stange code for _mm_move_epi64

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.4.0

Importance:	P3 normal
Target Milestone:	4.4.0
Assignee:	Uroš Bizjak

URL:	http://gcc.gnu.org/ml/gcc-patches/200...
Keywords:

Depends on:
Blocks:

Reported:	2008-08-01 03:28 UTC by H.J. Lu
Modified:	2008-08-07 13:18 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	x86-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2008-08-02 13:22:36

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2008-08-01 03:28:57 UTC

We should generate "movq" for _mm_move_epi64. But instead, we
generate very strange code and never movq:

[hjl@gnu-6 gcc]$ cat /tmp/m.c
#include <emmintrin.h>

__m128i
test (__m128i b)
{
  return _mm_move_epi64 (b);
}
[hjl@gnu-6 gcc]$ ./xgcc -B./ -S /tmp/m.c
[hjl@gnu-6 gcc]$ cat m.s
        .file   "m.c"
        .text
.globl test
        .type   test, @function
test:
.LFB493:
        pushq   %rbp
.LCFI0:
        movq    %rsp, %rbp
.LCFI1:
        movdqa  %xmm0, -32(%rbp)
        movdqa  -32(%rbp), %xmm0
        movdqa  %xmm0, %xmm0
        movdq2q %xmm0, %mm0
        movq    %mm0, -40(%rbp)
        movq    -40(%rbp), %rax
        movl    $0, %edx
        movq    %rax, -16(%rbp)
        movq    %rdx, -8(%rbp)
        movdqu  -16(%rbp), %xmm0
        leave
        ret
...
[hjl@gnu-6 gcc]$ ./xgcc -B./ -S /tmp/m.c -O2
[hjl@gnu-6 gcc]$ cat m.s
        .file   "m.c"
        .text
        .p2align 4,,15
.globl test
        .type   test, @function
test:
.LFB516:
        movhps  .LC0(%rip), %xmm0
        ret

Comment 1 Richard Biener 2008-08-01 11:58:12 UTC

It's very (un-)useful that

extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_move_epi64 (__m128i __A)
{
  return _mm_set_epi64 ((__m64)0LL, _mm_movepi64_pi64 (__A));
}

it is not even documented what the function is supposed to do...

Comment 2 Richard Biener 2008-08-01 12:04:29 UTC

Google leads me to VC++ reference which says "Moves the lower 64 bits of the lower 64 bits of the result, zeroing the upper bits.

I guess inlining and type fixing messes this up very early.

Comment 3 Richard Biener 2008-08-01 12:06:58 UTC

Btw, 4.3 generates

test:
.LFB518:
        movhps  .LC0(%rip), %xmm0
        ret

Likewise does gcc-4.4 (SUSE Linux) 4.4.0 20080611.  And current trunk.

So ... why are you exactly complaining about the code generated for -O0 again?

Comment 4 H.J. Lu 2008-08-01 12:53:10 UTC

(In reply to comment #2)
> Google leads me to VC++ reference which says "Moves the lower 64 bits of the
> lower 64 bits of the result, zeroing the upper bits.
> 
> I guess inlining and type fixing messes this up very early.
> 

Please download Intel64/IA32 SDM from:

http://developer.intel.com/products/processor/manuals/index.htm

_mm_move_epi64 is an intrinsic for "movq".

Comment 5 H.J. Lu 2008-08-01 12:55:56 UTC

(In reply to comment #3)
> Btw, 4.3 generates
> 
> test:
> .LFB518:
>         movhps  .LC0(%rip), %xmm0
>         ret
> 
> Likewise does gcc-4.4 (SUSE Linux) 4.4.0 20080611.  And current trunk.

It should be

movq %xmm0, %xmm0

with optimization.

> 
> So ... why are you exactly complaining about the code generated for -O0 again?
> 

It should use "movq". There is no need to use MMX moves, like
"movq %mm0, -40(%rbp)".

Comment 6 Richard Biener 2008-08-01 13:21:35 UTC

with -march=core2 it uses

        movd    %xmm0, %rax
        movq    %rax, %xmm0

with -march=opteron (and -march=generic) it uses

        movhps  .LC0(%rip), %xmm0

ISTR there is some penalty for using movq on opteron?

Comment 7 H.J. Lu 2008-08-01 13:25:42 UTC

(In reply to comment #6)
> with -march=core2 it uses
> 
>         movd    %xmm0, %rax
>         movq    %rax, %xmm0
>

Even this isn't necessary. We should just use

movq %xmm0,%xmm0

> with -march=opteron (and -march=generic) it uses
> 
>         movhps  .LC0(%rip), %xmm0
> 
> ISTR there is some penalty for using movq on opteron?

Opteron doesn't like inter-unit move, like

movd    %xmm0, %rax

But it isn't necessary at all. We should use

movq %xmm0,%xmm0

anyway.

Comment 8 Uroš Bizjak 2008-08-01 19:32:58 UTC

(In reply to comment #7)

> But it isn't necessary at all. We should use
> 
> movq %xmm0,%xmm0

It looks to me that we should add:

  [(set (match_operand:V2DI 0 "register_operand"     "=x")
	(vec_concat:V2DI
	  (match_operand:DI 1 "nonimmediate_operand" "!x")
	  (match_operand:DI 2 "vector_move_operand"  " C")))]

to the alternatives of all vec_concatv2di patterns.

Perhaps we need something like that also for all vec_concatv2si patterns, since x can also hold SImode value, and movd fills upper part of target SSE reg with zeroes.

"!" guarantees, that this alternative applies only when input value is already in SSE reg.

Comment 9 Uroš Bizjak 2008-08-02 13:08:58 UTC

Hm, IA-32 Intel® Architecture Software Developer’s Manual says about movq:

Operation
MOVQ instruction when operating on MMX technology registers and memory locations:
   DEST &#8592; SRC;
MOVQ instruction when source and destination operands are XMM registers:
   DEST[63:0] &#8592; SRC[63:0];
MOVQ instruction when source operand is XMM register and destination
operand is memory location:
   DEST &#8592; SRC[63:0];
MOVQ instruction when source operand is memory location and destination
operand is XMM register:
   DEST[63:0] &#8592; SRC;
   DEST[127:64] &#8592; 0000000000000000H;

Please note, that the documentation doesn't say anything about clearing destination bits [127:64] when both source and destination are in XMM register.

Comment 10 Uroš Bizjak 2008-08-02 13:10:26 UTC

"&#8592;" in my previous post represents left pointing arrow, <----.

Comment 11 Uroš Bizjak 2008-08-02 13:22:36 UTC

Uh, clearing of top bits is explicitly stated in latest Software Developer's Manual for both movq and movd. I'll fix these patterns.

Comment 12 Uroš Bizjak 2008-08-02 15:07:12 UTC

Patch in testing:

Index: testsuite/gcc.target/i386/pr36992.c
===================================================================
--- testsuite/gcc.target/i386/pr36992.c	(revision 0)
+++ testsuite/gcc.target/i386/pr36992.c	(revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile }
+/* { dg-options "-msse2 -O2" } */
+
+#include <emmintrin.h>
+
+__m128i
+test (__m128i b)
+{
+  return _mm_move_epi64 (b);
+}
+
+/* { dg-final { scan-assembler-times "mov\[qd\]\[ \\t\]+.*%xmm" 1 } } */
Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 138553)
+++ config/i386/sse.md	(working copy)
@@ -4777,7 +4777,7 @@
   "")
 
 (define_insn "*sse2_storeq_rex64"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=mx,r,r")
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=mx,*r,r")
 	(vec_select:DI
 	  (match_operand:V2DI 1 "nonimmediate_operand" "x,Yi,o")
 	  (parallel [(const_int 0)])))]
@@ -4940,10 +4940,10 @@
    (set_attr "mode" "TI,V4SF,V2SF")])
 
 (define_insn "vec_concatv2di"
-  [(set (match_operand:V2DI 0 "register_operand"     "=Y2,?Y2,Y2,x,x,x")
+  [(set (match_operand:V2DI 0 "register_operand"     "=Y2 ,?Y2,Y2,x,x,x")
 	(vec_concat:V2DI
-	  (match_operand:DI 1 "nonimmediate_operand" "  m,*y ,0 ,0,0,m")
-	  (match_operand:DI 2 "vector_move_operand"  "  C,  C,Y2,x,m,0")))]
+	  (match_operand:DI 1 "nonimmediate_operand" " mY2,*y ,0 ,0,0,m")
+	  (match_operand:DI 2 "vector_move_operand"  " C  ,  C,Y2,x,m,0")))]
   "!TARGET_64BIT && TARGET_SSE"
   "@
    movq\t{%1, %0|%0, %1}
@@ -4956,10 +4956,10 @@
    (set_attr "mode" "TI,TI,TI,V4SF,V2SF,V2SF")])
 
 (define_insn "*vec_concatv2di_rex64_sse4_1"
-  [(set (match_operand:V2DI 0 "register_operand"     "=x,x,Yi,!x,x,x,x,x")
+  [(set (match_operand:V2DI 0 "register_operand"     "=x ,x ,Yi,!x,x,x,x,x")
 	(vec_concat:V2DI
-	  (match_operand:DI 1 "nonimmediate_operand" " 0,m,r ,*y,0,0,0,m")
-	  (match_operand:DI 2 "vector_move_operand"  "rm,C,C ,C ,x,x,m,0")))]
+	  (match_operand:DI 1 "nonimmediate_operand" " 0 ,mx,r ,*y,0,0,0,m")
+	  (match_operand:DI 2 "vector_move_operand"  " rm,C ,C ,C ,x,x,m,0")))]
   "TARGET_64BIT && TARGET_SSE4_1"
   "@
    pinsrq\t{$0x1, %2, %0|%0, %2, 0x1}
@@ -4975,10 +4975,10 @@
    (set_attr "mode" "TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
 
 (define_insn "*vec_concatv2di_rex64_sse"
-  [(set (match_operand:V2DI 0 "register_operand"     "=Y2,Yi,!Y2,Y2,x,x,x")
+  [(set (match_operand:V2DI 0 "register_operand"     "=Y2 ,Yi,!Y2,Y2,x,x,x")
 	(vec_concat:V2DI
-	  (match_operand:DI 1 "nonimmediate_operand" "  m,r ,*y ,0 ,0,0,m")
-	  (match_operand:DI 2 "vector_move_operand"  "  C,C ,C  ,Y2,x,m,0")))]
+	  (match_operand:DI 1 "nonimmediate_operand" " mY2,r ,*y ,0 ,0,0,m")
+	  (match_operand:DI 2 "vector_move_operand"  " C  ,C ,C  ,Y2,x,m,0")))]
   "TARGET_64BIT && TARGET_SSE"
   "@
    movq\t{%1, %0|%0, %1}

Comment 13 H.J. Lu 2008-08-02 15:19:38 UTC

We should also test -O0. This code:

extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_movpi64_epi64 (__m64 __A) 
{
  return _mm_set_epi64 ((__m64)0LL, __A);
}

extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_move_epi64 (__m128i __A) 
{
  return _mm_set_epi64 ((__m64)0LL, _mm_movepi64_pi64 (__A));
}

Why do we use _mm_movepi64_pi64 at all? _mm_movepi64_pi64 is an MMX
intrinsic. It isn't necessary here.

Comment 14 Uroš Bizjak 2008-08-02 16:01:48 UTC

(In reply to comment #13)
> We should also test -O0.

Usage of MMX regs with -O0 is fixed by following patch:

Index: config/i386/mmx.md
===================================================================
--- config/i386/mmx.md	(revision 138553)
+++ config/i386/mmx.md	(working copy)
@@ -65,9 +65,9 @@
 
 (define_insn "*mov<mode>_internal_rex64"
   [(set (match_operand:MMXMODEI8 0 "nonimmediate_operand"
-				"=rm,r,!?y,!?y ,m  ,!y,Y2,x,x ,m,r,x")
+				"=rm,r,!?y,!?y ,m  ,!y,*Y2,x,x ,m,r,x")
 	(match_operand:MMXMODEI8 1 "vector_move_operand"
-				"Cr ,m,C  ,!?ym,!?y,Y2,!y,C,xm,x,x,r"))]
+				"Cr ,m,C  ,!?ym,!?y,*Y2,!y,C,xm,x,x,r"))]
   "TARGET_64BIT && TARGET_MMX
    && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
   "@
@@ -124,9 +124,9 @@
 
 (define_insn "*movv2sf_internal_rex64"
   [(set (match_operand:V2SF 0 "nonimmediate_operand"
-				"=rm,r ,!?y,!?y ,m ,!y,Y2,x,x,x,m,r,x")
+				"=rm,r ,!?y,!?y ,m ,!y,*Y2,x,x,x,m,r,x")
         (match_operand:V2SF 1 "vector_move_operand"
-				"Cr ,m ,C  ,!?ym,!y,Y2,!y,C,x,m,x,x,r"))]
+				"Cr ,m ,C  ,!?ym,!y,*Y2,!y,C,x,m,x,x,r"))]
   "TARGET_64BIT && TARGET_MMX
    && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
   "@

> Why do we use _mm_movepi64_pi64 at all? _mm_movepi64_pi64 is an MMX
> intrinsic. It isn't necessary here.

_mm_movepi64_pi64 extracts DImode element 0 from V2DI vector. It just generates sse2_storeq_* pattern that operates also on SSE regs.

Comment 15 Uroš Bizjak 2008-08-02 18:43:45 UTC

Complete patch at http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00116.html

Comment 16 uros 2008-08-03 06:14:13 UTC

Subject: Bug 36992

Author: uros
Date: Sun Aug  3 06:13:04 2008
New Revision: 138564

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138564
Log:
	PR target/36992
	* config/i386/sse.md (vec_concatv2di): Add Y2 constraint to
	alternative 0 of operand 1.
	(*vec_concatv2di_rex64_sse): Ditto.
	(*vec_concatv2di_rex64_sse4_1): Add x constraint to alternative 0
	of operand 1.
	(*sse2_storeq_rex64): Penalize allocation of "r" registers.
	* config/i386/mmx.md (*mov<mode>_internal_rex64): Penalize allocation
	of "Y2" registers to avoid SSE <-> MMX conversions for DImode moves.
	(*movv2sf_internal_rex64): Ditto.

testsuite/ChangeLog:

	PR target/36992
	* gcc.target/i386/pr36992-1.c: New test.
	* gcc.target/i386/pr36992-2.c: Ditto.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr36992-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr36992-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/mmx.md
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Comment 17 Uroš Bizjak 2008-08-03 06:49:16 UTC

Fixed.

Comment 18 H.J. Lu 2008-08-05 14:19:08 UTC

With -msse4, we got

bash-3.2$ ./xgcc -B./ -msse4 -S /export/gnu/src/gcc-work/gcc/gcc/testsuite/gcc.target/i386/pr36992-1.c
bash-3.2$ ./xgcc -B./ -msse4 -S /export/gnu/src/gcc-work/gcc/gcc/testsuite/gcc.target/i386/pr36992-1.c -O2
bash-3.2$ cat pr36992-1.s
	.file	"pr36992-1.c"
	.text
	.p2align 4,,15
.globl test
	.type	test, @function
test:
.LFB516:
	pextrq	$0, %xmm0, %rax
	pxor	%xmm0, %xmm0
	pinsrq	$0, %rax, %xmm0
	ret

Why can't we just emit movq pattern for _mm_move_epi64?

Comment 19 H.J. Lu 2008-08-05 17:34:36 UTC

A patch is posted at

http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00324.html

Comment 20 hjl@gcc.gnu.org 2008-08-05 17:41:31 UTC

Subject: Bug 36992

Author: hjl
Date: Tue Aug  5 17:40:16 2008
New Revision: 138734

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138734
Log:
gcc/

2008-08-05  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/36992
	* config/i386/emmintrin.h (_mm_move_epi64): Use
	__builtin_ia32_movq128.

	* config/i386/i386.c (ix86_builtins): Add IX86_BUILTIN_MOVQ128.
	(bdesc_args): Add IX86_BUILTIN_MOVQ128.

	* config/i386/sse.md (sse2_movq128): New.
	(*sse2_movq128): Likewise.

	* doc/extend.texi: Document __builtin_ia32_movq128.

gcc/testsuite/

2008-08-04  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/36992
	* gcc.target/i386/pr36992-1.c: Scan movq.
	* gcc.target/i386/pr36992-2.c: Likewise.

	* gcc.target/i386/pr36992-3.c: New.

Added:
    branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-3.c
Modified:
    branches/ix86/avx/gcc/ChangeLog.avx
    branches/ix86/avx/gcc/config/i386/emmintrin.h
    branches/ix86/avx/gcc/config/i386/i386.c
    branches/ix86/avx/gcc/config/i386/sse.md
    branches/ix86/avx/gcc/doc/extend.texi
    branches/ix86/avx/gcc/testsuite/ChangeLog.avx
    branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-1.c
    branches/ix86/avx/gcc/testsuite/gcc.target/i386/pr36992-2.c

Comment 21 hjl@gcc.gnu.org 2008-08-07 13:17:35 UTC

Subject: Bug 36992

Author: hjl
Date: Thu Aug  7 13:16:23 2008
New Revision: 138839

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=138839
Log:
gcc/

2008-08-07  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/36992
	* config/i386/emmintrin.h (_mm_move_epi64): Use
	__builtin_ia32_movq128.

	* config/i386/i386.c (ix86_builtins): Add IX86_BUILTIN_MOVQ128.
	(bdesc_args): Add IX86_BUILTIN_MOVQ128.

	* config/i386/sse.md (sse2_movq128): New.

	* doc/extend.texi: Document __builtin_ia32_movq128.

gcc/testsuite/

2008-08-07  H.J. Lu  <hongjiu.lu@intel.com>

	PR target/36992
	* gcc.target/i386/pr36992-1.c: Scan movq.
	* gcc.target/i386/pr36992-2.c: Use "-O2 -msse4" instead of
	"-O0 -msse2".  Scan movq.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/emmintrin.h
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md
    trunk/gcc/doc/extend.texi
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/pr36992-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr36992-2.c

Comment 22 H.J. Lu 2008-08-07 13:18:21 UTC

Fixed.