32708 – _mm_cvtsi64x_si128() and _mm_cvtsi128_si64x() inefficient

Bug 32708 - _mm_cvtsi64x_si128() and _mm_cvtsi128_si64x() inefficient

Summary: _mm_cvtsi64x_si128() and _mm_cvtsi128_si64x() inefficient

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.1.2

Importance:	P3 enhancement
Target Milestone:	4.3.0
Assignee:	Uroš Bizjak

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2007-07-09 21:05 UTC by Ryan Johnson
Modified:	2007-07-10 19:38 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2007-07-10 10:36:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ryan Johnson 2007-07-09 21:05:21 UTC

Consider the following functions (compiled with "g++-4.1.2 -msse3 -O3"):
#include <emmintrin.h>
__m128i int2vector(int i) { return _mm_cvtsi32_si128(i); }
int vector2int(__m128i i) { return _mm_cvtsi128_32(i); }
__m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); }
long long vector2long(__m128i) { return _mm_cvtsi128_si64x(i); }

They become:

_Z10int2vectori:
        movd    %edi, %xmm0
        ret
_Z10vector2intU8__vectorx:
        movd    %xmm0, %rax
        movq    %xmm0, -16(%rsp)
        ret
_Z11long2vectorx:
        movd    %rdi, %mm0
        movq    %rdi, -8(%rsp)
        movq2dq %mm0, %xmm0
        ret
_Z11vector2longU8__vectorx:
        movd    %xmm0, %rax
        movq    %xmm0, -16(%rsp)
        ret

long2vector() should use a simple MOVQ instruction the way int2vector() uses MOVD. It appears that the reason for the stack access is that the original code used a reg64->mem->mm->xmm path, which the optimizer partly noticed; gcc-4.3-20070617 leaves the full path in place.

Also, do the vector2<X>() functions really need to access the stack?

Finally, I've noticed several places where instructions involving 64-bit values use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those generally features, bugs, or a "who cares?"

Comment 1 Richard Biener 2007-07-10 09:14:14 UTC

Fixed testcase:

#include <emmintrin.h>
__m128i int2vector(int i) { return _mm_cvtsi32_si128(i); }
int vector2int(__m128i i) { return _mm_cvtsi128_si32(i); }
__m128i long2vector(long long i) { return _mm_cvtsi64x_si128(i); }
long long vector2long(__m128i i) { return _mm_cvtsi128_si64x(i); }

mainline generates:

int2vector:
.LFB525:
        pxor    %xmm0, %xmm0
        movq    %rdi, -8(%rsp)
        movq    -8(%rsp), %xmm1
        movss   %xmm1, %xmm0
        ret
.LFE525:
        .size   int2vector, .-int2vector
        .p2align 4,,15
.globl long2vector
        .type   long2vector, @function
long2vector:
.LFB527:
        movq    %rdi, -8(%rsp)
        movq    -8(%rsp), %mm0
        movq2dq %mm0, %xmm0
        ret
.LFE527:
        .size   long2vector, .-long2vector
        .p2align 4,,15
.globl vector2long
        .type   vector2long, @function
vector2long:
.LFB528:
        movq    %xmm0, -16(%rsp)
        movq    -16(%rsp), %rax
        ret
.LFE528:
        .size   vector2long, .-vector2long
        .p2align 4,,15
.globl vector2int
        .type   vector2int, @function
vector2int:
.LFB526:
        movd    %xmm0, -12(%rsp)
        movl    -12(%rsp), %eax
        ret

Comment 2 Pawel Sikora 2007-07-10 09:23:50 UTC

this looks like a dup of PR30961.

Comment 3 Richard Biener 2007-07-10 09:37:11 UTC

I don't think so, the _mm_ intrinsics are expanded via target builtins.

Comment 4 Uroš Bizjak 2007-07-10 10:36:00 UTC

(In reply to comment #0)

> long2vector() should use a simple MOVQ instruction the way int2vector() uses
> MOVD. It appears that the reason for the stack access is that the original code
> used a reg64->mem->mm->xmm path, which the optimizer partly noticed;
> gcc-4.3-20070617 leaves the full path in place.
> 
> Also, do the vector2<X>() functions really need to access the stack?

Stack access is the remain of partially implemented TARGET_INTER_UNIT_MOVES, and this was fully fixed in 4.3. The fact that gcc-4.3 leaves full path in place is due to missing pattern for vec_concat:V2DI (implemented in the patch below) where 64bit registers can be concatenated with zero (for 64bit targets) using movq instruction.

With attached patch, 4.3 generates:

long2vector:
.LFB4:
        movq    %rdi, %xmm0
        ret
.LFE4:

for -march=core2 (TARGET_INTER_UNIT_MOVES allowed)

and

long2vector:
.LFB4:
        movq    %rdi, -8(%rsp)
        movq    -8(%rsp), %xmm0
        ret
.LFE4:

for -march=k8 (no TARGET_INTER_UNIT_MOVES allowed).

> Finally, I've noticed several places where instructions involving 64-bit values
> use the "d/l" suffix (e.g. "long i = 0" ==> "xorl %eax, %eax"), or 32-bit
> operations that use 64-bit registers (e.g. "movd %xmm0, %rax" above). Are those
> generally features, bugs, or a "who cares?"

These are "who cares", produced by reload to satisfy register constraints (i.e. certain alternatives missing as an attempt to solve INTER_UNIT_MOVES requirements). They are harmles.

Index: sse.md
===================================================================
--- sse.md      (revision 126510)
+++ sse.md      (working copy)
@@ -4717,7 +4717,7 @@
        (vec_concat:V2DI
          (match_operand:DI 1 "nonimmediate_operand" "  m,*y ,0 ,0,0,m")
          (match_operand:DI 2 "vector_move_operand"  "  C,  C,Yt,x,m,0")))]
-  "TARGET_SSE"
+  "!TARGET_64BIT && TARGET_SSE"
   "@
    movq\t{%1, %0|%0, %1}
    movq2dq\t{%1, %0|%0, %1}
@@ -4728,6 +4728,23 @@
   [(set_attr "type" "ssemov,ssemov,sselog,ssemov,ssemov,ssemov")
    (set_attr "mode" "TI,TI,TI,V4SF,V2SF,V2SF")])
 
+(define_insn "*vec_concatv2di_rex"
+  [(set (match_operand:V2DI 0 "register_operand"     "=Yt,Yi,!Yt,Yt,x,x,x")
+       (vec_concat:V2DI
+         (match_operand:DI 1 "nonimmediate_operand" "  m,r ,*y ,0 ,0,0,m")
+         (match_operand:DI 2 "vector_move_operand"  "  C,C ,C  ,Yt,x,m,0")))]
+  "TARGET_64BIT"
+  "@
+   movq\t{%1, %0|%0, %1}
+   movq\t{%1, %0|%0, %1}
+   movq2dq\t{%1, %0|%0, %1}
+   punpcklqdq\t{%2, %0|%0, %2}
+   movlhps\t{%2, %0|%0, %2}
+   movhps\t{%2, %0|%0, %2}
+   movlps\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssemov,ssemov,ssemov,sselog,ssemov,ssemov,ssemov")
+   (set_attr "mode" "TI,TI,TI,TI,V4SF,V2SF,V2SF")])
+
 (define_expand "vec_setv2di"
   [(match_operand:V2DI 0 "register_operand" "")
    (match_operand:DI 1 "register_operand" "")

Comment 5 uros 2007-07-10 19:27:10 UTC

Subject: Bug 32708

Author: uros
Date: Tue Jul 10 19:26:58 2007
New Revision: 126523

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=126523
Log:
        PR target/32708
        * config/i386/sse.md (vec_concatv2di): Disable for TARGET_64BIT.
        (*vec_concatv2di_rex): New insn pattern.

testsuite/ChangeLog:

        PR target/32708
        * gcc.target/i386/pr32708-1.c: New test.
        * gcc.target/i386/pr32708-2.c: Ditto.
        * gcc.target/i386/pr32708-3.c: Ditto.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr32708-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr32708-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr32708-3.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog

Comment 6 Uroš Bizjak 2007-07-10 19:38:41 UTC

(In reply to comment #0)

> long2vector() should use a simple MOVQ instruction the way int2vector() uses
> MOVD. It appears that the reason for the stack access is that the original code
> used a reg64->mem->mm->xmm path, which the optimizer partly noticed;
> gcc-4.3-20070617 leaves the full path in place.

A note to the reporter: unless a noticeable performance regression is found, missed-optimization bugs should be reported vs. mainline gcc. Usually, mainline sources implement new infrastructure to support new optimizations, and this infrastructure is rarely backported to release branches. Sometimes requested optimization is already implemented in the mainline, again with little or no chances of being backported.

If you have a particular (complex) application, you are most welcomed to try to compile it with latest gcc sources. This way, compile-time, runtime and performance issues can be fixed early in gcc development cycle, benefiting your application, as well as gcc development.

Otherwise, the patch is committed to mainline. Not a regression on branches.