Bug 43147 - SSE shuffle merge
Summary: SSE shuffle merge
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 12.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2010-02-23 01:27 UTC by Liran Nuna
Modified: 2022-05-26 06:53 UTC (History)
6 users (show)

See Also:
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
Build: x86_64-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2010-02-23 01:42:16


Attachments
example patch (untested) (318 bytes, patch)
2018-12-28 18:50 UTC, Marc Glisse
Details | Diff
ix86_gimple_fold_builtin patch (1.00 KB, patch)
2018-12-29 20:26 UTC, Marc Glisse
Details | Diff
A patch (872 bytes, patch)
2021-08-21 23:45 UTC, H.J. Lu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Liran Nuna 2010-02-23 01:27:31 UTC
I've noticed that GCC (my current version is 4.4.1) doesn't fully optimize SSE shuffle merges, as seen in this example: 

#include <xmmintrin.h>
 
extern void printv(__m128 m);
 
int main()
{
	m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs
	m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E
	printv(m);
 
	return 0;
}

This code generates the following assembly:

	movaps	.LC1, %xmm1
	shufps	$201, %xmm1, %xmm1
	shufps	$45, %xmm1, %xmm1    ; <-- Both should merge to 78
	movaps	%xmm1, %xmm0
	movaps	%xmm1, -24(%ebp)

	.LC0:
		.long	1065353216 ; 1.0f
		.long	1073741824 ; 2.0f
		.long	1077936128 ; 3.0f
		.long	1082130432 ; 4.0f

Would be nice to see it as an enhancement!
Comment 1 Liran Nuna 2010-02-23 01:37:00 UTC
It appears I am missing a line in the code I posted:

#include <xmmintrin.h>

extern void printv(__m128 m);

int main()
{
        __m128 m = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
        m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs
        m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E
        printv(m);

        return 0;
}
Comment 2 Andrew Pinski 2010-02-23 01:42:08 UTC
I think that is because nothing simplifies:
    (vec_select:V4SF (vec_concat:V8SF (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 62)
                    (reg:V4SF 62))
                (parallel [
                        (const_int 1 [0x1])
                        (const_int 2 [0x2])
                        (const_int 4 [0x4])
                        (const_int 7 [0x7])
                    ]))
            (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 62)
                    (reg:V4SF 62))
                (parallel [
                        (const_int 1 [0x1])
                        (const_int 2 [0x2])
                        (const_int 4 [0x4])
                        (const_int 7 [0x7])
                    ])))
        (parallel [
                (const_int 1 [0x1])
                (const_int 3 [0x3])
                (const_int 6 [0x6])
                (const_int 4 [0x4])
            ]))
Comment 3 Andrew Pinski 2010-02-23 01:42:16 UTC
Confirmed.
Comment 4 Marc Glisse 2011-10-23 08:40:04 UTC
Apart from combining 2 shuffles, I would expect the set and the shuffle to be combined in Comment 1. I was going to report the following, but it already appears in this bug:
__m128d f(double d){
	__m128d x=_mm_setr_pd(-d,d);
	return _mm_shuffle_pd(x,x,1);
}

	movsd	.LC0(%rip), %xmm1
	xorpd	%xmm0, %xmm1
	movapd	%xmm1, %xmm2
	unpcklpd	%xmm0, %xmm2
	movapd	%xmm2, %xmm0
	shufpd	$1, %xmm2, %xmm0

some extra moves, as usual, and a shuffle that could be combined with the unpack.
(obviously we don't write such code, it is only after inlining that it looks that way)
Comment 5 Marc Glisse 2012-05-07 14:52:46 UTC
Actually, why isn't constant propagation happening for the example in comment #1? simplify-rtx.c contains code to that effect, it might just need a little tweaking...

(yes, I know the original report is probably interested in non-constant operands)
Comment 6 Marc Glisse 2018-12-28 18:50:13 UTC
Created attachment 45303 [details]
example patch (untested)

Making the meaning of shuffles visible in GIMPLE could help a bit (although it wouldn't solve the problem completely because IIRC we don't dare combine shuffles, since it is hard to find an optimal expansion for a shuffle and we might pessimize some cases).
This patch is one simple way to make _mm_shuffle_pd less opaque. _mm_shuffle_ps would be a bit longer but still manageable. It has the drawback that it does not diagnose when the mask is not a constant, or not between 0 and 3, and I am not sure how to do that from the C code. An alternative would be to keep the current builtin but turn it into a vec_perm_expr in ix86_gimple_fold_builtin, which could include diagnostics.
Comment 7 Jakub Jelinek 2018-12-28 20:50:19 UTC
(In reply to Marc Glisse from comment #6)
> Created attachment 45303 [details]
> example patch (untested)
> 
> Making the meaning of shuffles visible in GIMPLE could help a bit (although
> it wouldn't solve the problem completely because IIRC we don't dare combine
> shuffles, since it is hard to find an optimal expansion for a shuffle and we
> might pessimize some cases).
> This patch is one simple way to make _mm_shuffle_pd less opaque.
> _mm_shuffle_ps would be a bit longer but still manageable. It has the
> drawback that it does not diagnose when the mask is not a constant, or not
> between 0 and 3, and I am not sure how to do that from the C code. An
> alternative would be to keep the current builtin but turn it into a
> vec_perm_expr in ix86_gimple_fold_builtin, which could include diagnostics.

I think doing it in ix86_gimple_fold_builtin is better.
Comment 8 Marc Glisse 2018-12-29 20:26:16 UTC
Created attachment 45306 [details]
ix86_gimple_fold_builtin patch

Like this then?

I realized (because of the testsuite) that we do not currently validate that the 3rd argument is a 2-bit immediate (the error message on non-constants is misleading), and Intel's documentation indeed seems to indicate that any integer is valid, we only look at the 2 lowest bits.

If possible, I think it would be nice to reduce the number of builtins, that's why I started with a patch to emmintrin.h (I now know that it should use (mask&2)?3:2 for the second number). For validation, something like:

  if(!__builtin_constant_p(__mask)) complain();

with

extern void complain(void) __attribute__((error("argument must be a constant")));

may be good enough. But we still need to handle -O0... Yeah, maybe ix86_gimple_fold_builtin is easier :-(
Comment 9 Allan Jensen 2019-05-19 19:45:39 UTC
(In reply to Marc Glisse from comment #6)
> Created attachment 45303 [details]
> example patch (untested)
> 
> Making the meaning of shuffles visible in GIMPLE could help a bit (although
> it wouldn't solve the problem completely because IIRC we don't dare combine
> shuffles, since it is hard to find an optimal expansion for a shuffle and we
> might pessimize some cases).

With some other cases there are checks to see if a combined new tree can be generated as a single instruction and only combined in that case. And as soon as the compiler have SSSE3 available, we can shuffle anything as single instruction, so combining them is always safe and fast.
Comment 10 Marc Glisse 2019-05-20 14:54:00 UTC
Author: glisse
Date: Mon May 20 14:53:29 2019
New Revision: 271422

URL: https://gcc.gnu.org/viewcvs?rev=271422&root=gcc&view=rev
Log:
[i386] Fold __builtin_ia32_shufpd to VEC_PERM_EXPR

2019-05-20  Marc Glisse  <marc.glisse@inria.fr>

	PR rtl-optimization/43147
	* config/i386/i386.c (ix86_gimple_fold_builtin): Handle
	IX86_BUILTIN_SHUFPD.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
Comment 11 Andrew Pinski 2021-08-21 22:44:06 UTC
We produce:
Trying 5, 7 -> 11:
    5: r86:V4SF=[`*.LC0']
      REG_EQUAL const_vector
    7: r85:V4SF=vec_select(vec_concat(r86:V4SF,r86:V4SF),parallel)
      REG_DEAD r86:V4SF
      REG_EQUAL const_vector
   11: r88:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
      REG_DEAD r85:V4SF
      REG_EQUAL const_vector
Failed to match this instruction:
(set (reg:V4SF 88)
    (const_vector:V4SF [
            (const_double:SF 2.0e+0 [0x0.8p+2])
            (const_double:SF 1.0e+0 [0x0.8p+1])
            (const_double:SF 4.0e+0 [0x0.8p+3])
            (const_double:SF 3.0e+0 [0x0.cp+2])
        ]))

Which means the vec_select are merging at the rtl level just fine.

Anyways if the target expands __builtin_ia32_shufps to VEC_PERM_EXPR we would have gotten this optimized at the gimple level.  So this is a target issue.
Comment 12 H.J. Lu 2021-08-21 23:45:42 UTC
Created attachment 51345 [details]
A patch
Comment 13 H.J. Lu 2021-08-22 12:54:52 UTC
A patch is posted at

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577884.html
Comment 14 H.J. Lu 2021-08-24 01:57:42 UTC
From

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577991.html

Trying 5 -> 7:
    5: r85:V4SF=[`*.LC0']
      REG_EQUAL const_vector
    7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
      REG_DEAD r85:V4SF
      REG_EQUAL const_vector
Failed to match this instruction:
(set (reg:V4SF 84)
    (const_vector:V4SF [
            (const_double:SF 3.0e+0 [0x0.cp+2])
            (const_double:SF 2.0e+0 [0x0.8p+2])
            (const_double:SF 4.0e+0 [0x0.8p+3])
            (const_double:SF 1.0e+0 [0x0.8p+1])
        ]))

(insn 5 2 7 2 (set (reg:V4SF 85)
        (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16
A128])) "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
1600 {movv4sf_internal}
     (expr_list:REG_EQUAL (const_vector:V4SF [
                (const_double:SF 4.0e+0 [0x0.8p+3])
                (const_double:SF 3.0e+0 [0x0.cp+2])
                (const_double:SF 2.0e+0 [0x0.8p+2])
                (const_double:SF 1.0e+0 [0x0.8p+1])
            ])
        (nil)))
(insn 7 5 11 2 (set (reg:V4SF 84)
        (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85)
                (reg:V4SF 85))
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 2 [0x2])
                    (const_int 4 [0x4])
                    (const_int 7 [0x7])
                ])))
"/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
3015 {sse_shufps_v4sf}
     (expr_list:REG_DEAD (reg:V4SF 85)
        (expr_list:REG_EQUAL (const_vector:V4SF [
                    (const_double:SF 3.0e+0 [0x0.cp+2])
                    (const_double:SF 2.0e+0 [0x0.8p+2])
                    (const_double:SF 4.0e+0 [0x0.8p+3])
                    (const_double:SF 1.0e+0 [0x0.8p+1])
                ])
            (nil))))

I think pass_combine should be extended to force illegitimate constant
to constant pool and recog load insn again, It looks like a general
optimization that better not do it in the backend.
Comment 15 Hongtao.liu 2021-08-25 08:42:52 UTC
> I think pass_combine should be extended to force illegitimate constant
> to constant pool and recog load insn again, It looks like a general
> optimization that better not do it in the backend.

The issue can also be solved by folding __builtin_ia32_shufps to gimple VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem

typedef int v4si __attribute__((vector_size (16)));

v4si
foo ()
{
    v4si a = __extension__ (v4si) {4, 3, 2, 1};
    v4si b = __extension__ (v4si) {5, 6, 7, 8};
    v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7);
    v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 });
    return d;
}

foo():
  movdqa .LC0(%rip), %xmm0
  ret
.LC0:
  .long 8
  .long 2
  .long 3
  .long 5
Comment 16 Andrew Pinski 2021-08-25 08:52:19 UTC
(In reply to Hongtao.liu from comment #15)
> > I think pass_combine should be extended to force illegitimate constant
> > to constant pool and recog load insn again, It looks like a general
> > optimization that better not do it in the backend.
> 
> The issue can also be solved by folding __builtin_ia32_shufps to gimple
> VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem
> 
> typedef int v4si __attribute__((vector_size (16)));
> 
> v4si
> foo ()
> {
>     v4si a = __extension__ (v4si) {4, 3, 2, 1};
>     v4si b = __extension__ (v4si) {5, 6, 7, 8};
>     v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7);
>     v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 });
>     return d;
> }

But that is because we constant fold on the gimple level for PERMs.
combining VEC_PERM_EXPR on the gimple is PR 54346; note I found this while looking at other issues too :).
Comment 17 Marc Glisse 2021-08-25 08:54:11 UTC
(In reply to Hongtao.liu from comment #15)
> The issue can also be solved by folding __builtin_ia32_shufps to gimple
> VEC_PERM_EXPR,

Didn't you post a patch to do that last year? What happened to it?
Comment 18 Hongtao.liu 2021-08-25 09:20:08 UTC
(In reply to Marc Glisse from comment #17)
> (In reply to Hongtao.liu from comment #15)
> > The issue can also be solved by folding __builtin_ia32_shufps to gimple
> > VEC_PERM_EXPR,
> 
> Didn't you post a patch to do that last year? What happened to it?

I almost forgot it, let me retest my patch, it's in https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562029.html
Comment 19 CVS Commits 2021-08-27 00:51:47 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:0fa4787bf34b173ce6f198e99b6f6dd8a3f98014

commit r12-3177-g0fa4787bf34b173ce6f198e99b6f6dd8a3f98014
Author: liuhongt <hongtao.liu@intel.com>
Date:   Fri Dec 11 19:02:43 2020 +0800

    Fold more shuffle builtins to VEC_PERM_EXPR.
    
    A follow-up to https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html
    
    gcc/
            PR target/98167
            PR target/43147
            * config/i386/i386.c (ix86_gimple_fold_builtin): Fold
            IX86_BUILTIN_SHUFPD512, IX86_BUILTIN_SHUFPS512,
            IX86_BUILTIN_SHUFPD256ï¼ IX86_BUILTIN_SHUFPSï¼
            IX86_BUILTIN_SHUFPS256.
            (ix86_masked_all_ones): New function.
    
    gcc/testsuite/
            * gcc.target/i386/avx512f-vshufpd-1.c: Adjust testcase.
            * gcc.target/i386/avx512f-vshufps-1.c: Adjust testcase.
            * gcc.target/i386/pr43147.c: New test.
Comment 20 Hongtao.liu 2021-08-27 01:00:46 UTC
Fixed in GCC12, now  gcc generate optimal codes.

main:
.LFB532:
	.cfi_startproc
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movaps	.LC0(%rip), %xmm0
	call	printv
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
.LFE532:
	.size	main, .-main
	.section	.rodata.cst16,"aM",@progbits,16
	.align 16
.LC0:
	.long	1073741824
	.long	1065353216
	.long	1082130432
	.long	1077936128
	.ident	"GCC: (GNU) 12.0.0 20210825 (experimental)"
	.section	.note.GNU-stack,"",@progbits