Bug 107563 - __builtin_shufflevector fails to pshufd instructions under default x86_64 compilation toggle which is the sse2 one
Summary: __builtin_shufflevector fails to pshufd instructions under default x86_64 com...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2022-11-07 22:56 UTC by cqwrteur
Modified: 2026-01-13 22:54 UTC (History)
4 users (show)

See Also:
Host: x86_64-linux-gnu
Target: x86_64-*-* i?86-*-*
Build: x86_64-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cqwrteur 2022-11-07 22:56:27 UTC
#if defined(__SSE2__)

using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
void foo(temp_vec_type& v) noexcept
{
	v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
}

#endif

g++ -S pq.cc -Ofast
proves sse2 is enabled by default, but it does not call
https://www.felixcloutier.com/x86/pshufb
neither
https://www.felixcloutier.com/x86/pshufd

while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is buggy
Comment 1 cqwrteur 2022-11-07 23:00:06 UTC
see

https://godbolt.org/z/1aM57z7jn

vs

https://godbolt.org/z/b356qzrMY

While clang does the right thing here

https://godbolt.org/z/hnfrnb694
Comment 2 cqwrteur 2022-11-07 23:05:43 UTC
(In reply to cqwrteur from comment #0)
> #if defined(__SSE2__)
> 
> using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
> void foo(temp_vec_type& v) noexcept
> {
> 	v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
> }
> 
> #endif
> 
> g++ -S pq.cc -Ofast
> proves sse2 is enabled by default, but it does not call
> https://www.felixcloutier.com/x86/pshufb
> neither
> https://www.felixcloutier.com/x86/pshufd
> 
> while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is
> buggy

pshufb is sse3 sorry. but pshufd is sse2. It can be used for generating the right instruction.
Comment 3 cqwrteur 2022-11-08 00:11:44 UTC
(In reply to cqwrteur from comment #2)
> (In reply to cqwrteur from comment #0)
> > #if defined(__SSE2__)
> > 
> > using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
> > void foo(temp_vec_type& v) noexcept
> > {
> > 	v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
> > }
> > 
> > #endif
> > 
> > g++ -S pq.cc -Ofast
> > proves sse2 is enabled by default, but it does not call
> > https://www.felixcloutier.com/x86/pshufb
> > neither
> > https://www.felixcloutier.com/x86/pshufd
> > 
> > while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is
> > buggy
> 
> pshufb is sse3 sorry. but pshufd is sse2. It can be used for generating the
> right instruction.

https://godbolt.org/z/6baWWoE4e
BTW. -msse3 does not use pshufb either. i do not know why
Comment 4 cqwrteur 2022-11-08 00:11:49 UTC
(In reply to cqwrteur from comment #2)
> (In reply to cqwrteur from comment #0)
> > #if defined(__SSE2__)
> > 
> > using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
> > void foo(temp_vec_type& v) noexcept
> > {
> > 	v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
> > }
> > 
> > #endif
> > 
> > g++ -S pq.cc -Ofast
> > proves sse2 is enabled by default, but it does not call
> > https://www.felixcloutier.com/x86/pshufb
> > neither
> > https://www.felixcloutier.com/x86/pshufd
> > 
> > while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is
> > buggy
> 
> pshufb is sse3 sorry. but pshufd is sse2. It can be used for generating the
> right instruction.

https://godbolt.org/z/6baWWoE4e
BTW. -msse3 does not use pshufb either. i do not know why
Comment 5 Hongtao.liu 2022-11-08 03:23:44 UTC
> 
> https://godbolt.org/z/6baWWoE4e
> BTW. -msse3 does not use pshufb either. i do not know why

It should be -mssse3.
Comment 6 Hongtao.liu 2022-11-08 03:33:03 UTC
Shufd only handles

void foo1(temp_vec_type& v) noexcept
{
	v=__builtin_shufflevector(v,v,12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3);
}

Not the case in #c0.
Comment 7 cqwrteur 2022-11-08 06:11:08 UTC
(In reply to Hongtao.liu from comment #6)
> Shufd only handles
> 
> void foo1(temp_vec_type& v) noexcept
> {
> 	v=__builtin_shufflevector(v,v,12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3);
> }
> 
> Not the case in #c0.

I am using it for byte swap

actually, clang has a solution

			using x86_64_v4si [[__gnu__::__vector_size__ (16)]] = int;
			using x86_64_v16qi [[__gnu__::__vector_size__ (16)]] = char;
			using x86_64_v8hi [[__gnu__::__vector_size__ (16)]] = short;
			constexpr x86_64_v16qi zero{};
			if constexpr(sizeof(T)==8)
			{
				auto res0{__builtin_ia32_punpcklbw128(temp_vec,zero)};
				auto res1{__builtin_ia32_pshufd((x86_64_v4si)res0,78)};
				auto res2{__builtin_ia32_pshuflw((x86_64_v8hi)res1,27)};
				auto res3{__builtin_ia32_pshufhw(res2,27)};
				auto res4{__builtin_ia32_punpckhbw128(temp_vec,zero)};
				auto res5{__builtin_ia32_pshufd((x86_64_v4si)res4,78)};
				auto res6{__builtin_ia32_pshuflw((x86_64_v8hi)res5,27)};
				auto res7{__builtin_ia32_pshufhw(res6,27)};
				temp_vec=__builtin_ia32_packuswb128(res3,res7);
			}
			else if constexpr(sizeof(T)==4)
			{
				auto res0{__builtin_ia32_punpcklbw128(temp_vec,zero)};
				auto res2{__builtin_ia32_pshuflw((x86_64_v8hi)res0,27)};
				auto res3{__builtin_ia32_pshufhw(res2,27)};
				auto res4{__builtin_ia32_punpckhbw128(temp_vec,zero)};
				auto res6{__builtin_ia32_pshuflw((x86_64_v8hi)res4,27)};
				auto res7{__builtin_ia32_pshufhw(res6,27)};
				temp_vec=__builtin_ia32_packuswb128(res3,res7);
			}
			else if constexpr(sizeof(T)==2)
			{
				using x86_64_v8hu [[__gnu__::__vector_size__ (16)]] = unsigned short;
				auto res0{(x86_64_v8hu)temp_vec};
				temp_vec=(x86_64_v16qi)((res0>>8)|(res0<<8));
			}
Comment 8 cqwrteur 2022-11-08 06:11:42 UTC
for sse2 to do the __builtin_convertvector job yeah
Comment 9 cqwrteur 2022-11-08 06:14:21 UTC
(In reply to cqwrteur from comment #8)
> for sse2 to do the __builtin_convertvector job yeah

https://godbolt.org/z/dsf3WK58E

using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
void foo4(temp_vec_type& v) noexcept
{
	v=__builtin_shufflevector(v,v,1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14);
}

This is even more interesting.

foo4(char __vector(16)&): # @foo4(char __vector(16)&)
  movdqa (%rdi), %xmm0
  movdqa %xmm0, %xmm1
  psrlw $8, %xmm1
  psllw $8, %xmm0
  por %xmm1, %xmm0
  movdqa %xmm0, (%rdi)
  retq

clang generates this. by using ror and or
Comment 10 Hongtao.liu 2022-11-08 08:28:44 UTC
(In reply to cqwrteur from comment #9)
> (In reply to cqwrteur from comment #8)
> > for sse2 to do the __builtin_convertvector job yeah
> 
> https://godbolt.org/z/dsf3WK58E
> 
> using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char;
> void foo4(temp_vec_type& v) noexcept
> {
> 	v=__builtin_shufflevector(v,v,1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14);
> }
> 
> This is even more interesting.
> 
> foo4(char __vector(16)&): # @foo4(char __vector(16)&)
>   movdqa (%rdi), %xmm0
>   movdqa %xmm0, %xmm1
>   psrlw $8, %xmm1
>   psllw $8, %xmm0
>   por %xmm1, %xmm0
>   movdqa %xmm0, (%rdi)
>   retq
> 
> clang generates this. by using ror and or

This is interesting case, similar for psrld/psrlq + pslld/psllq + or.
Comment 11 GCC Commits 2024-05-15 04:47:26 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:a71f90c5a7ae2942083921033cb23dcd63e70525

commit r15-499-ga71f90c5a7ae2942083921033cb23dcd63e70525
Author: Levy Hsu <admin@levyhsu.com>
Date:   Thu May 9 16:50:56 2024 +0800

    x86: Add 3-instruction subroutine vector shift for V16QI in ix86_expand_vec_perm_const_1 [PR107563]
    
    Hi All
    
    We've introduced a new subroutine in ix86_expand_vec_perm_const_1
    to optimize vector shifting for the V16QI type on x86.
    This patch uses a three-instruction sequence psrlw, psllw, and por
    to handle specific vector shuffle operations more efficiently.
    The change aims to improve assembly code generation for configurations
    supporting SSE2.
    
    Bootstrapped and tested on x86_64-linux-gnu, OK for trunk?
    
    Best
    Levy
    
    gcc/ChangeLog:
    
            PR target/107563
            * config/i386/i386-expand.cc (expand_vec_perm_psrlw_psllw_por): New
            subroutine.
            (ix86_expand_vec_perm_const_1): Call expand_vec_perm_psrlw_psllw_por.
    
    gcc/testsuite/ChangeLog:
    
            PR target/107563
            * g++.target/i386/pr107563-a.C: New test.
            * g++.target/i386/pr107563-b.C: New test.
Comment 12 Levy Hsu 2024-05-18 15:17:12 UTC
switch (d->vmode)
    {
    case E_V8QImode:
      if (!TARGET_MMX_WITH_SSE)
	return false;
      mode = V4HImode;
      gen_shr = gen_ashrv4hi3(should be gen_lshrv4hi3);
      gen_shl = gen_ashlv4hi3;
      gen_or = gen_iorv4hi3;
      break;
    case E_V16QImode:
      mode = V8HImode;
      gen_shr = gen_vlshrv8hi3;
      gen_shl = gen_vashlv8hi3;
      gen_or = gen_iorv8hi3;
      break;
    default: return false;
    }

Obviously, under V8QImode it should be gen_lshrv4hi3 instead of gen_ashrv4hi3.

I mistakenly used gen_ashrv4hi3 due to the similar naming conventions and failed to find out. gen_lshrv4hi3 is the correct logical shift needed.

Will send a patch soon
Comment 13 Cory Fields 2026-01-13 17:07:20 UTC
Chiming in to say I'm seeing the exact same thing on trunk.

Here's a minimal reproducer:

using vec256 = unsigned __attribute__((__vector_size__(32)));

void slow_rotate(vec256& x)
{
    x = __builtin_shufflevector(x, x, 3, 0, 1, 2, 7, 4, 5, 6);
}

void fast_rotate(vec256& x)
{
    x = vec256{x[3], x[0], x[1], x[2], x[7], x[4], x[5], x[6]};
}

Godbolt link: https://godbolt.org/z/YY9P7xKbh

fast_rotate generates pshufd as expected on x86_64 with generic compilation flags. slow_rotate is *much* slower.
Comment 14 Andrew Pinski 2026-01-13 17:21:37 UTC
generic vector lowering does not handle:
  _2 = VEC_PERM_EXPR <_1, _1, { 3, 0, 1, 2, 7, 4, 5, 6 }>;

into 2 PERMs.
Comment 15 Levy Hsu 2026-01-13 22:54:27 UTC
Tree (lower/tree) dump:
https://godbolt.org/z/o7GrvjMqq
slow_rotate still contains a single wide
VEC_PERM_EXPR <_1, _1, {3,0,1,2,7,4,5,6}>
while fast_rotate is already expressed as element extracts + vector constructor.

RTL (expand) dump:
https://godbolt.org/z/WT9cqbx7h
fast_rotate expands to two 128-bit vec_select:V4SI shuffles (one per 16B half), which is the expected shape to select pshufd on an SSE2 baseline. In contrast, slow_rotate expands to scalar loads/stores (no vector perm/select remains), so the backend never sees a permute it can map to pshufd.

So this looks like a generic vector-lowering / tree -> RTL expansion gap for non-native (32B) VEC_PERM_EXPR on SSE2 targets: masks that do not cross the 128-bit boundary should be decomposed into two 16B perms, but currently fall back to scalarization.