#if defined(__SSE2__) using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char; void foo(temp_vec_type& v) noexcept { v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); } #endif g++ -S pq.cc -Ofast proves sse2 is enabled by default, but it does not call https://www.felixcloutier.com/x86/pshufb neither https://www.felixcloutier.com/x86/pshufd while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is buggy
see https://godbolt.org/z/1aM57z7jn vs https://godbolt.org/z/b356qzrMY While clang does the right thing here https://godbolt.org/z/hnfrnb694
(In reply to cqwrteur from comment #0) > #if defined(__SSE2__) > > using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char; > void foo(temp_vec_type& v) noexcept > { > v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); > } > > #endif > > g++ -S pq.cc -Ofast > proves sse2 is enabled by default, but it does not call > https://www.felixcloutier.com/x86/pshufb > neither > https://www.felixcloutier.com/x86/pshufd > > while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is > buggy pshufb is sse3 sorry. but pshufd is sse2. It can be used for generating the right instruction.
(In reply to cqwrteur from comment #2) > (In reply to cqwrteur from comment #0) > > #if defined(__SSE2__) > > > > using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char; > > void foo(temp_vec_type& v) noexcept > > { > > v=__builtin_shufflevector(v,v,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); > > } > > > > #endif > > > > g++ -S pq.cc -Ofast > > proves sse2 is enabled by default, but it does not call > > https://www.felixcloutier.com/x86/pshufb > > neither > > https://www.felixcloutier.com/x86/pshufd > > > > while g++ -S pq.cc -Ofast -msse4.2 will generate them correctly. Which is > > buggy > > pshufb is sse3 sorry. but pshufd is sse2. It can be used for generating the > right instruction. https://godbolt.org/z/6baWWoE4e BTW. -msse3 does not use pshufb either. i do not know why
> > https://godbolt.org/z/6baWWoE4e > BTW. -msse3 does not use pshufb either. i do not know why It should be -mssse3.
Shufd only handles void foo1(temp_vec_type& v) noexcept { v=__builtin_shufflevector(v,v,12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3); } Not the case in #c0.
(In reply to Hongtao.liu from comment #6) > Shufd only handles > > void foo1(temp_vec_type& v) noexcept > { > v=__builtin_shufflevector(v,v,12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3); > } > > Not the case in #c0. I am using it for byte swap actually, clang has a solution using x86_64_v4si [[__gnu__::__vector_size__ (16)]] = int; using x86_64_v16qi [[__gnu__::__vector_size__ (16)]] = char; using x86_64_v8hi [[__gnu__::__vector_size__ (16)]] = short; constexpr x86_64_v16qi zero{}; if constexpr(sizeof(T)==8) { auto res0{__builtin_ia32_punpcklbw128(temp_vec,zero)}; auto res1{__builtin_ia32_pshufd((x86_64_v4si)res0,78)}; auto res2{__builtin_ia32_pshuflw((x86_64_v8hi)res1,27)}; auto res3{__builtin_ia32_pshufhw(res2,27)}; auto res4{__builtin_ia32_punpckhbw128(temp_vec,zero)}; auto res5{__builtin_ia32_pshufd((x86_64_v4si)res4,78)}; auto res6{__builtin_ia32_pshuflw((x86_64_v8hi)res5,27)}; auto res7{__builtin_ia32_pshufhw(res6,27)}; temp_vec=__builtin_ia32_packuswb128(res3,res7); } else if constexpr(sizeof(T)==4) { auto res0{__builtin_ia32_punpcklbw128(temp_vec,zero)}; auto res2{__builtin_ia32_pshuflw((x86_64_v8hi)res0,27)}; auto res3{__builtin_ia32_pshufhw(res2,27)}; auto res4{__builtin_ia32_punpckhbw128(temp_vec,zero)}; auto res6{__builtin_ia32_pshuflw((x86_64_v8hi)res4,27)}; auto res7{__builtin_ia32_pshufhw(res6,27)}; temp_vec=__builtin_ia32_packuswb128(res3,res7); } else if constexpr(sizeof(T)==2) { using x86_64_v8hu [[__gnu__::__vector_size__ (16)]] = unsigned short; auto res0{(x86_64_v8hu)temp_vec}; temp_vec=(x86_64_v16qi)((res0>>8)|(res0<<8)); }
for sse2 to do the __builtin_convertvector job yeah
(In reply to cqwrteur from comment #8) > for sse2 to do the __builtin_convertvector job yeah https://godbolt.org/z/dsf3WK58E using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char; void foo4(temp_vec_type& v) noexcept { v=__builtin_shufflevector(v,v,1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14); } This is even more interesting. foo4(char __vector(16)&): # @foo4(char __vector(16)&) movdqa (%rdi), %xmm0 movdqa %xmm0, %xmm1 psrlw $8, %xmm1 psllw $8, %xmm0 por %xmm1, %xmm0 movdqa %xmm0, (%rdi) retq clang generates this. by using ror and or
(In reply to cqwrteur from comment #9) > (In reply to cqwrteur from comment #8) > > for sse2 to do the __builtin_convertvector job yeah > > https://godbolt.org/z/dsf3WK58E > > using temp_vec_type [[__gnu__::__vector_size__ (16)]] = char; > void foo4(temp_vec_type& v) noexcept > { > v=__builtin_shufflevector(v,v,1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14); > } > > This is even more interesting. > > foo4(char __vector(16)&): # @foo4(char __vector(16)&) > movdqa (%rdi), %xmm0 > movdqa %xmm0, %xmm1 > psrlw $8, %xmm1 > psllw $8, %xmm0 > por %xmm1, %xmm0 > movdqa %xmm0, (%rdi) > retq > > clang generates this. by using ror and or This is interesting case, similar for psrld/psrlq + pslld/psllq + or.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:a71f90c5a7ae2942083921033cb23dcd63e70525 commit r15-499-ga71f90c5a7ae2942083921033cb23dcd63e70525 Author: Levy Hsu <admin@levyhsu.com> Date: Thu May 9 16:50:56 2024 +0800 x86: Add 3-instruction subroutine vector shift for V16QI in ix86_expand_vec_perm_const_1 [PR107563] Hi All We've introduced a new subroutine in ix86_expand_vec_perm_const_1 to optimize vector shifting for the V16QI type on x86. This patch uses a three-instruction sequence psrlw, psllw, and por to handle specific vector shuffle operations more efficiently. The change aims to improve assembly code generation for configurations supporting SSE2. Bootstrapped and tested on x86_64-linux-gnu, OK for trunk? Best Levy gcc/ChangeLog: PR target/107563 * config/i386/i386-expand.cc (expand_vec_perm_psrlw_psllw_por): New subroutine. (ix86_expand_vec_perm_const_1): Call expand_vec_perm_psrlw_psllw_por. gcc/testsuite/ChangeLog: PR target/107563 * g++.target/i386/pr107563-a.C: New test. * g++.target/i386/pr107563-b.C: New test.
switch (d->vmode) { case E_V8QImode: if (!TARGET_MMX_WITH_SSE) return false; mode = V4HImode; gen_shr = gen_ashrv4hi3(should be gen_lshrv4hi3); gen_shl = gen_ashlv4hi3; gen_or = gen_iorv4hi3; break; case E_V16QImode: mode = V8HImode; gen_shr = gen_vlshrv8hi3; gen_shl = gen_vashlv8hi3; gen_or = gen_iorv8hi3; break; default: return false; } Obviously, under V8QImode it should be gen_lshrv4hi3 instead of gen_ashrv4hi3. I mistakenly used gen_ashrv4hi3 due to the similar naming conventions and failed to find out. gen_lshrv4hi3 is the correct logical shift needed. Will send a patch soon
Chiming in to say I'm seeing the exact same thing on trunk. Here's a minimal reproducer: using vec256 = unsigned __attribute__((__vector_size__(32))); void slow_rotate(vec256& x) { x = __builtin_shufflevector(x, x, 3, 0, 1, 2, 7, 4, 5, 6); } void fast_rotate(vec256& x) { x = vec256{x[3], x[0], x[1], x[2], x[7], x[4], x[5], x[6]}; } Godbolt link: https://godbolt.org/z/YY9P7xKbh fast_rotate generates pshufd as expected on x86_64 with generic compilation flags. slow_rotate is *much* slower.
generic vector lowering does not handle: _2 = VEC_PERM_EXPR <_1, _1, { 3, 0, 1, 2, 7, 4, 5, 6 }>; into 2 PERMs.
Tree (lower/tree) dump: https://godbolt.org/z/o7GrvjMqq slow_rotate still contains a single wide VEC_PERM_EXPR <_1, _1, {3,0,1,2,7,4,5,6}> while fast_rotate is already expressed as element extracts + vector constructor. RTL (expand) dump: https://godbolt.org/z/WT9cqbx7h fast_rotate expands to two 128-bit vec_select:V4SI shuffles (one per 16B half), which is the expected shape to select pshufd on an SSE2 baseline. In contrast, slow_rotate expands to scalar loads/stores (no vector perm/select remains), so the backend never sees a permute it can map to pshufd. So this looks like a generic vector-lowering / tree -> RTL expansion gap for non-native (32B) VEC_PERM_EXPR on SSE2 targets: masks that do not cross the 128-bit boundary should be decomposed into two 16B perms, but currently fall back to scalarization.