As a workaround for PR 82887 some code (e.g. a memset) uses __m512i zmm = _mm512_set1_epi8((char)c); __m256i ymm = _mm256_set1_epi8((char)c); instead of ymm = _mm512_castsi512_si256(zmm); (found in the persistent-memory library https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Obviously we'd like to CSE that instead of actually broadcasting twice. MVCE: #include <immintrin.h> __m512i sinkz; __m256i sinky; void foo(char c) { sinkz = _mm512_set1_epi8(c); sinky = _mm256_set1_epi8(c); } https://godbolt.org/z/CeXhi8 g++ (Compiler-Explorer-Build) 10.0.0 20191012 # g++ -O3 -march=skylake-avx512 (AVX512BW + AVX512VL are the relevant ones) foo(char): vpbroadcastb %edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vpbroadcastb %edi, %ymm0 # wasted insn vmovdqa64 %ymm0, sinky(%rip) # wasted EVEX prefix vzeroupper ret Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. (There are no CPUs with AVX512BW but not AVX512VL; if people compile that way it's their own fault. But this might be relevant for set1_epi32() on KNL). Clang finds this optimization, and uses a shorter vmovdqa for the YMM store saving another 2 bytes of code size: vpbroadcastb %edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vmovdqa %ymm0, sinky(%rip) vzeroupper ret
Interestingly enough with just -mavx512f we get vmovd %edi, %xmm0 vpbroadcastb %xmm0, %ymm0 vinserti64x4 $0x1, %ymm0, %zmm0, %zmm1 vmovdqa %ymm0, sinky(%rip) vmovdqa64 %zmm1, sinkz(%rip) the GIMPLE we expand from is _7 = {c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D)}; _8 = VIEW_CONVERT_EXPR<__m512i>(_7); sinkz = _8; _3 = {c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D), c_1(D)}; _6 = VIEW_CONVERT_EXPR<__m256i>(_3); sinky = _6; where we could replace _6 with a BIT_FIELD_REF but it will be a quite costly thing to do in general. Our representation for the splats isn't too nice either... So without avx512bw we seem miss the splat on V64QI and do a V32QI splat plus a concat. On the RTL side optimizing this isn't any less awkward than on GIMPLE I guess.
Yeah, it isn't e.g. something RTL CSE would naturally do, because there is no common subexpression, this needs to know that a narrower broadcast is a part of a wider broadcast of the same argument and know how to replace that with a backend instruction that takes the low bits from it (while it actually usually expands to no code, at least before RA it needs to be expressed some way and is very backend specific, we don't allow a vector mode to vector mode subreg with different size). So the only place to deal with this in RTL would be some backend specific pass I'm afraid.
On Mon, 14 Oct 2019, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jakub at gcc dot gnu.org > > --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Yeah, it isn't e.g. something RTL CSE would naturally do, because there is no > common subexpression, this needs to know that a narrower broadcast is a part of > a wider broadcast of the same argument and know how to replace that with a > backend instruction that takes the low bits from it (while it actually usually > expands to no code, at least before RA it needs to be expressed some way and is > very backend specific, we don't allow a vector mode to vector mode subreg with > different size). So the only place to deal with this in RTL would be some > backend specific pass I'm afraid. So what RTL CSE would need to do is when seeing (set reg:VNQI ...) know (via a target hook?) which subregs can be accessed at zero-cost and register the apropriate smaller vector sets with a subreg value. That probably makes sense only after reload to not constrain RA too much. It could be restricted to vec_duplicate since there it's easy to derive the lowpart expression to register.
On Mon, 14 Oct 2019, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 > > --- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- > On Mon, 14 Oct 2019, jakub at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 > > > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > CC| |jakub at gcc dot gnu.org > > > > --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > > Yeah, it isn't e.g. something RTL CSE would naturally do, because there is no > > common subexpression, this needs to know that a narrower broadcast is a part of > > a wider broadcast of the same argument and know how to replace that with a > > backend instruction that takes the low bits from it (while it actually usually > > expands to no code, at least before RA it needs to be expressed some way and is > > very backend specific, we don't allow a vector mode to vector mode subreg with > > different size). So the only place to deal with this in RTL would be some > > backend specific pass I'm afraid. > > So what RTL CSE would need to do is when seeing > > (set reg:VNQI ...) > > know (via a target hook?) which subregs can be accessed at zero-cost > and register the apropriate smaller vector sets with a subreg value. > That probably makes sense only after reload to not constrain RA > too much. It could be restricted to vec_duplicate since there > it's easy to derive the lowpart expression to register. Or IRA/LRA rematerialization / inheritance could be teached to do this.
This gives good code: #include <immintrin.h> __m512i sinkz; __m256i sinky; void foo(char c) { __m512i a = _mm512_set1_epi8(c); sinkz = a; sinky = *((__m256i*)&a); }
Similar when vectorizing int a[4096]; void foo () { for (int i = 1; i < 4095; ++i) a[i] = 42; } the combination of peeling for alignment and the epilog yields on GIMPLE: <bb 2> [local count: 10737416]: MEM <vector(8) int> [(int *)&a + 4B] = { 42, 42, 42, 42, 42, 42, 42, 42 }; MEM <vector(4) int> [(int *)&a + 36B] = { 42, 42, 42, 42 }; MEM <vector(2) int> [(int *)&a + 52B] = { 42, 42 }; a[15] = 42; ivtmp.28_59 = (unsigned long) &MEM <int[4096]> [(void *)&a + 64B]; _1 = (unsigned long) &a; _182 = _1 + 16320; <bb 3> [local count: 75161909]: # ivtmp.28_71 = PHI <ivtmp.28_65(3), ivtmp.28_59(2)> _21 = (void *) ivtmp.28_71; MEM <vector(16) int> [(int *)_21] = { 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 }; ivtmp.28_65 = ivtmp.28_71 + 64; if (ivtmp.28_65 != _182) goto <bb 3>; [85.71%] else goto <bb 4>; [14.29%] <bb 4> [local count: 21474835]: MEM <vector(8) int> [(int *)&a + 16320B] = { 42, 42, 42, 42, 42, 42, 42, 42 }; MEM <vector(4) int> [(int *)&a + 16352B] = { 42, 42, 42, 42 }; MEM <vector(2) int> [(int *)&a + 16368B] = { 42, 42 }; a[4094] = 42; return; and that in turn causes a lot of redundant broadcasts from constants (via GPRs): foo: .LFB0: .cfi_startproc movl $42, %eax movq .LC2(%rip), %rcx movl $42, %edx movl $42, a+60(%rip) vpbroadcastd %eax, %ymm0 vmovdqu %ymm0, a+4(%rip) vpbroadcastd %eax, %xmm0 movl $a+64, %eax vmovdqu %xmm0, a+36(%rip) vpbroadcastd %edx, %zmm0 movq %rcx, a+52(%rip) .L2: vmovdqa32 %zmm0, (%rax) subq $-128, %rax vmovdqa32 %zmm0, -64(%rax) cmpq $a+16320, %rax jne .L2 vpbroadcastd %edx, %ymm0 movq %rcx, a+16368(%rip) movl $42, a+16376(%rip) vmovdqa %ymm0, a+16320(%rip) vpbroadcastd %edx, %xmm0 vmovdqa %xmm0, a+16352(%rip) vzeroupper ret as they are constant on GIMPLE any "CSE" we'd perform there would be undone quickly by constant propagation. So it's only on RTL where the actual broadcast is a non-constant operation that we can and should optimize this somehow. Some kind of LCM to also handle earlier small but later bigger broadcasts would be necessary here.
Another simple case is typedef int v4si __attribute__((vector_size(16))); typedef short v8hi __attribute__((vector_size(16))); v8hi a; v4si b; void foo () { b = __extension__(v4si){0, 0, 0, 0}; a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0}; } GCC generates 2 pxor foo(): vpxor xmm0, xmm0, xmm0 vmovdqa XMMWORD PTR b[rip], xmm0 vpxor xmm0, xmm0, xmm0 vmovdqa XMMWORD PTR a[rip], xmm0 ret
On Thu, 21 Mar 2024, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 > > Hongtao Liu <liuhongt at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |liuhongt at gcc dot gnu.org > > --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > Another simple case is > > typedef int v4si __attribute__((vector_size(16))); > typedef short v8hi __attribute__((vector_size(16))); > > v8hi a; > v4si b; > void > foo () > { > b = __extension__(v4si){0, 0, 0, 0}; > a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0}; > } > > GCC generates 2 pxor > > foo(): > vpxor xmm0, xmm0, xmm0 > vmovdqa XMMWORD PTR b[rip], xmm0 > vpxor xmm0, xmm0, xmm0 > vmovdqa XMMWORD PTR a[rip], xmm0 > ret If we were to expose that vpxor before postreload we'd likely CSE but we have 5: xmm0:V4SI=const_vector REG_EQUIV const_vector 6: [`b']=xmm0:V4SI 7: xmm0:V8HI=const_vector REG_EQUIV const_vector 8: [`a']=xmm0:V8HI until the very end. But since we have the same mode size on the xmm0 sets CSE could easily handle (integral) constants by hashing/comparing on their byte representation rather than by using the RTX structure. OTOH as we mostly have special constants allowed in the IL like this treating all-zeros and all-ones specially might be good enough ...
> If we were to expose that vpxor before postreload we'd likely CSE but > we have > > 5: xmm0:V4SI=const_vector > REG_EQUIV const_vector > 6: [`b']=xmm0:V4SI > 7: xmm0:V8HI=const_vector > REG_EQUIV const_vector > 8: [`a']=xmm0:V8HI > > until the very end. But since we have the same mode size on the xmm0 > sets CSE could easily handle (integral) constants by hashing/comparing > on their byte representation rather than by using the RTX structure. > OTOH as we mostly have special constants allowed in the IL like this > treating all-zeros and all-ones specially might be good enough ... We only handle scalar code, guess could do something similar, maybe 1. iteraters over vector modes with same vector length? 2. iteraters over vector modes with same component mode but with bigger vector length? But will miss v8hi/v8si pxor, another alternative is canonicalize const_vector with scalar mode, i.e v4si -> TI, v8si -> OI, v16si -> XI. then we can just query with TI/OI/XImode? 4873 /* See if we have a CONST_INT that is already in a register in a 4874 wider mode. */ 4875 4876 if (src_const && src_related == 0 && CONST_INT_P (src_const) 4877 && is_int_mode (mode, &int_mode) 4878 && GET_MODE_PRECISION (int_mode) < BITS_PER_WORD) 4879 { 4880 opt_scalar_int_mode wider_mode_iter; 4881 FOR_EACH_WIDER_MODE (wider_mode_iter, int_mode) 4882 { 4883 scalar_int_mode wider_mode = wider_mode_iter.require (); 4884 if (GET_MODE_PRECISION (wider_mode) > BITS_PER_WORD) 4885 break; 4886 4887 struct table_elt *const_elt 4888 = lookup (src_const, HASH (src_const, wider_mode), wider_mode); 4889 4890 if (const_elt == 0) 4891 continue; 4892 4893 for (const_elt = const_elt->first_same_value; 4894 const_elt; const_elt = const_elt->next_same_value) 4895 if (REG_P (const_elt->exp)) 4896 { 4897 src_related = gen_lowpart (int_mode, const_elt->exp); 4898 break; 4899 } 4900 4901 if (src_related != 0) 4902 break; 4903 } 4904 }
But it's even simpler than the cited case - the mode has the same size (for the latest testcase, not for the original one, of course). It's also that after reload a zeroing of V4SImode will also zero ymm but of course setting V4SImode to all-ones will not set the upper half of ymm to all-ones but instead "zero-extends". With CSE it becomes then important what set comes first. If the larger mode set comes first it's easier. If the smaller mode set comes first you'd have to change that to a larger one (if the zero-extension is not what you want).
(In reply to Richard Biener from comment #6) > Similar when vectorizing > > int a[4096]; > > void foo () > { > for (int i = 1; i < 4095; ++i) > a[i] = 42; > } This was actually reported by me in PR 99639 but for aarch64.