Trying to implement WHILE_ULT for AVX512 I run into optimization issues. Consider double a[1024], b[1024]; void foo (int n) { for (int i = 0; i < n; ++i) a[i] = b[i] * 3.; } compiled with -O3 -march=cascadelake --param vect-partial-vector-usage=2 I get snippets like kxnorb %k1, %k1, %k1 kortestb %k1, %k1 je .L11 or kxorb %k1, %k1, %k1 kxnorb %k1, %k1, %k1 where we fail to simplify the operations. Looking at the RTL it looks like missed jump threading, but I do see the ops being (insn 18 72 74 5 (parallel [ (set (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]) (not:QI (xor:QI (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86]) (reg:QI 69 k1 [orig:86 loop_mask_15 ] [86])))) (unspec [ (const_int 0 [0]) ] UNSPEC_MASKOP) ]) 1912 {kxnorqi} (expr_list:REG_EQUAL (const_int -1 [0xffffffffffffffff]) (nil))) thus having an UNSPEC in them. When emitting a SET from constm1 I end up with mask<->GPR moves and if-converted code which isn't optimal either. When doing -fno-if-conversion I get .L7: vmovapd b(%rax), %ymm1{%k1} addl $4, %ecx movl %edi, %edx vmulpd %ymm2, %ymm1, %ymm0 subl %ecx, %edx vmovapd %ymm0, a(%rax){%k1} kxnorb %k1, %k1, %k1 cmpl $4, %edx jge .L5 vpbroadcastd %edx, %xmm0 vpcmpd $1, %xmm0, %xmm3, %k1 .L5: addq $32, %rax kortestb %k1, %k1 jne .L7 which also doesn't have the desired short-cut from the cmpl $4, %edx.
Created attachment 53645 [details] prototype for WHILE_ULT I'm playing with the attached. Note it requires the third operand patch for WHILE_ULT (but it's basically approved).
For UNSPEC part, we can create a new define_insn with genenral operation and accept both gpr and mask alternatives just like other logic patterns. For gpr version, we can split it to xor + not after reload. For mask version, we can split it to kxnor with UNSPEC after reload. It should help general simplication for xor and not.
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:498ad738690b3c464f901d63dcd4d0f49a50dd00 commit r13-3218-g498ad738690b3c464f901d63dcd4d0f49a50dd00 Author: liuhongt <hongtao.liu@intel.com> Date: Mon Oct 10 11:31:48 2022 +0800 Add define_insn_and_split to support general version of "kxnor". For genereal_reg_operand, it will be splitted into xor + not. For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just like what we did for other logic operations. The patch will optimize xor+not to kxnor when possible. gcc/ChangeLog: PR target/107093 * config/i386/i386.md (*notxor<mode>_1): New post_reload define_insn_and_split. (*notxorqi_1): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr107093.c: New test.
change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal. then RA works good to choose kxnor for setting constm1_rtx to mask register, and i got below with your attached patch(change #if 0 to #if 1), seems better than orginal patch. 6foo: 7.LFB0: 8 .cfi_startproc 9 testl %edi, %edi 10 jle .L9 11 kxnorb %k1, %k1, %k1 12 cmpl $4, %edi 13 jl .L11 14.L3: 15 vbroadcastsd .LC2(%rip), %ymm3 16 vmovdqa .LC0(%rip), %xmm2 17 xorl %eax, %eax 18 xorl %ecx, %ecx 19 .p2align 4,,10 20 .p2align 3 21.L7: 22 vmovapd b(%rax), %ymm0{%k1} 23 addl $4, %ecx 24 movl %edi, %edx 25 vmulpd %ymm3, %ymm0, %ymm1 26 subl %ecx, %edx 27 cmpl $4, %edx 28 vmovapd %ymm1, a(%rax){%k1} 29 vpbroadcastd %edx, %xmm1 30 movl $-1, %edx 31 vpcmpd $1, %xmm1, %xmm2, %k1 32 kmovb %k1, %esi 33 cmovge %edx, %esi 34 addq $32, %rax 35 kmovb %esi, %k1 36 kortestb %k1, %k1 37 jne .L7 38 vzeroupper 39.L9: 40 ret 41 .p2align 4,,10 42 .p2align 3 43.L11: 44 vmovdqa .LC0(%rip), %xmm2 45 vpbroadcastd %edi, %xmm1 46 vpcmpd $1, %xmm1, %xmm2, %k1 47 jmp .L3 48 .cfi_endproc
Also i think masked epilog(--param=vect-partial-vector-usage=1) should be good for general cases under AVX512, espicially when main loop's vector width is 512, and the remain tripcount is not enough for 256-bit vectorization but ok for 128-bit vectorization.
(In reply to Hongtao.liu from comment #4) > change "*k, CBC" to "?k, CBC", in *mov{qi,hi,si,di}_internal. > then RA works good to choose kxnor for setting constm1_rtx to mask register, > and i got below with your attached patch(change #if 0 to #if 1), seems > better than orginal patch. > > 6foo: > 7.LFB0: > 8 .cfi_startproc > 9 testl %edi, %edi > 10 jle .L9 > 11 kxnorb %k1, %k1, %k1 > 12 cmpl $4, %edi > 13 jl .L11 > 14.L3: > 15 vbroadcastsd .LC2(%rip), %ymm3 > 16 vmovdqa .LC0(%rip), %xmm2 > 17 xorl %eax, %eax > 18 xorl %ecx, %ecx > 19 .p2align 4,,10 > 20 .p2align 3 > 21.L7: > 22 vmovapd b(%rax), %ymm0{%k1} > 23 addl $4, %ecx > 24 movl %edi, %edx > 25 vmulpd %ymm3, %ymm0, %ymm1 > 26 subl %ecx, %edx > 27 cmpl $4, %edx > 28 vmovapd %ymm1, a(%rax){%k1} > 29 vpbroadcastd %edx, %xmm1 > 30 movl $-1, %edx > 31 vpcmpd $1, %xmm1, %xmm2, %k1 > 32 kmovb %k1, %esi > 33 cmovge %edx, %esi not sure if the round-trip to GPRs for the sake of a cmovge is worth, I guess a branch would be better.
(In reply to Hongtao.liu from comment #5) > Also i think masked epilog(--param=vect-partial-vector-usage=1) should be > good for general cases under AVX512, espicially when main loop's vector > width is 512, and the remain tripcount is not enough for 256-bit > vectorization but ok for 128-bit vectorization. Yes, for the fully masked variant I was mostly targeting -O2 with its very-cheap (size wise) cost model. Since we don't vectorize the epilogue of a vectorized epilogue (yet) going fully masked there should indeed help. Also when we start to use the unroll hint the vectorized epilogue might get full width iterations to handle as well. One downside for a fully masked body is that we're using masked stores which usually have higher latency due to the "merge" semantics which means an extra memory input + merge operation. Not sure if modern uArchs can optimize the all-ones mask case, the vectorizer, for .MASK_STORE, still has the code to change those to emit a mask compare against all-zeros and only conditionally doing a .MASK_STORE. That could be enhanced to single out the all-ones case, at least for the .MASK_STOREs in a main fully masked loop when the mask is only from the iteration (rather than conditional execution).
> > One downside for a fully masked body is that we're using masked stores > which usually have higher latency due to the "merge" semantics which > means an extra memory input + merge operation. Not sure if modern > uArchs can optimize the all-ones mask case, the vectorizer, for Also I guess mask store won't be store forward even load is inside the mask store.
On Tue, 11 Oct 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093 > > --- Comment #8 from Hongtao.liu <crazylht at gmail dot com> --- > > > > > One downside for a fully masked body is that we're using masked stores > > which usually have higher latency due to the "merge" semantics which > > means an extra memory input + merge operation. Not sure if modern > > uArchs can optimize the all-ones mask case, the vectorizer, for > Also I guess mask store won't be store forward even load is inside the mask > store. I guess the masking of the store is resolved in the load-store unit and not by splitting the operation into a load, modify, store because that cannot easily hide exceptions. So yes, a masked store in the store buffer likely cannot act as forwarding source (though the actual mask should be fully resolved there) since the actual merging will take place later.
icelake is able to forward a masked store with a all-ones mask, Zen4 isn't able to do that. Other masked stores indeed do not forward. There's a related problem also when an outer loop causes a low trip inner loop to use masked load/store to then overlapping vectors: outer iteration 1 ... = .MASK_LOAD (p, {-1, -1, -1, -1, 0, 0, 0, 0}); ... .MASK_STORE (p, val, {-1, -1, -1, -1, 0, 0, 0, 0}); outer iteration 2 ... = .MASK_LOAD (p + delta, {-1, -1, -1, -1, 0, 0, 0, 0}); ... .MASK_STORE (p + delta, val, {-1, -1, -1, -1, 0, 0, 0, 0}); with delta causing the next outer iteration to access the masked out values from the previous iteration. That gets a STLF failure (obviously) but we now also need to wait for the masked store to retire before the masked load of iteration 2 can be carried out. We are hitting this case in SPEC CPU 2017 with masked epilogues (the inner loop just iterates 4 times, vectorized with V8DFmode vectors). Ideally the implementation (the CPU) would "shorten" loads/stores for trailing sequences of zeros so this hazard doesn't occur. Not sure if that would be allowed by the x86 memory model though (I didn't find anything specific there with respect to load/store masking). ISTR store buffer entries are usually assigned at instruction issue time, I'm not sure if the mask is resolved there or whether the size of the store could be adjusted later when it is. The implementation could also somehow ignore the "conflict". Note I didn't yet fully benchmark masked epilogues with -mpreferred-vector-width=512 on icelake or sapphire rapids, maybe Intel CPUs are not affected by this issue. The original issue in the description seems solved we now generate the following with the code generation variant that's now on trunk: .L3: vmovapd b(%rax), %ymm0{%k1} movl %edx, %ecx subl $4, %edx kmovw %edx, %k0 vmulpd %ymm3, %ymm0, %ymm1{%k1}{z} vmovapd %ymm1, a(%rax){%k1} vpbroadcastmw2d %k0, %xmm1 addq $32, %rax vpcmpud $6, %xmm2, %xmm1, %k1 cmpw $4, %cx ja .L3 that avoids using the slow mask ops for loop control. It oddly does subl $4, %edx kmovw %edx, %k0 vpbroadcastmw2d %k0, %xmm1 with -march=cascadelake - with -march=znver4 I get the expected subl $8, %edx vpbroadcastw %edx, %xmm1 but I can reproduce the mask register "spill" with -mprefer-vector-width=256. We expand to (insn 14 13 15 (set (reg:V4SI 96) (vec_duplicate:V4SI (reg:SI 93 [ _27 ]))) 8167 {*avx512vl_vec_dup_gprv4si} (nil)) I'll file a separate bugreport for this.