Hi, I have the following simple loop which I want to compile for Skylake (AVX-512): ================================ #pragma GCC ivdep for (int i = 0; i < n; ++i) { if (b[off1[i]] < b[off2[i]]) a[off1[i]] = b[off1[i]]; else a[off2[i]] = b[off2[i]]; } ================================ Given AVX-512 masked scatter instructions and the absence of data conflicts ("ivdep"), vectorization should be possible along the lines of: 1. gather b[off1[i]] into zmm1 2. gather b[off2[i]] into zmm2 3. compare zmm1 and zmm2 with "<" and store result in mask1 4. compare zmm1 and zmm2 with ">=" and store result in mask2 5. scatter zmm1 to a[off1[i]] with mask1 6. scatter zmm2 to a[off2[i]] with mask2 However, GCC is not able to vectorize this loop (failing with "not vectorized: not suitable for scatter store"). I have tested this with the latest GCC trunk but the issue also occurs with all previous versions. If you want to have a look, here's a Godbolt example: https://godbolt.org/z/Is7Zml I understand that this loop is not a trivial case for vectorization and AVX-512 hasn't been around for too long, so it's likely that it isn't fully supported yet. But still, I'm wondering: 1. Am I missing some flags or hints to GCC in order to vectorize this loop? (I can imagine something related to the cost model, etc..) 2. Or is GCC currently just not capable of vectorizing it? If the answer is "2.": 3. Can we estimate to amount of work needed to support this? 4. Is there any plan on when this kind of pattern will be supported? 5. If it's realistic for a non-GCC developer to look into this, is there anything I can do to help? Many thanks in advance, Moritz
Confirmed. The issue here is the target doesn't advertise scatter: 3901 if (targetm.vectorize.builtin_scatter) (gdb) n 3902 decl = targetm.vectorize.builtin_scatter (vectype, offtype, scale); (gdb) 3905 if (!decl) (gdb) p decl $5 = (tree) 0x0 (gdb) p decl $5 = (tree) 0x0 (gdb) p debug_generic_expr (vectype) vector(4) double $6 = void (gdb) p debug_generic_expr (offtype) int $7 = void (gdb) p scale $8 = 8 probably because of the "mixup" of 4-byte index and 8 byte store element type. I'm not sure whether Jakub implemented widening of the index at all. OTOH when using long for idx it doesn't help. Using -march=knl we get a little bit further but ultimately fail the same way.
I haven't been involved with the scatters, only gathers. So guess the first step would be try something similar with gathers only and see if we vectorize that and see what is different for scatters.
Looks like there are no scatter patterns with DFmode, only SFmode ones? But even void loop (float * __restrict__ a, float const * __restrict__ b, int const * __restrict__ off1, int const * __restrict__ off2, int n) { #if defined(__clang__) #pragma clang loop vectorize(assume_safety) #elif defined(__GNUC__) #pragma GCC ivdep #endif for (int i = 0; i < n; ++i) { if (b[i] < b[i]) a[off1[i]] = b[i]; else a[off2[i]] = b[i]; } } doesn't work.
So, let's use a more complete testcase: void f1 (double * __restrict__ a, double const * __restrict__ b, int const * __restrict__ off1, int const * __restrict__ off2, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) { if (b[off1[i]] < b[off2[i]]) a[off1[i]] = b[off1[i]]; else a[off2[i]] = b[off2[i]]; } } void f2 (double * __restrict__ a, double const * __restrict__ b, long const * __restrict__ off1, long const * __restrict__ off2, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) { if (b[off1[i]] < b[off2[i]]) a[off1[i]] = b[off1[i]]; else a[off2[i]] = b[off2[i]]; } } void f3 (float * __restrict__ a, float const * __restrict__ b, int const * __restrict__ off1, int const * __restrict__ off2, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) { if (b[off1[i]] < b[off2[i]]) a[off1[i]] = b[off1[i]]; else a[off2[i]] = b[off2[i]]; } } void f4 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[i] = b[off[i]]; } void f5 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[i] = b[off[i]]; } void f6 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[i] = b[off[i]]; } void f7 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[off[i]] = b[i]; } void f8 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[off[i]] = b[i]; } void f9 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) a[off[i]] = b[i]; } void f10 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (a[i] > 10) a[i] = b[off[i]]; } void f11 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (a[i] > 10) a[i] = b[off[i]]; } void f12 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (a[i] > 10) a[i] = b[off[i]]; } void f13 (double * __restrict__ a, const double * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (b[i] > 10) a[off[i]] = b[i]; } void f14 (double * __restrict__ a, const double * __restrict__ b, long const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (b[i] > 10) a[off[i]] = b[i]; } void f15 (float * __restrict__ a, const float * __restrict__ b, int const * __restrict__ off, int n) { #pragma GCC ivdep for (int i = 0; i < n; ++i) if (b[i] > 10) a[off[i]] = b[i]; } With -O3 -mavx512{f,bw,dq,vl} -mtune=skylake-avx512 the f4/f5/f6 loops are vectorized for all of -mprefer-vector-width={128,256,512} (though e.g. for -mtune=generic not) and f7/f8/f9 are vectorized only with -mprefer-vector-width=512 using vscatter*. f10/f11/f12 are vectorized using vgather*, but strangely even in -mprefer-vector-width=512 mode using the AVX2 gathers, so that suggests that while unconditional scatters and gathers are properly supported, for conditional ones (both MASK_LOAD and MASK_STORE) the support for the cases when using a mask register rather than a vector register with mask either hasn't been done or doesn't work properly.
We have: else if (memory_access_type == VMAT_GATHER_SCATTER && gs_info.decl) { tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info.decl)); tree masktype = TREE_VALUE (TREE_CHAIN (TREE_CHAIN (TREE_CHAIN (arglist)))); if (TREE_CODE (masktype) == INTEGER_TYPE) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "masked gather with integer mask not" " supported."); return false; } } and no support for masked scatter stores when not using the aarch64 style ifns.
See PR59617 for the masked gathers.
Created attachment 45220 [details] gcc9-pr88464.patch Completely untested patch for the AVX512F conditional gather vectorization support.
Thanks for the input and for confirming that "for conditional ones (both MASK_LOAD and MASK_STORE) the support for the cases when using a mask register rather than a vector register with mask either hasn't been done or doesn't work properly." Any idea about a possible way forward to fix or implement those? As stated above, even though I have no experience with GCC development, I'd be happy to assist to the best of my resources if at all possible.
Author: jakub Date: Thu Dec 13 17:01:50 2018 New Revision: 267097 URL: https://gcc.gnu.org/viewcvs?rev=267097&root=gcc&view=rev Log: PR tree-optimization/88464 * tree-vect-stmts.c (vect_build_gather_load_calls): Handle INTEGER_TYPE masktype if mask is non-NULL. (vectorizable_load): Don't reject masked gather loads if masktype in the decl is INTEGER_TYPE. * gcc.target/i386/avx512f-pr88462-1.c: New test. * gcc.target/i386/avx512f-pr88462-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-stmts.c
For masked gathers it is now fixed on the trunk, for masked stores I'll have a look tomorrow.
Jakub, I can confirm it's working for masked gathers (we have a similar pattern elsewhere in our code) with the latest trunk. Thanks for looking at the scatters as well!
Created attachment 45239 [details] gcc9-pr88464-2.patch Incremental untested patch. First of all, I've missed an important case for gathers in the testsuite coverage, float with long index type on -m64, that actually ICEd due to multiple issues. The rest of the patch implements masked scatters, though at least for now only for 512-bit vectors. The problem for 128-bit and 256-bit is that the vectorizer computes the comparisons in an integral vector rather than bool vector (i.e. int); perhaps we could just emit an instruction that sets masks from the MSB bits of such a vector.
Author: jakub Date: Sat Dec 15 11:02:28 2018 New Revision: 267169 URL: https://gcc.gnu.org/viewcvs?rev=267169&root=gcc&view=rev Log: PR tree-optimization/88464 PR target/88498 * tree-vect-stmts.c (vect_build_gather_load_calls): For NARROWING and mask with integral masktype, don't try to permute mask vectors, instead emit VEC_UNPACK_{LO,HI}_EXPR. Fix up NOP_EXPR operand. (vectorizable_store): Handle masked scatters with decl and integral mask type. (permute_vec_elements): Allow scalar_dest to be NULL. * config/i386/i386.c (ix86_get_builtin) <case IX86_BUILTIN_GATHER3ALTDIV16SF>: Use lowpart_subreg for masks. <case IX86_BUILTIN_GATHER3ALTDIV8SF>: Don't assume mask and src have to be the same. * gcc.target/i386/avx512f-pr88462-1.c: Rename to ... * gcc.target/i386/avx512f-pr88464-1.c: ... this. Fix up PR number. Expect 4 vectorized loops instead of 3. (f4): New function. * gcc.target/i386/avx512f-pr88462-2.c: Rename to ... * gcc.target/i386/avx512f-pr88464-2.c: ... this. Fix up PR number and #include. (avx512f_test): Prepare arguments for f4 and check the results. * gcc.target/i386/avx512f-pr88464-3.c: New test. * gcc.target/i386/avx512f-pr88464-4.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-1.c - copied, changed from r267168, trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-2.c - copied, changed from r267168, trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c Removed: trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-1.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88462-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-stmts.c
Author: jakub Date: Sat Dec 15 11:04:41 2018 New Revision: 267170 URL: https://gcc.gnu.org/viewcvs?rev=267170&root=gcc&view=rev Log: PR tree-optimization/88464 PR target/88498 * tree-vect-stmts.c (vect_build_gather_load_calls): For NARROWING and mask with integral masktype, don't try to permute mask vectors, instead emit VEC_UNPACK_{LO,HI}_EXPR. Fix up NOP_EXPR operand. (vectorizable_store): Handle masked scatters with decl and integral mask type. (permute_vec_elements): Allow scalar_dest to be NULL. * config/i386/i386.c (ix86_get_builtin) <case IX86_BUILTIN_GATHER3ALTDIV16SF>: Use lowpart_subreg for masks. <case IX86_BUILTIN_GATHER3ALTDIV8SF>: Don't assume mask and src have to be the same. * gcc.target/i386/avx512f-pr88462-1.c: Rename to ... * gcc.target/i386/avx512f-pr88464-1.c: ... this. Fix up PR number. Expect 4 vectorized loops instead of 3. (f4): New function. * gcc.target/i386/avx512f-pr88462-2.c: Rename to ... * gcc.target/i386/avx512f-pr88464-2.c: ... this. Fix up PR number and #include. (avx512f_test): Prepare arguments for f4 and check the results. * gcc.target/i386/avx512f-pr88464-3.c: New test. * gcc.target/i386/avx512f-pr88464-4.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-3.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-4.c
Should be fixed now.
I can confirm the fix from my side. Thanks again!
Created attachment 45251 [details] gcc9-pr88464-3.patch Incremental patch over the PR88514 fix to add scatter vectorization with AVX512VL even for -mprefer-vector-width={128,256}.
Author: jakub Date: Tue Dec 18 18:41:26 2018 New Revision: 267239 URL: https://gcc.gnu.org/viewcvs?rev=267239&root=gcc&view=rev Log: PR target/88464 * config/i386/i386-builtin-types.def (VOID_FTYPE_PDOUBLE_QI_V8SI_V4DF_INT, VOID_FTYPE_PFLOAT_QI_V4DI_V8SF_INT, VOID_FTYPE_PLONGLONG_QI_V8SI_V4DI_INT, VOID_FTYPE_PINT_QI_V4DI_V8SI_INT, VOID_FTYPE_PDOUBLE_QI_V4SI_V2DF_INT, VOID_FTYPE_PFLOAT_QI_V2DI_V4SF_INT, VOID_FTYPE_PLONGLONG_QI_V4SI_V2DI_INT, VOID_FTYPE_PINT_QI_V2DI_V4SI_INT): New builtin types. * config/i386/i386.c (enum ix86_builtins): Add IX86_BUILTIN_SCATTERALTSIV4DF, IX86_BUILTIN_SCATTERALTDIV8SF, IX86_BUILTIN_SCATTERALTSIV4DI, IX86_BUILTIN_SCATTERALTDIV8SI, IX86_BUILTIN_SCATTERALTSIV2DF, IX86_BUILTIN_SCATTERALTDIV4SF, IX86_BUILTIN_SCATTERALTSIV2DI and IX86_BUILTIN_SCATTERALTDIV4SI. (ix86_init_mmx_sse_builtins): Fix up names of IX86_BUILTIN_GATHERALT*, IX86_BUILTIN_GATHER3ALT* and IX86_BUILTIN_SCATTERALT* builtins to match the IX86_BUILTIN codes. Build IX86_BUILTIN_SCATTERALTSIV4DF, IX86_BUILTIN_SCATTERALTDIV8SF, IX86_BUILTIN_SCATTERALTSIV4DI, IX86_BUILTIN_SCATTERALTDIV8SI, IX86_BUILTIN_SCATTERALTSIV2DF, IX86_BUILTIN_SCATTERALTDIV4SF, IX86_BUILTIN_SCATTERALTSIV2DI and IX86_BUILTIN_SCATTERALTDIV4SI decls. (ix86_vectorize_builtin_scatter): Expand those new builtins. * gcc.target/i386/avx512f-pr88464-5.c: New test. * gcc.target/i386/avx512f-pr88464-6.c: New test. * gcc.target/i386/avx512f-pr88464-7.c: New test. * gcc.target/i386/avx512f-pr88464-8.c: New test. * gcc.target/i386/avx512vl-pr88464-5.c: New test. * gcc.target/i386/avx512vl-pr88464-6.c: New test. * gcc.target/i386/avx512vl-pr88464-7.c: New test. * gcc.target/i386/avx512vl-pr88464-8.c: New test. * gcc.target/i386/avx512vl-pr88464-9.c: New test. * gcc.target/i386/avx512vl-pr88464-10.c: New test. * gcc.target/i386/avx512vl-pr88464-11.c: New test. * gcc.target/i386/avx512vl-pr88464-12.c: New test. * gcc.target/i386/avx512vl-pr88464-13.c: New test. * gcc.target/i386/avx512vl-pr88464-14.c: New test. * gcc.target/i386/avx512vl-pr88464-15.c: New test. * gcc.target/i386/avx512vl-pr88464-16.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-5.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-6.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-7.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr88464-8.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-10.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-11.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-12.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-13.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-14.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-15.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-16.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-5.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-6.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-7.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-8.c trunk/gcc/testsuite/gcc.target/i386/avx512vl-pr88464-9.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386-builtin-types.def trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/ChangeLog
FYI, there are quite some sequences like: kmovw %k1, %r11d testb %r11b, %r11b jne .L63 (e.g. when compiling avx512f-pr88464-1.c). Perhaps ktest insn can be utilized here?
(In reply to Uroš Bizjak from comment #19) > FYI, there are quite some sequences like: > > kmovw %k1, %r11d > testb %r11b, %r11b > jne .L63 > > (e.g. when compiling avx512f-pr88464-1.c). > > Perhaps ktest insn can be utilized here? It can if -mavx512dq, because we need ktestb rather than ktestw. Or we could do kandw first, but that would be longer (movl $255, %r11d; kmovw $r11d, %k2; kandw %k1, %k2; ktestw %k2, %k2). Have you tried with -mavx512dq? I haven't added it to dg-options of the tests, because I wanted to test the behavior without it too and duplicating all the tests also for -mavx512dq would be too much.
(In reply to Jakub Jelinek from comment #20) > (In reply to Uroš Bizjak from comment #19) > > FYI, there are quite some sequences like: > > > > kmovw %k1, %r11d > > testb %r11b, %r11b > > jne .L63 > > > > (e.g. when compiling avx512f-pr88464-1.c). > > > > Perhaps ktest insn can be utilized here? > > It can if -mavx512dq, because we need ktestb rather than ktestw. Or we > could do kandw first, but that would be longer (movl $255, %r11d; kmovw > $r11d, %k2; kandw %k1, %k2; ktestw %k2, %k2). Have you tried with > -mavx512dq? I haven't added it to dg-options of the tests, because I wanted > to test the behavior without it too and duplicating all the tests also for > -mavx512dq would be too much. Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm.
(In reply to Uroš Bizjak from comment #21) > Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm. f3 and f4 functions however can use ktestw even without -mavx512dq.
(In reply to Uroš Bizjak from comment #22) > (In reply to Uroš Bizjak from comment #21) > > Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm. > > f3 and f4 functions however can use ktestw even without -mavx512dq. Eh, not my day. We can use kortestw to check if operand is zero with -mavx512f.
(In reply to Uroš Bizjak from comment #23) > (In reply to Uroš Bizjak from comment #22) > > (In reply to Uroš Bizjak from comment #21) > > > Indeed, -mavx512dq produces expected ktestb. Sorry for the false alarm. > > > > f3 and f4 functions however can use ktestw even without -mavx512dq. > > Eh, not my day. We can use kortestw to check if operand is zero with > -mavx512f. So, something like: --cut here-- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 6e29427e30c..c529406cccc 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -1244,20 +1244,20 @@ (compare:CC (match_operand:SWI48 0 "nonimmediate_operand") (match_operand:SWI48 1 "<general_operand>")))]) -(define_mode_iterator SWI1248_AVX512BWDQ2_64 - [(QI "TARGET_AVX512DQ") (HI "TARGET_AVX512DQ") +(define_mode_iterator SWI1248_AVX512BWDQ_64 + [(QI "TARGET_AVX512DQ") HI (SI "TARGET_AVX512BW") (DI "TARGET_AVX512BW && TARGET_64BIT")]) (define_insn "*cmp<mode>_ccz_1" [(set (reg FLAGS_REG) - (compare (match_operand:SWI1248_AVX512BWDQ2_64 0 + (compare (match_operand:SWI1248_AVX512BWDQ_64 0 "nonimmediate_operand" "<r>,?m<r>,$k") - (match_operand:SWI1248_AVX512BWDQ2_64 1 "const0_operand")))] - "ix86_match_ccmode (insn, CCZmode)" + (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))] + "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)" "@ test{<imodesuffix>}\t%0, %0 cmp{<imodesuffix>}\t{%1, %0|%0, %1} - ktest<mskmodesuffix>\t%0, %0" + kortest<mskmodesuffix>\t%0, %0" [(set_attr "type" "test,icmp,msklog") (set_attr "length_immediate" "0,1,*") (set_attr "prefix" "*,*,vex") --cut here--
Isn't ktestw and kortestw the same thing when both operands are the same mask register?
And the TARGET_AVX512F && looks incorrect, then we wouldn't be able to test or cmp without -mavx512f.
(In reply to Jakub Jelinek from comment #25) > Isn't ktestw and kortestw the same thing when both operands are the same > mask register? True, but kortestw is available with AVX512F, where ktestw is not. (In reply to Jakub Jelinek from comment #26) > And the TARGET_AVX512F && looks incorrect, then we wouldn't be able to test > or cmp without -mavx512f. No, we fall to *cmp<mode>_ccno_1, which is compatible with CCZmode.
(In reply to Uroš Bizjak from comment #27) > (In reply to Jakub Jelinek from comment #25) > > Isn't ktestw and kortestw the same thing when both operands are the same > > mask register? > True, but kortestw is available with AVX512F, where ktestw is not. > > (In reply to Jakub Jelinek from comment #26) > > And the TARGET_AVX512F && looks incorrect, then we wouldn't be able to test > > or cmp without -mavx512f. > No, we fall to *cmp<mode>_ccno_1, which is compatible with CCZmode. You're right, sorry for the noise. Your patch looks good to me. There is another issue though (I guess not correctness, but efficiency), e.g. on avx512vl-pr88464-{1,3}.c. E.g. in avx512vl-pr88464-3.c we have: if (mask__40.16_82 == { 0, 0 }) goto <bb 7>; [100.00%] else goto <bb 6>; [20.00%] in *.optimized, an attempt to jump around masked stores if the mask is all zeros. We emit: ;; if (mask__40.16_82 == { 0, 0 }) (insn 44 43 45 (set (reg:CCZ 17 flags) (compare:CCZ (reg:QI 131 [ mask__40.16 ]) (const_int 0 [0]))) -1 (nil)) (jump_insn 45 44 0 (set (pc) (if_then_else (eq (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref 0) (pc))) -1 (int_list:REG_BR_PROB 1073741831 (nil))) for this, i.e. kmovw %k2, %r10d testb %r10b, %r10b je .L4 (without -mavx512dq, I guess ktestb or kortestb with -mavx512dq), but perhaps we should emit kmovw %k2, %r10d; testb $3, %r10b; je .L4 instead? If the setter is a compare that clears the higher bit, then it makes no difference, but if we are e.g. looking at the low 2 or 4 bits of 4 or 8 bit mask, then it will do a masked store even if the 2 or 4 bits we care about are clear, just some upper bits are not.