void foo (float * __restrict x, int *flag) { for (int i = 0; i < 512; ++i) { if (flag[i]) { float a = x[2*i+0] + 3.f; float b = x[2*i+1] + 177.f; x[2*i+0] = a; x[2*i+1] = b; } } } fails to vectorize on x86_64 with -march=znver4 (it needs masked stores enabled by tuning). This is because we do not support VMAT_CONTIGUOUS_PERMUTE for either .MASK_LOAD nor .MASK_STORE. Simply enabling that shows we fail to properly handle the mask part. The proper solution is to handle them in SLP which they are not either.
I have a patch.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:a1558e9ad856938f165f838733955b331ebbec09 commit r14-3441-ga1558e9ad856938f165f838733955b331ebbec09 Author: Richard Biener <rguenther@suse.de> Date: Wed Aug 23 14:28:26 2023 +0200 tree-optimization/111115 - SLP of masked stores The following adds the capability to do SLP on .MASK_STORE, I do not plan to add interleaving support. PR tree-optimization/111115 gcc/ * tree-vectorizer.h (vect_slp_child_index_for_operand): New. * tree-vect-data-refs.cc (can_group_stmts_p): Also group .MASK_STORE. * tree-vect-slp.cc (arg3_arg2_map): New. (vect_get_operand_map): Handle IFN_MASK_STORE. (vect_slp_child_index_for_operand): New function. (vect_build_slp_tree_1): Handle statements with no LHS, masked store ifns. (vect_remove_slp_scalar_calls): Likewise. * tree-vect-stmts.cc (vect_check_store_rhs): Lookup the SLP child corresponding to the ifn value index. (vectorizable_store): Likewise for the mask index. Support masked stores. (vectorizable_load): Lookup the SLP child corresponding to the ifn mask index. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_masked_store): Supported with check_avx_available. * gcc.dg/vect/slp-mask-store-1.c: New testcase.
Fixed for GCC 14.