For testcase void foo( int* restrict x, int n, int start, int m, int* restrict ret ) { for (int i = 0; i < n; i++) { int pos = start + i; if ( pos <= m) ret[0] += x[i]; } } with -O3 -mavx2 it could not be vectorized because ret[0] += x[i] is zero step MASK_STORE inside loop, and dr analysis failed for zero step store. But with manually loop store motion void foo2( int* restrict x, int n, int start, int m, int* restrict ret ) { int tmp = 0; for (int i = 0; i < n; i++) { int pos = start + i; if (pos <= m) tmp += x[i]; } ret[0] += tmp; } could be vectorized. godbolt: https://godbolt.org/z/Kcv8hP There is no LIM between ifcvt and vect, and current LIM could not handle MASK_STORE. Is there any possibility to vectorize foo, like by doing loop store motion in ifcvt instead of creating MASK_STORE?
The issue is that we need to vectorize this as reduction and since there's no "masked scalar store" on GIMPLE LIM itself doesn't help. The issue why LIM doesn't apply store-motion here is the _load_ which can trap. LIM would like to do ret0 = ret[0]; bool stored = false; for (int i = 0; i < n; i++) { int pos = start + i; if ( pos <= m) { ret0 += x[i]; stored = true; } } if (stored) ret[0] = ret0; but as you can see the unconditional load breaks this. LIM would need to be changed to handle the whole load-update-store sequence delaying the load as well (thereby re-associating the reduction). An alternative would be to split the loop and apply store-motion to the tail. for (int i = 0; i < n; i++) { int pos = start + i; if ( pos <= m) break; } if (i < n) { ret0 = ret[0]; for (int i = 0; i < n; i++) { int pos = start + i; if ( pos <= m) ret0 += x[i]; } ret[0] = ret0; } we can then vectorize the second loop. At the source level the fix is to make sure the load from ret[0] doesn't trap.