Consider the following snippet: void f(int n, int m, double *a) { #pragma omp simd for (int i = 0; i < n; i++) for (int j = 0; j < m; j++) a[i] += 2*a[i] + j; } where the objective is to vectorize the outer loop. At the moment GCC will refuse to vectorize this due to the inner loop. However, this loop presents no issues and, indeed, if m is substituted for a small constant it will vectorize fine (presumably because of unrolling). When m is known at compile time (pretty common) and the loop body is small (such as in this example) unrolling is viable. But for larger inner loop bodies it quickly becomes expensive and leads to large amounts of unnecessary code bloat. It would therefore be nice if the vectorizer could explicitly recognize this idiom of a non-problematic inner loop. (For some context this loop structure appears frequently in PDE solvers where you need to apply some kind of iterative method at each grid-point. Typically, with something like Newton's method we can bound the trip count and thus avoid breaks/tests, thus giving rise to these inner loops with fixed trip counts.)
The issue is that we fail to hoist the m > 0 check out of the outer loop: t.c:4:23: note: Considering guard 7 -> 6 in loop 1 t.c:4:23: missed: Block 5 has side effects <bb 5> [local count: 105119324]: # _10 = PHI <_8(4)> *_3 = _10; I think the logic in find_loop_guard is a bit flawed and it checks too many blocks for side-effects. I'm testing a fix.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:da64698159fe69b68f5264b54cebcb67c501b3cf commit r15-5082-gda64698159fe69b68f5264b54cebcb67c501b3cf Author: Richard Biener <rguenther@suse.de> Date: Mon Nov 11 09:40:20 2024 +0100 tree-optimization/117510 - fix guard hoisting validity check For the loop in the testcase we currently fail to hoist the guard check of the inner loop (m > 0) out of the outer loop because find_loop_guard checks all blocks of the outer loop for side-effects, including those that are skipped by the guard. This usually is harmless as the guard does not skip any blocks in the outer loop but in this case store-motion was applied to the inner loop and thus there's now a skipped store in the outer loop. The following properly skips blocks that are dominated by the entry to the skipped region. PR tree-optimization/117510 * tree-ssa-loop-unswitch.cc (find_loop_guard): Only check not skipped blocks for side-effects. * gcc.dg/vect/vect-outer-pr117510.c: New testcase.
Fixed for GCC 15.
(In reply to Richard Biener from comment #3) > Fixed for GCC 15. Thanks! If I have cases which, when m is a compile time constant, vectorize for m small but not m large is that likely to be a separate issue?
(In reply to Freddie Witherden from comment #4) > (In reply to Richard Biener from comment #3) > > Fixed for GCC 15. > > Thanks! If I have cases which, when m is a compile time constant, vectorize > for m small but not m large is that likely to be a separate issue? I guess it's the same issue if for the small M we unroll but for the large M is not. I've checked that with m constant 234 we now vectorize the outer loop.