Bug 117510 - Inner loop with static trip count breaks vectorization of outer loop
Summary: Inner loop with static trip count breaks vectorization of outer loop
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.2.0
: P3 enhancement
Target Milestone: 15.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-11-08 22:11 UTC by Freddie Witherden
Modified: 2024-11-11 12:24 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-11-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Freddie Witherden 2024-11-08 22:11:18 UTC
Consider the following snippet:

void f(int n, int m, double *a)
{
    #pragma omp simd
    for (int i = 0; i < n; i++)
        for (int j = 0; j < m; j++)
            a[i] += 2*a[i] + j;
}

where the objective is to vectorize the outer loop.  At the moment GCC will refuse to vectorize this due to the inner loop.  However, this loop presents no issues and, indeed, if m is substituted for a small constant it will vectorize fine (presumably because of unrolling).

When m is known at compile time (pretty common) and the loop body is small (such as in this example) unrolling is viable.  But for larger inner loop bodies it quickly becomes expensive and leads to large amounts of unnecessary code bloat.  It would therefore be nice if the vectorizer could explicitly recognize this idiom of a non-problematic inner loop.

(For some context this loop structure appears frequently in PDE solvers where you need to apply some kind of iterative method at each grid-point.  Typically, with something like Newton's method we can bound the trip count and thus avoid breaks/tests, thus giving rise to these inner loops with fixed trip counts.)
Comment 1 Richard Biener 2024-11-11 08:34:22 UTC
The issue is that we fail to hoist the m > 0 check out of the outer loop:

t.c:4:23: note: Considering guard 7 -> 6 in loop 1
t.c:4:23: missed: Block 5 has side effects

  <bb 5> [local count: 105119324]:
  # _10 = PHI <_8(4)>
  *_3 = _10;

I think the logic in find_loop_guard is a bit flawed and it checks too many
blocks for side-effects.

I'm testing a fix.
Comment 2 GCC Commits 2024-11-11 12:03:37 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:da64698159fe69b68f5264b54cebcb67c501b3cf

commit r15-5082-gda64698159fe69b68f5264b54cebcb67c501b3cf
Author: Richard Biener <rguenther@suse.de>
Date:   Mon Nov 11 09:40:20 2024 +0100

    tree-optimization/117510 - fix guard hoisting validity check
    
    For the loop in the testcase we currently fail to hoist the guard
    check of the inner loop (m > 0) out of the outer loop because
    find_loop_guard checks all blocks of the outer loop for side-effects,
    including those that are skipped by the guard.  This usually
    is harmless as the guard does not skip any blocks in the outer loop
    but in this case store-motion was applied to the inner loop and thus
    there's now a skipped store in the outer loop.
    
    The following properly skips blocks that are dominated by the
    entry to the skipped region.
    
            PR tree-optimization/117510
            * tree-ssa-loop-unswitch.cc (find_loop_guard): Only check
            not skipped blocks for side-effects.
    
            * gcc.dg/vect/vect-outer-pr117510.c: New testcase.
Comment 3 Richard Biener 2024-11-11 12:03:54 UTC
Fixed for GCC 15.
Comment 4 Freddie Witherden 2024-11-11 12:15:42 UTC
(In reply to Richard Biener from comment #3)
> Fixed for GCC 15.

Thanks! If I have cases which, when m is a compile time constant, vectorize for m small but not m large is that likely to be a separate issue?
Comment 5 Richard Biener 2024-11-11 12:24:22 UTC
(In reply to Freddie Witherden from comment #4)
> (In reply to Richard Biener from comment #3)
> > Fixed for GCC 15.
> 
> Thanks! If I have cases which, when m is a compile time constant, vectorize
> for m small but not m large is that likely to be a separate issue?

I guess it's the same issue if for the small M we unroll but for the large M is not.  I've checked that with m constant 234 we now vectorize the outer loop.