[Bug tree-optimization/98535] [11 Regression] ICE in operands_scanner::get_expr_operands(tree_node**, int) building 538.imagick_r

Fri Jan 22 09:13:36 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98535

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-10 branch has been updated by Richard Sandiford
<rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:51b23ba76f00610360a023de4cdae1641ca3b961

commit r10-9287-g51b23ba76f00610360a023de4cdae1641ca3b961
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Fri Jan 22 09:13:12 2021 +0000

    vect: Fix VLA SLP invariant optimisation [PR98535]

    duplicate_and_interleave is the main fallback way of loading
    a repeating sequence of elements into variable-length vectors.
    The code handles cases in which the number of elements in the
    sequence is potentially several times greater than the number
    of elements in a vector.

    Let:

    - NE be the (compile-time) number of elements in the sequence
    - NR be the (compile-time) number of vector results and
    - VE be the (run-time) number of elements in each vector

    The basic approach is to duplicate each element into a
    separate vector, giving NE vectors in total, then use
    log2(NE) rows of NE permutes to generate NE results.

    In the worst case --- when VE has no known compile-time factor
    and NR >= NE --- all of these permutes are necessary.  However,
    if VE is known to be a multiple of 2**F, then each of the
    first F permute rows produces duplicate results; specifically,
    the high permute for a given pair is the same as the low permute.
    The code dealt with this by reusing the low result for the
    high result.  This part was OK.

    However, having duplicate results from one row meant that the
    next row did duplicate work.  The redundancies would be optimised
    away by later passes, but the code tried to avoid generating them
    in the first place.  This is the part that went wrong.

    Specifically, NR is typically less than NE when some permutes are
    redundant, so the code tried to use NR to reduce the amount of work
    performed.  The problem was that, although it correctly calculated
    a conservative bound on how many results were needed in each row,
    it chose the wrong results for anything other than the final row.

    This doesn't usually matter for fully-packed SVE vectors.  We first
    try to coalesce smaller elements into larger ones, so normally
    VE ends up being 2**VQ (where VQ is the number of 128-bit blocks
    in an SVE vector).  In that situation we'd only apply the faulty
    optimisation to the final row, i.e. the case it handled correctly.
    E.g. for things like:

      void
      f (long *x)
      {
        for (int i = 0; i < 100; i += 8)
          {
            x[i] += 1;
            x[i + 1] += 2;
            x[i + 2] += 3;
            x[i + 3] += 4;
            x[i + 4] += 5;
            x[i + 5] += 6;
            x[i + 6] += 7;
            x[i + 7] += 8;
          }
      }

    (already tested by the testsuite), we'd have 3 rows of permutes
    producing 4 vector results.  The schemne produced:

    1st row: 8 results from 4 permutes, highs duplicates of lows
    2nd row: 8 results from 8 permutes (half of which are actually redundant)
    3rd row: 4 results from 4 permutes

    However, coalescing elements is trickier for unpacked vectors,
    and at the moment we don't try to do it (see the GET_MODE_SIZE
    check in can_duplicate_and_interleave_p).  Unpacked vectors
    therefore stress the code in ways that packed vectors didn't.

    The patch fixes this by removing the redundancies from each row,
    rather than trying to work around them later.  This also removes
    the redundant work in the second row of the example above.

    gcc/
            PR tree-optimization/98535
            * tree-vect-slp.c (duplicate_and_interleave): Use
quick_grow_cleared.
            If the high and low permutes are the same, remove the high permutes
            from the working set and only continue with the low ones.

    (cherry picked from commit ea74a3f548eb321429c371e327e778e63d9128a0)