Bug 101340

Summary:	SLP discovery via vect_slp_linearize_chain is imperfect
Product:	gcc	Reporter:	Richard Biener <rguenth>
Component:	tree-optimization	Assignee:	Not yet assigned to anyone <unassigned>
Status:	UNCONFIRMED ---
Severity:	normal	CC:	rsandifo
Priority:	P3	Keywords:	missed-optimization
Version:	12.0
Target Milestone:	---
Host:		Target:
Build:		Known to work:
Known to fail:		Last reconfirmed:
Bug Depends on:
Bug Blocks:	53947
Attachments:	testcase for the testsuite

Description Richard Biener 2021-07-06 10:52:05 UTC

For polyhedron capacita we have the hot fourir routine which with -ffast-math
suffers from reassoc perturbing the SLP lanes to not match.  SLP discovery
reassociation could help here but it's limited by the single_use check.

For loop vect we could allow uses outside of the chain but of course we
do not want to expand multi-uses inside.  That doesn't fit well with the
simple worklist approach but would need sth more elaborate.

Comment 1 Richard Biener 2021-07-06 10:52:51 UTC

Created attachment 51108 [details]
testcase for the testsuite

Comment 2 Richard Biener 2021-07-06 10:54:20 UTC

Patch that breaks for example gfortran.dg/PR100120.f90 because it expands
multi-uses inside a chain:

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 2813b3dbe91..0c93be8e4d5 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1491,7 +1491,11 @@ vect_slp_linearize_chain (vec_info *vinfo,
          gimple *use_stmt;
          use_operand_p use_p;
          if (dt == vect_internal_def
-             && single_imm_use (op, &use_p, &use_stmt)
+             /* For the loop SLP discovery case associate across multiple
+                uses as well, for BB vect avoid this since live lane
+                handling is not good enough yet.  */
+             && (is_a <loop_vec_info> (vinfo)
+                 || single_imm_use (op, &use_p, &use_stmt))
              && is_gimple_assign (def_stmt_info->stmt)
              && (gimple_assign_rhs_code (def_stmt_info->stmt) == code
                  || (code == PLUS_EXPR

Comment 3 Richard Sandiford 2022-02-03 11:57:59 UTC

Not sure if this is exactly the same issue (I can file a separate
PR if it's not), but there's a similar inefficiency in
gcc.dg/vect/pr97832-2.c.  There we unroll:

#pragma GCC unroll 4
    for (int k = 0; k < 4; ++k) {
      double x_re = x[c+0+k];
      double x_im = x[c+4+k];
      double y_re = y[c+0+k];
      double y_im = y[c+4+k];
      y_re = y_re - x_re * f_re - x_im * f_im;;
      y_im = y_im + x_re * f_im - x_im * f_re;
      y[c+0+k] = y_re;
      y[c+4+k] = y_im;
    }

The depth of the y_re and x_re calculations for k==0 are one
less than for k>0, due to the extra c+N additions for the latter.
k==0 therefore gets a lower reassociation rank, so we end
up with:

  _65 = f_re_34 * x_re_54;
  _66 = y_re_62 - _65;
  _67 = f_im_35 * x_im_60;
  y_re_68 = _66 - _67;

for k==0 but:

  _93 = f_re_34 * x_re_82;
  _95 = f_im_35 * x_im_88;
  _41 = _93 + _95;
  y_re_96 = y_re_90 - _41;

etc. for k>0.  This persists into the SLP code, where we use
the following load permutes:

  load permutation { 4 1 2 3 0 1 2 3 }
  load permutation { 0 5 6 7 4 5 6 7 }

With different reassociation we could have used:

  load permutation { 0 1 2 3 0 1 2 3 }
  load permutation { 4 5 6 7 4 5 6 7 }

instead.