For polyhedron capacita we have the hot fourir routine which with -ffast-math suffers from reassoc perturbing the SLP lanes to not match. SLP discovery reassociation could help here but it's limited by the single_use check. For loop vect we could allow uses outside of the chain but of course we do not want to expand multi-uses inside. That doesn't fit well with the simple worklist approach but would need sth more elaborate.
Created attachment 51108 [details] testcase for the testsuite
Patch that breaks for example gfortran.dg/PR100120.f90 because it expands multi-uses inside a chain: diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 2813b3dbe91..0c93be8e4d5 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1491,7 +1491,11 @@ vect_slp_linearize_chain (vec_info *vinfo, gimple *use_stmt; use_operand_p use_p; if (dt == vect_internal_def - && single_imm_use (op, &use_p, &use_stmt) + /* For the loop SLP discovery case associate across multiple + uses as well, for BB vect avoid this since live lane + handling is not good enough yet. */ + && (is_a <loop_vec_info> (vinfo) + || single_imm_use (op, &use_p, &use_stmt)) && is_gimple_assign (def_stmt_info->stmt) && (gimple_assign_rhs_code (def_stmt_info->stmt) == code || (code == PLUS_EXPR
Not sure if this is exactly the same issue (I can file a separate PR if it's not), but there's a similar inefficiency in gcc.dg/vect/pr97832-2.c. There we unroll: #pragma GCC unroll 4 for (int k = 0; k < 4; ++k) { double x_re = x[c+0+k]; double x_im = x[c+4+k]; double y_re = y[c+0+k]; double y_im = y[c+4+k]; y_re = y_re - x_re * f_re - x_im * f_im;; y_im = y_im + x_re * f_im - x_im * f_re; y[c+0+k] = y_re; y[c+4+k] = y_im; } The depth of the y_re and x_re calculations for k==0 are one less than for k>0, due to the extra c+N additions for the latter. k==0 therefore gets a lower reassociation rank, so we end up with: _65 = f_re_34 * x_re_54; _66 = y_re_62 - _65; _67 = f_im_35 * x_im_60; y_re_68 = _66 - _67; for k==0 but: _93 = f_re_34 * x_re_82; _95 = f_im_35 * x_im_88; _41 = _93 + _95; y_re_96 = y_re_90 - _41; etc. for k>0. This persists into the SLP code, where we use the following load permutes: load permutation { 4 1 2 3 0 1 2 3 } load permutation { 0 5 6 7 4 5 6 7 } With different reassociation we could have used: load permutation { 0 1 2 3 0 1 2 3 } load permutation { 4 5 6 7 4 5 6 7 } instead.