Summary: | SLP discovery via vect_slp_linearize_chain is imperfect | ||
---|---|---|---|
Product: | gcc | Reporter: | Richard Biener <rguenth> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | UNCONFIRMED --- | ||
Severity: | normal | CC: | rsandifo |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 12.0 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | ||
Bug Depends on: | |||
Bug Blocks: | 53947 | ||
Attachments: | testcase for the testsuite |
Description
Richard Biener
2021-07-06 10:52:05 UTC
Created attachment 51108 [details]
testcase for the testsuite
Patch that breaks for example gfortran.dg/PR100120.f90 because it expands multi-uses inside a chain: diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 2813b3dbe91..0c93be8e4d5 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1491,7 +1491,11 @@ vect_slp_linearize_chain (vec_info *vinfo, gimple *use_stmt; use_operand_p use_p; if (dt == vect_internal_def - && single_imm_use (op, &use_p, &use_stmt) + /* For the loop SLP discovery case associate across multiple + uses as well, for BB vect avoid this since live lane + handling is not good enough yet. */ + && (is_a <loop_vec_info> (vinfo) + || single_imm_use (op, &use_p, &use_stmt)) && is_gimple_assign (def_stmt_info->stmt) && (gimple_assign_rhs_code (def_stmt_info->stmt) == code || (code == PLUS_EXPR Not sure if this is exactly the same issue (I can file a separate PR if it's not), but there's a similar inefficiency in gcc.dg/vect/pr97832-2.c. There we unroll: #pragma GCC unroll 4 for (int k = 0; k < 4; ++k) { double x_re = x[c+0+k]; double x_im = x[c+4+k]; double y_re = y[c+0+k]; double y_im = y[c+4+k]; y_re = y_re - x_re * f_re - x_im * f_im;; y_im = y_im + x_re * f_im - x_im * f_re; y[c+0+k] = y_re; y[c+4+k] = y_im; } The depth of the y_re and x_re calculations for k==0 are one less than for k>0, due to the extra c+N additions for the latter. k==0 therefore gets a lower reassociation rank, so we end up with: _65 = f_re_34 * x_re_54; _66 = y_re_62 - _65; _67 = f_im_35 * x_im_60; y_re_68 = _66 - _67; for k==0 but: _93 = f_re_34 * x_re_82; _95 = f_im_35 * x_im_88; _41 = _93 + _95; y_re_96 = y_re_90 - _41; etc. for k>0. This persists into the SLP code, where we use the following load permutes: load permutation { 4 1 2 3 0 1 2 3 } load permutation { 0 5 6 7 4 5 6 7 } With different reassociation we could have used: load permutation { 0 1 2 3 0 1 2 3 } load permutation { 4 5 6 7 4 5 6 7 } instead. |