[PATCH] vect: Fix load costs for SLP permutes
Richard Biener
richard.guenther@gmail.com
Wed Oct 28 13:44:08 GMT 2020
On Wed, Oct 28, 2020 at 2:39 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> For the following test case (compiled with load/store lanes
> disabled locally):
>
> void
> f (uint32_t *restrict x, uint8_t *restrict y, int n)
> {
> for (int i = 0; i < n; ++i)
> {
> x[i * 2] = x[i * 2] + y[i * 2];
> x[i * 2 + 1] = x[i * 2 + 1] + y[i * 2];
> }
> }
>
> we have a redundant no-op permute on the x[] load node:
>
> node 0x4472350 (max_nunits=8, refcnt=2)
> stmt 0 _5 = *_4;
> stmt 1 _13 = *_12;
> load permutation { 0 1 }
>
> Then, when costing it, we pick a cost of 1, even though we need 4 copies
> of the x[] load to match a single y[] load:
>
> ==> examining statement: _5 = *_4;
> Vectorizing an unaligned access.
> vect_model_load_cost: unaligned supported by hardware.
> vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
>
> The problem is that the code only considers the permutation for
> the first scalar iteration, rather than for all VF iterations.
>
> This patch tries to fix that by using similar logic to
> vect_transform_slp_perm_load.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install?
I wonder if we can instead do this counting in vect_transform_slp_load
where we already count the number of permutes. That would avoid
the duplication of the "logic".
Richard.
> Richard
>
> [-b version also included]
>
>
> gcc/
> * tree-vect-stmts.c (vect_model_load_cost): Use similar logic
> to vect_transform_slp_perm_load when counting the number of
> loads in a permuted SLP operation.
> ---
> gcc/tree-vect-stmts.c | 67 ++++++++++++++++++++++++++-----------------
> 1 file changed, 41 insertions(+), 26 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 3575f25241f..6eacd641e6b 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1099,38 +1099,53 @@ vect_model_load_cost (vec_info *vinfo,
> stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
> /* Record the cost for the permutation. */
> unsigned n_perms;
> - unsigned assumed_nunits
> - = vect_nunits_for_cost (STMT_VINFO_VECTYPE (first_stmt_info));
> vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL,
> vf, true, &n_perms);
> inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
> first_stmt_info, 0, vect_body);
> - /* And adjust the number of loads performed. This handles
> - redundancies as well as loads that are later dead. */
> - auto_sbitmap perm (DR_GROUP_SIZE (first_stmt_info));
> - bitmap_clear (perm);
> - for (unsigned i = 0;
> - i < SLP_TREE_LOAD_PERMUTATION (slp_node).length (); ++i)
> - bitmap_set_bit (perm, SLP_TREE_LOAD_PERMUTATION (slp_node)[i]);
> - ncopies = 0;
> - bool load_seen = false;
> - for (unsigned i = 0; i < DR_GROUP_SIZE (first_stmt_info); ++i)
> - {
> - if (i % assumed_nunits == 0)
> +
> + /* And if this is not a simple "load N vectors and then permute each
> + vector internally" operation, adjust the number of loads performed.
> + This handles redundancies as well as loads that are later dead. */
> + unsigned int nscalars = SLP_TREE_SCALAR_STMTS (slp_node).length ();
> + unsigned int dr_group_size = DR_GROUP_SIZE (first_stmt_info);
> + tree vectype = STMT_VINFO_VECTYPE (first_stmt_info);
> + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> + if (nscalars != dr_group_size
> + || !multiple_p (nunits, nscalars))
> + {
> + /* The constant VF and nunits are enforced by
> + vect_transform_slp_perm_load for this case. */
> + unsigned int const_nunits = nunits.to_constant ();
> + unsigned int in_nlanes = dr_group_size * vf.to_constant ();
> + unsigned int out_nlanes = nscalars * vf.to_constant ();
> + auto_sbitmap perm (in_nlanes);
> + bitmap_clear (perm);
> + for (unsigned i = 0; i < out_nlanes; ++i)
> + {
> + unsigned int iter_num = i / nscalars;
> + unsigned int stmt_num = i % nscalars;
> + unsigned int in_lane
> + = (iter_num * dr_group_size
> + + SLP_TREE_LOAD_PERMUTATION (slp_node)[stmt_num]);
> + bitmap_set_bit (perm, in_lane);
> + }
> + ncopies = 0;
> + bool load_seen = false;
> + for (unsigned i = 0; i < in_nlanes; ++i)
> {
> - if (load_seen)
> - ncopies++;
> - load_seen = false;
> + if (i % const_nunits == 0)
> + {
> + if (load_seen)
> + ncopies++;
> + load_seen = false;
> + }
> + if (bitmap_bit_p (perm, i))
> + load_seen = true;
> }
> - if (bitmap_bit_p (perm, i))
> - load_seen = true;
> - }
> - if (load_seen)
> - ncopies++;
> - gcc_assert (ncopies
> - <= (DR_GROUP_SIZE (first_stmt_info)
> - - DR_GROUP_GAP (first_stmt_info)
> - + assumed_nunits - 1) / assumed_nunits);
> + if (load_seen)
> + ncopies++;
> + }
> }
>
> /* Grouped loads read all elements in the group at once,
>
More information about the Gcc-patches
mailing list