[PATCH] vect: Fix load costs for SLP permutes

Wed Oct 28 13:44:08 GMT 2020

On Wed, Oct 28, 2020 at 2:39 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> For the following test case (compiled with load/store lanes
> disabled locally):
>
>   void
>   f (uint32_t *restrict x, uint8_t *restrict y, int n)
>   {
>     for (int i = 0; i < n; ++i)
>       {
>         x[i * 2] = x[i * 2] + y[i * 2];
>         x[i * 2 + 1] = x[i * 2 + 1] + y[i * 2];
>       }
>   }
>
> we have a redundant no-op permute on the x[] load node:
>
>    node 0x4472350 (max_nunits=8, refcnt=2)
>           stmt 0 _5 = *_4;
>           stmt 1 _13 = *_12;
>           load permutation { 0 1 }
>
> Then, when costing it, we pick a cost of 1, even though we need 4 copies
> of the x[] load to match a single y[] load:
>
>    ==> examining statement: _5 = *_4;
>    Vectorizing an unaligned access.
>    vect_model_load_cost: unaligned supported by hardware.
>    vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
>
> The problem is that the code only considers the permutation for
> the first scalar iteration, rather than for all VF iterations.
>
> This patch tries to fix that by using similar logic to
> vect_transform_slp_perm_load.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

I wonder if we can instead do this counting in vect_transform_slp_load
where we already count the number of permutes.  That would avoid
the duplication of the "logic".

Richard.

> Richard
>
> [-b version also included]
>
>
> gcc/
>         * tree-vect-stmts.c (vect_model_load_cost): Use similar logic
>         to vect_transform_slp_perm_load when counting the number of
>         loads in a permuted SLP operation.
> ---
>  gcc/tree-vect-stmts.c | 67 ++++++++++++++++++++++++++-----------------
>  1 file changed, 41 insertions(+), 26 deletions(-)
>
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 3575f25241f..6eacd641e6b 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1099,38 +1099,53 @@ vect_model_load_cost (vec_info *vinfo,
>        stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
>        /* Record the cost for the permutation.  */
>        unsigned n_perms;
> -      unsigned assumed_nunits
> -       = vect_nunits_for_cost (STMT_VINFO_VECTYPE (first_stmt_info));
>        vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL,
>                                     vf, true, &n_perms);
>        inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
>                                        first_stmt_info, 0, vect_body);
> -      /* And adjust the number of loads performed.  This handles
> -        redundancies as well as loads that are later dead.  */
> -      auto_sbitmap perm (DR_GROUP_SIZE (first_stmt_info));
> -      bitmap_clear (perm);
> -      for (unsigned i = 0;
> -          i < SLP_TREE_LOAD_PERMUTATION (slp_node).length (); ++i)
> -       bitmap_set_bit (perm, SLP_TREE_LOAD_PERMUTATION (slp_node)[i]);
> -      ncopies = 0;
> -      bool load_seen = false;
> -      for (unsigned i = 0; i < DR_GROUP_SIZE (first_stmt_info); ++i)
> -       {
> -         if (i % assumed_nunits == 0)
> +
> +      /* And if this is not a simple "load N vectors and then permute each
> +        vector internally" operation, adjust the number of loads performed.
> +        This handles redundancies as well as loads that are later dead.  */
> +      unsigned int nscalars = SLP_TREE_SCALAR_STMTS (slp_node).length ();
> +      unsigned int dr_group_size = DR_GROUP_SIZE (first_stmt_info);
> +      tree vectype = STMT_VINFO_VECTYPE (first_stmt_info);
> +      poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +      if (nscalars != dr_group_size
> +         || !multiple_p (nunits, nscalars))
> +       {
> +         /* The constant VF and nunits are enforced by
> +            vect_transform_slp_perm_load for this case.  */
> +         unsigned int const_nunits = nunits.to_constant ();
> +         unsigned int in_nlanes = dr_group_size * vf.to_constant ();
> +         unsigned int out_nlanes = nscalars * vf.to_constant ();
> +         auto_sbitmap perm (in_nlanes);
> +         bitmap_clear (perm);
> +         for (unsigned i = 0; i < out_nlanes; ++i)
> +           {
> +             unsigned int iter_num = i / nscalars;
> +             unsigned int stmt_num = i % nscalars;
> +             unsigned int in_lane
> +               = (iter_num * dr_group_size
> +                  + SLP_TREE_LOAD_PERMUTATION (slp_node)[stmt_num]);
> +             bitmap_set_bit (perm, in_lane);
> +           }
> +         ncopies = 0;
> +         bool load_seen = false;
> +         for (unsigned i = 0; i < in_nlanes; ++i)
>             {
> -             if (load_seen)
> -               ncopies++;
> -             load_seen = false;
> +             if (i % const_nunits == 0)
> +               {
> +                 if (load_seen)
> +                   ncopies++;
> +                 load_seen = false;
> +               }
> +             if (bitmap_bit_p (perm, i))
> +               load_seen = true;
>             }
> -         if (bitmap_bit_p (perm, i))
> -           load_seen = true;
> -       }
> -      if (load_seen)
> -       ncopies++;
> -      gcc_assert (ncopies
> -                 <= (DR_GROUP_SIZE (first_stmt_info)
> -                     - DR_GROUP_GAP (first_stmt_info)
> -                     + assumed_nunits - 1) / assumed_nunits);
> +         if (load_seen)
> +           ncopies++;
> +       }
>      }
>
>    /* Grouped loads read all elements in the group at once,
>