This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fix a case in which the vector cost model was ignored


On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> This patch fixes a case in which we vectorised something with a
> fully-predicated loop even after the cost model had rejected it.
> E.g. the loop in the testcase has the costs:
>
>   Vector inside of loop cost: 27
>   Vector prologue cost: 0
>   Vector epilogue cost: 0
>   Scalar iteration cost: 7
>   Scalar outside cost: 6
>   Vector outside cost: 0
>   prologue iterations: 0
>   epilogue iterations: 0
>
> and we can see that the loop executes at most three times, but we
> decided to vectorise it anyway.
>
> (The costs here are equal for three iterations, but the same thing
> happens even when the vector code is strictly more expensive.)
>
> The problem is the handling of "/VF" in:
>
>   /* Calculate number of iterations required to make the vector version
>      profitable, relative to the loop bodies only.  The following condition
>      must hold true:
>      SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>      where
>      SIC = scalar iteration cost, VIC = vector iteration cost,
>      VOC = vector outside cost, VF = vectorization factor,
>      PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>      SOC = scalar outside cost for run time cost model check.  */
>
> We treat the "/VF" as truncating, but for fully-predicated loops, it's
> closer to a ceil division, since fractional iterations are handled by a
> full iteration with some predicate bits set to false.
>
> The easiest fix seemed to be to calculate the minimum number of vector
> iterations first, then use that to calculate the minimum number of scalar
> iterations.
>
> Calculating the minimum number of vector iterations might make sense for
> unpredicated loops too, since calculating the scalar niters directly
> doesn't take into account the fact that the VIC multiple has to be an
> integer.  But the handling of PL_ITERS and EP_ITERS for unpredicated
> loops is a bit hand-wavy anyway, so maybe vagueness here cancels out
> vagueness there?

Well, their estimate if we do not know them statically is.  If we statically
know pl/ep iters the formulat should match, no?  Hopefully for GCC 10
we can even fix the case in PR89754.

> Either way, changing this for unpredicated loops would be much too
> invasive for stage 4, so the patch keeps it specific to fully-predicated
> loops (i.e. SVE) for now.  There's no functional change for other targets.
>
> Tested on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu.
> This is a regression introduced by the original cost model patches for
> fully-predicated loops, so OK for GCC 9?

OK.

Thanks,
Richard.

> Thanks,
> Richard
>
>
> 2019-03-18  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the
>         calculation of the minimum number of scalar iterations for
>         fully-predicated loops.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/cost_model_1.c: New test.
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-03-08 18:15:33.668751875 +0000
> +++ gcc/tree-vect-loop.c        2019-03-18 09:55:03.257194326 +0000
> @@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop
>    /* Calculate number of iterations required to make the vector version
>       profitable, relative to the loop bodies only.  The following condition
>       must hold true:
> -     SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> +     SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC
>       where
>       SIC = scalar iteration cost, VIC = vector iteration cost,
>       VOC = vector outside cost, VF = vectorization factor,
> -     PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
> +     NPEEL = prologue iterations + epilogue iterations,
>       SOC = scalar outside cost for run time cost model check.  */
>
> -  if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost)
> +  int saving_per_viter = (scalar_single_iter_cost * assumed_vf
> +                         - vec_inside_cost);
> +  if (saving_per_viter <= 0)
> +    {
> +      if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
> +       warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
> +                   "vectorization did not happen for a simd loop");
> +
> +      if (dump_enabled_p ())
> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cost model: the vector iteration cost = %d "
> +                        "divided by the scalar iteration cost = %d "
> +                        "is greater or equal to the vectorization factor = %d"
> +                         ".\n",
> +                        vec_inside_cost, scalar_single_iter_cost, assumed_vf);
> +      *ret_min_profitable_niters = -1;
> +      *ret_min_profitable_estimate = -1;
> +      return;
> +    }
> +
> +  /* ??? The "if" arm is written to handle all cases; see below for what
> +     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +    {
> +      /* Rewriting the condition above in terms of the number of
> +        vector iterations (vniters) rather than the number of
> +        scalar iterations (niters) gives:
> +
> +        SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC
> +
> +        <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC
> +
> +        For integer N, X and Y when X > 0:
> +
> +        N * X > Y <==> N >= (Y /[floor] X) + 1.  */
> +      int outside_overhead = (vec_outside_cost
> +                             - scalar_single_iter_cost * peel_iters_prologue
> +                             - scalar_single_iter_cost * peel_iters_epilogue
> +                             - scalar_outside_cost);
> +      /* We're only interested in cases that require at least one
> +        vector iteration.  */
> +      int min_vec_niters = 1;
> +      if (outside_overhead > 0)
> +       min_vec_niters = outside_overhead / saving_per_viter + 1;
> +
> +      if (dump_enabled_p ())
> +       dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
> +                    min_vec_niters);
> +
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +       {
> +         /* Now that we know the minimum number of vector iterations,
> +            find the minimum niters for which the scalar cost is larger:
> +
> +            SIC * niters > VIC * vniters + VOC - SOC
> +
> +            We know that the minimum niters is no more than
> +            vniters * VF + NPEEL, but it might be (and often is) less
> +            than that if a partial vector iteration is cheaper than the
> +            equivalent scalar code.  */
> +         int threshold = (vec_inside_cost * min_vec_niters
> +                          + vec_outside_cost
> +                          - scalar_outside_cost);
> +         if (threshold <= 0)
> +           min_profitable_iters = 1;
> +         else
> +           min_profitable_iters = threshold / scalar_single_iter_cost + 1;
> +       }
> +      else
> +       /* Convert the number of vector iterations into a number of
> +          scalar iterations.  */
> +       min_profitable_iters = (min_vec_niters * assumed_vf
> +                               + peel_iters_prologue
> +                               + peel_iters_epilogue);
> +    }
> +  else
>      {
>        min_profitable_iters = ((vec_outside_cost - scalar_outside_cost)
>                               * assumed_vf
> @@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop
>          min_profitable_iters = 0;
>        else
>         {
> -         min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf)
> -                                  - vec_inside_cost);
> +         min_profitable_iters /= saving_per_viter;
>
>           if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters)
>               <= (((int) vec_inside_cost * min_profitable_iters)
> @@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop
>             min_profitable_iters++;
>         }
>      }
> -  /* vector version will never be profitable.  */
> -  else
> -    {
> -      if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
> -       warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
> -                   "vectorization did not happen for a simd loop");
> -
> -      if (dump_enabled_p ())
> -        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -                        "cost model: the vector iteration cost = %d "
> -                        "divided by the scalar iteration cost = %d "
> -                        "is greater or equal to the vectorization factor = %d"
> -                         ".\n",
> -                        vec_inside_cost, scalar_single_iter_cost, assumed_vf);
> -      *ret_min_profitable_niters = -1;
> -      *ret_min_profitable_estimate = -1;
> -      return;
> -    }
>
>    if (dump_enabled_p ())
>      dump_printf (MSG_NOTE,
> @@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop
>
>       Non-vectorized variant is SIC * niters and it must win over vector
>       variant on the expected loop trip count.  The following condition must hold true:
> -     SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC  */
> +     SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC  */
>
>    if (vec_outside_cost <= 0)
>      min_profitable_estimate = 0;
> +  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +    {
> +      /* This is a repeat of the code above, but with + SOC rather
> +        than - SOC.  */
> +      int outside_overhead = (vec_outside_cost
> +                             - scalar_single_iter_cost * peel_iters_prologue
> +                             - scalar_single_iter_cost * peel_iters_epilogue
> +                             + scalar_outside_cost);
> +      int min_vec_niters = 1;
> +      if (outside_overhead > 0)
> +       min_vec_niters = outside_overhead / saving_per_viter + 1;
> +
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +       {
> +         int threshold = (vec_inside_cost * min_vec_niters
> +                          + vec_outside_cost
> +                          + scalar_outside_cost);
> +         min_profitable_estimate = threshold / scalar_single_iter_cost + 1;
> +       }
> +      else
> +       min_profitable_estimate = (min_vec_niters * assumed_vf
> +                                  + peel_iters_prologue
> +                                  + peel_iters_epilogue);
> +    }
>    else
>      {
>        min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost)
> Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c
> ===================================================================
> --- /dev/null   2019-03-08 11:40:14.606883727 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 09:55:03.257194326 +0000
> @@ -0,0 +1,12 @@
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +void
> +f (unsigned int *restrict x, unsigned int *restrict y,
> +   unsigned char *restrict z, unsigned int n)
> +{
> +  for (unsigned int i = 0; i < n % 4; ++i)
> +    x[i] = x[i] + y[i] + z[i];
> +}
> +
> +/* { dg-final { scan-tree-dump "not vectorized: estimated iteration count too small" vect } } */
> +/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]