This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Fix a case in which the vector cost model was ignored
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: GCC Patches <gcc-patches at gcc dot gnu dot org>, Richard Sandiford <richard dot sandiford at arm dot com>
- Date: Mon, 18 Mar 2019 12:14:07 +0100
- Subject: Re: Fix a case in which the vector cost model was ignored
- References: <mptk1gwzeoy.fsf@arm.com>
On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> This patch fixes a case in which we vectorised something with a
> fully-predicated loop even after the cost model had rejected it.
> E.g. the loop in the testcase has the costs:
>
> Vector inside of loop cost: 27
> Vector prologue cost: 0
> Vector epilogue cost: 0
> Scalar iteration cost: 7
> Scalar outside cost: 6
> Vector outside cost: 0
> prologue iterations: 0
> epilogue iterations: 0
>
> and we can see that the loop executes at most three times, but we
> decided to vectorise it anyway.
>
> (The costs here are equal for three iterations, but the same thing
> happens even when the vector code is strictly more expensive.)
>
> The problem is the handling of "/VF" in:
>
> /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only. The following condition
> must hold true:
> SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> where
> SIC = scalar iteration cost, VIC = vector iteration cost,
> VOC = vector outside cost, VF = vectorization factor,
> PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
> SOC = scalar outside cost for run time cost model check. */
>
> We treat the "/VF" as truncating, but for fully-predicated loops, it's
> closer to a ceil division, since fractional iterations are handled by a
> full iteration with some predicate bits set to false.
>
> The easiest fix seemed to be to calculate the minimum number of vector
> iterations first, then use that to calculate the minimum number of scalar
> iterations.
>
> Calculating the minimum number of vector iterations might make sense for
> unpredicated loops too, since calculating the scalar niters directly
> doesn't take into account the fact that the VIC multiple has to be an
> integer. But the handling of PL_ITERS and EP_ITERS for unpredicated
> loops is a bit hand-wavy anyway, so maybe vagueness here cancels out
> vagueness there?
Well, their estimate if we do not know them statically is. If we statically
know pl/ep iters the formulat should match, no? Hopefully for GCC 10
we can even fix the case in PR89754.
> Either way, changing this for unpredicated loops would be much too
> invasive for stage 4, so the patch keeps it specific to fully-predicated
> loops (i.e. SVE) for now. There's no functional change for other targets.
>
> Tested on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu.
> This is a regression introduced by the original cost model patches for
> fully-predicated loops, so OK for GCC 9?
OK.
Thanks,
Richard.
> Thanks,
> Richard
>
>
> 2019-03-18 Richard Sandiford <richard.sandiford@arm.com>
>
> gcc/
> * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the
> calculation of the minimum number of scalar iterations for
> fully-predicated loops.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve/cost_model_1.c: New test.
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000
> +++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000
> @@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop
> /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only. The following condition
> must hold true:
> - SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> + SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC
> where
> SIC = scalar iteration cost, VIC = vector iteration cost,
> VOC = vector outside cost, VF = vectorization factor,
> - PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
> + NPEEL = prologue iterations + epilogue iterations,
> SOC = scalar outside cost for run time cost model check. */
>
> - if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost)
> + int saving_per_viter = (scalar_single_iter_cost * assumed_vf
> + - vec_inside_cost);
> + if (saving_per_viter <= 0)
> + {
> + if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
> + warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
> + "vectorization did not happen for a simd loop");
> +
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "cost model: the vector iteration cost = %d "
> + "divided by the scalar iteration cost = %d "
> + "is greater or equal to the vectorization factor = %d"
> + ".\n",
> + vec_inside_cost, scalar_single_iter_cost, assumed_vf);
> + *ret_min_profitable_niters = -1;
> + *ret_min_profitable_estimate = -1;
> + return;
> + }
> +
> + /* ??? The "if" arm is written to handle all cases; see below for what
> + we would do for !LOOP_VINFO_FULLY_MASKED_P. */
> + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> + {
> + /* Rewriting the condition above in terms of the number of
> + vector iterations (vniters) rather than the number of
> + scalar iterations (niters) gives:
> +
> + SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC
> +
> + <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC
> +
> + For integer N, X and Y when X > 0:
> +
> + N * X > Y <==> N >= (Y /[floor] X) + 1. */
> + int outside_overhead = (vec_outside_cost
> + - scalar_single_iter_cost * peel_iters_prologue
> + - scalar_single_iter_cost * peel_iters_epilogue
> + - scalar_outside_cost);
> + /* We're only interested in cases that require at least one
> + vector iteration. */
> + int min_vec_niters = 1;
> + if (outside_overhead > 0)
> + min_vec_niters = outside_overhead / saving_per_viter + 1;
> +
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n",
> + min_vec_niters);
> +
> + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> + {
> + /* Now that we know the minimum number of vector iterations,
> + find the minimum niters for which the scalar cost is larger:
> +
> + SIC * niters > VIC * vniters + VOC - SOC
> +
> + We know that the minimum niters is no more than
> + vniters * VF + NPEEL, but it might be (and often is) less
> + than that if a partial vector iteration is cheaper than the
> + equivalent scalar code. */
> + int threshold = (vec_inside_cost * min_vec_niters
> + + vec_outside_cost
> + - scalar_outside_cost);
> + if (threshold <= 0)
> + min_profitable_iters = 1;
> + else
> + min_profitable_iters = threshold / scalar_single_iter_cost + 1;
> + }
> + else
> + /* Convert the number of vector iterations into a number of
> + scalar iterations. */
> + min_profitable_iters = (min_vec_niters * assumed_vf
> + + peel_iters_prologue
> + + peel_iters_epilogue);
> + }
> + else
> {
> min_profitable_iters = ((vec_outside_cost - scalar_outside_cost)
> * assumed_vf
> @@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop
> min_profitable_iters = 0;
> else
> {
> - min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf)
> - - vec_inside_cost);
> + min_profitable_iters /= saving_per_viter;
>
> if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters)
> <= (((int) vec_inside_cost * min_profitable_iters)
> @@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop
> min_profitable_iters++;
> }
> }
> - /* vector version will never be profitable. */
> - else
> - {
> - if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
> - warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
> - "vectorization did not happen for a simd loop");
> -
> - if (dump_enabled_p ())
> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> - "cost model: the vector iteration cost = %d "
> - "divided by the scalar iteration cost = %d "
> - "is greater or equal to the vectorization factor = %d"
> - ".\n",
> - vec_inside_cost, scalar_single_iter_cost, assumed_vf);
> - *ret_min_profitable_niters = -1;
> - *ret_min_profitable_estimate = -1;
> - return;
> - }
>
> if (dump_enabled_p ())
> dump_printf (MSG_NOTE,
> @@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop
>
> Non-vectorized variant is SIC * niters and it must win over vector
> variant on the expected loop trip count. The following condition must hold true:
> - SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC */
> + SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */
>
> if (vec_outside_cost <= 0)
> min_profitable_estimate = 0;
> + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> + {
> + /* This is a repeat of the code above, but with + SOC rather
> + than - SOC. */
> + int outside_overhead = (vec_outside_cost
> + - scalar_single_iter_cost * peel_iters_prologue
> + - scalar_single_iter_cost * peel_iters_epilogue
> + + scalar_outside_cost);
> + int min_vec_niters = 1;
> + if (outside_overhead > 0)
> + min_vec_niters = outside_overhead / saving_per_viter + 1;
> +
> + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> + {
> + int threshold = (vec_inside_cost * min_vec_niters
> + + vec_outside_cost
> + + scalar_outside_cost);
> + min_profitable_estimate = threshold / scalar_single_iter_cost + 1;
> + }
> + else
> + min_profitable_estimate = (min_vec_niters * assumed_vf
> + + peel_iters_prologue
> + + peel_iters_epilogue);
> + }
> else
> {
> min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost)
> Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c
> ===================================================================
> --- /dev/null 2019-03-08 11:40:14.606883727 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 09:55:03.257194326 +0000
> @@ -0,0 +1,12 @@
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +void
> +f (unsigned int *restrict x, unsigned int *restrict y,
> + unsigned char *restrict z, unsigned int n)
> +{
> + for (unsigned int i = 0; i < n % 4; ++i)
> + x[i] = x[i] + y[i] + z[i];
> +}
> +
> +/* { dg-final { scan-tree-dump "not vectorized: estimated iteration count too small" vect } } */
> +/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */