target cost model tuning for x86

Dorit Nuzman DORIT@il.ibm.com
Sat Sep 8 10:56:00 GMT 2007


"Jagasia, Harsha" <harsha.jagasia@amd.com> wrote on 06/09/2007 23:40:22:

> Hello,
>
> This patch defines the cost-model target-specific costs for x86 (only
> AMD K8 for now). I looked at some Polyhedron benchmarks to arrive at the
> costs. I also made some small fixes to the cost model itself.
>
> - The processor costs for all processors have currently been filled in
> with the default used by the cost model, except K8, which is tuned. For
> now, I have filled in size_cost with just 1 byte for all statements. The
> size_cost is not being used currently, but can be used in the future as
> there was some discussion at the summit about using size as an
> alternative to make the decision to vectorize.
>
> - An issue I have run into is the testsuite. I have not added any test
> cases because the test cases under vect/costmodel/x86 are run without
> any -mtune options and hence they default to mtune=generic. So the
> k8_costs added by this patch do not get used by make check right now.
>
> One way to address this is to add sub directories for individual x86
> processors that can be tuned and replicating the test cases for each of
> those processors with possibly different behavior when vectorized with
> the cost model.
>
> Another way would be to use -mtune=native in the x86 costmodel-vect.exp
> file and having the actual processor tuned emitted in the vectorizer
> dump and having some macro along the lines of "istarget" that can search
> for the processor string. Any input on the approaches would be very
> welcome.
>

either is fine w me

> The following changes have been made to the cost model itself:
>
> - When we don't know the number of prologue/epilogue iterations we
> currently assume (VF-1)/2. In case of double precision on x86, this
> causes prologue and epilogue to be estimated as 0, when alignment is
> unknown. At this point if the number of iterations is unknown or not
> exactly divisible by the vectorization factor, the cost model goes on to
> account for guard branches in the epilogue even though the epilogue
> iterations are estimated as 0.

Just to make sure we understand each other: the desired behavior is that
the guard-branches will be taken into account even when the epilog/prolog
iterations are estimated to 0 (since the run-time tests may really execute.
we'll skip the epilog-loop if we discover at run-time that the number if
iterations is 0, but we will still pay the penalty of the branch). (The way
I read the wording above it seems to suggest otherwise, but as far as I
understand, your patch does make us account for guard-branches when
epilog/prolog iterations are estimated to 0, which is ok).

> (This may be a different kind of a
> problem on the spu also, where the branch cost is 6 and run time test
> cost is -19. This may result in outside vector cost of (2*6) + (-19) =
> -7, a negative value). This patch changes the estimated
> prologue/epilogue, when unknown to vf/2 to avoid being estimated as 0
> for double precision (this will still not help eradicate the corner case
> mentioned above for the spu).
>

Indeed I see that there's a bit I missed here when I ported the spu tuning
stuff from the Cell SDK to mainline - it should be
      "if (vec_outside_cost <= 0)"
instead of
      "if (vec_outside_cost == 0)"
Would you please include this fix in your patch? (thanks!)

> - The patch also changes the number of branches per prologue and
> epilogue each to 1 instead of 2. Looking at some loops, the branches can
> get optimized out,

could you please show an example where these guards are optimized away?

> especially when the alignment and iterations are
> known at compile time.
>

so maybe there's room to reduce the number of estimated branches from 2 to
1 only when the number of iterations is known at compile time, and not
always?

> - The cost model equation is changed to
> min_profitable_iters = (voc * vf
>             - vic * prologue_iters
>             - vic * epilogue_iters)
>             /(sic * vf - vic)
>
> instead of
> min_iters = (voc * vf)/(sic * vf - vic)
>

can you please explain why you are ignoring the costs of the prolog and
epilog loops? actually, I think I may have found the answer in the
following paragraph?:

 > - The cost model checks at compile time if the loop iterations are less
> than or equal to whats estimated by the cost model. It also checks at
> run time if the iterations left after peeling for prologue and epilogue
> are less than or equal to whats estimated by the cost model. If a loop
> is determined as profitable at compile-time, it may be evaluated as
> unprofitable at run-time.
> For example, num_iters=7 and vf=4 and cost model estimates
> profitable_iters=6 and data is aligned. At compile time since (num_iters
> > min_profitable_iters), this loop is profitable to vectorize. However,
> the run time check compares the iterations left after peeling for vf i.e
> 4 versus the min_profitable_iters and finds it to be unprofitable.

This is correct, but

> This patch sets the runtime threshold as 0 for cases which can be
> evaluated for profitability at compile time.
>

I don't like this solution so much. Actually it seems to me that by
ignoring the costs of the prolog/epilog loops, you already fixed the
inaccuracy in the run-time check, so you don't need this extra bit. The
problem is that at the same time you made the compile-time test less
accurate than it was before. Please correct me if I'm wrong: AFAIU, there
are 3 components to the overall iteration count: prolog_iters,
main_loop_iters and epilog_iters. The threshold that the cost model
computed before your patch consisted of all three:
      TH_before = prolog_iters + main_loop_iters + epilog_iters
This was ok for the compile-time test (which compares the above to
LOOP_VINFO_INT_NITERS which also represents the overall iteration count),
but was not ok for the run-time test (which compared it only to
main_loop_iters, and not the overall count).
With your fix to ignore the prolog and epilog loops we now have:
      TH_after = main_loop_iters
This is ok for the run-time test (so I don't see why you need the extra fix
you mention above), however it is not ok for the compile time test. So I
think what you need to do now is fix the compile-time test instead of the
run-time test. i.e., instead of comparing th with LOOP_VINFO_INT_NITERS,
you need to compare th with:
      LOOP_VINFO_INT_NITERS - prolog_iters - LOOP_VINFO_INT_NITERS%VF
where prolog_iters =
      LOOP_PEELING_FOR_ALIGNMENT >= 0 ? LOOP_PEELING_FOR_ALIGNMENT : VF/2;

> - Currently, the cost model checks if the iterations are LESS than whats
> estimated by the cost model at compile time, instead of LESS than or
> EQUAL.
>

I think what caused the confusion here is that PARAM_MIN_VECT_LOOP_BOUND is
defined as the minimum number of iterations of the vectorized loop. So if
we don't allow vectorization when niters is LESS_THAN_OR_EQUAL
min_vect_loop_bound, we are more conservative than what the user asked for.
So lets indeed change the test to LESS_THAN_OR_EQUAL, but then lets also
add 1 to PARAM_MIN_VECT_LOOP_BOUND (both in
tree-vect-analyze.c:vect_analyze_operations, and in
tree-vect-transform.c:vect_do_peeling_for_loop_bound).

thanks,
dorit

> - The patch has been bootstrapped on x86-64 and all the 64 and 32-bit
> tests pass as expected, except costmodel-vect-31.c in x86_64 and i386.
>
> The loop at line 66 in vect-31.c was being estimated as not profitable
> to vectorize. Because of the changes to the cost model, this is now
> estimated as profitable. Running it on K8 indicates that it is
> profitable to vectorize and the cost model estimates this correctly for
> K8.
>
> ChangeLogs:
> gcc/ChangeLog:
> 2007-09-04  Harsha Jagasia <harsha.jagasia@amd.com>
>
>    * config/i386/i386.h (processor_costs): Add scalar_stmt_cost,
>      scalar_load_cost, scalar_store_cost, vec_stmt_cost,
>      vec_to_scalar_cost, scalar_to_vec_cost, vec_align_load_cost,
>      vect_unalign_load_cost, vec_store_cost.
>      Define macros for x86 costs.
>    * config/i386/i386.c
>      (size_cost): Set scalar_stmt_cost, scalar_load_cost,
>      scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost,
>      scalar_to_vec_cost, vec_align_load_cost,
> vect_unalign_load_cost,
>      vec_store_cost to 1.
>         (i386_cost, i486_cost, pentium_cost, pentiumpro_cost,
> geode_cost,
>      k6_cost, athlon_cost, amdfam10_cost, pentium4_cost,
> nocona_cost,
>      core2_cost, generic64_cost, generic32_cost): Set to default
> untuned
>      costs.
>      (k8_cost): Costs for vectorization tuned.
>      (x86_builtin_vectorization_cost): New.
>    * tree-vect-analyze.c (vect_analyze_operations): Change
> comparison of
>      loop iterations with threshold to less than or equal to
> instead of
>      less than.
>    * tree-vect-transform.c (vect_estimate_min_profitable_iters):
>      Change prologue and epilogue iterations estimate to vf/2, when
>      unknown at compile-time. Change the number of guards for
> prologue
>      and epilogue to be 1 each. Change the cost model equation to
>      consider vector iterations as the loop iterations less the
>      prologue and epilogue iterations.
>      (vect_do_peeling_for_loop_bound): Set profitability threshold
> to
>      zero when loop has known number of iterations and known
> alignment
>      for data.
>
> gcc/testsuite/ChangeLog:
> 2007-09-04  Harsha Jagasia <harsha.jagasia@amd.com>
>
>    * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Change
> dg-final to
>      expect 1 non-profitable loop and 3 profitable loops.
>    * gcc.dg/vect/costmodel/x86-64/costmodel-vect-31.c: Change
> dg-final to
>      expect 1 non-profitable loop and 3 profitable loops.
>
> Thanks,
> Harsha
> [attachment "target.patch.txt" deleted by Dorit Nuzman/Haifa/IBM]



More information about the Gcc-patches mailing list