[PATCH v2] rs6000: Add load density heuristic
will schmidt
will_schmidt@vnet.ibm.com
Tue Jul 27 22:25:05 GMT 2021
On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
> Hi,
>
Hi,
> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
>
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.
ok
>
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts. Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test. With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.
ok
>
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.
ok
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
> * {-O3, -Ofast} {,-funroll-loops}
>
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.
So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
degredation with no additional regressions. that sounds good.
>
> Is it ok for trunk?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> nstmts, nloads and extra_ctor_cost.
> (rs6000_density_test): Add load density related heuristics and the
> checks, do extra costing on vector construction statements if need.
> (rs6000_init_cost): Init new members.
> (rs6000_update_target_cost_per_stmt): New function.
> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> rs6000_update_target_cost_per_stmt and call it.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 83d29cbfac1..806c3335cbc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
>
> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
> {
> struct loop *loop_info;
> unsigned cost[3];
> + /* Total number of vectorized stmts (loop only). */
> + unsigned nstmts;
> + /* Total number of loads (loop only). */
> + unsigned nloads;
> + /* Possible extra penalized cost on vector construction (loop only). */
> + unsigned extra_ctor_cost;
>
> /* For each vectorized loop, this var holds TRUE iff a non-memory vector
> instruction is needed by the vectorization. */
> bool vect_nonmem;
> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> "density %d%%, cost %d exceeds threshold, penalizing "
> - "loop body cost by %d%%", density_pct,
> + "loop body cost by %d%%\n", density_pct,
> vec_cost + not_vec_cost, DENSITY_PENALTY);
> }
> +
> + /* Check if we need to penalize the body cost for latency and
> + execution resources bound from strided or elementwise loads
> + into a vector. */
> + if (data->extra_ctor_cost > 0)
> + {
> + /* Threshold for load stmts percentage in all vectorized stmts. */
> + const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> + /* Threshold for total number of load stmts. */
> + const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> +
> + gcc_assert (data->nloads <= data->nstmts);
> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> +
> + /* It's likely to be bounded by latency and execution resources
> + from many scalar loads which are strided or elementwise loads
> + into a vector if both conditions below are found:
> + 1. there are many loads, it's easy to result in a long wait
> + for load units;
> + 2. load has a big proportion of all vectorized statements,
> + it's not easy to schedule other statements to spread among
> + the loads.
> + One typical case is the innermost loop of the hotspot of SPEC2017
> + 503.bwaves_r without loop interchange. */
> + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
> + && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
> + {
> + data->cost[vect_body] += data->extra_ctor_cost;
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_NOTE, vect_location,
> + "Found %u loads and load pct. %u%% exceed "
> + "the threshold, penalizing loop body "
> + "cost by extra cost %u for ctor.\n",
> + data->nloads, load_pct, data->extra_ctor_cost);
> + }
> + }
>
> }
>
> /* Implement targetm.vectorize.init_cost. */
> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
> data->cost[vect_body] = 0;
> data->cost[vect_epilogue] = 0;
> data->vect_nonmem = false;
> + data->nstmts = 0;
> + data->nloads = 0;
> + data->extra_ctor_cost = 0;
>
> data->costing_for_scalar = costing_for_scalar;
> return data;
> }
> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
> return 0;
> }
>
> +/* As a visitor function for each statement cost entry handled in
> + function add_stmt_cost, gather some information and update its
> + relevant fields in target cost accordingly. */
I got lost trying to parse that.. (could be just me :-)
Possibly instead something like
/* Helper function for add_stmt_cost ; gather information and update
the target_cost fields accordingly. */
> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> + enum vect_cost_for_stmt kind,
> + struct _stmt_vec_info *stmt_info,
> + enum vect_cost_model_location where,
> + int stmt_cost, unsigned int orig_count)
> +{
> +
> + /* Check whether we're doing something other than just a copy loop.
> + Not all such loops may be profitably vectorized; see
> + rs6000_finish_cost. */
> + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
> + || kind == vec_construct || kind == scalar_to_vec)
> + || (where == vect_body && kind == vector_stmt))
I thought I saw an alignment issue, then noticed the "(" that was
hiding on me.. :-)
Maybe clearer to read if you rearrange slightly and flatten it ? I
defer to others on that..
if ((kind == vec_to_scalar
|| kind == vec_perm
|| kind == vec_promote_demote
|| kind ==
vec_construct
|| kind == scalar_to_vec)
|| (kind == vector_stmt && where == vect_body)
> + data->vect_nonmem = true;
> +
> + /* Gather some information when we are costing the vector version for
> + the statements locate in a loop body. */
s/version/instruction? operation?/ ? ?
s/locate/located/
> + if (!data->costing_for_scalar && data->loop_info && where == vect_body)
> + {
> + data->nstmts += orig_count;
> +
> + if (kind == scalar_load || kind == vector_load || kind == unaligned_load
> + || kind == vector_gather_load)
Cosmetically, I'd move the second "||" to the next line to balance
those two lines a bit more.
> + data->nloads += orig_count;
> +
> + /* If we have strided or elementwise loads into a vector, it's
> + possible to be bounded by latency and execution resources for
> + many scalar loads. Try to account for this by scaling the
> + construction cost by the number of elements involved, when
> + handling each matching statement we record the possible extra
> + penalized cost into target cost, in the end of costing for
> + the whole loop, we do the actual penalization once some load
> + density heuristics are satisfied. */
> + if (kind == vec_construct && stmt_info
> + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
> + {
> + tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> + unsigned int nunits = vect_nunits_for_cost (vectype);
> + unsigned int extra_cost = nunits * stmt_cost;
> + /* As function rs6000_builtin_vectorization_cost shows, we have
> + priced much on V16QI/V8HI vector construction as their units,
> + if we penalize them with nunits * stmt_cost, it can result in
> + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
s/a/an/
> + 20 and nunits is 16, the extra cost is 320 which looks much
> + exaggerated. So let's use one maximum bound for the extra
> + penalized cost for vector construction here. */
> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> + data->extra_ctor_cost += extra_cost;
> + }
> + }
> +}
ok
> +
>
> /* Implement targetm.vectorize.add_stmt_cost. */
>
> static unsigned
> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
> /* Statements in an inner loop relative to the loop being
> vectorized are weighted more heavily. The value here is
> arbitrary and could potentially be improved with analysis. */
> + unsigned int orig_count = count;
I don't see any other assignments to orig_count. Does 'count' get
updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
variable is necessary?
>
> if (where == vect_body && stmt_info
> && stmt_in_inner_loop_p (vinfo, stmt_info))
> {
> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
> retval = (unsigned) (count * stmt_cost);
> cost_data->cost[where] += retval;
>
> - /* Check whether we're doing something other than just a copy loop.
> - Not all such loops may be profitably vectorized; see
> - rs6000_finish_cost. */
> - if ((kind == vec_to_scalar || kind == vec_perm
> - || kind == vec_promote_demote || kind == vec_construct
> - || kind == scalar_to_vec)
> - || (where == vect_body && kind == vector_stmt))
> - cost_data->vect_nonmem = true;
> + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> + stmt_cost, orig_count);
>
> }
>
> return retval;
>
More information about the Gcc-patches
mailing list