[PATCH v2] rs6000: Add load density heuristic

Tue Jul 27 22:25:05 GMT 2021

On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 

Hi,

> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.

ok

> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.

ok

> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.

ok

> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.

So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
degredation with no additional regressions.  that sounds good. 

> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 

> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 83d29cbfac1..806c3335cbc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> 
> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>  {
>    struct loop *loop_info;
>    unsigned cost[3];
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned nstmts;
> +  /* Total number of loads (loop only).  */
> +  unsigned nloads;
> +  /* Possible extra penalized cost on vector construction (loop only).  */
> +  unsigned extra_ctor_cost;
> 
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool vect_nonmem;
> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "density %d%%, cost %d exceeds threshold, penalizing "
> -			 "loop body cost by %d%%", density_pct,
> +			 "loop body cost by %d%%\n", density_pct,
>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>      }
> +
> +  /* Check if we need to penalize the body cost for latency and
> +     execution resources bound from strided or elementwise loads
> +     into a vector.  */
> +  if (data->extra_ctor_cost > 0)
> +    {
> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> +      /* Threshold for total number of load stmts.  */
> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> +
> +      gcc_assert (data->nloads <= data->nstmts);
> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> +
> +      /* It's likely to be bounded by latency and execution resources
> +	 from many scalar loads which are strided or elementwise loads
> +	 into a vector if both conditions below are found:
> +	   1. there are many loads, it's easy to result in a long wait
> +	      for load units;
> +	   2. load has a big proportion of all vectorized statements,
> +	      it's not easy to schedule other statements to spread among
> +	      the loads.
> +	 One typical case is the innermost loop of the hotspot of SPEC2017
> +	 503.bwaves_r without loop interchange.  */
> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
> +	{
> +	  data->cost[vect_body] += data->extra_ctor_cost;
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "Found %u loads and load pct. %u%% exceed "
> +			     "the threshold, penalizing loop body "
> +			     "cost by extra cost %u for ctor.\n",
> +			     data->nloads, load_pct, data->extra_ctor_cost);
> +	}
> +    }
> 
>  }
> 
>  /* Implement targetm.vectorize.init_cost.  */
> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>    data->cost[vect_body]     = 0;
>    data->cost[vect_epilogue] = 0;
>    data->vect_nonmem = false;
> +  data->nstmts = 0;
> +  data->nloads = 0;
> +  data->extra_ctor_cost = 0;
> 
>    data->costing_for_scalar = costing_for_scalar;
>    return data;
>  }
> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>    return 0;
>  }
> 
> +/* As a visitor function for each statement cost entry handled in
> +   function add_stmt_cost, gather some information and update its
> +   relevant fields in target cost accordingly.  */

I got lost trying to parse that..  (could be just me :-) 

Possibly instead something like
/* Helper function for add_stmt_cost ; gather information and update
the target_cost fields accordingly.  */

> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> +				    enum vect_cost_for_stmt kind,
> +				    struct _stmt_vec_info *stmt_info,
> +				    enum vect_cost_model_location where,
> +				    int stmt_cost, unsigned int orig_count)
> +{
> +
> +  /* Check whether we're doing something other than just a copy loop.
> +     Not all such loops may be profitably vectorized; see
> +     rs6000_finish_cost.  */
> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
> +       || kind == vec_construct || kind == scalar_to_vec)
> +      || (where == vect_body && kind == vector_stmt))

I thought I saw an alignment issue, then noticed the "(" that was
hiding on me.. :-)    

Maybe clearer to read if you rearrange slightly and flatten it ?  I
defer to others on that..

	if ((kind == vec_to_scalar
	     || kind == vec_perm
	     || kind == vec_promote_demote
	     || kind ==
vec_construct	
	     || kind == scalar_to_vec)
	    || (kind == vector_stmt && where == vect_body)

> +    data->vect_nonmem = true;
> +
> +  /* Gather some information when we are costing the vector version for
> +     the statements locate in a loop body.  */
s/version/instruction? operation?/  ? ? 
s/locate/located/

> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
> +    {
> +      data->nstmts += orig_count;
> +
> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
> +	  || kind == vector_gather_load)

Cosmetically, I'd move the second "||" to the next line to balance
those two lines a bit more. 

> +	data->nloads += orig_count;
> +
> +      /* If we have strided or elementwise loads into a vector, it's
> +	 possible to be bounded by latency and execution resources for
> +	 many scalar loads.  Try to account for this by scaling the
> +	 construction cost by the number of elements involved, when
> +	 handling each matching statement we record the possible extra
> +	 penalized cost into target cost, in the end of costing for
> +	 the whole loop, we do the actual penalization once some load
> +	 density heuristics are satisfied.  */
> +      if (kind == vec_construct && stmt_info
> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
> +	{
> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
> +	  unsigned int extra_cost = nunits * stmt_cost;
> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
> +	     priced much on V16QI/V8HI vector construction as their units,
> +	     if we penalize them with nunits * stmt_cost, it can result in
> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
s/a/an/ 
> +	     20 and nunits is 16, the extra cost is 320 which looks much
> +	     exaggerated.  So let's use one maximum bound for the extra
> +	     penalized cost for vector construction here.  */
> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  data->extra_ctor_cost += extra_cost;
> +	}
> +    }
> +}
ok

> +
> 
>  /* Implement targetm.vectorize.add_stmt_cost.  */
> 
>  static unsigned
> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        /* Statements in an inner loop relative to the loop being
>  	 vectorized are weighted more heavily.  The value here is
>  	 arbitrary and could potentially be improved with analysis.  */
> +      unsigned int orig_count = count;

I don't see any other assignments to orig_count.  Does 'count' get
updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
variable is necessary?

> 
>        if (where == vect_body && stmt_info
>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>  	{
> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        retval = (unsigned) (count * stmt_cost);
>        cost_data->cost[where] += retval;
> 
> -      /* Check whether we're doing something other than just a copy loop.
> -	 Not all such loops may be profitably vectorized; see
> -	 rs6000_finish_cost.  */
> -      if ((kind == vec_to_scalar || kind == vec_perm
> -	   || kind == vec_promote_demote || kind == vec_construct
> -	   || kind == scalar_to_vec)
> -	  || (where == vect_body && kind == vector_stmt))
> -	cost_data->vect_nonmem = true;
> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> +					  stmt_cost, orig_count);
> 
>      }
> 
>    return retval;
>