PING^4 [PATCH v2] rs6000: Modify the way for extra penalized cost

Kewen.Lin linkw@linux.ibm.com
Mon Nov 22 02:23:09 GMT 2021


Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen

>>> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>>>> Hi,
>>>>
>>>> This patch follows the discussions here[1][2], where Segher
>>>> pointed out the existing way to guard the extra penalized
>>>> cost for strided/elementwise loads with a magic bound does
>>>> not scale.
>>>>
>>>> The way with nunits * stmt_cost can get one much
>>>> exaggerated penalized cost, such as: for V16QI on P8, it's
>>>> 16 * 20 = 320, that's why we need one bound.  To make it
>>>> better and more readable, the penalized cost is simplified
>>>> as:
>>>>
>>>>     unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>>>>     unsigned extra_cost = nunits * adjusted_cost;
>>>>
>>>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>>>> while for the other modes, it uses 1.  It's mainly concluded
>>>> from the performance evaluations.  One thing might be
>>>> related is that: More units vector gets constructed, more
>>>> instructions are used.  It has more chances to schedule them
>>>> better (even run in parallelly when enough available units
>>>> at that time), so it seems reasonable not to penalize more
>>>> for them.
>>>>
>>>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>>>> sets O2-vect and Ofast-unroll show this change is neutral.
>>>>
>>>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>>>
>>>> Is it ok for trunk?
>>>>
>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>>>
>>>> BR,
>>>> Kewen
>>>> -----
>>>> gcc/ChangeLog:
>>>>
>>>> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>>> 	the way to compute extra penalized cost.  Remove useless parameter.
>>>> 	(rs6000_add_stmt_cost): Adjust the call to function
>>>> 	rs6000_update_target_cost_per_stmt.
>>>>
>>>>
>>>> ---
>>>>  gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>>>>  1 file changed, 18 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>>>> index dd42b0964f1..8200e1152c2 100644
>>>> --- a/gcc/config/rs6000/rs6000.c
>>>> +++ b/gcc/config/rs6000/rs6000.c
>>>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>>  				    enum vect_cost_for_stmt kind,
>>>>  				    struct _stmt_vec_info *stmt_info,
>>>>  				    enum vect_cost_model_location where,
>>>> -				    int stmt_cost,
>>>>  				    unsigned int orig_count)
>>>>  {
>>>>
>>>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>>  	{
>>>>  	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>>>  	  unsigned int nunits = vect_nunits_for_cost (vectype);
>>>> -	  unsigned int extra_cost = nunits * stmt_cost;
>>>> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
>>>> -	     priced much on V16QI/V8HI vector construction as their units,
>>>> -	     if we penalize them with nunits * stmt_cost, it can result in
>>>> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>>> -	     is 20 and nunits is 16, the extra cost is 320 which looks
>>>> -	     much exaggerated.  So let's use one maximum bound for the
>>>> -	     extra penalized cost for vector construction here.  */
>>>> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>>> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>>> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>>> +	  /* Don't expect strided/elementwise loads for just 1 nunit.  */
>>>> +	  gcc_assert (nunits > 1);
>>>> +	  /* i386 port adopts nunits * stmt_cost as the penalized cost
>>>> +	     for this kind of penalization, we used to follow it but
>>>> +	     found it could result in an unreliable body cost especially
>>>> +	     for V16QI/V8HI modes.  To make it better, we choose this
>>>> +	     new heuristic: for each scalar load, we use 2 as penalized
>>>> +	     cost for the case with 2 nunits and use 1 for the other
>>>> +	     cases.  It's without much supporting theory, mainly
>>>> +	     concluded from the broad performance evaluations on Power8,
>>>> +	     Power9 and Power10.  One possibly related point is that:
>>>> +	     vector construction for more units would use more insns,
>>>> +	     it has more chances to schedule them better (even run in
>>>> +	     parallelly when enough available units at that time), so
>>>> +	     it seems reasonable not to penalize that much for them.  */
>>>> +	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
>>>> +	  unsigned int extra_cost = nunits * adjusted_cost;
>>>>  	  data->extra_ctor_cost += extra_cost;
>>>>  	}
>>>>      }
>>>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>>>        cost_data->cost[where] += retval;
>>>>
>>>>        rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>>>> -					  stmt_cost, orig_count);
>>>> +					  orig_count);
>>>>      }
>>>>
>>>>    return retval;
>>>> --
>>>> 2.27.0
>>>>


More information about the Gcc-patches mailing list