PING^2 [PATCH v2] rs6000: Modify the way for extra penalized cost
Kewen.Lin
linkw@linux.ibm.com
Wed Oct 20 09:29:36 GMT 2021
Hi,
Gentle ping this:
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html
BR,
Kewen
> on 2021/9/28 下午4:16, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> This patch follows the discussions here[1][2], where Segher
>> pointed out the existing way to guard the extra penalized
>> cost for strided/elementwise loads with a magic bound does
>> not scale.
>>
>> The way with nunits * stmt_cost can get one much
>> exaggerated penalized cost, such as: for V16QI on P8, it's
>> 16 * 20 = 320, that's why we need one bound. To make it
>> better and more readable, the penalized cost is simplified
>> as:
>>
>> unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
>> unsigned extra_cost = nunits * adjusted_cost;
>>
>> For V2DI/V2DF, it uses 2 penalized cost for each scalar load
>> while for the other modes, it uses 1. It's mainly concluded
>> from the performance evaluations. One thing might be
>> related is that: More units vector gets constructed, more
>> instructions are used. It has more chances to schedule them
>> better (even run in parallelly when enough available units
>> at that time), so it seems reasonable not to penalize more
>> for them.
>>
>> The SPEC2017 evaluations on Power8/Power9/Power10 at option
>> sets O2-vect and Ofast-unroll show this change is neutral.
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579529.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>> the way to compute extra penalized cost. Remove useless parameter.
>> (rs6000_add_stmt_cost): Adjust the call to function
>> rs6000_update_target_cost_per_stmt.
>>
>>
>> ---
>> gcc/config/rs6000/rs6000.c | 31 ++++++++++++++++++-------------
>> 1 file changed, 18 insertions(+), 13 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index dd42b0964f1..8200e1152c2 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5422,7 +5422,6 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> enum vect_cost_for_stmt kind,
>> struct _stmt_vec_info *stmt_info,
>> enum vect_cost_model_location where,
>> - int stmt_cost,
>> unsigned int orig_count)
>> {
>>
>> @@ -5462,17 +5461,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> {
>> tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> unsigned int nunits = vect_nunits_for_cost (vectype);
>> - unsigned int extra_cost = nunits * stmt_cost;
>> - /* As function rs6000_builtin_vectorization_cost shows, we have
>> - priced much on V16QI/V8HI vector construction as their units,
>> - if we penalize them with nunits * stmt_cost, it can result in
>> - an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> - is 20 and nunits is 16, the extra cost is 320 which looks
>> - much exaggerated. So let's use one maximum bound for the
>> - extra penalized cost for vector construction here. */
>> - const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> - if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> - extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> + /* Don't expect strided/elementwise loads for just 1 nunit. */
>> + gcc_assert (nunits > 1);
>> + /* i386 port adopts nunits * stmt_cost as the penalized cost
>> + for this kind of penalization, we used to follow it but
>> + found it could result in an unreliable body cost especially
>> + for V16QI/V8HI modes. To make it better, we choose this
>> + new heuristic: for each scalar load, we use 2 as penalized
>> + cost for the case with 2 nunits and use 1 for the other
>> + cases. It's without much supporting theory, mainly
>> + concluded from the broad performance evaluations on Power8,
>> + Power9 and Power10. One possibly related point is that:
>> + vector construction for more units would use more insns,
>> + it has more chances to schedule them better (even run in
>> + parallelly when enough available units at that time), so
>> + it seems reasonable not to penalize that much for them. */
>> + unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
>> + unsigned int extra_cost = nunits * adjusted_cost;
>> data->extra_ctor_cost += extra_cost;
>> }
>> }
>> @@ -5510,7 +5515,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>> cost_data->cost[where] += retval;
>>
>> rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> - stmt_cost, orig_count);
>> + orig_count);
>> }
>>
>> return retval;
>> --
>> 2.27.0
>>
More information about the Gcc-patches
mailing list