[PATCH, RFC] Improve ivopts group costs

Wed Nov 9 13:02:00 GMT 2016

On Thu, Nov 3, 2016 at 4:00 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 1:35 PM, Evgeny Kudryashov <kudryashov@ispras.ru> wrote:
>> Hello,
>>
>> I'm facing the following problem related to ivopts. The problem is that GCC
>> generates a lot of induction variables and as a result there is an
>> unnecessary increase of stack usage and register pressure.
>>
>> For instance, for the attached testcase (tc_ivopts.c) GCC generates 26
>> induction variables (25 for each of lhsX[{0-4}][{0-4}][k] and one for
>> rhs[k][j][{0-2}]). You should be able to reproduce this issue on arm target
>> using GCC with "-O2 -mcpu=cortex-a9 -mtune=cortex-a9". For reference, I'm
>> attaching assembly I get on current trunk.
>>
>> The reason might be in use groups costs, in particular, in the way of
>> estimation. Currently, the cost of a tuple (group, candidate) is a sum of
>> per-use costs in the group. So, the cost of a group grows proportional to
>> the number of uses in the group. This approach has a negative effect on the
>> algorithm for finding the best set of induction variables: the part of a
>> total cost related to adding a new candidate is almost washed out by the
>> cost of the group. In addition, when there is a lot of groups with many uses
>> inside and a target is out of free registers, the cost of spill is washing
>> out too. As a result, GCC prefers to use native candidates (candidate
>> created for a particular group) for each group of uses instead of
>> considering the real cost of introducing a new variable into a set.
>>
>> The summing approach was added as a part of this patch
>> (https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00641.html) and the
>> motivation for taking the sum does not seem to be
>> specifically discussed.
>>
>> I propose the following patch that changes a group cost from cost summing to
>> selecting the largest one inside a group. For the given test case I have:
>> necessary size of stack is decreased by almost 3 times and ldr\str amount
>> are decreased by less than 2 times. Also I'm attaching assembly after
>> applying the patch.
>>
>> The essential change in the patch is just:
>>
>> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
>> index f9211ad..a149418 100644
>> --- a/gcc/tree-ssa-loop-ivopts.c
>> +++ b/gcc/tree-ssa-loop-ivopts.c
>> @@ -5151,7 +5151,8 @@ determine_group_iv_cost_address (struct ivopts_data
>> *data,
>>          offset and where the iv_use is.  */
>>         cost = get_computation_cost (data, next, cand, true,
>>                                      NULL, &can_autoinc, NULL);
>> -      sum_cost += cost;
>> +      if (sum_cost < cost)
>> +        sum_cost = cost;
>>      }
>>    set_group_iv_cost (data, group, cand, sum_cost, depends_on,
>>                      NULL_TREE, ERROR_MARK, inv_expr);
>>
>> Any suggestions?
> Hi Evgeny,
>
> I tend to be practical here.  Given cost is not model that well in
> IVOPT, any heuristic tuning in cost is quite likely resulting in
> improvement in some cases, while regressions in others.  At least we
> need to run good number of benchmarks for such changes.  Specific to
> this one, the summary of cost in a group is a natural result of the
> original code, in which uses' cost is added up before candidate
> selection.  Take your code as an example, choosing loop's original
> candidate for group results in lots of memory accesses with [base +
> index << scale] addressing mode, which could have higher cost than
> [base + offset] mode wrto u-arch, IMHO, it makes sense to add up this
> cost.  OTOH, I totally agree that IVOPT tends to choose too many
> candidates at the moment, especially for large loops.  Is it possible
> to achieve the same effect by penalizing register pressure cost?
> Meanwhile, I can collect benchmark data for your patch on AArch64 and
> get back to you later.
I collected spec2k6 data on one AArch64 processors, it doesn't give
meaningful improvement or regression.  Looking at the test, it boils
down to how we choose induction variable for loops having below memory
references:
for (biv)
  MEM[base + biv << scale + off1];
  MEM[base + biv << scale + off2];
  // ...
  MEM[base + biv << scale + offn];

On targets support [base + index << scale + offset] addressing mode ,
the biv should be preferred (if cost of the addressing mode is not
blocking) thus register pressure is 2.  While on targets only support
[base + index << scale], it is more complicated.  Choosing biv
actually increases the register pressure by 1 than choosing {base_1 +
biv << scale, step} as the candidate, or an additional computation
outside of address expression is required for each memory referece.
Well, there is one exception when "base" is actually anticipated on
exit edge of this loop.  So this is really target dependent cost model
issue, IMHO, decreasing group cost in target-independent way is not
that good, what do you think?  I will look deeper why choosing biv can
get rid of spilling.

BTW, the case can be further extended:
for (biv)
  MEM[base_1 + biv << scale + off1];
  MEM[base_1 + biv << scale + off2];
  // ...
  MEM[base_1 + biv << scale + offn];
  //
  // ...
  //
  MEM[base_m + biv << scale + off1];
  MEM[base_m + biv << scale + off2];
  // ...
  MEM[base_m + biv << scale + offn];

IVOPT doesn't work well for this either.

Thanks,
bin