Ping: IRA-based register pressure calculation for RTL loop invariant motion

Sat Oct 17 11:17:00 GMT 2009

On Sat, Oct 17, 2009 at 5:34 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> Richard Guenther wrote:
>>
>> On Wed, Oct 14, 2009 at 6:27 PM, Vladimir Makarov <vmakarov@redhat.com>
>> wrote:
>>
>>>
>>> Zdenek Dvorak wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>>>>
>>>>>>> +      if (i < ira_reg_class_cover_size)
>>>>>>> +       size_cost = comp_cost + 10;
>>>>>>> +      else
>>>>>>> +       size_cost = 0;
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Including comp_cost in size_cost makes no sense (this would prevent us
>>>>>> from
>>>>>> moving even very costly invariants out of the loop if we run out of
>>>>>> registers).
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> That is exactly what I intended.  As I wrote above, I tried a lot of
>>>>>  heuristics with different parameters which decided to move loop
>>>>>  invariant
>>>>> depending on spill cost and loop invariant cost.  But they  don't  work
>>>>> well
>>>>> at least for x86/x86_64 and power6.  I have some  speculation for this.
>>>>>  x86/x86_64 is OOO processors these days.  And  costly invariant will
>>>>> be
>>>>> hidden because usually the invariant has a lot  of freedom to be
>>>>> executed
>>>>> out-of-order.  For power6, long latency is  hidden by insn scheduling.
>>>>>  It
>>>>> is hard to me find a processor where it  will be important.  Another
>>>>> reason
>>>>> for this, it is very hard to evaluate  accurately spill cost at this
>>>>> stage.
>>>>>  So I decided not to use  combination of register pressure and
>>>>> invariant
>>>>> cost in my approach.
>>>>>
>>>>
>>>> could you please add this reasoning to the comment?  Another reason why
>>>> preventing the invariant motion does not hurt might be that all
>>>> expensive
>>>> invariants were already moved out of the loop by PRE and gimple
>>>> invariant
>>>> motion pass.
>>>>
>>>>
>>>>>
>>>>> +      for (i = 0; i < ira_reg_class_cover_size; i++)
>>>>> +       {
>>>>> +         cover_class = ira_reg_class_cover[i];
>>>>> +         if ((int) new_regs[cover_class]
>>>>> +             + (int) regs_needed[cover_class]
>>>>> +             + LOOP_DATA (curr_loop)->max_reg_pressure[cover_class]
>>>>> +             + IRA_LOOP_RESERVED_REGS
>>>>> +             - ira_available_class_regs[cover_class] > 0)
>>>>> +           break;
>>>>> +       }
>>>>>
>>>>
>>>> It might be clearer to write this as ... >
>>>> ira_available_class_regs[cover_class] instead
>>>> of ... - ira_available_class_regs[cover_class] > 0.  Otherwise, the
>>>> patch
>>>> is OK.
>>>>
>>>>
>>>
>>> Zdenek, thanks for the additional comments.  I incorporated them into the
>>> patch just before committing.  Here is the affected patch part:
>>>
>>
>> I think this consistently regressed both compile-time and runtime for
>> Polyhedron on x86_64.  For Itanium the story isn't clear, but effects
>> are seen there as well (it's also the only one I see off-noise effects
>> on SPEC 2000 - significant ups and downs).
>>
>>
>
>  Yes, it is expensive optimization (at least 3 additional passes
> through RTL insns one for calculating register pressure and two very
> expensive passes for finding register classes for pseudos).  It is
> clearly seen from SPEC compilation time graphs on
>
> http://vmakarov.fedorapeople.org/spec
>
> for 2 last benchmarking.  Therefore I proposed it only for -O3.
>
> Overall SPEC2000 scores are practically the same on x86/x86_64.
>
> As for Polyhedron benchmarks, here is my results on Core I7:
>
> first:  -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
> second: -ffast-math -funroll-loops -O3 -fira-loop-pressure
>
> x86:
> Geometric Mean Execution Time =      12.84 seconds
> Geometric Mean Execution Time =      12.82 seconds
>
> x86_64:
> Geometric Mean Execution Time =       9.89 seconds
> Geometric Mean Execution Time =       9.91 seconds
>
> On power6:
> first:  -mtune=power6 -ffast-math -funroll-loops -O3 -fno-ira-loop-pressure
> second: -mtune=power6 -ffast-math -funroll-loops -O3 -fira-loop-pressure
>
> Geometric Mean Execution Time =      19.22 seconds
> Geometric Mean Execution Time =      19.04 seconds
>
>  As I wrote earlier the winner of the optimization usage will be
> loops with pressure lower (but not too lower) than #registers.  For
> x86/x86_64, practically all loops have pressure more than #registers.
> For such loops, evaluation of invariant cost vs spill cost would be
> important.  But at this stage, spill cost is impossible to evaluate
> accurately.  So usage of old and new loop invariant motion criteria on
> processors similar x86/x86_64 will give different results for particular
> tests (some tests better, some worse) but overall score will be
> practically the same.
>
>  Probably, there is no sense to use IRA-based register pressure calculation
> for all targets (including x86/x86_64) but for power it is a clear win as it
> is seen from polyhedron and as I reported for SPEC2000.
>
>  So we could switch it off by default for -O3.  What do you think about this
> solution, Richard?

I think we could switch it on by default at -O3 for a selected group of
targets.  Itanium overall also improves with the new heuristics.  That would
make it power and Itanium.  Did you try restricting the heuristics to certain
register classes, like SSE registers on x86_64?

Thanks,
Richard.