LTO inliner -- sensitivity to increasing register pressure

Fri Apr 18 20:34:00 GMT 2014

On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> What I've observed on power is that LTO alone reduces performance and
>> LTO+FDO is not significantly different than FDO alone.
> On SPEC2k6?
>
> This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
> off-noise win on SPEC2k6
> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
> http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
>
> I do not see why PPC should be significantly more constrained by register
> pressure.
>
> I do not have head to head comparsion of FDO and FDO+LTO for SPEC
> http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
> shows noticeable drop in calculix and gamess.
> Martin profiled calculix and tracked it down to a loop that is not trained
> but hot in the reference run.  That makes it optimized for size.
>
> http://dromaeo.com/?id=219677,219672,219965,219877
> compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
> Here the benefits of LTO and FDO seems to add up nicely.
>>
>> I agree that an exact estimate of the register pressure would be a
>> difficult problem. I'm hoping that something that approximates potential
>> register pressure downstream will be sufficient to help inlining
>> decisions.
>
> Yep, register pressure and I-cache overhead estimates are used for inline
> decisions by some compilers.
>
> I am mostly concerned about the metric suffering from GIGO principe if we mix
> together too many estimates that are somehwat wrong by their nature. This is
> why I mostly tried to focus on size/time estimates and not add too many other
> metrics. But perhaps it is a time to experiment wit these, since obviously we
> pushed current infrastructure to mostly to its limits.
>

I like the word GIGO here. Getting inline signals right  requires deep
analysis (including interprocedural analysis). Different signals/hints
may also come with different quality thus different weights.

Another challenge is how to quantify cycle savings/overhead more
precisely. With that, we can abandon the threshold based scheme -- any
callsite with a net saving will be considered.

David

> Honza
>>
>>   Aaron
>>
>> On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
>> > Do you witness similar problems with LTO +FDO?
>> >
>> > My concern is it can be tricky to get the register pressure estimate
>> > right. The register pressure problem is created by downstream
>> > components (code motions etc) but only exposed by the inliner.  If you
>> > want to get it 'right' (i.e., not exposing the problems), you will
>> > need to bake the knowledge of the downstream components (possibly
>> > bugs) into the analysis which might not be a good thing to do longer
>> > term.
>> >
>> > David
>> >
>> > On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
>> > <acsawdey@linux.vnet.ibm.com> wrote:
>> > > Honza,
>> > >   Seeing your recent patches relating to inliner heuristics for LTO, I
>> > > thought I should mention some related work I'm doing.
>> > >
>> > > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
>> > > team, working on gcc performance.
>> > >
>> > > We have not generally seen good results using LTO on IBM power processors
>> > > and one of the problems seems to be excessive inlining that results in the
>> > > generation of excessive spill code. So, I have set out to tackle this by
>> > > doing some analysis at the time of the inliner pass to compute something
>> > > analogous to register pressure, which is then used to shut down inlining of
>> > > routines that have a lot of pressure.
>> > >
>> > > The analysis is basically a liveness analysis on the SSA names per basic
>> > > block and looking for the maximum number live in any block. I've been using
>> > > "liveness pressure" as a shorthand name for this.
>> > >
>> > > This can then be used in two ways.
>> > > 1) want_inline_function_to_all_callers_p at present always says to inline
>> > > things that have only one call site without regard to size or what this may
>> > > do to the register allocator downstream. In particular, BZ2_decompress in
>> > > bzip2 gets inlined and this causes the pressure reported downstream for the
>> > > int register class to increase 10x. Looking at some combination of pressure
>> > > in caller/callee may help avoid this kind of situation.
>> > > 2) I also want to experiment with adding the liveness pressure in the callee
>> > > into the badness calculation in edge_badness used by inline_small_functions.
>> > > The idea here is to try to inline functions that are less likely to cause
>> > > register allocator difficulty downstream first.
>> > >
>> > > I am just at the point of getting a prototype working, I will get a patch
>> > > you could take a look at posted next week. In the meantime, do you have any
>> > > comments or feedback?
>> > >
>> > > Thanks,
>> > >    Aaron
>> > >
>> > > --
>> > > Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
>> > > 050-2/C113  (507) 253-7520 home: 507/263-0782
>> > > IBM Linux Technology Center - PPC Toolchain
>> > >
>> >
>>
>> --
>> Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
>> 050-2/C113  (507) 253-7520 home: 507/263-0782
>> IBM Linux Technology Center - PPC Toolchain