This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: LTO inliner -- sensitivity to increasing register pressure
- From: Jan Hubicka <hubicka at ucw dot cz>
- To: Aaron Sawdey <acsawdey at linux dot vnet dot ibm dot com>
- Cc: Xinliang David Li <davidxl at google dot com>, Jan Hubicka <hubicka at ucw dot cz>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 18 Apr 2014 21:27:44 +0200
- Subject: Re: LTO inliner -- sensitivity to increasing register pressure
- Authentication-results: sourceware.org; auth=none
- References: <53515624 dot 5060509 at linux dot vnet dot ibm dot com> <CAAkRFZJ4S+tT2+DtYzptvyb8BOOAatFj-dJzABd8UiVtn2KUug at mail dot gmail dot com> <1397847967 dot 19063 dot 10 dot camel at ragesh3 dot rchland dot ibm dot com>
> What I've observed on power is that LTO alone reduces performance and
> LTO+FDO is not significantly different than FDO alone.
On SPEC2k6?
This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO seems
off-noise win on SPEC2k6
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html
I do not see why PPC should be significantly more constrained by register
pressure.
I do not have head to head comparsion of FDO and FDO+LTO for SPEC
http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html
shows noticeable drop in calculix and gamess.
Martin profiled calculix and tracked it down to a loop that is not trained
but hot in the reference run. That makes it optimized for size.
http://dromaeo.com/?id=219677,219672,219965,219877
compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO
Here the benefits of LTO and FDO seems to add up nicely.
>
> I agree that an exact estimate of the register pressure would be a
> difficult problem. I'm hoping that something that approximates potential
> register pressure downstream will be sufficient to help inlining
> decisions.
Yep, register pressure and I-cache overhead estimates are used for inline
decisions by some compilers.
I am mostly concerned about the metric suffering from GIGO principe if we mix
together too many estimates that are somehwat wrong by their nature. This is
why I mostly tried to focus on size/time estimates and not add too many other
metrics. But perhaps it is a time to experiment wit these, since obviously we
pushed current infrastructure to mostly to its limits.
Honza
>
> Aaron
>
> On Fri, 2014-04-18 at 10:36 -0700, Xinliang David Li wrote:
> > Do you witness similar problems with LTO +FDO?
> >
> > My concern is it can be tricky to get the register pressure estimate
> > right. The register pressure problem is created by downstream
> > components (code motions etc) but only exposed by the inliner. If you
> > want to get it 'right' (i.e., not exposing the problems), you will
> > need to bake the knowledge of the downstream components (possibly
> > bugs) into the analysis which might not be a good thing to do longer
> > term.
> >
> > David
> >
> > On Fri, Apr 18, 2014 at 9:43 AM, Aaron Sawdey
> > <acsawdey@linux.vnet.ibm.com> wrote:
> > > Honza,
> > > Seeing your recent patches relating to inliner heuristics for LTO, I
> > > thought I should mention some related work I'm doing.
> > >
> > > By way of introduction, I've recently joined the IBM LTC's PPC Toolchain
> > > team, working on gcc performance.
> > >
> > > We have not generally seen good results using LTO on IBM power processors
> > > and one of the problems seems to be excessive inlining that results in the
> > > generation of excessive spill code. So, I have set out to tackle this by
> > > doing some analysis at the time of the inliner pass to compute something
> > > analogous to register pressure, which is then used to shut down inlining of
> > > routines that have a lot of pressure.
> > >
> > > The analysis is basically a liveness analysis on the SSA names per basic
> > > block and looking for the maximum number live in any block. I've been using
> > > "liveness pressure" as a shorthand name for this.
> > >
> > > This can then be used in two ways.
> > > 1) want_inline_function_to_all_callers_p at present always says to inline
> > > things that have only one call site without regard to size or what this may
> > > do to the register allocator downstream. In particular, BZ2_decompress in
> > > bzip2 gets inlined and this causes the pressure reported downstream for the
> > > int register class to increase 10x. Looking at some combination of pressure
> > > in caller/callee may help avoid this kind of situation.
> > > 2) I also want to experiment with adding the liveness pressure in the callee
> > > into the badness calculation in edge_badness used by inline_small_functions.
> > > The idea here is to try to inline functions that are less likely to cause
> > > register allocator difficulty downstream first.
> > >
> > > I am just at the point of getting a prototype working, I will get a patch
> > > you could take a look at posted next week. In the meantime, do you have any
> > > comments or feedback?
> > >
> > > Thanks,
> > > Aaron
> > >
> > > --
> > > Aaron Sawdey, Ph.D. acsawdey@linux.vnet.ibm.com
> > > 050-2/C113 (507) 253-7520 home: 507/263-0782
> > > IBM Linux Technology Center - PPC Toolchain
> > >
> >
>
> --
> Aaron Sawdey, Ph.D. acsawdey@linux.vnet.ibm.com
> 050-2/C113 (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain