[PATCH] Drop callee function size limits for IPA inlining

Thu Feb 17 10:26:00 GMT 2011

> On Wed, 16 Feb 2011, Jack Howarth wrote:
> 
> > On Wed, Feb 16, 2011 at 08:39:53PM +0100, Richard Guenther wrote:
> > > On Wed, 16 Feb 2011, Dominique Dhumieres wrote:
> > > 
> > > > Richard,
> > > > 
> > > > The patch seems to fix pr45810 (at least most of it) without visible
> > > > degradation of the other polyhedron tests. Is it really to late to
> > > > apply it for 4.6? Would it be possible to test it on SPEC?
> > > 
> > > Heh, definitely way too late for 4.6!  I'm testing it tonight on
> > > our usual benchmarks (including SPEC).  We also do not have reached
> > > conclusion on whether the patch is a good idea.
> > 
> > Richard,
> >    If it is eventually found to be the correct fix, might this be backported for
> > gcc 4.6.1?
> 
> It isn't a "fix" it is a pretty substantial rework of how our IPA
> inlining heuristics work.

I just checked tonight results.  The patch seems to cause 10% SPECfp code size
growth and 20% SPECint with relatively little benefit (about 0.3% at most).  This is
more aggressive code size/speed tradeoff that we did ever before more than doubling
-O2 to -O3 code size gap.

We have improvements in polyhedron, applu, gzip (here I am convinced it is side effect
of code layout, not benefit of inlining - I analyzed that problem previously), vpr.
For some reason c-ray did not improve, but I can imagine it should. Perhaps wrong flags
are used?

Wave benchmark gets smaller and has no slowdown. There is performance regression in botan,
but it gets smaller.

We are still waiting for LTO spec2k6 results that will be IMO interesting based
on discusion bellow.

My overall opinion on this is that inliner should not blindly inline when it has no
idea inlining is profitable until overall program growths even at -O3. Doing this would
result in making -O3 even more benchmark centric.

What I see as main problem with this approach is that it will result in problems
in tunning overal unit growth.  This parameter is problematic in the following way:

  1) programs written in "kernel" style push it up. If you have program split into
     many tiny units, to get good code quality some units has to expand a lot, while
     other units not at all
  2) programs written with large C++ abstraction push it up as our size after inlining
     estimates are unrealistic.  The estimated resulting program size is much bigger
     than what we get in the final binary because inlining enables a lot of additional
     optimizations
  3) LTO push the limits down.  When you see whole program you have no problem described
     in 1).  On the other hand number of your inline candidates explode as you can
     do crossmodule inlining.  inlining them all leads to excessive code size growths.

     Moreover LTO behaviour is different with -fwhole-program and not.
     -fwhole-program allows a lot smaller unit growht as many of offline copies of
     functions can be optimized out.

I think there is no way to solve 1) with this approach. The proposed patch bumps down
large unit size that will result in problem with kernel style codebases.

It seems to me that inliner should inline when it has reason to believe that code
will improve noticeably. At present we are very simple on estimating this improvements
and our only guide is that if function is small, inlining is probably good idea.

To handle the cases like cray or polyhedron, we really need other kinds of analysis
to contribute to selection of what inlining is profitable.  One of easiest bits is
to make analysis on how much function body simplify when operands are known. Martin
has code for that and it will help in some of the testcases above.

There are number of other indicators of inline profitablitity.  I am bit
hestiant to add too many of them as we will result in difficult to maintain
inliner, but we probably can't avoid implementing some of important ones.

I know that to solve polyhedron like problems, Open64 uses loop nest analysis
info and I think ICC does that too. Obviously ICC is a lot less aggressive on
i.e. inlining functions called once, but more aggressive in inlining within
loopy regions.

With more analysis, we will need to refine what inline candidate is.  At first, inline
candidate should be function that is small after inlined (not small before inlining),
but we should also consider inlining regardless on function size when we know it will
help i.e. from LNO analysis.

Honza