This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Drop callee function size limits for IPA inlining


On Thu, 17 Feb 2011, Jan Hubicka wrote:

> > Heh, it was a radical approach to try to solve the C-Ray issue which
> > I don't see how we can do anything reasonable with more analysis (we'd
> > really need to see that we expose CSE opportunities, and even then we're
> > not going to get to the 40 instruction limit, even when considering 
> > savings).
> 
> Limit of 40 is low.  It is set that way as it ended up most resonable in code
> size/performance tradeoffs on our benchmark suite last time I retuned it.
> 
> Now we have c-ray and polyhedron that strongly preffer inlining of bigger
> functions, both as a result of somewhat weird coding style.
> 
> We need to develop something that accomodat those and does not bring too many
> new counter examples ;) Note that c-ray and polyhedron has shown up after a
> long while we have current heuristics and thus these are bit of extreme
> testcases where the current implemetnation falls appart.
> > 
> > >   1) programs written in "kernel" style push it up. If you have program split into
> > >      many tiny units, to get good code quality some units has to expand a lot, while
> > >      other units not at all
> > 
> > Do the units really have to expand more than 30%?  I really doubt so.
> > I think we also really need FRE in early opts.
> 
> Yes, run kernel w/o its always_inline hack and with -Winline.  There are untis
> that needs to expand 5 or 10 times.  This is result of weird things, like heavy use
> of __builtin_constant_p and extern inlines.

Hm, ok - so you tuned it up to the point where -Winline doesn't show
any warnings?  Kernel people have started to drop inline decls for some
time now.  For the builtin-constant-p thing we can really do better
by simply looking at call-sites and adjusting our estimates properly.

> > >   2) programs written with large C++ abstraction push it up as our size after inlining
> > >      estimates are unrealistic.  The estimated resulting program size is much bigger
> > >      than what we get in the final binary because inlining enables a lot of additional
> > >      optimizations
> > >   3) LTO push the limits down.  When you see whole program you have no problem described
> > >      in 1).  On the other hand number of your inline candidates explode as you can
> > >      do crossmodule inlining.  inlining them all leads to excessive code size growths.
> > > 
> > >      Moreover LTO behaviour is different with -fwhole-program and not.
> > >      -fwhole-program allows a lot smaller unit growht as many of offline copies of
> > >      functions can be optimized out.
> > 
> > Well, with -fuse-linker-plugin being the sort-of default now
> > -fwhole-program is in effect by default.
> 
> Yes, but it does not apply for poor-man's LTO a-la sqlite for example.
> Again, it is coding style sensitivity.

poor-man's LTO?

> > 
> > > I think there is no way to solve 1) with this approach. The proposed patch bumps down
> > > large unit size that will result in problem with kernel style codebases.
> > 
> > The large unit size basically allows "small" units to grow unbounded if
> > enough candidates in the given bounds (40 insns for non-inline declared
> > and 400(!) for inline declared functions).  I can only guess that
> > this allows extra growth for small units that use a lot of inline
> > declared functions.  But even then I'd be curious to see unit growths
> > we allow for kernel units with -O2, can we have a set of worst offender
> > files for the random tester?
> 
> I guess we can try to collect units exposing this behaviour, yes.

That would be nice.

> > I'd like us to simplify the current design first, it is basically 
> > impossible to do reasonable analysis with the current heuristics and
> > hard limits.  We seem to have many limits that are closely related
> > but interact in very weird ways - not to even start thinking about
> > users that want to get more inlining ...
> 
> Well, lets see how much we can simplify things.  I was always trying to void too much
> of complexity here, but after more than 8 years for sure quite some has cummulated ;)

Ideally I'd have a -finline-limit that would work in a way only inlining
incrementally more functions when you raise it (as in, increase the unit
growth limit), not cause random different inlining to happen when you
change it as it does right now because the list of candidates and the
list of good candidate isn't related ;)

> > > I know that to solve polyhedron like problems, Open64 uses loop nest analysis
> > > info and I think ICC does that too. Obviously ICC is a lot less aggressive on
> > > i.e. inlining functions called once, but more aggressive in inlining within
> > > loopy regions.
> > 
> > You already do that, sort of, inside the edge badness computation.  But
> > of course the hard limits completely ignore the badness ;)
> 
> Well, badness is really intended to drive the greedy algorithm to fit in the limits
> effetively.  You can't think of it as device deciding whether given call is good to inline.
> It is because the only way it can make function to not be inlined is to inline many funcions
> elsewhere so inliner bails out at function/unit growth limit.
> Many of real units don't really need/want 30% growth to fare well. Mozilla definitly doesn't ;)

Sure - I didn't even try to change the growth limits, not did I touch
the badness function.  Because changing too many things at once doesn't
result in useful information from an experiment.

> I see the badness priority and fact whether we think we want to inline given
> call at all as two independent things.  Currently we declare that we want to
> inline the call if:
> 
>  1) callee is inline candidate (i.e. fits in the questionable size limits and is inlinable)
>  2) call is hot or
>     call is cold and the expected whole program growth is negative after inlining
>     (i.e. we inline last copy of static function or function is very small so our
>     benefit analysis thinks resulting program will get smaler).
>
> I am aware that 1) if old and stupid.  It should consider function body size
> after inlining and have patch i didn't merged for 4.6 because it was past
> stage1 and it didn't show any noticeable benefits in benchmarks.  it also
> should be edge specific, not function specific as soon as we are able to work
> out that functions in certain contextes simplify more than in others (based
> on jump functions info).

Right.  1) is the main problem, and my patch only addresses 1) in a very
effective way.  It doesn't even try to address the fallout - that we now
always inline up to 30% unit growth in case the function passes the
tests in 2).  One could argue though that the tests in 2) need to be
better - 2) is really about the "usefulness" of inlining.
 
> I am also not claiming that we need to apply hard limit on (inlined) function size
> in all cases if we develop other ways to conclude that the inlining seems good
> in this particular case.

We effectively already have a hard limit on (inlined) function size, the
large function (growth) limits that are useful anyway, if only to limit
exposal to non-linear algorithms in GCC.  I doubt we need three limits
(large-function-insns, max-inline-insns-auto and max-inline-insns-single)
though - I'd be also interested to see testcases where it is important
that large-function-insns is 2700, it causes the function growth limit
to apply very late - we can for example try to merge
large-function-insns and max-inline-insns-auto into a single hard
limit and drop max-inline-insns-single.

> > Well, all functions that can be inlined should be candidates.  What we
> > end up inlining might be not all of them, of course.  Currently we
> > don't even consider most functions, regardless of how profitable inlining
> > them would be.
> 
> Yes, I agree that this is wrong and should be updated.
> 
> Concerning the implementation, i would suggest introducing
> want_to_inline_edge_p predicate that originally implements the tests above.
> Small function inliner then should put all edges leading to inlinable functions
> passing this predicate into the fibheap and work on them.  Then we have
> convenient place to see what kind of analysis we want to do there.

Yes, I have patches to separate want-* predicates for early and IPA
inliner.

Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]