This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Inlining and estimate_num_insns

From: Jan Hubicka <hubicka at ucw dot cz>
To: Richard Guenther <richard dot guenther at gmail dot com>
Cc: Jan Hubicka <jh at suse dot cz>, Steven Bosscher <stevenb at suse dot de>,Giovanni Bajo <giovannibajo at libero dot it>,Mark Mitchell <mark at codesourcery dot com>, gcc at gcc dot gnu dot org,Jan Hubicka <hubicka at ucw dot cz>
Date: Tue, 1 Mar 2005 16:14:14 +0100
Subject: Re: Inlining and estimate_num_insns
References: <Pine.LNX.4.44.0502241350470.2297-100000@alwazn.tat.physik.uni-tuebingen.de> <200502282156.44751.stevenb@suse.de> <20050301003307.GA15527@kam.mff.cuni.cz> <200503010138.54843.stevenb@suse.de> <20050301010357.GH22361@kam.mff.cuni.cz> <84fc9c000503010201388662f7@mail.gmail.com>

Hi,
so after reading the whole discussion, here are some of my toughts
for sanity check combined in one to reduce inbox pollution ;)

Concerning Richard's cost tweaks:

There is longer story why I like it ;)
I originally considered further tweaking of cost function as mostly lost
case as the representation of program at that time is way too close to
source level to estimate it's cost properly.  This got even bit worse in
4.0 times by gimplifier introducing not very predictable noise.

Instead I went to plan of optimizing functions early that ought to give
better estimates.  It seems to me that we need to know both - the code
size and expected time consumed by function to have chance to predict
the benefits in some way.  On tree-profiling and some my local patches I
hope to sort out soonish I am mostly there and I didn some limited
benchmarking.  Overall the early optimization seems to do good job for
SPEC (over 1% speedup in whole program mode is more than I expected),
but it does almost nothing to C++ testcases (about 10% speedup to POOMA
and about 0 to Gerald's application).  I believe the reason is that C++
testcases consist of little functions that are unoptimizable by
themselves so the context is not big enought.

In parallel with Richard's efforts, I tought that problem there is ineed
with the "abstraction functions", ie functions just accepting arguments
and calling other function or returning some field.  There is extremly
high amount of those (from profiling early one can see that for every
operation executed in resulting program, there are hunderds of function
calls elliminated by inlining) Clearly with any inlining limits if the
cost function computes non-zero cost to such a forwarders, we are going
to have dificult time finding tresholds.

I planed to write an pattern matching for these functions to bump them
to 0 cost, but it looks like Richard's patch is pretty interesting idea.
His results with limits set to 50 shows that he ineed managed to get
those forwarders very cheap, so I believe that this idea might ineed
work well, with some additional tweaking.

Only what I am affraid of is the fact that number of inlines will no
longer be linear function of code size esitmate increase that is limited
by linear fraction of whole unit.  However only "forwarders" having at
most one call comes out free, so this is still dominated by the longest
path in callgraph consisting of these in the program.  Unfortunately
this can be high and we can produce _a lot_ of grabage inlining these.

One trick that I came across is to do two stage inlining - first inline
just functions whose growth estimates are <= 0 in the bottom-up approach
, do early optimizations to reduce garbage and then do "real inlining
job".  This way we might trottle amount of garbage produced by inliner
and get more realistic estimates of the function bodies, but I am not at
all sure about this.  It would also help profiling performance on
tramp3d definitly.

Concerning -fobey-inline:

I really doubt this is going to help C++ programmers.  I think it might
be usefull to kernel and I can make slightly cleaner implementation
(without changes in frontends) if there is really good use for it.  Can
someone point me to existing codebase where -fobey-inline brings
considerable improvements over defaultinlining heuristics?  I've seen a
lot of argumenting in this direction but never actual bigger application
that needs it.

It might be also possible to strengten the function "inline" keywords
have for heuristics - either multiply priority by 2 for functions
declared inline so the candidates gets in first or do two stage
inlining, first for inline functions and other for autoinline.  But this
is probably not going to help those folks complaining mostly about -O2
ignoring inline, right?

Concerning multiple heuristics:

I really don't like this idea ;)  Still think we can make heuristics to
adapt to the programming style it is fed by just because often programs
consist of multiple such styles.

Concerning compilation time/speed tradeoffs:

Since whole task of inliner is to slow down compiler in order to improve
resulting code, it is dificult to blame it for doing it's job.  While I
was in easy position with original heuristics where the pre-cgraph code
produced just that much of inlining so it was easy to speedup both, now
we obviously do too little of inlining, so we need to expect some
slodowns.  I would define a sucess of heuristics if it results in faster
and smaller code, the compilation time is kind of secondary.  However
definitly for code bases (like SPEC) where extra inlining don't help, we
should not slow down seriously (over 1% I guess)

Concerning growth limits:

If you take a look on when -finline-unit-growth limit hits, it is clear
that it hits very often on small units (several times in the kernel,
testsuite and such) just because there is tinny space to manuever.  It
hits almost never on medium units (in GCC bootstrap it hits almost
never) and almost always on big units

My intuition alwas has been that for larger units the limits should be
much smaller and pooma was major counter example.  If we suceed solving
this, I would guess we can introduce something like small-unit-insns
limit and limit size of units that exceeds this.  Does this sound sane?

Concerning 4.0 timming:

I agree that we should started month or two ago, but I unfortunately
wasn't able to do any usefull job at that time.  But tunning much
earlier in 4.0 cycle was unprofitable, since the compiler was just
moving too fast.  I experimented with this at tree-ssa merging time but
basically I resulted with slowdown that would shot it off it's release
criteria so didn't wanted to interfere with it.  We however solved
number of problems since than, so tunning now is more pleasant ;)
In 3.4 we also tuned late and it seems to be sane step to me actually.

It is not dificult to compute code size esitmates before we go to gimple
to have more apple-to-apple comparsion.  While tree-SSA scored pretty
bad in this test, I tried it in mid of december and things was quite
compiarable to -O2 complation time.  The problem with inliner is that it
depends a lot on rest of compiler....

Overall I would like to continue on the patch of pre-inline and attempts
to tune somehow Richard's idea of inlining cost function now and sort
out the quadratic issues in cgraph if they really shows up now and lets
hope that we will end up with something generally useable soon and
if simple changes to cost would help to 4.0 I would still like to
consider it, similarly as Richard's double linked stuff as his
benchmarks looks pretty convicing ;)

Honza

Follow-Ups:
- Re: Inlining and estimate_num_insns
  - From: Richard Guenther

References:
- Re: Inlining and estimate_num_insns
  - From: Jan Hubicka
- Re: Inlining and estimate_num_insns
  - From: Steven Bosscher
- Re: Inlining and estimate_num_insns
  - From: Jan Hubicka
- Re: Inlining and estimate_num_insns
  - From: Richard Guenther

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]