This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Reduce inline-insns-auto


> Are the units for the parameter RTL instructions?  And you are measuring the
> effect on x86_64?

The units are not RTL instructions but number of instructions estimate
on gimple that is at present architecture independent.  Every gimple
statement is estimated in very primitive way after early optimizations,
so the estimates are relatively rough and become even more rough because
we are really shooting for new optimization oppurtunities exposed by
inlining (this is case of eon benchmark; I will send separate mail on
this) rather than reducing just the call cost itself.

As for risc/cisc differences, I did run tests at least on ia-64 that is
only non-x86_64 archhitecture we have setup for SPEC/C++ tester with the
whole patch re-tunning all the parameters as it is present on pretty-ipa
branch.  While there are definitly minor differences in behaviour
between x86_64/x86 and ia-64, they are not monotonously in one
dirrection (i.e. towards more inlining) as one might expect.  In fact
ia-64 seems less sensitive to inlining.

I was several times considering adding more precise estimates to
instruction size (as, for example, using our algorithm to expand
division by constant into sequence of shifts or having architecture
dependent call costs), but because this would make inlining decision a
lot more sensitive to given architecture and thus a lot more difficult
to tune in general, I am somehow hesitating to do that to avoid
explosion of testing matrix - inlining heuristics is even now difficult
to tune and I am trying to re-tune it every major release that is taking
considerable amount of time.

Let me know if you notice any ill effects on PPC benchmarks and I can
definitly investigate it.

As to give short outline what is going on here.

One of main reasons for pretty-ipa was to strenghten early
optimizations.  We now do more passes (such as empty loop removal needed
to get rid of initializers in vector code, alias analysis, better DCE,
eh cleanup).  I also experimented with pushing FRE and some loop opts
early that helps in some cases (eon in particular) but I didn't find
very good justification for it in general - will re-test on current
mainline.

These changes had in effect that the functions now all appear smaller to
inliner and we do more inlining than before early optimizations was
introduced. We also improved constant propagation/clonning/pure-const
disovery and other IPA passes that makes code quality less dependent on
excessive inlining. Inliner also now has more realistic cost estimates
by counting time/size ratios as opposed to combined metrics used in
previous releases, so it is making the inlining decisions in better
order.  Thus the need to trim down the parameters.

Since new cgraph code was introduced, traditionally the main issue was
C++ code (i.e. tramp3d) where we do have property that over 99% of code
inliner produce is getting optimized out, so the whole metric based
roughly on number of statements is completely confused (counting 99% of
garbage).  This gradually got improved and with current mainline we are
little over 1/3rd.

Originally we handled this by pretty intensive early inlining (that got
rid of some abstraction penalty) combined with large overall unit growth
(200%) and artifically increasing call costs (to 10 instructions).  This
caused original unit estimates (that contained many calls) to be larger
and gave inliner a lot of freedom to inline everything in C++ style
code, while in C style code we usually hit the bound on size of inline
candidates and didn't hit overall unit growht limits causing resonable
code size growth on C testcases.

This worked pretty well in most cases, but has major issues with LTO.
Here we end up in C code with many inline candidates and since the
percentage of optimized out code after inlining (i.e. abstraction
penalty) is a lot smaller, we just end up growing every program twice.
This also cause some degradation producing impractically large function
bodies that don't optimize as well as we would like.

With current code the inline size hack is done and I am just about to
reduce overall unit growth to 10% that seems to be enough now for C++
code too.

There are few patches waiting, especially Martin's code estimating size
of function body based on known values of its arguments (i.e. if
argument is constant it can work out what will be optimized out),
IPA-SRA and other work that in general helps to reduce need for
excessive inlining speculating that code will probably optimize out.

There is problem with merging and evaulating these, since with current
inline limits (tunned to give good result on our C++ testcases for early
stages of pretty-ipa) we inline everything anyway and see no benefits on
our testcases. 

So I am trying to push out the inline reduction, see what testcases are
most sensitive to limits and see what can be handled with the remaining
improvements we have in queue.  This should help evaulating both C++
changes and LTO.

Honza
> 
> If I have understood your message correctly, you have just set a
> generic parameter
> of GCC calculated in target instructions based on measurements of x86_64, a CISC
> architecture.  You report that the eon benchmark is very sensitive to
> the result and have
> tuned the parameters for x86_64, which seems like it will push all
> RISC architectures
> over the limit.
> 
> This does not seem like an appropriate method for determining the
> default value of
> this parameter.  Have I misunderstood something?
> 
> Thanks, David


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]