[PATCH] Add capability to run several iterations of early optimizations

Fri Oct 28 10:01:00 GMT 2011

On Thu, Oct 27, 2011 at 11:53 PM, Matt <matt@use.net> wrote:
>>>> Then you'd have to analyze the compile-time impact of the IPA
>>>> splitting on its own when not iterating. ?Then you should look
>>>> at what actually was the optimizations that were performed
>>>> that lead to the improvement (I can see some indirect inlining
>>>> happening, but everything else would be a bug in present
>>>> optimizers in the early pipeline - they are all designed to be
>>>> roughly independent on each other and _not_ expose new
>>>> opportunities by iteration). ?Thus - testcases?
>>>
>>> The initial motivation for the patch was to enable more indirect
>
> inlining and devirtualization opportunities.
>
>> Hm.
>
> It is the proprietary codebase of my employer that these optimizations were
> developed for. Multiple iterations specifically helps propogate the concrete
> type information from functions that implement the Abstract Factory design
> pattern, allowing for cleaner runtime dynamic dispatch. I can verify that in
> said codebase (and in the reduced, non-proprietary examples Maxim provided
> earlier in the year) it works quite effectively.
>
> Many of the devirt examples focus on a pure top-down approach like this:
> class I { virtual void f() = 0; };
> class K : public I { virtual void f() {} };
> class L: public I { virtual void f() {} };
> void g(I& i) { i.f(); }
> int main(void) { L l; g(l); return 0; }
>
> While that strategy isn't unheard of, it implies a link-time substitution to
> inject new/different sub-classes of the parameterized interface. Besides
> limiting extensibility by requiring a rebuild/relink, it also presupposes
> that two different implementations would be mutually exclusive for that
> module. That is often not the case, hence the factory pattern expressed in
> the other examples Maxim provided.
>
>>> Since then I found the patch to be helpful in searching for
>
> optimization opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20%
> with 2 additional iterations of early optimizations [*]. ?Given that
> applying more optimizations should, theoretically, not decrease performance,
> there is likely a very real bug or deficiency behind that.
>
>> It is likely early SRA that messes up, or maybe convert switch.  Early
>> passes should be really restricted to always profitable cleanups.
>
>> Your experiment looks useful to track down these bugs, but in general
>> I don't think we want to expose iterating early passes.
>
> In these other more top-down examples of devirt I mention above, I agree
> with you. Once the CFG is ordered and the analyses happen, things should be
> propogated forward without issue. In the case of factory functions, my
> understanding and experience on this real-world codebase is that multiple
> passes are required. First, to "bubble up" the concrete type info coming out
> of the factory function. Depending on how many layers, it may require a
> couple. Second, to then forward propogate that concrete type information for
> the pointer.
>
> There was a surprising side-effect when I started experimenting with this
> ipa-passes feature. In a module that contains ~100KLOC, I implemented
> mega-compilation (a poor-man's LTO). At two passes, the module got larger,
> which I expected. This minor growth continued with each additional pass,
> until at about 7 passes when it decreased by over 10%. I set up a script to
> run overnight to incrementally try passes and record the module size, and
> the "sweet spot" ended up being 54 passes as far as size. I took the three
> smallest binaries and did a full performance regression at the system level,
> and the smallest binary's inclusion resulted in an ~6% performance
> improvement (measured as overall network I/O throughput) while using less
> CPU on a Transmeta Crusoe-based appliance. (This is a web proxy, with about
> 500KLOC of other code that was not compiled in this new way.)
>
> The idea of multiple passes resulting is a smaller binary and higher
> performance was like a dream. I reproduced a similar pattern on open source
> projects, namely scummvm (on which I was able to use proper LTO)*. That is,
> smaller binaries resulted as well as decreased CPU usage. On some projects,
> this could possibly be correlated with micro-level benchmarks such as
> reduced branch prediction and L1 cache misses as reported by callgrind.
>
> While it's possible/probable that some of the performance improvements I saw
> by increasing ipa-passes were ultimately missed-optimization bugs that
> should be fixed, I'd be very surprised if *all* of those improvements were
> the case. As such, I would still like to see this exposed. I would be happy
> to file bugs and help test any instances where it looks like an optimization
> should have been gotten within a single ipa-pass.

I discussed the idea of iterating early optimizations shortly with Honza.
I was trying to step back a bit and look at what we try to do right now,
which is, optimize functions in topological order (thus, try to make sure
all callees are already early optimized when optimizing callers).  That
of course is difficult when there are cycles in the cgraph (for which we
basically start at a random place), especially when there would be
may-edges (which we do not have) for indirect/virtual calls, as they
basically make the whole cgraph cyclic.  So my idea was to make
cycle processing more explicit in early optimizations, and, whenever
we discover a new direct cgraph edge make sure we optimize the
callee, and whenever we optimized a callee queue all callers for
re-optimization.  You of course have to limit the number of times you
want to process a function, otherwise for a cycle, you'd optimize
indefinitely.  We already do the inlining itself repeatedly (via
--param early-inliner-max-iterations), though that only iterates
the inlining itself, allowing for "deep" inlining and some cases
of inlining of indirect calls if the inliner substituted a function
pointer parameter into a call of the inlined function.  Moving that
iteration over to iterating over the optimizations could make sense.

Thus, I'd really like to at least make iterating depend on some
profitability analysis, even if it is only based on cgraph analysis
such as 'we discovered a new direct edge'.

Richard.

>
> Thanks for helping to get this feature (and the other devirt-related pieces)
> into 4.7 -- it's been a huge boon to improving our C++ designs without
> sacrificing performance.
>
>
> * Note that that scummvm's "sweet spot" number of iterations was different.
> That being said, the default of three iterations to make the typical use of
> Factory pattern devirtualize correctly still resulted in improved
> performance over a single pass -- just not necessarily a smaller binary.
>
>
>
> --
> tangled strands of DNA explain the way that I behave.
> http://www.clock.org/~matt
>