This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Add working-set size and hotness information to fdo summary (issue6465057)


On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> This is useful for large applications with a long tail. The
>> instruction working set for those applications are very large, and
>> inliner and unroller need to be aware of that and good heuristics can
>> be developed to throttle aggressive code bloat transformations. For
>> inliner, it is kind of the like the global budget but more application
>> dependent. In the long run, we will collect more advanced fdo summary
>> regarding working set -- it will be working set size for each code
>> region (locality region).
>
> I see, so you use it to estimate size of the working set and effect of bloating
> optimizations on cache size. This sounds interesting. What are you experiences
> with this?

Teresa has done some tunings for the unroller so far. The inliner
tuning is the next step.

>
> What concerns me that it is greatly inaccurate - you have no idea how many
> instructions given counter is guarding and it can differ quite a lot. Also
> inlining/optimization makes working sets significantly different (by factor of
> 100 for tramp3d).

The pre ipa-inline working set is the one that is needed for ipa
inliner tuning. For post-ipa inline code increase transformations,
some update is probably needed.

>But on the ohter hand any solution at this level will be
> greatly inaccurate. So I am curious how reliable data you can get from this?
> How you take this into account for the heuristics?

This effort is just the first step to allow good heuristics to develop.

>
> It seems to me that for this use perhaps the simple logic in histogram merging
> maximizing the number of BBs for given bucket will work well?  It is
> inaccurate, but we are working with greatly inaccurate data anyway.
> Except for degenerated cases, the small and unimportant runs will have small BB
> counts, while large runs will have larger counts and those are ones we optimize
> for anyway.

The working set curve for each type of applications contains lots of
information that can be mined. The inaccuracy can also be mitigated by
more data 'calibration'.

>>
>>
>> >  2) Do we plan to add some features in near future that will anyway require global locking?
>> >     I guess LIPO itself does not count since it streams its data into independent file as you
>> >     mentioned earlier and locking LIPO file is not that hard.
>> >     Does LIPO stream everything into that common file, or does it use combination of gcda files
>> >     and common summary?
>>
>> Actually, LIPO module grouping information are stored in gcda files.
>> It is also stored in a separate .imports file (one per object) ---
>> this is primarily used by our build system for dependence information.
>
> I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
> on GCC bootstrap?

We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
main problem for application build -- the link time (for debug build)
is.

> (i.e. it does a lot more work in the libgcov module per each
> invocation, so I am curious if it is practically useful at all).
>
> With LTO based solution a lot can be probably pushed at link time? Before
> actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
> gcda files and do all the merging/updating/CFG constructions that is currently
> performed at runtime, right?

The dynamic cgraph build and analysis is still done at runtime.
However, with the new implementation, FE is no longer involved. Gcc
driver is modified to understand module grouping, and lto is used to
merge the streamed output from aux modules.


David

>>
>>
>> >
>> >     What other stuff Google plans to merge?
>> >     (In general I would be curious about merging plans WRT profile stuff, so we get more
>> >     synchronized and effective on getting patches in. We have about two months to get it done
>> >     in stage1 and it would be nice to get as much as possible. Obviously some of the patches will
>> >     need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think
>> >     this is an important feature).
>>
>> We plan to merge in the new LIPO implementation based on LTO
>> streaming. Rong Xu finished this in 4.6 based compiler, and he needs
>> to port it to 4.8.
>
> Good.  Looks like a lot of work ahead. It would be nice if we can perhaps start
> by merging the libgcov infrastructure updates prior the LIPO changes.  From
> what I saw at LIPO branch some time ago it has a lot of stuff that is not
> exactly LIPO specific.
>
> Honza
>>
>>
>> thanks,
>>
>> David
>>
>> >
>> > I also realized today that the common value counters (used by switch, indirect
>> > call and div/mod value profiling) are non-stanble WRT different merging orders
>> > (i.e.  parallel make in train run).  I do not think there is actual solution to
>> > that except for not merging the counter section of this type in libgcov and
>> > merge them in some canonical order at profile feedback time.  Perhaps we just
>> > want to live with this, since the disprepancy here is small. (i.e. these
>> > counters are quite rare and their outcome has just local effect on the final
>> > binary, unlike the global summaries/edge counters).
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> >
>> >> > Honza
>> >> >>
>> >> >> -Andi
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]