This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Optimize type streaming
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: Richard Biener <rguenther at suse dot de>, Dominique Dhumieres <dominiq at lps dot ens dot fr>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 9 Jul 2014 12:48:43 +0200
- Subject: Re: Optimize type streaming
- Authentication-results: sourceware.org; auth=none
- References: <20140629111311 dot A2B96105 at mailhost dot lps dot ens dot fr> <20140629195303 dot GA14692 at kam dot mff dot cuni dot cz> <alpine dot LSU dot 2 dot 11 dot 1407071002000 dot 5753 at zhemvz dot fhfr dot qr> <20140708083414 dot GA27367 at kam dot mff dot cuni dot cz> <20140709085857 dot GA12462 at kam dot mff dot cuni dot cz>
On Wed, Jul 9, 2014 at 10:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hello,
> perhaps I could write bit more on my longer term plans. At the moment 30% of firefox WPA is taken
> by straming trees and another roughly 30% is taken by inliner. It is bit anoying but relatively
> easy to optimize inliner, but trees represent bigger problem.
>
> According to the stats average tree is streamed in 20times and according to perf we spend about 1/4th
> by unpacking the sections and then actual read of fields & SCC unification dominates.
This part can probably be speed up quite a bit by doing the SCC unification
before materializing the SCC, that is, doing the "on-disk" format compare idea.
The issue here is that for bigger SCCs that have hash collisions in their
entries we need to do the edge walk - but eventually having two paths, one
for SCCs with all distinct entry hashes doing the on-disk format compare
and one doing the compare after materialization is possible (at least we
could do some statistics). I hope to get to this after the Cauldron (but I
also have a half-way-done re-org on how we do the compression to save
memory that I want to finish at some point...)
> At low level,
> tree streaming is already pretty well optimized.
>
> I started to look into the following:
>
> 1) putting types&decls on diet
>
> I started to move individual fields into more fitting locations, getting rid of
> one field for many different reasons. I am trying to do this incrementally
> and keeping about one field per week flow. Currenlty I am stuck at:
>
> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01969.html
>
> (moving DECL_ARGUMENTS). The plan is to
> - get rid of decl_with_vis. I removed all fields except for symbol table pointer
> (that will stay) and some flags I plan to handle soon - comdat, weak and visibility.
> The last one is harder because C++ FE uses it on type declarations, but it is almost done).
> The rest of flags (few variable/function specific items that has nothing
> to do with visibility) can go into decl_common where it is enough of space.
> - get rid of decl_non_common
> Here I need to move arguments and results. Have patches for both.
> - I plan to do the same on type side - decompose TYPE_NON_COMMON in favour of explicit type
> hiearchy.
> - experiment with getting rid of RTL pointer
>
> I plan to test moving DECL_RTL into on-side tables (one global for
> global RTLs and one local for per-functoin RTLs). This should get us closer moving RTL
> into per-function storage again and make RTL easier to reclaim.
> - Once done with these I can recast the inheritance to have DATA_TYPE and DATA_DECL
> that is common base of types/decls that do have data associated with them. Those can
> cary mode, sizes, alias info that is not needed for functions, labels, type declarations
> etc.
>
> I also wonder if we need canonical types for FUNCTION_TYPE/METHOD_TYPE and other thing
> that is not associated with readable data.
>
> This has bit of multiple inheritance issues (that I do not want to introduce),
> since we have decls with symbol table and decls with data. I think simple union
> for that single symtab pointer will do. In fact I already tested restricting
> DECL_SIZE&friends to decls with data, but there is a lot of frontend updating to do,
> as these fields are overriden for many of the FE declarations. (it is reason why I
> added FE machinery to allow custom memory storage for newly added ecls in the patch above)
>
> Naturally this is good from maintenance point of view, it has potential to reduce memory
> footprint, streaming size, improve mergeability of trees (if definition and external declarations
> looks the same in tree decls, we will merge more type variants, because currently we keep class types
> in two copies, one for unit definig them and other for units using them) and also avoid
> stremaing of stale pointers, but it is a slow progress and the direct benefits are limited.
Splitting variant types and main variants up also would be a big saver but
interesting from an inheritance perspective.
Note that trying to get somewhere with debug info and LTO would also make
a lot of things no longer necessary to stream. To summarize a recent IRC
discussion a working plan is to
- emit debug info for types and decls at the compile-stage, producing one
debug object per source TU (or emit the debug info into the (slim)
LTO objects
and play linker tricks later); add extra hidden symbols to be able to refer
to decls and record that symbols on-the-side (and stream them to
the LTO data)
- at WPA / LTRANS phase materialize the debug symbol references in
decl-lang-specific
- at LTRANS phase emit debug info for the actual functions refering to the
already output declarations via abstract origins. For that to work we need
to DW_tag_import their containers. An LTRANS file creates
a single DW_tag_compilation_unit (all other units are partially
"inlined" into it).
- we link the LTRANS objects and the per-TU debug objects in the final link.
that should get us to the point where most of the langhook issues in dwarf2out.c
for LTO are gone (most of, not all I fear ...). And it should enable
us to avoid
streaming type details (in theory we only need to preserve the type structure,
a bit more in detail than we do for canonical type merging).
Note that all this applies to non-LTO as well - if we emit the debug info
before gimplification we can finally introduce sth like a "gimple" type.
> 2) put BINFOs on diet
>
> BINFOs are currently added to every class type. We can drop them in case they do
> not hold useful information for devirtualization neither debug info. This is now
> quite well defined. Main offender is ipa-prop that still uses get_binfo_at_offset
> and walks binfos it should not. I am working on it.
>
> 3) ODR type merging
>
> I have patches for this, but want to go bit curefuly - I need to discuss with Jason
> the anonymous types and get code for checking ODR violations working well.
>
> Basically for ODR types I can merge variant lists that results in leaner debug info
> and bit less of streaming WPA->ltrans
> It is also important for type propagation and I have prototype to handle canonical types
> of ODR and anonymous types specially.
>
> This actually increases LTO stream sizes (uncompressed) by about 6% to stream explicit
> mangled names. My 4.10 with the patch is still faster than 4.9 but definitely would be
> happier if there was easier way around
I think it would be better to do the debug stuff first to see what we
really need.
> 4) Reduce size of LTO streams
>
> This is what I was shooting for with the variant streaming (in addition to have sanity checker
> for 3 as bugs in these may turn types into a crazy soup quite easily).
> Types and decls are most common things to stream, 50% of types are variants, so not streaming
> duplicated data in variants has chance to save about 30-40% of type storage.
> Decls inherits some stuff from types (99% of time), like DECL_SIZE and friends.
Yeah, I can see that. Though I'd rather organize the streaming in a way
the tree hierarchy ideally would look like. That is, stream a new
LTO_type_variant kind, output a reference to the main variant and what
is changed, for example as n_changes, { change1, chage2, ... }, for example
1, QUAL_CONST for a const qualified vairant.
Rather than adding all those noisy checks all over the place.
The above is also easier to compare in the future on-disk compare stage.
> In my tests I went from compression ration over 3 to 2.1 keeping about the same gzipped
> data - so this speeds up unpacking & rebuilding trees, since direct copies are faster than
> LTO streamer table lookups.
>
> 5) Avoid merging of unmergeable things
>
> This is the patch that drops hashtable to 1 for things where we know we do not want to merge.
> This is needed for correctness of ODR types and it also improves compression ration of the
> streams as SCC hashes are hard to gzip.
Yeah, that's indeed a good improvement we should get back to (finally
compute mergeability and indexability in a sane and consistent way)
> 6) Put variable initializers into named sections (as function bodies)
>
> This is supposed to help vtables, but I am always too lazy to dive into details of our
> ugly low level section API.
That would be indeed nice.
> 7) Improve streaming of locations, as discussed several times. Again I am bit discouraged
> but need to make extra section etc. Location lookup still shows high in the profile.
Likewise nice.
Any chance somebody would start on that debug info thing? I guess we can
sit down at the cauldron for that.
Richard.
> So some of my immediate plans.
> Honza