This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Optimize type streaming


On July 9, 2014 2:36:57 PM CEST, Jan Hubicka <hubicka@ucw.cz> wrote:
>> This part can probably be speed up quite a bit by doing the SCC
>unification
>> before materializing the SCC, that is, doing the "on-disk" format
>compare idea.
>> The issue here is that for bigger SCCs that have hash collisions in
>their
>> entries we need to do the edge walk - but eventually having two
>paths, one
>> for SCCs with all distinct entry hashes doing the on-disk format
>compare
>> and one doing the compare after materialization is possible (at least
>we
>> could do some statistics).  I hope to get to this after the Cauldron
>(but I
>> also have a half-way-done re-org on how we do the compression to save
>> memory that I want to finish at some point...)
>
>Cool,
>I was actually running into issues with the hashing after increasing
>SCC size
>(by straming DECL_METHODS lists for classes). I noticed that the
>quadratic loop
>in hash_scc is actually the reason why iterative_hash shows high in
>profiles.
>I do not see why it is needed: we compute hashes of all trees in the
>SCC
>and then we can identify the SCC hash on it.  To have unique enoug
>hashes of
>individual trees to propagate further, I think we only need to hash the
>SCC hash
>with the hash of given element again.
>
>The patch bellow does that and shots iterative_hash off my profiles
>while
>apparently not breaking anything (i.e. not increasing conflicts).

Yeah, u think it shouldn't be worse.

>It seems to me that main problem with duplicated hash entires within
>SCC is the
>fact that when we compute hashes, all pointers within SCC have has code
>0.  This
>means that we end up making no difference i.e. between two identically
>looking
>variatns of two different types that are part of the SCC.
>Or two constants of two different type.

Yes.

>I think if we want to get more serious about getting rid of hash value
>conflicts
>we can just feed the streamer cache with hashes and re-hash.  This will
>propagate
>differences across SCC and we can iterate until this stabilize.

Are we sure it ever stabilizes? But yes, I had something like this in mind (just do one iteration always) in case we need to improve hashing.

Patch is OK if you keep the sorting part of the comment, which is the important part if all this - hash values have to be independent of SCC entry chosen.

Richard.

>Index: lto-streamer-out.c
>===================================================================
>--- lto-streamer-out.c	(revision 212351)
>+++ lto-streamer-out.c	(working copy)
>@@ -1129,34 +1180,17 @@ hash_scc (struct streamer_tree_cache_d *
>   if (size == 1)
>     return sccstack[first].hash;
> 
>-  /* Sort the SCC of type, hash pairs so that when we mix in
>-     all members of the SCC the hash value becomes independent on
>-     the order we visited the SCC.  Disregard hashes equal to
>-     the hash of the tree we mix into because we cannot guarantee
>-     a stable sort for those across different TUs.  */
>+  /* Produce hash of the whole SCC as combination of hashes of
>individual elements.
>+     Then combine that hash into hash of each element, so othewise
>identically looking
>+     elements from two different SCCs are distinguished.  */
> qsort (&sccstack[first], size, sizeof (scc_entry), scc_entry_compare);
>-  hashval_t *tem = XALLOCAVEC (hashval_t, size);
>-  for (unsigned i = 0; i < size; ++i)
>-    {
>-      hashval_t hash = sccstack[first+i].hash;
>-      hashval_t orig_hash = hash;
>-      unsigned j;
>-      /* Skip same hashes.  */
>-      for (j = i + 1;
>-	   j < size && sccstack[first+j].hash == orig_hash; ++j)
>-	;
>-      for (; j < size; ++j)
>-	hash = iterative_hash_hashval_t (sccstack[first+j].hash, hash);
>-      for (j = 0; sccstack[first+j].hash != orig_hash; ++j)
>-	hash = iterative_hash_hashval_t (sccstack[first+j].hash, hash);
>-      tem[i] = hash;
>-    }
>-  hashval_t scc_hash = 0;
>+
>+  hashval_t scc_hash = sccstack[first].hash;
>+  for (unsigned i = 1; i < size; ++i)
>+    scc_hash = iterative_hash_hashval_t (scc_hash,
>+					 sccstack[first+i].hash);
>   for (unsigned i = 0; i < size; ++i)
>-    {
>-      sccstack[first+i].hash = tem[i];
>-      scc_hash = iterative_hash_hashval_t (tem[i], scc_hash);
>-    }
>+    sccstack[first+i].hash = iterative_hash_hashval_t
>(sccstack[first+i].hash, scc_hash);
>   return scc_hash;
> }
> 
>> 
>> >  At low level,
>> > tree streaming is already pretty well optimized.
>> >
>> > I started to look into the following:
>> >
>> > 1) putting types&decls on diet
>> >
>> >    I started to move individual fields into more fitting locations,
>getting rid of
>> >    one field for many different reasons.  I am trying to do this
>incrementally
>> >    and keeping about one field per week flow. Currenlty I am stuck
>at:
>> >
>> >    https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01969.html
>> >
>> >    (moving DECL_ARGUMENTS). The plan is to
>> >    - get rid of decl_with_vis.  I removed all fields except for
>symbol table pointer
>> >      (that will stay) and some flags I plan to handle soon -
>comdat, weak and visibility.
>> >      The last one is harder because C++ FE uses it on type
>declarations, but it is almost done).
>> >      The rest of flags (few variable/function specific items that
>has nothing
>> >      to do with visibility) can go into decl_common where it is
>enough of space.
>> >    - get rid of decl_non_common
>> >      Here I need to move arguments and results. Have patches for
>both.
>> >    - I plan to do the same on type side - decompose TYPE_NON_COMMON
>in favour of explicit type
>> >      hiearchy.
>> >    - experiment with getting rid of RTL pointer
>> >
>> >      I plan to test moving DECL_RTL into on-side tables (one global
>for
>> >      global RTLs and one local for per-functoin RTLs). This should
>get us closer moving RTL
>> >      into per-function storage again and make RTL easier to
>reclaim.
>> >    - Once done with these I can recast the inheritance to have
>DATA_TYPE and DATA_DECL
>> >      that is common base of types/decls that do have data
>associated with them.  Those can
>> >      cary mode, sizes, alias info that is not needed for functions,
>labels, type declarations
>> >      etc.
>> >
>> >      I also wonder if we need canonical types for
>FUNCTION_TYPE/METHOD_TYPE and other thing
>> >      that is not associated with readable data.
>> >
>> >      This has bit of multiple inheritance issues (that I do not
>want to introduce),
>> >      since we have decls with symbol table and decls with data.  I
>think simple union
>> >      for that single symtab pointer will do.  In fact I already
>tested restricting
>> >      DECL_SIZE&friends to decls with data, but there is a lot of
>frontend updating to do,
>> >      as these fields are overriden for many of the FE declarations.
> (it is reason why I
>> >      added FE machinery to allow custom memory storage for newly
>added ecls in the patch above)
>> >
>> >    Naturally this is good from maintenance point of view, it has
>potential to reduce memory
>> >    footprint, streaming size, improve mergeability of trees (if
>definition and external declarations
>> >    looks the same in tree decls, we will merge more type variants,
>because currently we keep class types
>> >    in two copies, one for unit definig them and other for units
>using them) and also avoid
>> >    stremaing of stale pointers, but it is a slow progress and the
>direct benefits are limited.
>> 
>> Splitting variant types and main variants up also would be a big
>saver but
>> interesting from an inheritance perspective.
>> 
>> Note that trying to get somewhere with debug info and LTO would also
>make
>> a lot of things no longer necessary to stream.  To summarize a recent
>IRC
>> discussion a working plan is to
>> 
>>  - emit debug info for types and decls at the compile-stage,
>producing one
>>    debug object per source TU (or emit the debug info into the (slim)
>> LTO objects
>>    and play linker tricks later); add extra hidden symbols to be able
>to refer
>>    to decls and record that symbols on-the-side (and stream them to
>> the LTO data)
>> 
>>  - at WPA / LTRANS phase materialize the debug symbol references in
>>    decl-lang-specific
>> 
>>  - at LTRANS phase emit debug info for the actual functions refering
>to the
>>    already output declarations via abstract origins.  For that to
>work we need
>>    to DW_tag_import their containers.  An LTRANS file creates
>>    a single DW_tag_compilation_unit (all other units are partially
>> "inlined" into it).
>> 
>>  - we link the LTRANS objects and the per-TU debug objects in the
>final link.
>> 
>> that should get us to the point where most of the langhook issues in
>dwarf2out.c
>> for LTO are gone (most of, not all I fear ...).  And it should enable
>> us to avoid
>> streaming type details (in theory we only need to preserve the type
>structure,
>> a bit more in detail than we do for canonical type merging).
>> 
>> Note that all this applies to non-LTO as well - if we emit the debug
>info
>> before gimplification we can finally introduce sth like a "gimple"
>type.
>> 
>> > 2) put BINFOs on diet
>> >
>> >    BINFOs are currently added to every class type.  We can drop
>them in case they do
>> >    not hold useful information for devirtualization neither debug
>info.  This is now
>> >    quite well defined.  Main offender is ipa-prop that still uses
>get_binfo_at_offset
>> >    and walks binfos it should not.  I am working on it.
>> >
>> > 3) ODR type merging
>> >
>> >    I have patches for this, but want to go bit curefuly - I need to
>discuss with Jason
>> >    the anonymous types and get code for checking ODR violations
>working well.
>> >
>> >    Basically for ODR types I can merge variant lists that results
>in leaner debug info
>> >    and bit less of streaming WPA->ltrans
>> >    It is also important for type propagation and I have prototype
>to handle canonical types
>> >    of ODR and anonymous types specially.
>> >
>> >    This actually increases LTO stream sizes (uncompressed) by about
>6% to stream explicit
>> >    mangled names.  My 4.10 with the patch is still faster than 4.9
>but definitely would be
>> >    happier if there was easier way around
>> 
>> I think it would be better to do the debug stuff first to see what we
>> really need.
>
>Well, I am not shooting only for debug info unification, but for ODR
>violation
>warkings, better TBAA, devirtualization and less streaming from WPA to
>ltrans.
>LLVM implements kind of early debug info and does use ODR names for
>unification BTW.
>(and have its own set of memory usage/streaming speed issues for LTO,
>too)
>
>I will push out the minimal patches for this, so we get better idea
>where this leads.
>> 
>> > 4) Reduce size of LTO streams
>> >
>> >    This is what I was shooting for with the variant streaming (in
>addition to have sanity checker
>> >    for 3 as bugs in these may turn types into a crazy soup quite
>easily).
>> >    Types and decls are most common things to stream, 50% of types
>are variants, so not streaming
>> >    duplicated data in variants has chance to save about 30-40% of
>type storage.
>> >    Decls inherits some stuff from types (99% of time), like
>DECL_SIZE and friends.
>> 
>> Yeah, I can see that.  Though I'd rather organize the streaming in a
>way
>> the tree hierarchy ideally would look like.  That is, stream a new
>> LTO_type_variant kind, output a reference to the main variant and
>what
>> is changed, for example as n_changes, { change1, chage2, ... }, for
>example
>> 1, QUAL_CONST for a const qualified vairant.
>
>This is kind of what I do.  Type variants gets a bitpack that
>represents what
>changed.  There are not that many things that.
>> 
>> Rather than adding all those noisy checks all over the place.
>
>I moved the verification bits into type verifier, will push out patch
>for that oo.
>> 
>> The above is also easier to compare in the future on-disk compare
>stage.
>
>Still do not get where is the difference.  We stream reference to the
>main variant + differences.
>Types are same only if main variant is same and differences are the
>same.  Nothing seems to change
>for on-disk comparing.
>> 
>> >    In my tests I went from compression ration over 3 to 2.1 keeping
>about the same gzipped
>> >    data - so this speeds up unpacking & rebuilding trees, since
>direct copies are faster than
>> >    LTO streamer table lookups.
>> >
>> > 5) Avoid merging of unmergeable things
>> >
>> >    This is the patch that drops hashtable to 1 for things where we
>know we do not want to merge.
>> >    This is needed for correctness of ODR types and it also improves
>compression ration of the
>> >    streams as SCC hashes are hard to gzip.
>> 
>> Yeah, that's indeed a good improvement we should get back to (finally
>> compute mergeability and indexability in a sane and consistent way)
>> 
>> > 6) Put variable initializers into named sections (as function
>bodies)
>> >
>> >    This is supposed to help vtables, but I am always too lazy to
>dive into details of our
>> >    ugly low level section API.
>> 
>> That would be indeed nice.
>> 
>> > 7) Improve streaming of locations, as discussed several times. 
>Again I am bit discouraged
>> >    but need to make extra section etc.  Location lookup still shows
>high in the profile.
>> 
>> Likewise nice.
>> 
>> Any chance somebody would start on that debug info thing?  I guess we
>can
>> sit down at the cauldron for that.
>
>Definitely good topic to discuss.  I am not 100% sure how much time I
>can get into this - the symtab
>redesign and other things are already consuming a lot of time...
>Debug info and command line options issues indeed is something we ought
>to solve soon...
>
>Honza
>> 
>> Richard.
>> 
>> > So some of my immediate plans.
>> > Honza



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]