This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Optimize type streaming


> This part can probably be speed up quite a bit by doing the SCC unification
> before materializing the SCC, that is, doing the "on-disk" format compare idea.
> The issue here is that for bigger SCCs that have hash collisions in their
> entries we need to do the edge walk - but eventually having two paths, one
> for SCCs with all distinct entry hashes doing the on-disk format compare
> and one doing the compare after materialization is possible (at least we
> could do some statistics).  I hope to get to this after the Cauldron (but I
> also have a half-way-done re-org on how we do the compression to save
> memory that I want to finish at some point...)

Cool,
I was actually running into issues with the hashing after increasing SCC size
(by straming DECL_METHODS lists for classes). I noticed that the quadratic loop
in hash_scc is actually the reason why iterative_hash shows high in profiles.
I do not see why it is needed: we compute hashes of all trees in the SCC
and then we can identify the SCC hash on it.  To have unique enoug hashes of
individual trees to propagate further, I think we only need to hash the SCC hash
with the hash of given element again.

The patch bellow does that and shots iterative_hash off my profiles while
apparently not breaking anything (i.e. not increasing conflicts).

It seems to me that main problem with duplicated hash entires within SCC is the
fact that when we compute hashes, all pointers within SCC have has code 0.  This
means that we end up making no difference i.e. between two identically looking
variatns of two different types that are part of the SCC.
Or two constants of two different type.

I think if we want to get more serious about getting rid of hash value conflicts
we can just feed the streamer cache with hashes and re-hash.  This will propagate
differences across SCC and we can iterate until this stabilize.

Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c	(revision 212351)
+++ lto-streamer-out.c	(working copy)
@@ -1129,34 +1180,17 @@ hash_scc (struct streamer_tree_cache_d *
   if (size == 1)
     return sccstack[first].hash;
 
-  /* Sort the SCC of type, hash pairs so that when we mix in
-     all members of the SCC the hash value becomes independent on
-     the order we visited the SCC.  Disregard hashes equal to
-     the hash of the tree we mix into because we cannot guarantee
-     a stable sort for those across different TUs.  */
+  /* Produce hash of the whole SCC as combination of hashes of individual elements.
+     Then combine that hash into hash of each element, so othewise identically looking
+     elements from two different SCCs are distinguished.  */
   qsort (&sccstack[first], size, sizeof (scc_entry), scc_entry_compare);
-  hashval_t *tem = XALLOCAVEC (hashval_t, size);
-  for (unsigned i = 0; i < size; ++i)
-    {
-      hashval_t hash = sccstack[first+i].hash;
-      hashval_t orig_hash = hash;
-      unsigned j;
-      /* Skip same hashes.  */
-      for (j = i + 1;
-	   j < size && sccstack[first+j].hash == orig_hash; ++j)
-	;
-      for (; j < size; ++j)
-	hash = iterative_hash_hashval_t (sccstack[first+j].hash, hash);
-      for (j = 0; sccstack[first+j].hash != orig_hash; ++j)
-	hash = iterative_hash_hashval_t (sccstack[first+j].hash, hash);
-      tem[i] = hash;
-    }
-  hashval_t scc_hash = 0;
+
+  hashval_t scc_hash = sccstack[first].hash;
+  for (unsigned i = 1; i < size; ++i)
+    scc_hash = iterative_hash_hashval_t (scc_hash,
+					 sccstack[first+i].hash);
   for (unsigned i = 0; i < size; ++i)
-    {
-      sccstack[first+i].hash = tem[i];
-      scc_hash = iterative_hash_hashval_t (tem[i], scc_hash);
-    }
+    sccstack[first+i].hash = iterative_hash_hashval_t (sccstack[first+i].hash, scc_hash);
   return scc_hash;
 }
 
> 
> >  At low level,
> > tree streaming is already pretty well optimized.
> >
> > I started to look into the following:
> >
> > 1) putting types&decls on diet
> >
> >    I started to move individual fields into more fitting locations, getting rid of
> >    one field for many different reasons.  I am trying to do this incrementally
> >    and keeping about one field per week flow. Currenlty I am stuck at:
> >
> >    https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01969.html
> >
> >    (moving DECL_ARGUMENTS). The plan is to
> >    - get rid of decl_with_vis.  I removed all fields except for symbol table pointer
> >      (that will stay) and some flags I plan to handle soon - comdat, weak and visibility.
> >      The last one is harder because C++ FE uses it on type declarations, but it is almost done).
> >      The rest of flags (few variable/function specific items that has nothing
> >      to do with visibility) can go into decl_common where it is enough of space.
> >    - get rid of decl_non_common
> >      Here I need to move arguments and results. Have patches for both.
> >    - I plan to do the same on type side - decompose TYPE_NON_COMMON in favour of explicit type
> >      hiearchy.
> >    - experiment with getting rid of RTL pointer
> >
> >      I plan to test moving DECL_RTL into on-side tables (one global for
> >      global RTLs and one local for per-functoin RTLs). This should get us closer moving RTL
> >      into per-function storage again and make RTL easier to reclaim.
> >    - Once done with these I can recast the inheritance to have DATA_TYPE and DATA_DECL
> >      that is common base of types/decls that do have data associated with them.  Those can
> >      cary mode, sizes, alias info that is not needed for functions, labels, type declarations
> >      etc.
> >
> >      I also wonder if we need canonical types for FUNCTION_TYPE/METHOD_TYPE and other thing
> >      that is not associated with readable data.
> >
> >      This has bit of multiple inheritance issues (that I do not want to introduce),
> >      since we have decls with symbol table and decls with data.  I think simple union
> >      for that single symtab pointer will do.  In fact I already tested restricting
> >      DECL_SIZE&friends to decls with data, but there is a lot of frontend updating to do,
> >      as these fields are overriden for many of the FE declarations.  (it is reason why I
> >      added FE machinery to allow custom memory storage for newly added ecls in the patch above)
> >
> >    Naturally this is good from maintenance point of view, it has potential to reduce memory
> >    footprint, streaming size, improve mergeability of trees (if definition and external declarations
> >    looks the same in tree decls, we will merge more type variants, because currently we keep class types
> >    in two copies, one for unit definig them and other for units using them) and also avoid
> >    stremaing of stale pointers, but it is a slow progress and the direct benefits are limited.
> 
> Splitting variant types and main variants up also would be a big saver but
> interesting from an inheritance perspective.
> 
> Note that trying to get somewhere with debug info and LTO would also make
> a lot of things no longer necessary to stream.  To summarize a recent IRC
> discussion a working plan is to
> 
>  - emit debug info for types and decls at the compile-stage, producing one
>    debug object per source TU (or emit the debug info into the (slim)
> LTO objects
>    and play linker tricks later); add extra hidden symbols to be able to refer
>    to decls and record that symbols on-the-side (and stream them to
> the LTO data)
> 
>  - at WPA / LTRANS phase materialize the debug symbol references in
>    decl-lang-specific
> 
>  - at LTRANS phase emit debug info for the actual functions refering to the
>    already output declarations via abstract origins.  For that to work we need
>    to DW_tag_import their containers.  An LTRANS file creates
>    a single DW_tag_compilation_unit (all other units are partially
> "inlined" into it).
> 
>  - we link the LTRANS objects and the per-TU debug objects in the final link.
> 
> that should get us to the point where most of the langhook issues in dwarf2out.c
> for LTO are gone (most of, not all I fear ...).  And it should enable
> us to avoid
> streaming type details (in theory we only need to preserve the type structure,
> a bit more in detail than we do for canonical type merging).
> 
> Note that all this applies to non-LTO as well - if we emit the debug info
> before gimplification we can finally introduce sth like a "gimple" type.
> 
> > 2) put BINFOs on diet
> >
> >    BINFOs are currently added to every class type.  We can drop them in case they do
> >    not hold useful information for devirtualization neither debug info.  This is now
> >    quite well defined.  Main offender is ipa-prop that still uses get_binfo_at_offset
> >    and walks binfos it should not.  I am working on it.
> >
> > 3) ODR type merging
> >
> >    I have patches for this, but want to go bit curefuly - I need to discuss with Jason
> >    the anonymous types and get code for checking ODR violations working well.
> >
> >    Basically for ODR types I can merge variant lists that results in leaner debug info
> >    and bit less of streaming WPA->ltrans
> >    It is also important for type propagation and I have prototype to handle canonical types
> >    of ODR and anonymous types specially.
> >
> >    This actually increases LTO stream sizes (uncompressed) by about 6% to stream explicit
> >    mangled names.  My 4.10 with the patch is still faster than 4.9 but definitely would be
> >    happier if there was easier way around
> 
> I think it would be better to do the debug stuff first to see what we
> really need.

Well, I am not shooting only for debug info unification, but for ODR violation
warkings, better TBAA, devirtualization and less streaming from WPA to ltrans.
LLVM implements kind of early debug info and does use ODR names for unification BTW.
(and have its own set of memory usage/streaming speed issues for LTO, too)

I will push out the minimal patches for this, so we get better idea where this leads.
> 
> > 4) Reduce size of LTO streams
> >
> >    This is what I was shooting for with the variant streaming (in addition to have sanity checker
> >    for 3 as bugs in these may turn types into a crazy soup quite easily).
> >    Types and decls are most common things to stream, 50% of types are variants, so not streaming
> >    duplicated data in variants has chance to save about 30-40% of type storage.
> >    Decls inherits some stuff from types (99% of time), like DECL_SIZE and friends.
> 
> Yeah, I can see that.  Though I'd rather organize the streaming in a way
> the tree hierarchy ideally would look like.  That is, stream a new
> LTO_type_variant kind, output a reference to the main variant and what
> is changed, for example as n_changes, { change1, chage2, ... }, for example
> 1, QUAL_CONST for a const qualified vairant.

This is kind of what I do.  Type variants gets a bitpack that represents what
changed.  There are not that many things that.
> 
> Rather than adding all those noisy checks all over the place.

I moved the verification bits into type verifier, will push out patch for that oo.
> 
> The above is also easier to compare in the future on-disk compare stage.

Still do not get where is the difference.  We stream reference to the main variant + differences.
Types are same only if main variant is same and differences are the same.  Nothing seems to change
for on-disk comparing.
> 
> >    In my tests I went from compression ration over 3 to 2.1 keeping about the same gzipped
> >    data - so this speeds up unpacking & rebuilding trees, since direct copies are faster than
> >    LTO streamer table lookups.
> >
> > 5) Avoid merging of unmergeable things
> >
> >    This is the patch that drops hashtable to 1 for things where we know we do not want to merge.
> >    This is needed for correctness of ODR types and it also improves compression ration of the
> >    streams as SCC hashes are hard to gzip.
> 
> Yeah, that's indeed a good improvement we should get back to (finally
> compute mergeability and indexability in a sane and consistent way)
> 
> > 6) Put variable initializers into named sections (as function bodies)
> >
> >    This is supposed to help vtables, but I am always too lazy to dive into details of our
> >    ugly low level section API.
> 
> That would be indeed nice.
> 
> > 7) Improve streaming of locations, as discussed several times.  Again I am bit discouraged
> >    but need to make extra section etc.  Location lookup still shows high in the profile.
> 
> Likewise nice.
> 
> Any chance somebody would start on that debug info thing?  I guess we can
> sit down at the cauldron for that.

Definitely good topic to discuss.  I am not 100% sure how much time I can get into this - the symtab
redesign and other things are already consuming a lot of time...
Debug info and command line options issues indeed is something we ought to solve soon...

Honza
> 
> Richard.
> 
> > So some of my immediate plans.
> > Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]