[PATCH][RFC] Re-write LTO type merging again, do tree merging

Richard Biener rguenther@suse.de
Thu Jun 13 10:08:00 GMT 2013


On Wed, 12 Jun 2013, Richard Biener wrote:

> 
> The following patch re-writes LTO type merging completely with the
> goal to move as much work as possible to the compile step, away
> from WPA time.  At the same time the merging itself gets very
> conservative but also more general - it now merges arbitrary trees,
> not only types, but only if they are bit-identical and have the
> same outgoing tree pointers.
> 
> Especially the latter means that we now have to merge SCCs of trees
> together and either take the whole SCC as prevailing or throw it
> away.  Moving work to the compile step means that we compute
> SCCs and their hashes there, re-organizing streaming to stream
> tree bodies as SCC blocks together with the computed hash.
> 
> When we ask the streamer to output a tree T then it now has
> to DFS walk all tree pointers, collecting SCCs of not yet
> streamed trees and output them like the following:
> 
>  { LTO_tree_scc, N, hash, entry_len,
>    { header1, header2, ... headerN },
>    { bits1, refs1, bits2, refs2, ... bitsN, refsN } }
>  { LTO_tree_scc, 1, hash, header, bits, refs }
>  { LTO_tree_scc, M, hash, entry_len,
>    { header1, header2, ... headerM },
>    { bits1, refs1, bits2, refs2, ... bitsM, refsM } }
>  LTO_tree_pickle_reference to T
> 
> with tree references in refsN always being LTO_tree_pickle_references
> instead of starting a new tree inline.  That results in at most
> N extra LTO_tree_pickle_references for N streamed trees, together
> with the LTO_tree_scc wrapping overhead this causes a slight
> increase in LTO object size (around 10% last time I measured, which
> was before some additional optimization went in).
> 
> The overhead also happens on the LTRANS file producing side
> which now has to do the DFS walk and stream the extra data.
> It doesn't do the hashing though as on the LTRANS consumer
> side no merging is performed.
> 
> The patch preserves the core of the old merging code to compare
> with the new code and output some statistics.  That means that
> if you build with -flto-report[-wpa] you get an additional
> compile-time and memory overhead.
> 
> For reference here are the stats when LTO bootstrapping for
> stage2 cc1:
> 
> WPA statistics
> [WPA] read 2494507 SCCs of average size 2.380067
> [WPA] 5937095 tree bodies read in total
> [WPA] tree SCC table: size 524287, 286280 elements, collision ratio: 
> 0.806376
> [WPA] tree SCC max chain length 11 (size 1)
> [WPA] Compared 403361 SCCs, 6226 collisions (0.015435)
> [WPA] Merged 399980 SCCs
> [WPA] Merged 2438250 tree bodies
> [WPA] Merged 192475 types
> [WPA] 195422 types prevailed
> [WPA] Old merging code merges an additional 54582 types of which 21083 are 
> in the same SCC with their prevailing variant
> 
> this says that we've streamed in 5937095 tree bodies in
> 2494507 SCCs (so the average SCC size is small), of those
> we were able to immediately ggc_free 399980 SCCs because they
> already existed in identical form (16% of the SCCs, 41% of the trees
> and 49% of the types).  The old merging code forced the merge
> of an additional 54582 types (but 21083 of them it merged with
> a type that is in the same SCC, that is, it changed the shape
> of the SCC and collapsed parts of it - something that is
> suspicious).
> 
> The patch was LTO bootstrapped (testing currently running) on
> x86_64-unknown-linux-gnu and I've built SPEC2k6 with -Ofast -g -flto
> and did a test run of the binaries which shows that
> currently  471.omnetpp, 483.xalancbmk and 447.dealII fail
> (471.omnetpp segfaults in __cxxabiv1::__dynamic_cast) - these
> fails were introduced quite recently likely due to the improved
> FUNCTION_DECL and VAR_DECL merging and the cgraph fixup Honza did.

The following incremental patch fixes that.

Index: trunk/gcc/lto-symtab.c
===================================================================
--- trunk.orig/gcc/lto-symtab.c 2013-06-12 16:47:38.000000000 +0200
+++ trunk/gcc/lto-symtab.c      2013-06-12 17:00:12.664126423 +0200
@@ -96,9 +96,6 @@ lto_varpool_replace_node (struct varpool
 
   ipa_clone_referring ((symtab_node)prevailing_node, 
&vnode->symbol.ref_list);
 
-  /* Be sure we can garbage collect the initializer.  */
-  if (DECL_INITIAL (vnode->symbol.decl))
-    DECL_INITIAL (vnode->symbol.decl) = error_mark_node;
   /* Finally remove the replaced node.  */
   varpool_remove_node (vnode);
 }
Index: trunk/gcc/varpool.c
===================================================================
--- trunk.orig/gcc/varpool.c    2013-06-12 13:13:06.000000000 +0200
+++ trunk/gcc/varpool.c 2013-06-12 17:01:46.088248807 +0200
@@ -77,15 +77,8 @@ varpool_remove_node (struct varpool_node
 
 /* Renove node initializer when it is no longer needed.  */
 void
-varpool_remove_initializer (struct varpool_node *node)
+varpool_remove_initializer (struct varpool_node *)
 {
-  if (DECL_INITIAL (node->symbol.decl)
-      && !DECL_IN_CONSTANT_POOL (node->symbol.decl)
-      /* Keep vtables for BINFO folding.  */
-      && !DECL_VIRTUAL_P (node->symbol.decl)
-      /* FIXME: http://gcc.gnu.org/PR55395 */
-      && debug_info_level == DINFO_LEVEL_NONE)
-    DECL_INITIAL (node->symbol.decl) = error_mark_node;
 }
 
 /* Dump given cgraph node.  */


Here are some updated numbers on cc1 disk, memory and compile-time
usage when built in stage3 with LTO and release checking:

Input object file size unpatched: 324509482 bytes
Input object file size patched:   373225406 bytes (115%)

WPA time/maxrss unpatched:

10.83user 0.67system 0:11.53elapsed 99%CPU (0avgtext+0avgdata 
575108maxresident)k
0inputs+778264outputs (0major+397754minor)pagefaults 0swaps

 ipa lto decl in         :   3.39 (32%) usr
 ipa lto decl out        :   3.88 (36%) usr

WPA time/maxrss patched:

18.35user 2.12system 0:20.56elapsed 99%CPU (0avgtext+0avgdata 
648800maxresident)k
16inputs+1263088outputs (0major+606347minor)pagefaults 0swaps

 ipa lto decl in         :   3.30 (18%) usr
 ipa lto decl out        :  11.46 (63%) usr

LTRANS file size unpatched: 398407935 bytes
LTRANS file size patched:   645106888 bytes (162%)

Unpatched WPA statistics:

[WPA] GIMPLE type table: size 262139, 134158 elements, 365200 searches, 
427507 collisions (ratio: 1.170611)
[WPA] GIMPLE type hash cache table: size 524287, 200149 elements, 4042887 
searches, 3624678 collisions (ratio: 0.896557)
[WPA] Merged 229106 types
[WPA] GIMPLE canonical type table: size 16381, 6221 elements, 134279 
searches, 4370 collisions (ratio: 0.032544)
[WPA] GIMPLE canonical type hash table: size 262139, 134223 elements, 
574837 searches, 506302 collisions (ratio: 0.880775)
[WPA] # of input files: 422
...
[WPA] Size of mmap'd section decls: 36618988 bytes
[WPA] Size of mmap'd section function_body: 40566108 bytes
[WPA] Size of mmap'd section statics: 0 bytes
[WPA] Size of mmap'd section symtab: 0 bytes
[WPA] Size of mmap'd section refs: 138315 bytes
[WPA] Size of mmap'd section asm: 0 bytes
[WPA] Size of mmap'd section jmpfuncs: 665899 bytes
[WPA] Size of mmap'd section pureconst: 59901 bytes
[WPA] Size of mmap'd section reference: 0 bytes
[WPA] Size of mmap'd section profile: 8845 bytes
[WPA] Size of mmap'd section symbol_nodes: 1144120 bytes
[WPA] Size of mmap'd section opts: 0 bytes
[WPA] Size of mmap'd section cgraphopt: 0 bytes
[WPA] Size of mmap'd section inline: 844004 bytes
[WPA] Size of mmap'd section ipcp_trans: 0 bytes

Patched WPA statistics:

[WPA] read 2386084 SCCs of average size 2.421782
[WPA] 5778576 tree bodies read in total
[WPA] tree SCC table: size 524287, 267346 elements, collision ratio: 
0.855240
[WPA] tree SCC max chain length 11 (size 1)
[WPA] Compared 362074 SCCs, 5417 collisions (0.014961)
[WPA] Merged 358942 SCCs
[WPA] Merged 2384340 tree bodies
[WPA] Merged 175201 types
[WPA] 188264 types prevailed
[WPA] Old merging code merges an additional 53081 types of which 21094 are 
in the same SCC
[WPA] GIMPLE canonical type table: size 16381, 6234 elements, 188399 
searches, 5850 collisions (ratio: 0.031051)
[WPA] GIMPLE canonical type hash table: size 262139, 188343 elements, 
1217265 searches, 817652 collisions (ratio: 0.671712)
[WPA] # of input files: 422
...
[WPA] Size of mmap'd section decls: 68171523 bytes
[WPA] Size of mmap'd section function_body: 56454939 bytes
[WPA] Size of mmap'd section statics: 0 bytes
[WPA] Size of mmap'd section symtab: 0 bytes
[WPA] Size of mmap'd section refs: 138463 bytes
[WPA] Size of mmap'd section asm: 0 bytes
[WPA] Size of mmap'd section jmpfuncs: 1185254 bytes
[WPA] Size of mmap'd section pureconst: 59943 bytes
[WPA] Size of mmap'd section reference: 0 bytes
[WPA] Size of mmap'd section profile: 8845 bytes
[WPA] Size of mmap'd section symbol_nodes: 1145897 bytes
[WPA] Size of mmap'd section opts: 0 bytes
[WPA] Size of mmap'd section cgraphopt: 0 bytes
[WPA] Size of mmap'd section inline: 895848 bytes
[WPA] Size of mmap'd section ipcp_trans: 0 bytes


That gives me a mixed feeling despite the overall way superior
design (and correctness).  The WPA LTRANS output side has a hard
time with both writing more trees (we merged less types) and
probably performing the DFS walks.

As you can see from the compile object file sizes the overhead
of streaming in SCCs is manageable so the doubling in LTRANS
file size has to come from extra trees we stream there (I'm
going to add some more counters and try doing stats per LTRANS
file we output).

Richard.



More information about the Gcc-patches mailing list