This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: WPA stream_out form & memory consumption



On 04/03/2014 03:07 PM, Richard Biener wrote:
On Thu, Apr 3, 2014 at 2:07 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/03/2014 11:41 AM, Richard Biener wrote:
On Wed, Apr 2, 2014 at 6:11 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/02/2014 04:13 PM, Martin LiÅka wrote:

On 03/27/2014 10:48 AM, Martin LiÅka wrote:
Previous patch is wrong, I did a mistake in name ;)

Martin

On 03/27/2014 09:52 AM, Martin LiÅka wrote:

On 03/25/2014 09:50 PM, Jan Hubicka wrote:
Hello,
      I've been compiling Chromium with LTO and I noticed that WPA
stream_out forks and do parallel:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.

I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about
6GB. When WPA start to fork, memory consumption increases so that
lto1 is killed. I would appreciate an --param option to disable this
WPA fork. The number of forks is taken from build system (-flto=9)
which is fine for ltrans phase, because LD releases aforementioned
8GB.

What do you think about that?
I can take a look - our measurements suggested that the WPA memory
will
be later dominated by ltrans.  Perhaps Chromium does something that
makes
WPA to explode that would be interesting to analyze.  I did not
managed
to get through Chromium LTO build process recently (ninja builds are
not
my friends), can you send me the instructions?

Honza
Thanks,
Martin

There are instructions how can one build chromium with LTO:
1) install depot-tools and export PATH variable according to guide:
http://www.chromium.org/developers/how-tos/install-depot-tools
2) Checkout source code: gclient sync; cd src
3) Apply patch (enables system gold linker and disables LTO for a
sandbox that uses top-level asm)
4) which ld should point to ld.gold
5) unsure that ld.bfd points to ld.bfd
6) run: build/gyp_chromium -Dwerror=
7) ninja -C out/Release chrome -jX

If there are any problems, follow:
https://code.google.com/p/chromium/wiki/LinuxBuildInstructions

Martin

Hello,
    taking latest trunk gcc, I built Firefox and Chromium. Both projects
compiled without debugging symbols and -O2 on an 8-core machine.

Firefox:
-flto=9, peak memory usage (in LTRANS): 11GB

Chromium:
-flto=6, peak memory usage (in parallel WPA phase ): 16.5GB

For details please see attached with graphs. The attachment contains
also
-fmem-report and -fmem-report-wpa.
I think reduced memory footprint to ~3.5GB is a bit optimistic:
http://gcc.gnu.org/gcc-4.9/changes.html

Is there any way we can reduce the memory footprint?

Attachment (due to size restriction):

https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing

Thank you,
Martin

Previous email presents a bit misleading graphs (influenced by
--enable-gather-detailed-mem-stats).

Firefox:
-flto=9, WPA peak: 8GB, LTRANS peak: 8GB
-flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
-flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB

These data shows that parallel WPA streaming increases short-time memory
footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
-flto=4).

For more details, please see the attachment.
The main overhead comes from maintaining the state during output of
the global types/decls.  We maintain somewhat "duplicate" info
here by having both the tree_ref_encoder and the streamer cache.
Eventually we can free the tree_ref_encoder pointer-map early, like with

Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c  (revision 209018)
+++ lto-streamer-out.c  (working copy)
@@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)

     gcc_assert (!alias_pairs);

-  /* Write the global symbols.  */
+  /* Get rid of the global decl state hash tables to save some memory.
*/
     out_state = lto_get_out_decl_state ();
-  num_fns = lto_function_decl_states.length ();
+  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+    if (out_state->streams[i].tree_hash_table)
+      {
+       delete out_state->streams[i].tree_hash_table;
+       out_state->streams[i].tree_hash_table = NULL;
+      }
+
+  /* Write the global symbols.  */
     lto_output_decl_state_streams (ob, out_state);
+  num_fns = lto_function_decl_states.length ();
     for (idx = 0; idx < num_fns; idx++)
       {
         fn_out_state =

as we do already for the fn state streams (untested).

we can also avoid re-allocating the output hashtable/vector by, after
(or in) create_output_block, allocate a bigger initial size for the
streamer_tree_cache.  Note that the pointer-set already expands if
the fill level is > 25%, and it really exponentially grows (similar to
hash_table, btw, but that grows only at 75% fill level).

OTOH simply summing then lengths of all decl streams results in
a lower value than the actual number of output trees in the output block.
Humm.

But this is clearly the data structure that could be worth optimizing
in some way.  For example during writing we don't need the
streamer cache nodes array (we just need a counter to assign indexes).

The attached is a patch that tries to do that plus the above (in testing
right now).  Maybe you can check if it makes a noticable difference.

Richard.

I run test of your patch for twice, according to graphs memory footprint
looks similar. Looks, after application of the patch, WPA phase was a bit
faster, but can be influenced by HDD heavily utilized at the end of WPA.
Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches

One another idea is to use threads instead of process fork. But I am not
familiar with sharing data problems between threads?
Try

Index: gcc/tree-streamer-out.c
===================================================================
--- gcc/tree-streamer-out.c     (revision 209054)
+++ gcc/tree-streamer-out.c     (working copy)
@@ -523,13 +523,6 @@ streamer_write_chain (struct output_bloc
  {
    while (t)
      {
-      tree saved_chain;
-
-      /* Clear TREE_CHAIN to avoid blindly recursing into the rest
-        of the list.  */
-      saved_chain = TREE_CHAIN (t);
-      TREE_CHAIN (t) = NULL_TREE;
-
        /* We avoid outputting external vars or functions by reference
          to the global decls section as we do not want to have them
          enter decl merging.  This is, of course, only for the call
@@ -541,7 +534,6 @@ streamer_write_chain (struct output_bloc
        else
         stream_write_tree (ob, t, ref_p);

-      TREE_CHAIN (t) = saved_chain;
        t = TREE_CHAIN (t);
      }

Hi!

I've just finally written enhanced stats for RAM and CPU utilization. I did 3 tests that are names in the graph as follows:

1) TRUNK.LOG = trunk gcc 20140401
2) PATCH1.LOG = Richard's patches that are in attachment (saves ~300MB)
3) PATCH2.LOG = Additional Richard's patch from the email I reply to (streamer_write_chain change) (saves additional ~1.5GB)

Good job Richard!

Note: graphs are work in progress, I should add legend for RAM graph where dark blue = ld, ligher blue = lto1 WPA (without additional mem after fork) and the rest are lto1 LTRANS. Blue lines display overall memory consumption taken from 'free' program.

Graph link: https://drive.google.com/file/d/0B0pisUJ80pO1VS0zdVRoUTVURVU/edit?usp=sharing

Martin




Martin


Martin


diff --git a/gcc/lto-section-out.c b/gcc/lto-section-out.c
index 9d6926c..075af76 100644
--- a/gcc/lto-section-out.c
+++ b/gcc/lto-section-out.c
@@ -99,13 +99,19 @@ lto_end_section (void)
 }
 
 
+/* We exponentially grow the size of the blocks as we need to make
+   room for more data to be written.  Start with a single page and go up
+   to 2MB pages for this.  */
+#define FIRST_BLOCK_SIZE 4096
+#define MAX_BLOCK_SIZE (2 * 1024 * 1024)
+
 /* Write all of the chars in OBS to the assembler.  Recycle the blocks
    in obs as this is being done.  */
 
 void
 lto_write_stream (struct lto_output_stream *obs)
 {
-  unsigned int block_size = 1024;
+  unsigned int block_size = FIRST_BLOCK_SIZE;
   struct lto_char_ptr_base *block;
   struct lto_char_ptr_base *next_block;
   if (!obs->first_block)
@@ -135,6 +141,7 @@ lto_write_stream (struct lto_output_stream *obs)
       else
 	lang_hooks.lto.append_data (base, num_chars, block);
       block_size *= 2;
+      block_size = MIN (MAX_BLOCK_SIZE, block_size);
     }
 }
 
@@ -152,7 +159,7 @@ lto_append_block (struct lto_output_stream *obs)
     {
       /* This is the first time the stream has been written
 	 into.  */
-      obs->block_size = 1024;
+      obs->block_size = FIRST_BLOCK_SIZE;
       new_block = (struct lto_char_ptr_base*) xmalloc (obs->block_size);
       obs->first_block = new_block;
     }
@@ -162,6 +169,7 @@ lto_append_block (struct lto_output_stream *obs)
       /* Get a new block that is twice as big as the last block
 	 and link it into the list.  */
       obs->block_size *= 2;
+      obs->block_size = MIN (MAX_BLOCK_SIZE, obs->block_size);
       new_block = (struct lto_char_ptr_base*) xmalloc (obs->block_size);
       /* The first bytes of the block are reserved as a pointer to
 	 the next block.  Set the chain of the full block to the
diff --git a/gcc/lto-streamer-in.c b/gcc/lto-streamer-in.c
index 3238ab8..e19b115 100644
--- a/gcc/lto-streamer-in.c
+++ b/gcc/lto-streamer-in.c
@@ -1365,8 +1365,7 @@ lto_data_in_create (struct lto_file_decl_data *file_data, const char *strings,
   data_in->strings = strings;
   data_in->strings_len = len;
   data_in->globals_resolution = resolutions;
-  data_in->reader_cache = streamer_tree_cache_create (false, false);
-
+  data_in->reader_cache = streamer_tree_cache_create (false, false, true);
   return data_in;
 }
 
diff --git a/gcc/lto-streamer-out.c b/gcc/lto-streamer-out.c
index 0f37f1c..69b5a79 100644
--- a/gcc/lto-streamer-out.c
+++ b/gcc/lto-streamer-out.c
@@ -79,7 +79,7 @@ create_output_block (enum lto_section_type section_type)
   ob->decl_state = lto_get_out_decl_state ();
   ob->main_stream = XCNEW (struct lto_output_stream);
   ob->string_stream = XCNEW (struct lto_output_stream);
-  ob->writer_cache = streamer_tree_cache_create (!flag_wpa, true);
+  ob->writer_cache = streamer_tree_cache_create (!flag_wpa, true, false);
 
   if (section_type == LTO_section_function_body)
     ob->cfg_stream = XCNEW (struct lto_output_stream);
@@ -1277,7 +1277,6 @@ DFS_write_tree (struct output_block *ob, sccs *from_state,
 	     ???  We still wrap these in LTO_tree_scc so at the
 	     input side we can properly identify the tree we want
 	     to ultimatively return.  */
-	  size_t old_len = ob->writer_cache->nodes.length ();
 	  if (size == 1)
 	    lto_output_tree_1 (ob, expr, scc_hash, ref_p, this_ref_p);
 	  else
@@ -1315,7 +1314,6 @@ DFS_write_tree (struct output_block *ob, sccs *from_state,
 		  streamer_write_zero (ob);
 		}
 	    }
-	  gcc_assert (old_len + size == ob->writer_cache->nodes.length ());
 
 	  /* Finally truncate the vector.  */
 	  sccstack.truncate (first);
@@ -2423,10 +2421,18 @@ produce_asm_for_decls (void)
 
   gcc_assert (!alias_pairs);
 
-  /* Write the global symbols.  */
+  /* Get rid of the global decl state hash tables to save some memory.  */
   out_state = lto_get_out_decl_state ();
-  num_fns = lto_function_decl_states.length ();
+  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+    if (out_state->streams[i].tree_hash_table)
+      {
+	delete out_state->streams[i].tree_hash_table;
+	out_state->streams[i].tree_hash_table = NULL;
+      }
+
+  /* Write the global symbols.  */
   lto_output_decl_state_streams (ob, out_state);
+  num_fns = lto_function_decl_states.length ();
   for (idx = 0; idx < num_fns; idx++)
     {
       fn_out_state =
diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index df9f031..e3cd9f4 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -2568,6 +2568,11 @@ lto_wpa_write_files (void)
   FOR_EACH_VEC_ELT (ltrans_partitions, i, part)
     lto_stats.num_output_symtab_nodes += lto_symtab_encoder_size (part->encoder);
 
+  /* Find out statics that need to be promoted
+     to globals with hidden visibility because they are accessed from multiple
+     partitions.  */
+  lto_promote_cross_file_statics ();
+
   timevar_pop (TV_WHOPR_WPA);
 
   timevar_push (TV_WHOPR_WPA_IO);
@@ -3280,20 +3285,11 @@ do_whole_program_analysis (void)
 
   lto_stats.num_cgraph_partitions += ltrans_partitions.length ();
 
-  /* Find out statics that need to be promoted
-     to globals with hidden visibility because they are accessed from multiple
-     partitions.  */
-  lto_promote_cross_file_statics ();
   timevar_pop (TV_WHOPR_PARTITIONING);
 
   timevar_stop (TV_PHASE_OPT_GEN);
-
-  /* Collect a last time - in lto_wpa_write_files we may end up forking
-     with the idea that this doesn't increase memory usage.  So we
-     absoultely do not want to collect after that.  */
-  ggc_collect ();
-
   timevar_start (TV_PHASE_STREAM_OUT);
+
   if (!quiet_flag)
     {
       fprintf (stderr, "\nStreaming out");
@@ -3302,8 +3298,10 @@ do_whole_program_analysis (void)
   lto_wpa_write_files ();
   if (!quiet_flag)
     fprintf (stderr, "\n");
+
   timevar_stop (TV_PHASE_STREAM_OUT);
 
+  ggc_collect ();
   if (post_ipa_mem_report)
     {
       fprintf (stderr, "Memory consumption after IPA\n");
diff --git a/gcc/tree-streamer-out.c b/gcc/tree-streamer-out.c
index 646fba5..4f597fa 100644
--- a/gcc/tree-streamer-out.c
+++ b/gcc/tree-streamer-out.c
@@ -526,10 +526,11 @@ streamer_write_chain (struct output_block *ob, tree t, bool ref_p)
       tree saved_chain;
 
       /* Clear TREE_CHAIN to avoid blindly recursing into the rest
-	 of the list.  */
+        of the list.  */
       saved_chain = TREE_CHAIN (t);
       TREE_CHAIN (t) = NULL_TREE;
 
+
       /* We avoid outputting external vars or functions by reference
 	 to the global decls section as we do not want to have them
 	 enter decl merging.  This is, of course, only for the call
diff --git a/gcc/tree-streamer.c b/gcc/tree-streamer.c
index af9461e..517bf77 100644
--- a/gcc/tree-streamer.c
+++ b/gcc/tree-streamer.c
@@ -101,20 +101,19 @@ static void
 streamer_tree_cache_add_to_node_array (struct streamer_tree_cache_d *cache,
 				       unsigned ix, tree t, hashval_t hash)
 {
-  /* Make sure we're either replacing an old element or
-     appending consecutively.  */
-  gcc_assert (ix <= cache->nodes.length ());
-
-  if (ix == cache->nodes.length ())
+  /* We're either replacing an old element or appending consecutively.  */
+  if (cache->nodes.exists ())
     {
-      cache->nodes.safe_push (t);
-      if (cache->hashes.exists ())
-	cache->hashes.safe_push (hash);
+      if (cache->nodes.length () == ix)
+	cache->nodes.safe_push (t);
+      else
+	cache->nodes[ix] = t;
     }
-  else
+  if (cache->hashes.exists ())
     {
-      cache->nodes[ix] = t;
-      if (cache->hashes.exists ())
+      if (cache->hashes.length () == ix)
+	cache->hashes.safe_push (hash);
+      else
 	cache->hashes[ix] = hash;
     }
 }
@@ -146,7 +145,7 @@ streamer_tree_cache_insert_1 (struct streamer_tree_cache_d *cache,
     {
       /* Determine the next slot to use in the cache.  */
       if (insert_at_next_slot_p)
-	ix = cache->nodes.length ();
+	ix = cache->next_idx++;
       else
 	ix = *ix_p;
        *slot = ix;
@@ -211,7 +210,7 @@ void
 streamer_tree_cache_append (struct streamer_tree_cache_d *cache,
 			    tree t, hashval_t hash)
 {
-  unsigned ix = cache->nodes.length ();
+  unsigned ix = cache->next_idx++;
   if (!cache->node_map)
     streamer_tree_cache_add_to_node_array (cache, ix, t, hash);
   else
@@ -326,7 +325,7 @@ preload_common_nodes (struct streamer_tree_cache_d *cache)
 /* Create a cache of pickled nodes.  */
 
 struct streamer_tree_cache_d *
-streamer_tree_cache_create (bool with_hashes, bool with_map)
+streamer_tree_cache_create (bool with_hashes, bool with_map, bool with_vec)
 {
   struct streamer_tree_cache_d *cache;
 
@@ -334,7 +333,9 @@ streamer_tree_cache_create (bool with_hashes, bool with_map)
 
   if (with_map)
     cache->node_map = new pointer_map<unsigned>;
-  cache->nodes.create (165);
+  cache->next_idx = 0;
+  if (with_vec)
+    cache->nodes.create (165);
   if (with_hashes)
     cache->hashes.create (165);
 
diff --git a/gcc/tree-streamer.h b/gcc/tree-streamer.h
index 2aca29a..20dbba0 100644
--- a/gcc/tree-streamer.h
+++ b/gcc/tree-streamer.h
@@ -52,6 +52,9 @@ struct streamer_tree_cache_d
   vec<tree> nodes;
   /* The node hashes (if available).  */
   vec<hashval_t> hashes;
+
+  /* Next index to assign.  */
+  unsigned next_idx;
 };
 
 /* Return true if tree node EXPR should be streamed as a builtin.  For
@@ -97,7 +100,7 @@ void streamer_tree_cache_append (struct streamer_tree_cache_d *, tree,
 				 hashval_t);
 bool streamer_tree_cache_lookup (struct streamer_tree_cache_d *, tree,
 				 unsigned *);
-struct streamer_tree_cache_d *streamer_tree_cache_create (bool, bool);
+struct streamer_tree_cache_d *streamer_tree_cache_create (bool, bool, bool);
 void streamer_tree_cache_delete (struct streamer_tree_cache_d *);
 
 /* Return the tree node at slot IX in CACHE.  */

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]