This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: WPA stream_out form & memory consumption
- From: Martin LiÅka <mliska at suse dot cz>
- To: gcc at gcc dot gnu dot org
- Cc: Richard Biener <richard dot guenther at gmail dot com>, Jan Hubicka <hubicka at ucw dot cz>
- Date: Fri, 04 Apr 2014 17:10:55 +0200
- Subject: Re: WPA stream_out form & memory consumption
- Authentication-results: sourceware.org; auth=none
- References: <53286192 dot 3030600 at suse dot cz> <20140325205021 dot GA6581 at atrey dot karlin dot mff dot cuni dot cz> <5333E6B8 dot 3000504 at suse dot cz> <5333F3D3 dot 1010009 at suse dot cz> <533C1B04 dot 40407 at suse dot cz> <533C36C8 dot 6090604 at suse dot cz> <CAFiYyc3vjeaNSdoOY=A07PLOMxK6e06PfEv6xa2OfN-hsZev4g at mail dot gmail dot com> <533D4EF9 dot 2050103 at suse dot cz> <CAFiYyc3XhRee2Lm1iscr20AtAHHC_yhn5i6P8cN0wrwJWib7PA at mail dot gmail dot com>
On 04/03/2014 03:07 PM, Richard Biener wrote:
On Thu, Apr 3, 2014 at 2:07 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/03/2014 11:41 AM, Richard Biener wrote:
On Wed, Apr 2, 2014 at 6:11 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/02/2014 04:13 PM, Martin LiÅka wrote:
On 03/27/2014 10:48 AM, Martin LiÅka wrote:
Previous patch is wrong, I did a mistake in name ;)
Martin
On 03/27/2014 09:52 AM, Martin LiÅka wrote:
On 03/25/2014 09:50 PM, Jan Hubicka wrote:
Hello,
I've been compiling Chromium with LTO and I noticed that WPA
stream_out forks and do parallel:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.
I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about
6GB. When WPA start to fork, memory consumption increases so that
lto1 is killed. I would appreciate an --param option to disable this
WPA fork. The number of forks is taken from build system (-flto=9)
which is fine for ltrans phase, because LD releases aforementioned
8GB.
What do you think about that?
I can take a look - our measurements suggested that the WPA memory
will
be later dominated by ltrans. Perhaps Chromium does something that
makes
WPA to explode that would be interesting to analyze. I did not
managed
to get through Chromium LTO build process recently (ninja builds are
not
my friends), can you send me the instructions?
Honza
Thanks,
Martin
There are instructions how can one build chromium with LTO:
1) install depot-tools and export PATH variable according to guide:
http://www.chromium.org/developers/how-tos/install-depot-tools
2) Checkout source code: gclient sync; cd src
3) Apply patch (enables system gold linker and disables LTO for a
sandbox that uses top-level asm)
4) which ld should point to ld.gold
5) unsure that ld.bfd points to ld.bfd
6) run: build/gyp_chromium -Dwerror=
7) ninja -C out/Release chrome -jX
If there are any problems, follow:
https://code.google.com/p/chromium/wiki/LinuxBuildInstructions
Martin
Hello,
taking latest trunk gcc, I built Firefox and Chromium. Both projects
compiled without debugging symbols and -O2 on an 8-core machine.
Firefox:
-flto=9, peak memory usage (in LTRANS): 11GB
Chromium:
-flto=6, peak memory usage (in parallel WPA phase ): 16.5GB
For details please see attached with graphs. The attachment contains
also
-fmem-report and -fmem-report-wpa.
I think reduced memory footprint to ~3.5GB is a bit optimistic:
http://gcc.gnu.org/gcc-4.9/changes.html
Is there any way we can reduce the memory footprint?
Attachment (due to size restriction):
https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing
Thank you,
Martin
Previous email presents a bit misleading graphs (influenced by
--enable-gather-detailed-mem-stats).
Firefox:
-flto=9, WPA peak: 8GB, LTRANS peak: 8GB
-flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
-flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB
These data shows that parallel WPA streaming increases short-time memory
footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
-flto=4).
For more details, please see the attachment.
The main overhead comes from maintaining the state during output of
the global types/decls. We maintain somewhat "duplicate" info
here by having both the tree_ref_encoder and the streamer cache.
Eventually we can free the tree_ref_encoder pointer-map early, like with
Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c (revision 209018)
+++ lto-streamer-out.c (working copy)
@@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)
gcc_assert (!alias_pairs);
- /* Write the global symbols. */
+ /* Get rid of the global decl state hash tables to save some memory.
*/
out_state = lto_get_out_decl_state ();
- num_fns = lto_function_decl_states.length ();
+ for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+ if (out_state->streams[i].tree_hash_table)
+ {
+ delete out_state->streams[i].tree_hash_table;
+ out_state->streams[i].tree_hash_table = NULL;
+ }
+
+ /* Write the global symbols. */
lto_output_decl_state_streams (ob, out_state);
+ num_fns = lto_function_decl_states.length ();
for (idx = 0; idx < num_fns; idx++)
{
fn_out_state =
as we do already for the fn state streams (untested).
we can also avoid re-allocating the output hashtable/vector by, after
(or in) create_output_block, allocate a bigger initial size for the
streamer_tree_cache. Note that the pointer-set already expands if
the fill level is > 25%, and it really exponentially grows (similar to
hash_table, btw, but that grows only at 75% fill level).
OTOH simply summing then lengths of all decl streams results in
a lower value than the actual number of output trees in the output block.
Humm.
But this is clearly the data structure that could be worth optimizing
in some way. For example during writing we don't need the
streamer cache nodes array (we just need a counter to assign indexes).
The attached is a patch that tries to do that plus the above (in testing
right now). Maybe you can check if it makes a noticable difference.
Richard.
I run test of your patch for twice, according to graphs memory footprint
looks similar. Looks, after application of the patch, WPA phase was a bit
faster, but can be influenced by HDD heavily utilized at the end of WPA.
Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches
One another idea is to use threads instead of process fork. But I am not
familiar with sharing data problems between threads?
Try
Index: gcc/tree-streamer-out.c
===================================================================
--- gcc/tree-streamer-out.c (revision 209054)
+++ gcc/tree-streamer-out.c (working copy)
@@ -523,13 +523,6 @@ streamer_write_chain (struct output_bloc
{
while (t)
{
- tree saved_chain;
-
- /* Clear TREE_CHAIN to avoid blindly recursing into the rest
- of the list. */
- saved_chain = TREE_CHAIN (t);
- TREE_CHAIN (t) = NULL_TREE;
-
/* We avoid outputting external vars or functions by reference
to the global decls section as we do not want to have them
enter decl merging. This is, of course, only for the call
@@ -541,7 +534,6 @@ streamer_write_chain (struct output_bloc
else
stream_write_tree (ob, t, ref_p);
- TREE_CHAIN (t) = saved_chain;
t = TREE_CHAIN (t);
}
Hi!
I've just finally written enhanced stats for RAM and CPU utilization. I
did 3 tests that are names in the graph as follows:
1) TRUNK.LOG = trunk gcc 20140401
2) PATCH1.LOG = Richard's patches that are in attachment (saves ~300MB)
3) PATCH2.LOG = Additional Richard's patch from the email I reply to
(streamer_write_chain change) (saves additional ~1.5GB)
Good job Richard!
Note: graphs are work in progress, I should add legend for RAM graph
where dark blue = ld, ligher blue = lto1 WPA (without additional mem
after fork) and the rest are lto1 LTRANS. Blue lines display overall
memory consumption taken from 'free' program.
Graph link:
https://drive.google.com/file/d/0B0pisUJ80pO1VS0zdVRoUTVURVU/edit?usp=sharing
Martin
Martin
Martin
diff --git a/gcc/lto-section-out.c b/gcc/lto-section-out.c
index 9d6926c..075af76 100644
--- a/gcc/lto-section-out.c
+++ b/gcc/lto-section-out.c
@@ -99,13 +99,19 @@ lto_end_section (void)
}
+/* We exponentially grow the size of the blocks as we need to make
+ room for more data to be written. Start with a single page and go up
+ to 2MB pages for this. */
+#define FIRST_BLOCK_SIZE 4096
+#define MAX_BLOCK_SIZE (2 * 1024 * 1024)
+
/* Write all of the chars in OBS to the assembler. Recycle the blocks
in obs as this is being done. */
void
lto_write_stream (struct lto_output_stream *obs)
{
- unsigned int block_size = 1024;
+ unsigned int block_size = FIRST_BLOCK_SIZE;
struct lto_char_ptr_base *block;
struct lto_char_ptr_base *next_block;
if (!obs->first_block)
@@ -135,6 +141,7 @@ lto_write_stream (struct lto_output_stream *obs)
else
lang_hooks.lto.append_data (base, num_chars, block);
block_size *= 2;
+ block_size = MIN (MAX_BLOCK_SIZE, block_size);
}
}
@@ -152,7 +159,7 @@ lto_append_block (struct lto_output_stream *obs)
{
/* This is the first time the stream has been written
into. */
- obs->block_size = 1024;
+ obs->block_size = FIRST_BLOCK_SIZE;
new_block = (struct lto_char_ptr_base*) xmalloc (obs->block_size);
obs->first_block = new_block;
}
@@ -162,6 +169,7 @@ lto_append_block (struct lto_output_stream *obs)
/* Get a new block that is twice as big as the last block
and link it into the list. */
obs->block_size *= 2;
+ obs->block_size = MIN (MAX_BLOCK_SIZE, obs->block_size);
new_block = (struct lto_char_ptr_base*) xmalloc (obs->block_size);
/* The first bytes of the block are reserved as a pointer to
the next block. Set the chain of the full block to the
diff --git a/gcc/lto-streamer-in.c b/gcc/lto-streamer-in.c
index 3238ab8..e19b115 100644
--- a/gcc/lto-streamer-in.c
+++ b/gcc/lto-streamer-in.c
@@ -1365,8 +1365,7 @@ lto_data_in_create (struct lto_file_decl_data *file_data, const char *strings,
data_in->strings = strings;
data_in->strings_len = len;
data_in->globals_resolution = resolutions;
- data_in->reader_cache = streamer_tree_cache_create (false, false);
-
+ data_in->reader_cache = streamer_tree_cache_create (false, false, true);
return data_in;
}
diff --git a/gcc/lto-streamer-out.c b/gcc/lto-streamer-out.c
index 0f37f1c..69b5a79 100644
--- a/gcc/lto-streamer-out.c
+++ b/gcc/lto-streamer-out.c
@@ -79,7 +79,7 @@ create_output_block (enum lto_section_type section_type)
ob->decl_state = lto_get_out_decl_state ();
ob->main_stream = XCNEW (struct lto_output_stream);
ob->string_stream = XCNEW (struct lto_output_stream);
- ob->writer_cache = streamer_tree_cache_create (!flag_wpa, true);
+ ob->writer_cache = streamer_tree_cache_create (!flag_wpa, true, false);
if (section_type == LTO_section_function_body)
ob->cfg_stream = XCNEW (struct lto_output_stream);
@@ -1277,7 +1277,6 @@ DFS_write_tree (struct output_block *ob, sccs *from_state,
??? We still wrap these in LTO_tree_scc so at the
input side we can properly identify the tree we want
to ultimatively return. */
- size_t old_len = ob->writer_cache->nodes.length ();
if (size == 1)
lto_output_tree_1 (ob, expr, scc_hash, ref_p, this_ref_p);
else
@@ -1315,7 +1314,6 @@ DFS_write_tree (struct output_block *ob, sccs *from_state,
streamer_write_zero (ob);
}
}
- gcc_assert (old_len + size == ob->writer_cache->nodes.length ());
/* Finally truncate the vector. */
sccstack.truncate (first);
@@ -2423,10 +2421,18 @@ produce_asm_for_decls (void)
gcc_assert (!alias_pairs);
- /* Write the global symbols. */
+ /* Get rid of the global decl state hash tables to save some memory. */
out_state = lto_get_out_decl_state ();
- num_fns = lto_function_decl_states.length ();
+ for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+ if (out_state->streams[i].tree_hash_table)
+ {
+ delete out_state->streams[i].tree_hash_table;
+ out_state->streams[i].tree_hash_table = NULL;
+ }
+
+ /* Write the global symbols. */
lto_output_decl_state_streams (ob, out_state);
+ num_fns = lto_function_decl_states.length ();
for (idx = 0; idx < num_fns; idx++)
{
fn_out_state =
diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index df9f031..e3cd9f4 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -2568,6 +2568,11 @@ lto_wpa_write_files (void)
FOR_EACH_VEC_ELT (ltrans_partitions, i, part)
lto_stats.num_output_symtab_nodes += lto_symtab_encoder_size (part->encoder);
+ /* Find out statics that need to be promoted
+ to globals with hidden visibility because they are accessed from multiple
+ partitions. */
+ lto_promote_cross_file_statics ();
+
timevar_pop (TV_WHOPR_WPA);
timevar_push (TV_WHOPR_WPA_IO);
@@ -3280,20 +3285,11 @@ do_whole_program_analysis (void)
lto_stats.num_cgraph_partitions += ltrans_partitions.length ();
- /* Find out statics that need to be promoted
- to globals with hidden visibility because they are accessed from multiple
- partitions. */
- lto_promote_cross_file_statics ();
timevar_pop (TV_WHOPR_PARTITIONING);
timevar_stop (TV_PHASE_OPT_GEN);
-
- /* Collect a last time - in lto_wpa_write_files we may end up forking
- with the idea that this doesn't increase memory usage. So we
- absoultely do not want to collect after that. */
- ggc_collect ();
-
timevar_start (TV_PHASE_STREAM_OUT);
+
if (!quiet_flag)
{
fprintf (stderr, "\nStreaming out");
@@ -3302,8 +3298,10 @@ do_whole_program_analysis (void)
lto_wpa_write_files ();
if (!quiet_flag)
fprintf (stderr, "\n");
+
timevar_stop (TV_PHASE_STREAM_OUT);
+ ggc_collect ();
if (post_ipa_mem_report)
{
fprintf (stderr, "Memory consumption after IPA\n");
diff --git a/gcc/tree-streamer-out.c b/gcc/tree-streamer-out.c
index 646fba5..4f597fa 100644
--- a/gcc/tree-streamer-out.c
+++ b/gcc/tree-streamer-out.c
@@ -526,10 +526,11 @@ streamer_write_chain (struct output_block *ob, tree t, bool ref_p)
tree saved_chain;
/* Clear TREE_CHAIN to avoid blindly recursing into the rest
- of the list. */
+ of the list. */
saved_chain = TREE_CHAIN (t);
TREE_CHAIN (t) = NULL_TREE;
+
/* We avoid outputting external vars or functions by reference
to the global decls section as we do not want to have them
enter decl merging. This is, of course, only for the call
diff --git a/gcc/tree-streamer.c b/gcc/tree-streamer.c
index af9461e..517bf77 100644
--- a/gcc/tree-streamer.c
+++ b/gcc/tree-streamer.c
@@ -101,20 +101,19 @@ static void
streamer_tree_cache_add_to_node_array (struct streamer_tree_cache_d *cache,
unsigned ix, tree t, hashval_t hash)
{
- /* Make sure we're either replacing an old element or
- appending consecutively. */
- gcc_assert (ix <= cache->nodes.length ());
-
- if (ix == cache->nodes.length ())
+ /* We're either replacing an old element or appending consecutively. */
+ if (cache->nodes.exists ())
{
- cache->nodes.safe_push (t);
- if (cache->hashes.exists ())
- cache->hashes.safe_push (hash);
+ if (cache->nodes.length () == ix)
+ cache->nodes.safe_push (t);
+ else
+ cache->nodes[ix] = t;
}
- else
+ if (cache->hashes.exists ())
{
- cache->nodes[ix] = t;
- if (cache->hashes.exists ())
+ if (cache->hashes.length () == ix)
+ cache->hashes.safe_push (hash);
+ else
cache->hashes[ix] = hash;
}
}
@@ -146,7 +145,7 @@ streamer_tree_cache_insert_1 (struct streamer_tree_cache_d *cache,
{
/* Determine the next slot to use in the cache. */
if (insert_at_next_slot_p)
- ix = cache->nodes.length ();
+ ix = cache->next_idx++;
else
ix = *ix_p;
*slot = ix;
@@ -211,7 +210,7 @@ void
streamer_tree_cache_append (struct streamer_tree_cache_d *cache,
tree t, hashval_t hash)
{
- unsigned ix = cache->nodes.length ();
+ unsigned ix = cache->next_idx++;
if (!cache->node_map)
streamer_tree_cache_add_to_node_array (cache, ix, t, hash);
else
@@ -326,7 +325,7 @@ preload_common_nodes (struct streamer_tree_cache_d *cache)
/* Create a cache of pickled nodes. */
struct streamer_tree_cache_d *
-streamer_tree_cache_create (bool with_hashes, bool with_map)
+streamer_tree_cache_create (bool with_hashes, bool with_map, bool with_vec)
{
struct streamer_tree_cache_d *cache;
@@ -334,7 +333,9 @@ streamer_tree_cache_create (bool with_hashes, bool with_map)
if (with_map)
cache->node_map = new pointer_map<unsigned>;
- cache->nodes.create (165);
+ cache->next_idx = 0;
+ if (with_vec)
+ cache->nodes.create (165);
if (with_hashes)
cache->hashes.create (165);
diff --git a/gcc/tree-streamer.h b/gcc/tree-streamer.h
index 2aca29a..20dbba0 100644
--- a/gcc/tree-streamer.h
+++ b/gcc/tree-streamer.h
@@ -52,6 +52,9 @@ struct streamer_tree_cache_d
vec<tree> nodes;
/* The node hashes (if available). */
vec<hashval_t> hashes;
+
+ /* Next index to assign. */
+ unsigned next_idx;
};
/* Return true if tree node EXPR should be streamed as a builtin. For
@@ -97,7 +100,7 @@ void streamer_tree_cache_append (struct streamer_tree_cache_d *, tree,
hashval_t);
bool streamer_tree_cache_lookup (struct streamer_tree_cache_d *, tree,
unsigned *);
-struct streamer_tree_cache_d *streamer_tree_cache_create (bool, bool);
+struct streamer_tree_cache_d *streamer_tree_cache_create (bool, bool, bool);
void streamer_tree_cache_delete (struct streamer_tree_cache_d *);
/* Return the tree node at slot IX in CACHE. */