This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: WPA stream_out form & memory consumption



On 04/04/2014 05:10 PM, Martin LiÅka wrote:

On 04/03/2014 03:07 PM, Richard Biener wrote:
On Thu, Apr 3, 2014 at 2:07 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/03/2014 11:41 AM, Richard Biener wrote:
On Wed, Apr 2, 2014 at 6:11 PM, Martin LiÅka <mliska@suse.cz> wrote:
On 04/02/2014 04:13 PM, Martin LiÅka wrote:

On 03/27/2014 10:48 AM, Martin LiÅka wrote:
Previous patch is wrong, I did a mistake in name ;)

Martin

On 03/27/2014 09:52 AM, Martin LiÅka wrote:

On 03/25/2014 09:50 PM, Jan Hubicka wrote:
Hello,
I've been compiling Chromium with LTO and I noticed that WPA
stream_out forks and do parallel:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.

I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about 6GB. When WPA start to fork, memory consumption increases so that lto1 is killed. I would appreciate an --param option to disable this WPA fork. The number of forks is taken from build system (-flto=9) which is fine for ltrans phase, because LD releases aforementioned
8GB.

What do you think about that?
I can take a look - our measurements suggested that the WPA memory
will
be later dominated by ltrans. Perhaps Chromium does something that
makes
WPA to explode that would be interesting to analyze.  I did not
managed
to get through Chromium LTO build process recently (ninja builds are
not
my friends), can you send me the instructions?

Honza
Thanks,
Martin

There are instructions how can one build chromium with LTO:
1) install depot-tools and export PATH variable according to guide:
http://www.chromium.org/developers/how-tos/install-depot-tools
2) Checkout source code: gclient sync; cd src
3) Apply patch (enables system gold linker and disables LTO for a
sandbox that uses top-level asm)
4) which ld should point to ld.gold
5) unsure that ld.bfd points to ld.bfd
6) run: build/gyp_chromium -Dwerror=
7) ninja -C out/Release chrome -jX

If there are any problems, follow:
https://code.google.com/p/chromium/wiki/LinuxBuildInstructions

Martin

Hello,
taking latest trunk gcc, I built Firefox and Chromium. Both projects
compiled without debugging symbols and -O2 on an 8-core machine.

Firefox:
-flto=9, peak memory usage (in LTRANS): 11GB

Chromium:
-flto=6, peak memory usage (in parallel WPA phase ): 16.5GB

For details please see attached with graphs. The attachment contains
also
-fmem-report and -fmem-report-wpa.
I think reduced memory footprint to ~3.5GB is a bit optimistic:
http://gcc.gnu.org/gcc-4.9/changes.html

Is there any way we can reduce the memory footprint?

Attachment (due to size restriction):

https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing

Thank you,
Martin

Previous email presents a bit misleading graphs (influenced by
--enable-gather-detailed-mem-stats).

Firefox:
-flto=9, WPA peak: 8GB, LTRANS peak: 8GB
-flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
-flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB

These data shows that parallel WPA streaming increases short-time memory
footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
-flto=4).

For more details, please see the attachment.
The main overhead comes from maintaining the state during output of
the global types/decls.  We maintain somewhat "duplicate" info
here by having both the tree_ref_encoder and the streamer cache.
Eventually we can free the tree_ref_encoder pointer-map early, like with

Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c  (revision 209018)
+++ lto-streamer-out.c  (working copy)
@@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)

     gcc_assert (!alias_pairs);

-  /* Write the global symbols.  */
+ /* Get rid of the global decl state hash tables to save some memory.
*/
     out_state = lto_get_out_decl_state ();
-  num_fns = lto_function_decl_states.length ();
+  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+    if (out_state->streams[i].tree_hash_table)
+      {
+       delete out_state->streams[i].tree_hash_table;
+       out_state->streams[i].tree_hash_table = NULL;
+      }
+
+  /* Write the global symbols.  */
     lto_output_decl_state_streams (ob, out_state);
+  num_fns = lto_function_decl_states.length ();
     for (idx = 0; idx < num_fns; idx++)
       {
         fn_out_state =

as we do already for the fn state streams (untested).

we can also avoid re-allocating the output hashtable/vector by, after
(or in) create_output_block, allocate a bigger initial size for the
streamer_tree_cache.  Note that the pointer-set already expands if
the fill level is > 25%, and it really exponentially grows (similar to
hash_table, btw, but that grows only at 75% fill level).

OTOH simply summing then lengths of all decl streams results in
a lower value than the actual number of output trees in the output block.
Humm.

But this is clearly the data structure that could be worth optimizing
in some way.  For example during writing we don't need the
streamer cache nodes array (we just need a counter to assign indexes).

The attached is a patch that tries to do that plus the above (in testing
right now).  Maybe you can check if it makes a noticable difference.

Richard.

I run test of your patch for twice, according to graphs memory footprint looks similar. Looks, after application of the patch, WPA phase was a bit faster, but can be influenced by HDD heavily utilized at the end of WPA.
Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches

One another idea is to use threads instead of process fork. But I am not
familiar with sharing data problems between threads?
Try

Index: gcc/tree-streamer-out.c
===================================================================
--- gcc/tree-streamer-out.c     (revision 209054)
+++ gcc/tree-streamer-out.c     (working copy)
@@ -523,13 +523,6 @@ streamer_write_chain (struct output_bloc
  {
    while (t)
      {
-      tree saved_chain;
-
-      /* Clear TREE_CHAIN to avoid blindly recursing into the rest
-        of the list.  */
-      saved_chain = TREE_CHAIN (t);
-      TREE_CHAIN (t) = NULL_TREE;
-
        /* We avoid outputting external vars or functions by reference
          to the global decls section as we do not want to have them
          enter decl merging.  This is, of course, only for the call
@@ -541,7 +534,6 @@ streamer_write_chain (struct output_bloc
        else
         stream_write_tree (ob, t, ref_p);

-      TREE_CHAIN (t) = saved_chain;
        t = TREE_CHAIN (t);
      }

Hi!

I've just finally written enhanced stats for RAM and CPU utilization. I did 3 tests that are names in the graph as follows:

1) TRUNK.LOG = trunk gcc 20140401
2) PATCH1.LOG = Richard's patches that are in attachment (saves ~300MB)
3) PATCH2.LOG = Additional Richard's patch from the email I reply to (streamer_write_chain change) (saves additional ~1.5GB)

Good job Richard!

Note: graphs are work in progress, I should add legend for RAM graph where dark blue = ld, ligher blue = lto1 WPA (without additional mem after fork) and the rest are lto1 LTRANS. Blue lines display overall memory consumption taken from 'free' program.

Graph link: https://drive.google.com/file/d/0B0pisUJ80pO1VS0zdVRoUTVURVU/edit?usp=sharing

Martin

Hello,
   there are stats connected to chromium streamed LTRANS objects:

246M    chrome.ltrans0.o
242M    chrome.ltrans10.o
245M    chrome.ltrans11.o
247M    chrome.ltrans12.o
248M    chrome.ltrans13.o
253M    chrome.ltrans14.o
244M    chrome.ltrans15.o
242M    chrome.ltrans16.o
359M    chrome.ltrans17.o
244M    chrome.ltrans18.o
297M    chrome.ltrans19.o
191M    chrome.ltrans1.o
50M    chrome.ltrans20.o
284M    chrome.ltrans21.o
190M    chrome.ltrans22.o
214M    chrome.ltrans23.o
230M    chrome.ltrans24.o
219M    chrome.ltrans25.o
235M    chrome.ltrans26.o
296M    chrome.ltrans27.o
281M    chrome.ltrans28.o
416M    chrome.ltrans29.o
322M    chrome.ltrans2.o
349M    chrome.ltrans30.o
283M    chrome.ltrans3.o
250M    chrome.ltrans4.o
241M    chrome.ltrans5.o
204M    chrome.ltrans6.o
198M    chrome.ltrans7.o
205M    chrome.ltrans8.o
321M    chrome.ltrans9.o
7.7G    total

Richard asked me to get data about decls sections:

.gnu.lto_.decls.4 34784856 170480445 162.6 MB 82.68% 0 0.0 B 0.00% .gnu.lto_.decls.4 35442048 338188213 322.5 MB 90.34% 0 0.0 B 0.00% .gnu.lto_.decls.4 33644416 210375388 200.6 MB 86.02% 0 0.0 B 0.00% .gnu.lto_.decls.4 41153440 222168434 211.9 MB 84.20% 0 0.0 B 0.00% .gnu.lto_.decls.4 38743824 214816021 204.9 MB 84.49% 0 0.0 B 0.00% .gnu.lto_.decls.4 34815904 261343148 249.2 MB 88.07% 0 0.0 B 0.00% .gnu.lto_.decls.4 35538200 218003166 207.9 MB 85.73% 0 0.0 B 0.00% .gnu.lto_.decls.4 29311096 335188814 319.7 MB 91.91% 0 0.0 B 0.00% .gnu.lto_.decls.4 37092008 219239738 209.1 MB 85.26% 0 0.0 B 0.00% .gnu.lto_.decls.4 31988880 278001873 265.1 MB 89.56% 0 0.0 B 0.00% .gnu.lto_.decls.4 33093656 189868273 181.1 MB 84.99% 0 0.0 B 0.00% .gnu.lto_.decls.4 40208184 210523895 200.8 MB 83.65% 0 0.0 B 0.00% .gnu.lto_.decls.4 41315304 156945830 149.7 MB 78.96% 0 0.0 B 0.00% .gnu.lto_.decls.4 49383776 285180940 272.0 MB 85.02% 0 0.0 B 0.00% .gnu.lto_.decls.4 23059400 27968474 26.7 MB 54.57% 0 0.0 B 0.00% .gnu.lto_.decls.4 38980280 218925856 208.8 MB 84.58% 0 0.0 B 0.00% .gnu.lto_.decls.4 43520848 216520786 206.5 MB 83.06% 0 0.0 B 0.00% .gnu.lto_.decls.4 50497368 204989891 195.5 MB 80.12% 0 0.0 B 0.00% .gnu.lto_.decls.4 31062216 261729526 249.6 MB 89.24% 0 0.0 B 0.00%

4th column shows size of the section and the following one presents what portion of an object covers this section. If I sum these numbers, declaration take 6.65GB / 7.7GB (86%). When I compress each file by gzip, overall size is 1.9 GB. I think, if it's safe, it would be good to enable zlib compression for LTRANS objects.

Martin






Martin


Martin




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]