This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: WPA stream_out form & memory consumption


On Thu, Apr 3, 2014 at 2:07 PM, Martin LiÅka <mliska@suse.cz> wrote:
>
> On 04/03/2014 11:41 AM, Richard Biener wrote:
>>
>> On Wed, Apr 2, 2014 at 6:11 PM, Martin LiÅka <mliska@suse.cz> wrote:
>>>
>>> On 04/02/2014 04:13 PM, Martin LiÅka wrote:
>>>>
>>>>
>>>> On 03/27/2014 10:48 AM, Martin LiÅka wrote:
>>>>>
>>>>> Previous patch is wrong, I did a mistake in name ;)
>>>>>
>>>>> Martin
>>>>>
>>>>> On 03/27/2014 09:52 AM, Martin LiÅka wrote:
>>>>>>
>>>>>>
>>>>>> On 03/25/2014 09:50 PM, Jan Hubicka wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>      I've been compiling Chromium with LTO and I noticed that WPA
>>>>>>>> stream_out forks and do parallel:
>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.
>>>>>>>>
>>>>>>>> I am unable to fit in 16GB memory: ld uses about 8GB and lto1 about
>>>>>>>> 6GB. When WPA start to fork, memory consumption increases so that
>>>>>>>> lto1 is killed. I would appreciate an --param option to disable this
>>>>>>>> WPA fork. The number of forks is taken from build system (-flto=9)
>>>>>>>> which is fine for ltrans phase, because LD releases aforementioned
>>>>>>>> 8GB.
>>>>>>>>
>>>>>>>> What do you think about that?
>>>>>>>
>>>>>>> I can take a look - our measurements suggested that the WPA memory
>>>>>>> will
>>>>>>> be later dominated by ltrans.  Perhaps Chromium does something that
>>>>>>> makes
>>>>>>> WPA to explode that would be interesting to analyze.  I did not
>>>>>>> managed
>>>>>>> to get through Chromium LTO build process recently (ninja builds are
>>>>>>> not
>>>>>>> my friends), can you send me the instructions?
>>>>>>>
>>>>>>> Honza
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> There are instructions how can one build chromium with LTO:
>>>>>> 1) install depot-tools and export PATH variable according to guide:
>>>>>> http://www.chromium.org/developers/how-tos/install-depot-tools
>>>>>> 2) Checkout source code: gclient sync; cd src
>>>>>> 3) Apply patch (enables system gold linker and disables LTO for a
>>>>>> sandbox that uses top-level asm)
>>>>>> 4) which ld should point to ld.gold
>>>>>> 5) unsure that ld.bfd points to ld.bfd
>>>>>> 6) run: build/gyp_chromium -Dwerror=
>>>>>> 7) ninja -C out/Release chrome -jX
>>>>>>
>>>>>> If there are any problems, follow:
>>>>>> https://code.google.com/p/chromium/wiki/LinuxBuildInstructions
>>>>>>
>>>>>> Martin
>>>>>>
>>>> Hello,
>>>>    taking latest trunk gcc, I built Firefox and Chromium. Both projects
>>>> compiled without debugging symbols and -O2 on an 8-core machine.
>>>>
>>>> Firefox:
>>>> -flto=9, peak memory usage (in LTRANS): 11GB
>>>>
>>>> Chromium:
>>>> -flto=6, peak memory usage (in parallel WPA phase ): 16.5GB
>>>>
>>>> For details please see attached with graphs. The attachment contains
>>>> also
>>>> -fmem-report and -fmem-report-wpa.
>>>> I think reduced memory footprint to ~3.5GB is a bit optimistic:
>>>> http://gcc.gnu.org/gcc-4.9/changes.html
>>>>
>>>> Is there any way we can reduce the memory footprint?
>>>>
>>>> Attachment (due to size restriction):
>>>>
>>>> https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing
>>>>
>>>> Thank you,
>>>> Martin
>>>
>>>
>>> Previous email presents a bit misleading graphs (influenced by
>>> --enable-gather-detailed-mem-stats).
>>>
>>> Firefox:
>>> -flto=9, WPA peak: 8GB, LTRANS peak: 8GB
>>> -flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
>>> -flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB
>>>
>>> These data shows that parallel WPA streaming increases short-time memory
>>> footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
>>> -flto=4).
>>>
>>> For more details, please see the attachment.
>>
>> The main overhead comes from maintaining the state during output of
>> the global types/decls.  We maintain somewhat "duplicate" info
>> here by having both the tree_ref_encoder and the streamer cache.
>> Eventually we can free the tree_ref_encoder pointer-map early, like with
>>
>> Index: lto-streamer-out.c
>> ===================================================================
>> --- lto-streamer-out.c  (revision 209018)
>> +++ lto-streamer-out.c  (working copy)
>> @@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)
>>
>>     gcc_assert (!alias_pairs);
>>
>> -  /* Write the global symbols.  */
>> +  /* Get rid of the global decl state hash tables to save some memory.
>> */
>>     out_state = lto_get_out_decl_state ();
>> -  num_fns = lto_function_decl_states.length ();
>> +  for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
>> +    if (out_state->streams[i].tree_hash_table)
>> +      {
>> +       delete out_state->streams[i].tree_hash_table;
>> +       out_state->streams[i].tree_hash_table = NULL;
>> +      }
>> +
>> +  /* Write the global symbols.  */
>>     lto_output_decl_state_streams (ob, out_state);
>> +  num_fns = lto_function_decl_states.length ();
>>     for (idx = 0; idx < num_fns; idx++)
>>       {
>>         fn_out_state =
>>
>> as we do already for the fn state streams (untested).
>>
>> we can also avoid re-allocating the output hashtable/vector by, after
>> (or in) create_output_block, allocate a bigger initial size for the
>> streamer_tree_cache.  Note that the pointer-set already expands if
>> the fill level is > 25%, and it really exponentially grows (similar to
>> hash_table, btw, but that grows only at 75% fill level).
>>
>> OTOH simply summing then lengths of all decl streams results in
>> a lower value than the actual number of output trees in the output block.
>> Humm.
>>
>> But this is clearly the data structure that could be worth optimizing
>> in some way.  For example during writing we don't need the
>> streamer cache nodes array (we just need a counter to assign indexes).
>>
>> The attached is a patch that tries to do that plus the above (in testing
>> right now).  Maybe you can check if it makes a noticable difference.
>>
>> Richard.
>
>
> I run test of your patch for twice, according to graphs memory footprint
> looks similar. Looks, after application of the patch, WPA phase was a bit
> faster, but can be influenced by HDD heavily utilized at the end of WPA.
> Sent graphs are executed after: echo 3 > /proc/sys/vm/drop_caches
>
> One another idea is to use threads instead of process fork. But I am not
> familiar with sharing data problems between threads?

Try

Index: gcc/tree-streamer-out.c
===================================================================
--- gcc/tree-streamer-out.c     (revision 209054)
+++ gcc/tree-streamer-out.c     (working copy)
@@ -523,13 +523,6 @@ streamer_write_chain (struct output_bloc
 {
   while (t)
     {
-      tree saved_chain;
-
-      /* Clear TREE_CHAIN to avoid blindly recursing into the rest
-        of the list.  */
-      saved_chain = TREE_CHAIN (t);
-      TREE_CHAIN (t) = NULL_TREE;
-
       /* We avoid outputting external vars or functions by reference
         to the global decls section as we do not want to have them
         enter decl merging.  This is, of course, only for the call
@@ -541,7 +534,6 @@ streamer_write_chain (struct output_bloc
       else
        stream_write_tree (ob, t, ref_p);

-      TREE_CHAIN (t) = saved_chain;
       t = TREE_CHAIN (t);
     }



> Martin
>
>
>>
>>> Martin
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]