This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] Old school parallelization of WPA streaming

From: Richard Biener <rguenther at suse dot de>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: Michael Matz <matz at suse dot de>, Andi Kleen <ak at linux dot intel dot com>, gcc-patches at gcc dot gnu dot org, dnovillo at google dot com, dmalcolm at redhat dot com, jakub at redhat dot com
Date: Thu, 29 Aug 2013 15:39:56 +0200 (CEST)
Subject: Re: [RFC] Old school parallelization of WPA streaming
Authentication-results: sourceware.org; auth=none
References: <20130821141747 dot GD24782 at kam dot mff dot cuni dot cz> <20130821145853 dot GB2139 at tassilo dot jf dot intel dot com> <aa686933-7ae6-461b-9403-d5d47ad11289 at email dot android dot com> <alpine dot LNX dot 2 dot 00 dot 1308281728110 dot 9949 at wotan dot suse dot de> <alpine dot LNX dot 2 dot 00 dot 1308290931540 dot 20077 at zhemvz dot fhfr dot qr> <20130829125103 dot GB20627 at kam dot mff dot cuni dot cz>

On Thu, 29 Aug 2013, Jan Hubicka wrote:

> Jakub,
> I am adding you to CC since I put my current toughts on LTO and debug info
> in here.
> > > Fork-fire-forget is really a much simpler choice here IMO; no worries 
> > > about shared resources, less debug hassle.
> > 
> > It might be not as cheap as it is on Linux hosts on other hosts of
> > course.  Also I'd rather try to avoid I/O than solving the issue
> 
> I still have some items on list here
>  1) avoid function sections to be decompressed by WPA
>     (this won't cause much compile time improvements as decompression is
>      well bellow 10% of runtime)

still low-hanging

finally get a LTO section header!  (with a flag telling whether the
section is compressed)

>  2) put variable initializers into named sections just as function bodies
>     are.
>     Seeing Martin's systemtaps of firefox/gimp/inkscape, to my surprise the
>     initializers are actually about as big as the text segment.  While
>     it seems bit wasteful to pust single integer_cst there (and we can
>     special case this), it seems that there is a promise for vtables
>     and other stuff.
> 
>     To make devirt work, we will need to load vtables into memory (or
>     invent representation to stream them other way that would be similarly
>     big). Still we will avoid need to load them in 5000 copies and merge
>     them.
>  3) I think good part of function/partitioning overhead is because abstract
>     origin streaming is utterly broken.
> 
>     Currently we can have DECL_ABSTRACT_ORIGIN on a function.  This I can now
>     track by used_as_abstract_origin flag and I can stream those functions
>     into partitins using them.
> 
>     This is still wrong for multitude of reasons
> 
>     1) we really want DECL_INITIAL tree of the functions used as abstract
>        origins in the form before any gimple optimizations happened on them.
>        (that is when debug hook is called)
>        This is not what happens - we stream the tree as it looks during
>        TLO streaming time - i.e. after early optimizations.
> 
>        I think we may just (at a time calling the debug hook) duplicate DECL_INITIAL
>        same way we duplicate decls for save_function_body and saving it elsewhere.
>        Making this tree to be abstract origin of the offline copy of the function itself.
> 
>     2) dwarf2out doesn't really the DECL_INITIAL tree so it does something useful
>        only when it is already there. 
>        It can simply call cgraph_get_body when it needs the DECL_INITIAL, but it
>        doesn't becuase push_cfun causes ICE.
>        If we really can't push_cfun from middle of RTL queueu, I suppose I can
>        just save it elsewhere
> 
>     3) It is not only toplevel decl that has origin, but all local vars in the
>        function.
> 
>        I think this goes terribly wrong - these decls are not indexable so they
>        are stored into function section of every function referring to them.
>        They are then read in many duplicates and never merged with the DECL_INITIAL
>        tree of the actual abstract origin. For some reason dwarf2out doesn't
>        seem to ICE, but I also do not see how this can produce working debug.
>        Moreover I think the duplicates contribute to our current debug info
>        size problems with LTO.
> 
>        If we solve 1) as discussed by above (i.e. by having separate
>        block trees for functions that are abstract origins), we can then solve 3)
>        by streaming those into global decl stream and make cross-function_context
>        tree references to become global.
> 
>     4) Of course after early inlining function may need abstract origins from
>        multiple other functions.  I do not track this at all.
>        May be easy to just collect a vector of functions that are needed into
>        cgraph_node.
> 
>     Of course solving 1)-4) is bit of early debug info without actually going to
>     stream the dwarf dies, but by using the BLOCK trees as a temporary representation.
>     Incrementally we can have this saved BLOCK tree to be a dwarf DIE and have
>     origins to point to them instead of decls.
> 
>     To get resonable streaming performance it would be nice to have way to get
>     abstract origin references cross-partition that debug info can accomplish.

Most of the abstract origin stuff is dropped on the floor by streaming
because you cannot really stream that stuff.  And yes, we need early
debug info to generate the offline abstract origin copy of later inlined
functions, and yes, we have to handle streaming / referencing those in
some way.  But OTOH abstract origins are an optimization for debug info
size, so we can as well not have them.

> Said that, I now have the fork() patch in all my trees and enjoy 50% faster
> WPA times.  I changed my mind about claim that stremaing should be disk bound -
> it is hard to hope for disk boundness for something that should fit in cache.

It should at least limit its fork rate according to -flto=N or jobserver.

> We went down from 5GB to 2GB of streaming for Firefox that is good.  But we will
> see again 4GB once Martin's code layout work will land.  I think it is from good
> part because of the origin fun above.

Ugh.

Richard.

Follow-Ups:
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Jan Hubicka

References:
- [RFC] Old school parallelization of WPA streaming
  - From: Jan Hubicka
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Andi Kleen
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Richard Biener
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Michael Matz
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Richard Biener
- Re: [RFC] Old school parallelization of WPA streaming
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]