This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: lto


Jan Hubicka wrote:
> Hi,
>   
>> diego and honza
>>
>> diego asked on irc were we planning to be able to serialize out all of
>> the combined declarations.
>> My response we could certainly use the machinery that we currently have
>> to do this.
>>
>> However, I believe that Deigo's motivation is somewhat flawed, and for
>> that matter the plans in the whoper document.
>>
>> I think that it is clear that for lto to be successful, we are going to
>> have to distribute the computation among either multiple processors in a
>> single box or multiple machines scattered throughout a network.  However
>> there are a lot of was to skin this cat, and the whoper way appears
>> quite complex.  I know that google people are not scared of writing a
>> lot of code, but in truth, we can get what is likely to be a good
>> solution for less than 1000 lines of code, and I believe that that is
>> what we should shoot for first.
>>
>> Honza and i are a few days away from being able to read in the cgraph
>> from each of the .o files, analyze the cgraph and then load the
>> functions from those same .o files, AS NECESSARY, for computation.
>>
>> It seems that there are two small pieces of code need to be written to
>> make a distributed lto:
>>
>> 1) A driver that knows what computational resources are available.
>>
>> Say you have N processors.  All that this driver has to do is spawn N
>> copies of the command line to the N different processors.  Each command
>> line only differs in one option.  The index of that processor in the
>> group of resources: i.e. "-flto-index n" for n = 0 ... N-1.  If you are
>> using an mp, this is done with threads, if on a network it is done with
>> ssh commands.
>>
>> It will also need to make sure the .o file produced has the n somehow in
>> it so that things can be put back together later by the linker.
>>     
>
> I might be somewhat biassed, since my mental model here (bit described in
> original cgraph document) here has always been to simply produce
> separate LTO .o file for each function (at least in initial
> implementation) and compile it with usual distribution tools (make -j or
> icecream perhaps). I see that for resonable performance it would make
> sense to splice it more coarsely and somehow balance the amount of
> object files shipped and fact that compilation time is quite
> impredictable.
>
> Problem is that .o files will have quite good amount of redundancy:
> symbols and types that overlap as well as summaries from IPA analysis,
> such as alias analysis for types in question, functions to inline and
> such. There is also other problem ....
>   
>> 2) The cgraph analyzer, which is the part that figures out what to
>> compile next needs to do two additional tasks:
>>
>> a) Sort each function by the estimated size (after inlining).  This will
>> be the order that functions will be compiled in.  This sort needs to be
>> stable and deterministic (both easy constraints).
>>     
>
> ... at the moment cgraph code sorts things to be compiled in topological
> order.  This order allows low level optimizers to propagate info from
> caller to callee.  At the moment few simple bits are propagated (as
> stack frame requirements), but I believe this is resonable way to get
> simple interprocedural register allocation.
>
> This pass is especially irritating by two properties:
> 1) it is deeply RTL level pass
> 2) it is important so I don't think we should leave it out completely
> (one can come with other cases of IPA optimizations that doesn't fit the
> framework)
>
> so I see chance in trying to make clusters of the callgraph of resonable
> sizes based on estimated frequencies of calling (similar thing is used
> for program layout: you want often called functions near to each other).
> This spliced program will get this low level IPA propagation (and
> perhaps some IPA optimization we won't fit into the big IPA framework)
> locally.
>
> This, naturally, has number of irritating implications, such as more
> nodes or better ballnacing would imply worse code.  Alternative would be
> to insist on the topological ordering and communicate the information
> back from compilation nodes to the compilation manager making nodes to
> wait if no function with all callees compiled is available, but that
> makes whole thing considerably more complicated.
>
> I would wonder if anyone knows what others are doing here.
>   
>>   
>> b) Each version of the spawned compilers actually only compiles function
>> X if the index in the sort is x and x mod N == the value of
>> -flto-index.  The sort is important because it allows for load balancing
>> with very little effort.
>>     
>
> If we decide to completely give up the propagation of results of late
> optimization to other functions (ie IPA register allocation, stack
> alignment, GOT pointer loading code removal and such) this seems to make
> sense.  I see value in not needing to actually produce new .o files.
> Determinism is probably not big issue, we will require same compiler
> version etc anyway.
>
>   
I have to admit that i had not considered the issue of having the later
stages of compilation of one function feed into the early stages of
another.   The problem with this is that it implies a serial ordering of
the compilation and that is not going to yield a usable system for
compiling very large programs. 

One of the the things to consider on this issue is how many of these
feedbacks can actually be preanalyzed.  Before I wrote the pure-const
ipa pass, we discovered if a function was pure or const in some late rtl
pass and only later functions (in the compilation order) were able to
take advantage of that fact.  By determining this as an ipa pass, we
have completely removed that ordering dependency.   I admit that others
dependencies will be harder, if not impossible, to accommodate, but we
may find that many are.

I also have to admit that i did not consider the placement of functions
when i came up with my sorting strategy.  I admit that my first simple
trick will likely lead to a bad ordering of functions, but so will
producing a separate .o file for every function (unless we later have a
smart linker that can reorder the functions to get a good answer).

However, my meta point still stands.  There is some global deterministic
sorting that the cgraph analyzer can do that each processor, with no
additional coordination can deterministicly pick which functions to
compile in its own space.  For instance, we can use the existing order
and just have each process divide that order in some standard way by N
and take the nth slice for itself.

If you do this, you can then, at least locally within the cluster of
functions, allocated on a single processor, do many of the tricks that
you describe above.

> Load ballancing might be issue here: at least my experience from GCC
> bootstrap is that all time goes to insn-attrtab that is not majority of
> source file but it is huge and drives our optimizers crazy.  Large
> projects might always have some such dominating component that will make
> some nodes run a lot longer than others.  Also expecting that all nodes
> are having same CPU power is not realistic IMO.
>
> It seems to me that this might be IO bound too.  Compile 1GB ooffice
> source on 100 machines and you will distribute 100GB along your network.
>
>   
The numbers are not that simple to calculate and the conclusion is not
that obvious.
Different file systems do different things.   I am old enough to
remember the andrew file system that when you opened any file, a copy of
that entire file was brought onto your machine.  However i do not
believe that that is the way modern network file systems work. 

I do make the assumption that the ipa summaries are small compared to
the sizes of the functions that they summarize.  If not, we are all
doomed. 

Then there is the question of the sizes of the global symbols.  If you
start from where i propose, we can always add a layer that combines the
symbols and writes them to a file that is then broadcast to the 100
servers.  But we do not have to do that to get things working.   The
advantage of my plan is that it is relatively simple and we can graft
more distribution from a central compilation manager as we discover the
bottlenecks. 

My primary concern is inlining.  Intermodular inlining in particular. 
Ultimately inlining, and the downstream optimizations that are possible
because the function calls have gone away, is where we will get most of
the performance gain.  Ultimately, we have to ask the question is it
better to have a single centralized processor do the inlining and parcel
out the inlined function bodies to compilation servers or is it better
to have the compilation servers open the various .o files and inline the
functions them selves.

In the former, you have a single process, and it is going to have to be
doing a lot of writing of files or sockets to pass the inlined functions
and all of the global state to the compilation servers.  This could take
a lot of wall time.  In the latter, you have a lot of files that are
fairly stable being made available to a lot of compilation servers, but
only a small portion of those files will actually be read in. 

I seriously doubt that anyone can actually predict which will be faster
over a wide range of computation models.  Certainly if the multicore
people start winning, that 100 servers may only be 10 to 15 boxes and
the broadcasting is not that bad.

I am not making my suggestion based on some assumption that i know which
of the two schemes will perform best.  What i do know is this:  (1) I
will most likely not live long enough to see a multithreaded version of
gcc so any plans that say that that is the way to handle multicores with
threads are, in my opinion, just dreams.  (2) My plan can work with a
very small amount of programming so we get to start playing earlier. 
(3) We can add any centralized smarts to handle bottlenecks as we
measure them.

Writing lots of code first without the measurements just seems wrong.



> With the other way you will distribute compilation, get 1GB of
> object files (with a lot of header IO, but headers are hopefully just
> minotity of course), collect to the linker machine that will do the big
> thing once and produce separate object files, get them to nodes and
> collect real objects (with assembly) and really link.
>
> So every code gets once as source, twice as LTO object and once as final
> object.  But I am not parallelization expert by no means.
>
> Last thing to consider is how to handle with static declarations.  We
> need to give them unique names and make them exported from .o file but
> not exported from the linked library/binary.  This is not big deal at
> least for ELF.
>   
>> What this scheme means is that all of the type merging and ipa pass
>> analysis will be replicated on all of the processors in the group. 
>> However, this will not increase the wall time, only the cpu time (what's
>> time to a pig?).  It may and most likely will decrease the wall time
>> which is the resource that we are actually most interested in.
>>
>> I really do believe that this can be made to work very quickly, and
>> unless/until it is shown not to be good enough it is going to be the
>> fastest way to get the system up and working on more than one processor.
>> Writing out enough state to get things restarted on some other machine
>> is going to be a lot of work. 
>>
>> Of course all of this depends on our compiler being deterministic.  The
>> hashing police may have to be given special emergency powers to enforce
>> this, but the truth is we are pretty good about this now.
>>     
>
>
> Honza
>   
>> Comments?
>>
>> Kenny
>>     


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]