WHOPR partitioning algorithm

Richard Guenther rguenther@suse.de
Mon Sep 6 13:10:00 GMT 2010


On Mon, 6 Sep 2010, Jan Hubicka wrote:

> > 
> > Well, if with 32 partitions all partitions exceed max partition size
> > (well, or reasonable partition size), we should create more partitions.
> > 
> > The question is really if we should create sane partition sizes
> > by default or if we should create N partitions by default.  I'd vote
> > for sane partition sizes which makes the lto-partitions parameter
> > useless (I don't expect easy reproducible testcases anyway with
> > dynamic partitioning).
> 
> Well, even if I tended to do the same originally, i concluded that this won't
> work well.
> 
> With 32 partitions for Mozilla, we get about 4GB of WPA memory use, and about 2
> minutes of streaming.  Then each partition is about 200MB of memory, but they
> takes about 3 minutes to compile.
> 
> Partition size sort of define ltrans memory use and time of compilation. Since
> memory use in ltrans is not a problem (our WPA stage is way too big of memory
> hog, we save only about half of the memory), we can set hard limit on time.
> Now lets decide that we will tune partition size to compile in about 1 minute.
> This would be hard per se, since sizes of partitions are pretty rough estimates
> by anyway.
> 
> Then we would get about 3*32 partitions for mozilla that would just increase
> streaming time (32partitions is 1.5GB, 2000 partitions is 7GB) On the other
> hand on project that is small enough to compile in minute (lets say GCC) we
> would get no partitioning at all.
> 
> Since too much partitions has expenses too, I believe number of partitions
> should be large enough so ltrans does not dominate memory use for single CPU
> compilation (this is about 2-4 in current implementation I guess).
> 
> In parallel compilation the number of partitions should be higher than the
> number of CPUs available, but not terribly much higher since streaming size
> increase with too many parttions.
> 
> 32 seemed resonable default - not that high so streaming cost would terribly
> increase and still more than number of CPUs on average developer workstation
> today.

Ok, we seem to disagree here ;)  I suppose we can change things once
we get WPA memory use into control.

> > > The other algorithm I tried was the usual graph clustering algorithm - just
> > > start with partition per node and then start concatenating partitions based on
> > > priority of edges until their size reach the expected size of partition.  But
> > > this does not play well with my plan for function reordering and since function
> > > reordering algorithm is essentially same thing, just producing one linear order
> > > instead of buckets, i expect that it will combine well with the linear
> > > partitioning one. (and the differences are minimal anyway)
> > 
> > Hm, I don't see the issue with function reordering - it should exactly
> > behave the same way (and thus is unneeded?)
> 
> Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
> function reordering algorithm orders the functions so most calls are short distance
> and forwarding.
> 
> So when you at WPA stage decide in what order the functions are, you need to partition
> them in linear segments of this order and keep partition in order as they get to linker
> or assign each function section and tell linker order of the sections.
> The second is more expensive. So I basicaly want
> 1) decide on linear order of functions
> 2) partition program linearly in this segments
> 3) in each partition arrange function order in the final .s files (either via gas section fragments
> or by just breaking our current topological ocomplation order0
> 4) link the ltrans partition in the original order

But if you cluster minimizing cross cluster edges most edges in a cluster
will be intra-cluster edges.  Which means you can simply order functions
inside a cluster and you should arrive at a similar (or the same) solution
as with first ordering and then clustering.  Thus I'd do

 1) cluster minimizing cross-cluster edges
 2) decide on linear oder of functions in each cluster, late in LTRANS

Why is that not simpler or why would it not work?

Richard.



More information about the Gcc-patches mailing list