WHOPR partitioning algorithm

Mon Sep 6 13:10:00 GMT 2010

On Mon, 6 Sep 2010, Jan Hubicka wrote:

> > 
> > Well, if with 32 partitions all partitions exceed max partition size
> > (well, or reasonable partition size), we should create more partitions.
> > 
> > The question is really if we should create sane partition sizes
> > by default or if we should create N partitions by default.  I'd vote
> > for sane partition sizes which makes the lto-partitions parameter
> > useless (I don't expect easy reproducible testcases anyway with
> > dynamic partitioning).
> 
> Well, even if I tended to do the same originally, i concluded that this won't
> work well.
> 
> With 32 partitions for Mozilla, we get about 4GB of WPA memory use, and about 2
> minutes of streaming.  Then each partition is about 200MB of memory, but they
> takes about 3 minutes to compile.
> 
> Partition size sort of define ltrans memory use and time of compilation. Since
> memory use in ltrans is not a problem (our WPA stage is way too big of memory
> hog, we save only about half of the memory), we can set hard limit on time.
> Now lets decide that we will tune partition size to compile in about 1 minute.
> This would be hard per se, since sizes of partitions are pretty rough estimates
> by anyway.
> 
> Then we would get about 3*32 partitions for mozilla that would just increase
> streaming time (32partitions is 1.5GB, 2000 partitions is 7GB) On the other
> hand on project that is small enough to compile in minute (lets say GCC) we
> would get no partitioning at all.
> 
> Since too much partitions has expenses too, I believe number of partitions
> should be large enough so ltrans does not dominate memory use for single CPU
> compilation (this is about 2-4 in current implementation I guess).
> 
> In parallel compilation the number of partitions should be higher than the
> number of CPUs available, but not terribly much higher since streaming size
> increase with too many parttions.
> 
> 32 seemed resonable default - not that high so streaming cost would terribly
> increase and still more than number of CPUs on average developer workstation
> today.

Ok, we seem to disagree here ;)  I suppose we can change things once
we get WPA memory use into control.

> > > The other algorithm I tried was the usual graph clustering algorithm - just
> > > start with partition per node and then start concatenating partitions based on
> > > priority of edges until their size reach the expected size of partition.  But
> > > this does not play well with my plan for function reordering and since function
> > > reordering algorithm is essentially same thing, just producing one linear order
> > > instead of buckets, i expect that it will combine well with the linear
> > > partitioning one. (and the differences are minimal anyway)
> > 
> > Hm, I don't see the issue with function reordering - it should exactly
> > behave the same way (and thus is unneeded?)
> 
> Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
> function reordering algorithm orders the functions so most calls are short distance
> and forwarding.
> 
> So when you at WPA stage decide in what order the functions are, you need to partition
> them in linear segments of this order and keep partition in order as they get to linker
> or assign each function section and tell linker order of the sections.
> The second is more expensive. So I basicaly want
> 1) decide on linear order of functions
> 2) partition program linearly in this segments
> 3) in each partition arrange function order in the final .s files (either via gas section fragments
> or by just breaking our current topological ocomplation order0
> 4) link the ltrans partition in the original order

But if you cluster minimizing cross cluster edges most edges in a cluster
will be intra-cluster edges.  Which means you can simply order functions
inside a cluster and you should arrive at a similar (or the same) solution
as with first ordering and then clustering.  Thus I'd do

 1) cluster minimizing cross-cluster edges
 2) decide on linear oder of functions in each cluster, late in LTRANS

Why is that not simpler or why would it not work?

Richard.