WHOPR partitioning algorithm

Mon Sep 6 15:59:00 GMT 2010

> > 32 seemed resonable default - not that high so streaming cost would terribly
> > increase and still more than number of CPUs on average developer workstation
> > today.
> 
> Ok, we seem to disagree here ;)  I suppose we can change things once
> we get WPA memory use into control.

Yep, with WPA memory not dominating we will have use for maximal partition size
parameter. We don't want single partition to need 2GB of RAM or so, but simply
we don't scale to this size of projects at the moment. 

Still i don't think we should be partitioning purely on it for reasons I described.
More partitions imply more streaming and thus longer WPA.  So when you compile GCC
or Mozilla on 24 CPU machine, the number of partitions is pretty much same even
if GCC is much smaller. So i don't think it is good idea to partition gigantic
projects into very many partitions just because of hard limit on single partition
size being tuned for smaller projects.
> 
> > > > The other algorithm I tried was the usual graph clustering algorithm - just
> > > > start with partition per node and then start concatenating partitions based on
> > > > priority of edges until their size reach the expected size of partition.  But
> > > > this does not play well with my plan for function reordering and since function
> > > > reordering algorithm is essentially same thing, just producing one linear order
> > > > instead of buckets, i expect that it will combine well with the linear
> > > > partitioning one. (and the differences are minimal anyway)
> > > 
> > > Hm, I don't see the issue with function reordering - it should exactly
> > > behave the same way (and thus is unneeded?)
> > 
> > Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
> > function reordering algorithm orders the functions so most calls are short distance
> > and forwarding.
> > 
> > So when you at WPA stage decide in what order the functions are, you need to partition
> > them in linear segments of this order and keep partition in order as they get to linker
> > or assign each function section and tell linker order of the sections.
> > The second is more expensive. So I basicaly want
> > 1) decide on linear order of functions
> > 2) partition program linearly in this segments
> > 3) in each partition arrange function order in the final .s files (either via gas section fragments
> > or by just breaking our current topological ocomplation order0
> > 4) link the ltrans partition in the original order
> 
> But if you cluster minimizing cross cluster edges most edges in a cluster
> will be intra-cluster edges.  Which means you can simply order functions
> inside a cluster and you should arrive at a similar (or the same) solution
> as with first ordering and then clustering.  Thus I'd do
> 
>  1) cluster minimizing cross-cluster edges
>  2) decide on linear oder of functions in each cluster, late in LTRANS
> 
> Why is that not simpler or why would it not work?

It does not seem simpler to me, since one would have similar clustering
algorithm implemented twice and it is bit harder than what I do on the
predefined sequence.

What you optimize for in function layout is not exactly equivalent of what you
optimize for when minimizing boundaries.  Your proposal gives precedence of mizimizing
boundaries over runtime layout, mine goes other way around. No idea how much practical
difference that would make.

>From high level I tend to like the idea of function ordering being regular IPA
pass and WHOPR/ltrans partitioning something as transparent as possible rather
than making function ordering to be combination of partitioning algorithm and
late small IPA function ordering pass both contributing on final order.
It is not grand difference either ;)

What I would like to also support in longer run is profile feedback specifying
in what order the functions are executed at runtime and laying code based on
that to optimize startup times. This is already done by some tools using
iceberg and linker scripts.   Again I guess it is sort of implementable in both
schemes - partitioning algorithm should naturally lean to putting startup code
into single partition and local ordering will hit it, but I can imagine it to
be somewhat fragile this way. (partitioning algorithm might get other
preferences or split the init code rather irregular for more complicated
callgraphs).

Honza