WHOPR partitioning algorithm

Sat Sep 4 14:07:00 GMT 2010

> 
> Can you add -flto-partition=single so we can eventually get rid of
> -flto (or make it an alias for -fwhopr -flto-partition=single)?
> 
> That would avoid the above and eventually make the implementation
> simpler.

well, still need to test if user specified one algorithm or many.  But yes, i
can add single.  Problem is that we still create the tmp file then, I have
plans to merge WHOPR/LTO queue up to point we will be able to decide on the fly
what variant we want.  It is not that hard, we just need to make LTO queue to
read function bodies on demand instead of reading them all at once that I hope
is possible (did not look into merging issues with this idea yet)
> > +/* Return true if NODE should be partitioned.  */
> 
> What does "partitioned" mean?  Duplicating the node in more than
> 1 partition?  Or does !partitioned mean that it isn't needed at all?
> Please clarify.

Partitioned nodes are those interesting for partitioning algorithm, nonpartitioned
nodes are dragged in on demand when computing the boundary. I.e. COMDAT is ignored
by partitioning algorithm and brought only to partitions that need it.
> > +  partition_size = total_size / PARAM_VALUE (PARAM_LTO_PARTITIONS);
> > +  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
> > +    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
> 
> Hm, so there is no max-parition-size?

well, I plan to do this, to avoid excessive memory use in ltrans stage.
Problem with this is that at the moment WPA stage does not scale that well,
so when we do the 32 partitions the ltrans stages consume fraction of
WPA memory that seems acceptable, so I skipped it from initial patch.
So the parameter buys you nothing, since when you can't compile with default
configuration, you etiher exploded in WPA or hit some memory usage bug in the
compiler.

I can add the parameter and tune it based on what I get in Mozilla.
> 
> I expected that we try to partition the program by trying to
> minimize reference edges between partitions while keeping
> partition size in a min/max bound.  Is that the graph algorithm

Yes, it is what the algorithm does, just the min/max bound is computed
by dividing program size by number of partitions, rounding it up to
min partition size and then partitions 1/4th smaller or 1/4th bigger
are accepted

> The above suggests that users have to adjust --param lto-partitions
> for very large programs to be able to compile them, which sounds
> bad.

Only place I expect users will actually need to tune lto-partitions is when
they want more than 32 partitions (have more of CPUs).  I personally don't
like much the idea of making this to depend on WHOPR parameter since this
would imply that resulting binary would depend on -j configuration of
Makefile (and we probably don't have the number handy with -fwhopr=jobserv)
that would make reproducing the compilation harder.

Don't know here, really.
> 
> Again, when trying to do partitioning it would be really helpful
> if cgraph nodes and varpool nodes would share a common symtab
> base and references would just reference those ...
> 
> > +  npartitions = 1;
> > +  partition = new_partition ("");
> > +  if (cgraph_dump_file)
> > +    fprintf (cgraph_dump_file, "Total unit size: %i, partition size: %i\n",
> > +	     total_size, partition_size);
> 
> The following needs a _big_ comment describing what it is doing.
> What's its complexity?  What timevar does it get attributed to
> (maybe add a new one?)

There is WPA output partition. Every node is added at most twice, every edge
visited twice too, so complexity should not be problem.

As I tried to explain in the mail, the algorithm is simple.
We keep adding nodes (in predefined order that is currently reverse postorder)
into current partition and compute the number of internal edges and external
edges one the go.  As soon as partition is bigger than 3/4th of expected size
we start to look for point when we reach minimum of the fraction and look into
5/4th of the expected size.  Then we undo the process up to place when we
reached the minima and start with new partition.

Since the program size estimate is not really good - it does not include size
of nodes that gets duplicated, we always update size of expected partition
after each run. Otherwise the algorthm tended to make bit more partitions than
requested (like 40 instead of 32).

So it is very primite greedy algorithm minimizing the boundary size of every
partition.

The other algorithm I tried was the usual graph clustering algorithm - just
start with partition per node and then start concatenating partitions based on
priority of edges until their size reach the expected size of partition.  But
this does not play well with my plan for function reordering and since function
reordering algorithm is essentially same thing, just producing one linear order
instead of buckets, i expect that it will combine well with the linear
partitioning one. (and the differences are minimal anyway)

> > +/* WHOPR partitioning configuration.  */
> > +
> > +DEFPARAM (PARAM_LTO_PARTITIONS,
> > +	  "lto-partitions",
> > +	  "Number of paritions program should be split to",
> 
> As I understand this is really lto-max-partitions, right?

At the moment yet, until we get maximal partition size ;)

OK, once we agree at the details, I will update patch and add comments as
requested.

Honza