[PATCH 2/6] Implement a new partitioner for parallel compilation

Giuliano Belinassi giuliano.belinassi@usp.br
Thu Aug 27 21:42:04 GMT 2020


Hi, Honza.

Again, thank you for your detailed review!

On 08/27, Jan Hubicka wrote:
> > When using the LTO infrastructure to compile files in parallel, we
> > can't simply use any of the LTO partitioner, once extra dependency
> > analysis is required to ensure that some nodes are correctly
> > partitioned together.
> > 
> > Therefore, here we implement a new partitioner called
> > "lto_merge_comdat_map" that does all these required analysis.
> > The partitioner works as follows:
> > 
> > 1. We create a number of disjoint sets and inserts each node into a
> >    separate set, which may be merged together in the future.
> > 
> > 2. Find COMDAT groups, and mark them to be partitioned together.
> > 
> > 3. Check all nodes that would require any COMDAT group to be
> >    copied to its partition (which we name "COMDAT frontier"),
> >    and mark them to be partitioned together.
> >    This avoids duplication of COMDAT groups and crashes on the LTO
> >    partitioning infrastructure.
> 
> What kind of crash you get here?

This assertion.

	  bool added = add_symbol_to_partition_1 (part, node1);
	  gcc_assert (added);

It checks if the COMDAT node was not already inserted into somewhere
else partition.

> > 
> > 4. Check if the user allows the partitioner to promote non-public
> >    functions or variables to global to improve parallelization
> >    opportunity with a cost of modifying the output code layout.
> > 
> > 5. Balance generated partitions for performance unless not told to.
> > 
> > The choice of 1. was by design, so we could use a union-find
> > data structure, which are know for being very fast on set unite
> > operations.
> 
> In LTO partitioner the groups of objects that "must go toghether"
> are discovered when first object is placed into the partition (via
> add_to_partition) because with the LTO rules it is always possible to
> discover all members from the group starting from any random element via
> graph walking.
> 
> I guess it is same with your partitioner?  I basically wonder how much
> code can be shared and what needs to be duplicated.
> It is not very nice to have partitioning implemented twice - it is bit
> subtle problem when it comes to details so I would be happier if we
> brought in the lto/lto-partition.c to middle end and updaed/cleaned it
> up as needed.

They are almost the same thing, they group together nodes that
require to be in the same partition before deciding how to partition
them.

Things are a little different in the way that this partitioner starts
with n partitions and merge nodes together as we decide that these
nodes needs to be in the same partition.  The advantage of this is that
merging partitions is quite cheap, but the drawback is that I can't
undo partitions easily. You can also see that I only use the
add_node_to_partition after it decides what nodes should be in the
partition.

I think that if there is a way to avoid failing that assertion that I
mentioned above, we could even get rid of this step and use the
balanced_map partitioner.

> > 
> > For 3. to work properly, we also had to modify
> > lto_promote_cross_file_statics to handle this case.
> > 
> > The parameters --param=promote-statics and --param=balance-partitions
> > control 4. and 5., respectively
> > 
> > gcc/ChangeLog:
> > 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> > 
> > 	* Makefile.in: Add lto-partition.o
> > 	* cgraph.h (struct symtab_node::aux2): New variable.
> > 	* lto-partition.c: Move from gcc/lto/lto-partition.c
> > 	(add_symbol_to_partition_1): Only compute insn size
> > 	if information is available.
> > 	(node_cmp): Same as above.
> > 	(class union_find): New.
> > 	(ds_print_roots): New function.
> > 	(balance_partitions): New function.
> > 	(build_ltrans_partitions): New function.
> > 	(merge_comdat_nodes): New function.
> > 	(merge_static_calls): New function.
> > 	(merge_contained_symbols): New function.
> > 	(lto_merge_comdat_map): New function.
> > 	(privatize_symbol_name_1): Handle when WPA is not enabled.
> > 	(privatize_symbol_name): Same as above.
> > 	(lto_promote_cross_file_statics): New parameter to select when
> > 	to promote to global.
> > 	(lto_check_usage_from_other_partitions): New function.
> > 	* lto-partition.h: Move from gcc/lto/lto-partition.h
> > 	(lto_promote_cross_file_statics): Update prototype.
> > 	(lto_check_usage_from_other_partitions): Declare.
> > 	(lto_merge_comdat_map): Declare.
> > 
> > gcc/lto/ChangeLog:
> > 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> > 
> > 	* lto-partition.c: Move to gcc/lto-partition.c.
> > 	* lto-partition.h: Move to gcc/lto-partition.h.
> > 	* lto.c: Update call to lto_promote_cross_file_statics.
> > 	* Makefile.in: Remove lto-partition.o.
> > ---
> >  gcc/Makefile.in               |   1 +
> >  gcc/cgraph.h                  |   1 +
> >  gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++-
> >  gcc/{lto => }/lto-partition.h |   4 +-
> >  gcc/lto/Make-lang.in          |   4 +-
> >  gcc/lto/lto.c                 |   2 +-
> >  gcc/params.opt                |   8 +
> >  gcc/tree.c                    |  23 +-
> >  8 files changed, 489 insertions(+), 17 deletions(-)
> >  rename gcc/{lto => }/lto-partition.c (78%)
> >  rename gcc/{lto => }/lto-partition.h (89%)
> > 
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 79e854aa938..be42b15f4ff 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1457,6 +1457,7 @@ OBJS = \
> >  	lra-spills.o \
> >  	lto-cgraph.o \
> >  	lto-streamer.o \
> > +	lto-partition.o \
> >  	lto-streamer-in.o \
> >  	lto-streamer-out.o \
> >  	lto-section-in.o \
> > diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > index 0211f08964f..b4a7871bd3d 100644
> > --- a/gcc/cgraph.h
> > +++ b/gcc/cgraph.h
> > @@ -615,6 +615,7 @@ public:
> >    struct lto_file_decl_data * lto_file_data;
> >  
> >    PTR GTY ((skip)) aux;
> > +  int aux2;
> 
> We do not really want to add more pass specific data to the symbol table
> (since it is critical wrt memory use during WPA stage).
> It is possible to attach extra info using the symbol-summary.h

How about if I refactor add_node_to_partition to not use the aux
pointer, but a single bit for marking if a node was already
partitioned? I counted 18 bits used in symtab_node, so there are plenty
of space left in this bitfield :)

Then I can use the aux pointer variable for storing the partitioning.
This will avoid me the cost of hash accesses.

> >  
> >    /* Comdat group the symbol is in.  Can be private if GGC allowed that.  */
> >    tree x_comdat_group;
> > diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c
> > similarity index 78%
> > rename from gcc/lto/lto-partition.c
> > rename to gcc/lto-partition.c
> > index 8e0488ab13e..ca962e69b5d 100644
> > --- a/gcc/lto/lto-partition.c
> > +++ b/gcc/lto-partition.c
> > @@ -170,7 +170,11 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
> >      {
> >        struct cgraph_edge *e;
> >        if (!node->alias && c == SYMBOL_PARTITION)
> > -	part->insns += ipa_size_summaries->get (cnode)->size;
> > +	{
> > +	  /* FIXME: Find out why this is being returned NULL in some cases.  */
> > +	  if (ipa_size_summaries->get (cnode))
> > +	    part->insns += ipa_size_summaries->get (cnode)->size;
> 
> It returns NULL when symbol summary is not available.  Either symbol is
> not defined (in which case it should not be SYMBOL_PARTITION) or it is
> not computed (probaly because of -O0?)
> > +/* Quickly balance partitions, trying to reach target_size in each of
> > +   them.  Returns true if something was done, or false if we decided
> > +   that it is not worth.  */
> > +
> > +static bool
> > +balance_partitions (union_find *ds, int n, int jobs)
> 
> Generally options should be documented, so I would learn what N means ;)

Woops.  n is just the number of nodes.

> > +{
> > +  int *sizes, i, j;
> > +  int total_size = 0, max_size = -1;
> > +  int target_size;
> > +  const int eps = 0;
> > +
> > +  symtab_node *node;
> > +
> > +  sizes = (int *) alloca (n * sizeof (*sizes));
> > +  memset (sizes, 0, n * sizeof (*sizes));
> 
> And we avoid using alloca for arrays that can grow over stack limit.
> I assume this has the size of symbol table, which means that you want to
> use vec.h API and allocae on heap.

Okay.

> > + 
> > +  /* Compute costs.  */
> > +  i = 0;
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      int root = ds->find (i);
> > +
> > +      if (cgraph_node *cnode = dyn_cast<cgraph_node *> (node))
> > +	{
> > +	  ipa_size_summary *summary = ipa_size_summaries->get (cnode);
> > +	  if (summary)
> > +	    sizes[root] += summary->size;
> > +	  else
> > +	    sizes[root] += 10;
> > +	}
> > +      else
> > +	sizes[root] += 10;
> 
> Do you have testcases where summary is mising?

This may be related to an early problem.  I will just remove these and
check if i hit these again.

> > +
> > +
> > +      i++;
> > +    }
> > +
> > +  /* Compute the total size and maximum size.  */
> > +  for (i = 0; i < n; ++i)
> > +    {
> > +      total_size += sizes[i];
> > +      max_size    = MAX (max_size, sizes[i]);
> 
> Also i think we should start using 64bit values for total sizes of
> units. In some extreme cases they already get close to overflow.

Okay :)

> > +    }
> > +
> > +  /* Quick return if total size is small.  */
> > +  if (total_size < param_min_partition_size)
> > +    return false;
> > +
> > +  target_size = total_size / (jobs + 1);
> > +
> > +  /* Unite small partitions.  */
> > +  for (i = 0, j = 0; j < n; ++j)
> > +    {
> > +      if (sizes[j] == 0)
> > +	continue;
> > +
> > +      if (i == -1)
> > +	i = j;
> > +      else
> > +	{
> > +	  if (sizes[i] + sizes[j] < target_size + eps)
> > +	    {
> > +	      ds->unite (i, j);
> > +	      sizes[i] += sizes[j];
> > +	      sizes[j] = 0;
> > +	    }
> > +	  else
> > +	      i = j;
> > +	}
> > +    }
> > +  return true;
> 
> Note that partitioning is not free, since reference to static var or a
> call from one partition to another will lead to more expensive
> relocations/instructions to be used (since gas will not be able to relax
> it to IP relative addressing where available).
> 
> For that reason LTO partitioner has the logic tracking boundary and
> minimizing it.  I think we should merge them and also cleanup
> lto/lto-partition.c - the partitioner was one late night experiment that
> I intended as a first proof of concept to be rewritten later, but we
> never got into impementing anything much smarter. On the other hand we
> added a lot of extra hacks to it (order preserving and other things), so
> it deserves some TLC.

You mean the balanced_map or the add_node_to_partition?

> 
> Also I think you unite partitions in the FOR_EACH_* order that is not
> really meaningful, the code layout is controlled by node->order values.

This was mainly to improve compilation performance and did not use the
code layout as account. Maybe the best strategy is to "fix" the LTO
partitioner so that we could use it for this, instead of implementing
a new one.

> 
> > +}
> > +
> > +/* Builds the LTRANS partitions, or return if not needed.  */
> > +
> > +static int
> > +build_ltrans_partitions (union_find *ds, int n)
> > +{
> > +  int i, n_partitions;
> > +  symtab_node *node;
> > +
> > +  int *compression = (int *) alloca (n * sizeof (*compression));
> > +  for (i = 0; i < n; ++i)
> > +    compression[i] = -1; /* Invalid value.  */
> > +
> > +  i = 0, n_partitions = 0;
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      int root = ds->find (i);
> > +      node->aux2 = root;
> > +      node->aux = NULL;
> > +
> > +      if (node->get_partitioning_class () == SYMBOL_PARTITION
> > +	  && compression[root] < 0)
> > +	compression[root] = n_partitions++;
> > +      i++;
> > +    }
> 
> What exactly compression is used for?

This is coordinate compression.  Partitions generated by my partitioner
will be identified as an integer between 0, ..., n-1, where n is the
number of nodes in the graph, but we may have m partitions, where
0 < m <= n.  What that does is map these 0, ..., n-1 identifiers to
a unique number between 0, ..., m-1.  If you take a closer look, the
algorithm ressembles the counting sort.

> > +
> > +  if (dump_file)
> > +    fprintf (dump_file, "n_partitions = %d\n", n_partitions);
> > +
> > +  if (n_partitions <= 1)
> > +    return false;
> > +
> > +  /* Create LTRANS partitions.  */
> > +  ltrans_partitions.create (n_partitions);
> > +  for (i = 0; i < n_partitions; i++)
> > +    new_partition ("");
> > +
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      if (node->get_partitioning_class () != SYMBOL_PARTITION
> > +	  || symbol_partitioned_p (node))
> > +	  continue;
> > +
> > +      int p = compression[node->aux2];
> > +      if (dump_file)
> > +	fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ());
> > +      add_symbol_to_partition (ltrans_partitions[p], node);
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Partition COMDAT groups together, and also bring together nodes that
> > +   requires them. Such nodes that are not in the COMDAT group that have
> > +   references to COMDAT grouped nodes are called the COMDAT frontier.  */
> > +
> > +static bool
> > +merge_comdat_nodes (symtab_node *node, int set)
> > +{
> > +  enum symbol_partitioning_class c = node->get_partitioning_class ();
> > +  bool ret = false;
> > +  symtab_node *node1;
> > +  cgraph_edge *e;
> > +
> > +  /* If node is already analysed, quickly return.  */
> > +  if (node->aux)
> > +    return false;
> > +
> > +  /* Mark as analysed.  */
> > +  node->aux = (void *) 1;
> > +
> > +
> > +  /* Aglomerate the COMDAT group into the same partition.  */
> > +  if (node->same_comdat_group)
> > +    {
> > +      for (node1 = node->same_comdat_group;
> > +	   node1 != node; node1 = node1->same_comdat_group)
> > +	if (!node->alias)
> > +	  {
> > +	    ds->unite (node1->aux2, set);
> > +	    merge_comdat_nodes (node1, set);
> > +	    ret = true;
> > +	  }
> > +    }
> > +
> > +  /* Look at nodes that can reach the COMDAT group, and aglomerate to the
> > +     same partition.  These nodes are called the "COMDAT Frontier".  The
> > +     idea is that every unpartitioned node that reaches a COMDAT group MUST
> > +     go through the COMDAT frontier before reaching it.  Therefore, only
> > +     nodes in the frontier are exported.  */
> > +  if (node->same_comdat_group || c == SYMBOL_DUPLICATE)
> > +    {
> > +      int i;
> > +      struct ipa_ref *ref = NULL;
> > +
> > +      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
> > +	{
> > +	  /* Add all inline clones and callees that are duplicated.  */
> > +	  for (e = cnode->callers; e; e = e->next_caller)
> > +	    if (!e->inline_failed || c == SYMBOL_DUPLICATE)
> > +	      {
> > +		ds->unite (set, e->caller->aux2);
> > +		merge_comdat_nodes (e->caller, set);
> > +		ret = true;
> > +	      }
> > +
> > +	  /* Add all thunks associated with the function.  */
> > +	  for (e = cnode->callees; e; e = e->next_callee)
> > +	    if (e->caller->thunk.thunk_p && !e->caller->inlined_to)
> > +	      {
> > +		ds->unite (set, e->callee->aux2);
> > +		merge_comdat_nodes (e->callee, set);
> > +		ret = true;
> > +	      }
> > +	}
> 
> I do not think it is strictly necessary to merge comdat function with
> all users. If the comdat is some easy accessor this may prevent a lot of
> merging.
> 
> All you need to do is to place it into one of partitins and mark set
> "used" flag on the symbols used in other partitions, so it does not get
> optimized away when unused.  Other partitions should reffer to it w/o
> problems.

I guess I will need your help in the future with this, then.  In any
case, speedups was quite OK even with this, so probably I will leave it
as is, so it could be improved incrementally.

> 
> I am not sure what exactly the goal is here about not changing code
> layout.
> 
> > +/* Partition the program into several partitions with a restriction that
> > +   COMDATS are partitioned together with all nodes requiring them.  If
> > +   promote_statics is false, we also partition together static functions
> > +   and nodes that call eachother, so non-public functions are not promoted
> > +   to globals.  */
> > +
> > +void
> > +lto_merge_comdat_map (bool balance, bool promote_statics, int jobs)
> 
> BTW I think it is odd name for partitioner. Comdat is just one of
> problems it deals with. But I would still like to see this merged with
> the lto partitioning logic, so we could also use all kinds of
> partitioners in both modes.

Well... it does merge COMDATs into the same partition :D

> 
> It all looks quite nice, but lets work on avoiding the code duplication
> here...

Thank you for your detailed review.
Giuliano.

> 
> Honza


More information about the Gcc-patches mailing list