diff mbox series

[2/6] Implement a new partitioner for parallel compilation

Message ID 20200820220019.3131804-3-giuliano.belinassi@usp.br
State New
Headers show
Series Parallelize Intra-Procedural Optimizations using the LTO Engine. | expand

Commit Message

Giuliano Belinassi Aug. 20, 2020, 10 p.m. UTC
When using the LTO infrastructure to compile files in parallel, we
can't simply use any of the LTO partitioner, once extra dependency
analysis is required to ensure that some nodes are correctly
partitioned together.

Therefore, here we implement a new partitioner called
"lto_merge_comdat_map" that does all these required analysis.
The partitioner works as follows:

1. We create a number of disjoint sets and inserts each node into a
   separate set, which may be merged together in the future.

2. Find COMDAT groups, and mark them to be partitioned together.

3. Check all nodes that would require any COMDAT group to be
   copied to its partition (which we name "COMDAT frontier"),
   and mark them to be partitioned together.
   This avoids duplication of COMDAT groups and crashes on the LTO
   partitioning infrastructure.

4. Check if the user allows the partitioner to promote non-public
   functions or variables to global to improve parallelization
   opportunity with a cost of modifying the output code layout.

5. Balance generated partitions for performance unless not told to.

The choice of 1. was by design, so we could use a union-find
data structure, which are know for being very fast on set unite
operations.

For 3. to work properly, we also had to modify
lto_promote_cross_file_statics to handle this case.

The parameters --param=promote-statics and --param=balance-partitions
control 4. and 5., respectively

gcc/ChangeLog:
2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>

	* Makefile.in: Add lto-partition.o
	* cgraph.h (struct symtab_node::aux2): New variable.
	* lto-partition.c: Move from gcc/lto/lto-partition.c
	(add_symbol_to_partition_1): Only compute insn size
	if information is available.
	(node_cmp): Same as above.
	(class union_find): New.
	(ds_print_roots): New function.
	(balance_partitions): New function.
	(build_ltrans_partitions): New function.
	(merge_comdat_nodes): New function.
	(merge_static_calls): New function.
	(merge_contained_symbols): New function.
	(lto_merge_comdat_map): New function.
	(privatize_symbol_name_1): Handle when WPA is not enabled.
	(privatize_symbol_name): Same as above.
	(lto_promote_cross_file_statics): New parameter to select when
	to promote to global.
	(lto_check_usage_from_other_partitions): New function.
	* lto-partition.h: Move from gcc/lto/lto-partition.h
	(lto_promote_cross_file_statics): Update prototype.
	(lto_check_usage_from_other_partitions): Declare.
	(lto_merge_comdat_map): Declare.

gcc/lto/ChangeLog:
2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>

	* lto-partition.c: Move to gcc/lto-partition.c.
	* lto-partition.h: Move to gcc/lto-partition.h.
	* lto.c: Update call to lto_promote_cross_file_statics.
	* Makefile.in: Remove lto-partition.o.
---
 gcc/Makefile.in               |   1 +
 gcc/cgraph.h                  |   1 +
 gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++-
 gcc/{lto => }/lto-partition.h |   4 +-
 gcc/lto/Make-lang.in          |   4 +-
 gcc/lto/lto.c                 |   2 +-
 gcc/params.opt                |   8 +
 gcc/tree.c                    |  23 +-
 8 files changed, 489 insertions(+), 17 deletions(-)
 rename gcc/{lto => }/lto-partition.c (78%)
 rename gcc/{lto => }/lto-partition.h (89%)

Comments

Jan Hubicka Aug. 27, 2020, 3:18 p.m. UTC | #1
> When using the LTO infrastructure to compile files in parallel, we
> can't simply use any of the LTO partitioner, once extra dependency
> analysis is required to ensure that some nodes are correctly
> partitioned together.
> 
> Therefore, here we implement a new partitioner called
> "lto_merge_comdat_map" that does all these required analysis.
> The partitioner works as follows:
> 
> 1. We create a number of disjoint sets and inserts each node into a
>    separate set, which may be merged together in the future.
> 
> 2. Find COMDAT groups, and mark them to be partitioned together.
> 
> 3. Check all nodes that would require any COMDAT group to be
>    copied to its partition (which we name "COMDAT frontier"),
>    and mark them to be partitioned together.
>    This avoids duplication of COMDAT groups and crashes on the LTO
>    partitioning infrastructure.

What kind of crash you get here?
> 
> 4. Check if the user allows the partitioner to promote non-public
>    functions or variables to global to improve parallelization
>    opportunity with a cost of modifying the output code layout.
> 
> 5. Balance generated partitions for performance unless not told to.
> 
> The choice of 1. was by design, so we could use a union-find
> data structure, which are know for being very fast on set unite
> operations.

In LTO partitioner the groups of objects that "must go toghether"
are discovered when first object is placed into the partition (via
add_to_partition) because with the LTO rules it is always possible to
discover all members from the group starting from any random element via
graph walking.

I guess it is same with your partitioner?  I basically wonder how much
code can be shared and what needs to be duplicated.
It is not very nice to have partitioning implemented twice - it is bit
subtle problem when it comes to details so I would be happier if we
brought in the lto/lto-partition.c to middle end and updaed/cleaned it
up as needed.
> 
> For 3. to work properly, we also had to modify
> lto_promote_cross_file_statics to handle this case.
> 
> The parameters --param=promote-statics and --param=balance-partitions
> control 4. and 5., respectively
> 
> gcc/ChangeLog:
> 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> 
> 	* Makefile.in: Add lto-partition.o
> 	* cgraph.h (struct symtab_node::aux2): New variable.
> 	* lto-partition.c: Move from gcc/lto/lto-partition.c
> 	(add_symbol_to_partition_1): Only compute insn size
> 	if information is available.
> 	(node_cmp): Same as above.
> 	(class union_find): New.
> 	(ds_print_roots): New function.
> 	(balance_partitions): New function.
> 	(build_ltrans_partitions): New function.
> 	(merge_comdat_nodes): New function.
> 	(merge_static_calls): New function.
> 	(merge_contained_symbols): New function.
> 	(lto_merge_comdat_map): New function.
> 	(privatize_symbol_name_1): Handle when WPA is not enabled.
> 	(privatize_symbol_name): Same as above.
> 	(lto_promote_cross_file_statics): New parameter to select when
> 	to promote to global.
> 	(lto_check_usage_from_other_partitions): New function.
> 	* lto-partition.h: Move from gcc/lto/lto-partition.h
> 	(lto_promote_cross_file_statics): Update prototype.
> 	(lto_check_usage_from_other_partitions): Declare.
> 	(lto_merge_comdat_map): Declare.
> 
> gcc/lto/ChangeLog:
> 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> 
> 	* lto-partition.c: Move to gcc/lto-partition.c.
> 	* lto-partition.h: Move to gcc/lto-partition.h.
> 	* lto.c: Update call to lto_promote_cross_file_statics.
> 	* Makefile.in: Remove lto-partition.o.
> ---
>  gcc/Makefile.in               |   1 +
>  gcc/cgraph.h                  |   1 +
>  gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++-
>  gcc/{lto => }/lto-partition.h |   4 +-
>  gcc/lto/Make-lang.in          |   4 +-
>  gcc/lto/lto.c                 |   2 +-
>  gcc/params.opt                |   8 +
>  gcc/tree.c                    |  23 +-
>  8 files changed, 489 insertions(+), 17 deletions(-)
>  rename gcc/{lto => }/lto-partition.c (78%)
>  rename gcc/{lto => }/lto-partition.h (89%)
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 79e854aa938..be42b15f4ff 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1457,6 +1457,7 @@ OBJS = \
>  	lra-spills.o \
>  	lto-cgraph.o \
>  	lto-streamer.o \
> +	lto-partition.o \
>  	lto-streamer-in.o \
>  	lto-streamer-out.o \
>  	lto-section-in.o \
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 0211f08964f..b4a7871bd3d 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -615,6 +615,7 @@ public:
>    struct lto_file_decl_data * lto_file_data;
>  
>    PTR GTY ((skip)) aux;
> +  int aux2;

We do not really want to add more pass specific data to the symbol table
(since it is critical wrt memory use during WPA stage).
It is possible to attach extra info using the symbol-summary.h
>  
>    /* Comdat group the symbol is in.  Can be private if GGC allowed that.  */
>    tree x_comdat_group;
> diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c
> similarity index 78%
> rename from gcc/lto/lto-partition.c
> rename to gcc/lto-partition.c
> index 8e0488ab13e..ca962e69b5d 100644
> --- a/gcc/lto/lto-partition.c
> +++ b/gcc/lto-partition.c
> @@ -170,7 +170,11 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
>      {
>        struct cgraph_edge *e;
>        if (!node->alias && c == SYMBOL_PARTITION)
> -	part->insns += ipa_size_summaries->get (cnode)->size;
> +	{
> +	  /* FIXME: Find out why this is being returned NULL in some cases.  */
> +	  if (ipa_size_summaries->get (cnode))
> +	    part->insns += ipa_size_summaries->get (cnode)->size;

It returns NULL when symbol summary is not available.  Either symbol is
not defined (in which case it should not be SYMBOL_PARTITION) or it is
not computed (probaly because of -O0?)
> +/* Quickly balance partitions, trying to reach target_size in each of
> +   them.  Returns true if something was done, or false if we decided
> +   that it is not worth.  */
> +
> +static bool
> +balance_partitions (union_find *ds, int n, int jobs)

Generally options should be documented, so I would learn what N means ;)
> +{
> +  int *sizes, i, j;
> +  int total_size = 0, max_size = -1;
> +  int target_size;
> +  const int eps = 0;
> +
> +  symtab_node *node;
> +
> +  sizes = (int *) alloca (n * sizeof (*sizes));
> +  memset (sizes, 0, n * sizeof (*sizes));

And we avoid using alloca for arrays that can grow over stack limit.
I assume this has the size of symbol table, which means that you want to
use vec.h API and allocae on heap.
> + 
> +  /* Compute costs.  */
> +  i = 0;
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      int root = ds->find (i);
> +
> +      if (cgraph_node *cnode = dyn_cast<cgraph_node *> (node))
> +	{
> +	  ipa_size_summary *summary = ipa_size_summaries->get (cnode);
> +	  if (summary)
> +	    sizes[root] += summary->size;
> +	  else
> +	    sizes[root] += 10;
> +	}
> +      else
> +	sizes[root] += 10;

Do you have testcases where summary is mising?
> +
> +
> +      i++;
> +    }
> +
> +  /* Compute the total size and maximum size.  */
> +  for (i = 0; i < n; ++i)
> +    {
> +      total_size += sizes[i];
> +      max_size    = MAX (max_size, sizes[i]);

Also i think we should start using 64bit values for total sizes of
units. In some extreme cases they already get close to overflow.
> +    }
> +
> +  /* Quick return if total size is small.  */
> +  if (total_size < param_min_partition_size)
> +    return false;
> +
> +  target_size = total_size / (jobs + 1);
> +
> +  /* Unite small partitions.  */
> +  for (i = 0, j = 0; j < n; ++j)
> +    {
> +      if (sizes[j] == 0)
> +	continue;
> +
> +      if (i == -1)
> +	i = j;
> +      else
> +	{
> +	  if (sizes[i] + sizes[j] < target_size + eps)
> +	    {
> +	      ds->unite (i, j);
> +	      sizes[i] += sizes[j];
> +	      sizes[j] = 0;
> +	    }
> +	  else
> +	      i = j;
> +	}
> +    }
> +  return true;

Note that partitioning is not free, since reference to static var or a
call from one partition to another will lead to more expensive
relocations/instructions to be used (since gas will not be able to relax
it to IP relative addressing where available).

For that reason LTO partitioner has the logic tracking boundary and
minimizing it.  I think we should merge them and also cleanup
lto/lto-partition.c - the partitioner was one late night experiment that
I intended as a first proof of concept to be rewritten later, but we
never got into impementing anything much smarter. On the other hand we
added a lot of extra hacks to it (order preserving and other things), so
it deserves some TLC.

Also I think you unite partitions in the FOR_EACH_* order that is not
really meaningful, the code layout is controlled by node->order values.

> +}
> +
> +/* Builds the LTRANS partitions, or return if not needed.  */
> +
> +static int
> +build_ltrans_partitions (union_find *ds, int n)
> +{
> +  int i, n_partitions;
> +  symtab_node *node;
> +
> +  int *compression = (int *) alloca (n * sizeof (*compression));
> +  for (i = 0; i < n; ++i)
> +    compression[i] = -1; /* Invalid value.  */
> +
> +  i = 0, n_partitions = 0;
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      int root = ds->find (i);
> +      node->aux2 = root;
> +      node->aux = NULL;
> +
> +      if (node->get_partitioning_class () == SYMBOL_PARTITION
> +	  && compression[root] < 0)
> +	compression[root] = n_partitions++;
> +      i++;
> +    }

What exactly compression is used for?
> +
> +  if (dump_file)
> +    fprintf (dump_file, "n_partitions = %d\n", n_partitions);
> +
> +  if (n_partitions <= 1)
> +    return false;
> +
> +  /* Create LTRANS partitions.  */
> +  ltrans_partitions.create (n_partitions);
> +  for (i = 0; i < n_partitions; i++)
> +    new_partition ("");
> +
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      if (node->get_partitioning_class () != SYMBOL_PARTITION
> +	  || symbol_partitioned_p (node))
> +	  continue;
> +
> +      int p = compression[node->aux2];
> +      if (dump_file)
> +	fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ());
> +      add_symbol_to_partition (ltrans_partitions[p], node);
> +    }
> +
> +  return true;
> +}
> +
> +/* Partition COMDAT groups together, and also bring together nodes that
> +   requires them. Such nodes that are not in the COMDAT group that have
> +   references to COMDAT grouped nodes are called the COMDAT frontier.  */
> +
> +static bool
> +merge_comdat_nodes (symtab_node *node, int set)
> +{
> +  enum symbol_partitioning_class c = node->get_partitioning_class ();
> +  bool ret = false;
> +  symtab_node *node1;
> +  cgraph_edge *e;
> +
> +  /* If node is already analysed, quickly return.  */
> +  if (node->aux)
> +    return false;
> +
> +  /* Mark as analysed.  */
> +  node->aux = (void *) 1;
> +
> +
> +  /* Aglomerate the COMDAT group into the same partition.  */
> +  if (node->same_comdat_group)
> +    {
> +      for (node1 = node->same_comdat_group;
> +	   node1 != node; node1 = node1->same_comdat_group)
> +	if (!node->alias)
> +	  {
> +	    ds->unite (node1->aux2, set);
> +	    merge_comdat_nodes (node1, set);
> +	    ret = true;
> +	  }
> +    }
> +
> +  /* Look at nodes that can reach the COMDAT group, and aglomerate to the
> +     same partition.  These nodes are called the "COMDAT Frontier".  The
> +     idea is that every unpartitioned node that reaches a COMDAT group MUST
> +     go through the COMDAT frontier before reaching it.  Therefore, only
> +     nodes in the frontier are exported.  */
> +  if (node->same_comdat_group || c == SYMBOL_DUPLICATE)
> +    {
> +      int i;
> +      struct ipa_ref *ref = NULL;
> +
> +      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
> +	{
> +	  /* Add all inline clones and callees that are duplicated.  */
> +	  for (e = cnode->callers; e; e = e->next_caller)
> +	    if (!e->inline_failed || c == SYMBOL_DUPLICATE)
> +	      {
> +		ds->unite (set, e->caller->aux2);
> +		merge_comdat_nodes (e->caller, set);
> +		ret = true;
> +	      }
> +
> +	  /* Add all thunks associated with the function.  */
> +	  for (e = cnode->callees; e; e = e->next_callee)
> +	    if (e->caller->thunk.thunk_p && !e->caller->inlined_to)
> +	      {
> +		ds->unite (set, e->callee->aux2);
> +		merge_comdat_nodes (e->callee, set);
> +		ret = true;
> +	      }
> +	}

I do not think it is strictly necessary to merge comdat function with
all users. If the comdat is some easy accessor this may prevent a lot of
merging.

All you need to do is to place it into one of partitins and mark set
"used" flag on the symbols used in other partitions, so it does not get
optimized away when unused.  Other partitions should reffer to it w/o
problems.

I am not sure what exactly the goal is here about not changing code
layout.

> +/* Partition the program into several partitions with a restriction that
> +   COMDATS are partitioned together with all nodes requiring them.  If
> +   promote_statics is false, we also partition together static functions
> +   and nodes that call eachother, so non-public functions are not promoted
> +   to globals.  */
> +
> +void
> +lto_merge_comdat_map (bool balance, bool promote_statics, int jobs)

BTW I think it is odd name for partitioner. Comdat is just one of
problems it deals with. But I would still like to see this merged with
the lto partitioning logic, so we could also use all kinds of
partitioners in both modes.

It all looks quite nice, but lets work on avoiding the code duplication
here...

Honza
Giuliano Belinassi Aug. 27, 2020, 9:42 p.m. UTC | #2
Hi, Honza.

Again, thank you for your detailed review!

On 08/27, Jan Hubicka wrote:
> > When using the LTO infrastructure to compile files in parallel, we
> > can't simply use any of the LTO partitioner, once extra dependency
> > analysis is required to ensure that some nodes are correctly
> > partitioned together.
> > 
> > Therefore, here we implement a new partitioner called
> > "lto_merge_comdat_map" that does all these required analysis.
> > The partitioner works as follows:
> > 
> > 1. We create a number of disjoint sets and inserts each node into a
> >    separate set, which may be merged together in the future.
> > 
> > 2. Find COMDAT groups, and mark them to be partitioned together.
> > 
> > 3. Check all nodes that would require any COMDAT group to be
> >    copied to its partition (which we name "COMDAT frontier"),
> >    and mark them to be partitioned together.
> >    This avoids duplication of COMDAT groups and crashes on the LTO
> >    partitioning infrastructure.
> 
> What kind of crash you get here?

This assertion.

	  bool added = add_symbol_to_partition_1 (part, node1);
	  gcc_assert (added);

It checks if the COMDAT node was not already inserted into somewhere
else partition.

> > 
> > 4. Check if the user allows the partitioner to promote non-public
> >    functions or variables to global to improve parallelization
> >    opportunity with a cost of modifying the output code layout.
> > 
> > 5. Balance generated partitions for performance unless not told to.
> > 
> > The choice of 1. was by design, so we could use a union-find
> > data structure, which are know for being very fast on set unite
> > operations.
> 
> In LTO partitioner the groups of objects that "must go toghether"
> are discovered when first object is placed into the partition (via
> add_to_partition) because with the LTO rules it is always possible to
> discover all members from the group starting from any random element via
> graph walking.
> 
> I guess it is same with your partitioner?  I basically wonder how much
> code can be shared and what needs to be duplicated.
> It is not very nice to have partitioning implemented twice - it is bit
> subtle problem when it comes to details so I would be happier if we
> brought in the lto/lto-partition.c to middle end and updaed/cleaned it
> up as needed.

They are almost the same thing, they group together nodes that
require to be in the same partition before deciding how to partition
them.

Things are a little different in the way that this partitioner starts
with n partitions and merge nodes together as we decide that these
nodes needs to be in the same partition.  The advantage of this is that
merging partitions is quite cheap, but the drawback is that I can't
undo partitions easily. You can also see that I only use the
add_node_to_partition after it decides what nodes should be in the
partition.

I think that if there is a way to avoid failing that assertion that I
mentioned above, we could even get rid of this step and use the
balanced_map partitioner.

> > 
> > For 3. to work properly, we also had to modify
> > lto_promote_cross_file_statics to handle this case.
> > 
> > The parameters --param=promote-statics and --param=balance-partitions
> > control 4. and 5., respectively
> > 
> > gcc/ChangeLog:
> > 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> > 
> > 	* Makefile.in: Add lto-partition.o
> > 	* cgraph.h (struct symtab_node::aux2): New variable.
> > 	* lto-partition.c: Move from gcc/lto/lto-partition.c
> > 	(add_symbol_to_partition_1): Only compute insn size
> > 	if information is available.
> > 	(node_cmp): Same as above.
> > 	(class union_find): New.
> > 	(ds_print_roots): New function.
> > 	(balance_partitions): New function.
> > 	(build_ltrans_partitions): New function.
> > 	(merge_comdat_nodes): New function.
> > 	(merge_static_calls): New function.
> > 	(merge_contained_symbols): New function.
> > 	(lto_merge_comdat_map): New function.
> > 	(privatize_symbol_name_1): Handle when WPA is not enabled.
> > 	(privatize_symbol_name): Same as above.
> > 	(lto_promote_cross_file_statics): New parameter to select when
> > 	to promote to global.
> > 	(lto_check_usage_from_other_partitions): New function.
> > 	* lto-partition.h: Move from gcc/lto/lto-partition.h
> > 	(lto_promote_cross_file_statics): Update prototype.
> > 	(lto_check_usage_from_other_partitions): Declare.
> > 	(lto_merge_comdat_map): Declare.
> > 
> > gcc/lto/ChangeLog:
> > 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
> > 
> > 	* lto-partition.c: Move to gcc/lto-partition.c.
> > 	* lto-partition.h: Move to gcc/lto-partition.h.
> > 	* lto.c: Update call to lto_promote_cross_file_statics.
> > 	* Makefile.in: Remove lto-partition.o.
> > ---
> >  gcc/Makefile.in               |   1 +
> >  gcc/cgraph.h                  |   1 +
> >  gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++-
> >  gcc/{lto => }/lto-partition.h |   4 +-
> >  gcc/lto/Make-lang.in          |   4 +-
> >  gcc/lto/lto.c                 |   2 +-
> >  gcc/params.opt                |   8 +
> >  gcc/tree.c                    |  23 +-
> >  8 files changed, 489 insertions(+), 17 deletions(-)
> >  rename gcc/{lto => }/lto-partition.c (78%)
> >  rename gcc/{lto => }/lto-partition.h (89%)
> > 
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 79e854aa938..be42b15f4ff 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1457,6 +1457,7 @@ OBJS = \
> >  	lra-spills.o \
> >  	lto-cgraph.o \
> >  	lto-streamer.o \
> > +	lto-partition.o \
> >  	lto-streamer-in.o \
> >  	lto-streamer-out.o \
> >  	lto-section-in.o \
> > diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> > index 0211f08964f..b4a7871bd3d 100644
> > --- a/gcc/cgraph.h
> > +++ b/gcc/cgraph.h
> > @@ -615,6 +615,7 @@ public:
> >    struct lto_file_decl_data * lto_file_data;
> >  
> >    PTR GTY ((skip)) aux;
> > +  int aux2;
> 
> We do not really want to add more pass specific data to the symbol table
> (since it is critical wrt memory use during WPA stage).
> It is possible to attach extra info using the symbol-summary.h

How about if I refactor add_node_to_partition to not use the aux
pointer, but a single bit for marking if a node was already
partitioned? I counted 18 bits used in symtab_node, so there are plenty
of space left in this bitfield :)

Then I can use the aux pointer variable for storing the partitioning.
This will avoid me the cost of hash accesses.

> >  
> >    /* Comdat group the symbol is in.  Can be private if GGC allowed that.  */
> >    tree x_comdat_group;
> > diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c
> > similarity index 78%
> > rename from gcc/lto/lto-partition.c
> > rename to gcc/lto-partition.c
> > index 8e0488ab13e..ca962e69b5d 100644
> > --- a/gcc/lto/lto-partition.c
> > +++ b/gcc/lto-partition.c
> > @@ -170,7 +170,11 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
> >      {
> >        struct cgraph_edge *e;
> >        if (!node->alias && c == SYMBOL_PARTITION)
> > -	part->insns += ipa_size_summaries->get (cnode)->size;
> > +	{
> > +	  /* FIXME: Find out why this is being returned NULL in some cases.  */
> > +	  if (ipa_size_summaries->get (cnode))
> > +	    part->insns += ipa_size_summaries->get (cnode)->size;
> 
> It returns NULL when symbol summary is not available.  Either symbol is
> not defined (in which case it should not be SYMBOL_PARTITION) or it is
> not computed (probaly because of -O0?)
> > +/* Quickly balance partitions, trying to reach target_size in each of
> > +   them.  Returns true if something was done, or false if we decided
> > +   that it is not worth.  */
> > +
> > +static bool
> > +balance_partitions (union_find *ds, int n, int jobs)
> 
> Generally options should be documented, so I would learn what N means ;)

Woops.  n is just the number of nodes.

> > +{
> > +  int *sizes, i, j;
> > +  int total_size = 0, max_size = -1;
> > +  int target_size;
> > +  const int eps = 0;
> > +
> > +  symtab_node *node;
> > +
> > +  sizes = (int *) alloca (n * sizeof (*sizes));
> > +  memset (sizes, 0, n * sizeof (*sizes));
> 
> And we avoid using alloca for arrays that can grow over stack limit.
> I assume this has the size of symbol table, which means that you want to
> use vec.h API and allocae on heap.

Okay.

> > + 
> > +  /* Compute costs.  */
> > +  i = 0;
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      int root = ds->find (i);
> > +
> > +      if (cgraph_node *cnode = dyn_cast<cgraph_node *> (node))
> > +	{
> > +	  ipa_size_summary *summary = ipa_size_summaries->get (cnode);
> > +	  if (summary)
> > +	    sizes[root] += summary->size;
> > +	  else
> > +	    sizes[root] += 10;
> > +	}
> > +      else
> > +	sizes[root] += 10;
> 
> Do you have testcases where summary is mising?

This may be related to an early problem.  I will just remove these and
check if i hit these again.

> > +
> > +
> > +      i++;
> > +    }
> > +
> > +  /* Compute the total size and maximum size.  */
> > +  for (i = 0; i < n; ++i)
> > +    {
> > +      total_size += sizes[i];
> > +      max_size    = MAX (max_size, sizes[i]);
> 
> Also i think we should start using 64bit values for total sizes of
> units. In some extreme cases they already get close to overflow.

Okay :)

> > +    }
> > +
> > +  /* Quick return if total size is small.  */
> > +  if (total_size < param_min_partition_size)
> > +    return false;
> > +
> > +  target_size = total_size / (jobs + 1);
> > +
> > +  /* Unite small partitions.  */
> > +  for (i = 0, j = 0; j < n; ++j)
> > +    {
> > +      if (sizes[j] == 0)
> > +	continue;
> > +
> > +      if (i == -1)
> > +	i = j;
> > +      else
> > +	{
> > +	  if (sizes[i] + sizes[j] < target_size + eps)
> > +	    {
> > +	      ds->unite (i, j);
> > +	      sizes[i] += sizes[j];
> > +	      sizes[j] = 0;
> > +	    }
> > +	  else
> > +	      i = j;
> > +	}
> > +    }
> > +  return true;
> 
> Note that partitioning is not free, since reference to static var or a
> call from one partition to another will lead to more expensive
> relocations/instructions to be used (since gas will not be able to relax
> it to IP relative addressing where available).
> 
> For that reason LTO partitioner has the logic tracking boundary and
> minimizing it.  I think we should merge them and also cleanup
> lto/lto-partition.c - the partitioner was one late night experiment that
> I intended as a first proof of concept to be rewritten later, but we
> never got into impementing anything much smarter. On the other hand we
> added a lot of extra hacks to it (order preserving and other things), so
> it deserves some TLC.

You mean the balanced_map or the add_node_to_partition?

> 
> Also I think you unite partitions in the FOR_EACH_* order that is not
> really meaningful, the code layout is controlled by node->order values.

This was mainly to improve compilation performance and did not use the
code layout as account. Maybe the best strategy is to "fix" the LTO
partitioner so that we could use it for this, instead of implementing
a new one.

> 
> > +}
> > +
> > +/* Builds the LTRANS partitions, or return if not needed.  */
> > +
> > +static int
> > +build_ltrans_partitions (union_find *ds, int n)
> > +{
> > +  int i, n_partitions;
> > +  symtab_node *node;
> > +
> > +  int *compression = (int *) alloca (n * sizeof (*compression));
> > +  for (i = 0; i < n; ++i)
> > +    compression[i] = -1; /* Invalid value.  */
> > +
> > +  i = 0, n_partitions = 0;
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      int root = ds->find (i);
> > +      node->aux2 = root;
> > +      node->aux = NULL;
> > +
> > +      if (node->get_partitioning_class () == SYMBOL_PARTITION
> > +	  && compression[root] < 0)
> > +	compression[root] = n_partitions++;
> > +      i++;
> > +    }
> 
> What exactly compression is used for?

This is coordinate compression.  Partitions generated by my partitioner
will be identified as an integer between 0, ..., n-1, where n is the
number of nodes in the graph, but we may have m partitions, where
0 < m <= n.  What that does is map these 0, ..., n-1 identifiers to
a unique number between 0, ..., m-1.  If you take a closer look, the
algorithm ressembles the counting sort.

> > +
> > +  if (dump_file)
> > +    fprintf (dump_file, "n_partitions = %d\n", n_partitions);
> > +
> > +  if (n_partitions <= 1)
> > +    return false;
> > +
> > +  /* Create LTRANS partitions.  */
> > +  ltrans_partitions.create (n_partitions);
> > +  for (i = 0; i < n_partitions; i++)
> > +    new_partition ("");
> > +
> > +  FOR_EACH_SYMBOL (node)
> > +    {
> > +      if (node->get_partitioning_class () != SYMBOL_PARTITION
> > +	  || symbol_partitioned_p (node))
> > +	  continue;
> > +
> > +      int p = compression[node->aux2];
> > +      if (dump_file)
> > +	fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ());
> > +      add_symbol_to_partition (ltrans_partitions[p], node);
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Partition COMDAT groups together, and also bring together nodes that
> > +   requires them. Such nodes that are not in the COMDAT group that have
> > +   references to COMDAT grouped nodes are called the COMDAT frontier.  */
> > +
> > +static bool
> > +merge_comdat_nodes (symtab_node *node, int set)
> > +{
> > +  enum symbol_partitioning_class c = node->get_partitioning_class ();
> > +  bool ret = false;
> > +  symtab_node *node1;
> > +  cgraph_edge *e;
> > +
> > +  /* If node is already analysed, quickly return.  */
> > +  if (node->aux)
> > +    return false;
> > +
> > +  /* Mark as analysed.  */
> > +  node->aux = (void *) 1;
> > +
> > +
> > +  /* Aglomerate the COMDAT group into the same partition.  */
> > +  if (node->same_comdat_group)
> > +    {
> > +      for (node1 = node->same_comdat_group;
> > +	   node1 != node; node1 = node1->same_comdat_group)
> > +	if (!node->alias)
> > +	  {
> > +	    ds->unite (node1->aux2, set);
> > +	    merge_comdat_nodes (node1, set);
> > +	    ret = true;
> > +	  }
> > +    }
> > +
> > +  /* Look at nodes that can reach the COMDAT group, and aglomerate to the
> > +     same partition.  These nodes are called the "COMDAT Frontier".  The
> > +     idea is that every unpartitioned node that reaches a COMDAT group MUST
> > +     go through the COMDAT frontier before reaching it.  Therefore, only
> > +     nodes in the frontier are exported.  */
> > +  if (node->same_comdat_group || c == SYMBOL_DUPLICATE)
> > +    {
> > +      int i;
> > +      struct ipa_ref *ref = NULL;
> > +
> > +      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
> > +	{
> > +	  /* Add all inline clones and callees that are duplicated.  */
> > +	  for (e = cnode->callers; e; e = e->next_caller)
> > +	    if (!e->inline_failed || c == SYMBOL_DUPLICATE)
> > +	      {
> > +		ds->unite (set, e->caller->aux2);
> > +		merge_comdat_nodes (e->caller, set);
> > +		ret = true;
> > +	      }
> > +
> > +	  /* Add all thunks associated with the function.  */
> > +	  for (e = cnode->callees; e; e = e->next_callee)
> > +	    if (e->caller->thunk.thunk_p && !e->caller->inlined_to)
> > +	      {
> > +		ds->unite (set, e->callee->aux2);
> > +		merge_comdat_nodes (e->callee, set);
> > +		ret = true;
> > +	      }
> > +	}
> 
> I do not think it is strictly necessary to merge comdat function with
> all users. If the comdat is some easy accessor this may prevent a lot of
> merging.
> 
> All you need to do is to place it into one of partitins and mark set
> "used" flag on the symbols used in other partitions, so it does not get
> optimized away when unused.  Other partitions should reffer to it w/o
> problems.

I guess I will need your help in the future with this, then.  In any
case, speedups was quite OK even with this, so probably I will leave it
as is, so it could be improved incrementally.

> 
> I am not sure what exactly the goal is here about not changing code
> layout.
> 
> > +/* Partition the program into several partitions with a restriction that
> > +   COMDATS are partitioned together with all nodes requiring them.  If
> > +   promote_statics is false, we also partition together static functions
> > +   and nodes that call eachother, so non-public functions are not promoted
> > +   to globals.  */
> > +
> > +void
> > +lto_merge_comdat_map (bool balance, bool promote_statics, int jobs)
> 
> BTW I think it is odd name for partitioner. Comdat is just one of
> problems it deals with. But I would still like to see this merged with
> the lto partitioning logic, so we could also use all kinds of
> partitioners in both modes.

Well... it does merge COMDATs into the same partition :D

> 
> It all looks quite nice, but lets work on avoiding the code duplication
> here...

Thank you for your detailed review.
Giuliano.

> 
> Honza
Richard Biener Aug. 31, 2020, 9:25 a.m. UTC | #3
On Fri, Aug 21, 2020 at 12:00 AM Giuliano Belinassi
<giuliano.belinassi@usp.br> wrote:
>
> When using the LTO infrastructure to compile files in parallel, we
> can't simply use any of the LTO partitioner, once extra dependency
> analysis is required to ensure that some nodes are correctly
> partitioned together.
>
> Therefore, here we implement a new partitioner called
> "lto_merge_comdat_map" that does all these required analysis.
> The partitioner works as follows:
>
> 1. We create a number of disjoint sets and inserts each node into a
>    separate set, which may be merged together in the future.
>
> 2. Find COMDAT groups, and mark them to be partitioned together.
>
> 3. Check all nodes that would require any COMDAT group to be
>    copied to its partition (which we name "COMDAT frontier"),
>    and mark them to be partitioned together.
>    This avoids duplication of COMDAT groups and crashes on the LTO
>    partitioning infrastructure.
>
> 4. Check if the user allows the partitioner to promote non-public
>    functions or variables to global to improve parallelization
>    opportunity with a cost of modifying the output code layout.
>
> 5. Balance generated partitions for performance unless not told to.
>
> The choice of 1. was by design, so we could use a union-find
> data structure, which are know for being very fast on set unite
> operations.
>
> For 3. to work properly, we also had to modify
> lto_promote_cross_file_statics to handle this case.
>
> The parameters --param=promote-statics and --param=balance-partitions
> control 4. and 5., respectively

Just a few comments ontop of Honzas remarks:

> gcc/ChangeLog:
> 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
>
>         * Makefile.in: Add lto-partition.o
>         * cgraph.h (struct symtab_node::aux2): New variable.
>         * lto-partition.c: Move from gcc/lto/lto-partition.c
>         (add_symbol_to_partition_1): Only compute insn size
>         if information is available.
>         (node_cmp): Same as above.
>         (class union_find): New.
>         (ds_print_roots): New function.
>         (balance_partitions): New function.
>         (build_ltrans_partitions): New function.
>         (merge_comdat_nodes): New function.
>         (merge_static_calls): New function.
>         (merge_contained_symbols): New function.
>         (lto_merge_comdat_map): New function.
>         (privatize_symbol_name_1): Handle when WPA is not enabled.
>         (privatize_symbol_name): Same as above.
>         (lto_promote_cross_file_statics): New parameter to select when
>         to promote to global.
>         (lto_check_usage_from_other_partitions): New function.
>         * lto-partition.h: Move from gcc/lto/lto-partition.h
>         (lto_promote_cross_file_statics): Update prototype.
>         (lto_check_usage_from_other_partitions): Declare.
>         (lto_merge_comdat_map): Declare.
>
> gcc/lto/ChangeLog:
> 2020-08-20  Giuliano Belinassi  <giuliano.belinassi@usp.br>
>
>         * lto-partition.c: Move to gcc/lto-partition.c.
>         * lto-partition.h: Move to gcc/lto-partition.h.
>         * lto.c: Update call to lto_promote_cross_file_statics.
>         * Makefile.in: Remove lto-partition.o.
> ---
>  gcc/Makefile.in               |   1 +
>  gcc/cgraph.h                  |   1 +
>  gcc/{lto => }/lto-partition.c | 463 +++++++++++++++++++++++++++++++++-
>  gcc/{lto => }/lto-partition.h |   4 +-
>  gcc/lto/Make-lang.in          |   4 +-
>  gcc/lto/lto.c                 |   2 +-
>  gcc/params.opt                |   8 +
>  gcc/tree.c                    |  23 +-
>  8 files changed, 489 insertions(+), 17 deletions(-)
>  rename gcc/{lto => }/lto-partition.c (78%)
>  rename gcc/{lto => }/lto-partition.h (89%)
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 79e854aa938..be42b15f4ff 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1457,6 +1457,7 @@ OBJS = \
>         lra-spills.o \
>         lto-cgraph.o \
>         lto-streamer.o \
> +       lto-partition.o \
>         lto-streamer-in.o \
>         lto-streamer-out.o \
>         lto-section-in.o \
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 0211f08964f..b4a7871bd3d 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -615,6 +615,7 @@ public:
>    struct lto_file_decl_data * lto_file_data;
>
>    PTR GTY ((skip)) aux;
> +  int aux2;
>
>    /* Comdat group the symbol is in.  Can be private if GGC allowed that.  */
>    tree x_comdat_group;
> diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c
> similarity index 78%
> rename from gcc/lto/lto-partition.c
> rename to gcc/lto-partition.c
> index 8e0488ab13e..ca962e69b5d 100644
> --- a/gcc/lto/lto-partition.c
> +++ b/gcc/lto-partition.c
> @@ -170,7 +170,11 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
>      {
>        struct cgraph_edge *e;
>        if (!node->alias && c == SYMBOL_PARTITION)
> -       part->insns += ipa_size_summaries->get (cnode)->size;
> +       {
> +         /* FIXME: Find out why this is being returned NULL in some cases.  */
> +         if (ipa_size_summaries->get (cnode))
> +           part->insns += ipa_size_summaries->get (cnode)->size;
> +       }
>
>        /* Add all inline clones and callees that are duplicated.  */
>        for (e = cnode->callees; e; e = e->next_callee)
> @@ -372,6 +376,402 @@ lto_max_map (void)
>      new_partition ("empty");
>  }
>
> +/* Class implementing a union-find algorithm.  */
> +
> +class union_find
> +{
> +public:
> +
> +  int *parent;
> +  int *rank;
> +  int n;
> +  int successful_unions;
> +
> +  union_find (int num_nodes)
> +    {
> +      n = num_nodes;
> +      parent = XNEWVEC (int, n);
> +      rank   = XNEWVEC (int, n);
> +
> +      for (int i = 0; i < n; ++i)
> +       parent[i] = i;
> +
> +      memset (rank, 0, n*sizeof(*rank));
> +      successful_unions = 0;
> +    }
> +
> +  ~union_find ()
> +    {
> +      free (parent);
> +      free (rank);
> +    }
> +
> +  int find (int x)
> +    {
> +      while (parent[x] != x)
> +       {
> +         parent[x] = parent[parent[x]];
> +         x = parent[x];
> +       }
> +      return x;
> +    }
> +
> +  void unite (int x, int y)
> +    {
> +      int x_root = find (x);
> +      int y_root = find (y);
> +
> +      if (x_root == y_root) /* If x and y are in same set.  */
> +       return;
> +
> +      successful_unions++;
> +
> +      if (rank[x_root] > rank[y_root]) /* Get which ones have greater rank.  */
> +       {
> +         x_root ^= y_root; /* Swap.  */
> +         y_root ^= x_root;
> +         x_root ^= y_root;

You can use std::swap (x_root, y_root); which is nicer to read.

> +       }
> +
> +      parent[y_root] = x_root;
> +      if (rank[x_root] == rank[y_root])
> +       rank[x_root]++;
> +    }
> +
> +  void print_roots ()
> +    {
> +      int i;
> +      for (i = 0; i < n; ++i)
> +       printf ("%d, ", find (i));
> +      printf ("\n");
> +    }
> +};
> +
> +static union_find *ds;
> +
> +DEBUG_FUNCTION void ds_print_roots (void)
> +{
> +  ds->print_roots ();
> +}
> +
> +static bool
> +privatize_symbol_name (symtab_node *);
> +
> +static void
> +promote_symbol (symtab_node *);
> +
> +/* Quickly balance partitions, trying to reach target_size in each of
> +   them.  Returns true if something was done, or false if we decided
> +   that it is not worth.  */
> +
> +static bool
> +balance_partitions (union_find *ds, int n, int jobs)
> +{
> +  int *sizes, i, j;
> +  int total_size = 0, max_size = -1;
> +  int target_size;
> +  const int eps = 0;
> +
> +  symtab_node *node;
> +
> +  sizes = (int *) alloca (n * sizeof (*sizes));
> +  memset (sizes, 0, n * sizeof (*sizes));
> +
> +  /* Compute costs.  */
> +  i = 0;
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      int root = ds->find (i);
> +
> +      if (cgraph_node *cnode = dyn_cast<cgraph_node *> (node))
> +       {
> +         ipa_size_summary *summary = ipa_size_summaries->get (cnode);
> +         if (summary)
> +           sizes[root] += summary->size;
> +         else
> +           sizes[root] += 10;
> +       }
> +      else
> +       sizes[root] += 10;
> +
> +
> +      i++;
> +    }
> +
> +  /* Compute the total size and maximum size.  */
> +  for (i = 0; i < n; ++i)
> +    {
> +      total_size += sizes[i];
> +      max_size    = MAX (max_size, sizes[i]);
> +    }
> +
> +  /* Quick return if total size is small.  */
> +  if (total_size < param_min_partition_size)
> +    return false;
> +
> +  target_size = total_size / (jobs + 1);
> +
> +  /* Unite small partitions.  */
> +  for (i = 0, j = 0; j < n; ++j)
> +    {
> +      if (sizes[j] == 0)
> +       continue;
> +
> +      if (i == -1)
> +       i = j;
> +      else
> +       {
> +         if (sizes[i] + sizes[j] < target_size + eps)
> +           {
> +             ds->unite (i, j);
> +             sizes[i] += sizes[j];
> +             sizes[j] = 0;
> +           }
> +         else
> +             i = j;
> +       }
> +    }
> +  return true;
> +}
> +
> +/* Builds the LTRANS partitions, or return if not needed.  */
> +
> +static int
> +build_ltrans_partitions (union_find *ds, int n)
> +{
> +  int i, n_partitions;
> +  symtab_node *node;
> +
> +  int *compression = (int *) alloca (n * sizeof (*compression));
> +  for (i = 0; i < n; ++i)
> +    compression[i] = -1; /* Invalid value.  */
> +
> +  i = 0, n_partitions = 0;
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      int root = ds->find (i);
> +      node->aux2 = root;
> +      node->aux = NULL;
> +
> +      if (node->get_partitioning_class () == SYMBOL_PARTITION
> +         && compression[root] < 0)
> +       compression[root] = n_partitions++;
> +      i++;
> +    }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "n_partitions = %d\n", n_partitions);
> +
> +  if (n_partitions <= 1)
> +    return false;
> +
> +  /* Create LTRANS partitions.  */
> +  ltrans_partitions.create (n_partitions);
> +  for (i = 0; i < n_partitions; i++)
> +    new_partition ("");
> +
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      if (node->get_partitioning_class () != SYMBOL_PARTITION
> +         || symbol_partitioned_p (node))
> +         continue;
> +
> +      int p = compression[node->aux2];
> +      if (dump_file)
> +       fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ());
> +      add_symbol_to_partition (ltrans_partitions[p], node);
> +    }
> +
> +  return true;
> +}
> +
> +/* Partition COMDAT groups together, and also bring together nodes that
> +   requires them. Such nodes that are not in the COMDAT group that have
> +   references to COMDAT grouped nodes are called the COMDAT frontier.  */
> +
> +static bool
> +merge_comdat_nodes (symtab_node *node, int set)
> +{
> +  enum symbol_partitioning_class c = node->get_partitioning_class ();
> +  bool ret = false;
> +  symtab_node *node1;
> +  cgraph_edge *e;
> +
> +  /* If node is already analysed, quickly return.  */
> +  if (node->aux)
> +    return false;
> +
> +  /* Mark as analysed.  */
> +  node->aux = (void *) 1;
> +
> +
> +  /* Aglomerate the COMDAT group into the same partition.  */
> +  if (node->same_comdat_group)
> +    {
> +      for (node1 = node->same_comdat_group;
> +          node1 != node; node1 = node1->same_comdat_group)
> +       if (!node->alias)
> +         {
> +           ds->unite (node1->aux2, set);
> +           merge_comdat_nodes (node1, set);
> +           ret = true;
> +         }
> +    }
> +
> +  /* Look at nodes that can reach the COMDAT group, and aglomerate to the
> +     same partition.  These nodes are called the "COMDAT Frontier".  The
> +     idea is that every unpartitioned node that reaches a COMDAT group MUST
> +     go through the COMDAT frontier before reaching it.  Therefore, only
> +     nodes in the frontier are exported.  */
> +  if (node->same_comdat_group || c == SYMBOL_DUPLICATE)
> +    {
> +      int i;
> +      struct ipa_ref *ref = NULL;
> +
> +      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
> +       {
> +         /* Add all inline clones and callees that are duplicated.  */
> +         for (e = cnode->callers; e; e = e->next_caller)
> +           if (!e->inline_failed || c == SYMBOL_DUPLICATE)
> +             {
> +               ds->unite (set, e->caller->aux2);
> +               merge_comdat_nodes (e->caller, set);
> +               ret = true;
> +             }
> +
> +         /* Add all thunks associated with the function.  */
> +         for (e = cnode->callees; e; e = e->next_callee)
> +           if (e->caller->thunk.thunk_p && !e->caller->inlined_to)
> +             {
> +               ds->unite (set, e->callee->aux2);
> +               merge_comdat_nodes (e->callee, set);
> +               ret = true;
> +             }
> +       }
> +
> +      for (i = 0; node->iterate_referring (i, ref); i++)
> +       {
> +         symtab_node *node1 = ref->referring;
> +         ds->unite (node1->aux2, set);
> +         ret = true;
> +
> +         if (node1->get_partitioning_class () == SYMBOL_DUPLICATE)
> +           merge_comdat_nodes (node1, set);
> +       }
> +    }
> +
> +  return ret;
> +}
> +
> +/* Bring together static nodes that are called by static functions, so
> +   promotion of statics to globals are not required.  This *MIGHT* negatively
> +   impact the number of partitions, and even generate very umbalanced
> +   partitions that can't be fixed.  */
> +
> +static bool
> +merge_static_calls (symtab_node *node, int set)
> +{
> +  bool ret = false;
> +  enum symbol_partitioning_class c = node->get_partitioning_class ();
> +
> +  if (node->aux)
> +    return false;
> +
> +  node->aux = (void *) 1;
> +
> +
> +  if (!TREE_PUBLIC (node->decl) || c == SYMBOL_DUPLICATE)
> +    {
> +      int i;
> +      struct ipa_ref *ref = NULL;
> +
> +      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
> +       {
> +         for (cgraph_edge *e = cnode->callers; e; e = e->next_caller)
> +           {
> +             /* FIXME: In theory, inlined functions should be a criteria to not
> +                merge partitions.  */
> +             ds->unite (node->aux2, e->caller->aux2);
> +             merge_static_calls (e->caller, set);
> +             ret = true;
> +           }
> +
> +       }
> +
> +      for (i = 0; node->iterate_referring (i, ref); ++i)
> +       {
> +         symtab_node *node1 = ref->referring;
> +         ds->unite (node1->aux2, set);
> +         merge_static_calls (node1, set);
> +         ret = true;
> +       }
> +    }
> +
> +  return ret;
> +}
> +
> +static bool
> +merge_contained_symbols (symtab_node *node, int set)
> +{
> +  bool ret = false;
> +  symtab_node *node1;
> +
> +  while ((node1 = contained_in_symbol (node)) != node)
> +    {
> +      node = node1;
> +      ds->unite (node->aux2, set);
> +      ret = true;
> +    }
> +
> +  return ret;
> +}
> +
> +/* Partition the program into several partitions with a restriction that
> +   COMDATS are partitioned together with all nodes requiring them.  If
> +   promote_statics is false, we also partition together static functions
> +   and nodes that call eachother, so non-public functions are not promoted
> +   to globals.  */
> +
> +void
> +lto_merge_comdat_map (bool balance, bool promote_statics, int jobs)
> +{
> +  symtab_node *node;
> +  int n = 0;
> +
> +  /* Initialize each not into its own distinct disjoint sets.  */
> +  FOR_EACH_SYMBOL (node)
> +    node->aux2 = n++;
> +
> +  union_find disjoint_sets = union_find (n);
> +  ds = &disjoint_sets;
> +
> +  /* First look at COMDATs.  */
> +  FOR_EACH_SYMBOL (node)
> +    {
> +      if (node->same_comdat_group)
> +       merge_comdat_nodes (node, node->aux2);
> +      merge_contained_symbols (node, node->aux2);
> +    }
> +
> +  FOR_EACH_SYMBOL (node)
> +    node->aux = NULL;
> +
> +  /* Then look at STATICs, if needed.  */
> +  if (!promote_statics)
> +    FOR_EACH_SYMBOL (node)
> +      if (!TREE_PUBLIC (node->decl))
> +       merge_static_calls (node, node->aux2);
> +
> +  FOR_EACH_SYMBOL (node)
> +    node->aux = NULL;
> +
> +  if (balance && !balance_partitions (&disjoint_sets, n, jobs))
> +    return;
> +
> +  build_ltrans_partitions (&disjoint_sets, n);
> +}
> +
> +
>  /* Helper function for qsort; sort nodes by order.  */
>  static int
>  node_cmp (const void *pa, const void *pb)
> @@ -931,7 +1331,7 @@ static hash_map<const char *, unsigned> *lto_clone_numbers;
>     represented by DECL.  */
>
>  static bool
> -privatize_symbol_name_1 (symtab_node *node, tree decl)
> +privatize_symbol_name_1 (symtab_node *node, tree decl, bool wpa)
>  {
>    const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl));
>
> @@ -939,11 +1339,19 @@ privatize_symbol_name_1 (symtab_node *node, tree decl)
>      return false;
>
>    name = maybe_rewrite_identifier (name);
> -  unsigned &clone_number = lto_clone_numbers->get_or_insert (name);
> -  symtab->change_decl_assembler_name (decl,
> -                                     clone_function_name (
> -                                         name, "lto_priv", clone_number));
> -  clone_number++;
> +  if (wpa)
> +    {
> +      gcc_assert (lto_clone_numbers);
> +
> +      unsigned &clone_number = lto_clone_numbers->get_or_insert (name);
> +      symtab->change_decl_assembler_name (decl,
> +                                         clone_function_name (
> +                                             name, "lto_priv", clone_number));
> +      clone_number++;
> +    }
> +  else
> +    symtab->change_decl_assembler_name (decl, get_file_function_name
> +                                       (node->asm_name ()));
>
>    if (node->lto_file_data)
>      lto_record_renamed_decl (node->lto_file_data, name,
> @@ -968,7 +1376,9 @@ privatize_symbol_name_1 (symtab_node *node, tree decl)
>  static bool
>  privatize_symbol_name (symtab_node *node)
>  {
> -  if (!privatize_symbol_name_1 (node, node->decl))
> +  bool wpa = !split_outputs;
> +
> +  if (!privatize_symbol_name_1 (node, node->decl, wpa))
>      return false;
>
>    return true;
> @@ -1117,7 +1527,7 @@ rename_statics (lto_symtab_encoder_t encoder, symtab_node *node)
>     all inlinees are added.  */
>
>  void
> -lto_promote_cross_file_statics (void)
> +lto_promote_cross_file_statics (bool promote)
>  {
>    unsigned i, n_sets;
>
> @@ -1147,10 +1557,17 @@ lto_promote_cross_file_statics (void)
>            lsei_next (&lsei))
>          {
>            symtab_node *node = lsei_node (lsei);
> +         cgraph_node *cnode = dyn_cast <cgraph_node *> (node);
>
>           /* If symbol is static, rename it if its assembler name
>              clashes with anything else in this unit.  */
>           rename_statics (encoder, node);
> +         if (cnode)
> +           {
> +             bool in_partition = lsei.encoder->nodes[lsei.index].in_partition;
> +             if (!in_partition)
> +               cnode->local = false;
> +           }
>
>           /* No need to promote if symbol already is externally visible ... */
>           if (node->externally_visible
> @@ -1163,8 +1580,12 @@ lto_promote_cross_file_statics (void)
>               validize_symbol_for_target (node);
>               continue;
>             }
> -
> -          promote_symbol (node);
> +         if (promote)
> +           {
> +             promote_symbol (node);
> +             if (cnode && split_outputs)
> +               cnode->local = false;
> +           }
>          }
>      }
>    delete lto_clone_numbers;
> @@ -1186,3 +1607,23 @@ lto_promote_statics_nonwpa (void)
>      }
>    delete lto_clone_numbers;
>  }
> +
> +/* Check if a variable is accessed across partitions.  If yesm then update
> +   used_from_other_partition.  */
> +
> +void
> +lto_check_usage_from_other_partitions (void)
> +{
> +  unsigned int i, j;
> +  for (i = 0; i < ltrans_partitions.length (); i++)
> +    {
> +      vec<lto_encoder_entry> &nodes = (ltrans_partitions[i])->encoder->nodes;
> +
> +      for (j = 0; j < nodes.length (); j++)
> +       {
> +         symtab_node *node = nodes[j].node;
> +         if (node && !nodes[j].in_partition)
> +           node->used_from_other_partition = true;
> +       }
> +    }
> +}
> diff --git a/gcc/lto/lto-partition.h b/gcc/lto-partition.h
> similarity index 89%
> rename from gcc/lto/lto-partition.h
> rename to gcc/lto-partition.h
> index 42b5ea8c80c..4a1b17fa728 100644
> --- a/gcc/lto/lto-partition.h
> +++ b/gcc/lto-partition.h
> @@ -36,6 +36,8 @@ extern vec<ltrans_partition> ltrans_partitions;
>  void lto_1_to_1_map (void);
>  void lto_max_map (void);
>  void lto_balanced_map (int, int);
> -void lto_promote_cross_file_statics (void);
> +void lto_promote_cross_file_statics (bool promote);
>  void free_ltrans_partitions (void);
>  void lto_promote_statics_nonwpa (void);
> +void lto_check_usage_from_other_partitions (void);
> +void lto_merge_comdat_map (bool, bool, int);
> diff --git a/gcc/lto/Make-lang.in b/gcc/lto/Make-lang.in
> index 0b73f9ef7bb..46b52cff183 100644
> --- a/gcc/lto/Make-lang.in
> +++ b/gcc/lto/Make-lang.in
> @@ -24,9 +24,9 @@ LTO_EXE = lto1$(exeext)
>  LTO_DUMP_EXE = lto-dump$(exeext)
>  LTO_DUMP_INSTALL_NAME := $(shell echo lto-dump|sed '$(program_transform_name)')
>  # The LTO-specific object files inclued in $(LTO_EXE).
> -LTO_OBJS = lto/lto-lang.o lto/lto.o lto/lto-object.o attribs.o lto/lto-partition.o lto/lto-symtab.o lto/lto-common.o
> +LTO_OBJS = lto/lto-lang.o lto/lto.o lto/lto-object.o attribs.o lto/lto-symtab.o lto/lto-common.o
>  lto_OBJS = $(LTO_OBJS)
> -LTO_DUMP_OBJS = lto/lto-lang.o lto/lto-object.o attribs.o lto/lto-partition.o lto/lto-symtab.o lto/lto-dump.o lto/lto-common.o
> +LTO_DUMP_OBJS = lto/lto-lang.o lto/lto-object.o attribs.o lto/lto-symtab.o lto/lto-dump.o lto/lto-common.o
>  lto_dump_OBJS = $(LTO_DUMP_OBJS)
>
>  # this is only useful in a LTO bootstrap, but this does not work right
> diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
> index 1c37814bde4..803b9920e35 100644
> --- a/gcc/lto/lto.c
> +++ b/gcc/lto/lto.c
> @@ -515,7 +515,7 @@ do_whole_program_analysis (void)
>    /* Find out statics that need to be promoted
>       to globals with hidden visibility because they are accessed from multiple
>       partitions.  */
> -  lto_promote_cross_file_statics ();
> +  lto_promote_cross_file_statics (true);
>    if (dump_file)
>       dump_end (partition_dump_id, dump_file);
>    dump_file = NULL;
> diff --git a/gcc/params.opt b/gcc/params.opt
> index f39e5d1a012..00fc58cd5cc 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -366,6 +366,14 @@ Minimal size of a partition for LTO (in estimated instructions).
>  Common Joined UInteger Var(param_lto_partitions) Init(128) IntegerRange(1, 65536) Param
>  Number of partitions the program should be split to.
>
> +-param=promote-statics=
> +Common Joined UInteger Var(param_promote_statics) Init(0) IntegerRange(0, 1) Param
> +Allow statics and non-public functions to be promoted as public when compiling in parallel.
> +
> +-param=balance-partitions=
> +Common Joined UInteger Var(param_balance_partitions) Init(1) IntegerRange(0, 1) Param
> +When compiling in parallel, try to balance the partitions for compilation performance.
> +
>  -param=max-average-unrolled-insns=
>  Common Joined UInteger Var(param_max_average_unrolled_insns) Init(80) Param Optimization
>  The maximum number of instructions to consider to unroll in a loop on average.
> diff --git a/gcc/tree.c b/gcc/tree.c
> index d0202c3f785..3ca162d5070 100644
> --- a/gcc/tree.c
> +++ b/gcc/tree.c
> @@ -9595,6 +9595,24 @@ make_anon_name ()
>    return id;
>  }
>
> +/* Filter the input name removing characters that may confuse the linker.  */
> +
> +static void
> +filter_name (char *name)
> +{
> +  char *p = name;
> +
> +  while (*p != '\0')
> +    {
> +      switch (*p)
> +       {
> +         case '*':
> +           *p = '_';
> +       }
> +      p++;
> +    }
> +}
> +
>  /* Generate a name for a special-purpose function.
>     The generated name may need to be unique across the whole link.
>     Changes to this function may also require corresponding changes to
> @@ -9651,8 +9669,7 @@ get_file_function_name (const char *type)
>        q = (char *) alloca (9 + 19 + len + 1);
>        memcpy (q, file, len + 1);
>
> -      snprintf (q + len, 9 + 19 + 1, "_%08X_" HOST_WIDE_INT_PRINT_HEX,
> -               crc32_string (0, name), get_random_seed (false));
> +      snprintf (q + len, 9 + 19 + 1, "_%08X", crc32_string (0, name));
>
>        p = q;
>      }
> @@ -9665,7 +9682,9 @@ get_file_function_name (const char *type)
>       Use a global object (which is already required to be unique over
>       the program) rather than the file name (which imposes extra
>       constraints).  */
> +
>    sprintf (buf, FILE_FUNCTION_FORMAT, type, p);
> +  filter_name (buf);

I wonder why you need this - none of the other callers of
get_file_function_name do so?

I'd rather have you not use get_file_function_name but instead
modify the partitioners symbol promotion code to append a
hash computed once by the partitioning code.  Like maybe simply
hash the cgraph structure somehow:  for-each cgraph node in
UID oder hash UID and merge hashes of all callees (or if we
have some simple way, hash node UIDs in PRE order of the
graph?).

I wonder why the LTO code does not run into collisions - maybe
we do not try hard enough?  Guess doing LTO bootstrap with a modified
-flto that randomly turns itself on/off might show some cases.

Btw, I'd default to the symbol promotion being enabled for
explicit -fparallel.

Richard.

>
>    return get_identifier (buf);
>  }
> --
> 2.28.0
>
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 79e854aa938..be42b15f4ff 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1457,6 +1457,7 @@  OBJS = \
 	lra-spills.o \
 	lto-cgraph.o \
 	lto-streamer.o \
+	lto-partition.o \
 	lto-streamer-in.o \
 	lto-streamer-out.o \
 	lto-section-in.o \
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 0211f08964f..b4a7871bd3d 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -615,6 +615,7 @@  public:
   struct lto_file_decl_data * lto_file_data;
 
   PTR GTY ((skip)) aux;
+  int aux2;
 
   /* Comdat group the symbol is in.  Can be private if GGC allowed that.  */
   tree x_comdat_group;
diff --git a/gcc/lto/lto-partition.c b/gcc/lto-partition.c
similarity index 78%
rename from gcc/lto/lto-partition.c
rename to gcc/lto-partition.c
index 8e0488ab13e..ca962e69b5d 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto-partition.c
@@ -170,7 +170,11 @@  add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
     {
       struct cgraph_edge *e;
       if (!node->alias && c == SYMBOL_PARTITION)
-	part->insns += ipa_size_summaries->get (cnode)->size;
+	{
+	  /* FIXME: Find out why this is being returned NULL in some cases.  */
+	  if (ipa_size_summaries->get (cnode))
+	    part->insns += ipa_size_summaries->get (cnode)->size;
+	}
 
       /* Add all inline clones and callees that are duplicated.  */
       for (e = cnode->callees; e; e = e->next_callee)
@@ -372,6 +376,402 @@  lto_max_map (void)
     new_partition ("empty");
 }
 
+/* Class implementing a union-find algorithm.  */
+
+class union_find
+{
+public:
+
+  int *parent;
+  int *rank;
+  int n;
+  int successful_unions;
+
+  union_find (int num_nodes)
+    {
+      n = num_nodes;
+      parent = XNEWVEC (int, n);
+      rank   = XNEWVEC (int, n);
+
+      for (int i = 0; i < n; ++i)
+	parent[i] = i;
+
+      memset (rank, 0, n*sizeof(*rank));
+      successful_unions = 0;
+    }
+
+  ~union_find ()
+    {
+      free (parent);
+      free (rank);
+    }
+
+  int find (int x)
+    {
+      while (parent[x] != x)
+	{
+	  parent[x] = parent[parent[x]];
+	  x = parent[x];
+	}
+      return x;
+    }
+
+  void unite (int x, int y)
+    {
+      int x_root = find (x);
+      int y_root = find (y);
+
+      if (x_root == y_root) /* If x and y are in same set.  */
+	return;
+
+      successful_unions++;
+
+      if (rank[x_root] > rank[y_root]) /* Get which ones have greater rank.  */
+	{
+	  x_root ^= y_root; /* Swap.  */
+	  y_root ^= x_root;
+	  x_root ^= y_root;
+	}
+
+      parent[y_root] = x_root;
+      if (rank[x_root] == rank[y_root])
+	rank[x_root]++;
+    }
+
+  void print_roots ()
+    {
+      int i;
+      for (i = 0; i < n; ++i)
+	printf ("%d, ", find (i));
+      printf ("\n");
+    }
+};
+
+static union_find *ds;
+
+DEBUG_FUNCTION void ds_print_roots (void)
+{
+  ds->print_roots ();
+}
+
+static bool
+privatize_symbol_name (symtab_node *);
+
+static void
+promote_symbol (symtab_node *);
+
+/* Quickly balance partitions, trying to reach target_size in each of
+   them.  Returns true if something was done, or false if we decided
+   that it is not worth.  */
+
+static bool
+balance_partitions (union_find *ds, int n, int jobs)
+{
+  int *sizes, i, j;
+  int total_size = 0, max_size = -1;
+  int target_size;
+  const int eps = 0;
+
+  symtab_node *node;
+
+  sizes = (int *) alloca (n * sizeof (*sizes));
+  memset (sizes, 0, n * sizeof (*sizes));
+ 
+  /* Compute costs.  */
+  i = 0;
+  FOR_EACH_SYMBOL (node)
+    {
+      int root = ds->find (i);
+
+      if (cgraph_node *cnode = dyn_cast<cgraph_node *> (node))
+	{
+	  ipa_size_summary *summary = ipa_size_summaries->get (cnode);
+	  if (summary)
+	    sizes[root] += summary->size;
+	  else
+	    sizes[root] += 10;
+	}
+      else
+	sizes[root] += 10;
+
+
+      i++;
+    }
+
+  /* Compute the total size and maximum size.  */
+  for (i = 0; i < n; ++i)
+    {
+      total_size += sizes[i];
+      max_size    = MAX (max_size, sizes[i]);
+    }
+
+  /* Quick return if total size is small.  */
+  if (total_size < param_min_partition_size)
+    return false;
+
+  target_size = total_size / (jobs + 1);
+
+  /* Unite small partitions.  */
+  for (i = 0, j = 0; j < n; ++j)
+    {
+      if (sizes[j] == 0)
+	continue;
+
+      if (i == -1)
+	i = j;
+      else
+	{
+	  if (sizes[i] + sizes[j] < target_size + eps)
+	    {
+	      ds->unite (i, j);
+	      sizes[i] += sizes[j];
+	      sizes[j] = 0;
+	    }
+	  else
+	      i = j;
+	}
+    }
+  return true;
+}
+
+/* Builds the LTRANS partitions, or return if not needed.  */
+
+static int
+build_ltrans_partitions (union_find *ds, int n)
+{
+  int i, n_partitions;
+  symtab_node *node;
+
+  int *compression = (int *) alloca (n * sizeof (*compression));
+  for (i = 0; i < n; ++i)
+    compression[i] = -1; /* Invalid value.  */
+
+  i = 0, n_partitions = 0;
+  FOR_EACH_SYMBOL (node)
+    {
+      int root = ds->find (i);
+      node->aux2 = root;
+      node->aux = NULL;
+
+      if (node->get_partitioning_class () == SYMBOL_PARTITION
+	  && compression[root] < 0)
+	compression[root] = n_partitions++;
+      i++;
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "n_partitions = %d\n", n_partitions);
+
+  if (n_partitions <= 1)
+    return false;
+
+  /* Create LTRANS partitions.  */
+  ltrans_partitions.create (n_partitions);
+  for (i = 0; i < n_partitions; i++)
+    new_partition ("");
+
+  FOR_EACH_SYMBOL (node)
+    {
+      if (node->get_partitioning_class () != SYMBOL_PARTITION
+	  || symbol_partitioned_p (node))
+	  continue;
+
+      int p = compression[node->aux2];
+      if (dump_file)
+	fprintf (dump_file, "p = %d\t;; %s\n", p, node->dump_name ());
+      add_symbol_to_partition (ltrans_partitions[p], node);
+    }
+
+  return true;
+}
+
+/* Partition COMDAT groups together, and also bring together nodes that
+   requires them. Such nodes that are not in the COMDAT group that have
+   references to COMDAT grouped nodes are called the COMDAT frontier.  */
+
+static bool
+merge_comdat_nodes (symtab_node *node, int set)
+{
+  enum symbol_partitioning_class c = node->get_partitioning_class ();
+  bool ret = false;
+  symtab_node *node1;
+  cgraph_edge *e;
+
+  /* If node is already analysed, quickly return.  */
+  if (node->aux)
+    return false;
+
+  /* Mark as analysed.  */
+  node->aux = (void *) 1;
+
+
+  /* Aglomerate the COMDAT group into the same partition.  */
+  if (node->same_comdat_group)
+    {
+      for (node1 = node->same_comdat_group;
+	   node1 != node; node1 = node1->same_comdat_group)
+	if (!node->alias)
+	  {
+	    ds->unite (node1->aux2, set);
+	    merge_comdat_nodes (node1, set);
+	    ret = true;
+	  }
+    }
+
+  /* Look at nodes that can reach the COMDAT group, and aglomerate to the
+     same partition.  These nodes are called the "COMDAT Frontier".  The
+     idea is that every unpartitioned node that reaches a COMDAT group MUST
+     go through the COMDAT frontier before reaching it.  Therefore, only
+     nodes in the frontier are exported.  */
+  if (node->same_comdat_group || c == SYMBOL_DUPLICATE)
+    {
+      int i;
+      struct ipa_ref *ref = NULL;
+
+      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
+	{
+	  /* Add all inline clones and callees that are duplicated.  */
+	  for (e = cnode->callers; e; e = e->next_caller)
+	    if (!e->inline_failed || c == SYMBOL_DUPLICATE)
+	      {
+		ds->unite (set, e->caller->aux2);
+		merge_comdat_nodes (e->caller, set);
+		ret = true;
+	      }
+
+	  /* Add all thunks associated with the function.  */
+	  for (e = cnode->callees; e; e = e->next_callee)
+	    if (e->caller->thunk.thunk_p && !e->caller->inlined_to)
+	      {
+		ds->unite (set, e->callee->aux2);
+		merge_comdat_nodes (e->callee, set);
+		ret = true;
+	      }
+	}
+
+      for (i = 0; node->iterate_referring (i, ref); i++)
+	{
+	  symtab_node *node1 = ref->referring;
+	  ds->unite (node1->aux2, set);
+	  ret = true;
+
+	  if (node1->get_partitioning_class () == SYMBOL_DUPLICATE)
+	    merge_comdat_nodes (node1, set);
+	}
+    }
+
+  return ret;
+}
+
+/* Bring together static nodes that are called by static functions, so 
+   promotion of statics to globals are not required.  This *MIGHT* negatively
+   impact the number of partitions, and even generate very umbalanced
+   partitions that can't be fixed.  */
+
+static bool
+merge_static_calls (symtab_node *node, int set)
+{
+  bool ret = false;
+  enum symbol_partitioning_class c = node->get_partitioning_class ();
+
+  if (node->aux)
+    return false;
+
+  node->aux = (void *) 1;
+
+
+  if (!TREE_PUBLIC (node->decl) || c == SYMBOL_DUPLICATE)
+    {
+      int i;
+      struct ipa_ref *ref = NULL;
+
+      if (cgraph_node *cnode = dyn_cast <cgraph_node *> (node))
+	{
+	  for (cgraph_edge *e = cnode->callers; e; e = e->next_caller)
+	    {
+	      /* FIXME: In theory, inlined functions should be a criteria to not
+		 merge partitions.  */
+	      ds->unite (node->aux2, e->caller->aux2);
+	      merge_static_calls (e->caller, set);
+	      ret = true;
+	    }
+
+	}
+
+      for (i = 0; node->iterate_referring (i, ref); ++i)
+	{
+	  symtab_node *node1 = ref->referring;
+	  ds->unite (node1->aux2, set);
+	  merge_static_calls (node1, set);
+	  ret = true;
+	}
+    }
+
+  return ret;
+}
+
+static bool
+merge_contained_symbols (symtab_node *node, int set)
+{
+  bool ret = false;
+  symtab_node *node1;
+
+  while ((node1 = contained_in_symbol (node)) != node)
+    {
+      node = node1;
+      ds->unite (node->aux2, set);
+      ret = true;
+    }
+
+  return ret;
+}
+
+/* Partition the program into several partitions with a restriction that
+   COMDATS are partitioned together with all nodes requiring them.  If
+   promote_statics is false, we also partition together static functions
+   and nodes that call eachother, so non-public functions are not promoted
+   to globals.  */
+
+void
+lto_merge_comdat_map (bool balance, bool promote_statics, int jobs)
+{
+  symtab_node *node;
+  int n = 0;
+
+  /* Initialize each not into its own distinct disjoint sets.  */
+  FOR_EACH_SYMBOL (node)
+    node->aux2 = n++;
+
+  union_find disjoint_sets = union_find (n);
+  ds = &disjoint_sets;
+
+  /* First look at COMDATs.  */
+  FOR_EACH_SYMBOL (node)
+    {
+      if (node->same_comdat_group)
+	merge_comdat_nodes (node, node->aux2);
+      merge_contained_symbols (node, node->aux2);
+    }
+
+  FOR_EACH_SYMBOL (node)
+    node->aux = NULL;
+
+  /* Then look at STATICs, if needed.  */
+  if (!promote_statics)
+    FOR_EACH_SYMBOL (node)
+      if (!TREE_PUBLIC (node->decl))
+	merge_static_calls (node, node->aux2);
+
+  FOR_EACH_SYMBOL (node)
+    node->aux = NULL;
+
+  if (balance && !balance_partitions (&disjoint_sets, n, jobs))
+    return;
+
+  build_ltrans_partitions (&disjoint_sets, n);
+}
+
+
 /* Helper function for qsort; sort nodes by order.  */
 static int
 node_cmp (const void *pa, const void *pb)
@@ -931,7 +1331,7 @@  static hash_map<const char *, unsigned> *lto_clone_numbers;
    represented by DECL.  */
 
 static bool
-privatize_symbol_name_1 (symtab_node *node, tree decl)
+privatize_symbol_name_1 (symtab_node *node, tree decl, bool wpa)
 {
   const char *name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl));
 
@@ -939,11 +1339,19 @@  privatize_symbol_name_1 (symtab_node *node, tree decl)
     return false;
 
   name = maybe_rewrite_identifier (name);
-  unsigned &clone_number = lto_clone_numbers->get_or_insert (name);
-  symtab->change_decl_assembler_name (decl,
-				      clone_function_name (
-					  name, "lto_priv", clone_number));
-  clone_number++;
+  if (wpa)
+    {
+      gcc_assert (lto_clone_numbers);
+
+      unsigned &clone_number = lto_clone_numbers->get_or_insert (name);
+      symtab->change_decl_assembler_name (decl,
+					  clone_function_name (
+					      name, "lto_priv", clone_number));
+      clone_number++;
+    }
+  else
+    symtab->change_decl_assembler_name (decl, get_file_function_name
+					(node->asm_name ()));
 
   if (node->lto_file_data)
     lto_record_renamed_decl (node->lto_file_data, name,
@@ -968,7 +1376,9 @@  privatize_symbol_name_1 (symtab_node *node, tree decl)
 static bool
 privatize_symbol_name (symtab_node *node)
 {
-  if (!privatize_symbol_name_1 (node, node->decl))
+  bool wpa = !split_outputs;
+
+  if (!privatize_symbol_name_1 (node, node->decl, wpa))
     return false;
 
   return true;
@@ -1117,7 +1527,7 @@  rename_statics (lto_symtab_encoder_t encoder, symtab_node *node)
    all inlinees are added.  */
 
 void
-lto_promote_cross_file_statics (void)
+lto_promote_cross_file_statics (bool promote)
 {
   unsigned i, n_sets;
 
@@ -1147,10 +1557,17 @@  lto_promote_cross_file_statics (void)
 	   lsei_next (&lsei))
         {
           symtab_node *node = lsei_node (lsei);
+	  cgraph_node *cnode = dyn_cast <cgraph_node *> (node);
 
 	  /* If symbol is static, rename it if its assembler name
 	     clashes with anything else in this unit.  */
 	  rename_statics (encoder, node);
+	  if (cnode)
+	    {
+	      bool in_partition = lsei.encoder->nodes[lsei.index].in_partition;
+	      if (!in_partition)
+		cnode->local = false;
+	    }
 
 	  /* No need to promote if symbol already is externally visible ... */
 	  if (node->externally_visible
@@ -1163,8 +1580,12 @@  lto_promote_cross_file_statics (void)
 	      validize_symbol_for_target (node);
 	      continue;
 	    }
-
-          promote_symbol (node);
+	  if (promote)
+	    {
+	      promote_symbol (node);
+	      if (cnode && split_outputs)
+		cnode->local = false;
+	    }
         }
     }
   delete lto_clone_numbers;
@@ -1186,3 +1607,23 @@  lto_promote_statics_nonwpa (void)
     }
   delete lto_clone_numbers;
 }
+
+/* Check if a variable is accessed across partitions.  If yesm then update
+   used_from_other_partition.  */
+
+void
+lto_check_usage_from_other_partitions (void)
+{
+  unsigned int i, j;
+  for (i = 0; i < ltrans_partitions.length (); i++)
+    {
+      vec<lto_encoder_entry> &nodes = (ltrans_partitions[i])->encoder->nodes;
+
+      for (j = 0; j < nodes.length (); j++)
+	{
+	  symtab_node *node = nodes[j].node;
+	  if (node && !nodes[j].in_partition)
+	    node->used_from_other_partition = true;
+	}
+    }
+}
diff --git a/gcc/lto/lto-partition.h b/gcc/lto-partition.h
similarity index 89%
rename from gcc/lto/lto-partition.h
rename to gcc/lto-partition.h
index 42b5ea8c80c..4a1b17fa728 100644
--- a/gcc/lto/lto-partition.h
+++ b/gcc/lto-partition.h
@@ -36,6 +36,8 @@  extern vec<ltrans_partition> ltrans_partitions;
 void lto_1_to_1_map (void);
 void lto_max_map (void);
 void lto_balanced_map (int, int);
-void lto_promote_cross_file_statics (void);
+void lto_promote_cross_file_statics (bool promote);
 void free_ltrans_partitions (void);
 void lto_promote_statics_nonwpa (void);
+void lto_check_usage_from_other_partitions (void);
+void lto_merge_comdat_map (bool, bool, int);
diff --git a/gcc/lto/Make-lang.in b/gcc/lto/Make-lang.in
index 0b73f9ef7bb..46b52cff183 100644
--- a/gcc/lto/Make-lang.in
+++ b/gcc/lto/Make-lang.in
@@ -24,9 +24,9 @@  LTO_EXE = lto1$(exeext)
 LTO_DUMP_EXE = lto-dump$(exeext)
 LTO_DUMP_INSTALL_NAME := $(shell echo lto-dump|sed '$(program_transform_name)')
 # The LTO-specific object files inclued in $(LTO_EXE).
-LTO_OBJS = lto/lto-lang.o lto/lto.o lto/lto-object.o attribs.o lto/lto-partition.o lto/lto-symtab.o lto/lto-common.o
+LTO_OBJS = lto/lto-lang.o lto/lto.o lto/lto-object.o attribs.o lto/lto-symtab.o lto/lto-common.o
 lto_OBJS = $(LTO_OBJS)
-LTO_DUMP_OBJS = lto/lto-lang.o lto/lto-object.o attribs.o lto/lto-partition.o lto/lto-symtab.o lto/lto-dump.o lto/lto-common.o
+LTO_DUMP_OBJS = lto/lto-lang.o lto/lto-object.o attribs.o lto/lto-symtab.o lto/lto-dump.o lto/lto-common.o
 lto_dump_OBJS = $(LTO_DUMP_OBJS)
 
 # this is only useful in a LTO bootstrap, but this does not work right
diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index 1c37814bde4..803b9920e35 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -515,7 +515,7 @@  do_whole_program_analysis (void)
   /* Find out statics that need to be promoted
      to globals with hidden visibility because they are accessed from multiple
      partitions.  */
-  lto_promote_cross_file_statics ();
+  lto_promote_cross_file_statics (true);
   if (dump_file)
      dump_end (partition_dump_id, dump_file);
   dump_file = NULL;
diff --git a/gcc/params.opt b/gcc/params.opt
index f39e5d1a012..00fc58cd5cc 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -366,6 +366,14 @@  Minimal size of a partition for LTO (in estimated instructions).
 Common Joined UInteger Var(param_lto_partitions) Init(128) IntegerRange(1, 65536) Param
 Number of partitions the program should be split to.
 
+-param=promote-statics=
+Common Joined UInteger Var(param_promote_statics) Init(0) IntegerRange(0, 1) Param
+Allow statics and non-public functions to be promoted as public when compiling in parallel.
+
+-param=balance-partitions=
+Common Joined UInteger Var(param_balance_partitions) Init(1) IntegerRange(0, 1) Param
+When compiling in parallel, try to balance the partitions for compilation performance.
+
 -param=max-average-unrolled-insns=
 Common Joined UInteger Var(param_max_average_unrolled_insns) Init(80) Param Optimization
 The maximum number of instructions to consider to unroll in a loop on average.
diff --git a/gcc/tree.c b/gcc/tree.c
index d0202c3f785..3ca162d5070 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -9595,6 +9595,24 @@  make_anon_name ()
   return id;
 }
 
+/* Filter the input name removing characters that may confuse the linker.  */
+
+static void
+filter_name (char *name)
+{
+  char *p = name;
+
+  while (*p != '\0')
+    {
+      switch (*p)
+	{
+	  case '*':
+	    *p = '_';
+	}
+      p++;
+    }
+}
+
 /* Generate a name for a special-purpose function.
    The generated name may need to be unique across the whole link.
    Changes to this function may also require corresponding changes to
@@ -9651,8 +9669,7 @@  get_file_function_name (const char *type)
       q = (char *) alloca (9 + 19 + len + 1);
       memcpy (q, file, len + 1);
 
-      snprintf (q + len, 9 + 19 + 1, "_%08X_" HOST_WIDE_INT_PRINT_HEX,
-		crc32_string (0, name), get_random_seed (false));
+      snprintf (q + len, 9 + 19 + 1, "_%08X", crc32_string (0, name));
 
       p = q;
     }
@@ -9665,7 +9682,9 @@  get_file_function_name (const char *type)
      Use a global object (which is already required to be unique over
      the program) rather than the file name (which imposes extra
      constraints).  */
+
   sprintf (buf, FILE_FUNCTION_FORMAT, type, p);
+  filter_name (buf);
 
   return get_identifier (buf);
 }