Patchwork WHOPR partitioning algorithm

login
register
mail settings
Submitter Jan Hubicka
Date Sept. 3, 2010, 10:29 p.m.
Message ID <20100903222954.GQ1664@kam.mff.cuni.cz>
Download mbox | patch
Permalink /patch/63728/
State New
Headers show

Comments

Jan Hubicka - Sept. 3, 2010, 10:29 p.m.
Hi,
this patch adds simple partitioning algorithm that is not source file driven.
It performs measurably better than 1-to-1 partitioning, saving about 1 minute
of Mozilla compilation (out of 5), reducing /tmp usage from 7GB to 1.6GB and
reducing resulting executable size by about 3-5%.

The algorithm is simple - it takes cgraph nodes in predefined order and inserts
them to current partition, when partition is too large it gets to new
partition.  To reduce boundary in between partitions, it is able to undo last
few insertion decision until number of in-partition references / number of
inter partition references is minimized.

Variables are dragged into current partition when they are referenced and they
are not in previous partititions yet.

I tried more smart graph clustering algorithm, but ended up with this one
because others won't play well with function reordering pass I plan to implement
next (the idea is to feed the order from this new pass instead of currently
used reverse postorder).  For mozilla and other large projects it seems very
important to order functions sequentially in the binary.

All algorithms resulted in pretty much same binary size, the trick with
unwinding brings just small benefits, about 1%.

Bootstrapped/regtested x86_64-linux, seems to make sense?

If accepted, I will update existing WHOPR testcases to use -flto-partition=1to1
since they will get optimized into single partition and stop testing what they
are intended to test otherwise.

Honza

	* doc/invoke.texi (-flto-partition, lto-partitions, lto-minpartition):
	Document.
	* opts.c (decode_options): Handle lto partitions.
	* common.opt (flto-partition): New.
	* params.def (PARAM_LTO_PARTITIONS, MIN_PARTITION_SIZE): New.

	* lto.c:  Include params.h.
	(add_cgraph_node_to_partition, add_varpool_node_to_partition): Do
	refcounting in aux field.
	(undo_partition, partition_cgraph_node_p, partition_varpool_node_p):
	New functions.
	(lto_1_to_1_map): Simplify.
	(lto_balanced_map): New function.
	(do_whole_program_analysis): Chose proper partitioning alg.
	* Makefile.in (lto.o): Add dependency on params.h
Richard Guenther - Sept. 4, 2010, 9:13 a.m.
On Sat, 4 Sep 2010, Jan Hubicka wrote:

> Hi,
> this patch adds simple partitioning algorithm that is not source file driven.
> It performs measurably better than 1-to-1 partitioning, saving about 1 minute
> of Mozilla compilation (out of 5), reducing /tmp usage from 7GB to 1.6GB and
> reducing resulting executable size by about 3-5%.
> 
> The algorithm is simple - it takes cgraph nodes in predefined order and inserts
> them to current partition, when partition is too large it gets to new
> partition.  To reduce boundary in between partitions, it is able to undo last
> few insertion decision until number of in-partition references / number of
> inter partition references is minimized.
> 
> Variables are dragged into current partition when they are referenced and they
> are not in previous partititions yet.
> 
> I tried more smart graph clustering algorithm, but ended up with this one
> because others won't play well with function reordering pass I plan to implement
> next (the idea is to feed the order from this new pass instead of currently
> used reverse postorder).  For mozilla and other large projects it seems very
> important to order functions sequentially in the binary.
> 
> All algorithms resulted in pretty much same binary size, the trick with
> unwinding brings just small benefits, about 1%.
> 
> Bootstrapped/regtested x86_64-linux, seems to make sense?
> 
> If accepted, I will update existing WHOPR testcases to use -flto-partition=1to1
> since they will get optimized into single partition and stop testing what they
> are intended to test otherwise.
> 
> Honza
> 
> 	* doc/invoke.texi (-flto-partition, lto-partitions, lto-minpartition):
> 	Document.
> 	* opts.c (decode_options): Handle lto partitions.
> 	* common.opt (flto-partition): New.
> 	* params.def (PARAM_LTO_PARTITIONS, MIN_PARTITION_SIZE): New.
> 
> 	* lto.c:  Include params.h.
> 	(add_cgraph_node_to_partition, add_varpool_node_to_partition): Do
> 	refcounting in aux field.
> 	(undo_partition, partition_cgraph_node_p, partition_varpool_node_p):
> 	New functions.
> 	(lto_1_to_1_map): Simplify.
> 	(lto_balanced_map): New function.
> 	(do_whole_program_analysis): Chose proper partitioning alg.
> 	* Makefile.in (lto.o): Add dependency on params.h
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 163809)
> +++ doc/invoke.texi	(working copy)
> @@ -352,8 +352,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fira-loop-pressure -fno-ira-share-save-slots @gol
>  -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
>  -fivopts -fkeep-inline-functions -fkeep-static-consts @gol
> --floop-block -floop-interchange -floop-strip-mine @gol
> --floop-parallelize-all -flto -flto-compression-level -flto-report -fltrans @gol
> +-floop-block -floop-interchange -floop-strip-mine -floop-parallelize-all @gol
> +-flto -flto-partition=@var{alg} -flto-compression-level -flto-report -fltrans @gol
>  -fltrans-output-list -fmerge-all-constants -fmerge-constants -fmodulo-sched @gol
>  -fmodulo-sched-allow-regmoves -fmove-loop-invariants -fmudflap @gol
>  -fmudflapir -fmudflapth -fno-branch-count-reg -fno-default-inline @gol
> @@ -7628,6 +7628,13 @@ GNU make.
>  
>  Disabled by default.
>  
> +@item -flto-partition=@var{alg}
> +@opindex flto-partition
> +Specify partitioning algorithm used by @option{-fwhopr} mode.  The value is
> +either @code{1to1} to specify partitioning corresponding to source files
> +or @code{balanced} to specify partitioning into, if possible, equally sized
> +chunks.

Which one is the default?

>  @item -fwpa
>  @opindex fwpa
>  This is an internal option used by GCC when compiling with
> @@ -8821,6 +8828,16 @@ parameter in order to perform devirtuali
>  @option{devirt-type-list-size} is the maximum number of types it
>  stores per a single formal parameter of a function.
>  
> +@item lto-partitions
> +Specify desired nuber of partitions produced during WHOPR copmilation.
> +Number of partitions should exceed number of CPUs used for compilatoin.
> +Default value is 32.
> +
> +@item lto-minpartition
> +Size of minimal paritition for WHOPR (in estimated instructions).
> +This prevents expenses of splitting very small programs into too many
> +partitions.
> +
>  @end table
>  @end table
>  
> Index: opts.c
> ===================================================================
> --- opts.c	(revision 163809)
> +++ opts.c	(working copy)
> @@ -1083,6 +1083,13 @@ decode_options (unsigned int argc, const
>        error ("LTO support has not been enabled in this configuration");
>  #endif
>      }
> +  if (flag_lto_partition_balanced || flag_lto_partition_1to1)
> +    {
> +      if (flag_lto_partition_balanced && flag_lto_partition_1to1)
> +	error ("Only one -flto-partitoin value can be specified");
> +      if (!flag_whopr)
> +	error ("-flto-partitoin has effect only with -fwhopr");
> +    }

Can you add -flto-partition=single so we can eventually get rid of
-flto (or make it an alias for -fwhopr -flto-partition=single)?

That would avoid the above and eventually make the implementation
simpler.

>    /* Reconcile -flto and -fwhopr.  Set additional flags as appropriate and
>       check option consistency.  */
> Index: lto/lto.c
> ===================================================================
> --- lto/lto.c	(revision 163809)
> +++ lto/lto.c	(working copy)
> @@ -44,6 +44,7 @@ along with GCC; see the file COPYING3.  
>  #include "lto-tree.h"
>  #include "lto-streamer.h"
>  #include "splay-tree.h"
> +#include "params.h"
>  
>  /* This needs to be included after config.h.  Otherwise, _GNU_SOURCE will not
>     be defined in time to set __USE_GNU in the system headers, and strsignal
> @@ -760,12 +761,8 @@ add_cgraph_node_to_partition (ltrans_par
>    part->insns += node->local.inline_summary.self_size;
>  
>    if (node->aux)
> -    {
> -      gcc_assert (node->aux != part);
> -      node->in_other_partition = 1;
> -    }
> -  else
> -    node->aux = part;
> +    node->in_other_partition = 1;
> +  node->aux = (void *)((size_t)node->aux + 1);
>  
>    cgraph_node_set_add (part->cgraph_set, node);
>  
> @@ -789,12 +786,8 @@ add_varpool_node_to_partition (ltrans_pa
>    varpool_node_set_add (part->varpool_set, vnode);
>  
>    if (vnode->aux)
> -    {
> -      gcc_assert (vnode->aux != part);
> -      vnode->in_other_partition = 1;
> -    }
> -  else
> -    vnode->aux = part;
> +    vnode->in_other_partition = 1;
> +  vnode->aux = (void *)((size_t)vnode->aux + 1);
>  
>    add_references_to_partition (part, &vnode->ref_list);
>  
> @@ -803,6 +796,66 @@ add_varpool_node_to_partition (ltrans_pa
>      add_varpool_node_to_partition (part, vnode->same_comdat_group);
>  }
>  
> +/* Undo all additions until number of cgraph nodes in PARITION is N_CGRAPH_NODES
> +   and number of varpool nodes is N_VARPOOL_NODES.  */
> +
> +static void
> +undo_partition (ltrans_partition partition, unsigned int n_cgraph_nodes,
> +		unsigned int n_varpool_nodes)
> +{
> +  while (VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes) >
> +	 n_cgraph_nodes)
> +    {
> +      struct cgraph_node *node = VEC_index (cgraph_node_ptr,
> +					    partition->cgraph_set->nodes,
> +					    n_cgraph_nodes);
> +      partition->insns -= node->local.inline_summary.self_size;
> +      cgraph_node_set_remove (partition->cgraph_set, node);
> +      node->aux = (void *)((size_t)node->aux - 1);
> +    }
> +  while (VEC_length (varpool_node_ptr, partition->varpool_set->nodes) >
> +	 n_varpool_nodes)
> +    {
> +      struct varpool_node *node = VEC_index (varpool_node_ptr,
> +					     partition->varpool_set->nodes,
> +					     n_varpool_nodes);
> +      varpool_node_set_remove (partition->varpool_set, node);
> +      node->aux = (void *)((size_t)node->aux - 1);
> +    }
> +}
> +
> +/* Return true if NODE should be partitioned.  */

What does "partitioned" mean?  Duplicating the node in more than
1 partition?  Or does !partitioned mean that it isn't needed at all?
Please clarify.

> +static bool
> +partition_cgraph_node_p (struct cgraph_node *node)
> +{
> +  /* We will get proper partition based on function they are inlined to.  */
> +  if (node->global.inlined_to)
> +    return false;
> +  /* Nodes without a body do not need partitioning.  */
> +  if (!node->analyzed)
> +    return false;
> +  /* Extern inlines and comdat are always only in partitions they are needed.  */
> +  if (DECL_EXTERNAL (node->decl)
> +      || DECL_COMDAT (node->decl))
> +    return false;
> +  return true;
> +}
> +
> +/* Return true if VNODE should be partitioned.  */
> +
> +static bool
> +partition_varpool_node_p (struct varpool_node *vnode)
> +{
> +  if (vnode->alias || !vnode->needed)
> +    return false;
> +  /* Constant pool and comdat are always only in partitions they are needed.  */
> +  if (DECL_IN_CONSTANT_POOL (vnode->decl)
> +      || DECL_COMDAT (vnode->decl))
> +    return false;
> +  return true;
> +}
> +
>  /* Group cgrah nodes by input files.  This is used mainly for testing
>     right now.  */
>  
> @@ -823,15 +876,7 @@ lto_1_to_1_map (void)
>  
>    for (node = cgraph_nodes; node; node = node->next)
>      {
> -      /* We will get proper partition based on function they are inlined to.  */
> -      if (node->global.inlined_to)
> -	continue;
> -      /* Nodes without a body do not need partitioning.  */
> -      if (!node->analyzed)
> -	continue;
> -      /* Extern inlines and comdat are always only in partitions they are needed.  */
> -      if (DECL_EXTERNAL (node->decl)
> -	  || DECL_COMDAT (node->decl))
> +      if (!partition_cgraph_node_p (node))
>  	continue;
>  
>        file_data = node->local.lto_file_data;
> @@ -866,11 +911,7 @@ lto_1_to_1_map (void)
>  
>    for (vnode = varpool_nodes; vnode; vnode = vnode->next)
>      {
> -      if (vnode->alias || !vnode->needed)
> -	continue;
> -      /* Constant pool and comdat are always only in partitions they are needed.  */
> -      if (DECL_IN_CONSTANT_POOL (vnode->decl)
> -	  || DECL_COMDAT (vnode->decl))
> +      if (!partition_varpool_node_p (vnode))
>  	continue;
>        file_data = vnode->lto_file_data;
>        slot = pointer_map_contains (pmap, file_data);
> @@ -904,6 +945,235 @@ lto_1_to_1_map (void)
>  						 ltrans_partitions);
>  }
>  
> +
> +/* Group cgraph nodes in evenly sized partitions.  */

equally sized

> +static void
> +lto_balanced_map (void)
> +{
> +  int n_nodes = 0;
> +  struct cgraph_node **postorder =
> +    XCNEWVEC (struct cgraph_node *, cgraph_n_nodes);
> +  struct cgraph_node **order = XNEWVEC (struct cgraph_node *, cgraph_max_uid);
> +  int i, postorder_len;
> +  struct cgraph_node *node;
> +  int total_size = 0;
> +  int partition_size;
> +  ltrans_partition partition;
> +  unsigned int last_visited_cgraph_node = 0, last_visited_varpool_node = 0;
> +  struct varpool_node *vnode;
> +  int cost = 0, internal = 0;
> +  int best_n_nodes = 0, best_n_varpool_nodes = 0, best_i = 0, best_cost =
> +    INT_MAX, best_internal = 0;
> +  int npartitions;
> +
> +  postorder_len = cgraph_postorder (postorder);
> +  for (i = 0; i < postorder_len; i++)
> +    {
> +      node = postorder[i];
> +      if (partition_cgraph_node_p (node))
> +	{
> +	  order[n_nodes++] = node;
> +          total_size += node->local.inline_summary.self_size;
> +	}
> +    }
> +  free (postorder);
> +
> +  partition_size = total_size / PARAM_VALUE (PARAM_LTO_PARTITIONS);
> +  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
> +    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);

Hm, so there is no max-parition-size?

I expected that we try to partition the program by trying to
minimize reference edges between partitions while keeping
partition size in a min/max bound.  Is that the graph algorithm
you tried?

The above suggests that users have to adjust --param lto-partitions
for very large programs to be able to compile them, which sounds
bad.

Again, when trying to do partitioning it would be really helpful
if cgraph nodes and varpool nodes would share a common symtab
base and references would just reference those ...

> +  npartitions = 1;
> +  partition = new_partition ("");
> +  if (cgraph_dump_file)
> +    fprintf (cgraph_dump_file, "Total unit size: %i, partition size: %i\n",
> +	     total_size, partition_size);

The following needs a _big_ comment describing what it is doing.
What's its complexity?  What timevar does it get attributed to
(maybe add a new one?)

> +  for (i = 0; i < n_nodes; i++)
> +    {
> +      add_cgraph_node_to_partition (partition, order[i]);
> +      /* Add also all referenced variables to the partition unless they was
> +         added earlier.  */
> +      while (last_visited_cgraph_node <
> +	     VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes)
> +	     || last_visited_varpool_node < VEC_length (varpool_node_ptr,
> +							partition->varpool_set->
> +							nodes))
> +	{
> +	  struct ipa_ref_list *refs;
> +	  int j;
> +	  struct ipa_ref *ref;
> +	  bool cgraph_p = false;
> +	  if (last_visited_cgraph_node <
> +	      VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes))
> +	    {
> +	      struct cgraph_edge *edge;
> +
> +	      cgraph_p = true;
> +	      node = VEC_index (cgraph_node_ptr, partition->cgraph_set->nodes,
> +				last_visited_cgraph_node);
> +	      refs = &node->ref_list;
> +	      total_size -= node->local.inline_summary.self_size;
> +	      last_visited_cgraph_node++;
> +	      gcc_assert (node->analyzed);
> +	      for (edge = node->callees; edge; edge = edge->next_callee)
> +		if (edge->callee->analyzed)
> +		  {
> +		    int edge_cost = edge->frequency;
> +		    cgraph_node_set_iterator csi;
> +
> +		    if (!edge_cost)
> +		      edge_cost = 1;
> +		    gcc_assert (edge_cost > 0);
> +		    csi = cgraph_node_set_find (partition->cgraph_set, edge->callee);
> +		    if (!csi_end_p (csi)
> +		        && csi.index < last_visited_cgraph_node - 1)
> +		      cost -= edge_cost, internal+= edge_cost;
> +		    else
> +		      cost += edge_cost;
> +		  }
> +	      for (edge = node->callers; edge; edge = edge->next_caller)
> +		{
> +		  int edge_cost = edge->frequency;
> +		  cgraph_node_set_iterator csi;
> +
> +		  gcc_assert (edge->caller->analyzed);
> +		  if (!edge_cost)
> +		    edge_cost = 1;
> +		  gcc_assert (edge_cost > 0);
> +		  csi = cgraph_node_set_find (partition->cgraph_set, edge->caller);
> +		  if (!csi_end_p (csi)
> +		      && csi.index < last_visited_cgraph_node)
> +		    cost -= edge_cost;
> +		  else
> +		    cost += edge_cost;
> +		}
> +	    }
> +	  else
> +	    {
> +	      refs =
> +		&VEC_index (varpool_node_ptr, partition->varpool_set->nodes,
> +			    last_visited_varpool_node)->ref_list;
> +	      last_visited_varpool_node++;
> +	    }
> +	  for (j = 0; ipa_ref_list_reference_iterate (refs, j, ref); j++)
> +	    if (ref->refered_type == IPA_REF_VARPOOL)
> +	      {
> +		varpool_node_set_iterator vsi;
> +
> +		vnode = ipa_ref_varpool_node (ref);
> +		if (!vnode->finalized)
> +		  continue;
> +		if (!vnode->aux && partition_varpool_node_p (vnode))
> +		  add_varpool_node_to_partition (partition, vnode);
> +		vsi = varpool_node_set_find (partition->varpool_set, vnode);
> +		if (!vsi_end_p (vsi)
> +		    && vsi.index < last_visited_varpool_node - !cgraph_p)
> +		  cost--, internal++;
> +		else
> +		  cost++;
> +	      }
> +	    else
> +	      {
> +		cgraph_node_set_iterator csi;
> +
> +		node = ipa_ref_node (ref);
> +		if (!node->analyzed)
> +		  continue;
> +		csi = cgraph_node_set_find (partition->cgraph_set, node);
> +		if (!csi_end_p (csi)
> +		    && csi.index < last_visited_cgraph_node - cgraph_p)
> +		  cost--, internal++;
> +		else
> +		  cost++;
> +	      }
> +	  for (j = 0; ipa_ref_list_refering_iterate (refs, j, ref); j++)
> +	    if (ref->refering_type == IPA_REF_VARPOOL)
> +	      {
> +		varpool_node_set_iterator vsi;
> +
> +		vnode = ipa_ref_refering_varpool_node (ref);
> +		gcc_assert (vnode->finalized);
> +		if (!vnode->aux && partition_varpool_node_p (vnode))
> +		  add_varpool_node_to_partition (partition, vnode);
> +		vsi = varpool_node_set_find (partition->varpool_set, vnode);
> +		if (!vsi_end_p (vsi)
> +		    && vsi.index < last_visited_varpool_node)
> +		  cost--;
> +		else
> +		  cost++;
> +	      }
> +	    else
> +	      {
> +		cgraph_node_set_iterator csi;
> +
> +		node = ipa_ref_refering_node (ref);
> +		gcc_assert (node->analyzed);
> +		csi = cgraph_node_set_find (partition->cgraph_set, node);
> +		if (!csi_end_p (csi)
> +		    && csi.index < last_visited_cgraph_node)
> +		  cost--;
> +		else
> +		  cost++;
> +	      }
> +	}
> +      if (partition->insns < partition_size * 3 / 4
> +	  || best_cost == INT_MAX
> +	  || ((!cost 
> +	       || (best_internal * (HOST_WIDE_INT) cost > (internal * (HOST_WIDE_INT)best_cost)))
> +  	      && partition->insns < partition_size * 5 / 4))
> +	{
> +	  best_cost = cost;
> +	  best_internal = internal;
> +	  best_i = i;
> +	  best_n_nodes = VEC_length (cgraph_node_ptr,
> +				     partition->cgraph_set->nodes);
> +	  best_n_varpool_nodes = VEC_length (varpool_node_ptr,
> +					     partition->varpool_set->nodes);
> +	}
> +      if (cgraph_dump_file)
> +	fprintf (cgraph_dump_file, "Step %i: added %s, size %i, cost %i/%i best %i/%i, step %i\n", i,
> +		 cgraph_node_name (order[i]), partition->insns, cost, internal,
> +		 best_cost, best_internal, best_i);
> +      if (partition->insns > 2 * partition_size)
> +	{
> +	  if (best_i != i)
> +	    {
> +	      if (cgraph_dump_file)
> +		fprintf (cgraph_dump_file, "Unwinding %i insertions to step %i\n",
> +			 i - best_i, best_i);
> +	      undo_partition (partition, best_n_nodes, best_n_varpool_nodes);
> +	    }
> +	  i = best_i;
> +	  partition = new_partition ("");
> +	  last_visited_cgraph_node = 0;
> +	  last_visited_varpool_node = 0;
> +	  cost = 0;
> +
> +	  if (cgraph_dump_file)
> +	    fprintf (cgraph_dump_file, "New partition\n");
> +	  best_n_nodes = 0;
> +	  best_n_varpool_nodes = 0;
> +	  best_cost = INT_MAX;
> +
> +	  /* Since the size of partitions is just approximate, update the size after
> +	     we finished current one.  */
> +	  if (npartitions < PARAM_VALUE (PARAM_LTO_PARTITIONS))
> +	    partition_size = total_size
> +	      / (PARAM_VALUE (PARAM_LTO_PARTITIONS) - npartitions);
> +	  else
> +	    partition_size = INT_MAX;
> +
> +	  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
> +	    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
> +	  npartitions ++;
> +	}
> +    }
> +  for (vnode = varpool_nodes; vnode; vnode = vnode->next)
> +    if (partition_varpool_node_p (vnode) && !vnode->aux)
> +      add_varpool_node_to_partition (partition, vnode);
> +  free (order);
> +}
> +
>  /* Promote variable VNODE to be static.  */
>  
>  static bool
> @@ -1977,7 +2247,10 @@ do_whole_program_analysis (void)
>    /* We are about to launch the final LTRANS phase, stop the WPA timer.  */
>    timevar_pop (TV_WHOPR_WPA);
>  
> -  lto_1_to_1_map ();
> +  if (flag_lto_partition_1to1)
> +    lto_1_to_1_map ();
> +  else
> +    lto_balanced_map ();
>  
>    if (!quiet_flag)
>      {
> Index: lto/Make-lang.in
> ===================================================================
> --- lto/Make-lang.in	(revision 163809)
> +++ lto/Make-lang.in	(working copy)
> @@ -85,7 +85,7 @@ lto/lto.o: lto/lto.c $(CONFIG_H) $(SYSTE
>  	$(CGRAPH_H) $(GGC_H) tree-ssa-operands.h $(TREE_PASS_H) \
>  	langhooks.h $(VEC_H) $(BITMAP_H) pointer-set.h $(IPA_PROP_H) \
>  	$(COMMON_H) debug.h $(TIMEVAR_H) $(GIMPLE_H) $(LTO_H) $(LTO_TREE_H) \
> -	$(LTO_TAGS_H) $(LTO_STREAMER_H) $(SPLAY_TREE_H) gt-lto-lto.h
> +	$(LTO_TAGS_H) $(LTO_STREAMER_H) $(SPLAY_TREE_H) gt-lto-lto.h $(PARAMS_H)
>  lto/lto-elf.o: lto/lto-elf.c $(CONFIG_H) coretypes.h $(SYSTEM_H) \
>  	toplev.h $(LTO_H) $(TM_H) $(LIBIBERTY_H) $(GGC_H) $(LTO_STREAMER_H)
>  lto/lto-coff.o: lto/lto-coff.c $(CONFIG_H) coretypes.h $(SYSTEM_H) \
> Index: common.opt
> ===================================================================
> --- common.opt	(revision 163809)
> +++ common.opt	(working copy)
> @@ -859,6 +859,14 @@ flto
>  Common Var(flag_lto)
>  Enable link-time optimization.
>  
> +flto-partition=1to1
> +Common Var(flag_lto_partition_1to1)
> +Partition functions and vars at linktime based on object files they originate from
> +
> +flto-partition=balanced
> +Common Var(flag_lto_partition_balanced)
> +Partition functions and vars at linktime into approximately same sized buckets
> +
>  ; The initial value of -1 comes from Z_DEFAULT_COMPRESSION in zlib.h.
>  flto-compression-level=
>  Common Joined RejectNegative UInteger Var(flag_lto_compression_level) Init(-1)
> Index: params.def
> ===================================================================
> --- params.def	(revision 163809)
> +++ params.def	(working copy)
> @@ -838,6 +838,17 @@ DEFPARAM (PARAM_DEVIRT_TYPE_LIST_SIZE,
>  	  "devirtualization",
>  	  8, 0, 0)
>  
> +/* WHOPR partitioning configuration.  */
> +
> +DEFPARAM (PARAM_LTO_PARTITIONS,
> +	  "lto-partitions",
> +	  "Number of paritions program should be split to",

As I understand this is really lto-max-partitions, right?

> +	  32, 0, 0)
> +
> +DEFPARAM (MIN_PARTITION_SIZE,
> +	  "lto-min-partition",
> +	  "Size of minimal paritition for WHOPR (in estimated instructions)",
> +	  1000, 0, 0)
>  /*
>  Local variables:
>  mode:c
Jan Hubicka - Sept. 4, 2010, 1:30 p.m.
> 
> Can you add -flto-partition=single so we can eventually get rid of
> -flto (or make it an alias for -fwhopr -flto-partition=single)?
> 
> That would avoid the above and eventually make the implementation
> simpler.

well, still need to test if user specified one algorithm or many.  But yes, i
can add single.  Problem is that we still create the tmp file then, I have
plans to merge WHOPR/LTO queue up to point we will be able to decide on the fly
what variant we want.  It is not that hard, we just need to make LTO queue to
read function bodies on demand instead of reading them all at once that I hope
is possible (did not look into merging issues with this idea yet)
> > +/* Return true if NODE should be partitioned.  */
> 
> What does "partitioned" mean?  Duplicating the node in more than
> 1 partition?  Or does !partitioned mean that it isn't needed at all?
> Please clarify.

Partitioned nodes are those interesting for partitioning algorithm, nonpartitioned
nodes are dragged in on demand when computing the boundary. I.e. COMDAT is ignored
by partitioning algorithm and brought only to partitions that need it.
> > +  partition_size = total_size / PARAM_VALUE (PARAM_LTO_PARTITIONS);
> > +  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
> > +    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
> 
> Hm, so there is no max-parition-size?

well, I plan to do this, to avoid excessive memory use in ltrans stage.
Problem with this is that at the moment WPA stage does not scale that well,
so when we do the 32 partitions the ltrans stages consume fraction of
WPA memory that seems acceptable, so I skipped it from initial patch.
So the parameter buys you nothing, since when you can't compile with default
configuration, you etiher exploded in WPA or hit some memory usage bug in the
compiler.

I can add the parameter and tune it based on what I get in Mozilla.
> 
> I expected that we try to partition the program by trying to
> minimize reference edges between partitions while keeping
> partition size in a min/max bound.  Is that the graph algorithm

Yes, it is what the algorithm does, just the min/max bound is computed
by dividing program size by number of partitions, rounding it up to
min partition size and then partitions 1/4th smaller or 1/4th bigger
are accepted

> The above suggests that users have to adjust --param lto-partitions
> for very large programs to be able to compile them, which sounds
> bad.

Only place I expect users will actually need to tune lto-partitions is when
they want more than 32 partitions (have more of CPUs).  I personally don't
like much the idea of making this to depend on WHOPR parameter since this
would imply that resulting binary would depend on -j configuration of
Makefile (and we probably don't have the number handy with -fwhopr=jobserv)
that would make reproducing the compilation harder.

Don't know here, really.
> 
> Again, when trying to do partitioning it would be really helpful
> if cgraph nodes and varpool nodes would share a common symtab
> base and references would just reference those ...
> 
> > +  npartitions = 1;
> > +  partition = new_partition ("");
> > +  if (cgraph_dump_file)
> > +    fprintf (cgraph_dump_file, "Total unit size: %i, partition size: %i\n",
> > +	     total_size, partition_size);
> 
> The following needs a _big_ comment describing what it is doing.
> What's its complexity?  What timevar does it get attributed to
> (maybe add a new one?)

There is WPA output partition. Every node is added at most twice, every edge
visited twice too, so complexity should not be problem.

As I tried to explain in the mail, the algorithm is simple.
We keep adding nodes (in predefined order that is currently reverse postorder)
into current partition and compute the number of internal edges and external
edges one the go.  As soon as partition is bigger than 3/4th of expected size
we start to look for point when we reach minimum of the fraction and look into
5/4th of the expected size.  Then we undo the process up to place when we
reached the minima and start with new partition.

Since the program size estimate is not really good - it does not include size
of nodes that gets duplicated, we always update size of expected partition
after each run. Otherwise the algorthm tended to make bit more partitions than
requested (like 40 instead of 32).

So it is very primite greedy algorithm minimizing the boundary size of every
partition.

The other algorithm I tried was the usual graph clustering algorithm - just
start with partition per node and then start concatenating partitions based on
priority of edges until their size reach the expected size of partition.  But
this does not play well with my plan for function reordering and since function
reordering algorithm is essentially same thing, just producing one linear order
instead of buckets, i expect that it will combine well with the linear
partitioning one. (and the differences are minimal anyway)

> > +/* WHOPR partitioning configuration.  */
> > +
> > +DEFPARAM (PARAM_LTO_PARTITIONS,
> > +	  "lto-partitions",
> > +	  "Number of paritions program should be split to",
> 
> As I understand this is really lto-max-partitions, right?

At the moment yet, until we get maximal partition size ;)

OK, once we agree at the details, I will update patch and add comments as
requested.

Honza
Richard Guenther - Sept. 6, 2010, 12:26 p.m.
On Sat, 4 Sep 2010, Jan Hubicka wrote:

> > 
> > Can you add -flto-partition=single so we can eventually get rid of
> > -flto (or make it an alias for -fwhopr -flto-partition=single)?
> > 
> > That would avoid the above and eventually make the implementation
> > simpler.
> 
> well, still need to test if user specified one algorithm or many.  But yes, i
> can add single.  Problem is that we still create the tmp file then, I have
> plans to merge WHOPR/LTO queue up to point we will be able to decide on the fly
> what variant we want.  It is not that hard, we just need to make LTO queue to
> read function bodies on demand instead of reading them all at once that I hope
> is possible (did not look into merging issues with this idea yet)

Hm.  The question is if we care - if a project is small enough then
the tmp file won't be too big, if it is one wouldn't use single anyways.

But yes, it would be nice to instead of writing the big tmp files
store references to the original files there and only store WPA
ltrans info in them ...

> > > +/* Return true if NODE should be partitioned.  */
> > 
> > What does "partitioned" mean?  Duplicating the node in more than
> > 1 partition?  Or does !partitioned mean that it isn't needed at all?
> > Please clarify.
> 
> Partitioned nodes are those interesting for partitioning algorithm, nonpartitioned
> nodes are dragged in on demand when computing the boundary. I.e. COMDAT is ignored
> by partitioning algorithm and brought only to partitions that need it.

Ok, please update the comment.

> > > +  partition_size = total_size / PARAM_VALUE (PARAM_LTO_PARTITIONS);
> > > +  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
> > > +    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
> > 
> > Hm, so there is no max-parition-size?
> 
> well, I plan to do this, to avoid excessive memory use in ltrans stage.
> Problem with this is that at the moment WPA stage does not scale that well,
> so when we do the 32 partitions the ltrans stages consume fraction of
> WPA memory that seems acceptable, so I skipped it from initial patch.
> So the parameter buys you nothing, since when you can't compile with default
> configuration, you etiher exploded in WPA or hit some memory usage bug in the
> compiler.
> 
> I can add the parameter and tune it based on what I get in Mozilla.

Ok.

> > I expected that we try to partition the program by trying to
> > minimize reference edges between partitions while keeping
> > partition size in a min/max bound.  Is that the graph algorithm
> 
> Yes, it is what the algorithm does, just the min/max bound is computed
> by dividing program size by number of partitions, rounding it up to
> min partition size and then partitions 1/4th smaller or 1/4th bigger
> are accepted
> 
> > The above suggests that users have to adjust --param lto-partitions
> > for very large programs to be able to compile them, which sounds
> > bad.
> 
> Only place I expect users will actually need to tune lto-partitions is when
> they want more than 32 partitions (have more of CPUs).  I personally don't
> like much the idea of making this to depend on WHOPR parameter since this
> would imply that resulting binary would depend on -j configuration of
> Makefile (and we probably don't have the number handy with -fwhopr=jobserv)
> that would make reproducing the compilation harder.
> 
> Don't know here, really.

Well, if with 32 partitions all partitions exceed max partition size
(well, or reasonable partition size), we should create more partitions.

The question is really if we should create sane partition sizes
by default or if we should create N partitions by default.  I'd vote
for sane partition sizes which makes the lto-partitions parameter
useless (I don't expect easy reproducible testcases anyway with
dynamic partitioning).

> > 
> > Again, when trying to do partitioning it would be really helpful
> > if cgraph nodes and varpool nodes would share a common symtab
> > base and references would just reference those ...
> > 
> > > +  npartitions = 1;
> > > +  partition = new_partition ("");
> > > +  if (cgraph_dump_file)
> > > +    fprintf (cgraph_dump_file, "Total unit size: %i, partition size: %i\n",
> > > +	     total_size, partition_size);
> > 
> > The following needs a _big_ comment describing what it is doing.
> > What's its complexity?  What timevar does it get attributed to
> > (maybe add a new one?)
> 
> There is WPA output partition. Every node is added at most twice, every edge
> visited twice too, so complexity should not be problem.
> 
> As I tried to explain in the mail, the algorithm is simple.

Please add the algorithm description as a comment then.

> We keep adding nodes (in predefined order that is currently reverse postorder)
> into current partition and compute the number of internal edges and external
> edges one the go.  As soon as partition is bigger than 3/4th of expected size
> we start to look for point when we reach minimum of the fraction and look into
> 5/4th of the expected size.  Then we undo the process up to place when we
> reached the minima and start with new partition.
> 
> Since the program size estimate is not really good - it does not include size
> of nodes that gets duplicated, we always update size of expected partition
> after each run. Otherwise the algorthm tended to make bit more partitions than
> requested (like 40 instead of 32).
> 
> So it is very primite greedy algorithm minimizing the boundary size of every
> partition.
> 
> The other algorithm I tried was the usual graph clustering algorithm - just
> start with partition per node and then start concatenating partitions based on
> priority of edges until their size reach the expected size of partition.  But
> this does not play well with my plan for function reordering and since function
> reordering algorithm is essentially same thing, just producing one linear order
> instead of buckets, i expect that it will combine well with the linear
> partitioning one. (and the differences are minimal anyway)

Hm, I don't see the issue with function reordering - it should exactly
behave the same way (and thus is unneeded?)

> > > +/* WHOPR partitioning configuration.  */
> > > +
> > > +DEFPARAM (PARAM_LTO_PARTITIONS,
> > > +	  "lto-partitions",
> > > +	  "Number of paritions program should be split to",
> > 
> > As I understand this is really lto-max-partitions, right?
> 
> At the moment yet, until we get maximal partition size ;)

I'd prefer to not have this parameter, at least not for this
partitioning algorithm.

> OK, once we agree at the details, I will update patch and add comments as
> requested.

Thanks,
Richard.
Jan Hubicka - Sept. 6, 2010, 12:52 p.m.
> 
> Well, if with 32 partitions all partitions exceed max partition size
> (well, or reasonable partition size), we should create more partitions.
> 
> The question is really if we should create sane partition sizes
> by default or if we should create N partitions by default.  I'd vote
> for sane partition sizes which makes the lto-partitions parameter
> useless (I don't expect easy reproducible testcases anyway with
> dynamic partitioning).

Well, even if I tended to do the same originally, i concluded that this won't
work well.

With 32 partitions for Mozilla, we get about 4GB of WPA memory use, and about 2
minutes of streaming.  Then each partition is about 200MB of memory, but they
takes about 3 minutes to compile.

Partition size sort of define ltrans memory use and time of compilation. Since
memory use in ltrans is not a problem (our WPA stage is way too big of memory
hog, we save only about half of the memory), we can set hard limit on time.
Now lets decide that we will tune partition size to compile in about 1 minute.
This would be hard per se, since sizes of partitions are pretty rough estimates
by anyway.

Then we would get about 3*32 partitions for mozilla that would just increase
streaming time (32partitions is 1.5GB, 2000 partitions is 7GB) On the other
hand on project that is small enough to compile in minute (lets say GCC) we
would get no partitioning at all.

Since too much partitions has expenses too, I believe number of partitions
should be large enough so ltrans does not dominate memory use for single CPU
compilation (this is about 2-4 in current implementation I guess).

In parallel compilation the number of partitions should be higher than the
number of CPUs available, but not terribly much higher since streaming size
increase with too many parttions.

32 seemed resonable default - not that high so streaming cost would terribly
increase and still more than number of CPUs on average developer workstation
today.
> > 
> > The other algorithm I tried was the usual graph clustering algorithm - just
> > start with partition per node and then start concatenating partitions based on
> > priority of edges until their size reach the expected size of partition.  But
> > this does not play well with my plan for function reordering and since function
> > reordering algorithm is essentially same thing, just producing one linear order
> > instead of buckets, i expect that it will combine well with the linear
> > partitioning one. (and the differences are minimal anyway)
> 
> Hm, I don't see the issue with function reordering - it should exactly
> behave the same way (and thus is unneeded?)

Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
function reordering algorithm orders the functions so most calls are short distance
and forwarding.

So when you at WPA stage decide in what order the functions are, you need to partition
them in linear segments of this order and keep partition in order as they get to linker
or assign each function section and tell linker order of the sections.
The second is more expensive. So I basicaly want
1) decide on linear order of functions
2) partition program linearly in this segments
3) in each partition arrange function order in the final .s files (either via gas section fragments
or by just breaking our current topological ocomplation order0
4) link the ltrans partition in the original order

Honza
> 
> > > > +/* WHOPR partitioning configuration.  */
> > > > +
> > > > +DEFPARAM (PARAM_LTO_PARTITIONS,
> > > > +	  "lto-partitions",
> > > > +	  "Number of paritions program should be split to",
> > > 
> > > As I understand this is really lto-max-partitions, right?
> > 
> > At the moment yet, until we get maximal partition size ;)
> 
> I'd prefer to not have this parameter, at least not for this
> partitioning algorithm.
> 
> > OK, once we agree at the details, I will update patch and add comments as
> > requested.
> 
> Thanks,
> Richard.
Richard Guenther - Sept. 6, 2010, 1:04 p.m.
On Mon, 6 Sep 2010, Jan Hubicka wrote:

> > 
> > Well, if with 32 partitions all partitions exceed max partition size
> > (well, or reasonable partition size), we should create more partitions.
> > 
> > The question is really if we should create sane partition sizes
> > by default or if we should create N partitions by default.  I'd vote
> > for sane partition sizes which makes the lto-partitions parameter
> > useless (I don't expect easy reproducible testcases anyway with
> > dynamic partitioning).
> 
> Well, even if I tended to do the same originally, i concluded that this won't
> work well.
> 
> With 32 partitions for Mozilla, we get about 4GB of WPA memory use, and about 2
> minutes of streaming.  Then each partition is about 200MB of memory, but they
> takes about 3 minutes to compile.
> 
> Partition size sort of define ltrans memory use and time of compilation. Since
> memory use in ltrans is not a problem (our WPA stage is way too big of memory
> hog, we save only about half of the memory), we can set hard limit on time.
> Now lets decide that we will tune partition size to compile in about 1 minute.
> This would be hard per se, since sizes of partitions are pretty rough estimates
> by anyway.
> 
> Then we would get about 3*32 partitions for mozilla that would just increase
> streaming time (32partitions is 1.5GB, 2000 partitions is 7GB) On the other
> hand on project that is small enough to compile in minute (lets say GCC) we
> would get no partitioning at all.
> 
> Since too much partitions has expenses too, I believe number of partitions
> should be large enough so ltrans does not dominate memory use for single CPU
> compilation (this is about 2-4 in current implementation I guess).
> 
> In parallel compilation the number of partitions should be higher than the
> number of CPUs available, but not terribly much higher since streaming size
> increase with too many parttions.
> 
> 32 seemed resonable default - not that high so streaming cost would terribly
> increase and still more than number of CPUs on average developer workstation
> today.

Ok, we seem to disagree here ;)  I suppose we can change things once
we get WPA memory use into control.

> > > The other algorithm I tried was the usual graph clustering algorithm - just
> > > start with partition per node and then start concatenating partitions based on
> > > priority of edges until their size reach the expected size of partition.  But
> > > this does not play well with my plan for function reordering and since function
> > > reordering algorithm is essentially same thing, just producing one linear order
> > > instead of buckets, i expect that it will combine well with the linear
> > > partitioning one. (and the differences are minimal anyway)
> > 
> > Hm, I don't see the issue with function reordering - it should exactly
> > behave the same way (and thus is unneeded?)
> 
> Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
> function reordering algorithm orders the functions so most calls are short distance
> and forwarding.
> 
> So when you at WPA stage decide in what order the functions are, you need to partition
> them in linear segments of this order and keep partition in order as they get to linker
> or assign each function section and tell linker order of the sections.
> The second is more expensive. So I basicaly want
> 1) decide on linear order of functions
> 2) partition program linearly in this segments
> 3) in each partition arrange function order in the final .s files (either via gas section fragments
> or by just breaking our current topological ocomplation order0
> 4) link the ltrans partition in the original order

But if you cluster minimizing cross cluster edges most edges in a cluster
will be intra-cluster edges.  Which means you can simply order functions
inside a cluster and you should arrive at a similar (or the same) solution
as with first ordering and then clustering.  Thus I'd do

 1) cluster minimizing cross-cluster edges
 2) decide on linear oder of functions in each cluster, late in LTRANS

Why is that not simpler or why would it not work?

Richard.
Andi Kleen - Sept. 6, 2010, 1:06 p.m.
Richard Guenther <rguenther@suse.de> writes:
>
> Hm.  The question is if we care - if a project is small enough then
> the tmp file won't be too big, if it is one wouldn't use single anyways.
>
> But yes, it would be nice to instead of writing the big tmp files
> store references to the original files there and only store WPA
> ltrans info in them ...

Does the function information not change at all?

That would definitely help for the tmpdir in tmpfs case to.

On my original attempts to use LTO on large codebases with the old
partition algorithm and moderate make -j I created several quite severe
swap storms that essential froze the desktop for minutes.

At least I understood what was going on but for a normal
user it would look like "gcc crashed my system"

(tmpfs has a half of physical memory default limit, but together
with several large lto1s it was enough to drive the system badly
into swap and in some cases even out of swap space) 

I assume this will be better with Honza's new algorithm, but 
there's still some danger of this.

-Andi
Richard Guenther - Sept. 6, 2010, 1:10 p.m.
On Mon, 6 Sep 2010, Andi Kleen wrote:

> Richard Guenther <rguenther@suse.de> writes:
> >
> > Hm.  The question is if we care - if a project is small enough then
> > the tmp file won't be too big, if it is one wouldn't use single anyways.
> >
> > But yes, it would be nice to instead of writing the big tmp files
> > store references to the original files there and only store WPA
> > ltrans info in them ...
> 
> Does the function information not change at all?

No.  What may change is the global types/decls section which could
be a problem (I think the per-function type/decl info inherits
from the global one?)

Richard.
Andi Kleen - Sept. 6, 2010, 1:21 p.m.
On Mon, 6 Sep 2010 15:10:11 +0200 (CEST)
Richard Guenther <rguenther@suse.de> wrote:

> > Does the function information not change at all?
> 
> No.  What may change is the global types/decls section which could
> be a problem (I think the per-function type/decl info inherits
> from the global one?)

If that's true it probably wouldn't be too difficult to teach
the code I wrote for handling ld -r combined objects about this.
It already creates fake files, you could just plug in another
file.

-Andi
Jan Hubicka - Sept. 6, 2010, 3:54 p.m.
> > 32 seemed resonable default - not that high so streaming cost would terribly
> > increase and still more than number of CPUs on average developer workstation
> > today.
> 
> Ok, we seem to disagree here ;)  I suppose we can change things once
> we get WPA memory use into control.

Yep, with WPA memory not dominating we will have use for maximal partition size
parameter. We don't want single partition to need 2GB of RAM or so, but simply
we don't scale to this size of projects at the moment. 

Still i don't think we should be partitioning purely on it for reasons I described.
More partitions imply more streaming and thus longer WPA.  So when you compile GCC
or Mozilla on 24 CPU machine, the number of partitions is pretty much same even
if GCC is much smaller. So i don't think it is good idea to partition gigantic
projects into very many partitions just because of hard limit on single partition
size being tuned for smaller projects.
> 
> > > > The other algorithm I tried was the usual graph clustering algorithm - just
> > > > start with partition per node and then start concatenating partitions based on
> > > > priority of edges until their size reach the expected size of partition.  But
> > > > this does not play well with my plan for function reordering and since function
> > > > reordering algorithm is essentially same thing, just producing one linear order
> > > > instead of buckets, i expect that it will combine well with the linear
> > > > partitioning one. (and the differences are minimal anyway)
> > > 
> > > Hm, I don't see the issue with function reordering - it should exactly
> > > behave the same way (and thus is unneeded?)
> > 
> > Well, partitioning algorithm clusters graphs minimizing the cross cluster edges,
> > function reordering algorithm orders the functions so most calls are short distance
> > and forwarding.
> > 
> > So when you at WPA stage decide in what order the functions are, you need to partition
> > them in linear segments of this order and keep partition in order as they get to linker
> > or assign each function section and tell linker order of the sections.
> > The second is more expensive. So I basicaly want
> > 1) decide on linear order of functions
> > 2) partition program linearly in this segments
> > 3) in each partition arrange function order in the final .s files (either via gas section fragments
> > or by just breaking our current topological ocomplation order0
> > 4) link the ltrans partition in the original order
> 
> But if you cluster minimizing cross cluster edges most edges in a cluster
> will be intra-cluster edges.  Which means you can simply order functions
> inside a cluster and you should arrive at a similar (or the same) solution
> as with first ordering and then clustering.  Thus I'd do
> 
>  1) cluster minimizing cross-cluster edges
>  2) decide on linear oder of functions in each cluster, late in LTRANS
> 
> Why is that not simpler or why would it not work?

It does not seem simpler to me, since one would have similar clustering
algorithm implemented twice and it is bit harder than what I do on the
predefined sequence.

What you optimize for in function layout is not exactly equivalent of what you
optimize for when minimizing boundaries.  Your proposal gives precedence of mizimizing
boundaries over runtime layout, mine goes other way around. No idea how much practical
difference that would make.

From high level I tend to like the idea of function ordering being regular IPA
pass and WHOPR/ltrans partitioning something as transparent as possible rather
than making function ordering to be combination of partitioning algorithm and
late small IPA function ordering pass both contributing on final order.
It is not grand difference either ;)

What I would like to also support in longer run is profile feedback specifying
in what order the functions are executed at runtime and laying code based on
that to optimize startup times. This is already done by some tools using
iceberg and linker scripts.   Again I guess it is sort of implementable in both
schemes - partitioning algorithm should naturally lean to putting startup code
into single partition and local ordering will hit it, but I can imagine it to
be somewhat fragile this way. (partitioning algorithm might get other
preferences or split the init code rather irregular for more complicated
callgraphs).

Honza
Jan Hubicka - Sept. 6, 2010, 3:59 p.m.
> Richard Guenther <rguenther@suse.de> writes:
> >
> > Hm.  The question is if we care - if a project is small enough then
> > the tmp file won't be too big, if it is one wouldn't use single anyways.
> >
> > But yes, it would be nice to instead of writing the big tmp files
> > store references to the original files there and only store WPA
> > ltrans info in them ...
> 
> Does the function information not change at all?

Functions stays same (they are just copied). We have problem with overly
large decl sections (they gets too much of context) this is why too many
partitions leads to such a large tmp files.

We might instead of copying the sections make ltrans stages to read them from
original .o files, but I am not sure it would be win in general. At the moment
neither the size occupied by function bodies nor the time needed by WPA stage
to copy sections is the problem.  So I would just first look into ways of
eliminating unnecesary noise in .o files before jumping into this busyness.

Also making ltrans to read all the source .o files will probably lead to
more I/O and problems with distributed compilation.
> 
> (tmpfs has a half of physical memory default limit, but together
> with several large lto1s it was enough to drive the system badly
> into swap and in some cases even out of swap space) 
> 
> I assume this will be better with Honza's new algorithm, but 
> there's still some danger of this.

I still get over 1 gig of tmp files for mozilla.  Not that bad but still a lot.

Honza
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.
Steven Bosscher - Sept. 6, 2010, 5:06 p.m.
> What I would like to also support in longer run is profile feedback
> specifying
> in what order the functions are executed at runtime and laying code based on
> that to optimize startup times.

btw, what about a profile driven partitioning algoritm, like LIPO?

Ciao,
Steven
Jan Hubicka - Sept. 6, 2010, 5:27 p.m.
> > What I would like to also support in longer run is profile feedback
> > specifying
> > in what order the functions are executed at runtime and laying code based on
> > that to optimize startup times.
> 
> btw, what about a profile driven partitioning algoritm, like LIPO?

Well, AFAIK LIPO is not shooting for code layout.

Current algorithm (and the planned function reordering pass) both use callgraph edge
frequencies that are profile driven too.

Honza
> 
> Ciao,
> Steven
Xinliang David Li - Sept. 6, 2010, 6 p.m.
I have not followed the thread carefully, but what is the advantage of
doing code layout in IPO instead of doing this at link time? The
support of user directed code and data layout is already in gold  (and
mozilla folks have tried this functionality to layout functions
according to load time affinity data collected by instrumentation and
got large speed up).  The remaining part of the support is also almost
ready but not yet submitted upstream (needs more tuning on various
targets): 1) the compiler part of the support (e.g. writing down call
graph hot edges as note sections for linker to consume; 2) the code
layout algorithm in linker using the data from the compiler.

Thanks,

David

On Mon, Sep 6, 2010 at 10:27 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > What I would like to also support in longer run is profile feedback
>> > specifyingTh
>> > in what order the functions are executed at runtime and laying code based on
>> > that to optimize startup times.
>>
>> btw, what about a profile driven partitioning algoritm, like LIPO?
>
> Well, AFAIK LIPO is not shooting for code layout.
>
> Current algorithm (and the planned function reordering pass) both use callgraph edge
> frequencies that are profile driven too.
>
> Honza
>>
>> Ciao,
>> Steven
>
Andi Kleen - Sept. 6, 2010, 6:32 p.m.
Xinliang David Li <davidxl@google.com> writes:

> I have not followed the thread carefully, but what is the advantage of
> doing code layout in IPO instead of doing this at link time? 

Just guessing, but gcc already knows the profile feedback data
and the linker would need to be told about it in a special way?

Although I'm not sure the profile feedback data that gcc
collects is the one that's needed to speed up startup.
Perhaps the profilers would need to be extended to collect
"first time only" data separately.

-Andi
Jack Howarth - Sept. 6, 2010, 7:48 p.m.
On Mon, Sep 06, 2010 at 11:00:40AM -0700, Xinliang David Li wrote:
> I have not followed the thread carefully, but what is the advantage of
> doing code layout in IPO instead of doing this at link time? The
> support of user directed code and data layout is already in gold  (and
> mozilla folks have tried this functionality to layout functions
> according to load time affinity data collected by instrumentation and
> got large speed up).  The remaining part of the support is also almost
> ready but not yet submitted upstream (needs more tuning on various
> targets): 1) the compiler part of the support (e.g. writing down call
> graph hot edges as note sections for linker to consume; 2) the code
> layout algorithm in linker using the data from the compiler.
> 

   Wouldn't that present issues for other targets like darwin and cygwin
that can't depend on such support in the linker? Also, wouldn't such a change
actually be be heading FSF gcc's LTO design towards llvm's design of using
a libLTO linker module?
               Jack

> Thanks,
> 
> David
>
Xinliang David Li - Sept. 6, 2010, 10:03 p.m.
On Mon, Sep 6, 2010 at 12:48 PM, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
> On Mon, Sep 06, 2010 at 11:00:40AM -0700, Xinliang David Li wrote:
>> I have not followed the thread carefully, but what is the advantage of
>> doing code layout in IPO instead of doing this at link time? The
>> support of user directed code and data layout is already in gold  (and
>> mozilla folks have tried this functionality to layout functions
>> according to load time affinity data collected by instrumentation and
>> got large speed up).  The remaining part of the support is also almost
>> ready but not yet submitted upstream (needs more tuning on various
>> targets): 1) the compiler part of the support (e.g. writing down call
>> graph hot edges as note sections for linker to consume; 2) the code
>> layout algorithm in linker using the data from the compiler.
>>
>
>   Wouldn't that present issues for other targets like darwin and cygwin
> that can't depend on such support in the linker?

Yes, that is true. Both methods have usability limitations.

> Also, wouldn't such a change
> actually be be heading FSF gcc's LTO design towards llvm's design of using
> a libLTO linker module?

I don't see a connection here. It does not change how gcc's LTO works
and they can co-exist.

David

>               Jack
>
>> Thanks,
>>
>> David
>>
>
Jack Howarth - Sept. 6, 2010, 10:37 p.m.
On Mon, Sep 06, 2010 at 03:03:04PM -0700, Xinliang David Li wrote:
> On Mon, Sep 6, 2010 at 12:48 PM, Jack Howarth <howarth@bromo.med.uc.edu> wrote:
> 
> > Also, wouldn't such a change
> > actually be be heading FSF gcc's LTO design towards llvm's design of using
> > a libLTO linker module?
> 
> I don't see a connection here. It does not change how gcc's LTO works
> and they can co-exist.
> 

David,
   I was thinking back to the original llvm/gcc integration proposal...

http://gcc.gnu.org/ml/gcc/2005-11/msg00888.html

Wasn't one reason that FSF gcc and llvm LTO went their separate ways a 
difference of opinion on where the bulk of the LTO should be done?
My understanding was that the FSF gcc approach was always oriented towards
performing most of the LTO at compile time whereas the llvm LTO approach
was to shift the work off to the linker (hence the libLTO linker module).
It would seem that your proposal would be a reorientation more towards
the llvm approach to LTO, no?
            Jack

Patch

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 163809)
+++ doc/invoke.texi	(working copy)
@@ -352,8 +352,8 @@  Objective-C and Objective-C++ Dialects}.
 -fira-loop-pressure -fno-ira-share-save-slots @gol
 -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol
 -fivopts -fkeep-inline-functions -fkeep-static-consts @gol
--floop-block -floop-interchange -floop-strip-mine @gol
--floop-parallelize-all -flto -flto-compression-level -flto-report -fltrans @gol
+-floop-block -floop-interchange -floop-strip-mine -floop-parallelize-all @gol
+-flto -flto-partition=@var{alg} -flto-compression-level -flto-report -fltrans @gol
 -fltrans-output-list -fmerge-all-constants -fmerge-constants -fmodulo-sched @gol
 -fmodulo-sched-allow-regmoves -fmove-loop-invariants -fmudflap @gol
 -fmudflapir -fmudflapth -fno-branch-count-reg -fno-default-inline @gol
@@ -7628,6 +7628,13 @@  GNU make.
 
 Disabled by default.
 
+@item -flto-partition=@var{alg}
+@opindex flto-partition
+Specify partitioning algorithm used by @option{-fwhopr} mode.  The value is
+either @code{1to1} to specify partitioning corresponding to source files
+or @code{balanced} to specify partitioning into, if possible, equally sized
+chunks.
+
 @item -fwpa
 @opindex fwpa
 This is an internal option used by GCC when compiling with
@@ -8821,6 +8828,16 @@  parameter in order to perform devirtuali
 @option{devirt-type-list-size} is the maximum number of types it
 stores per a single formal parameter of a function.
 
+@item lto-partitions
+Specify desired nuber of partitions produced during WHOPR copmilation.
+Number of partitions should exceed number of CPUs used for compilatoin.
+Default value is 32.
+
+@item lto-minpartition
+Size of minimal paritition for WHOPR (in estimated instructions).
+This prevents expenses of splitting very small programs into too many
+partitions.
+
 @end table
 @end table
 
Index: opts.c
===================================================================
--- opts.c	(revision 163809)
+++ opts.c	(working copy)
@@ -1083,6 +1083,13 @@  decode_options (unsigned int argc, const
       error ("LTO support has not been enabled in this configuration");
 #endif
     }
+  if (flag_lto_partition_balanced || flag_lto_partition_1to1)
+    {
+      if (flag_lto_partition_balanced && flag_lto_partition_1to1)
+	error ("Only one -flto-partitoin value can be specified");
+      if (!flag_whopr)
+	error ("-flto-partitoin has effect only with -fwhopr");
+    }
 
   /* Reconcile -flto and -fwhopr.  Set additional flags as appropriate and
      check option consistency.  */
Index: lto/lto.c
===================================================================
--- lto/lto.c	(revision 163809)
+++ lto/lto.c	(working copy)
@@ -44,6 +44,7 @@  along with GCC; see the file COPYING3.  
 #include "lto-tree.h"
 #include "lto-streamer.h"
 #include "splay-tree.h"
+#include "params.h"
 
 /* This needs to be included after config.h.  Otherwise, _GNU_SOURCE will not
    be defined in time to set __USE_GNU in the system headers, and strsignal
@@ -760,12 +761,8 @@  add_cgraph_node_to_partition (ltrans_par
   part->insns += node->local.inline_summary.self_size;
 
   if (node->aux)
-    {
-      gcc_assert (node->aux != part);
-      node->in_other_partition = 1;
-    }
-  else
-    node->aux = part;
+    node->in_other_partition = 1;
+  node->aux = (void *)((size_t)node->aux + 1);
 
   cgraph_node_set_add (part->cgraph_set, node);
 
@@ -789,12 +786,8 @@  add_varpool_node_to_partition (ltrans_pa
   varpool_node_set_add (part->varpool_set, vnode);
 
   if (vnode->aux)
-    {
-      gcc_assert (vnode->aux != part);
-      vnode->in_other_partition = 1;
-    }
-  else
-    vnode->aux = part;
+    vnode->in_other_partition = 1;
+  vnode->aux = (void *)((size_t)vnode->aux + 1);
 
   add_references_to_partition (part, &vnode->ref_list);
 
@@ -803,6 +796,66 @@  add_varpool_node_to_partition (ltrans_pa
     add_varpool_node_to_partition (part, vnode->same_comdat_group);
 }
 
+/* Undo all additions until number of cgraph nodes in PARITION is N_CGRAPH_NODES
+   and number of varpool nodes is N_VARPOOL_NODES.  */
+
+static void
+undo_partition (ltrans_partition partition, unsigned int n_cgraph_nodes,
+		unsigned int n_varpool_nodes)
+{
+  while (VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes) >
+	 n_cgraph_nodes)
+    {
+      struct cgraph_node *node = VEC_index (cgraph_node_ptr,
+					    partition->cgraph_set->nodes,
+					    n_cgraph_nodes);
+      partition->insns -= node->local.inline_summary.self_size;
+      cgraph_node_set_remove (partition->cgraph_set, node);
+      node->aux = (void *)((size_t)node->aux - 1);
+    }
+  while (VEC_length (varpool_node_ptr, partition->varpool_set->nodes) >
+	 n_varpool_nodes)
+    {
+      struct varpool_node *node = VEC_index (varpool_node_ptr,
+					     partition->varpool_set->nodes,
+					     n_varpool_nodes);
+      varpool_node_set_remove (partition->varpool_set, node);
+      node->aux = (void *)((size_t)node->aux - 1);
+    }
+}
+
+/* Return true if NODE should be partitioned.  */
+
+static bool
+partition_cgraph_node_p (struct cgraph_node *node)
+{
+  /* We will get proper partition based on function they are inlined to.  */
+  if (node->global.inlined_to)
+    return false;
+  /* Nodes without a body do not need partitioning.  */
+  if (!node->analyzed)
+    return false;
+  /* Extern inlines and comdat are always only in partitions they are needed.  */
+  if (DECL_EXTERNAL (node->decl)
+      || DECL_COMDAT (node->decl))
+    return false;
+  return true;
+}
+
+/* Return true if VNODE should be partitioned.  */
+
+static bool
+partition_varpool_node_p (struct varpool_node *vnode)
+{
+  if (vnode->alias || !vnode->needed)
+    return false;
+  /* Constant pool and comdat are always only in partitions they are needed.  */
+  if (DECL_IN_CONSTANT_POOL (vnode->decl)
+      || DECL_COMDAT (vnode->decl))
+    return false;
+  return true;
+}
+
 /* Group cgrah nodes by input files.  This is used mainly for testing
    right now.  */
 
@@ -823,15 +876,7 @@  lto_1_to_1_map (void)
 
   for (node = cgraph_nodes; node; node = node->next)
     {
-      /* We will get proper partition based on function they are inlined to.  */
-      if (node->global.inlined_to)
-	continue;
-      /* Nodes without a body do not need partitioning.  */
-      if (!node->analyzed)
-	continue;
-      /* Extern inlines and comdat are always only in partitions they are needed.  */
-      if (DECL_EXTERNAL (node->decl)
-	  || DECL_COMDAT (node->decl))
+      if (!partition_cgraph_node_p (node))
 	continue;
 
       file_data = node->local.lto_file_data;
@@ -866,11 +911,7 @@  lto_1_to_1_map (void)
 
   for (vnode = varpool_nodes; vnode; vnode = vnode->next)
     {
-      if (vnode->alias || !vnode->needed)
-	continue;
-      /* Constant pool and comdat are always only in partitions they are needed.  */
-      if (DECL_IN_CONSTANT_POOL (vnode->decl)
-	  || DECL_COMDAT (vnode->decl))
+      if (!partition_varpool_node_p (vnode))
 	continue;
       file_data = vnode->lto_file_data;
       slot = pointer_map_contains (pmap, file_data);
@@ -904,6 +945,235 @@  lto_1_to_1_map (void)
 						 ltrans_partitions);
 }
 
+
+/* Group cgraph nodes in evenly sized partitions.  */
+
+static void
+lto_balanced_map (void)
+{
+  int n_nodes = 0;
+  struct cgraph_node **postorder =
+    XCNEWVEC (struct cgraph_node *, cgraph_n_nodes);
+  struct cgraph_node **order = XNEWVEC (struct cgraph_node *, cgraph_max_uid);
+  int i, postorder_len;
+  struct cgraph_node *node;
+  int total_size = 0;
+  int partition_size;
+  ltrans_partition partition;
+  unsigned int last_visited_cgraph_node = 0, last_visited_varpool_node = 0;
+  struct varpool_node *vnode;
+  int cost = 0, internal = 0;
+  int best_n_nodes = 0, best_n_varpool_nodes = 0, best_i = 0, best_cost =
+    INT_MAX, best_internal = 0;
+  int npartitions;
+
+  postorder_len = cgraph_postorder (postorder);
+  for (i = 0; i < postorder_len; i++)
+    {
+      node = postorder[i];
+      if (partition_cgraph_node_p (node))
+	{
+	  order[n_nodes++] = node;
+          total_size += node->local.inline_summary.self_size;
+	}
+    }
+  free (postorder);
+
+  partition_size = total_size / PARAM_VALUE (PARAM_LTO_PARTITIONS);
+  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
+    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
+  npartitions = 1;
+  partition = new_partition ("");
+  if (cgraph_dump_file)
+    fprintf (cgraph_dump_file, "Total unit size: %i, partition size: %i\n",
+	     total_size, partition_size);
+
+  for (i = 0; i < n_nodes; i++)
+    {
+      add_cgraph_node_to_partition (partition, order[i]);
+      /* Add also all referenced variables to the partition unless they was
+         added earlier.  */
+      while (last_visited_cgraph_node <
+	     VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes)
+	     || last_visited_varpool_node < VEC_length (varpool_node_ptr,
+							partition->varpool_set->
+							nodes))
+	{
+	  struct ipa_ref_list *refs;
+	  int j;
+	  struct ipa_ref *ref;
+	  bool cgraph_p = false;
+	  if (last_visited_cgraph_node <
+	      VEC_length (cgraph_node_ptr, partition->cgraph_set->nodes))
+	    {
+	      struct cgraph_edge *edge;
+
+	      cgraph_p = true;
+	      node = VEC_index (cgraph_node_ptr, partition->cgraph_set->nodes,
+				last_visited_cgraph_node);
+	      refs = &node->ref_list;
+	      total_size -= node->local.inline_summary.self_size;
+	      last_visited_cgraph_node++;
+	      gcc_assert (node->analyzed);
+	      for (edge = node->callees; edge; edge = edge->next_callee)
+		if (edge->callee->analyzed)
+		  {
+		    int edge_cost = edge->frequency;
+		    cgraph_node_set_iterator csi;
+
+		    if (!edge_cost)
+		      edge_cost = 1;
+		    gcc_assert (edge_cost > 0);
+		    csi = cgraph_node_set_find (partition->cgraph_set, edge->callee);
+		    if (!csi_end_p (csi)
+		        && csi.index < last_visited_cgraph_node - 1)
+		      cost -= edge_cost, internal+= edge_cost;
+		    else
+		      cost += edge_cost;
+		  }
+	      for (edge = node->callers; edge; edge = edge->next_caller)
+		{
+		  int edge_cost = edge->frequency;
+		  cgraph_node_set_iterator csi;
+
+		  gcc_assert (edge->caller->analyzed);
+		  if (!edge_cost)
+		    edge_cost = 1;
+		  gcc_assert (edge_cost > 0);
+		  csi = cgraph_node_set_find (partition->cgraph_set, edge->caller);
+		  if (!csi_end_p (csi)
+		      && csi.index < last_visited_cgraph_node)
+		    cost -= edge_cost;
+		  else
+		    cost += edge_cost;
+		}
+	    }
+	  else
+	    {
+	      refs =
+		&VEC_index (varpool_node_ptr, partition->varpool_set->nodes,
+			    last_visited_varpool_node)->ref_list;
+	      last_visited_varpool_node++;
+	    }
+	  for (j = 0; ipa_ref_list_reference_iterate (refs, j, ref); j++)
+	    if (ref->refered_type == IPA_REF_VARPOOL)
+	      {
+		varpool_node_set_iterator vsi;
+
+		vnode = ipa_ref_varpool_node (ref);
+		if (!vnode->finalized)
+		  continue;
+		if (!vnode->aux && partition_varpool_node_p (vnode))
+		  add_varpool_node_to_partition (partition, vnode);
+		vsi = varpool_node_set_find (partition->varpool_set, vnode);
+		if (!vsi_end_p (vsi)
+		    && vsi.index < last_visited_varpool_node - !cgraph_p)
+		  cost--, internal++;
+		else
+		  cost++;
+	      }
+	    else
+	      {
+		cgraph_node_set_iterator csi;
+
+		node = ipa_ref_node (ref);
+		if (!node->analyzed)
+		  continue;
+		csi = cgraph_node_set_find (partition->cgraph_set, node);
+		if (!csi_end_p (csi)
+		    && csi.index < last_visited_cgraph_node - cgraph_p)
+		  cost--, internal++;
+		else
+		  cost++;
+	      }
+	  for (j = 0; ipa_ref_list_refering_iterate (refs, j, ref); j++)
+	    if (ref->refering_type == IPA_REF_VARPOOL)
+	      {
+		varpool_node_set_iterator vsi;
+
+		vnode = ipa_ref_refering_varpool_node (ref);
+		gcc_assert (vnode->finalized);
+		if (!vnode->aux && partition_varpool_node_p (vnode))
+		  add_varpool_node_to_partition (partition, vnode);
+		vsi = varpool_node_set_find (partition->varpool_set, vnode);
+		if (!vsi_end_p (vsi)
+		    && vsi.index < last_visited_varpool_node)
+		  cost--;
+		else
+		  cost++;
+	      }
+	    else
+	      {
+		cgraph_node_set_iterator csi;
+
+		node = ipa_ref_refering_node (ref);
+		gcc_assert (node->analyzed);
+		csi = cgraph_node_set_find (partition->cgraph_set, node);
+		if (!csi_end_p (csi)
+		    && csi.index < last_visited_cgraph_node)
+		  cost--;
+		else
+		  cost++;
+	      }
+	}
+      if (partition->insns < partition_size * 3 / 4
+	  || best_cost == INT_MAX
+	  || ((!cost 
+	       || (best_internal * (HOST_WIDE_INT) cost > (internal * (HOST_WIDE_INT)best_cost)))
+  	      && partition->insns < partition_size * 5 / 4))
+	{
+	  best_cost = cost;
+	  best_internal = internal;
+	  best_i = i;
+	  best_n_nodes = VEC_length (cgraph_node_ptr,
+				     partition->cgraph_set->nodes);
+	  best_n_varpool_nodes = VEC_length (varpool_node_ptr,
+					     partition->varpool_set->nodes);
+	}
+      if (cgraph_dump_file)
+	fprintf (cgraph_dump_file, "Step %i: added %s, size %i, cost %i/%i best %i/%i, step %i\n", i,
+		 cgraph_node_name (order[i]), partition->insns, cost, internal,
+		 best_cost, best_internal, best_i);
+      if (partition->insns > 2 * partition_size)
+	{
+	  if (best_i != i)
+	    {
+	      if (cgraph_dump_file)
+		fprintf (cgraph_dump_file, "Unwinding %i insertions to step %i\n",
+			 i - best_i, best_i);
+	      undo_partition (partition, best_n_nodes, best_n_varpool_nodes);
+	    }
+	  i = best_i;
+	  partition = new_partition ("");
+	  last_visited_cgraph_node = 0;
+	  last_visited_varpool_node = 0;
+	  cost = 0;
+
+	  if (cgraph_dump_file)
+	    fprintf (cgraph_dump_file, "New partition\n");
+	  best_n_nodes = 0;
+	  best_n_varpool_nodes = 0;
+	  best_cost = INT_MAX;
+
+	  /* Since the size of partitions is just approximate, update the size after
+	     we finished current one.  */
+	  if (npartitions < PARAM_VALUE (PARAM_LTO_PARTITIONS))
+	    partition_size = total_size
+	      / (PARAM_VALUE (PARAM_LTO_PARTITIONS) - npartitions);
+	  else
+	    partition_size = INT_MAX;
+
+	  if (partition_size < PARAM_VALUE (MIN_PARTITION_SIZE))
+	    partition_size = PARAM_VALUE (MIN_PARTITION_SIZE);
+	  npartitions ++;
+	}
+    }
+  for (vnode = varpool_nodes; vnode; vnode = vnode->next)
+    if (partition_varpool_node_p (vnode) && !vnode->aux)
+      add_varpool_node_to_partition (partition, vnode);
+  free (order);
+}
+
 /* Promote variable VNODE to be static.  */
 
 static bool
@@ -1977,7 +2247,10 @@  do_whole_program_analysis (void)
   /* We are about to launch the final LTRANS phase, stop the WPA timer.  */
   timevar_pop (TV_WHOPR_WPA);
 
-  lto_1_to_1_map ();
+  if (flag_lto_partition_1to1)
+    lto_1_to_1_map ();
+  else
+    lto_balanced_map ();
 
   if (!quiet_flag)
     {
Index: lto/Make-lang.in
===================================================================
--- lto/Make-lang.in	(revision 163809)
+++ lto/Make-lang.in	(working copy)
@@ -85,7 +85,7 @@  lto/lto.o: lto/lto.c $(CONFIG_H) $(SYSTE
 	$(CGRAPH_H) $(GGC_H) tree-ssa-operands.h $(TREE_PASS_H) \
 	langhooks.h $(VEC_H) $(BITMAP_H) pointer-set.h $(IPA_PROP_H) \
 	$(COMMON_H) debug.h $(TIMEVAR_H) $(GIMPLE_H) $(LTO_H) $(LTO_TREE_H) \
-	$(LTO_TAGS_H) $(LTO_STREAMER_H) $(SPLAY_TREE_H) gt-lto-lto.h
+	$(LTO_TAGS_H) $(LTO_STREAMER_H) $(SPLAY_TREE_H) gt-lto-lto.h $(PARAMS_H)
 lto/lto-elf.o: lto/lto-elf.c $(CONFIG_H) coretypes.h $(SYSTEM_H) \
 	toplev.h $(LTO_H) $(TM_H) $(LIBIBERTY_H) $(GGC_H) $(LTO_STREAMER_H)
 lto/lto-coff.o: lto/lto-coff.c $(CONFIG_H) coretypes.h $(SYSTEM_H) \
Index: common.opt
===================================================================
--- common.opt	(revision 163809)
+++ common.opt	(working copy)
@@ -859,6 +859,14 @@  flto
 Common Var(flag_lto)
 Enable link-time optimization.
 
+flto-partition=1to1
+Common Var(flag_lto_partition_1to1)
+Partition functions and vars at linktime based on object files they originate from
+
+flto-partition=balanced
+Common Var(flag_lto_partition_balanced)
+Partition functions and vars at linktime into approximately same sized buckets
+
 ; The initial value of -1 comes from Z_DEFAULT_COMPRESSION in zlib.h.
 flto-compression-level=
 Common Joined RejectNegative UInteger Var(flag_lto_compression_level) Init(-1)
Index: params.def
===================================================================
--- params.def	(revision 163809)
+++ params.def	(working copy)
@@ -838,6 +838,17 @@  DEFPARAM (PARAM_DEVIRT_TYPE_LIST_SIZE,
 	  "devirtualization",
 	  8, 0, 0)
 
+/* WHOPR partitioning configuration.  */
+
+DEFPARAM (PARAM_LTO_PARTITIONS,
+	  "lto-partitions",
+	  "Number of paritions program should be split to",
+	  32, 0, 0)
+
+DEFPARAM (MIN_PARTITION_SIZE,
+	  "lto-min-partition",
+	  "Size of minimal paritition for WHOPR (in estimated instructions)",
+	  1000, 0, 0)
 /*
 Local variables:
 mode:c