Patchwork [google/4_6] New fdo summary-based icache sensitive unrolling (issue6298056)

login
register
mail settings
Submitter Teresa Johnson
Date June 7, 2012, 6:42 a.m.
Message ID <20120607064220.2803060B4E@tjsboxrox.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/163494/
State New
Headers show

Comments

Teresa Johnson - June 7, 2012, 6:42 a.m.
This is the google/4_6 version of the patch to add new program summary
information to the gcov profile files to use as a estimate of code size
for guiding unrolling.

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for google/4_6?

Thanks,
Teresa


--
This patch is available for review at http://codereview.appspot.com/6298056

Patch

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 188240)
+++ doc/invoke.texi	(working copy)
@@ -384,7 +384,7 @@  Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
 -fprefetch-loop-arrays @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} -fprofile-generate-sampling @gol
@@ -396,8 +396,8 @@  Objective-C and Objective-C++ Dialects}.
 -freorder-blocks-and-partition -freorder-functions @gol
 -frerun-cse-after-loop -freschedule-modulo-scheduled-loops @gol
 -fripa -fripa-disallow-asm-modules -fripa-disallow-opt-mismatch @gol
--fripa-inc-path-sub=@var{path_mapping} -fripa-no-promote-always-inline-func -fripa-verbose @gol
--fripa-peel-size-limit -fripa-unroll-size-limit -frounding-math @gol
+-fripa-inc-path-sub=@var{path_mapping} -fripa-no-promote-always-inline-func @gol
+-fripa-verbose -frounding-math @gol
 -fsched2-use-superblocks -fsched-pressure @gol
 -fsched-spec-load -fsched-spec-load-dangerous @gol
 -fsched-stalled-insns-dep[=@var{n}] -fsched-stalled-insns[=@var{n}] @gol
@@ -420,7 +420,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-pta -ftree-reassoc @gol
 -ftree-sink -ftree-sra -ftree-switch-conversion @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-ld -fuse-linker-plugin @gol
@@ -8187,20 +8187,6 @@  Do not promote static functions with always inline
 Enable printing of verbose information about dynamic inter-procedural optimizations.
 This is used in conjunction with the @option{-fripa}.
 
-@item -fripa-peel-size-limit
-@opindex fripa-peel-size-limit
-Limit loop peeling of non-const non-FP loops in a LIPO compilation under estimates
-of a large code footprint. Enabled by default under @option{-fripa}. Code size
-estimation and thresholds are controlled by the @option{codesize-hotness-threshold}
-and @option{unrollpeel-codesize-threshold} parameters.
-
-@item -fripa-unroll-size-limit
-@opindex fripa-unroll-size-limit
-Limit loop unrolling of non-const non-FP loops in a LIPO compilation under estimates
-of a large code footprint. Enabled by default under @option{-fripa}. Code size
-estimation and thresholds are controlled by the @option{codesize-hotness-threshold}
-and @option{unrollpeel-codesize-threshold} parameters.
-
 @item fripa-auxiliary_module_id
 @opindex fripa-auxiliary_module_id
 Specify the global module id for an auxiliary module. This is only effective in
@@ -8545,6 +8531,14 @@  the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.
 
+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels the loops for that there is enough information that they do not
@@ -8553,6 +8547,14 @@  roll much (from profile feedback).  It also turns
 
 Enabled with @option{-fprofile-use}.
 
+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -8935,13 +8937,15 @@  hot loops. Its default value is 16.
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.
 
-@item codesize-hotness-threshold
-The minimum profile count of basic blocks to look at when estimating
-the code size footprint of the call graph in a LIPO compile.
-
 @item unrollpeel-codesize-threshold
-Maximum LIPO code size footprint estimate for loop unrolling and peeling.
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
 
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.
 
Index: gcov-io.c
===================================================================
--- gcov-io.c	(revision 188240)
+++ gcov-io.c	(working copy)
@@ -520,6 +520,7 @@  gcov_write_summary (gcov_unsigned_t tag, const str
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       gcov_write_unsigned (csum->num);
+      gcov_write_unsigned (csum->num_hot_counters);
       gcov_write_unsigned (csum->runs);
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
@@ -637,6 +638,7 @@  gcov_read_summary (struct gcov_summary *summary)
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       csum->num = gcov_read_unsigned ();
+      csum->num_hot_counters = gcov_read_unsigned ();
       csum->runs = gcov_read_unsigned ();
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
Index: gcov-io.h
===================================================================
--- gcov-io.h	(revision 188240)
+++ gcov-io.h	(working copy)
@@ -440,7 +440,7 @@  typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000)
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
 #define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+	(1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2))
 #define GCOV_TAG_MODULE_INFO ((gcov_unsigned_t)0xa4000000)
 #define GCOV_TAG_PMU_LOAD_LATENCY_INFO ((gcov_unsigned_t)0xa5000000)
 #define GCOV_TAG_PMU_LOAD_LATENCY_LENGTH(filename)  \
@@ -541,6 +541,8 @@  typedef HOST_WIDEST_INT gcov_type;
 struct gcov_ctr_summary
 {
   gcov_unsigned_t num;		/* number of counters.  */
+  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
+                                      percent of sum_all.  */
   gcov_unsigned_t runs;		/* number of program runs */
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
Index: loop-unroll.c
===================================================================
--- loop-unroll.c	(revision 188240)
+++ loop-unroll.c	(working copy)
@@ -35,6 +35,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "recog.h"
 #include "target.h"
 #include "diagnostic.h"
+#include "gcov-io.h"
 
 /* This pass performs loop unrolling and peeling.  We only perform these
    optimizations on innermost loops (with single exception) because
@@ -181,48 +182,73 @@  report_unroll_peel(struct loop *loop, location_t l
             "" : iter_str);
 }
 
-/* This returns a bit vector */
-typedef enum {
-  NO_LIMIT = 0,
-  LIMIT_UNROLL = 0x1,
-  LIMIT_PEEL = 0x2,
-  LIMIT_BOTH = 0x3
-} limit_type;
+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
 
-extern int cgraph_codesize_estimate;
-
-/* Determine whether LOOP unrolling/peeling should be constrained based
-   on code footprint estimates. */
-static limit_type
-limit_code_size(struct loop *loop)
+static int
+code_size_limit_factor(struct loop *loop)
 {
   unsigned size_threshold;
-  limit_type result = NO_LIMIT;
-  int result_int = 0;
   struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
 
-  if (!flag_dyn_ipa)
-    return NO_LIMIT;
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || profile_info->num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
 
-  gcc_assert (cgraph_codesize_estimate >= 0);
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
 
   /* Ignore FP loops, which are more likely to benefit heavily from
      unrolling. */
   if (desc->has_fp)
-    return NO_LIMIT;
+    return 1;
 
-  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
-  if (cgraph_codesize_estimate <= (int)size_threshold)
-    return NO_LIMIT;
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (profile_info->num_hot_counters < size_threshold * 2)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough, the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0.  */
+    limit_factor = INT_MAX;
 
-  if (flag_ripa_peel_size_limit)
-    result_int |= LIMIT_PEEL;
-
-  if (flag_ripa_unroll_size_limit)
-    result_int |= LIMIT_UNROLL;
-
-  result = (limit_type)result_int;
-  return result;
+  return limit_factor;
 }
 
 /* Compute the maximum number of times LOOP can be unrolled without exceeding
@@ -444,7 +470,6 @@  decide_unrolling_and_peeling (int flags)
   struct loop *loop;
   loop_iterator li;
   location_t locus;
-  limit_type limit;
 
   /* Scan the loops, inner ones first.  */
   FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
@@ -485,38 +510,15 @@  decide_unrolling_and_peeling (int flags)
       loop->ninsns = num_loop_insns (loop);
       loop->av_ninsns = average_num_loop_insns (loop);
 
-      /* Determine whether to limit code size growth from unrolling and
-         peeling. This is currently enabled only under LIPO (dynamic IPA)
-         where we have a partial call graph. It is not applied to loops
-         with constant trip counts, as it is easier to determine the
-         profitability of unrolling and peeling such loops. */
-      limit = limit_code_size(loop);
-      if (limit != NO_LIMIT)
-	{
-	  if (dump_file)
-            {
-	      fprintf (dump_file, ";; Due to large code size footprint estimate, limit ");
-              if (limit == (LIMIT_UNROLL|LIMIT_PEEL))
-	        fprintf (dump_file, "unrolling and peeling\n");
-              else if (limit == LIMIT_UNROLL)
-	        fprintf (dump_file, "unrolling\n");
-              else
-	        fprintf (dump_file, "peeling\n");
-            }
-	}
-
       /* Try transformations one by one in decreasing order of
 	 priority.  */
 
       decide_unroll_constant_iterations (loop, flags);
-      if (loop->lpt_decision.decision == LPT_NONE
-          && !(limit & LIMIT_UNROLL))
+      if (loop->lpt_decision.decision == LPT_NONE)
 	decide_unroll_runtime_iterations (loop, flags);
-      if (loop->lpt_decision.decision == LPT_NONE
-          && !(limit & LIMIT_UNROLL))
+      if (loop->lpt_decision.decision == LPT_NONE)
 	decide_unroll_stupid (loop, flags);
-      if (loop->lpt_decision.decision == LPT_NONE
-          && !(limit & LIMIT_PEEL))
+      if (loop->lpt_decision.decision == LPT_NONE)
 	decide_peel_simple (loop, flags);
 
       if (flag_opt_info >= OPT_INFO_MIN
@@ -1055,6 +1057,7 @@  decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, nunroll_branches, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL))
     {
@@ -1067,10 +1070,26 @@  decide_unroll_runtime_iterations (struct loop *loo
 	     "\n;; Considering unrolling loop with runtime "
 	     "computable number of iterations\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1461,6 +1480,7 @@  decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_PEEL))
     {
@@ -1471,8 +1491,23 @@  decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");
 
+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
 
@@ -1614,6 +1649,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL_ALL))
     {
@@ -1624,11 +1660,26 @@  decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
Index: coverage.c
===================================================================
--- coverage.c	(revision 188240)
+++ coverage.c	(working copy)
@@ -558,6 +558,7 @@  read_counts_file (const char *da_file_name, unsign
 	    {
 	      struct gcov_ctr_summary *csum = &summary.ctrs[entry->ctr];
 
+	      entry->summary.num_hot_counters += csum->num_hot_counters;
 	      entry->summary.runs += csum->runs;
 	      entry->summary.sum_all += csum->sum_all;
 	      if (entry->summary.run_max < csum->run_max)
Index: common.opt
===================================================================
--- common.opt	(revision 188240)
+++ common.opt	(working copy)
@@ -1130,14 +1130,6 @@  Common Report Var(flag_ripa_no_promote_always_inli
 Don't promote always inline static functions assuming they
 will be inlined and no copy is needed.
 
-fripa-peel-size-limit
-Common Report Var(flag_ripa_peel_size_limit) Init(1) Optimization
-Limit non-const non-FP loop peeling under dynamic IPA estimates of large code footprint
-
-fripa-unroll-size-limit
-Common Report Var(flag_ripa_unroll_size_limit) Init(1) Optimization
-Limit non-const non-FP loop unrolling under dynamic IPA estimates of large code footprint
-
 fripa-inc-path-sub=
 Common Joined RejectNegative Var(lipo_inc_path_pattern)
 Substitute substring in include paths with a new string to allow reuse profile data
@@ -1646,6 +1638,10 @@  fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers
 
+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large code footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2216,6 +2212,10 @@  funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops
 
+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of large code footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: tree-optimize.c
===================================================================
--- tree-optimize.c	(revision 188240)
+++ tree-optimize.c	(working copy)
@@ -47,7 +47,6 @@  along with GCC; see the file COPYING3.  If not see
 #include "except.h"
 #include "plugin.h"
 #include "regset.h"	/* FIXME: For reg_obstack.  */
-#include "params.h"
 
 /* Decides if the cgraph callee edges are being cleaned up for the
    last time.  */
@@ -156,49 +155,6 @@  struct gimple_opt_pass pass_all_early_optimization
  }
 };
 
-int cgraph_codesize_estimate = -1;
-
-/* Estimate the code size from the dynamic IPA call graph. */
-static void
-compute_codesize_estimate(void)
-{
-  struct cgraph_node *node;
-  basic_block bb;
-  gimple_stmt_iterator bsi;
-  struct function *my_function;
-
-  if (!flag_dyn_ipa)
-    return;
-
-  if (cgraph_codesize_estimate >= 0)
-    return;
-  cgraph_codesize_estimate = 0;
-
-  for (node = cgraph_nodes; node; node = node->next)
-    {
-      if (node->count == 0)
-        continue;
-
-      if (!gimple_has_body_p(node->decl))
-        continue;
-
-      my_function = DECL_STRUCT_FUNCTION (node->decl);
-
-      FOR_EACH_BB_FN (bb, my_function)
-        {
-          if (bb->count < PARAM_VALUE (PARAM_CODESIZE_HOTNESS_THRESHOLD))
-            continue;
-          for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
-            {
-              gimple stmt = gsi_stmt (bsi);
-              cgraph_codesize_estimate += estimate_num_insns (stmt, &eni_size_weights);
-            }
-        }
-    }
-  if (dump_file)
-    fprintf (dump_file, "Code size estimate from cgraph: %d\n", cgraph_codesize_estimate);
-}
-
 /* Pass: cleanup the CFG just before expanding trees to RTL.
    This is just a round of label cleanups and case node grouping
    because after the tree optimizers have run such cleanups may
@@ -207,8 +163,6 @@  struct gimple_opt_pass pass_all_early_optimization
 static unsigned int
 execute_cleanup_cfg_post_optimizing (void)
 {
-  /* Estimate the code footprint for hot BBs before we enter RTL */
-  compute_codesize_estimate();
   cleanup_tree_cfg ();
   cleanup_dead_labels ();
   group_case_labels ();
Index: libgcov.c
===================================================================
--- libgcov.c	(revision 188240)
+++ libgcov.c	(working copy)
@@ -835,6 +835,101 @@  gcov_counter_array (const struct gcov_info *info,
   return result;
 }
 
+/* Used by qsort to sort gcov values in descending order.  */
+
+static int
+sort_by_reverse_gcov_value (const void *pa, const void *pb)
+{
+  const gcov_type a = *(gcov_type const *)pa;
+  const gcov_type b = *(gcov_type const *)pb;
+
+  if (b > a)
+    return 1;
+  else if (b == a)
+    return 0;
+  else
+    return -1;
+}
+
+/* Determines the number of counters required to cover a given percentage
+   of the total sum of execution counts in the summary, which is then also
+   recorded in SUM.  */
+
+static void
+gcov_compute_cutoff_values (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, index;
+  gcov_unsigned_t c_num;
+  gcov_type *value_array;
+  gcov_type cum, cum_cutoff;
+  char *cutoff_str;
+  unsigned cutoff_perc;
+
+#define CUM_CUTOFF_PERCENT_TIMES_10 999
+  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
+  if (cutoff_str && strlen (cutoff_str))
+    cutoff_perc = atoi (cutoff_str);
+  else
+    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  /* Determine the cumulative counter value at the specified cutoff
+     percentage and record the percentage for use by gcov consumers.
+     Check for overflow when sum_all is multiplied by the cutoff_perc,
+     and if so, do the divide first.  */
+  if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
+    /* Overflow, do the divide first.  */
+    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
+  else
+    /* Otherwise multiply first to get the correct value for small
+       values of sum_all.  */
+    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
+
+  /* Next, walk through all the per-object structures and save each of
+     the count values in value_array.  */
+  index = 0;
+  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
+  for (gi_ptr = __gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!((1 << t_ix) & gi_ptr->ctr_mask))
+        continue;
+
+      ci_ptr = gi_ptr->counts;
+      /* Sanity check that there are enough entries in value_array
+         for this object's counters. Gracefully handle the case when
+         there are not, in case something in the profile info is
+         corrupted.  */
+      c_num = ci_ptr->num;
+      if (index + c_num > cs_ptr->num)
+        c_num = cs_ptr->num - index;
+      /* Copy over this object's counter values.  */
+      memcpy (&value_array[index], ci_ptr->values,
+              sizeof (gcov_type) * c_num);
+      index += c_num;
+    }
+
+  /* Sort all the counter values by descending value and finally
+     accumulate the values from hottest on down until reaching
+     the cutoff value computed earlier.  */
+  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
+         sort_by_reverse_gcov_value);
+  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
+    cum += value_array[c_num];
+  /* Record the number of counters required to reach the cutoff value.  */
+  cs_ptr->num_hot_counters = c_num;
+  free (value_array);
+}
+
 /* Compute object summary recored in gcov_info INFO. The result is
    stored in OBJ_SUM. Note that the caller is responsible for
    zeroing out OBJ_SUM, otherwise the summary is accumulated.  */
@@ -1010,7 +1105,11 @@  rewrite:;
         if ((1 << t_ix) & info->ctr_mask)
           {
             if (!cs_obj->runs++)
-              cs_obj->num = cs_tobj->num;
+              {
+                cs_obj->num = cs_tobj->num;
+                if (cs_tobj->num_hot_counters > cs_obj->num_hot_counters)
+                  cs_obj->num_hot_counters = cs_tobj->num_hot_counters;
+              }
             else if (cs_obj->num != cs_tobj->num)
                goto read_mismatch;
             cs_obj->sum_all += cs_tobj->sum_all;
@@ -1019,7 +1118,11 @@  rewrite:;
             cs_obj->sum_max += cs_tobj->run_max;
 
             if (!cs_prg->runs++)
-              cs_prg->num = cs_tprg->num;
+              {
+                cs_prg->num = cs_tprg->num;
+                if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
+                  cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
+              }
             else if (cs_prg->num != cs_tprg->num)
               goto read_mismatch;
             cs_prg->sum_all += cs_tprg->sum_all;
@@ -1225,6 +1328,7 @@  gcov_exit_init (void)
          is FDO/LIPO.  */
       dump_module_info |= gi_ptr->mod_info->is_primary;
     }
+  gcov_compute_cutoff_values (&this_program);
 
   gcov_alloc_filename ();
 
Index: params.def
===================================================================
--- params.def	(revision 188240)
+++ params.def	(working copy)
@@ -340,21 +340,23 @@  DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
 	 "max-completely-peel-loop-nest-depth",
 	 "The maximum depth of a loop nest we completely peel",
 	 8, 0, 0)
-/* The minimum profile count of basic blocks to look at when estimating
- * the code size footprint of the call graph in a dynamic IPA compile. */
-DEFPARAM(PARAM_CODESIZE_HOTNESS_THRESHOLD,
-	"codesize-hotness-threshold",
-	"Minimum profile count of basic blocks counted towards dynamic IPA "
-	"code size footprint estimate",
-	10000, 0, 0)
 /* The maximum code size estimate under which loop unrolling and peeling
- * is allowed in a dynamic IPA compile. This currently applies to loops with
- * non-constant iteration counts and no floating point computations. */
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
 DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
 	"unrollpeel-codesize-threshold",
-	"Maximum dynamic IPA code size footprint estimate for loop unrolling "
+	"Maximum profile-based code size footprint estimate for loop unrolling "
 	"and peeling",
-	10000, 0, 0)
+	15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)
 
 DEFPARAM(PARAM_MIN_ITER_UNROLL_WITH_BRANCHES,
 	"min-iter-unroll-with-branches",
Index: gcov-dump.c
===================================================================
--- gcov-dump.c	(revision 188240)
+++ gcov-dump.c	(working copy)
@@ -494,8 +494,10 @@  tag_summary (const char *filename ATTRIBUTE_UNUSED
     {
       printf ("\n");
       print_prefix (filename, 0, 0);
-      printf ("\t\tcounts=%u, runs=%u",
-	      summary.ctrs[ix].num, summary.ctrs[ix].runs);
+      printf ("\t\tcounts=%u (num hot counts=%u), runs=%u",
+	      summary.ctrs[ix].num,
+	      summary.ctrs[ix].num_hot_counters,
+	      summary.ctrs[ix].runs);
 
       printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);