Patchwork New fdo summary-based icache sensitive unrolling (issue6351086)

login
register
mail settings
Submitter Teresa Johnson
Date July 27, 2012, 4:47 a.m.
Message ID <20120727044750.0B77E61471@tjsboxrox.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/173555/
State New
Headers show

Comments

Teresa Johnson - July 27, 2012, 4:47 a.m.
Updated patch based on review feedback.

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Teresa


Original patch description:

Ports some patches related to improving FDO program summary information
and using it to guide loop unrolling from google branches to mainline.
The patch is enhanced to add additional summary information to aid
in determining hot/cold decisions.

The original patch description is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
and further discussion about incorporating onto mainline is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414

Honza, can you take a look to see if this patch meets your needs?

Full description:

This patch adds new program summary information to the gcov
profile files that indicate how many profiled counts compose
the majority of the program's execution time. This is used to
provide an indication of the overall code size of the program.

The new profile summary information is then used to guide
codesize based unroll and peel decisions, to prevent those
optimizations from increasing code size too much when the
program may be sensitive to icache effects.

This patch also pulls in dependent portions of google/main r187660 that cache
additional loop analysis results in the niter_desc auxiliary information
hanging off the loop structure (the optimization portions of that
change are not included here, and have an outstanding review request
for mainline).

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-07-26  Teresa Johnson  <tejohnson@google.com>

	* libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
	(gcov_compute_cutoff_values): Ditto.
	(gcov_exit): Call gcov_compute_cutoff_values and merge new summary
        information.
	* gcc/doc/invoke.texi (roll much): Document new options
        -fpeel-codesize-limit and -funroll-codesize-limit, and new params
        codesize-hotness-threshold and unrollpeel-hotness-threshold.
	* gcc/gcov-io.c (gcov_write_summary): Write new summary info.
	(gcov_read_summary): Read new summary info.
	* gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
        (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
        hot_cutoff_value.
	* gcc/loop-unroll.c (code_size_limit_factor): New function.
	(decide_unroll_runtime_iterations): Call code_size_limit_factor
        to control the unroll factor, and retrieve number of branches from
        niter_desc instead of via function that walks loop.
	(decide_peel_simple, decide_unroll_stupid): Ditto.
	* gcc/coverage.c (read_counts_file): Propagate new summary info.
	* gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
	* gcc/common.opt: Add -fpeel-codesize-limit and
        -funroll-codesize-limit.
	* gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
	* gcc/cfgloop.h (struct niter_desc): Added new fields to cache
        additional loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
	* gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
        (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
	* gcc/gcov-dump.c (tag_summary): Dump new summary info.


--
This patch is available for review at http://codereview.appspot.com/6351086
Steven Bosscher - July 27, 2012, 7:31 a.m.
On Fri, Jul 27, 2012 at 6:47 AM, Teresa Johnson <tejohnson@google.com> wrote:
>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>         hot_cutoff_value.

You should also update the description for the data file at the head
of gcov-io.h:

   The data file contains the following records.
        data: {unit summary:object summary:program* function-data*}*
        unit: header int32:checksum
        function-data:  announce_function present counts
        announce_function: header int32:ident
                int32:lineno_checksum int32:cfg_checksum
        present: header int32:present
        counts: header int64:count*
        summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
        count-summary:  int32:num int32:runs int64:sum
                        int64:max int64:sum_max

You've added two fields in count-summary IIUC.

Ciao!
Steven
Teresa Johnson - July 27, 2012, 2:33 p.m.
On Fri, Jul 27, 2012 at 12:31 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Fri, Jul 27, 2012 at 6:47 AM, Teresa Johnson <tejohnson@google.com> wrote:
>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>>         hot_cutoff_value.
>
> You should also update the description for the data file at the head
> of gcov-io.h:
>
>    The data file contains the following records.
>         data: {unit summary:object summary:program* function-data*}*
>         unit: header int32:checksum
>         function-data:  announce_function present counts
>         announce_function: header int32:ident
>                 int32:lineno_checksum int32:cfg_checksum
>         present: header int32:present
>         counts: header int64:count*
>         summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
>         count-summary:  int32:num int32:runs int64:sum
>                         int64:max int64:sum_max
>
> You've added two fields in count-summary IIUC.

Thanks, I will update this in my next version of the patch.

Teresa

>
> Ciao!
> Steven
Teresa Johnson - July 27, 2012, 2:57 p.m.
On Fri, Jul 27, 2012 at 3:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189893)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>
> This looks like good idea to me to drive the hot/cold partitioning even if it
> is not quite accurate (you have no idea how many instructions given counter is
> guarding).

Thanks - right, I think it will be a good approximation.

>
> To reduce overhead on embedded sysems, what about just doing histogram with say
> 128 steps instead if dragging in qsort? This also avoids the need to
> produce the copy of all counters.

I like that suggestion. I'll need 1024 buckets though to get the 99.9% cutoff
I am using right now (or use 128 buckets plus 1 to track what would be
bucket 1024 which is roughly 99.9%). I can use something like a binary
search to locate the right bucket without a linear search or divide. And
I can keep track of the min value for each bucket in order to identify the
correct value for hot_cutoff_value.

>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>
> To further keep embedded systems (at least a bit) happier, I guess one could do
> this without generic 64bit divide operations.  I guess 1000 can be bumped up to
> 1024, small error is hamless here.
>
> Actually it may be easier to simply embedd the histogram into gcov summary
> so one can control the cutoff with --param in compiler at --profile-use time.
> It seems resonable to me to trade 128 values per file for the extra flexibility.

Both you and David have requested more values, so I will go ahead
and implement this suggestion in this patch. I can use the 128 + 1
bucket approach I described above to get the data for roughly every
1% plus the 99.9%. That should be enough granularity for the
optimizations (and the smallest bucket doesn't really need to be
fed back as it can be largely extrapolated from the others). This
will require feeding back 128 arrays of 2 values (num_hot_counters
and hot_cutoff_value).

>
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>
> I wonder if the loop walking all counters can't be fused into one of the other
> loops we already have.

Not with the histogram approach as the preceding walk (in the caller, gcov_exit)
will be needed to find the min and max counter values first.

>> +        }
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189893)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
>> -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
>> compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>
> Lets handle the cutoff logic independently of the loop bits I can not approve.
> The patch is OK with the change ssuggested

Ok, I will split this into 2 patches, one for the gcov changes only, and a
follow-on patch with my loop unroller changes.

Thanks!
Teresa

>
> Honza

Patch

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 189893)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,120 @@  gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Used by qsort to sort gcov values in descending order.  */
+
+static int
+sort_by_reverse_gcov_value (const void *pa, const void *pb)
+{
+  const gcov_type a = *(gcov_type const *)pa;
+  const gcov_type b = *(gcov_type const *)pb;
+
+  if (b > a)
+    return 1;
+  else if (b == a)
+    return 0;
+  else
+    return -1;
+}
+
+/* Determines the number of counters required to cover a given percentage
+   of the total sum of execution counts in the summary, which is then also
+   recorded in SUM.  */
+
+static void
+gcov_compute_cutoff_values (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, i, ctr_info_ix, index;
+  gcov_unsigned_t c_num;
+  gcov_type *value_array;
+  gcov_type cum, cum_cutoff;
+  char *cutoff_str;
+  unsigned cutoff_perc;
+
+#define CUM_CUTOFF_PERCENT_TIMES_10 999
+  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
+  if (cutoff_str && strlen (cutoff_str))
+    cutoff_perc = atoi (cutoff_str);
+  else
+    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  /* Determine the cumulative counter value at the specified cutoff
+     percentage and record the percentage for use by gcov consumers.
+     Check for overflow when sum_all is multiplied by the cutoff_perc,
+     and if so, do the divide first.  */
+  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
+    /* Overflow, do the divide first.  */
+    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
+  else
+    /* Otherwise multiply first to get the correct value for small
+       values of sum_all.  */
+    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
+
+  /* Next, walk through all the per-object structures and save each of
+     the count values in value_array.  */
+  index = 0;
+  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
+        {
+          if (gi_ptr->merge[i])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          /* Sanity check that there are enough entries in value_arry
+            for this function's counters. Gracefully handle the case when
+            there are not, in case something in the profile info is
+            corrupted.  */
+          c_num = ci_ptr->num;
+          if (index + c_num > cs_ptr->num)
+            c_num = cs_ptr->num - index;
+          /* Copy over this function's counter values.  */
+          memcpy (&value_array[index], ci_ptr->values,
+                  sizeof (gcov_type) * c_num);
+          index += c_num;
+        }
+    }
+
+  /* Sort all the counter values by descending value and finally
+     accumulate the values from hottest on down until reaching
+     the cutoff value computed earlier.  */
+  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
+         sort_by_reverse_gcov_value);
+  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
+    cum += value_array[c_num];
+  /* Record the number of counters required to reach the cutoff value,
+     as well as the counter value that caused cum to reach the cutoff.  */
+  cs_ptr->num_hot_counters = c_num;
+  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
+  free (value_array);
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +461,7 @@  gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_cutoff_values (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -598,7 +713,12 @@  gcov_exit (void)
 	  if (gi_ptr->merge[t_ix])
 	    {
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                cs_prg->num = cs_tprg->num;
+              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
+              {
+                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
+                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
+              }
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 189893)
+++ gcc/doc/invoke.texi	(working copy)
@@ -385,7 +385,7 @@  Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
 -fprefetch-loop-arrays @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
@@ -417,7 +417,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-linker-plugin @gol
@@ -8527,6 +8527,14 @@  the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.
 
+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels loops for which there is enough information that they do not
@@ -8535,6 +8543,14 @@  roll much (from profile feedback).  It also turns
 
 Enabled with @option{-fprofile-use}.
 
+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -8886,6 +8902,15 @@  The maximum number of iterations of a loop to be s
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.
 
+@item unrollpeel-codesize-threshold
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
+
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.
 
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 189893)
+++ gcc/gcov-io.c	(working copy)
@@ -376,6 +376,8 @@  gcov_write_summary (gcov_unsigned_t tag, const str
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       gcov_write_unsigned (csum->num);
+      gcov_write_unsigned (csum->num_hot_counters);
+      gcov_write_counter (csum->hot_cutoff_value);
       gcov_write_unsigned (csum->runs);
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
@@ -495,6 +497,8 @@  gcov_read_summary (struct gcov_summary *summary)
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       csum->num = gcov_read_unsigned ();
+      csum->num_hot_counters = gcov_read_unsigned ();
+      csum->hot_cutoff_value = gcov_read_counter ();
       csum->runs = gcov_read_unsigned ();
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 189893)
+++ gcc/gcov-io.h	(working copy)
@@ -310,7 +310,7 @@  typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
 #define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+	(1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
 
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
@@ -393,6 +393,10 @@  typedef HOST_WIDEST_INT gcov_type;
 struct gcov_ctr_summary
 {
   gcov_unsigned_t num;		/* number of counters.  */
+  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
+                                      percent of sum_all.  */
+  gcov_type hot_cutoff_value;   /* value of smallest counter included in
+                                   num_hot_counters.  */
   gcov_unsigned_t runs;		/* number of program runs */
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
Index: gcc/loop-unroll.c
===================================================================
--- gcc/loop-unroll.c	(revision 189893)
+++ gcc/loop-unroll.c	(working copy)
@@ -32,6 +32,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "hashtab.h"
 #include "recog.h"
 #include "target.h"
+#include "gcov-io.h"
 #include "dumpfile.h"
 
 /* This pass performs loop unrolling and peeling.  We only perform these
@@ -151,6 +152,75 @@  static void combine_var_copies_in_loop_exit (struc
 					     basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
+
+static int
+code_size_limit_factor(struct loop *loop)
+{
+  unsigned size_threshold;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
+
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || profile_info->num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
+
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
+
+  /* Ignore FP loops, which are more likely to benefit heavily from
+     unrolling. */
+  if (desc->has_fp)
+    return 1;
+
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (profile_info->num_hot_counters < size_threshold * 2)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough, the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0.  */
+    limit_factor = INT_MAX;
+
+  return limit_factor;
+}
+
 /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
 void
 unroll_and_peel_loops (int flags)
@@ -803,6 +873,7 @@  decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL))
     {
@@ -815,10 +886,26 @@  decide_unroll_runtime_iterations (struct loop *loo
 	     "\n;; Considering unrolling loop with runtime "
 	     "computable number of iterations\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+        {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+        }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1196,6 +1283,7 @@  decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_PEEL))
     {
@@ -1206,8 +1294,23 @@  decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");
 
+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
 
@@ -1232,7 +1335,7 @@  decide_peel_simple (struct loop *loop, int flags)
 
   /* Do not simply peel loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not peeling, contains branches\n");
@@ -1349,6 +1452,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL_ALL))
     {
@@ -1359,11 +1463,26 @@  decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1393,7 +1512,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 
   /* Do not unroll loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not unrolling, contains branches\n");
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 189893)
+++ gcc/coverage.c	(working copy)
@@ -247,6 +247,14 @@  read_counts_file (void)
 	  gcov_read_summary (&sum);
 	  for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
 	    {
+              if (summary.ctrs[ix].num_hot_counters
+                  < sum.ctrs[ix].num_hot_counters)
+                {
+		  summary.ctrs[ix].num_hot_counters
+                      = sum.ctrs[ix].num_hot_counters;
+		  summary.ctrs[ix].hot_cutoff_value
+                      = sum.ctrs[ix].hot_cutoff_value;
+                }
 	      summary.ctrs[ix].runs += sum.ctrs[ix].runs;
 	      summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
 	      if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
Index: gcc/loop-iv.c
===================================================================
--- gcc/loop-iv.c	(revision 189893)
+++ gcc/loop-iv.c	(working copy)
@@ -2970,8 +2970,12 @@  get_simple_loop_desc (struct loop *loop)
   /* At least desc->infinite is not always initialized by
      find_simple_loop_exit.  */
   desc = XCNEW (struct niter_desc);
-  iv_analysis_loop_init (loop);
-  find_simple_exit (loop, desc);
+  if (loop->latch != EXIT_BLOCK_PTR)
+    {
+      iv_analysis_loop_init (loop);
+      find_simple_exit (loop, desc);
+    }
+  analyze_loop_insns (loop, desc);
   loop->aux = desc;
 
   if (desc->simple_p && (desc->assumptions || desc->infinite))
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 189893)
+++ gcc/common.opt	(working copy)
@@ -1551,6 +1551,10 @@  fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers
 
+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large code footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2112,6 +2116,10 @@  funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops
 
+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of large code footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: gcc/cfgloop.c
===================================================================
--- gcc/cfgloop.c	(revision 189893)
+++ gcc/cfgloop.c	(working copy)
@@ -1154,24 +1154,98 @@  get_loop_exit_edges (const struct loop *loop)
   return edges;
 }
 
-/* Counts the number of conditional branches inside LOOP.  */
+/* Determine if INSN is a floating point set.  */
 
-unsigned
-num_loop_branches (const struct loop *loop)
+static bool
+insn_has_fp_set(rtx insn)
 {
-  unsigned i, n;
-  basic_block * body;
+  int i;
+  rtx pat = PATTERN(insn);
+  if (GET_CODE (pat) == SET)
+    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
+  else if (GET_CODE (pat) == PARALLEL)
+    {
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+        {
+          rtx sub = XVECEXP (pat, 0, i);
+          if (GET_CODE (sub) == SET)
+            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
+        }
+    }
+  return false;
+}
 
-  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
+/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
+   the number of conditional branch instructions, calls and fp instructions,
+   as well as the average number of branches executed per iteration.  */
 
+void
+analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
+{
+  unsigned i, nbranch;
+  gcov_type weighted_nbranch;
+  bool has_call, has_fp;
+  basic_block * body, bb;
+  rtx insn;
+  gcov_type header_count = loop->header->count;
+
+  nbranch = weighted_nbranch = 0;
+  has_call = has_fp = false;
+
   body = get_loop_body (loop);
-  n = 0;
   for (i = 0; i < loop->num_nodes; i++)
-    if (EDGE_COUNT (body[i]->succs) >= 2)
-      n++;
+    {
+      bb = body[i];
+
+      if (EDGE_COUNT (bb->succs) >= 2)
+        {
+          nbranch++;
+
+          /* If this block is executed less frequently than the header (loop
+             entry), then it is weighted based on its execution count, which
+             will be turned into a ratio compared to the loop header below. */
+          if (bb->count < header_count)
+            weighted_nbranch += bb->count;
+
+          /* When it is executed more frequently than the header (i.e. it is
+             in a nested inner loop), simply weight the branch the same as the
+             header execution count, so that it will contribute 1 branch to
+             the ratio computed below. */
+          else
+            weighted_nbranch += header_count;
+        }
+
+      /* No need to iterate through the instructions below if
+         both flags have already been set.  */
+      if (has_call && has_fp)
+        continue;
+
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!INSN_P (insn))
+            continue;
+
+          if (!has_call)
+            has_call = CALL_P (insn);
+
+          if (!has_fp)
+            has_fp = insn_has_fp_set (insn);
+        }
+    }
   free (body);
 
-  return n;
+  desc->num_branches = nbranch;
+  /* Now divide the weights computed above by the loop header execution count,
+     to compute the average number of branches through the loop. By adding
+     header_count/2 to the numerator we round to nearest with integer
+     division.  */
+  if (header_count  != 0)
+    desc->av_num_branches
+        = (weighted_nbranch + header_count/2) / header_count;
+  else
+    desc->av_num_branches = 0;
+  desc->has_call = has_call;
+  desc->has_fp = has_fp;
 }
 
 /* Adds basic block BB to LOOP.  */
Index: gcc/cfgloop.h
===================================================================
--- gcc/cfgloop.h	(revision 189893)
+++ gcc/cfgloop.h	(working copy)
@@ -255,7 +255,6 @@  extern basic_block *get_loop_body_in_custom_order
 
 extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
 edge single_exit (const struct loop *);
-extern unsigned num_loop_branches (const struct loop *);
 
 extern edge loop_preheader_edge (const struct loop *);
 extern edge loop_latch_edge (const struct loop *);
@@ -366,7 +365,8 @@  struct rtx_iv
 };
 
 /* The description of an exit from the loop and of the number of iterations
-   till we take the exit.  */
+   till we take the exit. Also includes other information used primarily
+   by the loop unroller.  */
 
 struct niter_desc
 {
@@ -407,6 +407,18 @@  struct niter_desc
 
   /* The number of iterations of the loop.  */
   rtx niter_expr;
+
+  /* The number of branches in the loop.  */
+  unsigned num_branches;
+
+  /* The number of executed branches per iteration.  */
+  unsigned av_num_branches;
+
+  /* Whether the loop contains a call instruction.  */
+  bool has_call;
+
+  /* Whether the loop contains fp instructions.  */
+  bool has_fp;
 };
 
 extern void iv_analysis_loop_init (struct loop *);
@@ -420,6 +432,7 @@  extern void iv_analysis_done (void);
 
 extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
 extern void free_simple_loop_desc (struct loop *loop);
+void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
 
 static inline struct niter_desc *
 simple_loop_desc (struct loop *loop)
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 189893)
+++ gcc/params.def	(working copy)
@@ -311,6 +311,23 @@  DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
 	 "max-completely-peel-loop-nest-depth",
 	 "The maximum depth of a loop nest we completely peel",
 	 8, 0, 0)
+/* The maximum code size estimate under which loop unrolling and peeling
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
+DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
+        "unrollpeel-codesize-threshold",
+        "Maximum profile-based code size footprint estimate for loop unrolling "
+        "and peeling",
+        15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)
 
 /* The maximum number of insns of an unswitched loop.  */
 DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 189893)
+++ gcc/gcov-dump.c	(working copy)
@@ -456,8 +456,12 @@  tag_summary (const char *filename ATTRIBUTE_UNUSED
     {
       printf ("\n");
       print_prefix (filename, 0, 0);
-      printf ("\t\tcounts=%u, runs=%u",
-	      summary.ctrs[ix].num, summary.ctrs[ix].runs);
+      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
+              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
+	      summary.ctrs[ix].num,
+	      summary.ctrs[ix].num_hot_counters,
+	      (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
+	      summary.ctrs[ix].runs);
 
       printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);