Patchwork New fdo summary-based icache sensitive unrolling (issue6351086)

login
register
mail settings
Submitter Teresa Johnson
Date July 11, 2012, 5:42 p.m.
Message ID <20120711174300.29CB761419@tjsboxrox.mtv.corp.google.com>
Download mbox | patch
Permalink /patch/170496/
State New
Headers show

Comments

Teresa Johnson - July 11, 2012, 5:42 p.m.
Ports some patches related to improving FDO program summary information
and using it to guide loop unrolling from google branches to mainline.
The patch is enhanced to add additional summary information to aid
in determining hot/cold decisions.

The original patch description is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
and further discussion about incorporating onto mainline is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414

Honza, can you take a look to see if this patch meets your needs?

Full description:

This patch adds new program summary information to the gcov
profile files that indicate how many profiled counts compose
the majority of the program's execution time. This is used to
provide an indication of the overall code size of the program.

The new profile summary information is then used to guide
codesize based unroll and peel decisions, to prevent those
optimizations from increasing code size too much when the
program may be sensitive to icache effects.

This patch also pulls in dependent portions of google/main r187660 that cache
additional loop analysis results in the niter_desc auxiliary information
hanging off the loop structure (the optimization portions of that
change are not included here, and have an outstanding review request
for mainline).

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-07-11  Teresa Johnson  <tejohnson@google.com>

	* libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
	(gcov_compute_cutoff_values): Ditto.
	(gcov_exit): Call gcov_compute_cutoff_values and merge new summary
        information.
	* gcc/doc/invoke.texi (roll much): Document new options
        -fpeel-codesize-limit and -funroll-codesize-limit, and new params
        codesize-hotness-threshold and unrollpeel-hotness-threshold.
	* gcc/gcov-io.c (gcov_write_summary): Write new summary info.
	(gcov_read_summary): Read new summary info.
	* gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
        (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
        hot_cutoff_value.
	* gcc/loop-unroll.c (code_size_limit_factor): New function.
	(decide_unroll_runtime_iterations): Call code_size_limit_factor
        to control the unroll factor, and retrieve number of branches from
        niter_desc instead of via function that walks loop.
	(decide_peel_simple, decide_unroll_stupid): Ditto.
	* gcc/coverage.c (read_counts_file): Propagate new summary info.
	* gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
	* gcc/common.opt: Add -fpeel-codesize-limit and
        -funroll-codesize-limit.
	* gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
	* gcc/cfgloop.h (struct niter_desc): Added new fields to cache
        additional loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
	* gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
        (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
	* gcc/gcov-dump.c (tag_summary): Dump new summary info.


--
This patch is available for review at http://codereview.appspot.com/6351086
Teresa Johnson - July 18, 2012, 3:48 p.m.
Ping (retrying ping in plain text mode so that it goes through properly).

Thanks,
Teresa

On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Ports some patches related to improving FDO program summary information
> and using it to guide loop unrolling from google branches to mainline.
> The patch is enhanced to add additional summary information to aid
> in determining hot/cold decisions.
>
> The original patch description is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and further discussion about incorporating onto mainline is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>
> Honza, can you take a look to see if this patch meets your needs?
>
> Full description:
>
> This patch adds new program summary information to the gcov
> profile files that indicate how many profiled counts compose
> the majority of the program's execution time. This is used to
> provide an indication of the overall code size of the program.
>
> The new profile summary information is then used to guide
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> This patch also pulls in dependent portions of google/main r187660 that cache
> additional loop analysis results in the niter_desc auxiliary information
> hanging off the loop structure (the optimization portions of that
> change are not included here, and have an outstanding review request
> for mainline).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-07-11  Teresa Johnson  <tejohnson@google.com>
>
>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>         (gcov_compute_cutoff_values): Ditto.
>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>         information.
>         * gcc/doc/invoke.texi (roll much): Document new options
>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>         (gcov_read_summary): Read new summary info.
>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>         hot_cutoff_value.
>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>         to control the unroll factor, and retrieve number of branches from
>         niter_desc instead of via function that walks loop.
>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>         function, and add guards to enable this function to work for the
>         outermost loop.
>         * gcc/common.opt: Add -fpeel-codesize-limit and
>         -funroll-codesize-limit.
>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>         (num_loop_branches): Remove.
>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>         additional loop analysis information.
>         (num_loop_branches): Remove.
>         (analyze_loop_insns): Declare.
>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>
> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c    (revision 189413)
> +++ libgcc/libgcov.c    (working copy)
> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>    return 1;
>  }
>
> +/* Used by qsort to sort gcov values in descending order.  */
> +
> +static int
> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
> +{
> +  const gcov_type a = *(gcov_type const *)pa;
> +  const gcov_type b = *(gcov_type const *)pb;
> +
> +  if (b > a)
> +    return 1;
> +  else if (b == a)
> +    return 0;
> +  else
> +    return -1;
> +}
> +
> +/* Determines the number of counters required to cover a given percentage
> +   of the total sum of execution counts in the summary, which is then also
> +   recorded in SUM.  */
> +
> +static void
> +gcov_compute_cutoff_values (struct gcov_summary *sum)
> +{
> +  struct gcov_info *gi_ptr;
> +  const struct gcov_fn_info *gfi_ptr;
> +  const struct gcov_ctr_info *ci_ptr;
> +  struct gcov_ctr_summary *cs_ptr;
> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
> +  gcov_unsigned_t c_num;
> +  gcov_type *value_array;
> +  gcov_type cum, cum_cutoff;
> +  char *cutoff_str;
> +  unsigned cutoff_perc;
> +
> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
> +  if (cutoff_str && strlen (cutoff_str))
> +    cutoff_perc = atoi (cutoff_str);
> +  else
> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
> +
> +  /* This currently only applies to arc counters.  */
> +  t_ix = GCOV_COUNTER_ARCS;
> +
> +  /* First check if there are any counts recorded for this counter.  */
> +  cs_ptr = &(sum->ctrs[t_ix]);
> +  if (!cs_ptr->num)
> +    return;
> +
> +  /* Determine the cumulative counter value at the specified cutoff
> +     percentage and record the percentage for use by gcov consumers.
> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
> +     and if so, do the divide first.  */
> +  if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
> +    /* Overflow, do the divide first.  */
> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
> +  else
> +    /* Otherwise multiply first to get the correct value for small
> +       values of sum_all.  */
> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
> +
> +  /* Next, walk through all the per-object structures and save each of
> +     the count values in value_array.  */
> +  index = 0;
> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
> +    {
> +      if (!gi_ptr->merge[t_ix])
> +        continue;
> +
> +      /* Find the appropriate index into the gcov_ctr_info array
> +         for the counter we are currently working on based on the
> +         existence of the merge function pointer for this object.  */
> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
> +        {
> +          if (gi_ptr->merge[i])
> +            ctr_info_ix++;
> +        }
> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
> +        {
> +          gfi_ptr = gi_ptr->functions[f_ix];
> +
> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
> +            continue;
> +
> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
> +          /* Sanity check that there are enough entries in value_arry
> +            for this function's counters. Gracefully handle the case when
> +            there are not, in case something in the profile info is
> +            corrupted.  */
> +          c_num = ci_ptr->num;
> +          if (index + c_num > cs_ptr->num)
> +            c_num = cs_ptr->num - index;
> +          /* Copy over this function's counter values.  */
> +          memcpy (&value_array[index], ci_ptr->values,
> +                  sizeof (gcov_type) * c_num);
> +          index += c_num;
> +        }
> +    }
> +
> +  /* Sort all the counter values by descending value and finally
> +     accumulate the values from hottest on down until reaching
> +     the cutoff value computed earlier.  */
> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
> +         sort_by_reverse_gcov_value);
> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
> +    cum += value_array[c_num];
> +  /* Record the number of counters required to reach the cutoff value,
> +     as well as the counter value that caused cum to reach the cutoff.  */
> +  cs_ptr->num_hot_counters = c_num;
> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
> +  free (value_array);
> +}
> +
>  /* Dump the coverage counts. We merge with existing counts when
>     possible, to avoid growing the .da files ad infinitum. We use this
>     program's checksum to make sure we only accumulate whole program
> @@ -347,6 +461,7 @@ gcov_exit (void)
>             }
>         }
>      }
> +  gcov_compute_cutoff_values (&this_prg);
>
>    {
>      /* Check if the level of dirs to strip off specified. */
> @@ -598,7 +713,14 @@ gcov_exit (void)
>           if (gi_ptr->merge[t_ix])
>             {
>               if (!cs_prg->runs++)
> -               cs_prg->num = cs_tprg->num;
> +                {
> +                  cs_prg->num = cs_tprg->num;
> +                  if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
> +                    {
> +                      cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
> +                      cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
> +                    }
> +                }
>               cs_prg->sum_all += cs_tprg->sum_all;
>               if (cs_prg->run_max < cs_tprg->run_max)
>                 cs_prg->run_max = cs_tprg->run_max;
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi (revision 189413)
> +++ gcc/doc/invoke.texi (working copy)
> @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
>  -fprefetch-loop-arrays @gol
>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>  -fprofile-generate=@var{path} @gol
> @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-reassoc @gol
>  -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>  -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8505,6 +8505,14 @@ the loop is entered.  This usually makes programs
>  @option{-funroll-all-loops} implies the same options as
>  @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fpeel-loops
>  @opindex fpeel-loops
>  Peels loops for which there is enough information that they do not
> @@ -8513,6 +8521,14 @@ roll much (from profile feedback).  It also turns
>
>  Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fmove-loop-invariants
>  @opindex fmove-loop-invariants
>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
> @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s
>  @item max-completely-peel-loop-nest-depth
>  The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling and
> +peeling.
> +
>  @item max-unswitch-insns
>  The maximum number of insns of an unswitched loop.
>
> Index: gcc/gcov-io.c
> ===================================================================
> --- gcc/gcov-io.c       (revision 189413)
> +++ gcc/gcov-io.c       (working copy)
> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        gcov_write_unsigned (csum->num);
> +      gcov_write_unsigned (csum->num_hot_counters);
> +      gcov_write_counter (csum->hot_cutoff_value);
>        gcov_write_unsigned (csum->runs);
>        gcov_write_counter (csum->sum_all);
>        gcov_write_counter (csum->run_max);
> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        csum->num = gcov_read_unsigned ();
> +      csum->num_hot_counters = gcov_read_unsigned ();
> +      csum->hot_cutoff_value = gcov_read_counter ();
>        csum->runs = gcov_read_unsigned ();
>        csum->sum_all = gcov_read_counter ();
>        csum->run_max = gcov_read_counter ();
> Index: gcc/gcov-io.h
> ===================================================================
> --- gcc/gcov-io.h       (revision 189413)
> +++ gcc/gcov-io.h       (working copy)
> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>  #define GCOV_TAG_SUMMARY_LENGTH  \
> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>
>  /* Counters that are collected.  */
>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>  struct gcov_ctr_summary
>  {
>    gcov_unsigned_t num;         /* number of counters.  */
> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
> +                                      percent of sum_all.  */
> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
> +                                   num_hot_counters.  */
>    gcov_unsigned_t runs;                /* number of program runs */
>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>    gcov_type run_max;           /* maximum value on a single run.  */
> Index: gcc/loop-unroll.c
> ===================================================================
> --- gcc/loop-unroll.c   (revision 189413)
> +++ gcc/loop-unroll.c   (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "hashtab.h"
>  #include "recog.h"
>  #include "target.h"
> +#include "gcov-io.h"
>
>  /* This pass performs loop unrolling and peeling.  We only perform these
>     optimizations on innermost loops (with single exception) because
> @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
> +   based on code footprint estimates. Returns the codesize-based factor to be
> +   divided into the max instructions in an unrolled or peeled loop:
> +   1) For size <= threshold, do not limit (by returning 1).
> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> +      unrolled instructions according to loop hotness.
> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> +      INT_MAX).  */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> +  unsigned size_threshold;
> +  struct niter_desc *desc = get_simple_loop_desc (loop);
> +  gcov_type sum_to_header_ratio;
> +  int hotness_ratio_threshold;
> +  int limit_factor;
> +
> +  /* First check if the application has a large codesize footprint.
> +     This is estimated from FDO profile summary information for the
> +     program, where the num_hot_counters indicates the number of hottest
> +     counters (blocks) that compose most of the execution time of
> +     the program. A large value would indicate a large flat execution
> +     profile where icache misses may be a concern.  */
> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> +  if (!profile_info
> +      || profile_info->num_hot_counters <= size_threshold
> +      || !profile_info->sum_all)
> +    return 1;
> +
> +  /* Next, exclude some loops where unrolling/peeling may be more
> +     important to overall performance.  */
> +
> +  /* Ignore FP loops, which are more likely to benefit heavily from
> +     unrolling. */
> +  if (desc->has_fp)
> +    return 1;
> +
> +  /* Next, set the value of the codesize-based unroll factor divisor which in
> +     most loops will need to be set to a value that will reduce or eliminate
> +     unrolling/peeling.  */
> +  if (profile_info->num_hot_counters < size_threshold * 2)
> +    {
> +      /* For applications that are less than twice the codesize limit, allow
> +         limited unrolling for very hot loops.  */
> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> +      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> +      /* When the profile count sum to loop entry header ratio is smaller than
> +         the threshold (i.e. the loop entry is hot enough, the divisor is set
> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
> +         than the ratio, increase the divisor by the amount this ratio
> +         is over the threshold, which will quickly reduce the unroll/peel
> +         factor to zero as the loop's hotness reduces.  */
> +      if (sum_to_header_ratio > hotness_ratio_threshold)
> +        {
> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> +          gcc_assert (limit_factor >= 1);
> +        }
> +      else
> +        limit_factor = 1;
> +    }
> +  else
> +    /* For appliations that are at least twice the codesize limit, set
> +       the divisor to a large value that will force the unroll factor to 0.  */
> +    limit_factor = INT_MAX;
> +
> +  return limit_factor;
> +}
> +
>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>  void
>  unroll_and_peel_loops (int flags)
> @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL))
>      {
> @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>              "\n;; Considering unrolling loop with runtime "
>              "computable number of iterations\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +        {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +        }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags)
>  {
>    unsigned npeel;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_PEEL))
>      {
> @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags)
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> +  if (flag_peel_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from peeling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max peeled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* npeel = number of iterations to peel.  */
> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> +          / loop->ninsns;
>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags)
>
>    /* Do not simply peel loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not peeling, contains branches\n");
> @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL_ALL))
>      {
> @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av
> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>
>    /* Do not unroll loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: gcc/coverage.c
> ===================================================================
> --- gcc/coverage.c      (revision 189413)
> +++ gcc/coverage.c      (working copy)
> @@ -245,6 +245,14 @@ read_counts_file (void)
>           gcov_read_summary (&sum);
>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>             {
> +              if (summary.ctrs[ix].num_hot_counters
> +                  < sum.ctrs[ix].num_hot_counters)
> +                {
> +                 summary.ctrs[ix].num_hot_counters
> +                      = sum.ctrs[ix].num_hot_counters;
> +                 summary.ctrs[ix].hot_cutoff_value
> +                      = sum.ctrs[ix].hot_cutoff_value;
> +                }
>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
> Index: gcc/loop-iv.c
> ===================================================================
> --- gcc/loop-iv.c       (revision 189413)
> +++ gcc/loop-iv.c       (working copy)
> @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop)
>    /* At least desc->infinite is not always initialized by
>       find_simple_loop_exit.  */
>    desc = XCNEW (struct niter_desc);
> -  iv_analysis_loop_init (loop);
> -  find_simple_exit (loop, desc);
> +  if (loop->latch != EXIT_BLOCK_PTR)
> +    {
> +      iv_analysis_loop_init (loop);
> +      find_simple_exit (loop, desc);
> +    }
> +  analyze_loop_insns (loop, desc);
>    loop->aux = desc;
>
>    if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: gcc/common.opt
> ===================================================================
> --- gcc/common.opt      (revision 189413)
> +++ gcc/common.opt      (working copy)
> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>  Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
> +
>  fpeel-loops
>  Common Report Var(flag_peel_loops) Optimization
>  Perform loop peeling
> @@ -2108,6 +2112,10 @@ funroll-all-loops
>  Common Report Var(flag_unroll_all_loops) Optimization
>  Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
> +
>  ; Nonzero means that loop optimizer may assume that the induction variables
>  ; that control loops do not overflow and that the loops with nontrivial
>  ; exit condition are not infinite
> Index: gcc/cfgloop.c
> ===================================================================
> --- gcc/cfgloop.c       (revision 189413)
> +++ gcc/cfgloop.c       (working copy)
> @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop)
>    return edges;
>  }
>
> -/* Counts the number of conditional branches inside LOOP.  */
> +/* Determine if INSN is a floating point set.  */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
>  {
> -  unsigned i, n;
> -  basic_block * body;
> +  int i;
> +  rtx pat = PATTERN(insn);
> +  if (GET_CODE (pat) == SET)
> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> +  else if (GET_CODE (pat) == PARALLEL)
> +    {
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +        {
> +          rtx sub = XVECEXP (pat, 0, i);
> +          if (GET_CODE (sub) == SET)
> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> +        }
> +    }
> +  return false;
> +}
>
> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
> +   the number of conditional branch instructions, calls and fp instructions,
> +   as well as the average number of branches executed per iteration.  */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> +  unsigned i, nbranch;
> +  gcov_type weighted_nbranch;
> +  bool has_call, has_fp;
> +  basic_block * body, bb;
> +  rtx insn;
> +  gcov_type header_count = loop->header->count;
> +
> +  nbranch = weighted_nbranch = 0;
> +  has_call = has_fp = false;
> +
>    body = get_loop_body (loop);
> -  n = 0;
>    for (i = 0; i < loop->num_nodes; i++)
> -    if (EDGE_COUNT (body[i]->succs) >= 2)
> -      n++;
> +    {
> +      bb = body[i];
> +
> +      if (EDGE_COUNT (bb->succs) >= 2)
> +        {
> +          nbranch++;
> +
> +          /* If this block is executed less frequently than the header (loop
> +             entry), then it is weighted based on its execution count, which
> +             will be turned into a ratio compared to the loop header below. */
> +          if (bb->count < header_count)
> +            weighted_nbranch += bb->count;
> +
> +          /* When it is executed more frequently than the header (i.e. it is
> +             in a nested inner loop), simply weight the branch the same as the
> +             header execution count, so that it will contribute 1 branch to
> +             the ratio computed below. */
> +          else
> +            weighted_nbranch += header_count;
> +        }
> +
> +      /* No need to iterate through the instructions below if
> +         both flags have already been set.  */
> +      if (has_call && has_fp)
> +        continue;
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          if (!INSN_P (insn))
> +            continue;
> +
> +          if (!has_call)
> +            has_call = CALL_P (insn);
> +
> +          if (!has_fp)
> +            has_fp = insn_has_fp_set (insn);
> +        }
> +    }
>    free (body);
>
> -  return n;
> +  desc->num_branches = nbranch;
> +  /* Now divide the weights computed above by the loop header execution count,
> +     to compute the average number of branches through the loop. By adding
> +     header_count/2 to the numerator we round to nearest with integer
> +     division.  */
> +  if (header_count  != 0)
> +    desc->av_num_branches
> +        = (weighted_nbranch + header_count/2) / header_count;
> +  else
> +    desc->av_num_branches = 0;
> +  desc->has_call = has_call;
> +  desc->has_fp = has_fp;
>  }
>
>  /* Adds basic block BB to LOOP.  */
> Index: gcc/cfgloop.h
> ===================================================================
> --- gcc/cfgloop.h       (revision 189413)
> +++ gcc/cfgloop.h       (working copy)
> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>
>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>  edge single_exit (const struct loop *);
> -extern unsigned num_loop_branches (const struct loop *);
>
>  extern edge loop_preheader_edge (const struct loop *);
>  extern edge loop_latch_edge (const struct loop *);
> @@ -366,7 +365,8 @@ struct rtx_iv
>  };
>
>  /* The description of an exit from the loop and of the number of iterations
> -   till we take the exit.  */
> +   till we take the exit. Also includes other information used primarily
> +   by the loop unroller.  */
>
>  struct niter_desc
>  {
> @@ -407,6 +407,18 @@ struct niter_desc
>
>    /* The number of iterations of the loop.  */
>    rtx niter_expr;
> +
> +  /* The number of branches in the loop.  */
> +  unsigned num_branches;
> +
> +  /* The number of executed branches per iteration.  */
> +  unsigned av_num_branches;
> +
> +  /* Whether the loop contains a call instruction.  */
> +  bool has_call;
> +
> +  /* Whether the loop contains fp instructions.  */
> +  bool has_fp;
>  };
>
>  extern void iv_analysis_loop_init (struct loop *);
> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>
>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>  extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
>  static inline struct niter_desc *
>  simple_loop_desc (struct loop *loop)
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def      (revision 189413)
> +++ gcc/params.def      (working copy)
> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>          "max-completely-peel-loop-nest-depth",
>          "The maximum depth of a loop nest we completely peel",
>          8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations.  */
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> +        "unrollpeel-codesize-threshold",
> +        "Maximum profile-based code size footprint estimate for loop unrolling "
> +        "and peeling",
> +        15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> +   count that must be exceeded to ignore most code size limits when unrolling
> +   and peeling.  */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> +        "unrollpeel-hotness-threshold",
> +        "Maximum ratio of total profiled execution count to loop entry "
> +        "block count under which most codesize limits for unrolling and "
> +        "peeling will be ignored",
> +        100, 1, 0)
>
>  /* The maximum number of insns of an unswitched loop.  */
>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
> Index: gcc/gcov-dump.c
> ===================================================================
> --- gcc/gcov-dump.c     (revision 189413)
> +++ gcc/gcov-dump.c     (working copy)
> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>      {
>        printf ("\n");
>        print_prefix (filename, 0, 0);
> -      printf ("\t\tcounts=%u, runs=%u",
> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
> +             summary.ctrs[ix].num,
> +             summary.ctrs[ix].num_hot_counters,
> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
> +             summary.ctrs[ix].runs);
>
>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>
> --
> This patch is available for review at http://codereview.appspot.com/6351086
Teresa Johnson - July 26, 2012, 3:57 p.m.
Ping.

Teresa

On Wed, Jul 18, 2012 at 8:48 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Ping (retrying ping in plain text mode so that it goes through properly).
>
> Thanks,
> Teresa
>
> On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> Ports some patches related to improving FDO program summary information
>> and using it to guide loop unrolling from google branches to mainline.
>> The patch is enhanced to add additional summary information to aid
>> in determining hot/cold decisions.
>>
>> The original patch description is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> and further discussion about incorporating onto mainline is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>>
>> Honza, can you take a look to see if this patch meets your needs?
>>
>> Full description:
>>
>> This patch adds new program summary information to the gcov
>> profile files that indicate how many profiled counts compose
>> the majority of the program's execution time. This is used to
>> provide an indication of the overall code size of the program.
>>
>> The new profile summary information is then used to guide
>> codesize based unroll and peel decisions, to prevent those
>> optimizations from increasing code size too much when the
>> program may be sensitive to icache effects.
>>
>> This patch also pulls in dependent portions of google/main r187660 that cache
>> additional loop analysis results in the niter_desc auxiliary information
>> hanging off the loop structure (the optimization portions of that
>> change are not included here, and have an outstanding review request
>> for mainline).
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2012-07-11  Teresa Johnson  <tejohnson@google.com>
>>
>>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>>         (gcov_compute_cutoff_values): Ditto.
>>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>>         information.
>>         * gcc/doc/invoke.texi (roll much): Document new options
>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>>         (gcov_read_summary): Read new summary info.
>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>>         hot_cutoff_value.
>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>         to control the unroll factor, and retrieve number of branches from
>>         niter_desc instead of via function that walks loop.
>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>>         function, and add guards to enable this function to work for the
>>         outermost loop.
>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>         -funroll-codesize-limit.
>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>>         (num_loop_branches): Remove.
>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>         additional loop analysis information.
>>         (num_loop_branches): Remove.
>>         (analyze_loop_insns): Declare.
>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>>
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189413)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>> +{
>> +  struct gcov_info *gi_ptr;
>> +  const struct gcov_fn_info *gfi_ptr;
>> +  const struct gcov_ctr_info *ci_ptr;
>> +  struct gcov_ctr_summary *cs_ptr;
>> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
>> +  gcov_unsigned_t c_num;
>> +  gcov_type *value_array;
>> +  gcov_type cum, cum_cutoff;
>> +  char *cutoff_str;
>> +  unsigned cutoff_perc;
>> +
>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
>> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
>> +  if (cutoff_str && strlen (cutoff_str))
>> +    cutoff_perc = atoi (cutoff_str);
>> +  else
>> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
>> +
>> +  /* This currently only applies to arc counters.  */
>> +  t_ix = GCOV_COUNTER_ARCS;
>> +
>> +  /* First check if there are any counts recorded for this counter.  */
>> +  cs_ptr = &(sum->ctrs[t_ix]);
>> +  if (!cs_ptr->num)
>> +    return;
>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>> +
>> +  /* Next, walk through all the per-object structures and save each of
>> +     the count values in value_array.  */
>> +  index = 0;
>> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>> +        }
>> +    }
>> +
>> +  /* Sort all the counter values by descending value and finally
>> +     accumulate the values from hottest on down until reaching
>> +     the cutoff value computed earlier.  */
>> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
>> +         sort_by_reverse_gcov_value);
>> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
>> +    cum += value_array[c_num];
>> +  /* Record the number of counters required to reach the cutoff value,
>> +     as well as the counter value that caused cum to reach the cutoff.  */
>> +  cs_ptr->num_hot_counters = c_num;
>> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
>> +  free (value_array);
>> +}
>> +
>>  /* Dump the coverage counts. We merge with existing counts when
>>     possible, to avoid growing the .da files ad infinitum. We use this
>>     program's checksum to make sure we only accumulate whole program
>> @@ -347,6 +461,7 @@ gcov_exit (void)
>>             }
>>         }
>>      }
>> +  gcov_compute_cutoff_values (&this_prg);
>>
>>    {
>>      /* Check if the level of dirs to strip off specified. */
>> @@ -598,7 +713,14 @@ gcov_exit (void)
>>           if (gi_ptr->merge[t_ix])
>>             {
>>               if (!cs_prg->runs++)
>> -               cs_prg->num = cs_tprg->num;
>> +                {
>> +                  cs_prg->num = cs_tprg->num;
>> +                  if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
>> +                    {
>> +                      cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
>> +                      cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
>> +                    }
>> +                }
>>               cs_prg->sum_all += cs_tprg->sum_all;
>>               if (cs_prg->run_max < cs_tprg->run_max)
>>                 cs_prg->run_max = cs_tprg->run_max;
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189413)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc @gol
>>  -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8505,6 +8505,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fpeel-loops
>>  @opindex fpeel-loops
>>  Peels loops for which there is enough information that they do not
>> @@ -8513,6 +8521,14 @@ roll much (from profile feedback).  It also turns
>>
>>  Enabled with @option{-fprofile-use}.
>>
>> +@item -fpeel-codesize-limit
>> +@opindex fpeel-codesize-limit
>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fmove-loop-invariants
>>  @opindex fmove-loop-invariants
>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>> @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s
>>  @item max-completely-peel-loop-nest-depth
>>  The maximum depth of a loop nest suitable for complete peeling.
>>
>> +@item unrollpeel-codesize-threshold
>> +Maximum profile-based code size footprint estimate for loop unrolling and
>> +peeling.
>> +
>> +@item unrollpeel-hotness-threshold
>> +Maximum ratio of total execution count to loop entry block count under which
>> +most profile-based code size estimates will be ignored for loop unrolling and
>> +peeling.
>> +
>>  @item max-unswitch-insns
>>  The maximum number of insns of an unswitched loop.
>>
>> Index: gcc/gcov-io.c
>> ===================================================================
>> --- gcc/gcov-io.c       (revision 189413)
>> +++ gcc/gcov-io.c       (working copy)
>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        gcov_write_unsigned (csum->num);
>> +      gcov_write_unsigned (csum->num_hot_counters);
>> +      gcov_write_counter (csum->hot_cutoff_value);
>>        gcov_write_unsigned (csum->runs);
>>        gcov_write_counter (csum->sum_all);
>>        gcov_write_counter (csum->run_max);
>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        csum->num = gcov_read_unsigned ();
>> +      csum->num_hot_counters = gcov_read_unsigned ();
>> +      csum->hot_cutoff_value = gcov_read_counter ();
>>        csum->runs = gcov_read_unsigned ();
>>        csum->sum_all = gcov_read_counter ();
>>        csum->run_max = gcov_read_counter ();
>> Index: gcc/gcov-io.h
>> ===================================================================
>> --- gcc/gcov-io.h       (revision 189413)
>> +++ gcc/gcov-io.h       (working copy)
>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
>>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>>  #define GCOV_TAG_SUMMARY_LENGTH  \
>> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
>> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>>
>>  /* Counters that are collected.  */
>>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>>  struct gcov_ctr_summary
>>  {
>>    gcov_unsigned_t num;         /* number of counters.  */
>> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
>> +                                      percent of sum_all.  */
>> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
>> +                                   num_hot_counters.  */
>>    gcov_unsigned_t runs;                /* number of program runs */
>>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>>    gcov_type run_max;           /* maximum value on a single run.  */
>> Index: gcc/loop-unroll.c
>> ===================================================================
>> --- gcc/loop-unroll.c   (revision 189413)
>> +++ gcc/loop-unroll.c   (working copy)
>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "hashtab.h"
>>  #include "recog.h"
>>  #include "target.h"
>> +#include "gcov-io.h"
>>
>>  /* This pass performs loop unrolling and peeling.  We only perform these
>>     optimizations on innermost loops (with single exception) because
>> @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc
>>                                              basic_block);
>>  static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
>> +   based on code footprint estimates. Returns the codesize-based factor to be
>> +   divided into the max instructions in an unrolled or peeled loop:
>> +   1) For size <= threshold, do not limit (by returning 1).
>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>> +      unrolled instructions according to loop hotness.
>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>> +      INT_MAX).  */
>> +
>> +static int
>> +code_size_limit_factor(struct loop *loop)
>> +{
>> +  unsigned size_threshold;
>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>> +  gcov_type sum_to_header_ratio;
>> +  int hotness_ratio_threshold;
>> +  int limit_factor;
>> +
>> +  /* First check if the application has a large codesize footprint.
>> +     This is estimated from FDO profile summary information for the
>> +     program, where the num_hot_counters indicates the number of hottest
>> +     counters (blocks) that compose most of the execution time of
>> +     the program. A large value would indicate a large flat execution
>> +     profile where icache misses may be a concern.  */
>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>> +  if (!profile_info
>> +      || profile_info->num_hot_counters <= size_threshold
>> +      || !profile_info->sum_all)
>> +    return 1;
>> +
>> +  /* Next, exclude some loops where unrolling/peeling may be more
>> +     important to overall performance.  */
>> +
>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>> +     unrolling. */
>> +  if (desc->has_fp)
>> +    return 1;
>> +
>> +  /* Next, set the value of the codesize-based unroll factor divisor which in
>> +     most loops will need to be set to a value that will reduce or eliminate
>> +     unrolling/peeling.  */
>> +  if (profile_info->num_hot_counters < size_threshold * 2)
>> +    {
>> +      /* For applications that are less than twice the codesize limit, allow
>> +         limited unrolling for very hot loops.  */
>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>> +      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>> +      /* When the profile count sum to loop entry header ratio is smaller than
>> +         the threshold (i.e. the loop entry is hot enough, the divisor is set
>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>> +         than the ratio, increase the divisor by the amount this ratio
>> +         is over the threshold, which will quickly reduce the unroll/peel
>> +         factor to zero as the loop's hotness reduces.  */
>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>> +        {
>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>> +          gcc_assert (limit_factor >= 1);
>> +        }
>> +      else
>> +        limit_factor = 1;
>> +    }
>> +  else
>> +    /* For appliations that are at least twice the codesize limit, set
>> +       the divisor to a large value that will force the unroll factor to 0.  */
>> +    limit_factor = INT_MAX;
>> +
>> +  return limit_factor;
>> +}
>> +
>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>  void
>>  unroll_and_peel_loops (int flags)
>> @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL))
>>      {
>> @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>              "\n;; Considering unrolling loop with runtime "
>>              "computable number of iterations\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +        {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +        }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>  {
>>    unsigned npeel;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_PEEL))
>>      {
>> @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>
>> +  if (flag_peel_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from peeling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max peeled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* npeel = number of iterations to peel.  */
>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>> +          / loop->ninsns;
>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>
>> @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>
>>    /* Do not simply peel loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not peeling, contains branches\n");
>> @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL_ALL))
>>      {
>> @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av
>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>
>>    /* Do not unroll loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>> Index: gcc/coverage.c
>> ===================================================================
>> --- gcc/coverage.c      (revision 189413)
>> +++ gcc/coverage.c      (working copy)
>> @@ -245,6 +245,14 @@ read_counts_file (void)
>>           gcov_read_summary (&sum);
>>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>>             {
>> +              if (summary.ctrs[ix].num_hot_counters
>> +                  < sum.ctrs[ix].num_hot_counters)
>> +                {
>> +                 summary.ctrs[ix].num_hot_counters
>> +                      = sum.ctrs[ix].num_hot_counters;
>> +                 summary.ctrs[ix].hot_cutoff_value
>> +                      = sum.ctrs[ix].hot_cutoff_value;
>> +                }
>>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
>> Index: gcc/loop-iv.c
>> ===================================================================
>> --- gcc/loop-iv.c       (revision 189413)
>> +++ gcc/loop-iv.c       (working copy)
>> @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop)
>>    /* At least desc->infinite is not always initialized by
>>       find_simple_loop_exit.  */
>>    desc = XCNEW (struct niter_desc);
>> -  iv_analysis_loop_init (loop);
>> -  find_simple_exit (loop, desc);
>> +  if (loop->latch != EXIT_BLOCK_PTR)
>> +    {
>> +      iv_analysis_loop_init (loop);
>> +      find_simple_exit (loop, desc);
>> +    }
>> +  analyze_loop_insns (loop, desc);
>>    loop->aux = desc;
>>
>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>> Index: gcc/common.opt
>> ===================================================================
>> --- gcc/common.opt      (revision 189413)
>> +++ gcc/common.opt      (working copy)
>> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>  Return small aggregates in memory, not registers
>>
>> +fpeel-codesize-limit
>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
>> +
>>  fpeel-loops
>>  Common Report Var(flag_peel_loops) Optimization
>>  Perform loop peeling
>> @@ -2108,6 +2112,10 @@ funroll-all-loops
>>  Common Report Var(flag_unroll_all_loops) Optimization
>>  Perform loop unrolling for all loops
>>
>> +funroll-codesize-limit
>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
>> +
>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>  ; that control loops do not overflow and that the loops with nontrivial
>>  ; exit condition are not infinite
>> Index: gcc/cfgloop.c
>> ===================================================================
>> --- gcc/cfgloop.c       (revision 189413)
>> +++ gcc/cfgloop.c       (working copy)
>> @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop)
>>    return edges;
>>  }
>>
>> -/* Counts the number of conditional branches inside LOOP.  */
>> +/* Determine if INSN is a floating point set.  */
>>
>> -unsigned
>> -num_loop_branches (const struct loop *loop)
>> +static bool
>> +insn_has_fp_set(rtx insn)
>>  {
>> -  unsigned i, n;
>> -  basic_block * body;
>> +  int i;
>> +  rtx pat = PATTERN(insn);
>> +  if (GET_CODE (pat) == SET)
>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>> +  else if (GET_CODE (pat) == PARALLEL)
>> +    {
>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>> +        {
>> +          rtx sub = XVECEXP (pat, 0, i);
>> +          if (GET_CODE (sub) == SET)
>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>> +        }
>> +    }
>> +  return false;
>> +}
>>
>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
>> +   the number of conditional branch instructions, calls and fp instructions,
>> +   as well as the average number of branches executed per iteration.  */
>>
>> +void
>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>> +{
>> +  unsigned i, nbranch;
>> +  gcov_type weighted_nbranch;
>> +  bool has_call, has_fp;
>> +  basic_block * body, bb;
>> +  rtx insn;
>> +  gcov_type header_count = loop->header->count;
>> +
>> +  nbranch = weighted_nbranch = 0;
>> +  has_call = has_fp = false;
>> +
>>    body = get_loop_body (loop);
>> -  n = 0;
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>> -      n++;
>> +    {
>> +      bb = body[i];
>> +
>> +      if (EDGE_COUNT (bb->succs) >= 2)
>> +        {
>> +          nbranch++;
>> +
>> +          /* If this block is executed less frequently than the header (loop
>> +             entry), then it is weighted based on its execution count, which
>> +             will be turned into a ratio compared to the loop header below. */
>> +          if (bb->count < header_count)
>> +            weighted_nbranch += bb->count;
>> +
>> +          /* When it is executed more frequently than the header (i.e. it is
>> +             in a nested inner loop), simply weight the branch the same as the
>> +             header execution count, so that it will contribute 1 branch to
>> +             the ratio computed below. */
>> +          else
>> +            weighted_nbranch += header_count;
>> +        }
>> +
>> +      /* No need to iterate through the instructions below if
>> +         both flags have already been set.  */
>> +      if (has_call && has_fp)
>> +        continue;
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          if (!INSN_P (insn))
>> +            continue;
>> +
>> +          if (!has_call)
>> +            has_call = CALL_P (insn);
>> +
>> +          if (!has_fp)
>> +            has_fp = insn_has_fp_set (insn);
>> +        }
>> +    }
>>    free (body);
>>
>> -  return n;
>> +  desc->num_branches = nbranch;
>> +  /* Now divide the weights computed above by the loop header execution count,
>> +     to compute the average number of branches through the loop. By adding
>> +     header_count/2 to the numerator we round to nearest with integer
>> +     division.  */
>> +  if (header_count  != 0)
>> +    desc->av_num_branches
>> +        = (weighted_nbranch + header_count/2) / header_count;
>> +  else
>> +    desc->av_num_branches = 0;
>> +  desc->has_call = has_call;
>> +  desc->has_fp = has_fp;
>>  }
>>
>>  /* Adds basic block BB to LOOP.  */
>> Index: gcc/cfgloop.h
>> ===================================================================
>> --- gcc/cfgloop.h       (revision 189413)
>> +++ gcc/cfgloop.h       (working copy)
>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>>
>>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>>  edge single_exit (const struct loop *);
>> -extern unsigned num_loop_branches (const struct loop *);
>>
>>  extern edge loop_preheader_edge (const struct loop *);
>>  extern edge loop_latch_edge (const struct loop *);
>> @@ -366,7 +365,8 @@ struct rtx_iv
>>  };
>>
>>  /* The description of an exit from the loop and of the number of iterations
>> -   till we take the exit.  */
>> +   till we take the exit. Also includes other information used primarily
>> +   by the loop unroller.  */
>>
>>  struct niter_desc
>>  {
>> @@ -407,6 +407,18 @@ struct niter_desc
>>
>>    /* The number of iterations of the loop.  */
>>    rtx niter_expr;
>> +
>> +  /* The number of branches in the loop.  */
>> +  unsigned num_branches;
>> +
>> +  /* The number of executed branches per iteration.  */
>> +  unsigned av_num_branches;
>> +
>> +  /* Whether the loop contains a call instruction.  */
>> +  bool has_call;
>> +
>> +  /* Whether the loop contains fp instructions.  */
>> +  bool has_fp;
>>  };
>>
>>  extern void iv_analysis_loop_init (struct loop *);
>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>>
>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>  extern void free_simple_loop_desc (struct loop *loop);
>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>
>>  static inline struct niter_desc *
>>  simple_loop_desc (struct loop *loop)
>> Index: gcc/params.def
>> ===================================================================
>> --- gcc/params.def      (revision 189413)
>> +++ gcc/params.def      (working copy)
>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>          "max-completely-peel-loop-nest-depth",
>>          "The maximum depth of a loop nest we completely peel",
>>          8, 0, 0)
>> +/* The maximum code size estimate under which loop unrolling and peeling
>> + * is allowed in a profile feedback compile. This currently applies to loops
>> + * with non-constant iteration counts and no floating point computations.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>> +        "unrollpeel-codesize-threshold",
>> +        "Maximum profile-based code size footprint estimate for loop unrolling "
>> +        "and peeling",
>> +        15000, 0, 0)
>> +/* The maximum ratio of total profiled execution counts to loop entry block
>> +   count that must be exceeded to ignore most code size limits when unrolling
>> +   and peeling.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>> +        "unrollpeel-hotness-threshold",
>> +        "Maximum ratio of total profiled execution count to loop entry "
>> +        "block count under which most codesize limits for unrolling and "
>> +        "peeling will be ignored",
>> +        100, 1, 0)
>>
>>  /* The maximum number of insns of an unswitched loop.  */
>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>> Index: gcc/gcov-dump.c
>> ===================================================================
>> --- gcc/gcov-dump.c     (revision 189413)
>> +++ gcc/gcov-dump.c     (working copy)
>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>>      {
>>        printf ("\n");
>>        print_prefix (filename, 0, 0);
>> -      printf ("\t\tcounts=%u, runs=%u",
>> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
>> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
>> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
>> +             summary.ctrs[ix].num,
>> +             summary.ctrs[ix].num_hot_counters,
>> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
>> +             summary.ctrs[ix].runs);
>>
>>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>>
>> --
>> This patch is available for review at http://codereview.appspot.com/6351086
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Patch

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 189413)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,120 @@  gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Used by qsort to sort gcov values in descending order.  */
+
+static int
+sort_by_reverse_gcov_value (const void *pa, const void *pb)
+{
+  const gcov_type a = *(gcov_type const *)pa;
+  const gcov_type b = *(gcov_type const *)pb;
+
+  if (b > a)
+    return 1;
+  else if (b == a)
+    return 0;
+  else
+    return -1;
+}
+
+/* Determines the number of counters required to cover a given percentage
+   of the total sum of execution counts in the summary, which is then also
+   recorded in SUM.  */
+
+static void
+gcov_compute_cutoff_values (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, i, ctr_info_ix, index;
+  gcov_unsigned_t c_num;
+  gcov_type *value_array;
+  gcov_type cum, cum_cutoff;
+  char *cutoff_str;
+  unsigned cutoff_perc;
+
+#define CUM_CUTOFF_PERCENT_TIMES_10 999
+  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
+  if (cutoff_str && strlen (cutoff_str))
+    cutoff_perc = atoi (cutoff_str);
+  else
+    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  /* Determine the cumulative counter value at the specified cutoff
+     percentage and record the percentage for use by gcov consumers.
+     Check for overflow when sum_all is multiplied by the cutoff_perc,
+     and if so, do the divide first.  */
+  if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
+    /* Overflow, do the divide first.  */
+    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
+  else
+    /* Otherwise multiply first to get the correct value for small
+       values of sum_all.  */
+    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
+
+  /* Next, walk through all the per-object structures and save each of
+     the count values in value_array.  */
+  index = 0;
+  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
+        {
+          if (gi_ptr->merge[i])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          /* Sanity check that there are enough entries in value_arry
+            for this function's counters. Gracefully handle the case when
+            there are not, in case something in the profile info is
+            corrupted.  */
+          c_num = ci_ptr->num;
+          if (index + c_num > cs_ptr->num)
+            c_num = cs_ptr->num - index;
+          /* Copy over this function's counter values.  */
+          memcpy (&value_array[index], ci_ptr->values,
+                  sizeof (gcov_type) * c_num);
+          index += c_num;
+        }
+    }
+
+  /* Sort all the counter values by descending value and finally
+     accumulate the values from hottest on down until reaching
+     the cutoff value computed earlier.  */
+  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
+         sort_by_reverse_gcov_value);
+  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
+    cum += value_array[c_num];
+  /* Record the number of counters required to reach the cutoff value,
+     as well as the counter value that caused cum to reach the cutoff.  */
+  cs_ptr->num_hot_counters = c_num;
+  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
+  free (value_array);
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +461,7 @@  gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_cutoff_values (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -598,7 +713,14 @@  gcov_exit (void)
 	  if (gi_ptr->merge[t_ix])
 	    {
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                {
+                  cs_prg->num = cs_tprg->num;
+                  if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
+                    {
+                      cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
+                      cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
+                    }
+                }
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 189413)
+++ gcc/doc/invoke.texi	(working copy)
@@ -384,7 +384,7 @@  Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
 -fprefetch-loop-arrays @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
@@ -416,7 +416,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-reassoc @gol
 -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-linker-plugin @gol
@@ -8505,6 +8505,14 @@  the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.
 
+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels loops for which there is enough information that they do not
@@ -8513,6 +8521,14 @@  roll much (from profile feedback).  It also turns
 
 Enabled with @option{-fprofile-use}.
 
+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -8864,6 +8880,15 @@  The maximum number of iterations of a loop to be s
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.
 
+@item unrollpeel-codesize-threshold
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
+
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.
 
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 189413)
+++ gcc/gcov-io.c	(working copy)
@@ -376,6 +376,8 @@  gcov_write_summary (gcov_unsigned_t tag, const str
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       gcov_write_unsigned (csum->num);
+      gcov_write_unsigned (csum->num_hot_counters);
+      gcov_write_counter (csum->hot_cutoff_value);
       gcov_write_unsigned (csum->runs);
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
@@ -495,6 +497,8 @@  gcov_read_summary (struct gcov_summary *summary)
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       csum->num = gcov_read_unsigned ();
+      csum->num_hot_counters = gcov_read_unsigned ();
+      csum->hot_cutoff_value = gcov_read_counter ();
       csum->runs = gcov_read_unsigned ();
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 189413)
+++ gcc/gcov-io.h	(working copy)
@@ -310,7 +310,7 @@  typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
 #define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+	(1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
 
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
@@ -393,6 +393,10 @@  typedef HOST_WIDEST_INT gcov_type;
 struct gcov_ctr_summary
 {
   gcov_unsigned_t num;		/* number of counters.  */
+  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
+                                      percent of sum_all.  */
+  gcov_type hot_cutoff_value;   /* value of smallest counter included in
+                                   num_hot_counters.  */
   gcov_unsigned_t runs;		/* number of program runs */
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
Index: gcc/loop-unroll.c
===================================================================
--- gcc/loop-unroll.c	(revision 189413)
+++ gcc/loop-unroll.c	(working copy)
@@ -32,6 +32,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "hashtab.h"
 #include "recog.h"
 #include "target.h"
+#include "gcov-io.h"
 
 /* This pass performs loop unrolling and peeling.  We only perform these
    optimizations on innermost loops (with single exception) because
@@ -150,6 +151,75 @@  static void combine_var_copies_in_loop_exit (struc
 					     basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
+
+static int
+code_size_limit_factor(struct loop *loop)
+{
+  unsigned size_threshold;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
+
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || profile_info->num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
+
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
+
+  /* Ignore FP loops, which are more likely to benefit heavily from
+     unrolling. */
+  if (desc->has_fp)
+    return 1;
+
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (profile_info->num_hot_counters < size_threshold * 2)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough, the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0.  */
+    limit_factor = INT_MAX;
+
+  return limit_factor;
+}
+
 /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
 void
 unroll_and_peel_loops (int flags)
@@ -802,6 +872,7 @@  decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL))
     {
@@ -814,10 +885,26 @@  decide_unroll_runtime_iterations (struct loop *loo
 	     "\n;; Considering unrolling loop with runtime "
 	     "computable number of iterations\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+        {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+        }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1195,6 +1282,7 @@  decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_PEEL))
     {
@@ -1205,8 +1293,23 @@  decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");
 
+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
 
@@ -1231,7 +1334,7 @@  decide_peel_simple (struct loop *loop, int flags)
 
   /* Do not simply peel loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not peeling, contains branches\n");
@@ -1348,6 +1451,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL_ALL))
     {
@@ -1358,11 +1462,26 @@  decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1392,7 +1511,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 
   /* Do not unroll loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not unrolling, contains branches\n");
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 189413)
+++ gcc/coverage.c	(working copy)
@@ -245,6 +245,14 @@  read_counts_file (void)
 	  gcov_read_summary (&sum);
 	  for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
 	    {
+              if (summary.ctrs[ix].num_hot_counters
+                  < sum.ctrs[ix].num_hot_counters)
+                {
+		  summary.ctrs[ix].num_hot_counters
+                      = sum.ctrs[ix].num_hot_counters;
+		  summary.ctrs[ix].hot_cutoff_value
+                      = sum.ctrs[ix].hot_cutoff_value;
+                }
 	      summary.ctrs[ix].runs += sum.ctrs[ix].runs;
 	      summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
 	      if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
Index: gcc/loop-iv.c
===================================================================
--- gcc/loop-iv.c	(revision 189413)
+++ gcc/loop-iv.c	(working copy)
@@ -2969,8 +2969,12 @@  get_simple_loop_desc (struct loop *loop)
   /* At least desc->infinite is not always initialized by
      find_simple_loop_exit.  */
   desc = XCNEW (struct niter_desc);
-  iv_analysis_loop_init (loop);
-  find_simple_exit (loop, desc);
+  if (loop->latch != EXIT_BLOCK_PTR)
+    {
+      iv_analysis_loop_init (loop);
+      find_simple_exit (loop, desc);
+    }
+  analyze_loop_insns (loop, desc);
   loop->aux = desc;
 
   if (desc->simple_p && (desc->assumptions || desc->infinite))
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 189413)
+++ gcc/common.opt	(working copy)
@@ -1551,6 +1551,10 @@  fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers
 
+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large code footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2108,6 +2112,10 @@  funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops
 
+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of large code footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: gcc/cfgloop.c
===================================================================
--- gcc/cfgloop.c	(revision 189413)
+++ gcc/cfgloop.c	(working copy)
@@ -1155,24 +1155,98 @@  get_loop_exit_edges (const struct loop *loop)
   return edges;
 }
 
-/* Counts the number of conditional branches inside LOOP.  */
+/* Determine if INSN is a floating point set.  */
 
-unsigned
-num_loop_branches (const struct loop *loop)
+static bool
+insn_has_fp_set(rtx insn)
 {
-  unsigned i, n;
-  basic_block * body;
+  int i;
+  rtx pat = PATTERN(insn);
+  if (GET_CODE (pat) == SET)
+    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
+  else if (GET_CODE (pat) == PARALLEL)
+    {
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+        {
+          rtx sub = XVECEXP (pat, 0, i);
+          if (GET_CODE (sub) == SET)
+            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
+        }
+    }
+  return false;
+}
 
-  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
+/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
+   the number of conditional branch instructions, calls and fp instructions,
+   as well as the average number of branches executed per iteration.  */
 
+void
+analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
+{
+  unsigned i, nbranch;
+  gcov_type weighted_nbranch;
+  bool has_call, has_fp;
+  basic_block * body, bb;
+  rtx insn;
+  gcov_type header_count = loop->header->count;
+
+  nbranch = weighted_nbranch = 0;
+  has_call = has_fp = false;
+
   body = get_loop_body (loop);
-  n = 0;
   for (i = 0; i < loop->num_nodes; i++)
-    if (EDGE_COUNT (body[i]->succs) >= 2)
-      n++;
+    {
+      bb = body[i];
+
+      if (EDGE_COUNT (bb->succs) >= 2)
+        {
+          nbranch++;
+
+          /* If this block is executed less frequently than the header (loop
+             entry), then it is weighted based on its execution count, which
+             will be turned into a ratio compared to the loop header below. */
+          if (bb->count < header_count)
+            weighted_nbranch += bb->count;
+
+          /* When it is executed more frequently than the header (i.e. it is
+             in a nested inner loop), simply weight the branch the same as the
+             header execution count, so that it will contribute 1 branch to
+             the ratio computed below. */
+          else
+            weighted_nbranch += header_count;
+        }
+
+      /* No need to iterate through the instructions below if
+         both flags have already been set.  */
+      if (has_call && has_fp)
+        continue;
+
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!INSN_P (insn))
+            continue;
+
+          if (!has_call)
+            has_call = CALL_P (insn);
+
+          if (!has_fp)
+            has_fp = insn_has_fp_set (insn);
+        }
+    }
   free (body);
 
-  return n;
+  desc->num_branches = nbranch;
+  /* Now divide the weights computed above by the loop header execution count,
+     to compute the average number of branches through the loop. By adding
+     header_count/2 to the numerator we round to nearest with integer
+     division.  */
+  if (header_count  != 0)
+    desc->av_num_branches
+        = (weighted_nbranch + header_count/2) / header_count;
+  else
+    desc->av_num_branches = 0;
+  desc->has_call = has_call;
+  desc->has_fp = has_fp;
 }
 
 /* Adds basic block BB to LOOP.  */
Index: gcc/cfgloop.h
===================================================================
--- gcc/cfgloop.h	(revision 189413)
+++ gcc/cfgloop.h	(working copy)
@@ -255,7 +255,6 @@  extern basic_block *get_loop_body_in_custom_order
 
 extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
 edge single_exit (const struct loop *);
-extern unsigned num_loop_branches (const struct loop *);
 
 extern edge loop_preheader_edge (const struct loop *);
 extern edge loop_latch_edge (const struct loop *);
@@ -366,7 +365,8 @@  struct rtx_iv
 };
 
 /* The description of an exit from the loop and of the number of iterations
-   till we take the exit.  */
+   till we take the exit. Also includes other information used primarily
+   by the loop unroller.  */
 
 struct niter_desc
 {
@@ -407,6 +407,18 @@  struct niter_desc
 
   /* The number of iterations of the loop.  */
   rtx niter_expr;
+
+  /* The number of branches in the loop.  */
+  unsigned num_branches;
+
+  /* The number of executed branches per iteration.  */
+  unsigned av_num_branches;
+
+  /* Whether the loop contains a call instruction.  */
+  bool has_call;
+
+  /* Whether the loop contains fp instructions.  */
+  bool has_fp;
 };
 
 extern void iv_analysis_loop_init (struct loop *);
@@ -420,6 +432,7 @@  extern void iv_analysis_done (void);
 
 extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
 extern void free_simple_loop_desc (struct loop *loop);
+void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
 
 static inline struct niter_desc *
 simple_loop_desc (struct loop *loop)
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 189413)
+++ gcc/params.def	(working copy)
@@ -311,6 +311,23 @@  DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
 	 "max-completely-peel-loop-nest-depth",
 	 "The maximum depth of a loop nest we completely peel",
 	 8, 0, 0)
+/* The maximum code size estimate under which loop unrolling and peeling
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
+DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
+        "unrollpeel-codesize-threshold",
+        "Maximum profile-based code size footprint estimate for loop unrolling "
+        "and peeling",
+        15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)
 
 /* The maximum number of insns of an unswitched loop.  */
 DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 189413)
+++ gcc/gcov-dump.c	(working copy)
@@ -456,8 +456,12 @@  tag_summary (const char *filename ATTRIBUTE_UNUSED
     {
       printf ("\n");
       print_prefix (filename, 0, 0);
-      printf ("\t\tcounts=%u, runs=%u",
-	      summary.ctrs[ix].num, summary.ctrs[ix].runs);
+      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
+              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
+	      summary.ctrs[ix].num,
+	      summary.ctrs[ix].num_hot_counters,
+	      (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
+	      summary.ctrs[ix].runs);
 
       printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);