Patchwork New fdo summary-based icache sensitive unrolling (issue6351086)

login
register
mail settings
Submitter Teresa Johnson
Date Nov. 20, 2012, 12:58 a.m.
Message ID <CAAe5K+XH+HJJzFTFjdV9eb0shyTB5Vk_foKEh+Pgas5LJ5PGDw@mail.gmail.com>
Download mbox | patch
Permalink /patch/200225/
State New
Headers show

Comments

Teresa Johnson - Nov. 20, 2012, 12:58 a.m.
This patch was proposed awhile back, and the new working set program
summary was split out and committed separately. I would like to
see if the loop unroller part of the patch can now go in. For the
previous versions of the patch and related discussion, see:

http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
and
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html

Use the new working set information in the FDO program summary for
codesize based unroll and peel decisions, to prevent those
optimizations from increasing code size too much when the
program may be sensitive to icache effects.

As part of this, cache additional loop analysis results in the niter_desc
auxiliary information hanging off the loop structure to reduce
redundant analyses during unrolling.

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-11-19  Teresa Johnson  <tejohnson@google.com>

        * gcc/doc/invoke.texi: Document new options
        -fpeel-codesize-limit and -funroll-codesize-limit, and new params
        codesize-hotness-threshold and unrollpeel-hotness-threshold.
        * gcc/loop-unroll.c (code_size_limit_factor): New function.
        (decide_unroll_runtime_iterations): Call code_size_limit_factor
        to control the unroll factor, and retrieve number of branches from
        niter_desc instead of via function that walks loop.
        (decide_peel_simple, decide_unroll_stupid): Ditto.
        * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
        * gcc/common.opt: Add -fpeel-codesize-limit and
        -funroll-codesize-limit.
        * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
        * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
        additional loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
        * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
        (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.

 DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,

On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote:
> Updated patch based on review feedback.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Teresa
>
>
> Original patch description:
>
> Ports some patches related to improving FDO program summary information
> and using it to guide loop unrolling from google branches to mainline.
> The patch is enhanced to add additional summary information to aid
> in determining hot/cold decisions.
>
> The original patch description is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and further discussion about incorporating onto mainline is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>
> Honza, can you take a look to see if this patch meets your needs?
>
> Full description:
>
> This patch adds new program summary information to the gcov
> profile files that indicate how many profiled counts compose
> the majority of the program's execution time. This is used to
> provide an indication of the overall code size of the program.
>
> The new profile summary information is then used to guide
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> This patch also pulls in dependent portions of google/main r187660 that cache
> additional loop analysis results in the niter_desc auxiliary information
> hanging off the loop structure (the optimization portions of that
> change are not included here, and have an outstanding review request
> for mainline).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>
>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>         (gcov_compute_cutoff_values): Ditto.
>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>         information.
>         * gcc/doc/invoke.texi (roll much): Document new options
>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>         (gcov_read_summary): Read new summary info.
>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>         hot_cutoff_value.
>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>         to control the unroll factor, and retrieve number of branches from
>         niter_desc instead of via function that walks loop.
>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>         function, and add guards to enable this function to work for the
>         outermost loop.
>         * gcc/common.opt: Add -fpeel-codesize-limit and
>         -funroll-codesize-limit.
>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>         (num_loop_branches): Remove.
>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>         additional loop analysis information.
>         (num_loop_branches): Remove.
>         (analyze_loop_insns): Declare.
>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>
> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c    (revision 189893)
> +++ libgcc/libgcov.c    (working copy)
> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>    return 1;
>  }
>
> +/* Used by qsort to sort gcov values in descending order.  */
> +
> +static int
> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
> +{
> +  const gcov_type a = *(gcov_type const *)pa;
> +  const gcov_type b = *(gcov_type const *)pb;
> +
> +  if (b > a)
> +    return 1;
> +  else if (b == a)
> +    return 0;
> +  else
> +    return -1;
> +}
> +
> +/* Determines the number of counters required to cover a given percentage
> +   of the total sum of execution counts in the summary, which is then also
> +   recorded in SUM.  */
> +
> +static void
> +gcov_compute_cutoff_values (struct gcov_summary *sum)
> +{
> +  struct gcov_info *gi_ptr;
> +  const struct gcov_fn_info *gfi_ptr;
> +  const struct gcov_ctr_info *ci_ptr;
> +  struct gcov_ctr_summary *cs_ptr;
> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
> +  gcov_unsigned_t c_num;
> +  gcov_type *value_array;
> +  gcov_type cum, cum_cutoff;
> +  char *cutoff_str;
> +  unsigned cutoff_perc;
> +
> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
> +  if (cutoff_str && strlen (cutoff_str))
> +    cutoff_perc = atoi (cutoff_str);
> +  else
> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
> +
> +  /* This currently only applies to arc counters.  */
> +  t_ix = GCOV_COUNTER_ARCS;
> +
> +  /* First check if there are any counts recorded for this counter.  */
> +  cs_ptr = &(sum->ctrs[t_ix]);
> +  if (!cs_ptr->num)
> +    return;
> +
> +  /* Determine the cumulative counter value at the specified cutoff
> +     percentage and record the percentage for use by gcov consumers.
> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
> +     and if so, do the divide first.  */
> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
> +    /* Overflow, do the divide first.  */
> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
> +  else
> +    /* Otherwise multiply first to get the correct value for small
> +       values of sum_all.  */
> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
> +
> +  /* Next, walk through all the per-object structures and save each of
> +     the count values in value_array.  */
> +  index = 0;
> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
> +    {
> +      if (!gi_ptr->merge[t_ix])
> +        continue;
> +
> +      /* Find the appropriate index into the gcov_ctr_info array
> +         for the counter we are currently working on based on the
> +         existence of the merge function pointer for this object.  */
> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
> +        {
> +          if (gi_ptr->merge[i])
> +            ctr_info_ix++;
> +        }
> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
> +        {
> +          gfi_ptr = gi_ptr->functions[f_ix];
> +
> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
> +            continue;
> +
> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
> +          /* Sanity check that there are enough entries in value_arry
> +            for this function's counters. Gracefully handle the case when
> +            there are not, in case something in the profile info is
> +            corrupted.  */
> +          c_num = ci_ptr->num;
> +          if (index + c_num > cs_ptr->num)
> +            c_num = cs_ptr->num - index;
> +          /* Copy over this function's counter values.  */
> +          memcpy (&value_array[index], ci_ptr->values,
> +                  sizeof (gcov_type) * c_num);
> +          index += c_num;
> +        }
> +    }
> +
> +  /* Sort all the counter values by descending value and finally
> +     accumulate the values from hottest on down until reaching
> +     the cutoff value computed earlier.  */
> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
> +         sort_by_reverse_gcov_value);
> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
> +    cum += value_array[c_num];
> +  /* Record the number of counters required to reach the cutoff value,
> +     as well as the counter value that caused cum to reach the cutoff.  */
> +  cs_ptr->num_hot_counters = c_num;
> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
> +  free (value_array);
> +}
> +
>  /* Dump the coverage counts. We merge with existing counts when
>     possible, to avoid growing the .da files ad infinitum. We use this
>     program's checksum to make sure we only accumulate whole program
> @@ -347,6 +461,7 @@ gcov_exit (void)
>             }
>         }
>      }
> +  gcov_compute_cutoff_values (&this_prg);
>
>    {
>      /* Check if the level of dirs to strip off specified. */
> @@ -598,7 +713,12 @@ gcov_exit (void)
>           if (gi_ptr->merge[t_ix])
>             {
>               if (!cs_prg->runs++)
> -               cs_prg->num = cs_tprg->num;
> +                cs_prg->num = cs_tprg->num;
> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
> +              {
> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
> +              }
>               cs_prg->sum_all += cs_tprg->sum_all;
>               if (cs_prg->run_max < cs_tprg->run_max)
>                 cs_prg->run_max = cs_tprg->run_max;
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi (revision 189893)
> +++ gcc/doc/invoke.texi (working copy)
> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
>  -fprefetch-loop-arrays @gol
>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>  -fprofile-generate=@var{path} @gol
> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>  -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>  @option{-funroll-all-loops} implies the same options as
>  @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fpeel-loops
>  @opindex fpeel-loops
>  Peels loops for which there is enough information that they do not
> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>
>  Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fmove-loop-invariants
>  @opindex fmove-loop-invariants
>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>  @item max-completely-peel-loop-nest-depth
>  The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling and
> +peeling.
> +
>  @item max-unswitch-insns
>  The maximum number of insns of an unswitched loop.
>
> Index: gcc/gcov-io.c
> ===================================================================
> --- gcc/gcov-io.c       (revision 189893)
> +++ gcc/gcov-io.c       (working copy)
> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        gcov_write_unsigned (csum->num);
> +      gcov_write_unsigned (csum->num_hot_counters);
> +      gcov_write_counter (csum->hot_cutoff_value);
>        gcov_write_unsigned (csum->runs);
>        gcov_write_counter (csum->sum_all);
>        gcov_write_counter (csum->run_max);
> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        csum->num = gcov_read_unsigned ();
> +      csum->num_hot_counters = gcov_read_unsigned ();
> +      csum->hot_cutoff_value = gcov_read_counter ();
>        csum->runs = gcov_read_unsigned ();
>        csum->sum_all = gcov_read_counter ();
>        csum->run_max = gcov_read_counter ();
> Index: gcc/gcov-io.h
> ===================================================================
> --- gcc/gcov-io.h       (revision 189893)
> +++ gcc/gcov-io.h       (working copy)
> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>  #define GCOV_TAG_SUMMARY_LENGTH  \
> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>
>  /* Counters that are collected.  */
>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>  struct gcov_ctr_summary
>  {
>    gcov_unsigned_t num;         /* number of counters.  */
> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
> +                                      percent of sum_all.  */
> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
> +                                   num_hot_counters.  */
>    gcov_unsigned_t runs;                /* number of program runs */
>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>    gcov_type run_max;           /* maximum value on a single run.  */
> Index: gcc/loop-unroll.c
> ===================================================================
> --- gcc/loop-unroll.c   (revision 189893)
> +++ gcc/loop-unroll.c   (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "hashtab.h"
>  #include "recog.h"
>  #include "target.h"
> +#include "gcov-io.h"
>  #include "dumpfile.h"
>
>  /* This pass performs loop unrolling and peeling.  We only perform these
> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
> +   based on code footprint estimates. Returns the codesize-based factor to be
> +   divided into the max instructions in an unrolled or peeled loop:
> +   1) For size <= threshold, do not limit (by returning 1).
> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> +      unrolled instructions according to loop hotness.
> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> +      INT_MAX).  */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> +  unsigned size_threshold;
> +  struct niter_desc *desc = get_simple_loop_desc (loop);
> +  gcov_type sum_to_header_ratio;
> +  int hotness_ratio_threshold;
> +  int limit_factor;
> +
> +  /* First check if the application has a large codesize footprint.
> +     This is estimated from FDO profile summary information for the
> +     program, where the num_hot_counters indicates the number of hottest
> +     counters (blocks) that compose most of the execution time of
> +     the program. A large value would indicate a large flat execution
> +     profile where icache misses may be a concern.  */
> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> +  if (!profile_info
> +      || profile_info->num_hot_counters <= size_threshold
> +      || !profile_info->sum_all)
> +    return 1;
> +
> +  /* Next, exclude some loops where unrolling/peeling may be more
> +     important to overall performance.  */
> +
> +  /* Ignore FP loops, which are more likely to benefit heavily from
> +     unrolling. */
> +  if (desc->has_fp)
> +    return 1;
> +
> +  /* Next, set the value of the codesize-based unroll factor divisor which in
> +     most loops will need to be set to a value that will reduce or eliminate
> +     unrolling/peeling.  */
> +  if (profile_info->num_hot_counters < size_threshold * 2)
> +    {
> +      /* For applications that are less than twice the codesize limit, allow
> +         limited unrolling for very hot loops.  */
> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> +      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> +      /* When the profile count sum to loop entry header ratio is smaller than
> +         the threshold (i.e. the loop entry is hot enough, the divisor is set
> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
> +         than the ratio, increase the divisor by the amount this ratio
> +         is over the threshold, which will quickly reduce the unroll/peel
> +         factor to zero as the loop's hotness reduces.  */
> +      if (sum_to_header_ratio > hotness_ratio_threshold)
> +        {
> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> +          gcc_assert (limit_factor >= 1);
> +        }
> +      else
> +        limit_factor = 1;
> +    }
> +  else
> +    /* For appliations that are at least twice the codesize limit, set
> +       the divisor to a large value that will force the unroll factor to 0.  */
> +    limit_factor = INT_MAX;
> +
> +  return limit_factor;
> +}
> +
>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>  void
>  unroll_and_peel_loops (int flags)
> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL))
>      {
> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>              "\n;; Considering unrolling loop with runtime "
>              "computable number of iterations\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +        {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +        }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>  {
>    unsigned npeel;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_PEEL))
>      {
> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> +  if (flag_peel_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from peeling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max peeled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* npeel = number of iterations to peel.  */
> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> +          / loop->ninsns;
>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>
>    /* Do not simply peel loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not peeling, contains branches\n");
> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL_ALL))
>      {
> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av
> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>
>    /* Do not unroll loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: gcc/coverage.c
> ===================================================================
> --- gcc/coverage.c      (revision 189893)
> +++ gcc/coverage.c      (working copy)
> @@ -247,6 +247,14 @@ read_counts_file (void)
>           gcov_read_summary (&sum);
>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>             {
> +              if (summary.ctrs[ix].num_hot_counters
> +                  < sum.ctrs[ix].num_hot_counters)
> +                {
> +                 summary.ctrs[ix].num_hot_counters
> +                      = sum.ctrs[ix].num_hot_counters;
> +                 summary.ctrs[ix].hot_cutoff_value
> +                      = sum.ctrs[ix].hot_cutoff_value;
> +                }
>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
> Index: gcc/loop-iv.c
> ===================================================================
> --- gcc/loop-iv.c       (revision 189893)
> +++ gcc/loop-iv.c       (working copy)
> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>    /* At least desc->infinite is not always initialized by
>       find_simple_loop_exit.  */
>    desc = XCNEW (struct niter_desc);
> -  iv_analysis_loop_init (loop);
> -  find_simple_exit (loop, desc);
> +  if (loop->latch != EXIT_BLOCK_PTR)
> +    {
> +      iv_analysis_loop_init (loop);
> +      find_simple_exit (loop, desc);
> +    }
> +  analyze_loop_insns (loop, desc);
>    loop->aux = desc;
>
>    if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: gcc/common.opt
> ===================================================================
> --- gcc/common.opt      (revision 189893)
> +++ gcc/common.opt      (working copy)
> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>  Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
> +
>  fpeel-loops
>  Common Report Var(flag_peel_loops) Optimization
>  Perform loop peeling
> @@ -2112,6 +2116,10 @@ funroll-all-loops
>  Common Report Var(flag_unroll_all_loops) Optimization
>  Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
> +
>  ; Nonzero means that loop optimizer may assume that the induction variables
>  ; that control loops do not overflow and that the loops with nontrivial
>  ; exit condition are not infinite
> Index: gcc/cfgloop.c
> ===================================================================
> --- gcc/cfgloop.c       (revision 189893)
> +++ gcc/cfgloop.c       (working copy)
> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>    return edges;
>  }
>
> -/* Counts the number of conditional branches inside LOOP.  */
> +/* Determine if INSN is a floating point set.  */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
>  {
> -  unsigned i, n;
> -  basic_block * body;
> +  int i;
> +  rtx pat = PATTERN(insn);
> +  if (GET_CODE (pat) == SET)
> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> +  else if (GET_CODE (pat) == PARALLEL)
> +    {
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +        {
> +          rtx sub = XVECEXP (pat, 0, i);
> +          if (GET_CODE (sub) == SET)
> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> +        }
> +    }
> +  return false;
> +}
>
> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
> +   the number of conditional branch instructions, calls and fp instructions,
> +   as well as the average number of branches executed per iteration.  */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> +  unsigned i, nbranch;
> +  gcov_type weighted_nbranch;
> +  bool has_call, has_fp;
> +  basic_block * body, bb;
> +  rtx insn;
> +  gcov_type header_count = loop->header->count;
> +
> +  nbranch = weighted_nbranch = 0;
> +  has_call = has_fp = false;
> +
>    body = get_loop_body (loop);
> -  n = 0;
>    for (i = 0; i < loop->num_nodes; i++)
> -    if (EDGE_COUNT (body[i]->succs) >= 2)
> -      n++;
> +    {
> +      bb = body[i];
> +
> +      if (EDGE_COUNT (bb->succs) >= 2)
> +        {
> +          nbranch++;
> +
> +          /* If this block is executed less frequently than the header (loop
> +             entry), then it is weighted based on its execution count, which
> +             will be turned into a ratio compared to the loop header below. */
> +          if (bb->count < header_count)
> +            weighted_nbranch += bb->count;
> +
> +          /* When it is executed more frequently than the header (i.e. it is
> +             in a nested inner loop), simply weight the branch the same as the
> +             header execution count, so that it will contribute 1 branch to
> +             the ratio computed below. */
> +          else
> +            weighted_nbranch += header_count;
> +        }
> +
> +      /* No need to iterate through the instructions below if
> +         both flags have already been set.  */
> +      if (has_call && has_fp)
> +        continue;
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          if (!INSN_P (insn))
> +            continue;
> +
> +          if (!has_call)
> +            has_call = CALL_P (insn);
> +
> +          if (!has_fp)
> +            has_fp = insn_has_fp_set (insn);
> +        }
> +    }
>    free (body);
>
> -  return n;
> +  desc->num_branches = nbranch;
> +  /* Now divide the weights computed above by the loop header execution count,
> +     to compute the average number of branches through the loop. By adding
> +     header_count/2 to the numerator we round to nearest with integer
> +     division.  */
> +  if (header_count  != 0)
> +    desc->av_num_branches
> +        = (weighted_nbranch + header_count/2) / header_count;
> +  else
> +    desc->av_num_branches = 0;
> +  desc->has_call = has_call;
> +  desc->has_fp = has_fp;
>  }
>
>  /* Adds basic block BB to LOOP.  */
> Index: gcc/cfgloop.h
> ===================================================================
> --- gcc/cfgloop.h       (revision 189893)
> +++ gcc/cfgloop.h       (working copy)
> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>
>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>  edge single_exit (const struct loop *);
> -extern unsigned num_loop_branches (const struct loop *);
>
>  extern edge loop_preheader_edge (const struct loop *);
>  extern edge loop_latch_edge (const struct loop *);
> @@ -366,7 +365,8 @@ struct rtx_iv
>  };
>
>  /* The description of an exit from the loop and of the number of iterations
> -   till we take the exit.  */
> +   till we take the exit. Also includes other information used primarily
> +   by the loop unroller.  */
>
>  struct niter_desc
>  {
> @@ -407,6 +407,18 @@ struct niter_desc
>
>    /* The number of iterations of the loop.  */
>    rtx niter_expr;
> +
> +  /* The number of branches in the loop.  */
> +  unsigned num_branches;
> +
> +  /* The number of executed branches per iteration.  */
> +  unsigned av_num_branches;
> +
> +  /* Whether the loop contains a call instruction.  */
> +  bool has_call;
> +
> +  /* Whether the loop contains fp instructions.  */
> +  bool has_fp;
>  };
>
>  extern void iv_analysis_loop_init (struct loop *);
> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>
>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>  extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
>  static inline struct niter_desc *
>  simple_loop_desc (struct loop *loop)
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def      (revision 189893)
> +++ gcc/params.def      (working copy)
> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>          "max-completely-peel-loop-nest-depth",
>          "The maximum depth of a loop nest we completely peel",
>          8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations.  */
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> +        "unrollpeel-codesize-threshold",
> +        "Maximum profile-based code size footprint estimate for loop unrolling "
> +        "and peeling",
> +        15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> +   count that must be exceeded to ignore most code size limits when unrolling
> +   and peeling.  */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> +        "unrollpeel-hotness-threshold",
> +        "Maximum ratio of total profiled execution count to loop entry "
> +        "block count under which most codesize limits for unrolling and "
> +        "peeling will be ignored",
> +        100, 1, 0)
>
>  /* The maximum number of insns of an unswitched loop.  */
>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
> Index: gcc/gcov-dump.c
> ===================================================================
> --- gcc/gcov-dump.c     (revision 189893)
> +++ gcc/gcov-dump.c     (working copy)
> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>      {
>        printf ("\n");
>        print_prefix (filename, 0, 0);
> -      printf ("\t\tcounts=%u, runs=%u",
> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
> +             summary.ctrs[ix].num,
> +             summary.ctrs[ix].num_hot_counters,
> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
> +             summary.ctrs[ix].runs);
>
>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>
> --
> This patch is available for review at http://codereview.appspot.com/6351086
Teresa Johnson - Dec. 11, 2012, 5:13 p.m.
Ping.
Teresa

On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
> This patch was proposed awhile back, and the new working set program
> summary was split out and committed separately. I would like to
> see if the loop unroller part of the patch can now go in. For the
> previous versions of the patch and related discussion, see:
>
> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and
> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>
> Use the new working set information in the FDO program summary for
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> As part of this, cache additional loop analysis results in the niter_desc
> auxiliary information hanging off the loop structure to reduce
> redundant analyses during unrolling.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-11-19  Teresa Johnson  <tejohnson@google.com>
>
>         * gcc/doc/invoke.texi: Document new options
>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>         to control the unroll factor, and retrieve number of branches from
>         niter_desc instead of via function that walks loop.
>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>         function, and add guards to enable this function to work for the
>         outermost loop.
>         * gcc/common.opt: Add -fpeel-codesize-limit and
>         -funroll-codesize-limit.
>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>         (num_loop_branches): Remove.
>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>         additional loop analysis information.
>         (num_loop_branches): Remove.
>         (analyze_loop_insns): Declare.
>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi     (revision 193626)
> +++ doc/invoke.texi     (working copy)
> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
> -fpredictive-commoning @gol
>  -fprefetch-loop-arrays -fprofile-report @gol
>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>  -fprofile-generate=@var{path} @gol
> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>  -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8771,6 +8771,14 @@ the loop is entered.  This usually makes programs
>  @option{-funroll-all-loops} implies the same options as
>  @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback
> compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
> controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fpeel-loops
>  @opindex fpeel-loops
>  Peels loops for which there is enough information that they do not
> @@ -8779,6 +8787,14 @@ roll much (from profile feedback).  It also turns
>
>  Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
> controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fmove-loop-invariants
>  @opindex fmove-loop-invariants
>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s
>  @item max-completely-peel-loop-nest-depth
>  The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling and
> +peeling.
> +
>  @item max-unswitch-insns
>  The maximum number of insns of an unswitched loop.
>
> Index: loop-unroll.c
> ===================================================================
> --- loop-unroll.c       (revision 193626)
> +++ loop-unroll.c       (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "hashtab.h"
>  #include "recog.h"
>  #include "target.h"
> +#include "gcov-io.h"
>  #include "dumpfile.h"
>
>  /* This pass performs loop unrolling and peeling.  We only perform these
> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
> +   based on code footprint estimates. Returns the codesize-based factor to be
> +   divided into the max instructions in an unrolled or peeled loop:
> +   1) For size <= threshold, do not limit (by returning 1).
> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> +      unrolled instructions according to loop hotness.
> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> +      INT_MAX).  */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> +  unsigned size_threshold, num_hot_counters;
> +  struct niter_desc *desc = get_simple_loop_desc (loop);
> +  gcov_type sum_to_header_ratio;
> +  int hotness_ratio_threshold;
> +  int limit_factor;
> +  gcov_working_set_t *ws;
> +
> +  ws = find_working_set(999);
> +  if (! ws)
> +    return 1;
> +  num_hot_counters = ws->num_counters;
> +
> +  /* First check if the application has a large codesize footprint.
> +     This is estimated from FDO profile summary information for the
> +     program, where the num_hot_counters indicates the number of hottest
> +     counters (blocks) that compose most of the execution time of
> +     the program. A large value would indicate a large flat execution
> +     profile where icache misses may be a concern.  */
> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> +  if (!profile_info
> +      || num_hot_counters <= size_threshold
> +      || !profile_info->sum_all)
> +    return 1;
> +
> +  /* Next, exclude some loops where unrolling/peeling may be more
> +     important to overall performance.  */
> +
> +  /* Ignore FP loops, which are more likely to benefit heavily from
> +     unrolling. */
> +  if (desc->has_fp)
> +    return 1;
> +
> +  /* Next, set the value of the codesize-based unroll factor divisor which in
> +     most loops will need to be set to a value that will reduce or eliminate
> +     unrolling/peeling.  */
> +  if (num_hot_counters < size_threshold * 2
> +      && loop->header->count > 0)
> +    {
> +      /* For applications that are less than twice the codesize limit, allow
> +         limited unrolling for very hot loops.  */
> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> +      hotness_ratio_threshold = PARAM_VALUE
> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> +      /* When the profile count sum to loop entry header ratio is smaller than
> +         the threshold (i.e. the loop entry is hot enough), the divisor is set
> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
> +         than the ratio, increase the divisor by the amount this ratio
> +         is over the threshold, which will quickly reduce the unroll/peel
> +         factor to zero as the loop's hotness reduces.  */
> +      if (sum_to_header_ratio > hotness_ratio_threshold)
> +        {
> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> +          gcc_assert (limit_factor >= 1);
> +        }
> +      else
> +        limit_factor = 1;
> +    }
> +  else
> +    /* For appliations that are at least twice the codesize limit, set
> +       the divisor to a large value that will force the unroll factor to 0.  */
> +    limit_factor = INT_MAX;
> +
> +  return limit_factor;
> +}
> +
>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>  void
>  unroll_and_peel_loops (int flags)
> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>    double_int iterations;
>
>    if (!(flags & UAP_UNROLL))
> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>              "\n;; Considering unrolling loop with runtime "
>              "computable number of iterations\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +        {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +        }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
> loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags)
>  {
>    unsigned npeel;
>    double_int iterations;
> +  struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_PEEL))
>      {
> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags)
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> +  if (flag_peel_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from peeling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max peeled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* npeel = number of iterations to peel.  */
> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> +          / loop->ninsns;
>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags)
>      }
>
>    /* Do not simply peel loops with branches inside -- it increases number
> -     of mispredicts.
> +     of mispredicts.
>       Exception is when we do have profile and we however have good chance
>       to peel proper number of iterations loop will iterate in practice.
>       TODO: this heuristic needs tunning; while for complette unrolling
> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags)
>       peeling it is not the case.  Also a function call inside loop is
>       also branch from branch prediction POV (and probably better reason
>       to not unroll/peel).  */
> -  if (num_loop_branches (loop) > 1
> +  desc = get_simple_loop_desc (loop);
> +  if (desc->num_branches > 1
>        && profile_status != PROFILE_READ)
>      {
>        if (dump_file)
> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>    double_int iterations;
>
>    if (!(flags & UAP_UNROLL_ALL))
> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av
> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>       of mispredicts.
>       TODO: this heuristic needs tunning; call inside the loop body
>       is also relatively good reason to not unroll.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: loop-iv.c
> ===================================================================
> --- loop-iv.c   (revision 193626)
> +++ loop-iv.c   (working copy)
> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop)
>    /* At least desc->infinite is not always initialized by
>       find_simple_loop_exit.  */
>    desc = XCNEW (struct niter_desc);
> -  iv_analysis_loop_init (loop);
> -  find_simple_exit (loop, desc);
> +  if (loop->latch != EXIT_BLOCK_PTR)
> +    {
> +      iv_analysis_loop_init (loop);
> +      find_simple_exit (loop, desc);
> +    }
> +  analyze_loop_insns (loop, desc);
>    loop->aux = desc;
>
>    if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: common.opt
> ===================================================================
> --- common.opt  (revision 193626)
> +++ common.opt  (working copy)
> @@ -1573,6 +1573,10 @@ fpcc-struct-return
>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>  Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large
> code footprint
> +
>  fpeel-loops
>  Common Report Var(flag_peel_loops) Optimization
>  Perform loop peeling
> @@ -2138,6 +2142,10 @@ funroll-all-loops
>  Common Report Var(flag_unroll_all_loops) Optimization
>  Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of
> large code footprint
> +
>  ; Nonzero means that loop optimizer may assume that the induction variables
>  ; that control loops do not overflow and that the loops with nontrivial
>  ; exit condition are not infinite
> Index: cfgloop.c
> ===================================================================
> --- cfgloop.c   (revision 193626)
> +++ cfgloop.c   (working copy)
> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop)
>    return edges;
>  }
>
> -/* Counts the number of conditional branches inside LOOP.  */
> +/* Determine if INSN is a floating point set.  */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
>  {
> -  unsigned i, n;
> -  basic_block * body;
> +  int i;
> +  rtx pat = PATTERN(insn);
> +  if (GET_CODE (pat) == SET)
> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> +  else if (GET_CODE (pat) == PARALLEL)
> +    {
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +        {
> +          rtx sub = XVECEXP (pat, 0, i);
> +          if (GET_CODE (sub) == SET)
> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> +        }
> +    }
> +  return false;
> +}
>
> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
> +   the number of conditional branch instructions, calls and fp instructions,
> +   as well as the average number of branches executed per iteration.  */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> +  unsigned i, nbranch;
> +  gcov_type weighted_nbranch;
> +  bool has_call, has_fp;
> +  basic_block * body, bb;
> +  rtx insn;
> +  gcov_type header_count = loop->header->count;
> +
> +  nbranch = weighted_nbranch = 0;
> +  has_call = has_fp = false;
> +
>    body = get_loop_body (loop);
> -  n = 0;
>    for (i = 0; i < loop->num_nodes; i++)
> -    if (EDGE_COUNT (body[i]->succs) >= 2)
> -      n++;
> +    {
> +      bb = body[i];
> +
> +      if (EDGE_COUNT (bb->succs) >= 2)
> +        {
> +          nbranch++;
> +
> +          /* If this block is executed less frequently than the header (loop
> +             entry), then it is weighted based on its execution count, which
> +             will be turned into a ratio compared to the loop header below. */
> +          if (bb->count < header_count)
> +            weighted_nbranch += bb->count;
> +
> +          /* When it is executed more frequently than the header (i.e. it is
> +             in a nested inner loop), simply weight the branch the same as the
> +             header execution count, so that it will contribute 1 branch to
> +             the ratio computed below. */
> +          else
> +            weighted_nbranch += header_count;
> +        }
> +
> +      /* No need to iterate through the instructions below if
> +         both flags have already been set.  */
> +      if (has_call && has_fp)
> +        continue;
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          if (!INSN_P (insn))
> +            continue;
> +
> +          if (!has_call)
> +            has_call = CALL_P (insn);
> +
> +          if (!has_fp)
> +            has_fp = insn_has_fp_set (insn);
> +        }
> +    }
>    free (body);
>
> -  return n;
> +  desc->num_branches = nbranch;
> +  /* Now divide the weights computed above by the loop header execution count,
> +     to compute the average number of branches through the loop. By adding
> +     header_count/2 to the numerator we round to nearest with integer
> +     division.  */
> +  if (header_count  != 0)
> +    desc->av_num_branches
> +        = (weighted_nbranch + header_count/2) / header_count;
> +  else
> +    desc->av_num_branches = 0;
> +  desc->has_call = has_call;
> +  desc->has_fp = has_fp;
>  }
>
>  /* Adds basic block BB to LOOP.  */
> Index: cfgloop.h
> ===================================================================
> --- cfgloop.h   (revision 193626)
> +++ cfgloop.h   (working copy)
> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order
>  extern vec<edge> get_loop_exit_edges (const struct loop *);
>  extern edge single_exit (const struct loop *);
>  extern edge single_likely_exit (struct loop *loop);
> -extern unsigned num_loop_branches (const struct loop *);
>
>  extern edge loop_preheader_edge (const struct loop *);
>  extern edge loop_latch_edge (const struct loop *);
> @@ -365,7 +364,8 @@ struct rtx_iv
>  };
>
>  /* The description of an exit from the loop and of the number of iterations
> -   till we take the exit.  */
> +   till we take the exit. Also includes other information used primarily
> +   by the loop unroller.  */
>
>  struct niter_desc
>  {
> @@ -403,6 +403,18 @@ struct niter_desc
>
>    /* The number of iterations of the loop.  */
>    rtx niter_expr;
> +
> +  /* The number of branches in the loop.  */
> +  unsigned num_branches;
> +
> +  /* The number of executed branches per iteration.  */
> +  unsigned av_num_branches;
> +
> +  /* Whether the loop contains a call instruction.  */
> +  bool has_call;
> +
> +  /* Whether the loop contains fp instructions.  */
> +  bool has_fp;
>  };
>
>  extern void iv_analysis_loop_init (struct loop *);
> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void);
>
>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>  extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
>  static inline struct niter_desc *
>  simple_loop_desc (struct loop *loop)
> Index: params.def
> ===================================================================
> --- params.def  (revision 193626)
> +++ params.def  (working copy)
> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>          "max-completely-peel-loop-nest-depth",
>          "The maximum depth of a loop nest we completely peel",
>          8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations.  */
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> +        "unrollpeel-codesize-threshold",
> +        "Maximum profile-based code size footprint estimate for loop
> unrolling "
> +        "and peeling",
> +        15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> +   count that must be exceeded to ignore most code size limits when unrolling
> +   and peeling.  */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> +        "unrollpeel-hotness-threshold",
> +        "Maximum ratio of total profiled execution count to loop entry "
> +        "block count under which most codesize limits for unrolling and "
> +        "peeling will be ignored",
> +        100, 1, 0)
>
>  /* The maximum number of insns of an unswitched loop.  */
>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>
> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> Updated patch based on review feedback.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Teresa
>>
>>
>> Original patch description:
>>
>> Ports some patches related to improving FDO program summary information
>> and using it to guide loop unrolling from google branches to mainline.
>> The patch is enhanced to add additional summary information to aid
>> in determining hot/cold decisions.
>>
>> The original patch description is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> and further discussion about incorporating onto mainline is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>>
>> Honza, can you take a look to see if this patch meets your needs?
>>
>> Full description:
>>
>> This patch adds new program summary information to the gcov
>> profile files that indicate how many profiled counts compose
>> the majority of the program's execution time. This is used to
>> provide an indication of the overall code size of the program.
>>
>> The new profile summary information is then used to guide
>> codesize based unroll and peel decisions, to prevent those
>> optimizations from increasing code size too much when the
>> program may be sensitive to icache effects.
>>
>> This patch also pulls in dependent portions of google/main r187660 that cache
>> additional loop analysis results in the niter_desc auxiliary information
>> hanging off the loop structure (the optimization portions of that
>> change are not included here, and have an outstanding review request
>> for mainline).
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>>
>>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>>         (gcov_compute_cutoff_values): Ditto.
>>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>>         information.
>>         * gcc/doc/invoke.texi (roll much): Document new options
>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>>         (gcov_read_summary): Read new summary info.
>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>>         hot_cutoff_value.
>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>         to control the unroll factor, and retrieve number of branches from
>>         niter_desc instead of via function that walks loop.
>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>>         function, and add guards to enable this function to work for the
>>         outermost loop.
>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>         -funroll-codesize-limit.
>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>>         (num_loop_branches): Remove.
>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>         additional loop analysis information.
>>         (num_loop_branches): Remove.
>>         (analyze_loop_insns): Declare.
>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>>
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189893)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>> +{
>> +  struct gcov_info *gi_ptr;
>> +  const struct gcov_fn_info *gfi_ptr;
>> +  const struct gcov_ctr_info *ci_ptr;
>> +  struct gcov_ctr_summary *cs_ptr;
>> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
>> +  gcov_unsigned_t c_num;
>> +  gcov_type *value_array;
>> +  gcov_type cum, cum_cutoff;
>> +  char *cutoff_str;
>> +  unsigned cutoff_perc;
>> +
>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
>> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
>> +  if (cutoff_str && strlen (cutoff_str))
>> +    cutoff_perc = atoi (cutoff_str);
>> +  else
>> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
>> +
>> +  /* This currently only applies to arc counters.  */
>> +  t_ix = GCOV_COUNTER_ARCS;
>> +
>> +  /* First check if there are any counts recorded for this counter.  */
>> +  cs_ptr = &(sum->ctrs[t_ix]);
>> +  if (!cs_ptr->num)
>> +    return;
>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>> +
>> +  /* Next, walk through all the per-object structures and save each of
>> +     the count values in value_array.  */
>> +  index = 0;
>> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>> +        }
>> +    }
>> +
>> +  /* Sort all the counter values by descending value and finally
>> +     accumulate the values from hottest on down until reaching
>> +     the cutoff value computed earlier.  */
>> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
>> +         sort_by_reverse_gcov_value);
>> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
>> +    cum += value_array[c_num];
>> +  /* Record the number of counters required to reach the cutoff value,
>> +     as well as the counter value that caused cum to reach the cutoff.  */
>> +  cs_ptr->num_hot_counters = c_num;
>> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
>> +  free (value_array);
>> +}
>> +
>>  /* Dump the coverage counts. We merge with existing counts when
>>     possible, to avoid growing the .da files ad infinitum. We use this
>>     program's checksum to make sure we only accumulate whole program
>> @@ -347,6 +461,7 @@ gcov_exit (void)
>>             }
>>         }
>>      }
>> +  gcov_compute_cutoff_values (&this_prg);
>>
>>    {
>>      /* Check if the level of dirs to strip off specified. */
>> @@ -598,7 +713,12 @@ gcov_exit (void)
>>           if (gi_ptr->merge[t_ix])
>>             {
>>               if (!cs_prg->runs++)
>> -               cs_prg->num = cs_tprg->num;
>> +                cs_prg->num = cs_tprg->num;
>> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
>> +              {
>> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
>> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
>> +              }
>>               cs_prg->sum_all += cs_tprg->sum_all;
>>               if (cs_prg->run_max < cs_tprg->run_max)
>>                 cs_prg->run_max = cs_tprg->run_max;
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189893)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fpeel-loops
>>  @opindex fpeel-loops
>>  Peels loops for which there is enough information that they do not
>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>>
>>  Enabled with @option{-fprofile-use}.
>>
>> +@item -fpeel-codesize-limit
>> +@opindex fpeel-codesize-limit
>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fmove-loop-invariants
>>  @opindex fmove-loop-invariants
>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>>  @item max-completely-peel-loop-nest-depth
>>  The maximum depth of a loop nest suitable for complete peeling.
>>
>> +@item unrollpeel-codesize-threshold
>> +Maximum profile-based code size footprint estimate for loop unrolling and
>> +peeling.
>> +
>> +@item unrollpeel-hotness-threshold
>> +Maximum ratio of total execution count to loop entry block count under which
>> +most profile-based code size estimates will be ignored for loop unrolling and
>> +peeling.
>> +
>>  @item max-unswitch-insns
>>  The maximum number of insns of an unswitched loop.
>>
>> Index: gcc/gcov-io.c
>> ===================================================================
>> --- gcc/gcov-io.c       (revision 189893)
>> +++ gcc/gcov-io.c       (working copy)
>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        gcov_write_unsigned (csum->num);
>> +      gcov_write_unsigned (csum->num_hot_counters);
>> +      gcov_write_counter (csum->hot_cutoff_value);
>>        gcov_write_unsigned (csum->runs);
>>        gcov_write_counter (csum->sum_all);
>>        gcov_write_counter (csum->run_max);
>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        csum->num = gcov_read_unsigned ();
>> +      csum->num_hot_counters = gcov_read_unsigned ();
>> +      csum->hot_cutoff_value = gcov_read_counter ();
>>        csum->runs = gcov_read_unsigned ();
>>        csum->sum_all = gcov_read_counter ();
>>        csum->run_max = gcov_read_counter ();
>> Index: gcc/gcov-io.h
>> ===================================================================
>> --- gcc/gcov-io.h       (revision 189893)
>> +++ gcc/gcov-io.h       (working copy)
>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
>>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>>  #define GCOV_TAG_SUMMARY_LENGTH  \
>> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
>> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>>
>>  /* Counters that are collected.  */
>>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>>  struct gcov_ctr_summary
>>  {
>>    gcov_unsigned_t num;         /* number of counters.  */
>> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
>> +                                      percent of sum_all.  */
>> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
>> +                                   num_hot_counters.  */
>>    gcov_unsigned_t runs;                /* number of program runs */
>>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>>    gcov_type run_max;           /* maximum value on a single run.  */
>> Index: gcc/loop-unroll.c
>> ===================================================================
>> --- gcc/loop-unroll.c   (revision 189893)
>> +++ gcc/loop-unroll.c   (working copy)
>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "hashtab.h"
>>  #include "recog.h"
>>  #include "target.h"
>> +#include "gcov-io.h"
>>  #include "dumpfile.h"
>>
>>  /* This pass performs loop unrolling and peeling.  We only perform these
>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>>                                              basic_block);
>>  static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
>> +   based on code footprint estimates. Returns the codesize-based factor to be
>> +   divided into the max instructions in an unrolled or peeled loop:
>> +   1) For size <= threshold, do not limit (by returning 1).
>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>> +      unrolled instructions according to loop hotness.
>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>> +      INT_MAX).  */
>> +
>> +static int
>> +code_size_limit_factor(struct loop *loop)
>> +{
>> +  unsigned size_threshold;
>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>> +  gcov_type sum_to_header_ratio;
>> +  int hotness_ratio_threshold;
>> +  int limit_factor;
>> +
>> +  /* First check if the application has a large codesize footprint.
>> +     This is estimated from FDO profile summary information for the
>> +     program, where the num_hot_counters indicates the number of hottest
>> +     counters (blocks) that compose most of the execution time of
>> +     the program. A large value would indicate a large flat execution
>> +     profile where icache misses may be a concern.  */
>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>> +  if (!profile_info
>> +      || profile_info->num_hot_counters <= size_threshold
>> +      || !profile_info->sum_all)
>> +    return 1;
>> +
>> +  /* Next, exclude some loops where unrolling/peeling may be more
>> +     important to overall performance.  */
>> +
>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>> +     unrolling. */
>> +  if (desc->has_fp)
>> +    return 1;
>> +
>> +  /* Next, set the value of the codesize-based unroll factor divisor which in
>> +     most loops will need to be set to a value that will reduce or eliminate
>> +     unrolling/peeling.  */
>> +  if (profile_info->num_hot_counters < size_threshold * 2)
>> +    {
>> +      /* For applications that are less than twice the codesize limit, allow
>> +         limited unrolling for very hot loops.  */
>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>> +      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>> +      /* When the profile count sum to loop entry header ratio is smaller than
>> +         the threshold (i.e. the loop entry is hot enough, the divisor is set
>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>> +         than the ratio, increase the divisor by the amount this ratio
>> +         is over the threshold, which will quickly reduce the unroll/peel
>> +         factor to zero as the loop's hotness reduces.  */
>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>> +        {
>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>> +          gcc_assert (limit_factor >= 1);
>> +        }
>> +      else
>> +        limit_factor = 1;
>> +    }
>> +  else
>> +    /* For appliations that are at least twice the codesize limit, set
>> +       the divisor to a large value that will force the unroll factor to 0.  */
>> +    limit_factor = INT_MAX;
>> +
>> +  return limit_factor;
>> +}
>> +
>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>  void
>>  unroll_and_peel_loops (int flags)
>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL))
>>      {
>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>              "\n;; Considering unrolling loop with runtime "
>>              "computable number of iterations\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +        {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +        }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>  {
>>    unsigned npeel;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_PEEL))
>>      {
>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>
>> +  if (flag_peel_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from peeling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max peeled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* npeel = number of iterations to peel.  */
>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>> +          / loop->ninsns;
>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>
>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>
>>    /* Do not simply peel loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not peeling, contains branches\n");
>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL_ALL))
>>      {
>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av
>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>
>>    /* Do not unroll loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>> Index: gcc/coverage.c
>> ===================================================================
>> --- gcc/coverage.c      (revision 189893)
>> +++ gcc/coverage.c      (working copy)
>> @@ -247,6 +247,14 @@ read_counts_file (void)
>>           gcov_read_summary (&sum);
>>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>>             {
>> +              if (summary.ctrs[ix].num_hot_counters
>> +                  < sum.ctrs[ix].num_hot_counters)
>> +                {
>> +                 summary.ctrs[ix].num_hot_counters
>> +                      = sum.ctrs[ix].num_hot_counters;
>> +                 summary.ctrs[ix].hot_cutoff_value
>> +                      = sum.ctrs[ix].hot_cutoff_value;
>> +                }
>>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
>> Index: gcc/loop-iv.c
>> ===================================================================
>> --- gcc/loop-iv.c       (revision 189893)
>> +++ gcc/loop-iv.c       (working copy)
>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>>    /* At least desc->infinite is not always initialized by
>>       find_simple_loop_exit.  */
>>    desc = XCNEW (struct niter_desc);
>> -  iv_analysis_loop_init (loop);
>> -  find_simple_exit (loop, desc);
>> +  if (loop->latch != EXIT_BLOCK_PTR)
>> +    {
>> +      iv_analysis_loop_init (loop);
>> +      find_simple_exit (loop, desc);
>> +    }
>> +  analyze_loop_insns (loop, desc);
>>    loop->aux = desc;
>>
>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>> Index: gcc/common.opt
>> ===================================================================
>> --- gcc/common.opt      (revision 189893)
>> +++ gcc/common.opt      (working copy)
>> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>  Return small aggregates in memory, not registers
>>
>> +fpeel-codesize-limit
>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
>> +
>>  fpeel-loops
>>  Common Report Var(flag_peel_loops) Optimization
>>  Perform loop peeling
>> @@ -2112,6 +2116,10 @@ funroll-all-loops
>>  Common Report Var(flag_unroll_all_loops) Optimization
>>  Perform loop unrolling for all loops
>>
>> +funroll-codesize-limit
>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
>> +
>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>  ; that control loops do not overflow and that the loops with nontrivial
>>  ; exit condition are not infinite
>> Index: gcc/cfgloop.c
>> ===================================================================
>> --- gcc/cfgloop.c       (revision 189893)
>> +++ gcc/cfgloop.c       (working copy)
>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>>    return edges;
>>  }
>>
>> -/* Counts the number of conditional branches inside LOOP.  */
>> +/* Determine if INSN is a floating point set.  */
>>
>> -unsigned
>> -num_loop_branches (const struct loop *loop)
>> +static bool
>> +insn_has_fp_set(rtx insn)
>>  {
>> -  unsigned i, n;
>> -  basic_block * body;
>> +  int i;
>> +  rtx pat = PATTERN(insn);
>> +  if (GET_CODE (pat) == SET)
>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>> +  else if (GET_CODE (pat) == PARALLEL)
>> +    {
>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>> +        {
>> +          rtx sub = XVECEXP (pat, 0, i);
>> +          if (GET_CODE (sub) == SET)
>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>> +        }
>> +    }
>> +  return false;
>> +}
>>
>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
>> +   the number of conditional branch instructions, calls and fp instructions,
>> +   as well as the average number of branches executed per iteration.  */
>>
>> +void
>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>> +{
>> +  unsigned i, nbranch;
>> +  gcov_type weighted_nbranch;
>> +  bool has_call, has_fp;
>> +  basic_block * body, bb;
>> +  rtx insn;
>> +  gcov_type header_count = loop->header->count;
>> +
>> +  nbranch = weighted_nbranch = 0;
>> +  has_call = has_fp = false;
>> +
>>    body = get_loop_body (loop);
>> -  n = 0;
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>> -      n++;
>> +    {
>> +      bb = body[i];
>> +
>> +      if (EDGE_COUNT (bb->succs) >= 2)
>> +        {
>> +          nbranch++;
>> +
>> +          /* If this block is executed less frequently than the header (loop
>> +             entry), then it is weighted based on its execution count, which
>> +             will be turned into a ratio compared to the loop header below. */
>> +          if (bb->count < header_count)
>> +            weighted_nbranch += bb->count;
>> +
>> +          /* When it is executed more frequently than the header (i.e. it is
>> +             in a nested inner loop), simply weight the branch the same as the
>> +             header execution count, so that it will contribute 1 branch to
>> +             the ratio computed below. */
>> +          else
>> +            weighted_nbranch += header_count;
>> +        }
>> +
>> +      /* No need to iterate through the instructions below if
>> +         both flags have already been set.  */
>> +      if (has_call && has_fp)
>> +        continue;
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          if (!INSN_P (insn))
>> +            continue;
>> +
>> +          if (!has_call)
>> +            has_call = CALL_P (insn);
>> +
>> +          if (!has_fp)
>> +            has_fp = insn_has_fp_set (insn);
>> +        }
>> +    }
>>    free (body);
>>
>> -  return n;
>> +  desc->num_branches = nbranch;
>> +  /* Now divide the weights computed above by the loop header execution count,
>> +     to compute the average number of branches through the loop. By adding
>> +     header_count/2 to the numerator we round to nearest with integer
>> +     division.  */
>> +  if (header_count  != 0)
>> +    desc->av_num_branches
>> +        = (weighted_nbranch + header_count/2) / header_count;
>> +  else
>> +    desc->av_num_branches = 0;
>> +  desc->has_call = has_call;
>> +  desc->has_fp = has_fp;
>>  }
>>
>>  /* Adds basic block BB to LOOP.  */
>> Index: gcc/cfgloop.h
>> ===================================================================
>> --- gcc/cfgloop.h       (revision 189893)
>> +++ gcc/cfgloop.h       (working copy)
>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>>
>>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>>  edge single_exit (const struct loop *);
>> -extern unsigned num_loop_branches (const struct loop *);
>>
>>  extern edge loop_preheader_edge (const struct loop *);
>>  extern edge loop_latch_edge (const struct loop *);
>> @@ -366,7 +365,8 @@ struct rtx_iv
>>  };
>>
>>  /* The description of an exit from the loop and of the number of iterations
>> -   till we take the exit.  */
>> +   till we take the exit. Also includes other information used primarily
>> +   by the loop unroller.  */
>>
>>  struct niter_desc
>>  {
>> @@ -407,6 +407,18 @@ struct niter_desc
>>
>>    /* The number of iterations of the loop.  */
>>    rtx niter_expr;
>> +
>> +  /* The number of branches in the loop.  */
>> +  unsigned num_branches;
>> +
>> +  /* The number of executed branches per iteration.  */
>> +  unsigned av_num_branches;
>> +
>> +  /* Whether the loop contains a call instruction.  */
>> +  bool has_call;
>> +
>> +  /* Whether the loop contains fp instructions.  */
>> +  bool has_fp;
>>  };
>>
>>  extern void iv_analysis_loop_init (struct loop *);
>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>>
>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>  extern void free_simple_loop_desc (struct loop *loop);
>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>
>>  static inline struct niter_desc *
>>  simple_loop_desc (struct loop *loop)
>> Index: gcc/params.def
>> ===================================================================
>> --- gcc/params.def      (revision 189893)
>> +++ gcc/params.def      (working copy)
>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>          "max-completely-peel-loop-nest-depth",
>>          "The maximum depth of a loop nest we completely peel",
>>          8, 0, 0)
>> +/* The maximum code size estimate under which loop unrolling and peeling
>> + * is allowed in a profile feedback compile. This currently applies to loops
>> + * with non-constant iteration counts and no floating point computations.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>> +        "unrollpeel-codesize-threshold",
>> +        "Maximum profile-based code size footprint estimate for loop unrolling "
>> +        "and peeling",
>> +        15000, 0, 0)
>> +/* The maximum ratio of total profiled execution counts to loop entry block
>> +   count that must be exceeded to ignore most code size limits when unrolling
>> +   and peeling.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>> +        "unrollpeel-hotness-threshold",
>> +        "Maximum ratio of total profiled execution count to loop entry "
>> +        "block count under which most codesize limits for unrolling and "
>> +        "peeling will be ignored",
>> +        100, 1, 0)
>>
>>  /* The maximum number of insns of an unswitched loop.  */
>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>> Index: gcc/gcov-dump.c
>> ===================================================================
>> --- gcc/gcov-dump.c     (revision 189893)
>> +++ gcc/gcov-dump.c     (working copy)
>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>>      {
>>        printf ("\n");
>>        print_prefix (filename, 0, 0);
>> -      printf ("\t\tcounts=%u, runs=%u",
>> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
>> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
>> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
>> +             summary.ctrs[ix].num,
>> +             summary.ctrs[ix].num_hot_counters,
>> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
>> +             summary.ctrs[ix].runs);
>>
>>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>>
>> --
>> This patch is available for review at http://codereview.appspot.com/6351086
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Markus Trippelsdorf - Dec. 11, 2012, 6:03 p.m.
On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote:
> Ping.
> Teresa
> 
> On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
> > This patch was proposed awhile back, and the new working set program
> > summary was split out and committed separately. I would like to
> > see if the loop unroller part of the patch can now go in. For the
> > previous versions of the patch and related discussion, see:
> >
> > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> > and
> > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
> >
> > Use the new working set information in the FDO program summary for
> > codesize based unroll and peel decisions, to prevent those
> > optimizations from increasing code size too much when the
> > program may be sensitive to icache effects.
> >
> > As part of this, cache additional loop analysis results in the niter_desc
> > auxiliary information hanging off the loop structure to reduce
> > redundant analyses during unrolling.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

I guess this is the missing piece that would fix the Firefox lto/pgo
code-size issue:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144
Jan Hubicka - Dec. 11, 2012, 6:12 p.m.
> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote:
> > Ping.
> > Teresa
> > 
> > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
> > > This patch was proposed awhile back, and the new working set program
> > > summary was split out and committed separately. I would like to
> > > see if the loop unroller part of the patch can now go in. For the
> > > previous versions of the patch and related discussion, see:
> > >
> > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> > > and
> > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
> > >
> > > Use the new working set information in the FDO program summary for
> > > codesize based unroll and peel decisions, to prevent those
> > > optimizations from increasing code size too much when the
> > > program may be sensitive to icache effects.
> > >
> > > As part of this, cache additional loop analysis results in the niter_desc
> > > auxiliary information hanging off the loop structure to reduce
> > > redundant analyses during unrolling.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
> 
> I guess this is the missing piece that would fix the Firefox lto/pgo
> code-size issue:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144

It won't help in this case - the problem is that too much of code is identified as hot,
similar as for spec2k GCC benchmark.  Teresa - do you have any ideas what to do here
short of re-adding the BB ratio parameter and combining both heuristics for 4.8?

Honza
> 
> -- 
> Markus
Teresa Johnson - Dec. 11, 2012, 7:28 p.m.
On Tue, Dec 11, 2012 at 10:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote:
>> > Ping.
>> > Teresa
>> >
>> > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> > > This patch was proposed awhile back, and the new working set program
>> > > summary was split out and committed separately. I would like to
>> > > see if the loop unroller part of the patch can now go in. For the
>> > > previous versions of the patch and related discussion, see:
>> > >
>> > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> > > and
>> > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>> > >
>> > > Use the new working set information in the FDO program summary for
>> > > codesize based unroll and peel decisions, to prevent those
>> > > optimizations from increasing code size too much when the
>> > > program may be sensitive to icache effects.
>> > >
>> > > As part of this, cache additional loop analysis results in the niter_desc
>> > > auxiliary information hanging off the loop structure to reduce
>> > > redundant analyses during unrolling.
>> > >
>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> I guess this is the missing piece that would fix the Firefox lto/pgo
>> code-size issue:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144
>
> It won't help in this case - the problem is that too much of code is identified as hot,
> similar as for spec2k GCC benchmark.  Teresa - do you have any ideas what to do here
> short of re-adding the BB ratio parameter and combining both heuristics for 4.8?

Right, I doubt this patch would help. I looked through the bugzilla
log and it is most likely inlining increases. What was the size of the
gcc lto/pgo binary before the change to use the histogram? Was it
close to the gcc 4.7 lto/pgo size? In that case that is a very large
increase, ~25%.

Markus, could you attach to the bug one of the gcda files so that I
can see the program summary and figure out how far off the old hot bb
threshold is from the new histogram-based one? Also, it would be good
to see the -fdump-ipa-inline dumps before and after the regression (if
necessary, the before one could be from 4_7).

I'll update the bug with the same, so that we can track the discussion there.

Thanks,
Teresa

>
> Honza
>>
>> --
>> Markus



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson - Jan. 2, 2013, 3:55 p.m.
Ping.
Teresa

On Tue, Dec 11, 2012 at 9:13 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Ping.
> Teresa
>
> On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> This patch was proposed awhile back, and the new working set program
>> summary was split out and committed separately. I would like to
>> see if the loop unroller part of the patch can now go in. For the
>> previous versions of the patch and related discussion, see:
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> and
>> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>>
>> Use the new working set information in the FDO program summary for
>> codesize based unroll and peel decisions, to prevent those
>> optimizations from increasing code size too much when the
>> program may be sensitive to icache effects.
>>
>> As part of this, cache additional loop analysis results in the niter_desc
>> auxiliary information hanging off the loop structure to reduce
>> redundant analyses during unrolling.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2012-11-19  Teresa Johnson  <tejohnson@google.com>
>>
>>         * gcc/doc/invoke.texi: Document new options
>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>         to control the unroll factor, and retrieve number of branches from
>>         niter_desc instead of via function that walks loop.
>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>>         function, and add guards to enable this function to work for the
>>         outermost loop.
>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>         -funroll-codesize-limit.
>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>>         (num_loop_branches): Remove.
>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>         additional loop analysis information.
>>         (num_loop_branches): Remove.
>>         (analyze_loop_insns): Declare.
>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>
>> Index: doc/invoke.texi
>> ===================================================================
>> --- doc/invoke.texi     (revision 193626)
>> +++ doc/invoke.texi     (working copy)
>> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
>> -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays -fprofile-report @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8771,6 +8771,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
>> compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fpeel-loops
>>  @opindex fpeel-loops
>>  Peels loops for which there is enough information that they do not
>> @@ -8779,6 +8787,14 @@ roll much (from profile feedback).  It also turns
>>
>>  Enabled with @option{-fprofile-use}.
>>
>> +@item -fpeel-codesize-limit
>>
>> +@item -fpeel-codesize-limit
>> +@opindex fpeel-codesize-limit
>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fmove-loop-invariants
>>  @opindex fmove-loop-invariants
>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s
>>  @item max-completely-peel-loop-nest-depth
>>  The maximum depth of a loop nest suitable for complete peeling.
>>
>> +@item unrollpeel-codesize-threshold
>> +Maximum profile-based code size footprint estimate for loop unrolling and
>> +peeling.
>> +
>> +@item unrollpeel-hotness-threshold
>> +Maximum ratio of total execution count to loop entry block count under which
>> +most profile-based code size estimates will be ignored for loop unrolling and
>> +peeling.
>> +
>>  @item max-unswitch-insns
>>  The maximum number of insns of an unswitched loop.
>>
>> Index: loop-unroll.c
>> ===================================================================
>> --- loop-unroll.c       (revision 193626)
>> +++ loop-unroll.c       (working copy)
>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "hashtab.h"
>>  #include "recog.h"
>>  #include "target.h"
>> +#include "gcov-io.h"
>>  #include "dumpfile.h"
>>
>>  /* This pass performs loop unrolling and peeling.  We only perform these
>> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc
>>                                              basic_block);
>>  static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
>> +   based on code footprint estimates. Returns the codesize-based factor to be
>> +   divided into the max instructions in an unrolled or peeled loop:
>> +   1) For size <= threshold, do not limit (by returning 1).
>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>> +      unrolled instructions according to loop hotness.
>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>> +      INT_MAX).  */
>> +
>> +static int
>> +code_size_limit_factor(struct loop *loop)
>> +{
>> +  unsigned size_threshold, num_hot_counters;
>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>> +  gcov_type sum_to_header_ratio;
>> +  int hotness_ratio_threshold;
>> +  int limit_factor;
>> +  gcov_working_set_t *ws;
>> +
>> +  ws = find_working_set(999);
>> +  if (! ws)
>> +    return 1;
>> +  num_hot_counters = ws->num_counters;
>> +
>> +  /* First check if the application has a large codesize footprint.
>> +     This is estimated from FDO profile summary information for the
>> +     program, where the num_hot_counters indicates the number of hottest
>> +     counters (blocks) that compose most of the execution time of
>> +     the program. A large value would indicate a large flat execution
>> +     profile where icache misses may be a concern.  */
>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>> +  if (!profile_info
>> +      || num_hot_counters <= size_threshold
>> +      || !profile_info->sum_all)
>> +    return 1;
>> +
>> +  /* Next, exclude some loops where unrolling/peeling may be more
>> +     important to overall performance.  */
>> +
>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>> +     unrolling. */
>> +  if (desc->has_fp)
>> +    return 1;
>> +
>> +  /* Next, set the value of the codesize-based unroll factor divisor which in
>> +     most loops will need to be set to a value that will reduce or eliminate
>> +     unrolling/peeling.  */
>> +  if (num_hot_counters < size_threshold * 2
>> +      && loop->header->count > 0)
>> +    {
>> +      /* For applications that are less than twice the codesize limit, allow
>> +         limited unrolling for very hot loops.  */
>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>> +      hotness_ratio_threshold = PARAM_VALUE
>> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>> +      /* When the profile count sum to loop entry header ratio is smaller than
>> +         the threshold (i.e. the loop entry is hot enough), the divisor is set
>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>> +         than the ratio, increase the divisor by the amount this ratio
>> +         is over the threshold, which will quickly reduce the unroll/peel
>> +         factor to zero as the loop's hotness reduces.  */
>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>> +        {
>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>> +          gcc_assert (limit_factor >= 1);
>> +        }
>> +      else
>> +        limit_factor = 1;
>> +    }
>> +  else
>> +    /* For appliations that are at least twice the codesize limit, set
>> +       the divisor to a large value that will force the unroll factor to 0.  */
>> +    limit_factor = INT_MAX;
>> +
>> +  return limit_factor;
>> +}
>> +
>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>  void
>>  unroll_and_peel_loops (int flags)
>> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>    double_int iterations;
>>
>>    if (!(flags & UAP_UNROLL))
>> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>              "\n;; Considering unrolling loop with runtime "
>>              "computable number of iterations\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +        {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +        }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
>> loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags)
>>  {
>>    unsigned npeel;
>>    double_int iterations;
>> +  struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_PEEL))
>>      {
>> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>
>> +  if (flag_peel_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from peeling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max peeled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* npeel = number of iterations to peel.  */
>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>> +          / loop->ninsns;
>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>
>> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>      }
>>
>>    /* Do not simply peel loops with branches inside -- it increases number
>> -     of mispredicts.
>> +     of mispredicts.
>>       Exception is when we do have profile and we however have good chance
>>       to peel proper number of iterations loop will iterate in practice.
>>       TODO: this heuristic needs tunning; while for complette unrolling
>> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags)
>>       peeling it is not the case.  Also a function call inside loop is
>>       also branch from branch prediction POV (and probably better reason
>>       to not unroll/peel).  */
>> -  if (num_loop_branches (loop) > 1
>> +  desc = get_simple_loop_desc (loop);
>> +  if (desc->num_branches > 1
>>        && profile_status != PROFILE_READ)
>>      {
>>        if (dump_file)
>> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>    double_int iterations;
>>
>>    if (!(flags & UAP_UNROLL_ALL))
>> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av
>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>       of mispredicts.
>>       TODO: this heuristic needs tunning; call inside the loop body
>>       is also relatively good reason to not unroll.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>> Index: loop-iv.c
>> ===================================================================
>> --- loop-iv.c   (revision 193626)
>> +++ loop-iv.c   (working copy)
>> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop)
>>    /* At least desc->infinite is not always initialized by
>>       find_simple_loop_exit.  */
>>    desc = XCNEW (struct niter_desc);
>> -  iv_analysis_loop_init (loop);
>> -  find_simple_exit (loop, desc);
>> +  if (loop->latch != EXIT_BLOCK_PTR)
>> +    {
>> +      iv_analysis_loop_init (loop);
>> +      find_simple_exit (loop, desc);
>> +    }
>> +  analyze_loop_insns (loop, desc);
>>    loop->aux = desc;
>>
>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>> Index: common.opt
>> ===================================================================
>> --- common.opt  (revision 193626)
>> +++ common.opt  (working copy)
>> @@ -1573,6 +1573,10 @@ fpcc-struct-return
>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>  Return small aggregates in memory, not registers
>>
>> +fpeel-codesize-limit
>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop peeling under profile estimates of large
>> code footprint
>> +
>>  fpeel-loops
>>  Common Report Var(flag_peel_loops) Optimization
>>  Perform loop peeling
>> @@ -2138,6 +2142,10 @@ funroll-all-loops
>>  Common Report Var(flag_unroll_all_loops) Optimization
>>  Perform loop unrolling for all loops
>>
>> +funroll-codesize-limit
>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop unrolling under profile estimates of
>> large code footprint
>> +
>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>  ; that control loops do not overflow and that the loops with nontrivial
>>  ; exit condition are not infinite
>> Index: cfgloop.c
>> ===================================================================
>> --- cfgloop.c   (revision 193626)
>> +++ cfgloop.c   (working copy)
>> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop)
>>    return edges;
>>  }
>>
>> -/* Counts the number of conditional branches inside LOOP.  */
>> +/* Determine if INSN is a floating point set.  */
>>
>> -unsigned
>> -num_loop_branches (const struct loop *loop)
>> +static bool
>> +insn_has_fp_set(rtx insn)
>>  {
>> -  unsigned i, n;
>> -  basic_block * body;
>> +  int i;
>> +  rtx pat = PATTERN(insn);
>> +  if (GET_CODE (pat) == SET)
>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>> +  else if (GET_CODE (pat) == PARALLEL)
>> +    {
>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>> +        {
>> +          rtx sub = XVECEXP (pat, 0, i);
>> +          if (GET_CODE (sub) == SET)
>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>> +        }
>> +    }
>> +  return false;
>> +}
>>
>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
>> +   the number of conditional branch instructions, calls and fp instructions,
>> +   as well as the average number of branches executed per iteration.  */
>>
>> +void
>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>> +{
>> +  unsigned i, nbranch;
>> +  gcov_type weighted_nbranch;
>> +  bool has_call, has_fp;
>> +  basic_block * body, bb;
>> +  rtx insn;
>> +  gcov_type header_count = loop->header->count;
>> +
>> +  nbranch = weighted_nbranch = 0;
>> +  has_call = has_fp = false;
>> +
>>    body = get_loop_body (loop);
>> -  n = 0;
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>> -      n++;
>> +    {
>> +      bb = body[i];
>> +
>> +      if (EDGE_COUNT (bb->succs) >= 2)
>> +        {
>> +          nbranch++;
>> +
>> +          /* If this block is executed less frequently than the header (loop
>> +             entry), then it is weighted based on its execution count, which
>> +             will be turned into a ratio compared to the loop header below. */
>> +          if (bb->count < header_count)
>> +            weighted_nbranch += bb->count;
>> +
>> +          /* When it is executed more frequently than the header (i.e. it is
>> +             in a nested inner loop), simply weight the branch the same as the
>> +             header execution count, so that it will contribute 1 branch to
>> +             the ratio computed below. */
>> +          else
>> +            weighted_nbranch += header_count;
>> +        }
>> +
>> +      /* No need to iterate through the instructions below if
>> +         both flags have already been set.  */
>> +      if (has_call && has_fp)
>> +        continue;
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          if (!INSN_P (insn))
>> +            continue;
>> +
>> +          if (!has_call)
>> +            has_call = CALL_P (insn);
>> +
>> +          if (!has_fp)
>> +            has_fp = insn_has_fp_set (insn);
>> +        }
>> +    }
>>    free (body);
>>
>> -  return n;
>> +  desc->num_branches = nbranch;
>> +  /* Now divide the weights computed above by the loop header execution count,
>> +     to compute the average number of branches through the loop. By adding
>> +     header_count/2 to the numerator we round to nearest with integer
>> +     division.  */
>> +  if (header_count  != 0)
>> +    desc->av_num_branches
>> +        = (weighted_nbranch + header_count/2) / header_count;
>> +  else
>> +    desc->av_num_branches = 0;
>> +  desc->has_call = has_call;
>> +  desc->has_fp = has_fp;
>>  }
>>
>>  /* Adds basic block BB to LOOP.  */
>> Index: cfgloop.h
>> ===================================================================
>> --- cfgloop.h   (revision 193626)
>> +++ cfgloop.h   (working copy)
>> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order
>>  extern vec<edge> get_loop_exit_edges (const struct loop *);
>>  extern edge single_exit (const struct loop *);
>>  extern edge single_likely_exit (struct loop *loop);
>> -extern unsigned num_loop_branches (const struct loop *);
>>
>>  extern edge loop_preheader_edge (const struct loop *);
>>  extern edge loop_latch_edge (const struct loop *);
>> @@ -365,7 +364,8 @@ struct rtx_iv
>>  };
>>
>>  /* The description of an exit from the loop and of the number of iterations
>> -   till we take the exit.  */
>> +   till we take the exit. Also includes other information used primarily
>> +   by the loop unroller.  */
>>
>>  struct niter_desc
>>  {
>> @@ -403,6 +403,18 @@ struct niter_desc
>>
>>    /* The number of iterations of the loop.  */
>>    rtx niter_expr;
>> +
>> +  /* The number of branches in the loop.  */
>> +  unsigned num_branches;
>> +
>> +  /* The number of executed branches per iteration.  */
>> +  unsigned av_num_branches;
>> +
>> +  /* Whether the loop contains a call instruction.  */
>> +  bool has_call;
>> +
>> +  /* Whether the loop contains fp instructions.  */
>> +  bool has_fp;
>>  };
>>
>>  extern void iv_analysis_loop_init (struct loop *);
>> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void);
>>
>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>  extern void free_simple_loop_desc (struct loop *loop);
>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>
>>  static inline struct niter_desc *
>>  simple_loop_desc (struct loop *loop)
>> Index: params.def
>> ===================================================================
>> --- params.def  (revision 193626)
>> +++ params.def  (working copy)
>> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>          "max-completely-peel-loop-nest-depth",
>>          "The maximum depth of a loop nest we completely peel",
>>          8, 0, 0)
>> +/* The maximum code size estimate under which loop unrolling and peeling
>> + * is allowed in a profile feedback compile. This currently applies to loops
>> + * with non-constant iteration counts and no floating point computations.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>> +        "unrollpeel-codesize-threshold",
>> +        "Maximum profile-based code size footprint estimate for loop
>> unrolling "
>> +        "and peeling",
>> +        15000, 0, 0)
>> +/* The maximum ratio of total profiled execution counts to loop entry block
>> +   count that must be exceeded to ignore most code size limits when unrolling
>> +   and peeling.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>> +        "unrollpeel-hotness-threshold",
>> +        "Maximum ratio of total profiled execution count to loop entry "
>> +        "block count under which most codesize limits for unrolling and "
>> +        "peeling will be ignored",
>> +        100, 1, 0)
>>
>>  /* The maximum number of insns of an unswitched loop.  */
>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>>
>> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote:
>>> Updated patch based on review feedback.
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>>
>>> Teresa
>>>
>>>
>>> Original patch description:
>>>
>>> Ports some patches related to improving FDO program summary information
>>> and using it to guide loop unrolling from google branches to mainline.
>>> The patch is enhanced to add additional summary information to aid
>>> in determining hot/cold decisions.
>>>
>>> The original patch description is at:
>>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>>> and further discussion about incorporating onto mainline is at:
>>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>>>
>>> Honza, can you take a look to see if this patch meets your needs?
>>>
>>> Full description:
>>>
>>> This patch adds new program summary information to the gcov
>>> profile files that indicate how many profiled counts compose
>>> the majority of the program's execution time. This is used to
>>> provide an indication of the overall code size of the program.
>>>
>>> The new profile summary information is then used to guide
>>> codesize based unroll and peel decisions, to prevent those
>>> optimizations from increasing code size too much when the
>>> program may be sensitive to icache effects.
>>>
>>> This patch also pulls in dependent portions of google/main r187660 that cache
>>> additional loop analysis results in the niter_desc auxiliary information
>>> hanging off the loop structure (the optimization portions of that
>>> change are not included here, and have an outstanding review request
>>> for mainline).
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>>
>>> Thanks,
>>> Teresa
>>>
>>> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>>>
>>>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>>>         (gcov_compute_cutoff_values): Ditto.
>>>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>>>         information.
>>>         * gcc/doc/invoke.texi (roll much): Document new options
>>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>>>         (gcov_read_summary): Read new summary info.
>>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
>>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>>>         hot_cutoff_value.
>>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>>         to control the unroll factor, and retrieve number of branches from
>>>         niter_desc instead of via function that walks loop.
>>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>>>         function, and add guards to enable this function to work for the
>>>         outermost loop.
>>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>>         -funroll-codesize-limit.
>>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>>>         (num_loop_branches): Remove.
>>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>>         additional loop analysis information.
>>>         (num_loop_branches): Remove.
>>>         (analyze_loop_insns): Declare.
>>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>>>
>>> Index: libgcc/libgcov.c
>>> ===================================================================
>>> --- libgcc/libgcov.c    (revision 189893)
>>> +++ libgcc/libgcov.c    (working copy)
>>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>>    return 1;
>>>  }
>>>
>>> +/* Used by qsort to sort gcov values in descending order.  */
>>> +
>>> +static int
>>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>>> +{
>>> +  const gcov_type a = *(gcov_type const *)pa;
>>> +  const gcov_type b = *(gcov_type const *)pb;
>>> +
>>> +  if (b > a)
>>> +    return 1;
>>> +  else if (b == a)
>>> +    return 0;
>>> +  else
>>> +    return -1;
>>> +}
>>> +
>>> +/* Determines the number of counters required to cover a given percentage
>>> +   of the total sum of execution counts in the summary, which is then also
>>> +   recorded in SUM.  */
>>> +
>>> +static void
>>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>>> +{
>>> +  struct gcov_info *gi_ptr;
>>> +  const struct gcov_fn_info *gfi_ptr;
>>> +  const struct gcov_ctr_info *ci_ptr;
>>> +  struct gcov_ctr_summary *cs_ptr;
>>> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
>>> +  gcov_unsigned_t c_num;
>>> +  gcov_type *value_array;
>>> +  gcov_type cum, cum_cutoff;
>>> +  char *cutoff_str;
>>> +  unsigned cutoff_perc;
>>> +
>>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
>>> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
>>> +  if (cutoff_str && strlen (cutoff_str))
>>> +    cutoff_perc = atoi (cutoff_str);
>>> +  else
>>> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
>>> +
>>> +  /* This currently only applies to arc counters.  */
>>> +  t_ix = GCOV_COUNTER_ARCS;
>>> +
>>> +  /* First check if there are any counts recorded for this counter.  */
>>> +  cs_ptr = &(sum->ctrs[t_ix]);
>>> +  if (!cs_ptr->num)
>>> +    return;
>>> +
>>> +  /* Determine the cumulative counter value at the specified cutoff
>>> +     percentage and record the percentage for use by gcov consumers.
>>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>>> +     and if so, do the divide first.  */
>>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>>> +    /* Overflow, do the divide first.  */
>>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>>> +  else
>>> +    /* Otherwise multiply first to get the correct value for small
>>> +       values of sum_all.  */
>>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>>> +
>>> +  /* Next, walk through all the per-object structures and save each of
>>> +     the count values in value_array.  */
>>> +  index = 0;
>>> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
>>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>>> +    {
>>> +      if (!gi_ptr->merge[t_ix])
>>> +        continue;
>>> +
>>> +      /* Find the appropriate index into the gcov_ctr_info array
>>> +         for the counter we are currently working on based on the
>>> +         existence of the merge function pointer for this object.  */
>>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>>> +        {
>>> +          if (gi_ptr->merge[i])
>>> +            ctr_info_ix++;
>>> +        }
>>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>>> +        {
>>> +          gfi_ptr = gi_ptr->functions[f_ix];
>>> +
>>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>>> +            continue;
>>> +
>>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>>> +          /* Sanity check that there are enough entries in value_arry
>>> +            for this function's counters. Gracefully handle the case when
>>> +            there are not, in case something in the profile info is
>>> +            corrupted.  */
>>> +          c_num = ci_ptr->num;
>>> +          if (index + c_num > cs_ptr->num)
>>> +            c_num = cs_ptr->num - index;
>>> +          /* Copy over this function's counter values.  */
>>> +          memcpy (&value_array[index], ci_ptr->values,
>>> +                  sizeof (gcov_type) * c_num);
>>> +          index += c_num;
>>> +        }
>>> +    }
>>> +
>>> +  /* Sort all the counter values by descending value and finally
>>> +     accumulate the values from hottest on down until reaching
>>> +     the cutoff value computed earlier.  */
>>> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
>>> +         sort_by_reverse_gcov_value);
>>> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
>>> +    cum += value_array[c_num];
>>> +  /* Record the number of counters required to reach the cutoff value,
>>> +     as well as the counter value that caused cum to reach the cutoff.  */
>>> +  cs_ptr->num_hot_counters = c_num;
>>> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
>>> +  free (value_array);
>>> +}
>>> +
>>>  /* Dump the coverage counts. We merge with existing counts when
>>>     possible, to avoid growing the .da files ad infinitum. We use this
>>>     program's checksum to make sure we only accumulate whole program
>>> @@ -347,6 +461,7 @@ gcov_exit (void)
>>>             }
>>>         }
>>>      }
>>> +  gcov_compute_cutoff_values (&this_prg);
>>>
>>>    {
>>>      /* Check if the level of dirs to strip off specified. */
>>> @@ -598,7 +713,12 @@ gcov_exit (void)
>>>           if (gi_ptr->merge[t_ix])
>>>             {
>>>               if (!cs_prg->runs++)
>>> -               cs_prg->num = cs_tprg->num;
>>> +                cs_prg->num = cs_tprg->num;
>>> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
>>> +              {
>>> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
>>> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
>>> +              }
>>>               cs_prg->sum_all += cs_tprg->sum_all;
>>>               if (cs_prg->run_max < cs_tprg->run_max)
>>>                 cs_prg->run_max = cs_tprg->run_max;
>>> Index: gcc/doc/invoke.texi
>>> ===================================================================
>>> --- gcc/doc/invoke.texi (revision 189893)
>>> +++ gcc/doc/invoke.texi (working copy)
>>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol
>>>  -fprefetch-loop-arrays @gol
>>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>>  -fprofile-generate=@var{path} @gol
>>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
>>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
>>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>>  @option{-funroll-all-loops} implies the same options as
>>>  @option{-funroll-loops}.
>>>
>>> +@item -funroll-codesize-limit
>>> +@opindex funroll-codesize-limit
>>> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation
>>> +under estimates of a large code footprint. Enabled by default with
>>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>>> +by the @option{unrollpeel-codesize-threshold} and
>>> +@option{unrollpeel-hotness-threshold} parameters.
>>> +
>>>  @item -fpeel-loops
>>>  @opindex fpeel-loops
>>>  Peels loops for which there is enough information that they do not
>>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>>>
>>>  Enabled with @option{-fprofile-use}.
>>>
>>> +@item -fpeel-codesize-limit
>>> +@opindex fpeel-codesize-limit
>>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation
>>> +under estimates of a large code footprint. Enabled by default with
>>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled
>>> +by the @option{unrollpeel-codesize-threshold} and
>>> +@option{unrollpeel-hotness-threshold} parameters.
>>> +
>>>  @item -fmove-loop-invariants
>>>  @opindex fmove-loop-invariants
>>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>>>  @item max-completely-peel-loop-nest-depth
>>>  The maximum depth of a loop nest suitable for complete peeling.
>>>
>>> +@item unrollpeel-codesize-threshold
>>> +Maximum profile-based code size footprint estimate for loop unrolling and
>>> +peeling.
>>> +
>>> +@item unrollpeel-hotness-threshold
>>> +Maximum ratio of total execution count to loop entry block count under which
>>> +most profile-based code size estimates will be ignored for loop unrolling and
>>> +peeling.
>>> +
>>>  @item max-unswitch-insns
>>>  The maximum number of insns of an unswitched loop.
>>>
>>> Index: gcc/gcov-io.c
>>> ===================================================================
>>> --- gcc/gcov-io.c       (revision 189893)
>>> +++ gcc/gcov-io.c       (working copy)
>>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>>      {
>>>        gcov_write_unsigned (csum->num);
>>> +      gcov_write_unsigned (csum->num_hot_counters);
>>> +      gcov_write_counter (csum->hot_cutoff_value);
>>>        gcov_write_unsigned (csum->runs);
>>>        gcov_write_counter (csum->sum_all);
>>>        gcov_write_counter (csum->run_max);
>>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>>      {
>>>        csum->num = gcov_read_unsigned ();
>>> +      csum->num_hot_counters = gcov_read_unsigned ();
>>> +      csum->hot_cutoff_value = gcov_read_counter ();
>>>        csum->runs = gcov_read_unsigned ();
>>>        csum->sum_all = gcov_read_counter ();
>>>        csum->run_max = gcov_read_counter ();
>>> Index: gcc/gcov-io.h
>>> ===================================================================
>>> --- gcc/gcov-io.h       (revision 189893)
>>> +++ gcc/gcov-io.h       (working copy)
>>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>>>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
>>>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>>>  #define GCOV_TAG_SUMMARY_LENGTH  \
>>> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
>>> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>>>
>>>  /* Counters that are collected.  */
>>>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
>>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>>>  struct gcov_ctr_summary
>>>  {
>>>    gcov_unsigned_t num;         /* number of counters.  */
>>> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
>>> +                                      percent of sum_all.  */
>>> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
>>> +                                   num_hot_counters.  */
>>>    gcov_unsigned_t runs;                /* number of program runs */
>>>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>>>    gcov_type run_max;           /* maximum value on a single run.  */
>>> Index: gcc/loop-unroll.c
>>> ===================================================================
>>> --- gcc/loop-unroll.c   (revision 189893)
>>> +++ gcc/loop-unroll.c   (working copy)
>>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>>  #include "hashtab.h"
>>>  #include "recog.h"
>>>  #include "target.h"
>>> +#include "gcov-io.h"
>>>  #include "dumpfile.h"
>>>
>>>  /* This pass performs loop unrolling and peeling.  We only perform these
>>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>>>                                              basic_block);
>>>  static rtx get_expansion (struct var_to_expand *);
>>>
>>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained
>>> +   based on code footprint estimates. Returns the codesize-based factor to be
>>> +   divided into the max instructions in an unrolled or peeled loop:
>>> +   1) For size <= threshold, do not limit (by returning 1).
>>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>>> +      unrolled instructions according to loop hotness.
>>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>>> +      INT_MAX).  */
>>> +
>>> +static int
>>> +code_size_limit_factor(struct loop *loop)
>>> +{
>>> +  unsigned size_threshold;
>>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>>> +  gcov_type sum_to_header_ratio;
>>> +  int hotness_ratio_threshold;
>>> +  int limit_factor;
>>> +
>>> +  /* First check if the application has a large codesize footprint.
>>> +     This is estimated from FDO profile summary information for the
>>> +     program, where the num_hot_counters indicates the number of hottest
>>> +     counters (blocks) that compose most of the execution time of
>>> +     the program. A large value would indicate a large flat execution
>>> +     profile where icache misses may be a concern.  */
>>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>>> +  if (!profile_info
>>> +      || profile_info->num_hot_counters <= size_threshold
>>> +      || !profile_info->sum_all)
>>> +    return 1;
>>> +
>>> +  /* Next, exclude some loops where unrolling/peeling may be more
>>> +     important to overall performance.  */
>>> +
>>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>>> +     unrolling. */
>>> +  if (desc->has_fp)
>>> +    return 1;
>>> +
>>> +  /* Next, set the value of the codesize-based unroll factor divisor which in
>>> +     most loops will need to be set to a value that will reduce or eliminate
>>> +     unrolling/peeling.  */
>>> +  if (profile_info->num_hot_counters < size_threshold * 2)
>>> +    {
>>> +      /* For applications that are less than twice the codesize limit, allow
>>> +         limited unrolling for very hot loops.  */
>>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>>> +      hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>>> +      /* When the profile count sum to loop entry header ratio is smaller than
>>> +         the threshold (i.e. the loop entry is hot enough, the divisor is set
>>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>>> +         than the ratio, increase the divisor by the amount this ratio
>>> +         is over the threshold, which will quickly reduce the unroll/peel
>>> +         factor to zero as the loop's hotness reduces.  */
>>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>>> +        {
>>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>>> +          gcc_assert (limit_factor >= 1);
>>> +        }
>>> +      else
>>> +        limit_factor = 1;
>>> +    }
>>> +  else
>>> +    /* For appliations that are at least twice the codesize limit, set
>>> +       the divisor to a large value that will force the unroll factor to 0.  */
>>> +    limit_factor = INT_MAX;
>>> +
>>> +  return limit_factor;
>>> +}
>>> +
>>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>>  void
>>>  unroll_and_peel_loops (int flags)
>>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>>  {
>>>    unsigned nunroll, nunroll_by_av, i;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_UNROLL))
>>>      {
>>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>>              "\n;; Considering unrolling loop with runtime "
>>>              "computable number of iterations\n");
>>>
>>> +  if (flag_unroll_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from unrolling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +        {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>>> +        }
>>> +    }
>>> +
>>>    /* nunroll = total number of copies of the original loop body in
>>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>>> +            / loop->ninsns;
>>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>>> +                  / limit_factor / loop->av_ninsns;
>>>    if (nunroll > nunroll_by_av)
>>>      nunroll = nunroll_by_av;
>>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>>  {
>>>    unsigned npeel;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_PEEL))
>>>      {
>>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>>    if (dump_file)
>>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>>
>>> +  if (flag_peel_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from peeling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +       {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max peeled insns by divisor %d\n", limit_factor);
>>> +       }
>>> +    }
>>> +
>>>    /* npeel = number of iterations to peel.  */
>>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>>> +          / loop->ninsns;
>>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>>
>>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>>
>>>    /* Do not simply peel loops with branches inside -- it increases number
>>>       of mispredicts.  */
>>> -  if (num_loop_branches (loop) > 1)
>>> +  if (desc->num_branches > 1)
>>>      {
>>>        if (dump_file)
>>>         fprintf (dump_file, ";; Not peeling, contains branches\n");
>>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>  {
>>>    unsigned nunroll, nunroll_by_av, i;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_UNROLL_ALL))
>>>      {
>>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>    if (dump_file)
>>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>>
>>> +  if (flag_unroll_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from unrolling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +       {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>>> +       }
>>> +    }
>>> +
>>>    /* nunroll = total number of copies of the original loop body in
>>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
>>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>>> -  nunroll_by_av
>>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>>> +            / loop->ninsns;
>>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>>> +                  / limit_factor / loop->av_ninsns;
>>>    if (nunroll > nunroll_by_av)
>>>      nunroll = nunroll_by_av;
>>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>
>>>    /* Do not unroll loops with branches inside -- it increases number
>>>       of mispredicts.  */
>>> -  if (num_loop_branches (loop) > 1)
>>> +  if (desc->num_branches > 1)
>>>      {
>>>        if (dump_file)
>>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>>> Index: gcc/coverage.c
>>> ===================================================================
>>> --- gcc/coverage.c      (revision 189893)
>>> +++ gcc/coverage.c      (working copy)
>>> @@ -247,6 +247,14 @@ read_counts_file (void)
>>>           gcov_read_summary (&sum);
>>>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>>>             {
>>> +              if (summary.ctrs[ix].num_hot_counters
>>> +                  < sum.ctrs[ix].num_hot_counters)
>>> +                {
>>> +                 summary.ctrs[ix].num_hot_counters
>>> +                      = sum.ctrs[ix].num_hot_counters;
>>> +                 summary.ctrs[ix].hot_cutoff_value
>>> +                      = sum.ctrs[ix].hot_cutoff_value;
>>> +                }
>>>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>>>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>>>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
>>> Index: gcc/loop-iv.c
>>> ===================================================================
>>> --- gcc/loop-iv.c       (revision 189893)
>>> +++ gcc/loop-iv.c       (working copy)
>>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>>>    /* At least desc->infinite is not always initialized by
>>>       find_simple_loop_exit.  */
>>>    desc = XCNEW (struct niter_desc);
>>> -  iv_analysis_loop_init (loop);
>>> -  find_simple_exit (loop, desc);
>>> +  if (loop->latch != EXIT_BLOCK_PTR)
>>> +    {
>>> +      iv_analysis_loop_init (loop);
>>> +      find_simple_exit (loop, desc);
>>> +    }
>>> +  analyze_loop_insns (loop, desc);
>>>    loop->aux = desc;
>>>
>>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>>> Index: gcc/common.opt
>>> ===================================================================
>>> --- gcc/common.opt      (revision 189893)
>>> +++ gcc/common.opt      (working copy)
>>> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>>  Return small aggregates in memory, not registers
>>>
>>> +fpeel-codesize-limit
>>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>>> +Limit non-const non-FP loop peeling under profile estimates of large code footprint
>>> +
>>>  fpeel-loops
>>>  Common Report Var(flag_peel_loops) Optimization
>>>  Perform loop peeling
>>> @@ -2112,6 +2116,10 @@ funroll-all-loops
>>>  Common Report Var(flag_unroll_all_loops) Optimization
>>>  Perform loop unrolling for all loops
>>>
>>> +funroll-codesize-limit
>>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>>> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint
>>> +
>>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>>  ; that control loops do not overflow and that the loops with nontrivial
>>>  ; exit condition are not infinite
>>> Index: gcc/cfgloop.c
>>> ===================================================================
>>> --- gcc/cfgloop.c       (revision 189893)
>>> +++ gcc/cfgloop.c       (working copy)
>>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>>>    return edges;
>>>  }
>>>
>>> -/* Counts the number of conditional branches inside LOOP.  */
>>> +/* Determine if INSN is a floating point set.  */
>>>
>>> -unsigned
>>> -num_loop_branches (const struct loop *loop)
>>> +static bool
>>> +insn_has_fp_set(rtx insn)
>>>  {
>>> -  unsigned i, n;
>>> -  basic_block * body;
>>> +  int i;
>>> +  rtx pat = PATTERN(insn);
>>> +  if (GET_CODE (pat) == SET)
>>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>>> +  else if (GET_CODE (pat) == PARALLEL)
>>> +    {
>>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>>> +        {
>>> +          rtx sub = XVECEXP (pat, 0, i);
>>> +          if (GET_CODE (sub) == SET)
>>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>>> +        }
>>> +    }
>>> +  return false;
>>> +}
>>>
>>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
>>> +   the number of conditional branch instructions, calls and fp instructions,
>>> +   as well as the average number of branches executed per iteration.  */
>>>
>>> +void
>>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>>> +{
>>> +  unsigned i, nbranch;
>>> +  gcov_type weighted_nbranch;
>>> +  bool has_call, has_fp;
>>> +  basic_block * body, bb;
>>> +  rtx insn;
>>> +  gcov_type header_count = loop->header->count;
>>> +
>>> +  nbranch = weighted_nbranch = 0;
>>> +  has_call = has_fp = false;
>>> +
>>>    body = get_loop_body (loop);
>>> -  n = 0;
>>>    for (i = 0; i < loop->num_nodes; i++)
>>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>>> -      n++;
>>> +    {
>>> +      bb = body[i];
>>> +
>>> +      if (EDGE_COUNT (bb->succs) >= 2)
>>> +        {
>>> +          nbranch++;
>>> +
>>> +          /* If this block is executed less frequently than the header (loop
>>> +             entry), then it is weighted based on its execution count, which
>>> +             will be turned into a ratio compared to the loop header below. */
>>> +          if (bb->count < header_count)
>>> +            weighted_nbranch += bb->count;
>>> +
>>> +          /* When it is executed more frequently than the header (i.e. it is
>>> +             in a nested inner loop), simply weight the branch the same as the
>>> +             header execution count, so that it will contribute 1 branch to
>>> +             the ratio computed below. */
>>> +          else
>>> +            weighted_nbranch += header_count;
>>> +        }
>>> +
>>> +      /* No need to iterate through the instructions below if
>>> +         both flags have already been set.  */
>>> +      if (has_call && has_fp)
>>> +        continue;
>>> +
>>> +      FOR_BB_INSNS (bb, insn)
>>> +        {
>>> +          if (!INSN_P (insn))
>>> +            continue;
>>> +
>>> +          if (!has_call)
>>> +            has_call = CALL_P (insn);
>>> +
>>> +          if (!has_fp)
>>> +            has_fp = insn_has_fp_set (insn);
>>> +        }
>>> +    }
>>>    free (body);
>>>
>>> -  return n;
>>> +  desc->num_branches = nbranch;
>>> +  /* Now divide the weights computed above by the loop header execution count,
>>> +     to compute the average number of branches through the loop. By adding
>>> +     header_count/2 to the numerator we round to nearest with integer
>>> +     division.  */
>>> +  if (header_count  != 0)
>>> +    desc->av_num_branches
>>> +        = (weighted_nbranch + header_count/2) / header_count;
>>> +  else
>>> +    desc->av_num_branches = 0;
>>> +  desc->has_call = has_call;
>>> +  desc->has_fp = has_fp;
>>>  }
>>>
>>>  /* Adds basic block BB to LOOP.  */
>>> Index: gcc/cfgloop.h
>>> ===================================================================
>>> --- gcc/cfgloop.h       (revision 189893)
>>> +++ gcc/cfgloop.h       (working copy)
>>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>>>
>>>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>>>  edge single_exit (const struct loop *);
>>> -extern unsigned num_loop_branches (const struct loop *);
>>>
>>>  extern edge loop_preheader_edge (const struct loop *);
>>>  extern edge loop_latch_edge (const struct loop *);
>>> @@ -366,7 +365,8 @@ struct rtx_iv
>>>  };
>>>
>>>  /* The description of an exit from the loop and of the number of iterations
>>> -   till we take the exit.  */
>>> +   till we take the exit. Also includes other information used primarily
>>> +   by the loop unroller.  */
>>>
>>>  struct niter_desc
>>>  {
>>> @@ -407,6 +407,18 @@ struct niter_desc
>>>
>>>    /* The number of iterations of the loop.  */
>>>    rtx niter_expr;
>>> +
>>> +  /* The number of branches in the loop.  */
>>> +  unsigned num_branches;
>>> +
>>> +  /* The number of executed branches per iteration.  */
>>> +  unsigned av_num_branches;
>>> +
>>> +  /* Whether the loop contains a call instruction.  */
>>> +  bool has_call;
>>> +
>>> +  /* Whether the loop contains fp instructions.  */
>>> +  bool has_fp;
>>>  };
>>>
>>>  extern void iv_analysis_loop_init (struct loop *);
>>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>>>
>>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>>  extern void free_simple_loop_desc (struct loop *loop);
>>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>>
>>>  static inline struct niter_desc *
>>>  simple_loop_desc (struct loop *loop)
>>> Index: gcc/params.def
>>> ===================================================================
>>> --- gcc/params.def      (revision 189893)
>>> +++ gcc/params.def      (working copy)
>>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>>          "max-completely-peel-loop-nest-depth",
>>>          "The maximum depth of a loop nest we completely peel",
>>>          8, 0, 0)
>>> +/* The maximum code size estimate under which loop unrolling and peeling
>>> + * is allowed in a profile feedback compile. This currently applies to loops
>>> + * with non-constant iteration counts and no floating point computations.  */
>>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>>> +        "unrollpeel-codesize-threshold",
>>> +        "Maximum profile-based code size footprint estimate for loop unrolling "
>>> +        "and peeling",
>>> +        15000, 0, 0)
>>> +/* The maximum ratio of total profiled execution counts to loop entry block
>>> +   count that must be exceeded to ignore most code size limits when unrolling
>>> +   and peeling.  */
>>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>>> +        "unrollpeel-hotness-threshold",
>>> +        "Maximum ratio of total profiled execution count to loop entry "
>>> +        "block count under which most codesize limits for unrolling and "
>>> +        "peeling will be ignored",
>>> +        100, 1, 0)
>>>
>>>  /* The maximum number of insns of an unswitched loop.  */
>>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>>> Index: gcc/gcov-dump.c
>>> ===================================================================
>>> --- gcc/gcov-dump.c     (revision 189893)
>>> +++ gcc/gcov-dump.c     (working copy)
>>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>>>      {
>>>        printf ("\n");
>>>        print_prefix (filename, 0, 0);
>>> -      printf ("\t\tcounts=%u, runs=%u",
>>> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
>>> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
>>> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
>>> +             summary.ctrs[ix].num,
>>> +             summary.ctrs[ix].num_hot_counters,
>>> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
>>> +             summary.ctrs[ix].runs);
>>>
>>>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>>>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>>>
>>> --
>>> This patch is available for review at http://codereview.appspot.com/6351086
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Patch

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi     (revision 193626)
+++ doc/invoke.texi     (working copy)
@@ -389,7 +389,7 @@  Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
-fpredictive-commoning @gol
 -fprefetch-loop-arrays -fprofile-report @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
@@ -421,7 +421,7 @@  Objective-C and Objective-C++ Dialects}.
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-linker-plugin @gol
@@ -8771,6 +8771,14 @@  the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.

+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels loops for which there is enough information that they do not
@@ -8779,6 +8787,14 @@  roll much (from profile feedback).  It also turns

 Enabled with @option{-fprofile-use}.

+@item -fpeel-codesize-limit

+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -9142,6 +9158,15 @@  The maximum number of iterations of a loop to be s
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.

+@item unrollpeel-codesize-threshold
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
+
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.

Index: loop-unroll.c
===================================================================
--- loop-unroll.c       (revision 193626)
+++ loop-unroll.c       (working copy)
@@ -32,6 +32,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "hashtab.h"
 #include "recog.h"
 #include "target.h"
+#include "gcov-io.h"
 #include "dumpfile.h"

 /* This pass performs loop unrolling and peeling.  We only perform these
@@ -147,6 +148,82 @@  static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
 static rtx get_expansion (struct var_to_expand *);

+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
+
+static int
+code_size_limit_factor(struct loop *loop)
+{
+  unsigned size_threshold, num_hot_counters;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
+  gcov_working_set_t *ws;
+
+  ws = find_working_set(999);
+  if (! ws)
+    return 1;
+  num_hot_counters = ws->num_counters;
+
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
+
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
+
+  /* Ignore FP loops, which are more likely to benefit heavily from
+     unrolling. */
+  if (desc->has_fp)
+    return 1;
+
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (num_hot_counters < size_threshold * 2
+      && loop->header->count > 0)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough), the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0.  */
+    limit_factor = INT_MAX;
+
+  return limit_factor;
+}
+
 /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
 void
 unroll_and_peel_loops (int flags)
@@ -822,6 +899,7 @@  decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
   double_int iterations;

   if (!(flags & UAP_UNROLL))
@@ -835,10 +913,26 @@  decide_unroll_runtime_iterations (struct loop *loo
             "\n;; Considering unrolling loop with runtime "
             "computable number of iterations\n");

+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+        {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+        }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1229,6 +1323,8 @@  decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   double_int iterations;
+  struct niter_desc *desc;
+  int limit_factor = 1;

   if (!(flags & UAP_PEEL))
     {
@@ -1239,8 +1335,23 @@  decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");

+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+       {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+       }
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);

@@ -1253,7 +1364,7 @@  decide_peel_simple (struct loop *loop, int flags)
     }

   /* Do not simply peel loops with branches inside -- it increases number
-     of mispredicts.
+     of mispredicts.
      Exception is when we do have profile and we however have good chance
      to peel proper number of iterations loop will iterate in practice.
      TODO: this heuristic needs tunning; while for complette unrolling
@@ -1261,7 +1372,8 @@  decide_peel_simple (struct loop *loop, int flags)
      peeling it is not the case.  Also a function call inside loop is
      also branch from branch prediction POV (and probably better reason
      to not unroll/peel).  */
-  if (num_loop_branches (loop) > 1
+  desc = get_simple_loop_desc (loop);
+  if (desc->num_branches > 1
       && profile_status != PROFILE_READ)
     {
       if (dump_file)
@@ -1387,6 +1499,7 @@  decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
   double_int iterations;

   if (!(flags & UAP_UNROLL_ALL))
@@ -1398,11 +1511,26 @@  decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");

+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+       {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+       }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1434,7 +1562,7 @@  decide_unroll_stupid (struct loop *loop, int flags
      of mispredicts.
      TODO: this heuristic needs tunning; call inside the loop body
      is also relatively good reason to not unroll.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
        fprintf (dump_file, ";; Not unrolling, contains branches\n");
Index: loop-iv.c
===================================================================
--- loop-iv.c   (revision 193626)
+++ loop-iv.c   (working copy)
@@ -3006,8 +3006,12 @@  get_simple_loop_desc (struct loop *loop)
   /* At least desc->infinite is not always initialized by
      find_simple_loop_exit.  */
   desc = XCNEW (struct niter_desc);
-  iv_analysis_loop_init (loop);
-  find_simple_exit (loop, desc);
+  if (loop->latch != EXIT_BLOCK_PTR)
+    {
+      iv_analysis_loop_init (loop);
+      find_simple_exit (loop, desc);
+    }
+  analyze_loop_insns (loop, desc);
   loop->aux = desc;

   if (desc->simple_p && (desc->assumptions || desc->infinite))
Index: common.opt
===================================================================
--- common.opt  (revision 193626)
+++ common.opt  (working copy)
@@ -1573,6 +1573,10 @@  fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers

+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large
code footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2138,6 +2142,10 @@  funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops

+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of
large code footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: cfgloop.c
===================================================================
--- cfgloop.c   (revision 193626)
+++ cfgloop.c   (working copy)
@@ -1148,24 +1148,98 @@  get_loop_exit_edges (const struct loop *loop)
   return edges;
 }

-/* Counts the number of conditional branches inside LOOP.  */
+/* Determine if INSN is a floating point set.  */

-unsigned
-num_loop_branches (const struct loop *loop)
+static bool
+insn_has_fp_set(rtx insn)
 {
-  unsigned i, n;
-  basic_block * body;
+  int i;
+  rtx pat = PATTERN(insn);
+  if (GET_CODE (pat) == SET)
+    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
+  else if (GET_CODE (pat) == PARALLEL)
+    {
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+        {
+          rtx sub = XVECEXP (pat, 0, i);
+          if (GET_CODE (sub) == SET)
+            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
+        }
+    }
+  return false;
+}

-  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
+/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
+   the number of conditional branch instructions, calls and fp instructions,
+   as well as the average number of branches executed per iteration.  */

+void
+analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
+{
+  unsigned i, nbranch;
+  gcov_type weighted_nbranch;
+  bool has_call, has_fp;
+  basic_block * body, bb;
+  rtx insn;
+  gcov_type header_count = loop->header->count;
+
+  nbranch = weighted_nbranch = 0;
+  has_call = has_fp = false;
+
   body = get_loop_body (loop);
-  n = 0;
   for (i = 0; i < loop->num_nodes; i++)
-    if (EDGE_COUNT (body[i]->succs) >= 2)
-      n++;
+    {
+      bb = body[i];
+
+      if (EDGE_COUNT (bb->succs) >= 2)
+        {
+          nbranch++;
+
+          /* If this block is executed less frequently than the header (loop
+             entry), then it is weighted based on its execution count, which
+             will be turned into a ratio compared to the loop header below. */
+          if (bb->count < header_count)
+            weighted_nbranch += bb->count;
+
+          /* When it is executed more frequently than the header (i.e. it is
+             in a nested inner loop), simply weight the branch the same as the
+             header execution count, so that it will contribute 1 branch to
+             the ratio computed below. */
+          else
+            weighted_nbranch += header_count;
+        }
+
+      /* No need to iterate through the instructions below if
+         both flags have already been set.  */
+      if (has_call && has_fp)
+        continue;
+
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!INSN_P (insn))
+            continue;
+
+          if (!has_call)
+            has_call = CALL_P (insn);
+
+          if (!has_fp)
+            has_fp = insn_has_fp_set (insn);
+        }
+    }
   free (body);

-  return n;
+  desc->num_branches = nbranch;
+  /* Now divide the weights computed above by the loop header execution count,
+     to compute the average number of branches through the loop. By adding
+     header_count/2 to the numerator we round to nearest with integer
+     division.  */
+  if (header_count  != 0)
+    desc->av_num_branches
+        = (weighted_nbranch + header_count/2) / header_count;
+  else
+    desc->av_num_branches = 0;
+  desc->has_call = has_call;
+  desc->has_fp = has_fp;
 }

 /* Adds basic block BB to LOOP.  */
Index: cfgloop.h
===================================================================
--- cfgloop.h   (revision 193626)
+++ cfgloop.h   (working copy)
@@ -252,7 +252,6 @@  extern basic_block *get_loop_body_in_custom_order
 extern vec<edge> get_loop_exit_edges (const struct loop *);
 extern edge single_exit (const struct loop *);
 extern edge single_likely_exit (struct loop *loop);
-extern unsigned num_loop_branches (const struct loop *);

 extern edge loop_preheader_edge (const struct loop *);
 extern edge loop_latch_edge (const struct loop *);
@@ -365,7 +364,8 @@  struct rtx_iv
 };

 /* The description of an exit from the loop and of the number of iterations
-   till we take the exit.  */
+   till we take the exit. Also includes other information used primarily
+   by the loop unroller.  */

 struct niter_desc
 {
@@ -403,6 +403,18 @@  struct niter_desc

   /* The number of iterations of the loop.  */
   rtx niter_expr;
+
+  /* The number of branches in the loop.  */
+  unsigned num_branches;
+
+  /* The number of executed branches per iteration.  */
+  unsigned av_num_branches;
+
+  /* Whether the loop contains a call instruction.  */
+  bool has_call;
+
+  /* Whether the loop contains fp instructions.  */
+  bool has_fp;
 };

 extern void iv_analysis_loop_init (struct loop *);
@@ -416,6 +428,7 @@  extern void iv_analysis_done (void);

 extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
 extern void free_simple_loop_desc (struct loop *loop);
+void analyze_loop_insns (const struct loop *, struct niter_desc *desc);

 static inline struct niter_desc *
 simple_loop_desc (struct loop *loop)
Index: params.def
===================================================================
--- params.def  (revision 193626)
+++ params.def  (working copy)
@@ -321,6 +321,23 @@  DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
         "max-completely-peel-loop-nest-depth",
         "The maximum depth of a loop nest we completely peel",
         8, 0, 0)
+/* The maximum code size estimate under which loop unrolling and peeling
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
+DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
+        "unrollpeel-codesize-threshold",
+        "Maximum profile-based code size footprint estimate for loop
unrolling "
+        "and peeling",
+        15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)

 /* The maximum number of insns of an unswitched loop.  */