Message ID | CAAe5K+XH+HJJzFTFjdV9eb0shyTB5Vk_foKEh+Pgas5LJ5PGDw@mail.gmail.com |
---|---|
State | New |
Headers | show |
Ping. Teresa On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > This patch was proposed awhile back, and the new working set program > summary was split out and committed separately. I would like to > see if the loop unroller part of the patch can now go in. For the > previous versions of the patch and related discussion, see: > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > and > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > Use the new working set information in the FDO program summary for > codesize based unroll and peel decisions, to prevent those > optimizations from increasing code size too much when the > program may be sensitive to icache effects. > > As part of this, cache additional loop analysis results in the niter_desc > auxiliary information hanging off the loop structure to reduce > redundant analyses during unrolling. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > 2012-11-19 Teresa Johnson <tejohnson@google.com> > > * gcc/doc/invoke.texi: Document new options > -fpeel-codesize-limit and -funroll-codesize-limit, and new params > codesize-hotness-threshold and unrollpeel-hotness-threshold. > * gcc/loop-unroll.c (code_size_limit_factor): New function. > (decide_unroll_runtime_iterations): Call code_size_limit_factor > to control the unroll factor, and retrieve number of branches from > niter_desc instead of via function that walks loop. > (decide_peel_simple, decide_unroll_stupid): Ditto. > * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * gcc/common.opt: Add -fpeel-codesize-limit and > -funroll-codesize-limit. > * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. > (num_loop_branches): Remove. > * gcc/cfgloop.h (struct niter_desc): Added new fields to cache > additional loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. > > Index: doc/invoke.texi > =================================================================== > --- doc/invoke.texi (revision 193626) > +++ doc/invoke.texi (working copy) > @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol > --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops > -fpredictive-commoning @gol > -fprefetch-loop-arrays -fprofile-report @gol > -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol > -fprofile-generate=@var{path} @gol > @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol > -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol > -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol > -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol > -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}. > > +@item -funroll-codesize-limit > +@opindex funroll-codesize-limit > +Limit loop unrolling of non-const non-FP loops in a profile feedback > compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fpeel-loops > @opindex fpeel-loops > Peels loops for which there is enough information that they do not > @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns > > Enabled with @option{-fprofile-use}. > > +@item -fpeel-codesize-limit > > +@item -fpeel-codesize-limit > +@opindex fpeel-codesize-limit > +Limit loop peeling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fmove-loop-invariants > @opindex fmove-loop-invariants > Enables the loop invariant motion pass in the RTL loop optimizer. Enabled > @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s > @item max-completely-peel-loop-nest-depth > The maximum depth of a loop nest suitable for complete peeling. > > +@item unrollpeel-codesize-threshold > +Maximum profile-based code size footprint estimate for loop unrolling and > +peeling. > + > +@item unrollpeel-hotness-threshold > +Maximum ratio of total execution count to loop entry block count under which > +most profile-based code size estimates will be ignored for loop unrolling and > +peeling. > + > @item max-unswitch-insns > The maximum number of insns of an unswitched loop. > > Index: loop-unroll.c > =================================================================== > --- loop-unroll.c (revision 193626) > +++ loop-unroll.c (working copy) > @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see > #include "hashtab.h" > #include "recog.h" > #include "target.h" > +#include "gcov-io.h" > #include "dumpfile.h" > > /* This pass performs loop unrolling and peeling. We only perform these > @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Determine whether and how much LOOP unrolling/peeling should be constrained > + based on code footprint estimates. Returns the codesize-based factor to be > + divided into the max instructions in an unrolled or peeled loop: > + 1) For size <= threshold, do not limit (by returning 1). > + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or > + unrolled instructions according to loop hotness. > + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning > + INT_MAX). */ > + > +static int > +code_size_limit_factor(struct loop *loop) > +{ > + unsigned size_threshold, num_hot_counters; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + gcov_type sum_to_header_ratio; > + int hotness_ratio_threshold; > + int limit_factor; > + gcov_working_set_t *ws; > + > + ws = find_working_set(999); > + if (! ws) > + return 1; > + num_hot_counters = ws->num_counters; > + > + /* First check if the application has a large codesize footprint. > + This is estimated from FDO profile summary information for the > + program, where the num_hot_counters indicates the number of hottest > + counters (blocks) that compose most of the execution time of > + the program. A large value would indicate a large flat execution > + profile where icache misses may be a concern. */ > + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); > + if (!profile_info > + || num_hot_counters <= size_threshold > + || !profile_info->sum_all) > + return 1; > + > + /* Next, exclude some loops where unrolling/peeling may be more > + important to overall performance. */ > + > + /* Ignore FP loops, which are more likely to benefit heavily from > + unrolling. */ > + if (desc->has_fp) > + return 1; > + > + /* Next, set the value of the codesize-based unroll factor divisor which in > + most loops will need to be set to a value that will reduce or eliminate > + unrolling/peeling. */ > + if (num_hot_counters < size_threshold * 2 > + && loop->header->count > 0) > + { > + /* For applications that are less than twice the codesize limit, allow > + limited unrolling for very hot loops. */ > + sum_to_header_ratio = profile_info->sum_all / loop->header->count; > + hotness_ratio_threshold = PARAM_VALUE > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); > + /* When the profile count sum to loop entry header ratio is smaller than > + the threshold (i.e. the loop entry is hot enough), the divisor is set > + to 1 so the unroll/peel factor is not reduced. When it is bigger > + than the ratio, increase the divisor by the amount this ratio > + is over the threshold, which will quickly reduce the unroll/peel > + factor to zero as the loop's hotness reduces. */ > + if (sum_to_header_ratio > hotness_ratio_threshold) > + { > + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; > + gcc_assert (limit_factor >= 1); > + } > + else > + limit_factor = 1; > + } > + else > + /* For appliations that are at least twice the codesize limit, set > + the divisor to a large value that will force the unroll factor to 0. */ > + limit_factor = INT_MAX; > + > + return limit_factor; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > double_int iterations; > > if (!(flags & UAP_UNROLL)) > @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo > "\n;; Considering unrolling loop with runtime " > "computable number of iterations\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / > loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) > { > unsigned npeel; > double_int iterations; > + struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_PEEL)) > { > @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) > if (dump_file) > fprintf (dump_file, "\n;; Considering simply peeling loop\n"); > > + if (flag_peel_codesize_limit) > + { > + /* Determine whether to limit code size growth from peeling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max peeled insns by divisor %d\n", limit_factor); > + } > + } > + > /* npeel = number of iterations to peel. */ > - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; > + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor > + / loop->ninsns; > if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) > npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); > > @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) > } > > /* Do not simply peel loops with branches inside -- it increases number > - of mispredicts. > + of mispredicts. > Exception is when we do have profile and we however have good chance > to peel proper number of iterations loop will iterate in practice. > TODO: this heuristic needs tunning; while for complette unrolling > @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) > peeling it is not the case. Also a function call inside loop is > also branch from branch prediction POV (and probably better reason > to not unroll/peel). */ > - if (num_loop_branches (loop) > 1 > + desc = get_simple_loop_desc (loop); > + if (desc->num_branches > 1 > && profile_status != PROFILE_READ) > { > if (dump_file) > @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > double_int iterations; > > if (!(flags & UAP_UNROLL_ALL)) > @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags > if (dump_file) > fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av > - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags > of mispredicts. > TODO: this heuristic needs tunning; call inside the loop body > is also relatively good reason to not unroll. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: loop-iv.c > =================================================================== > --- loop-iv.c (revision 193626) > +++ loop-iv.c (working copy) > @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: common.opt > =================================================================== > --- common.opt (revision 193626) > +++ common.opt (working copy) > @@ -1573,6 +1573,10 @@ fpcc-struct-return > Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) > Return small aggregates in memory, not registers > > +fpeel-codesize-limit > +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop peeling under profile estimates of large > code footprint > + > fpeel-loops > Common Report Var(flag_peel_loops) Optimization > Perform loop peeling > @@ -2138,6 +2142,10 @@ funroll-all-loops > Common Report Var(flag_unroll_all_loops) Optimization > Perform loop unrolling for all loops > > +funroll-codesize-limit > +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop unrolling under profile estimates of > large code footprint > + > ; Nonzero means that loop optimizer may assume that the induction variables > ; that control loops do not overflow and that the loops with nontrivial > ; exit condition are not infinite > Index: cfgloop.c > =================================================================== > --- cfgloop.c (revision 193626) > +++ cfgloop.c (working copy) > @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts > + the number of conditional branch instructions, calls and fp instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header (loop > + entry), then it is weighted based on its execution count, which > + will be turned into a ratio compared to the loop header below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it is > + in a nested inner loop), simply weight the branch the same as the > + header execution count, so that it will contribute 1 branch to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution count, > + to compute the average number of branches through the loop. By adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: cfgloop.h > =================================================================== > --- cfgloop.h (revision 193626) > +++ cfgloop.h (working copy) > @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order > extern vec<edge> get_loop_exit_edges (const struct loop *); > extern edge single_exit (const struct loop *); > extern edge single_likely_exit (struct loop *loop); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -365,7 +364,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -403,6 +403,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: params.def > =================================================================== > --- params.def (revision 193626) > +++ params.def (working copy) > @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "max-completely-peel-loop-nest-depth", > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > +/* The maximum code size estimate under which loop unrolling and peeling > + * is allowed in a profile feedback compile. This currently applies to loops > + * with non-constant iteration counts and no floating point computations. */ > +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, > + "unrollpeel-codesize-threshold", > + "Maximum profile-based code size footprint estimate for loop > unrolling " > + "and peeling", > + 15000, 0, 0) > +/* The maximum ratio of total profiled execution counts to loop entry block > + count that must be exceeded to ignore most code size limits when unrolling > + and peeling. */ > +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, > + "unrollpeel-hotness-threshold", > + "Maximum ratio of total profiled execution count to loop entry " > + "block count under which most codesize limits for unrolling and " > + "peeling will be ignored", > + 100, 1, 0) > > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > > On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote: >> Updated patch based on review feedback. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Teresa >> >> >> Original patch description: >> >> Ports some patches related to improving FDO program summary information >> and using it to guide loop unrolling from google branches to mainline. >> The patch is enhanced to add additional summary information to aid >> in determining hot/cold decisions. >> >> The original patch description is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and further discussion about incorporating onto mainline is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >> >> Honza, can you take a look to see if this patch meets your needs? >> >> Full description: >> >> This patch adds new program summary information to the gcov >> profile files that indicate how many profiled counts compose >> the majority of the program's execution time. This is used to >> provide an indication of the overall code size of the program. >> >> The new profile summary information is then used to guide >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> This patch also pulls in dependent portions of google/main r187660 that cache >> additional loop analysis results in the niter_desc auxiliary information >> hanging off the loop structure (the optimization portions of that >> change are not included here, and have an outstanding review request >> for mainline). >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-07-26 Teresa Johnson <tejohnson@google.com> >> >> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >> (gcov_compute_cutoff_values): Ditto. >> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >> information. >> * gcc/doc/invoke.texi (roll much): Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >> (gcov_read_summary): Read new summary info. >> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >> hot_cutoff_value. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/coverage.c (read_counts_file): Propagate new summary info. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >> >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 189893) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Used by qsort to sort gcov values in descending order. */ >> + >> +static int >> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >> +{ >> + const gcov_type a = *(gcov_type const *)pa; >> + const gcov_type b = *(gcov_type const *)pb; >> + >> + if (b > a) >> + return 1; >> + else if (b == a) >> + return 0; >> + else >> + return -1; >> +} >> + >> +/* Determines the number of counters required to cover a given percentage >> + of the total sum of execution counts in the summary, which is then also >> + recorded in SUM. */ >> + >> +static void >> +gcov_compute_cutoff_values (struct gcov_summary *sum) >> +{ >> + struct gcov_info *gi_ptr; >> + const struct gcov_fn_info *gfi_ptr; >> + const struct gcov_ctr_info *ci_ptr; >> + struct gcov_ctr_summary *cs_ptr; >> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >> + gcov_unsigned_t c_num; >> + gcov_type *value_array; >> + gcov_type cum, cum_cutoff; >> + char *cutoff_str; >> + unsigned cutoff_perc; >> + >> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >> + if (cutoff_str && strlen (cutoff_str)) >> + cutoff_perc = atoi (cutoff_str); >> + else >> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >> + >> + /* This currently only applies to arc counters. */ >> + t_ix = GCOV_COUNTER_ARCS; >> + >> + /* First check if there are any counts recorded for this counter. */ >> + cs_ptr = &(sum->ctrs[t_ix]); >> + if (!cs_ptr->num) >> + return; >> + >> + /* Determine the cumulative counter value at the specified cutoff >> + percentage and record the percentage for use by gcov consumers. >> + Check for overflow when sum_all is multiplied by the cutoff_perc, >> + and if so, do the divide first. */ >> + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) >> + /* Overflow, do the divide first. */ >> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >> + else >> + /* Otherwise multiply first to get the correct value for small >> + values of sum_all. */ >> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >> + >> + /* Next, walk through all the per-object structures and save each of >> + the count values in value_array. */ >> + index = 0; >> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >> + { >> + if (!gi_ptr->merge[t_ix]) >> + continue; >> + >> + /* Find the appropriate index into the gcov_ctr_info array >> + for the counter we are currently working on based on the >> + existence of the merge function pointer for this object. */ >> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >> + { >> + if (gi_ptr->merge[i]) >> + ctr_info_ix++; >> + } >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + /* Sanity check that there are enough entries in value_arry >> + for this function's counters. Gracefully handle the case when >> + there are not, in case something in the profile info is >> + corrupted. */ >> + c_num = ci_ptr->num; >> + if (index + c_num > cs_ptr->num) >> + c_num = cs_ptr->num - index; >> + /* Copy over this function's counter values. */ >> + memcpy (&value_array[index], ci_ptr->values, >> + sizeof (gcov_type) * c_num); >> + index += c_num; >> + } >> + } >> + >> + /* Sort all the counter values by descending value and finally >> + accumulate the values from hottest on down until reaching >> + the cutoff value computed earlier. */ >> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >> + sort_by_reverse_gcov_value); >> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >> + cum += value_array[c_num]; >> + /* Record the number of counters required to reach the cutoff value, >> + as well as the counter value that caused cum to reach the cutoff. */ >> + cs_ptr->num_hot_counters = c_num; >> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >> + free (value_array); >> +} >> + >> /* Dump the coverage counts. We merge with existing counts when >> possible, to avoid growing the .da files ad infinitum. We use this >> program's checksum to make sure we only accumulate whole program >> @@ -347,6 +461,7 @@ gcov_exit (void) >> } >> } >> } >> + gcov_compute_cutoff_values (&this_prg); >> >> { >> /* Check if the level of dirs to strip off specified. */ >> @@ -598,7 +713,12 @@ gcov_exit (void) >> if (gi_ptr->merge[t_ix]) >> { >> if (!cs_prg->runs++) >> - cs_prg->num = cs_tprg->num; >> + cs_prg->num = cs_tprg->num; >> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >> + { >> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >> + } >> cs_prg->sum_all += cs_tprg->sum_all; >> if (cs_prg->run_max < cs_tprg->run_max) >> cs_prg->run_max = cs_tprg->run_max; >> Index: gcc/doc/invoke.texi >> =================================================================== >> --- gcc/doc/invoke.texi (revision 189893) >> +++ gcc/doc/invoke.texi (working copy) >> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol >> -fprefetch-loop-arrays @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >> -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: gcc/gcov-io.c >> =================================================================== >> --- gcc/gcov-io.c (revision 189893) >> +++ gcc/gcov-io.c (working copy) >> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> gcov_write_unsigned (csum->num); >> + gcov_write_unsigned (csum->num_hot_counters); >> + gcov_write_counter (csum->hot_cutoff_value); >> gcov_write_unsigned (csum->runs); >> gcov_write_counter (csum->sum_all); >> gcov_write_counter (csum->run_max); >> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> csum->num = gcov_read_unsigned (); >> + csum->num_hot_counters = gcov_read_unsigned (); >> + csum->hot_cutoff_value = gcov_read_counter (); >> csum->runs = gcov_read_unsigned (); >> csum->sum_all = gcov_read_counter (); >> csum->run_max = gcov_read_counter (); >> Index: gcc/gcov-io.h >> =================================================================== >> --- gcc/gcov-io.h (revision 189893) >> +++ gcc/gcov-io.h (working copy) >> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ >> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >> #define GCOV_TAG_SUMMARY_LENGTH \ >> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >> >> /* Counters that are collected. */ >> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >> struct gcov_ctr_summary >> { >> gcov_unsigned_t num; /* number of counters. */ >> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >> + percent of sum_all. */ >> + gcov_type hot_cutoff_value; /* value of smallest counter included in >> + num_hot_counters. */ >> gcov_unsigned_t runs; /* number of program runs */ >> gcov_type sum_all; /* sum of all counters accumulated. */ >> gcov_type run_max; /* maximum value on a single run. */ >> Index: gcc/loop-unroll.c >> =================================================================== >> --- gcc/loop-unroll.c (revision 189893) >> +++ gcc/loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> #include "dumpfile.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >> + based on code footprint estimates. Returns the codesize-based factor to be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || profile_info->num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (profile_info->num_hot_counters < size_threshold * 2) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller than >> + the threshold (i.e. the loop entry is hot enough, the divisor is set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL)) >> { >> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) >> >> /* Do not simply peel loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not peeling, contains branches\n"); >> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL_ALL)) >> { >> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> >> /* Do not unroll loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: gcc/coverage.c >> =================================================================== >> --- gcc/coverage.c (revision 189893) >> +++ gcc/coverage.c (working copy) >> @@ -247,6 +247,14 @@ read_counts_file (void) >> gcov_read_summary (&sum); >> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >> { >> + if (summary.ctrs[ix].num_hot_counters >> + < sum.ctrs[ix].num_hot_counters) >> + { >> + summary.ctrs[ix].num_hot_counters >> + = sum.ctrs[ix].num_hot_counters; >> + summary.ctrs[ix].hot_cutoff_value >> + = sum.ctrs[ix].hot_cutoff_value; >> + } >> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >> Index: gcc/loop-iv.c >> =================================================================== >> --- gcc/loop-iv.c (revision 189893) >> +++ gcc/loop-iv.c (working copy) >> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: gcc/common.opt >> =================================================================== >> --- gcc/common.opt (revision 189893) >> +++ gcc/common.opt (working copy) >> @@ -1551,6 +1551,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large code footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2112,6 +2116,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: gcc/cfgloop.c >> =================================================================== >> --- gcc/cfgloop.c (revision 189893) >> +++ gcc/cfgloop.c (working copy) >> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: gcc/cfgloop.h >> =================================================================== >> --- gcc/cfgloop.h (revision 189893) >> +++ gcc/cfgloop.h (working copy) >> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >> >> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >> edge single_exit (const struct loop *); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -366,7 +365,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -407,6 +407,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: gcc/params.def >> =================================================================== >> --- gcc/params.def (revision 189893) >> +++ gcc/params.def (working copy) >> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> Index: gcc/gcov-dump.c >> =================================================================== >> --- gcc/gcov-dump.c (revision 189893) >> +++ gcc/gcov-dump.c (working copy) >> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >> { >> printf ("\n"); >> print_prefix (filename, 0, 0); >> - printf ("\t\tcounts=%u, runs=%u", >> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >> + summary.ctrs[ix].num, >> + summary.ctrs[ix].num_hot_counters, >> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >> + summary.ctrs[ix].runs); >> >> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >> >> -- >> This patch is available for review at http://codereview.appspot.com/6351086 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > Ping. > Teresa > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > > This patch was proposed awhile back, and the new working set program > > summary was split out and committed separately. I would like to > > see if the loop unroller part of the patch can now go in. For the > > previous versions of the patch and related discussion, see: > > > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > > and > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > > > Use the new working set information in the FDO program summary for > > codesize based unroll and peel decisions, to prevent those > > optimizations from increasing code size too much when the > > program may be sensitive to icache effects. > > > > As part of this, cache additional loop analysis results in the niter_desc > > auxiliary information hanging off the loop structure to reduce > > redundant analyses during unrolling. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? I guess this is the missing piece that would fix the Firefox lto/pgo code-size issue: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144
> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > > Ping. > > Teresa > > > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > > > This patch was proposed awhile back, and the new working set program > > > summary was split out and committed separately. I would like to > > > see if the loop unroller part of the patch can now go in. For the > > > previous versions of the patch and related discussion, see: > > > > > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > > > and > > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > > > > > Use the new working set information in the FDO program summary for > > > codesize based unroll and peel decisions, to prevent those > > > optimizations from increasing code size too much when the > > > program may be sensitive to icache effects. > > > > > > As part of this, cache additional loop analysis results in the niter_desc > > > auxiliary information hanging off the loop structure to reduce > > > redundant analyses during unrolling. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > I guess this is the missing piece that would fix the Firefox lto/pgo > code-size issue: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 It won't help in this case - the problem is that too much of code is identified as hot, similar as for spec2k GCC benchmark. Teresa - do you have any ideas what to do here short of re-adding the BB ratio parameter and combining both heuristics for 4.8? Honza > > -- > Markus
On Tue, Dec 11, 2012 at 10:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: >> > Ping. >> > Teresa >> > >> > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: >> > > This patch was proposed awhile back, and the new working set program >> > > summary was split out and committed separately. I would like to >> > > see if the loop unroller part of the patch can now go in. For the >> > > previous versions of the patch and related discussion, see: >> > > >> > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> > > and >> > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html >> > > >> > > Use the new working set information in the FDO program summary for >> > > codesize based unroll and peel decisions, to prevent those >> > > optimizations from increasing code size too much when the >> > > program may be sensitive to icache effects. >> > > >> > > As part of this, cache additional loop analysis results in the niter_desc >> > > auxiliary information hanging off the loop structure to reduce >> > > redundant analyses during unrolling. >> > > >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> I guess this is the missing piece that would fix the Firefox lto/pgo >> code-size issue: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 > > It won't help in this case - the problem is that too much of code is identified as hot, > similar as for spec2k GCC benchmark. Teresa - do you have any ideas what to do here > short of re-adding the BB ratio parameter and combining both heuristics for 4.8? Right, I doubt this patch would help. I looked through the bugzilla log and it is most likely inlining increases. What was the size of the gcc lto/pgo binary before the change to use the histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a very large increase, ~25%. Markus, could you attach to the bug one of the gcda files so that I can see the program summary and figure out how far off the old hot bb threshold is from the new histogram-based one? Also, it would be good to see the -fdump-ipa-inline dumps before and after the regression (if necessary, the before one could be from 4_7). I'll update the bug with the same, so that we can track the discussion there. Thanks, Teresa > > Honza >> >> -- >> Markus -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Ping. Teresa On Tue, Dec 11, 2012 at 9:13 AM, Teresa Johnson <tejohnson@google.com> wrote: > Ping. > Teresa > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: >> This patch was proposed awhile back, and the new working set program >> summary was split out and committed separately. I would like to >> see if the loop unroller part of the patch can now go in. For the >> previous versions of the patch and related discussion, see: >> >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and >> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html >> >> Use the new working set information in the FDO program summary for >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> As part of this, cache additional loop analysis results in the niter_desc >> auxiliary information hanging off the loop structure to reduce >> redundant analyses during unrolling. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-11-19 Teresa Johnson <tejohnson@google.com> >> >> * gcc/doc/invoke.texi: Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> >> Index: doc/invoke.texi >> =================================================================== >> --- doc/invoke.texi (revision 193626) >> +++ doc/invoke.texi (working copy) >> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops >> -fpredictive-commoning @gol >> -fprefetch-loop-arrays -fprofile-report @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >> -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback >> compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: loop-unroll.c >> =================================================================== >> --- loop-unroll.c (revision 193626) >> +++ loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> #include "dumpfile.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >> + based on code footprint estimates. Returns the codesize-based factor to be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold, num_hot_counters; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + gcov_working_set_t *ws; >> + >> + ws = find_working_set(999); >> + if (! ws) >> + return 1; >> + num_hot_counters = ws->num_counters; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (num_hot_counters < size_threshold * 2 >> + && loop->header->count > 0) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller than >> + the threshold (i.e. the loop entry is hot enough), the divisor is set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> double_int iterations; >> >> if (!(flags & UAP_UNROLL)) >> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / >> loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> double_int iterations; >> + struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) >> } >> >> /* Do not simply peel loops with branches inside -- it increases number >> - of mispredicts. >> + of mispredicts. >> Exception is when we do have profile and we however have good chance >> to peel proper number of iterations loop will iterate in practice. >> TODO: this heuristic needs tunning; while for complette unrolling >> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) >> peeling it is not the case. Also a function call inside loop is >> also branch from branch prediction POV (and probably better reason >> to not unroll/peel). */ >> - if (num_loop_branches (loop) > 1 >> + desc = get_simple_loop_desc (loop); >> + if (desc->num_branches > 1 >> && profile_status != PROFILE_READ) >> { >> if (dump_file) >> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> double_int iterations; >> >> if (!(flags & UAP_UNROLL_ALL)) >> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> of mispredicts. >> TODO: this heuristic needs tunning; call inside the loop body >> is also relatively good reason to not unroll. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: loop-iv.c >> =================================================================== >> --- loop-iv.c (revision 193626) >> +++ loop-iv.c (working copy) >> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: common.opt >> =================================================================== >> --- common.opt (revision 193626) >> +++ common.opt (working copy) >> @@ -1573,6 +1573,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large >> code footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2138,6 +2142,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of >> large code footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: cfgloop.c >> =================================================================== >> --- cfgloop.c (revision 193626) >> +++ cfgloop.c (working copy) >> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: cfgloop.h >> =================================================================== >> --- cfgloop.h (revision 193626) >> +++ cfgloop.h (working copy) >> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order >> extern vec<edge> get_loop_exit_edges (const struct loop *); >> extern edge single_exit (const struct loop *); >> extern edge single_likely_exit (struct loop *loop); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -365,7 +364,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -403,6 +403,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: params.def >> =================================================================== >> --- params.def (revision 193626) >> +++ params.def (working copy) >> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop >> unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> >> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote: >>> Updated patch based on review feedback. >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >>> >>> Teresa >>> >>> >>> Original patch description: >>> >>> Ports some patches related to improving FDO program summary information >>> and using it to guide loop unrolling from google branches to mainline. >>> The patch is enhanced to add additional summary information to aid >>> in determining hot/cold decisions. >>> >>> The original patch description is at: >>> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >>> and further discussion about incorporating onto mainline is at: >>> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >>> >>> Honza, can you take a look to see if this patch meets your needs? >>> >>> Full description: >>> >>> This patch adds new program summary information to the gcov >>> profile files that indicate how many profiled counts compose >>> the majority of the program's execution time. This is used to >>> provide an indication of the overall code size of the program. >>> >>> The new profile summary information is then used to guide >>> codesize based unroll and peel decisions, to prevent those >>> optimizations from increasing code size too much when the >>> program may be sensitive to icache effects. >>> >>> This patch also pulls in dependent portions of google/main r187660 that cache >>> additional loop analysis results in the niter_desc auxiliary information >>> hanging off the loop structure (the optimization portions of that >>> change are not included here, and have an outstanding review request >>> for mainline). >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >>> >>> Thanks, >>> Teresa >>> >>> 2012-07-26 Teresa Johnson <tejohnson@google.com> >>> >>> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >>> (gcov_compute_cutoff_values): Ditto. >>> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >>> information. >>> * gcc/doc/invoke.texi (roll much): Document new options >>> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >>> codesize-hotness-threshold and unrollpeel-hotness-threshold. >>> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >>> (gcov_read_summary): Read new summary info. >>> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >>> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >>> hot_cutoff_value. >>> * gcc/loop-unroll.c (code_size_limit_factor): New function. >>> (decide_unroll_runtime_iterations): Call code_size_limit_factor >>> to control the unroll factor, and retrieve number of branches from >>> niter_desc instead of via function that walks loop. >>> (decide_peel_simple, decide_unroll_stupid): Ditto. >>> * gcc/coverage.c (read_counts_file): Propagate new summary info. >>> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >>> function, and add guards to enable this function to work for the >>> outermost loop. >>> * gcc/common.opt: Add -fpeel-codesize-limit and >>> -funroll-codesize-limit. >>> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >>> (num_loop_branches): Remove. >>> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >>> additional loop analysis information. >>> (num_loop_branches): Remove. >>> (analyze_loop_insns): Declare. >>> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >>> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >>> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >>> >>> Index: libgcc/libgcov.c >>> =================================================================== >>> --- libgcc/libgcov.c (revision 189893) >>> +++ libgcc/libgcov.c (working copy) >>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >>> return 1; >>> } >>> >>> +/* Used by qsort to sort gcov values in descending order. */ >>> + >>> +static int >>> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >>> +{ >>> + const gcov_type a = *(gcov_type const *)pa; >>> + const gcov_type b = *(gcov_type const *)pb; >>> + >>> + if (b > a) >>> + return 1; >>> + else if (b == a) >>> + return 0; >>> + else >>> + return -1; >>> +} >>> + >>> +/* Determines the number of counters required to cover a given percentage >>> + of the total sum of execution counts in the summary, which is then also >>> + recorded in SUM. */ >>> + >>> +static void >>> +gcov_compute_cutoff_values (struct gcov_summary *sum) >>> +{ >>> + struct gcov_info *gi_ptr; >>> + const struct gcov_fn_info *gfi_ptr; >>> + const struct gcov_ctr_info *ci_ptr; >>> + struct gcov_ctr_summary *cs_ptr; >>> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >>> + gcov_unsigned_t c_num; >>> + gcov_type *value_array; >>> + gcov_type cum, cum_cutoff; >>> + char *cutoff_str; >>> + unsigned cutoff_perc; >>> + >>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >>> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >>> + if (cutoff_str && strlen (cutoff_str)) >>> + cutoff_perc = atoi (cutoff_str); >>> + else >>> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >>> + >>> + /* This currently only applies to arc counters. */ >>> + t_ix = GCOV_COUNTER_ARCS; >>> + >>> + /* First check if there are any counts recorded for this counter. */ >>> + cs_ptr = &(sum->ctrs[t_ix]); >>> + if (!cs_ptr->num) >>> + return; >>> + >>> + /* Determine the cumulative counter value at the specified cutoff >>> + percentage and record the percentage for use by gcov consumers. >>> + Check for overflow when sum_all is multiplied by the cutoff_perc, >>> + and if so, do the divide first. */ >>> + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) >>> + /* Overflow, do the divide first. */ >>> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >>> + else >>> + /* Otherwise multiply first to get the correct value for small >>> + values of sum_all. */ >>> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >>> + >>> + /* Next, walk through all the per-object structures and save each of >>> + the count values in value_array. */ >>> + index = 0; >>> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >>> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >>> + { >>> + if (!gi_ptr->merge[t_ix]) >>> + continue; >>> + >>> + /* Find the appropriate index into the gcov_ctr_info array >>> + for the counter we are currently working on based on the >>> + existence of the merge function pointer for this object. */ >>> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >>> + { >>> + if (gi_ptr->merge[i]) >>> + ctr_info_ix++; >>> + } >>> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >>> + { >>> + gfi_ptr = gi_ptr->functions[f_ix]; >>> + >>> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >>> + continue; >>> + >>> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >>> + /* Sanity check that there are enough entries in value_arry >>> + for this function's counters. Gracefully handle the case when >>> + there are not, in case something in the profile info is >>> + corrupted. */ >>> + c_num = ci_ptr->num; >>> + if (index + c_num > cs_ptr->num) >>> + c_num = cs_ptr->num - index; >>> + /* Copy over this function's counter values. */ >>> + memcpy (&value_array[index], ci_ptr->values, >>> + sizeof (gcov_type) * c_num); >>> + index += c_num; >>> + } >>> + } >>> + >>> + /* Sort all the counter values by descending value and finally >>> + accumulate the values from hottest on down until reaching >>> + the cutoff value computed earlier. */ >>> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >>> + sort_by_reverse_gcov_value); >>> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >>> + cum += value_array[c_num]; >>> + /* Record the number of counters required to reach the cutoff value, >>> + as well as the counter value that caused cum to reach the cutoff. */ >>> + cs_ptr->num_hot_counters = c_num; >>> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >>> + free (value_array); >>> +} >>> + >>> /* Dump the coverage counts. We merge with existing counts when >>> possible, to avoid growing the .da files ad infinitum. We use this >>> program's checksum to make sure we only accumulate whole program >>> @@ -347,6 +461,7 @@ gcov_exit (void) >>> } >>> } >>> } >>> + gcov_compute_cutoff_values (&this_prg); >>> >>> { >>> /* Check if the level of dirs to strip off specified. */ >>> @@ -598,7 +713,12 @@ gcov_exit (void) >>> if (gi_ptr->merge[t_ix]) >>> { >>> if (!cs_prg->runs++) >>> - cs_prg->num = cs_tprg->num; >>> + cs_prg->num = cs_tprg->num; >>> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >>> + { >>> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >>> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >>> + } >>> cs_prg->sum_all += cs_tprg->sum_all; >>> if (cs_prg->run_max < cs_tprg->run_max) >>> cs_prg->run_max = cs_tprg->run_max; >>> Index: gcc/doc/invoke.texi >>> =================================================================== >>> --- gcc/doc/invoke.texi (revision 189893) >>> +++ gcc/doc/invoke.texi (working copy) >>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. >>> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >>> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >>> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol >>> -fprefetch-loop-arrays @gol >>> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >>> -fprofile-generate=@var{path} @gol >>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. >>> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >>> -ftree-switch-conversion -ftree-tail-merge @gol >>> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >>> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >>> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >>> -fwhole-program -fwpa -fuse-linker-plugin @gol >>> @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs >>> @option{-funroll-all-loops} implies the same options as >>> @option{-funroll-loops}. >>> >>> +@item -funroll-codesize-limit >>> +@opindex funroll-codesize-limit >>> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation >>> +under estimates of a large code footprint. Enabled by default with >>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >>> +by the @option{unrollpeel-codesize-threshold} and >>> +@option{unrollpeel-hotness-threshold} parameters. >>> + >>> @item -fpeel-loops >>> @opindex fpeel-loops >>> Peels loops for which there is enough information that they do not >>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns >>> >>> Enabled with @option{-fprofile-use}. >>> >>> +@item -fpeel-codesize-limit >>> +@opindex fpeel-codesize-limit >>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >>> +under estimates of a large code footprint. Enabled by default with >>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >>> +by the @option{unrollpeel-codesize-threshold} and >>> +@option{unrollpeel-hotness-threshold} parameters. >>> + >>> @item -fmove-loop-invariants >>> @opindex fmove-loop-invariants >>> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s >>> @item max-completely-peel-loop-nest-depth >>> The maximum depth of a loop nest suitable for complete peeling. >>> >>> +@item unrollpeel-codesize-threshold >>> +Maximum profile-based code size footprint estimate for loop unrolling and >>> +peeling. >>> + >>> +@item unrollpeel-hotness-threshold >>> +Maximum ratio of total execution count to loop entry block count under which >>> +most profile-based code size estimates will be ignored for loop unrolling and >>> +peeling. >>> + >>> @item max-unswitch-insns >>> The maximum number of insns of an unswitched loop. >>> >>> Index: gcc/gcov-io.c >>> =================================================================== >>> --- gcc/gcov-io.c (revision 189893) >>> +++ gcc/gcov-io.c (working copy) >>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >>> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >>> { >>> gcov_write_unsigned (csum->num); >>> + gcov_write_unsigned (csum->num_hot_counters); >>> + gcov_write_counter (csum->hot_cutoff_value); >>> gcov_write_unsigned (csum->runs); >>> gcov_write_counter (csum->sum_all); >>> gcov_write_counter (csum->run_max); >>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >>> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >>> { >>> csum->num = gcov_read_unsigned (); >>> + csum->num_hot_counters = gcov_read_unsigned (); >>> + csum->hot_cutoff_value = gcov_read_counter (); >>> csum->runs = gcov_read_unsigned (); >>> csum->sum_all = gcov_read_counter (); >>> csum->run_max = gcov_read_counter (); >>> Index: gcc/gcov-io.h >>> =================================================================== >>> --- gcc/gcov-io.h (revision 189893) >>> +++ gcc/gcov-io.h (working copy) >>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >>> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ >>> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >>> #define GCOV_TAG_SUMMARY_LENGTH \ >>> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >>> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >>> >>> /* Counters that are collected. */ >>> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >>> struct gcov_ctr_summary >>> { >>> gcov_unsigned_t num; /* number of counters. */ >>> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >>> + percent of sum_all. */ >>> + gcov_type hot_cutoff_value; /* value of smallest counter included in >>> + num_hot_counters. */ >>> gcov_unsigned_t runs; /* number of program runs */ >>> gcov_type sum_all; /* sum of all counters accumulated. */ >>> gcov_type run_max; /* maximum value on a single run. */ >>> Index: gcc/loop-unroll.c >>> =================================================================== >>> --- gcc/loop-unroll.c (revision 189893) >>> +++ gcc/loop-unroll.c (working copy) >>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >>> #include "hashtab.h" >>> #include "recog.h" >>> #include "target.h" >>> +#include "gcov-io.h" >>> #include "dumpfile.h" >>> >>> /* This pass performs loop unrolling and peeling. We only perform these >>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc >>> basic_block); >>> static rtx get_expansion (struct var_to_expand *); >>> >>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >>> + based on code footprint estimates. Returns the codesize-based factor to be >>> + divided into the max instructions in an unrolled or peeled loop: >>> + 1) For size <= threshold, do not limit (by returning 1). >>> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >>> + unrolled instructions according to loop hotness. >>> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >>> + INT_MAX). */ >>> + >>> +static int >>> +code_size_limit_factor(struct loop *loop) >>> +{ >>> + unsigned size_threshold; >>> + struct niter_desc *desc = get_simple_loop_desc (loop); >>> + gcov_type sum_to_header_ratio; >>> + int hotness_ratio_threshold; >>> + int limit_factor; >>> + >>> + /* First check if the application has a large codesize footprint. >>> + This is estimated from FDO profile summary information for the >>> + program, where the num_hot_counters indicates the number of hottest >>> + counters (blocks) that compose most of the execution time of >>> + the program. A large value would indicate a large flat execution >>> + profile where icache misses may be a concern. */ >>> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >>> + if (!profile_info >>> + || profile_info->num_hot_counters <= size_threshold >>> + || !profile_info->sum_all) >>> + return 1; >>> + >>> + /* Next, exclude some loops where unrolling/peeling may be more >>> + important to overall performance. */ >>> + >>> + /* Ignore FP loops, which are more likely to benefit heavily from >>> + unrolling. */ >>> + if (desc->has_fp) >>> + return 1; >>> + >>> + /* Next, set the value of the codesize-based unroll factor divisor which in >>> + most loops will need to be set to a value that will reduce or eliminate >>> + unrolling/peeling. */ >>> + if (profile_info->num_hot_counters < size_threshold * 2) >>> + { >>> + /* For applications that are less than twice the codesize limit, allow >>> + limited unrolling for very hot loops. */ >>> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >>> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >>> + /* When the profile count sum to loop entry header ratio is smaller than >>> + the threshold (i.e. the loop entry is hot enough, the divisor is set >>> + to 1 so the unroll/peel factor is not reduced. When it is bigger >>> + than the ratio, increase the divisor by the amount this ratio >>> + is over the threshold, which will quickly reduce the unroll/peel >>> + factor to zero as the loop's hotness reduces. */ >>> + if (sum_to_header_ratio > hotness_ratio_threshold) >>> + { >>> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >>> + gcc_assert (limit_factor >= 1); >>> + } >>> + else >>> + limit_factor = 1; >>> + } >>> + else >>> + /* For appliations that are at least twice the codesize limit, set >>> + the divisor to a large value that will force the unroll factor to 0. */ >>> + limit_factor = INT_MAX; >>> + >>> + return limit_factor; >>> +} >>> + >>> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >>> void >>> unroll_and_peel_loops (int flags) >>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo >>> { >>> unsigned nunroll, nunroll_by_av, i; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_UNROLL)) >>> { >>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo >>> "\n;; Considering unrolling loop with runtime " >>> "computable number of iterations\n"); >>> >>> + if (flag_unroll_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from unrolling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max unrolled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* nunroll = total number of copies of the original loop body in >>> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >>> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >>> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >>> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >>> + / loop->ninsns; >>> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >>> + / limit_factor / loop->av_ninsns; >>> if (nunroll > nunroll_by_av) >>> nunroll = nunroll_by_av; >>> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) >>> { >>> unsigned npeel; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_PEEL)) >>> { >>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) >>> if (dump_file) >>> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >>> >>> + if (flag_peel_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from peeling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max peeled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* npeel = number of iterations to peel. */ >>> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >>> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >>> + / loop->ninsns; >>> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >>> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >>> >>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) >>> >>> /* Do not simply peel loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not peeling, contains branches\n"); >>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags >>> { >>> unsigned nunroll, nunroll_by_av, i; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_UNROLL_ALL)) >>> { >>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags >>> if (dump_file) >>> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >>> >>> + if (flag_unroll_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from unrolling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max unrolled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* nunroll = total number of copies of the original loop body in >>> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >>> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >>> - nunroll_by_av >>> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >>> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >>> + / loop->ninsns; >>> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >>> + / limit_factor / loop->av_ninsns; >>> if (nunroll > nunroll_by_av) >>> nunroll = nunroll_by_av; >>> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags >>> >>> /* Do not unroll loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >>> Index: gcc/coverage.c >>> =================================================================== >>> --- gcc/coverage.c (revision 189893) >>> +++ gcc/coverage.c (working copy) >>> @@ -247,6 +247,14 @@ read_counts_file (void) >>> gcov_read_summary (&sum); >>> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >>> { >>> + if (summary.ctrs[ix].num_hot_counters >>> + < sum.ctrs[ix].num_hot_counters) >>> + { >>> + summary.ctrs[ix].num_hot_counters >>> + = sum.ctrs[ix].num_hot_counters; >>> + summary.ctrs[ix].hot_cutoff_value >>> + = sum.ctrs[ix].hot_cutoff_value; >>> + } >>> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >>> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >>> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >>> Index: gcc/loop-iv.c >>> =================================================================== >>> --- gcc/loop-iv.c (revision 189893) >>> +++ gcc/loop-iv.c (working copy) >>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) >>> /* At least desc->infinite is not always initialized by >>> find_simple_loop_exit. */ >>> desc = XCNEW (struct niter_desc); >>> - iv_analysis_loop_init (loop); >>> - find_simple_exit (loop, desc); >>> + if (loop->latch != EXIT_BLOCK_PTR) >>> + { >>> + iv_analysis_loop_init (loop); >>> + find_simple_exit (loop, desc); >>> + } >>> + analyze_loop_insns (loop, desc); >>> loop->aux = desc; >>> >>> if (desc->simple_p && (desc->assumptions || desc->infinite)) >>> Index: gcc/common.opt >>> =================================================================== >>> --- gcc/common.opt (revision 189893) >>> +++ gcc/common.opt (working copy) >>> @@ -1551,6 +1551,10 @@ fpcc-struct-return >>> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >>> Return small aggregates in memory, not registers >>> >>> +fpeel-codesize-limit >>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >>> +Limit non-const non-FP loop peeling under profile estimates of large code footprint >>> + >>> fpeel-loops >>> Common Report Var(flag_peel_loops) Optimization >>> Perform loop peeling >>> @@ -2112,6 +2116,10 @@ funroll-all-loops >>> Common Report Var(flag_unroll_all_loops) Optimization >>> Perform loop unrolling for all loops >>> >>> +funroll-codesize-limit >>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >>> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint >>> + >>> ; Nonzero means that loop optimizer may assume that the induction variables >>> ; that control loops do not overflow and that the loops with nontrivial >>> ; exit condition are not infinite >>> Index: gcc/cfgloop.c >>> =================================================================== >>> --- gcc/cfgloop.c (revision 189893) >>> +++ gcc/cfgloop.c (working copy) >>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) >>> return edges; >>> } >>> >>> -/* Counts the number of conditional branches inside LOOP. */ >>> +/* Determine if INSN is a floating point set. */ >>> >>> -unsigned >>> -num_loop_branches (const struct loop *loop) >>> +static bool >>> +insn_has_fp_set(rtx insn) >>> { >>> - unsigned i, n; >>> - basic_block * body; >>> + int i; >>> + rtx pat = PATTERN(insn); >>> + if (GET_CODE (pat) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >>> + else if (GET_CODE (pat) == PARALLEL) >>> + { >>> + for (i = 0; i < XVECLEN (pat, 0); i++) >>> + { >>> + rtx sub = XVECEXP (pat, 0, i); >>> + if (GET_CODE (sub) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >>> + } >>> + } >>> + return false; >>> +} >>> >>> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >>> + the number of conditional branch instructions, calls and fp instructions, >>> + as well as the average number of branches executed per iteration. */ >>> >>> +void >>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >>> +{ >>> + unsigned i, nbranch; >>> + gcov_type weighted_nbranch; >>> + bool has_call, has_fp; >>> + basic_block * body, bb; >>> + rtx insn; >>> + gcov_type header_count = loop->header->count; >>> + >>> + nbranch = weighted_nbranch = 0; >>> + has_call = has_fp = false; >>> + >>> body = get_loop_body (loop); >>> - n = 0; >>> for (i = 0; i < loop->num_nodes; i++) >>> - if (EDGE_COUNT (body[i]->succs) >= 2) >>> - n++; >>> + { >>> + bb = body[i]; >>> + >>> + if (EDGE_COUNT (bb->succs) >= 2) >>> + { >>> + nbranch++; >>> + >>> + /* If this block is executed less frequently than the header (loop >>> + entry), then it is weighted based on its execution count, which >>> + will be turned into a ratio compared to the loop header below. */ >>> + if (bb->count < header_count) >>> + weighted_nbranch += bb->count; >>> + >>> + /* When it is executed more frequently than the header (i.e. it is >>> + in a nested inner loop), simply weight the branch the same as the >>> + header execution count, so that it will contribute 1 branch to >>> + the ratio computed below. */ >>> + else >>> + weighted_nbranch += header_count; >>> + } >>> + >>> + /* No need to iterate through the instructions below if >>> + both flags have already been set. */ >>> + if (has_call && has_fp) >>> + continue; >>> + >>> + FOR_BB_INSNS (bb, insn) >>> + { >>> + if (!INSN_P (insn)) >>> + continue; >>> + >>> + if (!has_call) >>> + has_call = CALL_P (insn); >>> + >>> + if (!has_fp) >>> + has_fp = insn_has_fp_set (insn); >>> + } >>> + } >>> free (body); >>> >>> - return n; >>> + desc->num_branches = nbranch; >>> + /* Now divide the weights computed above by the loop header execution count, >>> + to compute the average number of branches through the loop. By adding >>> + header_count/2 to the numerator we round to nearest with integer >>> + division. */ >>> + if (header_count != 0) >>> + desc->av_num_branches >>> + = (weighted_nbranch + header_count/2) / header_count; >>> + else >>> + desc->av_num_branches = 0; >>> + desc->has_call = has_call; >>> + desc->has_fp = has_fp; >>> } >>> >>> /* Adds basic block BB to LOOP. */ >>> Index: gcc/cfgloop.h >>> =================================================================== >>> --- gcc/cfgloop.h (revision 189893) >>> +++ gcc/cfgloop.h (working copy) >>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >>> >>> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >>> edge single_exit (const struct loop *); >>> -extern unsigned num_loop_branches (const struct loop *); >>> >>> extern edge loop_preheader_edge (const struct loop *); >>> extern edge loop_latch_edge (const struct loop *); >>> @@ -366,7 +365,8 @@ struct rtx_iv >>> }; >>> >>> /* The description of an exit from the loop and of the number of iterations >>> - till we take the exit. */ >>> + till we take the exit. Also includes other information used primarily >>> + by the loop unroller. */ >>> >>> struct niter_desc >>> { >>> @@ -407,6 +407,18 @@ struct niter_desc >>> >>> /* The number of iterations of the loop. */ >>> rtx niter_expr; >>> + >>> + /* The number of branches in the loop. */ >>> + unsigned num_branches; >>> + >>> + /* The number of executed branches per iteration. */ >>> + unsigned av_num_branches; >>> + >>> + /* Whether the loop contains a call instruction. */ >>> + bool has_call; >>> + >>> + /* Whether the loop contains fp instructions. */ >>> + bool has_fp; >>> }; >>> >>> extern void iv_analysis_loop_init (struct loop *); >>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >>> >>> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >>> extern void free_simple_loop_desc (struct loop *loop); >>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >>> >>> static inline struct niter_desc * >>> simple_loop_desc (struct loop *loop) >>> Index: gcc/params.def >>> =================================================================== >>> --- gcc/params.def (revision 189893) >>> +++ gcc/params.def (working copy) >>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >>> "max-completely-peel-loop-nest-depth", >>> "The maximum depth of a loop nest we completely peel", >>> 8, 0, 0) >>> +/* The maximum code size estimate under which loop unrolling and peeling >>> + * is allowed in a profile feedback compile. This currently applies to loops >>> + * with non-constant iteration counts and no floating point computations. */ >>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >>> + "unrollpeel-codesize-threshold", >>> + "Maximum profile-based code size footprint estimate for loop unrolling " >>> + "and peeling", >>> + 15000, 0, 0) >>> +/* The maximum ratio of total profiled execution counts to loop entry block >>> + count that must be exceeded to ignore most code size limits when unrolling >>> + and peeling. */ >>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >>> + "unrollpeel-hotness-threshold", >>> + "Maximum ratio of total profiled execution count to loop entry " >>> + "block count under which most codesize limits for unrolling and " >>> + "peeling will be ignored", >>> + 100, 1, 0) >>> >>> /* The maximum number of insns of an unswitched loop. */ >>> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >>> Index: gcc/gcov-dump.c >>> =================================================================== >>> --- gcc/gcov-dump.c (revision 189893) >>> +++ gcc/gcov-dump.c (working copy) >>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >>> { >>> printf ("\n"); >>> print_prefix (filename, 0, 0); >>> - printf ("\t\tcounts=%u, runs=%u", >>> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >>> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >>> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >>> + summary.ctrs[ix].num, >>> + summary.ctrs[ix].num_hot_counters, >>> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >>> + summary.ctrs[ix].runs); >>> >>> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >>> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >>> >>> -- >>> This patch is available for review at http://codereview.appspot.com/6351086 >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 193626) +++ doc/invoke.texi (working copy) @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays -fprofile-report @gol -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol -fprofile-generate=@var{path} @gol @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol -ftree-switch-conversion -ftree-tail-merge @gol -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol --funit-at-a-time -funroll-all-loops -funroll-loops @gol +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol -fwhole-program -fwpa -fuse-linker-plugin @gol @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs @option{-funroll-all-loops} implies the same options as @option{-funroll-loops}. +@item -funroll-codesize-limit +@opindex funroll-codesize-limit +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fpeel-loops @opindex fpeel-loops Peels loops for which there is enough information that they do not @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns Enabled with @option{-fprofile-use}. +@item -fpeel-codesize-limit +@item -fpeel-codesize-limit +@opindex fpeel-codesize-limit +Limit loop peeling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fmove-loop-invariants @opindex fmove-loop-invariants Enables the loop invariant motion pass in the RTL loop optimizer. Enabled @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s @item max-completely-peel-loop-nest-depth The maximum depth of a loop nest suitable for complete peeling. +@item unrollpeel-codesize-threshold +Maximum profile-based code size footprint estimate for loop unrolling and +peeling. + +@item unrollpeel-hotness-threshold +Maximum ratio of total execution count to loop entry block count under which +most profile-based code size estimates will be ignored for loop unrolling and +peeling. + @item max-unswitch-insns The maximum number of insns of an unswitched loop. Index: loop-unroll.c =================================================================== --- loop-unroll.c (revision 193626) +++ loop-unroll.c (working copy) @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "hashtab.h" #include "recog.h" #include "target.h" +#include "gcov-io.h" #include "dumpfile.h" /* This pass performs loop unrolling and peeling. We only perform these @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc basic_block); static rtx get_expansion (struct var_to_expand *); +/* Determine whether and how much LOOP unrolling/peeling should be constrained + based on code footprint estimates. Returns the codesize-based factor to be + divided into the max instructions in an unrolled or peeled loop: + 1) For size <= threshold, do not limit (by returning 1). + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or + unrolled instructions according to loop hotness. + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning + INT_MAX). */ + +static int +code_size_limit_factor(struct loop *loop) +{ + unsigned size_threshold, num_hot_counters; + struct niter_desc *desc = get_simple_loop_desc (loop); + gcov_type sum_to_header_ratio; + int hotness_ratio_threshold; + int limit_factor; + gcov_working_set_t *ws; + + ws = find_working_set(999); + if (! ws) + return 1; + num_hot_counters = ws->num_counters; + + /* First check if the application has a large codesize footprint. + This is estimated from FDO profile summary information for the + program, where the num_hot_counters indicates the number of hottest + counters (blocks) that compose most of the execution time of + the program. A large value would indicate a large flat execution + profile where icache misses may be a concern. */ + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); + if (!profile_info + || num_hot_counters <= size_threshold + || !profile_info->sum_all) + return 1; + + /* Next, exclude some loops where unrolling/peeling may be more + important to overall performance. */ + + /* Ignore FP loops, which are more likely to benefit heavily from + unrolling. */ + if (desc->has_fp) + return 1; + + /* Next, set the value of the codesize-based unroll factor divisor which in + most loops will need to be set to a value that will reduce or eliminate + unrolling/peeling. */ + if (num_hot_counters < size_threshold * 2 + && loop->header->count > 0) + { + /* For applications that are less than twice the codesize limit, allow + limited unrolling for very hot loops. */ + sum_to_header_ratio = profile_info->sum_all / loop->header->count; + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); + /* When the profile count sum to loop entry header ratio is smaller than + the threshold (i.e. the loop entry is hot enough), the divisor is set + to 1 so the unroll/peel factor is not reduced. When it is bigger + than the ratio, increase the divisor by the amount this ratio + is over the threshold, which will quickly reduce the unroll/peel + factor to zero as the loop's hotness reduces. */ + if (sum_to_header_ratio > hotness_ratio_threshold) + { + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; + gcc_assert (limit_factor >= 1); + } + else + limit_factor = 1; + } + else + /* For appliations that are at least twice the codesize limit, set + the divisor to a large value that will force the unroll factor to 0. */ + limit_factor = INT_MAX; + + return limit_factor; +} + /* Unroll and/or peel (depending on FLAGS) LOOPS. */ void unroll_and_peel_loops (int flags) @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; double_int iterations; if (!(flags & UAP_UNROLL)) @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo "\n;; Considering unrolling loop with runtime " "computable number of iterations\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) { unsigned npeel; double_int iterations; + struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_PEEL)) { @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) if (dump_file) fprintf (dump_file, "\n;; Considering simply peeling loop\n"); + if (flag_peel_codesize_limit) + { + /* Determine whether to limit code size growth from peeling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max peeled insns by divisor %d\n", limit_factor); + } + } + /* npeel = number of iterations to peel. */ - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor + / loop->ninsns; if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) } /* Do not simply peel loops with branches inside -- it increases number - of mispredicts. + of mispredicts. Exception is when we do have profile and we however have good chance to peel proper number of iterations loop will iterate in practice. TODO: this heuristic needs tunning; while for complette unrolling @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) peeling it is not the case. Also a function call inside loop is also branch from branch prediction POV (and probably better reason to not unroll/peel). */ - if (num_loop_branches (loop) > 1 + desc = get_simple_loop_desc (loop); + if (desc->num_branches > 1 && profile_status != PROFILE_READ) { if (dump_file) @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; double_int iterations; if (!(flags & UAP_UNROLL_ALL)) @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags if (dump_file) fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags of mispredicts. TODO: this heuristic needs tunning; call inside the loop body is also relatively good reason to not unroll. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not unrolling, contains branches\n"); Index: loop-iv.c =================================================================== --- loop-iv.c (revision 193626) +++ loop-iv.c (working copy) @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) /* At least desc->infinite is not always initialized by find_simple_loop_exit. */ desc = XCNEW (struct niter_desc); - iv_analysis_loop_init (loop); - find_simple_exit (loop, desc); + if (loop->latch != EXIT_BLOCK_PTR) + { + iv_analysis_loop_init (loop); + find_simple_exit (loop, desc); + } + analyze_loop_insns (loop, desc); loop->aux = desc; if (desc->simple_p && (desc->assumptions || desc->infinite)) Index: common.opt =================================================================== --- common.opt (revision 193626) +++ common.opt (working copy) @@ -1573,6 +1573,10 @@ fpcc-struct-return Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) Return small aggregates in memory, not registers +fpeel-codesize-limit +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop peeling under profile estimates of large code footprint + fpeel-loops Common Report Var(flag_peel_loops) Optimization Perform loop peeling @@ -2138,6 +2142,10 @@ funroll-all-loops Common Report Var(flag_unroll_all_loops) Optimization Perform loop unrolling for all loops +funroll-codesize-limit +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop unrolling under profile estimates of large code footprint + ; Nonzero means that loop optimizer may assume that the induction variables ; that control loops do not overflow and that the loops with nontrivial ; exit condition are not infinite Index: cfgloop.c =================================================================== --- cfgloop.c (revision 193626) +++ cfgloop.c (working copy) @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) return edges; } -/* Counts the number of conditional branches inside LOOP. */ +/* Determine if INSN is a floating point set. */ -unsigned -num_loop_branches (const struct loop *loop) +static bool +insn_has_fp_set(rtx insn) { - unsigned i, n; - basic_block * body; + int i; + rtx pat = PATTERN(insn); + if (GET_CODE (pat) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); + else if (GET_CODE (pat) == PARALLEL) + { + for (i = 0; i < XVECLEN (pat, 0); i++) + { + rtx sub = XVECEXP (pat, 0, i); + if (GET_CODE (sub) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); + } + } + return false; +} - gcc_assert (loop->latch != EXIT_BLOCK_PTR); +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts + the number of conditional branch instructions, calls and fp instructions, + as well as the average number of branches executed per iteration. */ +void +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) +{ + unsigned i, nbranch; + gcov_type weighted_nbranch; + bool has_call, has_fp; + basic_block * body, bb; + rtx insn; + gcov_type header_count = loop->header->count; + + nbranch = weighted_nbranch = 0; + has_call = has_fp = false; + body = get_loop_body (loop); - n = 0; for (i = 0; i < loop->num_nodes; i++) - if (EDGE_COUNT (body[i]->succs) >= 2) - n++; + { + bb = body[i]; + + if (EDGE_COUNT (bb->succs) >= 2) + { + nbranch++; + + /* If this block is executed less frequently than the header (loop + entry), then it is weighted based on its execution count, which + will be turned into a ratio compared to the loop header below. */ + if (bb->count < header_count) + weighted_nbranch += bb->count; + + /* When it is executed more frequently than the header (i.e. it is + in a nested inner loop), simply weight the branch the same as the + header execution count, so that it will contribute 1 branch to + the ratio computed below. */ + else + weighted_nbranch += header_count; + } + + /* No need to iterate through the instructions below if + both flags have already been set. */ + if (has_call && has_fp) + continue; + + FOR_BB_INSNS (bb, insn) + { + if (!INSN_P (insn)) + continue; + + if (!has_call) + has_call = CALL_P (insn); + + if (!has_fp) + has_fp = insn_has_fp_set (insn); + } + } free (body); - return n; + desc->num_branches = nbranch; + /* Now divide the weights computed above by the loop header execution count, + to compute the average number of branches through the loop. By adding + header_count/2 to the numerator we round to nearest with integer + division. */ + if (header_count != 0) + desc->av_num_branches + = (weighted_nbranch + header_count/2) / header_count; + else + desc->av_num_branches = 0; + desc->has_call = has_call; + desc->has_fp = has_fp; } /* Adds basic block BB to LOOP. */ Index: cfgloop.h =================================================================== --- cfgloop.h (revision 193626) +++ cfgloop.h (working copy) @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order extern vec<edge> get_loop_exit_edges (const struct loop *); extern edge single_exit (const struct loop *); extern edge single_likely_exit (struct loop *loop); -extern unsigned num_loop_branches (const struct loop *); extern edge loop_preheader_edge (const struct loop *); extern edge loop_latch_edge (const struct loop *); @@ -365,7 +364,8 @@ struct rtx_iv }; /* The description of an exit from the loop and of the number of iterations - till we take the exit. */ + till we take the exit. Also includes other information used primarily + by the loop unroller. */ struct niter_desc { @@ -403,6 +403,18 @@ struct niter_desc /* The number of iterations of the loop. */ rtx niter_expr; + + /* The number of branches in the loop. */ + unsigned num_branches; + + /* The number of executed branches per iteration. */ + unsigned av_num_branches; + + /* Whether the loop contains a call instruction. */ + bool has_call; + + /* Whether the loop contains fp instructions. */ + bool has_fp; }; extern void iv_analysis_loop_init (struct loop *); @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); extern struct niter_desc *get_simple_loop_desc (struct loop *loop); extern void free_simple_loop_desc (struct loop *loop); +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); static inline struct niter_desc * simple_loop_desc (struct loop *loop) Index: params.def =================================================================== --- params.def (revision 193626) +++ params.def (working copy) @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, "max-completely-peel-loop-nest-depth", "The maximum depth of a loop nest we completely peel", 8, 0, 0) +/* The maximum code size estimate under which loop unrolling and peeling + * is allowed in a profile feedback compile. This currently applies to loops + * with non-constant iteration counts and no floating point computations. */ +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, + "unrollpeel-codesize-threshold", + "Maximum profile-based code size footprint estimate for loop unrolling " + "and peeling", + 15000, 0, 0) +/* The maximum ratio of total profiled execution counts to loop entry block + count that must be exceeded to ignore most code size limits when unrolling + and peeling. */ +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, + "unrollpeel-hotness-threshold", + "Maximum ratio of total profiled execution count to loop entry " + "block count under which most codesize limits for unrolling and " + "peeling will be ignored", + 100, 1, 0) /* The maximum number of insns of an unswitched loop. */