Message ID | mpttv7is8fp.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Optionally pick the cheapest loop_vec_info | expand |
On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford <Richard.Sandiford@arm.com> wrote: > > This patch adds a mode in which the vectoriser tries each available > base vector mode and picks the one with the lowest cost. For now > the behaviour is behind a default-off --param, but a later patch > enables it by default for SVE. > > The patch keeps the current behaviour of preferring a VF of > loop->simdlen over any larger or smaller VF, regardless of costs > or target preferences. Can you avoid using a --param for this? Instead I'd suggest to amend the vectorize_modes target hook to return some flags like VECT_FIRST_MODE_WINS. We'd eventually want to make the target able to say do-not-vectorize-epiloges-of-MODE (I think we may not want to vectorize SSE vectorized loop epilogues with MMX-with-SSE or GPRs for example). I guess for the latter we'd use a new target hook. Otherwise looks reasonable. Richard. > > 2019-11-05 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * params.def (vect-compare-loop-costs): New param. > * doc/invoke.texi: Document it. > * tree-vectorizer.h (_loop_vec_info::vec_outside_cost) > (_loop_vec_info::vec_inside_cost): New member variables. > * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them. > (vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions. > (vect_analyze_loop): When the new parameter allows, try vectorizing > the loop with each available vector mode and picking the one with > the lowest cost. > (vect_estimate_min_profitable_iters): Record the computed costs > in the loop_vec_info. > > Index: gcc/params.def > =================================================================== > --- gcc/params.def 2019-10-31 17:15:25.470517368 +0000 > +++ gcc/params.def 2019-11-05 14:19:58.781197820 +0000 > @@ -661,6 +661,13 @@ DEFPARAM(PARAM_VECT_MAX_PEELING_FOR_ALIG > "Maximum number of loop peels to enhance alignment of data references in a loop.", > -1, -1, 64) > > +DEFPARAM(PARAM_VECT_COMPARE_LOOP_COSTS, > + "vect-compare-loop-costs", > + "Whether to try vectorizing a loop using each supported" > + " combination of vector types and picking the version with the" > + " lowest cost.", > + 0, 0, 1) > + > DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS, > "max-cselib-memory-locations", > "The maximum memory locations recorded by cselib.", > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi 2019-11-04 21:13:57.611756365 +0000 > +++ gcc/doc/invoke.texi 2019-11-05 14:19:58.777197850 +0000 > @@ -11563,6 +11563,12 @@ doing loop versioning for alias in the v > The maximum number of loop peels to enhance access alignment > for vectorizer. Value -1 means no limit. > > +@item vect-compare-loop-costs > +Whether to try vectorizing a loop using each supported combination of > +vector types and picking the version with the lowest cost. This parameter > +has no effect when @option{-fno-vect-cost-model} or > +@option{-fvect-cost-model=unlimited} are used. > + > @item max-iterations-to-track > The maximum number of iterations of a loop the brute-force algorithm > for analysis of the number of iterations of the loop tries to evaluate. > Index: gcc/tree-vectorizer.h > =================================================================== > --- gcc/tree-vectorizer.h 2019-11-05 14:19:33.829371745 +0000 > +++ gcc/tree-vectorizer.h 2019-11-05 14:19:58.781197820 +0000 > @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve > /* Cost of a single scalar iteration. */ > int single_scalar_iteration_cost; > > + /* The cost of the vector prologue and epilogue, including peeled > + iterations and set-up code. */ > + int vec_outside_cost; > + > + /* The cost of the vector loop body. */ > + int vec_inside_cost; > + > /* Is the loop vectorizable? */ > bool vectorizable; > > Index: gcc/tree-vect-loop.c > =================================================================== > --- gcc/tree-vect-loop.c 2019-11-05 14:19:33.829371745 +0000 > +++ gcc/tree-vect-loop.c 2019-11-05 14:19:58.781197820 +0000 > @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo > scan_map (NULL), > slp_unrolling_factor (1), > single_scalar_iteration_cost (0), > + vec_outside_cost (0), > + vec_inside_cost (0), > vectorizable (false), > can_fully_mask_p (true), > fully_masked_p (false), > @@ -2373,6 +2375,80 @@ vect_analyze_loop_2 (loop_vec_info loop_ > goto start_over; > } > > +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears > + to be better than vectorizing it using OLD_LOOP_VINFO. Assume that > + OLD_LOOP_VINFO is better unless something specifically indicates > + otherwise. > + > + Note that this deliberately isn't a partial order. */ > + > +static bool > +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, > + loop_vec_info old_loop_vinfo) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo); > + gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop); > + > + poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo); > + poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo); > + > + /* Always prefer a VF of loop->simdlen over any other VF. */ > + if (loop->simdlen) > + { > + bool new_simdlen_p = known_eq (new_vf, loop->simdlen); > + bool old_simdlen_p = known_eq (old_vf, loop->simdlen); > + if (new_simdlen_p != old_simdlen_p) > + return new_simdlen_p; > + } > + > + /* Limit the VFs to what is likely to be the maximum number of iterations, > + to handle cases in which at least one loop_vinfo is fully-masked. */ > + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop); > + if (estimated_max_niter != -1) > + { > + if (known_le (estimated_max_niter, new_vf)) > + new_vf = estimated_max_niter; > + if (known_le (estimated_max_niter, old_vf)) > + old_vf = estimated_max_niter; > + } > + > + /* Check whether the (fractional) cost per scalar iteration is lower > + or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf. */ > + poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost > + * poly_widest_int (old_vf)); > + poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost > + * poly_widest_int (new_vf)); > + if (maybe_lt (rel_old, rel_new)) > + return false; > + if (known_lt (rel_new, rel_old)) > + return true; > + > + /* If there's nothing to choose between the loop bodies, see whether > + there's a difference in the prologue and epilogue costs. */ > + if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost) > + return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost; > + > + return false; > +} > + > +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return > + true if we should. */ > + > +static bool > +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo, > + loop_vec_info old_loop_vinfo) > +{ > + if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo)) > + return false; > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "***** Preferring vector mode %s to vector mode %s\n", > + GET_MODE_NAME (new_loop_vinfo->vector_mode), > + GET_MODE_NAME (old_loop_vinfo->vector_mode)); > + return true; > +} > + > /* Function vect_analyze_loop. > > Apply a set of analyses on LOOP, and create a loop_vec_info struct > @@ -2408,6 +2484,8 @@ vect_analyze_loop (class loop *loop, vec > machine_mode next_vector_mode = VOIDmode; > poly_uint64 lowest_th = 0; > unsigned vectorized_loops = 0; > + bool pick_lowest_cost_p = (PARAM_VALUE (PARAM_VECT_COMPARE_LOOP_COSTS) > + && !unlimited_cost_model (loop)); > > bool vect_epilogues = false; > opt_result res = opt_result::success (); > @@ -2428,6 +2506,34 @@ vect_analyze_loop (class loop *loop, vec > > bool fatal = false; > > + /* When pick_lowest_cost_p is true, we should in principle iterate > + over all the loop_vec_infos that LOOP_VINFO could replace and > + try to vectorize LOOP_VINFO under the same conditions. > + E.g. when trying to replace an epilogue loop, we should vectorize > + LOOP_VINFO as an epilogue loop with the same VF limit. When trying > + to replace the main loop, we should vectorize LOOP_VINFO as a main > + loop too. > + > + However, autovectorize_vector_modes is usually sorted as follows: > + > + - Modes that naturally produce lower VFs usually follow modes that > + naturally produce higher VFs. > + > + - When modes naturally produce the same VF, maskable modes > + usually follow unmaskable ones, so that the maskable mode > + can be used to vectorize the epilogue of the unmaskable mode. > + > + This order is preferred because it leads to the maximum > + epilogue vectorization opportunities. Targets should only use > + a different order if they want to make wide modes available while > + disparaging them relative to earlier, smaller modes. The assumption > + in that case is that the wider modes are more expensive in some > + way that isn't reflected directly in the costs. > + > + There should therefore be few interesting cases in which > + LOOP_VINFO fails when treated as an epilogue loop, succeeds when > + treated as a standalone loop, and ends up being genuinely cheaper > + than FIRST_LOOP_VINFO. */ > if (vect_epilogues) > LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; > > @@ -2475,13 +2581,34 @@ vect_analyze_loop (class loop *loop, vec > LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; > simdlen = 0; > } > + else if (pick_lowest_cost_p && first_loop_vinfo) > + { > + /* Keep trying to roll back vectorization attempts while the > + loop_vec_infos they produced were worse than this one. */ > + vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos; > + while (!vinfos.is_empty () > + && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ())) > + { > + gcc_assert (vect_epilogues); > + delete vinfos.pop (); > + } > + if (vinfos.is_empty () > + && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo)) > + { > + delete first_loop_vinfo; > + first_loop_vinfo = opt_loop_vec_info::success (NULL); > + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; > + } > + } > > if (first_loop_vinfo == NULL) > { > first_loop_vinfo = loop_vinfo; > lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); > } > - else if (vect_epilogues) > + else if (vect_epilogues > + /* For now only allow one epilogue loop. */ > + && first_loop_vinfo->epilogue_vinfos.is_empty ()) > { > first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); > poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); > @@ -2501,12 +2628,14 @@ vect_analyze_loop (class loop *loop, vec > && loop->inner == NULL > && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) > && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo) > - /* For now only allow one epilogue loop. */ > - && first_loop_vinfo->epilogue_vinfos.is_empty ()); > + /* For now only allow one epilogue loop, but allow > + pick_lowest_cost_p to replace it. */ > + && (first_loop_vinfo->epilogue_vinfos.is_empty () > + || pick_lowest_cost_p)); > > /* Commit to first_loop_vinfo if we have no reason to try > alternatives. */ > - if (!simdlen && !vect_epilogues) > + if (!simdlen && !vect_epilogues && !pick_lowest_cost_p) > break; > } > else > @@ -3454,7 +3583,11 @@ vect_estimate_min_profitable_iters (loop > &vec_inside_cost, &vec_epilogue_cost); > > vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost); > - > + > + /* Stash the costs so that we can compare two loop_vec_infos. */ > + loop_vinfo->vec_inside_cost = vec_inside_cost; > + loop_vinfo->vec_outside_cost = vec_outside_cost; > + > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford > <Richard.Sandiford@arm.com> wrote: >> >> This patch adds a mode in which the vectoriser tries each available >> base vector mode and picks the one with the lowest cost. For now >> the behaviour is behind a default-off --param, but a later patch >> enables it by default for SVE. >> >> The patch keeps the current behaviour of preferring a VF of >> loop->simdlen over any larger or smaller VF, regardless of costs >> or target preferences. > > Can you avoid using a --param for this? Instead I'd suggest to > amend the vectorize_modes target hook to return some > flags like VECT_FIRST_MODE_WINS. We'd eventually want > to make the target able to say do-not-vectorize-epiloges-of-MODE > (I think we may not want to vectorize SSE vectorized loop > epilogues with MMX-with-SSE or GPRs for example). I guess > for the latter we'd use a new target hook. The reason for using a --param was that I wanted a way of turning this on and off on the command line, so that users can experiment with it if necessary. E.g. enabling the --param could be a viable alternative to -mprefix-* in some cases. Disabling it would be a way of working around a bad cost model decision without going all the way to -fno-vect-cost-model. These kinds of --params can become useful workarounds until an optimisation bug is fixed. Thanks, Richard
On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford > > <Richard.Sandiford@arm.com> wrote: > >> > >> This patch adds a mode in which the vectoriser tries each available > >> base vector mode and picks the one with the lowest cost. For now > >> the behaviour is behind a default-off --param, but a later patch > >> enables it by default for SVE. > >> > >> The patch keeps the current behaviour of preferring a VF of > >> loop->simdlen over any larger or smaller VF, regardless of costs > >> or target preferences. > > > > Can you avoid using a --param for this? Instead I'd suggest to > > amend the vectorize_modes target hook to return some > > flags like VECT_FIRST_MODE_WINS. We'd eventually want > > to make the target able to say do-not-vectorize-epiloges-of-MODE > > (I think we may not want to vectorize SSE vectorized loop > > epilogues with MMX-with-SSE or GPRs for example). I guess > > for the latter we'd use a new target hook. > > The reason for using a --param was that I wanted a way of turning > this on and off on the command line, so that users can experiment > with it if necessary. E.g. enabling the --param could be a viable > alternative to -mprefix-* in some cases. Disabling it would be > a way of working around a bad cost model decision without going > all the way to -fno-vect-cost-model. > > These kinds of --params can become useful workarounds until an > optimisation bug is fixed. I'm arguing that the default depends on the actual ISAs so there isn't a one-fits all and given we have OMP SIMD and target cloning for multiple ISAs this looks like a wrong approach. For sure the target can use its own switches to override defaults here, or alternatively we might want to have a #pragma GCC simdlen mimicing OMP behavior here. Richard. > > Thanks, > Richard
Richard Biener <richard.guenther@gmail.com> writes: > On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Richard Biener <richard.guenther@gmail.com> writes: >> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford >> > <Richard.Sandiford@arm.com> wrote: >> >> >> >> This patch adds a mode in which the vectoriser tries each available >> >> base vector mode and picks the one with the lowest cost. For now >> >> the behaviour is behind a default-off --param, but a later patch >> >> enables it by default for SVE. >> >> >> >> The patch keeps the current behaviour of preferring a VF of >> >> loop->simdlen over any larger or smaller VF, regardless of costs >> >> or target preferences. >> > >> > Can you avoid using a --param for this? Instead I'd suggest to >> > amend the vectorize_modes target hook to return some >> > flags like VECT_FIRST_MODE_WINS. We'd eventually want >> > to make the target able to say do-not-vectorize-epiloges-of-MODE >> > (I think we may not want to vectorize SSE vectorized loop >> > epilogues with MMX-with-SSE or GPRs for example). I guess >> > for the latter we'd use a new target hook. >> >> The reason for using a --param was that I wanted a way of turning >> this on and off on the command line, so that users can experiment >> with it if necessary. E.g. enabling the --param could be a viable >> alternative to -mprefix-* in some cases. Disabling it would be >> a way of working around a bad cost model decision without going >> all the way to -fno-vect-cost-model. >> >> These kinds of --params can become useful workarounds until an >> optimisation bug is fixed. > > I'm arguing that the default depends on the actual ISAs so there isn't > a one-fits all and given we have OMP SIMD and target cloning for > multiple ISAs this looks like a wrong approach. For sure the > target can use its own switches to override defaults here, or alternatively > we might want to have a #pragma GCC simdlen mimicing OMP behavior > here. I agree there's no one-size-fits-all choice here, but that's true for other --params too. The problem with using target switches is that we have to explain them and to keep accepting them "forever" (or at least with a long deprecation period). Whereas the --param was just something that people could play with or perhaps use to work around problems temporarily. It would come with no guarantees attached. And what the --param did applied to any targets that support multiple modes, regardless of what the targets do by default. All that said, here's a version that returns the bitmask you suggested. I ended up making the flag select the new behaviour and 0 select the current behaviour, rather than have a flag for "first mode wins". Tested as before. Thanks, Richard 2019-11-07 Richard Sandiford <richard.sandiford@arm.com> gcc/ * target.h (VECT_COMPARE_COSTS): New constant. * target.def (autovectorize_vector_modes): Return a bitmask of flags. * doc/tm.texi: Regenerate. * targhooks.h (default_autovectorize_vector_modes): Update accordingly. * targhooks.c (default_autovectorize_vector_modes): Likewise. * config/aarch64/aarch64.c (aarch64_autovectorize_vector_modes): Likewise. * config/arc/arc.c (arc_autovectorize_vector_modes): Likewise. * config/arm/arm.c (arm_autovectorize_vector_modes): Likewise. * config/i386/i386.c (ix86_autovectorize_vector_modes): Likewise. * config/mips/mips.c (mips_autovectorize_vector_modes): Likewise. * tree-vectorizer.h (_loop_vec_info::vec_outside_cost) (_loop_vec_info::vec_inside_cost): New member variables. * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them. (vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions. (vect_analyze_loop): When autovectorize_vector_modes returns VECT_COMPARE_COSTS, try vectorizing the loop with each available vector mode and picking the one with the lowest cost. (vect_estimate_min_profitable_iters): Record the computed costs in the loop_vec_info. Index: gcc/target.h =================================================================== --- gcc/target.h 2019-11-07 15:11:15.831017985 +0000 +++ gcc/target.h 2019-11-07 16:52:30.037198353 +0000 @@ -218,6 +218,14 @@ enum omp_device_kind_arch_isa { omp_device_isa }; +/* Flags returned by TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES: + + VECT_COMPARE_COSTS + Tells the loop vectorizer to try all the provided modes and + pick the one with the lowest cost. By default the vectorizer + will choose the first mode that works. */ +const unsigned int VECT_COMPARE_COSTS = 1U << 0; + /* The target structure. This holds all the backend hooks. */ #define DEFHOOKPOD(NAME, DOC, TYPE, INIT) TYPE NAME; #define DEFHOOK(NAME, DOC, TYPE, PARAMS, INIT) TYPE (* NAME) PARAMS; Index: gcc/target.def =================================================================== --- gcc/target.def 2019-11-07 15:11:15.819018071 +0000 +++ gcc/target.def 2019-11-07 16:52:30.037198353 +0000 @@ -1925,10 +1925,20 @@ element mode.\n\ If @var{all} is true, add suitable vector modes even when they are generally\n\ not expected to be worthwhile.\n\ \n\ +The hook returns a bitmask of flags that control how the modes in\n\ +@var{modes} are used. The flags are:\n\ +@table @code\n\ +@item VECT_COMPARE_COSTS\n\ +Tells the loop vectorizer to try all the provided modes and pick the one\n\ +with the lowest cost. By default the vectorizer will choose the first\n\ +mode that works.\n\ +@end table\n\ +\n\ The hook does not need to do anything if the vector returned by\n\ @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant\n\ -for autovectorization. The default implementation does nothing.", - void, +for autovectorization. The default implementation adds no modes and\n\ +returns 0.", + unsigned int, (vector_modes *modes, bool all), default_autovectorize_vector_modes) Index: gcc/doc/tm.texi =================================================================== --- gcc/doc/tm.texi 2019-11-07 15:11:15.779018354 +0000 +++ gcc/doc/tm.texi 2019-11-07 16:52:30.037198353 +0000 @@ -6016,7 +6016,7 @@ against lower halves of vectors recursiv reached. The default is @var{mode} which means no splitting. @end deftypefn -@deftypefn {Target Hook} void TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all}) +@deftypefn {Target Hook} {unsigned int} TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all}) If using the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is not the only approach worth considering, this hook should add one mode to @var{modes} for each useful alternative approach. These modes are then @@ -6032,9 +6032,19 @@ element mode. If @var{all} is true, add suitable vector modes even when they are generally not expected to be worthwhile. +The hook returns a bitmask of flags that control how the modes in +@var{modes} are used. The flags are: +@table @code +@item VECT_COMPARE_COSTS +Tells the loop vectorizer to try all the provided modes and pick the one +with the lowest cost. By default the vectorizer will choose the first +mode that works. +@end table + The hook does not need to do anything if the vector returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant -for autovectorization. The default implementation does nothing. +for autovectorization. The default implementation adds no modes and +returns 0. @end deftypefn @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits}) Index: gcc/targhooks.h =================================================================== --- gcc/targhooks.h 2019-11-07 15:11:15.831017985 +0000 +++ gcc/targhooks.h 2019-11-07 16:52:30.041198324 +0000 @@ -113,7 +113,7 @@ default_builtin_support_vector_misalignm int, bool); extern machine_mode default_preferred_simd_mode (scalar_mode mode); extern machine_mode default_split_reduction (machine_mode); -extern void default_autovectorize_vector_modes (vector_modes *, bool); +extern unsigned int default_autovectorize_vector_modes (vector_modes *, bool); extern opt_machine_mode default_vectorize_related_mode (machine_mode, scalar_mode, poly_uint64); Index: gcc/targhooks.c =================================================================== --- gcc/targhooks.c 2019-11-07 15:11:15.831017985 +0000 +++ gcc/targhooks.c 2019-11-07 16:52:30.041198324 +0000 @@ -1301,9 +1301,10 @@ default_split_reduction (machine_mode mo /* By default only the preferred vector mode is tried. */ -void +unsigned int default_autovectorize_vector_modes (vector_modes *, bool) { + return 0; } /* The default implementation of TARGET_VECTORIZE_RELATED_MODE. */ Index: gcc/config/aarch64/aarch64.c =================================================================== --- gcc/config/aarch64/aarch64.c 2019-11-07 15:11:19.442992405 +0000 +++ gcc/config/aarch64/aarch64.c 2019-11-07 16:52:30.021198461 +0000 @@ -15949,7 +15949,7 @@ aarch64_preferred_simd_mode (scalar_mode /* Return a list of possible vector sizes for the vectorizer to iterate over. */ -static void +static unsigned int aarch64_autovectorize_vector_modes (vector_modes *modes, bool) { if (TARGET_SVE) @@ -15975,6 +15975,8 @@ aarch64_autovectorize_vector_modes (vect TODO: We could similarly support limited forms of V2QImode and V2HImode for this case. */ modes->safe_push (V2SImode); + + return 0; } /* Implement TARGET_MANGLE_TYPE. */ Index: gcc/config/arc/arc.c =================================================================== --- gcc/config/arc/arc.c 2019-11-07 15:11:15.599019628 +0000 +++ gcc/config/arc/arc.c 2019-11-07 16:52:30.021198461 +0000 @@ -609,7 +609,7 @@ arc_preferred_simd_mode (scalar_mode mod /* Implements target hook TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES. */ -static void +static unsigned int arc_autovectorize_vector_modes (vector_modes *modes, bool) { if (TARGET_PLUS_QMACW) @@ -617,6 +617,7 @@ arc_autovectorize_vector_modes (vector_m modes->quick_push (V4HImode); modes->quick_push (V2HImode); } + return 0; } Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c 2019-11-07 15:11:15.683019033 +0000 +++ gcc/config/arm/arm.c 2019-11-07 16:52:30.029198406 +0000 @@ -289,7 +289,7 @@ static bool arm_builtin_support_vector_m static void arm_conditional_register_usage (void); static enum flt_eval_method arm_excess_precision (enum excess_precision_type); static reg_class_t arm_preferred_rename_class (reg_class_t rclass); -static void arm_autovectorize_vector_modes (vector_modes *, bool); +static unsigned int arm_autovectorize_vector_modes (vector_modes *, bool); static int arm_default_branch_cost (bool, bool); static int arm_cortex_a5_branch_cost (bool, bool); static int arm_cortex_m_branch_cost (bool, bool); @@ -29015,7 +29015,7 @@ arm_vector_alignment (const_tree type) return align; } -static void +static unsigned int arm_autovectorize_vector_modes (vector_modes *modes, bool) { if (!TARGET_NEON_VECTORIZE_DOUBLE) @@ -29023,6 +29023,7 @@ arm_autovectorize_vector_modes (vector_m modes->safe_push (V16QImode); modes->safe_push (V8QImode); } + return 0; } static bool Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c 2019-11-07 15:11:15.715018807 +0000 +++ gcc/config/i386/i386.c 2019-11-07 16:52:30.033198382 +0000 @@ -21385,7 +21385,7 @@ ix86_preferred_simd_mode (scalar_mode mo vectors. If AVX512F is enabled then try vectorizing with 512bit, 256bit and 128bit vectors. */ -static void +static unsigned int ix86_autovectorize_vector_modes (vector_modes *modes, bool all) { if (TARGET_AVX512F && !TARGET_PREFER_AVX256) @@ -21415,6 +21415,8 @@ ix86_autovectorize_vector_modes (vector_ if (TARGET_MMX_WITH_SSE) modes->safe_push (V8QImode); + + return 0; } /* Implemenation of targetm.vectorize.get_mask_mode. */ Index: gcc/config/mips/mips.c =================================================================== --- gcc/config/mips/mips.c 2019-11-07 15:11:15.755018524 +0000 +++ gcc/config/mips/mips.c 2019-11-07 16:52:30.037198353 +0000 @@ -13455,11 +13455,12 @@ mips_preferred_simd_mode (scalar_mode mo /* Implement TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES. */ -static void +static unsigned int mips_autovectorize_vector_modes (vector_modes *modes, bool) { if (ISA_HAS_MSA) modes->safe_push (V16QImode); + return 0; } /* Implement TARGET_INIT_LIBFUNCS. */ Index: gcc/tree-vectorizer.h =================================================================== --- gcc/tree-vectorizer.h 2019-11-07 16:52:25.000000000 +0000 +++ gcc/tree-vectorizer.h 2019-11-07 16:52:30.041198324 +0000 @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve /* Cost of a single scalar iteration. */ int single_scalar_iteration_cost; + /* The cost of the vector prologue and epilogue, including peeled + iterations and set-up code. */ + int vec_outside_cost; + + /* The cost of the vector loop body. */ + int vec_inside_cost; + /* Is the loop vectorizable? */ bool vectorizable; Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2019-11-07 16:52:25.000000000 +0000 +++ gcc/tree-vect-loop.c 2019-11-07 16:52:30.041198324 +0000 @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo scan_map (NULL), slp_unrolling_factor (1), single_scalar_iteration_cost (0), + vec_outside_cost (0), + vec_inside_cost (0), vectorizable (false), can_fully_mask_p (true), fully_masked_p (false), @@ -2375,6 +2377,80 @@ vect_analyze_loop_2 (loop_vec_info loop_ goto start_over; } +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears + to be better than vectorizing it using OLD_LOOP_VINFO. Assume that + OLD_LOOP_VINFO is better unless something specifically indicates + otherwise. + + Note that this deliberately isn't a partial order. */ + +static bool +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, + loop_vec_info old_loop_vinfo) +{ + struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo); + gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop); + + poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo); + poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo); + + /* Always prefer a VF of loop->simdlen over any other VF. */ + if (loop->simdlen) + { + bool new_simdlen_p = known_eq (new_vf, loop->simdlen); + bool old_simdlen_p = known_eq (old_vf, loop->simdlen); + if (new_simdlen_p != old_simdlen_p) + return new_simdlen_p; + } + + /* Limit the VFs to what is likely to be the maximum number of iterations, + to handle cases in which at least one loop_vinfo is fully-masked. */ + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop); + if (estimated_max_niter != -1) + { + if (known_le (estimated_max_niter, new_vf)) + new_vf = estimated_max_niter; + if (known_le (estimated_max_niter, old_vf)) + old_vf = estimated_max_niter; + } + + /* Check whether the (fractional) cost per scalar iteration is lower + or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf. */ + poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost + * poly_widest_int (old_vf)); + poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost + * poly_widest_int (new_vf)); + if (maybe_lt (rel_old, rel_new)) + return false; + if (known_lt (rel_new, rel_old)) + return true; + + /* If there's nothing to choose between the loop bodies, see whether + there's a difference in the prologue and epilogue costs. */ + if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost) + return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost; + + return false; +} + +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return + true if we should. */ + +static bool +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo, + loop_vec_info old_loop_vinfo) +{ + if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo)) + return false; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Preferring vector mode %s to vector mode %s\n", + GET_MODE_NAME (new_loop_vinfo->vector_mode), + GET_MODE_NAME (old_loop_vinfo->vector_mode)); + return true; +} + /* Function vect_analyze_loop. Apply a set of analyses on LOOP, and create a loop_vec_info struct @@ -2386,8 +2462,9 @@ vect_analyze_loop (class loop *loop, vec auto_vector_modes vector_modes; /* Autodetect first vector size we try. */ - targetm.vectorize.autovectorize_vector_modes (&vector_modes, - loop->simdlen != 0); + unsigned int autovec_flags + = targetm.vectorize.autovectorize_vector_modes (&vector_modes, + loop->simdlen != 0); unsigned int mode_i = 0; DUMP_VECT_SCOPE ("analyze_loop_nest"); @@ -2410,6 +2487,8 @@ vect_analyze_loop (class loop *loop, vec machine_mode next_vector_mode = VOIDmode; poly_uint64 lowest_th = 0; unsigned vectorized_loops = 0; + bool pick_lowest_cost_p = ((autovec_flags & VECT_COMPARE_COSTS) + && !unlimited_cost_model (loop)); bool vect_epilogues = false; opt_result res = opt_result::success (); @@ -2430,6 +2509,34 @@ vect_analyze_loop (class loop *loop, vec bool fatal = false; + /* When pick_lowest_cost_p is true, we should in principle iterate + over all the loop_vec_infos that LOOP_VINFO could replace and + try to vectorize LOOP_VINFO under the same conditions. + E.g. when trying to replace an epilogue loop, we should vectorize + LOOP_VINFO as an epilogue loop with the same VF limit. When trying + to replace the main loop, we should vectorize LOOP_VINFO as a main + loop too. + + However, autovectorize_vector_modes is usually sorted as follows: + + - Modes that naturally produce lower VFs usually follow modes that + naturally produce higher VFs. + + - When modes naturally produce the same VF, maskable modes + usually follow unmaskable ones, so that the maskable mode + can be used to vectorize the epilogue of the unmaskable mode. + + This order is preferred because it leads to the maximum + epilogue vectorization opportunities. Targets should only use + a different order if they want to make wide modes available while + disparaging them relative to earlier, smaller modes. The assumption + in that case is that the wider modes are more expensive in some + way that isn't reflected directly in the costs. + + There should therefore be few interesting cases in which + LOOP_VINFO fails when treated as an epilogue loop, succeeds when + treated as a standalone loop, and ends up being genuinely cheaper + than FIRST_LOOP_VINFO. */ if (vect_epilogues) LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; @@ -2477,13 +2584,34 @@ vect_analyze_loop (class loop *loop, vec LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; simdlen = 0; } + else if (pick_lowest_cost_p && first_loop_vinfo) + { + /* Keep trying to roll back vectorization attempts while the + loop_vec_infos they produced were worse than this one. */ + vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos; + while (!vinfos.is_empty () + && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ())) + { + gcc_assert (vect_epilogues); + delete vinfos.pop (); + } + if (vinfos.is_empty () + && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo)) + { + delete first_loop_vinfo; + first_loop_vinfo = opt_loop_vec_info::success (NULL); + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; + } + } if (first_loop_vinfo == NULL) { first_loop_vinfo = loop_vinfo; lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); } - else if (vect_epilogues) + else if (vect_epilogues + /* For now only allow one epilogue loop. */ + && first_loop_vinfo->epilogue_vinfos.is_empty ()) { first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); @@ -2503,12 +2631,14 @@ vect_analyze_loop (class loop *loop, vec && loop->inner == NULL && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo) - /* For now only allow one epilogue loop. */ - && first_loop_vinfo->epilogue_vinfos.is_empty ()); + /* For now only allow one epilogue loop, but allow + pick_lowest_cost_p to replace it. */ + && (first_loop_vinfo->epilogue_vinfos.is_empty () + || pick_lowest_cost_p)); /* Commit to first_loop_vinfo if we have no reason to try alternatives. */ - if (!simdlen && !vect_epilogues) + if (!simdlen && !vect_epilogues && !pick_lowest_cost_p) break; } else @@ -3467,7 +3597,11 @@ vect_estimate_min_profitable_iters (loop &vec_inside_cost, &vec_epilogue_cost); vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost); - + + /* Stash the costs so that we can compare two loop_vec_infos. */ + loop_vinfo->vec_inside_cost = vec_inside_cost; + loop_vinfo->vec_outside_cost = vec_outside_cost; + if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");
On Thu, Nov 7, 2019 at 6:15 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> Richard Biener <richard.guenther@gmail.com> writes: > >> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford > >> > <Richard.Sandiford@arm.com> wrote: > >> >> > >> >> This patch adds a mode in which the vectoriser tries each available > >> >> base vector mode and picks the one with the lowest cost. For now > >> >> the behaviour is behind a default-off --param, but a later patch > >> >> enables it by default for SVE. > >> >> > >> >> The patch keeps the current behaviour of preferring a VF of > >> >> loop->simdlen over any larger or smaller VF, regardless of costs > >> >> or target preferences. > >> > > >> > Can you avoid using a --param for this? Instead I'd suggest to > >> > amend the vectorize_modes target hook to return some > >> > flags like VECT_FIRST_MODE_WINS. We'd eventually want > >> > to make the target able to say do-not-vectorize-epiloges-of-MODE > >> > (I think we may not want to vectorize SSE vectorized loop > >> > epilogues with MMX-with-SSE or GPRs for example). I guess > >> > for the latter we'd use a new target hook. > >> > >> The reason for using a --param was that I wanted a way of turning > >> this on and off on the command line, so that users can experiment > >> with it if necessary. E.g. enabling the --param could be a viable > >> alternative to -mprefix-* in some cases. Disabling it would be > >> a way of working around a bad cost model decision without going > >> all the way to -fno-vect-cost-model. > >> > >> These kinds of --params can become useful workarounds until an > >> optimisation bug is fixed. > > > > I'm arguing that the default depends on the actual ISAs so there isn't > > a one-fits all and given we have OMP SIMD and target cloning for > > multiple ISAs this looks like a wrong approach. For sure the > > target can use its own switches to override defaults here, or alternatively > > we might want to have a #pragma GCC simdlen mimicing OMP behavior > > here. > > I agree there's no one-size-fits-all choice here, but that's true for > other --params too. The problem with using target switches is that we > have to explain them and to keep accepting them "forever" (or at least > with a long deprecation period). Fortunately next week you'll be able to add target specific --params to your targets .opt file ;) > Whereas the --param was just something > that people could play with or perhaps use to work around problems > temporarily. It would come with no guarantees attached. And what the > --param did applied to any targets that support multiple modes, > regardless of what the targets do by default. > > All that said, here's a version that returns the bitmask you suggested. > I ended up making the flag select the new behaviour and 0 select the > current behaviour, rather than have a flag for "first mode wins". > Tested as before. OK. Thanks, Richard. > Thanks, > Richard > > > 2019-11-07 Richard Sandiford <richard.sandiford@arm.com> > > gcc/ > * target.h (VECT_COMPARE_COSTS): New constant. > * target.def (autovectorize_vector_modes): Return a bitmask of flags. > * doc/tm.texi: Regenerate. > * targhooks.h (default_autovectorize_vector_modes): Update accordingly. > * targhooks.c (default_autovectorize_vector_modes): Likewise. > * config/aarch64/aarch64.c (aarch64_autovectorize_vector_modes): > Likewise. > * config/arc/arc.c (arc_autovectorize_vector_modes): Likewise. > * config/arm/arm.c (arm_autovectorize_vector_modes): Likewise. > * config/i386/i386.c (ix86_autovectorize_vector_modes): Likewise. > * config/mips/mips.c (mips_autovectorize_vector_modes): Likewise. > * tree-vectorizer.h (_loop_vec_info::vec_outside_cost) > (_loop_vec_info::vec_inside_cost): New member variables. > * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them. > (vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions. > (vect_analyze_loop): When autovectorize_vector_modes returns > VECT_COMPARE_COSTS, try vectorizing the loop with each available > vector mode and picking the one with the lowest cost. > (vect_estimate_min_profitable_iters): Record the computed costs > in the loop_vec_info. > > Index: gcc/target.h > =================================================================== > --- gcc/target.h 2019-11-07 15:11:15.831017985 +0000 > +++ gcc/target.h 2019-11-07 16:52:30.037198353 +0000 > @@ -218,6 +218,14 @@ enum omp_device_kind_arch_isa { > omp_device_isa > }; > > +/* Flags returned by TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES: > + > + VECT_COMPARE_COSTS > + Tells the loop vectorizer to try all the provided modes and > + pick the one with the lowest cost. By default the vectorizer > + will choose the first mode that works. */ > +const unsigned int VECT_COMPARE_COSTS = 1U << 0; > + > /* The target structure. This holds all the backend hooks. */ > #define DEFHOOKPOD(NAME, DOC, TYPE, INIT) TYPE NAME; > #define DEFHOOK(NAME, DOC, TYPE, PARAMS, INIT) TYPE (* NAME) PARAMS; > Index: gcc/target.def > =================================================================== > --- gcc/target.def 2019-11-07 15:11:15.819018071 +0000 > +++ gcc/target.def 2019-11-07 16:52:30.037198353 +0000 > @@ -1925,10 +1925,20 @@ element mode.\n\ > If @var{all} is true, add suitable vector modes even when they are generally\n\ > not expected to be worthwhile.\n\ > \n\ > +The hook returns a bitmask of flags that control how the modes in\n\ > +@var{modes} are used. The flags are:\n\ > +@table @code\n\ > +@item VECT_COMPARE_COSTS\n\ > +Tells the loop vectorizer to try all the provided modes and pick the one\n\ > +with the lowest cost. By default the vectorizer will choose the first\n\ > +mode that works.\n\ > +@end table\n\ > +\n\ > The hook does not need to do anything if the vector returned by\n\ > @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant\n\ > -for autovectorization. The default implementation does nothing.", > - void, > +for autovectorization. The default implementation adds no modes and\n\ > +returns 0.", > + unsigned int, > (vector_modes *modes, bool all), > default_autovectorize_vector_modes) > > Index: gcc/doc/tm.texi > =================================================================== > --- gcc/doc/tm.texi 2019-11-07 15:11:15.779018354 +0000 > +++ gcc/doc/tm.texi 2019-11-07 16:52:30.037198353 +0000 > @@ -6016,7 +6016,7 @@ against lower halves of vectors recursiv > reached. The default is @var{mode} which means no splitting. > @end deftypefn > > -@deftypefn {Target Hook} void TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all}) > +@deftypefn {Target Hook} {unsigned int} TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all}) > If using the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} > is not the only approach worth considering, this hook should add one mode to > @var{modes} for each useful alternative approach. These modes are then > @@ -6032,9 +6032,19 @@ element mode. > If @var{all} is true, add suitable vector modes even when they are generally > not expected to be worthwhile. > > +The hook returns a bitmask of flags that control how the modes in > +@var{modes} are used. The flags are: > +@table @code > +@item VECT_COMPARE_COSTS > +Tells the loop vectorizer to try all the provided modes and pick the one > +with the lowest cost. By default the vectorizer will choose the first > +mode that works. > +@end table > + > The hook does not need to do anything if the vector returned by > @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant > -for autovectorization. The default implementation does nothing. > +for autovectorization. The default implementation adds no modes and > +returns 0. > @end deftypefn > > @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits}) > Index: gcc/targhooks.h > =================================================================== > --- gcc/targhooks.h 2019-11-07 15:11:15.831017985 +0000 > +++ gcc/targhooks.h 2019-11-07 16:52:30.041198324 +0000 > @@ -113,7 +113,7 @@ default_builtin_support_vector_misalignm > int, bool); > extern machine_mode default_preferred_simd_mode (scalar_mode mode); > extern machine_mode default_split_reduction (machine_mode); > -extern void default_autovectorize_vector_modes (vector_modes *, bool); > +extern unsigned int default_autovectorize_vector_modes (vector_modes *, bool); > extern opt_machine_mode default_vectorize_related_mode (machine_mode, > scalar_mode, > poly_uint64); > Index: gcc/targhooks.c > =================================================================== > --- gcc/targhooks.c 2019-11-07 15:11:15.831017985 +0000 > +++ gcc/targhooks.c 2019-11-07 16:52:30.041198324 +0000 > @@ -1301,9 +1301,10 @@ default_split_reduction (machine_mode mo > > /* By default only the preferred vector mode is tried. */ > > -void > +unsigned int > default_autovectorize_vector_modes (vector_modes *, bool) > { > + return 0; > } > > /* The default implementation of TARGET_VECTORIZE_RELATED_MODE. */ > Index: gcc/config/aarch64/aarch64.c > =================================================================== > --- gcc/config/aarch64/aarch64.c 2019-11-07 15:11:19.442992405 +0000 > +++ gcc/config/aarch64/aarch64.c 2019-11-07 16:52:30.021198461 +0000 > @@ -15949,7 +15949,7 @@ aarch64_preferred_simd_mode (scalar_mode > > /* Return a list of possible vector sizes for the vectorizer > to iterate over. */ > -static void > +static unsigned int > aarch64_autovectorize_vector_modes (vector_modes *modes, bool) > { > if (TARGET_SVE) > @@ -15975,6 +15975,8 @@ aarch64_autovectorize_vector_modes (vect > TODO: We could similarly support limited forms of V2QImode and V2HImode > for this case. */ > modes->safe_push (V2SImode); > + > + return 0; > } > > /* Implement TARGET_MANGLE_TYPE. */ > Index: gcc/config/arc/arc.c > =================================================================== > --- gcc/config/arc/arc.c 2019-11-07 15:11:15.599019628 +0000 > +++ gcc/config/arc/arc.c 2019-11-07 16:52:30.021198461 +0000 > @@ -609,7 +609,7 @@ arc_preferred_simd_mode (scalar_mode mod > /* Implements target hook > TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES. */ > > -static void > +static unsigned int > arc_autovectorize_vector_modes (vector_modes *modes, bool) > { > if (TARGET_PLUS_QMACW) > @@ -617,6 +617,7 @@ arc_autovectorize_vector_modes (vector_m > modes->quick_push (V4HImode); > modes->quick_push (V2HImode); > } > + return 0; > } > > > Index: gcc/config/arm/arm.c > =================================================================== > --- gcc/config/arm/arm.c 2019-11-07 15:11:15.683019033 +0000 > +++ gcc/config/arm/arm.c 2019-11-07 16:52:30.029198406 +0000 > @@ -289,7 +289,7 @@ static bool arm_builtin_support_vector_m > static void arm_conditional_register_usage (void); > static enum flt_eval_method arm_excess_precision (enum excess_precision_type); > static reg_class_t arm_preferred_rename_class (reg_class_t rclass); > -static void arm_autovectorize_vector_modes (vector_modes *, bool); > +static unsigned int arm_autovectorize_vector_modes (vector_modes *, bool); > static int arm_default_branch_cost (bool, bool); > static int arm_cortex_a5_branch_cost (bool, bool); > static int arm_cortex_m_branch_cost (bool, bool); > @@ -29015,7 +29015,7 @@ arm_vector_alignment (const_tree type) > return align; > } > > -static void > +static unsigned int > arm_autovectorize_vector_modes (vector_modes *modes, bool) > { > if (!TARGET_NEON_VECTORIZE_DOUBLE) > @@ -29023,6 +29023,7 @@ arm_autovectorize_vector_modes (vector_m > modes->safe_push (V16QImode); > modes->safe_push (V8QImode); > } > + return 0; > } > > static bool > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c 2019-11-07 15:11:15.715018807 +0000 > +++ gcc/config/i386/i386.c 2019-11-07 16:52:30.033198382 +0000 > @@ -21385,7 +21385,7 @@ ix86_preferred_simd_mode (scalar_mode mo > vectors. If AVX512F is enabled then try vectorizing with 512bit, > 256bit and 128bit vectors. */ > > -static void > +static unsigned int > ix86_autovectorize_vector_modes (vector_modes *modes, bool all) > { > if (TARGET_AVX512F && !TARGET_PREFER_AVX256) > @@ -21415,6 +21415,8 @@ ix86_autovectorize_vector_modes (vector_ > > if (TARGET_MMX_WITH_SSE) > modes->safe_push (V8QImode); > + > + return 0; > } > > /* Implemenation of targetm.vectorize.get_mask_mode. */ > Index: gcc/config/mips/mips.c > =================================================================== > --- gcc/config/mips/mips.c 2019-11-07 15:11:15.755018524 +0000 > +++ gcc/config/mips/mips.c 2019-11-07 16:52:30.037198353 +0000 > @@ -13455,11 +13455,12 @@ mips_preferred_simd_mode (scalar_mode mo > > /* Implement TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES. */ > > -static void > +static unsigned int > mips_autovectorize_vector_modes (vector_modes *modes, bool) > { > if (ISA_HAS_MSA) > modes->safe_push (V16QImode); > + return 0; > } > > /* Implement TARGET_INIT_LIBFUNCS. */ > Index: gcc/tree-vectorizer.h > =================================================================== > --- gcc/tree-vectorizer.h 2019-11-07 16:52:25.000000000 +0000 > +++ gcc/tree-vectorizer.h 2019-11-07 16:52:30.041198324 +0000 > @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve > /* Cost of a single scalar iteration. */ > int single_scalar_iteration_cost; > > + /* The cost of the vector prologue and epilogue, including peeled > + iterations and set-up code. */ > + int vec_outside_cost; > + > + /* The cost of the vector loop body. */ > + int vec_inside_cost; > + > /* Is the loop vectorizable? */ > bool vectorizable; > > Index: gcc/tree-vect-loop.c > =================================================================== > --- gcc/tree-vect-loop.c 2019-11-07 16:52:25.000000000 +0000 > +++ gcc/tree-vect-loop.c 2019-11-07 16:52:30.041198324 +0000 > @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo > scan_map (NULL), > slp_unrolling_factor (1), > single_scalar_iteration_cost (0), > + vec_outside_cost (0), > + vec_inside_cost (0), > vectorizable (false), > can_fully_mask_p (true), > fully_masked_p (false), > @@ -2375,6 +2377,80 @@ vect_analyze_loop_2 (loop_vec_info loop_ > goto start_over; > } > > +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears > + to be better than vectorizing it using OLD_LOOP_VINFO. Assume that > + OLD_LOOP_VINFO is better unless something specifically indicates > + otherwise. > + > + Note that this deliberately isn't a partial order. */ > + > +static bool > +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, > + loop_vec_info old_loop_vinfo) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo); > + gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop); > + > + poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo); > + poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo); > + > + /* Always prefer a VF of loop->simdlen over any other VF. */ > + if (loop->simdlen) > + { > + bool new_simdlen_p = known_eq (new_vf, loop->simdlen); > + bool old_simdlen_p = known_eq (old_vf, loop->simdlen); > + if (new_simdlen_p != old_simdlen_p) > + return new_simdlen_p; > + } > + > + /* Limit the VFs to what is likely to be the maximum number of iterations, > + to handle cases in which at least one loop_vinfo is fully-masked. */ > + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop); > + if (estimated_max_niter != -1) > + { > + if (known_le (estimated_max_niter, new_vf)) > + new_vf = estimated_max_niter; > + if (known_le (estimated_max_niter, old_vf)) > + old_vf = estimated_max_niter; > + } > + > + /* Check whether the (fractional) cost per scalar iteration is lower > + or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf. */ > + poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost > + * poly_widest_int (old_vf)); > + poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost > + * poly_widest_int (new_vf)); > + if (maybe_lt (rel_old, rel_new)) > + return false; > + if (known_lt (rel_new, rel_old)) > + return true; > + > + /* If there's nothing to choose between the loop bodies, see whether > + there's a difference in the prologue and epilogue costs. */ > + if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost) > + return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost; > + > + return false; > +} > + > +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return > + true if we should. */ > + > +static bool > +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo, > + loop_vec_info old_loop_vinfo) > +{ > + if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo)) > + return false; > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "***** Preferring vector mode %s to vector mode %s\n", > + GET_MODE_NAME (new_loop_vinfo->vector_mode), > + GET_MODE_NAME (old_loop_vinfo->vector_mode)); > + return true; > +} > + > /* Function vect_analyze_loop. > > Apply a set of analyses on LOOP, and create a loop_vec_info struct > @@ -2386,8 +2462,9 @@ vect_analyze_loop (class loop *loop, vec > auto_vector_modes vector_modes; > > /* Autodetect first vector size we try. */ > - targetm.vectorize.autovectorize_vector_modes (&vector_modes, > - loop->simdlen != 0); > + unsigned int autovec_flags > + = targetm.vectorize.autovectorize_vector_modes (&vector_modes, > + loop->simdlen != 0); > unsigned int mode_i = 0; > > DUMP_VECT_SCOPE ("analyze_loop_nest"); > @@ -2410,6 +2487,8 @@ vect_analyze_loop (class loop *loop, vec > machine_mode next_vector_mode = VOIDmode; > poly_uint64 lowest_th = 0; > unsigned vectorized_loops = 0; > + bool pick_lowest_cost_p = ((autovec_flags & VECT_COMPARE_COSTS) > + && !unlimited_cost_model (loop)); > > bool vect_epilogues = false; > opt_result res = opt_result::success (); > @@ -2430,6 +2509,34 @@ vect_analyze_loop (class loop *loop, vec > > bool fatal = false; > > + /* When pick_lowest_cost_p is true, we should in principle iterate > + over all the loop_vec_infos that LOOP_VINFO could replace and > + try to vectorize LOOP_VINFO under the same conditions. > + E.g. when trying to replace an epilogue loop, we should vectorize > + LOOP_VINFO as an epilogue loop with the same VF limit. When trying > + to replace the main loop, we should vectorize LOOP_VINFO as a main > + loop too. > + > + However, autovectorize_vector_modes is usually sorted as follows: > + > + - Modes that naturally produce lower VFs usually follow modes that > + naturally produce higher VFs. > + > + - When modes naturally produce the same VF, maskable modes > + usually follow unmaskable ones, so that the maskable mode > + can be used to vectorize the epilogue of the unmaskable mode. > + > + This order is preferred because it leads to the maximum > + epilogue vectorization opportunities. Targets should only use > + a different order if they want to make wide modes available while > + disparaging them relative to earlier, smaller modes. The assumption > + in that case is that the wider modes are more expensive in some > + way that isn't reflected directly in the costs. > + > + There should therefore be few interesting cases in which > + LOOP_VINFO fails when treated as an epilogue loop, succeeds when > + treated as a standalone loop, and ends up being genuinely cheaper > + than FIRST_LOOP_VINFO. */ > if (vect_epilogues) > LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; > > @@ -2477,13 +2584,34 @@ vect_analyze_loop (class loop *loop, vec > LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; > simdlen = 0; > } > + else if (pick_lowest_cost_p && first_loop_vinfo) > + { > + /* Keep trying to roll back vectorization attempts while the > + loop_vec_infos they produced were worse than this one. */ > + vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos; > + while (!vinfos.is_empty () > + && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ())) > + { > + gcc_assert (vect_epilogues); > + delete vinfos.pop (); > + } > + if (vinfos.is_empty () > + && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo)) > + { > + delete first_loop_vinfo; > + first_loop_vinfo = opt_loop_vec_info::success (NULL); > + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; > + } > + } > > if (first_loop_vinfo == NULL) > { > first_loop_vinfo = loop_vinfo; > lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); > } > - else if (vect_epilogues) > + else if (vect_epilogues > + /* For now only allow one epilogue loop. */ > + && first_loop_vinfo->epilogue_vinfos.is_empty ()) > { > first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); > poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); > @@ -2503,12 +2631,14 @@ vect_analyze_loop (class loop *loop, vec > && loop->inner == NULL > && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) > && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo) > - /* For now only allow one epilogue loop. */ > - && first_loop_vinfo->epilogue_vinfos.is_empty ()); > + /* For now only allow one epilogue loop, but allow > + pick_lowest_cost_p to replace it. */ > + && (first_loop_vinfo->epilogue_vinfos.is_empty () > + || pick_lowest_cost_p)); > > /* Commit to first_loop_vinfo if we have no reason to try > alternatives. */ > - if (!simdlen && !vect_epilogues) > + if (!simdlen && !vect_epilogues && !pick_lowest_cost_p) > break; > } > else > @@ -3467,7 +3597,11 @@ vect_estimate_min_profitable_iters (loop > &vec_inside_cost, &vec_epilogue_cost); > > vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost); > - > + > + /* Stash the costs so that we can compare two loop_vec_infos. */ > + loop_vinfo->vec_inside_cost = vec_inside_cost; > + loop_vinfo->vec_outside_cost = vec_outside_cost; > + > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");
Richard Biener <richard.guenther@gmail.com> writes: > On Thu, Nov 7, 2019 at 6:15 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Richard Biener <richard.guenther@gmail.com> writes: >> > On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford >> > <richard.sandiford@arm.com> wrote: >> >> >> >> Richard Biener <richard.guenther@gmail.com> writes: >> >> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford >> >> > <Richard.Sandiford@arm.com> wrote: >> >> >> >> >> >> This patch adds a mode in which the vectoriser tries each available >> >> >> base vector mode and picks the one with the lowest cost. For now >> >> >> the behaviour is behind a default-off --param, but a later patch >> >> >> enables it by default for SVE. >> >> >> >> >> >> The patch keeps the current behaviour of preferring a VF of >> >> >> loop->simdlen over any larger or smaller VF, regardless of costs >> >> >> or target preferences. >> >> > >> >> > Can you avoid using a --param for this? Instead I'd suggest to >> >> > amend the vectorize_modes target hook to return some >> >> > flags like VECT_FIRST_MODE_WINS. We'd eventually want >> >> > to make the target able to say do-not-vectorize-epiloges-of-MODE >> >> > (I think we may not want to vectorize SSE vectorized loop >> >> > epilogues with MMX-with-SSE or GPRs for example). I guess >> >> > for the latter we'd use a new target hook. >> >> >> >> The reason for using a --param was that I wanted a way of turning >> >> this on and off on the command line, so that users can experiment >> >> with it if necessary. E.g. enabling the --param could be a viable >> >> alternative to -mprefix-* in some cases. Disabling it would be >> >> a way of working around a bad cost model decision without going >> >> all the way to -fno-vect-cost-model. >> >> >> >> These kinds of --params can become useful workarounds until an >> >> optimisation bug is fixed. >> > >> > I'm arguing that the default depends on the actual ISAs so there isn't >> > a one-fits all and given we have OMP SIMD and target cloning for >> > multiple ISAs this looks like a wrong approach. For sure the >> > target can use its own switches to override defaults here, or alternatively >> > we might want to have a #pragma GCC simdlen mimicing OMP behavior >> > here. >> >> I agree there's no one-size-fits-all choice here, but that's true for >> other --params too. The problem with using target switches is that we >> have to explain them and to keep accepting them "forever" (or at least >> with a long deprecation period). > > Fortunately next week you'll be able to add target specific --params > to your targets .opt file ;) Nice! That definitely sounds like a good compromise. :-) I'll hold off on 6/6 until Martin's patches have gone in. There are a couple of other SVE things that would benefit from that too. Thanks, Richard
Index: gcc/params.def =================================================================== --- gcc/params.def 2019-10-31 17:15:25.470517368 +0000 +++ gcc/params.def 2019-11-05 14:19:58.781197820 +0000 @@ -661,6 +661,13 @@ DEFPARAM(PARAM_VECT_MAX_PEELING_FOR_ALIG "Maximum number of loop peels to enhance alignment of data references in a loop.", -1, -1, 64) +DEFPARAM(PARAM_VECT_COMPARE_LOOP_COSTS, + "vect-compare-loop-costs", + "Whether to try vectorizing a loop using each supported" + " combination of vector types and picking the version with the" + " lowest cost.", + 0, 0, 1) + DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS, "max-cselib-memory-locations", "The maximum memory locations recorded by cselib.", Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi 2019-11-04 21:13:57.611756365 +0000 +++ gcc/doc/invoke.texi 2019-11-05 14:19:58.777197850 +0000 @@ -11563,6 +11563,12 @@ doing loop versioning for alias in the v The maximum number of loop peels to enhance access alignment for vectorizer. Value -1 means no limit. +@item vect-compare-loop-costs +Whether to try vectorizing a loop using each supported combination of +vector types and picking the version with the lowest cost. This parameter +has no effect when @option{-fno-vect-cost-model} or +@option{-fvect-cost-model=unlimited} are used. + @item max-iterations-to-track The maximum number of iterations of a loop the brute-force algorithm for analysis of the number of iterations of the loop tries to evaluate. Index: gcc/tree-vectorizer.h =================================================================== --- gcc/tree-vectorizer.h 2019-11-05 14:19:33.829371745 +0000 +++ gcc/tree-vectorizer.h 2019-11-05 14:19:58.781197820 +0000 @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve /* Cost of a single scalar iteration. */ int single_scalar_iteration_cost; + /* The cost of the vector prologue and epilogue, including peeled + iterations and set-up code. */ + int vec_outside_cost; + + /* The cost of the vector loop body. */ + int vec_inside_cost; + /* Is the loop vectorizable? */ bool vectorizable; Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2019-11-05 14:19:33.829371745 +0000 +++ gcc/tree-vect-loop.c 2019-11-05 14:19:58.781197820 +0000 @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo scan_map (NULL), slp_unrolling_factor (1), single_scalar_iteration_cost (0), + vec_outside_cost (0), + vec_inside_cost (0), vectorizable (false), can_fully_mask_p (true), fully_masked_p (false), @@ -2373,6 +2375,80 @@ vect_analyze_loop_2 (loop_vec_info loop_ goto start_over; } +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears + to be better than vectorizing it using OLD_LOOP_VINFO. Assume that + OLD_LOOP_VINFO is better unless something specifically indicates + otherwise. + + Note that this deliberately isn't a partial order. */ + +static bool +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, + loop_vec_info old_loop_vinfo) +{ + struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo); + gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop); + + poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo); + poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo); + + /* Always prefer a VF of loop->simdlen over any other VF. */ + if (loop->simdlen) + { + bool new_simdlen_p = known_eq (new_vf, loop->simdlen); + bool old_simdlen_p = known_eq (old_vf, loop->simdlen); + if (new_simdlen_p != old_simdlen_p) + return new_simdlen_p; + } + + /* Limit the VFs to what is likely to be the maximum number of iterations, + to handle cases in which at least one loop_vinfo is fully-masked. */ + HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop); + if (estimated_max_niter != -1) + { + if (known_le (estimated_max_niter, new_vf)) + new_vf = estimated_max_niter; + if (known_le (estimated_max_niter, old_vf)) + old_vf = estimated_max_niter; + } + + /* Check whether the (fractional) cost per scalar iteration is lower + or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf. */ + poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost + * poly_widest_int (old_vf)); + poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost + * poly_widest_int (new_vf)); + if (maybe_lt (rel_old, rel_new)) + return false; + if (known_lt (rel_new, rel_old)) + return true; + + /* If there's nothing to choose between the loop bodies, see whether + there's a difference in the prologue and epilogue costs. */ + if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost) + return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost; + + return false; +} + +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO. Return + true if we should. */ + +static bool +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo, + loop_vec_info old_loop_vinfo) +{ + if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo)) + return false; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Preferring vector mode %s to vector mode %s\n", + GET_MODE_NAME (new_loop_vinfo->vector_mode), + GET_MODE_NAME (old_loop_vinfo->vector_mode)); + return true; +} + /* Function vect_analyze_loop. Apply a set of analyses on LOOP, and create a loop_vec_info struct @@ -2408,6 +2484,8 @@ vect_analyze_loop (class loop *loop, vec machine_mode next_vector_mode = VOIDmode; poly_uint64 lowest_th = 0; unsigned vectorized_loops = 0; + bool pick_lowest_cost_p = (PARAM_VALUE (PARAM_VECT_COMPARE_LOOP_COSTS) + && !unlimited_cost_model (loop)); bool vect_epilogues = false; opt_result res = opt_result::success (); @@ -2428,6 +2506,34 @@ vect_analyze_loop (class loop *loop, vec bool fatal = false; + /* When pick_lowest_cost_p is true, we should in principle iterate + over all the loop_vec_infos that LOOP_VINFO could replace and + try to vectorize LOOP_VINFO under the same conditions. + E.g. when trying to replace an epilogue loop, we should vectorize + LOOP_VINFO as an epilogue loop with the same VF limit. When trying + to replace the main loop, we should vectorize LOOP_VINFO as a main + loop too. + + However, autovectorize_vector_modes is usually sorted as follows: + + - Modes that naturally produce lower VFs usually follow modes that + naturally produce higher VFs. + + - When modes naturally produce the same VF, maskable modes + usually follow unmaskable ones, so that the maskable mode + can be used to vectorize the epilogue of the unmaskable mode. + + This order is preferred because it leads to the maximum + epilogue vectorization opportunities. Targets should only use + a different order if they want to make wide modes available while + disparaging them relative to earlier, smaller modes. The assumption + in that case is that the wider modes are more expensive in some + way that isn't reflected directly in the costs. + + There should therefore be few interesting cases in which + LOOP_VINFO fails when treated as an epilogue loop, succeeds when + treated as a standalone loop, and ends up being genuinely cheaper + than FIRST_LOOP_VINFO. */ if (vect_epilogues) LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo; @@ -2475,13 +2581,34 @@ vect_analyze_loop (class loop *loop, vec LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; simdlen = 0; } + else if (pick_lowest_cost_p && first_loop_vinfo) + { + /* Keep trying to roll back vectorization attempts while the + loop_vec_infos they produced were worse than this one. */ + vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos; + while (!vinfos.is_empty () + && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ())) + { + gcc_assert (vect_epilogues); + delete vinfos.pop (); + } + if (vinfos.is_empty () + && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo)) + { + delete first_loop_vinfo; + first_loop_vinfo = opt_loop_vec_info::success (NULL); + LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL; + } + } if (first_loop_vinfo == NULL) { first_loop_vinfo = loop_vinfo; lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); } - else if (vect_epilogues) + else if (vect_epilogues + /* For now only allow one epilogue loop. */ + && first_loop_vinfo->epilogue_vinfos.is_empty ()) { first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); @@ -2501,12 +2628,14 @@ vect_analyze_loop (class loop *loop, vec && loop->inner == NULL && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo) - /* For now only allow one epilogue loop. */ - && first_loop_vinfo->epilogue_vinfos.is_empty ()); + /* For now only allow one epilogue loop, but allow + pick_lowest_cost_p to replace it. */ + && (first_loop_vinfo->epilogue_vinfos.is_empty () + || pick_lowest_cost_p)); /* Commit to first_loop_vinfo if we have no reason to try alternatives. */ - if (!simdlen && !vect_epilogues) + if (!simdlen && !vect_epilogues && !pick_lowest_cost_p) break; } else @@ -3454,7 +3583,11 @@ vect_estimate_min_profitable_iters (loop &vec_inside_cost, &vec_epilogue_cost); vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost); - + + /* Stash the costs so that we can compare two loop_vec_infos. */ + loop_vinfo->vec_inside_cost = vec_inside_cost; + loop_vinfo->vec_outside_cost = vec_outside_cost; + if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");