diff mbox series

[1/3,vect] Add main vectorized loop unrolling

Message ID 27777876-4201-5e86-bf9a-063143d38641@arm.com
State New
Headers show
Series Enable vector unrolling of main loop | expand

Commit Message

Andre Vieira (lists) Sept. 17, 2021, 3:31 p.m. UTC
Hi all,

This patch adds the ability to define a target hook to unroll the main 
vectorized loop. It also introduces --param's vect-unroll and 
vect-unroll-reductions to control this through a command-line. I found 
this useful to experiment and believe can help when tuning, so I decided 
to leave it in.
We only unroll the main loop and have disabled unrolling epilogues for 
now. We also do not support unrolling of any loop that has a negative 
step and we do not support unrolling a loop with any reduction other 
than a TREE_CODE_REDUCTION.

Bootstrapped and regression tested on aarch64-linux-gnu as part of the 
series.

gcc/ChangeLog:

         * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
         and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
         * doc/tm.texi.in: Add entries for target hooks above.
         * params.opt: Add vect-unroll and vect-unroll-reductions 
parameters.
         * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
         and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
         * targhooks.c (default_add_stmt_cost_for_unroll): New.
         (default_unroll_factor): Likewise.
         * targhooks.h (default_add_stmt_cost_for_unroll): Likewise.
         (default_unroll_factor): Likewise.
         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
         par_unrolling_factor.
         (vect_update_vf_for_slp): Use unrolling factor to update 
vectorization
         factor.
         (vect_determine_partial_vectors_and_peeling): Account for 
unrolling.
         (vect_determine_unroll_factor): Determine how much to unroll 
vectorized
         main loop.
         (vect_analyze_loop_2): Call vect_determine_unroll_factor.
         (vect_analyze_loop): Allow for epilogue vectorization when 
unrolling
         and rewalk vector_mode array for the epilogues.
         (vectorizable_reduction): Disable single_defuse_cycle when 
unrolling.
         * tree-vectorizer.h (vect_unroll_value): Declare 
par_unrolling_factor
         as a member of loop_vec_info.

Comments

Richard Biener Sept. 21, 2021, 12:30 p.m. UTC | #1
On Fri, 17 Sep 2021, Andre Vieira (lists) wrote:

> Hi all,
> 
> This patch adds the ability to define a target hook to unroll the main
> vectorized loop. It also introduces --param's vect-unroll and
> vect-unroll-reductions to control this through a command-line. I found this
> useful to experiment and believe can help when tuning, so I decided to leave
> it in.
> We only unroll the main loop and have disabled unrolling epilogues for now. We
> also do not support unrolling of any loop that has a negative step and we do
> not support unrolling a loop with any reduction other than a
> TREE_CODE_REDUCTION.
> 
> Bootstrapped and regression tested on aarch64-linux-gnu as part of the series.

I wonder why we want to change the vector modes used for the epilogue,
we're either making it more likely to need to fall through to the
scalar epilogue or require another vectorized epilogue.

That said, for simplicity I'd only change the VF of the main loop.

There I wonder why you need to change vect_update_vf_for_slp or
vect_determine_partial_vectors_and_peeling and why it's not enough
to adjust the VF in a single place, I'd do that here:

  /* This is the point where we can re-start analysis with SLP forced off.  
*/
start_over:

  /* Now the vectorization factor is final.  */
  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
  gcc_assert (known_ne (vectorization_factor, 0U));

---->  call vect_update_vf_for_unroll ()

note there's also loop->unroll (from #pragma GCC unroll) which we
could include in what you look at in vect_unroll_value.

I don't like add_stmt_cost_for_unroll - how should a target go
and decide based on what it is fed?  You could as well feed it
the scalar body or the vinfo so it can get a shot at all
the vectorizers meta data - but feeding it individual stmt_infos
does not add any meaningful abstraction and thus what's the
point?

I _think_ what would make some sense is when we actually cost
the vector body (with the not unrolled VF) ask the target
"well, how about unrolling this?" because there it has the
chance to look at the actual vector stmts produced (in "cost form").
And if the target answers "yeah - go ahead and try x4" we signal
that to the iteration and have "mode N with x4 unroll" validated and
costed.

So instead of a new target hook amend the finish_cost hook to
produce a suggested unroll value and cost both the unrolled and
not unrolled body.

Sorry for steering in a different direction ;)

Thanks,
Richard.



> gcc/ChangeLog:
> 
>         * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
>         and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
>         * doc/tm.texi.in: Add entries for target hooks above.
>         * params.opt: Add vect-unroll and vect-unroll-reductions 
> parameters.
>         * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
>         and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
>         * targhooks.c (default_add_stmt_cost_for_unroll): New.
>         (default_unroll_factor): Likewise.
>         * targhooks.h (default_add_stmt_cost_for_unroll): Likewise.
>         (default_unroll_factor): Likewise.
>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>         par_unrolling_factor.
>         (vect_update_vf_for_slp): Use unrolling factor to update 
> vectorization
>         factor.
>         (vect_determine_partial_vectors_and_peeling): Account for 
> unrolling.
>         (vect_determine_unroll_factor): Determine how much to unroll
> vectorized
>         main loop.
>         (vect_analyze_loop_2): Call vect_determine_unroll_factor.
>         (vect_analyze_loop): Allow for epilogue vectorization when 
> unrolling
>         and rewalk vector_mode array for the epilogues.
>         (vectorizable_reduction): Disable single_defuse_cycle when 
> unrolling.
>         * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
>         as a member of loop_vec_info.
>
Andre Vieira (lists) Sept. 21, 2021, 4:34 p.m. UTC | #2
Hi Richi,

Thanks for the review, see below some questions.

On 21/09/2021 13:30, Richard Biener wrote:
> On Fri, 17 Sep 2021, Andre Vieira (lists) wrote:
>
>> Hi all,
>>
>> This patch adds the ability to define a target hook to unroll the main
>> vectorized loop. It also introduces --param's vect-unroll and
>> vect-unroll-reductions to control this through a command-line. I found this
>> useful to experiment and believe can help when tuning, so I decided to leave
>> it in.
>> We only unroll the main loop and have disabled unrolling epilogues for now. We
>> also do not support unrolling of any loop that has a negative step and we do
>> not support unrolling a loop with any reduction other than a
>> TREE_CODE_REDUCTION.
>>
>> Bootstrapped and regression tested on aarch64-linux-gnu as part of the series.
> I wonder why we want to change the vector modes used for the epilogue,
> we're either making it more likely to need to fall through to the
> scalar epilogue or require another vectorized epilogue.
I don't quite understand what you mean by change the vector modes for 
the epilogue. I don't think we do.
If you are referring to:
       /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
       if (loop_vinfo->par_unrolling_factor > 1)
         {
           next_vector_mode = vector_modes[0];
           mode_i = 1;

           if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location,
                  "***** Re-trying analysis with vector mode"
                  " %s for epilogue with partial vectors.\n",
                  GET_MODE_NAME (next_vector_mode));
           continue;
         }

That just forces trying the vector modes we've tried before. Though I 
might need to revisit this now I think about it. I'm afraid it might be 
possible for this to generate an epilogue with a vf that is not lower 
than that of the main loop, but I'd need to think about this again.

Either way I don't think this changes the vector modes used for the 
epilogue. But maybe I'm just missing your point here.
> That said, for simplicity I'd only change the VF of the main loop.
>
> There I wonder why you need to change vect_update_vf_for_slp or
> vect_determine_partial_vectors_and_peeling and why it's not enough
> to adjust the VF in a single place, I'd do that here:
>
>    /* This is the point where we can re-start analysis with SLP forced off.
> */
> start_over:
>
>    /* Now the vectorization factor is final.  */
>    poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    gcc_assert (known_ne (vectorization_factor, 0U));
>
> ---->  call vect_update_vf_for_unroll ()
I can move it there, it would indeed remove the need for the change to 
vect_update_vf_for_slp, the change to 
vect_determine_partial_vectors_and_peeling would still be required I 
think. It is meant to disable using partial vectors in an unrolled loop.
> note there's also loop->unroll (from #pragma GCC unroll) which we
> could include in what you look at in vect_unroll_value.
>
> I don't like add_stmt_cost_for_unroll - how should a target go
> and decide based on what it is fed?  You could as well feed it
> the scalar body or the vinfo so it can get a shot at all
> the vectorizers meta data - but feeding it individual stmt_infos
> does not add any meaningful abstraction and thus what's the
> point?
I am still working on tuning our backend hook, but the way it works is 
it estimates how many load, store and general ops are required for the 
vectorized loop based on these.
> I _think_ what would make some sense is when we actually cost
> the vector body (with the not unrolled VF) ask the target
> "well, how about unrolling this?" because there it has the
> chance to look at the actual vector stmts produced (in "cost form").
> And if the target answers "yeah - go ahead and try x4" we signal
> that to the iteration and have "mode N with x4 unroll" validated and
> costed.
>
> So instead of a new target hook amend the finish_cost hook to
> produce a suggested unroll value and cost both the unrolled and
> not unrolled body.
>
> Sorry for steering in a different direction ;)
The reason we decided to do this early and not after cost is because 
'vect_prune_runtime_alias_test_list' and 
'vect_enhance_data_refs_alignment' require the VF and if you suddenly 
raise that the alias analysis could become invalid.

An initial implementation did do it later for that very reason that we 
could reuse the cost calculations and AArch64 already computed these 
'ops' after Richard Sandiford's patches.
But yeah ... the above kinda led me to rewrite it this way.

>
> Thanks,
> Richard.
>
>
>
>> gcc/ChangeLog:
>>
>>          * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
>>          and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
>>          * doc/tm.texi.in: Add entries for target hooks above.
>>          * params.opt: Add vect-unroll and vect-unroll-reductions
>> parameters.
>>          * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
>>          and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
>>          * targhooks.c (default_add_stmt_cost_for_unroll): New.
>>          (default_unroll_factor): Likewise.
>>          * targhooks.h (default_add_stmt_cost_for_unroll): Likewise.
>>          (default_unroll_factor): Likewise.
>>          * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>>          par_unrolling_factor.
>>          (vect_update_vf_for_slp): Use unrolling factor to update
>> vectorization
>>          factor.
>>          (vect_determine_partial_vectors_and_peeling): Account for
>> unrolling.
>>          (vect_determine_unroll_factor): Determine how much to unroll
>> vectorized
>>          main loop.
>>          (vect_analyze_loop_2): Call vect_determine_unroll_factor.
>>          (vect_analyze_loop): Allow for epilogue vectorization when
>> unrolling
>>          and rewalk vector_mode array for the epilogues.
>>          (vectorizable_reduction): Disable single_defuse_cycle when
>> unrolling.
>>          * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
>>          as a member of loop_vec_info.
>>
Richard Biener Sept. 22, 2021, 6:14 a.m. UTC | #3
On Tue, 21 Sep 2021, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> Thanks for the review, see below some questions.
> 
> On 21/09/2021 13:30, Richard Biener wrote:
> > On Fri, 17 Sep 2021, Andre Vieira (lists) wrote:
> >
> >> Hi all,
> >>
> >> This patch adds the ability to define a target hook to unroll the main
> >> vectorized loop. It also introduces --param's vect-unroll and
> >> vect-unroll-reductions to control this through a command-line. I found this
> >> useful to experiment and believe can help when tuning, so I decided to
> >> leave
> >> it in.
> >> We only unroll the main loop and have disabled unrolling epilogues for now.
> >> We
> >> also do not support unrolling of any loop that has a negative step and we
> >> do
> >> not support unrolling a loop with any reduction other than a
> >> TREE_CODE_REDUCTION.
> >>
> >> Bootstrapped and regression tested on aarch64-linux-gnu as part of the
> >> series.
> > I wonder why we want to change the vector modes used for the epilogue,
> > we're either making it more likely to need to fall through to the
> > scalar epilogue or require another vectorized epilogue.
> I don't quite understand what you mean by change the vector modes for the
> epilogue. I don't think we do.
> If you are referring to:
>       /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
>       if (loop_vinfo->par_unrolling_factor > 1)
>         {
>           next_vector_mode = vector_modes[0];
>           mode_i = 1;
> 
>           if (dump_enabled_p ())
>         dump_printf_loc (MSG_NOTE, vect_location,
>                  "***** Re-trying analysis with vector mode"
>                  " %s for epilogue with partial vectors.\n",
>                  GET_MODE_NAME (next_vector_mode));
>           continue;
>         }
> 
> That just forces trying the vector modes we've tried before. Though I might
> need to revisit this now I think about it. I'm afraid it might be possible for
> this to generate an epilogue with a vf that is not lower than that of the main
> loop, but I'd need to think about this again.
> 
> Either way I don't think this changes the vector modes used for the epilogue.
> But maybe I'm just missing your point here.

Yes, I was refering to the above which suggests that when we vectorize
the main loop with V4SF but unroll then we try vectorizing the
epilogue with V4SF as well (but not unrolled).  I think that's
premature (not sure if you try V8SF if the main loop was V4SF but
unrolled 4 times).

> > That said, for simplicity I'd only change the VF of the main loop.
> >
> > There I wonder why you need to change vect_update_vf_for_slp or
> > vect_determine_partial_vectors_and_peeling and why it's not enough
> > to adjust the VF in a single place, I'd do that here:
> >
> >    /* This is the point where we can re-start analysis with SLP forced off.
> > */
> > start_over:
> >
> >    /* Now the vectorization factor is final.  */
> >    poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> >    gcc_assert (known_ne (vectorization_factor, 0U));
> >
> > ---->  call vect_update_vf_for_unroll ()
> I can move it there, it would indeed remove the need for the change to
> vect_update_vf_for_slp, the change to
> vect_determine_partial_vectors_and_peeling would still be required I think. It
> is meant to disable using partial vectors in an unrolled loop.

Why would we disable the use of partial vectors in an unrolled loop?

> > note there's also loop->unroll (from #pragma GCC unroll) which we
> > could include in what you look at in vect_unroll_value.
> >
> > I don't like add_stmt_cost_for_unroll - how should a target go
> > and decide based on what it is fed?  You could as well feed it
> > the scalar body or the vinfo so it can get a shot at all
> > the vectorizers meta data - but feeding it individual stmt_infos
> > does not add any meaningful abstraction and thus what's the
> > point?
> I am still working on tuning our backend hook, but the way it works is it
> estimates how many load, store and general ops are required for the vectorized
> loop based on these.
> > I _think_ what would make some sense is when we actually cost
> > the vector body (with the not unrolled VF) ask the target
> > "well, how about unrolling this?" because there it has the
> > chance to look at the actual vector stmts produced (in "cost form").
> > And if the target answers "yeah - go ahead and try x4" we signal
> > that to the iteration and have "mode N with x4 unroll" validated and
> > costed.
> >
> > So instead of a new target hook amend the finish_cost hook to
> > produce a suggested unroll value and cost both the unrolled and
> > not unrolled body.
> >
> > Sorry for steering in a different direction ;)
> The reason we decided to do this early and not after cost is because
> 'vect_prune_runtime_alias_test_list' and 'vect_enhance_data_refs_alignment'
> require the VF and if you suddenly raise that the alias analysis could become
> invalid.

Sure but I'm suggesting you keep the not unrolled body as one way of
costed vectorization but then if the target says "try unrolling"
re-do the analysis with the same mode but a larger VF.  Just like
we iterate over vector modes you'll now iterate over pairs of
vector mode + VF (unroll factor).  It's not about re-using the costing
it's about using costing that is actually relevant and also to avoid
targets inventing two distinct separate costings - a target (powerpc)
might already compute load/store density and other stuff for the main
costing so it should have an idea whether doubling or triplicating is OK.

Richard.

> An initial implementation did do it later for that very reason that we could
> reuse the cost calculations and AArch64 already computed these 'ops' after
> Richard Sandiford's patches.
> But yeah ... the above kinda led me to rewrite it this way.
> 
> >
> > Thanks,
> > Richard.
> >
> >
> >
> >> gcc/ChangeLog:
> >>
> >>          * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR
> >>          and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
> >>          * doc/tm.texi.in: Add entries for target hooks above.
> >>          * params.opt: Add vect-unroll and vect-unroll-reductions
> >> parameters.
> >>          * target.def: Define hooks TARGET_VECTORIZE_UNROLL_FACTOR
> >>          and TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL.
> >>          * targhooks.c (default_add_stmt_cost_for_unroll): New.
> >>          (default_unroll_factor): Likewise.
> >>          * targhooks.h (default_add_stmt_cost_for_unroll): Likewise.
> >>          (default_unroll_factor): Likewise.
> >>          * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
> >>          par_unrolling_factor.
> >>          (vect_update_vf_for_slp): Use unrolling factor to update
> >> vectorization
> >>          factor.
> >>          (vect_determine_partial_vectors_and_peeling): Account for
> >> unrolling.
> >>         (vect_determine_unroll_factor): Determine how much to unroll
> >> vectorized
> >>          main loop.
> >>          (vect_analyze_loop_2): Call vect_determine_unroll_factor.
> >>          (vect_analyze_loop): Allow for epilogue vectorization when
> >> unrolling
> >>          and rewalk vector_mode array for the epilogues.
> >>          (vectorizable_reduction): Disable single_defuse_cycle when
> >> unrolling.
> >>          * tree-vectorizer.h (vect_unroll_value): Declare
> >>  par_unrolling_factor
> >>          as a member of loop_vec_info.
> >>
>
Andre Vieira (lists) Sept. 30, 2021, 8:52 a.m. UTC | #4
Hi,


>> That just forces trying the vector modes we've tried before. Though I might
>> need to revisit this now I think about it. I'm afraid it might be possible for
>> this to generate an epilogue with a vf that is not lower than that of the main
>> loop, but I'd need to think about this again.
>>
>> Either way I don't think this changes the vector modes used for the epilogue.
>> But maybe I'm just missing your point here.
> Yes, I was refering to the above which suggests that when we vectorize
> the main loop with V4SF but unroll then we try vectorizing the
> epilogue with V4SF as well (but not unrolled).  I think that's
> premature (not sure if you try V8SF if the main loop was V4SF but
> unrolled 4 times).

My main motivation for this was because I had a SVE loop that vectorized 
with both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided 
to unroll V8HI by two and skipped using VNx8HI as a predicated epilogue 
which would've been the best choice.

So that is why I decided to just 'reset' the vector_mode selection. In a 
scenario where you only have the traditional vector modes it might make 
less sense.

Just realized I still didn't add any check to make sure the epilogue has 
a lower VF than the previous loop, though I'm still not sure that could 
happen. I'll go look at where to add that if you agree with this.

>> I can move it there, it would indeed remove the need for the change to
>> vect_update_vf_for_slp, the change to
>> vect_determine_partial_vectors_and_peeling would still be required I think. It
>> is meant to disable using partial vectors in an unrolled loop.
> Why would we disable the use of partial vectors in an unrolled loop?
The motivation behind that is that the overhead caused by generating 
predicates for each iteration will likely be too much for it to be 
profitable to unroll. On top of that, when dealing with low iteration 
count loops, if executing one predicated iteration would be enough we 
now still need to execute all other unrolled predicated iterations, 
whereas if we keep them unrolled we skip the unrolled loops.
> Sure but I'm suggesting you keep the not unrolled body as one way of
> costed vectorization but then if the target says "try unrolling"
> re-do the analysis with the same mode but a larger VF.  Just like
> we iterate over vector modes you'll now iterate over pairs of
> vector mode + VF (unroll factor).  It's not about re-using the costing
> it's about using costing that is actually relevant and also to avoid
> targets inventing two distinct separate costings - a target (powerpc)
> might already compute load/store density and other stuff for the main
> costing so it should have an idea whether doubling or triplicating is OK.
>
> Richard.
Sounds good! I changed the patch to determine the unrolling factor 
later, after all analysis has been done and retry analysis if an 
unrolling factor larger than 1 has been chosen for this loop and 
vector_mode.

gcc/ChangeLog:

         * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
         * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
         * params.opt: Add vect-unroll and vect-unroll-reductions 
parameters.
         * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
         * targhooks.c (default_unroll_factor): New.
         * targhooks.h (default_unroll_factor): Likewise.
         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
         par_unrolling_factor.
         (vect_determine_partial_vectors_and_peeling): Account for 
unrolling.
         (vect_determine_unroll_factor): New.
         (vect_try_unrolling): New.
         (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
         retrying a loop_vinfo as a main loop.
         (vect_analyze_loop): Call vect_try_unrolling when vectorizing 
main loops.
         (vect_analyze_loop): Allow for epilogue vectorization when 
unrolling
         and rewalk vector_mode warray for the epilogues.
         (vectorizable_reduction): Disable single_defuse_cycle when 
unrolling.
         * tree-vectorizer.h (vect_unroll_value): Declare 
par_unrolling_factor
         as a member of loop_vec_info.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index be8148583d8571b0d035b1938db9d056bfd213a8..71ee33a200fcbd37ccd5380321df507ae1e8961f 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6289,6 +6289,12 @@ allocated by TARGET_VECTORIZE_INIT_COST.  The default releases the
 accumulator.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_VECTORIZE_UNROLL_FACTOR (class vec_info *@var{vinfo})
+This hook should return the desired vector unrolling factor for a loop with
+@var{vinfo}. The default returns one, which means no unrolling will be
+performed.
+@end deftypefn
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d088eee4afecdbb5575b0f4f796ac344e4449155..3b3051f565ccbf88b07ee4f9f28e53cf6048d2e0 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4198,6 +4198,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_UNROLL_FACTOR
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/params.opt b/gcc/params.opt
index 658ca0288519e5f8185da67535dc42517c24d21c..d6c625c0a78e3aa21837c96a6757a57337cce22f 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1137,4 +1137,12 @@ Controls how loop vectorizer uses partial vectors.  0 means never, 1 means only
 Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) IntegerRange(1, 10000) Param Optimization
 The maximum factor which the loop vectorizer applies to the cost of statements in an inner loop relative to the loop being vectorized.
 
+-param=vect-unroll=
+Common Joined UInteger Var(param_vect_unroll) Init(0) IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops.  Also see vect-unroll-reductions.
+
+-param=vect-unroll-reductions=
+Common Joined UInteger Var(param_vect_unroll_reductions) Init(0) IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops that contain associative reductions.  0 means that such loops should be unrolled vect-unroll times.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/target.def b/gcc/target.def
index bfa819609c21bd71c0cc585c01dba42534453f47..8f48a453f3ff886381119291413514fae0e666ec 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2095,6 +2095,16 @@ accumulator.",
  (void *data),
  default_destroy_cost_data)
 
+/* Function to determine unroll factor for vectorization.  */
+DEFHOOK
+(unroll_factor,
+ "This hook should return the desired vector unrolling factor for a loop with\n\
+@var{vinfo}. The default returns one, which means no unrolling will be\n\
+performed.",
+ unsigned,
+ (class vec_info *vinfo),
+ default_unroll_factor)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 92d51992e625c2497aa8496b1e2e3d916e5706fd..237d7f52b7f9ec903fbc265c98b153cc174b7457 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -125,6 +125,7 @@ extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_model_location);
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
+extern unsigned default_unroll_factor (class vec_info *);
 
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int, unsigned);
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853dbc15706a65d1eb335e28e0564325e..826cbe6abb79c6b17f531cfb8332221c362dc500 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1535,6 +1535,14 @@ default_destroy_cost_data (void *data)
   free (data);
 }
 
+/* By default, return a vector unroll factor of one, meaning no unrolling will
+   be performed.  */
+unsigned
+default_unroll_factor (class vec_info *vinfo ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
 /* Determine whether or not a pointer mode is valid. Assume defaults
    of ptr_mode or Pmode - can be overridden.  */
 bool
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 5a5b8da2e771a1dd204f22a6447eba96bb3b352c..39b50010e081dd3f03cb37c9a55b2bba49981ec8 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -365,6 +365,24 @@ vect_determine_vectorization_factor (loop_vec_info loop_vinfo)
   if (known_le (vectorization_factor, 1U))
     return opt_result::failure_at (vect_location,
 				   "not vectorized: unsupported data-type\n");
+  /* Apply unrolling factor, this was determined by
+     vect_determine_unroll_factor the first time we ran the analysis for this
+     vector mode.  */
+  if (loop_vinfo->par_unrolling_factor > 1)
+    {
+      unsigned unrolling_factor = loop_vinfo->par_unrolling_factor;
+      while (unrolling_factor > 1)
+	{
+	  poly_uint64 candidate_factor = vectorization_factor * unrolling_factor;
+	  if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+	      <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+	    {
+	      vectorization_factor = candidate_factor;
+	      break;
+	    }
+	  unrolling_factor /= 2;
+	}
+    }
   LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
   return opt_result::success ();
 }
@@ -828,6 +846,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    par_unrolling_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -2128,10 +2147,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->par_unrolling_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2879,6 +2904,116 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we should unroll this loop and ask target how much to
+   unroll by.  */
+
+static opt_loop_vec_info
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll due to negative step\n");
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+	    seen_reduction_p = true;
+	  else
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll loop with reduction due to "
+				"non TREE_CODE_REDUCTION\n");
+	    }
+	}
+    }
+
+  unsigned int unrolling_factor = 1;
+  if (maybe_gt (vectorization_factor, 1U))
+    unrolling_factor = vect_unroll_value (loop_vinfo, seen_reduction_p);
+
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+							  loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+  unrolled_vinfo->par_unrolling_factor = unrolling_factor;
+  return unrolled_vinfo;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (opt_loop_vec_info loop_vinfo, unsigned *n_stmts)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  opt_loop_vec_info unrolled_vinfo = vect_determine_unroll_factor (loop_vinfo);
+  if (unrolled_vinfo)
+    {
+      if (unrolled_vinfo->par_unrolling_factor > 1)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** unrolling factor %d chosen for vector mode %s,"
+			     "re-trying analysis...\n",
+			     unrolled_vinfo->par_unrolling_factor,
+			     GET_MODE_NAME (unrolled_vinfo->vector_mode));
+	  bool unrolling_fatal = false;
+	  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts)
+	      && known_ne (loop_vinfo->vectorization_factor,
+			   unrolled_vinfo->vectorization_factor))
+	    {
+
+	      loop_vinfo = unrolled_vinfo;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling succeeded with factor = %d\n",
+				 loop_vinfo->par_unrolling_factor);
+
+	    }
+	  else
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling failed with factor = %d\n",
+				 unrolled_vinfo->par_unrolling_factor);
+	    }
+	}
+      else
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "target determined unrolling is not profitable.\n");
+    }
+  loop_vinfo->loop->aux = NULL;
+  return loop_vinfo;
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2904,6 +3039,8 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, unsigned int *n_stmts)
   bool fatal = false;
   bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
   loop->aux = NULL;
+  main_loop_vinfo = vect_try_unrolling (main_loop_vinfo, n_stmts);
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -3038,6 +3175,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (res)
 	{
+	  /* Only try unrolling main loops.  */
+	  if (!LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo))
+	    loop_vinfo = vect_try_unrolling (loop_vinfo, &n_stmts);
+
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
 	  vectorized_loops++;
 
@@ -3062,7 +3203,14 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		  gcc_assert (vect_epilogues);
 		  delete vinfos.pop ();
 		}
+	      /* Check if we may want to replace the current first_loop_vinfo
+		 with the new loop, but only if they have different vector
+		 modes.  If they have the same vector mode this means the main
+		 loop is an unrolled loop and we are trying to vectorize the
+		 epilogue using the same vector mode but with a lower
+		 vectorization factor.  */
 	      if (vinfos.is_empty ()
+		  && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
 		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
 		{
 		  loop_vec_info main_loop_vinfo
@@ -3153,13 +3301,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       /* Handle the case that the original loop can use partial
 	 vectorization, but want to only adopt it for the epilogue.
-	 The retry should be in the same mode as original.  */
+	 The retry should be in the same mode as original.
+	 Also handle the case where we have unrolled the main loop and want to
+	 retry all vector modes again for the epilogues, since the VF is now
+	 at least twice as high as the current vector mode.  */
       if (vect_epilogues
 	  && loop_vinfo
-	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	  && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      || loop_vinfo->par_unrolling_factor > 1))
 	{
-	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		       || loop_vinfo->par_unrolling_factor > 1)
 		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+	  if (loop_vinfo->par_unrolling_factor > 1)
+	    {
+	      next_vector_mode = vector_modes[0];
+	      mode_i = 1;
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "***** Re-trying analysis with vector mode"
+				 " %s for epilogues after unrolling.\n",
+				 GET_MODE_NAME (next_vector_mode));
+	      continue;
+	    }
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "***** Re-trying analysis with same vector mode"
@@ -7212,7 +7379,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->par_unrolling_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..6a8b6315a8d3337b2f7c8e17fdfec82565bfaab7 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,11 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that we've unrolled the vector loop in order
+     to promote more ILP.  This value is folded into vectorization_factor
+     (and therefore exactly divides vectorization_factor).  */
+  unsigned int par_unrolling_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1822,6 +1827,20 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
 	  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return the number of times that we should unroll general
+   reduction-free loops.  */
+
+inline unsigned int
+vect_unroll_value (loop_vec_info loop_vinfo, bool seen_reduction_p)
+{
+  if (seen_reduction_p && param_vect_unroll_reductions >= 1)
+    return param_vect_unroll_reductions;
+  if (param_vect_unroll >= 1)
+    return param_vect_unroll;
+  else
+    return targetm.vectorize.unroll_factor (loop_vinfo);
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;
Richard Biener Oct. 1, 2021, 8:19 a.m. UTC | #5
On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> 
> >> That just forces trying the vector modes we've tried before. Though I might
> >> need to revisit this now I think about it. I'm afraid it might be possible
> >> for
> >> this to generate an epilogue with a vf that is not lower than that of the
> >> main
> >> loop, but I'd need to think about this again.
> >>
> >> Either way I don't think this changes the vector modes used for the
> >> epilogue.
> >> But maybe I'm just missing your point here.
> > Yes, I was refering to the above which suggests that when we vectorize
> > the main loop with V4SF but unroll then we try vectorizing the
> > epilogue with V4SF as well (but not unrolled).  I think that's
> > premature (not sure if you try V8SF if the main loop was V4SF but
> > unrolled 4 times).
> 
> My main motivation for this was because I had a SVE loop that vectorized with
> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
> V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
> been the best choice.

I see, yes - for fully predicated epilogues it makes sense to consider
the same vector mode as for the main loop anyways (independent on
whether we're unrolling or not).  One could argue that with an
unrolled V4SImode main loop a predicated V8SImode epilogue would also
be a good match (but then somehow costing favored the unrolled V4SI
over the V8SI for the main loop...).

> So that is why I decided to just 'reset' the vector_mode selection. In a
> scenario where you only have the traditional vector modes it might make less
> sense.
> 
> Just realized I still didn't add any check to make sure the epilogue has a
> lower VF than the previous loop, though I'm still not sure that could happen.
> I'll go look at where to add that if you agree with this.

As said above, it only needs a lower VF in case the epilogue is not
fully masked - otherwise the same VF would be OK.

> >> I can move it there, it would indeed remove the need for the change to
> >> vect_update_vf_for_slp, the change to
> >> vect_determine_partial_vectors_and_peeling would still be required I think.
> >> It
> >> is meant to disable using partial vectors in an unrolled loop.
> > Why would we disable the use of partial vectors in an unrolled loop?
> The motivation behind that is that the overhead caused by generating
> predicates for each iteration will likely be too much for it to be profitable
> to unroll. On top of that, when dealing with low iteration count loops, if
> executing one predicated iteration would be enough we now still need to
> execute all other unrolled predicated iterations, whereas if we keep them
> unrolled we skip the unrolled loops.

OK, I guess we're not factoring in costs when deciding on predication
but go for it if it's gernally enabled and possible.

With the proposed scheme we'd then cost the predicated not unrolled
loop against a not predicated unrolled loop which might be a bit
apples vs. oranges also because the target made the unroll decision
based on the data it collected for the predicated loop.

> > Sure but I'm suggesting you keep the not unrolled body as one way of
> > costed vectorization but then if the target says "try unrolling"
> > re-do the analysis with the same mode but a larger VF.  Just like
> > we iterate over vector modes you'll now iterate over pairs of
> > vector mode + VF (unroll factor).  It's not about re-using the costing
> > it's about using costing that is actually relevant and also to avoid
> > targets inventing two distinct separate costings - a target (powerpc)
> > might already compute load/store density and other stuff for the main
> > costing so it should have an idea whether doubling or triplicating is OK.
> >
> > Richard.
> Sounds good! I changed the patch to determine the unrolling factor later,
> after all analysis has been done and retry analysis if an unrolling factor
> larger than 1 has been chosen for this loop and vector_mode.
> 
> gcc/ChangeLog:
> 
>         * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
>         * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
>         * params.opt: Add vect-unroll and vect-unroll-reductions 
> parameters.

What's the reason to add the --params?  It looks like this makes
us unroll with a static number short-cutting the target.

IMHO that's never going to be a great thing - but what we could do
is look at loop->unroll and try to honor that (factoring in that
the vectorization factor is already the times we unroll).

So I'd leave those params out for now, the user would have a much
more fine-grained way to control this with the unroll pragma.

Adding a max-vect-unroll parameter would be another thing but that
would apply after the targets or pragma decision.

>         * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.

I still do not like the new target hook - as said I'd like to
make you have the finis_cost hook allow the target to specify
a suggested unroll factor instead because that's the point where
it has all the info.

Thanks,
Richard.

>         * targhooks.c (default_unroll_factor): New.
>         * targhooks.h (default_unroll_factor): Likewise.
>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>         par_unrolling_factor.
>         (vect_determine_partial_vectors_and_peeling): Account for 
> unrolling.
>         (vect_determine_unroll_factor): New.
>         (vect_try_unrolling): New.
>         (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
>         retrying a loop_vinfo as a main loop.
>         (vect_analyze_loop): Call vect_try_unrolling when vectorizing 
> main loops.
>         (vect_analyze_loop): Allow for epilogue vectorization when unrolling
>         and rewalk vector_mode warray for the epilogues.
>         (vectorizable_reduction): Disable single_defuse_cycle when 
> unrolling.
>         * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
>         as a member of loop_vec_info.
>
Richard Sandiford Oct. 4, 2021, 4:30 p.m. UTC | #6
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>> 
>> 
>> >> That just forces trying the vector modes we've tried before. Though I might
>> >> need to revisit this now I think about it. I'm afraid it might be possible
>> >> for
>> >> this to generate an epilogue with a vf that is not lower than that of the
>> >> main
>> >> loop, but I'd need to think about this again.
>> >>
>> >> Either way I don't think this changes the vector modes used for the
>> >> epilogue.
>> >> But maybe I'm just missing your point here.
>> > Yes, I was refering to the above which suggests that when we vectorize
>> > the main loop with V4SF but unroll then we try vectorizing the
>> > epilogue with V4SF as well (but not unrolled).  I think that's
>> > premature (not sure if you try V8SF if the main loop was V4SF but
>> > unrolled 4 times).
>> 
>> My main motivation for this was because I had a SVE loop that vectorized with
>> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
>> V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
>> been the best choice.
>
> I see, yes - for fully predicated epilogues it makes sense to consider
> the same vector mode as for the main loop anyways (independent on
> whether we're unrolling or not).  One could argue that with an
> unrolled V4SImode main loop a predicated V8SImode epilogue would also
> be a good match (but then somehow costing favored the unrolled V4SI
> over the V8SI for the main loop...).
>
>> So that is why I decided to just 'reset' the vector_mode selection. In a
>> scenario where you only have the traditional vector modes it might make less
>> sense.
>> 
>> Just realized I still didn't add any check to make sure the epilogue has a
>> lower VF than the previous loop, though I'm still not sure that could happen.
>> I'll go look at where to add that if you agree with this.
>
> As said above, it only needs a lower VF in case the epilogue is not
> fully masked - otherwise the same VF would be OK.
>
>> >> I can move it there, it would indeed remove the need for the change to
>> >> vect_update_vf_for_slp, the change to
>> >> vect_determine_partial_vectors_and_peeling would still be required I think.
>> >> It
>> >> is meant to disable using partial vectors in an unrolled loop.
>> > Why would we disable the use of partial vectors in an unrolled loop?
>> The motivation behind that is that the overhead caused by generating
>> predicates for each iteration will likely be too much for it to be profitable
>> to unroll. On top of that, when dealing with low iteration count loops, if
>> executing one predicated iteration would be enough we now still need to
>> execute all other unrolled predicated iterations, whereas if we keep them
>> unrolled we skip the unrolled loops.
>
> OK, I guess we're not factoring in costs when deciding on predication
> but go for it if it's gernally enabled and possible.

Yeah.  That's mostly be design in SVE's case, but I can see that it might
need tweaking for other targets.

I don't think that's the really the “problem” (if it is a problem)
for the unroll decision though.  The “correct” way to unroll SVE loops,
which we're hoping to add at some point, is to predicate only the final
vector in each unrolled iteration.  So if the unroll factor is 4, say,
the unrolled iterations would be:

  - unrolled 4x
    - unpredicated vector 0
    - unpredicated vector 1
    - unpredicated vector 2
    - predicated vector 3
 
The epilogue loop would then be predicated and execute at most
3 times.  Alternatively, there could be two non-looping epilogues:

  - unrolled 2x
    - unpredicated vector 0
    - predicated vector 1
  
  - not unrolled
    - predicated vector 0

This ensures that every executed vector operation does at least
some useful work.

Like Andre says, if we predicate as things stand, we'd have:

  - unrolled 4x
    - predicated vector 0
    - predicated vector 1
    - predicated vector 2
    - predicated vector 3

where the code could end up executing 4 vector operations per
scalar operation even if the final vector iteration only needs
to process 2 elements.  I don't think that's a costing issue,
it's just building in redundancy that doesn't need to be there.

Thanks,
Richard
Andre Vieira (lists) Oct. 12, 2021, 10:35 a.m. UTC | #7
Hi Richi,

I think this is what you meant, I now hide all the unrolling cost 
calculations in the existing target hooks for costs. I did need to 
adjust 'finish_cost' to take the loop_vinfo so the target's 
implementations are able to set the newly renamed 'suggested_unroll_factor'.

Also added the checks for the epilogue's VF.

Is this more like what you had in mind?


gcc/ChangeLog:

         * config/aarch64/aarch64.c (aarch64_finish_cost): Add class 
vec_info parameter.
         * config/i386/i386.c (ix86_finish_cost): Likewise.
         * config/rs6000/rs6000.c (rs6000_finish_cost): Likewise.
         * doc/tm.texi: Document changes to TARGET_VECTORIZE_FINISH_COST.
         * target.def: Add class vec_info parameter to finish_cost.
         * targhooks.c (default_finish_cost): Likewise.
         * targhooks.h (default_finish_cost): Likewise.
         * tree-vect-loop.c (vect_determine_vectorization_factor): Use 
suggested_unroll_factor
         to increase vectorization_factor if possible.
         (_loop_vec_info::_loop_vec_info): Add suggested_unroll_factor 
member.
         (vect_compute_single_scalar_iteration_cost): Adjust call to 
finish_cost.
         (vect_determine_partial_vectors_and_peeling): Ensure unrolled 
loop is not predicated.
         (vect_determine_unroll_factor): New.
         (vect_try_unrolling): New.
         (vect_reanalyze_as_main_loop): Also try to unroll when 
reanalyzing as main loop.
         (vect_analyze_loop): Add call to vect_try_unrolling and check 
to ensure epilogue
         is either a smaller VF than main loop or uses partial vectors 
and might be of equal
         VF.
         (vect_estimate_min_profitable_iters): Adjust call to finish_cost.
         (vectorizable_reduction): Make sure to not use 
single_defuse_cyle when unrolling.
         * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Adjust 
call to finish_cost.
         * tree-vectorizer.h (finish_cost): Change to pass new class 
vec_info parameter.

On 01/10/2021 09:19, Richard Biener wrote:
> On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>>
>>>> That just forces trying the vector modes we've tried before. Though I might
>>>> need to revisit this now I think about it. I'm afraid it might be possible
>>>> for
>>>> this to generate an epilogue with a vf that is not lower than that of the
>>>> main
>>>> loop, but I'd need to think about this again.
>>>>
>>>> Either way I don't think this changes the vector modes used for the
>>>> epilogue.
>>>> But maybe I'm just missing your point here.
>>> Yes, I was refering to the above which suggests that when we vectorize
>>> the main loop with V4SF but unroll then we try vectorizing the
>>> epilogue with V4SF as well (but not unrolled).  I think that's
>>> premature (not sure if you try V8SF if the main loop was V4SF but
>>> unrolled 4 times).
>> My main motivation for this was because I had a SVE loop that vectorized with
>> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
>> V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
>> been the best choice.
> I see, yes - for fully predicated epilogues it makes sense to consider
> the same vector mode as for the main loop anyways (independent on
> whether we're unrolling or not).  One could argue that with an
> unrolled V4SImode main loop a predicated V8SImode epilogue would also
> be a good match (but then somehow costing favored the unrolled V4SI
> over the V8SI for the main loop...).
>
>> So that is why I decided to just 'reset' the vector_mode selection. In a
>> scenario where you only have the traditional vector modes it might make less
>> sense.
>>
>> Just realized I still didn't add any check to make sure the epilogue has a
>> lower VF than the previous loop, though I'm still not sure that could happen.
>> I'll go look at where to add that if you agree with this.
> As said above, it only needs a lower VF in case the epilogue is not
> fully masked - otherwise the same VF would be OK.
>
>>>> I can move it there, it would indeed remove the need for the change to
>>>> vect_update_vf_for_slp, the change to
>>>> vect_determine_partial_vectors_and_peeling would still be required I think.
>>>> It
>>>> is meant to disable using partial vectors in an unrolled loop.
>>> Why would we disable the use of partial vectors in an unrolled loop?
>> The motivation behind that is that the overhead caused by generating
>> predicates for each iteration will likely be too much for it to be profitable
>> to unroll. On top of that, when dealing with low iteration count loops, if
>> executing one predicated iteration would be enough we now still need to
>> execute all other unrolled predicated iterations, whereas if we keep them
>> unrolled we skip the unrolled loops.
> OK, I guess we're not factoring in costs when deciding on predication
> but go for it if it's gernally enabled and possible.
>
> With the proposed scheme we'd then cost the predicated not unrolled
> loop against a not predicated unrolled loop which might be a bit
> apples vs. oranges also because the target made the unroll decision
> based on the data it collected for the predicated loop.
>
>>> Sure but I'm suggesting you keep the not unrolled body as one way of
>>> costed vectorization but then if the target says "try unrolling"
>>> re-do the analysis with the same mode but a larger VF.  Just like
>>> we iterate over vector modes you'll now iterate over pairs of
>>> vector mode + VF (unroll factor).  It's not about re-using the costing
>>> it's about using costing that is actually relevant and also to avoid
>>> targets inventing two distinct separate costings - a target (powerpc)
>>> might already compute load/store density and other stuff for the main
>>> costing so it should have an idea whether doubling or triplicating is OK.
>>>
>>> Richard.
>> Sounds good! I changed the patch to determine the unrolling factor later,
>> after all analysis has been done and retry analysis if an unrolling factor
>> larger than 1 has been chosen for this loop and vector_mode.
>>
>> gcc/ChangeLog:
>>
>>          * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
>>          * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
>>          * params.opt: Add vect-unroll and vect-unroll-reductions
>> parameters.
> What's the reason to add the --params?  It looks like this makes
> us unroll with a static number short-cutting the target.
>
> IMHO that's never going to be a great thing - but what we could do
> is look at loop->unroll and try to honor that (factoring in that
> the vectorization factor is already the times we unroll).
>
> So I'd leave those params out for now, the user would have a much
> more fine-grained way to control this with the unroll pragma.
>
> Adding a max-vect-unroll parameter would be another thing but that
> would apply after the targets or pragma decision.
>
>>          * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
> I still do not like the new target hook - as said I'd like to
> make you have the finis_cost hook allow the target to specify
> a suggested unroll factor instead because that's the point where
> it has all the info.
>
> Thanks,
> Richard.
>
>>          * targhooks.c (default_unroll_factor): New.
>>          * targhooks.h (default_unroll_factor): Likewise.
>>          * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>>          par_unrolling_factor.
>>          (vect_determine_partial_vectors_and_peeling): Account for
>> unrolling.
>>          (vect_determine_unroll_factor): New.
>>          (vect_try_unrolling): New.
>>          (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
>>          retrying a loop_vinfo as a main loop.
>>          (vect_analyze_loop): Call vect_try_unrolling when vectorizing
>> main loops.
>>          (vect_analyze_loop): Allow for epilogue vectorization when unrolling
>>          and rewalk vector_mode warray for the epilogues.
>>          (vectorizable_reduction): Disable single_defuse_cycle when
>> unrolling.
>>          * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
>>          as a member of loop_vec_info.
>>
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 36519ccc5a58abab483c38d0a6c5f039592bfc7f..e6ccb66ba41895c4583a959d03ac3f0f173adae6 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15972,8 +15972,9 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
-aarch64_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+aarch64_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		     unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost)
 {
   auto *costs = static_cast<aarch64_vector_costs *> (data);
   *prologue_cost = costs->region[vect_prologue];
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index afc2674d49da370ae0f5ef277df7e9954f303b8e..de7bb9fe62fcec53ee40a4798f24c6ccd4584736 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23048,8 +23048,9 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-ix86_finish_cost (void *data, unsigned *prologue_cost,
-		  unsigned *body_cost, unsigned *epilogue_cost)
+ix86_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		  unsigned *prologue_cost, unsigned *body_cost,
+		  unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad81dfb316dff00cde810d6b1edd31fa49d5c1e8..6f674b6426284dbf9b9f8fdd85515cf9702adff6 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5551,8 +5551,9 @@ rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-rs6000_finish_cost (void *data, unsigned *prologue_cost,
-		    unsigned *body_cost, unsigned *epilogue_cost)
+rs6000_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		    unsigned *prologue_cost, unsigned *body_cost,
+		    unsigned *epilogue_cost)
 {
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index be8148583d8571b0d035b1938db9d056bfd213a8..05ddd4c58a3711dd949b28da3e61fb49d8175257 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6276,7 +6276,7 @@ return value should be viewed as a tentative cost that may later be
 revised.
 @end deftypefn
 
-@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost})
+@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (class vec_info *@var{vinfo}, void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost})
 This hook should complete calculations of the cost of vectorizing a loop
 or basic block based on @var{data}, and return the prologue, body, and
 epilogue costs as unsigned integers.  The default returns the value of
diff --git a/gcc/target.def b/gcc/target.def
index bfa819609c21bd71c0cc585c01dba42534453f47..f0be0e10a9225dd75b013535d8e42c1d1bfe8f50 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2081,8 +2081,8 @@ or basic block based on @var{data}, and return the prologue, body, and\n\
 epilogue costs as unsigned integers.  The default returns the value of\n\
 the three accumulators.",
  void,
- (void *data, unsigned *prologue_cost, unsigned *body_cost,
-  unsigned *epilogue_cost),
+ (class vec_info *vinfo, void *data, unsigned *prologue_cost,
+  unsigned *body_cost, unsigned *epilogue_cost),
  default_finish_cost)
 
 /* Function to delete target-specific cost modeling data.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 92d51992e625c2497aa8496b1e2e3d916e5706fd..6fd1fade49cfe00295afd52aee7a34931bb48b92 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -123,7 +123,8 @@ extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_for_stmt,
 				       class _stmt_vec_info *, tree, int,
 				       enum vect_cost_model_location);
-extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
+extern void default_finish_cost (class vec_info *, void *, unsigned *,
+				 unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853dbc15706a65d1eb335e28e0564325e..0a3ecfa76406152ce79aaf19c5a2cc8b652936ff 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1518,8 +1518,9 @@ default_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* By default, the cost model just returns the accumulated costs.  */
 
 void
-default_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+default_finish_cost (class vec_info *vinfo, void *data,
+		     unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 5a5b8da2e771a1dd204f22a6447eba96bb3b352c..50256cb6cb478246e3402162391096cbbc7fde94 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -365,6 +365,24 @@ vect_determine_vectorization_factor (loop_vec_info loop_vinfo)
   if (known_le (vectorization_factor, 1U))
     return opt_result::failure_at (vect_location,
 				   "not vectorized: unsupported data-type\n");
+  /* Apply unrolling factor, this was determined by
+     vect_determine_unroll_factor the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      unsigned unrolling_factor = loop_vinfo->suggested_unroll_factor;
+      while (unrolling_factor > 1)
+	{
+	  poly_uint64 candidate_factor = vectorization_factor * unrolling_factor;
+	  if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+	      <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+	    {
+	      vectorization_factor = candidate_factor;
+	      break;
+	    }
+	  unrolling_factor /= 2;
+	}
+    }
   LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
   return opt_result::success ();
 }
@@ -828,6 +846,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1301,7 +1320,7 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
 			  si->kind, si->stmt_info, si->vectype,
 			  si->misalign, si->where);
   unsigned prologue_cost = 0, body_cost = 0, epilogue_cost = 0;
-  finish_cost (target_cost_data, &prologue_cost, &body_cost,
+  finish_cost (NULL, target_cost_data, &prologue_cost, &body_cost,
 	       &epilogue_cost);
   destroy_cost_data (target_cost_data);
   LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo)
@@ -2128,10 +2147,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->suggested_unroll_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2879,6 +2904,121 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we should unroll this loop and ask target how much to
+   unroll by.  */
+
+static opt_loop_vec_info
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll due to negative step\n");
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+	    seen_reduction_p = true;
+	  else
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll loop with reduction due to "
+				"non TREE_CODE_REDUCTION\n");
+	    }
+	}
+    }
+
+  if (known_le (vectorization_factor, 1U))
+    return opt_loop_vec_info::failure_at (vect_location,
+					  "will not unroll loop with a VF of 1"
+					  "or less\n");
+
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+							  loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+  /* Use the suggested_unrolling_factor that was set during the target's
+     TARGET_VECTORIZE_FINISH_COST hook.  */
+  unrolled_vinfo->suggested_unroll_factor = loop_vinfo->suggested_unroll_factor;
+  return unrolled_vinfo;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (opt_loop_vec_info loop_vinfo, unsigned *n_stmts)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  opt_loop_vec_info unrolled_vinfo = vect_determine_unroll_factor (loop_vinfo);
+  /* Reset unrolling factor, in case we decide to not unroll.  */
+  loop_vinfo->suggested_unroll_factor = 1;
+  if (unrolled_vinfo)
+    {
+      if (unrolled_vinfo->suggested_unroll_factor > 1)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** unrolling factor %d chosen for vector mode %s,"
+			     "re-trying analyzis...\n",
+			     unrolled_vinfo->suggested_unroll_factor,
+			     GET_MODE_NAME (unrolled_vinfo->vector_mode));
+	  bool unrolling_fatal = false;
+	  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts)
+	      && known_ne (loop_vinfo->vectorization_factor,
+			   unrolled_vinfo->vectorization_factor))
+	    {
+
+	      loop_vinfo = unrolled_vinfo;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling succeeded with factor = %d\n",
+				 loop_vinfo->suggested_unroll_factor);
+
+	    }
+	  else
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling failed with factor = %d\n",
+				 unrolled_vinfo->suggested_unroll_factor);
+	    }
+	}
+      else
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "target determined unrolling is not profitable.\n");
+    }
+  loop_vinfo->loop->aux = NULL;
+  return loop_vinfo;
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2904,6 +3044,8 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, unsigned int *n_stmts)
   bool fatal = false;
   bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
   loop->aux = NULL;
+  main_loop_vinfo = vect_try_unrolling (main_loop_vinfo, n_stmts);
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -3038,6 +3180,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (res)
 	{
+	  /* Only try unrolling main loops.  */
+	  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	    loop_vinfo = vect_try_unrolling (loop_vinfo, &n_stmts);
+
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
 	  vectorized_loops++;
 
@@ -3056,13 +3202,26 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      /* Keep trying to roll back vectorization attempts while the
 		 loop_vec_infos they produced were worse than this one.  */
 	      vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+	      poly_uint64 vinfo_vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+	      poly_uint64 first_vinfo_vf
+		= LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
 	      while (!vinfos.is_empty ()
+		     && (known_lt (vinfo_vf, first_vinfo_vf)
+			 || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+			     && maybe_eq (vinfo_vf, first_vinfo_vf)))
 		     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
 		{
 		  gcc_assert (vect_epilogues);
 		  delete vinfos.pop ();
 		}
+	      /* Check if we may want to replace the current first_loop_vinfo
+		 with the new loop, but only if they have different vector
+		 modes.  If they have the same vector mode this means the main
+		 loop is an unrolled loop and we are trying to vectorize the
+		 epilogue using the same vector mode but with a lower
+		 vectorization factor.  */
 	      if (vinfos.is_empty ()
+		  && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
 		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
 		{
 		  loop_vec_info main_loop_vinfo
@@ -3105,14 +3264,34 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		   /* For now only allow one epilogue loop.  */
 		   && first_loop_vinfo->epilogue_vinfos.is_empty ())
 	    {
-	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
-	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-	      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-			  || maybe_ne (lowest_th, 0U));
-	      /* Keep track of the known smallest versioning
-		 threshold.  */
-	      if (ordered_p (lowest_th, th))
-		lowest_th = ordered_min (lowest_th, th);
+	      /* Ensure the epilogue has a smaller VF than the main loop or
+		 uses predication and has the same VF.  */
+	      if (known_lt (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))
+		  || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+		      && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+				   LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))))
+		{
+		  first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
+		  poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		  gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			      || maybe_ne (lowest_th, 0U));
+		  /* Keep track of the known smallest versioning
+		     threshold.  */
+		  if (ordered_p (lowest_th, th))
+		    lowest_th = ordered_min (lowest_th, th);
+		}
+	      else
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				     "***** Will not use %s mode as an"
+				     " epilogue, since it leads to an higher"
+				     " vectorization factor than main loop\n",
+				     GET_MODE_NAME (loop_vinfo->vector_mode));
+		  delete loop_vinfo;
+		  loop_vinfo = opt_loop_vec_info::success (NULL);
+		}
 	    }
 	  else
 	    {
@@ -3153,13 +3332,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       /* Handle the case that the original loop can use partial
 	 vectorization, but want to only adopt it for the epilogue.
-	 The retry should be in the same mode as original.  */
+	 The retry should be in the same mode as original.
+	 Also handle the case where we have unrolled the main loop and want to
+	 retry all vector modes again for the epilogues, since the VF is now
+	 at least twice as high as the current vector mode.  */
       if (vect_epilogues
 	  && loop_vinfo
-	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	  && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      || loop_vinfo->suggested_unroll_factor > 1))
 	{
-	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		       || loop_vinfo->suggested_unroll_factor > 1)
 		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+	  if (loop_vinfo->suggested_unroll_factor > 1)
+	    {
+	      next_vector_mode = vector_modes[0];
+	      mode_i = 1;
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "***** Re-trying analysis with vector mode"
+				 " %s for epilogues after unrolling.\n",
+				 GET_MODE_NAME (next_vector_mode));
+	      continue;
+	    }
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "***** Re-trying analysis with same vector mode"
@@ -4222,8 +4420,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-	       &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (loop_vinfo, LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+	       &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7212,7 +7410,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 024a1c38a2342246d7891db1de5f1d6e6458d5dd..dce8b953d306b90185ffe75c637f1fdb998aa953 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5405,7 +5405,8 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
       while (si < li_scalar_costs.length ()
 	     && li_scalar_costs[si].first == sl);
       unsigned dummy;
-      finish_cost (scalar_target_cost_data, &dummy, &scalar_cost, &dummy);
+      finish_cost (bb_vinfo, scalar_target_cost_data, &dummy, &scalar_cost,
+		   &dummy);
       destroy_cost_data (scalar_target_cost_data);
 
       /* Complete the target-specific vector cost calculation.  */
@@ -5418,7 +5419,7 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
 	}
       while (vi < li_vector_costs.length ()
 	     && li_vector_costs[vi].first == vl);
-      finish_cost (vect_target_cost_data, &vec_prologue_cost,
+      finish_cost (bb_vinfo, vect_target_cost_data, &vec_prologue_cost,
 		   &vec_inside_cost, &vec_epilogue_cost);
       destroy_cost_data (vect_target_cost_data);
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..e91fb6691857cbcc0b1c087d6de35164a7c75e48 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1570,10 +1577,10 @@ add_stmt_cost (vec_info *vinfo, void *data, stmt_info_for_cost *i)
 /* Alias targetm.vectorize.finish_cost.  */
 
 static inline void
-finish_cost (void *data, unsigned *prologue_cost,
+finish_cost (class vec_info *vinfo, void *data, unsigned *prologue_cost,
 	     unsigned *body_cost, unsigned *epilogue_cost)
 {
-  targetm.vectorize.finish_cost (data, prologue_cost, body_cost, epilogue_cost);
+  targetm.vectorize.finish_cost (vinfo, data, prologue_cost, body_cost, epilogue_cost);
 }
 
 /* Alias targetm.vectorize.destroy_cost_data.  */
Richard Biener Oct. 15, 2021, 8:48 a.m. UTC | #8
On Tue, 12 Oct 2021, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> I think this is what you meant, I now hide all the unrolling cost calculations
> in the existing target hooks for costs. I did need to adjust 'finish_cost' to
> take the loop_vinfo so the target's implementations are able to set the newly
> renamed 'suggested_unroll_factor'.
> 
> Also added the checks for the epilogue's VF.
> 
> Is this more like what you had in mind?

Not exactly (sorry..).  For the target hook I think we don't want to
pass vec_info but instead another output parameter like the existing
ones.

vect_estimate_min_profitable_iters should then via
vect_analyze_loop_costing and vect_analyze_loop_2 report the unroll
suggestion to vect_analyze_loop which should then, if the suggestion
was > 1, instead of iterating to the next vector mode run again
with a fixed VF (old VF times suggested unroll factor - there's
min_vf in vect_analyze_loop_2 which we should adjust to
the old VF times two for example and maybe store the suggested
factor as hint) - if it succeeds the result will end up in the
list of considered modes (where we now may have more than one
entry for the same mode but a different VF), we probably want to
only consider more unrolling once.

For simplicity I'd probably set min_vf = max_vf = old VF * suggested 
factor, thus take the targets request literally.

Richard.

> 
> gcc/ChangeLog:
> 
>         * config/aarch64/aarch64.c (aarch64_finish_cost): Add class vec_info
> parameter.
>         * config/i386/i386.c (ix86_finish_cost): Likewise.
>         * config/rs6000/rs6000.c (rs6000_finish_cost): Likewise.
>         * doc/tm.texi: Document changes to TARGET_VECTORIZE_FINISH_COST.
>         * target.def: Add class vec_info parameter to finish_cost.
>         * targhooks.c (default_finish_cost): Likewise.
>         * targhooks.h (default_finish_cost): Likewise.
>         * tree-vect-loop.c (vect_determine_vectorization_factor): Use 
> suggested_unroll_factor
>         to increase vectorization_factor if possible.
>         (_loop_vec_info::_loop_vec_info): Add suggested_unroll_factor 
> member.
>         (vect_compute_single_scalar_iteration_cost): Adjust call to
> finish_cost.
>         (vect_determine_partial_vectors_and_peeling): Ensure unrolled loop is
> not predicated.
>         (vect_determine_unroll_factor): New.
>         (vect_try_unrolling): New.
>         (vect_reanalyze_as_main_loop): Also try to unroll when 
> reanalyzing as main loop.
>         (vect_analyze_loop): Add call to vect_try_unrolling and check to
> ensure epilogue
>         is either a smaller VF than main loop or uses partial vectors and
> might be of equal
>         VF.
>         (vect_estimate_min_profitable_iters): Adjust call to finish_cost.
>         (vectorizable_reduction): Make sure to not use 
> single_defuse_cyle when unrolling.
>         * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Adjust call to
> finish_cost.
>         * tree-vectorizer.h (finish_cost): Change to pass new class vec_info
> parameter.
> 
> On 01/10/2021 09:19, Richard Biener wrote:
> > On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >>
> >>>> That just forces trying the vector modes we've tried before. Though I
> >>>> might
> >>>> need to revisit this now I think about it. I'm afraid it might be
> >>>> possible
> >>>> for
> >>>> this to generate an epilogue with a vf that is not lower than that of the
> >>>> main
> >>>> loop, but I'd need to think about this again.
> >>>>
> >>>> Either way I don't think this changes the vector modes used for the
> >>>> epilogue.
> >>>> But maybe I'm just missing your point here.
> >>> Yes, I was refering to the above which suggests that when we vectorize
> >>> the main loop with V4SF but unroll then we try vectorizing the
> >>> epilogue with V4SF as well (but not unrolled).  I think that's
> >>> premature (not sure if you try V8SF if the main loop was V4SF but
> >>> unrolled 4 times).
> >> My main motivation for this was because I had a SVE loop that vectorized
> >> with
> >> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
> >> V8HI by two and skipped using VNx8HI as a predicated epilogue which
> >> would've
> >> been the best choice.
> > I see, yes - for fully predicated epilogues it makes sense to consider
> > the same vector mode as for the main loop anyways (independent on
> > whether we're unrolling or not).  One could argue that with an
> > unrolled V4SImode main loop a predicated V8SImode epilogue would also
> > be a good match (but then somehow costing favored the unrolled V4SI
> > over the V8SI for the main loop...).
> >
> >> So that is why I decided to just 'reset' the vector_mode selection. In a
> >> scenario where you only have the traditional vector modes it might make
> >> less
> >> sense.
> >>
> >> Just realized I still didn't add any check to make sure the epilogue has a
> >> lower VF than the previous loop, though I'm still not sure that could
> >> happen.
> >> I'll go look at where to add that if you agree with this.
> > As said above, it only needs a lower VF in case the epilogue is not
> > fully masked - otherwise the same VF would be OK.
> >
> >>>> I can move it there, it would indeed remove the need for the change to
> >>>> vect_update_vf_for_slp, the change to
> >>>> vect_determine_partial_vectors_and_peeling would still be required I
> >>>> think.
> >>>> It
> >>>> is meant to disable using partial vectors in an unrolled loop.
> >>> Why would we disable the use of partial vectors in an unrolled loop?
> >> The motivation behind that is that the overhead caused by generating
> >> predicates for each iteration will likely be too much for it to be
> >> profitable
> >> to unroll. On top of that, when dealing with low iteration count loops, if
> >> executing one predicated iteration would be enough we now still need to
> >> execute all other unrolled predicated iterations, whereas if we keep them
> >> unrolled we skip the unrolled loops.
> > OK, I guess we're not factoring in costs when deciding on predication
> > but go for it if it's gernally enabled and possible.
> >
> > With the proposed scheme we'd then cost the predicated not unrolled
> > loop against a not predicated unrolled loop which might be a bit
> > apples vs. oranges also because the target made the unroll decision
> > based on the data it collected for the predicated loop.
> >
> >>> Sure but I'm suggesting you keep the not unrolled body as one way of
> >>> costed vectorization but then if the target says "try unrolling"
> >>> re-do the analysis with the same mode but a larger VF.  Just like
> >>> we iterate over vector modes you'll now iterate over pairs of
> >>> vector mode + VF (unroll factor).  It's not about re-using the costing
> >>> it's about using costing that is actually relevant and also to avoid
> >>> targets inventing two distinct separate costings - a target (powerpc)
> >>> might already compute load/store density and other stuff for the main
> >>> costing so it should have an idea whether doubling or triplicating is OK.
> >>>
> >>> Richard.
> >> Sounds good! I changed the patch to determine the unrolling factor later,
> >> after all analysis has been done and retry analysis if an unrolling factor
> >> larger than 1 has been chosen for this loop and vector_mode.
> >>
> >> gcc/ChangeLog:
> >>
> >>          * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
> >>          * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
> >>          * params.opt: Add vect-unroll and vect-unroll-reductions
> >> parameters.
> > What's the reason to add the --params?  It looks like this makes
> > us unroll with a static number short-cutting the target.
> >
> > IMHO that's never going to be a great thing - but what we could do
> > is look at loop->unroll and try to honor that (factoring in that
> > the vectorization factor is already the times we unroll).
> >
> > So I'd leave those params out for now, the user would have a much
> > more fine-grained way to control this with the unroll pragma.
> >
> > Adding a max-vect-unroll parameter would be another thing but that
> > would apply after the targets or pragma decision.
> >
> >>          * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
> > I still do not like the new target hook - as said I'd like to
> > make you have the finis_cost hook allow the target to specify
> > a suggested unroll factor instead because that's the point where
> > it has all the info.
> >
> > Thanks,
> > Richard.
> >
> >>          * targhooks.c (default_unroll_factor): New.
> >>          * targhooks.h (default_unroll_factor): Likewise.
> >>          * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
> >>          par_unrolling_factor.
> >>          (vect_determine_partial_vectors_and_peeling): Account for
> >> unrolling.
> >>          (vect_determine_unroll_factor): New.
> >>          (vect_try_unrolling): New.
> >>          (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
> >>          retrying a loop_vinfo as a main loop.
> >>          (vect_analyze_loop): Call vect_try_unrolling when vectorizing
> >> main loops.
> >>          (vect_analyze_loop): Allow for epilogue vectorization when
> >>  unrolling
> >>          and rewalk vector_mode warray for the epilogues.
> >>          (vectorizable_reduction): Disable single_defuse_cycle when
> >> unrolling.
> >>          * tree-vectorizer.h (vect_unroll_value): Declare
> >>  par_unrolling_factor
> >>          as a member of loop_vec_info.
> >>
> 
>
Andre Vieira (lists) Oct. 20, 2021, 1:29 p.m. UTC | #9
On 15/10/2021 09:48, Richard Biener wrote:
> On Tue, 12 Oct 2021, Andre Vieira (lists) wrote:
>
>> Hi Richi,
>>
>> I think this is what you meant, I now hide all the unrolling cost calculations
>> in the existing target hooks for costs. I did need to adjust 'finish_cost' to
>> take the loop_vinfo so the target's implementations are able to set the newly
>> renamed 'suggested_unroll_factor'.
>>
>> Also added the checks for the epilogue's VF.
>>
>> Is this more like what you had in mind?
> Not exactly (sorry..).  For the target hook I think we don't want to
> pass vec_info but instead another output parameter like the existing
> ones.
>
> vect_estimate_min_profitable_iters should then via
> vect_analyze_loop_costing and vect_analyze_loop_2 report the unroll
> suggestion to vect_analyze_loop which should then, if the suggestion
> was > 1, instead of iterating to the next vector mode run again
> with a fixed VF (old VF times suggested unroll factor - there's
> min_vf in vect_analyze_loop_2 which we should adjust to
> the old VF times two for example and maybe store the suggested
> factor as hint) - if it succeeds the result will end up in the
> list of considered modes (where we now may have more than one
> entry for the same mode but a different VF), we probably want to
> only consider more unrolling once.
>
> For simplicity I'd probably set min_vf = max_vf = old VF * suggested
> factor, thus take the targets request literally.
>
> Richard.

Hi,

I now pass an output parameter to finish_costs and route it through the 
various calls up to vect_analyze_loop.  I tried to rework 
vect_determine_vectorization_factor and noticed that merely setting 
min_vf and max_vf is not enough, we only use these to check whether the 
vectorization factor is within range, well actually we only use max_vf 
at that stage. We only seem to use 'min_vf' to make sure the 
data_references are valid.  I am not sure my changes are the most 
appropriate here, for instance I am pretty sure the checks for max and 
min vf I added in vect_determine_vectorization_factor are currently 
superfluous as they will pass by design, but thought they might be good 
future proofing?

Also I changed how we compare against max_vf, rather than relying on the 
'MAX_VECTORIZATION' I decided to use the estimated_poly_value with 
POLY_VALUE_MAX, to be able to bound it further in case we have knowledge 
of the VL. I am not entirely about the validity of this change, maybe we 
are better off keeping the MAX_VECTORIZATION in place and not making any 
changes to max_vf for unrolling.

What do you think?
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 36519ccc5a58abab483c38d0a6c5f039592bfc7f..9b1e01e9b62050d7e34bc55454771e40bdbdb4cb 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15972,8 +15972,8 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
-aarch64_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+aarch64_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost, unsigned *suggested_unroll_factor)
 {
   auto *costs = static_cast<aarch64_vector_costs *> (data);
   *prologue_cost = costs->region[vect_prologue];
@@ -15984,6 +15984,9 @@ aarch64_finish_cost (void *data, unsigned *prologue_cost,
       && costs->vec_flags
       && aarch64_use_new_vector_costs_p ())
     *body_cost = aarch64_adjust_body_cost (costs, *body_cost);
+
+  if(suggested_unroll_factor)
+    *suggested_unroll_factor = 1;
 }
 
 /* Implement TARGET_VECTORIZE_DESTROY_COST_DATA.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index afc2674d49da370ae0f5ef277df7e9954f303b8e..a48e43879512793907fef946c1575c3ed7f68092 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23048,13 +23048,15 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-ix86_finish_cost (void *data, unsigned *prologue_cost,
-		  unsigned *body_cost, unsigned *epilogue_cost)
+ix86_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost,
+		  unsigned *epilogue_cost, unsigned *suggested_unroll_factor)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
   *body_cost     = cost[vect_body];
   *epilogue_cost = cost[vect_epilogue];
+  if (suggested_unroll_factor)
+    *suggested_unroll_factor = 1;
 }
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad81dfb316dff00cde810d6b1edd31fa49d5c1e8..59d30ad6fcd1758383c52e34a0f90a126c501ec3 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5551,8 +5551,8 @@ rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-rs6000_finish_cost (void *data, unsigned *prologue_cost,
-		    unsigned *body_cost, unsigned *epilogue_cost)
+rs6000_finish_cost (void *data, unsigned *prologue_cost, unsigned *body_cost,
+		    unsigned *epilogue_cost, unsigned *suggested_unroll_factor)
 {
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
@@ -5578,6 +5578,8 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   *prologue_cost = cost_data->cost[vect_prologue];
   *body_cost     = cost_data->cost[vect_body];
   *epilogue_cost = cost_data->cost[vect_epilogue];
+  if (suggested_unroll_factor)
+    *suggested_unroll_factor = 1;
 }
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index be8148583d8571b0d035b1938db9d056bfd213a8..c584260b02c3e8d4fcd7b31c38321d5f81a71428 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6276,11 +6276,12 @@ return value should be viewed as a tentative cost that may later be
 revised.
 @end deftypefn
 
-@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost})
+@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost}, unsigned *@var{suggested_unroll_factor})
 This hook should complete calculations of the cost of vectorizing a loop
 or basic block based on @var{data}, and return the prologue, body, and
-epilogue costs as unsigned integers.  The default returns the value of
-the three accumulators.
+epilogue costs as unsigned integers.  It also asks the backend whether it
+has a @var{suggested_unroll_factor}.  The default returns the value of
+the three cost accumulators.
 @end deftypefn
 
 @deftypefn {Target Hook} void TARGET_VECTORIZE_DESTROY_COST_DATA (void *@var{data})
diff --git a/gcc/target.def b/gcc/target.def
index bfa819609c21bd71c0cc585c01dba42534453f47..df0f170ff3378671e802d82a8bce8e153d8cf8fe 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2078,11 +2078,12 @@ DEFHOOK
 (finish_cost,
  "This hook should complete calculations of the cost of vectorizing a loop\n\
 or basic block based on @var{data}, and return the prologue, body, and\n\
-epilogue costs as unsigned integers.  The default returns the value of\n\
-the three accumulators.",
+epilogue costs as unsigned integers.  It also asks the backend whether it\n\
+has a @var{suggested_unroll_factor}.  The default returns the value of\n\
+the three cost accumulators.",
  void,
  (void *data, unsigned *prologue_cost, unsigned *body_cost,
-  unsigned *epilogue_cost),
+  unsigned *epilogue_cost, unsigned *suggested_unroll_factor),
  default_finish_cost)
 
 /* Function to delete target-specific cost modeling data.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 92d51992e625c2497aa8496b1e2e3d916e5706fd..b9697c366876fe5a8c444ffcf58bdc6b5c33b0ad 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -123,7 +123,8 @@ extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_for_stmt,
 				       class _stmt_vec_info *, tree, int,
 				       enum vect_cost_model_location);
-extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
+extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *,
+				 unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853dbc15706a65d1eb335e28e0564325e..8552d9a0f144e7bcee3f2653f2ea84ea677f80a2 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1518,13 +1518,18 @@ default_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* By default, the cost model just returns the accumulated costs.  */
 
 void
-default_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+default_finish_cost (void *data,
+		     unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost,
+		     unsigned *suggested_unroll_factor)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
   *body_cost     = cost[vect_body];
   *epilogue_cost = cost[vect_epilogue];
+  /* Do not unroll.  */
+  if (suggested_unroll_factor)
+    *suggested_unroll_factor = 1;
 }
 
 /* Free the cost data.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 5a5b8da2e771a1dd204f22a6447eba96bb3b352c..1bfe2e4f989143f4415c6c5b4a0b902ef1e00d66 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -153,7 +153,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						unsigned *);
 static stmt_vec_info vect_is_simple_reduction (loop_vec_info, stmt_vec_info,
 					       bool *, bool *);
 
@@ -277,7 +278,8 @@ vect_determine_vf_for_stmt (vec_info *vinfo,
 */
 
 static opt_result
-vect_determine_vectorization_factor (loop_vec_info loop_vinfo)
+vect_determine_vectorization_factor (loop_vec_info loop_vinfo,
+				     poly_uint64 min_vf)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
@@ -354,6 +356,28 @@ vect_determine_vectorization_factor (loop_vec_info loop_vinfo)
         }
     }
 
+  /* Apply the suggested unrolling factor, this was determined by the backend
+     during finish_cost the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      poly_uint64 unrolled_vf
+	= vectorization_factor * loop_vinfo->suggested_unroll_factor;
+      unsigned HOST_WIDE_INT max_vf = estimated_poly_value (unrolled_vf,
+							    POLY_VALUE_MAX);
+      /* Make sure the unrolled vectorization factor fits the min and max
+	  vectorization factor.  */
+       if (max_vf <= LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo)
+	  && known_le (unrolled_vf, min_vf))
+	vectorization_factor = unrolled_vf;
+      else if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Unrolling failed due to unroll factor not fitting in"
+			 " range of min and max vectorization factor:"
+			 " [%d, %d]\n",
+			 min_vf, max_vf);
+    }
+
   /* TODO: Analyze cost. Decide if worth while to vectorize.  */
   if (dump_enabled_p ())
     {
@@ -828,6 +852,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1829,7 +1854,8 @@ vect_known_niters_smaller_than_vf (loop_vec_info loop_vinfo)
    definitely no, or -1 if it's worth retrying.  */
 
 static int
-vect_analyze_loop_costing (loop_vec_info loop_vinfo)
+vect_analyze_loop_costing (loop_vec_info loop_vinfo,
+			   unsigned *suggested_unroll_factor)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
@@ -1863,7 +1889,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      suggested_unroll_factor);
 
   if (min_profitable_iters < 0)
     {
@@ -2128,10 +2155,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->suggested_unroll_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2198,13 +2231,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
    for it.  The different analyses will record information in the
    loop_vec_info struct.  */
 static opt_result
-vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts,
+		     unsigned *suggested_unroll_factor,
+		     poly_uint64 min_vf = 2)
 {
   opt_result ok = opt_result::success ();
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
-  poly_uint64 min_vf = 2;
   loop_vec_info orig_loop_vinfo = NULL;
+  if (*suggested_unroll_factor > 1)
+    max_vf = estimated_poly_value (min_vf, POLY_VALUE_MAX);
 
   /* If we are dealing with an epilogue then orig_loop_vinfo points to the
      loop_vec_info of the first vectorized loop.  */
@@ -2308,11 +2344,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
       return ok;
     }
   if (max_vf != MAX_VECTORIZATION_FACTOR
-      && maybe_lt (max_vf, min_vf))
+      && loop_vinfo->suggested_unroll_factor == 1
+      && max_vf < estimated_poly_value (min_vf, POLY_VALUE_MAX))
     return opt_result::failure_at (vect_location, "bad data dependence.\n");
   LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo) = max_vf;
 
-  ok = vect_determine_vectorization_factor (loop_vinfo);
+  ok = vect_determine_vectorization_factor (loop_vinfo, min_vf);
   if (!ok)
     {
       if (dump_enabled_p ())
@@ -2321,7 +2358,9 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
       return ok;
     }
   if (max_vf != MAX_VECTORIZATION_FACTOR
-      && maybe_lt (max_vf, LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
+      && loop_vinfo->suggested_unroll_factor == 1
+      && max_vf < estimated_poly_value (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+					POLY_VALUE_MAX))
     return opt_result::failure_at (vect_location, "bad data dependence.\n");
 
   /* Compute the scalar iteration cost.  */
@@ -2547,7 +2586,7 @@ start_over:
     return ok;
 
   /* Check the costings of the loop make vectorizing worthwhile.  */
-  res = vect_analyze_loop_costing (loop_vinfo);
+  res = vect_analyze_loop_costing (loop_vinfo, suggested_unroll_factor);
   if (res < 0)
     {
       ok = opt_result::failure_at (vect_location,
@@ -2879,6 +2918,122 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we can unroll this loop.  */
+
+static bool
+vect_can_unroll (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  if (known_le (vectorization_factor, 1U))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "will not unroll loop with a VF of 1 or less\n");
+      return false;
+    }
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll due to negative step\n");
+	      return false;
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) != TREE_CODE_REDUCTION)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll loop with reduction due to "
+				"non TREE_CODE_REDUCTION\n");
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (loop_vec_info loop_vinfo, unsigned *n_stmts,
+		    unsigned suggested_unroll_factor)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  if (suggested_unroll_factor == 1)
+    return opt_loop_vec_info::failure_at (vect_location,
+					  "*** Target determined unrolling is"
+					  " not profitable.\n");
+
+  if (!vect_can_unroll (loop_vinfo))
+    return opt_loop_vec_info::failure_at (vect_location,
+					  "*** Can not unroll this loop.\n");
+
+  loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+							  loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+
+  /* Use the suggested_unrolling_factor that was returned at the target's
+     TARGET_VECTORIZE_FINISH_COST hook.  */
+  unrolled_vinfo->suggested_unroll_factor = suggested_unroll_factor;
+  poly_uint64 unrolled_vf
+    = LOOP_VINFO_VECT_FACTOR (loop_vinfo) * suggested_unroll_factor;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "***** unrolling factor %d chosen for vector mode %s,"
+		     "re-trying analyzis...\n",
+		     suggested_unroll_factor,
+		     GET_MODE_NAME (unrolled_vinfo->vector_mode));
+  bool unrolling_fatal = false;
+  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts,
+			   &suggested_unroll_factor,
+			   unrolled_vf)
+      && known_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+		   LOOP_VINFO_VECT_FACTOR (unrolled_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "unrolling succeeded with factor = %d\n",
+			 suggested_unroll_factor);
+      unrolled_vinfo->loop->aux = NULL;
+      return opt_loop_vec_info::success (unrolled_vinfo);
+    }
+
+  loop_vinfo->loop->aux = NULL;
+  return opt_loop_vec_info::failure_at (vect_location,
+					"unrolling failed with factor = %d\n",
+					suggested_unroll_factor);
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2902,8 +3057,16 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, unsigned int *n_stmts)
   main_loop_vinfo->vector_mode = loop_vinfo->vector_mode;
 
   bool fatal = false;
-  bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
+  unsigned suggested_unroll_factor = 1;
+  bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts,
+				  &suggested_unroll_factor);
   loop->aux = NULL;
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_try_unrolling (main_loop_vinfo, n_stmts,
+						      suggested_unroll_factor));
+  if (unrolled_vinfo)
+    main_loop_vinfo = unrolled_vinfo;
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -2960,6 +3123,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   bool vect_epilogues = false;
   opt_result res = opt_result::success ();
   unsigned HOST_WIDE_INT simdlen = loop->simdlen;
+  unsigned suggested_unroll_factor = 1;
   while (1)
     {
       /* Check the CFG characteristics of the loop (nesting, entry/exit).  */
@@ -3007,7 +3171,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       if (vect_epilogues)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
-      res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts);
+      res = vect_analyze_loop_2 (loop_vinfo, fatal, &n_stmts,
+				 &suggested_unroll_factor);
       if (mode_i == 0)
 	autodetected_vector_mode = loop_vinfo->vector_mode;
       if (dump_enabled_p ())
@@ -3038,6 +3203,18 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (res)
 	{
+	  /* Only try unrolling main loops.  */
+	  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	    {
+	      opt_loop_vec_info unrolled_vinfo =
+		vect_try_unrolling (loop_vinfo, &n_stmts,
+				    suggested_unroll_factor);
+	      if (unrolled_vinfo)
+		loop_vinfo = unrolled_vinfo;
+	      /* Reset suggested_unroll_factor for next loop_vinfo.  */
+	      suggested_unroll_factor = 1;
+	    }
+
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
 	  vectorized_loops++;
 
@@ -3056,13 +3233,26 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      /* Keep trying to roll back vectorization attempts while the
 		 loop_vec_infos they produced were worse than this one.  */
 	      vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+	      poly_uint64 vinfo_vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+	      poly_uint64 first_vinfo_vf
+		= LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
 	      while (!vinfos.is_empty ()
+		     && (known_lt (vinfo_vf, first_vinfo_vf)
+			 || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+			     && maybe_eq (vinfo_vf, first_vinfo_vf)))
 		     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
 		{
 		  gcc_assert (vect_epilogues);
 		  delete vinfos.pop ();
 		}
+	      /* Check if we may want to replace the current first_loop_vinfo
+		 with the new loop, but only if they have different vector
+		 modes.  If they have the same vector mode this means the main
+		 loop is an unrolled loop and we are trying to vectorize the
+		 epilogue using the same vector mode but with a lower
+		 vectorization factor.  */
 	      if (vinfos.is_empty ()
+		  && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
 		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
 		{
 		  loop_vec_info main_loop_vinfo
@@ -3105,14 +3295,34 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		   /* For now only allow one epilogue loop.  */
 		   && first_loop_vinfo->epilogue_vinfos.is_empty ())
 	    {
-	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
-	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-	      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-			  || maybe_ne (lowest_th, 0U));
-	      /* Keep track of the known smallest versioning
-		 threshold.  */
-	      if (ordered_p (lowest_th, th))
-		lowest_th = ordered_min (lowest_th, th);
+	      /* Ensure the epilogue has a smaller VF than the main loop or
+		 uses predication and has the same VF.  */
+	      if (known_lt (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))
+		  || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+		      && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+				   LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))))
+		{
+		  first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
+		  poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		  gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			      || maybe_ne (lowest_th, 0U));
+		  /* Keep track of the known smallest versioning
+		     threshold.  */
+		  if (ordered_p (lowest_th, th))
+		    lowest_th = ordered_min (lowest_th, th);
+		}
+	      else
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				     "***** Will not use %s mode as an"
+				     " epilogue, since it leads to an higher"
+				     " vectorization factor than main loop\n",
+				     GET_MODE_NAME (loop_vinfo->vector_mode));
+		  delete loop_vinfo;
+		  loop_vinfo = opt_loop_vec_info::success (NULL);
+		}
 	    }
 	  else
 	    {
@@ -3153,13 +3363,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       /* Handle the case that the original loop can use partial
 	 vectorization, but want to only adopt it for the epilogue.
-	 The retry should be in the same mode as original.  */
+	 The retry should be in the same mode as original.
+	 Also handle the case where we have unrolled the main loop and want to
+	 retry all vector modes again for the epilogues, since the VF is now
+	 at least twice as high as the current vector mode.  */
       if (vect_epilogues
 	  && loop_vinfo
-	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	  && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      || loop_vinfo->suggested_unroll_factor > 1))
 	{
-	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		       || loop_vinfo->suggested_unroll_factor > 1)
 		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+	  if (loop_vinfo->suggested_unroll_factor > 1)
+	    {
+	      next_vector_mode = vector_modes[0];
+	      mode_i = 1;
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "***** Re-trying analysis with vector mode"
+				 " %s for epilogues after unrolling.\n",
+				 GET_MODE_NAME (next_vector_mode));
+	      continue;
+	    }
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "***** Re-trying analysis with same vector mode"
@@ -3862,7 +4091,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    unsigned *suggested_unroll_factor)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -4222,8 +4452,9 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-	       &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+	       &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost,
+	       suggested_unroll_factor);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7212,7 +7443,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 024a1c38a2342246d7891db1de5f1d6e6458d5dd..a8a6c6a19ed4c98144f9097467c59386fdbe8233 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5418,8 +5418,8 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
 	}
       while (vi < li_vector_costs.length ()
 	     && li_vector_costs[vi].first == vl);
-      finish_cost (vect_target_cost_data, &vec_prologue_cost,
-		   &vec_inside_cost, &vec_epilogue_cost);
+      finish_cost (vect_target_cost_data, &vec_prologue_cost, &vec_inside_cost,
+		   &vec_epilogue_cost);
       destroy_cost_data (vect_target_cost_data);
 
       vec_outside_cost = vec_prologue_cost + vec_epilogue_cost;
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..8b182cd34e7d6a8d9e55a9c1003900b8216a952f 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1571,9 +1578,11 @@ add_stmt_cost (vec_info *vinfo, void *data, stmt_info_for_cost *i)
 
 static inline void
 finish_cost (void *data, unsigned *prologue_cost,
-	     unsigned *body_cost, unsigned *epilogue_cost)
+	     unsigned *body_cost, unsigned *epilogue_cost,
+	     unsigned *suggested_unroll_factor = NULL)
 {
-  targetm.vectorize.finish_cost (data, prologue_cost, body_cost, epilogue_cost);
+  targetm.vectorize.finish_cost (data, prologue_cost, body_cost, epilogue_cost,
+				 suggested_unroll_factor);
 }
 
 /* Alias targetm.vectorize.destroy_cost_data.  */
Richard Biener Oct. 21, 2021, 12:14 p.m. UTC | #10
On Wed, 20 Oct 2021, Andre Vieira (lists) wrote:

> On 15/10/2021 09:48, Richard Biener wrote:
> > On Tue, 12 Oct 2021, Andre Vieira (lists) wrote:
> >
> >> Hi Richi,
> >>
> >> I think this is what you meant, I now hide all the unrolling cost
> >> calculations
> >> in the existing target hooks for costs. I did need to adjust 'finish_cost'
> >> to
> >> take the loop_vinfo so the target's implementations are able to set the
> >> newly
> >> renamed 'suggested_unroll_factor'.
> >>
> >> Also added the checks for the epilogue's VF.
> >>
> >> Is this more like what you had in mind?
> > Not exactly (sorry..).  For the target hook I think we don't want to
> > pass vec_info but instead another output parameter like the existing
> > ones.
> >
> > vect_estimate_min_profitable_iters should then via
> > vect_analyze_loop_costing and vect_analyze_loop_2 report the unroll
> > suggestion to vect_analyze_loop which should then, if the suggestion
> > was > 1, instead of iterating to the next vector mode run again
> > with a fixed VF (old VF times suggested unroll factor - there's
> > min_vf in vect_analyze_loop_2 which we should adjust to
> > the old VF times two for example and maybe store the suggested
> > factor as hint) - if it succeeds the result will end up in the
> > list of considered modes (where we now may have more than one
> > entry for the same mode but a different VF), we probably want to
> > only consider more unrolling once.
> >
> > For simplicity I'd probably set min_vf = max_vf = old VF * suggested
> > factor, thus take the targets request literally.
> >
> > Richard.
> 
> Hi,
> 
> I now pass an output parameter to finish_costs and route it through the
> various calls up to vect_analyze_loop.  I tried to rework
> vect_determine_vectorization_factor and noticed that merely setting min_vf and
> max_vf is not enough, we only use these to check whether the vectorization
> factor is within range, well actually we only use max_vf at that stage. We
> only seem to use 'min_vf' to make sure the data_references are valid.  I am
> not sure my changes are the most appropriate here, for instance I am pretty
> sure the checks for max and min vf I added in
> vect_determine_vectorization_factor are currently superfluous as they will
> pass by design, but thought they might be good future proofing?

Ah, ok - well, max_vf is determined by dependence analysis so we do
have to check that.  I think that

+         && known_le (unrolled_vf, min_vf))

is superfluous.  So if we use min_vf as passed to vect_analyze_loop_2
to carry the suggested unroll factor then the changes look reasonable.

What's

   if (max_vf != MAX_VECTORIZATION_FACTOR
-      && maybe_lt (max_vf, min_vf))
+      && loop_vinfo->suggested_unroll_factor == 1
+      && max_vf < estimated_poly_value (min_vf, POLY_VALUE_MAX))
     return opt_result::failure_at (vect_location, "bad data 
dependence.\n");
   LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo) = max_vf;

supposed to be though?  Why can we allow the unroll case to
get past this point?  I suppose initializing max_vf to
MAX_VECTORIZATION_FACTOR doesn't play well with poly-ints?
That is, how does estimated_poly_value (min_vf, POLY_VALUE_MAX)
differ here?  I would have expected that we can drop the
max_vf != MAX_VECTORIZATION_FACTOR part but otherwise leave
the test the same?

Note you adjust the vectorization factor in 
vect_determine_vectorization_factor but I think you have to delay
that until after vect_update_vf_for_slp which means doing it
before the start_over: label.

@@ -3038,6 +3203,18 @@ vect_analyze_loop (class loop *loop, 
vec_info_shared *shared)

       if (res)
        {
+         /* Only try unrolling main loops.  */
+         if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+           {
+             opt_loop_vec_info unrolled_vinfo =
+               vect_try_unrolling (loop_vinfo, &n_stmts,
+                                   suggested_unroll_factor);
+             if (unrolled_vinfo)
+               loop_vinfo = unrolled_vinfo;
+             /* Reset suggested_unroll_factor for next loop_vinfo.  */
+             suggested_unroll_factor = 1;
+           }

so it looks like this eventually will leak 'loop_vinfo' but it
also looks like you throw away the not unrolled analysis and costing.
I was suggesting to simply add the unrolled variant to the
alternatives considered - the flow of vect_analyze_loop is already
quite complicated and the changes following this hunk do not help :/
(and I'm not sure what vect_reanalyze_as_main_loop is supposed to do
or why we need to consider unrolling there as well).  If we
don't want to globally consider the unrolled variant maybe
we can at least decide between the unrolled and not unrolled
variant in vect_try_unrolling?  But of course only the whole
series, (unrolled) main + vectorized epilogues, determine the
final cost.

That said, the overall flow is OK now, some details about the
max_vf check and where to compute the unrolled VF needs to be
fleshed out.  And then there's the main analysis loop which,
frankly, is a mess right now, even before your patch :/

I'm thinking of rewriting the analysis loop in vect_analyze_loop
to use a worklist initially seeded by the vector_modes[] but
that we can push things like as-main-loop, unrolled and
epilogue analysis to.  Maybe have the worklist specify
pairs of mode and kind or tuples of mode, min-VF and kind where
'kind' is as-main/epilogue/unroll (though maybe 'kind' is
redundant there).  Possibly as preparatory step.

> Also I changed how we compare against max_vf, rather than relying on the
> 'MAX_VECTORIZATION' I decided to use the estimated_poly_value with
> POLY_VALUE_MAX, to be able to bound it further in case we have knowledge of
> the VL. I am not entirely about the validity of this change, maybe we are
> better off keeping the MAX_VECTORIZATION in place and not making any changes
> to max_vf for unrolling.

Yeah, I guess that would simplify things.  The only bit we really
want to make sure is that we don't re-analyze with exactly the same
VF, so

+  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts,
+                          &suggested_unroll_factor,
+                          unrolled_vf)
+      && known_ne (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+                  LOOP_VINFO_VECT_FACTOR (unrolled_vinfo)))

the VF check here should be always true.  I was initially considering
to take a target hit of unrolling by 4 as to also allow two times
unrolling if max_vf constrains us this way but of course never unroll
more than 4 times.  But we can play with adjustments here later
(also considering #pragma unroll, etc.)

Richard, do you have any ideas on how to improve the iteration
mess we have?

Richard.
Richard Sandiford Oct. 22, 2021, 10:12 a.m. UTC | #11
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> On 15/10/2021 09:48, Richard Biener wrote:
>> On Tue, 12 Oct 2021, Andre Vieira (lists) wrote:
>>
>>> Hi Richi,
>>>
>>> I think this is what you meant, I now hide all the unrolling cost calculations
>>> in the existing target hooks for costs. I did need to adjust 'finish_cost' to
>>> take the loop_vinfo so the target's implementations are able to set the newly
>>> renamed 'suggested_unroll_factor'.
>>>
>>> Also added the checks for the epilogue's VF.
>>>
>>> Is this more like what you had in mind?
>> Not exactly (sorry..).  For the target hook I think we don't want to
>> pass vec_info but instead another output parameter like the existing
>> ones.
>>
>> vect_estimate_min_profitable_iters should then via
>> vect_analyze_loop_costing and vect_analyze_loop_2 report the unroll
>> suggestion to vect_analyze_loop which should then, if the suggestion
>> was > 1, instead of iterating to the next vector mode run again
>> with a fixed VF (old VF times suggested unroll factor - there's
>> min_vf in vect_analyze_loop_2 which we should adjust to
>> the old VF times two for example and maybe store the suggested
>> factor as hint) - if it succeeds the result will end up in the
>> list of considered modes (where we now may have more than one
>> entry for the same mode but a different VF), we probably want to
>> only consider more unrolling once.
>>
>> For simplicity I'd probably set min_vf = max_vf = old VF * suggested
>> factor, thus take the targets request literally.
>>
>> Richard.
>
> Hi,
>
> I now pass an output parameter to finish_costs and route it through the 
> various calls up to vect_analyze_loop.  I tried to rework 
> vect_determine_vectorization_factor and noticed that merely setting 
> min_vf and max_vf is not enough, we only use these to check whether the 
> vectorization factor is within range, well actually we only use max_vf 
> at that stage. We only seem to use 'min_vf' to make sure the 
> data_references are valid.  I am not sure my changes are the most 
> appropriate here, for instance I am pretty sure the checks for max and 
> min vf I added in vect_determine_vectorization_factor are currently 
> superfluous as they will pass by design, but thought they might be good 
> future proofing?
>
> Also I changed how we compare against max_vf, rather than relying on the 
> 'MAX_VECTORIZATION' I decided to use the estimated_poly_value with 
> POLY_VALUE_MAX, to be able to bound it further in case we have knowledge 
> of the VL. I am not entirely about the validity of this change, maybe we 
> are better off keeping the MAX_VECTORIZATION in place and not making any 
> changes to max_vf for unrolling.

Yeah, estimated_poly_value is just an estimate (even for POLY_VALUE_MAX)
rather than a guarantee.  We can't rely on it for correctness.

Richard
Richard Sandiford Oct. 22, 2021, 10:18 a.m. UTC | #12
Richard Biener <rguenther@suse.de> writes:
> That said, the overall flow is OK now, some details about the
> max_vf check and where to compute the unrolled VF needs to be
> fleshed out.  And then there's the main analysis loop which,
> frankly, is a mess right now, even before your patch :/

Yeah, the loop is certainly ripe for a rewrite :-)

> I'm thinking of rewriting the analysis loop in vect_analyze_loop
> to use a worklist initially seeded by the vector_modes[] but
> that we can push things like as-main-loop, unrolled and
> epilogue analysis to.  Maybe have the worklist specify
> pairs of mode and kind or tuples of mode, min-VF and kind where
> 'kind' is as-main/epilogue/unroll (though maybe 'kind' is
> redundant there).  Possibly as preparatory step.

Sounds good.  I think we can also drop some of the complexity if
we're prepared to analyse candidate replacements for the main
loop separately from candidate epilogue loops (even if the two
candidates have the same mode and VF, meaning that a lot of work
would be repeated).

Thanks,
Richard
Andre Vieira (lists) Nov. 11, 2021, 4:02 p.m. UTC | #13
Hi,

This is the rebased and reworked version of the unroll patch.  I wasn't 
entirely sure whether I should compare the costs of the unrolled 
loop_vinfo with the original loop_vinfo it was unrolled of. I did now, 
but I wasn't too sure whether it was a good idea to... Any thoughts on 
this?

Regards,

Andre


gcc/ChangeLog:

         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add 
suggested_unroll_factor parameter.
         (vect_analyze_loop_costing): Likewise.
         (vect_determine_partial_vectors_and_peeling): Don't mask an 
unrolled loop.
         (vect_analyze_loop_2): Support unrolling of loops.
         (vect_can_unroll): New function.
         (vect_try_unrolling): New function.
         (vect_analyze_loop_1): Add suggested_unroll_factor parameter 
and use it.
         (vect_analyze_loop): Call vect_try_unrolling when unrolling 
suggested.
         (vectorizable_reduction): Don't single_defuse_cycle when unrolling.
         * tree-vectorizer.h (_loop_vec_info::_loop_vec_info):  Add 
suggested_unroll_factor member.
         (vector_costs::vector_costs): Add m_suggested_unroll_factor member.
         (vector_costs::suggested_unroll_factor): New getter.
         (finish_cost): Add suggested_unroll_factor out parameter and 
set it.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index a28bb6321d76b8222bc8cfdade151ca9b4dca406..cfce7de0430c852d37f1a93e2d6a2f630694f613 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -153,7 +153,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						unsigned *);
 static stmt_vec_info vect_is_simple_reduction (loop_vec_info, stmt_vec_info,
 					       bool *, bool *);
 
@@ -828,6 +829,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1811,7 +1813,8 @@ vect_known_niters_smaller_than_vf (loop_vec_info loop_vinfo)
    definitely no, or -1 if it's worth retrying.  */
 
 static int
-vect_analyze_loop_costing (loop_vec_info loop_vinfo)
+vect_analyze_loop_costing (loop_vec_info loop_vinfo,
+			   unsigned *suggested_unroll_factor)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
@@ -1845,7 +1848,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      suggested_unroll_factor);
 
   if (min_profitable_iters < 0)
     {
@@ -2129,10 +2133,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->suggested_unroll_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2199,12 +2209,12 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
    for it.  The different analyses will record information in the
    loop_vec_info struct.  */
 static opt_result
-vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
+		     unsigned *suggested_unroll_factor, poly_uint64 min_vf = 2)
 {
   opt_result ok = opt_result::success ();
   int res;
   unsigned int max_vf = MAX_VECTORIZATION_FACTOR;
-  poly_uint64 min_vf = 2;
   loop_vec_info orig_loop_vinfo = NULL;
 
   /* If we are dealing with an epilogue then orig_loop_vinfo points to the
@@ -2359,6 +2369,26 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
      set of rgroups.  */
   gcc_assert (LOOP_VINFO_MASKS (loop_vinfo).is_empty ());
 
+  /* Apply the suggested unrolling factor, this was determined by the backend
+     during finish_cost the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      poly_uint64 unrolled_vf
+	= LOOP_VINFO_VECT_FACTOR (loop_vinfo) * loop_vinfo->suggested_unroll_factor;
+      /* Make sure the unrolled vectorization factor is less than the max
+	  vectorization factor.  */
+      unsigned HOST_WIDE_INT max_vf = LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo);
+      if (max_vf == MAX_VECTORIZATION_FACTOR || known_le (unrolled_vf, max_vf))
+	LOOP_VINFO_VECT_FACTOR (loop_vinfo) = unrolled_vf;
+      else
+	return opt_result::failure_at (vect_location,
+				       "unrolling failed: unrolled"
+				       " vectorization factor larger than"
+				       " maximum vectorization factor: %d\n",
+				       LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo));
+    }
+
   /* This is the point where we can re-start analysis with SLP forced off.  */
 start_over:
 
@@ -2550,7 +2580,7 @@ start_over:
     return ok;
 
   /* Check the costings of the loop make vectorizing worthwhile.  */
-  res = vect_analyze_loop_costing (loop_vinfo);
+  res = vect_analyze_loop_costing (loop_vinfo, suggested_unroll_factor);
   if (res < 0)
     {
       ok = opt_result::failure_at (vect_location,
@@ -2940,6 +2970,115 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we can unroll this loop.  */
+
+static bool
+vect_can_unroll (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  if (known_le (vectorization_factor, 1U))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "will not unroll loop with a VF of 1 or less\n");
+      return false;
+    }
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll due to negative step\n");
+	      return false;
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) != TREE_CODE_REDUCTION)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll loop with reduction due to "
+				 "non TREE_CODE_REDUCTION\n");
+	      return false;
+	    }
+	}
+    }
+
+  return true;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (loop_vec_info loop_vinfo,
+		    const vect_loop_form_info *loop_form_info,
+		    unsigned suggested_unroll_factor)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  if (!vect_can_unroll (loop_vinfo))
+    return opt_loop_vec_info::failure_at (vect_location,
+					  "*** Can not unroll this loop.\n");
+
+  loop_vec_info unrolled_vinfo
+    = vect_create_loop_vinfo (loop_vinfo->loop, loop_vinfo->shared,
+			      loop_form_info, NULL);
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+
+  /* Use the suggested_unrolling_factor that was returned at the target's
+     TARGET_VECTORIZE_FINISH_COST hook.  */
+  unrolled_vinfo->suggested_unroll_factor = suggested_unroll_factor;
+  poly_uint64 unrolled_vf
+    = LOOP_VINFO_VECT_FACTOR (loop_vinfo) * suggested_unroll_factor;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "***** unrolling factor %d chosen for vector mode %s,"
+		     "re-trying analyzis...\n",
+		     suggested_unroll_factor,
+		     GET_MODE_NAME (unrolled_vinfo->vector_mode));
+  bool unrolling_fatal = false;
+  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal,
+			   &suggested_unroll_factor, unrolled_vf))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "unrolling succeeded with factor = %d\n",
+			 suggested_unroll_factor);
+      unrolled_vinfo->loop->aux = NULL;
+      return opt_loop_vec_info::success (unrolled_vinfo);
+    }
+
+  loop_vinfo->loop->aux = NULL;
+  return opt_loop_vec_info::failure_at (vect_location,
+					"unrolling failed with factor = %d\n",
+					suggested_unroll_factor);
+}
+
 /* Analyze LOOP with VECTOR_MODES[MODE_I] and as epilogue if MAIN_LOOP_VINFO is
    not NULL.  Set AUTODETECTED_VECTOR_MODE if VOIDmode and advance
    MODE_I to the next mode useful to analyze.
@@ -2951,7 +3090,7 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
 		     loop_vec_info main_loop_vinfo,
 		     const vector_modes &vector_modes, unsigned &mode_i,
 		     machine_mode &autodetected_vector_mode,
-		     bool &fatal)
+		     bool &fatal, unsigned int *suggested_unroll_factor = NULL)
 {
   loop_vec_info loop_vinfo
     = vect_create_loop_vinfo (loop, shared, loop_form_info, main_loop_vinfo);
@@ -2960,7 +3099,8 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
   loop_vinfo->vector_mode = vector_mode;
 
   /* Run the main analysis.  */
-  opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal);
+  opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal,
+					suggested_unroll_factor);
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "***** Analysis %s with vector mode %s\n",
@@ -3081,15 +3221,40 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
     {
       unsigned int loop_vinfo_i = mode_i;
       bool fatal;
+      unsigned int suggested_unroll_factor = 1;
       opt_loop_vec_info loop_vinfo
 	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
 			       NULL, vector_modes, mode_i,
-			       autodetected_vector_mode, fatal);
+			       autodetected_vector_mode, fatal,
+			       &suggested_unroll_factor);
       if (fatal)
 	break;
 
       if (loop_vinfo)
 	{
+	  if (suggested_unroll_factor > 1)
+	    {
+	      opt_loop_vec_info unrolled_loop_vinfo
+		= vect_try_unrolling (loop_vinfo, &loop_form_info,
+				      suggested_unroll_factor);
+	      if (unrolled_loop_vinfo)
+		{
+		  if (vect_joust_loop_vinfos (unrolled_loop_vinfo, loop_vinfo))
+		    {
+		      delete loop_vinfo;
+		      loop_vinfo = unrolled_loop_vinfo;
+		    }
+		  else
+		    {
+		      if (dump_enabled_p ())
+			dump_printf_loc (MSG_NOTE, vect_location,
+					 "***** Original loop chosen over"
+					 " unrolled due to costs.");
+		      delete unrolled_loop_vinfo;
+		    }
+		}
+	    }
+
 	  /* Once we hit the desired simdlen for the first time,
 	     discard any previous attempts.  */
 	  if (simdlen
@@ -3158,10 +3323,33 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   /* Now analyze first_loop_vinfo for epilogue vectorization.  */
   poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 
+  if (first_loop_vinfo->suggested_unroll_factor > 1)
+    {
+      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with first vector mode"
+			     " %s for epilogue with partial vectors of"
+			     " unrolled first loop.\n",
+			     GET_MODE_NAME (vector_modes[0]));
+	  mode_i = 0;
+	}
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue of unrolled first loop.\n",
+			     GET_MODE_NAME (first_loop_vinfo->vector_mode));
+	  mode_i = first_loop_i;
+	}
+    }
+
   /* Handle the case that the original loop can use partial
      vectorization, but want to only adopt it for the epilogue.
      The retry should be in the same mode as original.  */
-  if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
+  else if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
     {
       gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo)
 		  && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo));
@@ -3182,6 +3370,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   /* ???  If first_loop_vinfo was using VOIDmode then we probably
      want to instead search for the corresponding mode in vector_modes[].  */
 
+  poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
   while (1)
     {
       bool fatal;
@@ -3193,6 +3382,22 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       if (fatal)
 	break;
 
+      if (loop_vinfo
+	  && known_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), first_vinfo_vf)
+	  && !(LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	       && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    first_vinfo_vf)))
+	{
+	    if (dump_enabled_p ())
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "***** Will not use %s mode as an"
+			       " epilogue, since it leads to an higher"
+			       " vectorization factor than main loop\n",
+			       GET_MODE_NAME (loop_vinfo->vector_mode));
+	    delete loop_vinfo;
+	    loop_vinfo = opt_loop_vec_info::success (NULL);
+	}
+
       if (loop_vinfo)
 	{
 	  if (pick_lowest_cost_p)
@@ -3905,7 +4110,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    unsigned *suggested_unroll_factor)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -4265,8 +4471,9 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-	       &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+	       &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost,
+	       suggested_unroll_factor);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7255,7 +7462,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index b552e9dccce5bce6a3bbcf5d531e7ccefa719b9a..238e0b4bf21871c518da73a79320f34e55c9201c 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -624,6 +624,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1430,6 +1437,7 @@ public:
   unsigned int prologue_cost () const;
   unsigned int body_cost () const;
   unsigned int epilogue_cost () const;
+  unsigned int suggested_unroll_factor () const;
 
 protected:
   unsigned int record_stmt_cost (stmt_vec_info, vect_cost_model_location,
@@ -1447,6 +1455,9 @@ protected:
   /* The costs of the three regions, indexed by vect_cost_model_location.  */
   unsigned int m_costs[3];
 
+  /* The suggested unrolling factor determined at finish_cost.  */
+  unsigned int m_suggested_unroll_factor;
+
   /* True if finish_cost has been called.  */
   bool m_finished;
 };
@@ -1459,6 +1470,7 @@ vector_costs::vector_costs (vec_info *vinfo, bool costing_for_scalar)
   : m_vinfo (vinfo),
     m_costing_for_scalar (costing_for_scalar),
     m_costs (),
+    m_suggested_unroll_factor(1),
     m_finished (false)
 {
 }
@@ -1490,6 +1502,15 @@ vector_costs::epilogue_cost () const
   return m_costs[vect_epilogue];
 }
 
+/* Return the suggested unroll factor.  */
+
+inline unsigned int
+vector_costs::suggested_unroll_factor () const
+{
+  gcc_checking_assert (m_finished);
+  return m_suggested_unroll_factor;
+}
+
 #define VECT_MAX_COST 1000
 
 /* The maximum number of intermediate steps required in multi-step type
@@ -1665,12 +1686,15 @@ add_stmt_cost (vector_costs *costs, stmt_info_for_cost *i)
 
 static inline void
 finish_cost (vector_costs *costs, unsigned *prologue_cost,
-	     unsigned *body_cost, unsigned *epilogue_cost)
+	     unsigned *body_cost, unsigned *epilogue_cost,
+	     unsigned *suggested_unroll_factor = NULL)
 {
   costs->finish_cost ();
   *prologue_cost = costs->prologue_cost ();
   *body_cost = costs->body_cost ();
   *epilogue_cost = costs->epilogue_cost ();
+  if (suggested_unroll_factor)
+    *suggested_unroll_factor = costs->suggested_unroll_factor ();
 }
 
 inline void
Richard Biener Nov. 12, 2021, 1:12 p.m. UTC | #14
On Thu, 11 Nov 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> This is the rebased and reworked version of the unroll patch.  I wasn't
> entirely sure whether I should compare the costs of the unrolled loop_vinfo
> with the original loop_vinfo it was unrolled of. I did now, but I wasn't too
> sure whether it was a good idea to... Any thoughts on this?

+  /* Apply the suggested unrolling factor, this was determined by the 
backend
+     during finish_cost the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      poly_uint64 unrolled_vf
+       = LOOP_VINFO_VECT_FACTOR (loop_vinfo) * 
loop_vinfo->suggested_unroll_factor;
+      /* Make sure the unrolled vectorization factor is less than the max
+         vectorization factor.  */
+      unsigned HOST_WIDE_INT max_vf = LOOP_VINFO_MAX_VECT_FACTOR 
(loop_vinfo);
+      if (max_vf == MAX_VECTORIZATION_FACTOR || known_le (unrolled_vf, 
max_vf))
+       LOOP_VINFO_VECT_FACTOR (loop_vinfo) = unrolled_vf;
+      else
+       return opt_result::failure_at (vect_location,
+                                      "unrolling failed: unrolled"
+                                      " vectorization factor larger than"
+                                      " maximum vectorization factor: 
%d\n",
+                                      LOOP_VINFO_MAX_VECT_FACTOR 
(loop_vinfo));
+    }
+
   /* This is the point where we can re-start analysis with SLP forced 
off.  */
 start_over:

So we're honoring suggested_unroll_factor here but you still have the
now unused hunk

+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
+                    unsigned *suggested_unroll_factor, poly_uint64 min_vf 
= 2)
 {

I also wonder whether vect_analyze_loop_2 could at least prune
suggested_unroll_factor as set by vect_analyze_loop_costing with its
knowledge of max_vf itself?  That would avoid using the at the moment
unused LOOP_VINFO_MAX_VECT_FACTOR?

I think all the things you do in vect_can_unroll should be figured
out with the re-analysis, and I'd just amend vect_analyze_loop_1
with a suggested unroll factor parameter like it has main_loop_vinfo
for the epilogue case.  The main loop adjustment would the be in the

          if (first_loop_vinfo == NULL)
            {
              first_loop_vinfo = loop_vinfo;
              first_loop_i = loop_vinfo_i;
              first_loop_next_i = mode_i;
            }

spot only, adding

if (loop_vinfo->suggested_unroll_factor != 1)
  {
    suggested_unroll_factor = loop_vinfo->suggested_unroll_factor;
    mode_i = first_loop_i;
    if (dump)
      dump_print ("Trying unrolling by %d\n");
    continue;
  }

and a reset of suggested_unroll_factor after the vect_analyze_loop_1
call?  (that's basically pushing another analysis case to the
poor-mans "worklist")

Richard.

> Regards,
> 
> Andre
> 
> 
> gcc/ChangeLog:
> 
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add
> suggested_unroll_factor parameter.
>         (vect_analyze_loop_costing): Likewise.
>         (vect_determine_partial_vectors_and_peeling): Don't mask an 
> unrolled loop.
>         (vect_analyze_loop_2): Support unrolling of loops.
>         (vect_can_unroll): New function.
>         (vect_try_unrolling): New function.
>         (vect_analyze_loop_1): Add suggested_unroll_factor parameter 
> and use it.
>         (vect_analyze_loop): Call vect_try_unrolling when unrolling suggested.
>         (vectorizable_reduction): Don't single_defuse_cycle when unrolling.
>         * tree-vectorizer.h (_loop_vec_info::_loop_vec_info):  Add 
> suggested_unroll_factor member.
>         (vector_costs::vector_costs): Add m_suggested_unroll_factor member.
>         (vector_costs::suggested_unroll_factor): New getter.
>         (finish_cost): Add suggested_unroll_factor out parameter and 
> set it.
> 
>
Andre Vieira (lists) Nov. 22, 2021, 11:41 a.m. UTC | #15
On 12/11/2021 13:12, Richard Biener wrote:
> On Thu, 11 Nov 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>> This is the rebased and reworked version of the unroll patch.  I wasn't
>> entirely sure whether I should compare the costs of the unrolled loop_vinfo
>> with the original loop_vinfo it was unrolled of. I did now, but I wasn't too
>> sure whether it was a good idea to... Any thoughts on this?
> +  /* Apply the suggested unrolling factor, this was determined by the
> backend
> +     during finish_cost the first time we ran the analyzis for this
> +     vector mode.  */
> +  if (loop_vinfo->suggested_unroll_factor > 1)
> +    {
> +      poly_uint64 unrolled_vf
> +       = LOOP_VINFO_VECT_FACTOR (loop_vinfo) *
> loop_vinfo->suggested_unroll_factor;
> +      /* Make sure the unrolled vectorization factor is less than the max
> +         vectorization factor.  */
> +      unsigned HOST_WIDE_INT max_vf = LOOP_VINFO_MAX_VECT_FACTOR
> (loop_vinfo);
> +      if (max_vf == MAX_VECTORIZATION_FACTOR || known_le (unrolled_vf,
> max_vf))
> +       LOOP_VINFO_VECT_FACTOR (loop_vinfo) = unrolled_vf;
> +      else
> +       return opt_result::failure_at (vect_location,
> +                                      "unrolling failed: unrolled"
> +                                      " vectorization factor larger than"
> +                                      " maximum vectorization factor:
> %d\n",
> +                                      LOOP_VINFO_MAX_VECT_FACTOR
> (loop_vinfo));
> +    }
> +
>     /* This is the point where we can re-start analysis with SLP forced
> off.  */
>   start_over:
>
> So we're honoring suggested_unroll_factor here but you still have the
> now unused hunk
>
> +vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
> +                    unsigned *suggested_unroll_factor, poly_uint64 min_vf
> = 2)
>   {
>
> I also wonder whether vect_analyze_loop_2 could at least prune
> suggested_unroll_factor as set by vect_analyze_loop_costing with its
> knowledge of max_vf itself?  That would avoid using the at the moment
> unused LOOP_VINFO_MAX_VECT_FACTOR?
>
> I think all the things you do in vect_can_unroll should be figured
> out with the re-analysis, and I'd just amend vect_analyze_loop_1
> with a suggested unroll factor parameter like it has main_loop_vinfo
> for the epilogue case.  The main loop adjustment would the be in the
>
>            if (first_loop_vinfo == NULL)
>              {
>                first_loop_vinfo = loop_vinfo;
>                first_loop_i = loop_vinfo_i;
>                first_loop_next_i = mode_i;
>              }
Sounds good.
>
> spot only, adding
>
> if (loop_vinfo->suggested_unroll_factor != 1)
>    {
>      suggested_unroll_factor = loop_vinfo->suggested_unroll_factor;
>      mode_i = first_loop_i;
>      if (dump)
>        dump_print ("Trying unrolling by %d\n");
>      continue;
>    }
Not quite like this because of how we need to keep the suggestion given 
at finish_costs, put into suggested_unroll_factor, separate from how we 
tell vect_analyze_loop_1 that we are now vectorizing an unrolled vector 
loop, which we do by writing to loop_vinfo->suggested_unroll_factor. 
Perhaps I should renamed the latter, to avoid confusion? Let me know if 
you think that would help and in the mean-time this is what the patch 
looks like now. I'll follow up with a ChangeLog when we settle on the 
name & structure.
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index a28bb6321d76b8222bc8cfdade151ca9b4dca406..c84f1df9cd9a1325135defcbe1d101642a867373 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -153,7 +153,8 @@ along with GCC; see the file COPYING3.  If not see
    http://gcc.gnu.org/projects/tree-ssa/vectorization.html
 */
 
-static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *);
+static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *,
+						unsigned *);
 static stmt_vec_info vect_is_simple_reduction (loop_vec_info, stmt_vec_info,
 					       bool *, bool *);
 
@@ -828,6 +829,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1811,7 +1813,8 @@ vect_known_niters_smaller_than_vf (loop_vec_info loop_vinfo)
    definitely no, or -1 if it's worth retrying.  */
 
 static int
-vect_analyze_loop_costing (loop_vec_info loop_vinfo)
+vect_analyze_loop_costing (loop_vec_info loop_vinfo,
+			   unsigned *suggested_unroll_factor)
 {
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
@@ -1845,7 +1848,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
-				      &min_profitable_estimate);
+				      &min_profitable_estimate,
+				      suggested_unroll_factor);
 
   if (min_profitable_iters < 0)
     {
@@ -2129,10 +2133,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->suggested_unroll_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2199,7 +2209,8 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
    for it.  The different analyses will record information in the
    loop_vec_info struct.  */
 static opt_result
-vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
+vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
+		     unsigned *suggested_unroll_factor)
 {
   opt_result ok = opt_result::success ();
   int res;
@@ -2359,6 +2370,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal)
      set of rgroups.  */
   gcc_assert (LOOP_VINFO_MASKS (loop_vinfo).is_empty ());
 
+  /* Apply the suggested unrolling factor, this was determined by the backend
+     during finish_cost the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo->suggested_unroll_factor;
+
   /* This is the point where we can re-start analysis with SLP forced off.  */
 start_over:
 
@@ -2550,7 +2567,7 @@ start_over:
     return ok;
 
   /* Check the costings of the loop make vectorizing worthwhile.  */
-  res = vect_analyze_loop_costing (loop_vinfo);
+  res = vect_analyze_loop_costing (loop_vinfo, suggested_unroll_factor);
   if (res < 0)
     {
       ok = opt_result::failure_at (vect_location,
@@ -2951,7 +2968,7 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
 		     loop_vec_info main_loop_vinfo,
 		     const vector_modes &vector_modes, unsigned &mode_i,
 		     machine_mode &autodetected_vector_mode,
-		     bool &fatal)
+		     bool &fatal, unsigned int *suggested_unroll_factor = NULL)
 {
   loop_vec_info loop_vinfo
     = vect_create_loop_vinfo (loop, shared, loop_form_info, main_loop_vinfo);
@@ -2959,8 +2976,18 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
   machine_mode vector_mode = vector_modes[mode_i];
   loop_vinfo->vector_mode = vector_mode;
 
+  /* Don't ask for a suggested unroll factor for an already unrolled loop
+     vinfo and reset the value for the next analysis.  */
+  if (suggested_unroll_factor && *suggested_unroll_factor > 1)
+    {
+      loop_vinfo->suggested_unroll_factor = *suggested_unroll_factor;
+      *suggested_unroll_factor = 1;
+      suggested_unroll_factor = NULL;
+    }
+
   /* Run the main analysis.  */
-  opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal);
+  opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal,
+					suggested_unroll_factor);
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "***** Analysis %s with vector mode %s\n",
@@ -3072,6 +3099,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   unsigned int first_loop_i = 0;
   unsigned int first_loop_next_i = 0;
   unsigned HOST_WIDE_INT simdlen = loop->simdlen;
+  unsigned int suggested_unroll_factor = 1;
 
   /* First determine the main loop vectorization mode, either the first
      one that works, starting with auto-detecting the vector mode and then
@@ -3084,7 +3112,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       opt_loop_vec_info loop_vinfo
 	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
 			       NULL, vector_modes, mode_i,
-			       autodetected_vector_mode, fatal);
+			       autodetected_vector_mode, fatal,
+			       &suggested_unroll_factor);
       if (fatal)
 	break;
 
@@ -3112,6 +3141,16 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      first_loop_vinfo = loop_vinfo;
 	      first_loop_i = loop_vinfo_i;
 	      first_loop_next_i = mode_i;
+	      if (suggested_unroll_factor > 1)
+		{
+		  mode_i = first_loop_i;
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				     "***** Retryng analysis for unrolling"
+				     " with unroll factor %d.\n",
+				     suggested_unroll_factor);
+		  continue;
+		}
 	    }
 	  else
 	    {
@@ -3158,10 +3197,33 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   /* Now analyze first_loop_vinfo for epilogue vectorization.  */
   poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 
+  if (first_loop_vinfo->suggested_unroll_factor > 1)
+    {
+      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with first vector mode"
+			     " %s for epilogue with partial vectors of"
+			     " unrolled first loop.\n",
+			     GET_MODE_NAME (vector_modes[0]));
+	  mode_i = 0;
+	}
+      else
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue of unrolled first loop.\n",
+			     GET_MODE_NAME (first_loop_vinfo->vector_mode));
+	  mode_i = first_loop_i;
+	}
+    }
+
   /* Handle the case that the original loop can use partial
      vectorization, but want to only adopt it for the epilogue.
      The retry should be in the same mode as original.  */
-  if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
+  else if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
     {
       gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo)
 		  && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo));
@@ -3182,6 +3244,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   /* ???  If first_loop_vinfo was using VOIDmode then we probably
      want to instead search for the corresponding mode in vector_modes[].  */
 
+  poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
   while (1)
     {
       bool fatal;
@@ -3193,6 +3256,22 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       if (fatal)
 	break;
 
+      if (loop_vinfo
+	  && known_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), first_vinfo_vf)
+	  && !(LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	       && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    first_vinfo_vf)))
+	{
+	    if (dump_enabled_p ())
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "***** Will not use %s mode as an"
+			       " epilogue, since it leads to an higher"
+			       " vectorization factor than main loop\n",
+			       GET_MODE_NAME (loop_vinfo->vector_mode));
+	    delete loop_vinfo;
+	    loop_vinfo = opt_loop_vec_info::success (NULL);
+	}
+
       if (loop_vinfo)
 	{
 	  if (pick_lowest_cost_p)
@@ -3905,7 +3984,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
 static void
 vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 				    int *ret_min_profitable_niters,
-				    int *ret_min_profitable_estimate)
+				    int *ret_min_profitable_estimate,
+				    unsigned *suggested_unroll_factor)
 {
   int min_profitable_iters;
   int min_profitable_estimate;
@@ -4265,8 +4345,23 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-	       &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+	       &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost,
+	       suggested_unroll_factor);
+
+  if (suggested_unroll_factor && *suggested_unroll_factor > 1
+      && LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo) != MAX_VECTORIZATION_FACTOR
+      && !known_le (LOOP_VINFO_VECT_FACTOR (loop_vinfo) *
+		    *suggested_unroll_factor,
+		    LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't unroll as unrolled vectorization factor larger"
+			 " than maximum vectorization factor: %d\n",
+			 LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo));
+      *suggested_unroll_factor = 1;
+    }
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7255,7 +7350,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index b552e9dccce5bce6a3bbcf5d531e7ccefa719b9a..238e0b4bf21871c518da73a79320f34e55c9201c 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -624,6 +624,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1430,6 +1437,7 @@ public:
   unsigned int prologue_cost () const;
   unsigned int body_cost () const;
   unsigned int epilogue_cost () const;
+  unsigned int suggested_unroll_factor () const;
 
 protected:
   unsigned int record_stmt_cost (stmt_vec_info, vect_cost_model_location,
@@ -1447,6 +1455,9 @@ protected:
   /* The costs of the three regions, indexed by vect_cost_model_location.  */
   unsigned int m_costs[3];
 
+  /* The suggested unrolling factor determined at finish_cost.  */
+  unsigned int m_suggested_unroll_factor;
+
   /* True if finish_cost has been called.  */
   bool m_finished;
 };
@@ -1459,6 +1470,7 @@ vector_costs::vector_costs (vec_info *vinfo, bool costing_for_scalar)
   : m_vinfo (vinfo),
     m_costing_for_scalar (costing_for_scalar),
     m_costs (),
+    m_suggested_unroll_factor(1),
     m_finished (false)
 {
 }
@@ -1490,6 +1502,15 @@ vector_costs::epilogue_cost () const
   return m_costs[vect_epilogue];
 }
 
+/* Return the suggested unroll factor.  */
+
+inline unsigned int
+vector_costs::suggested_unroll_factor () const
+{
+  gcc_checking_assert (m_finished);
+  return m_suggested_unroll_factor;
+}
+
 #define VECT_MAX_COST 1000
 
 /* The maximum number of intermediate steps required in multi-step type
@@ -1665,12 +1686,15 @@ add_stmt_cost (vector_costs *costs, stmt_info_for_cost *i)
 
 static inline void
 finish_cost (vector_costs *costs, unsigned *prologue_cost,
-	     unsigned *body_cost, unsigned *epilogue_cost)
+	     unsigned *body_cost, unsigned *epilogue_cost,
+	     unsigned *suggested_unroll_factor = NULL)
 {
   costs->finish_cost ();
   *prologue_cost = costs->prologue_cost ();
   *body_cost = costs->body_cost ();
   *epilogue_cost = costs->epilogue_cost ();
+  if (suggested_unroll_factor)
+    *suggested_unroll_factor = costs->suggested_unroll_factor ();
 }
 
 inline void
Richard Biener Nov. 22, 2021, 12:39 p.m. UTC | #16
On Mon, 22 Nov 2021, Andre Vieira (lists) wrote:

> 
> On 12/11/2021 13:12, Richard Biener wrote:
> > On Thu, 11 Nov 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >> This is the rebased and reworked version of the unroll patch.  I wasn't
> >> entirely sure whether I should compare the costs of the unrolled loop_vinfo
> >> with the original loop_vinfo it was unrolled of. I did now, but I wasn't
> >> too
> >> sure whether it was a good idea to... Any thoughts on this?
> > +  /* Apply the suggested unrolling factor, this was determined by the
> > backend
> > +     during finish_cost the first time we ran the analyzis for this
> > +     vector mode.  */
> > +  if (loop_vinfo->suggested_unroll_factor > 1)
> > +    {
> > +      poly_uint64 unrolled_vf
> > +       = LOOP_VINFO_VECT_FACTOR (loop_vinfo) *
> > loop_vinfo->suggested_unroll_factor;
> > +      /* Make sure the unrolled vectorization factor is less than the max
> > +         vectorization factor.  */
> > +      unsigned HOST_WIDE_INT max_vf = LOOP_VINFO_MAX_VECT_FACTOR
> > (loop_vinfo);
> > +      if (max_vf == MAX_VECTORIZATION_FACTOR || known_le (unrolled_vf,
> > max_vf))
> > +       LOOP_VINFO_VECT_FACTOR (loop_vinfo) = unrolled_vf;
> > +      else
> > +       return opt_result::failure_at (vect_location,
> > +                                      "unrolling failed: unrolled"
> > +                                      " vectorization factor larger than"
> > +                                      " maximum vectorization factor:
> > %d\n",
> > +                                      LOOP_VINFO_MAX_VECT_FACTOR
> > (loop_vinfo));
> > +    }
> > +
> >     /* This is the point where we can re-start analysis with SLP forced
> > off.  */
> >   start_over:
> >
> > So we're honoring suggested_unroll_factor here but you still have the
> > now unused hunk
> >
> > +vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal,
> > +                    unsigned *suggested_unroll_factor, poly_uint64 min_vf
> > = 2)
> >   {
> >
> > I also wonder whether vect_analyze_loop_2 could at least prune
> > suggested_unroll_factor as set by vect_analyze_loop_costing with its
> > knowledge of max_vf itself?  That would avoid using the at the moment
> > unused LOOP_VINFO_MAX_VECT_FACTOR?
> >
> > I think all the things you do in vect_can_unroll should be figured
> > out with the re-analysis, and I'd just amend vect_analyze_loop_1
> > with a suggested unroll factor parameter like it has main_loop_vinfo
> > for the epilogue case.  The main loop adjustment would the be in the
> >
> >            if (first_loop_vinfo == NULL)
> >              {
> >                first_loop_vinfo = loop_vinfo;
> >                first_loop_i = loop_vinfo_i;
> >                first_loop_next_i = mode_i;
> >              }
> Sounds good.
> >
> > spot only, adding
> >
> > if (loop_vinfo->suggested_unroll_factor != 1)
> >    {
> >      suggested_unroll_factor = loop_vinfo->suggested_unroll_factor;
> >      mode_i = first_loop_i;
> >      if (dump)
> >        dump_print ("Trying unrolling by %d\n");
> >      continue;
> >    }
> Not quite like this because of how we need to keep the suggestion given at
> finish_costs, put into suggested_unroll_factor, separate from how we tell
> vect_analyze_loop_1 that we are now vectorizing an unrolled vector loop, which
> we do by writing to loop_vinfo->suggested_unroll_factor. Perhaps I should
> renamed the latter, to avoid confusion? Let me know if you think that would
> help and in the mean-time this is what the patch looks like now. I'll follow
> up with a ChangeLog when we settle on the name & structure.

I think the patch looks OK, I'm only wondering about

+  if (first_loop_vinfo->suggested_unroll_factor > 1)
+    {
+      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
+       {
+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "***** Re-trying analysis with first vector
mode"
+                            " %s for epilogue with partial vectors of"
+                            " unrolled first loop.\n",
+                            GET_MODE_NAME (vector_modes[0]));
+         mode_i = 0;

and the later done check for bigger VF than main loop - why would
we re-start at 0 rather than at the old mode?  Maybe we want to
remember the iterator value we started at when arriving at the
main loop mode?  So if we analyzed successfully with mode_i == 2,
then sucessfully at mode_i == 4 which suggested an unroll of 2,
re-start at the mode_i we continued after the mode_i == 2
successful analysis?  To just consider the "simple" case of
AVX vs SSE it IMHO doesn't make much sense to succeed with
AVX V4DF, succeed with SSE V2DF and figure it's better than V4DF AVX
but get a suggestion of 2 times unroll and then re-try AVX V4DF
just to re-compute that yes, it's worse than SSE V2DF?  You
are probably thinking of SVE vs ADVSIMD here but do we need to
start at 0?  Adding a comment to the code would be nice.

Thanks,
Richard.
Andre Vieira (lists) Nov. 24, 2021, 9:46 a.m. UTC | #17
On 22/11/2021 12:39, Richard Biener wrote:
> +  if (first_loop_vinfo->suggested_unroll_factor > 1)
> +    {
> +      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_NOTE, vect_location,
> +                            "***** Re-trying analysis with first vector
> mode"
> +                            " %s for epilogue with partial vectors of"
> +                            " unrolled first loop.\n",
> +                            GET_MODE_NAME (vector_modes[0]));
> +         mode_i = 0;
>
> and the later done check for bigger VF than main loop - why would
> we re-start at 0 rather than at the old mode?  Maybe we want to
> remember the iterator value we started at when arriving at the
> main loop mode?  So if we analyzed successfully with mode_i == 2,
> then sucessfully at mode_i == 4 which suggested an unroll of 2,
> re-start at the mode_i we continued after the mode_i == 2
> successful analysis?  To just consider the "simple" case of
> AVX vs SSE it IMHO doesn't make much sense to succeed with
> AVX V4DF, succeed with SSE V2DF and figure it's better than V4DF AVX
> but get a suggestion of 2 times unroll and then re-try AVX V4DF
> just to re-compute that yes, it's worse than SSE V2DF?  You
> are probably thinking of SVE vs ADVSIMD here but do we need to
> start at 0?  Adding a comment to the code would be nice.
>
> Thanks,

I was indeed thinking SVE vs Advanced SIMD where we end up having to 
compare different vectorization strategies, which will have different 
costs depending. The hypothetical case, as in I don't think I've come 
across one, is where if we decide to vectorize the main loop for V8QI 
and unroll 2x, yielding a VF of 16, we may then want to then use a 
predicated VNx16QI epilogue. Though the question here is whether it is 
possible for an Advanced SIMD V8QI vectorization to beat V16QI but a SVE 
predicated VNx16QI to beat a VNx8QI for the same loop.  Might be good to 
get Sandiford's opinion on this.

I do think that initially I was more concerned with skipping a VNx8QI 
after selecting a V8QI but I just checked and Advanced SIMD modes are 
listed before SVE for (among others) this reason.

Regards,
Andre
Richard Biener Nov. 24, 2021, 11 a.m. UTC | #18
On Wed, 24 Nov 2021, Andre Vieira (lists) wrote:

> 
> On 22/11/2021 12:39, Richard Biener wrote:
> > +  if (first_loop_vinfo->suggested_unroll_factor > 1)
> > +    {
> > +      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_NOTE, vect_location,
> > +                            "***** Re-trying analysis with first vector
> > mode"
> > +                            " %s for epilogue with partial vectors of"
> > +                            " unrolled first loop.\n",
> > +                            GET_MODE_NAME (vector_modes[0]));
> > +         mode_i = 0;
> >
> > and the later done check for bigger VF than main loop - why would
> > we re-start at 0 rather than at the old mode?  Maybe we want to
> > remember the iterator value we started at when arriving at the
> > main loop mode?  So if we analyzed successfully with mode_i == 2,
> > then sucessfully at mode_i == 4 which suggested an unroll of 2,
> > re-start at the mode_i we continued after the mode_i == 2
> > successful analysis?  To just consider the "simple" case of
> > AVX vs SSE it IMHO doesn't make much sense to succeed with
> > AVX V4DF, succeed with SSE V2DF and figure it's better than V4DF AVX
> > but get a suggestion of 2 times unroll and then re-try AVX V4DF
> > just to re-compute that yes, it's worse than SSE V2DF?  You
> > are probably thinking of SVE vs ADVSIMD here but do we need to
> > start at 0?  Adding a comment to the code would be nice.
> >
> > Thanks,
> 
> I was indeed thinking SVE vs Advanced SIMD where we end up having to compare
> different vectorization strategies, which will have different costs depending.
> The hypothetical case, as in I don't think I've come across one, is where if
> we decide to vectorize the main loop for V8QI and unroll 2x, yielding a VF of
> 16, we may then want to then use a predicated VNx16QI epilogue.

But this isn't the epilogue handling ...

> Though the
> question here is whether it is possible for an Advanced SIMD V8QI
> vectorization to beat V16QI but a SVE predicated VNx16QI to beat a VNx8QI for
> the same loop.  Might be good to get Sandiford's opinion on this.
> 
> I do think that initially I was more concerned with skipping a VNx8QI after
> selecting a V8QI but I just checked and Advanced SIMD modes are listed before
> SVE for (among others) this reason.
> 
> Regards,
> Andre
> 
> 
>
Andre Vieira (lists) Nov. 25, 2021, 10:40 a.m. UTC | #19
On 24/11/2021 11:00, Richard Biener wrote:
> On Wed, 24 Nov 2021, Andre Vieira (lists) wrote:
>
>> On 22/11/2021 12:39, Richard Biener wrote:
>>> +  if (first_loop_vinfo->suggested_unroll_factor > 1)
>>> +    {
>>> +      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
>>> +       {
>>> +         if (dump_enabled_p ())
>>> +           dump_printf_loc (MSG_NOTE, vect_location,
>>> +                            "***** Re-trying analysis with first vector
>>> mode"
>>> +                            " %s for epilogue with partial vectors of"
>>> +                            " unrolled first loop.\n",
>>> +                            GET_MODE_NAME (vector_modes[0]));
>>> +         mode_i = 0;
>>>
>>> and the later done check for bigger VF than main loop - why would
>>> we re-start at 0 rather than at the old mode?  Maybe we want to
>>> remember the iterator value we started at when arriving at the
>>> main loop mode?  So if we analyzed successfully with mode_i == 2,
>>> then sucessfully at mode_i == 4 which suggested an unroll of 2,
>>> re-start at the mode_i we continued after the mode_i == 2
>>> successful analysis?  To just consider the "simple" case of
>>> AVX vs SSE it IMHO doesn't make much sense to succeed with
>>> AVX V4DF, succeed with SSE V2DF and figure it's better than V4DF AVX
>>> but get a suggestion of 2 times unroll and then re-try AVX V4DF
>>> just to re-compute that yes, it's worse than SSE V2DF?  You
>>> are probably thinking of SVE vs ADVSIMD here but do we need to
>>> start at 0?  Adding a comment to the code would be nice.
>>>
>>> Thanks,
>> I was indeed thinking SVE vs Advanced SIMD where we end up having to compare
>> different vectorization strategies, which will have different costs depending.
>> The hypothetical case, as in I don't think I've come across one, is where if
>> we decide to vectorize the main loop for V8QI and unroll 2x, yielding a VF of
>> 16, we may then want to then use a predicated VNx16QI epilogue.
> But this isn't the epilogue handling ...
Am I misunderstanding the code here? To me it looks like this is picking 
what mode_i to start the 'while (1)' loop does the loop analysis for the 
epilogues?
Richard Biener Nov. 25, 2021, 12:46 p.m. UTC | #20
On Thu, 25 Nov 2021, Andre Vieira (lists) wrote:

> 
> On 24/11/2021 11:00, Richard Biener wrote:
> > On Wed, 24 Nov 2021, Andre Vieira (lists) wrote:
> >
> >> On 22/11/2021 12:39, Richard Biener wrote:
> >>> +  if (first_loop_vinfo->suggested_unroll_factor > 1)
> >>> +    {
> >>> +      if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> >>> +       {
> >>> +         if (dump_enabled_p ())
> >>> +           dump_printf_loc (MSG_NOTE, vect_location,
> >>> +                            "***** Re-trying analysis with first vector
> >>> mode"
> >>> +                            " %s for epilogue with partial vectors of"
> >>> +                            " unrolled first loop.\n",
> >>> +                            GET_MODE_NAME (vector_modes[0]));
> >>> +         mode_i = 0;
> >>>
> >>> and the later done check for bigger VF than main loop - why would
> >>> we re-start at 0 rather than at the old mode?  Maybe we want to
> >>> remember the iterator value we started at when arriving at the
> >>> main loop mode?  So if we analyzed successfully with mode_i == 2,
> >>> then sucessfully at mode_i == 4 which suggested an unroll of 2,
> >>> re-start at the mode_i we continued after the mode_i == 2
> >>> successful analysis?  To just consider the "simple" case of
> >>> AVX vs SSE it IMHO doesn't make much sense to succeed with
> >>> AVX V4DF, succeed with SSE V2DF and figure it's better than V4DF AVX
> >>> but get a suggestion of 2 times unroll and then re-try AVX V4DF
> >>> just to re-compute that yes, it's worse than SSE V2DF?  You
> >>> are probably thinking of SVE vs ADVSIMD here but do we need to
> >>> start at 0?  Adding a comment to the code would be nice.
> >>>
> >>> Thanks,
> >> I was indeed thinking SVE vs Advanced SIMD where we end up having to
> >> compare
> >> different vectorization strategies, which will have different costs
> >> depending.
> >> The hypothetical case, as in I don't think I've come across one, is where
> >> if
> >> we decide to vectorize the main loop for V8QI and unroll 2x, yielding a VF
> >> of
> >> 16, we may then want to then use a predicated VNx16QI epilogue.
> > But this isn't the epilogue handling ...
> Am I misunderstanding the code here? To me it looks like this is picking what
> mode_i to start the 'while (1)' loop does the loop analysis for the epilogues?

Oops, my fault, yes, it does.  I would suggest to refactor things so
that the mode_i = first_loop_i case is there only once.  I also wonder
if all the argument about starting at 0 doesn't apply to the
not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well?  So
what's the reason to differ here?  So in the end I'd just change
the existing

  if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
    {

to

  if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)
      || first_loop_vinfo->suggested_unroll_factor > 1)
    {

and maybe revisit this when we have an actual testcase showing that
doing sth else has a positive effect?

Thanks,
Richard.
Andre Vieira (lists) Nov. 30, 2021, 11:36 a.m. UTC | #21
On 25/11/2021 12:46, Richard Biener wrote:
> Oops, my fault, yes, it does.  I would suggest to refactor things so
> that the mode_i = first_loop_i case is there only once.  I also wonder
> if all the argument about starting at 0 doesn't apply to the
> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well?  So
> what's the reason to differ here?  So in the end I'd just change
> the existing
>
>    if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
>      {
>
> to
>
>    if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)
>        || first_loop_vinfo->suggested_unroll_factor > 1)
>      {
>
> and maybe revisit this when we have an actual testcase showing that
> doing sth else has a positive effect?
>
> Thanks,
> Richard.

So I had a quick chat with Richard Sandiford and he is suggesting 
resetting mode_i to 0 for all cases.

He pointed out that for some tunings the SVE mode might come after the 
NEON mode, which means that even for not-unrolled loop_vinfos we could 
end up with a suboptimal choice of mode for the epilogue. I.e. it could 
be that we pick V16QI for main vectorization, but that's VNx16QI + 1 in 
the array, so we'd not try VNx16QI for the epilogue.

This would simplify the mode selecting cases, by just simply restarting 
at mode_i in all epilogue cases. Is that something you'd be OK?

Regards,
Andre
Richard Biener Nov. 30, 2021, 1:56 p.m. UTC | #22
On Tue, 30 Nov 2021, Andre Vieira (lists) wrote:

> 
> On 25/11/2021 12:46, Richard Biener wrote:
> > Oops, my fault, yes, it does.  I would suggest to refactor things so
> > that the mode_i = first_loop_i case is there only once.  I also wonder
> > if all the argument about starting at 0 doesn't apply to the
> > not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well?  So
> > what's the reason to differ here?  So in the end I'd just change
> > the existing
> >
> >    if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> >      {
> >
> > to
> >
> >    if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)
> >        || first_loop_vinfo->suggested_unroll_factor > 1)
> >      {
> >
> > and maybe revisit this when we have an actual testcase showing that
> > doing sth else has a positive effect?
> >
> > Thanks,
> > Richard.
> 
> So I had a quick chat with Richard Sandiford and he is suggesting resetting
> mode_i to 0 for all cases.
> 
> He pointed out that for some tunings the SVE mode might come after the NEON
> mode, which means that even for not-unrolled loop_vinfos we could end up with
> a suboptimal choice of mode for the epilogue. I.e. it could be that we pick
> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd not
> try VNx16QI for the epilogue.
> 
> This would simplify the mode selecting cases, by just simply restarting at
> mode_i in all epilogue cases. Is that something you'd be OK?

Works for me with an updated comment.  Even better with showing a
testcase exercising such tuning.

Richard.
Andre Vieira (lists) Jan. 10, 2022, 6:31 p.m. UTC | #23
Hi,

I don't think I ever ended up posting the rebased version on top of the 
epilogue mode patch. So here it is, I think I had a conditional OK if I 
split the epilogue mode patch, but just want to double check this is OK 
for trunk?


gcc/ChangeLog:

         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Pass 
new argument
         suggested_unroll_factor.
         (vect_analyze_loop_costing): Likewise.
         (_loop_vec_info::_loop_vec_info): Initialize new member 
suggested_unroll_factor.
         (vect_determine_partial_vectors_and_peeling): Make epilogue of 
unrolled
         main loop use partial vectors.
         (vect_analyze_loop_2): Pass and use new argument 
suggested_unroll_factor.
         (vect_analyze_loop_1): Likewise.
         (vect_analyze_loop): Change to intialize local 
suggested_unroll_factor and use it.
         (vectorizable_reduction): Don't use single_defuse_cycle when 
unrolling.
         * tree-vectorizer.h (_loop_vec_info::_loop_vec_info): Add new 
member suggested_unroll_factor.
         (vector_costs::vector_costs): Add new member 
m_suggested_unroll_factor.
         (vector_costs::suggested_unroll_factor): New getter function.
         (finish_cost): Set return argument suggested_unroll_factor.



Regards,
Andre

On 30/11/2021 13:56, Richard Biener wrote:
> On Tue, 30 Nov 2021, Andre Vieira (lists) wrote:
>
>> On 25/11/2021 12:46, Richard Biener wrote:
>>> Oops, my fault, yes, it does.  I would suggest to refactor things so
>>> that the mode_i = first_loop_i case is there only once.  I also wonder
>>> if all the argument about starting at 0 doesn't apply to the
>>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well?  So
>>> what's the reason to differ here?  So in the end I'd just change
>>> the existing
>>>
>>>     if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
>>>       {
>>>
>>> to
>>>
>>>     if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)
>>>         || first_loop_vinfo->suggested_unroll_factor > 1)
>>>       {
>>>
>>> and maybe revisit this when we have an actual testcase showing that
>>> doing sth else has a positive effect?
>>>
>>> Thanks,
>>> Richard.
>> So I had a quick chat with Richard Sandiford and he is suggesting resetting
>> mode_i to 0 for all cases.
>>
>> He pointed out that for some tunings the SVE mode might come after the NEON
>> mode, which means that even for not-unrolled loop_vinfos we could end up with
>> a suboptimal choice of mode for the epilogue. I.e. it could be that we pick
>> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd not
>> try VNx16QI for the epilogue.
>>
>> This would simplify the mode selecting cases, by just simply restarting at
>> mode_i in all epilogue cases. Is that something you'd be OK?
> Works for me with an updated comment.  Even better with showing a
> testcase exercising such tuning.
>
> Richard.
Hi,

I don't think I ever ended up posting the rebased version on top of the epilogue mode patch. So here it is.

OK for trunk?


gcc/ChangeLog:

        * tree-vect-loop.c (vect_estimate_min_profitable_iters): Pass new argument
        suggested_unroll_factor.
        (vect_analyze_loop_costing): Likewise.
        (_loop_vec_info::_loop_vec_info): Initialize new member suggested_unroll_factor.
        (vect_determine_partial_vectors_and_peeling): Make epilogue of unrolled
        main loop use partial vectors.
        (vect_analyze_loop_2): Pass and use new argument suggested_unroll_factor.
        (vect_analyze_loop_1): Likewise.
        (vect_analyze_loop): Change to intialize local suggested_unroll_factor and use it.
        (vectorizable_reduction): Don't use single_defuse_cycle when unrolling.
        * tree-vectorizer.h (_loop_vec_info::_loop_vec_info): Add new member suggested_unroll_factor.
        (vector_costs::vector_costs): Add new member m_suggested_unroll_factor.
        (vector_costs::suggested_unroll_factor): New getter function.
        (finish_cost): Set return argument suggested_unroll_factor.



Regards,
Andre
Richard Biener Jan. 11, 2022, 7:14 a.m. UTC | #24
On Mon, 10 Jan 2022, Andre Vieira (lists) wrote:

> Hi,
> 
> I don't think I ever ended up posting the rebased version on top of the
> epilogue mode patch. So here it is, I think I had a conditional OK if I split
> the epilogue mode patch, but just want to double check this is OK for trunk?

Yes, I think I acked this.

Richard.

> 
> gcc/ChangeLog:
> 
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Pass new
> argument
>         suggested_unroll_factor.
>         (vect_analyze_loop_costing): Likewise.
>         (_loop_vec_info::_loop_vec_info): Initialize new member 
> suggested_unroll_factor.
>         (vect_determine_partial_vectors_and_peeling): Make epilogue of
> unrolled
>         main loop use partial vectors.
>         (vect_analyze_loop_2): Pass and use new argument 
> suggested_unroll_factor.
>         (vect_analyze_loop_1): Likewise.
>         (vect_analyze_loop): Change to intialize local 
> suggested_unroll_factor and use it.
>         (vectorizable_reduction): Don't use single_defuse_cycle when
> unrolling.
>         * tree-vectorizer.h (_loop_vec_info::_loop_vec_info): Add new member
> suggested_unroll_factor.
>         (vector_costs::vector_costs): Add new member
> m_suggested_unroll_factor.
>         (vector_costs::suggested_unroll_factor): New getter function.
>         (finish_cost): Set return argument suggested_unroll_factor.
> 
> 
> 
> Regards,
> Andre
> 
> On 30/11/2021 13:56, Richard Biener wrote:
> > On Tue, 30 Nov 2021, Andre Vieira (lists) wrote:
> >
> >> On 25/11/2021 12:46, Richard Biener wrote:
> >>> Oops, my fault, yes, it does.  I would suggest to refactor things so
> >>> that the mode_i = first_loop_i case is there only once.  I also wonder
> >>> if all the argument about starting at 0 doesn't apply to the
> >>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well?  So
> >>> what's the reason to differ here?  So in the end I'd just change
> >>> the existing
> >>>
> >>>     if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo))
> >>>       {
> >>>
> >>> to
> >>>
> >>>     if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)
> >>>         || first_loop_vinfo->suggested_unroll_factor > 1)
> >>>       {
> >>>
> >>> and maybe revisit this when we have an actual testcase showing that
> >>> doing sth else has a positive effect?
> >>>
> >>> Thanks,
> >>> Richard.
> >> So I had a quick chat with Richard Sandiford and he is suggesting resetting
> >> mode_i to 0 for all cases.
> >>
> >> He pointed out that for some tunings the SVE mode might come after the NEON
> >> mode, which means that even for not-unrolled loop_vinfos we could end up
> >> with
> >> a suboptimal choice of mode for the epilogue. I.e. it could be that we pick
> >> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd
> >> not
> >> try VNx16QI for the epilogue.
> >>
> >> This would simplify the mode selecting cases, by just simply restarting at
> >> mode_i in all epilogue cases. Is that something you'd be OK?
> > Works for me with an updated comment.  Even better with showing a
> > testcase exercising such tuning.
> >
> > Richard.
>
diff mbox series

Patch

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f68f42638a112bed8396fd634bd3fd3c44ce848a..3bc9694d2162055d3db165ef888f35deb676548b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6283,6 +6283,19 @@  allocated by TARGET_VECTORIZE_INIT_COST.  The default releases the
 accumulator.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL (class vec_info *@var{vinfo}, class _stmt_vec_info *@var{stmt_info}, void *@var{data})
+This hook should update the target-specific @var{data} relative
+relative to the statement represented by @var{stmt_vinfo} to be used
+later to determine the unrolling factor for this loop using the current
+vectorization factor.
+@end deftypefn
+
+@deftypefn {Target Hook} unsigned TARGET_VECTORIZE_UNROLL_FACTOR (class vec_info *@var{vinfo}, void *@var{data})
+This hook should return the desired vector unrolling factor for a loop with
+@var{vinfo} based on the target-specific @var{data}. The default returns one,
+which means no unrolling will be performed.
+@end deftypefn
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index fdf16b901c537e6a02f630a80a2213d2dcb6d5d6..40f4cb02c34f575439f35070301855ddaf82a21a 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4195,6 +4195,10 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_ADD_STMT_COST_FOR_UNROLL
+
+@hook TARGET_VECTORIZE_UNROLL_FACTOR
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/params.opt b/gcc/params.opt
index f414dc1a61cfa9d5b9ded75e96560fc1f73041a5..00f92d4484797df0dbbad052f45205469cbb2c49 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1117,4 +1117,12 @@  Controls how loop vectorizer uses partial vectors.  0 means never, 1 means only
 Common Joined UInteger Var(param_vect_inner_loop_cost_factor) Init(50) IntegerRange(1, 10000) Param Optimization
 The maximum factor which the loop vectorizer applies to the cost of statements in an inner loop relative to the loop being vectorized.
 
+-param=vect-unroll=
+Common Joined UInteger Var(param_vect_unroll) Init(0) IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops.  Also see vect-unroll-reductions.
+
+-param=vect-unroll-reductions=
+Common Joined UInteger Var(param_vect_unroll_reductions) Init(0) IntegerRange(0, 32) Param Optimization
+Controls how many times the vectorizer tries to unroll loops that contain associative reductions.  0 means that such loops should be unrolled vect-unroll times.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/target.def b/gcc/target.def
index 28a34f1d51b5abb41c537b9cd327ca59f1f9260f..0eac529f17bd981b6494fe613117f28803a02390 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2095,6 +2095,28 @@  accumulator.",
  (void *data),
  default_destroy_cost_data)
 
+/* Target function to record cost approximation to be used by
+   TARGET_VECTORIZE_UNROLL_FACTOR.  */
+DEFHOOK
+(add_stmt_cost_for_unroll,
+ "This hook should update the target-specific @var{data} relative\n\
+relative to the statement represented by @var{stmt_vinfo} to be used\n\
+later to determine the unrolling factor for this loop using the current\n\
+vectorization factor.",
+ void,
+ (class vec_info *vinfo, class _stmt_vec_info *stmt_info, void *data),
+ default_add_stmt_cost_for_unroll)
+
+/* Function to determine unroll factor for vectorization.  */
+DEFHOOK
+(unroll_factor,
+ "This hook should return the desired vector unrolling factor for a loop with\n\
+@var{vinfo} based on the target-specific @var{data}. The default returns one,\n\
+which means no unrolling will be performed.",
+ unsigned,
+ (class vec_info *vinfo, void *data),
+ default_unroll_factor)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 92d51992e625c2497aa8496b1e2e3d916e5706fd..d285c24d6d398cfabb58c291fd2dcbfa6e1bd8f6 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -125,6 +125,9 @@  extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_model_location);
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
+extern void default_add_stmt_cost_for_unroll (class vec_info *,
+					      class _stmt_vec_info *, void *);
+extern unsigned default_unroll_factor (class vec_info *, void *);
 
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int, unsigned);
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853dbc15706a65d1eb335e28e0564325e..9bc7e80e5a67129633dab99a871b6babff65de97 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1535,6 +1535,26 @@  default_destroy_cost_data (void *data)
   free (data);
 }
 
+/* By default, we do not perform unrolling so this function does not need
+   to do anything.  */
+void
+default_add_stmt_cost_for_unroll (class vec_info *vinfo ATTRIBUTE_UNUSED,
+				  class _stmt_vec_info *stmt_info
+				  ATTRIBUTE_UNUSED,
+				  void *data ATTRIBUTE_UNUSED)
+{
+}
+
+
+/* By default, return a vector unroll factor of one, meaning no unrolling will
+   be performed.  */
+unsigned
+default_unroll_factor (class vec_info *vinfo ATTRIBUTE_UNUSED,
+		       void *data ATTRIBUTE_UNUSED)
+{
+  return 1;
+}
+
 /* Determine whether or not a pointer mode is valid. Assume defaults
    of ptr_mode or Pmode - can be overridden.  */
 bool
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 0c8d992624b59ddd056aff594738305d6be5afa8..14f8150d7c262b9422784e0e997ca4387664a20a 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -828,6 +828,7 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    par_unrolling_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1594,6 +1595,7 @@  vect_update_vf_for_slp (loop_vec_info loop_vinfo)
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "Loop contains only SLP stmts\n");
       vectorization_factor = LOOP_VINFO_SLP_UNROLLING_FACTOR (loop_vinfo);
+      vectorization_factor *= loop_vinfo->par_unrolling_factor;
     }
   else
     {
@@ -2131,7 +2133,8 @@  vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->par_unrolling_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2192,6 +2195,101 @@  vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
   return opt_result::success ();
 }
 
+
+static poly_uint64
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  bool can_unroll_p = !LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  if (!can_unroll_p)
+    return vectorization_factor;
+
+  DUMP_VECT_SCOPE ("vect_determine_unroll_factor");
+
+  void *target_cost_data = init_cost (loop_vinfo->loop, true);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      can_unroll_p = false;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll due to negative step\n");
+	      break;
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+	    seen_reduction_p = true;
+	  else
+	    {
+	      can_unroll_p = false;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "could not unroll due to unsupported "
+				 "reduction\n");
+	      break;
+	    }
+	}
+
+      targetm.vectorize.add_stmt_cost_for_unroll (loop_vinfo, stmt_info,
+						  target_cost_data);
+    }
+
+  if (!can_unroll_p)
+    {
+      return vectorization_factor;
+    }
+
+  unsigned int unrolling_factor = 1;
+  if (maybe_gt (vectorization_factor, 1U))
+    unrolling_factor = vect_unroll_value (loop_vinfo, seen_reduction_p,
+					  target_cost_data);
+
+
+  destroy_cost_data (target_cost_data);
+
+  while (unrolling_factor > 1)
+    {
+      poly_uint64 candidate_factor = vectorization_factor * unrolling_factor;
+      if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+	  <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+	{
+	  vectorization_factor = candidate_factor;
+	  break;
+	}
+      unrolling_factor /= 2;
+    }
+  loop_vinfo->par_unrolling_factor = unrolling_factor;
+  LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "unrolling factor = %d\n",
+		     unrolling_factor);
+
+  return vectorization_factor;
+}
+
 /* Function vect_analyze_loop_2.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -2320,6 +2418,8 @@  vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts)
 			 "can't determine vectorization factor.\n");
       return ok;
     }
+
+  vect_determine_unroll_factor (loop_vinfo);
   if (max_vf != MAX_VECTORIZATION_FACTOR
       && maybe_lt (max_vf, LOOP_VINFO_VECT_FACTOR (loop_vinfo)))
     return opt_result::failure_at (vect_location, "bad data dependence.\n");
@@ -3062,7 +3162,14 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		  gcc_assert (vect_epilogues);
 		  delete vinfos.pop ();
 		}
+	      /* Check if we may want to replace the current first_loop_vinfo
+		 with the new loop, but only if they have different vector
+		 modes.  If they have the same vector mode this means the main
+		 loop is an unrolled loop and we are trying to vectorize the
+		 epilogue using the same vector mode but with a lower
+		 vectorization factor.  */
 	      if (vinfos.is_empty ()
+		  && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
 		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
 		{
 		  loop_vec_info main_loop_vinfo
@@ -3156,10 +3263,26 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	 The retry should be in the same mode as original.  */
       if (vect_epilogues
 	  && loop_vinfo
-	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	  && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      || loop_vinfo->par_unrolling_factor > 1))
 	{
-	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		       || loop_vinfo->par_unrolling_factor > 1)
 		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+	  if (loop_vinfo->par_unrolling_factor > 1)
+	    {
+	      next_vector_mode = vector_modes[0];
+	      mode_i = 1;
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "***** Re-trying analysis with vector mode"
+				 " %s for epilogue with partial vectors.\n",
+				 GET_MODE_NAME (next_vector_mode));
+	      continue;
+	    }
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "***** Re-trying analysis with same vector mode"
@@ -7212,7 +7335,8 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->par_unrolling_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 9c2c29d61fae5e651a112b103482131e3d646fb6..b51e82a0663a391a096480bff03a2191bc11dcf4 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -620,6 +620,11 @@  public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that we've unrolled the vector loop in order
+     to promote more ILP.  This value is folded into vectorization_factor
+     (and therefore exactly divides vectorization_factor).  */
+  unsigned int par_unrolling_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1810,6 +1815,20 @@  vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
 	  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return the number of times that we should unroll general
+   reduction-free loops.  */
+
+inline unsigned int
+vect_unroll_value (loop_vec_info loop_vinfo, bool seen_reduction_p, void *data)
+{
+  if (seen_reduction_p && param_vect_unroll_reductions >= 1)
+    return param_vect_unroll_reductions;
+  if (param_vect_unroll >= 1)
+    return param_vect_unroll;
+  else
+    return targetm.vectorize.unroll_factor (loop_vinfo, data);
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;