diff mbox series

[v4] vect/rs6000: Support vector with length cost modeling

Message ID f9ca7491-d8f3-c8e3-5a55-1fbbe71a0d51@linux.ibm.com
State New
Headers show
Series [v4] vect/rs6000: Support vector with length cost modeling | expand

Commit Message

Kewen.Lin July 27, 2020, 3:58 a.m. UTC
Hi Richard,

Thanks for the review again!

on 2020/7/25 上午12:21, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> 
> Thanks, the rearrangement of the existing code looks good.  Could you
> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
> out into separate patches?
> 

Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html.

errr... that subject should be with prefix "[PATCH] vect:".

[snip ...] 
(Some comments in the snipped content will be done in v4)

>> +	 here.  */
>> +
>> +      /* For now we only operate length-based partial vectors on Power,
>> +	 which has constant VF all the time, we need some tweakings below
>> +	 if it doesn't hold in future.  */
>> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
> 
> Where do you rely on this?  There didn't seem to be any obvious
> to_constant uses.  Since this is “only” a cost calculation, we should
> be using assumed_vf.

Sorry for the confusion.  This was intended for the poly things like
VF or nitems_per_ctrl which isn't constant during compilation time,
then get people's attention on the possible runtime cost on things like
scaling up for nitems_step etc.  But I just realized that the computations
like the multiply with another constant can operate on the coefficient,
it looks there is no runtime cost then?  If so, I think I thought too
much before.  ;-)

>> -      prologue_cost_vec.release ();
>> -      epilogue_cost_vec.release ();
>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
>> +			    NULL, NULL_TREE, 0, vect_prologue);
>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
>> +			    NULL, NULL_TREE, 0, vect_body);
> 
> IMO this seems to be reproducing too much of the functions that you
> referred to.  And the danger with that is that they could easily
> get out of sync later.

Good point!  The original intention was to model as possible as we can,
to avoid some bad decision due to some unmodeled pieces, like the case
the loop body is small and some computation become nonnegligible.
The unsync risks seems also applied for other codes.  How about adding
some "note" comments in those functions?

The updated v4 is attached by addressing your comments as well as Segher's
comments.

BR,
Kewen
-----

gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
	function.
	(rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update
	function comment.
	* tree-vect-stmts.c (vect_gen_len): Likewise.

Comments

Richard Sandiford July 27, 2020, 1:40 p.m. UTC | #1
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the review again!
>
> on 2020/7/25 上午12:21, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> 
>> Thanks, the rearrangement of the existing code looks good.  Could you
>> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
>> out into separate patches?
>> 
>
> Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html.
>
> errr... that subject should be with prefix "[PATCH] vect:".
>
> [snip ...] 
> (Some comments in the snipped content will be done in v4)
>
>>> +	 here.  */
>>> +
>>> +      /* For now we only operate length-based partial vectors on Power,
>>> +	 which has constant VF all the time, we need some tweakings below
>>> +	 if it doesn't hold in future.  */
>>> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
>> 
>> Where do you rely on this?  There didn't seem to be any obvious
>> to_constant uses.  Since this is “only” a cost calculation, we should
>> be using assumed_vf.
>
> Sorry for the confusion.  This was intended for the poly things like
> VF or nitems_per_ctrl which isn't constant during compilation time,
> then get people's attention on the possible runtime cost on things like
> scaling up for nitems_step etc.  But I just realized that the computations
> like the multiply with another constant can operate on the coefficient,
> it looks there is no runtime cost then?  If so, I think I thought too
> much before.  ;-)
>
>>> -      prologue_cost_vec.release ();
>>> -      epilogue_cost_vec.release ();
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
>>> +			    NULL, NULL_TREE, 0, vect_prologue);
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
>>> +			    NULL, NULL_TREE, 0, vect_body);
>> 
>> IMO this seems to be reproducing too much of the functions that you
>> referred to.  And the danger with that is that they could easily
>> get out of sync later.
>
> Good point!  The original intention was to model as possible as we can,
> to avoid some bad decision due to some unmodeled pieces, like the case
> the loop body is small and some computation become nonnegligible.
> The unsync risks seems also applied for other codes.  How about adding
> some "note" comments in those functions?
>
> The updated v4 is attached by addressing your comments as well as Segher's
> comments.
>
> BR,
> Kewen
> -----
>
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
> 	function.
> 	(rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
> 	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
> 	modeling for vector with length.
> 	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update
> 	function comment.
> 	* tree-vect-stmts.c (vect_gen_len): Likewise.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 009afc5f894..86ef584e09b 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += num_vectors_m1 + 1;
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      rs6000_adjust_vect_cost_per_loop (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 490e7befea3..9d0e3fc525e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>  
>     This means that we cannot guarantee that such an induction variable
>     would ever hit a value that produces a set of all-false masks or zero
> -   lengths for RGC.  */
> +   lengths for RGC.
> +
> +   Note that please check cost modeling whether needed to be updated in
> +   function vect_estimate_min_profitable_iters if any changes.  */
>  
>  static tree
>  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 06cde4b1da3..a00160a7f2d 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>        (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>  			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
>      }
> +  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      /* Referring to the functions vect_set_loop_condition_partial_vectors
> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimizations, we consider the worst case
> +	 here.  */
> +
> +      /* For wrap around checking.  */
> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);
> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +	 scalar_stmt. For now we only use length-based partial vectors on
> +	 Power, target specific cost tweaking may be needed for other
> +	 ports in future.  */
> +      unsigned int min_max_cost = 2;
> +      unsigned int shift_cost = 1, minus_cost = 1;

Please instead add a scalar_min_max to vect_cost_for_stmt, and use
scalar_stmt for shift and minus.  There shouldn't be any Power things
hard-coded into target-independent code.

> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prologue_cnt = 0;
> +      unsigned int body_cnt = 0;

Maybe _stmts instead of _cnt, to make it clearer what we're counting.
At first it looked like this was actually the raw cost.  I guess with
the above change we'd also want separate counters for min_max.

> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  {
> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
> +
> +	    /* May need one shift for nitems_total computation.  */
> +	    if (nitems != 1 && !niters_known_p)
> +	      prologue_cnt += shift_cost;
> +
> +	    /* Need to handle wrap around.  */
> +	    if (iv_limit == -1
> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +		    > compare_precision))
> +	      prologue_cnt += (min_max_cost + minus_cost);

I think we should instead split this condition out into a subroutine
that both this function and vect_set_loop_condition_partial_vectors
can use.  It would take the loop_vec_info and rgroup_controls as
arguments.

That means that we might recompute things a few more times, but this
shouldn't be performance-critical.

Thanks,
Richard
Kewen.Lin July 28, 2020, 8:36 a.m. UTC | #2
Hi Richard,

Thanks for the comments!

on 2020/7/27 下午9:40, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:

[snip]

>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>> +      bool need_iterate_p
>> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
>> +
>> +      /* Init min/max, shift and minus cost relative to single
>> +	 scalar_stmt. For now we only use length-based partial vectors on
>> +	 Power, target specific cost tweaking may be needed for other
>> +	 ports in future.  */
>> +      unsigned int min_max_cost = 2;
>> +      unsigned int shift_cost = 1, minus_cost = 1;
> 
> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> scalar_stmt for shift and minus.  There shouldn't be any Power things
> hard-coded into target-independent code.
> 

Agree!  It's not good to leave them there.  I thought to wait and see
if other targets which support vector with length can reuse this, or
move it to target specific codes then if not sharable.  But anyway
it looks not good, let's fix it.

I had some concerns on vect_cost_for_stmt way, since it seems to allow
more computations similar to min/max to be added like this, in long
term it probably leads to the situtation like: scalar_min_max,
scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
seems not good to maintain.

I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
whether is assignment and the subcode of the expression, it provides the
chance to check the statement more fine-grain, not just as normal
scalar_stmt/vector_stmt.

For the case here, we don't have the stmt_info, but we know the type
of computation(expression), how about to extend the hook add_stmt_cost
with one extra tree_code type argument, by default it can be some
unmeaningful code, for some needs like here, we specify the tree_code
as the code of computation, like {MIN,MAX}_EXPR, then target specific
add_stmt_cost can check this tree_code and adjust the cost accordingly.

>> +      /* Init cost relative to single scalar_stmt.  */
>> +      unsigned int prologue_cnt = 0;
>> +      unsigned int body_cnt = 0;
> 
> Maybe _stmts instead of _cnt, to make it clearer what we're counting.
> At first it looked like this was actually the raw cost.  I guess with
> the above change we'd also want separate counters for min_max.
> 

Good point, will fix it in v5.

>> +      rgroup_controls *rgc;
>> +      unsigned int num_vectors_m1;
>> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
>> +	if (rgc->type)
>> +	  {
>> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
>> +
>> +	    /* May need one shift for nitems_total computation.  */
>> +	    if (nitems != 1 && !niters_known_p)
>> +	      prologue_cnt += shift_cost;
>> +
>> +	    /* Need to handle wrap around.  */
>> +	    if (iv_limit == -1
>> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
>> +		    > compare_precision))
>> +	      prologue_cnt += (min_max_cost + minus_cost);
> 
> I think we should instead split this condition out into a subroutine
> that both this function and vect_set_loop_condition_partial_vectors
> can use.  It would take the loop_vec_info and rgroup_controls as
> arguments.

OK, will split it in v5.

BR,
Kewen
Richard Sandiford July 31, 2020, 11:03 a.m. UTC | #3
"Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>> +      bool need_iterate_p
>>> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>> +
>>> +      /* Init min/max, shift and minus cost relative to single
>>> +	 scalar_stmt. For now we only use length-based partial vectors on
>>> +	 Power, target specific cost tweaking may be needed for other
>>> +	 ports in future.  */
>>> +      unsigned int min_max_cost = 2;
>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>> 
>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>> hard-coded into target-independent code.
>> 
>
> Agree!  It's not good to leave them there.  I thought to wait and see
> if other targets which support vector with length can reuse this, or
> move it to target specific codes then if not sharable.  But anyway
> it looks not good, let's fix it.
>
> I had some concerns on vect_cost_for_stmt way, since it seems to allow
> more computations similar to min/max to be added like this, in long
> term it probably leads to the situtation like: scalar_min_max,
> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> seems not good to maintain.

I guess doing that doesn't seem so bad to me :-)  I think it's been
a recurring problem that the current classification isn't fine-grained
enough for some cases.

> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> whether is assignment and the subcode of the expression, it provides the
> chance to check the statement more fine-grain, not just as normal
> scalar_stmt/vector_stmt.
>
> For the case here, we don't have the stmt_info, but we know the type
> of computation(expression), how about to extend the hook add_stmt_cost
> with one extra tree_code type argument, by default it can be some
> unmeaningful code, for some needs like here, we specify the tree_code
> as the code of computation, like {MIN,MAX}_EXPR, then target specific
> add_stmt_cost can check this tree_code and adjust the cost accordingly.

If we do that, I guess we should “promote” code_helper out of
gimple-match.h and use that instead, so that we can handle
internal and built-in functions too.

Would like to hear Richard's opinion on the best way forward here.

Thanks,
Richard
Richard Biener July 31, 2020, 11:20 a.m. UTC | #4
On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> >>> +      bool need_iterate_p
> >>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
> >>> +
> >>> +      /* Init min/max, shift and minus cost relative to single
> >>> +    scalar_stmt. For now we only use length-based partial vectors on
> >>> +    Power, target specific cost tweaking may be needed for other
> >>> +    ports in future.  */
> >>> +      unsigned int min_max_cost = 2;
> >>> +      unsigned int shift_cost = 1, minus_cost = 1;
> >>
> >> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> >> scalar_stmt for shift and minus.  There shouldn't be any Power things
> >> hard-coded into target-independent code.
> >>
> >
> > Agree!  It's not good to leave them there.  I thought to wait and see
> > if other targets which support vector with length can reuse this, or
> > move it to target specific codes then if not sharable.  But anyway
> > it looks not good, let's fix it.

In other generic places like this we simply use three generic scalar_stmt
costs.  At least I don't see how min_max should be different from it
when shift can be the same as minus.  Note this is also how we treat
vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
vector cost is one vector_stmt.  As you say below the add_stmt_cost
hook can adjust based on the actual GIMPLE stmt -- if one is
available (which indeed it isn't here).

I'm somewhat lacking context here as well - we actually GIMPLE
code-generate the min/max/shift/minus and only the eventual
AND is defered to the target, right?

> > I had some concerns on vect_cost_for_stmt way, since it seems to allow
> > more computations similar to min/max to be added like this, in long
> > term it probably leads to the situtation like: scalar_min_max,
> > scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> > seems not good to maintain.
>
> I guess doing that doesn't seem so bad to me :-)  I think it's been
> a recurring problem that the current classification isn't fine-grained
> enough for some cases.

But we eventually want to get rid of this classification enum in favor
of the add_stmt_cost hook ...

> > I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> > whether is assignment and the subcode of the expression, it provides the
> > chance to check the statement more fine-grain, not just as normal
> > scalar_stmt/vector_stmt.
> >
> > For the case here, we don't have the stmt_info, but we know the type
> > of computation(expression), how about to extend the hook add_stmt_cost
> > with one extra tree_code type argument, by default it can be some
> > unmeaningful code, for some needs like here, we specify the tree_code
> > as the code of computation, like {MIN,MAX}_EXPR, then target specific
> > add_stmt_cost can check this tree_code and adjust the cost accordingly.
>
> If we do that, I guess we should “promote” code_helper out of
> gimple-match.h and use that instead, so that we can handle
> internal and built-in functions too.
>
> Would like to hear Richard's opinion on the best way forward here.

I'd say defer this to a later patch and for now simply cost one scalar
stmt for MIN/MAX.  I agree that if we add a tree_code we want a
code_helper instead.  Note that I want to eventually have a
full SLP tree for the final code generation where all info should be
there (including SLP nodes for those min/max ops) and which the
target could traverse.  But I'm not sure if I can make enough progress
on that SLP-only thing for GCC 11 even...

Richard.

> Thanks,
> Richard
Kewen.Lin July 31, 2020, 12:37 p.m. UTC | #5
Hi Richards,

on 2020/7/31 下午7:20, Richard Biener wrote:
> On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>>>> +      bool need_iterate_p
>>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>>>> +
>>>>> +      /* Init min/max, shift and minus cost relative to single
>>>>> +    scalar_stmt. For now we only use length-based partial vectors on
>>>>> +    Power, target specific cost tweaking may be needed for other
>>>>> +    ports in future.  */
>>>>> +      unsigned int min_max_cost = 2;
>>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>>>>
>>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>>>> hard-coded into target-independent code.
>>>>
>>>
>>> Agree!  It's not good to leave them there.  I thought to wait and see
>>> if other targets which support vector with length can reuse this, or
>>> move it to target specific codes then if not sharable.  But anyway
>>> it looks not good, let's fix it.
> 
> In other generic places like this we simply use three generic scalar_stmt
> costs.  At least I don't see how min_max should be different from it
> when shift can be the same as minus.  Note this is also how we treat

Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
to implement min/max it takes double cycles of the normal scalar operations
like add/shift, I was trying to model it more fine-grained since we probably
generate a few min/max here, if the loop body cost is small, I was worried
the decision isn't good enough.  But yeah, in other generic places, the small
loop could also suffer this similar off, they are the same essentially.

> vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
> vector cost is one vector_stmt.  As you say below the add_stmt_cost
> hook can adjust based on the actual GIMPLE stmt -- if one is
> available (which indeed it isn't here).
> 
> I'm somewhat lacking context here as well - we actually GIMPLE
> code-generate the min/max/shift/minus and only the eventual
> AND is defered to the target, right?
> 

Yes, min/max/shift/minus are all GIMPLE code, targets like Power
will have its target specific cost for shift which moves length
to high bits 0:7.

One typical case is as below:

  <bb 3> [local count: 105119324]:
  _26 = n_11(D) * 4;
  _37 = MAX_EXPR <_26, 16>;
  _38 = _37 + 18446744073709551600;
  _40 = MIN_EXPR <_26, 16>;

  <bb 4> [local count: 630715945]:
  # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
  # loop_len_30 = PHI <_40(3), _44(4)>
  _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
  vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
  vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
  _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
  vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
  vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
  vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
  vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
  _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
  .LEN_STORE (_2, 4B, loop_len_30, vect_28);
  _42 = MIN_EXPR <ivtmp_35, _38>;
  _43 = _38 - _42;
  _44 = MIN_EXPR <_43, 16>;
  ivtmp_36 = ivtmp_35 + 16;
  if (ivtmp_35 < _38)
    goto <bb 4>; [83.33%]
  else
    goto <bb 5>; [16.67%]


>>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
>>> more computations similar to min/max to be added like this, in long
>>> term it probably leads to the situtation like: scalar_min_max,
>>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
>>> seems not good to maintain.
>>
>> I guess doing that doesn't seem so bad to me :-)  I think it's been
>> a recurring problem that the current classification isn't fine-grained
>> enough for some cases.
> 
> But we eventually want to get rid of this classification enum in favor
> of the add_stmt_cost hook ...
> 

Nice, sounds like each target has to handle it fine-grain.  :) 
IIUC, the current modeling doesn't consider the instruction dependency and
execution resource etc. like scheduling, even if all costs are fine-grained
enough, the decision could be sub-optimal.

>>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
>>> whether is assignment and the subcode of the expression, it provides the
>>> chance to check the statement more fine-grain, not just as normal
>>> scalar_stmt/vector_stmt.
>>>
>>> For the case here, we don't have the stmt_info, but we know the type
>>> of computation(expression), how about to extend the hook add_stmt_cost
>>> with one extra tree_code type argument, by default it can be some
>>> unmeaningful code, for some needs like here, we specify the tree_code
>>> as the code of computation, like {MIN,MAX}_EXPR, then target specific
>>> add_stmt_cost can check this tree_code and adjust the cost accordingly.
>>
>> If we do that, I guess we should “promote” code_helper out of
>> gimple-match.h and use that instead, so that we can handle
>> internal and built-in functions too.
>>
>> Would like to hear Richard's opinion on the best way forward here.
> 
> I'd say defer this to a later patch and for now simply cost one scalar
> stmt for MIN/MAX.  I agree that if we add a tree_code we want a
> code_helper instead.  Note that I want to eventually have a
> full SLP tree for the final code generation where all info should be
> there (including SLP nodes for those min/max ops) and which the
> target could traverse.  But I'm not sure if I can make enough progress
> on that SLP-only thing for GCC 11 even...
> 

OK, I'm fine to take MIN/MAX as scalar_stmt here.  Thank both of you!
This new SLP framework looks very promising and powerful.  :-)


BR,
Kewen
Richard Biener July 31, 2020, 1:01 p.m. UTC | #6
On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richards,
>
> on 2020/7/31 下午7:20, Richard Biener wrote:
> > On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> >>>>> +      bool need_iterate_p
> >>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
> >>>>> +
> >>>>> +      /* Init min/max, shift and minus cost relative to single
> >>>>> +    scalar_stmt. For now we only use length-based partial vectors on
> >>>>> +    Power, target specific cost tweaking may be needed for other
> >>>>> +    ports in future.  */
> >>>>> +      unsigned int min_max_cost = 2;
> >>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
> >>>>
> >>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> >>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
> >>>> hard-coded into target-independent code.
> >>>>
> >>>
> >>> Agree!  It's not good to leave them there.  I thought to wait and see
> >>> if other targets which support vector with length can reuse this, or
> >>> move it to target specific codes then if not sharable.  But anyway
> >>> it looks not good, let's fix it.
> >
> > In other generic places like this we simply use three generic scalar_stmt
> > costs.  At least I don't see how min_max should be different from it
> > when shift can be the same as minus.  Note this is also how we treat
>
> Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
> for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
> to implement min/max it takes double cycles of the normal scalar operations
> like add/shift, I was trying to model it more fine-grained since we probably
> generate a few min/max here, if the loop body cost is small, I was worried
> the decision isn't good enough.  But yeah, in other generic places, the small
> loop could also suffer this similar off, they are the same essentially.
>
> > vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
> > vector cost is one vector_stmt.  As you say below the add_stmt_cost
> > hook can adjust based on the actual GIMPLE stmt -- if one is
> > available (which indeed it isn't here).
> >
> > I'm somewhat lacking context here as well - we actually GIMPLE
> > code-generate the min/max/shift/minus and only the eventual
> > AND is defered to the target, right?
> >
>
> Yes, min/max/shift/minus are all GIMPLE code, targets like Power
> will have its target specific cost for shift which moves length
> to high bits 0:7.
>
> One typical case is as below:
>
>   <bb 3> [local count: 105119324]:
>   _26 = n_11(D) * 4;
>   _37 = MAX_EXPR <_26, 16>;
>   _38 = _37 + 18446744073709551600;
>   _40 = MIN_EXPR <_26, 16>;
>
>   <bb 4> [local count: 630715945]:
>   # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
>   # loop_len_30 = PHI <_40(3), _44(4)>
>   _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
>   vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
>   vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
>   _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
>   vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
>   vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
>   vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
>   vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
>   _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
>   .LEN_STORE (_2, 4B, loop_len_30, vect_28);
>   _42 = MIN_EXPR <ivtmp_35, _38>;
>   _43 = _38 - _42;
>   _44 = MIN_EXPR <_43, 16>;
>   ivtmp_36 = ivtmp_35 + 16;
>   if (ivtmp_35 < _38)
>     goto <bb 4>; [83.33%]
>   else
>     goto <bb 5>; [16.67%]
>
>
> >>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
> >>> more computations similar to min/max to be added like this, in long
> >>> term it probably leads to the situtation like: scalar_min_max,
> >>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> >>> seems not good to maintain.
> >>
> >> I guess doing that doesn't seem so bad to me :-)  I think it's been
> >> a recurring problem that the current classification isn't fine-grained
> >> enough for some cases.
> >
> > But we eventually want to get rid of this classification enum in favor
> > of the add_stmt_cost hook ...
> >
>
> Nice, sounds like each target has to handle it fine-grain.  :)
> IIUC, the current modeling doesn't consider the instruction dependency and
> execution resource etc. like scheduling, even if all costs are fine-grained
> enough, the decision could be sub-optimal.

That's what the finish_cost hook is for - the target can collect
all stmts during add_stmt and then apply adjustments at the end
(IIRC power does already for shift operation resource constraints).

> >>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> >>> whether is assignment and the subcode of the expression, it provides the
> >>> chance to check the statement more fine-grain, not just as normal
> >>> scalar_stmt/vector_stmt.
> >>>
> >>> For the case here, we don't have the stmt_info, but we know the type
> >>> of computation(expression), how about to extend the hook add_stmt_cost
> >>> with one extra tree_code type argument, by default it can be some
> >>> unmeaningful code, for some needs like here, we specify the tree_code
> >>> as the code of computation, like {MIN,MAX}_EXPR, then target specific
> >>> add_stmt_cost can check this tree_code and adjust the cost accordingly.
> >>
> >> If we do that, I guess we should “promote” code_helper out of
> >> gimple-match.h and use that instead, so that we can handle
> >> internal and built-in functions too.
> >>
> >> Would like to hear Richard's opinion on the best way forward here.
> >
> > I'd say defer this to a later patch and for now simply cost one scalar
> > stmt for MIN/MAX.  I agree that if we add a tree_code we want a
> > code_helper instead.  Note that I want to eventually have a
> > full SLP tree for the final code generation where all info should be
> > there (including SLP nodes for those min/max ops) and which the
> > target could traverse.  But I'm not sure if I can make enough progress
> > on that SLP-only thing for GCC 11 even...
> >
>
> OK, I'm fine to take MIN/MAX as scalar_stmt here.  Thank both of you!
> This new SLP framework looks very promising and powerful.  :-)

As do all things that are still on paper^Win my head only ;)

Richard.

>
> BR,
> Kewen
Kewen.Lin July 31, 2020, 1:21 p.m. UTC | #7
on 2020/7/31 下午9:01, Richard Biener wrote:
> On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richards,
>>
>> on 2020/7/31 下午7:20, Richard Biener wrote:
>>> On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
>>> <richard.sandiford@arm.com> wrote:
>>>>
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>>>>>> +      bool need_iterate_p
>>>>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>>>>>> +
>>>>>>> +      /* Init min/max, shift and minus cost relative to single
>>>>>>> +    scalar_stmt. For now we only use length-based partial vectors on
>>>>>>> +    Power, target specific cost tweaking may be needed for other
>>>>>>> +    ports in future.  */
>>>>>>> +      unsigned int min_max_cost = 2;
>>>>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>>>>>>
>>>>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>>>>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>>>>>> hard-coded into target-independent code.
>>>>>>
>>>>>
>>>>> Agree!  It's not good to leave them there.  I thought to wait and see
>>>>> if other targets which support vector with length can reuse this, or
>>>>> move it to target specific codes then if not sharable.  But anyway
>>>>> it looks not good, let's fix it.
>>>
>>> In other generic places like this we simply use three generic scalar_stmt
>>> costs.  At least I don't see how min_max should be different from it
>>> when shift can be the same as minus.  Note this is also how we treat
>>
>> Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
>> for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
>> to implement min/max it takes double cycles of the normal scalar operations
>> like add/shift, I was trying to model it more fine-grained since we probably
>> generate a few min/max here, if the loop body cost is small, I was worried
>> the decision isn't good enough.  But yeah, in other generic places, the small
>> loop could also suffer this similar off, they are the same essentially.
>>
>>> vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
>>> vector cost is one vector_stmt.  As you say below the add_stmt_cost
>>> hook can adjust based on the actual GIMPLE stmt -- if one is
>>> available (which indeed it isn't here).
>>>
>>> I'm somewhat lacking context here as well - we actually GIMPLE
>>> code-generate the min/max/shift/minus and only the eventual
>>> AND is defered to the target, right?
>>>
>>
>> Yes, min/max/shift/minus are all GIMPLE code, targets like Power
>> will have its target specific cost for shift which moves length
>> to high bits 0:7.
>>
>> One typical case is as below:
>>
>>   <bb 3> [local count: 105119324]:
>>   _26 = n_11(D) * 4;
>>   _37 = MAX_EXPR <_26, 16>;
>>   _38 = _37 + 18446744073709551600;
>>   _40 = MIN_EXPR <_26, 16>;
>>
>>   <bb 4> [local count: 630715945]:
>>   # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
>>   # loop_len_30 = PHI <_40(3), _44(4)>
>>   _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
>>   vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
>>   vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
>>   _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
>>   vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
>>   vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
>>   vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
>>   vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
>>   _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
>>   .LEN_STORE (_2, 4B, loop_len_30, vect_28);
>>   _42 = MIN_EXPR <ivtmp_35, _38>;
>>   _43 = _38 - _42;
>>   _44 = MIN_EXPR <_43, 16>;
>>   ivtmp_36 = ivtmp_35 + 16;
>>   if (ivtmp_35 < _38)
>>     goto <bb 4>; [83.33%]
>>   else
>>     goto <bb 5>; [16.67%]
>>
>>
>>>>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
>>>>> more computations similar to min/max to be added like this, in long
>>>>> term it probably leads to the situtation like: scalar_min_max,
>>>>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
>>>>> seems not good to maintain.
>>>>
>>>> I guess doing that doesn't seem so bad to me :-)  I think it's been
>>>> a recurring problem that the current classification isn't fine-grained
>>>> enough for some cases.
>>>
>>> But we eventually want to get rid of this classification enum in favor
>>> of the add_stmt_cost hook ...
>>>
>>
>> Nice, sounds like each target has to handle it fine-grain.  :)
>> IIUC, the current modeling doesn't consider the instruction dependency and
>> execution resource etc. like scheduling, even if all costs are fine-grained
>> enough, the decision could be sub-optimal.
> 
> That's what the finish_cost hook is for - the target can collect
> all stmts during add_stmt and then apply adjustments at the end
> (IIRC power does already for shift operation resource constraints).
> 

oh, forgot that we have that hook, though I just imagined some generic codes
would do this.  Good to know we have a way to make it.  Nice!

BR,
Kewen
diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..86ef584e09b 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@  rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += num_vectors_m1 + 1;
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@  rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      rs6000_adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 490e7befea3..9d0e3fc525e 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -412,7 +412,10 @@  vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    This means that we cannot guarantee that such an induction variable
    would ever hit a value that produces a set of all-false masks or zero
-   lengths for RGC.  */
+   lengths for RGC.
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 06cde4b1da3..a00160a7f2d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3798,6 +3798,70 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
 			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* Referring to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimizations, we consider the worst case
+	 here.  */
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prologue_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+	    /* May need one shift for nitems_total computation.  */
+	    if (nitems != 1 && !niters_known_p)
+	      prologue_cnt += shift_cost;
+
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prologue_cnt += (min_max_cost + minus_cost);
+
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prologue_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prologue_cnt += min_max_cost * num_vectors;
+
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
+
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prologue_cnt,
+			    scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
      following ways:
@@ -3932,8 +3996,8 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3960,7 +4024,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -4015,6 +4079,10 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4032,7 +4100,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4044,7 +4112,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 31af46ae19c..8550a252f44 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -12090,7 +12090,10 @@  vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
    min_of_start_and_end = min (START_INDEX, END_INDEX);
    left_len = END_INDEX - min_of_start_and_end;
    rhs = min (left_len, LEN_LIMIT);
-   LEN = rhs;  */
+   LEN = rhs;
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 gimple_seq
 vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)