diff mbox series

[v2] vect/rs6000: Support vector with length cost modeling

Message ID a06e714e-04c3-8a2f-fa1d-02a72aecf7f4@linux.ibm.com
State New
Headers show
Series [v2] vect/rs6000: Support vector with length cost modeling | expand

Commit Message

Kewen.Lin July 22, 2020, 1:26 a.m. UTC
Hi Richard,

on 2020/7/21 下午3:57, Richard Biener wrote:
> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> This patch is to add the cost modeling for vector with length,
>> it mainly follows what we generate for vector with length in
>> functions vect_set_loop_controls_directly and vect_gen_len
>> at the worst case.
>>
>> For Power, the length is expected to be in bits 0-7 (high bits),
>> we have to model the cost of shifting bits.  To allow other targets
>> not suffer this, I used one target hook to describe this extra cost,
>> I'm not sure if it's a correct way.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1.
>>
>> Any comments/suggestions are highly appreciated!
> 
> I don't like the introduction of an extra target hook for this.  All
> vectorizer cost modeling should ideally go through
> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
> not per stmt then either init_cost or finish_cost is appropriate.
> Currently init_cost only gets a struct loop while we should
> probably give it a vec_info * parameter so targets can
> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
> 

Thanks!  Nice, your suggested way looks better.  I've removed the hook
and taken care of it in finish_cost.  The updated v2 is attached.

Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost): New function.
	(rs6000_finish_cost): Call function adjust_vect_cost.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.

Comments

Richard Biener July 22, 2020, 6:38 a.m. UTC | #1
On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
> > On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> This patch is to add the cost modeling for vector with length,
> >> it mainly follows what we generate for vector with length in
> >> functions vect_set_loop_controls_directly and vect_gen_len
> >> at the worst case.
> >>
> >> For Power, the length is expected to be in bits 0-7 (high bits),
> >> we have to model the cost of shifting bits.  To allow other targets
> >> not suffer this, I used one target hook to describe this extra cost,
> >> I'm not sure if it's a correct way.
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
> >> param vect-partial-vector-usage=1.
> >>
> >> Any comments/suggestions are highly appreciated!
> >
> > I don't like the introduction of an extra target hook for this.  All
> > vectorizer cost modeling should ideally go through
> > init_cost/add_stmt_cost/finish_cost.  If the extra costing is
> > not per stmt then either init_cost or finish_cost is appropriate.
> > Currently init_cost only gets a struct loop while we should
> > probably give it a vec_info * parameter so targets can
> > check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
> >
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.

LGTM (with assuming the first larger hunk is mostly re-indenting
under LOOP_VINFO_USING_PARTIAL_VECTORS_P).

Thanks,
Richard.

> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * config/rs6000/rs6000.c (adjust_vect_cost): New function.
>         (rs6000_finish_cost): Call function adjust_vect_cost.
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
>         modeling for vector with length.
Kewen.Lin July 22, 2020, 7:08 a.m. UTC | #2
Hi Richard,

on 2020/7/22 下午2:38, Richard Biener wrote:
> On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richard,
>>
>> on 2020/7/21 下午3:57, Richard Biener wrote:
>>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This patch is to add the cost modeling for vector with length,
>>>> it mainly follows what we generate for vector with length in
>>>> functions vect_set_loop_controls_directly and vect_gen_len
>>>> at the worst case.
>>>>
>>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>>> we have to model the cost of shifting bits.  To allow other targets
>>>> not suffer this, I used one target hook to describe this extra cost,
>>>> I'm not sure if it's a correct way.
>>>>
>>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>>> param vect-partial-vector-usage=1.
>>>>
>>>> Any comments/suggestions are highly appreciated!
>>>
>>> I don't like the introduction of an extra target hook for this.  All
>>> vectorizer cost modeling should ideally go through
>>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>>> not per stmt then either init_cost or finish_cost is appropriate.
>>> Currently init_cost only gets a struct loop while we should
>>> probably give it a vec_info * parameter so targets can
>>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>>>
>>
>> Thanks!  Nice, your suggested way looks better.  I've removed the hook
>> and taken care of it in finish_cost.  The updated v2 is attached.
>>
>> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1.
> 
> LGTM (with assuming the first larger hunk is mostly re-indenting
> under LOOP_VINFO_USING_PARTIAL_VECTORS_P).

Thanks for the review!  Yes, for the original LOOP_VINFO_FULLY_MASKED_P
hunk, this patch moves the handling of gap peeling to be shared between
mask and length, and re-indent the remaining (masking specific) into inner
LOOP_VINFO_FULLY_MASKED_P.  The length specific is put into the else hunk.
It wouldn't change anything for masking, I'll run aarch64 regtesting to
ensure it.  :)

BR,
Kewen
Richard Sandiford July 22, 2020, 9:11 a.m. UTC | #3
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi,
>>>
>>> This patch is to add the cost modeling for vector with length,
>>> it mainly follows what we generate for vector with length in
>>> functions vect_set_loop_controls_directly and vect_gen_len
>>> at the worst case.
>>>
>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>> we have to model the cost of shifting bits.  To allow other targets
>>> not suffer this, I used one target hook to describe this extra cost,
>>> I'm not sure if it's a correct way.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>> param vect-partial-vector-usage=1.
>>>
>>> Any comments/suggestions are highly appreciated!
>> 
>> I don't like the introduction of an extra target hook for this.  All
>> vectorizer cost modeling should ideally go through
>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>> not per stmt then either init_cost or finish_cost is appropriate.
>> Currently init_cost only gets a struct loop while we should
>> probably give it a vec_info * parameter so targets can
>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>> 
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (adjust_vect_cost): New function.
> 	(rs6000_finish_cost): Call function adjust_vect_cost.
> 	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
> 	modeling for vector with length.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 5a4f07d5810..f2724e792c9 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += (num_vectors_m1 + 1);
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e933441b922..99e1fd7bdd0 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>       TODO: Build an expression that represents peel_iters for prologue and
>       epilogue to be used in a run-time test.  */
>  
> -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
>      {
>        peel_iters_prologue = 0;
>        peel_iters_epilogue = 0;
> @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  	  peel_iters_epilogue += 1;
>  	  stmt_info_for_cost *si;
>  	  int j;
> -	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> -			    j, si)
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
> +			    si)
>  	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
>  				  si->kind, si->stmt_info, si->vectype,
>  				  si->misalign, vect_epilogue);
>  	}
>  
> -      /* Calculate how many masks we need to generate.  */
> -      unsigned int num_masks = 0;
> -      rgroup_controls *rgm;
> -      unsigned int num_vectors_m1;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> -	if (rgm->type)
> -	  num_masks += num_vectors_m1 + 1;
> -      gcc_assert (num_masks > 0);
> -
> -      /* In the worst case, we need to generate each mask in the prologue
> -	 and in the loop body.  One of the loop body mask instructions
> -	 replaces the comparison in the scalar loop, and since we don't
> -	 count the scalar comparison against the scalar body, we shouldn't
> -	 count that vector instruction against the vector body either.
> -
> -	 Sometimes we can use unpacks instead of generating prologue
> -	 masks and sometimes the prologue mask will fold to a constant,
> -	 so the actual prologue cost might be smaller.  However, it's
> -	 simpler and safer to use the worst-case cost; if this ends up
> -	 being the tie-breaker between vectorizing or not, then it's
> -	 probably better not to vectorize.  */
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks - 1, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_body);
> -    }
> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> -    {
> -      peel_iters_prologue = 0;
> -      peel_iters_epilogue = 0;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  /* Calculate how many masks we need to generate.  */
> +	  unsigned int num_masks = 0;
> +	  rgroup_controls *rgm;
> +	  unsigned int num_vectors_m1;
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +	    if (rgm->type)
> +	      num_masks += num_vectors_m1 + 1;
> +	  gcc_assert (num_masks > 0);
> +
> +	  /* In the worst case, we need to generate each mask in the prologue
> +	     and in the loop body.  One of the loop body mask instructions
> +	     replaces the comparison in the scalar loop, and since we don't
> +	     count the scalar comparison against the scalar body, we shouldn't
> +	     count that vector instruction against the vector body either.
> +
> +	     Sometimes we can use unpacks instead of generating prologue
> +	     masks and sometimes the prologue mask will fold to a constant,
> +	     so the actual prologue cost might be smaller.  However, it's
> +	     simpler and safer to use the worst-case cost; if this ends up
> +	     being the tie-breaker between vectorizing or not, then it's
> +	     probably better not to vectorize.  */
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
> +	}
> +      else
> +	{
> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
> +
> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
> +	  if (npeel < 0)
> +	    {
> +	      peel_iters_prologue = assumed_vf / 2;
> +	      /* See below, if peeled iterations are unknown, count a taken
> +		 branch and a not taken branch per peeled loop.  */
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	    }
> +	  else
> +	    {
> +	      peel_iters_prologue = npeel;
> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +		/* See vect_get_known_peeling_cost, if peeled iterations are
> +		   known but number of scalar loop iterations are unknown, count
> +		   a taken branch per peeled loop.  */
> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				      cond_branch_taken, NULL, NULL_TREE, 0,
> +				      vect_prologue);
> +	    }

I think it'd be good to avoid duplicating this.  How about the
following structure?

  if (vect_use_loop_mask_for_alignment_p (…))
    {
      peel_iters_prologue = 0;
      peel_iters_epilogue = 0;
    }
  else if (npeel < 0)
    {
      … // A
    }
  else
    {
      …vect_get_known_peeling_cost stuff…
    }

but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:

  LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0

for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
whatever value we'd normally use.  Then wrap:

      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);
      (void) add_stmt_cost (loop_vinfo,
			    target_cost_data, 1, cond_branch_not_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);

in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
stuff in A conditional on peel_iters_epilogue != 0.

This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
code:

      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
	{
	  /* We need to peel exactly one iteration.  */
	  peel_iters_epilogue += 1;
	  stmt_info_for_cost *si;
	  int j;
	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
			    j, si)
	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
				  si->kind, si->stmt_info, si->vectype,
				  si->misalign, vect_epilogue);
	}

Then, after the above, have:

  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
    …add costs for mask overhead…
  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
    …add costs for lengths overhead…

So we'd have one block of code for estimating the prologue and epilogue
peeling cost, and a separate block of code for the loop control overhead.

Thanks,
Richard
Segher Boessenkool July 22, 2020, 5:49 p.m. UTC | #4
Hi!

On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote:
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{

Maybe call it rs6000_adjust_vect_cost?  For consistency, but also it
could (in the future) collide with a globalfunction of the same name (it
is a very non-specific name).

> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += (num_vectors_m1 + 1);

That doesn't need parentheses.

>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }

^^^ consistency :-)

The rs6000 parts are fine for trunk, thanks!


Segher
Kewen.Lin July 27, 2020, 3:44 a.m. UTC | #5
Hi Segher,

Thanks for the comments!

on 2020/7/23 上午1:49, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote:
>> +/* For some target specific vectorization cost which can't be handled per stmt,
>> +   we check the requisite conditions and adjust the vectorization cost
>> +   accordingly if satisfied.  One typical example is to model shift cost for
>> +   vector with length by counting number of required lengths under condition
>> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
>> +
>> +static void
>> +adjust_vect_cost (rs6000_cost_data *data)
>> +{
> 
> Maybe call it rs6000_adjust_vect_cost?  For consistency, but also it
> could (in the future) collide with a globalfunction of the same name (it
> is a very non-specific name).

Done in v4, used rs6000_adjust_vect_cost_per_loop.

> 
>> +	  /* Each length needs one shift to fill into bits 0-7.  */
>> +	  shift_cnt += (num_vectors_m1 + 1);
> 
> That doesn't need parentheses.

Done in v4.

> 
>>    if (cost_data->loop_info)
>> -    rs6000_density_test (cost_data);
>> +    {
>> +      adjust_vect_cost (cost_data);
>> +      rs6000_density_test (cost_data);
>> +    }
> 
> ^^^ consistency :-)
> 
> The rs6000 parts are fine for trunk, thanks!

Thanks!

BR,
Kewen
diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5a4f07d5810..f2724e792c9 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@  rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@  rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..99e1fd7bdd0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3652,7 +3652,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       peel_iters_prologue = 0;
       peel_iters_epilogue = 0;
@@ -3663,45 +3663,145 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	  peel_iters_epilogue += 1;
 	  stmt_info_for_cost *si;
 	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
 				  si->kind, si->stmt_info, si->vectype,
 				  si->misalign, vect_epilogue);
 	}
 
-      /* Calculate how many masks we need to generate.  */
-      unsigned int num_masks = 0;
-      rgroup_controls *rgm;
-      unsigned int num_vectors_m1;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
-	if (rgm->type)
-	  num_masks += num_vectors_m1 + 1;
-      gcc_assert (num_masks > 0);
-
-      /* In the worst case, we need to generate each mask in the prologue
-	 and in the loop body.  One of the loop body mask instructions
-	 replaces the comparison in the scalar loop, and since we don't
-	 count the scalar comparison against the scalar body, we shouldn't
-	 count that vector instruction against the vector body either.
-
-	 Sometimes we can use unpacks instead of generating prologue
-	 masks and sometimes the prologue mask will fold to a constant,
-	 so the actual prologue cost might be smaller.  However, it's
-	 simpler and safer to use the worst-case cost; if this ends up
-	 being the tie-breaker between vectorizing or not, then it's
-	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
-    }
-  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
-    {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  /* Calculate how many masks we need to generate.  */
+	  unsigned int num_masks = 0;
+	  rgroup_controls *rgm;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	    if (rgm->type)
+	      num_masks += num_vectors_m1 + 1;
+	  gcc_assert (num_masks > 0);
+
+	  /* In the worst case, we need to generate each mask in the prologue
+	     and in the loop body.  One of the loop body mask instructions
+	     replaces the comparison in the scalar loop, and since we don't
+	     count the scalar comparison against the scalar body, we shouldn't
+	     count that vector instruction against the vector body either.
+
+	     Sometimes we can use unpacks instead of generating prologue
+	     masks and sometimes the prologue mask will fold to a constant,
+	     so the actual prologue cost might be smaller.  However, it's
+	     simpler and safer to use the worst-case cost; if this ends up
+	     being the tie-breaker between vectorizing or not, then it's
+	     probably better not to vectorize.  */
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+				vector_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
+      else
+	{
+	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+
+	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
+	  if (npeel < 0)
+	    {
+	      peel_iters_prologue = assumed_vf / 2;
+	      /* See below, if peeled iterations are unknown, count a taken
+		 branch and a not taken branch per peeled loop.  */
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_not_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	    }
+	  else
+	    {
+	      peel_iters_prologue = npeel;
+	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+		/* See vect_get_known_peeling_cost, if peeled iterations are
+		   known but number of scalar loop iterations are unknown, count
+		   a taken branch per peeled loop.  */
+		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				      cond_branch_taken, NULL, NULL_TREE, 0,
+				      vect_prologue);
+	    }
+
+	  stmt_info_for_cost *si;
+	  int j;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
+	    (void) add_stmt_cost (loop_vinfo, target_cost_data,
+				  si->count * peel_iters_prologue, si->kind,
+				  si->stmt_info, si->vectype, si->misalign,
+				  vect_prologue);
+
+	  /* Refer to the functions vect_set_loop_condition_partial_vectors
+	     and vect_set_loop_controls_directly, we need to generate each
+	     length in the prologue and in the loop body if required.  Although
+	     there are some possible optimization, we consider the worst case
+	     here.  */
+
+	  /* For now we only operate length-based partial vectors on Power,
+	     which has constant VF all the time, we need some tweakings below
+	     if it doesn't hold in future.  */
+	  gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+	  /* For wrap around checking.  */
+	  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+	  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+	  bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+	  bool need_iterate_p
+	    = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+	  /* Init min/max, shift and minus cost relative to single scalar_stmt.
+	     For now we only use length-based partial vectors on Power, target
+	     specific cost tweaking may be needed for other ports in future.  */
+	  unsigned int min_max_cost = 2;
+	  unsigned int shift_cost = 1, minus_cost = 1;
+
+	  /* Init cost relative to single scalar_stmt.  */
+	  unsigned int prol_cnt = 0;
+	  unsigned int body_cnt = 0;
+
+	  rgroup_controls *rgc;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	    if (rgc->type)
+	      {
+		unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+		/* Need one shift for niters_total computation.  */
+		if (!niters_known_p && nitems != 1)
+		  prol_cnt += shift_cost;
+
+		/* Need to handle wrap around.  */
+		if (iv_limit == -1
+		    || (wi::min_precision (iv_limit * nitems, UNSIGNED)
+			> compare_precision))
+		  prol_cnt += (min_max_cost + minus_cost);
+
+		/* Need to handle batch limit excepting for the 1st one.  */
+		prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+		unsigned int num_vectors = num_vectors_m1 + 1;
+		/* Need to set up lengths in prologue, only one MIN required
+		   since start index is zero.  */
+		prol_cnt += min_max_cost * num_vectors;
+
+		/* Need to update lengths in body for next iteration.  */
+		if (need_iterate_p)
+		  body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	      }
+
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
     }
   else if (npeel < 0)
     {
@@ -3913,8 +4013,8 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4041,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4096,10 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4117,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4129,7 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost