diff mbox series

[5/7,v6] vect: Support vector load/store with length in vectorizer

Message ID a3d93431-8f55-9fa7-b2d6-537fb2610d0d@linux.ibm.com
State New
Headers show
Series None | expand

Commit Message

Kewen.Lin June 29, 2020, 6:33 a.m. UTC
Hi,

v6 changes against v5:
  - As len_load/store optab changes, added function can_vec_len_load_store_p
    and vect_get_same_size_vec_for_len.
  - Updated several places like vectoriable_load/store for optab changes.

v5 changes against v4:
  - Updated the conditions of clearing LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P
    in vectorizable_condition (which fixed the aarch reg failure).
  - Rebased and updated some macro and function names as the
    renaming/refactoring patch.
  - Updated some comments and dumpings.

v4 changes against v3:
  - split out some renaming and refactoring.
  - use QImode for length.
  - update the iv type determination.
  - introduce factor into rgroup_controls.
  - use using_partial_vectors_p for both approaches.

Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu P9.
Even with explicit vect-with-length-scope settings 1/2, I didn't find
any remarkable failures (only some trivial test case issues).

Is it ok for trunk?

BR,
Kewen
----
gcc/ChangeLog

	* doc/invoke.texi (vect-with-length-scope): Document new option.
	* optabs-query.c (can_vec_len_load_store_p): New function.
	* optabs-query.h (can_vec_len_load_store_p): New declare.
	* params.opt (vect-with-length-scope): New.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Add the
	handlings for vectorization using length-based partial vectors, call
	vect_gen_len for length generation.
	(vect_set_loop_condition_partial_vectors): Add the handlings for
	vectorization using length-based partial vectors.
	(vect_do_peeling): Allow remaining eiters less than epilogue vf for
	LOOP_VINFO_USING_PARTIAL_VECTORS_P.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Init
	epil_using_partial_vectors_p.
	(_loop_vec_info::~_loop_vec_info): Call release_vec_loop_controls
	for lengths destruction.
	(vect_verify_loop_lens): New function.
	(vect_analyze_loop_2): Add the check to allow only one vectorization
	approach using partial vectorization at the same time.  Check
	loop-wide reasons using length-based partial vectors decision.  Mark
	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P if the epilogue is
	considerable to use length-based approach.  Call
	release_vec_loop_controls for lengths destruction.
	(vect_analyze_loop): Add handlings for epilogue of loop when it's
	marked to use vectorization using partial vectors.
	(vect_estimate_min_profitable_iters): Adjust for loop vectorization
	using length-based partial vectors.
	(vect_record_loop_mask): Init factor to 1 for vectorization using
	mask-based partial vectors.
	(vect_record_loop_len): New function.
	(vect_get_loop_len): New function.
	* tree-vect-stmts.c (check_load_store_for_partial_vectors): Add
	checks for vectorization using length-based partial vectors.
	(vect_get_same_size_vec_for_len): New function.
	(vectorizable_store): Add handlings when using length-based partial
	vectors.
	(vectorizable_load): Likewise.
	(vectorizable_condition): Add some checks to disable vectorization
	using partial vectors for reduction.
	(vect_gen_len): New function.
	* tree-vectorizer.h (struct rgroup_controls): Add field factor
	mainly for length-based partial vectors.
	(vec_loop_lens): New typedef.
	(_loop_vec_info): Add lens and epil_using_partial_vectors_p.
	(LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P): New macro.
	(LOOP_VINFO_LENS): Likewise.
	(LOOP_VINFO_FULLY_WITH_LENGTH_P): Likewise.
	(vect_record_loop_len): New declare.
	(vect_get_loop_len): Likewise.
	(vect_gen_len): Likewise.

Comments

Richard Sandiford June 30, 2020, 7:53 p.m. UTC | #1
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 06a04e3d7dd..284c15705ea 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>  @item vect-epilogues-nomask
>  Enable loop epilogue vectorization using smaller vector size.
>  
> +@item vect-with-length-scope

In principle there's nothing length-specific about this option.
We could do the same for masks or for any future loop control
mechanism.  So how about vect-partial-vector-usage instead?

> +Control the scope of vector memory access with length exploitation.  0 means we
> +don't expliot any vector memory access with length, 1 means we only exploit
> +vector memory access with length for those loops whose iteration number are
> +less than VF, such as very small loop or epilogue, 2 means we want to exploit
> +vector memory access with length for any loops if possible.

Maybe:

  Controls when the loop vectorizer considers using partial vector loads
  and stores as an alternative to falling back to scalar code.  0 stops
  the vectorizer from ever using partial vector loads and stores.  1 allows
  partial vector loads and stores if vectorization removes the need for the
  code to iterate.  2 allows partial vector loads and stores in all loops.
  The parameter only has an effect on targets that support partial
  vector loads and stores.
  
> diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
> index 215d68e4225..9c351759204 100644
> --- a/gcc/optabs-query.c
> +++ b/gcc/optabs-query.c
> @@ -606,6 +606,60 @@ can_vec_mask_load_store_p (machine_mode mode,
>    return false;
>  }
>  
> +/* Return true if target supports vector load/store with length for vector
> +   mode MODE.  There are two flavors for vector load/store with length, one
> +   is to measure length with bytes, the other is to measure length with lanes.
> +   As len_{load,store} optabs point out, for the flavor with bytes, we use
> +   VnQI to wrap the other supportable same size vector modes.  Here the
> +   pointer FACTOR is to indicate that it is using VnQI to wrap if its value
> +   more than 1 and how many bytes for one element of wrapped vector mode.  */
> +
> +bool
> +can_vec_len_load_store_p (machine_mode mode, bool is_load, unsigned int *factor)
> +{
> +  optab op = is_load ? len_load_optab : len_store_optab;
> +  gcc_assert (VECTOR_MODE_P (mode));
> +
> +  /* Check if length in lanes supported for this mode directly.  */
> +  if (direct_optab_handler (op, mode))
> +    {
> +      *factor = 1;
> +      return true;
> +    }
> +
> +  /* Check if length in bytes supported for VnQI with the same vector size.  */
> +  scalar_mode emode = QImode;
> +  poly_uint64 esize = GET_MODE_SIZE (emode);

This is always equal to 1, so…

> +  poly_uint64 vsize = GET_MODE_SIZE (mode);
> +  poly_uint64 nunits;
> +
> +  /* To get how many nunits it would have if the element is QImode.  */
> +  if (multiple_p (vsize, esize, &nunits))
> +    {

…we can just set nunits to GET_MODE_SIZE (mode).

> +      machine_mode vmode;
> +      /* Check whether the related VnQI vector mode exists, as well as
> +	 optab supported.  */
> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
> +	  && direct_optab_handler (op, vmode))
> +	{
> +	  unsigned int mul;
> +	  scalar_mode orig_emode = GET_MODE_INNER (mode);
> +	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
> +
> +	  if (constant_multiple_p (orig_esize, esize, &mul))
> +	    *factor = mul;
> +	  else
> +	    gcc_unreachable ();

This is just:

	  *factor = GET_MODE_UNIT_SIZE (mode);

However, I think it would be better to return the vector mode that the
load or store should use, instead of this factor.  That way we can reuse
it when generating the load and store statements.

So maybe call the function get_len_load_store_mode and return an
opt_machine_mode.

> +
> +	  return true;
> +	}
> +    }
> +  else
> +    gcc_unreachable ();
> +
> +  return false;
> +}
> +
>  /* Return true if there is a compare_and_swap pattern.  */
>  
>  bool
> […]
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 9b564bb046c..daa6e8a2beb 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>  
> +-param=vect-with-length-scope=
> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
> +Control the vector with length exploitation scope.

Think this should be a bit more descriptive, at least saying what the
three values are (but in a more abbreviated form than the .texi above).

I think the default should be 2, with targets actively turning it down
where necessary.  That way, the decision to turn it down is more likely
to have a comment explaining why.

> […]
> @@ -422,10 +423,20 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  {
>    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>    tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);

IMO just “use_masks_p” would be more readable.  Same later on.

> +
>    tree ctrl_type = rgc->type;
> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
> +  /* Scale up nscalars per iteration with factor.  */
> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;

Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
anything here.

Or maybe “nitems_per_iter”, to keep the names shorter?

>    poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);

Maybe worth inserting a scaled_nscalars_per_ctrl or nitems_per_ctrl
here, since it's used in two places below (length_limit and as
batch_nscalars_ft).

>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  tree length_limit = NULL_TREE;
> +  /* For length, we need length_limit to check length in range.  */
> +  if (!vect_for_masking)
> +    {
> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
> +      length_limit = build_int_cst (compare_type, len_limit);
> +    }
>  
>    /* Calculate the maximum number of scalar values that the rgroup
>       handles in total, the number that it handles for each iteration
> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    tree nscalars_total = niters;
>    tree nscalars_step = build_int_cst (iv_type, vf);
>    tree nscalars_skip = niters_skip;
> -  if (nscalars_per_iter != 1)
> +  if (nscalars_per_iter_ft != 1)
>      {
>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>  	 these multiplications don't overflow.  */
> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>  				     nscalars_total, compare_factor);
>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  	     NSCALARS_SKIP to that cannot overflow.  */
>  	  tree const_limit = build_int_cst (compare_type,
>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> -					    * nscalars_per_iter);
> +					    * nscalars_per_iter_ft);
>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>  				      nscalars_total, const_limit);
>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,

It looks odd that we don't need to adjust the other nscalars_* values too.
E.g. the above seems to be comparing an unscaled nscalars_total with
a scaled nscalars_per_iter.  I think the units ought to “agree”,
both here and in the rest of the function.

> […]
> @@ -617,16 +638,32 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>  				      init_ctrl, unskipped_mask);
>  	  else
>  	    init_ctrl = unskipped_mask;
> +	  gcc_assert (vect_for_masking);

I think this ought to go at the beginning of the { … } block,
rather than the end.

>  	}
>  
> +      /* First iteration is full.  */

This comment belongs inside the “if”.

>        if (!init_ctrl)
> -	/* First iteration is full.  */
> -	init_ctrl = build_minus_one_cst (ctrl_type);
> +	{
> +	  if (vect_for_masking)
> +	    init_ctrl = build_minus_one_cst (ctrl_type);
> +	  else
> +	    init_ctrl = length_limit;
> +	}
>  
> […]
> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>    if (vect_epilogues
>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && prolog_peeling >= 0
> -      && known_eq (vf, lowest_vf))
> +      && known_eq (vf, lowest_vf)
> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>      {
>        unsigned HOST_WIDE_INT eiters
>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)

I'm still not really convinced that this check is right.  It feels
like it's hiding a problem elsewhere.

> […]
> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>    return true;
>  }
>  
> +/* Check whether we can use vector access with length based on precison
> +   comparison.  So far, to keep it simple, we only allow the case that the
> +   precision of the target supported length is larger than the precision
> +   required by loop niters.  */
> +
> +static bool
> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
> +{
> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +
> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    return false;
> +
> +  /* The one which has the largest NV should have max bytes per iter.  */
> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];

“lens->last ()”.  Using a reference feels more natural here.

> +
> +  /* Work out how many bits we need to represent the length limit.  */
> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;

I think this breaks the abstraction.  There's no guarantee that the
factor is the same for each rgroup_control, so there's no guarantee
that the maximum bytes per iter comes the last entry.  (Also, it'd
be better to avoid talking about bytes if we're trying to be general.)
I think we should take the maximum of each entry instead.

> +  unsigned int min_ni_prec
> +    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
> +
> +  /* Now use the maximum of below precisions for one suitable IV type:
> +     - the IV's natural precision
> +     - the precision needed to hold: the maximum number of scalar
> +       iterations multiplied by the scale factor (min_ni_prec above)
> +     - the Pmode precision
> +  */
> +
> +  /* If min_ni_width is less than the precision of the current niters,

min_ni_prec

> +     we perfer to still use the niters type.  */
> +  unsigned int ni_prec
> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
> +
> +  unsigned int required_prec = ni_prec;
> +  if (required_prec < pmode_prec)
> +    required_prec = pmode_prec;
> +
> +  tree iv_type = NULL_TREE;
> +  if (min_ni_prec > required_prec)
> +    {

Do we need this condition?  Looks like we could just do:

  min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
  min_ni_prec = MAX (min_ni_prec, ni_prec);

and then run the loop below.

> +      opt_scalar_int_mode tmode_iter;
> +      unsigned standard_bits = 0;
> +      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
> +      {
> +	scalar_mode tmode = tmode_iter.require ();
> +	unsigned int tbits = GET_MODE_BITSIZE (tmode);
> +
> +	/* ??? Do we really want to construct one IV whose precision exceeds
> +	   BITS_PER_WORD?  */
> +	if (tbits > BITS_PER_WORD)
> +	  break;
> +
> +	/* Find the first available standard integral type.  */
> +	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
> +	  {
> +	    standard_bits = tbits;
> +	    break;
> +	  }
> +      }
> +      if (standard_bits != 0)
> +	iv_type = build_nonstandard_integer_type (standard_bits, true);

I don't think there's any need for “standard_bits” here, we can just
set “iv_type” directly before breaking.

> +    }
> +  else
> +    iv_type = build_nonstandard_integer_type (required_prec, true);
> +
> +  if (!iv_type)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't vectorize with length-based partial vectors"
> +			 " due to no suitable iv type.\n");
> +      return false;
> +    }
> +
> +  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
> +  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
> +
> +  return true;
> +}
> +
>  /* Calculate the cost of one scalar iteration of the loop.  */
>  static void
>  vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
> @@ -2170,11 +2254,64 @@ start_over:
>        return ok;
>      }
>  
> -  /* Decide whether to use a fully-masked loop for this vectorization
> -     factor.  */
> -  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> -    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -       && vect_verify_full_masking (loop_vinfo));
> +  /* For now, we don't expect to mix both masking and length approaches for one
> +     loop, disable it if both are recorded.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
> +      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't vectorize a loop with partial vectors"
> +			 " because we don't expect to mix different"
> +			 " approaches with partial vectors for the"
> +			 " same loop.\n");
> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +
> +  /* Decide whether to vectorize a loop with partial vectors for
> +     this vectorization factor.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      /* Decide whether to use fully-masked approach.  */
> +      if (vect_verify_full_masking (loop_vinfo))
> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +      /* Decide whether to use length-based approach.  */
> +      else if (vect_verify_loop_lens (loop_vinfo))
> +	{
> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
> +	    {
> +	      if (dump_enabled_p ())
> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +				 "can't vectorize this loop with length-based"
> +				 " partial vectors approach becuase peeling"
> +				 " for alignment or gaps is required.\n");
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	    }

Why are these peeling cases necessary?  Peeling for gaps should
just mean subtracting one scalar iteration from the iteration count
and shouldn't otherwise affect the main loop.  Similarly, peeling for
alignment can be handled in the normal way, with a scalar prologue loop.

> +	  else if (param_vect_with_length_scope == 0)
> +	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

As above, I don't think this should be length-specific.  Same for the
== 1 handling, which we could do afterwards.

> +	  /* The epilogue and other known niters less than VF
> +	    cases can still use vector access with length fully.  */
> +	  else if (param_vect_with_length_scope == 1
> +		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +		   && !vect_known_niters_smaller_than_vf (loop_vinfo))
> +	    {
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +	    }
> +	  else
> +	    {
> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> +	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;

Think it's better to leave this last line out, otherwise it raises
the question why we don't set it to false elsewhere as well.

> +	    }
> +	}
> +      else
> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +  else
> +    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +
>    if (dump_enabled_p ())
>      {
>        if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> @@ -2183,6 +2320,15 @@ start_over:
>        else
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "not using a fully-masked loop.\n");
> +
> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "using length-based partial"
> +			 " vectors for loop fully.\n");
> +      else
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "not using length-based partial"
> +			 " vectors for loop fully.\n");

Think just one message for all three cases is better, perhaps with

  "operating only on full vectors.\n"

instead of "not using a fully-masked loop.\n".  Might need some
testsuite updates though -- probably worth splitting the wording
change out into a separate patch if so.

>      }
>  
>    /* If epilog loop is required because of data accesses with gaps,
> @@ -8249,6 +8423,63 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
>    return mask;
>  }
>  
> +/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
> +   lengths for vector access with length that each control a vector of type
> +   VECTYPE.  FACTOR is only meaningful for length in bytes, and to indicate
> +   how many bytes for each element (lane).  */

Maybe:

/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
   lengths for controlling an operation on VECTYPE.  The operation splits
   each element of VECTYPE into FACTOR separate subelements, measuring
   the length as a number of these subelements.  */

> +
> +void
> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> +		      unsigned int nvectors, tree vectype, unsigned int factor)
> +{
> +  gcc_assert (nvectors != 0);
> +  if (lens->length () < nvectors)
> +    lens->safe_grow_cleared (nvectors);
> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
> +
> +  /* The number of scalars per iteration, scalar occupied bytes and
> +     the number of vectors are both compile-time constants.  */
> +  unsigned int nscalars_per_iter
> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
> +		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
> +
> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
> +    {
> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
> +      rgl->type = vectype;
> +      rgl->factor = factor;
> +    }

This is dangerous because it ignores “factor” otherwise, and ignores
the previous factor if we overwrite it.

I think instead we should have:

  /* For now, we only support cases in which all loads and stores fall back
     to VnQI or none do.  */
  gcc_assert (!rgl->max_nscalars_per_iter
	      || (rgl->factor == 1 && factor == 1)
	      || (rgl->max_nscalars_per_iter * rgl->factor
		  == nscalars_per_iter * factor));

before changing rgl.

> […]
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index cdd6f6c5e5d..e0ffbab1d02 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1742,29 +1742,56 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        return;
>      }
>  
> -  machine_mode mask_mode;
> -  if (!VECTOR_MODE_P (vecmode)
> -      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> -      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +  if (!VECTOR_MODE_P (vecmode))
>      {
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -			 "can't use a fully-masked loop because the target"
> -			 " doesn't have the appropriate masked load or"
> -			 " store.\n");
> +			 "can't operate on partial vectors because of"
> +			 " the unexpected mode.\n");

Maybe: “can't operate on partial vectors when emulating vector operations”

>        LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>        return;
>      }
> -  /* We might load more scalars than we need for permuting SLP loads.
> -     We checked in get_group_load_store_type that the extra elements
> -     don't leak into a new vector.  */
> +
>    poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>    unsigned int nvectors;
> -  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> -    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
> -  else
> -    gcc_unreachable ();
> +
> +  machine_mode mask_mode;
> +  bool using_partial_vectors_p = false;
> +  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
> +      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
> +    {
> +      /* We might load more scalars than we need for permuting SLP loads.
> +	 We checked in get_group_load_store_type that the extra elements
> +	 don't leak into a new vector.  */
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))

Please split this out into a lambda that returns the number of vectors,
and keep the comment with it.  That way we can use it here and below.

> +	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
> +			       scalar_mask);
> +      else
> +	gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  unsigned int factor;
> +  if (can_vec_len_load_store_p (vecmode, is_load, &factor))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
> +	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
> +      else
> +	gcc_unreachable ();
> +      using_partial_vectors_p = true;
> +    }
> +
> +  if (!using_partial_vectors_p)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "can't operate on partial vectors because the"
> +			 " target doesn't have the appropriate partial"
> +			 "vectorization load or store.\n");

missing space between “partial” and “vectorization”.

> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
>  }
>  
>  /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
> @@ -6936,6 +6963,28 @@ vectorizable_scan_store (vec_info *vinfo,
>    return true;
>  }
>  
> +/* For the vector type VTYPE, return the same size vector type with
> +   QImode element, which is mainly for vector load/store with length
> +   in bytes.  */
> +
> +static tree
> +vect_get_same_size_vec_for_len (tree vtype)
> +{
> +  gcc_assert (VECTOR_TYPE_P (vtype));
> +  machine_mode v_mode = TYPE_MODE (vtype);
> +  gcc_assert (GET_MODE_INNER (v_mode) != QImode);
> +
> +  /* Obtain new element counts with QImode.  */
> +  poly_uint64 vsize = GET_MODE_SIZE (v_mode);
> +  poly_uint64 esize = GET_MODE_SIZE (QImode);
> +  poly_uint64 nelts = exact_div (vsize, esize);
> +
> +  /* Build element type with QImode.  */
> +  unsigned int eprec = GET_MODE_PRECISION (QImode);
> +  tree etype = build_nonstandard_integer_type (eprec, 1);
> +
> +  return build_vector_type (etype, nelts);
> +}

As mentioned above, I think we should be getting the mode of
the vector from get_len_load_store_mode.

> […]
> @@ -7911,10 +7968,16 @@ vectorizable_store (vec_info *vinfo,
>  	      unsigned HOST_WIDE_INT align;
>  
>  	      tree final_mask = NULL_TREE;
> +	      tree final_len = NULL_TREE;
>  	      if (loop_masks)
>  		final_mask = vect_get_loop_mask (gsi, loop_masks,
>  						 vec_num * ncopies,
>  						 vectype, vec_num * j + i);
> +	      else if (loop_lens)
> +		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
> +					       vec_num * ncopies,
> +					       vec_num * j + i);
> +

I don't think we need this “final_len”.  Unlike for masks, we only have
a single length, and can calculate it in the “if” statement below.

>  	      if (vec_mask)
>  		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>  						      vec_mask, gsi);
> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>  		  new_stmt = call;
>  		}
> +	      else if (final_len)
> +		{
> +		  align = least_bit_hwi (misalign | align);
> +		  tree ptr = build_int_cst (ref_type, align);
> +		  tree vtype = TREE_TYPE (vec_oprnd);

Couldn't you just reuse “vectype”?  Worth a comment if not.

> +		  /* Need conversion if it's wrapped with VnQI.  */
> +		  if (!direct_optab_handler (len_store_optab,
> +					     TYPE_MODE (vtype)))

I think this should use get_len_load_store_mode rather than querying
the optab directly.

> +		    {
> +		      tree new_vtype = vect_get_same_size_vec_for_len (vtype);
> +		      tree var
> +			= vect_get_new_ssa_name (new_vtype, vect_simple_var);
> +		      vec_oprnd
> +			= build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
> +		      gassign *new_stmt
> +			= gimple_build_assign (var, VIEW_CONVERT_EXPR,
> +					       vec_oprnd);
> +		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
> +						   gsi);
> +		      vec_oprnd = var;
> +		    }
> +		  gcall *call
> +		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
> +						  ptr, final_len, vec_oprnd);
> +		  gimple_call_set_nothrow (call, true);
> +		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
> +		  new_stmt = call;
> +		}
>  	      else
>  		{
>  		  data_ref = fold_build2 (MEM_REF, vectype,
> @@ -8531,6 +8622,7 @@ vectorizable_load (vec_info *vinfo,
>        tree dr_offset;
>  
>        gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
> +      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));

Might as well just change the existing assert to
!LOOP_VINFO_USING_PARTIAL_VECTORS_P.

Same comments for the load code.

> […]
> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>  	  return false;
>  	}
>  
> -      if (loop_vinfo
> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> -			       ncopies * vec_num, vectype, NULL);
> +      if (loop_vinfo && for_reduction
> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +	{
> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
> +				   ncopies * vec_num, vectype, NULL);
> +	  /* Using partial vectors can introduce inactive lanes in the last
> +	     iteration, since full vector of condition results are operated,
> +	     it's unsafe here.  But if we can AND the condition mask with
> +	     loop mask, it would be safe then.  */
> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
> +	    {
> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		{
> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +		}
> +	    }
> +	  else
> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	}
>  
>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,

I don't understand this part.

> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>    *nunits_vectype_out = nunits_vectype;
>    return opt_result::success ();
>  }
> +
> +/* Generate and return statement sequence that sets vector length LEN that is:
> +
> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
> +   left_len = END_INDEX - min_of_start_and_end;
> +   rhs = min (left_len, LEN_LIMIT);
> +   LEN = rhs;
> +
> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
> +   means if we have left_len in bytes larger than 255, it can't be saturated to
> +   vector limit (vector size).  One target hook can be provided if other ports
> +   don't suffer this.
> +*/

Should be no line break before the */

Personally I think it'd be better to drop the TODO.  This isn't the only
place that would need to change if we allowed out-of-range lengths,
whereas the comment might give the impression that it is.

> +
> +gimple_seq
> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
> +{
> +  gimple_seq stmts = NULL;
> +  tree len_type = TREE_TYPE (len);
> +  gcc_assert (TREE_TYPE (start_index) == len_type);
> +
> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
> +
> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
> +  gimple *new_stmt = gimple_build_assign (len, rhs);
> +  gimple_stmt_iterator i = gsi_last (stmts);
> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
> +
> +  return stmts;
> +}

It's better to build this up using gimple_build instead.

> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 6c830ad09f4..4155ffe1d49 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -417,6 +417,16 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     are compile-time constants but VF and nL can be variable (if the target
>     supports variable-length vectors).
>  
> +   Moreover, for some approach with partial vectors like being controlled
> +   by length (in bytes), it cares about the occupied bytes for each scalar.
> +   Provided that each scalar has factor bytes, the total number of scalar
> +   values becomes to factor * N, the above equation becomes to:
> +
> +       factor * N = factor * NS * VF = factor * NV * NL
> +
> +   factor * NS is the bytes of each scalar, factor * NL is the vector size
> +   in bytes.
> +
>     In classical vectorization, each iteration of the vector loop would
>     handle exactly VF iterations of the original scalar loop.  However,
>     in vector loops that are able to operate on partial vectors, a

As above, I think it'd be better to model the factor as splitting each
element into FACTOR pieces.  In that case I don't think we need to
describe it in this comment; a comment above the field should be enough.

> @@ -473,14 +483,19 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i)
>     first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
>     the second being indexed by the mask index 0 <= i < nV.  */
>  
> -/* The controls (like masks) needed by rgroups with nV vectors,
> +/* The controls (like masks, lengths) needed by rgroups with nV vectors,
>     according to the description above.  */

“(masks or lengths)”

>  struct rgroup_controls {
>    /* The largest nS for all rgroups that use these controls.  */
>    unsigned int max_nscalars_per_iter;
>  
> -  /* The type of control to use, based on the highest nS recorded above.
> -     For mask-based approach, it's used for mask_type.  */
> +  /* For now, it's mainly used for length-based in bytes approach, it's
> +     record the occupied bytes of each scalar.  */

Maybe:

  /* For the largest nS recorded above, the loop controls divide each scalar
     into FACTOR equal-sized pieces.  This is useful if we need to split
     element-based accesses into byte-based accesses.  */

> +  unsigned int factor;
> +
> +  /* This type is based on the highest nS recorded above.
> +     For mask-based approach, it records mask type to use.
> +     For length-based approach, it records appropriate vector type.  */

Maybe:

  /* This is a vector type with MAX_NSCALARS_PER_ITER * VF / nV elements.
     For mask-based controls, it is the type of the masks in CONTROLS.
     For length-based controls, it can be any vector type that has the
     specified number of elements; the type of the elements doesn't matter.  */

> @@ -644,6 +665,10 @@ public:
>       the vector loop can handle fewer than VF scalars.  */
>    bool using_partial_vectors_p;
>  
> +  /* True if we've decided to use partially-populated vectors for the
> +     epilogue of loop, only for length-based approach for now.  */

Don't think the bit after the comma is necessary.

> +  bool epil_using_partial_vectors_p;
> +
>    /* When we have grouped data accesses with gaps, we may introduce invalid
>       memory accesses.  We peel the last iteration of the loop to prevent
>       this.  */

Thanks,
Richard
Kewen.Lin July 1, 2020, 1:23 p.m. UTC | #2
Hi Richard,

Many thanks for your great review comments!

on 2020/7/1 上午3:53, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 06a04e3d7dd..284c15705ea 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -13389,6 +13389,13 @@ by the copy loop headers pass.
>>  @item vect-epilogues-nomask
>>  Enable loop epilogue vectorization using smaller vector size.
>>  
>> +@item vect-with-length-scope
> 
> In principle there's nothing length-specific about this option.
> We could do the same for masks or for any future loop control
> mechanism.  So how about vect-partial-vector-usage instead?
> 

Sounds good, will update it. 

[snip] 

I will also update as the comments in snipped parts (if they have)
snip some of them to have good readablity.

>> +      machine_mode vmode;
>> +      /* Check whether the related VnQI vector mode exists, as well as
>> +	 optab supported.  */
>> +      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
>> +	  && direct_optab_handler (op, vmode))
>> +	{
>> +	  unsigned int mul;
>> +	  scalar_mode orig_emode = GET_MODE_INNER (mode);
>> +	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
>> +
>> +	  if (constant_multiple_p (orig_esize, esize, &mul))
>> +	    *factor = mul;
>> +	  else
>> +	    gcc_unreachable ();
> 
> This is just:
> 
> 	  *factor = GET_MODE_UNIT_SIZE (mode);
> 
> However, I think it would be better to return the vector mode that the
> load or store should use, instead of this factor.  That way we can reuse
> it when generating the load and store statements.
> 
> So maybe call the function get_len_load_store_mode and return an
> opt_machine_mode.
> 

Will improve it.

>> diff --git a/gcc/params.opt b/gcc/params.opt
>> index 9b564bb046c..daa6e8a2beb 100644
>> --- a/gcc/params.opt
>> +++ b/gcc/params.opt
>> @@ -968,4 +968,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f
>>  Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
>>  Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
>>  
>> +-param=vect-with-length-scope=
>> +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
>> +Control the vector with length exploitation scope.
> 
> Think this should be a bit more descriptive, at least saying what the
> three values are (but in a more abbreviated form than the .texi above).
> 
> I think the default should be 2, with targets actively turning it down
> where necessary.  That way, the decision to turn it down is more likely
> to have a comment explaining why.
> 

Will update both.

>> +
>>    tree ctrl_type = rgc->type;
>> -  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
>> +  /* Scale up nscalars per iteration with factor.  */
>> +  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
> 
> Maybe “scaled_nscalars_per_iter”?  Not sure the comment really adds
> anything here.
> 
> Or maybe “nitems_per_iter”, to keep the names shorter?
> 

Will use the short one.

>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> +  tree length_limit = NULL_TREE;
>> +  /* For length, we need length_limit to check length in range.  */
>> +  if (!vect_for_masking)
>> +    {
>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>> +      length_limit = build_int_cst (compare_type, len_limit);
>> +    }
>>  
>>    /* Calculate the maximum number of scalar values that the rgroup
>>       handles in total, the number that it handles for each iteration
>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>    tree nscalars_total = niters;
>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>    tree nscalars_skip = niters_skip;
>> -  if (nscalars_per_iter != 1)
>> +  if (nscalars_per_iter_ft != 1)
>>      {
>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>  	 these multiplications don't overflow.  */
>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>  				     nscalars_total, compare_factor);
>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>  	  tree const_limit = build_int_cst (compare_type,
>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>> -					    * nscalars_per_iter);
>> +					    * nscalars_per_iter_ft);
>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>  				      nscalars_total, const_limit);
>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
> 
> It looks odd that we don't need to adjust the other nscalars_* values too.
> E.g. the above seems to be comparing an unscaled nscalars_total with
> a scaled nscalars_per_iter.  I think the units ought to “agree”,
> both here and in the rest of the function.
> 

Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
nscalars_step are scaled here.  The remaining related nscalars_*
seems only nscalars_skip, but length can't support skip.

>>  	}
>>  
>> +      /* First iteration is full.  */
> 
> This comment belongs inside the “if”.
> 

Sorry, I might miss something, but isn't this applied for both?

>>        if (!init_ctrl)
>> -	/* First iteration is full.  */
>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>> +	{
>> +	  if (vect_for_masking)
>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>> +	  else
>> +	    init_ctrl = length_limit;
>> +	}
>>  
>> […]
>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>    if (vect_epilogues
>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>        && prolog_peeling >= 0
>> -      && known_eq (vf, lowest_vf))
>> +      && known_eq (vf, lowest_vf)
>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>      {
>>        unsigned HOST_WIDE_INT eiters
>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
> 
> I'm still not really convinced that this check is right.  It feels
> like it's hiding a problem elsewhere.
> 

The comments above this hunk is that:

  /* If we know the number of scalar iterations for the main loop we should
     check whether after the main loop there are enough iterations left over
     for the epilogue.  */

So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
And the main work in the loop is to remove epil_info from epilogue_vinfos.

To make it simply, let's assume prolog_peeling and LOOP_VINFO_PEELING_FOR_GAPS
are zero, vf == lowest_vf.  

   eiters = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)

eiters is the remaining iterations which can't be handled in main loop with 
full (width/lanes) vectors.

For partial vectors epilogue handlings, loop_vinfo->vector_mode and 
epilogue_vinfo->vector_mode is the same (specially).

      while (!(constant_multiple_p
	       (GET_MODE_SIZE (loop_vinfo->vector_mode),
		GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio)
	       && eiters >= lowest_vf / ratio + epilogue_gaps))

It means that the ratio is 1 (specially), the lowest_vf/ratio is still vf, 
the remaining eiters is definitely less than vf, then the loop_vinfo->epilogue_vinfos[0]
gets removed.

I think the reason why partial vectors epilogue is special here is the VF of main loop 
is the same as the VF of epilogue loop.  Normally VF of epilogue loop should be
less than VF of main loop (here it seems assuming it's multiple relationship).

>> […]
>> @@ -1072,6 +1074,88 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
>>    return true;
>>  }
>>  
>> +/* Check whether we can use vector access with length based on precison
>> +   comparison.  So far, to keep it simple, we only allow the case that the
>> +   precision of the target supported length is larger than the precision
>> +   required by loop niters.  */
>> +
>> +static bool
>> +vect_verify_loop_lens (loop_vec_info loop_vinfo)
>> +{
>> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
>> +
>> +  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
>> +    return false;
>> +
>> +  /* The one which has the largest NV should have max bytes per iter.  */
>> +  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
> 
> “lens->last ()”.  Using a reference feels more natural here.
> 

Will fix it.

>> +
>> +  /* Work out how many bits we need to represent the length limit.  */
>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
> 
> I think this breaks the abstraction.  There's no guarantee that the
> factor is the same for each rgroup_control, so there's no guarantee
> that the maximum bytes per iter comes the last entry.  (Also, it'd
> be better to avoid talking about bytes if we're trying to be general.)
> I think we should take the maximum of each entry instead.
> 

Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
"maximum elements per iter"?  Yes, the code is for length in bytes, checking
the last entry is only reasonable for it.  Will update it to check all entries
instead.

>> +     we perfer to still use the niters type.  */
>> +  unsigned int ni_prec
>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>> +
>> +  unsigned int required_prec = ni_prec;
>> +  if (required_prec < pmode_prec)
>> +    required_prec = pmode_prec;
>> +
>> +  tree iv_type = NULL_TREE;
>> +  if (min_ni_prec > required_prec)
>> +    {
> 
> Do we need this condition?  Looks like we could just do:
> 
>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>   min_ni_prec = MAX (min_ni_prec, ni_prec);
> 
> and then run the loop below.
> 

I think the assumption holds that Pmode and niters type are standard integral
type?  If so, both of them don't need the below loop to build the integer
type, but min_ni_prec needs.  Does it make sense to differentiate them?

>> +      /* Decide whether to use fully-masked approach.  */
>> +      if (vect_verify_full_masking (loop_vinfo))
>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>> +      /* Decide whether to use length-based approach.  */
>> +      else if (vect_verify_loop_lens (loop_vinfo))
>> +	{
>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>> +	    {
>> +	      if (dump_enabled_p ())
>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +				 "can't vectorize this loop with length-based"
>> +				 " partial vectors approach becuase peeling"
>> +				 " for alignment or gaps is required.\n");
>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	    }
> 
> Why are these peeling cases necessary?  Peeling for gaps should
> just mean subtracting one scalar iteration from the iteration count
> and shouldn't otherwise affect the main loop.  Similarly, peeling for
> alignment can be handled in the normal way, with a scalar prologue loop.
> 

I was thinking to relax this later and to avoid to handle too many cases
in the first enablement patch.  Since Power hw whose level is able to support
vector with length, it supports unaligned load/store, need to construct
some cases for them.  May I postpone it a bit?  Or you prefer me to support
it here?

>> +	  else if (param_vect_with_length_scope == 0)
>> +	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
> 
> As above, I don't think this should be length-specific.  Same for the
> == 1 handling, which we could do afterwards.
> 

OK.  Will adjust it.

[snip]
>> +
>> +      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "using length-based partial"
>> +			 " vectors for loop fully.\n");
>> +      else
>> +	dump_printf_loc (MSG_NOTE, vect_location,
>> +			 "not using length-based partial"
>> +			 " vectors for loop fully.\n");
> 
> Think just one message for all three cases is better, perhaps with
> 
>   "operating only on full vectors.\n"
> 
> instead of "not using a fully-masked loop.\n".  Might need some
> testsuite updates though -- probably worth splitting the wording
> change out into a separate patch if so.
> 

OK, will check and split it if need.

[snip]

>> +
>> +void
>> +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>> +		      unsigned int nvectors, tree vectype, unsigned int factor)
>> +{
>> +  gcc_assert (nvectors != 0);
>> +  if (lens->length () < nvectors)
>> +    lens->safe_grow_cleared (nvectors);
>> +  rgroup_controls *rgl = &(*lens)[nvectors - 1];
>> +
>> +  /* The number of scalars per iteration, scalar occupied bytes and
>> +     the number of vectors are both compile-time constants.  */
>> +  unsigned int nscalars_per_iter
>> +    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
>> +		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
>> +
>> +  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
>> +    {
>> +      rgl->max_nscalars_per_iter = nscalars_per_iter;
>> +      rgl->type = vectype;
>> +      rgl->factor = factor;
>> +    }
> 
> This is dangerous because it ignores “factor” otherwise, and ignores
> the previous factor if we overwrite it.
> 
> I think instead we should have:
> 
>   /* For now, we only support cases in which all loads and stores fall back
>      to VnQI or none do.  */
>   gcc_assert (!rgl->max_nscalars_per_iter
> 	      || (rgl->factor == 1 && factor == 1)
> 	      || (rgl->max_nscalars_per_iter * rgl->factor
> 		  == nscalars_per_iter * factor));
> 
> before changing rgl.
> 

Thanks for pointing out this!  Will guard it.

[snip]
>>  	      if (vec_mask)
>>  		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
>>  						      vec_mask, gsi);
>> @@ -7994,6 +8057,34 @@ vectorizable_store (vec_info *vinfo,
>>  		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
>>  		  new_stmt = call;
>>  		}
>> +	      else if (final_len)
>> +		{
>> +		  align = least_bit_hwi (misalign | align);
>> +		  tree ptr = build_int_cst (ref_type, align);
>> +		  tree vtype = TREE_TYPE (vec_oprnd);
> 
> Couldn't you just reuse “vectype”?  Worth a comment if not.
> 

Yeah, will replace with it.

[snip]

>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>  	  return false;
>>  	}
>>  
>> -      if (loop_vinfo
>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> -			       ncopies * vec_num, vectype, NULL);
>> +      if (loop_vinfo && for_reduction
>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>> +	{
>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>> +				   ncopies * vec_num, vectype, NULL);
>> +	  /* Using partial vectors can introduce inactive lanes in the last
>> +	     iteration, since full vector of condition results are operated,
>> +	     it's unsafe here.  But if we can AND the condition mask with
>> +	     loop mask, it would be safe then.  */
>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>> +	    {
>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		{
>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +		}
>> +	    }
>> +	  else
>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> +	}
>>  
>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
> 
> I don't understand this part.

This is for the regression case on aarch64:

PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

As you mentioned before, we would expect to record masks for partial vectors reduction, 
otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
since the mask is further AND with some loop mask.  The difference looks like:

Without mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
  mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1

With mask AND loop mask optimization:

  loop_mask =...
  v1 = .MASK_LOAD (a, loop_mask)
  mask1 = v1 == {cst, ...}
  mask2 = loop_mask & mask1       
  v2 = .MASK_LOAD (b, mask2)
  vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?


The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?

> 
>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>    *nunits_vectype_out = nunits_vectype;
>>    return opt_result::success ();
>>  }
>> +
>> +/* Generate and return statement sequence that sets vector length LEN that is:
>> +
>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>> +   left_len = END_INDEX - min_of_start_and_end;
>> +   rhs = min (left_len, LEN_LIMIT);
>> +   LEN = rhs;
>> +
>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>> +   vector limit (vector size).  One target hook can be provided if other ports
>> +   don't suffer this.
>> +*/
> 
> Should be no line break before the */
> 
> Personally I think it'd be better to drop the TODO.  This isn't the only
> place that would need to change if we allowed out-of-range lengths,
> whereas the comment might give the impression that it is.
> 

Sorry I might miss something, but all undetermined lengths are generated here,
the other places you meant is doc or elsewhere?

>> +
>> +gimple_seq
>> +vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
>> +{
>> +  gimple_seq stmts = NULL;
>> +  tree len_type = TREE_TYPE (len);
>> +  gcc_assert (TREE_TYPE (start_index) == len_type);
>> +
>> +  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
>> +  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
>> +  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
>> +
>> +  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
>> +  gimple *new_stmt = gimple_build_assign (len, rhs);
>> +  gimple_stmt_iterator i = gsi_last (stmts);
>> +  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
>> +
>> +  return stmts;
>> +}
> 
> It's better to build this up using gimple_build instead.
> 

Will fix it.

[snip]

>> +  bool epil_using_partial_vectors_p;
>> +
>>    /* When we have grouped data accesses with gaps, we may introduce invalid
>>       memory accesses.  We peel the last iteration of the loop to prevent
>>       this.  */
> 
> Thanks,
> Richard
> 


BR,
Kewen
Richard Sandiford July 1, 2020, 3:17 p.m. UTC | #3
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>> +  tree length_limit = NULL_TREE;
>>> +  /* For length, we need length_limit to check length in range.  */
>>> +  if (!vect_for_masking)
>>> +    {
>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>> +    }
>>>  
>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>       handles in total, the number that it handles for each iteration
>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>    tree nscalars_total = niters;
>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>    tree nscalars_skip = niters_skip;
>>> -  if (nscalars_per_iter != 1)
>>> +  if (nscalars_per_iter_ft != 1)
>>>      {
>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>  	 these multiplications don't overflow.  */
>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>  				     nscalars_total, compare_factor);
>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>  	  tree const_limit = build_int_cst (compare_type,
>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>> -					    * nscalars_per_iter);
>>> +					    * nscalars_per_iter_ft);
>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>  				      nscalars_total, const_limit);
>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>> 
>> It looks odd that we don't need to adjust the other nscalars_* values too.
>> E.g. the above seems to be comparing an unscaled nscalars_total with
>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>> both here and in the rest of the function.
>> 
>
> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
> nscalars_step are scaled here.  The remaining related nscalars_*
> seems only nscalars_skip, but length can't support skip.

Hmm, OK.  But in that case can you update the names of the variables
to match?  It's confusing to have some nscalars_* variables actually
count scalars (and thus have “nitems” equivalents) and other nscalars_*
variables count something else (and thus effectively be nitems_* variables
themselves).

>
>>>  	}
>>>  
>>> +      /* First iteration is full.  */
>> 
>> This comment belongs inside the “if”.
>> 
>
> Sorry, I might miss something, but isn't this applied for both?

I meant it should be…

>
>>>        if (!init_ctrl)
>>> -	/* First iteration is full.  */
>>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>>> +	{
>>> +	  if (vect_for_masking)
>>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>>> +	  else
>>> +	    init_ctrl = length_limit;
>>> +	}

  if (!init_ctrl)
    {
      /* First iteration is full.  */
      if (vect_for_masking)
        init_ctrl = build_minus_one_cst (ctrl_type);
      else
        init_ctrl = length_limit;
    }

since the comment only applies to the “!init_ctrl” case.  The point
of a nonnull init_ctrl is to create cases in which the first vector
is not a full vector.

>>>  
>>> […]
>>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>>    if (vect_epilogues
>>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>>        && prolog_peeling >= 0
>>> -      && known_eq (vf, lowest_vf))
>>> +      && known_eq (vf, lowest_vf)
>>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>>      {
>>>        unsigned HOST_WIDE_INT eiters
>>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>> 
>> I'm still not really convinced that this check is right.  It feels
>> like it's hiding a problem elsewhere.
>> 
>
> The comments above this hunk is that:
>
>   /* If we know the number of scalar iterations for the main loop we should
>      check whether after the main loop there are enough iterations left over
>      for the epilogue.  */
>
> So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
> And the main work in the loop is to remove epil_info from epilogue_vinfos.

Oops, I think I misread it as checking loop_vinfo rather than
epilogue_vinfo.  It makes more sense now. :-)

>>> +
>>> +  /* Work out how many bits we need to represent the length limit.  */
>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>> 
>> I think this breaks the abstraction.  There's no guarantee that the
>> factor is the same for each rgroup_control, so there's no guarantee
>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>> be better to avoid talking about bytes if we're trying to be general.)
>> I think we should take the maximum of each entry instead.
>> 
>
> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
> the last entry is only reasonable for it.  Will update it to check all entries
> instead.

I meant bytes, since that's what the code is effectively calculating
(at least for Power).  I.e. I think this breaks the abstraction even
if we assume the Power scheme to measuring length, since in principle
it's possible to fix different vector sizes in the same vector region.

>>> +     we perfer to still use the niters type.  */
>>> +  unsigned int ni_prec
>>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>>> +
>>> +  unsigned int required_prec = ni_prec;
>>> +  if (required_prec < pmode_prec)
>>> +    required_prec = pmode_prec;
>>> +
>>> +  tree iv_type = NULL_TREE;
>>> +  if (min_ni_prec > required_prec)
>>> +    {
>> 
>> Do we need this condition?  Looks like we could just do:
>> 
>>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>>   min_ni_prec = MAX (min_ni_prec, ni_prec);
>> 
>> and then run the loop below.
>> 
>
> I think the assumption holds that Pmode and niters type are standard integral
> type?  If so, both of them don't need the below loop to build the integer
> type, but min_ni_prec needs.  Does it make sense to differentiate them?

IMO we should handle them the same way, i.e. always use the loop.
For example, Pmode can be a partial integer mode on some targets,
so it isn't guaranteed to give a nice power-of-2 integer type.

Maybe having a special case would be worth it if this was performance-
critical code, but since it isn't, having all cases go through the same
path seems better.  It also means that the loop will get more testing
coverage.

>>> +      /* Decide whether to use fully-masked approach.  */
>>> +      if (vect_verify_full_masking (loop_vinfo))
>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>> +      /* Decide whether to use length-based approach.  */
>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>> +	{
>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>> +	    {
>>> +	      if (dump_enabled_p ())
>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> +				 "can't vectorize this loop with length-based"
>>> +				 " partial vectors approach becuase peeling"
>>> +				 " for alignment or gaps is required.\n");
>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +	    }
>> 
>> Why are these peeling cases necessary?  Peeling for gaps should
>> just mean subtracting one scalar iteration from the iteration count
>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>> alignment can be handled in the normal way, with a scalar prologue loop.
>> 
>
> I was thinking to relax this later and to avoid to handle too many cases
> in the first enablement patch.  Since Power hw whose level is able to support
> vector with length, it supports unaligned load/store, need to construct
> some cases for them.  May I postpone it a bit?  Or you prefer me to support
> it here?

I've no objection to postponing it if there are specific known
problems that make it difficult, but I think we should at least
say what they are.  On the face of it, I'm not sure why it doesn't
Just Work, since the way that we control the main loop should be
mostly orthogonal to how we handle peeled prologue iterations
and how we handle a single peeled epilogue iteration.

>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>  	  return false;
>>>  	}
>>>  
>>> -      if (loop_vinfo
>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> -			       ncopies * vec_num, vectype, NULL);
>>> +      if (loop_vinfo && for_reduction
>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>> +	{
>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>> +				   ncopies * vec_num, vectype, NULL);
>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>> +	     iteration, since full vector of condition results are operated,
>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>> +	     loop mask, it would be safe then.  */
>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>> +	    {
>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +		{
>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +		}
>>> +	    }
>>> +	  else
>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>> +	}
>>>  
>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>> 
>> I don't understand this part.
>
> This is for the regression case on aarch64:
>
> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,

OK, if this is an SVE thing, it should really be a separate patch.
(And thanks for testing SVE.)

> As you mentioned before, we would expect to record masks for partial vectors reduction, 
> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
> since the mask is further AND with some loop mask.  The difference looks like:
>
> Without mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>
> With mask AND loop mask optimization:
>
>   loop_mask =...
>   v1 = .MASK_LOAD (a, loop_mask)
>   mask1 = v1 == {cst, ...}
>   mask2 = loop_mask & mask1       
>   v2 = .MASK_LOAD (b, mask2)
>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>
>
> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?

But in this particular test, we're doing outer loop vectorisation,
and the only elements of vres that matter are the ones selected
by loop_mask (since those are the only ones that get stored out).
So applying the loop mask to the VEC_COND_EXPR is “just” an
(important) optimisation, rather than a correctness issue.

What's causing the test to start failing with the patch?  I realise
you've probably already said, sorry, but it's been a large patch series
so it's hard to keep all the details committed to memory.

>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>    *nunits_vectype_out = nunits_vectype;
>>>    return opt_result::success ();
>>>  }
>>> +
>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>> +
>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>> +   left_len = END_INDEX - min_of_start_and_end;
>>> +   rhs = min (left_len, LEN_LIMIT);
>>> +   LEN = rhs;
>>> +
>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>> +   don't suffer this.
>>> +*/
>> 
>> Should be no line break before the */
>> 
>> Personally I think it'd be better to drop the TODO.  This isn't the only
>> place that would need to change if we allowed out-of-range lengths,
>> whereas the comment might give the impression that it is.
>> 
>
> Sorry I might miss something, but all undetermined lengths are generated here,
> the other places you meant is doc or elsewhere?

For example, we'd need to start querying the length operand of the optabs
to see what length precision the target uses, since it would be invalid
to do this optimisation for IVs that are wider than that precision.
The routine above doesn't seem the right place to do that.

It could also affect the semantics of the IFNs, if we ever added
folding rules for them.  So yeah, it boils down to this not being
a local decision for this routine -- it's tied to the optab and
IFN behaviour too.

Thanks,
Richard
Kewen.Lin July 2, 2020, 5:20 a.m. UTC | #4
Hi Richard,

on 2020/7/1 下午11:17, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>>> +  tree length_limit = NULL_TREE;
>>>> +  /* For length, we need length_limit to check length in range.  */
>>>> +  if (!vect_for_masking)
>>>> +    {
>>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>>> +    }
>>>>  
>>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>>       handles in total, the number that it handles for each iteration
>>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>    tree nscalars_total = niters;
>>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>>    tree nscalars_skip = niters_skip;
>>>> -  if (nscalars_per_iter != 1)
>>>> +  if (nscalars_per_iter_ft != 1)
>>>>      {
>>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>>  	 these multiplications don't overflow.  */
>>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>>  				     nscalars_total, compare_factor);
>>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>>  	  tree const_limit = build_int_cst (compare_type,
>>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>>> -					    * nscalars_per_iter);
>>>> +					    * nscalars_per_iter_ft);
>>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>>  				      nscalars_total, const_limit);
>>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>>>
>>> It looks odd that we don't need to adjust the other nscalars_* values too.
>>> E.g. the above seems to be comparing an unscaled nscalars_total with
>>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>>> both here and in the rest of the function.
>>>
>>
>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>> nscalars_step are scaled here.  The remaining related nscalars_*
>> seems only nscalars_skip, but length can't support skip.
> 
> Hmm, OK.  But in that case can you update the names of the variables
> to match?  It's confusing to have some nscalars_* variables actually
> count scalars (and thus have “nitems” equivalents) and other nscalars_*
> variables count something else (and thus effectively be nitems_* variables
> themselves).
> 

OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
to nitems_total/... (or nunits_total better?)

>>
>>>>  	}
>>>>  
>>>> +      /* First iteration is full.  */
>>>
>>> This comment belongs inside the “if”.
>>>
>>
>> Sorry, I might miss something, but isn't this applied for both?
> 
> I meant it should be…
> 
>>
>>>>        if (!init_ctrl)
>>>> -	/* First iteration is full.  */
>>>> -	init_ctrl = build_minus_one_cst (ctrl_type);
>>>> +	{
>>>> +	  if (vect_for_masking)
>>>> +	    init_ctrl = build_minus_one_cst (ctrl_type);
>>>> +	  else
>>>> +	    init_ctrl = length_limit;
>>>> +	}
> 
>   if (!init_ctrl)
>     {
>       /* First iteration is full.  */
>       if (vect_for_masking)
>         init_ctrl = build_minus_one_cst (ctrl_type);
>       else
>         init_ctrl = length_limit;
>     }
> 
> since the comment only applies to the “!init_ctrl” case.  The point
> of a nonnull init_ctrl is to create cases in which the first vector
> is not a full vector.
> 

Got it, will fix it.

>>>>  
>>>> […]
>>>> @@ -2568,7 +2608,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>>>    if (vect_epilogues
>>>>        && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>>>        && prolog_peeling >= 0
>>>> -      && known_eq (vf, lowest_vf))
>>>> +      && known_eq (vf, lowest_vf)
>>>> +      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
>>>>      {
>>>>        unsigned HOST_WIDE_INT eiters
>>>>  	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
>>>
>>> I'm still not really convinced that this check is right.  It feels
>>> like it's hiding a problem elsewhere.
>>>
>>
>> The comments above this hunk is that:
>>
>>   /* If we know the number of scalar iterations for the main loop we should
>>      check whether after the main loop there are enough iterations left over
>>      for the epilogue.  */
>>
>> So it's to check the ones in loop_vinfo->epilogue_vinfos whether can be removed.
>> And the main work in the loop is to remove epil_info from epilogue_vinfos.
> 
> Oops, I think I misread it as checking loop_vinfo rather than
> epilogue_vinfo.  It makes more sense now. :-)
> 
>>>> +
>>>> +  /* Work out how many bits we need to represent the length limit.  */
>>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>>>
>>> I think this breaks the abstraction.  There's no guarantee that the
>>> factor is the same for each rgroup_control, so there's no guarantee
>>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>>> be better to avoid talking about bytes if we're trying to be general.)
>>> I think we should take the maximum of each entry instead.
>>>
>>
>> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
>> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
>> the last entry is only reasonable for it.  Will update it to check all entries
>> instead.
> 
> I meant bytes, since that's what the code is effectively calculating
> (at least for Power).  I.e. I think this breaks the abstraction even
> if we assume the Power scheme to measuring length, since in principle
> it's possible to fix different vector sizes in the same vector region.
> 

Sorry I didn't catch the meaning of "it's possible to fix different
vector sizes in the same vector region."  I guess if we are counting
bytes, the max nunits per iteration should come from the last entry
since the last one holds max bytes which is the result of 
max_nscalar_per_iter * factor.  But I agree that it breaks abstraction
here since it's not applied to length in lanes.

>>>> +     we perfer to still use the niters type.  */
>>>> +  unsigned int ni_prec
>>>> +    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
>>>> +  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
>>>> +  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
>>>> +
>>>> +  unsigned int required_prec = ni_prec;
>>>> +  if (required_prec < pmode_prec)
>>>> +    required_prec = pmode_prec;
>>>> +
>>>> +  tree iv_type = NULL_TREE;
>>>> +  if (min_ni_prec > required_prec)
>>>> +    {
>>>
>>> Do we need this condition?  Looks like we could just do:
>>>
>>>   min_ni_prec = MAX (min_ni_prec, GET_MODE_BITSIZE (Pmode));
>>>   min_ni_prec = MAX (min_ni_prec, ni_prec);
>>>
>>> and then run the loop below.
>>>
>>
>> I think the assumption holds that Pmode and niters type are standard integral
>> type?  If so, both of them don't need the below loop to build the integer
>> type, but min_ni_prec needs.  Does it make sense to differentiate them?
> 
> IMO we should handle them the same way, i.e. always use the loop.
> For example, Pmode can be a partial integer mode on some targets,
> so it isn't guaranteed to give a nice power-of-2 integer type.
> 
> Maybe having a special case would be worth it if this was performance-
> critical code, but since it isn't, having all cases go through the same
> path seems better.  It also means that the loop will get more testing
> coverage.
> 

Thanks for the explanation, it makes sense.  I'll fix it.

>>>> +      /* Decide whether to use fully-masked approach.  */
>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>> +      /* Decide whether to use length-based approach.  */
>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>> +	{
>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>> +	    {
>>>> +	      if (dump_enabled_p ())
>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>> +				 "can't vectorize this loop with length-based"
>>>> +				 " partial vectors approach becuase peeling"
>>>> +				 " for alignment or gaps is required.\n");
>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +	    }
>>>
>>> Why are these peeling cases necessary?  Peeling for gaps should
>>> just mean subtracting one scalar iteration from the iteration count
>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>
>>
>> I was thinking to relax this later and to avoid to handle too many cases
>> in the first enablement patch.  Since Power hw whose level is able to support
>> vector with length, it supports unaligned load/store, need to construct
>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>> it here?
> 
> I've no objection to postponing it if there are specific known
> problems that make it difficult, but I think we should at least
> say what they are.  On the face of it, I'm not sure why it doesn't
> Just Work, since the way that we control the main loop should be
> mostly orthogonal to how we handle peeled prologue iterations
> and how we handle a single peeled epilogue iteration.
> 

OK, I will remove it to see the impact.  By the way, do you think to
use partial vectors for prologue is something worth to trying in future?

>>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>>  	  return false;
>>>>  	}
>>>>  
>>>> -      if (loop_vinfo
>>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>> -			       ncopies * vec_num, vectype, NULL);
>>>> +      if (loop_vinfo && for_reduction
>>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>>> +	{
>>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>> +				   ncopies * vec_num, vectype, NULL);
>>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>>> +	     iteration, since full vector of condition results are operated,
>>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>>> +	     loop mask, it would be safe then.  */
>>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>>> +	    {
>>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>> +		{
>>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +		}
>>>> +	    }
>>>> +	  else
>>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>> +	}
>>>>  
>>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>>>
>>> I don't understand this part.
>>
>> This is for the regression case on aarch64:
>>
>> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
> 
> OK, if this is an SVE thing, it should really be a separate patch.
> (And thanks for testing SVE.)
> 
>> As you mentioned before, we would expect to record masks for partial vectors reduction, 
>> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
>> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
>> since the mask is further AND with some loop mask.  The difference looks like:
>>
>> Without mask AND loop mask optimization:
>>
>>   loop_mask =...
>>   v1 = .MASK_LOAD (a, loop_mask)
>>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>>   v2 = .MASK_LOAD (b, mask2)
>>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>>
>> With mask AND loop mask optimization:
>>
>>   loop_mask =...
>>   v1 = .MASK_LOAD (a, loop_mask)
>>   mask1 = v1 == {cst, ...}
>>   mask2 = loop_mask & mask1       
>>   v2 = .MASK_LOAD (b, mask2)
>>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>>
>>
>> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
>> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?
> 
> But in this particular test, we're doing outer loop vectorisation,
> and the only elements of vres that matter are the ones selected
> by loop_mask (since those are the only ones that get stored out).
> So applying the loop mask to the VEC_COND_EXPR is “just” an
> (important) optimisation, rather than a correctness issue.
>  

Thanks for the clarification.  It looks the vres is always safe since its
further usage is guard with loop mask.  Then sorry that I didn't catch why
it is one optimization for this case, is there some difference in backend
supports on this different mask for cond_expr?


> What's causing the test to start failing with the patch?  I realise
> you've probably already said, sorry, but it's been a large patch series
> so it's hard to keep all the details committed to memory.
> 

No problem, appreciate your time much!  Since length-based partial vectors
doesn't support any reduction so far, the function has the responsibility
to disable use_partial_vectors_p for it.  Without the above else-if part,
since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
will stop this case to use mask-based partial vectors, but the case expects
the outer loop still able to use mask-based partial vectors.

As your clarification above, else-if looks wrong.  Probably we can change it
to check whether the current vectorization is for outer loop and the condition
stmt being handled is in the inner loop, we can allow it for partial vectors?

>>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>>    *nunits_vectype_out = nunits_vectype;
>>>>    return opt_result::success ();
>>>>  }
>>>> +
>>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>>> +
>>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>>> +   left_len = END_INDEX - min_of_start_and_end;
>>>> +   rhs = min (left_len, LEN_LIMIT);
>>>> +   LEN = rhs;
>>>> +
>>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>>> +   don't suffer this.
>>>> +*/
>>>
>>> Should be no line break before the */
>>>
>>> Personally I think it'd be better to drop the TODO.  This isn't the only
>>> place that would need to change if we allowed out-of-range lengths,
>>> whereas the comment might give the impression that it is.
>>>
>>
>> Sorry I might miss something, but all undetermined lengths are generated here,
>> the other places you meant is doc or elsewhere?
> 
> For example, we'd need to start querying the length operand of the optabs
> to see what length precision the target uses, since it would be invalid
> to do this optimisation for IVs that are wider than that precision.
> The routine above doesn't seem the right place to do that.
> 

OK, but it seems it's acceptable if the IV wider than the precision since
we allows it out of range?

> It could also affect the semantics of the IFNs, if we ever added
> folding rules for them.  So yeah, it boils down to this not being
> a local decision for this routine -- it's tied to the optab and
> IFN behaviour too.
> 

Agree, thanks!

BR,
Kewen
Kewen.Lin July 7, 2020, 9:26 a.m. UTC | #5
Hi Richard,

on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
[...]
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>>
> 
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)
> 

Please ignore this part, I have used nitems_ for the names.  :)

>>>>> +  /* Work out how many bits we need to represent the length limit.  */
>>>>> +  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
>>>>
>>>> I think this breaks the abstraction.  There's no guarantee that the
>>>> factor is the same for each rgroup_control, so there's no guarantee
>>>> that the maximum bytes per iter comes the last entry.  (Also, it'd
>>>> be better to avoid talking about bytes if we're trying to be general.)
>>>> I think we should take the maximum of each entry instead.
>>>>
>>>
>>> Agree!  I guess the above "maximum bytes per iter" is a typo? and you meant
>>> "maximum elements per iter"?  Yes, the code is for length in bytes, checking
>>> the last entry is only reasonable for it.  Will update it to check all entries
>>> instead.
>>
>> I meant bytes, since that's what the code is effectively calculating
>> (at least for Power).  I.e. I think this breaks the abstraction even
>> if we assume the Power scheme to measuring length, since in principle
>> it's possible to fix different vector sizes in the same vector region.
>>
> 
> Sorry I didn't catch the meaning of "it's possible to fix different
> vector sizes in the same vector region."  I guess if we are counting
> bytes, the max nunits per iteration should come from the last entry
> since the last one holds max bytes which is the result of 
> max_nscalar_per_iter * factor.  But I agree that it breaks abstraction
> here since it's not applied to length in lanes.
> 

By further thought, I guessed you meant we can have different vector
sizes for the same loop in future?  Yes, the assumption doesn't hold then.

> 
>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>> +      /* Decide whether to use length-based approach.  */
>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>> +	{
>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>> +	    {
>>>>> +	      if (dump_enabled_p ())
>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>> +				 "can't vectorize this loop with length-based"
>>>>> +				 " partial vectors approach becuase peeling"
>>>>> +				 " for alignment or gaps is required.\n");
>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +	    }
>>>>
>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>> just mean subtracting one scalar iteration from the iteration count
>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>
>>>
>>> I was thinking to relax this later and to avoid to handle too many cases
>>> in the first enablement patch.  Since Power hw whose level is able to support
>>> vector with length, it supports unaligned load/store, need to construct
>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>> it here?
>>
>> I've no objection to postponing it if there are specific known
>> problems that make it difficult, but I think we should at least
>> say what they are.  On the face of it, I'm not sure why it doesn't
>> Just Work, since the way that we control the main loop should be
>> mostly orthogonal to how we handle peeled prologue iterations
>> and how we handle a single peeled epilogue iteration.
>>
> 
> OK, I will remove it to see the impact.  By the way, do you think to
> use partial vectors for prologue is something worth to trying in future?
> 

I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
fail at execution during vect-partial-vector-usage=2.  So far the patch
doesn't handle any niters_skip cases.  I think if we want to support it, 
we have to add some handlings in/like what we have for masking, such as: 
mask_skip_niters, vect_prepare_for_masked_peels etc.  

Do you prefer me to extend the support in this patch series?

>>> Sorry I might miss something, but all undetermined lengths are generated here,
>>> the other places you meant is doc or elsewhere?
>>
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>>
> 
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?
> 

Please ignore this question, I agree that we have to avoid that case.  Sorry that
I was misunderstanding it before.  :)

BR,
Kewen
Richard Sandiford July 7, 2020, 10:15 a.m. UTC | #6
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>>>> +  tree length_limit = NULL_TREE;
>>>>> +  /* For length, we need length_limit to check length in range.  */
>>>>> +  if (!vect_for_masking)
>>>>> +    {
>>>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>>>> +    }
>>>>>  
>>>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>>>       handles in total, the number that it handles for each iteration
>>>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>    tree nscalars_total = niters;
>>>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>>>    tree nscalars_skip = niters_skip;
>>>>> -  if (nscalars_per_iter != 1)
>>>>> +  if (nscalars_per_iter_ft != 1)
>>>>>      {
>>>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>>>  	 these multiplications don't overflow.  */
>>>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>>>  				     nscalars_total, compare_factor);
>>>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>>>  	  tree const_limit = build_int_cst (compare_type,
>>>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>>>> -					    * nscalars_per_iter);
>>>>> +					    * nscalars_per_iter_ft);
>>>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>>>  				      nscalars_total, const_limit);
>>>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>>>>
>>>> It looks odd that we don't need to adjust the other nscalars_* values too.
>>>> E.g. the above seems to be comparing an unscaled nscalars_total with
>>>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>>>> both here and in the rest of the function.
>>>>
>>>
>>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>>> nscalars_step are scaled here.  The remaining related nscalars_*
>>> seems only nscalars_skip, but length can't support skip.
>> 
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>> 
>
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)

I agree “items” isn't great.  I was trying to avoid “units” because GCC
often uses that to mean bytes (BITS_PER_UNIT, UNITS_PER_WORD, etc.).
In this context that could be confusing, because sometimes the
“units” actually would be bytes, but not always.

>>>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>>>  	  return false;
>>>>>  	}
>>>>>  
>>>>> -      if (loop_vinfo
>>>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> -			       ncopies * vec_num, vectype, NULL);
>>>>> +      if (loop_vinfo && for_reduction
>>>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>>>> +	{
>>>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> +				   ncopies * vec_num, vectype, NULL);
>>>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>>>> +	     iteration, since full vector of condition results are operated,
>>>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>>>> +	     loop mask, it would be safe then.  */
>>>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>>>> +	    {
>>>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		{
>>>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +		}
>>>>> +	    }
>>>>> +	  else
>>>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +	}
>>>>>  
>>>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>>>>
>>>> I don't understand this part.
>>>
>>> This is for the regression case on aarch64:
>>>
>>> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
>> 
>> OK, if this is an SVE thing, it should really be a separate patch.
>> (And thanks for testing SVE.)
>> 
>>> As you mentioned before, we would expect to record masks for partial vectors reduction, 
>>> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
>>> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
>>> since the mask is further AND with some loop mask.  The difference looks like:
>>>
>>> Without mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>>>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>>>
>>> With mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}
>>>   mask2 = loop_mask & mask1       
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>>>
>>>
>>> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
>>> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?
>> 
>> But in this particular test, we're doing outer loop vectorisation,
>> and the only elements of vres that matter are the ones selected
>> by loop_mask (since those are the only ones that get stored out).
>> So applying the loop mask to the VEC_COND_EXPR is “just” an
>> (important) optimisation, rather than a correctness issue.
>>  
>
> Thanks for the clarification.  It looks the vres is always safe since its
> further usage is guard with loop mask.  Then sorry that I didn't catch why
> it is one optimization for this case, is there some difference in backend
> supports on this different mask for cond_expr?

No, the idea of the optimisation is to avoid cases in which we have:

    cmp_res = …compare…
    cmp_res' = cmp_res & loop_mask
    IFN_MASK_LOAD (…, cmp_res')
    z = cmp_res ? x : y

The problem here is that cmp_res and cmp_res' are live at the same time,
which prevents cmp_res and cmp_res' from being combined into a single
instruction.  It's better for the final instruction to be:

    z = cmp_res' ? x : y

so that everything uses the same comparison result.

We can't leave that to later passes because nothing in the gimple IL
indicates that only the loop_mask elements of z matter.

>> What's causing the test to start failing with the patch?  I realise
>> you've probably already said, sorry, but it's been a large patch series
>> so it's hard to keep all the details committed to memory.
>> 
>
> No problem, appreciate your time much!  Since length-based partial vectors
> doesn't support any reduction so far, the function has the responsibility
> to disable use_partial_vectors_p for it.  Without the above else-if part,
> since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
> will stop this case to use mask-based partial vectors, but the case expects
> the outer loop still able to use mask-based partial vectors.
>
> As your clarification above, else-if looks wrong.  Probably we can change it
> to check whether the current vectorization is for outer loop and the condition
> stmt being handled is in the inner loop, we can allow it for partial vectors?

I think it's more whether, for outer loop vectorisation, the reduction
is a double reduction or a simple nested-cycle reduction.  Both have
a COND_EXPR in the inner loop, but the extra elements only matter for
double reductions.

There again, I don't think we actually support double reductions for
COND_EXPR reductions.

>>>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>>>    *nunits_vectype_out = nunits_vectype;
>>>>>    return opt_result::success ();
>>>>>  }
>>>>> +
>>>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>>>> +
>>>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>>>> +   left_len = END_INDEX - min_of_start_and_end;
>>>>> +   rhs = min (left_len, LEN_LIMIT);
>>>>> +   LEN = rhs;
>>>>> +
>>>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>>>> +   don't suffer this.
>>>>> +*/
>>>>
>>>> Should be no line break before the */
>>>>
>>>> Personally I think it'd be better to drop the TODO.  This isn't the only
>>>> place that would need to change if we allowed out-of-range lengths,
>>>> whereas the comment might give the impression that it is.
>>>>
>>>
>>> Sorry I might miss something, but all undetermined lengths are generated here,
>>> the other places you meant is doc or elsewhere?
>> 
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>> 
>
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?

For example, suppose that a target handled out-of-range values but
still had a QImode length.  If the IV was wider than QI, we'd truncate
0x100 to 0 when generating the pattern, so a full-vector access would
get truncated to an empty-vector access.

Thanks,
Richard
Richard Sandiford July 7, 2020, 10:44 a.m. UTC | #7
"Kewen.Lin" <linkw@linux.ibm.com> writes:
> on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>>> +      /* Decide whether to use length-based approach.  */
>>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>>> +	{
>>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>>> +	    {
>>>>>> +	      if (dump_enabled_p ())
>>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>> +				 "can't vectorize this loop with length-based"
>>>>>> +				 " partial vectors approach becuase peeling"
>>>>>> +				 " for alignment or gaps is required.\n");
>>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>>> +	    }
>>>>>
>>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>>> just mean subtracting one scalar iteration from the iteration count
>>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>>
>>>>
>>>> I was thinking to relax this later and to avoid to handle too many cases
>>>> in the first enablement patch.  Since Power hw whose level is able to support
>>>> vector with length, it supports unaligned load/store, need to construct
>>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>>> it here?
>>>
>>> I've no objection to postponing it if there are specific known
>>> problems that make it difficult, but I think we should at least
>>> say what they are.  On the face of it, I'm not sure why it doesn't
>>> Just Work, since the way that we control the main loop should be
>>> mostly orthogonal to how we handle peeled prologue iterations
>>> and how we handle a single peeled epilogue iteration.
>>>
>> 
>> OK, I will remove it to see the impact.  By the way, do you think to
>> use partial vectors for prologue is something worth to trying in future?
>> 
>
> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
> fail at execution during vect-partial-vector-usage=2.  So far the patch
> doesn't handle any niters_skip cases.  I think if we want to support it, 
> we have to add some handlings in/like what we have for masking, such as: 
> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>
> Do you prefer me to extend the support in this patch series?

It's not so much whether it has to be supported now, but more why
it doesn't work now.  What was the reason for the failure?

The peeling-with-masking thing is just an optimisation, so that we
can vectorise the peeled iterations rather than falling back to
scalar code for them.  It shouldn't be needed for correctness.

Thanks,
Richard
Kewen.Lin July 8, 2020, 6:52 a.m. UTC | #8
Hi Richard,

on 2020/7/7 下午6:44, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> on 2020/7/2 下午1:20, Kewen.Lin via Gcc-patches wrote:
>>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>>> +      /* Decide whether to use fully-masked approach.  */
>>>>>>> +      if (vect_verify_full_masking (loop_vinfo))
>>>>>>> +	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>>>>>>> +      /* Decide whether to use length-based approach.  */
>>>>>>> +      else if (vect_verify_loop_lens (loop_vinfo))
>>>>>>> +	{
>>>>>>> +	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>>>>>>> +	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
>>>>>>> +	    {
>>>>>>> +	      if (dump_enabled_p ())
>>>>>>> +		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>>> +				 "can't vectorize this loop with length-based"
>>>>>>> +				 " partial vectors approach becuase peeling"
>>>>>>> +				 " for alignment or gaps is required.\n");
>>>>>>> +	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>>>> +	    }
>>>>>>
>>>>>> Why are these peeling cases necessary?  Peeling for gaps should
>>>>>> just mean subtracting one scalar iteration from the iteration count
>>>>>> and shouldn't otherwise affect the main loop.  Similarly, peeling for
>>>>>> alignment can be handled in the normal way, with a scalar prologue loop.
>>>>>>
>>>>>
>>>>> I was thinking to relax this later and to avoid to handle too many cases
>>>>> in the first enablement patch.  Since Power hw whose level is able to support
>>>>> vector with length, it supports unaligned load/store, need to construct
>>>>> some cases for them.  May I postpone it a bit?  Or you prefer me to support
>>>>> it here?
>>>>
>>>> I've no objection to postponing it if there are specific known
>>>> problems that make it difficult, but I think we should at least
>>>> say what they are.  On the face of it, I'm not sure why it doesn't
>>>> Just Work, since the way that we control the main loop should be
>>>> mostly orthogonal to how we handle peeled prologue iterations
>>>> and how we handle a single peeled epilogue iteration.
>>>>
>>>
>>> OK, I will remove it to see the impact.  By the way, do you think to
>>> use partial vectors for prologue is something worth to trying in future?
>>>
>>
>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>> we have to add some handlings in/like what we have for masking, such as: 
>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>
>> Do you prefer me to extend the support in this patch series?
> 
> It's not so much whether it has to be supported now, but more why
> it doesn't work now.  What was the reason for the failure?
> 
> The peeling-with-masking thing is just an optimisation, so that we
> can vectorise the peeled iterations rather than falling back to
> scalar code for them.  It shouldn't be needed for correctness.
> 

Whoops, thanks for the clarification!  Nice, I just realized it's a way to
adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
There the first iteration is optimized out due to the known AND result of
IV 0, then it tries to peel 3 iterations, the number of remaining iterations
for vectorization body is expected to be 12.  But it still uses 15 and causes
out-of-bound access.

The below fix can fix the failure.  The justification is that we need to use
the fixed up niters after peeling prolog for the vectorization body for
partial vectors.  I'm not sure why the other cases not using partial vectors 
don't need the fixed up niters, to avoid troubles I guarded it with 
LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.

Does it make sense?

--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
	  step_vector = build_one_cst (TREE_TYPE (niters));
	}
+      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
+       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
+				    &niters_vector, &step_vector,
+				    niters_no_overflow);
       else
	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
				     &step_vector, niters_no_overflow);

BR,
Kewen
Kewen.Lin July 8, 2020, 7:01 a.m. UTC | #9
Hi Richard,

on 2020/7/7 下午6:15, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> Hi Richard,
>>
>> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>
>>>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>>>> nscalars_step are scaled here.  The remaining related nscalars_*
>>>> seems only nscalars_skip, but length can't support skip.
>>>
>>> Hmm, OK.  But in that case can you update the names of the variables
>>> to match?  It's confusing to have some nscalars_* variables actually
>>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>>> variables count something else (and thus effectively be nitems_* variables
>>> themselves).
>>>
>>
>> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
>> to nitems_total/... (or nunits_total better?)
> 
> I agree “items” isn't great.  I was trying to avoid “units” because GCC
> often uses that to mean bytes (BITS_PER_UNIT, UNITS_PER_WORD, etc.).
> In this context that could be confusing, because sometimes the
> “units” actually would be bytes, but not always.
> 

Got it!  Thanks!

[...]
>>> But in this particular test, we're doing outer loop vectorisation,
>>> and the only elements of vres that matter are the ones selected
>>> by loop_mask (since those are the only ones that get stored out).
>>> So applying the loop mask to the VEC_COND_EXPR is “just” an
>>> (important) optimisation, rather than a correctness issue.
>>>  
>>
>> Thanks for the clarification.  It looks the vres is always safe since its
>> further usage is guard with loop mask.  Then sorry that I didn't catch why
>> it is one optimization for this case, is there some difference in backend
>> supports on this different mask for cond_expr?
> 
> No, the idea of the optimisation is to avoid cases in which we have:
> 
>     cmp_res = …compare…
>     cmp_res' = cmp_res & loop_mask
>     IFN_MASK_LOAD (…, cmp_res')
>     z = cmp_res ? x : y
> 
> The problem here is that cmp_res and cmp_res' are live at the same time,
> which prevents cmp_res and cmp_res' from being combined into a single
> instruction.  It's better for the final instruction to be:
> 
>     z = cmp_res' ? x : y
> 
> so that everything uses the same comparison result.
> 
> We can't leave that to later passes because nothing in the gimple IL
> indicates that only the loop_mask elements of z matter.
> 

Nice, thanks for the explanation.


[...]
>>> What's causing the test to start failing with the patch?  I realise
>>> you've probably already said, sorry, but it's been a large patch series
>>> so it's hard to keep all the details committed to memory.
>>>
>>
>> No problem, appreciate your time much!  Since length-based partial vectors
>> doesn't support any reduction so far, the function has the responsibility
>> to disable use_partial_vectors_p for it.  Without the above else-if part,
>> since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
>> will stop this case to use mask-based partial vectors, but the case expects
>> the outer loop still able to use mask-based partial vectors.
>>
>> As your clarification above, else-if looks wrong.  Probably we can change it
>> to check whether the current vectorization is for outer loop and the condition
>> stmt being handled is in the inner loop, we can allow it for partial vectors?
> 
> I think it's more whether, for outer loop vectorisation, the reduction
> is a double reduction or a simple nested-cycle reduction.  Both have
> a COND_EXPR in the inner loop, but the extra elements only matter for
> double reductions.
> 
> There again, I don't think we actually support double reductions for
> COND_EXPR reductions.
> 

OK.  I will send one separate patch with your suggestion on this.

[...]
>>
>> OK, but it seems it's acceptable if the IV wider than the precision since
>> we allows it out of range?
> 
> For example, suppose that a target handled out-of-range values but
> still had a QImode length.  If the IV was wider than QI, we'd truncate
> 0x100 to 0 when generating the pattern, so a full-vector access would
> get truncated to an empty-vector access.
> 

Yeah, it's so true.

BR,
Kewen
Richard Sandiford July 8, 2020, 12:50 p.m. UTC | #10
"Kewen.Lin" <linkw@linux.ibm.com> writes:
>> […]
>>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>>> we have to add some handlings in/like what we have for masking, such as: 
>>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>>
>>> Do you prefer me to extend the support in this patch series?
>> 
>> It's not so much whether it has to be supported now, but more why
>> it doesn't work now.  What was the reason for the failure?
>> 
>> The peeling-with-masking thing is just an optimisation, so that we
>> can vectorise the peeled iterations rather than falling back to
>> scalar code for them.  It shouldn't be needed for correctness.
>> 
>
> Whoops, thanks for the clarification!  Nice, I just realized it's a way to
> adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
> There the first iteration is optimized out due to the known AND result of
> IV 0, then it tries to peel 3 iterations, the number of remaining iterations
> for vectorization body is expected to be 12.  But it still uses 15 and causes
> out-of-bound access.
>
> The below fix can fix the failure.  The justification is that we need to use
> the fixed up niters after peeling prolog for the vectorization body for
> partial vectors.  I'm not sure why the other cases not using partial vectors 
> don't need the fixed up niters, to avoid troubles I guarded it with 
> LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.

I think the reason is that if we're peeling prologue iterations and
the total number of iterations isn't fixed, full-vector vectorisation
will “almost always” need an epilogue loop too, and in that case
niters_vector will be nonnull.

But that's not guaranteed to be true forever.  E.g. if the start
pointers have a known misalignment that require peeling a constant
number of iterations N, and if we can prove (using enhanced range/
nonzero-bits information) that the way niters is calculated means
that niter - N is a multiple of the vector size, we could peel
the prologue and not the epilogue.  In that case, what your patch
does would be correct.

So…

> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
> 			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
> 	  step_vector = build_one_cst (TREE_TYPE (niters));
> 	}
> +      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
> +       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
> +				    &niters_vector, &step_vector,
> +				    niters_no_overflow);
>        else
> 	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
> 				     &step_vector, niters_no_overflow);

…I think we should drop the LOOP_VINFO_USING_PARTIAL_VECTORS_P
condition.  Could you also add a comment above the new call saying:

   /* vect_do_peeling subtracted the number of peeled prologue
      iterations from LOOP_VINFO_NITERS.  */

It wasn't obvious to me where the update was happening when I first
looked at the code.

Very minor, but maybe also switch the last two cases round so that
“else” is the default behaviour and the “if”s are the exceptions.

OK with those changes, thanks.

Richard
Kewen.Lin July 10, 2020, 7:40 a.m. UTC | #11
Hi Richard,

on 2020/7/8 下午8:50, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> […]
>>>> I tested the updated patch with this releasing, LOOP_VINFO_PEELING_FOR_GAPS
>>>> part looks fine, but LOOP_VINFO_PEELING_FOR_ALIGNMENT caused one case to
>>>> fail at execution during vect-partial-vector-usage=2.  So far the patch
>>>> doesn't handle any niters_skip cases.  I think if we want to support it, 
>>>> we have to add some handlings in/like what we have for masking, such as: 
>>>> mask_skip_niters, vect_prepare_for_masked_peels etc.  
>>>>
>>>> Do you prefer me to extend the support in this patch series?
>>>
>>> It's not so much whether it has to be supported now, but more why
>>> it doesn't work now.  What was the reason for the failure?
>>>
>>> The peeling-with-masking thing is just an optimisation, so that we
>>> can vectorise the peeled iterations rather than falling back to
>>> scalar code for them.  It shouldn't be needed for correctness.
>>>
>>
>> Whoops, thanks for the clarification!  Nice, I just realized it's a way to
>> adopt partial vectors for prologue.  The fail case is gcc.dg/vect/vect-ifcvt-11.c.
>> There the first iteration is optimized out due to the known AND result of
>> IV 0, then it tries to peel 3 iterations, the number of remaining iterations
>> for vectorization body is expected to be 12.  But it still uses 15 and causes
>> out-of-bound access.
>>
>> The below fix can fix the failure.  The justification is that we need to use
>> the fixed up niters after peeling prolog for the vectorization body for
>> partial vectors.  I'm not sure why the other cases not using partial vectors 
>> don't need the fixed up niters, to avoid troubles I guarded it with 
>> LOOP_VINFO_USING_PARTIAL_VECTORS_P explicitly.
> 
> I think the reason is that if we're peeling prologue iterations and
> the total number of iterations isn't fixed, full-vector vectorisation
> will “almost always” need an epilogue loop too, and in that case
> niters_vector will be nonnull.
> 
> But that's not guaranteed to be true forever.  E.g. if the start
> pointers have a known misalignment that require peeling a constant
> number of iterations N, and if we can prove (using enhanced range/
> nonzero-bits information) that the way niters is calculated means
> that niter - N is a multiple of the vector size, we could peel
> the prologue and not the epilogue.  In that case, what your patch
> does would be correct.
> 

Thanks for the explanation, it makes more sense!

> So…
> 
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -8888,6 +8896,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>> 			     LOOP_VINFO_INT_NITERS (loop_vinfo) / lowest_vf);
>> 	  step_vector = build_one_cst (TREE_TYPE (niters));
>> 	}
>> +      else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
>> +	      && !vect_use_loop_mask_for_alignment_p (loop_vinfo))
>> +       vect_gen_vector_loop_niters (loop_vinfo, LOOP_VINFO_NITERS (loop_vinfo),
>> +				    &niters_vector, &step_vector,
>> +				    niters_no_overflow);
>>        else
>> 	vect_gen_vector_loop_niters (loop_vinfo, niters, &niters_vector,
>> 				     &step_vector, niters_no_overflow);
> 
> …I think we should drop the LOOP_VINFO_USING_PARTIAL_VECTORS_P
> condition.  Could you also add a comment above the new call saying:
> 
>    /* vect_do_peeling subtracted the number of peeled prologue
>       iterations from LOOP_VINFO_NITERS.  */
> 
> It wasn't obvious to me where the update was happening when I first
> looked at the code.
> 
> Very minor, but maybe also switch the last two cases round so that
> “else” is the default behaviour and the “if”s are the exceptions.
> 
> OK with those changes, thanks.

Bootstrapped/regtested on aarch64-linux-gnu and powerpc64le-linux-gnu.

Committed it via r11-1978 by incorporating your comments.  Thanks!

BR,
Kewen
diff mbox series

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 06a04e3d7dd..284c15705ea 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13389,6 +13389,13 @@  by the copy loop headers pass.
 @item vect-epilogues-nomask
 Enable loop epilogue vectorization using smaller vector size.
 
+@item vect-with-length-scope
+Control the scope of vector memory access with length exploitation.  0 means we
+don't expliot any vector memory access with length, 1 means we only exploit
+vector memory access with length for those loops whose iteration number are
+less than VF, such as very small loop or epilogue, 2 means we want to exploit
+vector memory access with length for any loops if possible.
+
 @item slp-max-insns-in-bb
 Maximum number of instructions in basic block to be
 considered for SLP vectorization.
diff --git a/gcc/optabs-query.c b/gcc/optabs-query.c
index 215d68e4225..9c351759204 100644
--- a/gcc/optabs-query.c
+++ b/gcc/optabs-query.c
@@ -606,6 +606,60 @@  can_vec_mask_load_store_p (machine_mode mode,
   return false;
 }
 
+/* Return true if target supports vector load/store with length for vector
+   mode MODE.  There are two flavors for vector load/store with length, one
+   is to measure length with bytes, the other is to measure length with lanes.
+   As len_{load,store} optabs point out, for the flavor with bytes, we use
+   VnQI to wrap the other supportable same size vector modes.  Here the
+   pointer FACTOR is to indicate that it is using VnQI to wrap if its value
+   more than 1 and how many bytes for one element of wrapped vector mode.  */
+
+bool
+can_vec_len_load_store_p (machine_mode mode, bool is_load, unsigned int *factor)
+{
+  optab op = is_load ? len_load_optab : len_store_optab;
+  gcc_assert (VECTOR_MODE_P (mode));
+
+  /* Check if length in lanes supported for this mode directly.  */
+  if (direct_optab_handler (op, mode))
+    {
+      *factor = 1;
+      return true;
+    }
+
+  /* Check if length in bytes supported for VnQI with the same vector size.  */
+  scalar_mode emode = QImode;
+  poly_uint64 esize = GET_MODE_SIZE (emode);
+  poly_uint64 vsize = GET_MODE_SIZE (mode);
+  poly_uint64 nunits;
+
+  /* To get how many nunits it would have if the element is QImode.  */
+  if (multiple_p (vsize, esize, &nunits))
+    {
+      machine_mode vmode;
+      /* Check whether the related VnQI vector mode exists, as well as
+	 optab supported.  */
+      if (related_vector_mode (mode, emode, nunits).exists (&vmode)
+	  && direct_optab_handler (op, vmode))
+	{
+	  unsigned int mul;
+	  scalar_mode orig_emode = GET_MODE_INNER (mode);
+	  poly_uint64 orig_esize = GET_MODE_SIZE (orig_emode);
+
+	  if (constant_multiple_p (orig_esize, esize, &mul))
+	    *factor = mul;
+	  else
+	    gcc_unreachable ();
+
+	  return true;
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  return false;
+}
+
 /* Return true if there is a compare_and_swap pattern.  */
 
 bool
diff --git a/gcc/optabs-query.h b/gcc/optabs-query.h
index 729e1fdfc81..9db9c91994a 100644
--- a/gcc/optabs-query.h
+++ b/gcc/optabs-query.h
@@ -188,6 +188,7 @@  enum insn_code find_widening_optab_handler_and_mode (optab, machine_mode,
 						     machine_mode *);
 int can_mult_highpart_p (machine_mode, bool);
 bool can_vec_mask_load_store_p (machine_mode, machine_mode, bool);
+bool can_vec_len_load_store_p (machine_mode, bool, unsigned int *);
 bool can_compare_and_swap_p (machine_mode, bool);
 bool can_atomic_exchange_p (machine_mode, bool);
 bool can_atomic_load_p (machine_mode);
diff --git a/gcc/params.opt b/gcc/params.opt
index 9b564bb046c..daa6e8a2beb 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -968,4 +968,8 @@  Bound on number of runtime checks inserted by the vectorizer's loop versioning f
 Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization
 Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check.
 
+-param=vect-with-length-scope=
+Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization
+Control the vector with length exploitation scope.
+
 ; This comment is to ensure we retain the blank line above.
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 458a6675c47..9b9bfb88b1a 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -399,19 +399,20 @@  vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    It is known that:
 
-     NITERS * RGC->max_nscalars_per_iter
+     NITERS * RGC->max_nscalars_per_iter * RGC->factor
 
    does not overflow.  However, MIGHT_WRAP_P says whether an induction
    variable that starts at 0 and has step:
 
-     VF * RGC->max_nscalars_per_iter
+     VF * RGC->max_nscalars_per_iter * RGC->factor
 
    might overflow before hitting a value above:
 
-     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter
+     (NITERS + NITERS_SKIP) * RGC->max_nscalars_per_iter * RGC->factor
 
    This means that we cannot guarantee that such an induction variable
-   would ever hit a value that produces a set of all-false masks for RGC.  */
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths for RGC.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -422,10 +423,20 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 {
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
+
   tree ctrl_type = rgc->type;
-  unsigned int nscalars_per_iter = rgc->max_nscalars_per_iter;
+  /* Scale up nscalars per iteration with factor.  */
+  unsigned int nscalars_per_iter_ft = rgc->max_nscalars_per_iter * rgc->factor;
   poly_uint64 nscalars_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  tree length_limit = NULL_TREE;
+  /* For length, we need length_limit to check length in range.  */
+  if (!vect_for_masking)
+    {
+      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
+      length_limit = build_int_cst (compare_type, len_limit);
+    }
 
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
@@ -434,12 +445,12 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   tree nscalars_total = niters;
   tree nscalars_step = build_int_cst (iv_type, vf);
   tree nscalars_skip = niters_skip;
-  if (nscalars_per_iter != 1)
+  if (nscalars_per_iter_ft != 1)
     {
       /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
 	 these multiplications don't overflow.  */
-      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
-      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
+      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
+      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
       nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
 				     nscalars_total, compare_factor);
       nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
@@ -509,7 +520,7 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	     NSCALARS_SKIP to that cannot overflow.  */
 	  tree const_limit = build_int_cst (compare_type,
 					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-					    * nscalars_per_iter);
+					    * nscalars_per_iter_ft);
 	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
 				      nscalars_total, const_limit);
 	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
@@ -549,16 +560,16 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
     {
       /* Previous controls will cover BIAS scalars.  This control covers the
 	 next batch.  */
-      poly_uint64 bias = nscalars_per_ctrl * i;
+      poly_uint64 batch_nscalars_ft = nscalars_per_ctrl * rgc->factor;
+      poly_uint64 bias = batch_nscalars_ft * i;
       tree bias_tree = build_int_cst (compare_type, bias);
-      gimple *tmp_stmt;
 
       /* See whether the first iteration of the vector loop is known
 	 to have a full control.  */
       poly_uint64 const_limit;
       bool first_iteration_full
 	= (poly_int_tree_p (first_limit, &const_limit)
-	   && known_ge (const_limit, (i + 1) * nscalars_per_ctrl));
+	   && known_ge (const_limit, (i + 1) * batch_nscalars_ft));
 
       /* Rather than have a new IV that starts at BIAS and goes up to
 	 TEST_LIMIT, prefer to use the same 0-based IV for each control
@@ -598,9 +609,19 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	      end = first_limit;
 	    }
 
-	  init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
-	  tmp_stmt = vect_gen_while (init_ctrl, start, end);
-	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	  if (vect_for_masking)
+	    {
+	      init_ctrl = make_temp_ssa_name (ctrl_type, NULL, "max_mask");
+	      gimple *tmp_stmt = vect_gen_while (init_ctrl, start, end);
+	      gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	    }
+	  else
+	    {
+	      init_ctrl = make_temp_ssa_name (compare_type, NULL, "max_len");
+	      gimple_seq seq = vect_gen_len (init_ctrl, start,
+					     end, length_limit);
+	      gimple_seq_add_seq (preheader_seq, seq);
+	    }
 	}
 
       /* Now AND out the bits that are within the number of skipped
@@ -617,16 +638,32 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 				      init_ctrl, unskipped_mask);
 	  else
 	    init_ctrl = unskipped_mask;
+	  gcc_assert (vect_for_masking);
 	}
 
+      /* First iteration is full.  */
       if (!init_ctrl)
-	/* First iteration is full.  */
-	init_ctrl = build_minus_one_cst (ctrl_type);
+	{
+	  if (vect_for_masking)
+	    init_ctrl = build_minus_one_cst (ctrl_type);
+	  else
+	    init_ctrl = length_limit;
+	}
 
       /* Get the control value for the next iteration of the loop.  */
-      next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
-      gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
-      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+      if (vect_for_masking)
+	{
+	  next_ctrl = make_temp_ssa_name (ctrl_type, NULL, "next_mask");
+	  gcall *call = vect_gen_while (next_ctrl, test_index, this_test_limit);
+	  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+	}
+      else
+	{
+	  next_ctrl = make_temp_ssa_name (compare_type, NULL, "next_len");
+	  gimple_seq seq = vect_gen_len (next_ctrl, test_index, this_test_limit,
+					 length_limit);
+	  gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT);
+	}
 
       vect_set_loop_control (loop, ctrl, init_ctrl, next_ctrl);
     }
@@ -652,6 +689,7 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
   gimple_seq preheader_seq = NULL;
   gimple_seq header_seq = NULL;
 
+  bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   unsigned int compare_precision = TYPE_PRECISION (compare_type);
   tree orig_niters = niters;
@@ -686,28 +724,30 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
   tree test_ctrl = NULL_TREE;
   rgroup_controls *rgc;
   unsigned int i;
-  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
-  FOR_EACH_VEC_ELT (*masks, i, rgc)
+  auto_vec<rgroup_controls> *controls = vect_for_masking
+					  ? &LOOP_VINFO_MASKS (loop_vinfo)
+					  : &LOOP_VINFO_LENS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*controls, i, rgc)
     if (!rgc->controls.is_empty ())
       {
 	/* First try using permutes.  This adds a single vector
 	   instruction to the loop for each mask, but needs no extra
 	   loop invariants or IVs.  */
 	unsigned int nmasks = i + 1;
-	if ((nmasks & 1) == 0)
+	if (vect_for_masking && (nmasks & 1) == 0)
 	  {
-	    rgroup_controls *half_rgc = &(*masks)[nmasks / 2 - 1];
+	    rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
 	    if (!half_rgc->controls.is_empty ()
 		&& vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
 	      continue;
 	  }
 
 	/* See whether zero-based IV would ever generate all-false masks
-	   before wrapping around.  */
+	   or zero length before wrapping around.  */
+	unsigned nscalars_ft = rgc->max_nscalars_per_iter * rgc->factor;
 	bool might_wrap_p
 	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * rgc->max_nscalars_per_iter,
-				    UNSIGNED)
+	     || (wi::min_precision (iv_limit * nscalars_ft, UNSIGNED)
 		 > compare_precision));
 
 	/* Set up all controls for this group.  */
@@ -2568,7 +2608,8 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (vect_epilogues
       && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && prolog_peeling >= 0
-      && known_eq (vf, lowest_vf))
+      && known_eq (vf, lowest_vf)
+      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (epilogue_vinfo))
     {
       unsigned HOST_WIDE_INT eiters
 	= (LOOP_VINFO_INT_NITERS (loop_vinfo)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 6311e795204..1079807534b 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -816,6 +816,7 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (true),
     using_partial_vectors_p (false),
+    epil_using_partial_vectors_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     no_data_dependencies (false),
@@ -898,6 +899,7 @@  _loop_vec_info::~_loop_vec_info ()
   free (bbs);
 
   release_vec_loop_controls (&masks);
+  release_vec_loop_controls (&lens);
   delete ivexpr_map;
   delete scan_map;
   epilogue_vinfos.release ();
@@ -1072,6 +1074,88 @@  vect_verify_full_masking (loop_vec_info loop_vinfo)
   return true;
 }
 
+/* Check whether we can use vector access with length based on precison
+   comparison.  So far, to keep it simple, we only allow the case that the
+   precision of the target supported length is larger than the precision
+   required by loop niters.  */
+
+static bool
+vect_verify_loop_lens (loop_vec_info loop_vinfo)
+{
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+
+  if (LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    return false;
+
+  /* The one which has the largest NV should have max bytes per iter.  */
+  rgroup_controls *rgl = &(*lens)[lens->length () - 1];
+
+  /* Work out how many bits we need to represent the length limit.  */
+  unsigned int nscalars_per_iter_ft = rgl->max_nscalars_per_iter * rgl->factor;
+  unsigned int min_ni_prec
+    = vect_min_prec_for_max_niters (loop_vinfo, nscalars_per_iter_ft);
+
+  /* Now use the maximum of below precisions for one suitable IV type:
+     - the IV's natural precision
+     - the precision needed to hold: the maximum number of scalar
+       iterations multiplied by the scale factor (min_ni_prec above)
+     - the Pmode precision
+  */
+
+  /* If min_ni_width is less than the precision of the current niters,
+     we perfer to still use the niters type.  */
+  unsigned int ni_prec
+    = TYPE_PRECISION (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)));
+  /* Prefer to use Pmode and wider IV to avoid narrow conversions.  */
+  unsigned int pmode_prec = GET_MODE_BITSIZE (Pmode);
+
+  unsigned int required_prec = ni_prec;
+  if (required_prec < pmode_prec)
+    required_prec = pmode_prec;
+
+  tree iv_type = NULL_TREE;
+  if (min_ni_prec > required_prec)
+    {
+      opt_scalar_int_mode tmode_iter;
+      unsigned standard_bits = 0;
+      FOR_EACH_MODE_IN_CLASS (tmode_iter, MODE_INT)
+      {
+	scalar_mode tmode = tmode_iter.require ();
+	unsigned int tbits = GET_MODE_BITSIZE (tmode);
+
+	/* ??? Do we really want to construct one IV whose precision exceeds
+	   BITS_PER_WORD?  */
+	if (tbits > BITS_PER_WORD)
+	  break;
+
+	/* Find the first available standard integral type.  */
+	if (tbits >= min_ni_prec && targetm.scalar_mode_supported_p (tmode))
+	  {
+	    standard_bits = tbits;
+	    break;
+	  }
+      }
+      if (standard_bits != 0)
+	iv_type = build_nonstandard_integer_type (standard_bits, true);
+    }
+  else
+    iv_type = build_nonstandard_integer_type (required_prec, true);
+
+  if (!iv_type)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize with length-based partial vectors"
+			 " due to no suitable iv type.\n");
+      return false;
+    }
+
+  LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo) = iv_type;
+  LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo) = iv_type;
+
+  return true;
+}
+
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
 vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
@@ -2170,11 +2254,64 @@  start_over:
       return ok;
     }
 
-  /* Decide whether to use a fully-masked loop for this vectorization
-     factor.  */
-  LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-    = (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-       && vect_verify_full_masking (loop_vinfo));
+  /* For now, we don't expect to mix both masking and length approaches for one
+     loop, disable it if both are recorded.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ()
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't vectorize a loop with partial vectors"
+			 " because we don't expect to mix different"
+			 " approaches with partial vectors for the"
+			 " same loop.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
+  /* Decide whether to vectorize a loop with partial vectors for
+     this vectorization factor.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      /* Decide whether to use fully-masked approach.  */
+      if (vect_verify_full_masking (loop_vinfo))
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+      /* Decide whether to use length-based approach.  */
+      else if (vect_verify_loop_lens (loop_vinfo))
+	{
+	  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	      || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "can't vectorize this loop with length-based"
+				 " partial vectors approach becuase peeling"
+				 " for alignment or gaps is required.\n");
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	  else if (param_vect_with_length_scope == 0)
+	    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  /* The epilogue and other known niters less than VF
+	    cases can still use vector access with length fully.  */
+	  else if (param_vect_with_length_scope == 1
+		   && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		   && !vect_known_niters_smaller_than_vf (loop_vinfo))
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	    }
+	  else
+	    {
+	      LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
+	      LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	    }
+	}
+      else
+	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+  else
+    LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
+
   if (dump_enabled_p ())
     {
       if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
@@ -2183,6 +2320,15 @@  start_over:
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "not using a fully-masked loop.\n");
+
+      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using length-based partial"
+			 " vectors for loop fully.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not using length-based partial"
+			 " vectors for loop fully.\n");
     }
 
   /* If epilog loop is required because of data accesses with gaps,
@@ -2406,6 +2552,7 @@  again:
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
   /* Reset accumulated rgroup information.  */
   release_vec_loop_controls (&LOOP_VINFO_MASKS (loop_vinfo));
+  release_vec_loop_controls (&LOOP_VINFO_LENS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
@@ -2692,7 +2839,10 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		lowest_th = ordered_min (lowest_th, th);
 	    }
 	  else
-	    delete loop_vinfo;
+	    {
+	      delete loop_vinfo;
+	      loop_vinfo = opt_loop_vec_info::success (NULL);
+	    }
 
 	  /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is
 	     enabled, SIMDUID is not set, it is the innermost loop and we have
@@ -2717,6 +2867,7 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
       else
 	{
 	  delete loop_vinfo;
+	  loop_vinfo = opt_loop_vec_info::success (NULL);
 	  if (fatal)
 	    {
 	      gcc_checking_assert (first_loop_vinfo == NULL);
@@ -2724,6 +2875,23 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	    }
 	}
 
+      /* Handle the case that the original loop can use partial
+	 vectorization, but want to only adopt it for the epilogue.
+	 The retry should be in the same mode as original.  */
+      if (vect_epilogues
+	  && loop_vinfo
+	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** Re-trying analysis with same vector mode"
+			     " %s for epilogue with partial vectors.\n",
+			     GET_MODE_NAME (loop_vinfo->vector_mode));
+	  continue;
+	}
+
       if (mode_i < vector_modes.length ()
 	  && VECTOR_MODE_P (autodetected_vector_mode)
 	  && (related_vector_mode (vector_modes[mode_i],
@@ -3564,6 +3732,11 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 			    target_cost_data, num_masks - 1, vector_stmt,
 			    NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
   else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
@@ -8197,6 +8370,7 @@  vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
     {
       rgm->max_nscalars_per_iter = nscalars_per_iter;
       rgm->type = truth_type_for (vectype);
+      rgm->factor = 1;
     }
 }
 
@@ -8249,6 +8423,63 @@  vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
   return mask;
 }
 
+/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS
+   lengths for vector access with length that each control a vector of type
+   VECTYPE.  FACTOR is only meaningful for length in bytes, and to indicate
+   how many bytes for each element (lane).  */
+
+void
+vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		      unsigned int nvectors, tree vectype, unsigned int factor)
+{
+  gcc_assert (nvectors != 0);
+  if (lens->length () < nvectors)
+    lens->safe_grow_cleared (nvectors);
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* The number of scalars per iteration, scalar occupied bytes and
+     the number of vectors are both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+
+  if (rgl->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      rgl->max_nscalars_per_iter = nscalars_per_iter;
+      rgl->type = vectype;
+      rgl->factor = factor;
+    }
+}
+
+/* Given a complete set of length LENS, extract length number INDEX for an
+   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
+
+tree
+vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
+		   unsigned int nvectors, unsigned int index)
+{
+  rgroup_controls *rgl = &(*lens)[nvectors - 1];
+
+  /* Populate the rgroup's len array, if this is the first time we've
+     used it.  */
+  if (rgl->controls.is_empty ())
+    {
+      rgl->controls.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree len_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  gcc_assert (len_type != NULL_TREE);
+	  tree len = make_temp_ssa_name (len_type, NULL, "loop_len");
+
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (len) = gimple_build_nop ();
+	  rgl->controls[i] = len;
+	}
+    }
+
+  return rgl->controls[index];
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index cdd6f6c5e5d..e0ffbab1d02 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1742,29 +1742,56 @@  check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       return;
     }
 
-  machine_mode mask_mode;
-  if (!VECTOR_MODE_P (vecmode)
-      || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
-      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+  if (!VECTOR_MODE_P (vecmode))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			 "can't use a fully-masked loop because the target"
-			 " doesn't have the appropriate masked load or"
-			 " store.\n");
+			 "can't operate on partial vectors because of"
+			 " the unexpected mode.\n");
       LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
       return;
     }
-  /* We might load more scalars than we need for permuting SLP loads.
-     We checked in get_group_load_store_type that the extra elements
-     don't leak into a new vector.  */
+
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
   unsigned int nvectors;
-  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
-    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype, scalar_mask);
-  else
-    gcc_unreachable ();
+
+  machine_mode mask_mode;
+  bool using_partial_vectors_p = false;
+  if (targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
+      && can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      /* We might load more scalars than we need for permuting SLP loads.
+	 We checked in get_group_load_store_type that the extra elements
+	 don't leak into a new vector.  */
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype,
+			       scalar_mask);
+      else
+	gcc_unreachable ();
+      using_partial_vectors_p = true;
+    }
+
+  unsigned int factor;
+  if (can_vec_len_load_store_p (vecmode, is_load, &factor))
+    {
+      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+      if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor);
+      else
+	gcc_unreachable ();
+      using_partial_vectors_p = true;
+    }
+
+  if (!using_partial_vectors_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't operate on partial vectors because the"
+			 " target doesn't have the appropriate partial"
+			 "vectorization load or store.\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
 }
 
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
@@ -6936,6 +6963,28 @@  vectorizable_scan_store (vec_info *vinfo,
   return true;
 }
 
+/* For the vector type VTYPE, return the same size vector type with
+   QImode element, which is mainly for vector load/store with length
+   in bytes.  */
+
+static tree
+vect_get_same_size_vec_for_len (tree vtype)
+{
+  gcc_assert (VECTOR_TYPE_P (vtype));
+  machine_mode v_mode = TYPE_MODE (vtype);
+  gcc_assert (GET_MODE_INNER (v_mode) != QImode);
+
+  /* Obtain new element counts with QImode.  */
+  poly_uint64 vsize = GET_MODE_SIZE (v_mode);
+  poly_uint64 esize = GET_MODE_SIZE (QImode);
+  poly_uint64 nelts = exact_div (vsize, esize);
+
+  /* Build element type with QImode.  */
+  unsigned int eprec = GET_MODE_PRECISION (QImode);
+  tree etype = build_nonstandard_integer_type (eprec, 1);
+
+  return build_vector_type (etype, nelts);
+}
 
 /* Function vectorizable_store.
 
@@ -7655,6 +7704,14 @@  vectorizable_store (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -7911,10 +7968,16 @@  vectorizable_store (vec_info *vinfo,
 	      unsigned HOST_WIDE_INT align;
 
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -7994,6 +8057,34 @@  vectorizable_store (vec_info *vinfo,
 		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 		  new_stmt = call;
 		}
+	      else if (final_len)
+		{
+		  align = least_bit_hwi (misalign | align);
+		  tree ptr = build_int_cst (ref_type, align);
+		  tree vtype = TREE_TYPE (vec_oprnd);
+		  /* Need conversion if it's wrapped with VnQI.  */
+		  if (!direct_optab_handler (len_store_optab,
+					     TYPE_MODE (vtype)))
+		    {
+		      tree new_vtype = vect_get_same_size_vec_for_len (vtype);
+		      tree var
+			= vect_get_new_ssa_name (new_vtype, vect_simple_var);
+		      vec_oprnd
+			= build1 (VIEW_CONVERT_EXPR, new_vtype, vec_oprnd);
+		      gassign *new_stmt
+			= gimple_build_assign (var, VIEW_CONVERT_EXPR,
+					       vec_oprnd);
+		      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt,
+						   gsi);
+		      vec_oprnd = var;
+		    }
+		  gcall *call
+		    = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr,
+						  ptr, final_len, vec_oprnd);
+		  gimple_call_set_nothrow (call, true);
+		  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
+		  new_stmt = call;
+		}
 	      else
 		{
 		  data_ref = fold_build2 (MEM_REF, vectype,
@@ -8531,6 +8622,7 @@  vectorizable_load (vec_info *vinfo,
       tree dr_offset;
 
       gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+      gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (grouped_load)
@@ -8819,6 +8911,14 @@  vectorizable_load (vec_info *vinfo,
     = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        ? &LOOP_VINFO_MASKS (loop_vinfo)
        : NULL);
+  vec_loop_lens *loop_lens
+    = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)
+       ? &LOOP_VINFO_LENS (loop_vinfo)
+       : NULL);
+
+  /* Shouldn't go with length-based approach if fully masked.  */
+  gcc_assert (!loop_lens || (loop_lens && !loop_masks));
+
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
@@ -9134,11 +9234,18 @@  vectorizable_load (vec_info *vinfo,
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      tree final_mask = NULL_TREE;
+	      tree final_len = NULL_TREE;
 	      if (loop_masks
 		  && memory_access_type != VMAT_INVARIANT)
 		final_mask = vect_get_loop_mask (gsi, loop_masks,
 						 vec_num * ncopies,
 						 vectype, vec_num * j + i);
+	      else if (loop_lens
+		  && memory_access_type != VMAT_INVARIANT)
+		final_len = vect_get_loop_len (loop_vinfo, loop_lens,
+					       vec_num * ncopies,
+					       vec_num * j + i);
+
 	      if (vec_mask)
 		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
 						      vec_mask, gsi);
@@ -9207,6 +9314,35 @@  vectorizable_load (vec_info *vinfo,
 			new_stmt = call;
 			data_ref = NULL_TREE;
 		      }
+		    else if (final_len)
+		      {
+			align = least_bit_hwi (misalign | align);
+			tree ptr = build_int_cst (ref_type, align);
+			gcall *call
+			  = gimple_build_call_internal (IFN_LEN_LOAD, 3,
+							dataref_ptr, ptr,
+							final_len);
+			gimple_call_set_nothrow (call, true);
+			new_stmt = call;
+			data_ref = NULL_TREE;
+
+			/* Need conversion if it's wrapped with VnQI.  */
+			if (!direct_optab_handler (len_load_optab,
+						   TYPE_MODE (vectype)))
+			  {
+			    tree new_vtype
+			      = vect_get_same_size_vec_for_len (vectype);
+			    tree var = vect_get_new_ssa_name (new_vtype,
+							      vect_simple_var);
+			    gimple_set_lhs (call, var);
+			    vect_finish_stmt_generation (vinfo, stmt_info, call,
+							 gsi);
+			    tree op = build1 (VIEW_CONVERT_EXPR, vectype, var);
+			    new_stmt
+			      = gimple_build_assign (vec_dest,
+						     VIEW_CONVERT_EXPR, op);
+			  }
+		      }
 		    else
 		      {
 			tree ltype = vectype;
@@ -9850,11 +9986,30 @@  vectorizable_condition (vec_info *vinfo,
 	  return false;
 	}
 
-      if (loop_vinfo
-	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-	  && reduction_type == EXTRACT_LAST_REDUCTION)
-	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
-			       ncopies * vec_num, vectype, NULL);
+      if (loop_vinfo && for_reduction
+	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (reduction_type == EXTRACT_LAST_REDUCTION)
+	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
+				   ncopies * vec_num, vectype, NULL);
+	  /* Using partial vectors can introduce inactive lanes in the last
+	     iteration, since full vector of condition results are operated,
+	     it's unsafe here.  But if we can AND the condition mask with
+	     loop mask, it would be safe then.  */
+	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
+	    {
+	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
+	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		{
+		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
+		  cond.code = invert_tree_comparison (cond.code, honor_nans);
+		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
+		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+		}
+	    }
+	  else
+	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	}
 
       STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
       vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
@@ -11910,3 +12065,36 @@  vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
   *nunits_vectype_out = nunits_vectype;
   return opt_result::success ();
 }
+
+/* Generate and return statement sequence that sets vector length LEN that is:
+
+   min_of_start_and_end = min (START_INDEX, END_INDEX);
+   left_len = END_INDEX - min_of_start_and_end;
+   rhs = min (left_len, LEN_LIMIT);
+   LEN = rhs;
+
+   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
+   means if we have left_len in bytes larger than 255, it can't be saturated to
+   vector limit (vector size).  One target hook can be provided if other ports
+   don't suffer this.
+*/
+
+gimple_seq
+vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
+{
+  gimple_seq stmts = NULL;
+  tree len_type = TREE_TYPE (len);
+  gcc_assert (TREE_TYPE (start_index) == len_type);
+
+  tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index);
+  tree left_len = fold_build2 (MINUS_EXPR, len_type, end_index, min);
+  left_len = fold_build2 (MIN_EXPR, len_type, left_len, len_limit);
+
+  tree rhs = force_gimple_operand (left_len, &stmts, true, NULL_TREE);
+  gimple *new_stmt = gimple_build_assign (len, rhs);
+  gimple_stmt_iterator i = gsi_last (stmts);
+  gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING);
+
+  return stmts;
+}
+
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6c830ad09f4..4155ffe1d49 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -417,6 +417,16 @@  is_a_helper <_bb_vec_info *>::test (vec_info *i)
    are compile-time constants but VF and nL can be variable (if the target
    supports variable-length vectors).
 
+   Moreover, for some approach with partial vectors like being controlled
+   by length (in bytes), it cares about the occupied bytes for each scalar.
+   Provided that each scalar has factor bytes, the total number of scalar
+   values becomes to factor * N, the above equation becomes to:
+
+       factor * N = factor * NS * VF = factor * NV * NL
+
+   factor * NS is the bytes of each scalar, factor * NL is the vector size
+   in bytes.
+
    In classical vectorization, each iteration of the vector loop would
    handle exactly VF iterations of the original scalar loop.  However,
    in vector loops that are able to operate on partial vectors, a
@@ -473,14 +483,19 @@  is_a_helper <_bb_vec_info *>::test (vec_info *i)
    first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
    the second being indexed by the mask index 0 <= i < nV.  */
 
-/* The controls (like masks) needed by rgroups with nV vectors,
+/* The controls (like masks, lengths) needed by rgroups with nV vectors,
    according to the description above.  */
 struct rgroup_controls {
   /* The largest nS for all rgroups that use these controls.  */
   unsigned int max_nscalars_per_iter;
 
-  /* The type of control to use, based on the highest nS recorded above.
-     For mask-based approach, it's used for mask_type.  */
+  /* For now, it's mainly used for length-based in bytes approach, it's
+     record the occupied bytes of each scalar.  */
+  unsigned int factor;
+
+  /* This type is based on the highest nS recorded above.
+     For mask-based approach, it records mask type to use.
+     For length-based approach, it records appropriate vector type.  */
   tree type;
 
   /* A vector of nV controls, in iteration order.  */
@@ -489,6 +504,8 @@  struct rgroup_controls {
 
 typedef auto_vec<rgroup_controls> vec_loop_masks;
 
+typedef auto_vec<rgroup_controls> vec_loop_lens;
+
 typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
 
 /*-----------------------------------------------------------------*/
@@ -536,6 +553,10 @@  public:
      on inactive scalars.  */
   vec_loop_masks masks;
 
+  /* The lengths that a loop with length should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_lens lens;
+
   /* Set of scalar conditions that have loop mask applied.  */
   scalar_cond_masked_set_type scalar_cond_masked_set;
 
@@ -644,6 +665,10 @@  public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
 
+  /* True if we've decided to use partially-populated vectors for the
+     epilogue of loop, only for length-based approach for now.  */
+  bool epil_using_partial_vectors_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -707,9 +732,12 @@  public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
+  (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
 #define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_LENS(L)                 (L)->lens
 #define LOOP_VINFO_MASK_SKIP_NITERS(L)     (L)->mask_skip_niters
 #define LOOP_VINFO_RGROUP_COMPARE_TYPE(L)  (L)->rgroup_compare_type
 #define LOOP_VINFO_RGROUP_IV_TYPE(L)       (L)->rgroup_iv_type
@@ -747,6 +775,10 @@  public:
   (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
    && !LOOP_VINFO_MASKS (L).is_empty ())
 
+#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L)	\
+  (LOOP_VINFO_USING_PARTIAL_VECTORS_P (L)	\
+   && !LOOP_VINFO_LENS (L).is_empty ())
+
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L)	\
   ((L)->may_misalign_stmts.length () > 0)
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIAS(L)		\
@@ -1866,6 +1898,11 @@  extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+				  tree, unsigned int);
+extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
+			       unsigned int);
+extern gimple_seq vect_gen_len (tree, tree, tree, tree);
 extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
 
 /* Drive for loop transformation stage.  */