diff mbox

[vec-tails,05/10] Check if loop can be masked

Message ID 20160519194208.GF40563@msticlxl57.ims.intel.com
State New
Headers show

Commit Message

Ilya Enkovich May 19, 2016, 7:42 p.m. UTC
Hi,

This patch introduces analysis to determine if loop can be masked
(compute LOOP_VINFO_CAN_BE_MASKED and LOOP_VINFO_REQUIRED_MASKS)
and compute how much masking costs.

Thanks,
Ilya
--
gcc/

2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>

	* tree-vect-loop.c: Include insn-config.h and recog.h.
	(vect_check_required_masks_widening): New.
	(vect_check_required_masks_narrowing): New.
	(vect_get_masking_iv_elems): New.
	(vect_get_masking_iv_type): New.
	(vect_get_extreme_masks): New.
	(vect_check_required_masks): New.
	(vect_analyze_loop_operations): Add vect_check_required_masks
	call to compute LOOP_VINFO_CAN_BE_MASKED.
	(vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
	LOOP_VINFO_NEED_MASKING before starting over.
	(vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
	masking cost.
	* tree-vect-stmts.c (can_mask_load_store): New.
	(vect_model_load_masking_cost): New.
	(vect_model_store_masking_cost): New.
	(vect_model_simple_masking_cost): New.
	(vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
	and masking cost.
	(vectorizable_simd_clone_call): Likewise.
	(vectorizable_store): Likewise.
	(vectorizable_load): Likewise.
	(vect_stmt_should_be_masked_for_epilogue): New.
	(vect_add_required_mask_for_stmt): New.
	(vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
	* tree-vectorizer.h (vect_model_load_masking_cost): New.
	(vect_model_store_masking_cost): New.
	(vect_model_simple_masking_cost): New.

Comments

Richard Biener June 15, 2016, 11:22 a.m. UTC | #1
On Thu, May 19, 2016 at 9:42 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hi,
>
> This patch introduces analysis to determine if loop can be masked
> (compute LOOP_VINFO_CAN_BE_MASKED and LOOP_VINFO_REQUIRED_MASKS)
> and compute how much masking costs.

Maybe in a different patch, but it looks like you assume say, a
division, does not need
masking.

Code-generation-wise we'd add a new iv starting with

 iv = { 0, 1, 2, 3 };

and the mask is computed by comparing that against {niter, niter, niter, niter}?

So if we need masks for different vector element counts we could also add
additional IVs rather than "widening"/"shortening" the comparison result.
cond-expr reduction does this kind of IV as well which is a chance to share
some code (eventually).

You look at TREE_TYPE of LOOP_VINFO_NITERS (loop_vinfo) - I don't think
this is meaningful (if then only by accident).  I think you should look at the
control IV itself, possibly it's value-range, to determine the smallest possible
type to use.

Finally we have a related missed optimization opportunity, namely avoiding
peeling for gaps if we mask the last load of the group (profitability depends
on the overhead of such masking of course as it would be done in the main
vectorized loop).

Richard.

> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
>         * tree-vect-loop.c: Include insn-config.h and recog.h.
>         (vect_check_required_masks_widening): New.
>         (vect_check_required_masks_narrowing): New.
>         (vect_get_masking_iv_elems): New.
>         (vect_get_masking_iv_type): New.
>         (vect_get_extreme_masks): New.
>         (vect_check_required_masks): New.
>         (vect_analyze_loop_operations): Add vect_check_required_masks
>         call to compute LOOP_VINFO_CAN_BE_MASKED.
>         (vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
>         LOOP_VINFO_NEED_MASKING before starting over.
>         (vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
>         masking cost.
>         * tree-vect-stmts.c (can_mask_load_store): New.
>         (vect_model_load_masking_cost): New.
>         (vect_model_store_masking_cost): New.
>         (vect_model_simple_masking_cost): New.
>         (vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
>         and masking cost.
>         (vectorizable_simd_clone_call): Likewise.
>         (vectorizable_store): Likewise.
>         (vectorizable_load): Likewise.
>         (vect_stmt_should_be_masked_for_epilogue): New.
>         (vect_add_required_mask_for_stmt): New.
>         (vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
>         * tree-vectorizer.h (vect_model_load_masking_cost): New.
>         (vect_model_store_masking_cost): New.
>         (vect_model_simple_masking_cost): New.
>
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e25a0ce..31360d3 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -31,6 +31,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "ssa.h"
>  #include "optabs-tree.h"
> +#include "insn-config.h"
> +#include "recog.h"             /* FIXME: for insn_data */
>  #include "diagnostic-core.h"
>  #include "fold-const.h"
>  #include "stor-layout.h"
> @@ -1601,6 +1603,266 @@ vect_update_vf_for_slp (loop_vec_info loop_vinfo)
>                      vectorization_factor);
>  }
>
> +/* Function vect_check_required_masks_widening.
> +
> +   Return 1 if vector mask of type MASK_TYPE can be widened
> +   to a type having REQ_ELEMS elements in a single vector.  */
> +
> +static bool
> +vect_check_required_masks_widening (loop_vec_info loop_vinfo,
> +                                   tree mask_type, unsigned req_elems)
> +{
> +  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
> +
> +  gcc_assert (mask_elems > req_elems);
> +
> +  /* Don't convert if it requires too many intermediate steps.  */
> +  int steps = exact_log2 (mask_elems / req_elems);
> +  if (steps > MAX_INTERM_CVT_STEPS + 1)
> +    return false;
> +
> +  /* Check we have conversion support for given mask mode.  */
> +  machine_mode mode = TYPE_MODE (mask_type);
> +  insn_code icode = optab_handler (vec_unpacks_lo_optab, mode);
> +  if (icode == CODE_FOR_nothing
> +      || optab_handler (vec_unpacks_hi_optab, mode) == CODE_FOR_nothing)
> +    return false;
> +
> +  /* Make recursive call for multi-step conversion.  */
> +  if (steps > 1)
> +    {
> +      mask_elems = mask_elems >> 1;
> +      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
> +      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
> +       return false;
> +
> +      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
> +                                              req_elems))
> +       return false;
> +    }
> +  else
> +    {
> +      mask_type = build_truth_vector_type (req_elems, current_vector_size);
> +      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
> +       return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Function vect_check_required_masks_narowing.
> +
> +   Return 1 if vector mask of type MASK_TYPE can be narrowed
> +   to a type having REQ_ELEMS elements in a single vector.  */
> +
> +static bool
> +vect_check_required_masks_narrowing (loop_vec_info loop_vinfo,
> +                                    tree mask_type, unsigned req_elems)
> +{
> +  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
> +
> +  gcc_assert (req_elems > mask_elems);
> +
> +  /* Don't convert if it requires too many intermediate steps.  */
> +  int steps = exact_log2 (req_elems / mask_elems);
> +  if (steps > MAX_INTERM_CVT_STEPS + 1)
> +    return false;
> +
> +  /* Check we have conversion support for given mask mode.  */
> +  machine_mode mode = TYPE_MODE (mask_type);
> +  insn_code icode = optab_handler (vec_pack_trunc_optab, mode);
> +  if (icode == CODE_FOR_nothing)
> +    return false;
> +
> +  /* Make recursive call for multi-step conversion.  */
> +  if (steps > 1)
> +    {
> +      mask_elems = mask_elems << 1;
> +      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
> +      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
> +       return false;
> +
> +      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
> +                                               req_elems))
> +       return false;
> +    }
> +  else
> +    {
> +      mask_type = build_truth_vector_type (req_elems, current_vector_size);
> +      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
> +       return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Function vect_get_masking_iv_elems.
> +
> +   Return a number of elements in IV used for loop masking.  */
> +static int
> +vect_get_masking_iv_elems (loop_vec_info loop_vinfo)
> +{
> +  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> +  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
> +
> +  /* We extend IV type in case it is not big enough to
> +     fill full vector.  */
> +  return MIN ((int)TYPE_VECTOR_SUBPARTS (iv_vectype),
> +             LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +}
> +
> +/* Function vect_get_masking_iv_type.
> +
> +   Return a type of IV used for loop masking.  */
> +static tree
> +vect_get_masking_iv_type (loop_vec_info loop_vinfo)
> +{
> +  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
> +  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
> +  unsigned vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +
> +  if (TYPE_VECTOR_SUBPARTS (iv_vectype) <= vf)
> +    return iv_vectype;
> +
> +  unsigned elem_size = current_vector_size * BITS_PER_UNIT / vf;
> +  iv_type = build_nonstandard_integer_type (elem_size, TYPE_UNSIGNED (iv_type));
> +
> +  return get_vectype_for_scalar_type (iv_type);
> +}
> +
> +/* Function vect_get_extreme_masks.
> +
> +   Determine minimum and maximum number of elements in masks
> +   required for masking a loop described by LOOP_VINFO.
> +   Computed values are returned in MIN_MASK_ELEMS and
> +   MAX_MASK_ELEMS. */
> +
> +static void
> +vect_get_extreme_masks (loop_vec_info loop_vinfo,
> +                       unsigned *min_mask_elems,
> +                       unsigned *max_mask_elems)
> +{
> +  unsigned required_masks = LOOP_VINFO_REQUIRED_MASKS (loop_vinfo);
> +  unsigned elems = 1;
> +
> +  *min_mask_elems = *max_mask_elems = vect_get_masking_iv_elems (loop_vinfo);
> +
> +  while (required_masks)
> +    {
> +      if (required_masks & 1)
> +       {
> +         if (elems < *min_mask_elems)
> +           *min_mask_elems = elems;
> +         if (elems > *max_mask_elems)
> +           *max_mask_elems = elems;
> +       }
> +      elems = elems << 1;
> +      required_masks = required_masks >> 1;
> +    }
> +}
> +
> +/* Function vect_check_required_masks.
> +
> +   For given LOOP_VINFO check all required masks can be computed
> +   and add computation cost into loop cost data.  */
> +
> +static void
> +vect_check_required_masks (loop_vec_info loop_vinfo)
> +{
> +  if (!LOOP_VINFO_REQUIRED_MASKS (loop_vinfo))
> +    return;
> +
> +  /* Firstly check we have a proper comparison to get
> +     an initial mask.  */
> +  tree iv_vectype = vect_get_masking_iv_type (loop_vinfo);
> +  unsigned iv_elems = TYPE_VECTOR_SUBPARTS (iv_vectype);
> +
> +  tree mask_type = build_same_sized_truth_vector_type (iv_vectype);
> +
> +  if (!expand_vec_cmp_expr_p (iv_vectype, mask_type))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: required vector comparison "
> +                        "is not supported.\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      return;
> +    }
> +
> +  int cmp_copies  = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / iv_elems;
> +  /* Add cost of initial iv values creation.  */
> +  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
> +                scalar_to_vec, NULL, 0, vect_masking_prologue);
> +  /* Add cost of upper bound and step values creation.  It is the same
> +     for all copies.  */
> +  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), 2,
> +                scalar_to_vec, NULL, 0, vect_masking_prologue);
> +  /* Add cost of vector comparisons.  */
> +  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
> +                vector_stmt, NULL, 0, vect_masking_body);
> +  /* Add cost of iv increment.  */
> +  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
> +                vector_stmt, NULL, 0, vect_masking_body);
> +
> +
> +  /* Now check the widest and the narrowest masks.
> +     All intermediate values are obtained while
> +     computing extreme values.  */
> +  unsigned min_mask_elems = 0;
> +  unsigned max_mask_elems = 0;
> +
> +  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
> +
> +  if (min_mask_elems < iv_elems)
> +    {
> +      /* Check mask widening is available.  */
> +      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
> +                                              min_mask_elems))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: required mask widening "
> +                            "is not supported.\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +         return;
> +       }
> +
> +      /* Add widening cost.  We have totally (2^N - 1) vectors
> +        we need to widen per each original vector, where N is
> +        a number of conversion steps.  Each widening requires
> +        two extracts.  */
> +      int steps = exact_log2 (iv_elems / min_mask_elems);
> +      int conversions = cmp_copies * 2 * ((1 << steps) - 1);
> +      add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
> +                    conversions, vec_promote_demote,
> +                    NULL, 0, vect_masking_body);
> +    }
> +
> +  if (max_mask_elems > iv_elems)
> +    {
> +      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
> +                                               max_mask_elems))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: required mask narrowing "
> +                            "is not supported.\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +         return;
> +       }
> +
> +      /* Add narrowing cost.  We have totally (2^N - 1) vector
> +        narrowings per each resulting vector, where N is
> +        a number of conversion steps.  */
> +      int steps = exact_log2 (max_mask_elems / iv_elems);
> +      int results = cmp_copies * iv_elems / max_mask_elems;
> +      int conversions = results * ((1 << steps) - 1);
> +      add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
> +                    conversions, vec_promote_demote,
> +                    NULL, 0, vect_masking_body);
> +    }
> +}
> +
>  /* Function vect_analyze_loop_operations.
>
>     Scan the loop stmts and make sure they are all vectorizable.  */
> @@ -1759,6 +2021,12 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
>        return false;
>      }
>
> +  /* If all statements can be masked then we also need
> +     to check we may compute required masks and compute
> +     its cost.  */
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    vect_check_required_masks (loop_vinfo);
> +
>    return true;
>  }
>
> @@ -2232,6 +2500,8 @@ again:
>    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
>    LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
>    LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
> +  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = true;
> +  LOOP_VINFO_NEED_MASKING (loop_vinfo) = false;
>
>    goto start_over;
>  }
> @@ -5424,6 +5694,7 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
>        outer_loop = loop;
>        loop = loop->inner;
>        nested_cycle = true;
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>      }
>
>    /* 1. Is vectorizable reduction?  */
> @@ -5623,6 +5894,18 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
>
>    gcc_assert (ncopies >= 1);
>
> +  if (slp_node || PURE_SLP_STMT (stmt_info) || code == COND_EXPR
> +      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
> +      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
> +        == INTEGER_INDUC_COND_REDUCTION)
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: unsupported conditional "
> +                        "reduction\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +    }
> +
>    vec_mode = TYPE_MODE (vectype_in);
>
>    if (code == COND_EXPR)
> @@ -5900,6 +6183,19 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
>           return false;
>         }
>      }
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      /* Check that masking of reduction is supported.  */
> +      tree mask_vtype = build_same_sized_truth_vector_type (vectype_out);
> +      if (!expand_vec_cond_expr_p (vectype_out, mask_vtype))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: required vector conditional "
> +                            "expression is not supported.\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +       }
> +    }
>
>    if (!vec_stmt) /* transformation not required.  */
>      {
> @@ -5908,6 +6204,10 @@ vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
>                                          reduc_index))
>          return false;
>        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> +
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       vect_model_simple_masking_cost (stmt_info, ncopies);
> +
>        return true;
>      }
>
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 9ab4af4..91ebe5a 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "builtins.h"
>  #include "internal-fn.h"
> +#include "tree-ssa-loop-ivopts.h"
>
>  /* For lang_hooks.types.type_for_mode.  */
>  #include "langhooks.h"
> @@ -535,6 +536,38 @@ process_use (gimple *stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
>    return true;
>  }
>
> +/* Return ture if STMT can be converted to masked form.  */
> +
> +static bool
> +can_mask_load_store (gimple *stmt)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  tree vectype, mask_vectype;
> +  tree lhs, ref;
> +
> +  if (!stmt_info)
> +    return false;
> +  lhs = gimple_assign_lhs (stmt);
> +  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
> +  if (may_be_nonaddressable_p (ref))
> +    return false;
> +  vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  mask_vectype = build_same_sized_truth_vector_type (vectype);
> +  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
> +                                 TYPE_MODE (mask_vectype),
> +                                 gimple_assign_load_p (stmt)))
> +    {
> +      if (dump_enabled_p ())
> +       {
> +         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                          "Statement can't be masked.\n");
> +         dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
> +       }
> +
> +       return false;
> +    }
> +  return true;
> +}
>
>  /* Function vect_mark_stmts_to_be_vectorized.
>
> @@ -1193,6 +1226,52 @@ vect_get_load_cost (struct data_reference *dr, int ncopies,
>      }
>  }
>
> +/* Function vect_model_load_masking_cost.
> +
> +   Models cost for memory load masking.  */
> +
> +void
> +vect_model_load_masking_cost (stmt_vec_info stmt_info, int ncopies)
> +{
> +  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +                          ncopies, vector_mask_load, stmt_info, false,
> +                          vect_masking_body);
> +  else
> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +                          ncopies, vector_load, stmt_info, false,
> +                          vect_masking_body);
> +}
> +
> +/* Function vect_model_store_masking_cost.
> +
> +   Models cost for memory store masking.  */
> +
> +void
> +vect_model_store_masking_cost (stmt_vec_info stmt_info, int ncopies)
> +{
> +  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +                          ncopies, vector_mask_store, stmt_info, false,
> +                          vect_masking_body);
> +  else
> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +                          ncopies, vector_store, stmt_info, false,
> +                          vect_masking_body);
> +}
> +
> +/* Function vect_model_simple_masking_cost.
> +
> +   Models cost for statement masking.  Return estimated cost.  */
> +
> +void
> +vect_model_simple_masking_cost (stmt_vec_info stmt_info, int ncopies)
> +{
> +  add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +                        ncopies, vector_stmt, stmt_info, false,
> +                        vect_masking_body);
> +}
> +
>  /* Insert the new stmt NEW_STMT at *GSI or at the appropriate place in
>     the loop preheader for the vectorized stmt STMT.  */
>
> @@ -1791,6 +1870,20 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
>                && !useless_type_conversion_p (vectype, rhs_vectype)))
>      return false;
>
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      /* Check that mask conjuction is supported.  */
> +      optab tab;
> +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
> +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: unsupported mask operation\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +       }
> +    }
> +
>    if (!vec_stmt) /* transformation not required.  */
>      {
>        STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
> @@ -1799,6 +1892,15 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
>                                NULL, NULL, NULL);
>        else
>         vect_model_load_cost (stmt_info, ncopies, false, NULL, NULL, NULL);
> +
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (is_store)
> +           vect_model_store_masking_cost (stmt_info, ncopies);
> +         else
> +           vect_model_load_masking_cost (stmt_info, ncopies);
> +       }
> +
>        return true;
>      }
>
> @@ -2795,6 +2897,18 @@ vectorizable_simd_clone_call (gimple *stmt, gimple_stmt_iterator *gsi,
>    if (slp_node || PURE_SLP_STMT (stmt_info))
>      return false;
>
> +  /* Masked clones are not yet supported.  But we allow
> +     calls which may be just called with no mask.  */
> +  if (!(gimple_call_flags (stmt) & ECF_CONST)
> +      || (gimple_call_flags (stmt) & ECF_LOOPING_CONST_OR_PURE))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: non-const call "
> +                        "(masked calls are not supported)\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +    }
> +
>    /* Process function arguments.  */
>    nargs = gimple_call_num_args (stmt);
>
> @@ -5335,6 +5449,14 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>                                  "negative step and reversing not supported.\n");
>               return false;
>             }
> +         if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +           {
> +             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                                "cannot be masked: negative step"
> +                                " is not supported.");
> +           }
>         }
>      }
>
> @@ -5343,6 +5465,15 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        grouped_store = true;
>        first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
>        group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: grouped access"
> +                            " is not supported." );
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      }
> +
>        if (!slp
>           && !PURE_SLP_STMT (stmt_info)
>           && !STMT_VINFO_STRIDED_P (stmt_info))
> @@ -5398,6 +5529,44 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>                               "scatter index use not simple.");
>           return false;
>         }
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: gather/scatter is"
> +                            " not supported.");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +       }
> +    }
> +
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && STMT_VINFO_STRIDED_P (stmt_info))
> +    {
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: strided store is not"
> +                        " supported.\n");
> +    }
> +
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && integer_zerop (nested_in_vect_loop_p (loop, stmt)
> +                       ? STMT_VINFO_DR_STEP (stmt_info)
> +                       : DR_STEP (dr)))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: invariant store.\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +    }
> +
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && !can_mask_load_store (stmt))
> +    {
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: unsupported mask store.\n");
>      }
>
>    if (!vec_stmt) /* transformation not required.  */
> @@ -5407,6 +5576,9 @@ vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        if (!PURE_SLP_STMT (stmt_info))
>         vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt,
>                                NULL, NULL, NULL);
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       vect_model_store_masking_cost (stmt_info, ncopies);
> +
>        return true;
>      }
>
> @@ -6312,6 +6484,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        grouped_load = true;
>        /* FORNOW */
>        gcc_assert (!nested_in_vect_loop && !STMT_VINFO_GATHER_SCATTER_P (stmt_info));
> +      /* Not yet supported.  */
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: grouped acces is not"
> +                            " supported.");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      }
>
>        first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
>
> @@ -6358,6 +6539,7 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>             }
>
>           LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = true;
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>         }
>
>        if (slp && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ())
> @@ -6423,6 +6605,16 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        gather_decl = vect_check_gather_scatter (stmt, loop_vinfo, &gather_base,
>                                                &gather_off, &gather_scale);
>        gcc_assert (gather_decl);
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                           "cannot be masked: gather/scatter is not"
> +                           " supported.\n");
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +       }
> +
> +
>        if (!vect_is_simple_use (gather_off, vinfo, &def_stmt, &gather_dt,
>                                &gather_off_vectype))
>         {
> @@ -6434,6 +6626,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>      }
>    else if (STMT_VINFO_STRIDED_P (stmt_info))
>      {
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       {
> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "cannot be masked: strided load is not"
> +                            " supported.\n");
> +       }
> +
>        if ((grouped_load
>            && (slp || PURE_SLP_STMT (stmt_info)))
>           && (group_size > nunits
> @@ -6485,9 +6686,35 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>                                   "\n");
>               return false;
>             }
> +         if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +           {
> +             if (dump_enabled_p ())
> +               dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                               "cannot be masked: negative step "
> +                                "for masking.\n");
> +             LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +           }
>         }
>      }
>
> +  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && integer_zerop (nested_in_vect_loop
> +                       ? STMT_VINFO_DR_STEP (stmt_info)
> +                       : DR_STEP (dr)))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "allow invariant load for masked loop.\n");
> +    }
> +  else if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +          && !can_mask_load_store (stmt))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "cannot be masked: unsupported masked load.\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +    }
> +
>    if (!vec_stmt) /* transformation not required.  */
>      {
>        STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
> @@ -6495,6 +6722,9 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        if (!PURE_SLP_STMT (stmt_info))
>         vect_model_load_cost (stmt_info, ncopies, load_lanes_p,
>                               NULL, NULL, NULL);
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +       vect_model_load_masking_cost (stmt_info, ncopies);
> +
>        return true;
>      }
>
> @@ -7891,6 +8121,43 @@ vectorizable_comparison (gimple *stmt, gimple_stmt_iterator *gsi,
>    return true;
>  }
>
> +/* Return true if vector version of STMT should be masked
> +   in a vectorized loop epilogue (considering usage of the
> +   same VF as for main loop).  */
> +
> +static bool
> +vect_stmt_should_be_masked_for_epilogue (gimple *stmt)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +
> +  /* We should mask all statements accessing memory.  */
> +  if (STMT_VINFO_DATA_REF (stmt_info))
> +    return true;
> +
> +  /* We should also mask all recursions.  */
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
> +      || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
> +    return true;
> +
> +  return false;
> +}
> +
> +/* Add a mask required to mask STMT to LOOP_VINFO_REQUIRED_MASKS.  */
> +
> +static void
> +vect_add_required_mask_for_stmt (gimple *stmt)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  unsigned HOST_WIDE_INT nelems = TYPE_VECTOR_SUBPARTS (vectype);
> +  int bit_no = exact_log2 (nelems);
> +
> +  gcc_assert (bit_no >= 0);
> +
> +  LOOP_VINFO_REQUIRED_MASKS (loop_vinfo) |= (1 << bit_no);
> +}
> +
>  /* Make sure the statement is vectorizable.  */
>
>  bool
> @@ -7898,6 +8165,7 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
>  {
>    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>    bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>    enum vect_relevant relevance = STMT_VINFO_RELEVANT (stmt_info);
>    bool ok;
>    tree scalar_type, vectype;
> @@ -8064,6 +8332,10 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
>        STMT_VINFO_VECTYPE (stmt_info) = vectype;
>     }
>
> +  /* Masking is not supported for SLP yet.  */
> +  if (loop_vinfo && node)
> +    LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +
>    if (STMT_VINFO_RELEVANT_P (stmt_info))
>      {
>        gcc_assert (!VECTOR_MODE_P (TYPE_MODE (gimple_expr_type (stmt))));
> @@ -8123,6 +8395,11 @@ vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
>        return false;
>      }
>
> +  if (loop_vinfo
> +      && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
> +      && vect_stmt_should_be_masked_for_epilogue (stmt))
> +    vect_add_required_mask_for_stmt (stmt);
> +
>    if (bb_vinfo)
>      return true;
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index d3450b6..86c5371 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1033,6 +1033,9 @@ extern void vect_model_store_cost (stmt_vec_info, int, bool,
>  extern void vect_model_load_cost (stmt_vec_info, int, bool, slp_tree,
>                                   stmt_vector_for_cost *,
>                                   stmt_vector_for_cost *);
> +extern void vect_model_load_masking_cost (stmt_vec_info, int);
> +extern void vect_model_store_masking_cost (stmt_vec_info, int);
> +extern void vect_model_simple_masking_cost (stmt_vec_info, int);
>  extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
>                                   enum vect_cost_for_stmt, stmt_vec_info,
>                                   int, enum vect_cost_model_location);
Jeff Law June 16, 2016, 6:26 a.m. UTC | #2
On 06/15/2016 05:22 AM, Richard Biener wrote:
>
> You look at TREE_TYPE of LOOP_VINFO_NITERS (loop_vinfo) - I don't think
> this is meaningful (if then only by accident).  I think you should look at the
> control IV itself, possibly it's value-range, to determine the smallest possible
> type to use.
Can we get an IV that's created after VRP?  If so, then we have to be 
prepared for the case where there's no range information on the IV.  At 
which point I think using type min/max of the IV is probably the right 
fallback.  But I do think we should be looking at range info much more 
systematically.

I can't see how TREE_TYPE of the NITERS makes sense either.

> Finally we have a related missed optimization opportunity, namely avoiding
> peeling for gaps if we mask the last load of the group (profitability depends
> on the overhead of such masking of course as it would be done in the main
> vectorized loop).
I think that's a specific instance of a more general question -- what 
transformations can be avoided by masking and can we generate costs to 
select between those transformations and masking.  Seems like a 
follow-up item rather than a requirement for this work to go forward to me.

Jeff
Jeff Law June 16, 2016, 7:08 a.m. UTC | #3
On 05/19/2016 01:42 PM, Ilya Enkovich wrote:
> Hi,
>
> This patch introduces analysis to determine if loop can be masked
> (compute LOOP_VINFO_CAN_BE_MASKED and LOOP_VINFO_REQUIRED_MASKS)
> and compute how much masking costs.
>
> Thanks,
> Ilya
> --
> gcc/
>
> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>
> 	* tree-vect-loop.c: Include insn-config.h and recog.h.
> 	(vect_check_required_masks_widening): New.
> 	(vect_check_required_masks_narrowing): New.
> 	(vect_get_masking_iv_elems): New.
> 	(vect_get_masking_iv_type): New.
> 	(vect_get_extreme_masks): New.
> 	(vect_check_required_masks): New.
> 	(vect_analyze_loop_operations): Add vect_check_required_masks
> 	call to compute LOOP_VINFO_CAN_BE_MASKED.
> 	(vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
> 	LOOP_VINFO_NEED_MASKING before starting over.
> 	(vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
> 	masking cost.
> 	* tree-vect-stmts.c (can_mask_load_store): New.
> 	(vect_model_load_masking_cost): New.
> 	(vect_model_store_masking_cost): New.
> 	(vect_model_simple_masking_cost): New.
> 	(vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
> 	and masking cost.
> 	(vectorizable_simd_clone_call): Likewise.
> 	(vectorizable_store): Likewise.
> 	(vectorizable_load): Likewise.
> 	(vect_stmt_should_be_masked_for_epilogue): New.
> 	(vect_add_required_mask_for_stmt): New.
> 	(vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
> 	* tree-vectorizer.h (vect_model_load_masking_cost): New.
> 	(vect_model_store_masking_cost): New.
> 	(vect_model_simple_masking_cost): New.
>
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e25a0ce..31360d3 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -31,6 +31,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "ssa.h"
>  #include "optabs-tree.h"
> +#include "insn-config.h"
> +#include "recog.h"		/* FIXME: for insn_data */
Ick :(


> +
> +/* Function vect_check_required_masks_narowing.
narrowing


> +
> +   Return 1 if vector mask of type MASK_TYPE can be narrowed
> +   to a type having REQ_ELEMS elements in a single vector.  */
> +
> +static bool
> +vect_check_required_masks_narrowing (loop_vec_info loop_vinfo,
> +				     tree mask_type, unsigned req_elems)
Given the common structure & duplication I can't help but wonder if a 
single function should be used for widening/narrowing.  Ultimately can't 
you swap  mask_elems/req_elems and always go narrower to wider (using a 
different optab for the two different cases)?




> +
> +/* Function vect_get_masking_iv_elems.
> +
> +   Return a number of elements in IV used for loop masking.  */
> +static int
> +vect_get_masking_iv_elems (loop_vec_info loop_vinfo)
> +{
> +  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
I'm guessing Richi's comment about what tree type you're looking at 
refers to this and similar instances.  Doesn't this give you the type of 
the number of iterations rather than the type of the iteration variable 
itself?




  +
> +  if (!expand_vec_cmp_expr_p (iv_vectype, mask_type))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "cannot be masked: required vector comparison "
> +			 "is not supported.\n");
> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +      return;
> +    }
On a totally unrelated topic, I was speaking with David Malcolm earlier 
this week about how to turn this kind of missed optimization information 
we currently emit into dumps into something more easily consumed by users.

The general issue is that we've got customers that want to understand 
why certain optimizations fire or do not fire.  They're by far more 
interested in the vectorizer than anything else.

We have a sense that much of the information those customers desire is 
sitting in the dump files, but it's buried in there with other stuff 
that isn't generally useful to users.

So we're pondering what it might take to take these glorified fprintf 
calls and turn them into a first class diagnostic that could be emitted 
to stderr or into the dump file depending (of course) on the options 
passed to GCC.

The reason I bring this up is the hope that your team might have some 
insights based on what ICC has done the in the past for its customers.

Anyway, back to the code...


> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 9ab4af4..91ebe5a 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "builtins.h"
>  #include "internal-fn.h"
> +#include "tree-ssa-loop-ivopts.h"
>
>  /* For lang_hooks.types.type_for_mode.  */
>  #include "langhooks.h"
> @@ -535,6 +536,38 @@ process_use (gimple *stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
>    return true;
>  }
>
> +/* Return ture if STMT can be converted to masked form.  */
s/ture/true/


> @@ -1193,6 +1226,52 @@ vect_get_load_cost (struct data_reference *dr, int ncopies,
>      }
>  }
>
> +/* Function vect_model_load_masking_cost.
> +
> +   Models cost for memory load masking.  */
> +
> +void
> +vect_model_load_masking_cost (stmt_vec_info stmt_info, int ncopies)
> +{
> +  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
> +			   ncopies, vector_mask_load, stmt_info, false,
> +			   vect_masking_body);
What GIMPLE_CALLs are going to appear here and in 
vect_model_store_masking_cost?  Seems like there ought to be a comment 
of some kind addressing that question.


> @@ -1791,6 +1870,20 @@ vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
>  	       && !useless_type_conversion_p (vectype, rhs_vectype)))
>      return false;
>
> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +    {
> +      /* Check that mask conjuction is supported.  */
> +      optab tab;
> +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
> +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			     "cannot be masked: unsupported mask operation\n");
> +	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
> +	}
> +    }
Should the optab querying be in optab-query.c?

General question, it looks like we're baking into various places what 
things can or can not be masked.  Isn't that a function of the masking 
capabilities of the processor?  And if so, shouldn't we be querying 
optabs rather than just declaring and open-coding knowledge that certain 
things can't be masked?


> @@ -6312,6 +6484,15 @@ vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
>        grouped_load = true;
>        /* FORNOW */
>        gcc_assert (!nested_in_vect_loop && !STMT_VINFO_GATHER_SCATTER_P (stmt_info));
> +      /* Not yet supported.  */
> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			     "cannot be masked: grouped acces is not"
s/acces/access/

So it feels a bit like this needs some more work, particularly WRT what 
capabilities can and can not be masked.  If I'm missing something on 
that, definitely speak up, but it feels like we're embedding knowledge 
of the masking capabilities somewhere where it doesn't belong.  I'd also 
like to know more about what GIMPLE_CALLS show up in 
vect_model_{load,store}_masking_cost.



Jeff
Ilya Enkovich June 22, 2016, 3:03 p.m. UTC | #4
2016-06-16 9:26 GMT+03:00 Jeff Law <law@redhat.com>:
> On 06/15/2016 05:22 AM, Richard Biener wrote:
>>
>>
>> You look at TREE_TYPE of LOOP_VINFO_NITERS (loop_vinfo) - I don't think
>> this is meaningful (if then only by accident).  I think you should look at
>> the
>> control IV itself, possibly it's value-range, to determine the smallest
>> possible
>> type to use.
>
> Can we get an IV that's created after VRP?  If so, then we have to be
> prepared for the case where there's no range information on the IV.  At
> which point I think using type min/max of the IV is probably the right
> fallback.  But I do think we should be looking at range info much more
> systematically.
>
> I can't see how TREE_TYPE of the NITERS makes sense either.

I need to build a vector {niters, ..., niters} and compare to it.  Why doesn't
it make sense to choose the same type for IV?  I agree that choosing a smaller
type may be beneficial.   Shouldn't I look at nb_iterations_upper_bound then
to check if NITERS can be casted to a smaller type?

Thanks,
Ilya

>
>> Finally we have a related missed optimization opportunity, namely avoiding
>> peeling for gaps if we mask the last load of the group (profitability
>> depends
>> on the overhead of such masking of course as it would be done in the main
>> vectorized loop).
>
> I think that's a specific instance of a more general question -- what
> transformations can be avoided by masking and can we generate costs to
> select between those transformations and masking.  Seems like a follow-up
> item rather than a requirement for this work to go forward to me.
>
> Jeff
Ilya Enkovich June 22, 2016, 4:09 p.m. UTC | #5
2016-06-16 10:08 GMT+03:00 Jeff Law <law@redhat.com>:
> On 05/19/2016 01:42 PM, Ilya Enkovich wrote:
>>
>> Hi,
>>
>> This patch introduces analysis to determine if loop can be masked
>> (compute LOOP_VINFO_CAN_BE_MASKED and LOOP_VINFO_REQUIRED_MASKS)
>> and compute how much masking costs.
>>
>> Thanks,
>> Ilya
>> --
>> gcc/
>>
>> 2016-05-19  Ilya Enkovich  <ilya.enkovich@intel.com>
>>
>>         * tree-vect-loop.c: Include insn-config.h and recog.h.
>>         (vect_check_required_masks_widening): New.
>>         (vect_check_required_masks_narrowing): New.
>>         (vect_get_masking_iv_elems): New.
>>         (vect_get_masking_iv_type): New.
>>         (vect_get_extreme_masks): New.
>>         (vect_check_required_masks): New.
>>         (vect_analyze_loop_operations): Add vect_check_required_masks
>>         call to compute LOOP_VINFO_CAN_BE_MASKED.
>>         (vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and
>>         LOOP_VINFO_NEED_MASKING before starting over.
>>         (vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and
>>         masking cost.
>>         * tree-vect-stmts.c (can_mask_load_store): New.
>>         (vect_model_load_masking_cost): New.
>>         (vect_model_store_masking_cost): New.
>>         (vect_model_simple_masking_cost): New.
>>         (vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED
>>         and masking cost.
>>         (vectorizable_simd_clone_call): Likewise.
>>         (vectorizable_store): Likewise.
>>         (vectorizable_load): Likewise.
>>         (vect_stmt_should_be_masked_for_epilogue): New.
>>         (vect_add_required_mask_for_stmt): New.
>>         (vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED.
>>         * tree-vectorizer.h (vect_model_load_masking_cost): New.
>>         (vect_model_store_masking_cost): New.
>>         (vect_model_simple_masking_cost): New.
>>
>>
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index e25a0ce..31360d3 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -31,6 +31,8 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "tree-pass.h"
>>  #include "ssa.h"
>>  #include "optabs-tree.h"
>> +#include "insn-config.h"
>> +#include "recog.h"             /* FIXME: for insn_data */
>
> Ick :(
>
>
>> +
>> +/* Function vect_check_required_masks_narowing.
>
> narrowing
>
>
>> +
>> +   Return 1 if vector mask of type MASK_TYPE can be narrowed
>> +   to a type having REQ_ELEMS elements in a single vector.  */
>> +
>> +static bool
>> +vect_check_required_masks_narrowing (loop_vec_info loop_vinfo,
>> +                                    tree mask_type, unsigned req_elems)
>
> Given the common structure & duplication I can't help but wonder if a single
> function should be used for widening/narrowing.  Ultimately can't you swap
> mask_elems/req_elems and always go narrower to wider (using a different
> optab for the two different cases)?

I think we can't always go in narrower to wider direction because widening
uses two optabs wand also because the way insn_data is checked.

>
>
>
>
>> +
>> +/* Function vect_get_masking_iv_elems.
>> +
>> +   Return a number of elements in IV used for loop masking.  */
>> +static int
>> +vect_get_masking_iv_elems (loop_vec_info loop_vinfo)
>> +{
>> +  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
>
> I'm guessing Richi's comment about what tree type you're looking at refers
> to this and similar instances.  Doesn't this give you the type of the number
> of iterations rather than the type of the iteration variable itself?
>
>

Since I build vector IV by myself and use to compare with NITERS I
feel it's safe to
use type of NITERS.  Do you expect NITERS and IV types differ?

>
>
>  +
>>
>> +  if (!expand_vec_cmp_expr_p (iv_vectype, mask_type))
>> +    {
>> +      if (dump_enabled_p ())
>> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                        "cannot be masked: required vector comparison "
>> +                        "is not supported.\n");
>> +      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> +      return;
>> +    }
>
> On a totally unrelated topic, I was speaking with David Malcolm earlier this
> week about how to turn this kind of missed optimization information we
> currently emit into dumps into something more easily consumed by users.
>
> The general issue is that we've got customers that want to understand why
> certain optimizations fire or do not fire.  They're by far more interested
> in the vectorizer than anything else.
>
> We have a sense that much of the information those customers desire is
> sitting in the dump files, but it's buried in there with other stuff that
> isn't generally useful to users.
>
> So we're pondering what it might take to take these glorified fprintf calls
> and turn them into a first class diagnostic that could be emitted to stderr
> or into the dump file depending (of course) on the options passed to GCC.
>
> The reason I bring this up is the hope that your team might have some
> insights based on what ICC has done the in the past for its customers.
>
> Anyway, back to the code...
>
>
>> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> index 9ab4af4..91ebe5a 100644
>> --- a/gcc/tree-vect-stmts.c
>> +++ b/gcc/tree-vect-stmts.c
>> @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "tree-vectorizer.h"
>>  #include "builtins.h"
>>  #include "internal-fn.h"
>> +#include "tree-ssa-loop-ivopts.h"
>>
>>  /* For lang_hooks.types.type_for_mode.  */
>>  #include "langhooks.h"
>> @@ -535,6 +536,38 @@ process_use (gimple *stmt, tree use, loop_vec_info
>> loop_vinfo, bool live_p,
>>    return true;
>>  }
>>
>> +/* Return ture if STMT can be converted to masked form.  */
>
> s/ture/true/
>
>
>> @@ -1193,6 +1226,52 @@ vect_get_load_cost (struct data_reference *dr, int
>> ncopies,
>>      }
>>  }
>>
>> +/* Function vect_model_load_masking_cost.
>> +
>> +   Models cost for memory load masking.  */
>> +
>> +void
>> +vect_model_load_masking_cost (stmt_vec_info stmt_info, int ncopies)
>> +{
>> +  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
>> +    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
>> +                          ncopies, vector_mask_load, stmt_info, false,
>> +                          vect_masking_body);
>
> What GIMPLE_CALLs are going to appear here and in
> vect_model_store_masking_cost?  Seems like there ought to be a comment of
> some kind addressing that question.

This checks for MASK_LOAD case which is internal function call.  I'll add
a comment here.

>
>
>> @@ -1791,6 +1870,20 @@ vectorizable_mask_load_store (gimple *stmt,
>> gimple_stmt_iterator *gsi,
>>                && !useless_type_conversion_p (vectype, rhs_vectype)))
>>      return false;
>>
>> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> +    {
>> +      /* Check that mask conjuction is supported.  */
>> +      optab tab;
>> +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
>> +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
>> CODE_FOR_nothing)
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                            "cannot be masked: unsupported mask
>> operation\n");
>> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>> +       }
>> +    }
>
> Should the optab querying be in optab-query.c?

We always directly call optab_handler for simple operations.  There are dozens
of such calls in vectorizer.

>
> General question, it looks like we're baking into various places what things
> can or can not be masked.  Isn't that a function of the masking capabilities
> of the processor?  And if so, shouldn't we be querying optabs rather than
> just declaring and open-coding knowledge that certain things can't be
> masked?
>
>
>> @@ -6312,6 +6484,15 @@ vectorizable_load (gimple *stmt,
>> gimple_stmt_iterator *gsi, gimple **vec_stmt,
>>        grouped_load = true;
>>        /* FORNOW */
>>        gcc_assert (!nested_in_vect_loop && !STMT_VINFO_GATHER_SCATTER_P
>> (stmt_info));
>> +      /* Not yet supported.  */
>> +      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> +                            "cannot be masked: grouped acces is not"
>
> s/acces/access/
>
> So it feels a bit like this needs some more work, particularly WRT what
> capabilities can and can not be masked.  If I'm missing something on that,
> definitely speak up, but it feels like we're embedding knowledge of the
> masking capabilities somewhere where it doesn't belong.  I'd also like to
> know more about what GIMPLE_CALLS show up in
> vect_model_{load,store}_masking_cost.

We don't embed masking capabilities into vectorizer.

Actually we don't depend on masking capabilities so much.  We have to mask
loads and stores and use can_mask_load_store for that which uses existing optab
query.  We also require masking for reductions and use VEC_COND for that
(and use existing expand_vec_cond_expr_p).  Other checks are to check if we
can build required masks.  So we actually don't expose any new processor
masking capabilities to GIMPLE.  I.e. all this works on targets with no
rich masking capabilities.  E.g. we can mask loops for quite old SSE targets.

Thanks,
Ilya

>
>
>
> Jeff
>
Jeff Law June 22, 2016, 5:20 p.m. UTC | #6
On 06/22/2016 09:03 AM, Ilya Enkovich wrote:
> 2016-06-16 9:26 GMT+03:00 Jeff Law <law@redhat.com>:
>> On 06/15/2016 05:22 AM, Richard Biener wrote:
>>>
>>>
>>> You look at TREE_TYPE of LOOP_VINFO_NITERS (loop_vinfo) - I don't think
>>> this is meaningful (if then only by accident).  I think you should look at
>>> the
>>> control IV itself, possibly it's value-range, to determine the smallest
>>> possible
>>> type to use.
>>
>> Can we get an IV that's created after VRP?  If so, then we have to be
>> prepared for the case where there's no range information on the IV.  At
>> which point I think using type min/max of the IV is probably the right
>> fallback.  But I do think we should be looking at range info much more
>> systematically.
>>
>> I can't see how TREE_TYPE of the NITERS makes sense either.
>
> I need to build a vector {niters, ..., niters} and compare to it.  Why doesn't
> it make sense to choose the same type for IV?  I agree that choosing a smaller
> type may be beneficial.   Shouldn't I look at nb_iterations_upper_bound then
> to check if NITERS can be casted to a smaller type?
Isn't TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)) the type of the 
constant being used to represent the number of iterations?  That is 
independent of the type of the IV.

Though I guess your argument is that since you're building a vector of 
niters, that indeed what you want is the type of that constant, not the 
type of the IV.  That might be worth a comment in the code :-)

jeff
Jeff Law June 22, 2016, 5:42 p.m. UTC | #7
On 06/22/2016 10:09 AM, Ilya Enkovich wrote:

>> Given the common structure & duplication I can't help but wonder if a single
>> function should be used for widening/narrowing.  Ultimately can't you swap
>> mask_elems/req_elems and always go narrower to wider (using a different
>> optab for the two different cases)?
>
> I think we can't always go in narrower to wider direction because widening
> uses two optabs wand also because the way insn_data is checked.
OK.  Thanks for considering.

>>
>> I'm guessing Richi's comment about what tree type you're looking at refers
>> to this and similar instances.  Doesn't this give you the type of the number
>> of iterations rather than the type of the iteration variable itself?
>>
>>
>
> Since I build vector IV by myself and use to compare with NITERS I
> feel it's safe to
> use type of NITERS.  Do you expect NITERS and IV types differ?
Since you're comparing to NITERS, it sounds like you've got it right and 
that Richi and I have it wrong.

It's less a question of whether or not we expect NITERS and IV to have 
different types, but more a realization that there's nothing that 
inherently says they have to be the same.  THey probably are the same 
most of the time, but I don't think that's something we can or should 
necessarily depend on.



>>> @@ -1791,6 +1870,20 @@ vectorizable_mask_load_store (gimple *stmt,
>>> gimple_stmt_iterator *gsi,
>>>                && !useless_type_conversion_p (vectype, rhs_vectype)))
>>>      return false;
>>>
>>> +  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
>>> +    {
>>> +      /* Check that mask conjuction is supported.  */
>>> +      optab tab;
>>> +      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
>>> +      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) ==
>>> CODE_FOR_nothing)
>>> +       {
>>> +         if (dump_enabled_p ())
>>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> +                            "cannot be masked: unsupported mask
>>> operation\n");
>>> +         LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
>>> +       }
>>> +    }
>>
>> Should the optab querying be in optab-query.c?
>
> We always directly call optab_handler for simple operations.  There are dozens
> of such calls in vectorizer.
OK.  I would look favorably on a change to move those queries out into 
optabs-query as a separate patch.

>
> We don't embed masking capabilities into vectorizer.
>
> Actually we don't depend on masking capabilities so much.  We have to mask
> loads and stores and use can_mask_load_store for that which uses existing optab
> query.  We also require masking for reductions and use VEC_COND for that
> (and use existing expand_vec_cond_expr_p).  Other checks are to check if we
> can build required masks.  So we actually don't expose any new processor
> masking capabilities to GIMPLE.  I.e. all this works on targets with no
> rich masking capabilities.  E.g. we can mask loops for quite old SSE targets.
OK.  I think the key here is that load/store masking already exists and 
the others are either VEC_COND or checking if we can build the mask 
rather than can the operation be masked.  THanks for clarifying.
jeff
diff mbox

Patch

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e25a0ce..31360d3 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -31,6 +31,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "ssa.h"
 #include "optabs-tree.h"
+#include "insn-config.h"
+#include "recog.h"		/* FIXME: for insn_data */
 #include "diagnostic-core.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -1601,6 +1603,266 @@  vect_update_vf_for_slp (loop_vec_info loop_vinfo)
 		     vectorization_factor);
 }
 
+/* Function vect_check_required_masks_widening.
+
+   Return 1 if vector mask of type MASK_TYPE can be widened
+   to a type having REQ_ELEMS elements in a single vector.  */
+
+static bool
+vect_check_required_masks_widening (loop_vec_info loop_vinfo,
+				    tree mask_type, unsigned req_elems)
+{
+  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  gcc_assert (mask_elems > req_elems);
+
+  /* Don't convert if it requires too many intermediate steps.  */
+  int steps = exact_log2 (mask_elems / req_elems);
+  if (steps > MAX_INTERM_CVT_STEPS + 1)
+    return false;
+
+  /* Check we have conversion support for given mask mode.  */
+  machine_mode mode = TYPE_MODE (mask_type);
+  insn_code icode = optab_handler (vec_unpacks_lo_optab, mode);
+  if (icode == CODE_FOR_nothing
+      || optab_handler (vec_unpacks_hi_optab, mode) == CODE_FOR_nothing)
+    return false;
+
+  /* Make recursive call for multi-step conversion.  */
+  if (steps > 1)
+    {
+      mask_elems = mask_elems >> 1;
+      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+
+      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
+					       req_elems))
+	return false;
+    }
+  else
+    {
+      mask_type = build_truth_vector_type (req_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+    }
+
+  return true;
+}
+
+/* Function vect_check_required_masks_narowing.
+
+   Return 1 if vector mask of type MASK_TYPE can be narrowed
+   to a type having REQ_ELEMS elements in a single vector.  */
+
+static bool
+vect_check_required_masks_narrowing (loop_vec_info loop_vinfo,
+				     tree mask_type, unsigned req_elems)
+{
+  unsigned mask_elems = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  gcc_assert (req_elems > mask_elems);
+
+  /* Don't convert if it requires too many intermediate steps.  */
+  int steps = exact_log2 (req_elems / mask_elems);
+  if (steps > MAX_INTERM_CVT_STEPS + 1)
+    return false;
+
+  /* Check we have conversion support for given mask mode.  */
+  machine_mode mode = TYPE_MODE (mask_type);
+  insn_code icode = optab_handler (vec_pack_trunc_optab, mode);
+  if (icode == CODE_FOR_nothing)
+    return false;
+
+  /* Make recursive call for multi-step conversion.  */
+  if (steps > 1)
+    {
+      mask_elems = mask_elems << 1;
+      mask_type = build_truth_vector_type (mask_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+
+      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
+						req_elems))
+	return false;
+    }
+  else
+    {
+      mask_type = build_truth_vector_type (req_elems, current_vector_size);
+      if (TYPE_MODE (mask_type) != insn_data[icode].operand[0].mode)
+	return false;
+    }
+
+  return true;
+}
+
+/* Function vect_get_masking_iv_elems.
+
+   Return a number of elements in IV used for loop masking.  */
+static int
+vect_get_masking_iv_elems (loop_vec_info loop_vinfo)
+{
+  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
+
+  /* We extend IV type in case it is not big enough to
+     fill full vector.  */
+  return MIN ((int)TYPE_VECTOR_SUBPARTS (iv_vectype),
+	      LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+}
+
+/* Function vect_get_masking_iv_type.
+
+   Return a type of IV used for loop masking.  */
+static tree
+vect_get_masking_iv_type (loop_vec_info loop_vinfo)
+{
+  tree iv_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree iv_vectype = get_vectype_for_scalar_type (iv_type);
+  unsigned vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  if (TYPE_VECTOR_SUBPARTS (iv_vectype) <= vf)
+    return iv_vectype;
+
+  unsigned elem_size = current_vector_size * BITS_PER_UNIT / vf;
+  iv_type = build_nonstandard_integer_type (elem_size, TYPE_UNSIGNED (iv_type));
+
+  return get_vectype_for_scalar_type (iv_type);
+}
+
+/* Function vect_get_extreme_masks.
+
+   Determine minimum and maximum number of elements in masks
+   required for masking a loop described by LOOP_VINFO.
+   Computed values are returned in MIN_MASK_ELEMS and
+   MAX_MASK_ELEMS. */
+
+static void
+vect_get_extreme_masks (loop_vec_info loop_vinfo,
+			unsigned *min_mask_elems,
+			unsigned *max_mask_elems)
+{
+  unsigned required_masks = LOOP_VINFO_REQUIRED_MASKS (loop_vinfo);
+  unsigned elems = 1;
+
+  *min_mask_elems = *max_mask_elems = vect_get_masking_iv_elems (loop_vinfo);
+
+  while (required_masks)
+    {
+      if (required_masks & 1)
+	{
+	  if (elems < *min_mask_elems)
+	    *min_mask_elems = elems;
+	  if (elems > *max_mask_elems)
+	    *max_mask_elems = elems;
+	}
+      elems = elems << 1;
+      required_masks = required_masks >> 1;
+    }
+}
+
+/* Function vect_check_required_masks.
+
+   For given LOOP_VINFO check all required masks can be computed
+   and add computation cost into loop cost data.  */
+
+static void
+vect_check_required_masks (loop_vec_info loop_vinfo)
+{
+  if (!LOOP_VINFO_REQUIRED_MASKS (loop_vinfo))
+    return;
+
+  /* Firstly check we have a proper comparison to get
+     an initial mask.  */
+  tree iv_vectype = vect_get_masking_iv_type (loop_vinfo);
+  unsigned iv_elems = TYPE_VECTOR_SUBPARTS (iv_vectype);
+
+  tree mask_type = build_same_sized_truth_vector_type (iv_vectype);
+
+  if (!expand_vec_cmp_expr_p (iv_vectype, mask_type))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: required vector comparison "
+			 "is not supported.\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      return;
+    }
+
+  int cmp_copies  = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / iv_elems;
+  /* Add cost of initial iv values creation.  */
+  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
+		 scalar_to_vec, NULL, 0, vect_masking_prologue);
+  /* Add cost of upper bound and step values creation.  It is the same
+     for all copies.  */
+  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), 2,
+		 scalar_to_vec, NULL, 0, vect_masking_prologue);
+  /* Add cost of vector comparisons.  */
+  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
+		 vector_stmt, NULL, 0, vect_masking_body);
+  /* Add cost of iv increment.  */
+  add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), cmp_copies,
+		 vector_stmt, NULL, 0, vect_masking_body);
+
+
+  /* Now check the widest and the narrowest masks.
+     All intermediate values are obtained while
+     computing extreme values.  */
+  unsigned min_mask_elems = 0;
+  unsigned max_mask_elems = 0;
+
+  vect_get_extreme_masks (loop_vinfo, &min_mask_elems, &max_mask_elems);
+
+  if (min_mask_elems < iv_elems)
+    {
+      /* Check mask widening is available.  */
+      if (!vect_check_required_masks_widening (loop_vinfo, mask_type,
+					       min_mask_elems))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required mask widening "
+			     "is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	  return;
+	}
+
+      /* Add widening cost.  We have totally (2^N - 1) vectors
+	 we need to widen per each original vector, where N is
+	 a number of conversion steps.  Each widening requires
+	 two extracts.  */
+      int steps = exact_log2 (iv_elems / min_mask_elems);
+      int conversions = cmp_copies * 2 * ((1 << steps) - 1);
+      add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+		     conversions, vec_promote_demote,
+		     NULL, 0, vect_masking_body);
+    }
+
+  if (max_mask_elems > iv_elems)
+    {
+      if (!vect_check_required_masks_narrowing (loop_vinfo, mask_type,
+						max_mask_elems))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required mask narrowing "
+			     "is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	  return;
+	}
+
+      /* Add narrowing cost.  We have totally (2^N - 1) vector
+	 narrowings per each resulting vector, where N is
+	 a number of conversion steps.  */
+      int steps = exact_log2 (max_mask_elems / iv_elems);
+      int results = cmp_copies * iv_elems / max_mask_elems;
+      int conversions = results * ((1 << steps) - 1);
+      add_stmt_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+		     conversions, vec_promote_demote,
+		     NULL, 0, vect_masking_body);
+    }
+}
+
 /* Function vect_analyze_loop_operations.
 
    Scan the loop stmts and make sure they are all vectorizable.  */
@@ -1759,6 +2021,12 @@  vect_analyze_loop_operations (loop_vec_info loop_vinfo)
       return false;
     }
 
+  /* If all statements can be masked then we also need
+     to check we may compute required masks and compute
+     its cost.  */
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    vect_check_required_masks (loop_vinfo);
+
   return true;
 }
 
@@ -2232,6 +2500,8 @@  again:
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
+  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = true;
+  LOOP_VINFO_NEED_MASKING (loop_vinfo) = false;
 
   goto start_over;
 }
@@ -5424,6 +5694,7 @@  vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
       outer_loop = loop;
       loop = loop->inner;
       nested_cycle = true;
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
     }
 
   /* 1. Is vectorizable reduction?  */
@@ -5623,6 +5894,18 @@  vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 
   gcc_assert (ncopies >= 1);
 
+  if (slp_node || PURE_SLP_STMT (stmt_info) || code == COND_EXPR
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) == COND_REDUCTION
+      || STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
+	 == INTEGER_INDUC_COND_REDUCTION)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported conditional "
+			 "reduction\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   vec_mode = TYPE_MODE (vectype_in);
 
   if (code == COND_EXPR)
@@ -5900,6 +6183,19 @@  vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 	  return false;
 	}
     }
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Check that masking of reduction is supported.  */
+      tree mask_vtype = build_same_sized_truth_vector_type (vectype_out);
+      if (!expand_vec_cond_expr_p (vectype_out, mask_vtype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: required vector conditional "
+			     "expression is not supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+    }
 
   if (!vec_stmt) /* transformation not required.  */
     {
@@ -5908,6 +6204,10 @@  vectorizable_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
 					 reduc_index))
         return false;
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	vect_model_simple_masking_cost (stmt_info, ncopies);
+
       return true;
     }
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 9ab4af4..91ebe5a 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -48,6 +48,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "builtins.h"
 #include "internal-fn.h"
+#include "tree-ssa-loop-ivopts.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -535,6 +536,38 @@  process_use (gimple *stmt, tree use, loop_vec_info loop_vinfo, bool live_p,
   return true;
 }
 
+/* Return ture if STMT can be converted to masked form.  */
+
+static bool
+can_mask_load_store (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  tree vectype, mask_vectype;
+  tree lhs, ref;
+
+  if (!stmt_info)
+    return false;
+  lhs = gimple_assign_lhs (stmt);
+  ref = (TREE_CODE (lhs) == SSA_NAME) ? gimple_assign_rhs1 (stmt) : lhs;
+  if (may_be_nonaddressable_p (ref))
+    return false;
+  vectype = STMT_VINFO_VECTYPE (stmt_info);
+  mask_vectype = build_same_sized_truth_vector_type (vectype);
+  if (!can_vec_mask_load_store_p (TYPE_MODE (vectype),
+				  TYPE_MODE (mask_vectype),
+				  gimple_assign_load_p (stmt)))
+    {
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			   "Statement can't be masked.\n");
+	  dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0);
+	}
+
+       return false;
+    }
+  return true;
+}
 
 /* Function vect_mark_stmts_to_be_vectorized.
 
@@ -1193,6 +1226,52 @@  vect_get_load_cost (struct data_reference *dr, int ncopies,
     }
 }
 
+/* Function vect_model_load_masking_cost.
+
+   Models cost for memory load masking.  */
+
+void
+vect_model_load_masking_cost (stmt_vec_info stmt_info, int ncopies)
+{
+  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
+    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
+			   ncopies, vector_mask_load, stmt_info, false,
+			   vect_masking_body);
+  else
+    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
+			   ncopies, vector_load, stmt_info, false,
+			   vect_masking_body);
+}
+
+/* Function vect_model_store_masking_cost.
+
+   Models cost for memory store masking.  */
+
+void
+vect_model_store_masking_cost (stmt_vec_info stmt_info, int ncopies)
+{
+  if (gimple_code (stmt_info->stmt) == GIMPLE_CALL)
+    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
+			   ncopies, vector_mask_store, stmt_info, false,
+			   vect_masking_body);
+  else
+    add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
+			   ncopies, vector_store, stmt_info, false,
+			   vect_masking_body);
+}
+
+/* Function vect_model_simple_masking_cost.
+
+   Models cost for statement masking.  Return estimated cost.  */
+
+void
+vect_model_simple_masking_cost (stmt_vec_info stmt_info, int ncopies)
+{
+  add_stmt_masking_cost (stmt_info->vinfo->target_cost_data,
+			 ncopies, vector_stmt, stmt_info, false,
+			 vect_masking_body);
+}
+
 /* Insert the new stmt NEW_STMT at *GSI or at the appropriate place in
    the loop preheader for the vectorized stmt STMT.  */
 
@@ -1791,6 +1870,20 @@  vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 	       && !useless_type_conversion_p (vectype, rhs_vectype)))
     return false;
 
+  if (LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+    {
+      /* Check that mask conjuction is supported.  */
+      optab tab;
+      tab = optab_for_tree_code (BIT_AND_EXPR, vectype, optab_default);
+      if (!tab || optab_handler (tab, TYPE_MODE (vectype)) == CODE_FOR_nothing)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: unsupported mask operation\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
@@ -1799,6 +1892,15 @@  vectorizable_mask_load_store (gimple *stmt, gimple_stmt_iterator *gsi,
 			       NULL, NULL, NULL);
       else
 	vect_model_load_cost (stmt_info, ncopies, false, NULL, NULL, NULL);
+
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (is_store)
+	    vect_model_store_masking_cost (stmt_info, ncopies);
+	  else
+	    vect_model_load_masking_cost (stmt_info, ncopies);
+	}
+
       return true;
     }
 
@@ -2795,6 +2897,18 @@  vectorizable_simd_clone_call (gimple *stmt, gimple_stmt_iterator *gsi,
   if (slp_node || PURE_SLP_STMT (stmt_info))
     return false;
 
+  /* Masked clones are not yet supported.  But we allow
+     calls which may be just called with no mask.  */
+  if (!(gimple_call_flags (stmt) & ECF_CONST)
+      || (gimple_call_flags (stmt) & ECF_LOOPING_CONST_OR_PURE))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: non-const call "
+			 "(masked calls are not supported)\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   /* Process function arguments.  */
   nargs = gimple_call_num_args (stmt);
 
@@ -5335,6 +5449,14 @@  vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 				 "negative step and reversing not supported.\n");
 	      return false;
 	    }
+	  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	    {
+	      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "cannot be masked: negative step"
+				 " is not supported.");
+	    }
 	}
     }
 
@@ -5343,6 +5465,15 @@  vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       grouped_store = true;
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
       group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: grouped access"
+			     " is not supported." );
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      }
+
       if (!slp
 	  && !PURE_SLP_STMT (stmt_info)
 	  && !STMT_VINFO_STRIDED_P (stmt_info))
@@ -5398,6 +5529,44 @@  vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
                              "scatter index use not simple.");
 	  return false;
 	}
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: gather/scatter is"
+			     " not supported.");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && STMT_VINFO_STRIDED_P (stmt_info))
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: strided store is not"
+			 " supported.\n");
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && integer_zerop (nested_in_vect_loop_p (loop, stmt)
+			? STMT_VINFO_DR_STEP (stmt_info)
+			: DR_STEP (dr)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: invariant store.\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && !can_mask_load_store (stmt))
+    {
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported mask store.\n");
     }
 
   if (!vec_stmt) /* transformation not required.  */
@@ -5407,6 +5576,9 @@  vectorizable_store (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       if (!PURE_SLP_STMT (stmt_info))
 	vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt,
 			       NULL, NULL, NULL);
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	vect_model_store_masking_cost (stmt_info, ncopies);
+
       return true;
     }
 
@@ -6312,6 +6484,15 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       grouped_load = true;
       /* FORNOW */
       gcc_assert (!nested_in_vect_loop && !STMT_VINFO_GATHER_SCATTER_P (stmt_info));
+      /* Not yet supported.  */
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: grouped acces is not"
+			     " supported.");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+      }
 
       first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
 
@@ -6358,6 +6539,7 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
 	    }
 
 	  LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = true;
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
 	}
 
       if (slp && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ())
@@ -6423,6 +6605,16 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       gather_decl = vect_check_gather_scatter (stmt, loop_vinfo, &gather_base,
 					       &gather_off, &gather_scale);
       gcc_assert (gather_decl);
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			    "cannot be masked: gather/scatter is not"
+			    " supported.\n");
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	}
+
+
       if (!vect_is_simple_use (gather_off, vinfo, &def_stmt, &gather_dt,
 			       &gather_off_vectype))
 	{
@@ -6434,6 +6626,15 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
     }
   else if (STMT_VINFO_STRIDED_P (stmt_info))
     {
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	{
+	  LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "cannot be masked: strided load is not"
+			     " supported.\n");
+	}
+
       if ((grouped_load
 	   && (slp || PURE_SLP_STMT (stmt_info)))
 	  && (group_size > nunits
@@ -6485,9 +6686,35 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
                                  "\n");
 	      return false;
 	    }
+	  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			        "cannot be masked: negative step "
+				 "for masking.\n");
+	      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+	    }
 	}
     }
 
+  if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && integer_zerop (nested_in_vect_loop
+			? STMT_VINFO_DR_STEP (stmt_info)
+			: DR_STEP (dr)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "allow invariant load for masked loop.\n");
+    }
+  else if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+	   && !can_mask_load_store (stmt))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "cannot be masked: unsupported masked load.\n");
+      LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
@@ -6495,6 +6722,9 @@  vectorizable_load (gimple *stmt, gimple_stmt_iterator *gsi, gimple **vec_stmt,
       if (!PURE_SLP_STMT (stmt_info))
 	vect_model_load_cost (stmt_info, ncopies, load_lanes_p,
 			      NULL, NULL, NULL);
+      if (loop_vinfo && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo))
+	vect_model_load_masking_cost (stmt_info, ncopies);
+
       return true;
     }
 
@@ -7891,6 +8121,43 @@  vectorizable_comparison (gimple *stmt, gimple_stmt_iterator *gsi,
   return true;
 }
 
+/* Return true if vector version of STMT should be masked
+   in a vectorized loop epilogue (considering usage of the
+   same VF as for main loop).  */
+
+static bool
+vect_stmt_should_be_masked_for_epilogue (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+  /* We should mask all statements accessing memory.  */
+  if (STMT_VINFO_DATA_REF (stmt_info))
+    return true;
+
+  /* We should also mask all recursions.  */
+  if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
+      || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
+    return true;
+
+  return false;
+}
+
+/* Add a mask required to mask STMT to LOOP_VINFO_REQUIRED_MASKS.  */
+
+static void
+vect_add_required_mask_for_stmt (gimple *stmt)
+{
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+  unsigned HOST_WIDE_INT nelems = TYPE_VECTOR_SUBPARTS (vectype);
+  int bit_no = exact_log2 (nelems);
+
+  gcc_assert (bit_no >= 0);
+
+  LOOP_VINFO_REQUIRED_MASKS (loop_vinfo) |= (1 << bit_no);
+}
+
 /* Make sure the statement is vectorizable.  */
 
 bool
@@ -7898,6 +8165,7 @@  vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
   enum vect_relevant relevance = STMT_VINFO_RELEVANT (stmt_info);
   bool ok;
   tree scalar_type, vectype;
@@ -8064,6 +8332,10 @@  vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
       STMT_VINFO_VECTYPE (stmt_info) = vectype;
    }
 
+  /* Masking is not supported for SLP yet.  */
+  if (loop_vinfo && node)
+    LOOP_VINFO_CAN_BE_MASKED (loop_vinfo) = false;
+
   if (STMT_VINFO_RELEVANT_P (stmt_info))
     {
       gcc_assert (!VECTOR_MODE_P (TYPE_MODE (gimple_expr_type (stmt))));
@@ -8123,6 +8395,11 @@  vect_analyze_stmt (gimple *stmt, bool *need_to_vectorize, slp_tree node)
       return false;
     }
 
+  if (loop_vinfo
+      && LOOP_VINFO_CAN_BE_MASKED (loop_vinfo)
+      && vect_stmt_should_be_masked_for_epilogue (stmt))
+    vect_add_required_mask_for_stmt (stmt);
+
   if (bb_vinfo)
     return true;
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index d3450b6..86c5371 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1033,6 +1033,9 @@  extern void vect_model_store_cost (stmt_vec_info, int, bool,
 extern void vect_model_load_cost (stmt_vec_info, int, bool, slp_tree,
 				  stmt_vector_for_cost *,
 				  stmt_vector_for_cost *);
+extern void vect_model_load_masking_cost (stmt_vec_info, int);
+extern void vect_model_store_masking_cost (stmt_vec_info, int);
+extern void vect_model_simple_masking_cost (stmt_vec_info, int);
 extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
 				  enum vect_cost_for_stmt, stmt_vec_info,
 				  int, enum vect_cost_model_location);