diff mbox

[2/4,Vectorizer] Use a VEC_PERM_EXPR instead of VEC_RSHIFT_EXPR; expand appropriate VEC_PERM_EXPRs using vec_shr_optab

Message ID 54639E9F.50102@arm.com
State New
Headers show

Commit Message

Alan Lawrence Nov. 12, 2014, 5:53 p.m. UTC
This makes the vectorizer use VEC_PERM_EXPRs when doing reductions via shifts, 
rather than VEC_RSHIFT_EXPR.

VEC_RSHIFT_EXPR presently has an endianness-dependent meaning (paralleling 
vec_shr_optab). While the overall destination of this patch series is to make 
these endianness-neutral, this patch already feels quite big enough, hence, here 
we just switch to using VEC_PERM_EXPRs that have meaning equivalent to the old 
VEC_RSHIFT_EXPRs. Since VEC_PERM_EXPR is endianness-neutral, this means the mask 
we need to represent the meaning of the old VEC_RSHIFT_EXPR changes according to 
endianness. (Patch 4 completes this journey by removing the 
BYTES_BIG_ENDIAN-conditional parts; so an alternative route to the same 
endpoint, would be to first change VEC_RSHIFT_EXPR to be endianness-independent, 
then replace it by VEC_PERM_EXPRs. I posted such a patch to make VEC_RSHIFT_EXPR 
independent https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html and this 
was what lead Richi to make his suggestion!)

The "trick" here is then to look for the platform handling vec_shr_optab when 
expanding vec_perm_const *if* the second vector is all constant zeroes and the 
vec_perm mask is appropriate. I felt it was best to keep this case separate from 
can_vec_perm_p, so the latter only indicates when the target platform can apply 
a given permutation to _arbitrary_input_vectors_, as can_vec_perm_p's interface 
is already complicated enough without making it also able to handle cases where 
some of the vectors-to-be-shuffled are known.

A nice side effect of this patch is that aarch64 targets suddenly perform 
reductions via shifts even *without* a vec_shr_optab, because 
aarch64_vectorize_vec_perm_const_ok looks for shuffle-masks for the EXT 
instruction, which can indeed be used to perform a shift :).

With patch 1, bootstrapped on x86-none-linux-gnu (more testing with patch 3).

gcc/ChangeLog:

	* optabs.c (can_vec_perm_p): Update comment, does not consider vec_shr.
	(shift_amt_for_vec_perm_mask): New.
	(expand_vec_perm_1): Use vec_shr_optab if second vector is const0_rtx
	and mask appropriate.

	* tree-vect-loop.c (calc_vec_perm_mask_for_shift): New.
	(have_whole_vector_shift): New.
	(vect_model_reduction_cost): Call have_whole_vector_shift instead of
	looking for vec_shr_optab.
	(vect_create_epilog_for_reduction): Likewise; also rename local variable
	have_whole_vector_shift to reduce_with_shift; output VEC_PERM_EXPRs
	instead of VEC_RSHIFT_EXPRs.

	* tree-vect-stmts.c (vect_gen_perm_mask_checked): Extend comment.

Comments

Richard Biener Nov. 13, 2014, 11:22 a.m. UTC | #1
On Wed, Nov 12, 2014 at 6:53 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This makes the vectorizer use VEC_PERM_EXPRs when doing reductions via
> shifts, rather than VEC_RSHIFT_EXPR.
>
> VEC_RSHIFT_EXPR presently has an endianness-dependent meaning (paralleling
> vec_shr_optab). While the overall destination of this patch series is to
> make these endianness-neutral, this patch already feels quite big enough,
> hence, here we just switch to using VEC_PERM_EXPRs that have meaning
> equivalent to the old VEC_RSHIFT_EXPRs. Since VEC_PERM_EXPR is
> endianness-neutral, this means the mask we need to represent the meaning of
> the old VEC_RSHIFT_EXPR changes according to endianness. (Patch 4 completes
> this journey by removing the BYTES_BIG_ENDIAN-conditional parts; so an
> alternative route to the same endpoint, would be to first change
> VEC_RSHIFT_EXPR to be endianness-independent, then replace it by
> VEC_PERM_EXPRs. I posted such a patch to make VEC_RSHIFT_EXPR independent
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01475.html and this was what
> lead Richi to make his suggestion!)
>
> The "trick" here is then to look for the platform handling vec_shr_optab
> when expanding vec_perm_const *if* the second vector is all constant zeroes
> and the vec_perm mask is appropriate. I felt it was best to keep this case
> separate from can_vec_perm_p, so the latter only indicates when the target
> platform can apply a given permutation to _arbitrary_input_vectors_, as
> can_vec_perm_p's interface is already complicated enough without making it
> also able to handle cases where some of the vectors-to-be-shuffled are
> known.
>
> A nice side effect of this patch is that aarch64 targets suddenly perform
> reductions via shifts even *without* a vec_shr_optab, because
> aarch64_vectorize_vec_perm_const_ok looks for shuffle-masks for the EXT
> instruction, which can indeed be used to perform a shift :).
>
> With patch 1, bootstrapped on x86-none-linux-gnu (more testing with patch
> 3).

Ok.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * optabs.c (can_vec_perm_p): Update comment, does not consider
> vec_shr.
>         (shift_amt_for_vec_perm_mask): New.
>         (expand_vec_perm_1): Use vec_shr_optab if second vector is
> const0_rtx
>         and mask appropriate.
>
>         * tree-vect-loop.c (calc_vec_perm_mask_for_shift): New.
>         (have_whole_vector_shift): New.
>         (vect_model_reduction_cost): Call have_whole_vector_shift instead of
>         looking for vec_shr_optab.
>         (vect_create_epilog_for_reduction): Likewise; also rename local
> variable
>         have_whole_vector_shift to reduce_with_shift; output VEC_PERM_EXPRs
>         instead of VEC_RSHIFT_EXPRs.
>
>         * tree-vect-stmts.c (vect_gen_perm_mask_checked): Extend comment.
diff mbox

Patch

diff --git a/gcc/optabs.c b/gcc/optabs.c
index 9452f991a6fb784c6288ad8501a412b83b14c92a..64a6a1345bf88993b9c2a045e67cfcc22c8010a4 100644
--- a/gcc/optabs.c
+++ b/gcc/optabs.c
@@ -6575,8 +6575,11 @@  vector_compare_rtx (enum tree_code tcode, tree t_op0, tree t_op1,
   return gen_rtx_fmt_ee (rcode, VOIDmode, ops[0].value, ops[1].value);
 }
 
-/* Return true if VEC_PERM_EXPR can be expanded using SIMD extensions
-   of the CPU.  SEL may be NULL, which stands for an unknown constant.  */
+/* Return true if VEC_PERM_EXPR of arbitrary input vectors can be expanded using
+   SIMD extensions of the CPU.  SEL may be NULL, which stands for an unknown
+   constant.  Note that additional permutations representing whole-vector shifts
+   may also be handled via the vec_shr optab, but only where the second input
+   vector is entirely constant zeroes; this case is not dealt with here.  */
 
 bool
 can_vec_perm_p (enum machine_mode mode, bool variable,
@@ -6629,6 +6632,36 @@  can_vec_perm_p (enum machine_mode mode, bool variable,
   return true;
 }
 
+/* Checks if vec_perm mask SEL is a constant equivalent to a shift of the first
+   vec_perm operand, assuming the second operand is a constant vector of zeroes.
+   Return the shift distance in bits if so, or NULL_RTX if the vec_perm is not a
+   shift.  */
+static rtx
+shift_amt_for_vec_perm_mask (rtx sel)
+{
+  unsigned int i, first, nelt = GET_MODE_NUNITS (GET_MODE (sel));
+  unsigned int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (GET_MODE (sel)));
+
+  if (GET_CODE (sel) != CONST_VECTOR)
+    return NULL_RTX;
+
+  first = INTVAL (CONST_VECTOR_ELT (sel, 0));
+  if (first >= 2*nelt)
+    return NULL_RTX;
+  for (i = 1; i < nelt; i++)
+    {
+      int idx = INTVAL (CONST_VECTOR_ELT (sel, i));
+      unsigned int expected = (i + first) & (2 * nelt - 1);
+      /* Indices into the second vector are all equivalent.  */
+      if (idx < 0 || (MIN (nelt, (unsigned) idx) != MIN (nelt, expected)))
+	return NULL_RTX;
+    }
+
+  if (BYTES_BIG_ENDIAN)
+    first = (2 * nelt) - first;
+  return GEN_INT (first * bitsize);
+}
+
 /* A subroutine of expand_vec_perm for expanding one vec_perm insn.  */
 
 static rtx
@@ -6657,6 +6690,17 @@  expand_vec_perm_1 (enum insn_code icode, rtx target,
   else
     {
       create_input_operand (&ops[1], v0, tmode);
+      /* See if this can be handled with a vec_shr.  We only do this if the
+         second vector is all zeroes.  */
+      enum insn_code shift_code = optab_handler (vec_shr_optab, GET_MODE (v0));
+      if (v1 == CONST0_RTX (GET_MODE (v1)) && shift_code)
+	if (rtx shift_amt = shift_amt_for_vec_perm_mask (sel))
+	  {
+	    create_convert_operand_from_type (&ops[2], shift_amt,
+					      sizetype_tab[(int) stk_sizetype]);
+	    if (maybe_expand_insn (shift_code, 3, ops))
+	      return ops[0].value;
+	  }
       create_input_operand (&ops[2], v1, tmode);
     }
 
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index a15ce14ef841d4573f7937522bcdaab8d6cb5efe..f1d327f42a5c517a01121135569dd014e49502e0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3082,6 +3082,41 @@  vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   *ret_min_profitable_estimate = min_profitable_estimate;
 }
 
+/* Writes into SEL a mask for a vec_perm, equivalent to a vec_shr by OFFSET
+   vector elements (not bits) for a vector of mode MODE.  */
+static void
+calc_vec_perm_mask_for_shift (enum machine_mode mode, unsigned int offset,
+			      unsigned char *sel)
+{
+  unsigned int i, nelt = GET_MODE_NUNITS (mode);
+
+  for (i = 0; i < nelt; i++)
+    sel[i] = (BYTES_BIG_ENDIAN ? i - offset : i + offset) & (2*nelt - 1);
+}
+
+/* Checks whether the target supports whole-vector shifts for vectors of mode
+   MODE.  This is the case if _either_ the platform handles vec_shr_optab, _or_
+   it supports vec_perm_const with masks for all necessary shift amounts.  */
+static bool
+have_whole_vector_shift (enum machine_mode mode)
+{
+  if (optab_handler (vec_shr_optab, mode) != CODE_FOR_nothing)
+    return true;
+
+  if (direct_optab_handler (vec_perm_const_optab, mode) == CODE_FOR_nothing)
+    return false;
+
+  unsigned int i, nelt = GET_MODE_NUNITS (mode);
+  unsigned char *sel = XALLOCAVEC (unsigned char, nelt);
+
+  for (i = nelt/2; i >= 1; i/=2)
+    {
+      calc_vec_perm_mask_for_shift (mode, i, sel);
+      if (!can_vec_perm_p (mode, false, sel))
+	return false;
+    }
+  return true;
+}
 
 /* TODO: Close dependency between vect_model_*_cost and vectorizable_*
    functions. Design better to avoid maintenance issues.  */
@@ -3184,7 +3219,7 @@  vect_model_reduction_cost (stmt_vec_info stmt_info, enum tree_code reduc_code,
 	  /* We have a whole vector shift available.  */
 	  if (VECTOR_MODE_P (mode)
 	      && optab_handler (optab, mode) != CODE_FOR_nothing
-	      && optab_handler (vec_shr_optab, mode) != CODE_FOR_nothing)
+	      && have_whole_vector_shift (mode))
 	    {
 	      /* Final reduction via vector shifts and the reduction operator.
 		 Also requires scalar extract.  */
@@ -3787,7 +3822,6 @@  get_initial_def_for_reduction (gimple stmt, tree init_val,
   return init_def;
 }
 
-
 /* Function vect_create_epilog_for_reduction
 
    Create code at the loop-epilog to finalize the result of a reduction
@@ -4211,18 +4245,11 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
     }
   else
     {
-      enum tree_code shift_code = ERROR_MARK;
-      bool have_whole_vector_shift = true;
-      int bit_offset;
+      bool reduce_with_shift = have_whole_vector_shift (mode);
       int element_bitsize = tree_to_uhwi (bitsize);
       int vec_size_in_bits = tree_to_uhwi (TYPE_SIZE (vectype));
       tree vec_temp;
 
-      if (optab_handler (vec_shr_optab, mode) != CODE_FOR_nothing)
-        shift_code = VEC_RSHIFT_EXPR;
-      else
-        have_whole_vector_shift = false;
-
       /* Regardless of whether we have a whole vector shift, if we're
          emulating the operation via tree-vect-generic, we don't want
          to use it.  Only the first round of the reduction is likely
@@ -4230,18 +4257,24 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
       /* ??? It might be better to emit a reduction tree code here, so that
          tree-vect-generic can expand the first round via bit tricks.  */
       if (!VECTOR_MODE_P (mode))
-        have_whole_vector_shift = false;
+        reduce_with_shift = false;
       else
         {
           optab optab = optab_for_tree_code (code, vectype, optab_default);
           if (optab_handler (optab, mode) == CODE_FOR_nothing)
-            have_whole_vector_shift = false;
+            reduce_with_shift = false;
         }
 
-      if (have_whole_vector_shift && !slp_reduc)
+      if (reduce_with_shift && !slp_reduc)
         {
+          int nelements = vec_size_in_bits / element_bitsize;
+          unsigned char *sel = XALLOCAVEC (unsigned char, nelements);
+
+          int elt_offset;
+
+          tree zero_vec = build_zero_cst (vectype);
           /*** Case 2: Create:
-             for (offset = VS/2; offset >= element_size; offset/=2)
+             for (offset = nelements/2; offset >= 1; offset/=2)
                 {
                   Create:  va' = vec_shift <va, offset>
                   Create:  va = vop <va, va'>
@@ -4253,14 +4286,15 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
 
           vec_dest = vect_create_destination_var (scalar_dest, vectype);
           new_temp = new_phi_result;
-          for (bit_offset = vec_size_in_bits/2;
-               bit_offset >= element_bitsize;
-               bit_offset /= 2)
+          for (elt_offset = nelements / 2;
+               elt_offset >= 1;
+               elt_offset /= 2)
             {
-              tree bitpos = size_int (bit_offset);
-
-              epilog_stmt = gimple_build_assign_with_ops (shift_code,
-                                               vec_dest, new_temp, bitpos);
+              calc_vec_perm_mask_for_shift (mode, elt_offset, sel);
+              tree mask = vect_gen_perm_mask_any (vectype, sel);
+	      epilog_stmt = gimple_build_assign_with_ops (VEC_PERM_EXPR,
+							  vec_dest, new_temp,
+							  zero_vec, mask);
               new_name = make_ssa_name (vec_dest, epilog_stmt);
               gimple_assign_set_lhs (epilog_stmt, new_name);
               gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
@@ -4276,8 +4310,6 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         }
       else
         {
-          tree rhs;
-
           /*** Case 3: Create:
              s = extract_field <v_out2, 0>
              for (offset = element_size;
@@ -4295,11 +4327,12 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
           vec_size_in_bits = tree_to_uhwi (TYPE_SIZE (vectype));
           FOR_EACH_VEC_ELT (new_phis, i, new_phi)
             {
+              int bit_offset;
               if (gimple_code (new_phi) == GIMPLE_PHI)
                 vec_temp = PHI_RESULT (new_phi);
               else
                 vec_temp = gimple_assign_lhs (new_phi);
-              rhs = build3 (BIT_FIELD_REF, scalar_type, vec_temp, bitsize,
+              tree rhs = build3 (BIT_FIELD_REF, scalar_type, vec_temp, bitsize,
                             bitsize_zero_node);
               epilog_stmt = gimple_build_assign (new_scalar_dest, rhs);
               new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 3c2d7d7926e7aece6c0aca63210a2d6d591ac1fb..db464f3fc842b14f25f5a4ae1e18aec91e59e868 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -5491,7 +5491,8 @@  vect_gen_perm_mask_any (tree vectype, const unsigned char *sel)
   return mask_vec;
 }
 
-/* Checked version of vect_gen_perm_mask_any.  Asserts can_vec_perm_p.  */
+/* Checked version of vect_gen_perm_mask_any.  Asserts can_vec_perm_p,
+   i.e. that the target supports the pattern _for arbitrary input vectors_.  */
 
 tree
 vect_gen_perm_mask_checked (tree vectype, const unsigned char *sel)