diff mbox series

[RFC,vect] PR: 65930 teach vectorizer to handle SUM reductions with sign-change casts

Message ID 780e0f8d-2b31-a9e8-1701-4a895e95d9b6@arm.com
State New
Headers show
Series [RFC,vect] PR: 65930 teach vectorizer to handle SUM reductions with sign-change casts | expand

Commit Message

Andre Vieira (lists) Aug. 23, 2019, 4:55 p.m. UTC
Hi Richard,

I have come up with a way to teach the vectorizer to handle 
sign-changing reductions, restricted to SUM operations as I'm not sure 
other reductions are equivalent with different signs.

The main nature of this approach is to let it recognize reductions of 
the form: 
Phi->NopConversion?->Plus/Minus-reduction->NopConversion?->Phi. Then 
vectorize the statements normally, with some extra workarounds to handle 
the conversions. This is mainly needed where it looks for uses of the 
result of the reduction, we now need to check the uses of the result of 
the conversion instead.

I am curious to know what you think of this approach. I have regression 
tested this on aarch64 and x86_64 with AVX512 and it shows no 
regressions. On the 1 month old version of trunk I tested on it even 
seems to make gcc.dg/vect/pr89268.c pass, where it used to fail with an 
ICE complaining about a definition not dominating a use.

The initial benchmarks I did also show a 14% improvement on x264_r on 
SPEC2017 for aarch64.

Cheers,
Andre

Comments

Richard Biener Aug. 27, 2019, 1:28 p.m. UTC | #1
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote:

> Hi Richard,
> 
> I have come up with a way to teach the vectorizer to handle sign-changing
> reductions, restricted to SUM operations as I'm not sure other reductions are
> equivalent with different signs.
> 
> The main nature of this approach is to let it recognize reductions of the
> form: Phi->NopConversion?->Plus/Minus-reduction->NopConversion?->Phi. Then
> vectorize the statements normally, with some extra workarounds to handle the
> conversions. This is mainly needed where it looks for uses of the result of
> the reduction, we now need to check the uses of the result of the conversion
> instead.
> 
> I am curious to know what you think of this approach. I have regression tested
> this on aarch64 and x86_64 with AVX512 and it shows no regressions. On the 1
> month old version of trunk I tested on it even seems to make
> gcc.dg/vect/pr89268.c pass, where it used to fail with an ICE complaining
> about a definition not dominating a use.

Aww.  Yeah, I had a half-way working patch along this line as well
and threw it away because of ugliness.

So I was hoping we can at some point refactor the reduction detection
code to use the path discovery in check_reduction_path (which is basically
a lame SCC finding algorithm), massage the detected reduction path
and in the reduction PHI meta-data record something like
"this reduction SUMs _1, _4, _3 and _5" plus for the conversions
"do the reduction in SIGN" and during code-generation just look at
the PHI node and the backedge def which we'd replace.

But of course I stopped short trying that because the reduction code
is a big mess.  And I threw away the attempt that looked like yours
because I didn't want to make an even bigger mess out of it :/

On the branch throwing away the non-SLP paths I started to 
refactor^Wrewrite all this but got stuck as well.  One thing I
realized on the branch is that nested cycle handling should be
more straight-forward and done in a separate vectorizable_*
routine.  Not sure it simplified things a lot, but well.  Maybe
also simply always building a SLP graph for reductions only
helps.

Well.

Maybe you can try experimenting with amending check_reduction_path
with conversion support - from a quick look your patch wouldn't
handle

_1 = PHI <.., _4>
_2 = (unsigned) _1;
_3 = _2 + ...;
_4 = (signed) _3;

since the last stmt you expect is still a PLUS?

> The initial benchmarks I did also show a 14% improvement on x264_r on SPEC2017
> for aarch64.

Interesting, on x86 IIRC I didn't see any such big effect on x264_r
but it was the testcase I ran into this first.

Richard.
Richard Biener Aug. 27, 2019, 1:42 p.m. UTC | #2
On Tue, 27 Aug 2019, Richard Biener wrote:

> On Fri, 23 Aug 2019, Andre Vieira (lists) wrote:
> 
> > Hi Richard,
> > 
> > I have come up with a way to teach the vectorizer to handle sign-changing
> > reductions, restricted to SUM operations as I'm not sure other reductions are
> > equivalent with different signs.
> > 
> > The main nature of this approach is to let it recognize reductions of the
> > form: Phi->NopConversion?->Plus/Minus-reduction->NopConversion?->Phi. Then
> > vectorize the statements normally, with some extra workarounds to handle the
> > conversions. This is mainly needed where it looks for uses of the result of
> > the reduction, we now need to check the uses of the result of the conversion
> > instead.
> > 
> > I am curious to know what you think of this approach. I have regression tested
> > this on aarch64 and x86_64 with AVX512 and it shows no regressions. On the 1
> > month old version of trunk I tested on it even seems to make
> > gcc.dg/vect/pr89268.c pass, where it used to fail with an ICE complaining
> > about a definition not dominating a use.
> 
> Aww.  Yeah, I had a half-way working patch along this line as well
> and threw it away because of ugliness.
> 
> So I was hoping we can at some point refactor the reduction detection
> code to use the path discovery in check_reduction_path (which is basically
> a lame SCC finding algorithm), massage the detected reduction path
> and in the reduction PHI meta-data record something like
> "this reduction SUMs _1, _4, _3 and _5" plus for the conversions
> "do the reduction in SIGN" and during code-generation just look at
> the PHI node and the backedge def which we'd replace.
> 
> But of course I stopped short trying that because the reduction code
> is a big mess.  And I threw away the attempt that looked like yours
> because I didn't want to make an even bigger mess out of it :/
> 
> On the branch throwing away the non-SLP paths I started to 
> refactor^Wrewrite all this but got stuck as well.

Before you start looking I figured this all is only in my
working tree...

Richard.
diff mbox series

Patch

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..a346547153b6b12fd9090dd7491766986ab2f4f9 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2576,6 +2576,26 @@  report_vect_op (dump_flags_t msg_type, gimple *stmt, const char *msg)
   dump_printf_loc (msg_type, vect_location, "%s%G", msg, stmt);
 }
 
+/* Function is_nop_conversion_stmt
+
+  Check if STMT is a gimple assign statement that does a tree nop conversion.
+  */
+
+bool
+is_nop_conversion_stmt (gimple *stmt)
+{
+  tree outer_t, inner_t;
+  if (!is_gimple_assign (stmt))
+    return false;
+  if (gimple_assign_rhs_code (stmt) != NOP_EXPR)
+    return false;
+
+  outer_t = TREE_TYPE (gimple_assign_lhs (stmt));
+  inner_t = TREE_TYPE (gimple_assign_rhs1 (stmt));
+
+  return tree_nop_conversion_p (outer_t, inner_t);
+}
+
 /* DEF_STMT_INFO occurs in a loop that contains a potential reduction
    operation.  Return true if the results of DEF_STMT_INFO are something
    that can be accumulated by such a reduction.  */
@@ -2649,7 +2669,9 @@  vect_is_slp_reduction (loop_vec_info loop_info, gimple *phi,
           if (flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
             {
 	      loop_use_stmt = use_stmt;
-	      nloop_uses++;
+	      /* Do not count a nop conversion as a use.  */
+	      if (!is_nop_conversion_stmt (use_stmt))
+		nloop_uses++;
             }
            else
              n_out_of_loop_uses++;
@@ -2663,18 +2685,24 @@  vect_is_slp_reduction (loop_vec_info loop_info, gimple *phi,
       if (found)
         break;
 
-      /* We reached a statement with no loop uses.  */
-      if (nloop_uses == 0)
-	return false;
-
       /* This is a loop exit phi, and we haven't reached the reduction phi.  */
       if (gimple_code (loop_use_stmt) == GIMPLE_PHI)
         return false;
 
-      if (!is_gimple_assign (loop_use_stmt)
-	  || code != gimple_assign_rhs_code (loop_use_stmt)
-	  || !flow_bb_inside_loop_p (loop, gimple_bb (loop_use_stmt)))
-        return false;
+      if (!is_gimple_assign (loop_use_stmt))
+	return false;
+
+      /* Keep moving along the def-use chain, ignoring nop coversions.  */
+      if (!is_nop_conversion_stmt (loop_use_stmt))
+	{
+	  /* We reached a statement with no loop uses.  */
+	  if (nloop_uses == 0)
+	    return false;
+
+	  else if (code != gimple_assign_rhs_code (loop_use_stmt)
+		   || !flow_bb_inside_loop_p (loop, gimple_bb (loop_use_stmt)))
+	    return false;
+	}
 
       /* Insert USE_STMT into reduction chain.  */
       use_stmt_info = loop_info->lookup_stmt (loop_use_stmt);
@@ -2693,7 +2721,9 @@  vect_is_slp_reduction (loop_vec_info loop_info, gimple *phi,
   for (unsigned i = 0; i < reduc_chain.length (); ++i)
     {
       gassign *next_stmt = as_a <gassign *> (reduc_chain[i]->stmt);
-      if (gimple_assign_rhs2 (next_stmt) == lhs)
+      if (is_nop_conversion_stmt (next_stmt))
+	continue;
+      else if (gimple_assign_rhs2 (next_stmt) == lhs)
 	{
 	  tree op = gimple_assign_rhs1 (next_stmt);
 	  stmt_vec_info def_stmt_info = loop_info->lookup_def (op);
@@ -3120,6 +3150,28 @@  vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   gassign *def_stmt = as_a <gassign *> (def_stmt_info->stmt);
   code = orig_code = gimple_assign_rhs_code (def_stmt);
 
+  /* If the def_stmt is a nop conversion then this is not the real reduction
+     definition statement.  Follow the definition of the variable this
+     statement is converting to the actual reduction definition.  */
+  if (is_nop_conversion_stmt (def_stmt))
+    {
+      tree rhs = gimple_assign_rhs1 (def_stmt);
+      gimple *new_def = SSA_NAME_DEF_STMT (rhs);
+
+      if (is_gimple_assign (new_def))
+	{
+	  enum tree_code new_code = gimple_assign_rhs_code (new_def);
+	  /* Only do this for reductions that are safe to ignore the signedness
+	     though.  */
+	  if (new_code == PLUS_EXPR || new_code == MINUS_EXPR)
+	    {
+	      def_stmt = as_a <gassign *> (new_def);
+	      def_stmt_info = loop_info->lookup_stmt (def_stmt);
+	      code = orig_code = new_code;
+	    }
+	}
+    }
+
   if (nested_in_vect_loop && !check_reduction)
     {
       /* FIXME: Even for non-reductions code generation is funneled
@@ -4551,6 +4603,7 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs,
   tree new_phi_result;
   stmt_vec_info inner_phi = NULL;
   tree induction_index = NULL_TREE;
+  stmt_vec_info use_stmt_info;
 
   if (slp_node)
     group_size = SLP_TREE_SCALAR_STMTS (slp_node).length (); 
@@ -4798,6 +4851,18 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs,
          v_out1 = phi <VECT_DEF> 
          Store them in NEW_PHIS.  */
 
+  stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info);
+  scalar_dest = gimple_assign_lhs (orig_stmt_info->stmt);
+  if ((use_stmt_info = loop_vinfo->lookup_single_use (scalar_dest))
+      && is_nop_conversion_stmt (use_stmt_info->stmt))
+    scalar_dest = gimple_assign_lhs (use_stmt_info->stmt);
+
+  scalar_type = TREE_TYPE (scalar_dest);
+  scalar_results.create (group_size);
+  new_scalar_dest = vect_create_destination_var (scalar_dest, NULL);
+  bitsize = TYPE_SIZE (scalar_type);
+
+
   exit_bb = single_exit (loop)->dest;
   prev_phi_info = NULL;
   new_phis.create (vect_defs.length ());
@@ -4805,6 +4870,34 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs,
     {
       for (j = 0; j < ncopies; j++)
         {
+	  /* If use_stmt_info is NULL, then the scalar destination does not
+	     have a single use.  This means we could have the following case:
+	     loop:
+	       phi_r = (phi_i, loop), (initial_def, pre_header)
+	       cast_i = (int) phi_r;
+	       sum = cast_i + ...;
+	       phi_i = (unsigned int) sum;
+	     loop_exit:
+	       phi_out = (sum, loop)
+
+	     In this case, the def will currently point to the result of the
+	     cast rather than the result of the reduction, which means the loop
+	     exit phi's will be constructed using the wrong type.  For this
+	     reason we want to use the value of the reduction before the
+	     casting.  Note that we only accept reductions with sign-changing
+	     casts if they are using operations that are sign-invariant.
+	    */
+	  gimple *def_stmt;
+	  if (!use_stmt_info
+	      && !useless_type_conversion_p (TREE_TYPE (TREE_TYPE (def)),
+					     scalar_type)
+	      && tree_nop_conversion_p (TREE_TYPE (TREE_TYPE (def)),
+					scalar_type)
+	      && (def_stmt = SSA_NAME_DEF_STMT (def))
+	      && is_gimple_assign (def_stmt)
+	      && gimple_assign_rhs_code (def_stmt) == VIEW_CONVERT_EXPR)
+	    def = TREE_OPERAND (gimple_assign_rhs1 (def_stmt), 0);
+
 	  tree new_def = copy_ssa_name (def);
           phi = create_phi_node (new_def, exit_bb);
 	  stmt_vec_info phi_info = loop_vinfo->add_stmt (phi);
@@ -4863,7 +4956,6 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs,
          Otherwise (it is a regular reduction) - the tree-code and scalar-def
          are taken from STMT.  */
 
-  stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info);
   if (orig_stmt_info != stmt_info)
     {
       /* Reduction pattern  */
@@ -4877,12 +4969,6 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs,
   if (code == MINUS_EXPR) 
     code = PLUS_EXPR;
   
-  scalar_dest = gimple_assign_lhs (orig_stmt_info->stmt);
-  scalar_type = TREE_TYPE (scalar_dest);
-  scalar_results.create (group_size); 
-  new_scalar_dest = vect_create_destination_var (scalar_dest, NULL);
-  bitsize = TYPE_SIZE (scalar_type);
-
   /* In case this is a reduction in an inner-loop while vectorizing an outer
      loop - we don't need to extract a single scalar result at the end of the
      inner-loop (unless it is double reduction, i.e., the use of reduction is
@@ -5591,16 +5677,50 @@  vect_finalize_reduction:
   if (adjustment_def)
     {
       gcc_assert (!slp_reduc);
+
       if (nested_in_vect_loop)
 	{
-          new_phi = new_phis[0];
+	  new_phi = new_phis[0];
+	  new_temp = PHI_RESULT (new_phi);
+	  if (!useless_type_conversion_p (TREE_TYPE (TREE_TYPE (new_temp)),
+					  TREE_TYPE (initial_def)))
+	    {
+	      gimple_seq stmts;
+	      poly_uint64 sz
+		= GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (initial_def)));
+	      vectype
+		= get_vectype_for_scalar_type_and_size (TREE_TYPE (initial_def),
+							sz);
+
+	      gcc_assert (tree_nop_conversion_p (TREE_TYPE (TREE_TYPE (new_temp)),
+						 TREE_TYPE (initial_def)));
+
+	      new_temp = build1 (VIEW_CONVERT_EXPR, vectype, new_temp);
+	      new_temp = force_gimple_operand (unshare_expr(new_temp), &stmts,
+					       true, NULL_TREE);
+	      if (stmts)
+		gsi_insert_before (&exit_gsi, stmts, GSI_SAME_STMT);
+	    }
 	  gcc_assert (TREE_CODE (TREE_TYPE (adjustment_def)) == VECTOR_TYPE);
-	  expr = build2 (code, vectype, PHI_RESULT (new_phi), adjustment_def);
+	  expr = build2 (code, vectype, new_temp, adjustment_def);
 	  new_dest = vect_create_destination_var (scalar_dest, vectype);
 	}
       else
 	{
-          new_temp = scalar_results[0];
+	  new_temp = scalar_results[0];
+	  if (!useless_type_conversion_p (TREE_TYPE (new_temp),
+					  TREE_TYPE (initial_def)))
+	    {
+	      gimple_seq stmts;
+	      scalar_type = TREE_TYPE (initial_def);
+	      gcc_assert (tree_nop_conversion_p (TREE_TYPE (new_temp),
+						 scalar_type));
+	      new_temp = build1 (NOP_EXPR, scalar_type, new_temp);
+	      new_temp = force_gimple_operand (unshare_expr(new_temp), &stmts,
+					       true, NULL_TREE);
+	      if (stmts)
+		gsi_insert_before (&exit_gsi, stmts, GSI_SAME_STMT);
+	    }
 	  gcc_assert (TREE_CODE (TREE_TYPE (adjustment_def)) != VECTOR_TYPE);
 	  expr = build2 (code, scalar_type, new_temp, adjustment_def);
 	  new_dest = vect_create_destination_var (scalar_dest, scalar_type);
@@ -5817,17 +5937,22 @@  vect_finalize_reduction:
             continue;
         }
 
-      phis.create (3);
+      auto_vec<use_operand_p> dest_uses;
+      dest_uses.create(3);
       /* Find the loop-closed-use at the loop exit of the original scalar
          result.  (The reduction result is expected to have two immediate uses,
          one at the latch block, and one at the loop exit).  For double
          reductions we are looking for exit phis of the outer loop.  */
+
+      use_stmt_info
+	= loop_vinfo->lookup_single_use (scalar_dest);
+
       FOR_EACH_IMM_USE_FAST (use_p, imm_iter, scalar_dest)
         {
           if (!flow_bb_inside_loop_p (loop, gimple_bb (USE_STMT (use_p))))
 	    {
 	      if (!is_gimple_debug (USE_STMT (use_p)))
-		phis.safe_push (USE_STMT (use_p));
+		dest_uses.safe_push (use_p);
 	    }
           else
             {
@@ -5840,23 +5965,70 @@  vect_finalize_reduction:
                       if (!flow_bb_inside_loop_p (loop,
                                              gimple_bb (USE_STMT (phi_use_p)))
 			  && !is_gimple_debug (USE_STMT (phi_use_p)))
-                        phis.safe_push (USE_STMT (phi_use_p));
+                        dest_uses.safe_push (phi_use_p);
                     }
                 }
             }
         }
 
-      FOR_EACH_VEC_ELT (phis, i, exit_phi)
-        {
-          /* Replace the uses:  */
-          orig_name = PHI_RESULT (exit_phi);
-          scalar_result = scalar_results[k];
-          FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, orig_name)
-            FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
-              SET_USE (use_p, scalar_result);
-        }
+      scalar_result = scalar_results[k];
+      /* Not quite sure why we initially expect these PHI-Nodes to have a
+	 single argument.  Given the sign-change reductions we sometimes see
+	 code generation that results in these phi-nodes having multiple
+	 arguments.  If that is the case we replace the actual argument within
+	 the phi-node rather than the uses of the result of the phi-node.  */
+      FOR_EACH_VEC_ELT (dest_uses, i, use_p)
+	{
+	  if (gimple_phi_num_args (USE_STMT (use_p)) > 1)
+	    {
+	      tree use = USE_FROM_PTR (use_p);
+	      if (!useless_type_conversion_p (TREE_TYPE (use),
+					      TREE_TYPE (scalar_result)))
+		{
+		  gimple_stmt_iterator gsi;
+		  gimple_seq stmts;
+		  gcc_assert (tree_nop_conversion_p (TREE_TYPE (use),
+						     TREE_TYPE
+						     (scalar_result)));
+		  gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (scalar_result));
+		  scalar_result = build1 (NOP_EXPR, TREE_TYPE (use),
+					  scalar_result);
+		  scalar_result
+		    = force_gimple_operand (unshare_expr (scalar_result),
+					    &stmts, true, NULL_TREE);
+		  gsi_insert_after (&gsi, stmts, GSI_SAME_STMT);
+		}
+	      SET_USE (use_p, scalar_result);
+	    }
+	  else
+	    {
+	      orig_name = PHI_RESULT (USE_STMT (use_p));
+	      FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, orig_name)
+		FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
+		  {
+		    tree use = USE_FROM_PTR (use_p);
+		    if (!useless_type_conversion_p (TREE_TYPE (use),
+						    TREE_TYPE (scalar_result)))
+		      {
+			gimple_stmt_iterator gsi;
+			gimple_seq stmts;
+			gcc_assert (tree_nop_conversion_p (TREE_TYPE (use),
+							   TREE_TYPE
+							   (scalar_result)));
+			gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (scalar_result));
+			scalar_result = build1 (NOP_EXPR, TREE_TYPE (use),
+						scalar_result);
+			scalar_result
+			  = force_gimple_operand (unshare_expr (scalar_result),
+						  &stmts, true, NULL_TREE);
+			gsi_insert_after (&gsi, stmts, GSI_SAME_STMT);
+		      }
+		    SET_USE (use_p, scalar_result);
+		  }
+	    }
+	}
 
-      phis.release ();
+      dest_uses.release ();
     }
 }
 
@@ -6339,6 +6511,9 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       for (unsigned k = 1; k < gimple_num_ops (reduc_stmt); ++k)
 	{
 	  tree op = gimple_op (reduc_stmt, k);
+	  if (TREE_CODE (op) == SSA_NAME
+	      && is_nop_conversion_stmt (SSA_NAME_DEF_STMT(op)))
+	    op = gimple_assign_rhs1(SSA_NAME_DEF_STMT (op));
 	  if (op == phi_result)
 	    continue;
 	  if (k == 1 && code == COND_EXPR)
@@ -6367,9 +6542,16 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       stmt_vec_info use_stmt_info;
       if (ncopies > 1
 	  && STMT_VINFO_RELEVANT (reduc_stmt_info) <= vect_used_only_live
-	  && (use_stmt_info = loop_vinfo->lookup_single_use (phi_result))
-	  && vect_stmt_to_vectorize (use_stmt_info) == reduc_stmt_info)
-	single_defuse_cycle = true;
+	  && (use_stmt_info = loop_vinfo->lookup_single_use (phi_result)))
+	{
+	  if (is_nop_conversion_stmt (use_stmt_info->stmt))
+	    {
+	      tree lhs = gimple_assign_lhs (use_stmt_info->stmt);
+	      use_stmt_info = loop_vinfo->lookup_single_use (lhs);
+	    }
+	  if (vect_stmt_to_vectorize (use_stmt_info) == reduc_stmt_info)
+	    single_defuse_cycle = true;
+	}
 
       /* Create the destination vector  */
       scalar_dest = gimple_assign_lhs (reduc_stmt);
@@ -6512,8 +6694,13 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 					  &def_stmt_info);
       dt = dts[i];
       gcc_assert (is_simple_use);
-      if (dt == vect_reduction_def
-	  && ops[i] == reduc_def)
+
+
+      if ((dt == vect_reduction_def
+	   && ops[i] == reduc_def)
+	  || (def_stmt_info
+	      && is_nop_conversion_stmt (def_stmt_info->stmt)
+	      && gimple_assign_rhs1 (def_stmt_info->stmt) == reduc_def))
 	{
 	  reduc_index = i;
 	  continue;
@@ -6536,7 +6723,10 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 	return false;
 
       if (dt == vect_nested_cycle
-	  && ops[i] == reduc_def)
+	  && (ops[i] == reduc_def
+	      || (def_stmt_info
+		  && is_nop_conversion_stmt (def_stmt_info->stmt)
+		  && gimple_assign_rhs1 (def_stmt_info->stmt) == reduc_def)))
 	{
 	  found_nested_cycle_def = true;
 	  reduc_index = i;
@@ -6579,6 +6769,8 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   if (!(reduc_index == -1
 	|| dts[reduc_index] == vect_reduction_def
 	|| dts[reduc_index] == vect_nested_cycle
+	|| (dts[reduc_index] == vect_internal_def
+	    && is_nop_conversion_stmt (SSA_NAME_DEF_STMT (ops[reduc_index])))
 	|| ((dts[reduc_index] == vect_internal_def
 	     || dts[reduc_index] == vect_external_def
 	     || dts[reduc_index] == vect_constant_def
@@ -6749,6 +6941,12 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
       def_arg = PHI_ARG_DEF_FROM_EDGE (reduc_def_phi,
                                        loop_preheader_edge (def_stmt_loop));
       stmt_vec_info def_arg_stmt_info = loop_vinfo->lookup_def (def_arg);
+      if (def_arg_stmt_info
+	  && is_nop_conversion_stmt (def_arg_stmt_info->stmt))
+	{
+	  tree rhs = gimple_assign_rhs1 (def_arg_stmt_info->stmt);
+	  def_arg_stmt_info = loop_vinfo->lookup_def (rhs);
+	}
       if (def_arg_stmt_info
 	  && (STMT_VINFO_DEF_TYPE (def_arg_stmt_info)
 	      == vect_double_reduction_def))
@@ -7133,12 +7331,19 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
    This only works when we see both the reduction PHI and its only consumer
    in vectorizable_reduction and there are no intermediate stmts
    participating.  */
-  stmt_vec_info use_stmt_info;
+  stmt_vec_info use_stmt_info = NULL;
   tree reduc_phi_result = gimple_phi_result (reduc_def_phi);
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && (use_stmt_info = loop_vinfo->lookup_single_use (reduc_phi_result))
-      && vect_stmt_to_vectorize (use_stmt_info) == stmt_info)
+      && (use_stmt_info = loop_vinfo->lookup_single_use (reduc_phi_result)))
+    {
+      if (is_nop_conversion_stmt (use_stmt_info->stmt))
+	{
+	  tree lhs = gimple_assign_lhs (use_stmt_info->stmt);
+	  use_stmt_info = loop_vinfo->lookup_single_use (lhs);
+	}
+    }
+  if (use_stmt_info && vect_stmt_to_vectorize (use_stmt_info) == stmt_info)
     {
       single_defuse_cycle = true;
       epilog_copies = 1;
@@ -7405,6 +7610,27 @@  vectorizable_reduction (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   if ((!single_defuse_cycle || code == COND_EXPR) && !slp_node)
     vect_defs[0] = gimple_get_lhs ((*vec_stmt)->stmt);
 
+  for (j = 0; j < vec_num; ++j)
+    {
+      gimple_seq stmts;
+      gimple_stmt_iterator it = gsi_for_stmt (SSA_NAME_DEF_STMT (vect_defs[j]));
+      tree def_t = TREE_TYPE (vect_defs[j]);
+      tree phi_t = TREE_TYPE (PHI_RESULT (reduc_def_phi));
+      if (tree_nop_conversion_p (TREE_TYPE (def_t), phi_t))
+	{
+	  /* TODO: Not sure about slp_node here...  */
+	  poly_uint64 sz = GET_MODE_SIZE (TYPE_MODE (def_t));
+	  tree vectype = get_vectype_for_scalar_type_and_size (phi_t, sz);
+	  vect_defs[j] = fold_build1 (VIEW_CONVERT_EXPR, vectype,
+				      vect_defs[j]);
+	  vect_defs[j]
+	    = force_gimple_operand (unshare_expr (vect_defs[j]), &stmts,
+				    true, NULL_TREE);
+	  if (stmts)
+	    gsi_insert_after (&it, stmts, GSI_SAME_STMT);
+	}
+    }
+
   vect_create_epilog_for_reduction (vect_defs, stmt_info, reduc_def_phi,
 				    epilog_copies, reduc_fn, phis,
 				    double_reduc, slp_node, slp_node_instance,
@@ -8148,7 +8374,7 @@  vectorizable_live_operation (stmt_vec_info stmt_info,
   else
     {
       enum vect_def_type dt = STMT_VINFO_DEF_TYPE (stmt_info);
-      vec_lhs = vect_get_vec_def_for_operand_1 (stmt_info, dt);
+      vec_lhs = vect_get_vec_def_for_operand_1 (NULL, stmt_info, dt);
       gcc_checking_assert (ncopies == 1
 			   || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 601a6f55fbff388c89f88d994e790aebf2bf960e..e6af73af30f63ed72ffd6df8486212a83fa57313 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1519,7 +1519,8 @@  vect_init_vector (stmt_vec_info stmt_info, tree val, tree type,
    with type DT that will be used in the vectorized stmt.  */
 
 tree
-vect_get_vec_def_for_operand_1 (stmt_vec_info def_stmt_info,
+vect_get_vec_def_for_operand_1 (stmt_vec_info stmt_vinfo,
+				stmt_vec_info def_stmt_info,
 				enum vect_def_type dt)
 {
   tree vec_oprnd;
@@ -1533,14 +1534,19 @@  vect_get_vec_def_for_operand_1 (stmt_vec_info def_stmt_info,
       /* Code should use vect_get_vec_def_for_operand.  */
       gcc_unreachable ();
 
-    /* Operand is defined by a loop header phi.  In case of nested
-       cycles we also may have uses of the backedge def.  */
+    /* Operand is defined by a loop header phi or by the reduction statement
+       itself when we are asking for the definition of the rhs of a nop
+       conversion.  In case of nested cycles we also may have uses of the
+       backedge def.  */
     case vect_reduction_def:
     case vect_double_reduction_def:
     case vect_nested_cycle:
     case vect_induction_def:
       gcc_assert (gimple_code (def_stmt_info->stmt) == GIMPLE_PHI
-		  || dt == vect_nested_cycle);
+                  || dt == vect_nested_cycle
+                  || (dt == vect_reduction_def
+                      && stmt_vinfo
+                      && is_nop_conversion_stmt (stmt_vinfo->stmt)));
       /* Fallthru.  */
 
     /* operand is defined inside the loop.  */
@@ -1616,7 +1622,7 @@  vect_get_vec_def_for_operand (tree op, stmt_vec_info stmt_vinfo, tree vectype)
       return vect_init_vector (stmt_vinfo, op, vector_type, NULL);
     }
   else
-    return vect_get_vec_def_for_operand_1 (def_stmt_info, dt);
+    return vect_get_vec_def_for_operand_1 (stmt_vinfo, def_stmt_info, dt);
 }
 
 
@@ -5359,6 +5365,17 @@  vectorizable_assignment (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
   else
     ncopies = vect_get_num_copies (loop_vinfo, vectype);
 
+  if (ncopies > 1 && is_nop_conversion_stmt (stmt_info->stmt))
+    {
+      tree lhs = gimple_assign_lhs (stmt_info->stmt);
+      stmt_vec_info reduc_info = loop_vinfo->lookup_single_use (lhs);
+      gimple *phi_def
+	= SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmt_info->stmt));
+      if (reduc_info && STMT_VINFO_REDUC_DEF (reduc_info)
+	  && gimple_code (phi_def) == GIMPLE_PHI)
+	ncopies = 1;
+    }
+
   gcc_assert (ncopies >= 1);
 
   if (!vect_is_simple_use (op, vinfo, &dt[0], &vectype_in))
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..387bca3d4433403185c7bbc4b81b3e6520dba531 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1515,7 +1515,8 @@  extern stmt_vec_info vect_finish_stmt_generation (stmt_vec_info, gimple *,
 						  gimple_stmt_iterator *);
 extern opt_result vect_mark_stmts_to_be_vectorized (loop_vec_info, bool *);
 extern tree vect_get_store_rhs (stmt_vec_info);
-extern tree vect_get_vec_def_for_operand_1 (stmt_vec_info, enum vect_def_type);
+extern tree vect_get_vec_def_for_operand_1 (stmt_vec_info, stmt_vec_info,
+					    enum vect_def_type);
 extern tree vect_get_vec_def_for_operand (tree, stmt_vec_info, tree = NULL);
 extern void vect_get_vec_defs (tree, tree, stmt_vec_info, vec<tree> *,
 			       vec<tree> *, slp_tree);
@@ -1620,6 +1621,7 @@  extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
 				   unsigned int, tree);
 extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
 				unsigned int, tree, unsigned int);
+extern bool is_nop_conversion_stmt (gimple *);
 
 /* Drive for loop transformation stage.  */
 extern class loop *vect_transform_loop (loop_vec_info);