Record leader nodes in the SLP graph (PR92516)
diff mbox series

Message ID mptimnk2pnk.fsf@arm.com
State New
Headers show
Series
  • Record leader nodes in the SLP graph (PR92516)
Related show

Commit Message

Richard Sandiford Nov. 16, 2019, 8:26 a.m. UTC
If stmt analysis failed for an SLP node, r278246 tried building the
node from scalars instead.  By that point we've already processed child
nodes (usually successfully) and might have made some of them the lead
nodes for their stmt list.  This is why we couldn't free the child nodes
when deciding to build from scalars:

  /* Don't remove and free the child nodes here, since they could be
     referenced by other structures.  The analysis and scheduling phases
     (need to) ignore child nodes of anything that isn't vect_internal_def.  */

The problem in this PR is that we (correctly) don't process the unused
child nodes again during the scheduling phase, which means that those
nodes won't become the leader again.  So some other node with same stmts
becomes the leader instead.  However, any internal-def child nodes of
this new leader won't have been processed during the analysis phase,
because we also (correctly) skip child nodes of non-leader nodes.

We talked on IRC about fixing this by sharing nodes between SLP
instances, so that it becomes a "proper" graph.  But that seems
like it could throw up some problems too, so I wanted to fix the
PR in a less invasive way first.

This patch therefore records the leader nodes during the analysis
phase and reuses that choice during scheduling.  When scheduling
a non-leader node, we first try to schedule the leader node and
then reuse its vector stmts.  At the moment, doing this means we
need to know the leader node's SLP instance, so the patch records
that in the SLP node too.

While there, I noticed that the two-operation handling returned
early without restoring the original def types.  That's really
an independent bug fix.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


2019-11-16  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	PR tree-optimization/92516
	* tree-vectorizer.h (_slp_tree::leader): New member variable.
	(_slp_tree::leader_instance): Likewise.
	* tree-vect-slp.c (vect_create_new_slp_node): Initialize them.
	(vect_slp_analyze_node_operations): When deferring to an existing
	leader, record that leader in the SLP node.
	(vect_slp_analyze_operations): When committing to keeping an
	SLP instance, record that leaders in it are themselves leaderless.
	Also record the SLP instance we should use to vectorize them.
	(vect_schedule_slp_instance): Remove the bst_map parameter.
	If the node has a leader, vectorize that first and reuse
	its vector stmts.  Restore stmt def types for two-operation SLP.
	(vect_schedule_slp): Update call accordingly, removing the bst_map.

gcc/testsuite/
	* g++.dg/vect/slp-pr92516.cc: New test.

Comments

Richard Biener Nov. 18, 2019, 10:02 a.m. UTC | #1
On Sat, 16 Nov 2019, Richard Sandiford wrote:

> If stmt analysis failed for an SLP node, r278246 tried building the
> node from scalars instead.  By that point we've already processed child
> nodes (usually successfully) and might have made some of them the lead
> nodes for their stmt list.  This is why we couldn't free the child nodes
> when deciding to build from scalars:
> 
>   /* Don't remove and free the child nodes here, since they could be
>      referenced by other structures.  The analysis and scheduling phases
>      (need to) ignore child nodes of anything that isn't vect_internal_def.  */
> 
> The problem in this PR is that we (correctly) don't process the unused
> child nodes again during the scheduling phase, which means that those
> nodes won't become the leader again.  So some other node with same stmts
> becomes the leader instead.  However, any internal-def child nodes of
> this new leader won't have been processed during the analysis phase,
> because we also (correctly) skip child nodes of non-leader nodes.
> 
> We talked on IRC about fixing this by sharing nodes between SLP
> instances, so that it becomes a "proper" graph.  But that seems
> like it could throw up some problems too, so I wanted to fix the
> PR in a less invasive way first.
> 
> This patch therefore records the leader nodes during the analysis
> phase and reuses that choice during scheduling.  When scheduling
> a non-leader node, we first try to schedule the leader node and
> then reuse its vector stmts.  At the moment, doing this means we
> need to know the leader node's SLP instance, so the patch records
> that in the SLP node too.
> 
> While there, I noticed that the two-operation handling returned
> early without restoring the original def types.  That's really
> an independent bug fix.

Can you split that out please?

For the rest I prefer the following which I am now fully testing
on x86_64-unknown-linux-gnu (vect.exp testing went fine).  As
discussed on IRC this makes the SLP graph nodes shared between
SLP instances (which it has in fact been, just "virtual" via
all the leader computations).  So SLP instances are now just
the collection of entries into the directed SLP graph.

I'm also throwing this on a SPEC build test.

Richard.

2019-11-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/92516
	* tree-vect-slp.c (vect_analyze_slp_instance): Add bst_map
	argument, hoist bst_map creation/destruction to ...
	(vect_analyze_slp): ... here, forming a true graph with
	SLP instances being the entries.
	(vect_detect_hybrid_slp_stmts): Remove wrapper.
	(vect_detect_hybrid_slp): Use one visited set for all
	graph entries.
	(vect_slp_analyze_node_operations): Simplify visited/lvisited
	to hash-sets of slp_tree.
	(vect_slp_analyze_operations): Likewise.
	(vect_bb_slp_scalar_cost): Remove wrapper.
	(vect_bb_vectorization_profitable_p): Use one visited set for
	all graph entries.
	(vect_schedule_slp_instance): Elide bst_map use.
	(vect_schedule_slp): Likewise.

	* g++.dg/vect/slp-pr92516.cc: New testcase.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	(revision 278389)
+++ gcc/tree-vect-slp.c	(working copy)
@@ -2087,6 +2087,7 @@ calculate_unrolling_factor (poly_uint64
 
 static bool
 vect_analyze_slp_instance (vec_info *vinfo,
+			   scalar_stmts_to_slp_tree_map_t *bst_map,
 			   stmt_vec_info stmt_info, unsigned max_tree_size)
 {
   slp_instance new_instance;
@@ -2194,19 +2195,11 @@ vect_analyze_slp_instance (vec_info *vin
   /* Build the tree for the SLP instance.  */
   bool *matches = XALLOCAVEC (bool, group_size);
   unsigned npermutes = 0;
-  scalar_stmts_to_slp_tree_map_t *bst_map
-    = new scalar_stmts_to_slp_tree_map_t ();
   poly_uint64 max_nunits = nunits;
   unsigned tree_size = 0;
   node = vect_build_slp_tree (vinfo, scalar_stmts, group_size,
 			      &max_nunits, matches, &npermutes,
 			      &tree_size, bst_map);
-  /* The map keeps a reference on SLP nodes built, release that.  */
-  for (scalar_stmts_to_slp_tree_map_t::iterator it = bst_map->begin ();
-       it != bst_map->end (); ++it)
-    if ((*it).second)
-      vect_free_slp_tree ((*it).second, false);
-  delete bst_map;
   if (node != NULL)
     {
       /* If this is a reduction chain with a conversion in front
@@ -2394,7 +2387,7 @@ vect_analyze_slp_instance (vec_info *vin
 
 	  stmt_vec_info rest = vect_split_slp_store_group (stmt_info,
 							   group1_size);
-	  bool res = vect_analyze_slp_instance (vinfo, stmt_info,
+	  bool res = vect_analyze_slp_instance (vinfo, bst_map, stmt_info,
 						max_tree_size);
 	  /* If the first non-match was in the middle of a vector,
 	     skip the rest of that vector.  */
@@ -2405,7 +2398,8 @@ vect_analyze_slp_instance (vec_info *vin
 		rest = vect_split_slp_store_group (rest, const_nunits);
 	    }
 	  if (i < group_size)
-	    res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
+	    res |= vect_analyze_slp_instance (vinfo, bst_map,
+					      rest, max_tree_size);
 	  return res;
 	}
       /* Even though the first vector did not all match, we might be able to SLP
@@ -2427,9 +2421,12 @@ vect_analyze_slp (vec_info *vinfo, unsig
 
   DUMP_VECT_SCOPE ("vect_analyze_slp");
 
+  scalar_stmts_to_slp_tree_map_t *bst_map
+    = new scalar_stmts_to_slp_tree_map_t ();
+
   /* Find SLP sequences starting from groups of grouped stores.  */
   FOR_EACH_VEC_ELT (vinfo->grouped_stores, i, first_element)
-    vect_analyze_slp_instance (vinfo, first_element, max_tree_size);
+    vect_analyze_slp_instance (vinfo, bst_map, first_element, max_tree_size);
 
   if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
     {
@@ -2437,7 +2434,7 @@ vect_analyze_slp (vec_info *vinfo, unsig
 	{
 	  /* Find SLP sequences starting from reduction chains.  */
 	  FOR_EACH_VEC_ELT (loop_vinfo->reduction_chains, i, first_element)
-	    if (! vect_analyze_slp_instance (vinfo, first_element,
+	    if (! vect_analyze_slp_instance (vinfo, bst_map, first_element,
 					     max_tree_size))
 	      {
 		/* Dissolve reduction chain group.  */
@@ -2459,10 +2456,17 @@ vect_analyze_slp (vec_info *vinfo, unsig
 
       /* Find SLP sequences starting from groups of reductions.  */
       if (loop_vinfo->reductions.length () > 1)
-	vect_analyze_slp_instance (vinfo, loop_vinfo->reductions[0],
+	vect_analyze_slp_instance (vinfo, bst_map, loop_vinfo->reductions[0],
 				   max_tree_size);
     }
 
+  /* The map keeps a reference on SLP nodes built, release that.  */
+  for (scalar_stmts_to_slp_tree_map_t::iterator it = bst_map->begin ();
+       it != bst_map->end (); ++it)
+    if ((*it).second)
+      vect_free_slp_tree ((*it).second, false);
+  delete bst_map;
+
   return opt_result::success ();
 }
 
@@ -2589,13 +2593,6 @@ vect_detect_hybrid_slp_stmts (slp_tree n
 	vect_detect_hybrid_slp_stmts (child, i, stype, visited);
 }
 
-static void
-vect_detect_hybrid_slp_stmts (slp_tree node, unsigned i, slp_vect_type stype)
-{
-  hash_map<slp_tree, unsigned> visited;
-  vect_detect_hybrid_slp_stmts (node, i, stype, visited);
-}
-
 /* Helpers for vect_detect_hybrid_slp walking pattern stmt uses.  */
 
 static tree
@@ -2678,11 +2675,12 @@ vect_detect_hybrid_slp (loop_vec_info lo
   /* Then walk the SLP instance trees marking stmts with uses in
      non-SLP stmts as hybrid, also propagating hybrid down the
      SLP tree, collecting the above info on-the-fly.  */
+  hash_map<slp_tree, unsigned> visited;
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       for (unsigned i = 0; i < SLP_INSTANCE_GROUP_SIZE (instance); ++i)
 	vect_detect_hybrid_slp_stmts (SLP_INSTANCE_TREE (instance),
-				      i, pure_slp);
+				      i, pure_slp, visited);
     }
 }
 
@@ -2830,8 +2828,8 @@ vect_slp_convert_to_external (vec_info *
 static bool
 vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
 				  slp_instance node_instance,
-				  scalar_stmts_to_slp_tree_map_t *visited,
-				  scalar_stmts_to_slp_tree_map_t *lvisited,
+				  hash_set<slp_tree> &visited,
+				  hash_set<slp_tree> &lvisited,
 				  stmt_vector_for_cost *cost_vec)
 {
   int i, j;
@@ -2841,27 +2839,13 @@ vect_slp_analyze_node_operations (vec_in
     return true;
 
   /* If we already analyzed the exact same set of scalar stmts we're done.
-     We share the generated vector stmts for those.  */
-  slp_tree *leader;
-  if ((leader = visited->get (SLP_TREE_SCALAR_STMTS (node)))
-      || (leader = lvisited->get (SLP_TREE_SCALAR_STMTS (node))))
-    {
-      SLP_TREE_NUMBER_OF_VEC_STMTS (node)
-	= SLP_TREE_NUMBER_OF_VEC_STMTS (*leader);
-      /* Cope with cases in which we made a late decision to build the
-	 node from scalars.  */
-      if (SLP_TREE_DEF_TYPE (*leader) == vect_external_def
-	  && vect_slp_convert_to_external (vinfo, node, node_instance))
-	;
-      else
-	gcc_assert (SLP_TREE_DEF_TYPE (node) == SLP_TREE_DEF_TYPE (*leader));
-      return true;
-    }
-
-  /* The SLP graph is acyclic so not caching whether we failed or succeeded
+     We share the generated vector stmts for those.
+     The SLP graph is acyclic so not caching whether we failed or succeeded
      doesn't result in any issue since we throw away the lvisited set
      when we fail.  */
-  lvisited->put (SLP_TREE_SCALAR_STMTS (node).copy (), node);
+  if (visited.contains (node)
+      || lvisited.add (node))
+    return true;
 
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
     if (!vect_slp_analyze_node_operations (vinfo, child, node_instance,
@@ -2934,16 +2918,15 @@ vect_slp_analyze_operations (vec_info *v
 
   DUMP_VECT_SCOPE ("vect_slp_analyze_operations");
 
-  scalar_stmts_to_slp_tree_map_t *visited
-    = new scalar_stmts_to_slp_tree_map_t ();
+  hash_set<slp_tree> visited;
   for (i = 0; vinfo->slp_instances.iterate (i, &instance); )
     {
-      scalar_stmts_to_slp_tree_map_t lvisited;
+      hash_set<slp_tree> lvisited;
       stmt_vector_for_cost cost_vec;
       cost_vec.create (2);
       if (!vect_slp_analyze_node_operations (vinfo,
 					     SLP_INSTANCE_TREE (instance),
-					     instance, visited, &lvisited,
+					     instance, visited, lvisited,
 					     &cost_vec))
         {
 	  slp_tree node = SLP_INSTANCE_TREE (instance);
@@ -2958,16 +2941,15 @@ vect_slp_analyze_operations (vec_info *v
 	}
       else
 	{
-	  for (scalar_stmts_to_slp_tree_map_t::iterator x = lvisited.begin();
+	  for (hash_set<slp_tree>::iterator x = lvisited.begin();
 	       x != lvisited.end(); ++x)
-	    visited->put ((*x).first.copy (), (*x).second);
+	    visited.add (*x);
 	  i++;
 
 	  add_stmt_costs (vinfo->target_cost_data, &cost_vec);
 	  cost_vec.release ();
 	}
     }
-  delete visited;
 
   return !vinfo->slp_instances.is_empty ();
 }
@@ -3058,15 +3040,6 @@ vect_bb_slp_scalar_cost (basic_block bb,
     }
 }
 
-static void 
-vect_bb_slp_scalar_cost (basic_block bb,
-			 slp_tree node, vec<bool, va_heap> *life,
-			 stmt_vector_for_cost *cost_vec)
-{
-  hash_set<slp_tree> visited;
-  vect_bb_slp_scalar_cost (bb, node, life, cost_vec, visited);
-}
-
 /* Check if vectorization of the basic block is profitable.  */
 
 static bool
@@ -3081,13 +3054,14 @@ vect_bb_vectorization_profitable_p (bb_v
   /* Calculate scalar cost.  */
   stmt_vector_for_cost scalar_costs;
   scalar_costs.create (0);
+  hash_set<slp_tree> visited;
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       auto_vec<bool, 20> life;
       life.safe_grow_cleared (SLP_INSTANCE_GROUP_SIZE (instance));
       vect_bb_slp_scalar_cost (BB_VINFO_BB (bb_vinfo),
 			       SLP_INSTANCE_TREE (instance),
-			       &life, &scalar_costs);
+			       &life, &scalar_costs, visited);
     }
   void *target_cost_data = init_cost (NULL);
   add_stmt_costs (target_cost_data, &scalar_costs);
@@ -4128,8 +4102,7 @@ vect_transform_slp_perm_load (slp_tree n
 /* Vectorize SLP instance tree in postorder.  */
 
 static void
-vect_schedule_slp_instance (slp_tree node, slp_instance instance,
-			    scalar_stmts_to_slp_tree_map_t *bst_map)
+vect_schedule_slp_instance (slp_tree node, slp_instance instance)
 {
   gimple_stmt_iterator si;
   stmt_vec_info stmt_info;
@@ -4146,17 +4119,8 @@ vect_schedule_slp_instance (slp_tree nod
   if (SLP_TREE_VEC_STMTS (node).exists ())
     return;
 
-  /* See if we have already vectorized the same set of stmts and reuse their
-     vectorized stmts across instances.  */
-  if (slp_tree *leader = bst_map->get (SLP_TREE_SCALAR_STMTS (node)))
-    {
-      SLP_TREE_VEC_STMTS (node).safe_splice (SLP_TREE_VEC_STMTS (*leader));
-      return;
-    }
-
-  bst_map->put (SLP_TREE_SCALAR_STMTS (node).copy (), node);
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-    vect_schedule_slp_instance (child, instance, bst_map);
+    vect_schedule_slp_instance (child, instance);
 
   /* Push SLP node def-type to stmts.  */
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
@@ -4376,14 +4340,12 @@ vect_schedule_slp (vec_info *vinfo)
   slp_instance instance;
   unsigned int i;
 
-  scalar_stmts_to_slp_tree_map_t *bst_map
-    = new scalar_stmts_to_slp_tree_map_t ();
   slp_instances = vinfo->slp_instances;
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       slp_tree node = SLP_INSTANCE_TREE (instance);
       /* Schedule the tree of INSTANCE.  */
-      vect_schedule_slp_instance (node, instance, bst_map);
+      vect_schedule_slp_instance (node, instance);
 
       if (SLP_INSTANCE_ROOT_STMT (instance))
 	vectorize_slp_instance_root_stmt (node, instance);
@@ -4392,7 +4354,6 @@ vect_schedule_slp (vec_info *vinfo)
 	dump_printf_loc (MSG_NOTE, vect_location,
                          "vectorizing stmts using SLP.\n");
     }
-  delete bst_map;
 
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
Index: gcc/testsuite/g++.dg/vect/slp-pr92516.cc
===================================================================
--- gcc/testsuite/g++.dg/vect/slp-pr92516.cc	(nonexistent)
+++ gcc/testsuite/g++.dg/vect/slp-pr92516.cc	(working copy)
@@ -0,0 +1,43 @@
+// { dg-do compile }
+// { dg-require-effective-target c++14 }
+
+class a {
+public:
+  typedef int b;
+  operator b();
+};
+class c {
+public:
+  constexpr int m_fn1() const;
+  constexpr int d() const;
+  int e;
+  int f;
+};
+constexpr int c::m_fn1() const { return e; }
+constexpr int c::d() const { return f; }
+class g {
+public:
+  g();
+  constexpr void i(const c &) noexcept;
+  int j;
+  int k;
+  int l;
+  int m;
+};
+constexpr void g::i(const c &n) noexcept {
+  int v = l - j, h = m - k;
+  j = n.m_fn1() - v / 2;
+  k = n.d() - h / 2;
+  l = j + v;
+  m = k + h;
+}
+class o {
+  void m_fn4() const;
+  a p;
+} r;
+void o::m_fn4() const {
+  g q;
+  c t;
+  q.i(t);
+  r.p || 0;
+}
Richard Biener Nov. 18, 2019, 11:56 a.m. UTC | #2
On Mon, 18 Nov 2019, Richard Biener wrote:

> On Sat, 16 Nov 2019, Richard Sandiford wrote:
> 
> > If stmt analysis failed for an SLP node, r278246 tried building the
> > node from scalars instead.  By that point we've already processed child
> > nodes (usually successfully) and might have made some of them the lead
> > nodes for their stmt list.  This is why we couldn't free the child nodes
> > when deciding to build from scalars:
> > 
> >   /* Don't remove and free the child nodes here, since they could be
> >      referenced by other structures.  The analysis and scheduling phases
> >      (need to) ignore child nodes of anything that isn't vect_internal_def.  */
> > 
> > The problem in this PR is that we (correctly) don't process the unused
> > child nodes again during the scheduling phase, which means that those
> > nodes won't become the leader again.  So some other node with same stmts
> > becomes the leader instead.  However, any internal-def child nodes of
> > this new leader won't have been processed during the analysis phase,
> > because we also (correctly) skip child nodes of non-leader nodes.
> > 
> > We talked on IRC about fixing this by sharing nodes between SLP
> > instances, so that it becomes a "proper" graph.  But that seems
> > like it could throw up some problems too, so I wanted to fix the
> > PR in a less invasive way first.
> > 
> > This patch therefore records the leader nodes during the analysis
> > phase and reuses that choice during scheduling.  When scheduling
> > a non-leader node, we first try to schedule the leader node and
> > then reuse its vector stmts.  At the moment, doing this means we
> > need to know the leader node's SLP instance, so the patch records
> > that in the SLP node too.
> > 
> > While there, I noticed that the two-operation handling returned
> > early without restoring the original def types.  That's really
> > an independent bug fix.
> 
> Can you split that out please?
> 
> For the rest I prefer the following which I am now fully testing
> on x86_64-unknown-linux-gnu (vect.exp testing went fine).  As
> discussed on IRC this makes the SLP graph nodes shared between
> SLP instances (which it has in fact been, just "virtual" via
> all the leader computations).  So SLP instances are now just
> the collection of entries into the directed SLP graph.
> 
> I'm also throwing this on a SPEC build test.

Bootstrap & regtest went fine.  SPEC building reveals a latent
CTOR vectorization issue fixed by the pach below.

Re-testing the combined patch now.

Richard.

2019-11-18  Richard Biener  <rguenther@suse.de>

	* tree-vect-slp.c (vect_analyze_slp_instance): When a CTOR
	was vectorized with just external refs fail.

	* gcc.dg/vect/vect-ctor-1.c: New testcase.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	(revision 278389)
+++ gcc/tree-vect-slp.c	(working copy)
@@ -2260,6 +2253,18 @@ vect_analyze_slp_instance (vec_info *vin
 	  matches[group_size / const_max_nunits * const_max_nunits] = false;
 	  vect_free_slp_tree (node, false);
 	}
+      else if (constructor
+	       && SLP_TREE_DEF_TYPE (node) != vect_internal_def)
+	{
+	  /* CONSTRUCTOR vectorization relies on a vector stmt being
+	     generated, that doesn't work for fully external ones.  */
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "Build SLP failed: CONSTRUCTOR of external "
+			     "or constant elements\n");
+	  vect_free_slp_tree (node, false);
+	  return false;
+	}
       else
 	{
 	  /* Create a new SLP instance.  */
Index: gcc/testsuite/gcc.dg/vect/vect-ctor-1.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-ctor-1.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/vect/vect-ctor-1.c	(working copy)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-mavx2" { target { i?86-*-* x86_64-*-* } } } */
+
+typedef struct {
+    unsigned short mprr_2[5][16][16];
+} ImageParameters;
+int s[16][2];
+void intrapred_luma_16x16(ImageParameters *img, int s0)
+{
+  for (int j=0; j < 16; j++)
+    for (int i=0; i < 16; i++)
+      {
+	img->mprr_2[1 ][j][i]=s[j][1];
+	img->mprr_2[2 ][j][i]=s0;
+      }
+}
Richard Sandiford Nov. 20, 2019, 12:16 p.m. UTC | #3
Richard Biener <rguenther@suse.de> writes:
> On Sat, 16 Nov 2019, Richard Sandiford wrote:
>> If stmt analysis failed for an SLP node, r278246 tried building the
>> node from scalars instead.  By that point we've already processed child
>> nodes (usually successfully) and might have made some of them the lead
>> nodes for their stmt list.  This is why we couldn't free the child nodes
>> when deciding to build from scalars:
>> 
>>   /* Don't remove and free the child nodes here, since they could be
>>      referenced by other structures.  The analysis and scheduling phases
>>      (need to) ignore child nodes of anything that isn't vect_internal_def.  */
>> 
>> The problem in this PR is that we (correctly) don't process the unused
>> child nodes again during the scheduling phase, which means that those
>> nodes won't become the leader again.  So some other node with same stmts
>> becomes the leader instead.  However, any internal-def child nodes of
>> this new leader won't have been processed during the analysis phase,
>> because we also (correctly) skip child nodes of non-leader nodes.
>> 
>> We talked on IRC about fixing this by sharing nodes between SLP
>> instances, so that it becomes a "proper" graph.  But that seems
>> like it could throw up some problems too, so I wanted to fix the
>> PR in a less invasive way first.
>> 
>> This patch therefore records the leader nodes during the analysis
>> phase and reuses that choice during scheduling.  When scheduling
>> a non-leader node, we first try to schedule the leader node and
>> then reuse its vector stmts.  At the moment, doing this means we
>> need to know the leader node's SLP instance, so the patch records
>> that in the SLP node too.
>> 
>> While there, I noticed that the two-operation handling returned
>> early without restoring the original def types.  That's really
>> an independent bug fix.
>
> Can you split that out please?

Sure, done as below.  Tested on aarch64-linux-gnu and x86_64-linux-gnu.
OK to install?

> For the rest I prefer the following which I am now fully testing
> on x86_64-unknown-linux-gnu (vect.exp testing went fine).  As
> discussed on IRC this makes the SLP graph nodes shared between
> SLP instances (which it has in fact been, just "virtual" via
> all the leader computations).  So SLP instances are now just
> the collection of entries into the directed SLP graph.

Looks great.  Obviously I chickened out too early :-)

Thanks,
Richard


2019-11-19  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vect-slp.c (vect_schedule_slp_instance): Restore stmt
	def types for two-operation SLP.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2019-11-20 09:48:57.145101295 +0000
+++ gcc/tree-vect-slp.c	2019-11-20 12:11:56.818216093 +0000
@@ -4165,6 +4165,7 @@ vect_schedule_slp_instance (slp_tree nod
 
   /* Handle two-operation SLP nodes by vectorizing the group with
      both operations and then performing a merge.  */
+  bool done_p = false;
   if (SLP_TREE_TWO_OPERATORS (node))
     {
       gassign *stmt = as_a <gassign *> (stmt_info->stmt);
@@ -4235,10 +4236,11 @@ vect_schedule_slp_instance (slp_tree nod
 	    }
 	  v0.release ();
 	  v1.release ();
-	  return;
+	  done_p = true;
 	}
     }
-  vect_transform_stmt (stmt_info, &si, node, instance);
+  if (!done_p)
+    vect_transform_stmt (stmt_info, &si, node, instance);
 
   /* Restore stmt def-types.  */
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
Richard Biener Nov. 20, 2019, 1:53 p.m. UTC | #4
On Wed, 20 Nov 2019, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Sat, 16 Nov 2019, Richard Sandiford wrote:
> >> If stmt analysis failed for an SLP node, r278246 tried building the
> >> node from scalars instead.  By that point we've already processed child
> >> nodes (usually successfully) and might have made some of them the lead
> >> nodes for their stmt list.  This is why we couldn't free the child nodes
> >> when deciding to build from scalars:
> >> 
> >>   /* Don't remove and free the child nodes here, since they could be
> >>      referenced by other structures.  The analysis and scheduling phases
> >>      (need to) ignore child nodes of anything that isn't vect_internal_def.  */
> >> 
> >> The problem in this PR is that we (correctly) don't process the unused
> >> child nodes again during the scheduling phase, which means that those
> >> nodes won't become the leader again.  So some other node with same stmts
> >> becomes the leader instead.  However, any internal-def child nodes of
> >> this new leader won't have been processed during the analysis phase,
> >> because we also (correctly) skip child nodes of non-leader nodes.
> >> 
> >> We talked on IRC about fixing this by sharing nodes between SLP
> >> instances, so that it becomes a "proper" graph.  But that seems
> >> like it could throw up some problems too, so I wanted to fix the
> >> PR in a less invasive way first.
> >> 
> >> This patch therefore records the leader nodes during the analysis
> >> phase and reuses that choice during scheduling.  When scheduling
> >> a non-leader node, we first try to schedule the leader node and
> >> then reuse its vector stmts.  At the moment, doing this means we
> >> need to know the leader node's SLP instance, so the patch records
> >> that in the SLP node too.
> >> 
> >> While there, I noticed that the two-operation handling returned
> >> early without restoring the original def types.  That's really
> >> an independent bug fix.
> >
> > Can you split that out please?
> 
> Sure, done as below.  Tested on aarch64-linux-gnu and x86_64-linux-gnu.
> OK to install?

OK.

Thanks,
Richard.
 
> > For the rest I prefer the following which I am now fully testing
> > on x86_64-unknown-linux-gnu (vect.exp testing went fine).  As
> > discussed on IRC this makes the SLP graph nodes shared between
> > SLP instances (which it has in fact been, just "virtual" via
> > all the leader computations).  So SLP instances are now just
> > the collection of entries into the directed SLP graph.
> 
> Looks great.  Obviously I chickened out too early :-)
> 
> Thanks,
> Richard
> 
> 
> 2019-11-19  Richard Sandiford  <richard.sandiford@arm.com>
> 
> gcc/
> 	* tree-vect-slp.c (vect_schedule_slp_instance): Restore stmt
> 	def types for two-operation SLP.
> 
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c	2019-11-20 09:48:57.145101295 +0000
> +++ gcc/tree-vect-slp.c	2019-11-20 12:11:56.818216093 +0000
> @@ -4165,6 +4165,7 @@ vect_schedule_slp_instance (slp_tree nod
>  
>    /* Handle two-operation SLP nodes by vectorizing the group with
>       both operations and then performing a merge.  */
> +  bool done_p = false;
>    if (SLP_TREE_TWO_OPERATORS (node))
>      {
>        gassign *stmt = as_a <gassign *> (stmt_info->stmt);
> @@ -4235,10 +4236,11 @@ vect_schedule_slp_instance (slp_tree nod
>  	    }
>  	  v0.release ();
>  	  v1.release ();
> -	  return;
> +	  done_p = true;
>  	}
>      }
> -  vect_transform_stmt (stmt_info, &si, node, instance);
> +  if (!done_p)
> +    vect_transform_stmt (stmt_info, &si, node, instance);
>  
>    /* Restore stmt def-types.  */
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
>

Patch
diff mbox series

Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2019-11-14 15:14:29.239572543 +0000
+++ gcc/tree-vectorizer.h	2019-11-16 08:22:50.620257693 +0000
@@ -128,6 +128,11 @@  struct _slp_tree {
   vec<unsigned> load_permutation;
   /* Vectorized stmt/s.  */
   vec<stmt_vec_info> vec_stmts;
+  /* If non-null, the leader node that provides the vectorized stmts.  */
+  _slp_tree *leader;
+  /* If the node is a leader node, this is the slp_instance to use when
+     vectorizing it.  */
+  class _slp_instance *leader_instance;
   /* Number of vector stmts that are created to replace the group of scalar
      stmts. It is calculated during the transformation phase as the number of
      scalar elements in one scalar iteration (GROUP_SIZE) multiplied by VF
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2019-11-14 15:33:43.503740636 +0000
+++ gcc/tree-vect-slp.c	2019-11-16 08:22:50.612257749 +0000
@@ -129,6 +129,8 @@  vect_create_new_slp_node (vec<stmt_vec_i
   SLP_TREE_LOAD_PERMUTATION (node) = vNULL;
   SLP_TREE_TWO_OPERATORS (node) = false;
   SLP_TREE_DEF_TYPE (node) = vect_internal_def;
+  node->leader = NULL;
+  node->leader_instance = NULL;
   node->refcnt = 1;
   node->max_nunits = 1;
 
@@ -155,6 +157,8 @@  vect_create_new_slp_node (vec<tree> ops)
   SLP_TREE_LOAD_PERMUTATION (node) = vNULL;
   SLP_TREE_TWO_OPERATORS (node) = false;
   SLP_TREE_DEF_TYPE (node) = vect_external_def;
+  node->leader = NULL;
+  node->leader_instance = NULL;
   node->refcnt = 1;
   node->max_nunits = 1;
 
@@ -2764,6 +2768,7 @@  vect_slp_analyze_node_operations (vec_in
   if ((leader = visited->get (SLP_TREE_SCALAR_STMTS (node)))
       || (leader = lvisited->get (SLP_TREE_SCALAR_STMTS (node))))
     {
+      node->leader = *leader;
       SLP_TREE_NUMBER_OF_VEC_STMTS (node)
 	= SLP_TREE_NUMBER_OF_VEC_STMTS (*leader);
       /* Cope with cases in which we made a late decision to build the
@@ -2878,7 +2883,11 @@  vect_slp_analyze_operations (vec_info *v
 	{
 	  for (scalar_stmts_to_slp_tree_map_t::iterator x = lvisited.begin();
 	       x != lvisited.end(); ++x)
-	    visited->put ((*x).first.copy (), (*x).second);
+	    {
+	      visited->put ((*x).first.copy (), (*x).second);
+	      (*x).second->leader = NULL;
+	      (*x).second->leader_instance = instance;
+	    }
 	  i++;
 
 	  add_stmt_costs (vinfo->target_cost_data, &cost_vec);
@@ -4046,8 +4055,7 @@  vect_transform_slp_perm_load (slp_tree n
 /* Vectorize SLP instance tree in postorder.  */
 
 static void
-vect_schedule_slp_instance (slp_tree node, slp_instance instance,
-			    scalar_stmts_to_slp_tree_map_t *bst_map)
+vect_schedule_slp_instance (slp_tree node, slp_instance instance)
 {
   gimple_stmt_iterator si;
   stmt_vec_info stmt_info;
@@ -4066,15 +4074,18 @@  vect_schedule_slp_instance (slp_tree nod
 
   /* See if we have already vectorized the same set of stmts and reuse their
      vectorized stmts across instances.  */
-  if (slp_tree *leader = bst_map->get (SLP_TREE_SCALAR_STMTS (node)))
+  if (node->leader)
     {
-      SLP_TREE_VEC_STMTS (node).safe_splice (SLP_TREE_VEC_STMTS (*leader));
+      gcc_assert (!node->leader->leader);
+      vect_schedule_slp_instance (node->leader, node->leader->leader_instance);
+      SLP_TREE_VEC_STMTS (node).safe_splice
+	(SLP_TREE_VEC_STMTS (node->leader));
       return;
     }
+  gcc_assert (instance == node->leader_instance);
 
-  bst_map->put (SLP_TREE_SCALAR_STMTS (node).copy (), node);
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-    vect_schedule_slp_instance (child, instance, bst_map);
+    vect_schedule_slp_instance (child, instance);
 
   /* Push SLP node def-type to stmts.  */
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
@@ -4107,6 +4118,7 @@  vect_schedule_slp_instance (slp_tree nod
 
   /* Handle two-operation SLP nodes by vectorizing the group with
      both operations and then performing a merge.  */
+  bool done_p = false;
   if (SLP_TREE_TWO_OPERATORS (node))
     {
       gassign *stmt = as_a <gassign *> (stmt_info->stmt);
@@ -4177,10 +4189,11 @@  vect_schedule_slp_instance (slp_tree nod
 	    }
 	  v0.release ();
 	  v1.release ();
-	  return;
+	  done_p = true;
 	}
     }
-  vect_transform_stmt (stmt_info, &si, node, instance);
+  if (!done_p)
+    vect_transform_stmt (stmt_info, &si, node, instance);
 
   /* Restore stmt def-types.  */
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
@@ -4294,14 +4307,12 @@  vect_schedule_slp (vec_info *vinfo)
   slp_instance instance;
   unsigned int i;
 
-  scalar_stmts_to_slp_tree_map_t *bst_map
-    = new scalar_stmts_to_slp_tree_map_t ();
   slp_instances = vinfo->slp_instances;
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
       slp_tree node = SLP_INSTANCE_TREE (instance);
       /* Schedule the tree of INSTANCE.  */
-      vect_schedule_slp_instance (node, instance, bst_map);
+      vect_schedule_slp_instance (node, instance);
 
       if (SLP_INSTANCE_ROOT_STMT (instance))
 	vectorize_slp_instance_root_stmt (node, instance);
@@ -4310,7 +4321,6 @@  vect_schedule_slp (vec_info *vinfo)
 	dump_printf_loc (MSG_NOTE, vect_location,
                          "vectorizing stmts using SLP.\n");
     }
-  delete bst_map;
 
   FOR_EACH_VEC_ELT (slp_instances, i, instance)
     {
Index: gcc/testsuite/g++.dg/vect/slp-pr92516.cc
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/g++.dg/vect/slp-pr92516.cc	2019-11-16 08:22:50.612257749 +0000
@@ -0,0 +1,43 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target c++14 } */
+
+class a {
+public:
+  typedef int b;
+  operator b();
+};
+class c {
+public:
+  constexpr int m_fn1() const;
+  constexpr int d() const;
+  int e;
+  int f;
+};
+constexpr int c::m_fn1() const { return e; }
+constexpr int c::d() const { return f; }
+class g {
+public:
+  g();
+  constexpr void i(const c &) noexcept;
+  int j;
+  int k;
+  int l;
+  int m;
+};
+constexpr void g::i(const c &n) noexcept {
+  int v = l - j, h = m - k;
+  j = n.m_fn1() - v / 2;
+  k = n.d() - h / 2;
+  l = j + v;
+  m = k + h;
+}
+class o {
+  void m_fn4() const;
+  a p;
+} r;
+void o::m_fn4() const {
+  g q;
+  c t;
+  q.i(t);
+  r.p || 0;
+}