diff mbox series

[v2,3/16] middle-end Add basic SLP pattern matching scaffolding.

Message ID 20200925142753.GA13692@arm.com
State New
Headers show
Series middle-end Add support for SLP vectorization of complex number instructions. | expand

Commit Message

Tamar Christina Sept. 25, 2020, 2:27 p.m. UTC
Hi All,

This patch adds the basic infrastructure for doing pattern matching on SLP trees.
This is done immediately after the SLP tree creation because it can change the
shape of the tree in radical ways and so we would like to do it before any
analysis is performed on the tree.

A new file tree-vect-slp-patterns.c is added which contains all the code for
pattern matching on SLP trees.

This cover letter is short because the changes are heavily commented.

All pattern matchers need to implement the abstract type VectPatternMatch.
The VectSimplePatternMatch abstract class provides some default functionality
for pattern matchers that need to rebuild nodes.

The pattern matcher requires if replacing a statement in a node, that ALL
statements be replaced.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* Makefile.in (tree-vect-slp-patterns.o): New.
	* doc/passes.texi: Update documentation.
	* tree-vect-slp.c (vect_match_slp_patterns_2, vect_match_slp_patterns):
	New.
	(vect_analyze_slp_instance): Call pattern matcher.
	* tree-vectorizer.h (class VectPatternMatch, class VectPattern): New.
	* tree-vect-slp-patterns.c: New file.

--

Comments

Richard Biener Sept. 28, 2020, 12:36 p.m. UTC | #1
On Fri, 25 Sep 2020, Tamar Christina wrote:

> Hi All,
> 
> This patch adds the basic infrastructure for doing pattern matching on SLP trees.
> This is done immediately after the SLP tree creation because it can change the
> shape of the tree in radical ways and so we would like to do it before any
> analysis is performed on the tree.
> 
> A new file tree-vect-slp-patterns.c is added which contains all the code for
> pattern matching on SLP trees.
> 
> This cover letter is short because the changes are heavily commented.
> 
> All pattern matchers need to implement the abstract type VectPatternMatch.
> The VectSimplePatternMatch abstract class provides some default functionality
> for pattern matchers that need to rebuild nodes.
> 
> The pattern matcher requires if replacing a statement in a node, that ALL
> statements be replaced.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

+    gcall *build ()
+    {
+      stmt_vec_info stmt_info;
+

please define functions out-of-line (apart from the 1-liners)

+      /* We have to explicitly mark the old statement as unused because 
during
+        statement analysis the original and new pattern statement may 
require
+        different level of unrolling.  As an example add/sub when 
vectorized
+        without a pattern requires 4 copies, whereas with a COMPLEX_ADD 
pattern
+        this only requires 2 copies and the two statement will be treated 
as
+        hand unrolled.  That means that the analysis won't happen as 
it'll find
+        a mismatch.  So we don't analyze the old statement and if we end 
up
+        needing it, e.g. SLP fails then we have to quickly re-analyze it.  
*/
+      STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
+      STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;

so this means all uses have to be inside the pattern as otherwise
there may be even non-SLP uses.  vect_mark_pattern_stmts supports
detecting patterns of patterns, I suppose the two-phase analysis
for SLP patterns does not support this right now?

+      SLP_TREE_CODE (this->m_node) = gimple_expr_code (call_stmt);;

double ;, just make it CALL_EXPR literally (or leave it ERROR_MARK)

You seem to do in-place changing of the SLP node you match off?

@@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
                              &tree_size, bst_map);
   if (node != NULL)
     {
+      /* Temporarily allow add_stmt calls again.  */
+      vinfo->stmt_vec_info_ro = false;
+
+      /* See if any patterns can be found in the constructed SLP tree
+        before we do any analysis on it.  */
+      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
+                              matches, &npermutes, &tree_size, bst_map);
+
+      /* After this no more add_stmt calls are allowed.  */
+      vinfo->stmt_vec_info_ro = true;
+

I think this is a bit early to match patterns - I'd defer it to the
point where all entries into the same SLP subgraph are analyzed,
thus somewhere at the end of vect_analyze_slp loop over all
instances and match patterns?  That way phases are more clearly
separated.

Note that fiddling with vinfo->stmt_vec_info_ro is a bit ugly,
maybe add a ->add_pattern_stmt (gimple *pattern_stmt, stmt_vec_info 
orig_stmt) variant that also sets STMT_VINFO_RELATED_STMT
but doesn't check !stmt_vec_info_ro.  That could be used from
tree-vect-patterns.c as well and we could set stmt_vec_info_ro
earlier.

+  VectPattern *pattern = patt_fn (node, vinfo);
+  uint8_t n = pattern->get_arity ();
+
+  if (group_size % n != 0)
+    {
+      delete pattern;

seems to require VectPattern allocation even upon failure, I suggest
to return NULL then to avoid excessive allocations.

+      if (!pattern->matches (stmt_infos, i))
+       {
+         /* We can only do replacements for entire groups, we must 
replace all
+            statements in a node as the argument list/children may not 
have
+            equal height then.  Operations that don't rewrite the 
arguments
+            may be safe to do, so perhaps paramatrise it.  */
+
+         found_p = false;

I find it a bit ugly to iterate over "unrolls" in the machinery
rather than the individual pattern matcher which might have an
easier and in particular cheaper job here.  Since you require
all lanes to match the same pattern anyway.   Not sure if your
later patches support say, mixing complex add with different
rotate in the same SLP node.  Note the ultimate idea in the end
is that a SLP node can, of course, be split into two [but at
this point the vector type / unroll factor is not final so
general splitting at vector boundary is not desired yet].
The split can be undone for consumers by inserting a
VEC_PERM node (which should semantically be a concat + select)

+      tree type = gimple_expr_type (STMT_VINFO_STMT (stmt_info));
+      tree vectype = get_vectype_for_scalar_type (vinfo, type, node);

use 

      tree vectype = SLP_TREE_VECTYPE (node);

generally avoid looking at scalar stmts, iff then look at 
SLP_TREE_REPRESENTATIVE - all lanes have uniform operations
applied to (but the scalar stmts may not appear to do so!  the
scalar stmts merely stand for their 'def').

+  /* Perform recursive matching, it's important to do this after matching 
things
+     in the current node as the matches here may re-order the nodes below 
it.
+     As such the pattern that needs to be subsequently match may change.  
*/
+
+  if (SLP_TREE_CHILDREN (node).exists ()) {
+    slp_tree child;
+    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+      found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
+                                               patt_fn, max_nunits, 
matches,
+                                               npermutes, tree_size, 
bst_map);
+  }


you definitely need a visited set - you are walking a graph and
nodes can appear along multiple paths!

+      vect_mark_slp_stmts_relevant (node);

that walks the whole subgraph but if you need to do anything you
at most want to touch the node itself, no?

To make patterns-of-patterns viable you need to do all parts
of the walk in post-order.  What breaks if you do ->matches/->validate
in post-order?  I think that would be more future-proof.

Otherwise this looks like an OK overall design.

Thanks for working on it!

Richard.


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* Makefile.in (tree-vect-slp-patterns.o): New.
> 	* doc/passes.texi: Update documentation.
> 	* tree-vect-slp.c (vect_match_slp_patterns_2, vect_match_slp_patterns):
> 	New.
> 	(vect_analyze_slp_instance): Call pattern matcher.
> 	* tree-vectorizer.h (class VectPatternMatch, class VectPattern): New.
> 	* tree-vect-slp-patterns.c: New file.
> 
>
Tamar Christina Sept. 28, 2020, 2:56 p.m. UTC | #2
Hi Richi,

Thanks for the review! 

Just some answers to your questions:

> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Monday, September 28, 2020 1:37 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> On Fri, 25 Sep 2020, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This patch adds the basic infrastructure for doing pattern matching on SLP
> trees.
> > This is done immediately after the SLP tree creation because it can
> > change the shape of the tree in radical ways and so we would like to
> > do it before any analysis is performed on the tree.
> >
> > A new file tree-vect-slp-patterns.c is added which contains all the
> > code for pattern matching on SLP trees.
> >
> > This cover letter is short because the changes are heavily commented.
> >
> > All pattern matchers need to implement the abstract type
> VectPatternMatch.
> > The VectSimplePatternMatch abstract class provides some default
> > functionality for pattern matchers that need to rebuild nodes.
> >
> > The pattern matcher requires if replacing a statement in a node, that
> > ALL statements be replaced.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> 
> +    gcall *build ()
> +    {
> +      stmt_vec_info stmt_info;
> +
> 
> please define functions out-of-line (apart from the 1-liners)
> 
> +      /* We have to explicitly mark the old statement as unused because
> during
> +        statement analysis the original and new pattern statement may
> require
> +        different level of unrolling.  As an example add/sub when
> vectorized
> +        without a pattern requires 4 copies, whereas with a COMPLEX_ADD
> pattern
> +        this only requires 2 copies and the two statement will be
> + treated
> as
> +        hand unrolled.  That means that the analysis won't happen as
> it'll find
> +        a mismatch.  So we don't analyze the old statement and if we
> + end
> up
> +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> */
> +      STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> +      STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;
> 
> so this means all uses have to be inside the pattern as otherwise there may
> be even non-SLP uses.  vect_mark_pattern_stmts supports detecting
> patterns of patterns, I suppose the two-phase analysis for SLP patterns does
> not support this right now?
> 
> +      SLP_TREE_CODE (this->m_node) = gimple_expr_code (call_stmt);;
> 
> double ;, just make it CALL_EXPR literally (or leave it ERROR_MARK)
> 
> You seem to do in-place changing of the SLP node you match off?

Yes since this would allow me to change the root node as well, though
thinking about it I can probably do it by passing it as a reference which
then would allow me to re-use vect_create_new_slp_node which is
probably preferable. 

> 
> @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
>                               &tree_size, bst_map);
>    if (node != NULL)
>      {
> +      /* Temporarily allow add_stmt calls again.  */
> +      vinfo->stmt_vec_info_ro = false;
> +
> +      /* See if any patterns can be found in the constructed SLP tree
> +        before we do any analysis on it.  */
> +      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
> +                              matches, &npermutes, &tree_size,
> + bst_map);
> +
> +      /* After this no more add_stmt calls are allowed.  */
> +      vinfo->stmt_vec_info_ro = true;
> +
> 
> I think this is a bit early to match patterns - I'd defer it to the point where all
> entries into the same SLP subgraph are analyzed, thus somewhere at the
> end of vect_analyze_slp loop over all instances and match patterns?  That
> way phases are more clearly separated.

That would probably work, my only worry is that the SLP analysis itself may fail and
bail out at 

	  /* If the loads and stores can be handled with load/store-lane
	     instructions do not generate this SLP instance.  */
	  if (is_a <loop_vec_info> (vinfo)
	      && loads_permuted
	      && dr && vect_store_lanes_supported (vectype, group_size, false))

Which in the initial tree may be true, but in the patterned tree may not be.  In the previous
revision of the patch you had suggested I return a boolean which can be used to cancel such
checks.  Would that be the preferred approach?

> 
> Note that fiddling with vinfo->stmt_vec_info_ro is a bit ugly, maybe add a -
> >add_pattern_stmt (gimple *pattern_stmt, stmt_vec_info
> orig_stmt) variant that also sets STMT_VINFO_RELATED_STMT but doesn't
> check !stmt_vec_info_ro.  That could be used from tree-vect-patterns.c as
> well and we could set stmt_vec_info_ro earlier.
> 
> +  VectPattern *pattern = patt_fn (node, vinfo);  uint8_t n =
> + pattern->get_arity ();
> +
> +  if (group_size % n != 0)
> +    {
> +      delete pattern;
> 
> seems to require VectPattern allocation even upon failure, I suggest to
> return NULL then to avoid excessive allocations.
> 
> +      if (!pattern->matches (stmt_infos, i))
> +       {
> +         /* We can only do replacements for entire groups, we must
> replace all
> +            statements in a node as the argument list/children may not
> have
> +            equal height then.  Operations that don't rewrite the
> arguments
> +            may be safe to do, so perhaps paramatrise it.  */
> +
> +         found_p = false;
> 
> I find it a bit ugly to iterate over "unrolls" in the machinery rather than the
> individual pattern matcher which might have an easier and in particular
> cheaper job here.  Since you require
> all lanes to match the same pattern anyway.   Not sure if your
> later patches support say, mixing complex add with different rotate in the
> same SLP node.

It does, as the constraint only applies to one pattern matcher class handling the
entire node.

An example of such case is

node 0x531a1f0 (max_nunits=2, refcnt=2)
 stmt 0 *_9 = _10;
 stmt 1 *_15 = _16;
 stmt 2 *_25 = _26;
 stmt 3 *_31 = _32;
 children 0x531a980
node 0x531a980 (max_nunits=2, refcnt=2)
 stmt 0 slp_patt_112 = .COMPLEX_ADD_ROT90 (_4, _14);
 stmt 1 slp_patt_111 = .COMPLEX_ADD_ROT90 (_12, _8);
 stmt 2 slp_patt_110 = .COMPLEX_ADD_ROT270 (_20, _30);
 stmt 3 slp_patt_109 = .COMPLEX_ADD_ROT270 (_28, _24);
 lane permutation { 0[0] 1[1] 1[2] 0[3] }
 children 0x5310680 0x530e040
node 0x5310680 (max_nunits=2, refcnt=4)
 stmt 0 _4 = *_3;
 stmt 1 _12 = *_11;
 stmt 2 _20 = *_19;
 stmt 3 _28 = *_27;
 load permutation { 0 1 2 3 }
node 0x530e040 (max_nunits=2, refcnt=2)
 stmt 0 _14 = *_13;
 stmt 1 _8 = *_7;
 stmt 2 _30 = *_29;
 stmt 3 _24 = *_23;
 load permutation { 0 1 2 3 }

though looking at the resulting assembly the code is incorrect,

.L6:
        ldr     q1, [x1, x3]
        ldr     q0, [x0, x3]
        fcadd   v0.2d, v0.2d, v1.2d, #270
        str     q0, [x2, x3]
        ldr     q1, [x5, x3]
        ldr     q0, [x6, x3]
        fcadd   v0.2d, v0.2d, v1.2d, #270
        str     q0, [x4, x3]
        add     x3, x3, 32
        cmp     x3, 1600
        bne     .L6
        ret

Which I assume is because SLP_TREE_REPRESENTATIVE is pointing to the rotate 270?

> Note the ultimate idea in the end is that a SLP node can, of
> course, be split into two [but at this point the vector type / unroll factor is not
> final so general splitting at vector boundary is not desired yet].
> The split can be undone for consumers by inserting a VEC_PERM node (which
> should semantically be a concat + select)
> 
> +      tree type = gimple_expr_type (STMT_VINFO_STMT (stmt_info));
> +      tree vectype = get_vectype_for_scalar_type (vinfo, type, node);
> 
> use
> 
>       tree vectype = SLP_TREE_VECTYPE (node);
> 
> generally avoid looking at scalar stmts, iff then look at
> SLP_TREE_REPRESENTATIVE - all lanes have uniform operations applied to
> (but the scalar stmts may not appear to do so!  the scalar stmts merely stand
> for their 'def').
> 
> +  /* Perform recursive matching, it's important to do this after
> + matching
> things
> +     in the current node as the matches here may re-order the nodes
> + below
> it.
> +     As such the pattern that needs to be subsequently match may change.
> */
> +
> +  if (SLP_TREE_CHILDREN (node).exists ()) {
> +    slp_tree child;
> +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> +      found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
> +                                               patt_fn, max_nunits,
> matches,
> +                                               npermutes, tree_size,
> bst_map);
> +  }
> 
> 
> you definitely need a visited set - you are walking a graph and nodes can
> appear along multiple paths!
> 
> +      vect_mark_slp_stmts_relevant (node);
> 
> that walks the whole subgraph but if you need to do anything you at most
> want to touch the node itself, no?
> 
> To make patterns-of-patterns viable you need to do all parts of the walk in
> post-order.  What breaks if you do ->matches/->validate in post-order?  I
> think that would be more future-proof.

You lose the ability to match the longest pattern. As an example the complex add and
complex fma patterns overlap. Right now I can try matching the fma first and then add.
But doing it in post order the fma woud never match as the subtree would be too small
and the add would always match.

Aside from that it makes it very difficult to rebuild the subtrees as the SSA names have changed (since build
Is already done in post order),
So right now I can use e.g. _3, _4 etc, however if the patterns have already been applied I would
need to know what their replacements are since build () would replace them and you lose the ability to navigate
by SSA name.

Regards,
Tamar

> 
> Otherwise this looks like an OK overall design.
> 
> Thanks for working on it!
> 
> Richard.
> 
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > 	* doc/passes.texi: Update documentation.
> > 	* tree-vect-slp.c (vect_match_slp_patterns_2,
> vect_match_slp_patterns):
> > 	New.
> > 	(vect_analyze_slp_instance): Call pattern matcher.
> > 	* tree-vectorizer.h (class VectPatternMatch, class VectPattern): New.
> > 	* tree-vect-slp-patterns.c: New file.
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imend
Tamar Christina Sept. 28, 2020, 5:24 p.m. UTC | #3
> -----Original Message-----
> From: Gcc-patches <gcc-patches-bounces@gcc.gnu.org> On Behalf Of Tamar
> Christina
> Sent: Monday, September 28, 2020 3:56 PM
> To: Richard Biener <rguenther@suse.de>
> Cc: nd <nd@arm.com>; gcc-patches@gcc.gnu.org; ook@ucw.cz
> Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> Hi Richi,
> 
> Thanks for the review!
> 
> Just some answers to your questions:
> 
> > -----Original Message-----
> > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > Behalf Of Richard Biener
> > Sent: Monday, September 28, 2020 1:37 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > scaffolding.
> >
> > On Fri, 25 Sep 2020, Tamar Christina wrote:
> >
> > > Hi All,
> > >
> > > This patch adds the basic infrastructure for doing pattern matching
> > > on SLP
> > trees.
> > > This is done immediately after the SLP tree creation because it can
> > > change the shape of the tree in radical ways and so we would like to
> > > do it before any analysis is performed on the tree.
> > >
> > > A new file tree-vect-slp-patterns.c is added which contains all the
> > > code for pattern matching on SLP trees.
> > >
> > > This cover letter is short because the changes are heavily commented.
> > >
> > > All pattern matchers need to implement the abstract type
> > VectPatternMatch.
> > > The VectSimplePatternMatch abstract class provides some default
> > > functionality for pattern matchers that need to rebuild nodes.
> > >
> > > The pattern matcher requires if replacing a statement in a node,
> > > that ALL statements be replaced.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> >
> > +    gcall *build ()
> > +    {
> > +      stmt_vec_info stmt_info;
> > +
> >
> > please define functions out-of-line (apart from the 1-liners)
> >
> > +      /* We have to explicitly mark the old statement as unused
> > + because
> > during
> > +        statement analysis the original and new pattern statement may
> > require
> > +        different level of unrolling.  As an example add/sub when
> > vectorized
> > +        without a pattern requires 4 copies, whereas with a
> > + COMPLEX_ADD
> > pattern
> > +        this only requires 2 copies and the two statement will be
> > + treated
> > as
> > +        hand unrolled.  That means that the analysis won't happen as
> > it'll find
> > +        a mismatch.  So we don't analyze the old statement and if we
> > + end
> > up
> > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > */
> > +      STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > +      STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;
> >
> > so this means all uses have to be inside the pattern as otherwise
> > there may be even non-SLP uses.  vect_mark_pattern_stmts supports
> > detecting patterns of patterns, I suppose the two-phase analysis for
> > SLP patterns does not support this right now?
> >
> > +      SLP_TREE_CODE (this->m_node) = gimple_expr_code (call_stmt);;
> >
> > double ;, just make it CALL_EXPR literally (or leave it ERROR_MARK)
> >
> > You seem to do in-place changing of the SLP node you match off?
> 
> Yes since this would allow me to change the root node as well, though
> thinking about it I can probably do it by passing it as a reference which then
> would allow me to re-use vect_create_new_slp_node which is probably
> preferable.
> 
> >
> > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
> >                               &tree_size, bst_map);
> >    if (node != NULL)
> >      {
> > +      /* Temporarily allow add_stmt calls again.  */
> > +      vinfo->stmt_vec_info_ro = false;
> > +
> > +      /* See if any patterns can be found in the constructed SLP tree
> > +        before we do any analysis on it.  */
> > +      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
> > +                              matches, &npermutes, &tree_size,
> > + bst_map);
> > +
> > +      /* After this no more add_stmt calls are allowed.  */
> > +      vinfo->stmt_vec_info_ro = true;
> > +
> >
> > I think this is a bit early to match patterns - I'd defer it to the
> > point where all entries into the same SLP subgraph are analyzed, thus
> > somewhere at the end of vect_analyze_slp loop over all instances and
> > match patterns?  That way phases are more clearly separated.
> 
> That would probably work, my only worry is that the SLP analysis itself may
> fail and bail out at
> 
> 	  /* If the loads and stores can be handled with load/store-lane
> 	     instructions do not generate this SLP instance.  */
> 	  if (is_a <loop_vec_info> (vinfo)
> 	      && loads_permuted
> 	      && dr && vect_store_lanes_supported (vectype, group_size,
> false))
> 
> Which in the initial tree may be true, but in the patterned tree may not be.
> In the previous revision of the patch you had suggested I return a boolean
> which can be used to cancel such checks.  Would that be the preferred
> approach?
> 
> >
> > Note that fiddling with vinfo->stmt_vec_info_ro is a bit ugly, maybe
> > add a -
> > >add_pattern_stmt (gimple *pattern_stmt, stmt_vec_info
> > orig_stmt) variant that also sets STMT_VINFO_RELATED_STMT but doesn't
> > check !stmt_vec_info_ro.  That could be used from tree-vect-patterns.c
> > as well and we could set stmt_vec_info_ro earlier.
> >
> > +  VectPattern *pattern = patt_fn (node, vinfo);  uint8_t n =
> > + pattern->get_arity ();
> > +
> > +  if (group_size % n != 0)
> > +    {
> > +      delete pattern;
> >
> > seems to require VectPattern allocation even upon failure, I suggest
> > to return NULL then to avoid excessive allocations.
> >
> > +      if (!pattern->matches (stmt_infos, i))
> > +       {
> > +         /* We can only do replacements for entire groups, we must
> > replace all
> > +            statements in a node as the argument list/children may
> > + not
> > have
> > +            equal height then.  Operations that don't rewrite the
> > arguments
> > +            may be safe to do, so perhaps paramatrise it.  */
> > +
> > +         found_p = false;
> >
> > I find it a bit ugly to iterate over "unrolls" in the machinery rather
> > than the individual pattern matcher which might have an easier and in
> > particular cheaper job here.  Since you require
> > all lanes to match the same pattern anyway.   Not sure if your
> > later patches support say, mixing complex add with different rotate in
> > the same SLP node.
> 
> It does, as the constraint only applies to one pattern matcher class handling
> the entire node.
> 
> An example of such case is
> 
> node 0x531a1f0 (max_nunits=2, refcnt=2)
>  stmt 0 *_9 = _10;
>  stmt 1 *_15 = _16;
>  stmt 2 *_25 = _26;
>  stmt 3 *_31 = _32;
>  children 0x531a980
> node 0x531a980 (max_nunits=2, refcnt=2)
>  stmt 0 slp_patt_112 = .COMPLEX_ADD_ROT90 (_4, _14);  stmt 1 slp_patt_111
> = .COMPLEX_ADD_ROT90 (_12, _8);  stmt 2 slp_patt_110
> = .COMPLEX_ADD_ROT270 (_20, _30);  stmt 3 slp_patt_109
> = .COMPLEX_ADD_ROT270 (_28, _24);  lane permutation { 0[0] 1[1] 1[2] 0[3] }
> children 0x5310680 0x530e040 node 0x5310680 (max_nunits=2, refcnt=4)
> stmt 0 _4 = *_3;  stmt 1 _12 = *_11;  stmt 2 _20 = *_19;  stmt 3 _28 = *_27;
> load permutation { 0 1 2 3 } node 0x530e040 (max_nunits=2, refcnt=2)  stmt 0
> _14 = *_13;  stmt 1 _8 = *_7;  stmt 2 _30 = *_29;  stmt 3 _24 = *_23;  load
> permutation { 0 1 2 3 }
> 
> though looking at the resulting assembly the code is incorrect,
> 
> .L6:
>         ldr     q1, [x1, x3]
>         ldr     q0, [x0, x3]
>         fcadd   v0.2d, v0.2d, v1.2d, #270
>         str     q0, [x2, x3]
>         ldr     q1, [x5, x3]
>         ldr     q0, [x6, x3]
>         fcadd   v0.2d, v0.2d, v1.2d, #270
>         str     q0, [x4, x3]
>         add     x3, x3, 32
>         cmp     x3, 1600
>         bne     .L6
>         ret
> 
> Which I assume is because SLP_TREE_REPRESENTATIVE is pointing to the
> rotate 270?
> 
> > Note the ultimate idea in the end is that a SLP node can, of course,
> > be split into two [but at this point the vector type / unroll factor
> > is not final so general splitting at vector boundary is not desired yet].
> > The split can be undone for consumers by inserting a VEC_PERM node
> > (which should semantically be a concat + select)
> >
> > +      tree type = gimple_expr_type (STMT_VINFO_STMT (stmt_info));
> > +      tree vectype = get_vectype_for_scalar_type (vinfo, type, node);
> >
> > use
> >
> >       tree vectype = SLP_TREE_VECTYPE (node);
> >
> > generally avoid looking at scalar stmts, iff then look at
> > SLP_TREE_REPRESENTATIVE - all lanes have uniform operations applied to
> > (but the scalar stmts may not appear to do so!  the scalar stmts
> > merely stand for their 'def').
> >
> > +  /* Perform recursive matching, it's important to do this after
> > + matching
> > things
> > +     in the current node as the matches here may re-order the nodes
> > + below
> > it.
> > +     As such the pattern that needs to be subsequently match may change.
> > */
> > +
> > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > +    slp_tree child;
> > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > +      found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
> > +                                               patt_fn, max_nunits,
> > matches,
> > +                                               npermutes, tree_size,
> > bst_map);
> > +  }
> >
> >
> > you definitely need a visited set - you are walking a graph and nodes
> > can appear along multiple paths!
> >
> > +      vect_mark_slp_stmts_relevant (node);
> >
> > that walks the whole subgraph but if you need to do anything you at
> > most want to touch the node itself, no?
> >
> > To make patterns-of-patterns viable you need to do all parts of the
> > walk in post-order.  What breaks if you do ->matches/->validate in
> > post-order?  I think that would be more future-proof.
> 
> You lose the ability to match the longest pattern. As an example the complex
> add and complex fma patterns overlap. Right now I can try matching the fma
> first and then add.
> But doing it in post order the fma woud never match as the subtree would be
> too small and the add would always match.
> 

Oops, I forgot that this new version tries the same pattern over the entire tree. So it should work.
You would only lose the ability to navigate by SSA name, but it doesn't need that anyway..

I'll make that change and see.

Thanks,
Tamar

> Aside from that it makes it very difficult to rebuild the subtrees as the SSA
> names have changed (since build Is already done in post order), So right now
> I can use e.g. _3, _4 etc, however if the patterns have already been applied I
> would need to know what their replacements are since build () would
> replace them and you lose the ability to navigate by SSA name.
> 
> Regards,
> Tamar
> 
> >
> > Otherwise this looks like an OK overall design.
> >
> > Thanks for working on it!
> >
> > Richard.
> >
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > 	* doc/passes.texi: Update documentation.
> > > 	* tree-vect-slp.c (vect_match_slp_patterns_2,
> > vect_match_slp_patterns):
> > > 	New.
> > > 	(vect_analyze_slp_instance): Call pattern matcher.
> > > 	* tree-vectorizer.h (class VectPatternMatch, class VectPattern): New.
> > > 	* tree-vect-slp-patterns.c: New file.
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imend
Richard Biener Sept. 29, 2020, 6:45 a.m. UTC | #4
On Mon, 28 Sep 2020, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-bounces@gcc.gnu.org> On Behalf Of Tamar
> > Christina
> > Sent: Monday, September 28, 2020 3:56 PM
> > To: Richard Biener <rguenther@suse.de>
> > Cc: nd <nd@arm.com>; gcc-patches@gcc.gnu.org; ook@ucw.cz
> > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > scaffolding.
> > 
> > Hi Richi,
> > 
> > Thanks for the review!
> > 
> > Just some answers to your questions:
> > 
> > > -----Original Message-----
> > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > Behalf Of Richard Biener
> > > Sent: Monday, September 28, 2020 1:37 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > > scaffolding.
> > >
> > > On Fri, 25 Sep 2020, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch adds the basic infrastructure for doing pattern matching
> > > > on SLP
> > > trees.
> > > > This is done immediately after the SLP tree creation because it can
> > > > change the shape of the tree in radical ways and so we would like to
> > > > do it before any analysis is performed on the tree.
> > > >
> > > > A new file tree-vect-slp-patterns.c is added which contains all the
> > > > code for pattern matching on SLP trees.
> > > >
> > > > This cover letter is short because the changes are heavily commented.
> > > >
> > > > All pattern matchers need to implement the abstract type
> > > VectPatternMatch.
> > > > The VectSimplePatternMatch abstract class provides some default
> > > > functionality for pattern matchers that need to rebuild nodes.
> > > >
> > > > The pattern matcher requires if replacing a statement in a node,
> > > > that ALL statements be replaced.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > +    gcall *build ()
> > > +    {
> > > +      stmt_vec_info stmt_info;
> > > +
> > >
> > > please define functions out-of-line (apart from the 1-liners)
> > >
> > > +      /* We have to explicitly mark the old statement as unused
> > > + because
> > > during
> > > +        statement analysis the original and new pattern statement may
> > > require
> > > +        different level of unrolling.  As an example add/sub when
> > > vectorized
> > > +        without a pattern requires 4 copies, whereas with a
> > > + COMPLEX_ADD
> > > pattern
> > > +        this only requires 2 copies and the two statement will be
> > > + treated
> > > as
> > > +        hand unrolled.  That means that the analysis won't happen as
> > > it'll find
> > > +        a mismatch.  So we don't analyze the old statement and if we
> > > + end
> > > up
> > > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > > */
> > > +      STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > > +      STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;
> > >
> > > so this means all uses have to be inside the pattern as otherwise
> > > there may be even non-SLP uses.  vect_mark_pattern_stmts supports
> > > detecting patterns of patterns, I suppose the two-phase analysis for
> > > SLP patterns does not support this right now?
> > >
> > > +      SLP_TREE_CODE (this->m_node) = gimple_expr_code (call_stmt);;
> > >
> > > double ;, just make it CALL_EXPR literally (or leave it ERROR_MARK)
> > >
> > > You seem to do in-place changing of the SLP node you match off?
> > 
> > Yes since this would allow me to change the root node as well, though
> > thinking about it I can probably do it by passing it as a reference which then
> > would allow me to re-use vect_create_new_slp_node which is probably
> > preferable.
> > 
> > >
> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
> > >                               &tree_size, bst_map);
> > >    if (node != NULL)
> > >      {
> > > +      /* Temporarily allow add_stmt calls again.  */
> > > +      vinfo->stmt_vec_info_ro = false;
> > > +
> > > +      /* See if any patterns can be found in the constructed SLP tree
> > > +        before we do any analysis on it.  */
> > > +      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
> > > +                              matches, &npermutes, &tree_size,
> > > + bst_map);
> > > +
> > > +      /* After this no more add_stmt calls are allowed.  */
> > > +      vinfo->stmt_vec_info_ro = true;
> > > +
> > >
> > > I think this is a bit early to match patterns - I'd defer it to the
> > > point where all entries into the same SLP subgraph are analyzed, thus
> > > somewhere at the end of vect_analyze_slp loop over all instances and
> > > match patterns?  That way phases are more clearly separated.
> > 
> > That would probably work, my only worry is that the SLP analysis itself may
> > fail and bail out at
> > 
> > 	  /* If the loads and stores can be handled with load/store-lane
> > 	     instructions do not generate this SLP instance.  */
> > 	  if (is_a <loop_vec_info> (vinfo)
> > 	      && loads_permuted
> > 	      && dr && vect_store_lanes_supported (vectype, group_size,
> > false))

Ah, that piece of code.  Yeah, I'm repeatedly running into it as well - 
it's a bad hack that stands in the way all the time :/

I guess we should try moving this upward like to
vect_analyze_loop_2 right before

  /* Check the SLP opportunities in the loop, analyze and build SLP trees.  
*/
  ok = vect_analyze_slp (loop_vinfo, *n_stmts);
  if (!ok)
    return ok;

and there check whether all grouped loads and stores can be handled
with store-/load-lanes (and there are no SLP reduction chains?) in
which case do not try to attempt SLP at all.  Because the testcases
this check was supposed to change were all-load/store-lane or
all SLP so the mixed case is probably not worth special casing.

Since load-/store-lanes is an arm speciality I tried to only touch
this fragile part with a ten-foot pole ;)  CCing Richard, if he
acks the above I can produce a patch.

> > Which in the initial tree may be true, but in the patterned tree may not be.
> > In the previous revision of the patch you had suggested I return a boolean
> > which can be used to cancel such checks.  Would that be the preferred
> > approach?



> > >
> > > Note that fiddling with vinfo->stmt_vec_info_ro is a bit ugly, maybe
> > > add a -
> > > >add_pattern_stmt (gimple *pattern_stmt, stmt_vec_info
> > > orig_stmt) variant that also sets STMT_VINFO_RELATED_STMT but doesn't
> > > check !stmt_vec_info_ro.  That could be used from tree-vect-patterns.c
> > > as well and we could set stmt_vec_info_ro earlier.
> > >
> > > +  VectPattern *pattern = patt_fn (node, vinfo);  uint8_t n =
> > > + pattern->get_arity ();
> > > +
> > > +  if (group_size % n != 0)
> > > +    {
> > > +      delete pattern;
> > >
> > > seems to require VectPattern allocation even upon failure, I suggest
> > > to return NULL then to avoid excessive allocations.
> > >
> > > +      if (!pattern->matches (stmt_infos, i))
> > > +       {
> > > +         /* We can only do replacements for entire groups, we must
> > > replace all
> > > +            statements in a node as the argument list/children may
> > > + not
> > > have
> > > +            equal height then.  Operations that don't rewrite the
> > > arguments
> > > +            may be safe to do, so perhaps paramatrise it.  */
> > > +
> > > +         found_p = false;
> > >
> > > I find it a bit ugly to iterate over "unrolls" in the machinery rather
> > > than the individual pattern matcher which might have an easier and in
> > > particular cheaper job here.  Since you require
> > > all lanes to match the same pattern anyway.   Not sure if your
> > > later patches support say, mixing complex add with different rotate in
> > > the same SLP node.
> > 
> > It does, as the constraint only applies to one pattern matcher class handling
> > the entire node.
> > 
> > An example of such case is
> > 
> > node 0x531a1f0 (max_nunits=2, refcnt=2)
> >  stmt 0 *_9 = _10;
> >  stmt 1 *_15 = _16;
> >  stmt 2 *_25 = _26;
> >  stmt 3 *_31 = _32;
> >  children 0x531a980
> > node 0x531a980 (max_nunits=2, refcnt=2)
> >  stmt 0 slp_patt_112 = .COMPLEX_ADD_ROT90 (_4, _14);  stmt 1 slp_patt_111
> > = .COMPLEX_ADD_ROT90 (_12, _8);  stmt 2 slp_patt_110
> > = .COMPLEX_ADD_ROT270 (_20, _30);  stmt 3 slp_patt_109
> > = .COMPLEX_ADD_ROT270 (_28, _24);  lane permutation { 0[0] 1[1] 1[2] 0[3] }
> > children 0x5310680 0x530e040 node 0x5310680 (max_nunits=2, refcnt=4)
> > stmt 0 _4 = *_3;  stmt 1 _12 = *_11;  stmt 2 _20 = *_19;  stmt 3 _28 = *_27;
> > load permutation { 0 1 2 3 } node 0x530e040 (max_nunits=2, refcnt=2)  stmt 0
> > _14 = *_13;  stmt 1 _8 = *_7;  stmt 2 _30 = *_29;  stmt 3 _24 = *_23;  load
> > permutation { 0 1 2 3 }
> > 
> > though looking at the resulting assembly the code is incorrect,
> > 
> > .L6:
> >         ldr     q1, [x1, x3]
> >         ldr     q0, [x0, x3]
> >         fcadd   v0.2d, v0.2d, v1.2d, #270
> >         str     q0, [x2, x3]
> >         ldr     q1, [x5, x3]
> >         ldr     q0, [x6, x3]
> >         fcadd   v0.2d, v0.2d, v1.2d, #270
> >         str     q0, [x4, x3]
> >         add     x3, x3, 32
> >         cmp     x3, 1600
> >         bne     .L6
> >         ret
> > 
> > Which I assume is because SLP_TREE_REPRESENTATIVE is pointing to the
> > rotate 270?

Or maybe due to the lane permutations.

> > > Note the ultimate idea in the end is that a SLP node can, of course,
> > > be split into two [but at this point the vector type / unroll factor
> > > is not final so general splitting at vector boundary is not desired yet].
> > > The split can be undone for consumers by inserting a VEC_PERM node
> > > (which should semantically be a concat + select)
> > >
> > > +      tree type = gimple_expr_type (STMT_VINFO_STMT (stmt_info));
> > > +      tree vectype = get_vectype_for_scalar_type (vinfo, type, node);
> > >
> > > use
> > >
> > >       tree vectype = SLP_TREE_VECTYPE (node);
> > >
> > > generally avoid looking at scalar stmts, iff then look at
> > > SLP_TREE_REPRESENTATIVE - all lanes have uniform operations applied to
> > > (but the scalar stmts may not appear to do so!  the scalar stmts
> > > merely stand for their 'def').
> > >
> > > +  /* Perform recursive matching, it's important to do this after
> > > + matching
> > > things
> > > +     in the current node as the matches here may re-order the nodes
> > > + below
> > > it.
> > > +     As such the pattern that needs to be subsequently match may change.
> > > */
> > > +
> > > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > > +    slp_tree child;
> > > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > > +      found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
> > > +                                               patt_fn, max_nunits,
> > > matches,
> > > +                                               npermutes, tree_size,
> > > bst_map);
> > > +  }
> > >
> > >
> > > you definitely need a visited set - you are walking a graph and nodes
> > > can appear along multiple paths!
> > >
> > > +      vect_mark_slp_stmts_relevant (node);
> > >
> > > that walks the whole subgraph but if you need to do anything you at
> > > most want to touch the node itself, no?
> > >
> > > To make patterns-of-patterns viable you need to do all parts of the
> > > walk in post-order.  What breaks if you do ->matches/->validate in
> > > post-order?  I think that would be more future-proof.
> > 
> > You lose the ability to match the longest pattern. As an example the complex
> > add and complex fma patterns overlap. Right now I can try matching the fma
> > first and then add.
> > But doing it in post order the fma woud never match as the subtree would be
> > too small and the add would always match.
> > 
> 
> Oops, I forgot that this new version tries the same pattern over the 
> entire tree. So it should work. You would only lose the ability to 
> navigate by SSA name, but it doesn't need that anyway..
> 
> I'll make that change and see.

Thanks,
Richard.

> Thanks,
> Tamar
> 
> > Aside from that it makes it very difficult to rebuild the subtrees as the SSA
> > names have changed (since build Is already done in post order), So right now
> > I can use e.g. _3, _4 etc, however if the patterns have already been applied I
> > would need to know what their replacements are since build () would
> > replace them and you lose the ability to navigate by SSA name.
> > 
> > Regards,
> > Tamar
> > 
> > >
> > > Otherwise this looks like an OK overall design.
> > >
> > > Thanks for working on it!
> > >
> > > Richard.
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > > 	* doc/passes.texi: Update documentation.
> > > > 	* tree-vect-slp.c (vect_match_slp_patterns_2,
> > > vect_match_slp_patterns):
> > > > 	New.
> > > > 	(vect_analyze_slp_instance): Call pattern matcher.
> > > > 	* tree-vectorizer.h (class VectPatternMatch, class VectPattern): New.
> > > > 	* tree-vect-slp-patterns.c: New file.
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imend
>
Richard Sandiford Sept. 29, 2020, 9:25 a.m. UTC | #5
Richard Biener <rguenther@suse.de> writes:
>> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
>> > >                               &tree_size, bst_map);
>> > >    if (node != NULL)
>> > >      {
>> > > +      /* Temporarily allow add_stmt calls again.  */
>> > > +      vinfo->stmt_vec_info_ro = false;
>> > > +
>> > > +      /* See if any patterns can be found in the constructed SLP tree
>> > > +        before we do any analysis on it.  */
>> > > +      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
>> > > +                              matches, &npermutes, &tree_size,
>> > > + bst_map);
>> > > +
>> > > +      /* After this no more add_stmt calls are allowed.  */
>> > > +      vinfo->stmt_vec_info_ro = true;
>> > > +
>> > >
>> > > I think this is a bit early to match patterns - I'd defer it to the
>> > > point where all entries into the same SLP subgraph are analyzed, thus
>> > > somewhere at the end of vect_analyze_slp loop over all instances and
>> > > match patterns?  That way phases are more clearly separated.
>> > 
>> > That would probably work, my only worry is that the SLP analysis itself may
>> > fail and bail out at
>> > 
>> > 	  /* If the loads and stores can be handled with load/store-lane
>> > 	     instructions do not generate this SLP instance.  */
>> > 	  if (is_a <loop_vec_info> (vinfo)
>> > 	      && loads_permuted
>> > 	      && dr && vect_store_lanes_supported (vectype, group_size,
>> > false))
>
> Ah, that piece of code.  Yeah, I'm repeatedly running into it as well - 
> it's a bad hack that stands in the way all the time :/

At one point I was wondering about trying to drop the above, vectorise with
and without SLP, and then compare their costs, like for VECT_COMPARE_COSTS.
But that seemed like a dead end with the move to doing everything on the
SLP representation.

> I guess we should try moving this upward like to
> vect_analyze_loop_2 right before
>
>   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  
> */
>   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
>   if (!ok)
>     return ok;
>
> and there check whether all grouped loads and stores can be handled
> with store-/load-lanes (and there are no SLP reduction chains?) in
> which case do not try to attempt SLP at all.  Because the testcases
> this check was supposed to change were all-load/store-lane or
> all SLP so the mixed case is probably not worth special casing.
>
> Since load-/store-lanes is an arm speciality I tried to only touch
> this fragile part with a ten-foot pole ;)  CCing Richard, if he
> acks the above I can produce a patch.

Yeah, sounds good to me.  Probably also sorth checking whether the
likely_max iteration count is high enough to support group_size
vectors, if we have enough information to guess that.

We could also get the gen* machinery to emit a macro that is true if at
least one load/store-lane pattern is defined, so that we can skip the
code for non-Arm targets.  I can do that as a follow-up.

Thanks,
Richard
Richard Biener Sept. 29, 2020, 9:42 a.m. UTC | #6
On Tue, 29 Sep 2020, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info *vinfo,
> >> > >                               &tree_size, bst_map);
> >> > >    if (node != NULL)
> >> > >      {
> >> > > +      /* Temporarily allow add_stmt calls again.  */
> >> > > +      vinfo->stmt_vec_info_ro = false;
> >> > > +
> >> > > +      /* See if any patterns can be found in the constructed SLP tree
> >> > > +        before we do any analysis on it.  */
> >> > > +      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
> >> > > +                              matches, &npermutes, &tree_size,
> >> > > + bst_map);
> >> > > +
> >> > > +      /* After this no more add_stmt calls are allowed.  */
> >> > > +      vinfo->stmt_vec_info_ro = true;
> >> > > +
> >> > >
> >> > > I think this is a bit early to match patterns - I'd defer it to the
> >> > > point where all entries into the same SLP subgraph are analyzed, thus
> >> > > somewhere at the end of vect_analyze_slp loop over all instances and
> >> > > match patterns?  That way phases are more clearly separated.
> >> > 
> >> > That would probably work, my only worry is that the SLP analysis itself may
> >> > fail and bail out at
> >> > 
> >> > 	  /* If the loads and stores can be handled with load/store-lane
> >> > 	     instructions do not generate this SLP instance.  */
> >> > 	  if (is_a <loop_vec_info> (vinfo)
> >> > 	      && loads_permuted
> >> > 	      && dr && vect_store_lanes_supported (vectype, group_size,
> >> > false))
> >
> > Ah, that piece of code.  Yeah, I'm repeatedly running into it as well - 
> > it's a bad hack that stands in the way all the time :/
> 
> At one point I was wondering about trying to drop the above, vectorise with
> and without SLP, and then compare their costs, like for VECT_COMPARE_COSTS.
> But that seemed like a dead end with the move to doing everything on the
> SLP representation.

Yeah ... though even moving everything to the SLP representation will
retain the issue since there it will be N group-size 1 SLP instances
vs. 1 group-size N SLP instance.

> > I guess we should try moving this upward like to
> > vect_analyze_loop_2 right before
> >
> >   /* Check the SLP opportunities in the loop, analyze and build SLP trees.  
> > */
> >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> >   if (!ok)
> >     return ok;
> >
> > and there check whether all grouped loads and stores can be handled
> > with store-/load-lanes (and there are no SLP reduction chains?) in
> > which case do not try to attempt SLP at all.  Because the testcases
> > this check was supposed to change were all-load/store-lane or
> > all SLP so the mixed case is probably not worth special casing.
> >
> > Since load-/store-lanes is an arm speciality I tried to only touch
> > this fragile part with a ten-foot pole ;)  CCing Richard, if he
> > acks the above I can produce a patch.
> 
> Yeah, sounds good to me.  Probably also sorth checking whether the
> likely_max iteration count is high enough to support group_size
> vectors, if we have enough information to guess that.
> 
> We could also get the gen* machinery to emit a macro that is true if at
> least one load/store-lane pattern is defined, so that we can skip the
> code for non-Arm targets.  I can do that as a follow-up.

I've had a second look and one complication is that we only elide the
SLP node if any of the loads are permuted.  So if all loads/stores
are unpermuted but load/store-lanes would work we'd keep the SLP node.

Of course without actually building the SLP node we don't know whether
the loads will be permuted or not ...

But surely the current place for the check will cause some testcases
to become hybrid vectorizations which is likely undesirable.

So we could move the check after all SLP discovery is completed
and throw it all away if we can and should use load/store-lanes?
But that might then not solve Tamars issue.

Richard.
Tamar Christina Nov. 3, 2020, 3:06 p.m. UTC | #7
Hi Richi,

This is a respin which includes the changes you requested.

Thanks,
Tamar

gcc/ChangeLog:

	* Makefile.in (tree-vect-slp-patterns.o): New.
	* doc/passes.texi: Update documentation.
	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
	(vect_analyze_slp): Call pattern matcher.
	* tree-vectorizer.h (enum _complex_operation):
	(class vect_pattern_match, class vect_pattern): New.
	* tree-vect-slp-patterns.c: New file.

> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Tuesday, September 29, 2020 10:42 AM
> To: Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> On Tue, 29 Sep 2020, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info
> *vinfo,
> > >> > >                               &tree_size, bst_map);
> > >> > >    if (node != NULL)
> > >> > >      {
> > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > >> > > +      vinfo->stmt_vec_info_ro = false;
> > >> > > +
> > >> > > +      /* See if any patterns can be found in the constructed SLP tree
> > >> > > +        before we do any analysis on it.  */
> > >> > > +      vect_match_slp_patterns (node, vinfo, group_size,
> &max_nunits,
> > >> > > +                              matches, &npermutes, &tree_size,
> > >> > > + bst_map);
> > >> > > +
> > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > >> > > +      vinfo->stmt_vec_info_ro = true;
> > >> > > +
> > >> > >
> > >> > > I think this is a bit early to match patterns - I'd defer it to
> > >> > > the point where all entries into the same SLP subgraph are
> > >> > > analyzed, thus somewhere at the end of vect_analyze_slp loop
> > >> > > over all instances and match patterns?  That way phases are more
> clearly separated.
> > >> >
> > >> > That would probably work, my only worry is that the SLP analysis
> > >> > itself may fail and bail out at
> > >> >
> > >> > 	  /* If the loads and stores can be handled with load/store-lane
> > >> > 	     instructions do not generate this SLP instance.  */
> > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > >> > 	      && loads_permuted
> > >> > 	      && dr && vect_store_lanes_supported (vectype, group_size,
> > >> > false))
> > >
> > > Ah, that piece of code.  Yeah, I'm repeatedly running into it as
> > > well - it's a bad hack that stands in the way all the time :/
> >
> > At one point I was wondering about trying to drop the above, vectorise
> > with and without SLP, and then compare their costs, like for
> VECT_COMPARE_COSTS.
> > But that seemed like a dead end with the move to doing everything on
> > the SLP representation.
> 
> Yeah ... though even moving everything to the SLP representation will retain
> the issue since there it will be N group-size 1 SLP instances vs. 1 group-size N
> SLP instance.
> 
> > > I guess we should try moving this upward like to
> > > vect_analyze_loop_2 right before
> > >
> > >   /* Check the SLP opportunities in the loop, analyze and build SLP trees.
> > > */
> > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > >   if (!ok)
> > >     return ok;
> > >
> > > and there check whether all grouped loads and stores can be handled
> > > with store-/load-lanes (and there are no SLP reduction chains?) in
> > > which case do not try to attempt SLP at all.  Because the testcases
> > > this check was supposed to change were all-load/store-lane or all
> > > SLP so the mixed case is probably not worth special casing.
> > >
> > > Since load-/store-lanes is an arm speciality I tried to only touch
> > > this fragile part with a ten-foot pole ;)  CCing Richard, if he acks
> > > the above I can produce a patch.
> >
> > Yeah, sounds good to me.  Probably also sorth checking whether the
> > likely_max iteration count is high enough to support group_size
> > vectors, if we have enough information to guess that.
> >
> > We could also get the gen* machinery to emit a macro that is true if
> > at least one load/store-lane pattern is defined, so that we can skip
> > the code for non-Arm targets.  I can do that as a follow-up.
> 
> I've had a second look and one complication is that we only elide the SLP
> node if any of the loads are permuted.  So if all loads/stores are unpermuted
> but load/store-lanes would work we'd keep the SLP node.
> 
> Of course without actually building the SLP node we don't know whether the
> loads will be permuted or not ...
> 
> But surely the current place for the check will cause some testcases to
> become hybrid vectorizations which is likely undesirable.
> 
> So we could move the check after all SLP discovery is completed and throw it
> all away if we can and should use load/store-lanes?
> But that might then not solve Tamars issue.
> 
> Richard.
Richard Biener Nov. 4, 2020, 12:40 p.m. UTC | #8
On Tue, 3 Nov 2020, Tamar Christina wrote:

> Hi Richi,
> 
> This is a respin which includes the changes you requested.

Comments randomly ordered, I'm pasting in pieces of the patch -
sending it inline would help to get pieces properly quoted and in-order.

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 
4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f9aa42ffe0ce0dd 
100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
...

I miss comments in this file, see tree-vectorizer.h where we try
to document purpose of classes and fields.

Things that sticks out to me:

+    uint8_t m_arity;
+    uint8_t m_num_args;

why uint8_t and not simply unsigned int?  Not knowing what
arity / num_args should be here ;)

+    vec_info *m_vinfo;
...
+    vect_pattern (slp_tree *node, vec_info *vinfo)

so this looks like something I freed stmt_vec_info of - back-pointers
in the "wrong" direction of the logical hierarchy.  I suppose it's
just to avoid passing down vinfo where we need it?  Please do that
instead - pass down vinfo as everything else does.

The class seems to expose both very high-level (build () it!)
and very low level details (get_ifn).  The high-level one suggests
that a pattern _not_ being represented by an ifn is possible
but there's too much implementation detail already in the
vect_pattern class to make that impossible.  I guess the IFN
details could be pushed down to the simple matching class
(and that be called vect_ifn_pattern or so).

+static bool
+vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo)
+{
+  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
+  bool found_p = false;
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt match 
--\n");
+      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
+    }

we dumped all instances after their analysis.  Maybe just
refer to the instance with its address (dump_print %p) so
lookup in the (already large) dump file is easy.

+  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();
+  for (unsigned x = 0; x < num__slp_patterns; x++)
+    {
+      visited->empty ();
+      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo, 
slp_patterns[x],
+                                           visited);
+    }
+
+  delete visited;

no need to new / delete, just do

  has_set<slp_tree> visited;

like everyone else.  Btw, do you really want to scan
pieces of the SLP graph (with instances being graph entries)
multiple times?  If not then you should move the visited
set to the caller instead.

+  /* TODO: Remove in final version, only here for generating debug dot 
graphs
+          from SLP tree.  */
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
+      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
+    }

now, if there was some pattern matched it is probably useful
to dump the graph (entry) again.  But only conditional on that
I think.  So can you instead make the dump conditional on
found_p and remove the start dot/end dot markers as said in the comment?

+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                            "transformation for %s not valid due to post 
"
+                            "condition\n",

not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
MSG_MISSED_OPTIMIZATION should be used for things (likely) making
vectorization fail.

+  /* Perform recursive matching, it's important to do this after matching 
things

before matching things?

+     in the current node as the matches here may re-order the nodes below 
it.
+     As such the pattern that needs to be subsequently match may change.  

and this is no longer true?

*/
+
+  if (SLP_TREE_CHILDREN (node).exists ()) {

elide this check, the loop will simply not run if empty

+    slp_tree child;
+    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)

I think you want to perform the recursion in the caller so you
do it only once and not once for each pattern kind now that you
do post-order processing rather than pre-order.

+  vect_pattern *pattern = patt_fn (ref_node, vinfo);
+
+  if (pattern->matches ())

this suggests you get number of SLP nodes times number of pattern
matchers allocations & inits of vect_pattern.  If you move

+      if (pattern->is_optab_supported_p (vectype, OPTIMIZE_FOR_SPEED))
+       {

into ->matches () then whether this is a IFN or multi-node pattern
becomes an implementation detail which would be IMHO better.

+  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
+    {
+      /* Calculate the location of the statement in NODE to replace.  */
+      stmt_info = SLP_TREE_REPRESENTATIVE (node);
+      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
+      tree type = gimple_expr_type (old_stmt);
+
+      /* Create the argument set for use by 
gimple_build_call_internal_vec.  */
+      for (int i = 0; i < this->m_num_args; i++)

the scalar argument type is the type of the definition so instead
of gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs (old_stmt))

But then you seem to mix the result and the argument types?  I
think you need to walk over SLP_TREE_CHILDREN and look at their
representatives scalar def.  Which would then assume the pattern
consumes all direct children.  Somehow this "generic" 
vect_simple_pattern_match::build () has quite some restrictions.
I guess they are documented in the files toplevel comment
which also has general parts applying to all possible pattern
matchers (not deriving from vect_simple_pattern_match).  Can you
split up the comment and more explicitely mark what applies to
patterns in general (postorder traversal) and what applies
to vect_simple_pattern_match only?

+      /* We have to explicitly mark the old statement as unused because 
during
+        statement analysis the original and new pattern statement may 
require

comment needs updating.

+        different level of unrolling.  As an example add/sub when 
vectorized
+        without a pattern requires 4 copies, whereas with a COMPLEX_ADD 
pattern
+        this only requires 2 copies and the two statement will be treated 
as
+        hand unrolled.  That means that the analysis won't happen as 
it'll find
+        a mismatch.  So we don't analyze the old statement and if we end 
up
+        needing it, e.g. SLP fails then we have to quickly re-analyze it.  
*/
+      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;

I guess relevance should be copied from the original stmt (like if
it was used_by_reduction).  Can't that be done in
vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part
could be done there, too.  So you'd be left with

+      STMT_SLP_TYPE (call_stmt_info) = pure_slp;

and

+      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;

where the pure_slp setting should in theory be not needed (it
should be automagically detected so) but setting it is not wrong
(it would be an optimization).

--

I'm now looking down the series to see how this is all used,
so quoting different pieces from other patches.

+/* Checks to see of the expression EXPR is a gimple assign with code CODE
+   and if this is the case the two operands of EXPR is returned in OP1 
and
+   OP2.
+
+   If the matching and extraction is successful TRUE is returned 
otherwise
+   FALSE in which case the value of OP1 and OP2 will not have been 
touched.
+*/
+
+static bool
+vect_match_expression_p (slp_tree node, tree_code code)
+{
+  if (!SLP_TREE_REPRESENTATIVE (node))

I see that many overall function comments do not actually match
anymore, if at least for the parameters refered to.  Please go
over the patch and update them.

+  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE (node));
+  if (!is_gimple_assign (expr)
+      || gimple_expr_code (expr) != code)

gimple_assign_rhs_code (expr)

+/* Check if the given lane permute in PERMUTES matches an alternating 
sequence
+   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
+static bool
+vect_check_lane_permute (lane_permutation_t &permutes,
+                        unsigned p0, unsigned p1)
+{
+  unsigned val[2] = {p0, p1};
+  for (unsigned i = 0; i < permutes.length (); i++)
+    if (permutes[i].first != val[i % 2])
+      return false;

so this doesn't put any constraint on permutes[].second.  So it
matches an arbitrary interleaving of two vectors.

+
+   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
+*/
+
+static complex_operation_t
+vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t 
&lanes,
+                    bool two_operands = true, vec<slp_tree> *ops = NULL)

 { {_37, _38}, {_12, _36} }

there's a lot of vect_match_expression_p calls and I wonder if
inlining them and CSEing the 'code' compute of node1 and node2
would simplify things and elide the macro.  Also we then know
this is all CSEd in the end ...

I wonder what LANES specifies, TWO_OPERANDS specifies whether
two-operand variants are allowed unless .. (guess LANES causes
only same operands to match up in the end).

+static void
+vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree node)
+{

need to find a use for this still, but don't you set pure_slp
"above"?

+bool
+complex_pattern::validate_p ()
+{
+      unsigned i;
+      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
+       {
+         slp_tree vnode = NULL;
+         if (vect_slp_make_linear (node, tmp, &vnode))
+           nodes.quick_push (vnode);
+         else
+           {
+             nodes.release();
+             return false;
+           }

hum.  vect_slp_make_linear better not fail?  At least we've
created half-way stuff when it does.  Can't spot the implementation
right now (guess splitting the patch up doesn't really help much ...)

Noting elsewhere:

+static void
+vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
+                                hash_set<slp_tree> *visited, slp_tree 
root)
+{
+  if (!root || visited->contains (root))
+    return;
+
+  unsigned int i;
+  slp_tree node;
+  stmt_vec_info related_stmt_info;
+  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
+
+  visited->add (root);

visited->add () returns whether the key was already there so
please combine the contains and the add calls here and elsewhere.

    if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
+        && (related_stmt_info = STMT_VINFO_RELATED_STMT (stmt_info)) != 
NULL)
+      {
+       if (dump_enabled_p ())

I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder
if the above is a sufficient test.  There's is_pattern_stmt_p ()
(which obviously also applies to non-SLP only patterns).

Note that I also see another issue, namely with hybrid SLP the
original non-pattern stmt would need to stay relevant.  That
probably asks for hybrid SLP discovery and relevance marking
to be combined somehow.  You can probably create a simple
testcase by storing a lane of a complex op via a non-grouped
store.

+       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+       STMT_VINFO_RELEVANT (related_stmt_info) = vect_used_in_scope;

as said previously the relevance should be copied, in this case
back to the original stmt info from the pattern stmt info.

+       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
+       STMT_SLP_TYPE (related_stmt_info) = loop_vect;

one option might be to delay pattern recognition (for the loop case)
to after vect_detect_hybrid_slp and simply not make any hybrid
stmt containing nodes part of a SLP pattern.  It would be a workaround
of course, not the best solution.  Another option is to
add a new field to stmt_info to put a "saved" relevance to and
simply go over all stmts restoring relevance - we already restore
SLP_TYPE to loop_vect that way at

  /* Reset SLP type to loop_vect on all stmts.  */
  for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
    {

so simply restore the saved relevance would work there (guess I
like that more than what vect_dissolve_slp_only_patterns does,
in any case do what vect_dissolve_slp_only_patterns does in the
above loop please).

Thanks,
Richard.



> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* Makefile.in (tree-vect-slp-patterns.o): New.
> 	* doc/passes.texi: Update documentation.
> 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> 	(vect_analyze_slp): Call pattern matcher.
> 	* tree-vectorizer.h (enum _complex_operation):
> 	(class vect_pattern_match, class vect_pattern): New.
> 	* tree-vect-slp-patterns.c: New file.
> 
> > -----Original Message-----
> > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > Behalf Of Richard Biener
> > Sent: Tuesday, September 29, 2020 10:42 AM
> > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>; gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > scaffolding.
> > 
> > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > 
> > > Richard Biener <rguenther@suse.de> writes:
> > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info
> > *vinfo,
> > > >> > >                               &tree_size, bst_map);
> > > >> > >    if (node != NULL)
> > > >> > >      {
> > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > >> > > +
> > > >> > > +      /* See if any patterns can be found in the constructed SLP tree
> > > >> > > +        before we do any analysis on it.  */
> > > >> > > +      vect_match_slp_patterns (node, vinfo, group_size,
> > &max_nunits,
> > > >> > > +                              matches, &npermutes, &tree_size,
> > > >> > > + bst_map);
> > > >> > > +
> > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > >> > > +
> > > >> > >
> > > >> > > I think this is a bit early to match patterns - I'd defer it to
> > > >> > > the point where all entries into the same SLP subgraph are
> > > >> > > analyzed, thus somewhere at the end of vect_analyze_slp loop
> > > >> > > over all instances and match patterns?  That way phases are more
> > clearly separated.
> > > >> >
> > > >> > That would probably work, my only worry is that the SLP analysis
> > > >> > itself may fail and bail out at
> > > >> >
> > > >> > 	  /* If the loads and stores can be handled with load/store-lane
> > > >> > 	     instructions do not generate this SLP instance.  */
> > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > >> > 	      && loads_permuted
> > > >> > 	      && dr && vect_store_lanes_supported (vectype, group_size,
> > > >> > false))
> > > >
> > > > Ah, that piece of code.  Yeah, I'm repeatedly running into it as
> > > > well - it's a bad hack that stands in the way all the time :/
> > >
> > > At one point I was wondering about trying to drop the above, vectorise
> > > with and without SLP, and then compare their costs, like for
> > VECT_COMPARE_COSTS.
> > > But that seemed like a dead end with the move to doing everything on
> > > the SLP representation.
> > 
> > Yeah ... though even moving everything to the SLP representation will retain
> > the issue since there it will be N group-size 1 SLP instances vs. 1 group-size N
> > SLP instance.
> > 
> > > > I guess we should try moving this upward like to
> > > > vect_analyze_loop_2 right before
> > > >
> > > >   /* Check the SLP opportunities in the loop, analyze and build SLP trees.
> > > > */
> > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > >   if (!ok)
> > > >     return ok;
> > > >
> > > > and there check whether all grouped loads and stores can be handled
> > > > with store-/load-lanes (and there are no SLP reduction chains?) in
> > > > which case do not try to attempt SLP at all.  Because the testcases
> > > > this check was supposed to change were all-load/store-lane or all
> > > > SLP so the mixed case is probably not worth special casing.
> > > >
> > > > Since load-/store-lanes is an arm speciality I tried to only touch
> > > > this fragile part with a ten-foot pole ;)  CCing Richard, if he acks
> > > > the above I can produce a patch.
> > >
> > > Yeah, sounds good to me.  Probably also sorth checking whether the
> > > likely_max iteration count is high enough to support group_size
> > > vectors, if we have enough information to guess that.
> > >
> > > We could also get the gen* machinery to emit a macro that is true if
> > > at least one load/store-lane pattern is defined, so that we can skip
> > > the code for non-Arm targets.  I can do that as a follow-up.
> > 
> > I've had a second look and one complication is that we only elide the SLP
> > node if any of the loads are permuted.  So if all loads/stores are unpermuted
> > but load/store-lanes would work we'd keep the SLP node.
> > 
> > Of course without actually building the SLP node we don't know whether the
> > loads will be permuted or not ...
> > 
> > But surely the current place for the check will cause some testcases to
> > become hybrid vectorizations which is likely undesirable.
> > 
> > So we could move the check after all SLP discovery is completed and throw it
> > all away if we can and should use load/store-lanes?
> > But that might then not solve Tamars issue.
> > 
> > Richard.
>
Tamar Christina Nov. 4, 2020, 1:37 p.m. UTC | #9
> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Wednesday, November 4, 2020 12:41 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> On Tue, 3 Nov 2020, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > This is a respin which includes the changes you requested.
> 
> Comments randomly ordered, I'm pasting in pieces of the patch - sending it
> inline would help to get pieces properly quoted and in-order.
> 
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> 4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f9
> aa42ffe0ce0dd
> 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> ...
> 
> I miss comments in this file, see tree-vectorizer.h where we try to document
> purpose of classes and fields.
> 
> Things that sticks out to me:
> 
> +    uint8_t m_arity;
> +    uint8_t m_num_args;
> 
> why uint8_t and not simply unsigned int?  Not knowing what arity /
> num_args should be here ;)

I think I can remove arity, but num_args is how many operands the created
internal function call should take.  Since we can't vectorize calls with more than
4 arguments at the moment it seemed like 255 would be a safe limit :).

> 
> +    vec_info *m_vinfo;
> ...
> +    vect_pattern (slp_tree *node, vec_info *vinfo)
> 
> so this looks like something I freed stmt_vec_info of - back-pointers in the
> "wrong" direction of the logical hierarchy.  I suppose it's just to avoid passing
> down vinfo where we need it?  Please do that instead - pass down vinfo as
> everything else does.
> 
> The class seems to expose both very high-level (build () it!) and very low
> level details (get_ifn).  The high-level one suggests that a pattern _not_
> being represented by an ifn is possible but there's too much implementation
> detail already in the vect_pattern class to make that impossible.  I guess the
> IFN details could be pushed down to the simple matching class (and that be
> called vect_ifn_pattern or so).
> 
> +static bool
> +vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo) {
> +  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
> +  bool found_p = false;
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt match
> --\n");
> +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> +      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
> +    }
> 
> we dumped all instances after their analysis.  Maybe just refer to the
> instance with its address (dump_print %p) so lookup in the (already large)
> dump file is easy.
> 
> +  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();  for
> + (unsigned x = 0; x < num__slp_patterns; x++)
> +    {
> +      visited->empty ();
> +      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo,
> slp_patterns[x],
> +                                           visited);
> +    }
> +
> +  delete visited;
> 
> no need to new / delete, just do
> 
>   has_set<slp_tree> visited;
> 
> like everyone else.  Btw, do you really want to scan pieces of the SLP graph
> (with instances being graph entries) multiple times?  If not then you should
> move the visited set to the caller instead.
> 
> +  /* TODO: Remove in final version, only here for generating debug dot
> graphs
> +          from SLP tree.  */
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
> +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> +      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
> +    }
> 
> now, if there was some pattern matched it is probably useful to dump the
> graph (entry) again.  But only conditional on that I think.  So can you instead
> make the dump conditional on found_p and remove the start dot/end dot
> markers as said in the comment?
> 
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "transformation for %s not valid due to
> + post
> "
> +                            "condition\n",
> 
> not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
> MSG_MISSED_OPTIMIZATION should be used for things (likely) making
> vectorization fail.
> 
> +  /* Perform recursive matching, it's important to do this after
> + matching
> things
> 
> before matching things?
> 
> +     in the current node as the matches here may re-order the nodes
> + below
> it.
> +     As such the pattern that needs to be subsequently match may change.
> 
> and this is no longer true?
> 
> */
> +
> +  if (SLP_TREE_CHILDREN (node).exists ()) {
> 
> elide this check, the loop will simply not run if empty
> 
> +    slp_tree child;
> +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> 
> I think you want to perform the recursion in the caller so you do it only once
> and not once for each pattern kind now that you do post-order processing
> rather than pre-order.

That wouldn't work as that would make multiply match before FMA/FMS.
I had tried having FM(A|S) match MUL instead, but MUL does a lot of work to
create permute nodes, which would have to be undone.

The FMS case is somewhat easy as that still has a two operators node in between,
But FMA is just a simple add on top.  I have to then remove the merging nodes that MUL
Created, and drop half the operands to create the new node.

Which is doable but effectively does the same pattern twice, whereas now the MUL would
just skip all that work since FMA matched. 

The post order is essentially really problematic for overlapping patterns.  In pre-order I could for
Instance detect all 3 patterns in one traversal and share the matching information between all 3.

Right now I can only do it between mul and add and FMA has to be standalone.

> 
> +  vect_pattern *pattern = patt_fn (ref_node, vinfo);
> +
> +  if (pattern->matches ())
> 
> this suggests you get number of SLP nodes times number of pattern
> matchers allocations & inits of vect_pattern.  If you move
> 
> +      if (pattern->is_optab_supported_p (vectype, OPTIMIZE_FOR_SPEED))
> +       {
> 
> into ->matches () then whether this is a IFN or multi-node pattern becomes
> an implementation detail which would be IMHO better.
> 
> +  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
> +    {
> +      /* Calculate the location of the statement in NODE to replace.  */
> +      stmt_info = SLP_TREE_REPRESENTATIVE (node);
> +      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
> +      tree type = gimple_expr_type (old_stmt);
> +
> +      /* Create the argument set for use by
> gimple_build_call_internal_vec.  */
> +      for (int i = 0; i < this->m_num_args; i++)
> 
> the scalar argument type is the type of the definition so instead of
> gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs (old_stmt))
> 
> But then you seem to mix the result and the argument types?  I think you
> need to walk over SLP_TREE_CHILDREN and look at their representatives
> scalar def.  Which would then assume the pattern consumes all direct
> children.  Somehow this "generic"

Since I'm only consuming simple math ops would the argument and result types
differ? Something like a widening sub I can see but not the normal sub.

> vect_simple_pattern_match::build () has quite some restrictions.
> I guess they are documented in the files toplevel comment which also has
> general parts applying to all possible pattern matchers (not deriving from
> vect_simple_pattern_match).  Can you split up the comment and more
> explicitely mark what applies to patterns in general (postorder traversal) and
> what applies to vect_simple_pattern_match only?
> 
> +      /* We have to explicitly mark the old statement as unused because
> during
> +        statement analysis the original and new pattern statement may
> require
> 
> comment needs updating.
> 
> +        different level of unrolling.  As an example add/sub when
> vectorized
> +        without a pattern requires 4 copies, whereas with a COMPLEX_ADD
> pattern
> +        this only requires 2 copies and the two statement will be treated
> as
> +        hand unrolled.  That means that the analysis won't happen as
> it'll find
> +        a mismatch.  So we don't analyze the old statement and if we end
> up
> +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> */
> +      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> 
> I guess relevance should be copied from the original stmt (like if
> it was used_by_reduction).  Can't that be done in
> vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part
> could be done there, too.  So you'd be left with
> 
> +      STMT_SLP_TYPE (call_stmt_info) = pure_slp;
> 
> and
> 
> +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> 
> where the pure_slp setting should in theory be not needed (it
> should be automagically detected so) but setting it is not wrong
> (it would be an optimization).

I am using it as a marker for a node dissolve needs to look at when undoing SLP
so it restores the relevancy, but given your comment below on that I may be able
to get rid of it.

> 
> --
> 
> I'm now looking down the series to see how this is all used,
> so quoting different pieces from other patches.
> 
> +/* Checks to see of the expression EXPR is a gimple assign with code CODE
> +   and if this is the case the two operands of EXPR is returned in OP1
> and
> +   OP2.
> +
> +   If the matching and extraction is successful TRUE is returned
> otherwise
> +   FALSE in which case the value of OP1 and OP2 will not have been
> touched.
> +*/
> +
> +static bool
> +vect_match_expression_p (slp_tree node, tree_code code)
> +{
> +  if (!SLP_TREE_REPRESENTATIVE (node))
> 
> I see that many overall function comments do not actually match
> anymore, if at least for the parameters refered to.  Please go
> over the patch and update them.
> 
> +  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE (node));
> +  if (!is_gimple_assign (expr)
> +      || gimple_expr_code (expr) != code)
> 
> gimple_assign_rhs_code (expr)
> 
> +/* Check if the given lane permute in PERMUTES matches an alternating
> sequence
> +   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
> +static bool
> +vect_check_lane_permute (lane_permutation_t &permutes,
> +                        unsigned p0, unsigned p1)
> +{
> +  unsigned val[2] = {p0, p1};
> +  for (unsigned i = 0; i < permutes.length (); i++)
> +    if (permutes[i].first != val[i % 2])
> +      return false;
> 
> so this doesn't put any constraint on permutes[].second.  So it
> matches an arbitrary interleaving of two vectors.
> 
> +
> +   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
> +*/
> +
> +static complex_operation_t
> +vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t
> &lanes,
> +                    bool two_operands = true, vec<slp_tree> *ops = NULL)
> 
>  { {_37, _38}, {_12, _36} }
> 
> there's a lot of vect_match_expression_p calls and I wonder if
> inlining them and CSEing the 'code' compute of node1 and node2
> would simplify things and elide the macro.  Also we then know
> this is all CSEd in the end ...
> 
> I wonder what LANES specifies, TWO_OPERANDS specifies whether
> two-operand variants are allowed unless .. (guess LANES causes
> only same operands to match up in the end).
> 

I'm guarding against that you could technically speaking have a +/- node
that does reverses the order of the elements below it, in which case that's
a different operation than what we want to guard against.

I hadn't been able to prove to myself that SLP build would never do this.

> +static void
> +vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree
> node)
> +{
> 
> need to find a use for this still, but don't you set pure_slp
> "above"?

This is used to mark the elided nodes.  For instance in a complex multiply the
* nodes in the SLP tree become unused.  While not having them in the SLP tree
is simple, the detection of pure/hybrid is led from the BB statements so they would
mark the elided nodes as being hybrid.

There is of course the problem of is this node used in any other SLP instance, it may force
the analysis to think the SLP tree is pure, but if SLP fails it would try not SLP anyway as it
would have done eventually if Hybrid.

> 
> +bool
> +complex_pattern::validate_p ()
> +{
> +      unsigned i;
> +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
> +       {
> +         slp_tree vnode = NULL;
> +         if (vect_slp_make_linear (node, tmp, &vnode))
> +           nodes.quick_push (vnode);
> +         else
> +           {
> +             nodes.release();
> +             return false;
> +           }
> 
> hum.  vect_slp_make_linear better not fail?  At least we've
> created half-way stuff when it does.  Can't spot the implementation
> right now (guess splitting the patch up doesn't really help much ...)
> 

It can fail, so I see I need to free some nodes otherwise I'll leak... But
It's not a correctness issue. Nothing It changed on NODE until it all
succeeds.

> Noting elsewhere:
> 
> +static void
> +vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
> +                                hash_set<slp_tree> *visited, slp_tree
> root)
> +{
> +  if (!root || visited->contains (root))
> +    return;
> +
> +  unsigned int i;
> +  slp_tree node;
> +  stmt_vec_info related_stmt_info;
> +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
> +
> +  visited->add (root);
> 
> visited->add () returns whether the key was already there so
> please combine the contains and the add calls here and elsewhere.
> 
>     if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> +        && (related_stmt_info = STMT_VINFO_RELATED_STMT (stmt_info)) !=
> NULL)
> +      {
> +       if (dump_enabled_p ())
> 
> I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder
> if the above is a sufficient test.  There's is_pattern_stmt_p ()
> (which obviously also applies to non-SLP only patterns).
> 
> Note that I also see another issue, namely with hybrid SLP the
> original non-pattern stmt would need to stay relevant.  That
> probably asks for hybrid SLP discovery and relevance marking
> to be combined somehow.  You can probably create a simple
> testcase by storing a lane of a complex op via a non-grouped
> store.
> 
> +       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> +       STMT_VINFO_RELEVANT (related_stmt_info) = vect_used_in_scope;
> 
> as said previously the relevance should be copied, in this case
> back to the original stmt info from the pattern stmt info.
> 
> +       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
> +       STMT_SLP_TYPE (related_stmt_info) = loop_vect;
> 
> one option might be to delay pattern recognition (for the loop case)
> to after vect_detect_hybrid_slp and simply not make any hybrid
> stmt containing nodes part of a SLP pattern.  It would be a workaround
> of course, not the best solution.  Another option is to
> add a new field to stmt_info to put a "saved" relevance to and
> simply go over all stmts restoring relevance - we already restore
> SLP_TYPE to loop_vect that way at

Yes.. ideally I would want something like telling it "don't worry about this if
you end up being pure SLP".

I haven't tried with this new approach to instead add it to the PATTERN_SEQ
for the statement.   Last time I tried however it looked like in order to get this
to work correctly I'd need a bigger hack..

Thanks,
Tamar

> 
>   /* Reset SLP type to loop_vect on all stmts.  */
>   for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
>     {
> 
> so simply restore the saved relevance would work there (guess I
> like that more than what vect_dissolve_slp_only_patterns does,
> in any case do what vect_dissolve_slp_only_patterns does in the
> above loop please).
> 
> Thanks,
> Richard.
> 
> 
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > 	* doc/passes.texi: Update documentation.
> > 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> > 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> > 	(vect_analyze_slp): Call pattern matcher.
> > 	* tree-vectorizer.h (enum _complex_operation):
> > 	(class vect_pattern_match, class vect_pattern): New.
> > 	* tree-vect-slp-patterns.c: New file.
> >
> > > -----Original Message-----
> > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > Behalf Of Richard Biener
> > > Sent: Tuesday, September 29, 2020 10:42 AM
> > > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> gcc-
> > > patches@gcc.gnu.org
> > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > > scaffolding.
> > >
> > > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > >
> > > > Richard Biener <rguenther@suse.de> writes:
> > > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info
> > > *vinfo,
> > > > >> > >                               &tree_size, bst_map);
> > > > >> > >    if (node != NULL)
> > > > >> > >      {
> > > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > > >> > > +
> > > > >> > > +      /* See if any patterns can be found in the constructed SLP
> tree
> > > > >> > > +        before we do any analysis on it.  */
> > > > >> > > +      vect_match_slp_patterns (node, vinfo, group_size,
> > > &max_nunits,
> > > > >> > > +                              matches, &npermutes, &tree_size,
> > > > >> > > + bst_map);
> > > > >> > > +
> > > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > > >> > > +
> > > > >> > >
> > > > >> > > I think this is a bit early to match patterns - I'd defer it to
> > > > >> > > the point where all entries into the same SLP subgraph are
> > > > >> > > analyzed, thus somewhere at the end of vect_analyze_slp loop
> > > > >> > > over all instances and match patterns?  That way phases are
> more
> > > clearly separated.
> > > > >> >
> > > > >> > That would probably work, my only worry is that the SLP analysis
> > > > >> > itself may fail and bail out at
> > > > >> >
> > > > >> > 	  /* If the loads and stores can be handled with load/store-
> lane
> > > > >> > 	     instructions do not generate this SLP instance.  */
> > > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > > >> > 	      && loads_permuted
> > > > >> > 	      && dr && vect_store_lanes_supported (vectype,
> group_size,
> > > > >> > false))
> > > > >
> > > > > Ah, that piece of code.  Yeah, I'm repeatedly running into it as
> > > > > well - it's a bad hack that stands in the way all the time :/
> > > >
> > > > At one point I was wondering about trying to drop the above, vectorise
> > > > with and without SLP, and then compare their costs, like for
> > > VECT_COMPARE_COSTS.
> > > > But that seemed like a dead end with the move to doing everything on
> > > > the SLP representation.
> > >
> > > Yeah ... though even moving everything to the SLP representation will
> retain
> > > the issue since there it will be N group-size 1 SLP instances vs. 1 group-
> size N
> > > SLP instance.
> > >
> > > > > I guess we should try moving this upward like to
> > > > > vect_analyze_loop_2 right before
> > > > >
> > > > >   /* Check the SLP opportunities in the loop, analyze and build SLP
> trees.
> > > > > */
> > > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > > >   if (!ok)
> > > > >     return ok;
> > > > >
> > > > > and there check whether all grouped loads and stores can be handled
> > > > > with store-/load-lanes (and there are no SLP reduction chains?) in
> > > > > which case do not try to attempt SLP at all.  Because the testcases
> > > > > this check was supposed to change were all-load/store-lane or all
> > > > > SLP so the mixed case is probably not worth special casing.
> > > > >
> > > > > Since load-/store-lanes is an arm speciality I tried to only touch
> > > > > this fragile part with a ten-foot pole ;)  CCing Richard, if he acks
> > > > > the above I can produce a patch.
> > > >
> > > > Yeah, sounds good to me.  Probably also sorth checking whether the
> > > > likely_max iteration count is high enough to support group_size
> > > > vectors, if we have enough information to guess that.
> > > >
> > > > We could also get the gen* machinery to emit a macro that is true if
> > > > at least one load/store-lane pattern is defined, so that we can skip
> > > > the code for non-Arm targets.  I can do that as a follow-up.
> > >
> > > I've had a second look and one complication is that we only elide the SLP
> > > node if any of the loads are permuted.  So if all loads/stores are
> unpermuted
> > > but load/store-lanes would work we'd keep the SLP node.
> > >
> > > Of course without actually building the SLP node we don't know whether
> the
> > > loads will be permuted or not ...
> > >
> > > But surely the current place for the check will cause some testcases to
> > > become hybrid vectorizations which is likely undesirable.
> > >
> > > So we could move the check after all SLP discovery is completed and
> throw it
> > > all away if we can and should use load/store-lanes?
> > > But that might then not solve Tamars issue.
> > >
> > > Richard.
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg,
> Germany; GF: Felix Imend
Richard Biener Nov. 4, 2020, 2:04 p.m. UTC | #10
On Wed, 4 Nov 2020, Tamar Christina wrote:

> > -----Original Message-----
> > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > Behalf Of Richard Biener
> > Sent: Wednesday, November 4, 2020 12:41 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> > gcc-patches@gcc.gnu.org
> > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > scaffolding.
> > 
> > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > 
> > > Hi Richi,
> > >
> > > This is a respin which includes the changes you requested.
> > 
> > Comments randomly ordered, I'm pasting in pieces of the patch - sending it
> > inline would help to get pieces properly quoted and in-order.
> > 
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> > 4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f9
> > aa42ffe0ce0dd
> > 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > ...
> > 
> > I miss comments in this file, see tree-vectorizer.h where we try to document
> > purpose of classes and fields.
> > 
> > Things that sticks out to me:
> > 
> > +    uint8_t m_arity;
> > +    uint8_t m_num_args;
> > 
> > why uint8_t and not simply unsigned int?  Not knowing what arity /
> > num_args should be here ;)
> 
> I think I can remove arity, but num_args is how many operands the created
> internal function call should take.  Since we can't vectorize calls with more than
> 4 arguments at the moment it seemed like 255 would be a safe limit :).
> 
> > 
> > +    vec_info *m_vinfo;
> > ...
> > +    vect_pattern (slp_tree *node, vec_info *vinfo)
> > 
> > so this looks like something I freed stmt_vec_info of - back-pointers in the
> > "wrong" direction of the logical hierarchy.  I suppose it's just to avoid passing
> > down vinfo where we need it?  Please do that instead - pass down vinfo as
> > everything else does.
> > 
> > The class seems to expose both very high-level (build () it!) and very low
> > level details (get_ifn).  The high-level one suggests that a pattern _not_
> > being represented by an ifn is possible but there's too much implementation
> > detail already in the vect_pattern class to make that impossible.  I guess the
> > IFN details could be pushed down to the simple matching class (and that be
> > called vect_ifn_pattern or so).
> > 
> > +static bool
> > +vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo) {
> > +  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
> > +  bool found_p = false;
> > +
> > +  if (dump_enabled_p ())
> > +    {
> > +      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt match
> > --\n");
> > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
> > +    }
> > 
> > we dumped all instances after their analysis.  Maybe just refer to the
> > instance with its address (dump_print %p) so lookup in the (already large)
> > dump file is easy.
> > 
> > +  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();  for
> > + (unsigned x = 0; x < num__slp_patterns; x++)
> > +    {
> > +      visited->empty ();
> > +      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo,
> > slp_patterns[x],
> > +                                           visited);
> > +    }
> > +
> > +  delete visited;
> > 
> > no need to new / delete, just do
> > 
> >   has_set<slp_tree> visited;
> > 
> > like everyone else.  Btw, do you really want to scan pieces of the SLP graph
> > (with instances being graph entries) multiple times?  If not then you should
> > move the visited set to the caller instead.
> > 
> > +  /* TODO: Remove in final version, only here for generating debug dot
> > graphs
> > +          from SLP tree.  */
> > +
> > +  if (dump_enabled_p ())
> > +    {
> > +      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
> > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
> > +    }
> > 
> > now, if there was some pattern matched it is probably useful to dump the
> > graph (entry) again.  But only conditional on that I think.  So can you instead
> > make the dump conditional on found_p and remove the start dot/end dot
> > markers as said in the comment?
> > 
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "transformation for %s not valid due to
> > + post
> > "
> > +                            "condition\n",
> > 
> > not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
> > MSG_MISSED_OPTIMIZATION should be used for things (likely) making
> > vectorization fail.
> > 
> > +  /* Perform recursive matching, it's important to do this after
> > + matching
> > things
> > 
> > before matching things?
> > 
> > +     in the current node as the matches here may re-order the nodes
> > + below
> > it.
> > +     As such the pattern that needs to be subsequently match may change.
> > 
> > and this is no longer true?
> > 
> > */
> > +
> > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > 
> > elide this check, the loop will simply not run if empty
> > 
> > +    slp_tree child;
> > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > 
> > I think you want to perform the recursion in the caller so you do it only once
> > and not once for each pattern kind now that you do post-order processing
> > rather than pre-order.
> 
> That wouldn't work as that would make multiply match before FMA/FMS.

Hmm, I see.  Note that to avoid visiting multiple entries into the 
graph multiple times you then need another visited set or compute
pre- or post-order of the entire graph into an array separately.

> I had tried having FM(A|S) match MUL instead, but MUL does a lot of work to
> create permute nodes, which would have to be undone.

Nevertheless I think that's what regular pattern matching does.  Is
matching FMA from "scalar" really that much simpler than matching
first MUL and then FMA with a MUL component?

> The FMS case is somewhat easy as that still has a two operators node in between,
> But FMA is just a simple add on top.  I have to then remove the merging nodes that MUL
> Created, and drop half the operands to create the new node.
> 
> Which is doable but effectively does the same pattern twice, whereas now the MUL would
> just skip all that work since FMA matched. 
> 
> The post order is essentially really problematic for overlapping patterns.  In pre-order I could for
> Instance detect all 3 patterns in one traversal and share the matching information between all 3.

I do remember you doing this in some earlier variant but it looked
equally complicated.  What you've now done is do a post-order walk
for each pattern kind.

So I'd like to get the recursion up to fix the 'visited' thing.  We
can either have ->pre_match and ->post_match (ugh) or bite the bullet
and do FM(A|S) ontop of MUL (I believe for the earlier variant you
said that would be possible).

> Right now I can only do it between mul and add and FMA has to be standalone.
> 
> > 
> > +  vect_pattern *pattern = patt_fn (ref_node, vinfo);
> > +
> > +  if (pattern->matches ())
> > 
> > this suggests you get number of SLP nodes times number of pattern
> > matchers allocations & inits of vect_pattern.  If you move
> > 
> > +      if (pattern->is_optab_supported_p (vectype, OPTIMIZE_FOR_SPEED))
> > +       {
> > 
> > into ->matches () then whether this is a IFN or multi-node pattern becomes
> > an implementation detail which would be IMHO better.
> > 
> > +  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
> > +    {
> > +      /* Calculate the location of the statement in NODE to replace.  */
> > +      stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > +      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
> > +      tree type = gimple_expr_type (old_stmt);
> > +
> > +      /* Create the argument set for use by
> > gimple_build_call_internal_vec.  */
> > +      for (int i = 0; i < this->m_num_args; i++)
> > 
> > the scalar argument type is the type of the definition so instead of
> > gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs (old_stmt))
> > 
> > But then you seem to mix the result and the argument types?  I think you
> > need to walk over SLP_TREE_CHILDREN and look at their representatives
> > scalar def.  Which would then assume the pattern consumes all direct
> > children.  Somehow this "generic"
> 
> Since I'm only consuming simple math ops would the argument and result types
> differ? Something like a widening sub I can see but not the normal sub.

No.  Still gimple_expr_type is somewhat of an odd beast so better
avoid it.  And document this restriction.

> > vect_simple_pattern_match::build () has quite some restrictions.
> > I guess they are documented in the files toplevel comment which also has
> > general parts applying to all possible pattern matchers (not deriving from
> > vect_simple_pattern_match).  Can you split up the comment and more
> > explicitely mark what applies to patterns in general (postorder traversal) and
> > what applies to vect_simple_pattern_match only?
> > 
> > +      /* We have to explicitly mark the old statement as unused because
> > during
> > +        statement analysis the original and new pattern statement may
> > require
> > 
> > comment needs updating.
> > 
> > +        different level of unrolling.  As an example add/sub when
> > vectorized
> > +        without a pattern requires 4 copies, whereas with a COMPLEX_ADD
> > pattern
> > +        this only requires 2 copies and the two statement will be treated
> > as
> > +        hand unrolled.  That means that the analysis won't happen as
> > it'll find
> > +        a mismatch.  So we don't analyze the old statement and if we end
> > up
> > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > */
> > +      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> > 
> > I guess relevance should be copied from the original stmt (like if
> > it was used_by_reduction).  Can't that be done in
> > vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part
> > could be done there, too.  So you'd be left with
> > 
> > +      STMT_SLP_TYPE (call_stmt_info) = pure_slp;
> > 
> > and
> > 
> > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > 
> > where the pure_slp setting should in theory be not needed (it
> > should be automagically detected so) but setting it is not wrong
> > (it would be an optimization).
> 
> I am using it as a marker for a node dissolve needs to look at when undoing SLP
> so it restores the relevancy, but given your comment below on that I may be able
> to get rid of it.
> 
> > 
> > --
> > 
> > I'm now looking down the series to see how this is all used,
> > so quoting different pieces from other patches.
> > 
> > +/* Checks to see of the expression EXPR is a gimple assign with code CODE
> > +   and if this is the case the two operands of EXPR is returned in OP1
> > and
> > +   OP2.
> > +
> > +   If the matching and extraction is successful TRUE is returned
> > otherwise
> > +   FALSE in which case the value of OP1 and OP2 will not have been
> > touched.
> > +*/
> > +
> > +static bool
> > +vect_match_expression_p (slp_tree node, tree_code code)
> > +{
> > +  if (!SLP_TREE_REPRESENTATIVE (node))
> > 
> > I see that many overall function comments do not actually match
> > anymore, if at least for the parameters refered to.  Please go
> > over the patch and update them.
> > 
> > +  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE (node));
> > +  if (!is_gimple_assign (expr)
> > +      || gimple_expr_code (expr) != code)
> > 
> > gimple_assign_rhs_code (expr)
> > 
> > +/* Check if the given lane permute in PERMUTES matches an alternating
> > sequence
> > +   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
> > +static bool
> > +vect_check_lane_permute (lane_permutation_t &permutes,
> > +                        unsigned p0, unsigned p1)
> > +{
> > +  unsigned val[2] = {p0, p1};
> > +  for (unsigned i = 0; i < permutes.length (); i++)
> > +    if (permutes[i].first != val[i % 2])
> > +      return false;
> > 
> > so this doesn't put any constraint on permutes[].second.  So it
> > matches an arbitrary interleaving of two vectors.
> > 
> > +
> > +   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
> > +*/
> > +
> > +static complex_operation_t
> > +vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t
> > &lanes,
> > +                    bool two_operands = true, vec<slp_tree> *ops = NULL)
> > 
> >  { {_37, _38}, {_12, _36} }
> > 
> > there's a lot of vect_match_expression_p calls and I wonder if
> > inlining them and CSEing the 'code' compute of node1 and node2
> > would simplify things and elide the macro.  Also we then know
> > this is all CSEd in the end ...
> > 
> > I wonder what LANES specifies, TWO_OPERANDS specifies whether
> > two-operand variants are allowed unless .. (guess LANES causes
> > only same operands to match up in the end).
> > 
> 
> I'm guarding against that you could technically speaking have a +/- node
> that does reverses the order of the elements below it, in which case that's
> a different operation than what we want to guard against.
> 
> I hadn't been able to prove to myself that SLP build would never do this.
> 
> > +static void
> > +vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree
> > node)
> > +{
> > 
> > need to find a use for this still, but don't you set pure_slp
> > "above"?
> 
> This is used to mark the elided nodes.  For instance in a complex multiply the
> * nodes in the SLP tree become unused.  While not having them in the SLP tree
> is simple, the detection of pure/hybrid is led from the BB statements so they would
> mark the elided nodes as being hybrid.

What does regular pattern matching do here?  Ah, the relevance compute
is after pattern matching.  But you don't know whether it's still
relevant (for hybrid operation).

> There is of course the problem of is this node used in any other SLP instance, it may force
> the analysis to think the SLP tree is pure, but if SLP fails it would try not SLP anyway as it
> would have done eventually if Hybrid.

I still think the code will have some issues with relevance and
pure vs. hybrid ...

> > 
> > +bool
> > +complex_pattern::validate_p ()
> > +{
> > +      unsigned i;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
> > +       {
> > +         slp_tree vnode = NULL;
> > +         if (vect_slp_make_linear (node, tmp, &vnode))
> > +           nodes.quick_push (vnode);
> > +         else
> > +           {
> > +             nodes.release();
> > +             return false;
> > +           }
> > 
> > hum.  vect_slp_make_linear better not fail?  At least we've
> > created half-way stuff when it does.  Can't spot the implementation
> > right now (guess splitting the patch up doesn't really help much ...)
> > 
> 
> It can fail, so I see I need to free some nodes otherwise I'll leak... But
> It's not a correctness issue. Nothing It changed on NODE until it all
> succeeds.
> 
> > Noting elsewhere:
> > 
> > +static void
> > +vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
> > +                                hash_set<slp_tree> *visited, slp_tree
> > root)
> > +{
> > +  if (!root || visited->contains (root))
> > +    return;
> > +
> > +  unsigned int i;
> > +  slp_tree node;
> > +  stmt_vec_info related_stmt_info;
> > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
> > +
> > +  visited->add (root);
> > 
> > visited->add () returns whether the key was already there so
> > please combine the contains and the add calls here and elsewhere.
> > 
> >     if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> > +        && (related_stmt_info = STMT_VINFO_RELATED_STMT (stmt_info)) !=
> > NULL)
> > +      {
> > +       if (dump_enabled_p ())
> > 
> > I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder
> > if the above is a sufficient test.  There's is_pattern_stmt_p ()
> > (which obviously also applies to non-SLP only patterns).
> > 
> > Note that I also see another issue, namely with hybrid SLP the
> > original non-pattern stmt would need to stay relevant.  That
> > probably asks for hybrid SLP discovery and relevance marking
> > to be combined somehow.  You can probably create a simple
> > testcase by storing a lane of a complex op via a non-grouped
> > store.
> > 
> > +       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > +       STMT_VINFO_RELEVANT (related_stmt_info) = vect_used_in_scope;
> > 
> > as said previously the relevance should be copied, in this case
> > back to the original stmt info from the pattern stmt info.
> > 
> > +       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
> > +       STMT_SLP_TYPE (related_stmt_info) = loop_vect;
> > 
> > one option might be to delay pattern recognition (for the loop case)
> > to after vect_detect_hybrid_slp and simply not make any hybrid
> > stmt containing nodes part of a SLP pattern.  It would be a workaround
> > of course, not the best solution.  Another option is to
> > add a new field to stmt_info to put a "saved" relevance to and
> > simply go over all stmts restoring relevance - we already restore
> > SLP_TYPE to loop_vect that way at
> 
> Yes.. ideally I would want something like telling it "don't worry about this if
> you end up being pure SLP".
> 
> I haven't tried with this new approach to instead add it to the PATTERN_SEQ
> for the statement.   Last time I tried however it looked like in order to get this
> to work correctly I'd need a bigger hack..

I don't think PATTERN_SEQ would help us here, we'd need a "ORIG_SEQ"
though we should eventually have that by means of the pattern root
scalar stmts.

Richard.

> Thanks,
> Tamar
> 
> > 
> >   /* Reset SLP type to loop_vect on all stmts.  */
> >   for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
> >     {
> > 
> > so simply restore the saved relevance would work there (guess I
> > like that more than what vect_dissolve_slp_only_patterns does,
> > in any case do what vect_dissolve_slp_only_patterns does in the
> > above loop please).
> > 
> > Thanks,
> > Richard.
> > 
> > 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > 	* doc/passes.texi: Update documentation.
> > > 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> > > 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> > > 	(vect_analyze_slp): Call pattern matcher.
> > > 	* tree-vectorizer.h (enum _complex_operation):
> > > 	(class vect_pattern_match, class vect_pattern): New.
> > > 	* tree-vect-slp-patterns.c: New file.
> > >
> > > > -----Original Message-----
> > > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > > Behalf Of Richard Biener
> > > > Sent: Tuesday, September 29, 2020 10:42 AM
> > > > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > > > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> > gcc-
> > > > patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > > > scaffolding.
> > > >
> > > > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > > >
> > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance (vec_info
> > > > *vinfo,
> > > > > >> > >                               &tree_size, bst_map);
> > > > > >> > >    if (node != NULL)
> > > > > >> > >      {
> > > > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > > > >> > > +
> > > > > >> > > +      /* See if any patterns can be found in the constructed SLP
> > tree
> > > > > >> > > +        before we do any analysis on it.  */
> > > > > >> > > +      vect_match_slp_patterns (node, vinfo, group_size,
> > > > &max_nunits,
> > > > > >> > > +                              matches, &npermutes, &tree_size,
> > > > > >> > > + bst_map);
> > > > > >> > > +
> > > > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > > > >> > > +
> > > > > >> > >
> > > > > >> > > I think this is a bit early to match patterns - I'd defer it to
> > > > > >> > > the point where all entries into the same SLP subgraph are
> > > > > >> > > analyzed, thus somewhere at the end of vect_analyze_slp loop
> > > > > >> > > over all instances and match patterns?  That way phases are
> > more
> > > > clearly separated.
> > > > > >> >
> > > > > >> > That would probably work, my only worry is that the SLP analysis
> > > > > >> > itself may fail and bail out at
> > > > > >> >
> > > > > >> > 	  /* If the loads and stores can be handled with load/store-
> > lane
> > > > > >> > 	     instructions do not generate this SLP instance.  */
> > > > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > > > >> > 	      && loads_permuted
> > > > > >> > 	      && dr && vect_store_lanes_supported (vectype,
> > group_size,
> > > > > >> > false))
> > > > > >
> > > > > > Ah, that piece of code.  Yeah, I'm repeatedly running into it as
> > > > > > well - it's a bad hack that stands in the way all the time :/
> > > > >
> > > > > At one point I was wondering about trying to drop the above, vectorise
> > > > > with and without SLP, and then compare their costs, like for
> > > > VECT_COMPARE_COSTS.
> > > > > But that seemed like a dead end with the move to doing everything on
> > > > > the SLP representation.
> > > >
> > > > Yeah ... though even moving everything to the SLP representation will
> > retain
> > > > the issue since there it will be N group-size 1 SLP instances vs. 1 group-
> > size N
> > > > SLP instance.
> > > >
> > > > > > I guess we should try moving this upward like to
> > > > > > vect_analyze_loop_2 right before
> > > > > >
> > > > > >   /* Check the SLP opportunities in the loop, analyze and build SLP
> > trees.
> > > > > > */
> > > > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > > > >   if (!ok)
> > > > > >     return ok;
> > > > > >
> > > > > > and there check whether all grouped loads and stores can be handled
> > > > > > with store-/load-lanes (and there are no SLP reduction chains?) in
> > > > > > which case do not try to attempt SLP at all.  Because the testcases
> > > > > > this check was supposed to change were all-load/store-lane or all
> > > > > > SLP so the mixed case is probably not worth special casing.
> > > > > >
> > > > > > Since load-/store-lanes is an arm speciality I tried to only touch
> > > > > > this fragile part with a ten-foot pole ;)  CCing Richard, if he acks
> > > > > > the above I can produce a patch.
> > > > >
> > > > > Yeah, sounds good to me.  Probably also sorth checking whether the
> > > > > likely_max iteration count is high enough to support group_size
> > > > > vectors, if we have enough information to guess that.
> > > > >
> > > > > We could also get the gen* machinery to emit a macro that is true if
> > > > > at least one load/store-lane pattern is defined, so that we can skip
> > > > > the code for non-Arm targets.  I can do that as a follow-up.
> > > >
> > > > I've had a second look and one complication is that we only elide the SLP
> > > > node if any of the loads are permuted.  So if all loads/stores are
> > unpermuted
> > > > but load/store-lanes would work we'd keep the SLP node.
> > > >
> > > > Of course without actually building the SLP node we don't know whether
> > the
> > > > loads will be permuted or not ...
> > > >
> > > > But surely the current place for the check will cause some testcases to
> > > > become hybrid vectorizations which is likely undesirable.
> > > >
> > > > So we could move the check after all SLP discovery is completed and
> > throw it
> > > > all away if we can and should use load/store-lanes?
> > > > But that might then not solve Tamars issue.
> > > >
> > > > Richard.
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg,
> > Germany; GF: Felix Imend
>
Tamar Christina Nov. 4, 2020, 2:36 p.m. UTC | #11
> -----Original Message-----
> From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> Behalf Of Richard Biener
> Sent: Wednesday, November 4, 2020 2:04 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> scaffolding.
> 
> On Wed, 4 Nov 2020, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > Behalf Of Richard Biener
> > > Sent: Wednesday, November 4, 2020 12:41 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> > > gcc-patches@gcc.gnu.org
> > > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > matching scaffolding.
> > >
> > > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > >
> > > > Hi Richi,
> > > >
> > > > This is a respin which includes the changes you requested.
> > >
> > > Comments randomly ordered, I'm pasting in pieces of the patch -
> > > sending it inline would help to get pieces properly quoted and in-order.
> > >
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> > >
> 4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f
> > > 9
> > > aa42ffe0ce0dd
> > > 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > ...
> > >
> > > I miss comments in this file, see tree-vectorizer.h where we try to
> > > document purpose of classes and fields.
> > >
> > > Things that sticks out to me:
> > >
> > > +    uint8_t m_arity;
> > > +    uint8_t m_num_args;
> > >
> > > why uint8_t and not simply unsigned int?  Not knowing what arity /
> > > num_args should be here ;)
> >
> > I think I can remove arity, but num_args is how many operands the
> > created internal function call should take.  Since we can't vectorize
> > calls with more than
> > 4 arguments at the moment it seemed like 255 would be a safe limit :).
> >
> > >
> > > +    vec_info *m_vinfo;
> > > ...
> > > +    vect_pattern (slp_tree *node, vec_info *vinfo)
> > >
> > > so this looks like something I freed stmt_vec_info of -
> > > back-pointers in the "wrong" direction of the logical hierarchy.  I
> > > suppose it's just to avoid passing down vinfo where we need it?
> > > Please do that instead - pass down vinfo as everything else does.
> > >
> > > The class seems to expose both very high-level (build () it!) and
> > > very low level details (get_ifn).  The high-level one suggests that
> > > a pattern _not_ being represented by an ifn is possible but there's
> > > too much implementation detail already in the vect_pattern class to
> > > make that impossible.  I guess the IFN details could be pushed down
> > > to the simple matching class (and that be called vect_ifn_pattern or so).
> > >
> > > +static bool
> > > +vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo) {
> > > +  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
> > > +  bool found_p = false;
> > > +
> > > +  if (dump_enabled_p ())
> > > +    {
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt
> > > + match
> > > --\n");
> > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
> > > +    }
> > >
> > > we dumped all instances after their analysis.  Maybe just refer to
> > > the instance with its address (dump_print %p) so lookup in the
> > > (already large) dump file is easy.
> > >
> > > +  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();  for
> > > + (unsigned x = 0; x < num__slp_patterns; x++)
> > > +    {
> > > +      visited->empty ();
> > > +      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo,
> > > slp_patterns[x],
> > > +                                           visited);
> > > +    }
> > > +
> > > +  delete visited;
> > >
> > > no need to new / delete, just do
> > >
> > >   has_set<slp_tree> visited;
> > >
> > > like everyone else.  Btw, do you really want to scan pieces of the
> > > SLP graph (with instances being graph entries) multiple times?  If
> > > not then you should move the visited set to the caller instead.
> > >
> > > +  /* TODO: Remove in final version, only here for generating debug
> > > + dot
> > > graphs
> > > +          from SLP tree.  */
> > > +
> > > +  if (dump_enabled_p ())
> > > +    {
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
> > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
> > > +    }
> > >
> > > now, if there was some pattern matched it is probably useful to dump
> > > the graph (entry) again.  But only conditional on that I think.  So
> > > can you instead make the dump conditional on found_p and remove the
> > > start dot/end dot markers as said in the comment?
> > >
> > > +         if (dump_enabled_p ())
> > > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > +                            "transformation for %s not valid due to
> > > + post
> > > "
> > > +                            "condition\n",
> > >
> > > not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
> > > MSG_MISSED_OPTIMIZATION should be used for things (likely) making
> > > vectorization fail.
> > >
> > > +  /* Perform recursive matching, it's important to do this after
> > > + matching
> > > things
> > >
> > > before matching things?
> > >
> > > +     in the current node as the matches here may re-order the nodes
> > > + below
> > > it.
> > > +     As such the pattern that needs to be subsequently match may
> change.
> > >
> > > and this is no longer true?
> > >
> > > */
> > > +
> > > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > >
> > > elide this check, the loop will simply not run if empty
> > >
> > > +    slp_tree child;
> > > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > >
> > > I think you want to perform the recursion in the caller so you do it
> > > only once and not once for each pattern kind now that you do
> > > post-order processing rather than pre-order.
> >
> > That wouldn't work as that would make multiply match before FMA/FMS.
> 
> Hmm, I see.  Note that to avoid visiting multiple entries into the graph
> multiple times you then need another visited set or compute
> pre- or post-order of the entire graph into an array separately.
> 
> > I had tried having FM(A|S) match MUL instead, but MUL does a lot of
> > work to create permute nodes, which would have to be undone.
> 
> Nevertheless I think that's what regular pattern matching does.  Is matching
> FMA from "scalar" really that much simpler than matching first MUL and then
> FMA with a MUL component?

No, the matching itself no. (in fact I already have a branch where I do this).
The FMA case however does end up needing to drop half the load nodes
(which I need to pick at random really).

Is it OK to do so? If so then I can use this variant. If I split FMA and FMS then
I can do it if dropping the loads at this point doesn't cause any issues. 

Would you prefer I just elide the nodes or add a VEC_PERM that just selects
from the first vector only?

> 
> > The FMS case is somewhat easy as that still has a two operators node
> > in between, But FMA is just a simple add on top.  I have to then
> > remove the merging nodes that MUL Created, and drop half the operands
> to create the new node.
> >
> > Which is doable but effectively does the same pattern twice, whereas
> > now the MUL would just skip all that work since FMA matched.
> >
> > The post order is essentially really problematic for overlapping
> > patterns.  In pre-order I could for Instance detect all 3 patterns in one
> traversal and share the matching information between all 3.
> 
> I do remember you doing this in some earlier variant but it looked equally
> complicated.  What you've now done is do a post-order walk for each pattern
> kind.
> 
> So I'd like to get the recursion up to fix the 'visited' thing.  We can either have
> ->pre_match and ->post_match (ugh) or bite the bullet and do FM(A|S)
> ontop of MUL (I believe for the earlier variant you said that would be
> possible).
> 
> > Right now I can only do it between mul and add and FMA has to be
> standalone.
> >
> > >
> > > +  vect_pattern *pattern = patt_fn (ref_node, vinfo);
> > > +
> > > +  if (pattern->matches ())
> > >
> > > this suggests you get number of SLP nodes times number of pattern
> > > matchers allocations & inits of vect_pattern.  If you move
> > >
> > > +      if (pattern->is_optab_supported_p (vectype,
> OPTIMIZE_FOR_SPEED))
> > > +       {
> > >
> > > into ->matches () then whether this is a IFN or multi-node pattern
> > > becomes an implementation detail which would be IMHO better.
> > >
> > > +  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
> > > +    {
> > > +      /* Calculate the location of the statement in NODE to replace.  */
> > > +      stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > > +      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
> > > +      tree type = gimple_expr_type (old_stmt);
> > > +
> > > +      /* Create the argument set for use by
> > > gimple_build_call_internal_vec.  */
> > > +      for (int i = 0; i < this->m_num_args; i++)
> > >
> > > the scalar argument type is the type of the definition so instead of
> > > gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs
> > > (old_stmt))
> > >
> > > But then you seem to mix the result and the argument types?  I think
> > > you need to walk over SLP_TREE_CHILDREN and look at their
> > > representatives scalar def.  Which would then assume the pattern
> > > consumes all direct children.  Somehow this "generic"
> >
> > Since I'm only consuming simple math ops would the argument and result
> > types differ? Something like a widening sub I can see but not the normal
> sub.
> 
> No.  Still gimple_expr_type is somewhat of an odd beast so better avoid it.
> And document this restriction.
> 
> > > vect_simple_pattern_match::build () has quite some restrictions.
> > > I guess they are documented in the files toplevel comment which also
> > > has general parts applying to all possible pattern matchers (not
> > > deriving from vect_simple_pattern_match).  Can you split up the
> > > comment and more explicitely mark what applies to patterns in
> > > general (postorder traversal) and what applies to
> vect_simple_pattern_match only?
> > >
> > > +      /* We have to explicitly mark the old statement as unused
> > > + because
> > > during
> > > +        statement analysis the original and new pattern statement
> > > + may
> > > require
> > >
> > > comment needs updating.
> > >
> > > +        different level of unrolling.  As an example add/sub when
> > > vectorized
> > > +        without a pattern requires 4 copies, whereas with a
> > > + COMPLEX_ADD
> > > pattern
> > > +        this only requires 2 copies and the two statement will be
> > > + treated
> > > as
> > > +        hand unrolled.  That means that the analysis won't happen
> > > + as
> > > it'll find
> > > +        a mismatch.  So we don't analyze the old statement and if
> > > + we end
> > > up
> > > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > > */
> > > +      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> > >
> > > I guess relevance should be copied from the original stmt (like if
> > > it was used_by_reduction).  Can't that be done in
> > > vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part could
> > > be done there, too.  So you'd be left with
> > >
> > > +      STMT_SLP_TYPE (call_stmt_info) = pure_slp;
> > >
> > > and
> > >
> > > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > >
> > > where the pure_slp setting should in theory be not needed (it should
> > > be automagically detected so) but setting it is not wrong (it would
> > > be an optimization).
> >
> > I am using it as a marker for a node dissolve needs to look at when
> > undoing SLP so it restores the relevancy, but given your comment below
> > on that I may be able to get rid of it.
> >
> > >
> > > --
> > >
> > > I'm now looking down the series to see how this is all used, so
> > > quoting different pieces from other patches.
> > >
> > > +/* Checks to see of the expression EXPR is a gimple assign with code
> CODE
> > > +   and if this is the case the two operands of EXPR is returned in
> > > +OP1
> > > and
> > > +   OP2.
> > > +
> > > +   If the matching and extraction is successful TRUE is returned
> > > otherwise
> > > +   FALSE in which case the value of OP1 and OP2 will not have been
> > > touched.
> > > +*/
> > > +
> > > +static bool
> > > +vect_match_expression_p (slp_tree node, tree_code code) {
> > > +  if (!SLP_TREE_REPRESENTATIVE (node))
> > >
> > > I see that many overall function comments do not actually match
> > > anymore, if at least for the parameters refered to.  Please go over
> > > the patch and update them.
> > >
> > > +  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE
> (node));
> > > + if (!is_gimple_assign (expr)
> > > +      || gimple_expr_code (expr) != code)
> > >
> > > gimple_assign_rhs_code (expr)
> > >
> > > +/* Check if the given lane permute in PERMUTES matches an
> > > +alternating
> > > sequence
> > > +   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
> > > +static bool
> > > +vect_check_lane_permute (lane_permutation_t &permutes,
> > > +                        unsigned p0, unsigned p1) {
> > > +  unsigned val[2] = {p0, p1};
> > > +  for (unsigned i = 0; i < permutes.length (); i++)
> > > +    if (permutes[i].first != val[i % 2])
> > > +      return false;
> > >
> > > so this doesn't put any constraint on permutes[].second.  So it
> > > matches an arbitrary interleaving of two vectors.
> > >
> > > +
> > > +   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
> > > +*/
> > > +
> > > +static complex_operation_t
> > > +vect_detect_pair_op (slp_tree node1, slp_tree node2,
> > > +lane_permutation_t
> > > &lanes,
> > > +                    bool two_operands = true, vec<slp_tree> *ops =
> > > + NULL)
> > >
> > >  { {_37, _38}, {_12, _36} }
> > >
> > > there's a lot of vect_match_expression_p calls and I wonder if
> > > inlining them and CSEing the 'code' compute of node1 and node2 would
> > > simplify things and elide the macro.  Also we then know this is all
> > > CSEd in the end ...
> > >
> > > I wonder what LANES specifies, TWO_OPERANDS specifies whether
> > > two-operand variants are allowed unless .. (guess LANES causes only
> > > same operands to match up in the end).
> > >
> >
> > I'm guarding against that you could technically speaking have a +/-
> > node that does reverses the order of the elements below it, in which
> > case that's a different operation than what we want to guard against.
> >
> > I hadn't been able to prove to myself that SLP build would never do this.
> >
> > > +static void
> > > +vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree
> > > node)
> > > +{
> > >
> > > need to find a use for this still, but don't you set pure_slp
> > > "above"?
> >
> > This is used to mark the elided nodes.  For instance in a complex
> > multiply the
> > * nodes in the SLP tree become unused.  While not having them in the
> > SLP tree is simple, the detection of pure/hybrid is led from the BB
> > statements so they would mark the elided nodes as being hybrid.
> 
> What does regular pattern matching do here?  Ah, the relevance compute is
> after pattern matching.  But you don't know whether it's still relevant (for
> hybrid operation).
> 
> > There is of course the problem of is this node used in any other SLP
> > instance, it may force the analysis to think the SLP tree is pure, but
> > if SLP fails it would try not SLP anyway as it would have done eventually if
> Hybrid.
> 
> I still think the code will have some issues with relevance and pure vs.
> hybrid ...
> 
> > >
> > > +bool
> > > +complex_pattern::validate_p ()
> > > +{
> > > +      unsigned i;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
> > > +       {
> > > +         slp_tree vnode = NULL;
> > > +         if (vect_slp_make_linear (node, tmp, &vnode))
> > > +           nodes.quick_push (vnode);
> > > +         else
> > > +           {
> > > +             nodes.release();
> > > +             return false;
> > > +           }
> > >
> > > hum.  vect_slp_make_linear better not fail?  At least we've created
> > > half-way stuff when it does.  Can't spot the implementation right
> > > now (guess splitting the patch up doesn't really help much ...)
> > >
> >
> > It can fail, so I see I need to free some nodes otherwise I'll leak...
> > But It's not a correctness issue. Nothing It changed on NODE until it
> > all succeeds.
> >
> > > Noting elsewhere:
> > >
> > > +static void
> > > +vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
> > > +                                hash_set<slp_tree> *visited,
> > > +slp_tree
> > > root)
> > > +{
> > > +  if (!root || visited->contains (root))
> > > +    return;
> > > +
> > > +  unsigned int i;
> > > +  slp_tree node;
> > > +  stmt_vec_info related_stmt_info;
> > > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
> > > +
> > > +  visited->add (root);
> > >
> > > visited->add () returns whether the key was already there so
> > > please combine the contains and the add calls here and elsewhere.
> > >
> > >     if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> > > +        && (related_stmt_info = STMT_VINFO_RELATED_STMT
> > > + (stmt_info)) !=
> > > NULL)
> > > +      {
> > > +       if (dump_enabled_p ())
> > >
> > > I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder if
> > > the above is a sufficient test.  There's is_pattern_stmt_p () (which
> > > obviously also applies to non-SLP only patterns).
> > >
> > > Note that I also see another issue, namely with hybrid SLP the
> > > original non-pattern stmt would need to stay relevant.  That
> > > probably asks for hybrid SLP discovery and relevance marking to be
> > > combined somehow.  You can probably create a simple testcase by
> > > storing a lane of a complex op via a non-grouped store.
> > >
> > > +       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > > +       STMT_VINFO_RELEVANT (related_stmt_info) =
> > > + vect_used_in_scope;
> > >
> > > as said previously the relevance should be copied, in this case back
> > > to the original stmt info from the pattern stmt info.
> > >
> > > +       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
> > > +       STMT_SLP_TYPE (related_stmt_info) = loop_vect;
> > >
> > > one option might be to delay pattern recognition (for the loop case)
> > > to after vect_detect_hybrid_slp and simply not make any hybrid stmt
> > > containing nodes part of a SLP pattern.  It would be a workaround of
> > > course, not the best solution.  Another option is to add a new field
> > > to stmt_info to put a "saved" relevance to and simply go over all
> > > stmts restoring relevance - we already restore SLP_TYPE to loop_vect
> > > that way at
> >
> > Yes.. ideally I would want something like telling it "don't worry
> > about this if you end up being pure SLP".
> >
> > I haven't tried with this new approach to instead add it to the
> PATTERN_SEQ
> > for the statement.   Last time I tried however it looked like in order to get
> this
> > to work correctly I'd need a bigger hack..
> 
> I don't think PATTERN_SEQ would help us here, we'd need a "ORIG_SEQ"
> though we should eventually have that by means of the pattern root scalar
> stmts.

Shouldn't it? If I mark the unused nodes as part of the same pattern, vect_analyze_stmt
would examine the pattern instead, but once it's relevancy is determined it won't analyze it
again.  The unused statements then don't have any relevancy which will get
vect_detect_hybrid_slp to ignore them which should get us what we want no?

Unless I'm misinterpreting something.

Regards,
Tamar

> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > >
> > >   /* Reset SLP type to loop_vect on all stmts.  */
> > >   for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
> > >     {
> > >
> > > so simply restore the saved relevance would work there (guess I like
> > > that more than what vect_dissolve_slp_only_patterns does, in any
> > > case do what vect_dissolve_slp_only_patterns does in the above loop
> > > please).
> > >
> > > Thanks,
> > > Richard.
> > >
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > > 	* doc/passes.texi: Update documentation.
> > > > 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> > > > 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> > > > 	(vect_analyze_slp): Call pattern matcher.
> > > > 	* tree-vectorizer.h (enum _complex_operation):
> > > > 	(class vect_pattern_match, class vect_pattern): New.
> > > > 	* tree-vect-slp-patterns.c: New file.
> > > >
> > > > > -----Original Message-----
> > > > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de>
> > > > > On Behalf Of Richard Biener
> > > > > Sent: Tuesday, September 29, 2020 10:42 AM
> > > > > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > > > > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> > > gcc-
> > > > > patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > > > matching scaffolding.
> > > > >
> > > > > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > > > >
> > > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance
> > > > > > >> > > (vec_info
> > > > > *vinfo,
> > > > > > >> > >                               &tree_size, bst_map);
> > > > > > >> > >    if (node != NULL)
> > > > > > >> > >      {
> > > > > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > > > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > > > > >> > > +
> > > > > > >> > > +      /* See if any patterns can be found in the
> > > > > > >> > > + constructed SLP
> > > tree
> > > > > > >> > > +        before we do any analysis on it.  */
> > > > > > >> > > +      vect_match_slp_patterns (node, vinfo,
> > > > > > >> > > + group_size,
> > > > > &max_nunits,
> > > > > > >> > > +                              matches, &npermutes,
> > > > > > >> > > + &tree_size, bst_map);
> > > > > > >> > > +
> > > > > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > > > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > > > > >> > > +
> > > > > > >> > >
> > > > > > >> > > I think this is a bit early to match patterns - I'd
> > > > > > >> > > defer it to the point where all entries into the same
> > > > > > >> > > SLP subgraph are analyzed, thus somewhere at the end of
> > > > > > >> > > vect_analyze_slp loop over all instances and match
> > > > > > >> > > patterns?  That way phases are
> > > more
> > > > > clearly separated.
> > > > > > >> >
> > > > > > >> > That would probably work, my only worry is that the SLP
> > > > > > >> > analysis itself may fail and bail out at
> > > > > > >> >
> > > > > > >> > 	  /* If the loads and stores can be handled with
> > > > > > >> > load/store-
> > > lane
> > > > > > >> > 	     instructions do not generate this SLP instance.  */
> > > > > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > > > > >> > 	      && loads_permuted
> > > > > > >> > 	      && dr && vect_store_lanes_supported (vectype,
> > > group_size,
> > > > > > >> > false))
> > > > > > >
> > > > > > > Ah, that piece of code.  Yeah, I'm repeatedly running into
> > > > > > > it as well - it's a bad hack that stands in the way all the
> > > > > > > time :/
> > > > > >
> > > > > > At one point I was wondering about trying to drop the above,
> > > > > > vectorise with and without SLP, and then compare their costs,
> > > > > > like for
> > > > > VECT_COMPARE_COSTS.
> > > > > > But that seemed like a dead end with the move to doing
> > > > > > everything on the SLP representation.
> > > > >
> > > > > Yeah ... though even moving everything to the SLP representation
> > > > > will
> > > retain
> > > > > the issue since there it will be N group-size 1 SLP instances
> > > > > vs. 1 group-
> > > size N
> > > > > SLP instance.
> > > > >
> > > > > > > I guess we should try moving this upward like to
> > > > > > > vect_analyze_loop_2 right before
> > > > > > >
> > > > > > >   /* Check the SLP opportunities in the loop, analyze and
> > > > > > > build SLP
> > > trees.
> > > > > > > */
> > > > > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > > > > >   if (!ok)
> > > > > > >     return ok;
> > > > > > >
> > > > > > > and there check whether all grouped loads and stores can be
> > > > > > > handled with store-/load-lanes (and there are no SLP
> > > > > > > reduction chains?) in which case do not try to attempt SLP
> > > > > > > at all.  Because the testcases this check was supposed to
> > > > > > > change were all-load/store-lane or all SLP so the mixed case is
> probably not worth special casing.
> > > > > > >
> > > > > > > Since load-/store-lanes is an arm speciality I tried to only
> > > > > > > touch this fragile part with a ten-foot pole ;)  CCing
> > > > > > > Richard, if he acks the above I can produce a patch.
> > > > > >
> > > > > > Yeah, sounds good to me.  Probably also sorth checking whether
> > > > > > the likely_max iteration count is high enough to support
> > > > > > group_size vectors, if we have enough information to guess that.
> > > > > >
> > > > > > We could also get the gen* machinery to emit a macro that is
> > > > > > true if at least one load/store-lane pattern is defined, so
> > > > > > that we can skip the code for non-Arm targets.  I can do that as a
> follow-up.
> > > > >
> > > > > I've had a second look and one complication is that we only
> > > > > elide the SLP node if any of the loads are permuted.  So if all
> > > > > loads/stores are
> > > unpermuted
> > > > > but load/store-lanes would work we'd keep the SLP node.
> > > > >
> > > > > Of course without actually building the SLP node we don't know
> > > > > whether
> > > the
> > > > > loads will be permuted or not ...
> > > > >
> > > > > But surely the current place for the check will cause some
> > > > > testcases to become hybrid vectorizations which is likely undesirable.
> > > > >
> > > > > So we could move the check after all SLP discovery is completed
> > > > > and
> > > throw it
> > > > > all away if we can and should use load/store-lanes?
> > > > > But that might then not solve Tamars issue.
> > > > >
> > > > > Richard.
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imend
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imend
Richard Biener Nov. 4, 2020, 3:14 p.m. UTC | #12
On Wed, 4 Nov 2020, Tamar Christina wrote:

> > -----Original Message-----
> > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > Behalf Of Richard Biener
> > Sent: Wednesday, November 4, 2020 2:04 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> > gcc-patches@gcc.gnu.org
> > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern matching
> > scaffolding.
> > 
> > On Wed, 4 Nov 2020, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de> On
> > > > Behalf Of Richard Biener
> > > > Sent: Wednesday, November 4, 2020 12:41 PM
> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > Cc: Richard Sandiford <Richard.Sandiford@arm.com>; nd <nd@arm.com>;
> > > > gcc-patches@gcc.gnu.org
> > > > Subject: RE: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > > matching scaffolding.
> > > >
> > > > On Tue, 3 Nov 2020, Tamar Christina wrote:
> > > >
> > > > > Hi Richi,
> > > > >
> > > > > This is a respin which includes the changes you requested.
> > > >
> > > > Comments randomly ordered, I'm pasting in pieces of the patch -
> > > > sending it inline would help to get pieces properly quoted and in-order.
> > > >
> > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> > > >
> > 4bd454cfb185d7036843fc7140b073f525b2ec6a..b813508d3ceaf4c54f612bc10f
> > > > 9
> > > > aa42ffe0ce0dd
> > > > 100644
> > > > --- a/gcc/tree-vectorizer.h
> > > > +++ b/gcc/tree-vectorizer.h
> > > > ...
> > > >
> > > > I miss comments in this file, see tree-vectorizer.h where we try to
> > > > document purpose of classes and fields.
> > > >
> > > > Things that sticks out to me:
> > > >
> > > > +    uint8_t m_arity;
> > > > +    uint8_t m_num_args;
> > > >
> > > > why uint8_t and not simply unsigned int?  Not knowing what arity /
> > > > num_args should be here ;)
> > >
> > > I think I can remove arity, but num_args is how many operands the
> > > created internal function call should take.  Since we can't vectorize
> > > calls with more than
> > > 4 arguments at the moment it seemed like 255 would be a safe limit :).
> > >
> > > >
> > > > +    vec_info *m_vinfo;
> > > > ...
> > > > +    vect_pattern (slp_tree *node, vec_info *vinfo)
> > > >
> > > > so this looks like something I freed stmt_vec_info of -
> > > > back-pointers in the "wrong" direction of the logical hierarchy.  I
> > > > suppose it's just to avoid passing down vinfo where we need it?
> > > > Please do that instead - pass down vinfo as everything else does.
> > > >
> > > > The class seems to expose both very high-level (build () it!) and
> > > > very low level details (get_ifn).  The high-level one suggests that
> > > > a pattern _not_ being represented by an ifn is possible but there's
> > > > too much implementation detail already in the vect_pattern class to
> > > > make that impossible.  I guess the IFN details could be pushed down
> > > > to the simple matching class (and that be called vect_ifn_pattern or so).
> > > >
> > > > +static bool
> > > > +vect_match_slp_patterns (slp_tree *ref_node, vec_info *vinfo) {
> > > > +  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
> > > > +  bool found_p = false;
> > > > +
> > > > +  if (dump_enabled_p ())
> > > > +    {
> > > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt
> > > > + match
> > > > --\n");
> > > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
> > > > +    }
> > > >
> > > > we dumped all instances after their analysis.  Maybe just refer to
> > > > the instance with its address (dump_print %p) so lookup in the
> > > > (already large) dump file is easy.
> > > >
> > > > +  hash_set<slp_tree> *visited = new hash_set<slp_tree> ();  for
> > > > + (unsigned x = 0; x < num__slp_patterns; x++)
> > > > +    {
> > > > +      visited->empty ();
> > > > +      found_p |= vect_match_slp_patterns_2 (ref_node, vinfo,
> > > > slp_patterns[x],
> > > > +                                           visited);
> > > > +    }
> > > > +
> > > > +  delete visited;
> > > >
> > > > no need to new / delete, just do
> > > >
> > > >   has_set<slp_tree> visited;
> > > >
> > > > like everyone else.  Btw, do you really want to scan pieces of the
> > > > SLP graph (with instances being graph entries) multiple times?  If
> > > > not then you should move the visited set to the caller instead.
> > > >
> > > > +  /* TODO: Remove in final version, only here for generating debug
> > > > + dot
> > > > graphs
> > > > +          from SLP tree.  */
> > > > +
> > > > +  if (dump_enabled_p ())
> > > > +    {
> > > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
> > > > +      vect_print_slp_graph (MSG_NOTE, vect_location, *ref_node);
> > > > +      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
> > > > +    }
> > > >
> > > > now, if there was some pattern matched it is probably useful to dump
> > > > the graph (entry) again.  But only conditional on that I think.  So
> > > > can you instead make the dump conditional on found_p and remove the
> > > > start dot/end dot markers as said in the comment?
> > > >
> > > > +         if (dump_enabled_p ())
> > > > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > +                            "transformation for %s not valid due to
> > > > + post
> > > > "
> > > > +                            "condition\n",
> > > >
> > > > not really a MSG_MISSED_OPTIMIZATION, use MSG_NOTE.
> > > > MSG_MISSED_OPTIMIZATION should be used for things (likely) making
> > > > vectorization fail.
> > > >
> > > > +  /* Perform recursive matching, it's important to do this after
> > > > + matching
> > > > things
> > > >
> > > > before matching things?
> > > >
> > > > +     in the current node as the matches here may re-order the nodes
> > > > + below
> > > > it.
> > > > +     As such the pattern that needs to be subsequently match may
> > change.
> > > >
> > > > and this is no longer true?
> > > >
> > > > */
> > > > +
> > > > +  if (SLP_TREE_CHILDREN (node).exists ()) {
> > > >
> > > > elide this check, the loop will simply not run if empty
> > > >
> > > > +    slp_tree child;
> > > > +    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > > >
> > > > I think you want to perform the recursion in the caller so you do it
> > > > only once and not once for each pattern kind now that you do
> > > > post-order processing rather than pre-order.
> > >
> > > That wouldn't work as that would make multiply match before FMA/FMS.
> > 
> > Hmm, I see.  Note that to avoid visiting multiple entries into the graph
> > multiple times you then need another visited set or compute
> > pre- or post-order of the entire graph into an array separately.
> > 
> > > I had tried having FM(A|S) match MUL instead, but MUL does a lot of
> > > work to create permute nodes, which would have to be undone.
> > 
> > Nevertheless I think that's what regular pattern matching does.  Is matching
> > FMA from "scalar" really that much simpler than matching first MUL and then
> > FMA with a MUL component?
> 
> No, the matching itself no. (in fact I already have a branch where I do this).
> The FMA case however does end up needing to drop half the load nodes
> (which I need to pick at random really).
> 
> Is it OK to do so? If so then I can use this variant. If I split FMA and FMS then
> I can do it if dropping the loads at this point doesn't cause any issues. 

Sure, dopping nodes is easy.

> Would you prefer I just elide the nodes or add a VEC_PERM that just selects
> from the first vector only?

Drop them.

> > 
> > > The FMS case is somewhat easy as that still has a two operators node
> > > in between, But FMA is just a simple add on top.  I have to then
> > > remove the merging nodes that MUL Created, and drop half the operands
> > to create the new node.
> > >
> > > Which is doable but effectively does the same pattern twice, whereas
> > > now the MUL would just skip all that work since FMA matched.
> > >
> > > The post order is essentially really problematic for overlapping
> > > patterns.  In pre-order I could for Instance detect all 3 patterns in one
> > traversal and share the matching information between all 3.
> > 
> > I do remember you doing this in some earlier variant but it looked equally
> > complicated.  What you've now done is do a post-order walk for each pattern
> > kind.
> > 
> > So I'd like to get the recursion up to fix the 'visited' thing.  We can either have
> > ->pre_match and ->post_match (ugh) or bite the bullet and do FM(A|S)
> > ontop of MUL (I believe for the earlier variant you said that would be
> > possible).
> > 
> > > Right now I can only do it between mul and add and FMA has to be
> > standalone.
> > >
> > > >
> > > > +  vect_pattern *pattern = patt_fn (ref_node, vinfo);
> > > > +
> > > > +  if (pattern->matches ())
> > > >
> > > > this suggests you get number of SLP nodes times number of pattern
> > > > matchers allocations & inits of vect_pattern.  If you move
> > > >
> > > > +      if (pattern->is_optab_supported_p (vectype,
> > OPTIMIZE_FOR_SPEED))
> > > > +       {
> > > >
> > > > into ->matches () then whether this is a IFN or multi-node pattern
> > > > becomes an implementation detail which would be IMHO better.
> > > >
> > > > +  FOR_EACH_VEC_ELT (*this->m_nodes, ix, node)
> > > > +    {
> > > > +      /* Calculate the location of the statement in NODE to replace.  */
> > > > +      stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > > > +      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
> > > > +      tree type = gimple_expr_type (old_stmt);
> > > > +
> > > > +      /* Create the argument set for use by
> > > > gimple_build_call_internal_vec.  */
> > > > +      for (int i = 0; i < this->m_num_args; i++)
> > > >
> > > > the scalar argument type is the type of the definition so instead of
> > > > gimple_expr_type you want simply TREE_TYPE (gimple_get_lhs
> > > > (old_stmt))
> > > >
> > > > But then you seem to mix the result and the argument types?  I think
> > > > you need to walk over SLP_TREE_CHILDREN and look at their
> > > > representatives scalar def.  Which would then assume the pattern
> > > > consumes all direct children.  Somehow this "generic"
> > >
> > > Since I'm only consuming simple math ops would the argument and result
> > > types differ? Something like a widening sub I can see but not the normal
> > sub.
> > 
> > No.  Still gimple_expr_type is somewhat of an odd beast so better avoid it.
> > And document this restriction.
> > 
> > > > vect_simple_pattern_match::build () has quite some restrictions.
> > > > I guess they are documented in the files toplevel comment which also
> > > > has general parts applying to all possible pattern matchers (not
> > > > deriving from vect_simple_pattern_match).  Can you split up the
> > > > comment and more explicitely mark what applies to patterns in
> > > > general (postorder traversal) and what applies to
> > vect_simple_pattern_match only?
> > > >
> > > > +      /* We have to explicitly mark the old statement as unused
> > > > + because
> > > > during
> > > > +        statement analysis the original and new pattern statement
> > > > + may
> > > > require
> > > >
> > > > comment needs updating.
> > > >
> > > > +        different level of unrolling.  As an example add/sub when
> > > > vectorized
> > > > +        without a pattern requires 4 copies, whereas with a
> > > > + COMPLEX_ADD
> > > > pattern
> > > > +        this only requires 2 copies and the two statement will be
> > > > + treated
> > > > as
> > > > +        hand unrolled.  That means that the analysis won't happen
> > > > + as
> > > > it'll find
> > > > +        a mismatch.  So we don't analyze the old statement and if
> > > > + we end
> > > > up
> > > > +        needing it, e.g. SLP fails then we have to quickly re-analyze it.
> > > > */
> > > > +      STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
> > > >
> > > > I guess relevance should be copied from the original stmt (like if
> > > > it was used_by_reduction).  Can't that be done in
> > > > vect_mark_pattern_stmts?  Likewise the ->add_pattern_stmt part could
> > > > be done there, too.  So you'd be left with
> > > >
> > > > +      STMT_SLP_TYPE (call_stmt_info) = pure_slp;
> > > >
> > > > and
> > > >
> > > > +      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
> > > >
> > > > where the pure_slp setting should in theory be not needed (it should
> > > > be automagically detected so) but setting it is not wrong (it would
> > > > be an optimization).
> > >
> > > I am using it as a marker for a node dissolve needs to look at when
> > > undoing SLP so it restores the relevancy, but given your comment below
> > > on that I may be able to get rid of it.
> > >
> > > >
> > > > --
> > > >
> > > > I'm now looking down the series to see how this is all used, so
> > > > quoting different pieces from other patches.
> > > >
> > > > +/* Checks to see of the expression EXPR is a gimple assign with code
> > CODE
> > > > +   and if this is the case the two operands of EXPR is returned in
> > > > +OP1
> > > > and
> > > > +   OP2.
> > > > +
> > > > +   If the matching and extraction is successful TRUE is returned
> > > > otherwise
> > > > +   FALSE in which case the value of OP1 and OP2 will not have been
> > > > touched.
> > > > +*/
> > > > +
> > > > +static bool
> > > > +vect_match_expression_p (slp_tree node, tree_code code) {
> > > > +  if (!SLP_TREE_REPRESENTATIVE (node))
> > > >
> > > > I see that many overall function comments do not actually match
> > > > anymore, if at least for the parameters refered to.  Please go over
> > > > the patch and update them.
> > > >
> > > > +  gimple* expr = STMT_VINFO_STMT (SLP_TREE_REPRESENTATIVE
> > (node));
> > > > + if (!is_gimple_assign (expr)
> > > > +      || gimple_expr_code (expr) != code)
> > > >
> > > > gimple_assign_rhs_code (expr)
> > > >
> > > > +/* Check if the given lane permute in PERMUTES matches an
> > > > +alternating
> > > > sequence
> > > > +   of {P0 P1 P0 P1 ...}.  This to account for unrolled loops.   */
> > > > +static bool
> > > > +vect_check_lane_permute (lane_permutation_t &permutes,
> > > > +                        unsigned p0, unsigned p1) {
> > > > +  unsigned val[2] = {p0, p1};
> > > > +  for (unsigned i = 0; i < permutes.length (); i++)
> > > > +    if (permutes[i].first != val[i % 2])
> > > > +      return false;
> > > >
> > > > so this doesn't put any constraint on permutes[].second.  So it
> > > > matches an arbitrary interleaving of two vectors.
> > > >
> > > > +
> > > > +   will return PLUS_MINUS along with OPS containing {_37, _12, _38, _36}.
> > > > +*/
> > > > +
> > > > +static complex_operation_t
> > > > +vect_detect_pair_op (slp_tree node1, slp_tree node2,
> > > > +lane_permutation_t
> > > > &lanes,
> > > > +                    bool two_operands = true, vec<slp_tree> *ops =
> > > > + NULL)
> > > >
> > > >  { {_37, _38}, {_12, _36} }
> > > >
> > > > there's a lot of vect_match_expression_p calls and I wonder if
> > > > inlining them and CSEing the 'code' compute of node1 and node2 would
> > > > simplify things and elide the macro.  Also we then know this is all
> > > > CSEd in the end ...
> > > >
> > > > I wonder what LANES specifies, TWO_OPERANDS specifies whether
> > > > two-operand variants are allowed unless .. (guess LANES causes only
> > > > same operands to match up in the end).
> > > >
> > >
> > > I'm guarding against that you could technically speaking have a +/-
> > > node that does reverses the order of the elements below it, in which
> > > case that's a different operation than what we want to guard against.
> > >
> > > I hadn't been able to prove to myself that SLP build would never do this.
> > >
> > > > +static void
> > > > +vect_mark_stmts_as_in_pattern (hash_set<slp_tree> *cache, slp_tree
> > > > node)
> > > > +{
> > > >
> > > > need to find a use for this still, but don't you set pure_slp
> > > > "above"?
> > >
> > > This is used to mark the elided nodes.  For instance in a complex
> > > multiply the
> > > * nodes in the SLP tree become unused.  While not having them in the
> > > SLP tree is simple, the detection of pure/hybrid is led from the BB
> > > statements so they would mark the elided nodes as being hybrid.
> > 
> > What does regular pattern matching do here?  Ah, the relevance compute is
> > after pattern matching.  But you don't know whether it's still relevant (for
> > hybrid operation).
> > 
> > > There is of course the problem of is this node used in any other SLP
> > > instance, it may force the analysis to think the SLP tree is pure, but
> > > if SLP fails it would try not SLP anyway as it would have done eventually if
> > Hybrid.
> > 
> > I still think the code will have some issues with relevance and pure vs.
> > hybrid ...
> > 
> > > >
> > > > +bool
> > > > +complex_pattern::validate_p ()
> > > > +{
> > > > +      unsigned i;
> > > > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, tmp)
> > > > +       {
> > > > +         slp_tree vnode = NULL;
> > > > +         if (vect_slp_make_linear (node, tmp, &vnode))
> > > > +           nodes.quick_push (vnode);
> > > > +         else
> > > > +           {
> > > > +             nodes.release();
> > > > +             return false;
> > > > +           }
> > > >
> > > > hum.  vect_slp_make_linear better not fail?  At least we've created
> > > > half-way stuff when it does.  Can't spot the implementation right
> > > > now (guess splitting the patch up doesn't really help much ...)
> > > >
> > >
> > > It can fail, so I see I need to free some nodes otherwise I'll leak...
> > > But It's not a correctness issue. Nothing It changed on NODE until it
> > > all succeeds.
> > >
> > > > Noting elsewhere:
> > > >
> > > > +static void
> > > > +vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo,
> > > > +                                hash_set<slp_tree> *visited,
> > > > +slp_tree
> > > > root)
> > > > +{
> > > > +  if (!root || visited->contains (root))
> > > > +    return;
> > > > +
> > > > +  unsigned int i;
> > > > +  slp_tree node;
> > > > +  stmt_vec_info related_stmt_info;
> > > > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (root);
> > > > +
> > > > +  visited->add (root);
> > > >
> > > > visited->add () returns whether the key was already there so
> > > > please combine the contains and the add calls here and elsewhere.
> > > >
> > > >     if (stmt_info && STMT_VINFO_SLP_VECT_ONLY (stmt_info)
> > > > +        && (related_stmt_info = STMT_VINFO_RELATED_STMT
> > > > + (stmt_info)) !=
> > > > NULL)
> > > > +      {
> > > > +       if (dump_enabled_p ())
> > > >
> > > > I think STMT_VINFO_SLP_VECT_ONLY is a thing already, so I wonder if
> > > > the above is a sufficient test.  There's is_pattern_stmt_p () (which
> > > > obviously also applies to non-SLP only patterns).
> > > >
> > > > Note that I also see another issue, namely with hybrid SLP the
> > > > original non-pattern stmt would need to stay relevant.  That
> > > > probably asks for hybrid SLP discovery and relevance marking to be
> > > > combined somehow.  You can probably create a simple testcase by
> > > > storing a lane of a complex op via a non-grouped store.
> > > >
> > > > +       STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
> > > > +       STMT_VINFO_RELEVANT (related_stmt_info) =
> > > > + vect_used_in_scope;
> > > >
> > > > as said previously the relevance should be copied, in this case back
> > > > to the original stmt info from the pattern stmt info.
> > > >
> > > > +       STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
> > > > +       STMT_SLP_TYPE (related_stmt_info) = loop_vect;
> > > >
> > > > one option might be to delay pattern recognition (for the loop case)
> > > > to after vect_detect_hybrid_slp and simply not make any hybrid stmt
> > > > containing nodes part of a SLP pattern.  It would be a workaround of
> > > > course, not the best solution.  Another option is to add a new field
> > > > to stmt_info to put a "saved" relevance to and simply go over all
> > > > stmts restoring relevance - we already restore SLP_TYPE to loop_vect
> > > > that way at
> > >
> > > Yes.. ideally I would want something like telling it "don't worry
> > > about this if you end up being pure SLP".
> > >
> > > I haven't tried with this new approach to instead add it to the
> > PATTERN_SEQ
> > > for the statement.   Last time I tried however it looked like in order to get
> > this
> > > to work correctly I'd need a bigger hack..
> > 
> > I don't think PATTERN_SEQ would help us here, we'd need a "ORIG_SEQ"
> > though we should eventually have that by means of the pattern root scalar
> > stmts.
> 
> Shouldn't it? If I mark the unused nodes as part of the same pattern, vect_analyze_stmt
> would examine the pattern instead, but once it's relevancy is determined it won't analyze it
> again.  The unused statements then don't have any relevancy which will get
> vect_detect_hybrid_slp to ignore them which should get us what we want no?
> 
> Unless I'm misinterpreting something.

The area is complex so I might misremeber as well ... guess we need
to figure out the hard way.

Richard.

> Regards,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > >   /* Reset SLP type to loop_vect on all stmts.  */
> > > >   for (i = 0; i < LOOP_VINFO_LOOP (loop_vinfo)->num_nodes; ++i)
> > > >     {
> > > >
> > > > so simply restore the saved relevance would work there (guess I like
> > > > that more than what vect_dissolve_slp_only_patterns does, in any
> > > > case do what vect_dissolve_slp_only_patterns does in the above loop
> > > > please).
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > 	* Makefile.in (tree-vect-slp-patterns.o): New.
> > > > > 	* doc/passes.texi: Update documentation.
> > > > > 	* tree-vect-slp.c (vect_print_slp_tree): Add new state.
> > > > > 	(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
> > > > > 	(vect_analyze_slp): Call pattern matcher.
> > > > > 	* tree-vectorizer.h (enum _complex_operation):
> > > > > 	(class vect_pattern_match, class vect_pattern): New.
> > > > > 	* tree-vect-slp-patterns.c: New file.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: rguenther@c653.arch.suse.de <rguenther@c653.arch.suse.de>
> > > > > > On Behalf Of Richard Biener
> > > > > > Sent: Tuesday, September 29, 2020 10:42 AM
> > > > > > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > > > > > Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>;
> > > > gcc-
> > > > > > patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH v2 3/16]middle-end Add basic SLP pattern
> > > > > > matching scaffolding.
> > > > > >
> > > > > > On Tue, 29 Sep 2020, Richard Sandiford wrote:
> > > > > >
> > > > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > > >> > > @@ -2192,6 +2378,17 @@ vect_analyze_slp_instance
> > > > > > > >> > > (vec_info
> > > > > > *vinfo,
> > > > > > > >> > >                               &tree_size, bst_map);
> > > > > > > >> > >    if (node != NULL)
> > > > > > > >> > >      {
> > > > > > > >> > > +      /* Temporarily allow add_stmt calls again.  */
> > > > > > > >> > > +      vinfo->stmt_vec_info_ro = false;
> > > > > > > >> > > +
> > > > > > > >> > > +      /* See if any patterns can be found in the
> > > > > > > >> > > + constructed SLP
> > > > tree
> > > > > > > >> > > +        before we do any analysis on it.  */
> > > > > > > >> > > +      vect_match_slp_patterns (node, vinfo,
> > > > > > > >> > > + group_size,
> > > > > > &max_nunits,
> > > > > > > >> > > +                              matches, &npermutes,
> > > > > > > >> > > + &tree_size, bst_map);
> > > > > > > >> > > +
> > > > > > > >> > > +      /* After this no more add_stmt calls are allowed.  */
> > > > > > > >> > > +      vinfo->stmt_vec_info_ro = true;
> > > > > > > >> > > +
> > > > > > > >> > >
> > > > > > > >> > > I think this is a bit early to match patterns - I'd
> > > > > > > >> > > defer it to the point where all entries into the same
> > > > > > > >> > > SLP subgraph are analyzed, thus somewhere at the end of
> > > > > > > >> > > vect_analyze_slp loop over all instances and match
> > > > > > > >> > > patterns?  That way phases are
> > > > more
> > > > > > clearly separated.
> > > > > > > >> >
> > > > > > > >> > That would probably work, my only worry is that the SLP
> > > > > > > >> > analysis itself may fail and bail out at
> > > > > > > >> >
> > > > > > > >> > 	  /* If the loads and stores can be handled with
> > > > > > > >> > load/store-
> > > > lane
> > > > > > > >> > 	     instructions do not generate this SLP instance.  */
> > > > > > > >> > 	  if (is_a <loop_vec_info> (vinfo)
> > > > > > > >> > 	      && loads_permuted
> > > > > > > >> > 	      && dr && vect_store_lanes_supported (vectype,
> > > > group_size,
> > > > > > > >> > false))
> > > > > > > >
> > > > > > > > Ah, that piece of code.  Yeah, I'm repeatedly running into
> > > > > > > > it as well - it's a bad hack that stands in the way all the
> > > > > > > > time :/
> > > > > > >
> > > > > > > At one point I was wondering about trying to drop the above,
> > > > > > > vectorise with and without SLP, and then compare their costs,
> > > > > > > like for
> > > > > > VECT_COMPARE_COSTS.
> > > > > > > But that seemed like a dead end with the move to doing
> > > > > > > everything on the SLP representation.
> > > > > >
> > > > > > Yeah ... though even moving everything to the SLP representation
> > > > > > will
> > > > retain
> > > > > > the issue since there it will be N group-size 1 SLP instances
> > > > > > vs. 1 group-
> > > > size N
> > > > > > SLP instance.
> > > > > >
> > > > > > > > I guess we should try moving this upward like to
> > > > > > > > vect_analyze_loop_2 right before
> > > > > > > >
> > > > > > > >   /* Check the SLP opportunities in the loop, analyze and
> > > > > > > > build SLP
> > > > trees.
> > > > > > > > */
> > > > > > > >   ok = vect_analyze_slp (loop_vinfo, *n_stmts);
> > > > > > > >   if (!ok)
> > > > > > > >     return ok;
> > > > > > > >
> > > > > > > > and there check whether all grouped loads and stores can be
> > > > > > > > handled with store-/load-lanes (and there are no SLP
> > > > > > > > reduction chains?) in which case do not try to attempt SLP
> > > > > > > > at all.  Because the testcases this check was supposed to
> > > > > > > > change were all-load/store-lane or all SLP so the mixed case is
> > probably not worth special casing.
> > > > > > > >
> > > > > > > > Since load-/store-lanes is an arm speciality I tried to only
> > > > > > > > touch this fragile part with a ten-foot pole ;)  CCing
> > > > > > > > Richard, if he acks the above I can produce a patch.
> > > > > > >
> > > > > > > Yeah, sounds good to me.  Probably also sorth checking whether
> > > > > > > the likely_max iteration count is high enough to support
> > > > > > > group_size vectors, if we have enough information to guess that.
> > > > > > >
> > > > > > > We could also get the gen* machinery to emit a macro that is
> > > > > > > true if at least one load/store-lane pattern is defined, so
> > > > > > > that we can skip the code for non-Arm targets.  I can do that as a
> > follow-up.
> > > > > >
> > > > > > I've had a second look and one complication is that we only
> > > > > > elide the SLP node if any of the loads are permuted.  So if all
> > > > > > loads/stores are
> > > > unpermuted
> > > > > > but load/store-lanes would work we'd keep the SLP node.
> > > > > >
> > > > > > Of course without actually building the SLP node we don't know
> > > > > > whether
> > > > the
> > > > > > loads will be permuted or not ...
> > > > > >
> > > > > > But surely the current place for the check will cause some
> > > > > > testcases to become hybrid vectorizations which is likely undesirable.
> > > > > >
> > > > > > So we could move the check after all SLP discovery is completed
> > > > > > and
> > > > throw it
> > > > > > all away if we can and should use load/store-lanes?
> > > > > > But that might then not solve Tamars issue.
> > > > > >
> > > > > > Richard.
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguenther@suse.de>
> > > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > > Nuernberg, Germany; GF: Felix Imend
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imend
>
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 9c6c1c93b976aaf350cc1f9b3bdc538308fdf08b..936202b73696c8529b32c05b2356c7316fabc542 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1638,6 +1638,7 @@  OBJS = \
 	tree-vect-loop.o \
 	tree-vect-loop-manip.o \
 	tree-vect-slp.o \
+	tree-vect-slp-patterns.o \
 	tree-vectorizer.o \
 	tree-vector-builder.o \
 	tree-vrp.o \
diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
index a5ae4143a8c1293e674b499120372ee5fe5c412b..c86df5cd843084a5b7933ef99a23386891a7b0c1 100644
--- a/gcc/doc/passes.texi
+++ b/gcc/doc/passes.texi
@@ -709,7 +709,8 @@  loop.
 The pass is implemented in @file{tree-vectorizer.c} (the main driver),
 @file{tree-vect-loop.c} and @file{tree-vect-loop-manip.c} (loop specific parts
 and general loop utilities), @file{tree-vect-slp} (loop-aware SLP
-functionality), @file{tree-vect-stmts.c} and @file{tree-vect-data-refs.c}.
+functionality), @file{tree-vect-stmts.c}, @file{tree-vect-data-refs.c} and
+@file{tree-vect-slp-patterns.c} containing the SLP pattern matcher.
 Analysis of data references is in @file{tree-data-ref.c}.
 
 SLP Vectorization.  This pass performs vectorization of straight-line code. The
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
new file mode 100644
index 0000000000000000000000000000000000000000..f605f68d2a14c4bf4941f97b7c1d57f6acb5ffb1
--- /dev/null
+++ b/gcc/tree-vect-slp-patterns.c
@@ -0,0 +1,310 @@ 
+/* SLP - Pattern matcher on SLP trees
+   Copyright (C) 2020 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "optabs-tree.h"
+#include "insn-config.h"
+#include "recog.h"		/* FIXME: for insn_data */
+#include "fold-const.h"
+#include "stor-layout.h"
+#include "gimple-iterator.h"
+#include "cfgloop.h"
+#include "tree-vectorizer.h"
+#include "langhooks.h"
+#include "gimple-walk.h"
+#include "dbgcnt.h"
+#include "tree-vector-builder.h"
+#include "vec-perm-indices.h"
+#include "gimple-fold.h"
+#include "internal-fn.h"
+
+/* SLP Pattern matching mechanism.
+
+  This extension to the SLP vectorizer allows one to transform the generated SLP
+  tree based on any pattern.  The difference between this and the normal vect
+  pattern matcher is that unlike the former, this matcher allows you to match
+  with instructions that do not belong to the same SSA dominator graph.
+
+  The only requirement that this pattern matcher has is that you are only
+  only allowed to either match an entire group or none.
+
+  As an example, the following simple loop:
+
+    double a[restrict N]; double b[restrict N]; double c[restrict N];
+
+    for (int i=0; i < N; i+=2)
+    {
+      c[i] = a[i] - b[i+1];
+      c[i+1] = a[i+1] + b[i];
+    }
+
+  which represents a complex addition on with a rotation of 90* around the
+  argand plane. i.e. if `a` and `b` were complex numbers then this would be the
+  same as `a + (b * I)`.
+
+  Here the expressions for `c[i]` and `c[i+1]` are independent but have to be
+  both recognized in order for the pattern to work.  As an SLP tree this is
+  represented as
+
+                +--------------------------------+
+                |       stmt 0 *_9 = _10;        |
+                |       stmt 1 *_15 = _16;       |
+                +--------------------------------+
+                                |
+                                |
+                                v
+                +--------------------------------+
+                |     stmt 0 _10 = _4 - _8;      |
+                |    stmt 1 _16 = _12 + _14;     |
+                | lane permutation { 0[0] 1[1] } |
+                +--------------------------------+
+                            |        |
+                            |        |
+                            |        |
+               +-----+      |        |      +-----+
+               |     |      |        |      |     |
+         +-----| { } |<-----+        +----->| { } --------+
+         |     |     |   +------------------|     |       |
+         |     +-----+   |                  +-----+       |
+         |        |      |                                |
+         |        |      |                                |
+         |        +------|------------------+             |
+         |               |                  |             |
+         v               v                  v             v
+     +--------------------------+     +--------------------------------+
+     |     stmt 0 _8 = *_7;     |     |        stmt 0 _4 = *_3;        |
+     |    stmt 1 _14 = *_13;    |     |       stmt 1 _12 = *_11;       |
+     | load permutation { 1 0 } |     |    load permutation { 0 1 }    |
+     +--------------------------+     +--------------------------------+
+
+  The pattern matcher allows you to replace both statements 0 and 1 or none at
+  all.  You are also allowed to replace and match on any number of nodes.
+
+  The pattern matcher uses a sliding window to handle unrolled cases.  Every
+  pattern has to declare the number of statements that they consume.  The
+  pattern matcher uses this to incrementally ask if the pattern can be applied.
+  This is done using the method `matches ()`.
+
+  If the pattern can be applied a VecPatternMatch is returned which contains all
+  state information on where the match was found.  This is stored in a list of
+  operations to perform.  If the match cannot be applied then the current
+  pattern is aborted and no changes made to the tree.
+
+  The pattern matcher has two modes:
+
+  1) pre-order traversal is used to perform a check to see if the pattern can be
+     applied or not. If the pattern can be applied then a second step is
+     performed that allows the pattern to rewrite it's children.  This step is
+     required because the application of a pattern can change the layout of the
+     tree which affects the nodes that are still to be matched.  This is
+     performed using `validate_p ()`.
+
+  2) post-order traversal is used to actually perform the rewriting of the
+     matches found earlier.  This is done by calling `build ()` on all matches
+     that were found earlier.
+
+  The pattern matcher currently only allows you to perform replacements to
+  internal functions.
+
+  To add a new pattern, implement the VectPattern class and add the type to
+  slp_patterns.  */
+
+/* VectSimplePatternMatch holds contextual information about a single match
+   found in the SLP tree.  The use of the class is to allow you to defer
+   performing any modifications to the SLP tree until they are to be done.  By
+   calling build () the modifications are done in-place as to allow also re-
+   writing of the root node.  */
+
+class VectSimplePatternMatch : public VectPatternMatch
+{
+  protected:
+    uint8_t m_arity;
+    vec<stmt_vec_info> m_ifn_args;
+    internal_fn m_ifn;
+    vec_info *m_vinfo;
+    int m_idx, m_num_args;
+    tree m_type, m_vectype;
+    slp_tree m_node;
+    int m_pos;
+
+  public:
+    VectSimplePatternMatch (uint8_t arity, vec<stmt_vec_info> ifn_args,
+			    internal_fn ifn, vec_info *vinfo, int idx,
+			    slp_tree node, tree type, tree vectype,
+			    int num_args)
+    {
+      /* Number of statements the pattern matches against.  */
+      this->m_arity = arity;
+
+      /* Arguments to be used when building the new stmts using the IFN.  */
+      this->m_ifn_args = ifn_args.copy ();
+
+      /* The IFN to create the new statements with.  */
+      this->m_ifn = ifn;
+
+      /* The vectorization information for the current loop.  */
+      this->m_vinfo = vinfo;
+
+      /* The index in the sliding window where the statements were matched.  */
+      this->m_idx = idx;
+
+      /* The number of arguments required to create the new IFN.  */
+      this->m_num_args = num_args;
+
+      /* The original scalar type of the statement being replaced.  */
+      this->m_type = type;
+
+      /* The vector type to create the IFN for.  */
+      this->m_vectype = vectype;
+
+      /* The node that contains the statement that is being replaced.  */
+      this->m_node = node;
+
+      /* The current position inside the arity of the statement being replaced.
+	 generally the match can be cached and re-used for multiple stmts.  */
+      this->m_pos = 0;
+
+      gcc_assert ((unsigned)(num_args * arity) == ifn_args.length ());
+    }
+
+    uint8_t get_arity ()
+    {
+      return this->m_arity;
+    }
+
+    internal_fn get_IFN ()
+    {
+      return this->m_ifn;
+    }
+
+    const vec<stmt_vec_info> get_IFN_args ()
+    {
+      return this->m_ifn_args;
+    }
+
+    /* Create a replacement pattern statement for STMT_INFO and inserts the new
+       statement into NODE.  The statement is created as call to internal
+       function IFN with arguments ARGS.  The arity of IFN needs to match the
+       amount of elements in ARGS.  The scalar type of the statement as TYPE and
+       the corresponding vector type VECTYPE.  These two types are used to
+       construct the new vector only replacement pattern statement.
+
+       Futhermore the new pattern is also added to the vectorization information
+       structure VINFO and the old statement STMT_INFO is marked as unused while
+       the new statement is marked as used and the number of SLP uses of the new
+       statement is incremented.
+
+       The newly created SLP nodes are marked as SLP only and will be dissolved
+       if SLP is aborted.
+
+       The newly created gimple call is returned and the BB remains unchanged.
+    */
+
+    gcall *build ()
+    {
+      stmt_vec_info stmt_info;
+
+      /* Check if this call was made too often.  */
+      if (this->m_pos >= this->m_arity)
+	return NULL;
+
+      auto_vec<tree> args;
+      args.create (this->m_num_args);
+
+      /* Create the argument set for use by gimple_build_call_internal_vec.  */
+      stmt_vec_info arg;
+      for (int i = 0; i < this->m_num_args; i++)
+	{
+	  arg = this->m_ifn_args[i + (this->m_pos * this->m_num_args)];
+	  args.quick_push (gimple_get_lhs (STMT_VINFO_STMT (arg)));
+	}
+
+      /* Check to see if we haven't created all the nodes already.  */
+      if (args.is_empty ())
+	return NULL;
+
+      /* Calculate the location of the statement in NODE to replace.  */
+      int entry = this->m_idx - (this->m_arity - 1) + this->m_pos;
+      stmt_info = SLP_TREE_SCALAR_STMTS (this->m_node)[entry];
+
+      /* Create the new pattern statements.  */
+      gcall *call_stmt = gimple_build_call_internal_vec (this->m_ifn, args);
+      tree var = make_temp_ssa_name (this->m_type, call_stmt, "slp_patt");
+      gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
+      gimple_call_set_lhs (call_stmt, var);
+      gimple_set_location (call_stmt, gimple_location (old_stmt));
+      gimple_call_set_nothrow (call_stmt, true);
+
+      /* Adjust the book-keeping for the new and old statements for use during SLP.
+         This is required to get the right VF and statement during SLP analysis.
+         These changes are created after relevancy has been set for the nodes as
+         such we need to manually update them.  Any changes will be undone if SLP
+         is cancelled.  */
+      stmt_vec_info call_stmt_info = this->m_vinfo->add_stmt (call_stmt);
+      vect_mark_pattern_stmts (this->m_vinfo, stmt_info, call_stmt,
+			       this->m_vectype);
+
+      /* We have to explicitly mark the old statement as unused because during
+	 statement analysis the original and new pattern statement may require
+	 different level of unrolling.  As an example add/sub when vectorized
+	 without a pattern requires 4 copies, whereas with a COMPLEX_ADD pattern
+	 this only requires 2 copies and the two statement will be treated as
+	 hand unrolled.  That means that the analysis won't happen as it'll find
+	 a mismatch.  So we don't analyze the old statement and if we end up
+	 needing it, e.g. SLP fails then we have to quickly re-analyze it.  */
+      STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+      STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
+      STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;
+
+      /* Since we are replacing all the statements in the group with the same
+	 thing it doesn't really matter.  So just set it every time a new stmt
+	 is created.  */
+      SLP_TREE_SCALAR_STMTS (this->m_node)[entry] = call_stmt_info;
+      SLP_TREE_REPRESENTATIVE (this->m_node) = call_stmt_info;
+      SLP_TREE_CODE (this->m_node) = gimple_expr_code (call_stmt);;
+
+      this->m_pos++;
+      return call_stmt;
+    }
+
+    ~VectSimplePatternMatch ()
+    {
+      this->m_ifn_args.release ();
+    }
+};
+
+#define SLP_PATTERN(x) &x::create
+VectPatternDecl slp_patterns[]
+{
+  /* For least amount of back-tracking and more efficient matching
+     order patterns from the largest to the smallest.  Especially if they
+     overlap in what they can detect.  */
+};
+#undef SLP_PATTERN
+
+size_t num__slp_patterns = sizeof(slp_patterns)/sizeof(VectPatternDecl);
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 01189d44d892fc42b132bbb7de1c471df45518ae..947b031a6d492e6a02621dbcf41ba60d96c606f0 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -2055,6 +2055,192 @@  calculate_unrolling_factor (poly_uint64 nunits, unsigned int group_size)
   return exact_div (common_multiple (nunits, group_size), group_size);
 }
 
+/* Helper function of vect_match_slp_patterns.
+
+   Attempts to match the given pattern PATT_INFO against the slp tree rooted in
+   NODE using VINFO and GROUP_SIZE.
+
+   If matching is successful the value in NODE is updated and returned, if not
+   then it is returned unchanged.  */
+
+static bool
+vect_match_slp_patterns_2 (slp_tree node, vec_info *vinfo,
+			   unsigned int group_size, VectPatternDecl patt_fn,
+			   poly_uint64 *max_nunits, bool *matches,
+			   unsigned *npermutes, unsigned *tree_size,
+			   scalar_stmts_to_slp_tree_map_t * bst_map)
+{
+  unsigned i;
+  stmt_vec_info stmt_info;
+  if (!node)
+    return false;
+
+  vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node);
+  bool found_p = false, found_rec_p = false;
+  VectPattern *pattern = patt_fn (node, vinfo);
+  uint8_t n = pattern->get_arity ();
+
+  if (group_size % n != 0)
+    {
+      delete pattern;
+      return false;
+    }
+
+  /* The data dependency orderings will force the nodes to be created in the
+     order of their data flow.  Which means since we're matching specific
+     patterns in particular order we only have to do a linear scan here to match
+     the same instruction multiple times.  The group size doesn't have to be
+     constrained.  */
+
+  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
+    {
+      stmt_info = scalar_stmts[i];
+
+      if (gimple_assign_load_p (STMT_VINFO_STMT (stmt_info))
+	  || gimple_store_p (STMT_VINFO_STMT (stmt_info))
+	  || gimple_assign_cast_p (STMT_VINFO_STMT (stmt_info)))
+        break;
+
+      stmt_vec_info *stmt_infos = scalar_stmts.begin () + (i - (n - 1));
+
+      gcc_assert (stmt_infos);
+
+      if (!pattern->matches (stmt_infos, i))
+	{
+	  /* We can only do replacements for entire groups, we must replace all
+	     statements in a node as the argument list/children may not have
+	     equal height then.  Operations that don't rewrite the arguments
+	     may be safe to do, so perhaps paramatrise it.  */
+
+	  found_p = false;
+	  break;
+	}
+
+      tree type = gimple_expr_type (STMT_VINFO_STMT (stmt_info));
+      tree vectype = get_vectype_for_scalar_type (vinfo, type, node);
+
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Found %s pattern in SLP tree\n",
+			 pattern->get_name ());
+
+      if (pattern->is_optab_supported_p (vectype, OPTIMIZE_FOR_SPEED))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Target supports %s vectorization with mode %T\n",
+			     internal_fn_name (pattern->get_last_ifn ()),
+			     vectype);
+
+	  found_p = true;
+	}
+      else
+        {
+	  if (dump_enabled_p ())
+	    {
+	      if (!vectype)
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "Target does not support vector type for "
+				 "%T\n", type);
+	      else
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "Target does not support %s for "
+				 "vector type %T\n",
+				 internal_fn_name (pattern->get_last_ifn ()),
+						   vectype);
+	    }
+	  found_p = false;
+        }
+    }
+
+  if (found_p)
+    {
+      /* Find which nodes should be the children of the new node.  */
+
+      if (!pattern->validate_p (max_nunits, matches,
+				npermutes, tree_size, bst_map))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "transformation for %s not valid due to post "
+			     "condition\n", internal_fn_name (pattern->get_last_ifn ()));
+	  found_p = false;
+	}
+    }
+
+  /* Perform recursive matching, it's important to do this after matching things
+     in the current node as the matches here may re-order the nodes below it.
+     As such the pattern that needs to be subsequently match may change.  */
+
+  if (SLP_TREE_CHILDREN (node).exists ()) {
+    slp_tree child;
+    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+      found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
+						patt_fn, max_nunits, matches,
+						npermutes, tree_size, bst_map);
+  }
+
+  if (found_p)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location, "Creating vec patterns\n");
+
+      while (gcall* call_stmt = pattern->build ())
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location, "\t %p stmt: %G",
+			     node, call_stmt);
+	}
+
+      vect_mark_slp_stmts_relevant (node);
+    }
+
+  delete pattern;
+  return found_p | found_rec_p;
+}
+
+/* Applies pattern matching to the given SLP tree rooted in NODE using vec_info
+   VINFO and group size GROUP_SIZE.
+
+   The modified tree is returned.  Patterns are tried in order and multiple
+   patterns may match.  If the permutes need to be cancelled then
+   CANCEL_PERMUTE is set.  */
+
+static bool
+vect_match_slp_patterns (slp_tree node, vec_info *vinfo,
+			 unsigned int group_size, poly_uint64 *max_nunits,
+			 bool *matches, unsigned *npermutes,
+			 unsigned *tree_size,
+			 scalar_stmts_to_slp_tree_map_t * bst_map)
+{
+  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
+  bool found_p = false;
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt match --\n");
+      vect_print_slp_graph (MSG_NOTE, vect_location, node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
+    }
+
+  for (unsigned x = 0; x < num__slp_patterns; x++)
+    found_p |= vect_match_slp_patterns_2 (node, vinfo, group_size,
+					  slp_patterns[x], max_nunits, matches,
+					  npermutes, tree_size, bst_map);
+
+  /* TODO: Remove in final version, only here for generating debug dot graphs
+	   from SLP tree.  */
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
+      vect_print_slp_graph (MSG_NOTE, vect_location, node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
+    }
+
+  return found_p;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -2192,6 +2378,17 @@  vect_analyze_slp_instance (vec_info *vinfo,
 			      &tree_size, bst_map);
   if (node != NULL)
     {
+      /* Temporarily allow add_stmt calls again.  */
+      vinfo->stmt_vec_info_ro = false;
+
+      /* See if any patterns can be found in the constructed SLP tree
+	 before we do any analysis on it.  */
+      vect_match_slp_patterns (node, vinfo, group_size, &max_nunits,
+			       matches, &npermutes, &tree_size, bst_map);
+
+      /* After this no more add_stmt calls are allowed.  */
+      vinfo->stmt_vec_info_ro = true;
+
       /* Calculate the unrolling factor based on the smallest type.  */
       poly_uint64 unrolling_factor
 	= calculate_unrolling_factor (max_nunits, group_size);
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 79926f1a43534635ddca85556a928e364022c40a..95bbf13b1c733c07b7deb8515c1b17c6979cff21 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@  typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include "internal-fn.h"
 
 
 /* Used for naming of new temporaries.  */
@@ -2100,6 +2101,99 @@  typedef hash_map <vec <gimple *>, slp_tree,
 		  simple_hashmap_traits <bst_traits, slp_tree> >
   scalar_stmts_to_slp_tree_map_t;
 
+/* SLP Pattern matcher types, tree-vect-slp-patterns.c.  */
+
+class VectPatternMatch
+{
+  public:
+    virtual gcall *build () = 0;
+    virtual internal_fn get_IFN () = 0;
+    virtual const vec<stmt_vec_info> get_IFN_args () = 0;
+    virtual uint8_t get_arity () = 0;
+    virtual ~VectPatternMatch () {};
+};
+
+class VectPattern
+{
+  protected:
+    uint8_t m_arity;
+    uint8_t m_num_args;
+    internal_fn m_last_ifn;
+    int m_last_idx;
+    slp_tree m_node;
+    vec_info *m_vinfo;
+    vec<VectPatternMatch*> m_matches;
+    VectPattern (slp_tree node, vec_info *vinfo)
+    {
+      this->m_last_ifn = IFN_LAST;
+      this->m_node = node;
+      this->m_vinfo = vinfo;
+      this->m_matches.create (0);
+      this->m_curr_match = 0;
+    }
+
+  private:
+    unsigned m_curr_match;
+
+  public:
+    static VectPattern* create (slp_tree node, vec_info *vinfo);
+    virtual bool matches (stmt_vec_info *stmts, int idx) = 0;
+
+    virtual const char* get_name () = 0;
+    virtual ~VectPattern ()
+    {
+      int i;
+      VectPatternMatch *match;
+      FOR_EACH_VEC_ELT (this->m_matches, i, match)
+        delete match;
+      this->m_matches.release ();
+    }
+
+    virtual gcall *build ()
+    {
+      if (this->m_curr_match >= this->m_matches.length ())
+       return NULL;
+
+      gcall *entry =
+        this->m_matches[this->m_curr_match]->build ();
+
+      if (entry)
+        return entry;
+
+      this->m_curr_match++;
+      return build ();
+    }
+
+    virtual bool validate_p (poly_uint64 *, bool *, unsigned *, unsigned *,
+			     scalar_stmts_to_slp_tree_map_t *)
+    {
+      return true;
+    }
+
+    virtual uint8_t get_arity ()
+    {
+      return this->m_arity;
+    }
+
+    virtual bool is_optab_supported_p ( tree vectype, optimization_type opt_type)
+    {
+      if (!vectype)
+        return false;
+
+      return direct_internal_fn_supported_p (this->m_last_ifn, vectype,
+					     opt_type);
+    }
+
+    internal_fn get_last_ifn ()
+    {
+      return this->m_last_ifn;
+    }
+};
+
+typedef VectPattern* (*VectPatternDecl) (slp_tree, vec_info *);
+extern VectPatternDecl slp_patterns[];
+extern size_t num__slp_patterns;
+
 extern void
 vect_free_slp_tree (slp_tree node);