diff mbox series

[1/2,RFC,middle-end] : Add SLP pattern matcher.

Message ID 20191001113848.GA8365@arm.com
State New
Headers show
Series [1/2,RFC,middle-end] : Add SLP pattern matcher. | expand

Commit Message

Tamar Christina Oct. 1, 2019, 11:38 a.m. UTC
Hi All,

This adds a framework to allow pattern matchers to be written at based on the
SLP tree.  The difference between this one and the one in tree-vect-patterns is
that this matcher allows the matching of an arbitrary number of parallel
statements and replacing of an arbitrary number of children or statements.

Any relationship created by the SLP pattern matcher will be undone if SLP fails.

The pattern matcher can also cancel all permutes depending on what the pattern
requested it to do.  As soon as one pattern requests the permutes to be
cancelled all permutes are cancelled.

Compared to the previous pattern matcher this one will work for an arbitrary
group size and will match at any arbitrary node in the SLP tree.  The only
requirement is that the entire node is matched or rejected.

vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
operations" but the matcher cannot be because in cases where you match the order
of the operands may be changed.  So all operands must be changed or none.

Furthermore the matcher relies on canonization of the operations inside the
SLP tree and on the fact that floating math operations are not commutative.
This means that matching a pattern does not need to try all alternatives or
combinations and that arguments will always be in the same order if it's the
same operation.

The pattern matcher also ignored uninteresting nodes such as type casts, loads
and stores.  Doing so is essential to keep the runtime down.

Each matcher is allowed a post condition that can be run to perform any changes
to the SLP tree as needed before the patterns are created and may also abort
the creation of the patterns.

When a pattern is matched it is not immediately created but instead it is
deferred until all statements in the node have been analyzed.  Only if all post
conditions are true, and all statements will be replaced will the patterns be
created in batch.  This allows us to not have to undo any work if the pattern
fails but also makes it so we traverse the tree only once.

When a new pattern is created it is a marked as a pattern to the statement it is
replacing and be marked as used in the current SLP scope.  If SLP fails then
relationship is undone and the relevancy restored.

Each pattern matcher can detect any number of pattern it wants.  The only
constraint is that the optabs they produce must all have the same arity.

The pattern matcher supports instructions that have no scalar form as they
are added as pattern statements to the stmt.  The BB is left untouched and
so the scalar loop is untouched.

Bootstrapped on aarch64-none-linux-gnu and no issues.
No regression testing done yet.

Thanks,
Tamar

gcc/ChangeLog:

2019-10-01  Tamar Christina  <tamar.christina@arm.com>

	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
	(vect_dissolve_slp_only_groups): Use macro.
	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how
	to free.
	(ssa_name_def_to_slp_tree_map_t): New.
	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
	(vect_create_slp_patt_stmt): New.
	(vect_match_slp_patterns_2): New.
	(vect_match_slp_patterns): New.
	(vect_analyze_slp_instance): Call vect_match_slp_patterns and undo
	permutes.
	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP.
	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
	(vect_mark_pattern_stmts): New.

--

Comments

Richard Biener Oct. 7, 2019, 11:15 a.m. UTC | #1
On Tue, 1 Oct 2019, Tamar Christina wrote:

> Hi All,
> 
> This adds a framework to allow pattern matchers to be written at based on the
> SLP tree.  The difference between this one and the one in tree-vect-patterns is
> that this matcher allows the matching of an arbitrary number of parallel
> statements and replacing of an arbitrary number of children or statements.
> 
> Any relationship created by the SLP pattern matcher will be undone if SLP fails.
> 
> The pattern matcher can also cancel all permutes depending on what the pattern
> requested it to do.  As soon as one pattern requests the permutes to be
> cancelled all permutes are cancelled.
> 
> Compared to the previous pattern matcher this one will work for an arbitrary
> group size and will match at any arbitrary node in the SLP tree.  The only
> requirement is that the entire node is matched or rejected.
> 
> vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
> operations" but the matcher cannot be because in cases where you match the order
> of the operands may be changed.  So all operands must be changed or none.
> 
> Furthermore the matcher relies on canonization of the operations inside the
> SLP tree and on the fact that floating math operations are not commutative.
> This means that matching a pattern does not need to try all alternatives or
> combinations and that arguments will always be in the same order if it's the
> same operation.
> 
> The pattern matcher also ignored uninteresting nodes such as type casts, loads
> and stores.  Doing so is essential to keep the runtime down.
> 
> Each matcher is allowed a post condition that can be run to perform any changes
> to the SLP tree as needed before the patterns are created and may also abort
> the creation of the patterns.
> 
> When a pattern is matched it is not immediately created but instead it is
> deferred until all statements in the node have been analyzed.  Only if all post
> conditions are true, and all statements will be replaced will the patterns be
> created in batch.  This allows us to not have to undo any work if the pattern
> fails but also makes it so we traverse the tree only once.
> 
> When a new pattern is created it is a marked as a pattern to the statement it is
> replacing and be marked as used in the current SLP scope.  If SLP fails then
> relationship is undone and the relevancy restored.
> 
> Each pattern matcher can detect any number of pattern it wants.  The only
> constraint is that the optabs they produce must all have the same arity.
> 
> The pattern matcher supports instructions that have no scalar form as they
> are added as pattern statements to the stmt.  The BB is left untouched and
> so the scalar loop is untouched.
> 
> Bootstrapped on aarch64-none-linux-gnu and no issues.
> No regression testing done yet.

If you split out the introduction of SLP_TREE_REF_COUNT you can commit
that right now (sorry for being too lazy there...).

One overall comment - you do pattern matching after SLP tree
creation (good) but still do it before the whole SLP graph is
created (bad).  Would it be possible to instead do it as a separate
phase in vect_analyze_slp, looping over all instances (the instances
present entries into the single unified SLP graph now), avoiding
to visit "duplicates"?

In patch 1/2 I see references (mostly in variable names) to
"complex" which is IMHO too specific.

I'd also think that a useful first pattern to support would be
what we do with SLP_TREE_TWO_OPERATORS, where code generation
inserts extra blends.  I have yet to dive into the complex pattern
details to see if that's feasible or if you maybe re-use that...
You seem to at least rely on the SLP build succeeding with
the mixed plus/minus cases?  Which also restricts the kind of
patterns we can recognize I guess.  Plus it shows the chicken-and-egg
issue here - we'd like to detect the patterns _first_ and then
build the SLP trees (or rather combine lanes into larger groups).
Doing it after the fact makes it no more powerful than doing
it as folding post vectorization?

+typedef hash_map <stmt_vec_info, slp_tree>
+  ssa_name_def_to_slp_tree_map_t;
+

this seems to be unused.

+  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
+    {
...
+      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
+      work.idx = i;

so this tries to match patterns on ARITY arbitrary aligned
but adjacent scalar stmts [in a brute force way].  But then
you immediately fail when one matching fails.  So I think
this loop should be written like

 for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
   {
...

to make this fact clearer.  A missed optimization here is to consider
pre/post shuffling of the group to make more cases match.

In this loop you also do the target support check which could
possibly be done upfront in the pattern matcher itself to save
compile-time?  It also seems to be required that patterns match
a single IFN call?

Looking at the complex patterns I am worried and confused about
the transform phase - just quoting/commenting on random pieces:

+  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
+    {
+      if (defs.contains (scalar_stmt))
+       {

this is quadratic - vec::contains does a linear search.

         arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
+         found_p = true;

it seems that you are re-doing the match here, something that
should have been done in the first phase of pattern matching already.
May I suggest to restructure the pattern matchers in a way that you
have a

 class slp_pattern
 {
   virtual match() = 0;
   virtual transform() = 0;
 };

and derive from that so you can have a pattern specific state you
can transfer from match to transform?  Iteration over patterns
then either becomes ad-hoc or you find a way to iterate over
an "array of types" with our C++04 features creating a new
instance when you match to hold this state?

I also wonder what you are doing here - shouldn't it be "simply"
replacing a subgraph of the SLP tree with a single new SLP
node that now has those IFNs as "scalar" stmts (they are not
really scalars anymore because of that arity issue).  This also
means that code-generation might better not go the "traditional"
way but instead we use a new vect_transform_slp_pattern function
which does the natural thing and we'll just have the pattern
IFN recorded directly in the slp_tree structure?  (I realize
the complication of vect_get_slp_defs using the scalar stmts to
identify vectorized operands defs)

That said, I still think the same result can be achieved by
post-vectorizer pattern matching.  I also think that
doing pattern matching after the SLP tree build is backwards.
My vision is that we'd do more general graph matching on
the SSA graph forming the SLP tree rather than the current
ad-hoc matching starting from special "sinks".

Thanks,
Richard.


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 2019-10-01  Tamar Christina  <tamar.christina@arm.com>
> 
> 	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> 	(vect_dissolve_slp_only_groups): Use macro.
> 	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> 	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how
> 	to free.
> 	(ssa_name_def_to_slp_tree_map_t): New.
> 	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> 	(vect_create_slp_patt_stmt): New.
> 	(vect_match_slp_patterns_2): New.
> 	(vect_match_slp_patterns): New.
> 	(vect_analyze_slp_instance): Call vect_match_slp_patterns and undo
> 	permutes.
> 	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP.
> 	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> 	(vect_mark_pattern_stmts): New.
> 
>
Tamar Christina Oct. 8, 2019, 4:37 p.m. UTC | #2
Hi Richi,

Thanks for the review, I've added some comments inline.

The 10/07/2019 12:15, Richard Biener wrote:
> On Tue, 1 Oct 2019, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > This adds a framework to allow pattern matchers to be written at based on the
> > SLP tree.  The difference between this one and the one in tree-vect-patterns is
> > that this matcher allows the matching of an arbitrary number of parallel
> > statements and replacing of an arbitrary number of children or statements.
> > 
> > Any relationship created by the SLP pattern matcher will be undone if SLP fails.
> > 
> > The pattern matcher can also cancel all permutes depending on what the pattern
> > requested it to do.  As soon as one pattern requests the permutes to be
> > cancelled all permutes are cancelled.
> > 
> > Compared to the previous pattern matcher this one will work for an arbitrary
> > group size and will match at any arbitrary node in the SLP tree.  The only
> > requirement is that the entire node is matched or rejected.
> > 
> > vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
> > operations" but the matcher cannot be because in cases where you match the order
> > of the operands may be changed.  So all operands must be changed or none.
> > 
> > Furthermore the matcher relies on canonization of the operations inside the
> > SLP tree and on the fact that floating math operations are not commutative.
> > This means that matching a pattern does not need to try all alternatives or
> > combinations and that arguments will always be in the same order if it's the
> > same operation.
> > 
> > The pattern matcher also ignored uninteresting nodes such as type casts, loads
> > and stores.  Doing so is essential to keep the runtime down.
> > 
> > Each matcher is allowed a post condition that can be run to perform any changes
> > to the SLP tree as needed before the patterns are created and may also abort
> > the creation of the patterns.
> > 
> > When a pattern is matched it is not immediately created but instead it is
> > deferred until all statements in the node have been analyzed.  Only if all post
> > conditions are true, and all statements will be replaced will the patterns be
> > created in batch.  This allows us to not have to undo any work if the pattern
> > fails but also makes it so we traverse the tree only once.
> > 
> > When a new pattern is created it is a marked as a pattern to the statement it is
> > replacing and be marked as used in the current SLP scope.  If SLP fails then
> > relationship is undone and the relevancy restored.
> > 
> > Each pattern matcher can detect any number of pattern it wants.  The only
> > constraint is that the optabs they produce must all have the same arity.
> > 
> > The pattern matcher supports instructions that have no scalar form as they
> > are added as pattern statements to the stmt.  The BB is left untouched and
> > so the scalar loop is untouched.
> > 
> > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > No regression testing done yet.
> 
> If you split out the introduction of SLP_TREE_REF_COUNT you can commit
> that right now (sorry for being too lazy there...).
> 

I'll split those off :)

> One overall comment - you do pattern matching after SLP tree
> creation (good) but still do it before the whole SLP graph is
> created (bad).  Would it be possible to instead do it as a separate
> phase in vect_analyze_slp, looping over all instances (the instances
> present entries into the single unified SLP graph now), avoiding
> to visit "duplicates"?
> 

It should be, the only issue I can see is that build SLP may fail because of
an unsupported permute, or because it can use load lanes.  If I'm understanding
it correctly you wouldn't get SLP vectorization in those cases so then the matching
can't work? So it would limit it a it more.

> In patch 1/2 I see references (mostly in variable names) to
> "complex" which is IMHO too specific.
> 

Sorry, missed those during my cleanup.

> I'd also think that a useful first pattern to support would be
> what we do with SLP_TREE_TWO_OPERATORS, where code generation
> inserts extra blends.  I have yet to dive into the complex pattern
> details to see if that's feasible or if you maybe re-use that...

Oh, I hadn't thought of that. I'll take a look.

> You seem to at least rely on the SLP build succeeding with
> the mixed plus/minus cases?  Which also restricts the kind of
> patterns we can recognize I guess.  Plus it shows the chicken-and-egg
> issue here - we'd like to detect the patterns _first_ and then
> build the SLP trees (or rather combine lanes into larger groups).
> Doing it after the fact makes it no more powerful than doing
> it as folding post vectorization?

It's true that I do rely on build SLP succeeding, and there is one case I know
of where SLP build fails when I didn't expect it to.  But I was operating under
the impression that that's just a case that needs better support in SLP build.


There were two other approaches I had considered here before landing on this one:

1) Do the pattern matching as part of SLP build.  This would get around the issue that
   SLP build would have failed when the pattern could apply, but comes with a much higher
   overhead on SLP build as this now needs to back-track or keep a list of possible alternatives.
   either way you're going to lose more in space and/or time over doing the pattern match post build.
   The FMA/FMS cases are matching a subtree essentially.

2) Doing it post vectorization is also more work as you have to rewrite all uses and defs.
   if I'm not mistaken I have to keep the SSA names matching at that point. But also there are/can be
   target specific gimple coming out of vectorizable_*, like Arm target's use of .LOAD_LANES instead of
   a load and permute.  This means I would have to handle target specific things as well instead of
   doing it at the SLP tree level where I'd have less issues.  At least for basic things like loads.

   Also you could have the case where the target doesn't support a particular permute but does support
   an instruction that impicitly has the permute, in which can vectorization won't happen.
   You could in principle use the pattern matcher to work around this.

   And lastly, (I'm not 100% sure about this one) but doing it post vectorization means you've done it
   after cloning no? Usually the use of these instructions results in an iteration doing half the number
   of operations each iteration. The permuted version usually end up needing larger loads, which means
   that I'd have to maintain the amount of iterations as that's fixed at that point no?

   That is to say, the permuted version on a complex double pair for instance has to load two complex
   numbers in order to get the pair or real and imaginare numbers to do the operation on.  So it processes
   32-bytes each iteration.  The version with the instruction only needs a single number at a time, so it
   can do a normal 16-byte load.

   The benefit here is that you then don't need a trailing loop as the compiler sees you always have a
   multiple of 16-bytes in your iteration.

   I may be misunderstanding this, but doing it post vectorization means I'd have to do two 16-byte loads
   since it's too late to change the iteration count?

   This also means that a hand unrolled loop doing e.g. two complex additions with a 90* rotation
   ends up with 4 loads instead of just the one (as doing before cloning makes it not clone as it can merge
   the operations into one.).  But I may be way off here.. This is my current understanding of the code :)
> 
> +typedef hash_map <stmt_vec_info, slp_tree>
> +  ssa_name_def_to_slp_tree_map_t;
> +
> 
> this seems to be unused.
> 
> +  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
> +    {
> ...
> +      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
> +      work.idx = i;
> 
> so this tries to match patterns on ARITY arbitrary aligned
> but adjacent scalar stmts [in a brute force way].  But then
> you immediately fail when one matching fails.  So I think
> this loop should be written like
> 
>  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
>    {
> ...
> 
> to make this fact clearer.  A missed optimization here is to consider
> pre/post shuffling of the group to make more cases match.
> 
> In this loop you also do the target support check which could
> possibly be done upfront in the pattern matcher itself to save
> compile-time?  It also seems to be required that patterns match
> a single IFN call?
> 

The patterns can return any number of IFNs, for instance the FMA also detects CONJ_FMA and FMS
since the calculations are quite similar and only differ in one node.

> Looking at the complex patterns I am worried and confused about
> the transform phase - just quoting/commenting on random pieces:
> 
> +  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
> +    {
> +      if (defs.contains (scalar_stmt))
> +       {
> 
> this is quadratic - vec::contains does a linear search.

True, I figured it wasn't an issue since defs will always be quite small, it's always 2 * IFN arity,
but...

> 
>          arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
> +         found_p = true;
> 
> it seems that you are re-doing the match here, something that
> should have been done in the first phase of pattern matching already.

... You are right. I no longer need to do this here.  I missed this during cleanup.

> May I suggest to restructure the pattern matchers in a way that you
> have a
> 
>  class slp_pattern
>  {
>    virtual match() = 0;
>    virtual transform() = 0;
>  };
> 
> and derive from that so you can have a pattern specific state you
> can transfer from match to transform?  Iteration over patterns
> then either becomes ad-hoc or you find a way to iterate over
> an "array of types" with our C++04 features creating a new
> instance when you match to hold this state?
> 

Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :)

> I also wonder what you are doing here - shouldn't it be "simply"
> replacing a subgraph of the SLP tree with a single new SLP
> node that now has those IFNs as "scalar" stmts (they are not
> really scalars anymore because of that arity issue).  This also
> means that code-generation might better not go the "traditional"
> way but instead we use a new vect_transform_slp_pattern function
> which does the natural thing and we'll just have the pattern
> IFN recorded directly in the slp_tree structure?  (I realize
> the complication of vect_get_slp_defs using the scalar stmts to
> identify vectorized operands defs)
> 
> That said, I still think the same result can be achieved by
> post-vectorizer pattern matching.  I also think that
> doing pattern matching after the SLP tree build is backwards.
> My vision is that we'd do more general graph matching on
> the SSA graph forming the SLP tree rather than the current
> ad-hoc matching starting from special "sinks".
>

This was also one of the original things I looked at, though I
had figured the amount of work you have to do to do the subgraph
matching is a lot more complicated.  I couldn't really think of
any way of minimizing back tracking etc into something reasonably
efficient.

Matching post build means I don't really have to backtrack because
I can use assumptions of vect_get_slp_defs to my advantage and just
access each  child node in constant time.

Kind Regards,
Tamar

> Thanks,
> Richard.
> 
> 
> > Thanks,
> > Tamar
> > 
> > gcc/ChangeLog:
> > 
> > 2019-10-01  Tamar Christina  <tamar.christina@arm.com>
> > 
> > 	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> > 	(vect_dissolve_slp_only_groups): Use macro.
> > 	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> > 	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how
> > 	to free.
> > 	(ssa_name_def_to_slp_tree_map_t): New.
> > 	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> > 	(vect_create_slp_patt_stmt): New.
> > 	(vect_match_slp_patterns_2): New.
> > 	(vect_match_slp_patterns): New.
> > 	(vect_analyze_slp_instance): Call vect_match_slp_patterns and undo
> > 	permutes.
> > 	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP.
> > 	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> > 	(vect_mark_pattern_stmts): New.
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 247165 (AG München)

--
Richard Biener Oct. 11, 2019, 12:02 p.m. UTC | #3
On Tue, 8 Oct 2019, Tamar Christina wrote:

> Hi Richi,
> 
> Thanks for the review, I've added some comments inline.
> 
> The 10/07/2019 12:15, Richard Biener wrote:
> > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > 
> > > Hi All,
> > > 
> > > This adds a framework to allow pattern matchers to be written at based on the
> > > SLP tree.  The difference between this one and the one in tree-vect-patterns is
> > > that this matcher allows the matching of an arbitrary number of parallel
> > > statements and replacing of an arbitrary number of children or statements.
> > > 
> > > Any relationship created by the SLP pattern matcher will be undone if SLP fails.
> > > 
> > > The pattern matcher can also cancel all permutes depending on what the pattern
> > > requested it to do.  As soon as one pattern requests the permutes to be
> > > cancelled all permutes are cancelled.
> > > 
> > > Compared to the previous pattern matcher this one will work for an arbitrary
> > > group size and will match at any arbitrary node in the SLP tree.  The only
> > > requirement is that the entire node is matched or rejected.
> > > 
> > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
> > > operations" but the matcher cannot be because in cases where you match the order
> > > of the operands may be changed.  So all operands must be changed or none.
> > > 
> > > Furthermore the matcher relies on canonization of the operations inside the
> > > SLP tree and on the fact that floating math operations are not commutative.
> > > This means that matching a pattern does not need to try all alternatives or
> > > combinations and that arguments will always be in the same order if it's the
> > > same operation.
> > > 
> > > The pattern matcher also ignored uninteresting nodes such as type casts, loads
> > > and stores.  Doing so is essential to keep the runtime down.
> > > 
> > > Each matcher is allowed a post condition that can be run to perform any changes
> > > to the SLP tree as needed before the patterns are created and may also abort
> > > the creation of the patterns.
> > > 
> > > When a pattern is matched it is not immediately created but instead it is
> > > deferred until all statements in the node have been analyzed.  Only if all post
> > > conditions are true, and all statements will be replaced will the patterns be
> > > created in batch.  This allows us to not have to undo any work if the pattern
> > > fails but also makes it so we traverse the tree only once.
> > > 
> > > When a new pattern is created it is a marked as a pattern to the statement it is
> > > replacing and be marked as used in the current SLP scope.  If SLP fails then
> > > relationship is undone and the relevancy restored.
> > > 
> > > Each pattern matcher can detect any number of pattern it wants.  The only
> > > constraint is that the optabs they produce must all have the same arity.
> > > 
> > > The pattern matcher supports instructions that have no scalar form as they
> > > are added as pattern statements to the stmt.  The BB is left untouched and
> > > so the scalar loop is untouched.
> > > 
> > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > No regression testing done yet.
> > 
> > If you split out the introduction of SLP_TREE_REF_COUNT you can commit
> > that right now (sorry for being too lazy there...).
> > 
> 
> I'll split those off :)
> 
> > One overall comment - you do pattern matching after SLP tree
> > creation (good) but still do it before the whole SLP graph is
> > created (bad).  Would it be possible to instead do it as a separate
> > phase in vect_analyze_slp, looping over all instances (the instances
> > present entries into the single unified SLP graph now), avoiding
> > to visit "duplicates"?
> > 
> 
> It should be, the only issue I can see is that build SLP may fail because of
> an unsupported permute, or because it can use load lanes.  If I'm understanding
> it correctly you wouldn't get SLP vectorization in those cases so then the matching
> can't work? So it would limit it a it more.

True.  As said later ideally the pattern matching woudl happen before
or rather as part of SLP building so that the operations then actually
match (and we then don't need that plus/minus SLP_TREE_TWO_OPERATORS
handling).  At the moment it sits in a quite awkward place which
may contribute to its complexity.

> > In patch 1/2 I see references (mostly in variable names) to
> > "complex" which is IMHO too specific.
> > 
> 
> Sorry, missed those during my cleanup.
> 
> > I'd also think that a useful first pattern to support would be
> > what we do with SLP_TREE_TWO_OPERATORS, where code generation
> > inserts extra blends.  I have yet to dive into the complex pattern
> > details to see if that's feasible or if you maybe re-use that...
> 
> Oh, I hadn't thought of that. I'll take a look.
> 
> > You seem to at least rely on the SLP build succeeding with
> > the mixed plus/minus cases?  Which also restricts the kind of
> > patterns we can recognize I guess.  Plus it shows the chicken-and-egg
> > issue here - we'd like to detect the patterns _first_ and then
> > build the SLP trees (or rather combine lanes into larger groups).
> > Doing it after the fact makes it no more powerful than doing
> > it as folding post vectorization?
> 
> It's true that I do rely on build SLP succeeding, and there is one case I know
> of where SLP build fails when I didn't expect it to.  But I was operating under
> the impression that that's just a case that needs better support in SLP build.
> 
> 
> There were two other approaches I had considered here before landing on this one:
> 
> 1) Do the pattern matching as part of SLP build.  This would get around the issue that
>    SLP build would have failed when the pattern could apply, but comes with a much higher
>    overhead on SLP build as this now needs to back-track or keep a list of possible alternatives.
>    either way you're going to lose more in space and/or time over doing the pattern match post build.
>    The FMA/FMS cases are matching a subtree essentially.
> 
> 2) Doing it post vectorization is also more work as you have to rewrite all uses and defs.
>    if I'm not mistaken I have to keep the SSA names matching at that point. But also there are/can be
>    target specific gimple coming out of vectorizable_*, like Arm target's use of .LOAD_LANES instead of
>    a load and permute.  This means I would have to handle target specific things as well instead of
>    doing it at the SLP tree level where I'd have less issues.  At least for basic things like loads.
> 
>    Also you could have the case where the target doesn't support a particular permute but does support
>    an instruction that impicitly has the permute, in which can vectorization won't happen.
>    You could in principle use the pattern matcher to work around this.
> 
>    And lastly, (I'm not 100% sure about this one) but doing it post vectorization means you've done it
>    after cloning no? Usually the use of these instructions results in an iteration doing half the number
>    of operations each iteration. The permuted version usually end up needing larger loads, which means
>    that I'd have to maintain the amount of iterations as that's fixed at that point no?

True.

>    That is to say, the permuted version on a complex double pair for instance has to load two complex
>    numbers in order to get the pair or real and imaginare numbers to do the operation on.  So it processes
>    32-bytes each iteration.  The version with the instruction only needs a single number at a time, so it
>    can do a normal 16-byte load.
> 
>    The benefit here is that you then don't need a trailing loop as the compiler sees you always have a
>    multiple of 16-bytes in your iteration.
> 
>    I may be misunderstanding this, but doing it post vectorization means I'd have to do two 16-byte loads
>    since it's too late to change the iteration count?

Yes.

>    This also means that a hand unrolled loop doing e.g. two complex additions with a 90* rotation
>    ends up with 4 loads instead of just the one (as doing before cloning makes it not clone as it can merge
>    the operations into one.).  But I may be way off here.. This is my current understanding of the code :)
> > 
> > +typedef hash_map <stmt_vec_info, slp_tree>
> > +  ssa_name_def_to_slp_tree_map_t;
> > +
> > 
> > this seems to be unused.
> > 
> > +  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
> > +    {
> > ...
> > +      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
> > +      work.idx = i;
> > 
> > so this tries to match patterns on ARITY arbitrary aligned
> > but adjacent scalar stmts [in a brute force way].  But then
> > you immediately fail when one matching fails.  So I think
> > this loop should be written like
> > 
> >  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
> >    {
> > ...
> > 
> > to make this fact clearer.  A missed optimization here is to consider
> > pre/post shuffling of the group to make more cases match.
> > 
> > In this loop you also do the target support check which could
> > possibly be done upfront in the pattern matcher itself to save
> > compile-time?  It also seems to be required that patterns match
> > a single IFN call?
> > 
> 
> The patterns can return any number of IFNs, for instance the FMA also detects CONJ_FMA and FMS
> since the calculations are quite similar and only differ in one node.
> 
> > Looking at the complex patterns I am worried and confused about
> > the transform phase - just quoting/commenting on random pieces:
> > 
> > +  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
> > +    {
> > +      if (defs.contains (scalar_stmt))
> > +       {
> > 
> > this is quadratic - vec::contains does a linear search.
> 
> True, I figured it wasn't an issue since defs will always be quite small, it's always 2 * IFN arity,
> but...
> 
> > 
> >          arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
> > +         found_p = true;
> > 
> > it seems that you are re-doing the match here, something that
> > should have been done in the first phase of pattern matching already.
> 
> ... You are right. I no longer need to do this here.  I missed this during cleanup.
> 
> > May I suggest to restructure the pattern matchers in a way that you
> > have a
> > 
> >  class slp_pattern
> >  {
> >    virtual match() = 0;
> >    virtual transform() = 0;
> >  };
> > 
> > and derive from that so you can have a pattern specific state you
> > can transfer from match to transform?  Iteration over patterns
> > then either becomes ad-hoc or you find a way to iterate over
> > an "array of types" with our C++04 features creating a new
> > instance when you match to hold this state?
> > 
> 
> Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :)

;)

> > I also wonder what you are doing here - shouldn't it be "simply"
> > replacing a subgraph of the SLP tree with a single new SLP
> > node that now has those IFNs as "scalar" stmts (they are not
> > really scalars anymore because of that arity issue).  This also
> > means that code-generation might better not go the "traditional"
> > way but instead we use a new vect_transform_slp_pattern function
> > which does the natural thing and we'll just have the pattern
> > IFN recorded directly in the slp_tree structure?  (I realize
> > the complication of vect_get_slp_defs using the scalar stmts to
> > identify vectorized operands defs)
> > 
> > That said, I still think the same result can be achieved by
> > post-vectorizer pattern matching.  I also think that
> > doing pattern matching after the SLP tree build is backwards.
> > My vision is that we'd do more general graph matching on
> > the SSA graph forming the SLP tree rather than the current
> > ad-hoc matching starting from special "sinks".
> >
> 
> This was also one of the original things I looked at, though I
> had figured the amount of work you have to do to do the subgraph
> matching is a lot more complicated.  I couldn't really think of
> any way of minimizing back tracking etc into something reasonably
> efficient.
> 
> Matching post build means I don't really have to backtrack because
> I can use assumptions of vect_get_slp_defs to my advantage and just
> access each  child node in constant time.

Yeah, I realize how you've done it is probably at the easiest and
most flexible point.  Still I fear we're engineering us into a corner
here :/

You know I'm in the process of re-doing the vectorizer in terms
of SLP-only.  For the SLP build to better support BB vectorization
and partial loop vectorization my idea is to start with a set
of SLP instances 1:1 mapping the SSA graph (SLP instances are
acutally the entries into the now shared SLP graph - it's no
longer a tree) and then perform SLP node merging, forming
SLP nodes with group sizes > 1.  During that, or rather as
part of that, SLP pattern matching would come in, merging
some of the SLP nodes, forming the group-size 2 complex math opts.

But rewriting SLP build is quite late on the plan of the rewrite...
so in the current SLP build the pattern matching would hook into
vect_build_slp_tree_1 when we discover alt_stmt_code.  Of course
as you say it's likely more complex...

I wonder if re-synthesizing complex-type operations in regular
pattern matching would help?

That said - I'm fine with going with your approach but only
if you promise to help once I run into it during the SLP
rewrite...

Thanks,
Richard.

> Kind Regards,
> Tamar
> 
> > Thanks,
> > Richard.
> > 
> > 
> > > Thanks,
> > > Tamar
> > > 
> > > gcc/ChangeLog:
> > > 
> > > 2019-10-01  Tamar Christina  <tamar.christina@arm.com>
> > > 
> > > 	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> > > 	(vect_dissolve_slp_only_groups): Use macro.
> > > 	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> > > 	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and how
> > > 	to free.
> > > 	(ssa_name_def_to_slp_tree_map_t): New.
> > > 	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> > > 	(vect_create_slp_patt_stmt): New.
> > > 	(vect_match_slp_patterns_2): New.
> > > 	(vect_match_slp_patterns): New.
> > > 	(vect_analyze_slp_instance): Call vect_match_slp_patterns and undo
> > > 	permutes.
> > > 	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for SLP.
> > > 	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> > > 	(vect_mark_pattern_stmts): New.
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
> 
>
Tamar Christina Oct. 12, 2019, 11:22 a.m. UTC | #4
Hi Richi,

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Friday, October 11, 2019 8:02 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; ook@ucw.cz
> Subject: Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.
> 
> On Tue, 8 Oct 2019, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > Thanks for the review, I've added some comments inline.
> >
> > The 10/07/2019 12:15, Richard Biener wrote:
> > > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This adds a framework to allow pattern matchers to be written at
> > > > based on the SLP tree.  The difference between this one and the
> > > > one in tree-vect-patterns is that this matcher allows the matching
> > > > of an arbitrary number of parallel statements and replacing of an
> arbitrary number of children or statements.
> > > >
> > > > Any relationship created by the SLP pattern matcher will be undone if
> SLP fails.
> > > >
> > > > The pattern matcher can also cancel all permutes depending on what
> > > > the pattern requested it to do.  As soon as one pattern requests
> > > > the permutes to be cancelled all permutes are cancelled.
> > > >
> > > > Compared to the previous pattern matcher this one will work for an
> > > > arbitrary group size and will match at any arbitrary node in the
> > > > SLP tree.  The only requirement is that the entire node is matched or
> rejected.
> > > >
> > > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as
> > > > "compatible operations" but the matcher cannot be because in cases
> > > > where you match the order of the operands may be changed.  So all
> operands must be changed or none.
> > > >
> > > > Furthermore the matcher relies on canonization of the operations
> > > > inside the SLP tree and on the fact that floating math operations are not
> commutative.
> > > > This means that matching a pattern does not need to try all
> > > > alternatives or combinations and that arguments will always be in
> > > > the same order if it's the same operation.
> > > >
> > > > The pattern matcher also ignored uninteresting nodes such as type
> > > > casts, loads and stores.  Doing so is essential to keep the runtime down.
> > > >
> > > > Each matcher is allowed a post condition that can be run to
> > > > perform any changes to the SLP tree as needed before the patterns
> > > > are created and may also abort the creation of the patterns.
> > > >
> > > > When a pattern is matched it is not immediately created but
> > > > instead it is deferred until all statements in the node have been
> > > > analyzed.  Only if all post conditions are true, and all
> > > > statements will be replaced will the patterns be created in batch.
> > > > This allows us to not have to undo any work if the pattern fails but also
> makes it so we traverse the tree only once.
> > > >
> > > > When a new pattern is created it is a marked as a pattern to the
> > > > statement it is replacing and be marked as used in the current SLP
> > > > scope.  If SLP fails then relationship is undone and the relevancy
> restored.
> > > >
> > > > Each pattern matcher can detect any number of pattern it wants.
> > > > The only constraint is that the optabs they produce must all have the
> same arity.
> > > >
> > > > The pattern matcher supports instructions that have no scalar form
> > > > as they are added as pattern statements to the stmt.  The BB is
> > > > left untouched and so the scalar loop is untouched.
> > > >
> > > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > > No regression testing done yet.
> > >
> > > If you split out the introduction of SLP_TREE_REF_COUNT you can
> > > commit that right now (sorry for being too lazy there...).
> > >
> >
> > I'll split those off :)
> >
> > > One overall comment - you do pattern matching after SLP tree
> > > creation (good) but still do it before the whole SLP graph is
> > > created (bad).  Would it be possible to instead do it as a separate
> > > phase in vect_analyze_slp, looping over all instances (the instances
> > > present entries into the single unified SLP graph now), avoiding to
> > > visit "duplicates"?
> > >
> >
> > It should be, the only issue I can see is that build SLP may fail
> > because of an unsupported permute, or because it can use load lanes.
> > If I'm understanding it correctly you wouldn't get SLP vectorization
> > in those cases so then the matching can't work? So it would limit it a it more.
> 
> True.  As said later ideally the pattern matching woudl happen before or
> rather as part of SLP building so that the operations then actually match (and
> we then don't need that plus/minus SLP_TREE_TWO_OPERATORS handling).
> At the moment it sits in a quite awkward place which may contribute to its
> complexity.
> 
> > > In patch 1/2 I see references (mostly in variable names) to
> > > "complex" which is IMHO too specific.
> > >
> >
> > Sorry, missed those during my cleanup.
> >
> > > I'd also think that a useful first pattern to support would be what
> > > we do with SLP_TREE_TWO_OPERATORS, where code generation inserts
> > > extra blends.  I have yet to dive into the complex pattern details
> > > to see if that's feasible or if you maybe re-use that...
> >
> > Oh, I hadn't thought of that. I'll take a look.
> >
> > > You seem to at least rely on the SLP build succeeding with the mixed
> > > plus/minus cases?  Which also restricts the kind of patterns we can
> > > recognize I guess.  Plus it shows the chicken-and-egg issue here -
> > > we'd like to detect the patterns _first_ and then build the SLP
> > > trees (or rather combine lanes into larger groups).
> > > Doing it after the fact makes it no more powerful than doing it as
> > > folding post vectorization?
> >
> > It's true that I do rely on build SLP succeeding, and there is one
> > case I know of where SLP build fails when I didn't expect it to.  But
> > I was operating under the impression that that's just a case that needs
> better support in SLP build.
> >
> >
> > There were two other approaches I had considered here before landing on
> this one:
> >
> > 1) Do the pattern matching as part of SLP build.  This would get around the
> issue that
> >    SLP build would have failed when the pattern could apply, but comes with
> a much higher
> >    overhead on SLP build as this now needs to back-track or keep a list of
> possible alternatives.
> >    either way you're going to lose more in space and/or time over doing the
> pattern match post build.
> >    The FMA/FMS cases are matching a subtree essentially.
> >
> > 2) Doing it post vectorization is also more work as you have to rewrite all
> uses and defs.
> >    if I'm not mistaken I have to keep the SSA names matching at that point.
> But also there are/can be
> >    target specific gimple coming out of vectorizable_*, like Arm target's use
> of .LOAD_LANES instead of
> >    a load and permute.  This means I would have to handle target specific
> things as well instead of
> >    doing it at the SLP tree level where I'd have less issues.  At least for basic
> things like loads.
> >
> >    Also you could have the case where the target doesn't support a
> particular permute but does support
> >    an instruction that impicitly has the permute, in which can vectorization
> won't happen.
> >    You could in principle use the pattern matcher to work around this.
> >
> >    And lastly, (I'm not 100% sure about this one) but doing it post
> vectorization means you've done it
> >    after cloning no? Usually the use of these instructions results in an
> iteration doing half the number
> >    of operations each iteration. The permuted version usually end up
> needing larger loads, which means
> >    that I'd have to maintain the amount of iterations as that's fixed at that
> point no?
> 
> True.
> 
> >    That is to say, the permuted version on a complex double pair for instance
> has to load two complex
> >    numbers in order to get the pair or real and imaginare numbers to do the
> operation on.  So it processes
> >    32-bytes each iteration.  The version with the instruction only needs a
> single number at a time, so it
> >    can do a normal 16-byte load.
> >
> >    The benefit here is that you then don't need a trailing loop as the
> compiler sees you always have a
> >    multiple of 16-bytes in your iteration.
> >
> >    I may be misunderstanding this, but doing it post vectorization means I'd
> have to do two 16-byte loads
> >    since it's too late to change the iteration count?
> 
> Yes.
> 
> >    This also means that a hand unrolled loop doing e.g. two complex
> additions with a 90* rotation
> >    ends up with 4 loads instead of just the one (as doing before cloning
> makes it not clone as it can merge
> >    the operations into one.).  But I may be way off here.. This is my
> > current understanding of the code :)
> > >
> > > +typedef hash_map <stmt_vec_info, slp_tree>
> > > +  ssa_name_def_to_slp_tree_map_t;
> > > +
> > >
> > > this seems to be unused.
> > >
> > > +  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
> > > +    {
> > > ...
> > > +      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
> > > +      work.idx = i;
> > >
> > > so this tries to match patterns on ARITY arbitrary aligned but
> > > adjacent scalar stmts [in a brute force way].  But then you
> > > immediately fail when one matching fails.  So I think this loop
> > > should be written like
> > >
> > >  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
> > >    {
> > > ...
> > >
> > > to make this fact clearer.  A missed optimization here is to
> > > consider pre/post shuffling of the group to make more cases match.
> > >
> > > In this loop you also do the target support check which could
> > > possibly be done upfront in the pattern matcher itself to save
> > > compile-time?  It also seems to be required that patterns match a
> > > single IFN call?
> > >
> >
> > The patterns can return any number of IFNs, for instance the FMA also
> > detects CONJ_FMA and FMS since the calculations are quite similar and
> only differ in one node.
> >
> > > Looking at the complex patterns I am worried and confused about the
> > > transform phase - just quoting/commenting on random pieces:
> > >
> > > +  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
> > > +    {
> > > +      if (defs.contains (scalar_stmt))
> > > +       {
> > >
> > > this is quadratic - vec::contains does a linear search.
> >
> > True, I figured it wasn't an issue since defs will always be quite
> > small, it's always 2 * IFN arity, but...
> >
> > >
> > >          arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
> > > +         found_p = true;
> > >
> > > it seems that you are re-doing the match here, something that should
> > > have been done in the first phase of pattern matching already.
> >
> > ... You are right. I no longer need to do this here.  I missed this during
> cleanup.
> >
> > > May I suggest to restructure the pattern matchers in a way that you
> > > have a
> > >
> > >  class slp_pattern
> > >  {
> > >    virtual match() = 0;
> > >    virtual transform() = 0;
> > >  };
> > >
> > > and derive from that so you can have a pattern specific state you
> > > can transfer from match to transform?  Iteration over patterns then
> > > either becomes ad-hoc or you find a way to iterate over an "array of
> > > types" with our C++04 features creating a new instance when you
> > > match to hold this state?
> > >
> >
> > Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :)
> 
> ;)
> 
> > > I also wonder what you are doing here - shouldn't it be "simply"
> > > replacing a subgraph of the SLP tree with a single new SLP node that
> > > now has those IFNs as "scalar" stmts (they are not really scalars
> > > anymore because of that arity issue).  This also means that
> > > code-generation might better not go the "traditional"
> > > way but instead we use a new vect_transform_slp_pattern function
> > > which does the natural thing and we'll just have the pattern IFN
> > > recorded directly in the slp_tree structure?  (I realize the
> > > complication of vect_get_slp_defs using the scalar stmts to identify
> > > vectorized operands defs)
> > >
> > > That said, I still think the same result can be achieved by
> > > post-vectorizer pattern matching.  I also think that doing pattern
> > > matching after the SLP tree build is backwards.
> > > My vision is that we'd do more general graph matching on the SSA
> > > graph forming the SLP tree rather than the current ad-hoc matching
> > > starting from special "sinks".
> > >
> >
> > This was also one of the original things I looked at, though I had
> > figured the amount of work you have to do to do the subgraph matching
> > is a lot more complicated.  I couldn't really think of any way of
> > minimizing back tracking etc into something reasonably efficient.
> >
> > Matching post build means I don't really have to backtrack because I
> > can use assumptions of vect_get_slp_defs to my advantage and just
> > access each  child node in constant time.
> 
> Yeah, I realize how you've done it is probably at the easiest and most flexible
> point.  Still I fear we're engineering us into a corner here :/
> 
> You know I'm in the process of re-doing the vectorizer in terms of SLP-only.
> For the SLP build to better support BB vectorization and partial loop
> vectorization my idea is to start with a set of SLP instances 1:1 mapping the
> SSA graph (SLP instances are acutally the entries into the now shared SLP
> graph - it's no longer a tree) and then perform SLP node merging, forming
> SLP nodes with group sizes > 1.  During that, or rather as part of that, SLP
> pattern matching would come in, merging some of the SLP nodes, forming
> the group-size 2 complex math opts.

Ah yeah it would indeed fit here nicely since it wouldn't have to do the mini
partitioning itself at that time.

> 
> But rewriting SLP build is quite late on the plan of the rewrite...
> so in the current SLP build the pattern matching would hook into
> vect_build_slp_tree_1 when we discover alt_stmt_code.  Of course as you
> say it's likely more complex...
> 
> I wonder if re-synthesizing complex-type operations in regular pattern
> matching would help?
> 
> That said - I'm fine with going with your approach but only if you promise to
> help once I run into it during the SLP rewrite...

Of course!, more then happy to help! Thanks! I'll make the other changes and
send up the full patch series then.

Thanks,
Tamar

> 
> Thanks,
> Richard.
> 
> > Kind Regards,
> > Tamar
> >
> > > Thanks,
> > > Richard.
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 2019-10-01  Tamar Christina  <tamar.christina@arm.com>
> > > >
> > > > 	* tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> > > > 	(vect_dissolve_slp_only_groups): Use macro.
> > > > 	* tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> > > > 	* tree-vect-slp.c (vect_free_slp_tree): Add control of recursion and
> how
> > > > 	to free.
> > > > 	(ssa_name_def_to_slp_tree_map_t): New.
> > > > 	(vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> > > > 	(vect_create_slp_patt_stmt): New.
> > > > 	(vect_match_slp_patterns_2): New.
> > > > 	(vect_match_slp_patterns): New.
> > > > 	(vect_analyze_slp_instance): Call vect_match_slp_patterns and
> undo
> > > > 	permutes.
> > > > 	(vect_detect_hybrid_slp_stmts): Dissolve relationships created for
> SLP.
> > > > 	* tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> > > > 	(vect_mark_pattern_stmts): New.
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
diff mbox series

Patch

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index b0cbbac0cb5ba1ffce706715d3dbb9139063803d..1535b8b74e7d79818d8cece847dc31c7596451d5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1808,6 +1808,59 @@  vect_get_datarefs_in_loop (loop_p loop, basic_block *bbs,
   return opt_result::success ();
 }
 
+/* For every SLP only pattern created by the pattern matched rooted in ROOT
+   restore the relevancy of the original statements over those of the pattern
+   and destroy the pattern relationship.  This restores the SLP tree to a state
+   where it can be used when SLP build is cancelled or re-tried.  */
+
+static void
+vect_dissolve_slp_only_patterns (slp_tree root)
+{
+  if (!root)
+    return;
+
+  unsigned int i;
+  slp_tree node;
+  stmt_vec_info stmt_info;
+  stmt_vec_info related_stmt_info;
+
+  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (root), i, stmt_info)
+    if (STMT_VINFO_SLP_VECT_ONLY (stmt_info)
+        && (related_stmt_info = STMT_VINFO_RELATED_STMT (stmt_info)) != NULL)
+      {
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "dissolving relevancy of %G over %G",
+			   STMT_VINFO_STMT (stmt_info),
+			   STMT_VINFO_STMT (related_stmt_info));
+	STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+	STMT_VINFO_RELEVANT (related_stmt_info) = vect_used_in_scope;
+	STMT_VINFO_IN_PATTERN_P (related_stmt_info) = false;
+      }
+
+  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (root), i, node)
+    vect_dissolve_slp_only_patterns (node);
+}
+
+/* Lookup any SLP Only Pattern statements created by the SLP pattern matcher in
+   all slp_instances in LOOP_VINFO and undo the relevancy of statements such
+   that the original SLP tree before the pattern matching is used.  */
+
+static void
+vect_dissolve_slp_only_patterns (loop_vec_info loop_vinfo)
+{
+
+  unsigned int i;
+
+  DUMP_VECT_SCOPE ("vect_dissolve_slp_only_patterns");
+
+  /* Unmark any SLP only patterns as relevant and restore the STMT_INFO of the
+     related instruction.  */
+  slp_instance instance;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), i, instance)
+    vect_dissolve_slp_only_patterns (SLP_INSTANCE_TREE (instance));
+}
+
 /* Look for SLP-only access groups and turn each individual access into its own
    group.  */
 static void
@@ -1818,7 +1871,7 @@  vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
 
   DUMP_VECT_SCOPE ("vect_dissolve_slp_only_groups");
 
-  vec<data_reference_p> datarefs = loop_vinfo->shared->datarefs;
+  vec<data_reference_p> datarefs = LOOP_VINFO_DATAREFS (loop_vinfo);
   FOR_EACH_VEC_ELT (datarefs, i, dr)
     {
       gcc_assert (DR_REF (dr));
@@ -2219,6 +2272,9 @@  again:
   /* Ensure that "ok" is false (with an opt_problem if dumping is enabled).  */
   gcc_assert (!ok);
 
+  /* Dissolve any SLP patterns created by the SLP pattern matcher.  */
+  vect_dissolve_slp_only_patterns (loop_vinfo);
+
   /* Try again with SLP forced off but if we didn't do any SLP there is
      no point in re-trying.  */
   if (!slp)
diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c
index 8430c98acc68de4894c5563677895e48746ee045..4824bf218ad3527162427408baa01575c853f9c5 100644
--- a/gcc/tree-vect-patterns.c
+++ b/gcc/tree-vect-patterns.c
@@ -4737,7 +4737,7 @@  const unsigned int NUM_PATTERNS = ARRAY_SIZE (vect_vect_recog_func_ptrs);
 
 /* Mark statements that are involved in a pattern.  */
 
-static inline void
+void
 vect_mark_pattern_stmts (stmt_vec_info orig_stmt_info, gimple *pattern_stmt,
                          tree pattern_vectype)
 {
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 0220b1804fb1b00dae7df622d8bf45abb138b1a1..7b7453981975f1d756965a96499f651ff2fcea7a 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -46,22 +46,28 @@  along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "internal-fn.h"
 
-
-/* Recursively free the memory allocated for the SLP tree rooted at NODE.
+/* If RECURSIVE then Recursively free the memory allocated for the SLP tree
+   rooted at NODE otherwise only free NODE.
    FINAL_P is true if we have vectorized the instance or if we have
-   made a final decision not to vectorize the statements in any way.  */
+   made a final decision not to vectorize the statements in any way.
+   Else if !FINAL_P and MARK_UNUSED is true then any statements found will be
+   marked as unused.  */
 
 static void
-vect_free_slp_tree (slp_tree node, bool final_p)
+vect_free_slp_tree (slp_tree node, bool final_p, bool recursive = true,
+		    bool mark_unused = false)
 {
   int i;
   slp_tree child;
 
-  if (--node->refcnt != 0)
+  if (--SLP_TREE_REF_COUNT(node) != 0)
     return;
 
-  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
-    vect_free_slp_tree (child, final_p);
+  if (recursive)
+    {
+      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+	vect_free_slp_tree (child, final_p, recursive, mark_unused);
+    }
 
   /* Don't update STMT_VINFO_NUM_SLP_USES if it isn't relevant.
      Some statements might no longer exist, after having been
@@ -74,6 +80,8 @@  vect_free_slp_tree (slp_tree node, bool final_p)
 	{
 	  gcc_assert (STMT_VINFO_NUM_SLP_USES (stmt_info) > 0);
 	  STMT_VINFO_NUM_SLP_USES (stmt_info)--;
+	  if (mark_unused)
+	    STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
 	}
     }
 
@@ -128,7 +136,7 @@  vect_create_new_slp_node (vec<stmt_vec_info> scalar_stmts)
   SLP_TREE_LOAD_PERMUTATION (node) = vNULL;
   SLP_TREE_TWO_OPERATORS (node) = false;
   SLP_TREE_DEF_TYPE (node) = vect_internal_def;
-  node->refcnt = 1;
+  SLP_TREE_REF_COUNT(node) = 1;
 
   unsigned i;
   FOR_EACH_VEC_ELT (scalar_stmts, i, stmt_info)
@@ -1047,6 +1055,9 @@  typedef hash_map <vec <gimple *>, slp_tree,
 		  simple_hashmap_traits <bst_traits, slp_tree> >
   scalar_stmts_to_slp_tree_map_t;
 
+typedef hash_map <stmt_vec_info, slp_tree>
+  ssa_name_def_to_slp_tree_map_t;
+
 static slp_tree
 vect_build_slp_tree_2 (vec_info *vinfo,
 		       vec<stmt_vec_info> stmts, unsigned int group_size,
@@ -1067,14 +1078,14 @@  vect_build_slp_tree (vec_info *vinfo,
 	dump_printf_loc (MSG_NOTE, vect_location, "re-using %sSLP tree %p\n",
 			 *leader ? "" : "failed ", *leader);
       if (*leader)
-	(*leader)->refcnt++;
+	SLP_TREE_REF_COUNT(*leader)++;
       return *leader;
     }
   slp_tree res = vect_build_slp_tree_2 (vinfo, stmts, group_size, max_nunits,
 					matches, npermutes, tree_size, bst_map);
   /* Keep a reference for the bst_map use.  */
   if (res)
-    res->refcnt++;
+    SLP_TREE_REF_COUNT(res)++;
   bst_map->put (stmts.copy (), res);
   return res;
 }
@@ -1894,6 +1905,340 @@  calculate_unrolling_factor (poly_uint64 nunits, unsigned int group_size)
   return exact_div (common_multiple (nunits, group_size), group_size);
 }
 
+/* Forward declare match_workset.  */
+
+typedef struct _match_workset match_workset;
+
+/* Holds different supported pattern functions and their names to be used in the
+   debug dumps along with the arity and post function to run if the pattern
+   matching succeeds.  */
+
+typedef struct  _complex_pattern
+{
+  /* The function to call to detect if a pattern has been detected in the
+     current scope.  The return value indicates if the pattern was found.  Each
+     matching function is allowed to match against more than one optab.  */
+
+  bool (*patt_fn) (slp_tree, stmt_vec_info *, int idx, int n, internal_fn *,
+		   vec<stmt_vec_info> *, vec_info *);
+
+  /* The arity determines how many statements the matches should look for
+     concurrently.  This will affect the number of stmt_vec_info given to the
+     above pattern matcher and the entry will be skilled if there are not enough
+     statements for the pattern to match.  */
+  uint8_t arity;
+
+  /* Name of the pattern matcher to be used in debug dumps.  */
+
+  char name[50];
+
+  /* post matching validation function that is run before any replacements are
+     done in the SLP tree.  This function also validates if the transformation
+     resulting from applying the pattern function is valid.  If it returns false
+     the transformation is aborted.  This function is allowed to re-order any
+     children of the given SLP_TREE but not the statements in the tree.  */
+  bool (*post_fn) (slp_tree, const vec<stmt_vec_info>, int, vec<match_workset>,
+		   bool *);
+} complex_pattern_t;
+
+/* The entire workset that needs to be processed if the pattern matches.
+   Patterns are not applied immediately but are differed until all matches have
+   been identified.  This caching allows us to limit the number of tree walks
+   needed to apply transformations.  It also allows you to still avoid having a
+   partially invalid SLP_TREE during pattern applications.  */
+
+typedef struct _match_workset {
+  /* Scalar type the pattern works on.  */
+  tree type;
+
+  /* Vector type the pattern will produce.  */
+  tree vectype;
+
+  /* The internal function optab to use for creating the pattern statements.  */
+  internal_fn ifn;
+
+  /* The offset in the list of all matched arguments where the arguments that
+     should be used when creating the new statements begins.  */
+  int vects_offset;
+
+  /* The number of arguments that should be consumed by this pattern.  */
+  int vects_len;
+
+  /* The list of statements that the pattern matched originally matched on.  */
+  stmt_vec_info *stmt_infos;
+
+  /* The location of the stmt_infos[0] in the list of childrens of the SLP_TREE
+     currently being inspected.  */
+  int idx;
+
+  /* The original pattern definition being applied.  */
+
+  complex_pattern_t *patt_info;
+} match_workset;
+
+/* Create a replacement pattern statement for STMT_INFO and inserts the new
+   statement into NODE.  The statement is created as call to internal function
+   IFN with arguments ARGS.  The arity of IFN needs to match the amount of
+   elements in ARGS.  The scalar type of the statement as TYPE and the
+   corresponding vector type VECTYPE.  These two types are used to construct
+   the new vector only replacement pattern statement.
+
+   NUM is the number of pattens that need to be produced and ITER is which
+   pattern is currently being created.  These are used as indices into DEFS in
+   order to select the right arguments to create the pattern with.
+
+   Futhermore the new pattern is also added to the vectorization information
+   structure VINFO and the old statement STMT_INFO is marked as unused while
+   the new statement is marked as used and the number of SLP uses of the new
+   statement is incremented.
+
+   The newly created SLP nodes are marked as SLP only and will be dissolved if
+   SLP is aborted.
+
+   The newly created gimple call is returned and the BB remains unchanged.  */
+
+static gcall*
+vect_create_slp_patt_stmt (slp_tree node, match_workset *work,
+			   const vec<stmt_vec_info> defs, int iter,
+			   int num, vec_info *vinfo)
+{
+  /* Adjust idx to write the top pos first.  This calculated the position in the
+     scalar statements list of NODE that is to be updated.  */
+  int idx = work->idx - ((num - 1) - iter);
+  stmt_vec_info stmt_info = work->stmt_infos[iter];
+
+  auto_vec<tree> args;
+  int num_args = work->vects_len / num;
+  args.create (num_args);
+  int i;
+
+  /* Create the argument set for use by gimple_build_call_internal_vec.  */
+  for (i = 0; i < num_args; i++)
+    {
+      stmt_vec_info arg = defs[(iter * num_args) + i + work->vects_offset];
+      args.quick_push (gimple_get_lhs (STMT_VINFO_STMT (arg)));
+    }
+
+  /* Create the new pattern statements.  */
+  gcall *call_stmt = gimple_build_call_internal_vec (work->ifn, args);
+  tree var = make_temp_ssa_name (work->type, call_stmt, "slp_patt");
+  gimple* old_stmt = STMT_VINFO_STMT (stmt_info);
+  gimple_call_set_lhs (call_stmt, var);
+  gimple_set_location (call_stmt, gimple_location (old_stmt));
+  gimple_call_set_nothrow (call_stmt, true);
+
+  /* Adjust the book-keeping for the new and old statements for use during SLP.
+     This is required to get the right VF and statement during SLP analysis.
+     These changes are created after relevancy has been set for the nodes as
+     such we need to manually update them.  Any changes will be undone if SLP
+     is cancelled.  */
+  stmt_vec_info call_stmt_info = vinfo->add_stmt (call_stmt);
+  vect_mark_pattern_stmts (stmt_info, call_stmt, work->vectype);
+  STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+  STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;
+  STMT_VINFO_SLP_VECT_ONLY (call_stmt_info) = true;
+  STMT_VINFO_RELATED_STMT (call_stmt_info) = stmt_info;
+  STMT_VINFO_NUM_SLP_USES (call_stmt_info)++;
+  SLP_TREE_SCALAR_STMTS (node)[idx] = call_stmt_info;
+
+  return call_stmt;
+}
+
+/* Helper function of vect_match_slp_patterns.
+
+   Attempts to match the given pattern PATT_INFO against the slp tree rooted in
+   NODE using VINFO and GROUP_SIZE.
+
+   If matching is successful the value in NODE is updated and returned, if not
+   then it is returned unchanged.  CANCEL_PERMUTES indicates whether the permute
+   should be cancelled or not for the SLP_TREE.  */
+
+static bool
+vect_match_slp_patterns_2 (slp_tree node, vec_info *vinfo,
+			   unsigned int group_size, complex_pattern_t patt_info,
+			   bool *cancel_permutes)
+{
+  unsigned i;
+  int x;
+  stmt_vec_info stmt_info;
+  auto_vec<stmt_vec_info> vects, all_vects;
+  vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node);
+  bool found_p = false, found_rec_p = false;
+  uint8_t n = patt_info.arity;
+
+  if (n > group_size
+      || scalar_stmts.length () == 0)
+    return false;
+
+  match_workset work;
+  auto_vec<match_workset> workset;
+
+  /* The data dependency orderings will force the nodes to be created in the
+     order of their data flow.  Which means since we're matching specific
+     patterns in particular order we only have to do a linear scan here to match
+     the same instruction multiple times.  The group size doesn't have to be
+     constrained.  */
+  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
+    {
+      vects.truncate (0);
+
+      if (gimple_assign_load_p (STMT_VINFO_STMT (stmt_info))
+	  || gimple_store_p (STMT_VINFO_STMT (stmt_info))
+	  || gimple_assign_cast_p (STMT_VINFO_STMT (stmt_info)))
+        break;
+
+      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
+      work.idx = i;
+
+      gcc_assert (work.stmt_infos);
+
+      if (!patt_info.patt_fn (node, work.stmt_infos, i, n, &work.ifn, &vects,
+			      vinfo)
+	  || work.ifn == IFN_LAST)
+	{
+	  /* We can only do replacements for entire groups, we must replace all
+	     statements in a node as the argument list/children may not have
+	     equal height then.  Operations that don't rewrite the arguments
+	     may be safe to do, so perhaps paramatrise it.  */
+	  found_p = false;
+	  break;
+	}
+
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Found %s pattern in SLP tree\n", patt_info.name);
+
+      work.type = gimple_expr_type (STMT_VINFO_STMT (*work.stmt_infos));
+      work.vectype = get_vectype_for_scalar_type (work.type);
+      work.patt_info = &patt_info;
+
+
+      if (work.vectype
+	  && direct_internal_fn_supported_p (work.ifn, work.vectype,
+					     OPTIMIZE_FOR_SPEED))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Target supports %s vectorization with mode %T\n",
+			     internal_fn_name (work.ifn), work.vectype);
+
+	  /* Now push new statements.  */
+	  work.vects_offset = all_vects.length ();
+	  all_vects.safe_splice (vects);
+	  work.vects_len = vects.length ();
+	  workset.safe_push (work);
+
+	  found_p = true;
+
+	  i += (n-1);
+	}
+      else
+        {
+	  if (dump_enabled_p ())
+	    {
+	      if (!work.vectype)
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "Target does not support vector type for "
+				 "%T\n", work.type);
+	      else
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "Target does not support %s for "
+				 "vector type %T\n",
+				 internal_fn_name (work.ifn), work.vectype);
+	    }
+        }
+    }
+
+  vects.release ();
+
+  if (found_p && patt_info.post_fn)
+    {
+      /* Find which nodes should be the children of the new node.  */
+      if (!patt_info.post_fn (node, all_vects,
+				 all_vects.length () / group_size, workset,
+				 cancel_permutes))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "transformation for %s not valid due to post "
+			     "condition\n", internal_fn_name (work.ifn));
+	  found_p = false;
+	  all_vects.release ();
+	}
+    }
+
+  /* Perform recursive matching, it's important to do this after matching things
+     in the current node as the matches here may re-order the nodes below it.
+     As such the pattern that needs to be subsequently match may change.  */
+  slp_tree child;
+  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+    found_rec_p |= vect_match_slp_patterns_2 (child, vinfo, group_size,
+					      patt_info, cancel_permutes);
+
+  if (found_p)
+    {
+      if (dump_enabled_p ())
+      dump_printf_loc (MSG_NOTE, vect_location, "Creating %d patterns\n",
+		       workset.length ());
+      for (unsigned i = 0; i < workset.length (); i++)
+	{
+	  work = workset[i];
+	  for (x = 0; x < n; x++)
+	    {
+	      gcall* call_stmt =
+		vect_create_slp_patt_stmt (node, &work, all_vects, x, n,
+					   vinfo);
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location, "\t %p stmt(%d): %G",
+			         node, x, call_stmt);
+	    }
+	}
+      all_vects.release ();
+    }
+
+  return found_p | found_rec_p;
+}
+
+/* Applies pattern matching to the given SLP tree rooted in NODE using vec_info
+   VINFO and group size GROUP_SIZE.
+
+   The modified tree is returned.  Patterns are tried in order and multiple
+   patterns may match.  If the permutes need to be cancelled then
+   CANCEL_PERMUTE is set.  */
+
+static bool
+vect_match_slp_patterns (slp_tree node, vec_info *vinfo,
+			 unsigned int group_size, bool *cancel_permutes)
+{
+  DUMP_VECT_SCOPE ("vect_match_slp_patterns");
+  bool found_p = false;
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- before patt match --\n");
+      vect_print_slp_tree (MSG_NOTE, vect_location, node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end patt --\n");
+    }
+
+
+  static complex_pattern_t patterns[0] = {};
+
+  for (size_t x = 0; x < sizeof (patterns) / sizeof (complex_pattern_t); x++)
+    found_p |= vect_match_slp_patterns_2 (node, vinfo, group_size, patterns[x],
+					  cancel_permutes);
+
+  /* TODO: remove in final version, only here for generating debug dot graphs
+	   from SLP tree.  */
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "-- start dot --\n");
+      vect_print_slp_tree (MSG_NOTE, vect_location, node);
+      dump_printf_loc (MSG_NOTE, vect_location, "-- end dot --\n");
+    }
+
+  return found_p;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -2019,6 +2364,11 @@  vect_analyze_slp_instance (vec_info *vinfo,
 	}
       else
 	{
+	  bool cancel_permutes;
+	  bool possible_complex_rotates
+	    = vect_match_slp_patterns (node, vinfo, group_size,
+				       &cancel_permutes);
+
 	  /* Create a new SLP instance.  */
 	  new_instance = XNEW (class _slp_instance);
 	  SLP_INSTANCE_TREE (new_instance) = node;
@@ -2097,8 +2447,15 @@  vect_analyze_slp_instance (vec_info *vinfo,
 		      (STMT_VINFO_VECTYPE (stmt_vinfo),
 		       DR_GROUP_SIZE (stmt_vinfo), false))
 		    break;
+		 if (SLP_TREE_LOAD_PERMUTATION (load_node).length () > 0
+		     && !cancel_permutes)
+		   possible_complex_rotates = false;
+
+
 		}
-	      if (i == SLP_INSTANCE_LOADS (new_instance).length ())
+
+		if (!possible_complex_rotates
+		    && i == SLP_INSTANCE_LOADS (new_instance).length ())
 		{
 		  if (dump_enabled_p ())
 		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2109,6 +2466,11 @@  vect_analyze_slp_instance (vec_info *vinfo,
 		}
 	    }
 
+	  if (possible_complex_rotates && cancel_permutes && dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Cancelled all permutes in SLP tree due to pattern"
+			     "matching\n");
+
 	  vinfo->slp_instances.safe_push (new_instance);
 
 	  if (dump_enabled_p ())
@@ -2281,7 +2643,7 @@  vect_detect_hybrid_slp_stmts (slp_tree node, unsigned i, slp_vect_type stype,
 
   /* We need to union stype over the incoming graph edges but we still
      want to limit recursion to stay O(N+E).  */
-  bool only_edge = (++visited.get_or_insert (node) < node->refcnt);
+  bool only_edge = (++visited.get_or_insert (node) < SLP_TREE_REF_COUNT(node));
 
   /* Propagate hybrid down the SLP tree.  */
   if (stype == hybrid)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 1456cde4c2c2dec7244c504d2c496248894a4f1e..ac027c6129fa9c24780b2cd7c5a6e239c1905a0f 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -168,6 +168,7 @@  public:
 
 #define SLP_TREE_CHILDREN(S)                     (S)->children
 #define SLP_TREE_SCALAR_STMTS(S)                 (S)->stmts
+#define SLP_TREE_REF_COUNT(S)                    (S)->refcnt
 #define SLP_TREE_VEC_STMTS(S)                    (S)->vec_stmts
 #define SLP_TREE_NUMBER_OF_VEC_STMTS(S)          (S)->vec_stmts_size
 #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
@@ -1664,6 +1665,8 @@  extern void duplicate_and_interleave (gimple_seq *, tree, vec<tree>,
 extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info);
 
 /* In tree-vect-patterns.c.  */
+extern void vect_mark_pattern_stmts (stmt_vec_info, gimple *, tree);
+
 /* Pattern recognition functions.
    Additional pattern recognition functions can (and will) be added
    in the future.  */