diff mbox series

slp: fix sharing of SLP only patterns. (PR99149)

Message ID patch-14220-tamar@arm.com
State New
Headers show
Series slp: fix sharing of SLP only patterns. (PR99149) | expand

Commit Message

Tamar Christina Feb. 19, 2021, 2:41 p.m. UTC
Hi Richi,

The attached testcase ICEs due to a couple of issues.
In the testcase you have two SLP instances that share the majority of their
definition with each other.  One tree defines a COMPLEX_MUL sequence and the
other tree a COMPLEX_FMA.

The ice happens because:

1. the refcounts are wrong, in particular the FMA case doesn't correctly count
the references for the COMPLEX_MUL that it consumes.

2. when the FMA is created it incorrectly assumes it can just tear apart the MUL
node that it's consuming.  This is wrong and should only be done when there is
no more uses of the node, in which case the vector only pattern is no longer
relevant.

To fix the last part the SLP only pattern reset code was moved into
vect_free_slp_tree which results in cleaner code.  I also think it does belong
there since that function knows when there are no more uses of the node and so
the pattern should be unmarked, so when the the vectorizer is inspecting the BB
it doesn't find the now invalid vector only patterns.

This has the obvious problem in that, eventually after analysis is done, the
entire SLP tree is dissolved before codegen.  Where we get into trouble as we
have now dissolved the patterns too...

My initial thought was to add a parameter to vect_free_slp_tree, but I know you
wouldn't like that.  So I am sending this patch up as an RFC.

PS. This testcase actually shows that the codegen we get in these cases is not
optimal.  Currently this won't vectorize as the compiler thinks the vector
version is too expensive.

My guess here is because the patterns now unshare the tree and it's likely
costing the setup for the vector code twice?

Even with the shared code (without patterns, so same as GCC 10, or turning off
the patterns) it won't vectorize it.

The scalar code is

        mov     w0, 0
        ldr     x4, [x1, #:lo12:.LANCHOR0]
        ldrsw   x2, [x3, 20]
        ldr     x1, [x3, 8]
        lsl     x2, x2, 3
        ldrsw   x3, [x3, 16]
        ldp     s3, s2, [x4]
        add     x5, x1, x2
        ldr     s0, [x1, x2]
        lsl     x3, x3, 3
        add     x4, x1, x3
        fmul    s1, s2, s2
        fnmsub  s1, s3, s3, s1
        fmul    s0, s2, s0
        fmadd   s0, s3, s2, s0
        ldr     s3, [x1, x3]
        ldr     s2, [x4, 4]
        fadd    s3, s3, s1
        fadd    s2, s0, s2
        str     s3, [x1, x3]
        str     s2, [x4, 4]
        str     s1, [x1, x2]
        str     s0, [x5, 4]
        ret

and turning off the cost model we get

        movi    v1.2s, 0
        mov     w0, 0
        ldr     x4, [x1, #:lo12:.LANCHOR0]
        ldrsw   x3, [x2, 16]
        ldr     x1, [x2, 8]
        ldrsw   x2, [x2, 20]
        ldr     d0, [x4]
        fcmla   v1.2s, v0.2s, v0.2s, #0
        ldr     d2, [x1, x3, lsl 3]
        fcmla   v2.2s, v0.2s, v0.2s, #0
        fcmla   v2.2s, v0.2s, v0.2s, #90
        str     d2, [x1, x3, lsl 3]
        fcmla   v1.2s, v0.2s, v0.2s, #90
        str     d1, [x1, x2, lsl 3]

however, if the pattern matcher doesn't create the FMA node because it would
unshare the tree (which I think is a general heuristic that would work out to
better code wouldn't it?) then we would get (with the cost model enabled even)

        movi    v0.2s, 0
        mov     w0, 0
        ldr     x4, [x1, #:lo12:.LANCHOR0]
        ldrsw   x3, [x2, 16]
        ldr     x1, [x2, 8]
        ldr     d1, [x4]
        fcmla   v0.2s, v1.2s, v1.2s, #0
        fcmla   v0.2s, v1.2s, v1.2s, #90
        ldrsw   x2, [x2, 20]
        ldr     d1, [x1, x3, lsl 3]
        fadd    v1.2s, v1.2s, v0.2s
        str     d1, [x1, x3, lsl 3]
        str     d0, [x1, x2, lsl 3]
        ret

Which is the most optimal form.  So I think this should perhaps be handled in
GCC 12 if there's a way to detect when you're going to unshare a sub-tree.

Thanks,
Tamar

gcc/ChangeLog:

	PR tree-optimization/99149
	* tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
	buffer.
	(vect_slp_reset_pattern): Remove.
	(complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern.
	(complex_mul_pattern::build, complex_fma_pattern::build,
	complex_fms_pattern::build): Fix ref counts.
	* tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy
	when node is being deleted.
	* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.

gcc/testsuite/ChangeLog:

	PR tree-optimization/99149
	* gcc.dg/vect/pr99149.C: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C b/gcc/testsuite/gcc.dg/vect/pr99149.C
new file mode 100755
index 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486e425dd65461a148


--

Comments

Tamar Christina Feb. 19, 2021, 2:47 p.m. UTC | #1
Ps. The code in comment is needed, but was commented out due to the early free issue described in the patch.

Thanks,
Tamar

> -----Original Message-----
> From: Tamar Christina <tamar.christina@arm.com>
> Sent: Friday, February 19, 2021 2:42 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; rguenther@suse.de
> Subject: [PATCH] slp: fix sharing of SLP only patterns. (PR99149)
> 
> Hi Richi,
> 
> The attached testcase ICEs due to a couple of issues.
> In the testcase you have two SLP instances that share the majority of their
> definition with each other.  One tree defines a COMPLEX_MUL sequence
> and the other tree a COMPLEX_FMA.
> 
> The ice happens because:
> 
> 1. the refcounts are wrong, in particular the FMA case doesn't correctly count
> the references for the COMPLEX_MUL that it consumes.
> 
> 2. when the FMA is created it incorrectly assumes it can just tear apart the
> MUL node that it's consuming.  This is wrong and should only be done when
> there is no more uses of the node, in which case the vector only pattern is no
> longer relevant.
> 
> To fix the last part the SLP only pattern reset code was moved into
> vect_free_slp_tree which results in cleaner code.  I also think it does belong
> there since that function knows when there are no more uses of the node
> and so the pattern should be unmarked, so when the the vectorizer is
> inspecting the BB it doesn't find the now invalid vector only patterns.
> 
> This has the obvious problem in that, eventually after analysis is done, the
> entire SLP tree is dissolved before codegen.  Where we get into trouble as
> we have now dissolved the patterns too...
> 
> My initial thought was to add a parameter to vect_free_slp_tree, but I know
> you wouldn't like that.  So I am sending this patch up as an RFC.
> 
> PS. This testcase actually shows that the codegen we get in these cases is not
> optimal.  Currently this won't vectorize as the compiler thinks the vector
> version is too expensive.
> 
> My guess here is because the patterns now unshare the tree and it's likely
> costing the setup for the vector code twice?
> 
> Even with the shared code (without patterns, so same as GCC 10, or turning
> off the patterns) it won't vectorize it.
> 
> The scalar code is
> 
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x2, [x3, 20]
>         ldr     x1, [x3, 8]
>         lsl     x2, x2, 3
>         ldrsw   x3, [x3, 16]
>         ldp     s3, s2, [x4]
>         add     x5, x1, x2
>         ldr     s0, [x1, x2]
>         lsl     x3, x3, 3
>         add     x4, x1, x3
>         fmul    s1, s2, s2
>         fnmsub  s1, s3, s3, s1
>         fmul    s0, s2, s0
>         fmadd   s0, s3, s2, s0
>         ldr     s3, [x1, x3]
>         ldr     s2, [x4, 4]
>         fadd    s3, s3, s1
>         fadd    s2, s0, s2
>         str     s3, [x1, x3]
>         str     s2, [x4, 4]
>         str     s1, [x1, x2]
>         str     s0, [x5, 4]
>         ret
> 
> and turning off the cost model we get
> 
>         movi    v1.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldrsw   x2, [x2, 20]
>         ldr     d0, [x4]
>         fcmla   v1.2s, v0.2s, v0.2s, #0
>         ldr     d2, [x1, x3, lsl 3]
>         fcmla   v2.2s, v0.2s, v0.2s, #0
>         fcmla   v2.2s, v0.2s, v0.2s, #90
>         str     d2, [x1, x3, lsl 3]
>         fcmla   v1.2s, v0.2s, v0.2s, #90
>         str     d1, [x1, x2, lsl 3]
> 
> however, if the pattern matcher doesn't create the FMA node because it
> would unshare the tree (which I think is a general heuristic that would work
> out to better code wouldn't it?) then we would get (with the cost model
> enabled even)
> 
>         movi    v0.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldr     d1, [x4]
>         fcmla   v0.2s, v1.2s, v1.2s, #0
>         fcmla   v0.2s, v1.2s, v1.2s, #90
>         ldrsw   x2, [x2, 20]
>         ldr     d1, [x1, x3, lsl 3]
>         fadd    v1.2s, v1.2s, v0.2s
>         str     d1, [x1, x3, lsl 3]
>         str     d0, [x1, x2, lsl 3]
>         ret
> 
> Which is the most optimal form.  So I think this should perhaps be handled in
> GCC 12 if there's a way to detect when you're going to unshare a sub-tree.
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
> 	buffer.
> 	(vect_slp_reset_pattern): Remove.
> 	(complex_fma_pattern::matches): Remove call to
> vect_slp_reset_pattern.
> 	(complex_mul_pattern::build, complex_fma_pattern::build,
> 	complex_fms_pattern::build): Fix ref counts.
> 	* tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern
> relevancy
> 	when node is being deleted.
> 	* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* gcc.dg/vect/pr99149.C: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C
> b/gcc/testsuite/gcc.dg/vect/pr99149.C
> new file mode 100755
> index
> 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486
> e425dd65461a148
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr99149.C
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-w -O3 -march=armv8.3-a" { target {
> +aarch64*-*-* } } } */
> +
> +class a {
> +  float b;
> +  float c;
> +
> +public:
> +  a(float d, float e) : b(d), c(e) {}
> +  a operator+(a d) { return a(b + d.b, c + d.c); }
> +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); } }; int
> +f, g; class {
> +  a *h;
> +  a *i;
> +
> +public:
> +  void j() {
> +    a k = h[0], l = i[g], m = k * i[f];
> +    i[g] = l + m;
> +    i[f] = m;
> +  }
> +} n;
> +main() { n.j(); }
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index
> f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f9
> 7dfe206f17ab57 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree
> node2, lane_permutation_t &lanes,
> 
>    if (result != CMPLX_NONE && ops != NULL)
>      {
> -      ops->create (2);
> -      ops->quick_push (node1);
> -      ops->quick_push (node2);
> +      ops->safe_push (node1);
> +      ops->safe_push (node2);
>      }
>    return result;
>  }
> @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode
> +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1],
> + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
> 
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
>    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this-
> >m_node);
> -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
> 
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public
> complex_pattern
>      }
>  };
> 
> -/* Helper function to "reset" a previously matched node and undo the
> changes
> -   made enough so that the node is treated as an irrelevant node.  */
> -
> -static inline void
> -vect_slp_reset_pattern (slp_tree node)
> -{
> -  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE
> (node));
> -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> -  SLP_TREE_REPRESENTATIVE (node) = stmt_info; -}
> -
>  /* Pattern matcher for trying to match complex multiply and accumulate
>     and multiply and subtract patterns in SLP tree.
>     If the operation matches then IFN is set to the operation it matched and
> @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches
> (complex_operation_t op,
>    if (!vect_pattern_validate_optab (ifn, vnode))
>      return IFN_LAST;
> 
> -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> -     stmt that starts the pattern is marked as being in a pattern,
> -     namely the CMUL.  When replacing this with a CFMA we have to
> -     unmark this statement as being in a pattern.  This is because
> -     vect_mark_pattern_stmts will only mark the current stmt as being
> -     in a pattern.  Later on when the scalar stmts are examined the
> -     old statement which is supposed to be irrelevant will point to
> -     CMUL unless we undo the pattern relationship here.  */
> -  vect_slp_reset_pattern (node);
>    ops->truncate (0);
>    ops->create (3);
> 
> @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize
> (slp_tree_to_load_perm_map_t *perm_cache,  void
> complex_fma_pattern::build (vec_info *vinfo)  {
> +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> +
>    SLP_TREE_CHILDREN (*this->m_node).release ();
>    SLP_TREE_CHILDREN (*this->m_node).create (3);
>    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
> 
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;  SLP_TREE_REF_COUNT
> + (this->m_ops[2])++;
> +
> +  vect_free_slp_tree (node);
> +
>    complex_pattern::build (vinfo);
>  }
> 
> @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode =
> +    vect_build_combine_node (this->m_ops[2], this->m_ops[3],
> + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> + SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
> 
> @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this-
> >m_node));
> -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
> 
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index
> ea8a97b01c6371791ac66de3e1dabfedee69cb67..65c2ff867ab41ea70367087dc
> 26fb6eea1375ffb 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
>      if (child)
>        vect_free_slp_tree (child);
> 
> +  /* If the node defines any SLP only patterns then those patterns are no
> +     longer valid and should be removed.  */  stmt_vec_info
> + rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);  if (rep_stmt_info &&
> + STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> +    {
> +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> +      //STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> +      //STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> +    }
> +
>    delete node;
>  }
> 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc37207
> 2d98326891f19e6 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
>    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
>    STMT_VINFO_REDUC_IDX (res) = -1;
>    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
>    STMT_VINFO_VEC_STMTS (res) = vNULL;
> 
>    if (is_a <loop_vec_info> (this)
> 
> 
> --
Richard Biener Feb. 22, 2021, 8:28 a.m. UTC | #2
On Fri, 19 Feb 2021, Tamar Christina wrote:

> Hi Richi,
> 
> The attached testcase ICEs due to a couple of issues.
> In the testcase you have two SLP instances that share the majority of their
> definition with each other.  One tree defines a COMPLEX_MUL sequence and the
> other tree a COMPLEX_FMA.
> 
> The ice happens because:
> 
> 1. the refcounts are wrong, in particular the FMA case doesn't correctly count
> the references for the COMPLEX_MUL that it consumes.
> 
> 2. when the FMA is created it incorrectly assumes it can just tear apart the MUL
> node that it's consuming.  This is wrong and should only be done when there is
> no more uses of the node, in which case the vector only pattern is no longer
> relevant.
> 
> To fix the last part the SLP only pattern reset code was moved into
> vect_free_slp_tree which results in cleaner code.  I also think it does belong
> there since that function knows when there are no more uses of the node and so
> the pattern should be unmarked, so when the the vectorizer is inspecting the BB
> it doesn't find the now invalid vector only patterns.

Note that if you rely on the pattern being reset when it's no longer used
then this won't work in case we end up in vect_slp_convert_to_external
because that does

  /* Don't remove and free the child nodes here, since they could be
     referenced by other structures.  The analysis and scheduling phases
     (need to) ignore child nodes of anything that isn't 
vect_internal_def.  */
  unsigned int group_size = SLP_TREE_LANES (node);
  SLP_TREE_DEF_TYPE (node) = vect_external_def;
  SLP_TREE_SCALAR_OPS (node).safe_grow (group_size, true);
  SLP_TREE_LOAD_PERMUTATION (node).release ();
  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info)
    {
      tree lhs = gimple_get_lhs (vect_orig_stmt (stmt_info)->stmt);

where I don't remember the reason for not freeing the children
(it's Richards code, so you'd have to ask him).  OTOH the above only
triggers for non-loop vectorization where the pattern reset isn't
needed?

> This has the obvious problem in that, eventually after analysis is done, the
> entire SLP tree is dissolved before codegen.  Where we get into trouble as we
> have now dissolved the patterns too...

Huh?  Codegen needs the SLP nodes so they most definitely survive until
after vectorization has finished.  So did you run into actual problems
with this approach?

> My initial thought was to add a parameter to vect_free_slp_tree, but I know you
> wouldn't like that.  So I am sending this patch up as an RFC.
> 
> PS. This testcase actually shows that the codegen we get in these cases is not
> optimal.  Currently this won't vectorize as the compiler thinks the vector
> version is too expensive.
>
> My guess here is because the patterns now unshare the tree and it's likely
> costing the setup for the vector code twice?

I can see it figures there are LIVE lanes out of the SLP tree, likely
because the way vect_bb_slp_mark_live_stmts works out the stmt to
vectorize doesn't work for SLP patterns.  It sees

t.ii:19:10: note: node 0x3b87dd0 (max_nunits=2, refcnt=1)
t.ii:19:10: note: op template: MEM <float> [(struct a *)_10] = _24;
t.ii:19:10: note:       stmt 0 MEM <float> [(struct a *)_10] = _24;
t.ii:19:10: note:       stmt 1 MEM <float> [(struct a *)_10 + 4B] = _25;
t.ii:19:10: note:       children 0x3b87e58
t.ii:19:10: note: node 0x3b87e58 (max_nunits=2, refcnt=1)
t.ii:19:10: note: op template: slp_patt_34 = .COMPLEX_FMA (_24, _24, _24);
t.ii:19:10: note:       stmt 0 _24 = l$b_11 + _20;
t.ii:19:10: note:       stmt 1 _25 = l$c_12 + _23;
t.ii:19:10: note:       children 0x3b87ee0 0x3b88100 0x3b88100
t.ii:19:10: note: node 0x3b87ee0 (max_nunits=2, refcnt=1)
t.ii:19:10: note: op template: l$b_11 = MEM[(const struct a &)_10].b;
t.ii:19:10: note:       stmt 0 l$b_11 = MEM[(const struct a &)_10].b;
t.ii:19:10: note:       stmt 1 l$c_12 = MEM[(const struct a &)_10].c;
t.ii:19:10: note: node 0x3b88100 (max_nunits=2, refcnt=4)
t.ii:19:10: note: op template: k$b_4 = MEM[(const struct a &)_3].b;
t.ii:19:10: note:       stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
t.ii:19:10: note:       stmt 1 k$c_5 = MEM[(const struct a &)_3].c;

but the scalar stmts for the .COMPLEX_FMA pattern have uses that
are not marked as PURE_SLP (_20 and _23).

There's the other issue that we have

t.ii:19:10: note: op template: slp_patt_33 = .COMPLEX_MUL (_20, _20);
t.ii:19:10: note:       stmt 0 _20 = _18 - _19;
t.ii:19:10: note:       stmt 1 _23 = _21 + _22;
t.ii:19:10: note:       children 0x3b88180 0x3b88538
t.ii:19:10: note: node 0x3b88538 (max_nunits=1, refcnt=1)
t.ii:19:10: note: op: VEC_PERM_EXPR
t.ii:19:10: note:       { }
t.ii:19:10: note:       lane permutation { 0[0] 1[1] }
t.ii:19:10: note:       children 0x3b880f8 0x3b88290
t.ii:19:10: note: node 0x3b880f8 (max_nunits=2, refcnt=1)
t.ii:19:10: note: op template: k$b_4 = MEM[(const struct a &)_3].b;
t.ii:19:10: note:       stmt 0 k$b_4 = MEM[(const struct a &)_3].b;
t.ii:19:10: note:       stmt 1 k$b_4 = MEM[(const struct a &)_3].b;
t.ii:19:10: note:       load permutation { 0 0 }
t.ii:19:10: note: node 0x3b88290 (max_nunits=2, refcnt=1)
t.ii:19:10: note: op template: k$c_5 = MEM[(const struct a &)_3].c;
t.ii:19:10: note:       stmt 0 k$c_5 = MEM[(const struct a &)_3].c;
t.ii:19:10: note:       stmt 1 k$c_5 = MEM[(const struct a &)_3].c;
t.ii:19:10: note:       load permutation { 1 1 }

which is a load that wasn't converted somehow.  Note I see

t.ii:19:10: note:   converting stmts on permute node 0x3b88538
t.ii:19:10: note:   re-using SLP tree 0x3b88180
t.ii:19:10: note:   Pattern matched SLP tree

but eventually that didn't propagate to the use in the .COMPLEX_MUL
node.

So in the end your patch looks like a reasonable fix if the commented
lines are uncommented and that doesn't cause any problems.

Thanks,
Richard.

> Even with the shared code (without patterns, so same as GCC 10, or turning off
> the patterns) it won't vectorize it.
> 
> The scalar code is
> 
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x2, [x3, 20]
>         ldr     x1, [x3, 8]
>         lsl     x2, x2, 3
>         ldrsw   x3, [x3, 16]
>         ldp     s3, s2, [x4]
>         add     x5, x1, x2
>         ldr     s0, [x1, x2]
>         lsl     x3, x3, 3
>         add     x4, x1, x3
>         fmul    s1, s2, s2
>         fnmsub  s1, s3, s3, s1
>         fmul    s0, s2, s0
>         fmadd   s0, s3, s2, s0
>         ldr     s3, [x1, x3]
>         ldr     s2, [x4, 4]
>         fadd    s3, s3, s1
>         fadd    s2, s0, s2
>         str     s3, [x1, x3]
>         str     s2, [x4, 4]
>         str     s1, [x1, x2]
>         str     s0, [x5, 4]
>         ret
> 
> and turning off the cost model we get
> 
>         movi    v1.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldrsw   x2, [x2, 20]
>         ldr     d0, [x4]
>         fcmla   v1.2s, v0.2s, v0.2s, #0
>         ldr     d2, [x1, x3, lsl 3]
>         fcmla   v2.2s, v0.2s, v0.2s, #0
>         fcmla   v2.2s, v0.2s, v0.2s, #90
>         str     d2, [x1, x3, lsl 3]
>         fcmla   v1.2s, v0.2s, v0.2s, #90
>         str     d1, [x1, x2, lsl 3]
> 
> however, if the pattern matcher doesn't create the FMA node because it would
> unshare the tree (which I think is a general heuristic that would work out to
> better code wouldn't it?) then we would get (with the cost model enabled even)
> 
>         movi    v0.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldr     d1, [x4]
>         fcmla   v0.2s, v1.2s, v1.2s, #0
>         fcmla   v0.2s, v1.2s, v1.2s, #90
>         ldrsw   x2, [x2, 20]
>         ldr     d1, [x1, x3, lsl 3]
>         fadd    v1.2s, v1.2s, v0.2s
>         str     d1, [x1, x3, lsl 3]
>         str     d0, [x1, x2, lsl 3]
>         ret
> 
> Which is the most optimal form.  So I think this should perhaps be handled in
> GCC 12 if there's a way to detect when you're going to unshare a sub-tree.
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
> 	buffer.
> 	(vect_slp_reset_pattern): Remove.
> 	(complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern.
> 	(complex_mul_pattern::build, complex_fma_pattern::build,
> 	complex_fms_pattern::build): Fix ref counts.
> 	* tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy
> 	when node is being deleted.
> 	* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* gcc.dg/vect/pr99149.C: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C b/gcc/testsuite/gcc.dg/vect/pr99149.C
> new file mode 100755
> index 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486e425dd65461a148
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr99149.C
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-w -O3 -march=armv8.3-a" { target { aarch64*-*-* } } } */
> +
> +class a {
> +  float b;
> +  float c;
> +
> +public:
> +  a(float d, float e) : b(d), c(e) {}
> +  a operator+(a d) { return a(b + d.b, c + d.c); }
> +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); }
> +};
> +int f, g;
> +class {
> +  a *h;
> +  a *i;
> +
> +public:
> +  void j() {
> +    a k = h[0], l = i[g], m = k * i[f];
> +    i[g] = l + m;
> +    i[f] = m;
> +  }
> +} n;
> +main() { n.j(); }
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
> index f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f97dfe206f17ab57 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t &lanes,
>  
>    if (result != CMPLX_NONE && ops != NULL)
>      {
> -      ops->create (2);
> -      ops->quick_push (node1);
> -      ops->quick_push (node2);
> +      ops->safe_push (node1);
> +      ops->safe_push (node2);
>      }
>    return result;
>  }
> @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)
>  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode
> +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
>  
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
>    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
>  
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public complex_pattern
>      }
>  };
>  
> -/* Helper function to "reset" a previously matched node and undo the changes
> -   made enough so that the node is treated as an irrelevant node.  */
> -
> -static inline void
> -vect_slp_reset_pattern (slp_tree node)
> -{
> -  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> -  SLP_TREE_REPRESENTATIVE (node) = stmt_info;
> -}
> -
>  /* Pattern matcher for trying to match complex multiply and accumulate
>     and multiply and subtract patterns in SLP tree.
>     If the operation matches then IFN is set to the operation it matched and
> @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches (complex_operation_t op,
>    if (!vect_pattern_validate_optab (ifn, vnode))
>      return IFN_LAST;
>  
> -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> -     stmt that starts the pattern is marked as being in a pattern,
> -     namely the CMUL.  When replacing this with a CFMA we have to
> -     unmark this statement as being in a pattern.  This is because
> -     vect_mark_pattern_stmts will only mark the current stmt as being
> -     in a pattern.  Later on when the scalar stmts are examined the
> -     old statement which is supposed to be irrelevant will point to
> -     CMUL unless we undo the pattern relationship here.  */
> -  vect_slp_reset_pattern (node);
>    ops->truncate (0);
>    ops->create (3);
>  
> @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize (slp_tree_to_load_perm_map_t *perm_cache,
>  void
>  complex_fma_pattern::build (vec_info *vinfo)
>  {
> +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> +
>    SLP_TREE_CHILDREN (*this->m_node).release ();
>    SLP_TREE_CHILDREN (*this->m_node).create (3);
>    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
>  
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
> +  vect_free_slp_tree (node);
> +
>    complex_pattern::build (vinfo);
>  }
>  
> @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)
>  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode =
> +    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node);
> +  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
>  
> @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node));
> -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
>  
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index ea8a97b01c6371791ac66de3e1dabfedee69cb67..65c2ff867ab41ea70367087dc26fb6eea1375ffb 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
>      if (child)
>        vect_free_slp_tree (child);
>  
> +  /* If the node defines any SLP only patterns then those patterns are no
> +     longer valid and should be removed.  */
> +  stmt_vec_info rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);
> +  if (rep_stmt_info && STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> +    {
> +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> +      //STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> +      //STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> +    }
> +
>    delete node;
>  }
>  
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc372072d98326891f19e6 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
>    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
>    STMT_VINFO_REDUC_IDX (res) = -1;
>    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
>    STMT_VINFO_VEC_STMTS (res) = vNULL;
>  
>    if (is_a <loop_vec_info> (this)
> 
> 
>
Richard Biener Feb. 24, 2021, 8:38 a.m. UTC | #3
On Tue, 23 Feb 2021, Tamar Christina wrote:

> Hi Richi,
> 
> The attached testcase ICEs due to a couple of issues.
> In the testcase you have two SLP instances that share the majority of their
> definition with each other.  One tree defines a COMPLEX_MUL sequence and the
> other tree a COMPLEX_FMA.
> 
> The ice happens because:
> 
> 1. the refcounts are wrong, in particular the FMA case doesn't correctly count
> the references for the COMPLEX_MUL that it consumes.
> 
> 2. when the FMA is created it incorrectly assumes it can just tear apart the MUL
> node that it's consuming.  This is wrong and should only be done when there is
> no more uses of the node, in which case the vector only pattern is no longer
> relevant.
> 
> To fix the last part the SLP only pattern reset code was moved into
> vect_free_slp_tree which results in cleaner code.  I also think it does belong
> there since that function knows when there are no more uses of the node and so
> the pattern should be unmarked, so when the the vectorizer is inspecting the BB
> it doesn't find the now invalid vector only patterns.
> 
> The patch also clears the SLP_TREE_REPRESENTATIVE when stores are removed such
> that we don't hit an error later trying to free the stmt_vec_info again.
> 
> Lastly it also tweaks the results of whether a pattern has been detected or not
> to return true when another SLP instance has created a pattern that is only used
> by a different instance (due to the trees being unshared).
> 
> Instead of ICEing this code now produces
> 
>         adrp    x1, .LANCHOR0
>         add     x2, x1, :lo12:.LANCHOR0
>         movi    v1.2s, 0
>         mov     w0, 0
>         ldr     x4, [x1, #:lo12:.LANCHOR0]
>         ldrsw   x3, [x2, 16]
>         ldr     x1, [x2, 8]
>         ldrsw   x2, [x2, 20]
>         ldr     d0, [x4]
>         ldr     d2, [x1, x3, lsl 3]
>         fcmla   v2.2s, v0.2s, v0.2s, #0
>         fcmla   v2.2s, v0.2s, v0.2s, #90
>         str     d2, [x1, x3, lsl 3]
>         fcmla   v1.2s, v0.2s, v0.2s, #0
>         fcmla   v1.2s, v0.2s, v0.2s, #90
>         str     d1, [x1, x2, lsl 3]
>         ret
> 
> PS. This testcase actually shows that the codegen we get in these cases is not
> optimal. It should generate a MUL + ADD instead MUL + FMA.
> 
> But that's for GCC 12.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

OK.

Thanks,
Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
> 	buffer.
> 	(vect_slp_reset_pattern): Remove.
> 	(complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern.
> 	(complex_mul_pattern::build, complex_fma_pattern::build,
> 	complex_fms_pattern::build): Fix ref counts.
> 	* tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy
> 	when node is being deleted.
> 	(vect_match_slp_patterns_2): Correct result of cache hit on patterns.
> 	(vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of removed
> 	stores.
> 	* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR tree-optimization/99149
> 	* g++.dg/vect/pr99149.cc: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/testsuite/g++.dg/vect/pr99149.cc b/gcc/testsuite/g++.dg/vect/pr99149.cc
> new file mode 100755
> index 0000000000000000000000000000000000000000..902a26f576fcc79d2802bec093668674cca1c84f
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/vect/pr99149.cc
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-w -O3 -march=armv8.3-a -fdump-tree-slp-all" { target { aarch64*-*-* } } } */
> +
> +class a {
> +  float b;
> +  float c;
> +
> +public:
> +  a(float d, float e) : b(d), c(e) {}
> +  a operator+(a d) { return a(b + d.b, c + d.c); }
> +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); }
> +};
> +int f, g;
> +class {
> +  a *h;
> +  a *i;
> +
> +public:
> +  void j() {
> +    a k = h[0], l = i[g], m = k * i[f];
> +    i[g] = l + m;
> +    i[f] = m;
> +  }
> +} n;
> +main() { n.j(); }
> +
> +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_MUL" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_FMA" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
> index f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f97dfe206f17ab57 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t &lanes,
>  
>    if (result != CMPLX_NONE && ops != NULL)
>      {
> -      ops->create (2);
> -      ops->quick_push (node1);
> -      ops->quick_push (node2);
> +      ops->safe_push (node1);
> +      ops->safe_push (node2);
>      }
>    return result;
>  }
> @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)
>  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode
> +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
>  
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
>    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
>  
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public complex_pattern
>      }
>  };
>  
> -/* Helper function to "reset" a previously matched node and undo the changes
> -   made enough so that the node is treated as an irrelevant node.  */
> -
> -static inline void
> -vect_slp_reset_pattern (slp_tree node)
> -{
> -  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> -  SLP_TREE_REPRESENTATIVE (node) = stmt_info;
> -}
> -
>  /* Pattern matcher for trying to match complex multiply and accumulate
>     and multiply and subtract patterns in SLP tree.
>     If the operation matches then IFN is set to the operation it matched and
> @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches (complex_operation_t op,
>    if (!vect_pattern_validate_optab (ifn, vnode))
>      return IFN_LAST;
>  
> -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> -     stmt that starts the pattern is marked as being in a pattern,
> -     namely the CMUL.  When replacing this with a CFMA we have to
> -     unmark this statement as being in a pattern.  This is because
> -     vect_mark_pattern_stmts will only mark the current stmt as being
> -     in a pattern.  Later on when the scalar stmts are examined the
> -     old statement which is supposed to be irrelevant will point to
> -     CMUL unless we undo the pattern relationship here.  */
> -  vect_slp_reset_pattern (node);
>    ops->truncate (0);
>    ops->create (3);
>  
> @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize (slp_tree_to_load_perm_map_t *perm_cache,
>  void
>  complex_fma_pattern::build (vec_info *vinfo)
>  {
> +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> +
>    SLP_TREE_CHILDREN (*this->m_node).release ();
>    SLP_TREE_CHILDREN (*this->m_node).create (3);
>    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
>  
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> +
> +  vect_free_slp_tree (node);
> +
>    complex_pattern::build (vinfo);
>  }
>  
> @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)
>  {
>    slp_tree node;
>    unsigned i;
> +  slp_tree newnode =
> +    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node);
> +  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
>      vect_free_slp_tree (node);
>  
> @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
>    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node));
> -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
>  
>    /* And then rewrite the node itself.  */
>    complex_pattern::build (vinfo);
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index ea8a97b01c6371791ac66de3e1dabfedee69cb67..605873714a5cafaaf822f61f1f769f96b3876694 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
>      if (child)
>        vect_free_slp_tree (child);
>  
> +  /* If the node defines any SLP only patterns then those patterns are no
> +     longer valid and should be removed.  */
> +  stmt_vec_info rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);
> +  if (rep_stmt_info && STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> +    {
> +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> +      STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> +      STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> +    }
> +
>    delete node;
>  }
>  
> @@ -2395,7 +2405,9 @@ vect_match_slp_patterns_2 (slp_tree *ref_node, vec_info *vinfo,
>    slp_tree node = *ref_node;
>    bool found_p = false;
>    if (!node || visited->add (node))
> -    return false;
> +    return node
> +	   && SLP_TREE_REPRESENTATIVE (node)
> +	   && STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node));
>  
>    slp_tree child;
>    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> @@ -6451,6 +6463,11 @@ vect_schedule_slp (vec_info *vinfo, vec<slp_instance> slp_instances)
>  	  store_info = vect_orig_stmt (store_info);
>  	  /* Free the attached stmt_vec_info and remove the stmt.  */
>  	  vinfo->remove_stmt (store_info);
> +
> +	  /* Invalidate SLP_TREE_REPRESENTATIVE in case we released it
> +	     to not crash in vect_free_slp_tree later.  */
> +	  if (SLP_TREE_REPRESENTATIVE (root) == store_info)
> +	    SLP_TREE_REPRESENTATIVE (root) = NULL;
>          }
>      }
>  }
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc372072d98326891f19e6 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
>    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
>    STMT_VINFO_REDUC_IDX (res) = -1;
>    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
>    STMT_VINFO_VEC_STMTS (res) = vNULL;
>  
>    if (is_a <loop_vec_info> (this)
> 
> 
>
Christophe Lyon Feb. 24, 2021, 2:17 p.m. UTC | #4
On Wed, 24 Feb 2021 at 09:38, Richard Biener <rguenther@suse.de> wrote:
>
> On Tue, 23 Feb 2021, Tamar Christina wrote:
>
> > Hi Richi,
> >
> > The attached testcase ICEs due to a couple of issues.
> > In the testcase you have two SLP instances that share the majority of their
> > definition with each other.  One tree defines a COMPLEX_MUL sequence and the
> > other tree a COMPLEX_FMA.
> >
> > The ice happens because:
> >
> > 1. the refcounts are wrong, in particular the FMA case doesn't correctly count
> > the references for the COMPLEX_MUL that it consumes.
> >
> > 2. when the FMA is created it incorrectly assumes it can just tear apart the MUL
> > node that it's consuming.  This is wrong and should only be done when there is
> > no more uses of the node, in which case the vector only pattern is no longer
> > relevant.
> >
> > To fix the last part the SLP only pattern reset code was moved into
> > vect_free_slp_tree which results in cleaner code.  I also think it does belong
> > there since that function knows when there are no more uses of the node and so
> > the pattern should be unmarked, so when the the vectorizer is inspecting the BB
> > it doesn't find the now invalid vector only patterns.
> >
> > The patch also clears the SLP_TREE_REPRESENTATIVE when stores are removed such
> > that we don't hit an error later trying to free the stmt_vec_info again.
> >
> > Lastly it also tweaks the results of whether a pattern has been detected or not
> > to return true when another SLP instance has created a pattern that is only used
> > by a different instance (due to the trees being unshared).
> >
> > Instead of ICEing this code now produces
> >
> >         adrp    x1, .LANCHOR0
> >         add     x2, x1, :lo12:.LANCHOR0
> >         movi    v1.2s, 0
> >         mov     w0, 0
> >         ldr     x4, [x1, #:lo12:.LANCHOR0]
> >         ldrsw   x3, [x2, 16]
> >         ldr     x1, [x2, 8]
> >         ldrsw   x2, [x2, 20]
> >         ldr     d0, [x4]
> >         ldr     d2, [x1, x3, lsl 3]
> >         fcmla   v2.2s, v0.2s, v0.2s, #0
> >         fcmla   v2.2s, v0.2s, v0.2s, #90
> >         str     d2, [x1, x3, lsl 3]
> >         fcmla   v1.2s, v0.2s, v0.2s, #0
> >         fcmla   v1.2s, v0.2s, v0.2s, #90
> >         str     d1, [x1, x2, lsl 3]
> >         ret
> >
> > PS. This testcase actually shows that the codegen we get in these cases is not
> > optimal. It should generate a MUL + ADD instead MUL + FMA.
> >
> > But that's for GCC 12.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Hi Tamar,

The new test fails on aarch64_be:
FAIL: g++.dg/vect/pr99149.cc  -std=c++14  scan-tree-dump-times slp2
"stmt.*COMPLEX_FMA" 1
FAIL: g++.dg/vect/pr99149.cc  -std=c++17  scan-tree-dump-times slp2
"stmt.*COMPLEX_FMA" 1
FAIL: g++.dg/vect/pr99149.cc  -std=c++2a  scan-tree-dump-times slp2
"stmt.*COMPLEX_FMA" 1
FAIL: g++.dg/vect/pr99149.cc  -std=c++98  scan-tree-dump-times slp2
"stmt.*COMPLEX_FMA" 1

Is that supposed to work, or should you disable the test on aarch64_be?

Christophe

>
> OK.
>
> Thanks,
> Richard.
>
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >       PR tree-optimization/99149
> >       * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
> >       buffer.
> >       (vect_slp_reset_pattern): Remove.
> >       (complex_fma_pattern::matches): Remove call to vect_slp_reset_pattern.
> >       (complex_mul_pattern::build, complex_fma_pattern::build,
> >       complex_fms_pattern::build): Fix ref counts.
> >       * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern relevancy
> >       when node is being deleted.
> >       (vect_match_slp_patterns_2): Correct result of cache hit on patterns.
> >       (vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of removed
> >       stores.
> >       * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       PR tree-optimization/99149
> >       * g++.dg/vect/pr99149.cc: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/g++.dg/vect/pr99149.cc b/gcc/testsuite/g++.dg/vect/pr99149.cc
> > new file mode 100755
> > index 0000000000000000000000000000000000000000..902a26f576fcc79d2802bec093668674cca1c84f
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/vect/pr99149.cc
> > @@ -0,0 +1,28 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-w -O3 -march=armv8.3-a -fdump-tree-slp-all" { target { aarch64*-*-* } } } */
> > +
> > +class a {
> > +  float b;
> > +  float c;
> > +
> > +public:
> > +  a(float d, float e) : b(d), c(e) {}
> > +  a operator+(a d) { return a(b + d.b, c + d.c); }
> > +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); }
> > +};
> > +int f, g;
> > +class {
> > +  a *h;
> > +  a *i;
> > +
> > +public:
> > +  void j() {
> > +    a k = h[0], l = i[g], m = k * i[f];
> > +    i[g] = l + m;
> > +    i[f] = m;
> > +  }
> > +} n;
> > +main() { n.j(); }
> > +
> > +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_MUL" 1 "slp2" } } */
> > +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_FMA" 1 "slp2" } } */
> > diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
> > index f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f97dfe206f17ab57 100644
> > --- a/gcc/tree-vect-slp-patterns.c
> > +++ b/gcc/tree-vect-slp-patterns.c
> > @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t &lanes,
> >
> >    if (result != CMPLX_NONE && ops != NULL)
> >      {
> > -      ops->create (2);
> > -      ops->quick_push (node1);
> > -      ops->quick_push (node2);
> > +      ops->safe_push (node1);
> > +      ops->safe_push (node2);
> >      }
> >    return result;
> >  }
> > @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)
> >  {
> >    slp_tree node;
> >    unsigned i;
> > +  slp_tree newnode
> > +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> > +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> > +
> >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
> >      vect_free_slp_tree (node);
> >
> >    /* First re-arrange the children.  */
> >    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
> >    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> > -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> > -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
> > -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> > +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
> >
> >    /* And then rewrite the node itself.  */
> >    complex_pattern::build (vinfo);
> > @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public complex_pattern
> >      }
> >  };
> >
> > -/* Helper function to "reset" a previously matched node and undo the changes
> > -   made enough so that the node is treated as an irrelevant node.  */
> > -
> > -static inline void
> > -vect_slp_reset_pattern (slp_tree node)
> > -{
> > -  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
> > -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> > -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> > -  SLP_TREE_REPRESENTATIVE (node) = stmt_info;
> > -}
> > -
> >  /* Pattern matcher for trying to match complex multiply and accumulate
> >     and multiply and subtract patterns in SLP tree.
> >     If the operation matches then IFN is set to the operation it matched and
> > @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches (complex_operation_t op,
> >    if (!vect_pattern_validate_optab (ifn, vnode))
> >      return IFN_LAST;
> >
> > -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> > -     stmt that starts the pattern is marked as being in a pattern,
> > -     namely the CMUL.  When replacing this with a CFMA we have to
> > -     unmark this statement as being in a pattern.  This is because
> > -     vect_mark_pattern_stmts will only mark the current stmt as being
> > -     in a pattern.  Later on when the scalar stmts are examined the
> > -     old statement which is supposed to be irrelevant will point to
> > -     CMUL unless we undo the pattern relationship here.  */
> > -  vect_slp_reset_pattern (node);
> >    ops->truncate (0);
> >    ops->create (3);
> >
> > @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize (slp_tree_to_load_perm_map_t *perm_cache,
> >  void
> >  complex_fma_pattern::build (vec_info *vinfo)
> >  {
> > +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> > +
> >    SLP_TREE_CHILDREN (*this->m_node).release ();
> >    SLP_TREE_CHILDREN (*this->m_node).create (3);
> >    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
> >
> > +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> > +  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> > +
> > +  vect_free_slp_tree (node);
> > +
> >    complex_pattern::build (vinfo);
> >  }
> >
> > @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)
> >  {
> >    slp_tree node;
> >    unsigned i;
> > +  slp_tree newnode =
> > +    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node);
> > +  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> > +  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> > +
> >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
> >      vect_free_slp_tree (node);
> >
> > @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
> >    /* First re-arrange the children.  */
> >    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
> >    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> > -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> > -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node));
> > -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> > -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> > +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
> >
> >    /* And then rewrite the node itself.  */
> >    complex_pattern::build (vinfo);
> > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> > index ea8a97b01c6371791ac66de3e1dabfedee69cb67..605873714a5cafaaf822f61f1f769f96b3876694 100644
> > --- a/gcc/tree-vect-slp.c
> > +++ b/gcc/tree-vect-slp.c
> > @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
> >      if (child)
> >        vect_free_slp_tree (child);
> >
> > +  /* If the node defines any SLP only patterns then those patterns are no
> > +     longer valid and should be removed.  */
> > +  stmt_vec_info rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > +  if (rep_stmt_info && STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> > +    {
> > +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> > +      STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> > +      STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> > +    }
> > +
> >    delete node;
> >  }
> >
> > @@ -2395,7 +2405,9 @@ vect_match_slp_patterns_2 (slp_tree *ref_node, vec_info *vinfo,
> >    slp_tree node = *ref_node;
> >    bool found_p = false;
> >    if (!node || visited->add (node))
> > -    return false;
> > +    return node
> > +        && SLP_TREE_REPRESENTATIVE (node)
> > +        && STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node));
> >
> >    slp_tree child;
> >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > @@ -6451,6 +6463,11 @@ vect_schedule_slp (vec_info *vinfo, vec<slp_instance> slp_instances)
> >         store_info = vect_orig_stmt (store_info);
> >         /* Free the attached stmt_vec_info and remove the stmt.  */
> >         vinfo->remove_stmt (store_info);
> > +
> > +       /* Invalidate SLP_TREE_REPRESENTATIVE in case we released it
> > +          to not crash in vect_free_slp_tree later.  */
> > +       if (SLP_TREE_REPRESENTATIVE (root) == store_info)
> > +         SLP_TREE_REPRESENTATIVE (root) = NULL;
> >          }
> >      }
> >  }
> > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > index 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc372072d98326891f19e6 100644
> > --- a/gcc/tree-vectorizer.c
> > +++ b/gcc/tree-vectorizer.c
> > @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
> >    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
> >    STMT_VINFO_REDUC_IDX (res) = -1;
> >    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> > +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
> >    STMT_VINFO_VEC_STMTS (res) = vNULL;
> >
> >    if (is_a <loop_vec_info> (this)
> >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
Tamar Christina Feb. 24, 2021, 2:20 p.m. UTC | #5
> -----Original Message-----
> From: Christophe Lyon <christophe.lyon@linaro.org>
> Sent: Wednesday, February 24, 2021 2:17 PM
> To: Richard Biener <rguenther@suse.de>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>; gcc
> Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [PATCH v2] middle-end slp: fix sharing of SLP only patterns.
> 
> On Wed, 24 Feb 2021 at 09:38, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Tue, 23 Feb 2021, Tamar Christina wrote:
> >
> > > Hi Richi,
> > >
> > > The attached testcase ICEs due to a couple of issues.
> > > In the testcase you have two SLP instances that share the majority
> > > of their definition with each other.  One tree defines a COMPLEX_MUL
> > > sequence and the other tree a COMPLEX_FMA.
> > >
> > > The ice happens because:
> > >
> > > 1. the refcounts are wrong, in particular the FMA case doesn't
> > > correctly count the references for the COMPLEX_MUL that it consumes.
> > >
> > > 2. when the FMA is created it incorrectly assumes it can just tear
> > > apart the MUL node that it's consuming.  This is wrong and should
> > > only be done when there is no more uses of the node, in which case
> > > the vector only pattern is no longer relevant.
> > >
> > > To fix the last part the SLP only pattern reset code was moved into
> > > vect_free_slp_tree which results in cleaner code.  I also think it
> > > does belong there since that function knows when there are no more
> > > uses of the node and so the pattern should be unmarked, so when the
> > > the vectorizer is inspecting the BB it doesn't find the now invalid vector
> only patterns.
> > >
> > > The patch also clears the SLP_TREE_REPRESENTATIVE when stores are
> > > removed such that we don't hit an error later trying to free the
> stmt_vec_info again.
> > >
> > > Lastly it also tweaks the results of whether a pattern has been
> > > detected or not to return true when another SLP instance has created
> > > a pattern that is only used by a different instance (due to the trees being
> unshared).
> > >
> > > Instead of ICEing this code now produces
> > >
> > >         adrp    x1, .LANCHOR0
> > >         add     x2, x1, :lo12:.LANCHOR0
> > >         movi    v1.2s, 0
> > >         mov     w0, 0
> > >         ldr     x4, [x1, #:lo12:.LANCHOR0]
> > >         ldrsw   x3, [x2, 16]
> > >         ldr     x1, [x2, 8]
> > >         ldrsw   x2, [x2, 20]
> > >         ldr     d0, [x4]
> > >         ldr     d2, [x1, x3, lsl 3]
> > >         fcmla   v2.2s, v0.2s, v0.2s, #0
> > >         fcmla   v2.2s, v0.2s, v0.2s, #90
> > >         str     d2, [x1, x3, lsl 3]
> > >         fcmla   v1.2s, v0.2s, v0.2s, #0
> > >         fcmla   v1.2s, v0.2s, v0.2s, #90
> > >         str     d1, [x1, x2, lsl 3]
> > >         ret
> > >
> > > PS. This testcase actually shows that the codegen we get in these
> > > cases is not optimal. It should generate a MUL + ADD instead MUL + FMA.
> > >
> > > But that's for GCC 12.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Hi Tamar,
> 
> The new test fails on aarch64_be:
> FAIL: g++.dg/vect/pr99149.cc  -std=c++14  scan-tree-dump-times slp2
> "stmt.*COMPLEX_FMA" 1
> FAIL: g++.dg/vect/pr99149.cc  -std=c++17  scan-tree-dump-times slp2
> "stmt.*COMPLEX_FMA" 1
> FAIL: g++.dg/vect/pr99149.cc  -std=c++2a  scan-tree-dump-times slp2
> "stmt.*COMPLEX_FMA" 1
> FAIL: g++.dg/vect/pr99149.cc  -std=c++98  scan-tree-dump-times slp2
> "stmt.*COMPLEX_FMA" 1
> 
> Is that supposed to work, or should you disable the test on aarch64_be?

Args, it's a bit complicated, it's blocked on AArch64 big-endian but not SVE big-endian.

I'll disable all be for now.

Thanks,
Tamar
> 
> Christophe
> 
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >       PR tree-optimization/99149
> > >       * tree-vect-slp-patterns.c (vect_detect_pair_op): Don't recreate the
> > >       buffer.
> > >       (vect_slp_reset_pattern): Remove.
> > >       (complex_fma_pattern::matches): Remove call to
> vect_slp_reset_pattern.
> > >       (complex_mul_pattern::build, complex_fma_pattern::build,
> > >       complex_fms_pattern::build): Fix ref counts.
> > >       * tree-vect-slp.c (vect_free_slp_tree): Undo SLP only pattern
> relevancy
> > >       when node is being deleted.
> > >       (vect_match_slp_patterns_2): Correct result of cache hit on patterns.
> > >       (vect_schedule_slp): Invalidate SLP_TREE_REPRESENTATIVE of
> removed
> > >       stores.
> > >       * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize value.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >       PR tree-optimization/99149
> > >       * g++.dg/vect/pr99149.cc: New test.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/testsuite/g++.dg/vect/pr99149.cc
> > > b/gcc/testsuite/g++.dg/vect/pr99149.cc
> > > new file mode 100755
> > > index
> > >
> 0000000000000000000000000000000000000000..902a26f576fcc79d2802bec093
> > > 668674cca1c84f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/vect/pr99149.cc
> > > @@ -0,0 +1,28 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-w -O3 -march=armv8.3-a
> > > +-fdump-tree-slp-all" { target { aarch64*-*-* } } } */
> > > +
> > > +class a {
> > > +  float b;
> > > +  float c;
> > > +
> > > +public:
> > > +  a(float d, float e) : b(d), c(e) {}
> > > +  a operator+(a d) { return a(b + d.b, c + d.c); }
> > > +  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); } };
> > > +int f, g; class {
> > > +  a *h;
> > > +  a *i;
> > > +
> > > +public:
> > > +  void j() {
> > > +    a k = h[0], l = i[g], m = k * i[f];
> > > +    i[g] = l + m;
> > > +    i[f] = m;
> > > +  }
> > > +} n;
> > > +main() { n.j(); }
> > > +
> > > +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_MUL" 1 "slp2" }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-times "stmt.*COMPLEX_FMA" 1 "slp2" }
> > > +} */
> > > diff --git a/gcc/tree-vect-slp-patterns.c
> > > b/gcc/tree-vect-slp-patterns.c index
> > >
> f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f9
> > > 7dfe206f17ab57 100644
> > > --- a/gcc/tree-vect-slp-patterns.c
> > > +++ b/gcc/tree-vect-slp-patterns.c
> > > @@ -407,9 +407,8 @@ vect_detect_pair_op (slp_tree node1, slp_tree
> > > node2, lane_permutation_t &lanes,
> > >
> > >    if (result != CMPLX_NONE && ops != NULL)
> > >      {
> > > -      ops->create (2);
> > > -      ops->quick_push (node1);
> > > -      ops->quick_push (node2);
> > > +      ops->safe_push (node1);
> > > +      ops->safe_push (node2);
> > >      }
> > >    return result;
> > >  }
> > > @@ -1090,15 +1089,17 @@ complex_mul_pattern::build (vec_info *vinfo)
> > > {
> > >    slp_tree node;
> > >    unsigned i;
> > > +  slp_tree newnode
> > > +    = vect_build_combine_node (this->m_ops[0], this->m_ops[1],
> > > + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> > > +
> > >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
> > >      vect_free_slp_tree (node);
> > >
> > >    /* First re-arrange the children.  */
> > >    SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
> > >    SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
> > > -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> > > -    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this-
> >m_node);
> > > -  SLP_TREE_REF_COUNT (this->m_ops[2])++;
> > > +  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
> > >
> > >    /* And then rewrite the node itself.  */
> > >    complex_pattern::build (vinfo);
> > > @@ -1133,18 +1134,6 @@ class complex_fma_pattern : public
> complex_pattern
> > >      }
> > >  };
> > >
> > > -/* Helper function to "reset" a previously matched node and undo the
> changes
> > > -   made enough so that the node is treated as an irrelevant node.  */
> > > -
> > > -static inline void
> > > -vect_slp_reset_pattern (slp_tree node) -{
> > > -  stmt_vec_info stmt_info = vect_orig_stmt
> (SLP_TREE_REPRESENTATIVE
> > > (node));
> > > -  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> > > -  STMT_SLP_TYPE (stmt_info) = pure_slp;
> > > -  SLP_TREE_REPRESENTATIVE (node) = stmt_info; -}
> > > -
> > >  /* Pattern matcher for trying to match complex multiply and accumulate
> > >     and multiply and subtract patterns in SLP tree.
> > >     If the operation matches then IFN is set to the operation it
> > > matched and @@ -1208,15 +1197,6 @@ complex_fma_pattern::matches
> (complex_operation_t op,
> > >    if (!vect_pattern_validate_optab (ifn, vnode))
> > >      return IFN_LAST;
> > >
> > > -  /* FMA matched ADD + CMUL.  During the matching of CMUL the
> > > -     stmt that starts the pattern is marked as being in a pattern,
> > > -     namely the CMUL.  When replacing this with a CFMA we have to
> > > -     unmark this statement as being in a pattern.  This is because
> > > -     vect_mark_pattern_stmts will only mark the current stmt as being
> > > -     in a pattern.  Later on when the scalar stmts are examined the
> > > -     old statement which is supposed to be irrelevant will point to
> > > -     CMUL unless we undo the pattern relationship here.  */
> > > -  vect_slp_reset_pattern (node);
> > >    ops->truncate (0);
> > >    ops->create (3);
> > >
> > > @@ -1259,10 +1239,17 @@ complex_fma_pattern::recognize
> > > (slp_tree_to_load_perm_map_t *perm_cache,  void
> > > complex_fma_pattern::build (vec_info *vinfo)  {
> > > +  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
> > > +
> > >    SLP_TREE_CHILDREN (*this->m_node).release ();
> > >    SLP_TREE_CHILDREN (*this->m_node).create (3);
> > >    SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
> > >
> > > +  SLP_TREE_REF_COUNT (this->m_ops[1])++;  SLP_TREE_REF_COUNT
> > > + (this->m_ops[2])++;
> > > +
> > > +  vect_free_slp_tree (node);
> > > +
> > >    complex_pattern::build (vinfo);
> > >  }
> > >
> > > @@ -1427,6 +1414,11 @@ complex_fms_pattern::build (vec_info *vinfo)
> > > {
> > >    slp_tree node;
> > >    unsigned i;
> > > +  slp_tree newnode =
> > > +    vect_build_combine_node (this->m_ops[2], this->m_ops[3],
> > > + *this->m_node);  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> > > + SLP_TREE_REF_COUNT (this->m_ops[1])++;
> > > +
> > >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
> > >      vect_free_slp_tree (node);
> > >
> > > @@ -1436,10 +1428,7 @@ complex_fms_pattern::build (vec_info *vinfo)
> > >    /* First re-arrange the children.  */
> > >    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
> > >    SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
> > > -  SLP_TREE_CHILDREN (*this->m_node).quick_push (
> > > -    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this-
> >m_node));
> > > -  SLP_TREE_REF_COUNT (this->m_ops[0])++;
> > > -  SLP_TREE_REF_COUNT (this->m_ops[1])++;
> > > +  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
> > >
> > >    /* And then rewrite the node itself.  */
> > >    complex_pattern::build (vinfo);
> > > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index
> > >
> ea8a97b01c6371791ac66de3e1dabfedee69cb67..605873714a5cafaaf822f61f1f
> > > 769f96b3876694 100644
> > > --- a/gcc/tree-vect-slp.c
> > > +++ b/gcc/tree-vect-slp.c
> > > @@ -146,6 +146,16 @@ vect_free_slp_tree (slp_tree node)
> > >      if (child)
> > >        vect_free_slp_tree (child);
> > >
> > > +  /* If the node defines any SLP only patterns then those patterns are no
> > > +     longer valid and should be removed.  */  stmt_vec_info
> > > + rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);  if (rep_stmt_info
> > > + && STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
> > > +    {
> > > +      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
> > > +      STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
> > > +      STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
> > > +    }
> > > +
> > >    delete node;
> > >  }
> > >
> > > @@ -2395,7 +2405,9 @@ vect_match_slp_patterns_2 (slp_tree
> *ref_node, vec_info *vinfo,
> > >    slp_tree node = *ref_node;
> > >    bool found_p = false;
> > >    if (!node || visited->add (node))
> > > -    return false;
> > > +    return node
> > > +        && SLP_TREE_REPRESENTATIVE (node)
> > > +        && STMT_VINFO_SLP_VECT_ONLY_PATTERN
> > > + (SLP_TREE_REPRESENTATIVE (node));
> > >
> > >    slp_tree child;
> > >    FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) @@ -6451,6
> > > +6463,11 @@ vect_schedule_slp (vec_info *vinfo, vec<slp_instance>
> slp_instances)
> > >         store_info = vect_orig_stmt (store_info);
> > >         /* Free the attached stmt_vec_info and remove the stmt.  */
> > >         vinfo->remove_stmt (store_info);
> > > +
> > > +       /* Invalidate SLP_TREE_REPRESENTATIVE in case we released it
> > > +          to not crash in vect_free_slp_tree later.  */
> > > +       if (SLP_TREE_REPRESENTATIVE (root) == store_info)
> > > +         SLP_TREE_REPRESENTATIVE (root) = NULL;
> > >          }
> > >      }
> > >  }
> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> > >
> 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc37207
> 2
> > > d98326891f19e6 100644
> > > --- a/gcc/tree-vectorizer.c
> > > +++ b/gcc/tree-vectorizer.c
> > > @@ -695,6 +695,7 @@ vec_info::new_stmt_vec_info (gimple *stmt)
> > >    STMT_VINFO_REDUC_FN (res) = IFN_LAST;
> > >    STMT_VINFO_REDUC_IDX (res) = -1;
> > >    STMT_VINFO_SLP_VECT_ONLY (res) = false;
> > > +  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
> > >    STMT_VINFO_VEC_STMTS (res) = vNULL;
> > >
> > >    if (is_a <loop_vec_info> (this)
> > >
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/pr99149.C b/gcc/testsuite/gcc.dg/vect/pr99149.C
new file mode 100755
index 0000000000000000000000000000000000000000..b12fe17e4ded148ce2bf67486e425dd65461a148
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr99149.C
@@ -0,0 +1,25 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-w -O3 -march=armv8.3-a" { target { aarch64*-*-* } } } */
+
+class a {
+  float b;
+  float c;
+
+public:
+  a(float d, float e) : b(d), c(e) {}
+  a operator+(a d) { return a(b + d.b, c + d.c); }
+  a operator*(a d) { return a(b * b - c * c, b * c + c * d.b); }
+};
+int f, g;
+class {
+  a *h;
+  a *i;
+
+public:
+  void j() {
+    a k = h[0], l = i[g], m = k * i[f];
+    i[g] = l + m;
+    i[f] = m;
+  }
+} n;
+main() { n.j(); }
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index f0817da9f622d22e3df2e30410d1cf610b4ffa1d..1e2769662a54229ab8e24390f97dfe206f17ab57 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -407,9 +407,8 @@  vect_detect_pair_op (slp_tree node1, slp_tree node2, lane_permutation_t &lanes,
 
   if (result != CMPLX_NONE && ops != NULL)
     {
-      ops->create (2);
-      ops->quick_push (node1);
-      ops->quick_push (node2);
+      ops->safe_push (node1);
+      ops->safe_push (node2);
     }
   return result;
 }
@@ -1090,15 +1089,17 @@  complex_mul_pattern::build (vec_info *vinfo)
 {
   slp_tree node;
   unsigned i;
+  slp_tree newnode
+    = vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
+  SLP_TREE_REF_COUNT (this->m_ops[2])++;
+
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
     vect_free_slp_tree (node);
 
   /* First re-arrange the children.  */
   SLP_TREE_CHILDREN (*this->m_node).reserve_exact (2);
   SLP_TREE_CHILDREN (*this->m_node)[0] = this->m_ops[2];
-  SLP_TREE_CHILDREN (*this->m_node)[1] =
-    vect_build_combine_node (this->m_ops[0], this->m_ops[1], *this->m_node);
-  SLP_TREE_REF_COUNT (this->m_ops[2])++;
+  SLP_TREE_CHILDREN (*this->m_node)[1] = newnode;
 
   /* And then rewrite the node itself.  */
   complex_pattern::build (vinfo);
@@ -1133,18 +1134,6 @@  class complex_fma_pattern : public complex_pattern
     }
 };
 
-/* Helper function to "reset" a previously matched node and undo the changes
-   made enough so that the node is treated as an irrelevant node.  */
-
-static inline void
-vect_slp_reset_pattern (slp_tree node)
-{
-  stmt_vec_info stmt_info = vect_orig_stmt (SLP_TREE_REPRESENTATIVE (node));
-  STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
-  STMT_SLP_TYPE (stmt_info) = pure_slp;
-  SLP_TREE_REPRESENTATIVE (node) = stmt_info;
-}
-
 /* Pattern matcher for trying to match complex multiply and accumulate
    and multiply and subtract patterns in SLP tree.
    If the operation matches then IFN is set to the operation it matched and
@@ -1208,15 +1197,6 @@  complex_fma_pattern::matches (complex_operation_t op,
   if (!vect_pattern_validate_optab (ifn, vnode))
     return IFN_LAST;
 
-  /* FMA matched ADD + CMUL.  During the matching of CMUL the
-     stmt that starts the pattern is marked as being in a pattern,
-     namely the CMUL.  When replacing this with a CFMA we have to
-     unmark this statement as being in a pattern.  This is because
-     vect_mark_pattern_stmts will only mark the current stmt as being
-     in a pattern.  Later on when the scalar stmts are examined the
-     old statement which is supposed to be irrelevant will point to
-     CMUL unless we undo the pattern relationship here.  */
-  vect_slp_reset_pattern (node);
   ops->truncate (0);
   ops->create (3);
 
@@ -1259,10 +1239,17 @@  complex_fma_pattern::recognize (slp_tree_to_load_perm_map_t *perm_cache,
 void
 complex_fma_pattern::build (vec_info *vinfo)
 {
+  slp_tree node = SLP_TREE_CHILDREN (*this->m_node)[1];
+
   SLP_TREE_CHILDREN (*this->m_node).release ();
   SLP_TREE_CHILDREN (*this->m_node).create (3);
   SLP_TREE_CHILDREN (*this->m_node).safe_splice (this->m_ops);
 
+  SLP_TREE_REF_COUNT (this->m_ops[1])++;
+  SLP_TREE_REF_COUNT (this->m_ops[2])++;
+
+  vect_free_slp_tree (node);
+
   complex_pattern::build (vinfo);
 }
 
@@ -1427,6 +1414,11 @@  complex_fms_pattern::build (vec_info *vinfo)
 {
   slp_tree node;
   unsigned i;
+  slp_tree newnode =
+    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node);
+  SLP_TREE_REF_COUNT (this->m_ops[0])++;
+  SLP_TREE_REF_COUNT (this->m_ops[1])++;
+
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (*this->m_node), i, node)
     vect_free_slp_tree (node);
 
@@ -1436,10 +1428,7 @@  complex_fms_pattern::build (vec_info *vinfo)
   /* First re-arrange the children.  */
   SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[0]);
   SLP_TREE_CHILDREN (*this->m_node).quick_push (this->m_ops[1]);
-  SLP_TREE_CHILDREN (*this->m_node).quick_push (
-    vect_build_combine_node (this->m_ops[2], this->m_ops[3], *this->m_node));
-  SLP_TREE_REF_COUNT (this->m_ops[0])++;
-  SLP_TREE_REF_COUNT (this->m_ops[1])++;
+  SLP_TREE_CHILDREN (*this->m_node).quick_push (newnode);
 
   /* And then rewrite the node itself.  */
   complex_pattern::build (vinfo);
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index ea8a97b01c6371791ac66de3e1dabfedee69cb67..65c2ff867ab41ea70367087dc26fb6eea1375ffb 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -146,6 +146,16 @@  vect_free_slp_tree (slp_tree node)
     if (child)
       vect_free_slp_tree (child);
 
+  /* If the node defines any SLP only patterns then those patterns are no
+     longer valid and should be removed.  */
+  stmt_vec_info rep_stmt_info = SLP_TREE_REPRESENTATIVE (node);
+  if (rep_stmt_info && STMT_VINFO_SLP_VECT_ONLY_PATTERN (rep_stmt_info))
+    {
+      stmt_vec_info stmt_info = vect_orig_stmt (rep_stmt_info);
+      //STMT_VINFO_IN_PATTERN_P (stmt_info) = false;
+      //STMT_SLP_TYPE (stmt_info) = STMT_SLP_TYPE (rep_stmt_info);
+    }
+
   delete node;
 }
 
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 5b45df3a4e00266b7530eb4da6985f0d940cb05b..63ba594f2276850a00fc372072d98326891f19e6 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -695,6 +695,7 @@  vec_info::new_stmt_vec_info (gimple *stmt)
   STMT_VINFO_REDUC_FN (res) = IFN_LAST;
   STMT_VINFO_REDUC_IDX (res) = -1;
   STMT_VINFO_SLP_VECT_ONLY (res) = false;
+  STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) = false;
   STMT_VINFO_VEC_STMTS (res) = vNULL;
 
   if (is_a <loop_vec_info> (this)