diff mbox series

[RFC] Add x86 subadd SLP pattern

Message ID soq832r0-q415-50rp-9o61-2ns838p7235@fhfr.qr
State New
Headers show
Series [RFC] Add x86 subadd SLP pattern | expand

Commit Message

Richard Biener June 17, 2021, 9:44 a.m. UTC
This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
thus subtract, add alternating on lanes, starting with subtract.

It adds a corresponding optab and direct internal function,
vec_subadd$a3 and at the moment to make the i386 backend changes
"obvious", duplicates the existing avx_addsubv4df3 pattern with
the new canonical name (CODE_FOR_* and gen_* are used throughout the
intrinsic code, so the actual change to rename all existing patterns
will be quite a bit bigger).  I expect some bike-shedding on
subadd vs. addsub so I delay that change ;)

ARC seems to have both addsub{V2HI,V2SI,V4HI} and
subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but
I didn't dare to decipher whether it's -, + or +, -.  bfin
has both variants as well plus a saturating variant of both (on v2hi).
But none of the major vector ISAs besides x86 seem to have it.
Still some targets having both and opting for the naming I choose
would make that a better fit IMHO.

Uros/Hontao - would mass-renaming of the x86 addsub patterns
to subadd be OK?  (not touching the FMA ones which have both)

The SLP pattern matches the exact alternating lane sequence rather
than trying to be clever and anticipating incoming permutes - we
could permute the two input vectors to the needed lane alternation,
do the subadd and then permute the result vector back but that's
only profitable in case the two input or the output permute will
vanish - something Tamars refactoring of SLP pattern recog should
make possible (I hope ;)).

So as said, the patch is incomplete on the x86 backend (and test
coverage) side, but the vectorizer part should be finalized until
Tamar posts his work.

Bootstrapped / tested on x86_64-unknown-linux-gnu.

2021-06-17  Richard Biener  <rguenther@suse.de>

	* config/i386/sse.md (vec_subaddv4df3): Duplicate from
	avx_addsubv4df3.
	* internal-fn.def (VEC_SUBADD): New internal optab fn.
	* optabs.def (vec_subadd_optab): New optab.
	* tree-vect-slp-patterns.c (class subadd_pattern): New.
	(slp_patterns): Add subadd_pattern.
	* tree-vectorizer.h (vect_pattern::vect_pattern): Make
	m_ops optional.

	* gcc.target/i386/vect-subadd-1.c: New testcase.
---
 gcc/config/i386/sse.md                        | 16 +++
 gcc/internal-fn.def                           |  1 +
 gcc/optabs.def                                |  1 +
 gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++
 gcc/tree-vect-slp-patterns.c                  | 98 +++++++++++++++++++
 gcc/tree-vectorizer.h                         |  3 +-
 6 files changed, 153 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-subadd-1.c

Comments

Uros Bizjak June 17, 2021, 9:54 a.m. UTC | #1
On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
>
> This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> thus subtract, add alternating on lanes, starting with subtract.
>
> It adds a corresponding optab and direct internal function,
> vec_subadd$a3 and at the moment to make the i386 backend changes
> "obvious", duplicates the existing avx_addsubv4df3 pattern with
> the new canonical name (CODE_FOR_* and gen_* are used throughout the
> intrinsic code, so the actual change to rename all existing patterns
> will be quite a bit bigger).  I expect some bike-shedding on
> subadd vs. addsub so I delay that change ;)
>
> ARC seems to have both addsub{V2HI,V2SI,V4HI} and
> subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but
> I didn't dare to decipher whether it's -, + or +, -.  bfin
> has both variants as well plus a saturating variant of both (on v2hi).
> But none of the major vector ISAs besides x86 seem to have it.
> Still some targets having both and opting for the naming I choose
> would make that a better fit IMHO.
>
> Uros/Hontao - would mass-renaming of the x86 addsub patterns
> to subadd be OK?  (not touching the FMA ones which have both)

We can add define_expand (this is how we approached the problem in the
past for x86). If there are not too many of them, then this is the
preferred approach (see below) ...

... until someone comes along and mass-renames everything to a generic name.

Uros.

>
> The SLP pattern matches the exact alternating lane sequence rather
> than trying to be clever and anticipating incoming permutes - we
> could permute the two input vectors to the needed lane alternation,
> do the subadd and then permute the result vector back but that's
> only profitable in case the two input or the output permute will
> vanish - something Tamars refactoring of SLP pattern recog should
> make possible (I hope ;)).
>
> So as said, the patch is incomplete on the x86 backend (and test
> coverage) side, but the vectorizer part should be finalized until
> Tamar posts his work.
>
> Bootstrapped / tested on x86_64-unknown-linux-gnu.
>
> 2021-06-17  Richard Biener  <rguenther@suse.de>
>
>         * config/i386/sse.md (vec_subaddv4df3): Duplicate from
>         avx_addsubv4df3.
>         * internal-fn.def (VEC_SUBADD): New internal optab fn.
>         * optabs.def (vec_subadd_optab): New optab.
>         * tree-vect-slp-patterns.c (class subadd_pattern): New.
>         (slp_patterns): Add subadd_pattern.
>         * tree-vectorizer.h (vect_pattern::vect_pattern): Make
>         m_ops optional.
>
>         * gcc.target/i386/vect-subadd-1.c: New testcase.
> ---
>  gcc/config/i386/sse.md                        | 16 +++
>  gcc/internal-fn.def                           |  1 +
>  gcc/optabs.def                                |  1 +
>  gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++
>  gcc/tree-vect-slp-patterns.c                  | 98 +++++++++++++++++++
>  gcc/tree-vectorizer.h                         |  3 +-
>  6 files changed, 153 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-subadd-1.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 94296bc773b..73238c9874a 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2396,6 +2396,22 @@
>     (set_attr "prefix" "<round_saeonly_scalar_prefix>")
>     (set_attr "mode" "<ssescalarmode>")])
>
> +/* ???  Rename avx_addsubv4df3, but its name is used throughout the backed,
> +   so avoid this churn for the moment.  */
> +(define_insn "vec_subaddv4df3"
> +  [(set (match_operand:V4DF 0 "register_operand" "=x")
> +       (vec_merge:V4DF
> +         (minus:V4DF
> +           (match_operand:V4DF 1 "register_operand" "x")
> +           (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
> +         (plus:V4DF (match_dup 1) (match_dup 2))
> +         (const_int 5)))]
> +  "TARGET_AVX"
> +  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
> +  [(set_attr "type" "sseadd")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "V4DF")])

define_expand "vec_subaddv4df3"
  [(set (match_operand:V4DF 0 "register_operand")
       (vec_merge:V4DF
         (minus:V4DF
           (match_operand:V4DF 1 "register_operand")
           (match_operand:V4DF 2 "nonimmediate_operand"))
         (plus:V4DF (match_dup 1) (match_dup 2))
         (const_int 5)))]
  "TARGET_AVX")

>  (define_insn "avx_addsubv4df3"
>    [(set (match_operand:V4DF 0 "register_operand" "=x")
>         (vec_merge:V4DF
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index b2f414d2131..a4578a0293e 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
> +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd, binary)
>
>
>  /* FP scales.  */
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..cacbd02cfe1 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a")
>  OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a")
>  OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a")
>  OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a")
> +OPTAB_D (vec_subadd_optab, "vec_subadd$a3")
>
>  OPTAB_D (sync_add_optab, "sync_add$I$a")
>  OPTAB_D (sync_and_optab, "sync_and$I$a")
> diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> new file mode 100644
> index 00000000000..a79e5ba92e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run { target avx_runtime } } */
> +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */
> +
> +double x[4], y[4], z[4];
> +void __attribute__((noipa)) foo ()
> +{
> +  x[0] = y[0] - z[0];
> +  x[1] = y[1] + z[1];
> +  x[2] = y[2] - z[2];
> +  x[3] = y[3] + z[3];
> +}
> +void __attribute__((noipa)) bar ()
> +{
> +  x[0] = y[0] + z[0];
> +  x[1] = y[1] - z[1];
> +  x[2] = y[2] + z[2];
> +  x[3] = y[3] - z[3];
> +}
> +int main()
> +{
> +  for (int i = 0; i < 4; ++i)
> +    {
> +      y[i] = i + 1;
> +      z[i] = 2 * i + 1;
> +    }
> +  foo ();
> +  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
> +    __builtin_abort ();
> +  bar ();
> +  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
> index 2ed49cd9edc..61732c53ad7 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info * /* vinfo */)
>    gcc_unreachable ();
>  }
>
> +
> +/* The subadd_pattern.  */
> +
> +class subadd_pattern : public vect_pattern
> +{
> +  public:
> +    subadd_pattern (slp_tree *node)
> +       : vect_pattern (node, NULL, IFN_VEC_SUBADD) {};
> +
> +    void build (vec_info *);
> +
> +    static vect_pattern*
> +    recognize (slp_tree_to_load_perm_map_t *, slp_tree *);
> +};
> +
> +vect_pattern *
> +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_)
> +{
> +  slp_tree node = *node_;
> +  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> +      || SLP_TREE_CHILDREN (node).length () != 2)
> +    return NULL;
> +
> +  /* Match a blend of a plus and a minus op with the same number of plus and
> +     minus lanes on the same operands.  */
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
> +  slp_tree add = SLP_TREE_CHILDREN (node)[1];
> +  bool swapped_p = false;
> +  if (vect_match_expression_p (sub, PLUS_EXPR))
> +    {
> +      std::swap (add, sub);
> +      swapped_p = true;
> +    }
> +  if (!(vect_match_expression_p (add, PLUS_EXPR)
> +       && vect_match_expression_p (sub, MINUS_EXPR)))
> +    return NULL;
> +  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
> +        && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
> +       || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
> +           && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0])))
> +    return NULL;
> +
> +  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i)
> +    {
> +      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i];
> +      if (swapped_p)
> +       perm.first = perm.first == 0 ? 1 : 0;
> +      /* It has to be alternating -, +, -, ...
> +        While we could permute the .SUBADD inputs and the .SUBADD output
> +        that's only profitable over the add + sub + blend if at least
> +        one of the permute is optimized which we can't determine here.  */
> +      if (perm.first != (i & 1)
> +         || perm.second != i)
> +       return NULL;
> +    }
> +
> +  if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node))
> +    return NULL;
> +
> +  return new subadd_pattern (node_);
> +}
> +
> +void
> +subadd_pattern::build (vec_info *vinfo)
> +{
> +  slp_tree node = *m_node;
> +
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
> +  slp_tree add = SLP_TREE_CHILDREN (node)[1];
> +  if (vect_match_expression_p (sub, PLUS_EXPR))
> +    std::swap (add, sub);
> +
> +  /* Modify the blend node in-place.  */
> +  SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0];
> +  SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1];
> +  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++;
> +  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
> +
> +  /* Build IFN_VEC_SUBADD from the sub representative operands.  */
> +  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
> +  gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2,
> +                                           gimple_assign_rhs1 (rep->stmt),
> +                                           gimple_assign_rhs2 (rep->stmt));
> +  gimple_call_set_lhs (call, make_ssa_name
> +                              (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
> +  gimple_call_set_nothrow (call, true);
> +  gimple_set_bb (call, gimple_bb (rep->stmt));
> +  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
> +  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope;
> +  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
> +  SLP_TREE_CODE (node) = ERROR_MARK;
> +  SLP_TREE_LANE_PERMUTATION (node).release ();
> +
> +  vect_free_slp_tree (sub);
> +  vect_free_slp_tree (add);
> +}
> +
>  /*******************************************************************************
>   * Pattern matching definitions
>   ******************************************************************************/
> @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[]
>       overlap in what they can detect.  */
>
>    SLP_PATTERN (complex_operations_pattern),
> +  SLP_PATTERN (subadd_pattern)
>  };
>  #undef SLP_PATTERN
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 04c20f8bd0f..b9824623ad9 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2100,7 +2100,8 @@ class vect_pattern
>        this->m_ifn = ifn;
>        this->m_node = node;
>        this->m_ops.create (0);
> -      this->m_ops.safe_splice (*m_ops);
> +      if (m_ops)
> +       this->m_ops.safe_splice (*m_ops);
>      }
>
>    public:
> --
> 2.26.2
Uros Bizjak June 17, 2021, 9:59 a.m. UTC | #2
On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
>
> This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> thus subtract, add alternating on lanes, starting with subtract.
>
> It adds a corresponding optab and direct internal function,
> vec_subadd$a3 and at the moment to make the i386 backend changes
> "obvious", duplicates the existing avx_addsubv4df3 pattern with
> the new canonical name (CODE_FOR_* and gen_* are used throughout the
> intrinsic code, so the actual change to rename all existing patterns
> will be quite a bit bigger).  I expect some bike-shedding on
> subadd vs. addsub so I delay that change ;)

Well, the pattern is called addsub in the x86 world because highpart
does add and lowpart does sub. In left-to-right writing systems
highpart comes before lowpart, so you have addsub.

Uros.
Richard Biener June 17, 2021, 10 a.m. UTC | #3
On Thu, 17 Jun 2021, Uros Bizjak wrote:

> On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > thus subtract, add alternating on lanes, starting with subtract.
> >
> > It adds a corresponding optab and direct internal function,
> > vec_subadd$a3 and at the moment to make the i386 backend changes
> > "obvious", duplicates the existing avx_addsubv4df3 pattern with
> > the new canonical name (CODE_FOR_* and gen_* are used throughout the
> > intrinsic code, so the actual change to rename all existing patterns
> > will be quite a bit bigger).  I expect some bike-shedding on
> > subadd vs. addsub so I delay that change ;)
> >
> > ARC seems to have both addsub{V2HI,V2SI,V4HI} and
> > subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but
> > I didn't dare to decipher whether it's -, + or +, -.  bfin
> > has both variants as well plus a saturating variant of both (on v2hi).
> > But none of the major vector ISAs besides x86 seem to have it.
> > Still some targets having both and opting for the naming I choose
> > would make that a better fit IMHO.
> >
> > Uros/Hontao - would mass-renaming of the x86 addsub patterns
> > to subadd be OK?  (not touching the FMA ones which have both)
> 
> We can add define_expand (this is how we approached the problem in the
> past for x86). If there are not too many of them, then this is the
> preferred approach (see below) ...

OK.

> ... until someone comes along and mass-renames everything to a generic name.

Though now that I tried it isn't so bad for the addsub case (see
patch below).  Would that be prefered or would it be an incomplete
renaming and thus the define_expand prefered?

Thanks,
Richard.

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index 80c2a2c0294..3222eaedbd0 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
@@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128
 /* AVX */
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 73238c9874a..48818dcf8cb 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2396,8 +2396,6 @@
    (set_attr "prefix" "<round_saeonly_scalar_prefix>")
    (set_attr "mode" "<ssescalarmode>")])
 
-/* ???  Rename avx_addsubv4df3, but its name is used throughout the backed,
-   so avoid this churn for the moment.  */
 (define_insn "vec_subaddv4df3"
   [(set (match_operand:V4DF 0 "register_operand" "=x")
 	(vec_merge:V4DF
@@ -2412,21 +2410,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V4DF")])
 
-(define_insn "avx_addsubv4df3"
-  [(set (match_operand:V4DF 0 "register_operand" "=x")
-	(vec_merge:V4DF
-	  (minus:V4DF
-	    (match_operand:V4DF 1 "register_operand" "x")
-	    (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
-	  (plus:V4DF (match_dup 1) (match_dup 2))
-	  (const_int 5)))]
-  "TARGET_AVX"
-  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "type" "sseadd")
-   (set_attr "prefix" "vex")
-   (set_attr "mode" "V4DF")])
-
-(define_insn "sse3_addsubv2df3"
+(define_insn "vec_subaddv2df3"
   [(set (match_operand:V2DF 0 "register_operand" "=x,x")
 	(vec_merge:V2DF
 	  (minus:V2DF
@@ -2444,7 +2428,7 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "V2DF")])
 
-(define_insn "avx_addsubv8sf3"
+(define_insn "vec_subaddv8sf3"
   [(set (match_operand:V8SF 0 "register_operand" "=x")
 	(vec_merge:V8SF
 	  (minus:V8SF
@@ -2458,7 +2442,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V8SF")])
 
-(define_insn "sse3_addsubv4sf3"
+(define_insn "vec_subaddv4sf3"
   [(set (match_operand:V4SF 0 "register_operand" "=x,x")
 	(vec_merge:V4SF
 	  (minus:V4SF
Uros Bizjak June 17, 2021, 10:09 a.m. UTC | #4
On Thu, Jun 17, 2021 at 12:00 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 17 Jun 2021, Uros Bizjak wrote:
>
> > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > > thus subtract, add alternating on lanes, starting with subtract.
> > >
> > > It adds a corresponding optab and direct internal function,
> > > vec_subadd$a3 and at the moment to make the i386 backend changes
> > > "obvious", duplicates the existing avx_addsubv4df3 pattern with
> > > the new canonical name (CODE_FOR_* and gen_* are used throughout the
> > > intrinsic code, so the actual change to rename all existing patterns
> > > will be quite a bit bigger).  I expect some bike-shedding on
> > > subadd vs. addsub so I delay that change ;)
> > >
> > > ARC seems to have both addsub{V2HI,V2SI,V4HI} and
> > > subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but
> > > I didn't dare to decipher whether it's -, + or +, -.  bfin
> > > has both variants as well plus a saturating variant of both (on v2hi).
> > > But none of the major vector ISAs besides x86 seem to have it.
> > > Still some targets having both and opting for the naming I choose
> > > would make that a better fit IMHO.
> > >
> > > Uros/Hontao - would mass-renaming of the x86 addsub patterns
> > > to subadd be OK?  (not touching the FMA ones which have both)
> >
> > We can add define_expand (this is how we approached the problem in the
> > past for x86). If there are not too many of them, then this is the
> > preferred approach (see below) ...
>
> OK.
>
> > ... until someone comes along and mass-renames everything to a generic name.
>
> Though now that I tried it isn't so bad for the addsub case (see
> patch below).  Would that be prefered or would it be an incomplete
> renaming and thus the define_expand prefered?

The approach below is OK, too. If you rename the patterns to addsub
(see my previous mail for reasoning), then the rename is fully
preferred as far as x86 is concerned.

Thanks,
Uros.

> Thanks,
> Richard.
>
> diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
> index 80c2a2c0294..3222eaedbd0 100644
> --- a/gcc/config/i386/i386-builtin.def
> +++ b/gcc/config/i386/i386-builtin.def
> @@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__
>  BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
>  BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
>
> -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
> -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
> +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
> +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
>  BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
>  BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
>  BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
> @@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128
>  /* AVX */
>  BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
>  BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
> -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
> -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
> +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
> +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
>  BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
>  BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
>  BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 73238c9874a..48818dcf8cb 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2396,8 +2396,6 @@
>     (set_attr "prefix" "<round_saeonly_scalar_prefix>")
>     (set_attr "mode" "<ssescalarmode>")])
>
> -/* ???  Rename avx_addsubv4df3, but its name is used throughout the backed,
> -   so avoid this churn for the moment.  */
>  (define_insn "vec_subaddv4df3"
>    [(set (match_operand:V4DF 0 "register_operand" "=x")
>         (vec_merge:V4DF
> @@ -2412,21 +2410,7 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "V4DF")])
>
> -(define_insn "avx_addsubv4df3"
> -  [(set (match_operand:V4DF 0 "register_operand" "=x")
> -       (vec_merge:V4DF
> -         (minus:V4DF
> -           (match_operand:V4DF 1 "register_operand" "x")
> -           (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
> -         (plus:V4DF (match_dup 1) (match_dup 2))
> -         (const_int 5)))]
> -  "TARGET_AVX"
> -  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
> -  [(set_attr "type" "sseadd")
> -   (set_attr "prefix" "vex")
> -   (set_attr "mode" "V4DF")])
> -
> -(define_insn "sse3_addsubv2df3"
> +(define_insn "vec_subaddv2df3"
>    [(set (match_operand:V2DF 0 "register_operand" "=x,x")
>         (vec_merge:V2DF
>           (minus:V2DF
> @@ -2444,7 +2428,7 @@
>     (set_attr "prefix" "orig,vex")
>     (set_attr "mode" "V2DF")])
>
> -(define_insn "avx_addsubv8sf3"
> +(define_insn "vec_subaddv8sf3"
>    [(set (match_operand:V8SF 0 "register_operand" "=x")
>         (vec_merge:V8SF
>           (minus:V8SF
> @@ -2458,7 +2442,7 @@
>     (set_attr "prefix" "vex")
>     (set_attr "mode" "V8SF")])
>
> -(define_insn "sse3_addsubv4sf3"
> +(define_insn "vec_subaddv4sf3"
>    [(set (match_operand:V4SF 0 "register_operand" "=x,x")
>         (vec_merge:V4SF
>           (minus:V4SF
Tamar Christina June 17, 2021, 10:24 a.m. UTC | #5
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Thursday, June 17, 2021 10:45 AM
> To: gcc-patches@gcc.gnu.org
> Cc: hongtao.liu@intel.com; ubizjak@gmail.com; Tamar Christina
> <Tamar.Christina@arm.com>
> Subject: [PATCH][RFC] Add x86 subadd SLP pattern
> 
> This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } thus subtract,
> add alternating on lanes, starting with subtract.
> 
> It adds a corresponding optab and direct internal function,
> vec_subadd$a3 and at the moment to make the i386 backend changes
> "obvious", duplicates the existing avx_addsubv4df3 pattern with the new
> canonical name (CODE_FOR_* and gen_* are used throughout the intrinsic
> code, so the actual change to rename all existing patterns will be quite a bit
> bigger).  I expect some bike-shedding on subadd vs. addsub so I delay that
> change ;)
> 
> ARC seems to have both addsub{V2HI,V2SI,V4HI} and
> subadd{V2HI,V2SI,V4HI}.  ARM iwmmxt has wsubaddx on V4HI but I didn't
> dare to decipher whether it's -, + or +, -.  bfin has both variants as well plus a
> saturating variant of both (on v2hi).
> But none of the major vector ISAs besides x86 seem to have it.
> Still some targets having both and opting for the naming I choose would
> make that a better fit IMHO.

In terms of internal_fn I think we'll need both, whatever the target expands to is
Up to it.  But the reason for both is that in the when you get the bigger cases the difference
Between ADDSUB and SUBADD determines whether it's a conjugate or not. 
(I believe x86 has those with fmaddsub and fmsubadd) the rework just simply
Matches MUL + ADDSUB or SUBADD and never needs to look further.

This keeps stacking, to more complicated sequences like complex dot product.

> 
> Uros/Hontao - would mass-renaming of the x86 addsub patterns to subadd
> be OK?  (not touching the FMA ones which have both)
> 
> The SLP pattern matches the exact alternating lane sequence rather than
> trying to be clever and anticipating incoming permutes - we could permute
> the two input vectors to the needed lane alternation, do the subadd and
> then permute the result vector back but that's only profitable in case the two
> input or the output permute will vanish - something Tamars refactoring of
> SLP pattern recog should make possible (I hope ;)).

It will, it pushes it to optimize_slp which will allow you to decide what to do.
As part of this it allows additional transformations. So atm this where I do
ADDSUB -> COMPLEX_ADD depending on the permute it sees.

Regards,
Tamar

> 
> So as said, the patch is incomplete on the x86 backend (and test
> coverage) side, but the vectorizer part should be finalized until Tamar posts
> his work.
> 
> Bootstrapped / tested on x86_64-unknown-linux-gnu.
> 
> 2021-06-17  Richard Biener  <rguenther@suse.de>
> 
> 	* config/i386/sse.md (vec_subaddv4df3): Duplicate from
> 	avx_addsubv4df3.
> 	* internal-fn.def (VEC_SUBADD): New internal optab fn.
> 	* optabs.def (vec_subadd_optab): New optab.
> 	* tree-vect-slp-patterns.c (class subadd_pattern): New.
> 	(slp_patterns): Add subadd_pattern.
> 	* tree-vectorizer.h (vect_pattern::vect_pattern): Make
> 	m_ops optional.
> 
> 	* gcc.target/i386/vect-subadd-1.c: New testcase.
> ---
>  gcc/config/i386/sse.md                        | 16 +++
>  gcc/internal-fn.def                           |  1 +
>  gcc/optabs.def                                |  1 +
>  gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++
>  gcc/tree-vect-slp-patterns.c                  | 98 +++++++++++++++++++
>  gcc/tree-vectorizer.h                         |  3 +-
>  6 files changed, 153 insertions(+), 1 deletion(-)  create mode 100644
> gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> 94296bc773b..73238c9874a 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2396,6 +2396,22 @@
>     (set_attr "prefix" "<round_saeonly_scalar_prefix>")
>     (set_attr "mode" "<ssescalarmode>")])
> 
> +/* ???  Rename avx_addsubv4df3, but its name is used throughout the
> backed,
> +   so avoid this churn for the moment.  */ (define_insn
> +"vec_subaddv4df3"
> +  [(set (match_operand:V4DF 0 "register_operand" "=x")
> +	(vec_merge:V4DF
> +	  (minus:V4DF
> +	    (match_operand:V4DF 1 "register_operand" "x")
> +	    (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
> +	  (plus:V4DF (match_dup 1) (match_dup 2))
> +	  (const_int 5)))]
> +  "TARGET_AVX"
> +  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
> +  [(set_attr "type" "sseadd")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "V4DF")])
> +
>  (define_insn "avx_addsubv4df3"
>    [(set (match_operand:V4DF 0 "register_operand" "=x")
>  	(vec_merge:V4DF
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> b2f414d2131..a4578a0293e 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90,
> ECF_CONST, cadd90, binary)  DEF_INTERNAL_OPTAB_FN
> (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
> DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj,
> binary)
> +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd,
> binary)
> 
> 
>  /* FP scales.  */
> diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..cacbd02cfe1
> 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab,
> "vec_widen_usubl_hi_$a")  OPTAB_D (vec_widen_usubl_lo_optab,
> "vec_widen_usubl_lo_$a")  OPTAB_D (vec_widen_uaddl_hi_optab,
> "vec_widen_uaddl_hi_$a")  OPTAB_D (vec_widen_uaddl_lo_optab,
> "vec_widen_uaddl_lo_$a")
> +OPTAB_D (vec_subadd_optab, "vec_subadd$a3")
> 
>  OPTAB_D (sync_add_optab, "sync_add$I$a")  OPTAB_D (sync_and_optab,
> "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> new file mode 100644
> index 00000000000..a79e5ba92e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run { target avx_runtime } } */
> +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */
> +
> +double x[4], y[4], z[4];
> +void __attribute__((noipa)) foo ()
> +{
> +  x[0] = y[0] - z[0];
> +  x[1] = y[1] + z[1];
> +  x[2] = y[2] - z[2];
> +  x[3] = y[3] + z[3];
> +}
> +void __attribute__((noipa)) bar ()
> +{
> +  x[0] = y[0] + z[0];
> +  x[1] = y[1] - z[1];
> +  x[2] = y[2] + z[2];
> +  x[3] = y[3] - z[3];
> +}
> +int main()
> +{
> +  for (int i = 0; i < 4; ++i)
> +    {
> +      y[i] = i + 1;
> +      z[i] = 2 * i + 1;
> +    }
> +  foo ();
> +  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
> +    __builtin_abort ();
> +  bar ();
> +  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
> +    __builtin_abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index
> 2ed49cd9edc..61732c53ad7 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info *
> /* vinfo */)
>    gcc_unreachable ();
>  }
> 
> +
> +/* The subadd_pattern.  */
> +
> +class subadd_pattern : public vect_pattern {
> +  public:
> +    subadd_pattern (slp_tree *node)
> +	: vect_pattern (node, NULL, IFN_VEC_SUBADD) {};
> +
> +    void build (vec_info *);
> +
> +    static vect_pattern*
> +    recognize (slp_tree_to_load_perm_map_t *, slp_tree *); };
> +
> +vect_pattern *
> +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree
> +*node_) {
> +  slp_tree node = *node_;
> +  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> +      || SLP_TREE_CHILDREN (node).length () != 2)
> +    return NULL;
> +
> +  /* Match a blend of a plus and a minus op with the same number of plus
> and
> +     minus lanes on the same operands.  */
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
> +  slp_tree add = SLP_TREE_CHILDREN (node)[1];
> +  bool swapped_p = false;
> +  if (vect_match_expression_p (sub, PLUS_EXPR))
> +    {
> +      std::swap (add, sub);
> +      swapped_p = true;
> +    }
> +  if (!(vect_match_expression_p (add, PLUS_EXPR)
> +	&& vect_match_expression_p (sub, MINUS_EXPR)))
> +    return NULL;
> +  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
> +	 && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
> +	|| (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
> +	    && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN
> (add)[0])))
> +    return NULL;
> +
> +  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length ();
> ++i)
> +    {
> +      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION
> (node)[i];
> +      if (swapped_p)
> +	perm.first = perm.first == 0 ? 1 : 0;
> +      /* It has to be alternating -, +, -, ...
> +	 While we could permute the .SUBADD inputs and the .SUBADD
> output
> +	 that's only profitable over the add + sub + blend if at least
> +	 one of the permute is optimized which we can't determine here.  */
> +      if (perm.first != (i & 1)
> +	  || perm.second != i)
> +	return NULL;
> +    }
> +
> +  if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node))
> +    return NULL;
> +
> +  return new subadd_pattern (node_);
> +}
> +
> +void
> +subadd_pattern::build (vec_info *vinfo) {
> +  slp_tree node = *m_node;
> +
> +  slp_tree sub = SLP_TREE_CHILDREN (node)[0];  slp_tree add =
> + SLP_TREE_CHILDREN (node)[1];  if (vect_match_expression_p (sub,
> + PLUS_EXPR))
> +    std::swap (add, sub);
> +
> +  /* Modify the blend node in-place.  */  SLP_TREE_CHILDREN (node)[0] =
> + SLP_TREE_CHILDREN (sub)[0];  SLP_TREE_CHILDREN (node)[1] =
> + SLP_TREE_CHILDREN (sub)[1];  SLP_TREE_REF_COUNT
> (SLP_TREE_CHILDREN
> + (node)[0])++;  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
> +
> +  /* Build IFN_VEC_SUBADD from the sub representative operands.  */
> +  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
> +  gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2,
> +					    gimple_assign_rhs1 (rep->stmt),
> +					    gimple_assign_rhs2 (rep->stmt));
> +  gimple_call_set_lhs (call, make_ssa_name
> +			       (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
> +  gimple_call_set_nothrow (call, true);
> +  gimple_set_bb (call, gimple_bb (rep->stmt));
> +  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
> +  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) =
> +vect_used_in_scope;
> +  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
> +  SLP_TREE_CODE (node) = ERROR_MARK;
> +  SLP_TREE_LANE_PERMUTATION (node).release ();
> +
> +  vect_free_slp_tree (sub);
> +  vect_free_slp_tree (add);
> +}
> +
> 
> /**********************************************************
> *********************
>   * Pattern matching definitions
> 
> **********************************************************
> ********************/
> @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[]
>       overlap in what they can detect.  */
> 
>    SLP_PATTERN (complex_operations_pattern),
> +  SLP_PATTERN (subadd_pattern)
>  };
>  #undef SLP_PATTERN
> 
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index
> 04c20f8bd0f..b9824623ad9 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2100,7 +2100,8 @@ class vect_pattern
>        this->m_ifn = ifn;
>        this->m_node = node;
>        this->m_ops.create (0);
> -      this->m_ops.safe_splice (*m_ops);
> +      if (m_ops)
> +	this->m_ops.safe_splice (*m_ops);
>      }
> 
>    public:
> --
> 2.26.2
Richard Biener June 17, 2021, 10:43 a.m. UTC | #6
On Thu, 17 Jun 2021, Uros Bizjak wrote:

> On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > thus subtract, add alternating on lanes, starting with subtract.
> >
> > It adds a corresponding optab and direct internal function,
> > vec_subadd$a3 and at the moment to make the i386 backend changes
> > "obvious", duplicates the existing avx_addsubv4df3 pattern with
> > the new canonical name (CODE_FOR_* and gen_* are used throughout the
> > intrinsic code, so the actual change to rename all existing patterns
> > will be quite a bit bigger).  I expect some bike-shedding on
> > subadd vs. addsub so I delay that change ;)
> 
> Well, the pattern is called addsub in the x86 world because highpart
> does add and lowpart does sub. In left-to-right writing systems
> highpart comes before lowpart, so you have addsub.

The other targets mentioned do not seem to agree but I can live
with that, thus I'll change back to addsub.

Richard.
Richard Biener June 17, 2021, 1:47 p.m. UTC | #7
On Thu, 17 Jun 2021, Richard Biener wrote:

> On Thu, 17 Jun 2021, Uros Bizjak wrote:
> 
> > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > > thus subtract, add alternating on lanes, starting with subtract.
> > >
> > > It adds a corresponding optab and direct internal function,
> > > vec_subadd$a3 and at the moment to make the i386 backend changes
> > > "obvious", duplicates the existing avx_addsubv4df3 pattern with
> > > the new canonical name (CODE_FOR_* and gen_* are used throughout the
> > > intrinsic code, so the actual change to rename all existing patterns
> > > will be quite a bit bigger).  I expect some bike-shedding on
> > > subadd vs. addsub so I delay that change ;)
> > 
> > Well, the pattern is called addsub in the x86 world because highpart
> > does add and lowpart does sub. In left-to-right writing systems
> > highpart comes before lowpart, so you have addsub.
> 
> The other targets mentioned do not seem to agree but I can live
> with that, thus I'll change back to addsub.

For reference, the following is what passed bootstrap & regtest on 
x86_64-unknown-linux-gnu.  I'm likely going to wait for Tamar a bit
as I expect the code needs refactoring for the pending changes.

I'm also looking into merging the patterns (you might have seen
the other mail about this) and look into getting rid of the UNSPEC
in the fmaddsub patterns.

Richard.

From ddc804a832dec0e1b702e1a80c7cdd6b79ebe1a9 Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Mon, 31 May 2021 13:19:01 +0200
Subject: [PATCH] Add x86 addsub SLP pattern
To: gcc-patches@gcc.gnu.org

This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
thus subtract, add alternating on lanes, starting with subtract.

It adds a corresponding optab and direct internal function,
vec_addsub$a3 and renames the existing i386 backend patterns to
the new canonical name.

The SLP pattern matches the exact alternating lane sequence rather
than trying to be clever and anticipating incoming permutes - we
could permute the two input vectors to the needed lane alternation,
do the addsub and then permute the result vector back but that's
only profitable in case the two input or the output permute will
vanish - something Tamars refactoring of SLP pattern recog should
make possible.

2021-06-17  Richard Biener  <rguenther@suse.de>

	* config/i386/sse.md (avx_addsubv4df3): Rename to
	vec_addsubv4df3.
	(avx_addsubv8sf3): Rename to vec_addsubv8sf3.
	(sse3_addsubv2df3): Rename to vec_addsubv2df3.
	(sse3_addsubv4sf3): Rename to vec_addsubv4sf3.
	* internal-fn.def (VEC_ADDSUB): New internal optab fn.
	* optabs.def (vec_addsub_optab): New optab.
	* tree-vect-slp-patterns.c (class addsub_pattern): New.
	(slp_patterns): Add addsub_pattern.
	* tree-vectorizer.h (vect_pattern::vect_pattern): Make
	m_ops optional.
	* doc/md.texi (vec_addsub<mode>3): Document.

	* gcc.target/i386/vect-addsubv2df.c: New testcase.
	* gcc.target/i386/vect-addsubv4sf.c: Likewise.
	* gcc.target/i386/vect-addsubv4df.c: Likewise.
	* gcc.target/i386/vect-addsubv8sf.c: Likewise.
---
 gcc/config/i386/i386-builtin.def              |   8 +-
 gcc/config/i386/sse.md                        |   8 +-
 gcc/doc/md.texi                               |   8 ++
 gcc/internal-fn.def                           |   1 +
 gcc/optabs.def                                |   1 +
 .../gcc.target/i386/vect-addsubv2df.c         |  42 ++++++++
 .../gcc.target/i386/vect-addsubv4df.c         |  36 +++++++
 .../gcc.target/i386/vect-addsubv4sf.c         |  46 ++++++++
 .../gcc.target/i386/vect-addsubv8sf.c         |  46 ++++++++
 gcc/tree-vect-slp-patterns.c                  | 100 ++++++++++++++++++
 gcc/tree-vectorizer.h                         |   3 +-
 11 files changed, 290 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index 80c2a2c0294..30719e5e4bd 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
@@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128
 /* AVX */
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 94296bc773b..038e2cd90d4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2396,7 +2396,7 @@
    (set_attr "prefix" "<round_saeonly_scalar_prefix>")
    (set_attr "mode" "<ssescalarmode>")])
 
-(define_insn "avx_addsubv4df3"
+(define_insn "vec_addsubv4df3"
   [(set (match_operand:V4DF 0 "register_operand" "=x")
 	(vec_merge:V4DF
 	  (minus:V4DF
@@ -2410,7 +2410,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V4DF")])
 
-(define_insn "sse3_addsubv2df3"
+(define_insn "vec_addsubv2df3"
   [(set (match_operand:V2DF 0 "register_operand" "=x,x")
 	(vec_merge:V2DF
 	  (minus:V2DF
@@ -2428,7 +2428,7 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "V2DF")])
 
-(define_insn "avx_addsubv8sf3"
+(define_insn "vec_addsubv8sf3"
   [(set (match_operand:V8SF 0 "register_operand" "=x")
 	(vec_merge:V8SF
 	  (minus:V8SF
@@ -2442,7 +2442,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V8SF")])
 
-(define_insn "sse3_addsubv4sf3"
+(define_insn "vec_addsubv4sf3"
   [(set (match_operand:V4SF 0 "register_operand" "=x,x")
 	(vec_merge:V4SF
 	  (minus:V4SF
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 00caf3844cc..1b918144330 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5682,6 +5682,14 @@ signed/unsigned elements of size S@.  Subtract the high/low elements of 2 from
 1 and widen the resulting elements. Put the N/2 results of size 2*S in the
 output vector (operand 0).
 
+@cindex @code{vec_addsub@var{m}3} instruction pattern
+@item @samp{vec_addsub@var{m}3}
+Alternating subtract, add with even lanes doing subtract and odd
+lanes doing addition.  Operands 1 and 2 and the outout operand are vectors
+with mode @var{m}.
+
+These instructions are not allowed to @code{FAIL}.
+
 @cindex @code{mulhisi3} instruction pattern
 @item @samp{mulhisi3}
 Multiply operands 1 and 2, which have mode @code{HImode}, and store
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index b2f414d2131..c3b8e730960 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
 
 
 /* FP scales.  */
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..41ab2598eb6 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a")
 OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a")
 OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a")
 OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a")
+OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
 
 OPTAB_D (sync_add_optab, "sync_add$I$a")
 OPTAB_D (sync_and_optab, "sync_and$I$a")
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
new file mode 100644
index 00000000000..547485d5519
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse3_test
+#endif
+
+#include CHECK_H
+
+double x[2], y[2], z[2];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+}
+static void
+TEST (void)
+{
+  for (int i = 0; i < 2; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
new file mode 100644
index 00000000000..e0a1b3d9d00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
@@ -0,0 +1,36 @@
+/* { dg-do run { target avx_runtime } } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */
+
+double x[4], y[4], z[4];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+}
+int main()
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
new file mode 100644
index 00000000000..b524f0c35a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse3_test
+#endif
+
+#include CHECK_H
+
+float x[4], y[4], z[4];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+}
+static void
+TEST (void)
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c
new file mode 100644
index 00000000000..0eed33b6531
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c
@@ -0,0 +1,46 @@
+/* { dg-do run { target avx_runtime } } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */
+
+float x[8], y[8], z[8];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+  x[4] = y[4] - z[4];
+  x[5] = y[5] + z[5];
+  x[6] = y[6] - z[6];
+  x[7] = y[7] + z[7];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+  x[4] = y[4] + z[4];
+  x[5] = y[5] - z[5];
+  x[6] = y[6] + z[6];
+  x[7] = y[7] - z[7];
+}
+int main()
+{
+  for (int i = 0; i < 8; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11
+      || x[4] != -4 || x[5] != 17 || x[6] != -6 || x[7] != 23)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3
+      || x[4] != 14 || x[5] != -5 || x[6] != 20 || x[7] != -7)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index 2ed49cd9edc..d536494a1bd 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -1490,6 +1490,105 @@ complex_operations_pattern::build (vec_info * /* vinfo */)
   gcc_unreachable ();
 }
 
+
+/* The addsub_pattern.  */
+
+class addsub_pattern : public vect_pattern
+{
+  public:
+    addsub_pattern (slp_tree *node)
+	: vect_pattern (node, NULL, IFN_VEC_ADDSUB) {};
+
+    void build (vec_info *);
+
+    static vect_pattern*
+    recognize (slp_tree_to_load_perm_map_t *, slp_tree *);
+};
+
+vect_pattern *
+addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_)
+{
+  slp_tree node = *node_;
+  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+      || SLP_TREE_CHILDREN (node).length () != 2)
+    return NULL;
+
+  /* Match a blend of a plus and a minus op with the same number of plus and
+     minus lanes on the same operands.  */
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  bool swapped_p = false;
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    {
+      std::swap (add, sub);
+      swapped_p = true;
+    }
+  if (!(vect_match_expression_p (add, PLUS_EXPR)
+	&& vect_match_expression_p (sub, MINUS_EXPR)))
+    return NULL;
+  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
+	 && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
+	|| (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
+	    && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0])))
+    return NULL;
+
+  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i)
+    {
+      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i];
+      if (swapped_p)
+	perm.first = perm.first == 0 ? 1 : 0;
+      /* It has to be alternating -, +, -, ...
+	 While we could permute the .ADDSUB inputs and the .ADDSUB output
+	 that's only profitable over the add + sub + blend if at least
+	 one of the permute is optimized which we can't determine here.  */
+      if (perm.first != (i & 1)
+	  || perm.second != i)
+	return NULL;
+    }
+
+  if (!vect_pattern_validate_optab (IFN_VEC_ADDSUB, node))
+    return NULL;
+
+  return new addsub_pattern (node_);
+}
+
+void
+addsub_pattern::build (vec_info *vinfo)
+{
+  slp_tree node = *m_node;
+
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    std::swap (add, sub);
+
+  /* Modify the blend node in-place.  */
+  SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0];
+  SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1];
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++;
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
+
+  /* Build IFN_VEC_ADDSUB from the sub representative operands.  */
+  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
+  gcall *call = gimple_build_call_internal (IFN_VEC_ADDSUB, 2,
+					    gimple_assign_rhs1 (rep->stmt),
+					    gimple_assign_rhs2 (rep->stmt));
+  gimple_call_set_lhs (call, make_ssa_name
+			       (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
+  gimple_call_set_nothrow (call, true);
+  gimple_set_bb (call, gimple_bb (rep->stmt));
+  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
+  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope;
+  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
+  STMT_VINFO_VECTYPE (SLP_TREE_REPRESENTATIVE (node)) = SLP_TREE_VECTYPE (node);
+  STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node)) = true;
+  SLP_TREE_CODE (node) = ERROR_MARK;
+  SLP_TREE_LANE_PERMUTATION (node).release ();
+
+  vect_free_slp_tree (sub);
+  vect_free_slp_tree (add);
+}
+
 /*******************************************************************************
  * Pattern matching definitions
  ******************************************************************************/
@@ -1502,6 +1601,7 @@ vect_pattern_decl_t slp_patterns[]
      overlap in what they can detect.  */
 
   SLP_PATTERN (complex_operations_pattern),
+  SLP_PATTERN (addsub_pattern)
 };
 #undef SLP_PATTERN
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 04c20f8bd0f..b9824623ad9 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2100,7 +2100,8 @@ class vect_pattern
       this->m_ifn = ifn;
       this->m_node = node;
       this->m_ops.create (0);
-      this->m_ops.safe_splice (*m_ops);
+      if (m_ops)
+	this->m_ops.safe_splice (*m_ops);
     }
 
   public:
Richard Sandiford June 22, 2021, 9:42 a.m. UTC | #8
Richard Biener <rguenther@suse.de> writes:
> On Thu, 17 Jun 2021, Uros Bizjak wrote:
>
>> On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote:
>> >
>> > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
>> > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
>> > thus subtract, add alternating on lanes, starting with subtract.
>> >
>> > It adds a corresponding optab and direct internal function,
>> > vec_subadd$a3 and at the moment to make the i386 backend changes
>> > "obvious", duplicates the existing avx_addsubv4df3 pattern with
>> > the new canonical name (CODE_FOR_* and gen_* are used throughout the
>> > intrinsic code, so the actual change to rename all existing patterns
>> > will be quite a bit bigger).  I expect some bike-shedding on
>> > subadd vs. addsub so I delay that change ;)
>> 
>> Well, the pattern is called addsub in the x86 world because highpart
>> does add and lowpart does sub. In left-to-right writing systems
>> highpart comes before lowpart, so you have addsub.
>
> The other targets mentioned do not seem to agree but I can live
> with that, thus I'll change back to addsub.

FWIW, subadd sounds clearer to me too.  It seems surprising to put
imaginary before real when interpreting something as complex, for example.

Putting the highpart first feels especially odd on an LE system like x86…

Richard
Uros Bizjak June 22, 2021, 10:17 a.m. UTC | #9
On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:

> >> Well, the pattern is called addsub in the x86 world because highpart
> >> does add and lowpart does sub. In left-to-right writing systems
> >> highpart comes before lowpart, so you have addsub.
> >
> > The other targets mentioned do not seem to agree but I can live
> > with that, thus I'll change back to addsub.
>
> FWIW, subadd sounds clearer to me too.  It seems surprising to put
> imaginary before real when interpreting something as complex, for example.
>
> Putting the highpart first feels especially odd on an LE system like x86…

The XMM vector is documented left to right with MSB at the left (c.f.
most significant *DIGIT* of the number at the left)

xmm[MSB] .... xmm[LSB]

so, looking at x86 ADDSUBPD insn documentation:

xmm2[127:64] xmm2[63:0]
( + -)
xmm1[127:64] xmm1[63:0]
(=)
xmm1[127:64] holds ADD
xmm1[63:0] holds SUB

xmm1[127:64] xmm1 [63:0]
ADD SUB

Uros.
Richard Biener June 22, 2021, 10:34 a.m. UTC | #10
On Tue, 22 Jun 2021, Uros Bizjak wrote:

> On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> 
> > >> Well, the pattern is called addsub in the x86 world because highpart
> > >> does add and lowpart does sub. In left-to-right writing systems
> > >> highpart comes before lowpart, so you have addsub.
> > >
> > > The other targets mentioned do not seem to agree but I can live
> > > with that, thus I'll change back to addsub.
> >
> > FWIW, subadd sounds clearer to me too.  It seems surprising to put
> > imaginary before real when interpreting something as complex, for example.
> >
> > Putting the highpart first feels especially odd on an LE system like x86…
> 
> The XMM vector is documented left to right with MSB at the left (c.f.
> most significant *DIGIT* of the number at the left)
> 
> xmm[MSB] .... xmm[LSB]
> 
> so, looking at x86 ADDSUBPD insn documentation:
> 
> xmm2[127:64] xmm2[63:0]
> ( + -)
> xmm1[127:64] xmm1[63:0]
> (=)
> xmm1[127:64] holds ADD
> xmm1[63:0] holds SUB
> 
> xmm1[127:64] xmm1 [63:0]
> ADD SUB

I think if we really want to resolve the "ambiguity" we have to
spell it out in the optab.  vec_addodd_subeven or vec_addsub_oddeven.
As I noted there are targets who have the opposite so we could
then add vec_addsub_evenodd (not vec_subadd_evenodd).

Just tell me what you prefer - I'll adjust the patch accordingly.

Richard.
Uros Bizjak June 22, 2021, 10:49 a.m. UTC | #11
On Tue, Jun 22, 2021 at 12:34 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Tue, 22 Jun 2021, Uros Bizjak wrote:
>
> > On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >
> > > >> Well, the pattern is called addsub in the x86 world because highpart
> > > >> does add and lowpart does sub. In left-to-right writing systems
> > > >> highpart comes before lowpart, so you have addsub.
> > > >
> > > > The other targets mentioned do not seem to agree but I can live
> > > > with that, thus I'll change back to addsub.
> > >
> > > FWIW, subadd sounds clearer to me too.  It seems surprising to put
> > > imaginary before real when interpreting something as complex, for example.
> > >
> > > Putting the highpart first feels especially odd on an LE system like x86…
> >
> > The XMM vector is documented left to right with MSB at the left (c.f.
> > most significant *DIGIT* of the number at the left)
> >
> > xmm[MSB] .... xmm[LSB]
> >
> > so, looking at x86 ADDSUBPD insn documentation:
> >
> > xmm2[127:64] xmm2[63:0]
> > ( + -)
> > xmm1[127:64] xmm1[63:0]
> > (=)
> > xmm1[127:64] holds ADD
> > xmm1[63:0] holds SUB
> >
> > xmm1[127:64] xmm1 [63:0]
> > ADD SUB
>
> I think if we really want to resolve the "ambiguity" we have to
> spell it out in the optab.  vec_addodd_subeven or vec_addsub_oddeven.
> As I noted there are targets who have the opposite so we could
> then add vec_addsub_evenodd (not vec_subadd_evenodd).
>
> Just tell me what you prefer - I'll adjust the patch accordingly.

I'd use addsub when add comes at the left of sub, and MSB is also
considered at the left. subadd for when sub comes at the left of add
where MSB is also at the left.

I think that the documentation should clear the ambiguity.

Otherwise, just my 0.02€, I don't want to bikeshed about this ad
infinitum, so I'll stop there.

Uros.
Richard Biener June 24, 2021, 11:07 a.m. UTC | #12
On Tue, 22 Jun 2021, Uros Bizjak wrote:

> On Tue, Jun 22, 2021 at 12:34 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Tue, 22 Jun 2021, Uros Bizjak wrote:
> >
> > > On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >
> > > > >> Well, the pattern is called addsub in the x86 world because highpart
> > > > >> does add and lowpart does sub. In left-to-right writing systems
> > > > >> highpart comes before lowpart, so you have addsub.
> > > > >
> > > > > The other targets mentioned do not seem to agree but I can live
> > > > > with that, thus I'll change back to addsub.
> > > >
> > > > FWIW, subadd sounds clearer to me too.  It seems surprising to put
> > > > imaginary before real when interpreting something as complex, for example.
> > > >
> > > > Putting the highpart first feels especially odd on an LE system like x86?
> > >
> > > The XMM vector is documented left to right with MSB at the left (c.f.
> > > most significant *DIGIT* of the number at the left)
> > >
> > > xmm[MSB] .... xmm[LSB]
> > >
> > > so, looking at x86 ADDSUBPD insn documentation:
> > >
> > > xmm2[127:64] xmm2[63:0]
> > > ( + -)
> > > xmm1[127:64] xmm1[63:0]
> > > (=)
> > > xmm1[127:64] holds ADD
> > > xmm1[63:0] holds SUB
> > >
> > > xmm1[127:64] xmm1 [63:0]
> > > ADD SUB
> >
> > I think if we really want to resolve the "ambiguity" we have to
> > spell it out in the optab.  vec_addodd_subeven or vec_addsub_oddeven.
> > As I noted there are targets who have the opposite so we could
> > then add vec_addsub_evenodd (not vec_subadd_evenodd).
> >
> > Just tell me what you prefer - I'll adjust the patch accordingly.
> 
> I'd use addsub when add comes at the left of sub, and MSB is also
> considered at the left. subadd for when sub comes at the left of add
> where MSB is also at the left.
> 
> I think that the documentation should clear the ambiguity.
> 
> Otherwise, just my 0.02?, I don't want to bikeshed about this ad
> infinitum, so I'll stop there.

Yeah, so I've not changed anything but after the permute propagation
fix I just pushed I've added a testcase that would be broken without
it and adding the new CFN_VEC_ADDSUB case (which I now did).

Re-bootstrapped and tested on x86_64-unknown-linux-gnu, now pushed
since it helps me working with Tamar on the SLP pattern reorg to have
at least one matching pattern on x86_64.

Richard.

From a6bfb9106f5605c0ce6ffe1ed91b50510e43343b Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Mon, 31 May 2021 13:19:01 +0200
Subject: [PATCH] Add x86 addsub SLP pattern
To: gcc-patches@gcc.gnu.org

This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
thus subtract, add alternating on lanes, starting with subtract.

It adds a corresponding optab and direct internal function,
vec_addsub$a3 and renames the existing i386 backend patterns to
the new canonical name.

The SLP pattern matches the exact alternating lane sequence rather
than trying to be clever and anticipating incoming permutes - we
could permute the two input vectors to the needed lane alternation,
do the addsub and then permute the result vector back but that's
only profitable in case the two input or the output permute will
vanish - something Tamars refactoring of SLP pattern recog should
make possible.

2021-06-17  Richard Biener  <rguenther@suse.de>

	* config/i386/sse.md (avx_addsubv4df3): Rename to
	vec_addsubv4df3.
	(avx_addsubv8sf3): Rename to vec_addsubv8sf3.
	(sse3_addsubv2df3): Rename to vec_addsubv2df3.
	(sse3_addsubv4sf3): Rename to vec_addsubv4sf3.
	* internal-fn.def (VEC_ADDSUB): New internal optab fn.
	* optabs.def (vec_addsub_optab): New optab.
	* tree-vect-slp-patterns.c (class addsub_pattern): New.
	(slp_patterns): Add addsub_pattern.
	(vect_optimize_slp): Disable propagation across CFN_VEC_ADDSUB.
	* tree-vectorizer.h (vect_pattern::vect_pattern): Make
	m_ops optional.
	* doc/md.texi (vec_addsub<mode>3): Document.

	* gcc.target/i386/vect-addsubv2df.c: New testcase.
	* gcc.target/i386/vect-addsubv4sf.c: Likewise.
	* gcc.target/i386/vect-addsubv4df.c: Likewise.
	* gcc.target/i386/vect-addsubv8sf.c: Likewise.
	* gcc.target/i386/vect-addsub-2.c: Likewise.
	* gcc.target/i386/vect-addsub-3.c: Likewise.
---
 gcc/config/i386/i386-builtin.def              |   8 +-
 gcc/config/i386/sse.md                        |   8 +-
 gcc/doc/md.texi                               |   8 ++
 gcc/internal-fn.def                           |   1 +
 gcc/optabs.def                                |   1 +
 gcc/testsuite/gcc.target/i386/vect-addsub-2.c |  21 ++++
 gcc/testsuite/gcc.target/i386/vect-addsub-3.c |  38 +++++++
 .../gcc.target/i386/vect-addsubv2df.c         |  42 ++++++++
 .../gcc.target/i386/vect-addsubv4df.c         |  36 +++++++
 .../gcc.target/i386/vect-addsubv4sf.c         |  46 ++++++++
 .../gcc.target/i386/vect-addsubv8sf.c         |  46 ++++++++
 gcc/tree-vect-slp-patterns.c                  | 100 ++++++++++++++++++
 gcc/tree-vect-slp.c                           |   1 +
 gcc/tree-vectorizer.h                         |   3 +-
 14 files changed, 350 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsub-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsub-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index 31df3a613dd..ea79e0bdda2 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF)
 
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
-BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
+BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF)
 BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF)
@@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128
 /* AVX */
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF)
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 5bd65dd9312..1f1db8214cc 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2410,7 +2410,7 @@
    (set_attr "prefix" "<round_saeonly_scalar_prefix>")
    (set_attr "mode" "<ssescalarmode>")])
 
-(define_insn "avx_addsubv4df3"
+(define_insn "vec_addsubv4df3"
   [(set (match_operand:V4DF 0 "register_operand" "=x")
 	(vec_merge:V4DF
 	  (minus:V4DF
@@ -2424,7 +2424,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V4DF")])
 
-(define_insn "sse3_addsubv2df3"
+(define_insn "vec_addsubv2df3"
   [(set (match_operand:V2DF 0 "register_operand" "=x,x")
 	(vec_merge:V2DF
 	  (minus:V2DF
@@ -2442,7 +2442,7 @@
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "V2DF")])
 
-(define_insn "avx_addsubv8sf3"
+(define_insn "vec_addsubv8sf3"
   [(set (match_operand:V8SF 0 "register_operand" "=x")
 	(vec_merge:V8SF
 	  (minus:V8SF
@@ -2456,7 +2456,7 @@
    (set_attr "prefix" "vex")
    (set_attr "mode" "V8SF")])
 
-(define_insn "sse3_addsubv4sf3"
+(define_insn "vec_addsubv4sf3"
   [(set (match_operand:V4SF 0 "register_operand" "=x,x")
 	(vec_merge:V4SF
 	  (minus:V4SF
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 00caf3844cc..1b918144330 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5682,6 +5682,14 @@ signed/unsigned elements of size S@.  Subtract the high/low elements of 2 from
 1 and widen the resulting elements. Put the N/2 results of size 2*S in the
 output vector (operand 0).
 
+@cindex @code{vec_addsub@var{m}3} instruction pattern
+@item @samp{vec_addsub@var{m}3}
+Alternating subtract, add with even lanes doing subtract and odd
+lanes doing addition.  Operands 1 and 2 and the outout operand are vectors
+with mode @var{m}.
+
+These instructions are not allowed to @code{FAIL}.
+
 @cindex @code{mulhisi3} instruction pattern
 @item @samp{mulhisi3}
 Multiply operands 1 and 2, which have mode @code{HImode}, and store
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index b2f414d2131..c3b8e730960 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary)
 
 
 /* FP scales.  */
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..41ab2598eb6 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a")
 OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a")
 OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a")
 OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a")
+OPTAB_D (vec_addsub_optab, "vec_addsub$a3")
 
 OPTAB_D (sync_add_optab, "sync_add$I$a")
 OPTAB_D (sync_and_optab, "sync_and$I$a")
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsub-2.c b/gcc/testsuite/gcc.target/i386/vect-addsub-2.c
new file mode 100644
index 00000000000..a6b941461e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsub-2.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3" } */
+
+float a[1024], b[1024];
+
+void foo()
+{
+  for (int i = 0; i < 256; i++)
+    {
+      a[4*i+0] = a[4*i+0] - b[4*i+0];
+      a[4*i+1] = a[4*i+1] + b[4*i+1];
+      a[4*i+2] = a[4*i+2] - b[4*i+2];
+      a[4*i+3] = a[4*i+3] + b[4*i+3];
+    }
+}
+
+/* We should be able to vectorize this with SLP using the addsub
+   SLP pattern.  */
+/* { dg-final { scan-assembler "addsubps" } } */
+/* { dg-final { scan-assembler-not "shuf" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsub-3.c b/gcc/testsuite/gcc.target/i386/vect-addsub-3.c
new file mode 100644
index 00000000000..b27ee56bd73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsub-3.c
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse3_test
+#endif
+
+#include CHECK_H
+
+double a[2], b[2], c[2];
+
+void __attribute__((noipa))
+foo ()
+{
+  /* When we want to use addsubpd we have to keep permuting both
+     loads, if instead we blend the result of an add and a sub we
+     can combine the blend with the permute.  Both are similar in cost,
+     verify we did not wrongly apply both.  */
+  double tem0 = a[1] - b[1];
+  double tem1 = a[0] + b[0];
+  c[0] = tem0;
+  c[1] = tem1;
+}
+
+static void
+TEST (void)
+{
+  a[0] = 1.; a[1] = 2.;
+  b[0] = 2.; b[1] = 4.;
+  foo ();
+  if (c[0] != -2. || c[1] != 3.)
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
new file mode 100644
index 00000000000..547485d5519
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse3_test
+#endif
+
+#include CHECK_H
+
+double x[2], y[2], z[2];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+}
+static void
+TEST (void)
+{
+  for (int i = 0; i < 2; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
new file mode 100644
index 00000000000..e0a1b3d9d00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c
@@ -0,0 +1,36 @@
+/* { dg-do run { target avx_runtime } } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */
+
+double x[4], y[4], z[4];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+}
+int main()
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
new file mode 100644
index 00000000000..b524f0c35a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c
@@ -0,0 +1,46 @@
+/* { dg-do run } */
+/* { dg-require-effective-target sse3 } */
+/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#ifndef TEST
+#define TEST sse3_test
+#endif
+
+#include CHECK_H
+
+float x[4], y[4], z[4];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+}
+static void
+TEST (void)
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c
new file mode 100644
index 00000000000..0eed33b6531
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c
@@ -0,0 +1,46 @@
+/* { dg-do run { target avx_runtime } } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */
+
+float x[8], y[8], z[8];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+  x[4] = y[4] - z[4];
+  x[5] = y[5] + z[5];
+  x[6] = y[6] - z[6];
+  x[7] = y[7] + z[7];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+  x[4] = y[4] + z[4];
+  x[5] = y[5] - z[5];
+  x[6] = y[6] + z[6];
+  x[7] = y[7] - z[7];
+}
+int main()
+{
+  for (int i = 0; i < 8; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11
+      || x[4] != -4 || x[5] != 17 || x[6] != -6 || x[7] != 23)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3
+      || x[4] != 14 || x[5] != -5 || x[6] != 20 || x[7] != -7)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index 2ed49cd9edc..d536494a1bd 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -1490,6 +1490,105 @@ complex_operations_pattern::build (vec_info * /* vinfo */)
   gcc_unreachable ();
 }
 
+
+/* The addsub_pattern.  */
+
+class addsub_pattern : public vect_pattern
+{
+  public:
+    addsub_pattern (slp_tree *node)
+	: vect_pattern (node, NULL, IFN_VEC_ADDSUB) {};
+
+    void build (vec_info *);
+
+    static vect_pattern*
+    recognize (slp_tree_to_load_perm_map_t *, slp_tree *);
+};
+
+vect_pattern *
+addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_)
+{
+  slp_tree node = *node_;
+  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+      || SLP_TREE_CHILDREN (node).length () != 2)
+    return NULL;
+
+  /* Match a blend of a plus and a minus op with the same number of plus and
+     minus lanes on the same operands.  */
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  bool swapped_p = false;
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    {
+      std::swap (add, sub);
+      swapped_p = true;
+    }
+  if (!(vect_match_expression_p (add, PLUS_EXPR)
+	&& vect_match_expression_p (sub, MINUS_EXPR)))
+    return NULL;
+  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
+	 && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
+	|| (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
+	    && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0])))
+    return NULL;
+
+  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i)
+    {
+      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i];
+      if (swapped_p)
+	perm.first = perm.first == 0 ? 1 : 0;
+      /* It has to be alternating -, +, -, ...
+	 While we could permute the .ADDSUB inputs and the .ADDSUB output
+	 that's only profitable over the add + sub + blend if at least
+	 one of the permute is optimized which we can't determine here.  */
+      if (perm.first != (i & 1)
+	  || perm.second != i)
+	return NULL;
+    }
+
+  if (!vect_pattern_validate_optab (IFN_VEC_ADDSUB, node))
+    return NULL;
+
+  return new addsub_pattern (node_);
+}
+
+void
+addsub_pattern::build (vec_info *vinfo)
+{
+  slp_tree node = *m_node;
+
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    std::swap (add, sub);
+
+  /* Modify the blend node in-place.  */
+  SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0];
+  SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1];
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++;
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
+
+  /* Build IFN_VEC_ADDSUB from the sub representative operands.  */
+  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
+  gcall *call = gimple_build_call_internal (IFN_VEC_ADDSUB, 2,
+					    gimple_assign_rhs1 (rep->stmt),
+					    gimple_assign_rhs2 (rep->stmt));
+  gimple_call_set_lhs (call, make_ssa_name
+			       (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
+  gimple_call_set_nothrow (call, true);
+  gimple_set_bb (call, gimple_bb (rep->stmt));
+  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
+  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope;
+  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
+  STMT_VINFO_VECTYPE (SLP_TREE_REPRESENTATIVE (node)) = SLP_TREE_VECTYPE (node);
+  STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node)) = true;
+  SLP_TREE_CODE (node) = ERROR_MARK;
+  SLP_TREE_LANE_PERMUTATION (node).release ();
+
+  vect_free_slp_tree (sub);
+  vect_free_slp_tree (add);
+}
+
 /*******************************************************************************
  * Pattern matching definitions
  ******************************************************************************/
@@ -1502,6 +1601,7 @@ vect_pattern_decl_t slp_patterns[]
      overlap in what they can detect.  */
 
   SLP_PATTERN (complex_operations_pattern),
+  SLP_PATTERN (addsub_pattern)
 };
 #undef SLP_PATTERN
 
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 69ee8faed09..227d6aa3ee8 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -3705,6 +3705,7 @@ vect_optimize_slp (vec_info *vinfo)
 	      case CFN_COMPLEX_ADD_ROT270:
 	      case CFN_COMPLEX_MUL:
 	      case CFN_COMPLEX_MUL_CONJ:
+	      case CFN_VEC_ADDSUB:
 		continue;
 	      default:;
 	      }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 5c71fbc487f..fa28336d429 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2100,7 +2100,8 @@ class vect_pattern
       this->m_ifn = ifn;
       this->m_node = node;
       this->m_ops.create (0);
-      this->m_ops.safe_splice (*m_ops);
+      if (m_ops)
+	this->m_ops.safe_splice (*m_ops);
     }
 
   public:
Uros Bizjak June 24, 2021, 2:25 p.m. UTC | #13
On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote:

> This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> thus subtract, add alternating on lanes, starting with subtract.
>
> It adds a corresponding optab and direct internal function,
> vec_addsub$a3 and renames the existing i386 backend patterns to
> the new canonical name.
>
> The SLP pattern matches the exact alternating lane sequence rather
> than trying to be clever and anticipating incoming permutes - we
> could permute the two input vectors to the needed lane alternation,
> do the addsub and then permute the result vector back but that's
> only profitable in case the two input or the output permute will
> vanish - something Tamars refactoring of SLP pattern recog should
> make possible.

Using the attached patch, I was also able to generate addsub for the
following testcase:

float x[2], y[2], z[2];

void foo ()
{
  x[0] = y[0] - z[0];
  x[1] = y[1] + z[1];
}

       vmovq   y(%rip), %xmm0
       vmovq   z(%rip), %xmm1
       vaddsubps       %xmm1, %xmm0, %xmm0
       vmovlps %xmm0, x(%rip)
       ret

Uros.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index e887f03474d..5f10572718d 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -788,6 +788,24 @@ (define_insn "*mmx_haddsubv2sf3"
    (set_attr "prefix_extra" "1")
    (set_attr "mode" "V2SF")])
 
+(define_insn "vec_addsubv2sf3"
+  [(set (match_operand:V2SF 0 "register_operand" "=x,x")
+	(vec_merge:V2SF
+	  (minus:V2SF
+	    (match_operand:V2SF 1 "register_operand" "0,x")
+	    (match_operand:V2SF 2 "register_operand" "x,x"))
+	  (plus:V2SF (match_dup 1) (match_dup 2))
+	  (const_int 1)))]
+  "TARGET_SSE3 && TARGET_MMX_WITH_SSE"
+  "@
+   addsubps\t{%2, %0|%0, %2}
+   vaddsubps\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "isa" "noavx,avx")
+   (set_attr "type" "sseadd")
+   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix_rep" "1,*")
+   (set_attr "mode" "V4SF")])
+
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;
 ;; Parallel single-precision floating point comparisons
Richard Biener June 25, 2021, 6:48 a.m. UTC | #14
On Thu, 24 Jun 2021, Uros Bizjak wrote:

> On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote:
> 
> > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > thus subtract, add alternating on lanes, starting with subtract.
> >
> > It adds a corresponding optab and direct internal function,
> > vec_addsub$a3 and renames the existing i386 backend patterns to
> > the new canonical name.
> >
> > The SLP pattern matches the exact alternating lane sequence rather
> > than trying to be clever and anticipating incoming permutes - we
> > could permute the two input vectors to the needed lane alternation,
> > do the addsub and then permute the result vector back but that's
> > only profitable in case the two input or the output permute will
> > vanish - something Tamars refactoring of SLP pattern recog should
> > make possible.
> 
> Using the attached patch, I was also able to generate addsub for the
> following testcase:
> 
> float x[2], y[2], z[2];
> 
> void foo ()
> {
>   x[0] = y[0] - z[0];
>   x[1] = y[1] + z[1];
> }
> 
>        vmovq   y(%rip), %xmm0
>        vmovq   z(%rip), %xmm1
>        vaddsubps       %xmm1, %xmm0, %xmm0
>        vmovlps %xmm0, x(%rip)
>        ret

Nice.  But I suppose it can be merged with the other now single
pattern?

Richard.
Uros Bizjak June 25, 2021, 7:55 a.m. UTC | #15
On Fri, Jun 25, 2021 at 8:48 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 24 Jun 2021, Uros Bizjak wrote:
>
> > On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1
> > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... }
> > > thus subtract, add alternating on lanes, starting with subtract.
> > >
> > > It adds a corresponding optab and direct internal function,
> > > vec_addsub$a3 and renames the existing i386 backend patterns to
> > > the new canonical name.
> > >
> > > The SLP pattern matches the exact alternating lane sequence rather
> > > than trying to be clever and anticipating incoming permutes - we
> > > could permute the two input vectors to the needed lane alternation,
> > > do the addsub and then permute the result vector back but that's
> > > only profitable in case the two input or the output permute will
> > > vanish - something Tamars refactoring of SLP pattern recog should
> > > make possible.
> >
> > Using the attached patch, I was also able to generate addsub for the
> > following testcase:
> >
> > float x[2], y[2], z[2];
> >
> > void foo ()
> > {
> >   x[0] = y[0] - z[0];
> >   x[1] = y[1] + z[1];
> > }
> >
> >        vmovq   y(%rip), %xmm0
> >        vmovq   z(%rip), %xmm1
> >        vaddsubps       %xmm1, %xmm0, %xmm0
> >        vmovlps %xmm0, x(%rip)
> >        ret
>
> Nice.  But I suppose it can be merged with the other now single
> pattern?

Actually, MMX_WITH_SSE V2SF modes do not support memory operands (they
operate on V2SF subreg of V4SF SSE registers), so a new mode attribute
would be needed to instantiate register operand in case of V2SF.

Uros.
>
> Richard.
diff mbox series

Patch

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 94296bc773b..73238c9874a 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2396,6 +2396,22 @@ 
    (set_attr "prefix" "<round_saeonly_scalar_prefix>")
    (set_attr "mode" "<ssescalarmode>")])
 
+/* ???  Rename avx_addsubv4df3, but its name is used throughout the backed,
+   so avoid this churn for the moment.  */
+(define_insn "vec_subaddv4df3"
+  [(set (match_operand:V4DF 0 "register_operand" "=x")
+	(vec_merge:V4DF
+	  (minus:V4DF
+	    (match_operand:V4DF 1 "register_operand" "x")
+	    (match_operand:V4DF 2 "nonimmediate_operand" "xm"))
+	  (plus:V4DF (match_dup 1) (match_dup 2))
+	  (const_int 5)))]
+  "TARGET_AVX"
+  "vaddsubpd\t{%2, %1, %0|%0, %1, %2}"
+  [(set_attr "type" "sseadd")
+   (set_attr "prefix" "vex")
+   (set_attr "mode" "V4DF")])
+
 (define_insn "avx_addsubv4df3"
   [(set (match_operand:V4DF 0 "register_operand" "=x")
 	(vec_merge:V4DF
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index b2f414d2131..a4578a0293e 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -281,6 +281,7 @@  DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
 DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd, binary)
 
 
 /* FP scales.  */
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..cacbd02cfe1 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -407,6 +407,7 @@  OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a")
 OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a")
 OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a")
 OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a")
+OPTAB_D (vec_subadd_optab, "vec_subadd$a3")
 
 OPTAB_D (sync_add_optab, "sync_add$I$a")
 OPTAB_D (sync_and_optab, "sync_and$I$a")
diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
new file mode 100644
index 00000000000..a79e5ba92e2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c
@@ -0,0 +1,35 @@ 
+/* { dg-do run { target avx_runtime } } */
+/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */
+
+double x[4], y[4], z[4];
+void __attribute__((noipa)) foo ()
+{
+  x[0] = y[0] - z[0];
+  x[1] = y[1] + z[1];
+  x[2] = y[2] - z[2];
+  x[3] = y[3] + z[3];
+}
+void __attribute__((noipa)) bar ()
+{
+  x[0] = y[0] + z[0];
+  x[1] = y[1] - z[1];
+  x[2] = y[2] + z[2];
+  x[3] = y[3] - z[3];
+}
+int main()
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      y[i] = i + 1;
+      z[i] = 2 * i + 1;
+    }
+  foo ();
+  if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11)
+    __builtin_abort ();
+  bar ();
+  if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3)
+    __builtin_abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index 2ed49cd9edc..61732c53ad7 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -1490,6 +1490,103 @@  complex_operations_pattern::build (vec_info * /* vinfo */)
   gcc_unreachable ();
 }
 
+
+/* The subadd_pattern.  */
+
+class subadd_pattern : public vect_pattern
+{
+  public:
+    subadd_pattern (slp_tree *node)
+	: vect_pattern (node, NULL, IFN_VEC_SUBADD) {};
+
+    void build (vec_info *);
+
+    static vect_pattern*
+    recognize (slp_tree_to_load_perm_map_t *, slp_tree *);
+};
+
+vect_pattern *
+subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_)
+{
+  slp_tree node = *node_;
+  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
+      || SLP_TREE_CHILDREN (node).length () != 2)
+    return NULL;
+
+  /* Match a blend of a plus and a minus op with the same number of plus and
+     minus lanes on the same operands.  */
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  bool swapped_p = false;
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    {
+      std::swap (add, sub);
+      swapped_p = true;
+    }
+  if (!(vect_match_expression_p (add, PLUS_EXPR)
+	&& vect_match_expression_p (sub, MINUS_EXPR)))
+    return NULL;
+  if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0]
+	 && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1])
+	|| (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1]
+	    && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0])))
+    return NULL;
+
+  for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i)
+    {
+      std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i];
+      if (swapped_p)
+	perm.first = perm.first == 0 ? 1 : 0;
+      /* It has to be alternating -, +, -, ...
+	 While we could permute the .SUBADD inputs and the .SUBADD output
+	 that's only profitable over the add + sub + blend if at least
+	 one of the permute is optimized which we can't determine here.  */
+      if (perm.first != (i & 1)
+	  || perm.second != i)
+	return NULL;
+    }
+
+  if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node))
+    return NULL;
+
+  return new subadd_pattern (node_);
+}
+
+void
+subadd_pattern::build (vec_info *vinfo)
+{
+  slp_tree node = *m_node;
+
+  slp_tree sub = SLP_TREE_CHILDREN (node)[0];
+  slp_tree add = SLP_TREE_CHILDREN (node)[1];
+  if (vect_match_expression_p (sub, PLUS_EXPR))
+    std::swap (add, sub);
+
+  /* Modify the blend node in-place.  */
+  SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0];
+  SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1];
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++;
+  SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++;
+
+  /* Build IFN_VEC_SUBADD from the sub representative operands.  */
+  stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub);
+  gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2,
+					    gimple_assign_rhs1 (rep->stmt),
+					    gimple_assign_rhs2 (rep->stmt));
+  gimple_call_set_lhs (call, make_ssa_name
+			       (TREE_TYPE (gimple_assign_lhs (rep->stmt))));
+  gimple_call_set_nothrow (call, true);
+  gimple_set_bb (call, gimple_bb (rep->stmt));
+  SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep);
+  STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope;
+  STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp;
+  SLP_TREE_CODE (node) = ERROR_MARK;
+  SLP_TREE_LANE_PERMUTATION (node).release ();
+
+  vect_free_slp_tree (sub);
+  vect_free_slp_tree (add);
+}
+
 /*******************************************************************************
  * Pattern matching definitions
  ******************************************************************************/
@@ -1502,6 +1599,7 @@  vect_pattern_decl_t slp_patterns[]
      overlap in what they can detect.  */
 
   SLP_PATTERN (complex_operations_pattern),
+  SLP_PATTERN (subadd_pattern)
 };
 #undef SLP_PATTERN
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 04c20f8bd0f..b9824623ad9 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2100,7 +2100,8 @@  class vect_pattern
       this->m_ifn = ifn;
       this->m_node = node;
       this->m_ops.create (0);
-      this->m_ops.safe_splice (*m_ops);
+      if (m_ops)
+	this->m_ops.safe_splice (*m_ops);
     }
 
   public: