Message ID | soq832r0-q415-50rp-9o61-2ns838p7235@fhfr.qr |
---|---|
State | New |
Headers | show |
Series | [RFC] Add x86 subadd SLP pattern | expand |
On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > thus subtract, add alternating on lanes, starting with subtract. > > It adds a corresponding optab and direct internal function, > vec_subadd$a3 and at the moment to make the i386 backend changes > "obvious", duplicates the existing avx_addsubv4df3 pattern with > the new canonical name (CODE_FOR_* and gen_* are used throughout the > intrinsic code, so the actual change to rename all existing patterns > will be quite a bit bigger). I expect some bike-shedding on > subadd vs. addsub so I delay that change ;) > > ARC seems to have both addsub{V2HI,V2SI,V4HI} and > subadd{V2HI,V2SI,V4HI}. ARM iwmmxt has wsubaddx on V4HI but > I didn't dare to decipher whether it's -, + or +, -. bfin > has both variants as well plus a saturating variant of both (on v2hi). > But none of the major vector ISAs besides x86 seem to have it. > Still some targets having both and opting for the naming I choose > would make that a better fit IMHO. > > Uros/Hontao - would mass-renaming of the x86 addsub patterns > to subadd be OK? (not touching the FMA ones which have both) We can add define_expand (this is how we approached the problem in the past for x86). If there are not too many of them, then this is the preferred approach (see below) ... ... until someone comes along and mass-renames everything to a generic name. Uros. > > The SLP pattern matches the exact alternating lane sequence rather > than trying to be clever and anticipating incoming permutes - we > could permute the two input vectors to the needed lane alternation, > do the subadd and then permute the result vector back but that's > only profitable in case the two input or the output permute will > vanish - something Tamars refactoring of SLP pattern recog should > make possible (I hope ;)). > > So as said, the patch is incomplete on the x86 backend (and test > coverage) side, but the vectorizer part should be finalized until > Tamar posts his work. > > Bootstrapped / tested on x86_64-unknown-linux-gnu. > > 2021-06-17 Richard Biener <rguenther@suse.de> > > * config/i386/sse.md (vec_subaddv4df3): Duplicate from > avx_addsubv4df3. > * internal-fn.def (VEC_SUBADD): New internal optab fn. > * optabs.def (vec_subadd_optab): New optab. > * tree-vect-slp-patterns.c (class subadd_pattern): New. > (slp_patterns): Add subadd_pattern. > * tree-vectorizer.h (vect_pattern::vect_pattern): Make > m_ops optional. > > * gcc.target/i386/vect-subadd-1.c: New testcase. > --- > gcc/config/i386/sse.md | 16 +++ > gcc/internal-fn.def | 1 + > gcc/optabs.def | 1 + > gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++ > gcc/tree-vect-slp-patterns.c | 98 +++++++++++++++++++ > gcc/tree-vectorizer.h | 3 +- > 6 files changed, 153 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/i386/vect-subadd-1.c > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 94296bc773b..73238c9874a 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -2396,6 +2396,22 @@ > (set_attr "prefix" "<round_saeonly_scalar_prefix>") > (set_attr "mode" "<ssescalarmode>")]) > > +/* ??? Rename avx_addsubv4df3, but its name is used throughout the backed, > + so avoid this churn for the moment. */ > +(define_insn "vec_subaddv4df3" > + [(set (match_operand:V4DF 0 "register_operand" "=x") > + (vec_merge:V4DF > + (minus:V4DF > + (match_operand:V4DF 1 "register_operand" "x") > + (match_operand:V4DF 2 "nonimmediate_operand" "xm")) > + (plus:V4DF (match_dup 1) (match_dup 2)) > + (const_int 5)))] > + "TARGET_AVX" > + "vaddsubpd\t{%2, %1, %0|%0, %1, %2}" > + [(set_attr "type" "sseadd") > + (set_attr "prefix" "vex") > + (set_attr "mode" "V4DF")]) define_expand "vec_subaddv4df3" [(set (match_operand:V4DF 0 "register_operand") (vec_merge:V4DF (minus:V4DF (match_operand:V4DF 1 "register_operand") (match_operand:V4DF 2 "nonimmediate_operand")) (plus:V4DF (match_dup 1) (match_dup 2)) (const_int 5)))] "TARGET_AVX") > (define_insn "avx_addsubv4df3" > [(set (match_operand:V4DF 0 "register_operand" "=x") > (vec_merge:V4DF > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index b2f414d2131..a4578a0293e 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary) > DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary) > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary) > +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd, binary) > > > /* FP scales. */ > diff --git a/gcc/optabs.def b/gcc/optabs.def > index b192a9d070b..cacbd02cfe1 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a") > OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a") > OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a") > OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a") > +OPTAB_D (vec_subadd_optab, "vec_subadd$a3") > > OPTAB_D (sync_add_optab, "sync_add$I$a") > OPTAB_D (sync_and_optab, "sync_and$I$a") > diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c > new file mode 100644 > index 00000000000..a79e5ba92e2 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c > @@ -0,0 +1,35 @@ > +/* { dg-do run { target avx_runtime } } */ > +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */ > + > +double x[4], y[4], z[4]; > +void __attribute__((noipa)) foo () > +{ > + x[0] = y[0] - z[0]; > + x[1] = y[1] + z[1]; > + x[2] = y[2] - z[2]; > + x[3] = y[3] + z[3]; > +} > +void __attribute__((noipa)) bar () > +{ > + x[0] = y[0] + z[0]; > + x[1] = y[1] - z[1]; > + x[2] = y[2] + z[2]; > + x[3] = y[3] - z[3]; > +} > +int main() > +{ > + for (int i = 0; i < 4; ++i) > + { > + y[i] = i + 1; > + z[i] = 2 * i + 1; > + } > + foo (); > + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) > + __builtin_abort (); > + bar (); > + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */ > diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c > index 2ed49cd9edc..61732c53ad7 100644 > --- a/gcc/tree-vect-slp-patterns.c > +++ b/gcc/tree-vect-slp-patterns.c > @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info * /* vinfo */) > gcc_unreachable (); > } > > + > +/* The subadd_pattern. */ > + > +class subadd_pattern : public vect_pattern > +{ > + public: > + subadd_pattern (slp_tree *node) > + : vect_pattern (node, NULL, IFN_VEC_SUBADD) {}; > + > + void build (vec_info *); > + > + static vect_pattern* > + recognize (slp_tree_to_load_perm_map_t *, slp_tree *); > +}; > + > +vect_pattern * > +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) > +{ > + slp_tree node = *node_; > + if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > + || SLP_TREE_CHILDREN (node).length () != 2) > + return NULL; > + > + /* Match a blend of a plus and a minus op with the same number of plus and > + minus lanes on the same operands. */ > + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; > + slp_tree add = SLP_TREE_CHILDREN (node)[1]; > + bool swapped_p = false; > + if (vect_match_expression_p (sub, PLUS_EXPR)) > + { > + std::swap (add, sub); > + swapped_p = true; > + } > + if (!(vect_match_expression_p (add, PLUS_EXPR) > + && vect_match_expression_p (sub, MINUS_EXPR))) > + return NULL; > + if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0] > + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1]) > + || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1] > + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0]))) > + return NULL; > + > + for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i) > + { > + std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i]; > + if (swapped_p) > + perm.first = perm.first == 0 ? 1 : 0; > + /* It has to be alternating -, +, -, ... > + While we could permute the .SUBADD inputs and the .SUBADD output > + that's only profitable over the add + sub + blend if at least > + one of the permute is optimized which we can't determine here. */ > + if (perm.first != (i & 1) > + || perm.second != i) > + return NULL; > + } > + > + if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node)) > + return NULL; > + > + return new subadd_pattern (node_); > +} > + > +void > +subadd_pattern::build (vec_info *vinfo) > +{ > + slp_tree node = *m_node; > + > + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; > + slp_tree add = SLP_TREE_CHILDREN (node)[1]; > + if (vect_match_expression_p (sub, PLUS_EXPR)) > + std::swap (add, sub); > + > + /* Modify the blend node in-place. */ > + SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0]; > + SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1]; > + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++; > + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++; > + > + /* Build IFN_VEC_SUBADD from the sub representative operands. */ > + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub); > + gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2, > + gimple_assign_rhs1 (rep->stmt), > + gimple_assign_rhs2 (rep->stmt)); > + gimple_call_set_lhs (call, make_ssa_name > + (TREE_TYPE (gimple_assign_lhs (rep->stmt)))); > + gimple_call_set_nothrow (call, true); > + gimple_set_bb (call, gimple_bb (rep->stmt)); > + SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep); > + STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope; > + STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp; > + SLP_TREE_CODE (node) = ERROR_MARK; > + SLP_TREE_LANE_PERMUTATION (node).release (); > + > + vect_free_slp_tree (sub); > + vect_free_slp_tree (add); > +} > + > /******************************************************************************* > * Pattern matching definitions > ******************************************************************************/ > @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[] > overlap in what they can detect. */ > > SLP_PATTERN (complex_operations_pattern), > + SLP_PATTERN (subadd_pattern) > }; > #undef SLP_PATTERN > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 04c20f8bd0f..b9824623ad9 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -2100,7 +2100,8 @@ class vect_pattern > this->m_ifn = ifn; > this->m_node = node; > this->m_ops.create (0); > - this->m_ops.safe_splice (*m_ops); > + if (m_ops) > + this->m_ops.safe_splice (*m_ops); > } > > public: > -- > 2.26.2
On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > thus subtract, add alternating on lanes, starting with subtract. > > It adds a corresponding optab and direct internal function, > vec_subadd$a3 and at the moment to make the i386 backend changes > "obvious", duplicates the existing avx_addsubv4df3 pattern with > the new canonical name (CODE_FOR_* and gen_* are used throughout the > intrinsic code, so the actual change to rename all existing patterns > will be quite a bit bigger). I expect some bike-shedding on > subadd vs. addsub so I delay that change ;) Well, the pattern is called addsub in the x86 world because highpart does add and lowpart does sub. In left-to-right writing systems highpart comes before lowpart, so you have addsub. Uros.
On Thu, 17 Jun 2021, Uros Bizjak wrote: > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > thus subtract, add alternating on lanes, starting with subtract. > > > > It adds a corresponding optab and direct internal function, > > vec_subadd$a3 and at the moment to make the i386 backend changes > > "obvious", duplicates the existing avx_addsubv4df3 pattern with > > the new canonical name (CODE_FOR_* and gen_* are used throughout the > > intrinsic code, so the actual change to rename all existing patterns > > will be quite a bit bigger). I expect some bike-shedding on > > subadd vs. addsub so I delay that change ;) > > > > ARC seems to have both addsub{V2HI,V2SI,V4HI} and > > subadd{V2HI,V2SI,V4HI}. ARM iwmmxt has wsubaddx on V4HI but > > I didn't dare to decipher whether it's -, + or +, -. bfin > > has both variants as well plus a saturating variant of both (on v2hi). > > But none of the major vector ISAs besides x86 seem to have it. > > Still some targets having both and opting for the naming I choose > > would make that a better fit IMHO. > > > > Uros/Hontao - would mass-renaming of the x86 addsub patterns > > to subadd be OK? (not touching the FMA ones which have both) > > We can add define_expand (this is how we approached the problem in the > past for x86). If there are not too many of them, then this is the > preferred approach (see below) ... OK. > ... until someone comes along and mass-renames everything to a generic name. Though now that I tried it isn't so bad for the addsub case (see patch below). Would that be prefered or would it be an incomplete renaming and thus the define_expand prefered? Thanks, Richard. diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index 80c2a2c0294..3222eaedbd0 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__ BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) @@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128 /* AVX */ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 73238c9874a..48818dcf8cb 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2396,8 +2396,6 @@ (set_attr "prefix" "<round_saeonly_scalar_prefix>") (set_attr "mode" "<ssescalarmode>")]) -/* ??? Rename avx_addsubv4df3, but its name is used throughout the backed, - so avoid this churn for the moment. */ (define_insn "vec_subaddv4df3" [(set (match_operand:V4DF 0 "register_operand" "=x") (vec_merge:V4DF @@ -2412,21 +2410,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V4DF")]) -(define_insn "avx_addsubv4df3" - [(set (match_operand:V4DF 0 "register_operand" "=x") - (vec_merge:V4DF - (minus:V4DF - (match_operand:V4DF 1 "register_operand" "x") - (match_operand:V4DF 2 "nonimmediate_operand" "xm")) - (plus:V4DF (match_dup 1) (match_dup 2)) - (const_int 5)))] - "TARGET_AVX" - "vaddsubpd\t{%2, %1, %0|%0, %1, %2}" - [(set_attr "type" "sseadd") - (set_attr "prefix" "vex") - (set_attr "mode" "V4DF")]) - -(define_insn "sse3_addsubv2df3" +(define_insn "vec_subaddv2df3" [(set (match_operand:V2DF 0 "register_operand" "=x,x") (vec_merge:V2DF (minus:V2DF @@ -2444,7 +2428,7 @@ (set_attr "prefix" "orig,vex") (set_attr "mode" "V2DF")]) -(define_insn "avx_addsubv8sf3" +(define_insn "vec_subaddv8sf3" [(set (match_operand:V8SF 0 "register_operand" "=x") (vec_merge:V8SF (minus:V8SF @@ -2458,7 +2442,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V8SF")]) -(define_insn "sse3_addsubv4sf3" +(define_insn "vec_subaddv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=x,x") (vec_merge:V4SF (minus:V4SF
On Thu, Jun 17, 2021 at 12:00 PM Richard Biener <rguenther@suse.de> wrote: > > On Thu, 17 Jun 2021, Uros Bizjak wrote: > > > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > > thus subtract, add alternating on lanes, starting with subtract. > > > > > > It adds a corresponding optab and direct internal function, > > > vec_subadd$a3 and at the moment to make the i386 backend changes > > > "obvious", duplicates the existing avx_addsubv4df3 pattern with > > > the new canonical name (CODE_FOR_* and gen_* are used throughout the > > > intrinsic code, so the actual change to rename all existing patterns > > > will be quite a bit bigger). I expect some bike-shedding on > > > subadd vs. addsub so I delay that change ;) > > > > > > ARC seems to have both addsub{V2HI,V2SI,V4HI} and > > > subadd{V2HI,V2SI,V4HI}. ARM iwmmxt has wsubaddx on V4HI but > > > I didn't dare to decipher whether it's -, + or +, -. bfin > > > has both variants as well plus a saturating variant of both (on v2hi). > > > But none of the major vector ISAs besides x86 seem to have it. > > > Still some targets having both and opting for the naming I choose > > > would make that a better fit IMHO. > > > > > > Uros/Hontao - would mass-renaming of the x86 addsub patterns > > > to subadd be OK? (not touching the FMA ones which have both) > > > > We can add define_expand (this is how we approached the problem in the > > past for x86). If there are not too many of them, then this is the > > preferred approach (see below) ... > > OK. > > > ... until someone comes along and mass-renames everything to a generic name. > > Though now that I tried it isn't so bad for the addsub case (see > patch below). Would that be prefered or would it be an incomplete > renaming and thus the define_expand prefered? The approach below is OK, too. If you rename the patterns to addsub (see my previous mail for reasoning), then the rename is fully preferred as far as x86 is concerned. Thanks, Uros. > Thanks, > Richard. > > diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def > index 80c2a2c0294..3222eaedbd0 100644 > --- a/gcc/config/i386/i386-builtin.def > +++ b/gcc/config/i386/i386-builtin.def > @@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__ > BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) > BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) > > -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) > -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) > +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) > +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_subaddv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) > BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) > BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) > BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) > @@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128 > /* AVX */ > BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) > BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) > -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) > -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) > +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) > +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_subaddv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) > BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) > BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) > BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 73238c9874a..48818dcf8cb 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -2396,8 +2396,6 @@ > (set_attr "prefix" "<round_saeonly_scalar_prefix>") > (set_attr "mode" "<ssescalarmode>")]) > > -/* ??? Rename avx_addsubv4df3, but its name is used throughout the backed, > - so avoid this churn for the moment. */ > (define_insn "vec_subaddv4df3" > [(set (match_operand:V4DF 0 "register_operand" "=x") > (vec_merge:V4DF > @@ -2412,21 +2410,7 @@ > (set_attr "prefix" "vex") > (set_attr "mode" "V4DF")]) > > -(define_insn "avx_addsubv4df3" > - [(set (match_operand:V4DF 0 "register_operand" "=x") > - (vec_merge:V4DF > - (minus:V4DF > - (match_operand:V4DF 1 "register_operand" "x") > - (match_operand:V4DF 2 "nonimmediate_operand" "xm")) > - (plus:V4DF (match_dup 1) (match_dup 2)) > - (const_int 5)))] > - "TARGET_AVX" > - "vaddsubpd\t{%2, %1, %0|%0, %1, %2}" > - [(set_attr "type" "sseadd") > - (set_attr "prefix" "vex") > - (set_attr "mode" "V4DF")]) > - > -(define_insn "sse3_addsubv2df3" > +(define_insn "vec_subaddv2df3" > [(set (match_operand:V2DF 0 "register_operand" "=x,x") > (vec_merge:V2DF > (minus:V2DF > @@ -2444,7 +2428,7 @@ > (set_attr "prefix" "orig,vex") > (set_attr "mode" "V2DF")]) > > -(define_insn "avx_addsubv8sf3" > +(define_insn "vec_subaddv8sf3" > [(set (match_operand:V8SF 0 "register_operand" "=x") > (vec_merge:V8SF > (minus:V8SF > @@ -2458,7 +2442,7 @@ > (set_attr "prefix" "vex") > (set_attr "mode" "V8SF")]) > > -(define_insn "sse3_addsubv4sf3" > +(define_insn "vec_subaddv4sf3" > [(set (match_operand:V4SF 0 "register_operand" "=x,x") > (vec_merge:V4SF > (minus:V4SF
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Thursday, June 17, 2021 10:45 AM > To: gcc-patches@gcc.gnu.org > Cc: hongtao.liu@intel.com; ubizjak@gmail.com; Tamar Christina > <Tamar.Christina@arm.com> > Subject: [PATCH][RFC] Add x86 subadd SLP pattern > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } thus subtract, > add alternating on lanes, starting with subtract. > > It adds a corresponding optab and direct internal function, > vec_subadd$a3 and at the moment to make the i386 backend changes > "obvious", duplicates the existing avx_addsubv4df3 pattern with the new > canonical name (CODE_FOR_* and gen_* are used throughout the intrinsic > code, so the actual change to rename all existing patterns will be quite a bit > bigger). I expect some bike-shedding on subadd vs. addsub so I delay that > change ;) > > ARC seems to have both addsub{V2HI,V2SI,V4HI} and > subadd{V2HI,V2SI,V4HI}. ARM iwmmxt has wsubaddx on V4HI but I didn't > dare to decipher whether it's -, + or +, -. bfin has both variants as well plus a > saturating variant of both (on v2hi). > But none of the major vector ISAs besides x86 seem to have it. > Still some targets having both and opting for the naming I choose would > make that a better fit IMHO. In terms of internal_fn I think we'll need both, whatever the target expands to is Up to it. But the reason for both is that in the when you get the bigger cases the difference Between ADDSUB and SUBADD determines whether it's a conjugate or not. (I believe x86 has those with fmaddsub and fmsubadd) the rework just simply Matches MUL + ADDSUB or SUBADD and never needs to look further. This keeps stacking, to more complicated sequences like complex dot product. > > Uros/Hontao - would mass-renaming of the x86 addsub patterns to subadd > be OK? (not touching the FMA ones which have both) > > The SLP pattern matches the exact alternating lane sequence rather than > trying to be clever and anticipating incoming permutes - we could permute > the two input vectors to the needed lane alternation, do the subadd and > then permute the result vector back but that's only profitable in case the two > input or the output permute will vanish - something Tamars refactoring of > SLP pattern recog should make possible (I hope ;)). It will, it pushes it to optimize_slp which will allow you to decide what to do. As part of this it allows additional transformations. So atm this where I do ADDSUB -> COMPLEX_ADD depending on the permute it sees. Regards, Tamar > > So as said, the patch is incomplete on the x86 backend (and test > coverage) side, but the vectorizer part should be finalized until Tamar posts > his work. > > Bootstrapped / tested on x86_64-unknown-linux-gnu. > > 2021-06-17 Richard Biener <rguenther@suse.de> > > * config/i386/sse.md (vec_subaddv4df3): Duplicate from > avx_addsubv4df3. > * internal-fn.def (VEC_SUBADD): New internal optab fn. > * optabs.def (vec_subadd_optab): New optab. > * tree-vect-slp-patterns.c (class subadd_pattern): New. > (slp_patterns): Add subadd_pattern. > * tree-vectorizer.h (vect_pattern::vect_pattern): Make > m_ops optional. > > * gcc.target/i386/vect-subadd-1.c: New testcase. > --- > gcc/config/i386/sse.md | 16 +++ > gcc/internal-fn.def | 1 + > gcc/optabs.def | 1 + > gcc/testsuite/gcc.target/i386/vect-subadd-1.c | 35 +++++++ > gcc/tree-vect-slp-patterns.c | 98 +++++++++++++++++++ > gcc/tree-vectorizer.h | 3 +- > 6 files changed, 153 insertions(+), 1 deletion(-) create mode 100644 > gcc/testsuite/gcc.target/i386/vect-subadd-1.c > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index > 94296bc773b..73238c9874a 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -2396,6 +2396,22 @@ > (set_attr "prefix" "<round_saeonly_scalar_prefix>") > (set_attr "mode" "<ssescalarmode>")]) > > +/* ??? Rename avx_addsubv4df3, but its name is used throughout the > backed, > + so avoid this churn for the moment. */ (define_insn > +"vec_subaddv4df3" > + [(set (match_operand:V4DF 0 "register_operand" "=x") > + (vec_merge:V4DF > + (minus:V4DF > + (match_operand:V4DF 1 "register_operand" "x") > + (match_operand:V4DF 2 "nonimmediate_operand" "xm")) > + (plus:V4DF (match_dup 1) (match_dup 2)) > + (const_int 5)))] > + "TARGET_AVX" > + "vaddsubpd\t{%2, %1, %0|%0, %1, %2}" > + [(set_attr "type" "sseadd") > + (set_attr "prefix" "vex") > + (set_attr "mode" "V4DF")]) > + > (define_insn "avx_addsubv4df3" > [(set (match_operand:V4DF 0 "register_operand" "=x") > (vec_merge:V4DF > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index > b2f414d2131..a4578a0293e 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, > ECF_CONST, cadd90, binary) DEF_INTERNAL_OPTAB_FN > (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary) > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, > binary) > +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd, > binary) > > > /* FP scales. */ > diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..cacbd02cfe1 > 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, > "vec_widen_usubl_hi_$a") OPTAB_D (vec_widen_usubl_lo_optab, > "vec_widen_usubl_lo_$a") OPTAB_D (vec_widen_uaddl_hi_optab, > "vec_widen_uaddl_hi_$a") OPTAB_D (vec_widen_uaddl_lo_optab, > "vec_widen_uaddl_lo_$a") > +OPTAB_D (vec_subadd_optab, "vec_subadd$a3") > > OPTAB_D (sync_add_optab, "sync_add$I$a") OPTAB_D (sync_and_optab, > "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c > b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c > new file mode 100644 > index 00000000000..a79e5ba92e2 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c > @@ -0,0 +1,35 @@ > +/* { dg-do run { target avx_runtime } } */ > +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */ > + > +double x[4], y[4], z[4]; > +void __attribute__((noipa)) foo () > +{ > + x[0] = y[0] - z[0]; > + x[1] = y[1] + z[1]; > + x[2] = y[2] - z[2]; > + x[3] = y[3] + z[3]; > +} > +void __attribute__((noipa)) bar () > +{ > + x[0] = y[0] + z[0]; > + x[1] = y[1] - z[1]; > + x[2] = y[2] + z[2]; > + x[3] = y[3] - z[3]; > +} > +int main() > +{ > + for (int i = 0; i < 4; ++i) > + { > + y[i] = i + 1; > + z[i] = 2 * i + 1; > + } > + foo (); > + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) > + __builtin_abort (); > + bar (); > + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */ > diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index > 2ed49cd9edc..61732c53ad7 100644 > --- a/gcc/tree-vect-slp-patterns.c > +++ b/gcc/tree-vect-slp-patterns.c > @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info * > /* vinfo */) > gcc_unreachable (); > } > > + > +/* The subadd_pattern. */ > + > +class subadd_pattern : public vect_pattern { > + public: > + subadd_pattern (slp_tree *node) > + : vect_pattern (node, NULL, IFN_VEC_SUBADD) {}; > + > + void build (vec_info *); > + > + static vect_pattern* > + recognize (slp_tree_to_load_perm_map_t *, slp_tree *); }; > + > +vect_pattern * > +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree > +*node_) { > + slp_tree node = *node_; > + if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > + || SLP_TREE_CHILDREN (node).length () != 2) > + return NULL; > + > + /* Match a blend of a plus and a minus op with the same number of plus > and > + minus lanes on the same operands. */ > + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; > + slp_tree add = SLP_TREE_CHILDREN (node)[1]; > + bool swapped_p = false; > + if (vect_match_expression_p (sub, PLUS_EXPR)) > + { > + std::swap (add, sub); > + swapped_p = true; > + } > + if (!(vect_match_expression_p (add, PLUS_EXPR) > + && vect_match_expression_p (sub, MINUS_EXPR))) > + return NULL; > + if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0] > + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1]) > + || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1] > + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN > (add)[0]))) > + return NULL; > + > + for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); > ++i) > + { > + std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION > (node)[i]; > + if (swapped_p) > + perm.first = perm.first == 0 ? 1 : 0; > + /* It has to be alternating -, +, -, ... > + While we could permute the .SUBADD inputs and the .SUBADD > output > + that's only profitable over the add + sub + blend if at least > + one of the permute is optimized which we can't determine here. */ > + if (perm.first != (i & 1) > + || perm.second != i) > + return NULL; > + } > + > + if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node)) > + return NULL; > + > + return new subadd_pattern (node_); > +} > + > +void > +subadd_pattern::build (vec_info *vinfo) { > + slp_tree node = *m_node; > + > + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; slp_tree add = > + SLP_TREE_CHILDREN (node)[1]; if (vect_match_expression_p (sub, > + PLUS_EXPR)) > + std::swap (add, sub); > + > + /* Modify the blend node in-place. */ SLP_TREE_CHILDREN (node)[0] = > + SLP_TREE_CHILDREN (sub)[0]; SLP_TREE_CHILDREN (node)[1] = > + SLP_TREE_CHILDREN (sub)[1]; SLP_TREE_REF_COUNT > (SLP_TREE_CHILDREN > + (node)[0])++; SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++; > + > + /* Build IFN_VEC_SUBADD from the sub representative operands. */ > + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub); > + gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2, > + gimple_assign_rhs1 (rep->stmt), > + gimple_assign_rhs2 (rep->stmt)); > + gimple_call_set_lhs (call, make_ssa_name > + (TREE_TYPE (gimple_assign_lhs (rep->stmt)))); > + gimple_call_set_nothrow (call, true); > + gimple_set_bb (call, gimple_bb (rep->stmt)); > + SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep); > + STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = > +vect_used_in_scope; > + STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp; > + SLP_TREE_CODE (node) = ERROR_MARK; > + SLP_TREE_LANE_PERMUTATION (node).release (); > + > + vect_free_slp_tree (sub); > + vect_free_slp_tree (add); > +} > + > > /********************************************************** > ********************* > * Pattern matching definitions > > ********************************************************** > ********************/ > @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[] > overlap in what they can detect. */ > > SLP_PATTERN (complex_operations_pattern), > + SLP_PATTERN (subadd_pattern) > }; > #undef SLP_PATTERN > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index > 04c20f8bd0f..b9824623ad9 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -2100,7 +2100,8 @@ class vect_pattern > this->m_ifn = ifn; > this->m_node = node; > this->m_ops.create (0); > - this->m_ops.safe_splice (*m_ops); > + if (m_ops) > + this->m_ops.safe_splice (*m_ops); > } > > public: > -- > 2.26.2
On Thu, 17 Jun 2021, Uros Bizjak wrote: > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > thus subtract, add alternating on lanes, starting with subtract. > > > > It adds a corresponding optab and direct internal function, > > vec_subadd$a3 and at the moment to make the i386 backend changes > > "obvious", duplicates the existing avx_addsubv4df3 pattern with > > the new canonical name (CODE_FOR_* and gen_* are used throughout the > > intrinsic code, so the actual change to rename all existing patterns > > will be quite a bit bigger). I expect some bike-shedding on > > subadd vs. addsub so I delay that change ;) > > Well, the pattern is called addsub in the x86 world because highpart > does add and lowpart does sub. In left-to-right writing systems > highpart comes before lowpart, so you have addsub. The other targets mentioned do not seem to agree but I can live with that, thus I'll change back to addsub. Richard.
On Thu, 17 Jun 2021, Richard Biener wrote: > On Thu, 17 Jun 2021, Uros Bizjak wrote: > > > On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > > thus subtract, add alternating on lanes, starting with subtract. > > > > > > It adds a corresponding optab and direct internal function, > > > vec_subadd$a3 and at the moment to make the i386 backend changes > > > "obvious", duplicates the existing avx_addsubv4df3 pattern with > > > the new canonical name (CODE_FOR_* and gen_* are used throughout the > > > intrinsic code, so the actual change to rename all existing patterns > > > will be quite a bit bigger). I expect some bike-shedding on > > > subadd vs. addsub so I delay that change ;) > > > > Well, the pattern is called addsub in the x86 world because highpart > > does add and lowpart does sub. In left-to-right writing systems > > highpart comes before lowpart, so you have addsub. > > The other targets mentioned do not seem to agree but I can live > with that, thus I'll change back to addsub. For reference, the following is what passed bootstrap & regtest on x86_64-unknown-linux-gnu. I'm likely going to wait for Tamar a bit as I expect the code needs refactoring for the pending changes. I'm also looking into merging the patterns (you might have seen the other mail about this) and look into getting rid of the UNSPEC in the fmaddsub patterns. Richard. From ddc804a832dec0e1b702e1a80c7cdd6b79ebe1a9 Mon Sep 17 00:00:00 2001 From: Richard Biener <rguenther@suse.de> Date: Mon, 31 May 2021 13:19:01 +0200 Subject: [PATCH] Add x86 addsub SLP pattern To: gcc-patches@gcc.gnu.org This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } thus subtract, add alternating on lanes, starting with subtract. It adds a corresponding optab and direct internal function, vec_addsub$a3 and renames the existing i386 backend patterns to the new canonical name. The SLP pattern matches the exact alternating lane sequence rather than trying to be clever and anticipating incoming permutes - we could permute the two input vectors to the needed lane alternation, do the addsub and then permute the result vector back but that's only profitable in case the two input or the output permute will vanish - something Tamars refactoring of SLP pattern recog should make possible. 2021-06-17 Richard Biener <rguenther@suse.de> * config/i386/sse.md (avx_addsubv4df3): Rename to vec_addsubv4df3. (avx_addsubv8sf3): Rename to vec_addsubv8sf3. (sse3_addsubv2df3): Rename to vec_addsubv2df3. (sse3_addsubv4sf3): Rename to vec_addsubv4sf3. * internal-fn.def (VEC_ADDSUB): New internal optab fn. * optabs.def (vec_addsub_optab): New optab. * tree-vect-slp-patterns.c (class addsub_pattern): New. (slp_patterns): Add addsub_pattern. * tree-vectorizer.h (vect_pattern::vect_pattern): Make m_ops optional. * doc/md.texi (vec_addsub<mode>3): Document. * gcc.target/i386/vect-addsubv2df.c: New testcase. * gcc.target/i386/vect-addsubv4sf.c: Likewise. * gcc.target/i386/vect-addsubv4df.c: Likewise. * gcc.target/i386/vect-addsubv8sf.c: Likewise. --- gcc/config/i386/i386-builtin.def | 8 +- gcc/config/i386/sse.md | 8 +- gcc/doc/md.texi | 8 ++ gcc/internal-fn.def | 1 + gcc/optabs.def | 1 + .../gcc.target/i386/vect-addsubv2df.c | 42 ++++++++ .../gcc.target/i386/vect-addsubv4df.c | 36 +++++++ .../gcc.target/i386/vect-addsubv4sf.c | 46 ++++++++ .../gcc.target/i386/vect-addsubv8sf.c | 46 ++++++++ gcc/tree-vect-slp-patterns.c | 100 ++++++++++++++++++ gcc/tree-vectorizer.h | 3 +- 11 files changed, 290 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv2df.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4df.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index 80c2a2c0294..30719e5e4bd 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__ BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) @@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128 /* AVX */ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 94296bc773b..038e2cd90d4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2396,7 +2396,7 @@ (set_attr "prefix" "<round_saeonly_scalar_prefix>") (set_attr "mode" "<ssescalarmode>")]) -(define_insn "avx_addsubv4df3" +(define_insn "vec_addsubv4df3" [(set (match_operand:V4DF 0 "register_operand" "=x") (vec_merge:V4DF (minus:V4DF @@ -2410,7 +2410,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V4DF")]) -(define_insn "sse3_addsubv2df3" +(define_insn "vec_addsubv2df3" [(set (match_operand:V2DF 0 "register_operand" "=x,x") (vec_merge:V2DF (minus:V2DF @@ -2428,7 +2428,7 @@ (set_attr "prefix" "orig,vex") (set_attr "mode" "V2DF")]) -(define_insn "avx_addsubv8sf3" +(define_insn "vec_addsubv8sf3" [(set (match_operand:V8SF 0 "register_operand" "=x") (vec_merge:V8SF (minus:V8SF @@ -2442,7 +2442,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V8SF")]) -(define_insn "sse3_addsubv4sf3" +(define_insn "vec_addsubv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=x,x") (vec_merge:V4SF (minus:V4SF diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 00caf3844cc..1b918144330 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5682,6 +5682,14 @@ signed/unsigned elements of size S@. Subtract the high/low elements of 2 from 1 and widen the resulting elements. Put the N/2 results of size 2*S in the output vector (operand 0). +@cindex @code{vec_addsub@var{m}3} instruction pattern +@item @samp{vec_addsub@var{m}3} +Alternating subtract, add with even lanes doing subtract and odd +lanes doing addition. Operands 1 and 2 and the outout operand are vectors +with mode @var{m}. + +These instructions are not allowed to @code{FAIL}. + @cindex @code{mulhisi3} instruction pattern @item @samp{mulhisi3} Multiply operands 1 and 2, which have mode @code{HImode}, and store diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index b2f414d2131..c3b8e730960 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary) +DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary) /* FP scales. */ diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..41ab2598eb6 100644 --- a/gcc/optabs.def +++ b/gcc/optabs.def @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a") OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a") OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a") OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a") +OPTAB_D (vec_addsub_optab, "vec_addsub$a3") OPTAB_D (sync_add_optab, "sync_add$I$a") OPTAB_D (sync_and_optab, "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c new file mode 100644 index 00000000000..547485d5519 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c @@ -0,0 +1,42 @@ +/* { dg-do run } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */ + +#ifndef CHECK_H +#define CHECK_H "sse3-check.h" +#endif + +#ifndef TEST +#define TEST sse3_test +#endif + +#include CHECK_H + +double x[2], y[2], z[2]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; +} +static void +TEST (void) +{ + for (int i = 0; i < 2; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c new file mode 100644 index 00000000000..e0a1b3d9d00 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c @@ -0,0 +1,36 @@ +/* { dg-do run { target avx_runtime } } */ +/* { dg-require-effective-target avx } */ +/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */ + +double x[4], y[4], z[4]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; +} +int main() +{ + for (int i = 0; i < 4; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c new file mode 100644 index 00000000000..b524f0c35a8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c @@ -0,0 +1,46 @@ +/* { dg-do run } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */ + +#ifndef CHECK_H +#define CHECK_H "sse3-check.h" +#endif + +#ifndef TEST +#define TEST sse3_test +#endif + +#include CHECK_H + +float x[4], y[4], z[4]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; +} +static void +TEST (void) +{ + for (int i = 0; i < 4; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c new file mode 100644 index 00000000000..0eed33b6531 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c @@ -0,0 +1,46 @@ +/* { dg-do run { target avx_runtime } } */ +/* { dg-require-effective-target avx } */ +/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */ + +float x[8], y[8], z[8]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; + x[4] = y[4] - z[4]; + x[5] = y[5] + z[5]; + x[6] = y[6] - z[6]; + x[7] = y[7] + z[7]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; + x[4] = y[4] + z[4]; + x[5] = y[5] - z[5]; + x[6] = y[6] + z[6]; + x[7] = y[7] - z[7]; +} +int main() +{ + for (int i = 0; i < 8; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11 + || x[4] != -4 || x[5] != 17 || x[6] != -6 || x[7] != 23) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3 + || x[4] != 14 || x[5] != -5 || x[6] != 20 || x[7] != -7) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index 2ed49cd9edc..d536494a1bd 100644 --- a/gcc/tree-vect-slp-patterns.c +++ b/gcc/tree-vect-slp-patterns.c @@ -1490,6 +1490,105 @@ complex_operations_pattern::build (vec_info * /* vinfo */) gcc_unreachable (); } + +/* The addsub_pattern. */ + +class addsub_pattern : public vect_pattern +{ + public: + addsub_pattern (slp_tree *node) + : vect_pattern (node, NULL, IFN_VEC_ADDSUB) {}; + + void build (vec_info *); + + static vect_pattern* + recognize (slp_tree_to_load_perm_map_t *, slp_tree *); +}; + +vect_pattern * +addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) +{ + slp_tree node = *node_; + if (SLP_TREE_CODE (node) != VEC_PERM_EXPR + || SLP_TREE_CHILDREN (node).length () != 2) + return NULL; + + /* Match a blend of a plus and a minus op with the same number of plus and + minus lanes on the same operands. */ + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + bool swapped_p = false; + if (vect_match_expression_p (sub, PLUS_EXPR)) + { + std::swap (add, sub); + swapped_p = true; + } + if (!(vect_match_expression_p (add, PLUS_EXPR) + && vect_match_expression_p (sub, MINUS_EXPR))) + return NULL; + if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1]) + || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0]))) + return NULL; + + for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i) + { + std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i]; + if (swapped_p) + perm.first = perm.first == 0 ? 1 : 0; + /* It has to be alternating -, +, -, ... + While we could permute the .ADDSUB inputs and the .ADDSUB output + that's only profitable over the add + sub + blend if at least + one of the permute is optimized which we can't determine here. */ + if (perm.first != (i & 1) + || perm.second != i) + return NULL; + } + + if (!vect_pattern_validate_optab (IFN_VEC_ADDSUB, node)) + return NULL; + + return new addsub_pattern (node_); +} + +void +addsub_pattern::build (vec_info *vinfo) +{ + slp_tree node = *m_node; + + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + if (vect_match_expression_p (sub, PLUS_EXPR)) + std::swap (add, sub); + + /* Modify the blend node in-place. */ + SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0]; + SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1]; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++; + + /* Build IFN_VEC_ADDSUB from the sub representative operands. */ + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub); + gcall *call = gimple_build_call_internal (IFN_VEC_ADDSUB, 2, + gimple_assign_rhs1 (rep->stmt), + gimple_assign_rhs2 (rep->stmt)); + gimple_call_set_lhs (call, make_ssa_name + (TREE_TYPE (gimple_assign_lhs (rep->stmt)))); + gimple_call_set_nothrow (call, true); + gimple_set_bb (call, gimple_bb (rep->stmt)); + SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep); + STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope; + STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp; + STMT_VINFO_VECTYPE (SLP_TREE_REPRESENTATIVE (node)) = SLP_TREE_VECTYPE (node); + STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node)) = true; + SLP_TREE_CODE (node) = ERROR_MARK; + SLP_TREE_LANE_PERMUTATION (node).release (); + + vect_free_slp_tree (sub); + vect_free_slp_tree (add); +} + /******************************************************************************* * Pattern matching definitions ******************************************************************************/ @@ -1502,6 +1601,7 @@ vect_pattern_decl_t slp_patterns[] overlap in what they can detect. */ SLP_PATTERN (complex_operations_pattern), + SLP_PATTERN (addsub_pattern) }; #undef SLP_PATTERN diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 04c20f8bd0f..b9824623ad9 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2100,7 +2100,8 @@ class vect_pattern this->m_ifn = ifn; this->m_node = node; this->m_ops.create (0); - this->m_ops.safe_splice (*m_ops); + if (m_ops) + this->m_ops.safe_splice (*m_ops); } public:
Richard Biener <rguenther@suse.de> writes: > On Thu, 17 Jun 2021, Uros Bizjak wrote: > >> On Thu, Jun 17, 2021 at 11:44 AM Richard Biener <rguenther@suse.de> wrote: >> > >> > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 >> > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } >> > thus subtract, add alternating on lanes, starting with subtract. >> > >> > It adds a corresponding optab and direct internal function, >> > vec_subadd$a3 and at the moment to make the i386 backend changes >> > "obvious", duplicates the existing avx_addsubv4df3 pattern with >> > the new canonical name (CODE_FOR_* and gen_* are used throughout the >> > intrinsic code, so the actual change to rename all existing patterns >> > will be quite a bit bigger). I expect some bike-shedding on >> > subadd vs. addsub so I delay that change ;) >> >> Well, the pattern is called addsub in the x86 world because highpart >> does add and lowpart does sub. In left-to-right writing systems >> highpart comes before lowpart, so you have addsub. > > The other targets mentioned do not seem to agree but I can live > with that, thus I'll change back to addsub. FWIW, subadd sounds clearer to me too. It seems surprising to put imaginary before real when interpreting something as complex, for example. Putting the highpart first feels especially odd on an LE system like x86… Richard
On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > >> Well, the pattern is called addsub in the x86 world because highpart > >> does add and lowpart does sub. In left-to-right writing systems > >> highpart comes before lowpart, so you have addsub. > > > > The other targets mentioned do not seem to agree but I can live > > with that, thus I'll change back to addsub. > > FWIW, subadd sounds clearer to me too. It seems surprising to put > imaginary before real when interpreting something as complex, for example. > > Putting the highpart first feels especially odd on an LE system like x86… The XMM vector is documented left to right with MSB at the left (c.f. most significant *DIGIT* of the number at the left) xmm[MSB] .... xmm[LSB] so, looking at x86 ADDSUBPD insn documentation: xmm2[127:64] xmm2[63:0] ( + -) xmm1[127:64] xmm1[63:0] (=) xmm1[127:64] holds ADD xmm1[63:0] holds SUB xmm1[127:64] xmm1 [63:0] ADD SUB Uros.
On Tue, 22 Jun 2021, Uros Bizjak wrote: > On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > >> Well, the pattern is called addsub in the x86 world because highpart > > >> does add and lowpart does sub. In left-to-right writing systems > > >> highpart comes before lowpart, so you have addsub. > > > > > > The other targets mentioned do not seem to agree but I can live > > > with that, thus I'll change back to addsub. > > > > FWIW, subadd sounds clearer to me too. It seems surprising to put > > imaginary before real when interpreting something as complex, for example. > > > > Putting the highpart first feels especially odd on an LE system like x86… > > The XMM vector is documented left to right with MSB at the left (c.f. > most significant *DIGIT* of the number at the left) > > xmm[MSB] .... xmm[LSB] > > so, looking at x86 ADDSUBPD insn documentation: > > xmm2[127:64] xmm2[63:0] > ( + -) > xmm1[127:64] xmm1[63:0] > (=) > xmm1[127:64] holds ADD > xmm1[63:0] holds SUB > > xmm1[127:64] xmm1 [63:0] > ADD SUB I think if we really want to resolve the "ambiguity" we have to spell it out in the optab. vec_addodd_subeven or vec_addsub_oddeven. As I noted there are targets who have the opposite so we could then add vec_addsub_evenodd (not vec_subadd_evenodd). Just tell me what you prefer - I'll adjust the patch accordingly. Richard.
On Tue, Jun 22, 2021 at 12:34 PM Richard Biener <rguenther@suse.de> wrote: > > On Tue, 22 Jun 2021, Uros Bizjak wrote: > > > On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > > > > > >> Well, the pattern is called addsub in the x86 world because highpart > > > >> does add and lowpart does sub. In left-to-right writing systems > > > >> highpart comes before lowpart, so you have addsub. > > > > > > > > The other targets mentioned do not seem to agree but I can live > > > > with that, thus I'll change back to addsub. > > > > > > FWIW, subadd sounds clearer to me too. It seems surprising to put > > > imaginary before real when interpreting something as complex, for example. > > > > > > Putting the highpart first feels especially odd on an LE system like x86… > > > > The XMM vector is documented left to right with MSB at the left (c.f. > > most significant *DIGIT* of the number at the left) > > > > xmm[MSB] .... xmm[LSB] > > > > so, looking at x86 ADDSUBPD insn documentation: > > > > xmm2[127:64] xmm2[63:0] > > ( + -) > > xmm1[127:64] xmm1[63:0] > > (=) > > xmm1[127:64] holds ADD > > xmm1[63:0] holds SUB > > > > xmm1[127:64] xmm1 [63:0] > > ADD SUB > > I think if we really want to resolve the "ambiguity" we have to > spell it out in the optab. vec_addodd_subeven or vec_addsub_oddeven. > As I noted there are targets who have the opposite so we could > then add vec_addsub_evenodd (not vec_subadd_evenodd). > > Just tell me what you prefer - I'll adjust the patch accordingly. I'd use addsub when add comes at the left of sub, and MSB is also considered at the left. subadd for when sub comes at the left of add where MSB is also at the left. I think that the documentation should clear the ambiguity. Otherwise, just my 0.02€, I don't want to bikeshed about this ad infinitum, so I'll stop there. Uros.
On Tue, 22 Jun 2021, Uros Bizjak wrote: > On Tue, Jun 22, 2021 at 12:34 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Tue, 22 Jun 2021, Uros Bizjak wrote: > > > > > On Tue, Jun 22, 2021 at 11:42 AM Richard Sandiford > > > <richard.sandiford@arm.com> wrote: > > > > > > > >> Well, the pattern is called addsub in the x86 world because highpart > > > > >> does add and lowpart does sub. In left-to-right writing systems > > > > >> highpart comes before lowpart, so you have addsub. > > > > > > > > > > The other targets mentioned do not seem to agree but I can live > > > > > with that, thus I'll change back to addsub. > > > > > > > > FWIW, subadd sounds clearer to me too. It seems surprising to put > > > > imaginary before real when interpreting something as complex, for example. > > > > > > > > Putting the highpart first feels especially odd on an LE system like x86? > > > > > > The XMM vector is documented left to right with MSB at the left (c.f. > > > most significant *DIGIT* of the number at the left) > > > > > > xmm[MSB] .... xmm[LSB] > > > > > > so, looking at x86 ADDSUBPD insn documentation: > > > > > > xmm2[127:64] xmm2[63:0] > > > ( + -) > > > xmm1[127:64] xmm1[63:0] > > > (=) > > > xmm1[127:64] holds ADD > > > xmm1[63:0] holds SUB > > > > > > xmm1[127:64] xmm1 [63:0] > > > ADD SUB > > > > I think if we really want to resolve the "ambiguity" we have to > > spell it out in the optab. vec_addodd_subeven or vec_addsub_oddeven. > > As I noted there are targets who have the opposite so we could > > then add vec_addsub_evenodd (not vec_subadd_evenodd). > > > > Just tell me what you prefer - I'll adjust the patch accordingly. > > I'd use addsub when add comes at the left of sub, and MSB is also > considered at the left. subadd for when sub comes at the left of add > where MSB is also at the left. > > I think that the documentation should clear the ambiguity. > > Otherwise, just my 0.02?, I don't want to bikeshed about this ad > infinitum, so I'll stop there. Yeah, so I've not changed anything but after the permute propagation fix I just pushed I've added a testcase that would be broken without it and adding the new CFN_VEC_ADDSUB case (which I now did). Re-bootstrapped and tested on x86_64-unknown-linux-gnu, now pushed since it helps me working with Tamar on the SLP pattern reorg to have at least one matching pattern on x86_64. Richard. From a6bfb9106f5605c0ce6ffe1ed91b50510e43343b Mon Sep 17 00:00:00 2001 From: Richard Biener <rguenther@suse.de> Date: Mon, 31 May 2021 13:19:01 +0200 Subject: [PATCH] Add x86 addsub SLP pattern To: gcc-patches@gcc.gnu.org This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } thus subtract, add alternating on lanes, starting with subtract. It adds a corresponding optab and direct internal function, vec_addsub$a3 and renames the existing i386 backend patterns to the new canonical name. The SLP pattern matches the exact alternating lane sequence rather than trying to be clever and anticipating incoming permutes - we could permute the two input vectors to the needed lane alternation, do the addsub and then permute the result vector back but that's only profitable in case the two input or the output permute will vanish - something Tamars refactoring of SLP pattern recog should make possible. 2021-06-17 Richard Biener <rguenther@suse.de> * config/i386/sse.md (avx_addsubv4df3): Rename to vec_addsubv4df3. (avx_addsubv8sf3): Rename to vec_addsubv8sf3. (sse3_addsubv2df3): Rename to vec_addsubv2df3. (sse3_addsubv4sf3): Rename to vec_addsubv4sf3. * internal-fn.def (VEC_ADDSUB): New internal optab fn. * optabs.def (vec_addsub_optab): New optab. * tree-vect-slp-patterns.c (class addsub_pattern): New. (slp_patterns): Add addsub_pattern. (vect_optimize_slp): Disable propagation across CFN_VEC_ADDSUB. * tree-vectorizer.h (vect_pattern::vect_pattern): Make m_ops optional. * doc/md.texi (vec_addsub<mode>3): Document. * gcc.target/i386/vect-addsubv2df.c: New testcase. * gcc.target/i386/vect-addsubv4sf.c: Likewise. * gcc.target/i386/vect-addsubv4df.c: Likewise. * gcc.target/i386/vect-addsubv8sf.c: Likewise. * gcc.target/i386/vect-addsub-2.c: Likewise. * gcc.target/i386/vect-addsub-3.c: Likewise. --- gcc/config/i386/i386-builtin.def | 8 +- gcc/config/i386/sse.md | 8 +- gcc/doc/md.texi | 8 ++ gcc/internal-fn.def | 1 + gcc/optabs.def | 1 + gcc/testsuite/gcc.target/i386/vect-addsub-2.c | 21 ++++ gcc/testsuite/gcc.target/i386/vect-addsub-3.c | 38 +++++++ .../gcc.target/i386/vect-addsubv2df.c | 42 ++++++++ .../gcc.target/i386/vect-addsubv4df.c | 36 +++++++ .../gcc.target/i386/vect-addsubv4sf.c | 46 ++++++++ .../gcc.target/i386/vect-addsubv8sf.c | 46 ++++++++ gcc/tree-vect-slp-patterns.c | 100 ++++++++++++++++++ gcc/tree-vect-slp.c | 1 + gcc/tree-vectorizer.h | 3 +- 14 files changed, 350 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsub-2.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsub-3.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv2df.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4df.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c create mode 100644 gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index 31df3a613dd..ea79e0bdda2 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -855,8 +855,8 @@ BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_mmx_subv1di3, "__ BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movshdup, "__builtin_ia32_movshdup", IX86_BUILTIN_MOVSHDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_movsldup, "__builtin_ia32_movsldup", IX86_BUILTIN_MOVSLDUP, UNKNOWN, (int) V4SF_FTYPE_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) -BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv4sf3, "__builtin_ia32_addsubps", IX86_BUILTIN_ADDSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) +BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_vec_addsubv2df3, "__builtin_ia32_addsubpd", IX86_BUILTIN_ADDSUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv4sf3, "__builtin_ia32_haddps", IX86_BUILTIN_HADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_haddv2df3, "__builtin_ia32_haddpd", IX86_BUILTIN_HADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF) BDESC (OPTION_MASK_ISA_SSE3, 0, CODE_FOR_sse3_hsubv4sf3, "__builtin_ia32_hsubps", IX86_BUILTIN_HSUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF) @@ -996,8 +996,8 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_pclmulqdq, 0, IX86_BUILTIN_PCLMULQDQ128 /* AVX */ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv4df3, "__builtin_ia32_addpd256", IX86_BUILTIN_ADDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_addv8sf3, "__builtin_ia32_addps256", IX86_BUILTIN_ADDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv4df3, "__builtin_ia32_addsubpd256", IX86_BUILTIN_ADDSUBPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_vec_addsubv8sf3, "__builtin_ia32_addsubps256", IX86_BUILTIN_ADDSUBPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv4df3, "__builtin_ia32_andpd256", IX86_BUILTIN_ANDPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_andv8sf3, "__builtin_ia32_andps256", IX86_BUILTIN_ANDPS256, UNKNOWN, (int) V8SF_FTYPE_V8SF_V8SF) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_andnotv4df3, "__builtin_ia32_andnpd256", IX86_BUILTIN_ANDNPD256, UNKNOWN, (int) V4DF_FTYPE_V4DF_V4DF) diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 5bd65dd9312..1f1db8214cc 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2410,7 +2410,7 @@ (set_attr "prefix" "<round_saeonly_scalar_prefix>") (set_attr "mode" "<ssescalarmode>")]) -(define_insn "avx_addsubv4df3" +(define_insn "vec_addsubv4df3" [(set (match_operand:V4DF 0 "register_operand" "=x") (vec_merge:V4DF (minus:V4DF @@ -2424,7 +2424,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V4DF")]) -(define_insn "sse3_addsubv2df3" +(define_insn "vec_addsubv2df3" [(set (match_operand:V2DF 0 "register_operand" "=x,x") (vec_merge:V2DF (minus:V2DF @@ -2442,7 +2442,7 @@ (set_attr "prefix" "orig,vex") (set_attr "mode" "V2DF")]) -(define_insn "avx_addsubv8sf3" +(define_insn "vec_addsubv8sf3" [(set (match_operand:V8SF 0 "register_operand" "=x") (vec_merge:V8SF (minus:V8SF @@ -2456,7 +2456,7 @@ (set_attr "prefix" "vex") (set_attr "mode" "V8SF")]) -(define_insn "sse3_addsubv4sf3" +(define_insn "vec_addsubv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=x,x") (vec_merge:V4SF (minus:V4SF diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 00caf3844cc..1b918144330 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5682,6 +5682,14 @@ signed/unsigned elements of size S@. Subtract the high/low elements of 2 from 1 and widen the resulting elements. Put the N/2 results of size 2*S in the output vector (operand 0). +@cindex @code{vec_addsub@var{m}3} instruction pattern +@item @samp{vec_addsub@var{m}3} +Alternating subtract, add with even lanes doing subtract and odd +lanes doing addition. Operands 1 and 2 and the outout operand are vectors +with mode @var{m}. + +These instructions are not allowed to @code{FAIL}. + @cindex @code{mulhisi3} instruction pattern @item @samp{mulhisi3} Multiply operands 1 and 2, which have mode @code{HImode}, and store diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index b2f414d2131..c3b8e730960 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary) +DEF_INTERNAL_OPTAB_FN (VEC_ADDSUB, ECF_CONST, vec_addsub, binary) /* FP scales. */ diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..41ab2598eb6 100644 --- a/gcc/optabs.def +++ b/gcc/optabs.def @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a") OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a") OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a") OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a") +OPTAB_D (vec_addsub_optab, "vec_addsub$a3") OPTAB_D (sync_add_optab, "sync_add$I$a") OPTAB_D (sync_and_optab, "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-addsub-2.c b/gcc/testsuite/gcc.target/i386/vect-addsub-2.c new file mode 100644 index 00000000000..a6b941461e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsub-2.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3" } */ + +float a[1024], b[1024]; + +void foo() +{ + for (int i = 0; i < 256; i++) + { + a[4*i+0] = a[4*i+0] - b[4*i+0]; + a[4*i+1] = a[4*i+1] + b[4*i+1]; + a[4*i+2] = a[4*i+2] - b[4*i+2]; + a[4*i+3] = a[4*i+3] + b[4*i+3]; + } +} + +/* We should be able to vectorize this with SLP using the addsub + SLP pattern. */ +/* { dg-final { scan-assembler "addsubps" } } */ +/* { dg-final { scan-assembler-not "shuf" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsub-3.c b/gcc/testsuite/gcc.target/i386/vect-addsub-3.c new file mode 100644 index 00000000000..b27ee56bd73 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsub-3.c @@ -0,0 +1,38 @@ +/* { dg-do run } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3" } */ + +#ifndef CHECK_H +#define CHECK_H "sse3-check.h" +#endif + +#ifndef TEST +#define TEST sse3_test +#endif + +#include CHECK_H + +double a[2], b[2], c[2]; + +void __attribute__((noipa)) +foo () +{ + /* When we want to use addsubpd we have to keep permuting both + loads, if instead we blend the result of an add and a sub we + can combine the blend with the permute. Both are similar in cost, + verify we did not wrongly apply both. */ + double tem0 = a[1] - b[1]; + double tem1 = a[0] + b[0]; + c[0] = tem0; + c[1] = tem1; +} + +static void +TEST (void) +{ + a[0] = 1.; a[1] = 2.; + b[0] = 2.; b[1] = 4.; + foo (); + if (c[0] != -2. || c[1] != 3.) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c new file mode 100644 index 00000000000..547485d5519 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv2df.c @@ -0,0 +1,42 @@ +/* { dg-do run } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */ + +#ifndef CHECK_H +#define CHECK_H "sse3-check.h" +#endif + +#ifndef TEST +#define TEST sse3_test +#endif + +#include CHECK_H + +double x[2], y[2], z[2]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; +} +static void +TEST (void) +{ + for (int i = 0; i < 2; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c new file mode 100644 index 00000000000..e0a1b3d9d00 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4df.c @@ -0,0 +1,36 @@ +/* { dg-do run { target avx_runtime } } */ +/* { dg-require-effective-target avx } */ +/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */ + +double x[4], y[4], z[4]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; +} +int main() +{ + for (int i = 0; i < 4; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c new file mode 100644 index 00000000000..b524f0c35a8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv4sf.c @@ -0,0 +1,46 @@ +/* { dg-do run } */ +/* { dg-require-effective-target sse3 } */ +/* { dg-options "-O3 -msse3 -fdump-tree-slp2" } */ + +#ifndef CHECK_H +#define CHECK_H "sse3-check.h" +#endif + +#ifndef TEST +#define TEST sse3_test +#endif + +#include CHECK_H + +float x[4], y[4], z[4]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; +} +static void +TEST (void) +{ + for (int i = 0; i < 4; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) + __builtin_abort (); +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c new file mode 100644 index 00000000000..0eed33b6531 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-addsubv8sf.c @@ -0,0 +1,46 @@ +/* { dg-do run { target avx_runtime } } */ +/* { dg-require-effective-target avx } */ +/* { dg-options "-O3 -mavx -fdump-tree-slp2" } */ + +float x[8], y[8], z[8]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; + x[4] = y[4] - z[4]; + x[5] = y[5] + z[5]; + x[6] = y[6] - z[6]; + x[7] = y[7] + z[7]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; + x[4] = y[4] + z[4]; + x[5] = y[5] - z[5]; + x[6] = y[6] + z[6]; + x[7] = y[7] - z[7]; +} +int main() +{ + for (int i = 0; i < 8; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11 + || x[4] != -4 || x[5] != 17 || x[6] != -6 || x[7] != 23) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3 + || x[4] != 14 || x[5] != -5 || x[6] != 20 || x[7] != -7) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "ADDSUB" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index 2ed49cd9edc..d536494a1bd 100644 --- a/gcc/tree-vect-slp-patterns.c +++ b/gcc/tree-vect-slp-patterns.c @@ -1490,6 +1490,105 @@ complex_operations_pattern::build (vec_info * /* vinfo */) gcc_unreachable (); } + +/* The addsub_pattern. */ + +class addsub_pattern : public vect_pattern +{ + public: + addsub_pattern (slp_tree *node) + : vect_pattern (node, NULL, IFN_VEC_ADDSUB) {}; + + void build (vec_info *); + + static vect_pattern* + recognize (slp_tree_to_load_perm_map_t *, slp_tree *); +}; + +vect_pattern * +addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) +{ + slp_tree node = *node_; + if (SLP_TREE_CODE (node) != VEC_PERM_EXPR + || SLP_TREE_CHILDREN (node).length () != 2) + return NULL; + + /* Match a blend of a plus and a minus op with the same number of plus and + minus lanes on the same operands. */ + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + bool swapped_p = false; + if (vect_match_expression_p (sub, PLUS_EXPR)) + { + std::swap (add, sub); + swapped_p = true; + } + if (!(vect_match_expression_p (add, PLUS_EXPR) + && vect_match_expression_p (sub, MINUS_EXPR))) + return NULL; + if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1]) + || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0]))) + return NULL; + + for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i) + { + std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i]; + if (swapped_p) + perm.first = perm.first == 0 ? 1 : 0; + /* It has to be alternating -, +, -, ... + While we could permute the .ADDSUB inputs and the .ADDSUB output + that's only profitable over the add + sub + blend if at least + one of the permute is optimized which we can't determine here. */ + if (perm.first != (i & 1) + || perm.second != i) + return NULL; + } + + if (!vect_pattern_validate_optab (IFN_VEC_ADDSUB, node)) + return NULL; + + return new addsub_pattern (node_); +} + +void +addsub_pattern::build (vec_info *vinfo) +{ + slp_tree node = *m_node; + + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + if (vect_match_expression_p (sub, PLUS_EXPR)) + std::swap (add, sub); + + /* Modify the blend node in-place. */ + SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0]; + SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1]; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++; + + /* Build IFN_VEC_ADDSUB from the sub representative operands. */ + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub); + gcall *call = gimple_build_call_internal (IFN_VEC_ADDSUB, 2, + gimple_assign_rhs1 (rep->stmt), + gimple_assign_rhs2 (rep->stmt)); + gimple_call_set_lhs (call, make_ssa_name + (TREE_TYPE (gimple_assign_lhs (rep->stmt)))); + gimple_call_set_nothrow (call, true); + gimple_set_bb (call, gimple_bb (rep->stmt)); + SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep); + STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope; + STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp; + STMT_VINFO_VECTYPE (SLP_TREE_REPRESENTATIVE (node)) = SLP_TREE_VECTYPE (node); + STMT_VINFO_SLP_VECT_ONLY_PATTERN (SLP_TREE_REPRESENTATIVE (node)) = true; + SLP_TREE_CODE (node) = ERROR_MARK; + SLP_TREE_LANE_PERMUTATION (node).release (); + + vect_free_slp_tree (sub); + vect_free_slp_tree (add); +} + /******************************************************************************* * Pattern matching definitions ******************************************************************************/ @@ -1502,6 +1601,7 @@ vect_pattern_decl_t slp_patterns[] overlap in what they can detect. */ SLP_PATTERN (complex_operations_pattern), + SLP_PATTERN (addsub_pattern) }; #undef SLP_PATTERN diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 69ee8faed09..227d6aa3ee8 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -3705,6 +3705,7 @@ vect_optimize_slp (vec_info *vinfo) case CFN_COMPLEX_ADD_ROT270: case CFN_COMPLEX_MUL: case CFN_COMPLEX_MUL_CONJ: + case CFN_VEC_ADDSUB: continue; default:; } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 5c71fbc487f..fa28336d429 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2100,7 +2100,8 @@ class vect_pattern this->m_ifn = ifn; this->m_node = node; this->m_ops.create (0); - this->m_ops.safe_splice (*m_ops); + if (m_ops) + this->m_ops.safe_splice (*m_ops); } public:
On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote: > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > thus subtract, add alternating on lanes, starting with subtract. > > It adds a corresponding optab and direct internal function, > vec_addsub$a3 and renames the existing i386 backend patterns to > the new canonical name. > > The SLP pattern matches the exact alternating lane sequence rather > than trying to be clever and anticipating incoming permutes - we > could permute the two input vectors to the needed lane alternation, > do the addsub and then permute the result vector back but that's > only profitable in case the two input or the output permute will > vanish - something Tamars refactoring of SLP pattern recog should > make possible. Using the attached patch, I was also able to generate addsub for the following testcase: float x[2], y[2], z[2]; void foo () { x[0] = y[0] - z[0]; x[1] = y[1] + z[1]; } vmovq y(%rip), %xmm0 vmovq z(%rip), %xmm1 vaddsubps %xmm1, %xmm0, %xmm0 vmovlps %xmm0, x(%rip) ret Uros. diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index e887f03474d..5f10572718d 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -788,6 +788,24 @@ (define_insn "*mmx_haddsubv2sf3" (set_attr "prefix_extra" "1") (set_attr "mode" "V2SF")]) +(define_insn "vec_addsubv2sf3" + [(set (match_operand:V2SF 0 "register_operand" "=x,x") + (vec_merge:V2SF + (minus:V2SF + (match_operand:V2SF 1 "register_operand" "0,x") + (match_operand:V2SF 2 "register_operand" "x,x")) + (plus:V2SF (match_dup 1) (match_dup 2)) + (const_int 1)))] + "TARGET_SSE3 && TARGET_MMX_WITH_SSE" + "@ + addsubps\t{%2, %0|%0, %2} + vaddsubps\t{%2, %1, %0|%0, %1, %2}" + [(set_attr "isa" "noavx,avx") + (set_attr "type" "sseadd") + (set_attr "prefix" "orig,vex") + (set_attr "prefix_rep" "1,*") + (set_attr "mode" "V4SF")]) + ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; ;; Parallel single-precision floating point comparisons
On Thu, 24 Jun 2021, Uros Bizjak wrote: > On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote: > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > thus subtract, add alternating on lanes, starting with subtract. > > > > It adds a corresponding optab and direct internal function, > > vec_addsub$a3 and renames the existing i386 backend patterns to > > the new canonical name. > > > > The SLP pattern matches the exact alternating lane sequence rather > > than trying to be clever and anticipating incoming permutes - we > > could permute the two input vectors to the needed lane alternation, > > do the addsub and then permute the result vector back but that's > > only profitable in case the two input or the output permute will > > vanish - something Tamars refactoring of SLP pattern recog should > > make possible. > > Using the attached patch, I was also able to generate addsub for the > following testcase: > > float x[2], y[2], z[2]; > > void foo () > { > x[0] = y[0] - z[0]; > x[1] = y[1] + z[1]; > } > > vmovq y(%rip), %xmm0 > vmovq z(%rip), %xmm1 > vaddsubps %xmm1, %xmm0, %xmm0 > vmovlps %xmm0, x(%rip) > ret Nice. But I suppose it can be merged with the other now single pattern? Richard.
On Fri, Jun 25, 2021 at 8:48 AM Richard Biener <rguenther@suse.de> wrote: > > On Thu, 24 Jun 2021, Uros Bizjak wrote: > > > On Thu, Jun 24, 2021 at 1:07 PM Richard Biener <rguenther@suse.de> wrote: > > > > > This addds SLP pattern recognition for the SSE3/AVX [v]addsubp{ds} v0, v1 > > > instructions which compute { v0[0] - v1[0], v0[1], + v1[1], ... } > > > thus subtract, add alternating on lanes, starting with subtract. > > > > > > It adds a corresponding optab and direct internal function, > > > vec_addsub$a3 and renames the existing i386 backend patterns to > > > the new canonical name. > > > > > > The SLP pattern matches the exact alternating lane sequence rather > > > than trying to be clever and anticipating incoming permutes - we > > > could permute the two input vectors to the needed lane alternation, > > > do the addsub and then permute the result vector back but that's > > > only profitable in case the two input or the output permute will > > > vanish - something Tamars refactoring of SLP pattern recog should > > > make possible. > > > > Using the attached patch, I was also able to generate addsub for the > > following testcase: > > > > float x[2], y[2], z[2]; > > > > void foo () > > { > > x[0] = y[0] - z[0]; > > x[1] = y[1] + z[1]; > > } > > > > vmovq y(%rip), %xmm0 > > vmovq z(%rip), %xmm1 > > vaddsubps %xmm1, %xmm0, %xmm0 > > vmovlps %xmm0, x(%rip) > > ret > > Nice. But I suppose it can be merged with the other now single > pattern? Actually, MMX_WITH_SSE V2SF modes do not support memory operands (they operate on V2SF subreg of V4SF SSE registers), so a new mode attribute would be needed to instantiate register operand in case of V2SF. Uros. > > Richard.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 94296bc773b..73238c9874a 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2396,6 +2396,22 @@ (set_attr "prefix" "<round_saeonly_scalar_prefix>") (set_attr "mode" "<ssescalarmode>")]) +/* ??? Rename avx_addsubv4df3, but its name is used throughout the backed, + so avoid this churn for the moment. */ +(define_insn "vec_subaddv4df3" + [(set (match_operand:V4DF 0 "register_operand" "=x") + (vec_merge:V4DF + (minus:V4DF + (match_operand:V4DF 1 "register_operand" "x") + (match_operand:V4DF 2 "nonimmediate_operand" "xm")) + (plus:V4DF (match_dup 1) (match_dup 2)) + (const_int 5)))] + "TARGET_AVX" + "vaddsubpd\t{%2, %1, %0|%0, %1, %2}" + [(set_attr "type" "sseadd") + (set_attr "prefix" "vex") + (set_attr "mode" "V4DF")]) + (define_insn "avx_addsubv4df3" [(set (match_operand:V4DF 0 "register_operand" "=x") (vec_merge:V4DF diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index b2f414d2131..a4578a0293e 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -281,6 +281,7 @@ DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL_CONJ, ECF_CONST, cmul_conj, binary) +DEF_INTERNAL_OPTAB_FN (VEC_SUBADD, ECF_CONST, vec_subadd, binary) /* FP scales. */ diff --git a/gcc/optabs.def b/gcc/optabs.def index b192a9d070b..cacbd02cfe1 100644 --- a/gcc/optabs.def +++ b/gcc/optabs.def @@ -407,6 +407,7 @@ OPTAB_D (vec_widen_usubl_hi_optab, "vec_widen_usubl_hi_$a") OPTAB_D (vec_widen_usubl_lo_optab, "vec_widen_usubl_lo_$a") OPTAB_D (vec_widen_uaddl_hi_optab, "vec_widen_uaddl_hi_$a") OPTAB_D (vec_widen_uaddl_lo_optab, "vec_widen_uaddl_lo_$a") +OPTAB_D (vec_subadd_optab, "vec_subadd$a3") OPTAB_D (sync_add_optab, "sync_add$I$a") OPTAB_D (sync_and_optab, "sync_and$I$a") diff --git a/gcc/testsuite/gcc.target/i386/vect-subadd-1.c b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c new file mode 100644 index 00000000000..a79e5ba92e2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-subadd-1.c @@ -0,0 +1,35 @@ +/* { dg-do run { target avx_runtime } } */ +/* { dg-options "-O3 -mavx2 -fdump-tree-slp2" } */ + +double x[4], y[4], z[4]; +void __attribute__((noipa)) foo () +{ + x[0] = y[0] - z[0]; + x[1] = y[1] + z[1]; + x[2] = y[2] - z[2]; + x[3] = y[3] + z[3]; +} +void __attribute__((noipa)) bar () +{ + x[0] = y[0] + z[0]; + x[1] = y[1] - z[1]; + x[2] = y[2] + z[2]; + x[3] = y[3] - z[3]; +} +int main() +{ + for (int i = 0; i < 4; ++i) + { + y[i] = i + 1; + z[i] = 2 * i + 1; + } + foo (); + if (x[0] != 0 || x[1] != 5 || x[2] != -2 || x[3] != 11) + __builtin_abort (); + bar (); + if (x[0] != 2 || x[1] != -1 || x[2] != 8 || x[3] != -3) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "SUBADD" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index 2ed49cd9edc..61732c53ad7 100644 --- a/gcc/tree-vect-slp-patterns.c +++ b/gcc/tree-vect-slp-patterns.c @@ -1490,6 +1490,103 @@ complex_operations_pattern::build (vec_info * /* vinfo */) gcc_unreachable (); } + +/* The subadd_pattern. */ + +class subadd_pattern : public vect_pattern +{ + public: + subadd_pattern (slp_tree *node) + : vect_pattern (node, NULL, IFN_VEC_SUBADD) {}; + + void build (vec_info *); + + static vect_pattern* + recognize (slp_tree_to_load_perm_map_t *, slp_tree *); +}; + +vect_pattern * +subadd_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) +{ + slp_tree node = *node_; + if (SLP_TREE_CODE (node) != VEC_PERM_EXPR + || SLP_TREE_CHILDREN (node).length () != 2) + return NULL; + + /* Match a blend of a plus and a minus op with the same number of plus and + minus lanes on the same operands. */ + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + bool swapped_p = false; + if (vect_match_expression_p (sub, PLUS_EXPR)) + { + std::swap (add, sub); + swapped_p = true; + } + if (!(vect_match_expression_p (add, PLUS_EXPR) + && vect_match_expression_p (sub, MINUS_EXPR))) + return NULL; + if (!((SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[0] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[1]) + || (SLP_TREE_CHILDREN (sub)[0] == SLP_TREE_CHILDREN (add)[1] + && SLP_TREE_CHILDREN (sub)[1] == SLP_TREE_CHILDREN (add)[0]))) + return NULL; + + for (unsigned i = 0; i < SLP_TREE_LANE_PERMUTATION (node).length (); ++i) + { + std::pair<unsigned, unsigned> perm = SLP_TREE_LANE_PERMUTATION (node)[i]; + if (swapped_p) + perm.first = perm.first == 0 ? 1 : 0; + /* It has to be alternating -, +, -, ... + While we could permute the .SUBADD inputs and the .SUBADD output + that's only profitable over the add + sub + blend if at least + one of the permute is optimized which we can't determine here. */ + if (perm.first != (i & 1) + || perm.second != i) + return NULL; + } + + if (!vect_pattern_validate_optab (IFN_VEC_SUBADD, node)) + return NULL; + + return new subadd_pattern (node_); +} + +void +subadd_pattern::build (vec_info *vinfo) +{ + slp_tree node = *m_node; + + slp_tree sub = SLP_TREE_CHILDREN (node)[0]; + slp_tree add = SLP_TREE_CHILDREN (node)[1]; + if (vect_match_expression_p (sub, PLUS_EXPR)) + std::swap (add, sub); + + /* Modify the blend node in-place. */ + SLP_TREE_CHILDREN (node)[0] = SLP_TREE_CHILDREN (sub)[0]; + SLP_TREE_CHILDREN (node)[1] = SLP_TREE_CHILDREN (sub)[1]; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[0])++; + SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (node)[1])++; + + /* Build IFN_VEC_SUBADD from the sub representative operands. */ + stmt_vec_info rep = SLP_TREE_REPRESENTATIVE (sub); + gcall *call = gimple_build_call_internal (IFN_VEC_SUBADD, 2, + gimple_assign_rhs1 (rep->stmt), + gimple_assign_rhs2 (rep->stmt)); + gimple_call_set_lhs (call, make_ssa_name + (TREE_TYPE (gimple_assign_lhs (rep->stmt)))); + gimple_call_set_nothrow (call, true); + gimple_set_bb (call, gimple_bb (rep->stmt)); + SLP_TREE_REPRESENTATIVE (node) = vinfo->add_pattern_stmt (call, rep); + STMT_VINFO_RELEVANT (SLP_TREE_REPRESENTATIVE (node)) = vect_used_in_scope; + STMT_SLP_TYPE (SLP_TREE_REPRESENTATIVE (node)) = pure_slp; + SLP_TREE_CODE (node) = ERROR_MARK; + SLP_TREE_LANE_PERMUTATION (node).release (); + + vect_free_slp_tree (sub); + vect_free_slp_tree (add); +} + /******************************************************************************* * Pattern matching definitions ******************************************************************************/ @@ -1502,6 +1599,7 @@ vect_pattern_decl_t slp_patterns[] overlap in what they can detect. */ SLP_PATTERN (complex_operations_pattern), + SLP_PATTERN (subadd_pattern) }; #undef SLP_PATTERN diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 04c20f8bd0f..b9824623ad9 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2100,7 +2100,8 @@ class vect_pattern this->m_ifn = ifn; this->m_node = node; this->m_ops.create (0); - this->m_ops.safe_splice (*m_ops); + if (m_ops) + this->m_ops.safe_splice (*m_ops); } public: