Message ID | 20220218140102.13F4E13C91@imap2.suse-dmz.suse.de |
---|---|
State | New |
Headers | show |
Series | [1/3] tree-optimization/104582 - Simplify vectorizer cost API and fixes | expand |
On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > This uses the now passed SLP node to the vectorizer costing hook > to adjust vector construction costs for the cost of moving an > integer component from a GPR to a vector register when that's > required for building a vector from components. A cruical difference > here is whether the component is loaded from memory or extracted > from a vector register as in those cases no intermediate GPR is involved. > > The pr99881.c testcase can be Un-XFAILed with this patch, the > pr91446.c testcase now produces scalar code which looks superior > to me so I've adjusted it as well. > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > after adding the BIT_FIELD_REF vector extracting special casing. Does the patch handle PR101929? > > I suppose we can let autotesters look for SPEC performance fallout. > > OK if testing succeeds? > > Thanks, > Richard. > > 2022-02-18 Richard Biener <rguenther@suse.de> > > PR tree-optimization/104582 > PR target/99881 > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > Cost GPR to vector register moves for integer vector construction. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > * gcc.target/i386/pr99881.c: Un-XFAIL. > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > --- > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > 7 files changed, 102 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 0830dbd7dca..b2bf90576d5 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) > > unsigned > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > - stmt_vec_info stmt_info, slp_tree, > + stmt_vec_info stmt_info, slp_tree node, > tree vectype, int misalign, > vect_cost_model_location where) > { > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > } > + else if (kind == vec_construct > + && node > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > + { > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > + unsigned i; > + tree op; > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > + if (TREE_CODE (op) == SSA_NAME) > + TREE_VISITED (op) = 0; > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > + { > + if (TREE_CODE (op) != SSA_NAME > + || TREE_VISITED (op)) > + continue; > + TREE_VISITED (op) = 1; > + gimple *def = SSA_NAME_DEF_STMT (op); > + tree tem; > + if (is_gimple_assign (def) > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > + && ((tem = gimple_assign_rhs1 (def)), true) > + && TREE_CODE (tem) == SSA_NAME > + /* A sign-change expands to nothing. */ > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > + TREE_TYPE (tem))) > + def = SSA_NAME_DEF_STMT (tem); > + /* When the component is loaded from memory we can directly > + move it to a vector register, otherwise we have to go > + via a GPR or via vpinsr which involves similar cost. > + Likewise with a BIT_FIELD_REF extracting from a vector > + register we can hope to avoid using a GPR. */ > + if (!is_gimple_assign (def) > + || (!gimple_assign_load_p (def) > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > + || !VECTOR_TYPE_P (TREE_TYPE > + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) > + stmt_cost += ix86_cost->sse_to_integer; > + } > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > + if (TREE_CODE (op) == SSA_NAME) > + TREE_VISITED (op) = 0; > + } > if (stmt_cost == -1) > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > new file mode 100644 > index 00000000000..992a845ad7a > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > + > +struct S { unsigned long a, b; } s; > + > +void > +foo (unsigned long *a, unsigned long *b) > +{ > + unsigned long a_ = *a; > + unsigned long b_ = *b; > + s.a = a_; > + s.b = b_; > +} > + > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > new file mode 100644 > index 00000000000..7637cdb4a97 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > + > +struct S { unsigned long a, b; } s; > + > +void > +foo (unsigned long a, unsigned long b) > +{ > + s.a = a; > + s.b = b; > +} > + > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > new file mode 100644 > index 00000000000..999c4905708 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > + > +struct S { double a, b; } s; > + > +void > +foo (double a, double b) > +{ > + s.a = a; > + s.b = b; > +} > + > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > new file mode 100644 > index 00000000000..cc471e1ed73 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > @@ -0,0 +1,15 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > + > +struct S { unsigned long a, b; } s; > + > +void > +foo (signed long *a, unsigned long *b) > +{ > + unsigned long a_ = *a; > + unsigned long b_ = *b; > + s.a = a_; > + s.b = b_; > +} > + > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > index 0243ca3ea68..067bf43f698 100644 > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > bar (&t); > } > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > index 3e087eb2ed7..a1ec1d1ba8a 100644 > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > @@ -1,7 +1,7 @@ > /* PR target/99881. */ > /* { dg-do compile { target { ! ia32 } } } */ > /* { dg-options "-Ofast -march=skylake" } */ > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > void > foo (int* __restrict a, int n, int c) > -- > 2.34.1
On Mon, 21 Feb 2022, Hongtao Liu wrote: > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > > > This uses the now passed SLP node to the vectorizer costing hook > > to adjust vector construction costs for the cost of moving an > > integer component from a GPR to a vector register when that's > > required for building a vector from components. A cruical difference > > here is whether the component is loaded from memory or extracted > > from a vector register as in those cases no intermediate GPR is involved. > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the > > pr91446.c testcase now produces scalar code which looks superior > > to me so I've adjusted it as well. > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > > after adding the BIT_FIELD_REF vector extracting special casing. > Does the patch handle PR101929? The patch will regress the testcase posted in PR101929 again: _255 1 times scalar_store costs 12 in body _261 1 times scalar_store costs 12 in body _258 1 times scalar_store costs 12 in body _264 1 times scalar_store costs 12 in body t0_247 + t2_251 1 times scalar_stmt costs 4 in body t1_472 + t3_444 1 times scalar_stmt costs 4 in body t0_406 - t2_451 1 times scalar_stmt costs 4 in body t1_472 - t3_444 1 times scalar_stmt costs 4 in body -node 0x4182f48 1 times vec_construct costs 16 in prologue -node 0x41882b0 1 times vec_construct costs 16 in prologue +node 0x4182f48 1 times vec_construct costs 28 in prologue +node 0x41882b0 1 times vec_construct costs 28 in prologue t0_406 + t2_451 1 times vector_stmt costs 4 in body t1_472 - t3_444 1 times vector_stmt costs 4 in body node 0x41829f8 1 times vec_perm costs 4 in body _436 1 times vector_store costs 16 in body t.c:37:9: note: Cost model analysis for part in loop 0: - Vector cost: 60 + Vector cost: 84 Scalar cost: 64 +t.c:37:9: missed: not vectorized: vectorization is not profitable. We're constructing V4SI from patterns like { _407, _480, _407, _480 } where the components are results of integer adds (so the result is definitely in a GPR). We are costing the construction as 4 * sse_op + 2 * sse_to_integer which with skylake cost is 4 * COSTS_N_INSNS (1) + 2 * 6. Whether the vectorization itself is profitable is likely questionable but then it's true that the construction of V4SI is more costly in terms of uops than a construction of V4SF. Now, we can - for the first time - now see the actual construction pattern and ideal construction might be two GPR->xmm moves + two splats + one unpack or maybe two GPR->xmm moves + one unpack + splat of DI (or other means of duplicating the lowpart). It still wouldn't help here of course, and we would not have RTL expansion adhere to this scheme. Trying to capture secondary effects (the FRE opportunity unleashed) in costing of this particular SLP subgraph is difficult and probably undesirable though. Adjusting any of the target specific costs is likely not going to work because of the original narrow profitability gap (60 vs. 64). For the particular kernel the overall vectorization strathegy needs to improve (PR98138 is about this I think). I know we've reverted the previous change that attempted to cost for GPR -> XMM. It did case vec_construct: { /* N element inserts into SSE vectors. */ + int cost + = TYPE_VECTOR_SUBPARTS (vectype) * (fp ? + ix86_cost->sse_op + : ix86_cost->integer_to_sse); + - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; thus treated integer_to_sse as GPR insert into vector cost for integer components in general while the now proposed change _adds_ integer to XMM move costs (but only once for duplicate elements and not for the cases we know are from memory / vector regs). With the proposed patch we can probably be more "correct" above and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given for FP the first one is free and for int we're costing it separately. Again it won't help here. Thanks, Richard. > > > > I suppose we can let autotesters look for SPEC performance fallout. > > > > OK if testing succeeds? > > > > Thanks, > > Richard. > > > > 2022-02-18 Richard Biener <rguenther@suse.de> > > > > PR tree-optimization/104582 > > PR target/99881 > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > Cost GPR to vector register moves for integer vector construction. > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > > * gcc.target/i386/pr99881.c: Un-XFAIL. > > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > > --- > > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > 7 files changed, 102 insertions(+), 3 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 0830dbd7dca..b2bf90576d5 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) > > > > unsigned > > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > - stmt_vec_info stmt_info, slp_tree, > > + stmt_vec_info stmt_info, slp_tree node, > > tree vectype, int misalign, > > vect_cost_model_location where) > > { > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > } > > + else if (kind == vec_construct > > + && node > > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > > + { > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > + unsigned i; > > + tree op; > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + if (TREE_CODE (op) == SSA_NAME) > > + TREE_VISITED (op) = 0; > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + { > > + if (TREE_CODE (op) != SSA_NAME > > + || TREE_VISITED (op)) > > + continue; > > + TREE_VISITED (op) = 1; > > + gimple *def = SSA_NAME_DEF_STMT (op); > > + tree tem; > > + if (is_gimple_assign (def) > > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > > + && ((tem = gimple_assign_rhs1 (def)), true) > > + && TREE_CODE (tem) == SSA_NAME > > + /* A sign-change expands to nothing. */ > > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > > + TREE_TYPE (tem))) > > + def = SSA_NAME_DEF_STMT (tem); > > + /* When the component is loaded from memory we can directly > > + move it to a vector register, otherwise we have to go > > + via a GPR or via vpinsr which involves similar cost. > > + Likewise with a BIT_FIELD_REF extracting from a vector > > + register we can hope to avoid using a GPR. */ > > + if (!is_gimple_assign (def) > > + || (!gimple_assign_load_p (def) > > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > > + || !VECTOR_TYPE_P (TREE_TYPE > > + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) > > + stmt_cost += ix86_cost->sse_to_integer; > > + } > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > + if (TREE_CODE (op) == SSA_NAME) > > + TREE_VISITED (op) = 0; > > + } > > if (stmt_cost == -1) > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > new file mode 100644 > > index 00000000000..992a845ad7a > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > @@ -0,0 +1,15 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (unsigned long *a, unsigned long *b) > > +{ > > + unsigned long a_ = *a; > > + unsigned long b_ = *b; > > + s.a = a_; > > + s.b = b_; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > new file mode 100644 > > index 00000000000..7637cdb4a97 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > @@ -0,0 +1,13 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (unsigned long a, unsigned long b) > > +{ > > + s.a = a; > > + s.b = b; > > +} > > + > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > new file mode 100644 > > index 00000000000..999c4905708 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > @@ -0,0 +1,13 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { double a, b; } s; > > + > > +void > > +foo (double a, double b) > > +{ > > + s.a = a; > > + s.b = b; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > new file mode 100644 > > index 00000000000..cc471e1ed73 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > @@ -0,0 +1,15 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > + > > +struct S { unsigned long a, b; } s; > > + > > +void > > +foo (signed long *a, unsigned long *b) > > +{ > > + unsigned long a_ = *a; > > + unsigned long b_ = *b; > > + s.a = a_; > > + s.b = b_; > > +} > > + > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > > index 0243ca3ea68..067bf43f698 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > > bar (&t); > > } > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > > index 3e087eb2ed7..a1ec1d1ba8a 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > @@ -1,7 +1,7 @@ > > /* PR target/99881. */ > > /* { dg-do compile { target { ! ia32 } } } */ > > /* { dg-options "-Ofast -march=skylake" } */ > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > void > > foo (int* __restrict a, int n, int c) > > -- > > 2.34.1 > > > >
On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote: > > On Mon, 21 Feb 2022, Hongtao Liu wrote: > > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > This uses the now passed SLP node to the vectorizer costing hook > > > to adjust vector construction costs for the cost of moving an > > > integer component from a GPR to a vector register when that's > > > required for building a vector from components. A cruical difference > > > here is whether the component is loaded from memory or extracted > > > from a vector register as in those cases no intermediate GPR is involved. > > > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the > > > pr91446.c testcase now produces scalar code which looks superior > > > to me so I've adjusted it as well. > > > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > > > after adding the BIT_FIELD_REF vector extracting special casing. > > Does the patch handle PR101929? > > The patch will regress the testcase posted in PR101929 again: > > _255 1 times scalar_store costs 12 in body > _261 1 times scalar_store costs 12 in body > _258 1 times scalar_store costs 12 in body > _264 1 times scalar_store costs 12 in body > t0_247 + t2_251 1 times scalar_stmt costs 4 in body > t1_472 + t3_444 1 times scalar_stmt costs 4 in body > t0_406 - t2_451 1 times scalar_stmt costs 4 in body > t1_472 - t3_444 1 times scalar_stmt costs 4 in body > -node 0x4182f48 1 times vec_construct costs 16 in prologue > -node 0x41882b0 1 times vec_construct costs 16 in prologue > +node 0x4182f48 1 times vec_construct costs 28 in prologue > +node 0x41882b0 1 times vec_construct costs 28 in prologue > t0_406 + t2_451 1 times vector_stmt costs 4 in body > t1_472 - t3_444 1 times vector_stmt costs 4 in body > node 0x41829f8 1 times vec_perm costs 4 in body > _436 1 times vector_store costs 16 in body > t.c:37:9: note: Cost model analysis for part in loop 0: > - Vector cost: 60 > + Vector cost: 84 > Scalar cost: 64 > +t.c:37:9: missed: not vectorized: vectorization is not profitable. > > We're constructing V4SI from patterns like { _407, _480, _407, _480 } > where the components are results of integer adds (so the result is > definitely in a GPR). We are costing the construction as > 4 * sse_op + 2 * sse_to_integer which with skylake cost is > 4 * COSTS_N_INSNS (1) + 2 * 6. > > Whether the vectorization itself is profitable is likely questionable > but then it's true that the construction of V4SI is more costly > in terms of uops than a construction of V4SF. > > Now, we can - for the first time - now see the actual construction > pattern and ideal construction might be two GPR->xmm moves > + two splats + one unpack or maybe two GPR->xmm moves + one > unpack + splat of DI (or other means of duplicating the lowpart). Yes, the patch is technically right. I'm ok with the patch. > It still wouldn't help here of course, and we would not have RTL > expansion adhere to this scheme. > > Trying to capture secondary effects (the FRE opportunity unleashed) > in costing of this particular SLP subgraph is difficult and probably > undesirable though. Adjusting any of the target specific costs > is likely not going to work because of the original narrow profitability > gap (60 vs. 64). > > For the particular kernel the overall vectorization strathegy needs > to improve (PR98138 is about this I think). I know we've reverted > the previous change that attempted to cost for GPR -> XMM. It did > > case vec_construct: > { > /* N element inserts into SSE vectors. */ > + int cost > + = TYPE_VECTOR_SUBPARTS (vectype) * (fp ? > + ix86_cost->sse_op > + : > ix86_cost->integer_to_sse); > + > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > > thus treated integer_to_sse as GPR insert into vector cost for > integer components in general while the now proposed change > _adds_ integer to XMM move costs (but only once for duplicate > elements and not for the cases we know are from memory / vector regs). > With the proposed patch we can probably be more "correct" above > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given > for FP the first one is free and for int we're costing it separately. > Again it won't help here. > > Thanks, > Richard. > > > > > > > I suppose we can let autotesters look for SPEC performance fallout. > > > > > > OK if testing succeeds? > > > > > > Thanks, > > > Richard. > > > > > > 2022-02-18 Richard Biener <rguenther@suse.de> > > > > > > PR tree-optimization/104582 > > > PR target/99881 > > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > > Cost GPR to vector register moves for integer vector construction. > > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > > > * gcc.target/i386/pr99881.c: Un-XFAIL. > > > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > > > --- > > > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > > > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > > > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > > > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > > > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > > > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > > 7 files changed, 102 insertions(+), 3 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > index 0830dbd7dca..b2bf90576d5 100644 > > > --- a/gcc/config/i386/i386.cc > > > +++ b/gcc/config/i386/i386.cc > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) > > > > > > unsigned > > > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > - stmt_vec_info stmt_info, slp_tree, > > > + stmt_vec_info stmt_info, slp_tree node, > > > tree vectype, int misalign, > > > vect_cost_model_location where) > > > { > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > > } > > > + else if (kind == vec_construct > > > + && node > > > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > > > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > > > + { > > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > + unsigned i; > > > + tree op; > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > + if (TREE_CODE (op) == SSA_NAME) > > > + TREE_VISITED (op) = 0; > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > + { > > > + if (TREE_CODE (op) != SSA_NAME > > > + || TREE_VISITED (op)) > > > + continue; > > > + TREE_VISITED (op) = 1; > > > + gimple *def = SSA_NAME_DEF_STMT (op); > > > + tree tem; > > > + if (is_gimple_assign (def) > > > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > > > + && ((tem = gimple_assign_rhs1 (def)), true) > > > + && TREE_CODE (tem) == SSA_NAME > > > + /* A sign-change expands to nothing. */ > > > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > > > + TREE_TYPE (tem))) > > > + def = SSA_NAME_DEF_STMT (tem); > > > + /* When the component is loaded from memory we can directly > > > + move it to a vector register, otherwise we have to go > > > + via a GPR or via vpinsr which involves similar cost. > > > + Likewise with a BIT_FIELD_REF extracting from a vector > > > + register we can hope to avoid using a GPR. */ > > > + if (!is_gimple_assign (def) > > > + || (!gimple_assign_load_p (def) > > > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > > > + || !VECTOR_TYPE_P (TREE_TYPE > > > + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) > > > + stmt_cost += ix86_cost->sse_to_integer; > > > + } > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > + if (TREE_CODE (op) == SSA_NAME) > > > + TREE_VISITED (op) = 0; > > > + } > > > if (stmt_cost == -1) > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > new file mode 100644 > > > index 00000000000..992a845ad7a > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > @@ -0,0 +1,15 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > + > > > +struct S { unsigned long a, b; } s; > > > + > > > +void > > > +foo (unsigned long *a, unsigned long *b) > > > +{ > > > + unsigned long a_ = *a; > > > + unsigned long b_ = *b; > > > + s.a = a_; > > > + s.b = b_; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > new file mode 100644 > > > index 00000000000..7637cdb4a97 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > @@ -0,0 +1,13 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > + > > > +struct S { unsigned long a, b; } s; > > > + > > > +void > > > +foo (unsigned long a, unsigned long b) > > > +{ > > > + s.a = a; > > > + s.b = b; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > new file mode 100644 > > > index 00000000000..999c4905708 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > @@ -0,0 +1,13 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > + > > > +struct S { double a, b; } s; > > > + > > > +void > > > +foo (double a, double b) > > > +{ > > > + s.a = a; > > > + s.b = b; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > new file mode 100644 > > > index 00000000000..cc471e1ed73 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > @@ -0,0 +1,15 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > + > > > +struct S { unsigned long a, b; } s; > > > + > > > +void > > > +foo (signed long *a, unsigned long *b) > > > +{ > > > + unsigned long a_ = *a; > > > + unsigned long b_ = *b; > > > + s.a = a_; > > > + s.b = b_; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > > > index 0243ca3ea68..067bf43f698 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > > > bar (&t); > > > } > > > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > > > index 3e087eb2ed7..a1ec1d1ba8a 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > > @@ -1,7 +1,7 @@ > > > /* PR target/99881. */ > > > /* { dg-do compile { target { ! ia32 } } } */ > > > /* { dg-options "-Ofast -march=skylake" } */ > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > > > void > > > foo (int* __restrict a, int n, int c) > > > -- > > > 2.34.1 > > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
On Tue, 22 Feb 2022, Hongtao Liu wrote: > On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Mon, 21 Feb 2022, Hongtao Liu wrote: > > > > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > This uses the now passed SLP node to the vectorizer costing hook > > > > to adjust vector construction costs for the cost of moving an > > > > integer component from a GPR to a vector register when that's > > > > required for building a vector from components. A cruical difference > > > > here is whether the component is loaded from memory or extracted > > > > from a vector register as in those cases no intermediate GPR is involved. > > > > > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the > > > > pr91446.c testcase now produces scalar code which looks superior > > > > to me so I've adjusted it as well. > > > > > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > > > > after adding the BIT_FIELD_REF vector extracting special casing. > > > Does the patch handle PR101929? > > > > The patch will regress the testcase posted in PR101929 again: > > > > _255 1 times scalar_store costs 12 in body > > _261 1 times scalar_store costs 12 in body > > _258 1 times scalar_store costs 12 in body > > _264 1 times scalar_store costs 12 in body > > t0_247 + t2_251 1 times scalar_stmt costs 4 in body > > t1_472 + t3_444 1 times scalar_stmt costs 4 in body > > t0_406 - t2_451 1 times scalar_stmt costs 4 in body > > t1_472 - t3_444 1 times scalar_stmt costs 4 in body > > -node 0x4182f48 1 times vec_construct costs 16 in prologue > > -node 0x41882b0 1 times vec_construct costs 16 in prologue > > +node 0x4182f48 1 times vec_construct costs 28 in prologue > > +node 0x41882b0 1 times vec_construct costs 28 in prologue > > t0_406 + t2_451 1 times vector_stmt costs 4 in body > > t1_472 - t3_444 1 times vector_stmt costs 4 in body > > node 0x41829f8 1 times vec_perm costs 4 in body > > _436 1 times vector_store costs 16 in body > > t.c:37:9: note: Cost model analysis for part in loop 0: > > - Vector cost: 60 > > + Vector cost: 84 > > Scalar cost: 64 > > +t.c:37:9: missed: not vectorized: vectorization is not profitable. > > > > We're constructing V4SI from patterns like { _407, _480, _407, _480 } > > where the components are results of integer adds (so the result is > > definitely in a GPR). We are costing the construction as > > 4 * sse_op + 2 * sse_to_integer which with skylake cost is > > 4 * COSTS_N_INSNS (1) + 2 * 6. > > > > Whether the vectorization itself is profitable is likely questionable > > but then it's true that the construction of V4SI is more costly > > in terms of uops than a construction of V4SF. > > > > Now, we can - for the first time - now see the actual construction > > pattern and ideal construction might be two GPR->xmm moves > > + two splats + one unpack or maybe two GPR->xmm moves + one > > unpack + splat of DI (or other means of duplicating the lowpart). > Yes, the patch is technically right. I'm ok with the patch. Thanks, I've pushed it now. I've also tested the suggested adjustment doing diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b2bf90576d5..acf2cc977b4 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, case vec_construct: { /* N element inserts into SSE vectors. */ - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss); successfully (with no effect on the PR101929 case as expected), I will queue that for stage1 since it isn't known to fix any regression (but I will keep it as option in case something pops up). I'll also have a more detailed look into the x264_r case to see if there's something we can do about the regression that will now show up (and I'll watch autotesters). Thanks, Richard. > > It still wouldn't help here of course, and we would not have RTL > > expansion adhere to this scheme. > > > > Trying to capture secondary effects (the FRE opportunity unleashed) > > in costing of this particular SLP subgraph is difficult and probably > > undesirable though. Adjusting any of the target specific costs > > is likely not going to work because of the original narrow profitability > > gap (60 vs. 64). > > > > For the particular kernel the overall vectorization strathegy needs > > to improve (PR98138 is about this I think). I know we've reverted > > the previous change that attempted to cost for GPR -> XMM. It did > > > > case vec_construct: > > { > > /* N element inserts into SSE vectors. */ > > + int cost > > + = TYPE_VECTOR_SUBPARTS (vectype) * (fp ? > > + ix86_cost->sse_op > > + : > > ix86_cost->integer_to_sse); > > + > > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > > > > thus treated integer_to_sse as GPR insert into vector cost for > > integer components in general while the now proposed change > > _adds_ integer to XMM move costs (but only once for duplicate > > elements and not for the cases we know are from memory / vector regs). > > With the proposed patch we can probably be more "correct" above > > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given > > for FP the first one is free and for int we're costing it separately. > > Again it won't help here. > > > > Thanks, > > Richard. > > > > > > > > > > I suppose we can let autotesters look for SPEC performance fallout. > > > > > > > > OK if testing succeeds? > > > > > > > > Thanks, > > > > Richard. > > > > > > > > 2022-02-18 Richard Biener <rguenther@suse.de> > > > > > > > > PR tree-optimization/104582 > > > > PR target/99881 > > > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > > > Cost GPR to vector register moves for integer vector construction. > > > > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > > > > * gcc.target/i386/pr99881.c: Un-XFAIL. > > > > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > > > > --- > > > > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > > > > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > > > > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > > > > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > > > > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > > > > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > > > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > > > 7 files changed, 102 insertions(+), 3 deletions(-) > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > index 0830dbd7dca..b2bf90576d5 100644 > > > > --- a/gcc/config/i386/i386.cc > > > > +++ b/gcc/config/i386/i386.cc > > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) > > > > > > > > unsigned > > > > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > > - stmt_vec_info stmt_info, slp_tree, > > > > + stmt_vec_info stmt_info, slp_tree node, > > > > tree vectype, int misalign, > > > > vect_cost_model_location where) > > > > { > > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > > > } > > > > + else if (kind == vec_construct > > > > + && node > > > > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > > > > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > > > > + { > > > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > + unsigned i; > > > > + tree op; > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > + if (TREE_CODE (op) == SSA_NAME) > > > > + TREE_VISITED (op) = 0; > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > + { > > > > + if (TREE_CODE (op) != SSA_NAME > > > > + || TREE_VISITED (op)) > > > > + continue; > > > > + TREE_VISITED (op) = 1; > > > > + gimple *def = SSA_NAME_DEF_STMT (op); > > > > + tree tem; > > > > + if (is_gimple_assign (def) > > > > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > > > > + && ((tem = gimple_assign_rhs1 (def)), true) > > > > + && TREE_CODE (tem) == SSA_NAME > > > > + /* A sign-change expands to nothing. */ > > > > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > > > > + TREE_TYPE (tem))) > > > > + def = SSA_NAME_DEF_STMT (tem); > > > > + /* When the component is loaded from memory we can directly > > > > + move it to a vector register, otherwise we have to go > > > > + via a GPR or via vpinsr which involves similar cost. > > > > + Likewise with a BIT_FIELD_REF extracting from a vector > > > > + register we can hope to avoid using a GPR. */ > > > > + if (!is_gimple_assign (def) > > > > + || (!gimple_assign_load_p (def) > > > > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > > > > + || !VECTOR_TYPE_P (TREE_TYPE > > > > + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) > > > > + stmt_cost += ix86_cost->sse_to_integer; > > > > + } > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > + if (TREE_CODE (op) == SSA_NAME) > > > > + TREE_VISITED (op) = 0; > > > > + } > > > > if (stmt_cost == -1) > > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > new file mode 100644 > > > > index 00000000000..992a845ad7a > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > @@ -0,0 +1,15 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > + > > > > +struct S { unsigned long a, b; } s; > > > > + > > > > +void > > > > +foo (unsigned long *a, unsigned long *b) > > > > +{ > > > > + unsigned long a_ = *a; > > > > + unsigned long b_ = *b; > > > > + s.a = a_; > > > > + s.b = b_; > > > > +} > > > > + > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > new file mode 100644 > > > > index 00000000000..7637cdb4a97 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > @@ -0,0 +1,13 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > + > > > > +struct S { unsigned long a, b; } s; > > > > + > > > > +void > > > > +foo (unsigned long a, unsigned long b) > > > > +{ > > > > + s.a = a; > > > > + s.b = b; > > > > +} > > > > + > > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > new file mode 100644 > > > > index 00000000000..999c4905708 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > @@ -0,0 +1,13 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > + > > > > +struct S { double a, b; } s; > > > > + > > > > +void > > > > +foo (double a, double b) > > > > +{ > > > > + s.a = a; > > > > + s.b = b; > > > > +} > > > > + > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > new file mode 100644 > > > > index 00000000000..cc471e1ed73 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > @@ -0,0 +1,15 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > + > > > > +struct S { unsigned long a, b; } s; > > > > + > > > > +void > > > > +foo (signed long *a, unsigned long *b) > > > > +{ > > > > + unsigned long a_ = *a; > > > > + unsigned long b_ = *b; > > > > + s.a = a_; > > > > + s.b = b_; > > > > +} > > > > + > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > > > > index 0243ca3ea68..067bf43f698 100644 > > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > > > > bar (&t); > > > > } > > > > > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > > > > index 3e087eb2ed7..a1ec1d1ba8a 100644 > > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > > > @@ -1,7 +1,7 @@ > > > > /* PR target/99881. */ > > > > /* { dg-do compile { target { ! ia32 } } } */ > > > > /* { dg-options "-Ofast -march=skylake" } */ > > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > > > > > void > > > > foo (int* __restrict a, int n, int c) > > > > -- > > > > 2.34.1 > > > > > > > > > > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) > > > >
On Tue, 22 Feb 2022, Richard Biener wrote: > On Tue, 22 Feb 2022, Hongtao Liu wrote: > > > On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Mon, 21 Feb 2022, Hongtao Liu wrote: > > > > > > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > This uses the now passed SLP node to the vectorizer costing hook > > > > > to adjust vector construction costs for the cost of moving an > > > > > integer component from a GPR to a vector register when that's > > > > > required for building a vector from components. A cruical difference > > > > > here is whether the component is loaded from memory or extracted > > > > > from a vector register as in those cases no intermediate GPR is involved. > > > > > > > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the > > > > > pr91446.c testcase now produces scalar code which looks superior > > > > > to me so I've adjusted it as well. > > > > > > > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu > > > > > after adding the BIT_FIELD_REF vector extracting special casing. > > > > Does the patch handle PR101929? > > > > > > The patch will regress the testcase posted in PR101929 again: > > > > > > _255 1 times scalar_store costs 12 in body > > > _261 1 times scalar_store costs 12 in body > > > _258 1 times scalar_store costs 12 in body > > > _264 1 times scalar_store costs 12 in body > > > t0_247 + t2_251 1 times scalar_stmt costs 4 in body > > > t1_472 + t3_444 1 times scalar_stmt costs 4 in body > > > t0_406 - t2_451 1 times scalar_stmt costs 4 in body > > > t1_472 - t3_444 1 times scalar_stmt costs 4 in body > > > -node 0x4182f48 1 times vec_construct costs 16 in prologue > > > -node 0x41882b0 1 times vec_construct costs 16 in prologue > > > +node 0x4182f48 1 times vec_construct costs 28 in prologue > > > +node 0x41882b0 1 times vec_construct costs 28 in prologue > > > t0_406 + t2_451 1 times vector_stmt costs 4 in body > > > t1_472 - t3_444 1 times vector_stmt costs 4 in body > > > node 0x41829f8 1 times vec_perm costs 4 in body > > > _436 1 times vector_store costs 16 in body > > > t.c:37:9: note: Cost model analysis for part in loop 0: > > > - Vector cost: 60 > > > + Vector cost: 84 > > > Scalar cost: 64 > > > +t.c:37:9: missed: not vectorized: vectorization is not profitable. > > > > > > We're constructing V4SI from patterns like { _407, _480, _407, _480 } > > > where the components are results of integer adds (so the result is > > > definitely in a GPR). We are costing the construction as > > > 4 * sse_op + 2 * sse_to_integer which with skylake cost is > > > 4 * COSTS_N_INSNS (1) + 2 * 6. > > > > > > Whether the vectorization itself is profitable is likely questionable > > > but then it's true that the construction of V4SI is more costly > > > in terms of uops than a construction of V4SF. > > > > > > Now, we can - for the first time - now see the actual construction > > > pattern and ideal construction might be two GPR->xmm moves > > > + two splats + one unpack or maybe two GPR->xmm moves + one > > > unpack + splat of DI (or other means of duplicating the lowpart). > > Yes, the patch is technically right. I'm ok with the patch. > > Thanks, I've pushed it now. I've also tested the suggested adjustment > doing > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index b2bf90576d5..acf2cc977b4 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum > vect_cost_for_stmt type_of_cost, > case vec_construct: > { > /* N element inserts into SSE vectors. */ > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > + int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * > ix86_cost->sse_op; > /* One vinserti128 for combining two SSE vectors for AVX256. */ > if (GET_MODE_BITSIZE (mode) == 256) > cost += ix86_vec_cost (mode, ix86_cost->addss); > > successfully (with no effect on the PR101929 case as expected), I > will queue that for stage1 since it isn't known to fix any > regression (but I will keep it as option in case something pops up). > > I'll also have a more detailed look into the x264_r case to see > if there's something we can do about the regression that will now > show up (and I'll watch autotesters). I found a way to get back the vectorization doing diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 9188d727e33..7f1f12fb6c6 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -2374,7 +2375,7 @@ fail: n_vector_builds++; } } - if (all_uniform_p + if ((all_uniform_p && !two_operators) || n_vector_builds > 1 || (n_vector_builds == children.length () && is_a <gphi *> (stmt_info->stmt))) which in itself is reasonable since the result of the operation is a non-uniform vector. That causes us to do _37 = {a3_245_6(D), a3_245_6(D), a3_245_6(D), a3_245_6(D)}; _36 = {a2_232_5(D), a2_232_5(D), a2_232_5(D), a2_232_5(D)}; vect__252_8.7_39 = _36 - _37; vect__250_7.6_38 = _36 + _37; _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 }>; and then vector lowering does what vect_optimize_slp should do as well (but doesn't): _26 = a2_232_5(D) - a3_245_6(D); vect__252_8.7_39 = {_26, _26, _26, _26}; _25 = a2_232_5(D) + a3_245_6(D); vect__250_7.6_38 = {_25, _25, _25, _25}; _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 }>; and later stmt folding will get us back the non-uniform vector construction by folding the VEC_PERM_EXPR where we simply assume that we can optimally expand the constructor. But the important part is that this way we evade vec_construct costing and instead get node 0x40e11a8 1 times scalar_to_vec costs 4 in prologue where of course it's obvious that we miss to account for the GPR -> XMM penalty here in the backend. When we are not fixing that we get the kernel for x264_r optimized like before. A self-contained testcase for this opportunity is void foo (unsigned * __restrict tmp, unsigned a0_206, unsigned a1_219, unsigned a2_232, unsigned a3_245) { unsigned _246 = a0_206 + a1_219; unsigned _248 = a0_206 - a1_219; unsigned _250 = a2_232 + a3_245; unsigned _252 = a2_232 - a3_245; int t0_247 = (int) _246; int t1_249 = (int) _248; int t2_251 = (int) _250; int t3_253 = (int) _252; int _254 = t0_247 + t2_251; int _260 = t1_249 + t3_253; int _257 = t0_247 - t2_251; int _263 = t1_249 - t3_253; unsigned _255 = (unsigned int) _254; unsigned _261 = (unsigned int) _260; unsigned _258 = (unsigned int) _257; unsigned _264 = (unsigned int) _263; tmp[0] = _255; tmp[1] = _261; tmp[2] = _258; tmp[3] = _264; } which does show that the vectorization is not profitable since we get 0000000000000000 <foo>: 0: 89 c8 mov %ecx,%eax 2: 44 01 c1 add %r8d,%ecx 5: 44 29 c0 sub %r8d,%eax 8: c5 f9 6e d9 vmovd %ecx,%xmm3 c: c4 e3 61 22 c8 01 vpinsrd $0x1,%eax,%xmm3,%xmm1 12: 89 f0 mov %esi,%eax 14: 01 d6 add %edx,%esi 16: 29 d0 sub %edx,%eax 18: c5 f9 6e e6 vmovd %esi,%xmm4 1c: c4 e3 59 22 c0 01 vpinsrd $0x1,%eax,%xmm4,%xmm0 22: c5 f1 6c c9 vpunpcklqdq %xmm1,%xmm1,%xmm1 26: c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0 2a: c5 f9 fe d1 vpaddd %xmm1,%xmm0,%xmm2 2e: c5 f9 fa c1 vpsubd %xmm1,%xmm0,%xmm0 32: c5 e8 c6 c0 e4 vshufps $0xe4,%xmm0,%xmm2,%xmm0 37: c5 fa 7f 07 vmovdqu %xmm0,(%rdi) 3b: c3 ret vs the scalar code 0000000000000000 <foo>: 0: 48 89 f8 mov %rdi,%rax 3: 8d 3c 16 lea (%rsi,%rdx,1),%edi 6: 29 d6 sub %edx,%esi 8: 42 8d 14 01 lea (%rcx,%r8,1),%edx c: 44 29 c1 sub %r8d,%ecx f: 44 8d 04 17 lea (%rdi,%rdx,1),%r8d 13: 44 89 00 mov %r8d,(%rax) 16: 29 d7 sub %edx,%edi 18: 44 8d 04 0e lea (%rsi,%rcx,1),%r8d 1c: 29 ce sub %ecx,%esi 1e: 44 89 40 04 mov %r8d,0x4(%rax) 22: 89 78 08 mov %edi,0x8(%rax) 25: 89 70 0c mov %esi,0xc(%rax) 28: c3 ret but we could choose to ignore this for the sake of avoiding the x264 regression (shall it appear which I think it will) and try to address this in GCC 12. Richard. > Thanks, > Richard. > > > > > It still wouldn't help here of course, and we would not have RTL > > > expansion adhere to this scheme. > > > > > > Trying to capture secondary effects (the FRE opportunity unleashed) > > > in costing of this particular SLP subgraph is difficult and probably > > > undesirable though. Adjusting any of the target specific costs > > > is likely not going to work because of the original narrow profitability > > > gap (60 vs. 64). > > > > > > For the particular kernel the overall vectorization strathegy needs > > > to improve (PR98138 is about this I think). I know we've reverted > > > the previous change that attempted to cost for GPR -> XMM. It did > > > > > > case vec_construct: > > > { > > > /* N element inserts into SSE vectors. */ > > > + int cost > > > + = TYPE_VECTOR_SUBPARTS (vectype) * (fp ? > > > + ix86_cost->sse_op > > > + : > > > ix86_cost->integer_to_sse); > > > + > > > - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > > > > > > thus treated integer_to_sse as GPR insert into vector cost for > > > integer components in general while the now proposed change > > > _adds_ integer to XMM move costs (but only once for duplicate > > > elements and not for the cases we know are from memory / vector regs). > > > With the proposed patch we can probably be more "correct" above > > > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given > > > for FP the first one is free and for int we're costing it separately. > > > Again it won't help here. > > > > > > Thanks, > > > Richard. > > > > > > > > > > > > > I suppose we can let autotesters look for SPEC performance fallout. > > > > > > > > > > OK if testing succeeds? > > > > > > > > > > Thanks, > > > > > Richard. > > > > > > > > > > 2022-02-18 Richard Biener <rguenther@suse.de> > > > > > > > > > > PR tree-optimization/104582 > > > > > PR target/99881 > > > > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > > > > Cost GPR to vector register moves for integer vector construction. > > > > > > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New. > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise. > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise. > > > > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise. > > > > > * gcc.target/i386/pr99881.c: Un-XFAIL. > > > > > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization. > > > > > --- > > > > > gcc/config/i386/i386.cc | 45 ++++++++++++++++++- > > > > > .../costmodel/x86_64/costmodel-pr104582-1.c | 15 +++++++ > > > > > .../costmodel/x86_64/costmodel-pr104582-2.c | 13 ++++++ > > > > > .../costmodel/x86_64/costmodel-pr104582-3.c | 13 ++++++ > > > > > .../costmodel/x86_64/costmodel-pr104582-4.c | 15 +++++++ > > > > > gcc/testsuite/gcc.target/i386/pr91446.c | 2 +- > > > > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > > > > 7 files changed, 102 insertions(+), 3 deletions(-) > > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > > > index 0830dbd7dca..b2bf90576d5 100644 > > > > > --- a/gcc/config/i386/i386.cc > > > > > +++ b/gcc/config/i386/i386.cc > > > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) > > > > > > > > > > unsigned > > > > > ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > > > - stmt_vec_info stmt_info, slp_tree, > > > > > + stmt_vec_info stmt_info, slp_tree node, > > > > > tree vectype, int misalign, > > > > > vect_cost_model_location where) > > > > > { > > > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > > stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); > > > > > } > > > > > + else if (kind == vec_construct > > > > > + && node > > > > > + && SLP_TREE_DEF_TYPE (node) == vect_external_def > > > > > + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) > > > > > + { > > > > > + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > > + unsigned i; > > > > > + tree op; > > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > > + if (TREE_CODE (op) == SSA_NAME) > > > > > + TREE_VISITED (op) = 0; > > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > > + { > > > > > + if (TREE_CODE (op) != SSA_NAME > > > > > + || TREE_VISITED (op)) > > > > > + continue; > > > > > + TREE_VISITED (op) = 1; > > > > > + gimple *def = SSA_NAME_DEF_STMT (op); > > > > > + tree tem; > > > > > + if (is_gimple_assign (def) > > > > > + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) > > > > > + && ((tem = gimple_assign_rhs1 (def)), true) > > > > > + && TREE_CODE (tem) == SSA_NAME > > > > > + /* A sign-change expands to nothing. */ > > > > > + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), > > > > > + TREE_TYPE (tem))) > > > > > + def = SSA_NAME_DEF_STMT (tem); > > > > > + /* When the component is loaded from memory we can directly > > > > > + move it to a vector register, otherwise we have to go > > > > > + via a GPR or via vpinsr which involves similar cost. > > > > > + Likewise with a BIT_FIELD_REF extracting from a vector > > > > > + register we can hope to avoid using a GPR. */ > > > > > + if (!is_gimple_assign (def) > > > > > + || (!gimple_assign_load_p (def) > > > > > + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF > > > > > + || !VECTOR_TYPE_P (TREE_TYPE > > > > > + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) > > > > > + stmt_cost += ix86_cost->sse_to_integer; > > > > > + } > > > > > + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) > > > > > + if (TREE_CODE (op) == SSA_NAME) > > > > > + TREE_VISITED (op) = 0; > > > > > + } > > > > > if (stmt_cost == -1) > > > > > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > > > > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > > new file mode 100644 > > > > > index 00000000000..992a845ad7a > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > > > > @@ -0,0 +1,15 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > > + > > > > > +struct S { unsigned long a, b; } s; > > > > > + > > > > > +void > > > > > +foo (unsigned long *a, unsigned long *b) > > > > > +{ > > > > > + unsigned long a_ = *a; > > > > > + unsigned long b_ = *b; > > > > > + s.a = a_; > > > > > + s.b = b_; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > > new file mode 100644 > > > > > index 00000000000..7637cdb4a97 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c > > > > > @@ -0,0 +1,13 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > > + > > > > > +struct S { unsigned long a, b; } s; > > > > > + > > > > > +void > > > > > +foo (unsigned long a, unsigned long b) > > > > > +{ > > > > > + s.a = a; > > > > > + s.b = b; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > > new file mode 100644 > > > > > index 00000000000..999c4905708 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > > > > @@ -0,0 +1,13 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > > + > > > > > +struct S { double a, b; } s; > > > > > + > > > > > +void > > > > > +foo (double a, double b) > > > > > +{ > > > > > + s.a = a; > > > > > + s.b = b; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > > new file mode 100644 > > > > > index 00000000000..cc471e1ed73 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > > > > @@ -0,0 +1,15 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ > > > > > + > > > > > +struct S { unsigned long a, b; } s; > > > > > + > > > > > +void > > > > > +foo (signed long *a, unsigned long *b) > > > > > +{ > > > > > + unsigned long a_ = *a; > > > > > + unsigned long b_ = *b; > > > > > + s.a = a_; > > > > > + s.b = b_; > > > > > +} > > > > > + > > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > > > > > index 0243ca3ea68..067bf43f698 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, > > > > > bar (&t); > > > > > } > > > > > > > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > > > > > index 3e087eb2ed7..a1ec1d1ba8a 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > > > > @@ -1,7 +1,7 @@ > > > > > /* PR target/99881. */ > > > > > /* { dg-do compile { target { ! ia32 } } } */ > > > > > /* { dg-options "-Ofast -march=skylake" } */ > > > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > > > > > > > void > > > > > foo (int* __restrict a, int n, int c) > > > > > -- > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > -- > > > Richard Biener <rguenther@suse.de> > > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > > > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) > > > > > > > > > >
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 0830dbd7dca..b2bf90576d5 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar) unsigned ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, - stmt_vec_info stmt_info, slp_tree, + stmt_vec_info stmt_info, slp_tree node, tree vectype, int misalign, vect_cost_model_location where) { @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1); } + else if (kind == vec_construct + && node + && SLP_TREE_DEF_TYPE (node) == vect_external_def + && INTEGRAL_TYPE_P (TREE_TYPE (vectype))) + { + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); + unsigned i; + tree op; + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) + if (TREE_CODE (op) == SSA_NAME) + TREE_VISITED (op) = 0; + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) + { + if (TREE_CODE (op) != SSA_NAME + || TREE_VISITED (op)) + continue; + TREE_VISITED (op) = 1; + gimple *def = SSA_NAME_DEF_STMT (op); + tree tem; + if (is_gimple_assign (def) + && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def)) + && ((tem = gimple_assign_rhs1 (def)), true) + && TREE_CODE (tem) == SSA_NAME + /* A sign-change expands to nothing. */ + && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)), + TREE_TYPE (tem))) + def = SSA_NAME_DEF_STMT (tem); + /* When the component is loaded from memory we can directly + move it to a vector register, otherwise we have to go + via a GPR or via vpinsr which involves similar cost. + Likewise with a BIT_FIELD_REF extracting from a vector + register we can hope to avoid using a GPR. */ + if (!is_gimple_assign (def) + || (!gimple_assign_load_p (def) + && (gimple_assign_rhs_code (def) != BIT_FIELD_REF + || !VECTOR_TYPE_P (TREE_TYPE + (TREE_OPERAND (gimple_assign_rhs1 (def), 0)))))) + stmt_cost += ix86_cost->sse_to_integer; + } + FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op) + if (TREE_CODE (op) == SSA_NAME) + TREE_VISITED (op) = 0; + } if (stmt_cost == -1) stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c new file mode 100644 index 00000000000..992a845ad7a --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ + +struct S { unsigned long a, b; } s; + +void +foo (unsigned long *a, unsigned long *b) +{ + unsigned long a_ = *a; + unsigned long b_ = *b; + s.a = a_; + s.b = b_; +} + +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c new file mode 100644 index 00000000000..7637cdb4a97 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ + +struct S { unsigned long a, b; } s; + +void +foo (unsigned long a, unsigned long b) +{ + s.a = a; + s.b = b; +} + +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c new file mode 100644 index 00000000000..999c4905708 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ + +struct S { double a, b; } s; + +void +foo (double a, double b) +{ + s.a = a; + s.b = b; +} + +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c new file mode 100644 index 00000000000..cc471e1ed73 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */ + +struct S { unsigned long a, b; } s; + +void +foo (signed long *a, unsigned long *b) +{ + unsigned long a_ = *a; + unsigned long b_ = *b; + s.a = a_; + s.b = b_; +} + +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c index 0243ca3ea68..067bf43f698 100644 --- a/gcc/testsuite/gcc.target/i386/pr91446.c +++ b/gcc/testsuite/gcc.target/i386/pr91446.c @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height, bar (&t); } -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c index 3e087eb2ed7..a1ec1d1ba8a 100644 --- a/gcc/testsuite/gcc.target/i386/pr99881.c +++ b/gcc/testsuite/gcc.target/i386/pr99881.c @@ -1,7 +1,7 @@ /* PR target/99881. */ /* { dg-do compile { target { ! ia32 } } } */ /* { dg-options "-Ofast -march=skylake" } */ -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ void foo (int* __restrict a, int n, int c)