diff mbox series

[3/3] target/99881 - x86 vector cost of CTOR from integer regs

Message ID 20220218140102.13F4E13C91@imap2.suse-dmz.suse.de
State New
Headers show
Series [1/3] tree-optimization/104582 - Simplify vectorizer cost API and fixes | expand

Commit Message

Richard Biener Feb. 18, 2022, 2:01 p.m. UTC
This uses the now passed SLP node to the vectorizer costing hook
to adjust vector construction costs for the cost of moving an
integer component from a GPR to a vector register when that's
required for building a vector from components.  A cruical difference
here is whether the component is loaded from memory or extracted
from a vector register as in those cases no intermediate GPR is involved.

The pr99881.c testcase can be Un-XFAILed with this patch, the
pr91446.c testcase now produces scalar code which looks superior
to me so I've adjusted it as well.

I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
after adding the BIT_FIELD_REF vector extracting special casing.

I suppose we can let autotesters look for SPEC performance fallout.

OK if testing succeeds?

Thanks,
Richard.

2022-02-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/104582
	PR target/99881
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Cost GPR to vector register moves for integer vector construction.

	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
	* gcc.target/i386/pr99881.c: Un-XFAIL.
	* gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
---
 gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
 .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
 .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
 .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
 .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
 gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
 gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
 7 files changed, 102 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c

Comments

Hongtao Liu Feb. 21, 2022, 1:35 a.m. UTC | #1
On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This uses the now passed SLP node to the vectorizer costing hook
> to adjust vector construction costs for the cost of moving an
> integer component from a GPR to a vector register when that's
> required for building a vector from components.  A cruical difference
> here is whether the component is loaded from memory or extracted
> from a vector register as in those cases no intermediate GPR is involved.
>
> The pr99881.c testcase can be Un-XFAILed with this patch, the
> pr91446.c testcase now produces scalar code which looks superior
> to me so I've adjusted it as well.
>
> I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> after adding the BIT_FIELD_REF vector extracting special casing.
Does the patch handle PR101929?
>
> I suppose we can let autotesters look for SPEC performance fallout.
>
> OK if testing succeeds?
>
> Thanks,
> Richard.
>
> 2022-02-18  Richard Biener  <rguenther@suse.de>
>
>         PR tree-optimization/104582
>         PR target/99881
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>         Cost GPR to vector register moves for integer vector construction.
>
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
>         * gcc.target/i386/pr99881.c: Un-XFAIL.
>         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> ---
>  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
>  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
>  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
>  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
>  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
>  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
>  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
>  7 files changed, 102 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 0830dbd7dca..b2bf90576d5 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
>
>  unsigned
>  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> -                                 stmt_vec_info stmt_info, slp_tree,
> +                                 stmt_vec_info stmt_info, slp_tree node,
>                                   tree vectype, int misalign,
>                                   vect_cost_model_location where)
>  {
> @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>      }
> +  else if (kind == vec_construct
> +          && node
> +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      unsigned i;
> +      tree op;
> +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> +       if (TREE_CODE (op) == SSA_NAME)
> +         TREE_VISITED (op) = 0;
> +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> +       {
> +         if (TREE_CODE (op) != SSA_NAME
> +             || TREE_VISITED (op))
> +           continue;
> +         TREE_VISITED (op) = 1;
> +         gimple *def = SSA_NAME_DEF_STMT (op);
> +         tree tem;
> +         if (is_gimple_assign (def)
> +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> +             && ((tem = gimple_assign_rhs1 (def)), true)
> +             && TREE_CODE (tem) == SSA_NAME
> +             /* A sign-change expands to nothing.  */
> +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> +                                       TREE_TYPE (tem)))
> +           def = SSA_NAME_DEF_STMT (tem);
> +         /* When the component is loaded from memory we can directly
> +            move it to a vector register, otherwise we have to go
> +            via a GPR or via vpinsr which involves similar cost.
> +            Likewise with a BIT_FIELD_REF extracting from a vector
> +            register we can hope to avoid using a GPR.  */
> +         if (!is_gimple_assign (def)
> +             || (!gimple_assign_load_p (def)
> +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> +                     || !VECTOR_TYPE_P (TREE_TYPE
> +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> +           stmt_cost += ix86_cost->sse_to_integer;
> +       }
> +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> +       if (TREE_CODE (op) == SSA_NAME)
> +         TREE_VISITED (op) = 0;
> +    }
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> new file mode 100644
> index 00000000000..992a845ad7a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> +
> +struct S { unsigned long a, b; } s;
> +
> +void
> +foo (unsigned long *a, unsigned long *b)
> +{
> +  unsigned long a_ = *a;
> +  unsigned long b_ = *b;
> +  s.a = a_;
> +  s.b = b_;
> +}
> +
> +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> new file mode 100644
> index 00000000000..7637cdb4a97
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> +
> +struct S { unsigned long a, b; } s;
> +
> +void
> +foo (unsigned long a, unsigned long b)
> +{
> +  s.a = a;
> +  s.b = b;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> new file mode 100644
> index 00000000000..999c4905708
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> +
> +struct S { double a, b; } s;
> +
> +void
> +foo (double a, double b)
> +{
> +  s.a = a;
> +  s.b = b;
> +}
> +
> +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> new file mode 100644
> index 00000000000..cc471e1ed73
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> +
> +struct S { unsigned long a, b; } s;
> +
> +void
> +foo (signed long *a, unsigned long *b)
> +{
> +  unsigned long a_ = *a;
> +  unsigned long b_ = *b;
> +  s.a = a_;
> +  s.b = b_;
> +}
> +
> +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> index 0243ca3ea68..067bf43f698 100644
> --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
>    bar (&t);
>  }
>
> -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> index 3e087eb2ed7..a1ec1d1ba8a 100644
> --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> @@ -1,7 +1,7 @@
>  /* PR target/99881.  */
>  /* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-Ofast -march=skylake" } */
> -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
>
>  void
>  foo (int* __restrict a, int n, int c)
> --
> 2.34.1
Richard Biener Feb. 21, 2022, 9:10 a.m. UTC | #2
On Mon, 21 Feb 2022, Hongtao Liu wrote:

> On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > This uses the now passed SLP node to the vectorizer costing hook
> > to adjust vector construction costs for the cost of moving an
> > integer component from a GPR to a vector register when that's
> > required for building a vector from components.  A cruical difference
> > here is whether the component is loaded from memory or extracted
> > from a vector register as in those cases no intermediate GPR is involved.
> >
> > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > pr91446.c testcase now produces scalar code which looks superior
> > to me so I've adjusted it as well.
> >
> > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > after adding the BIT_FIELD_REF vector extracting special casing.
> Does the patch handle PR101929?

The patch will regress the testcase posted in PR101929 again:

 _255 1 times scalar_store costs 12 in body
 _261 1 times scalar_store costs 12 in body
 _258 1 times scalar_store costs 12 in body
 _264 1 times scalar_store costs 12 in body
 t0_247 + t2_251 1 times scalar_stmt costs 4 in body
 t1_472 + t3_444 1 times scalar_stmt costs 4 in body
 t0_406 - t2_451 1 times scalar_stmt costs 4 in body
 t1_472 - t3_444 1 times scalar_stmt costs 4 in body
-node 0x4182f48 1 times vec_construct costs 16 in prologue
-node 0x41882b0 1 times vec_construct costs 16 in prologue
+node 0x4182f48 1 times vec_construct costs 28 in prologue
+node 0x41882b0 1 times vec_construct costs 28 in prologue
 t0_406 + t2_451 1 times vector_stmt costs 4 in body
 t1_472 - t3_444 1 times vector_stmt costs 4 in body
 node 0x41829f8 1 times vec_perm costs 4 in body
 _436 1 times vector_store costs 16 in body
 t.c:37:9: note: Cost model analysis for part in loop 0:
-  Vector cost: 60
+  Vector cost: 84
   Scalar cost: 64
+t.c:37:9: missed: not vectorized: vectorization is not profitable.

We're constructing V4SI from patterns like { _407, _480, _407, _480 }
where the components are results of integer adds (so the result is
definitely in a GPR).  We are costing the construction as
4 * sse_op + 2 * sse_to_integer which with skylake cost is
4 * COSTS_N_INSNS (1) + 2 * 6.

Whether the vectorization itself is profitable is likely questionable
but then it's true that the construction of V4SI is more costly
in terms of uops than a construction of V4SF.

Now, we can - for the first time - now see the actual construction
pattern and ideal construction might be two GPR->xmm moves
+ two splats + one unpack or maybe two GPR->xmm moves + one
unpack + splat of DI (or other means of duplicating the lowpart).
It still wouldn't help here of course, and we would not have RTL
expansion adhere to this scheme.

Trying to capture secondary effects (the FRE opportunity unleashed)
in costing of this particular SLP subgraph is difficult and probably
undesirable though.  Adjusting any of the target specific costs
is likely not going to work because of the original narrow profitability
gap (60 vs. 64).

For the particular kernel the overall vectorization strathegy needs
to improve (PR98138 is about this I think).  I know we've reverted
the previous change that attempted to cost for GPR -> XMM.  It did

       case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
+         int cost
+           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
+                                               ix86_cost->sse_op
+                                               : 
ix86_cost->integer_to_sse);
+
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;

thus treated integer_to_sse as GPR insert into vector cost for
integer components in general while the now proposed change
_adds_ integer to XMM move costs (but only once for duplicate
elements and not for the cases we know are from memory / vector regs).
With the proposed patch we can probably be more "correct" above
and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
for FP the first one is free and for int we're costing it separately.
Again it won't help here.

Thanks,
Richard.

> >
> > I suppose we can let autotesters look for SPEC performance fallout.
> >
> > OK if testing succeeds?
> >
> > Thanks,
> > Richard.
> >
> > 2022-02-18  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/104582
> >         PR target/99881
> >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> >         Cost GPR to vector register moves for integer vector construction.
> >
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > ---
> >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> >  7 files changed, 102 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 0830dbd7dca..b2bf90576d5 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> >
> >  unsigned
> >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > -                                 stmt_vec_info stmt_info, slp_tree,
> > +                                 stmt_vec_info stmt_info, slp_tree node,
> >                                   tree vectype, int misalign,
> >                                   vect_cost_model_location where)
> >  {
> > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> >      }
> > +  else if (kind == vec_construct
> > +          && node
> > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > +    {
> > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > +      unsigned i;
> > +      tree op;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       if (TREE_CODE (op) == SSA_NAME)
> > +         TREE_VISITED (op) = 0;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       {
> > +         if (TREE_CODE (op) != SSA_NAME
> > +             || TREE_VISITED (op))
> > +           continue;
> > +         TREE_VISITED (op) = 1;
> > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > +         tree tem;
> > +         if (is_gimple_assign (def)
> > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > +             && TREE_CODE (tem) == SSA_NAME
> > +             /* A sign-change expands to nothing.  */
> > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > +                                       TREE_TYPE (tem)))
> > +           def = SSA_NAME_DEF_STMT (tem);
> > +         /* When the component is loaded from memory we can directly
> > +            move it to a vector register, otherwise we have to go
> > +            via a GPR or via vpinsr which involves similar cost.
> > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > +            register we can hope to avoid using a GPR.  */
> > +         if (!is_gimple_assign (def)
> > +             || (!gimple_assign_load_p (def)
> > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > +           stmt_cost += ix86_cost->sse_to_integer;
> > +       }
> > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > +       if (TREE_CODE (op) == SSA_NAME)
> > +         TREE_VISITED (op) = 0;
> > +    }
> >    if (stmt_cost == -1)
> >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > new file mode 100644
> > index 00000000000..992a845ad7a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (unsigned long *a, unsigned long *b)
> > +{
> > +  unsigned long a_ = *a;
> > +  unsigned long b_ = *b;
> > +  s.a = a_;
> > +  s.b = b_;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > new file mode 100644
> > index 00000000000..7637cdb4a97
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (unsigned long a, unsigned long b)
> > +{
> > +  s.a = a;
> > +  s.b = b;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > new file mode 100644
> > index 00000000000..999c4905708
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { double a, b; } s;
> > +
> > +void
> > +foo (double a, double b)
> > +{
> > +  s.a = a;
> > +  s.b = b;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > new file mode 100644
> > index 00000000000..cc471e1ed73
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > +
> > +struct S { unsigned long a, b; } s;
> > +
> > +void
> > +foo (signed long *a, unsigned long *b)
> > +{
> > +  unsigned long a_ = *a;
> > +  unsigned long b_ = *b;
> > +  s.a = a_;
> > +  s.b = b_;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > index 0243ca3ea68..067bf43f698 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> >    bar (&t);
> >  }
> >
> > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > @@ -1,7 +1,7 @@
> >  /* PR target/99881.  */
> >  /* { dg-do compile { target { ! ia32 } } } */
> >  /* { dg-options "-Ofast -march=skylake" } */
> > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> >
> >  void
> >  foo (int* __restrict a, int n, int c)
> > --
> > 2.34.1
> 
> 
> 
>
Hongtao Liu Feb. 22, 2022, 4:08 a.m. UTC | #3
On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 21 Feb 2022, Hongtao Liu wrote:
>
> > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This uses the now passed SLP node to the vectorizer costing hook
> > > to adjust vector construction costs for the cost of moving an
> > > integer component from a GPR to a vector register when that's
> > > required for building a vector from components.  A cruical difference
> > > here is whether the component is loaded from memory or extracted
> > > from a vector register as in those cases no intermediate GPR is involved.
> > >
> > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > pr91446.c testcase now produces scalar code which looks superior
> > > to me so I've adjusted it as well.
> > >
> > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > after adding the BIT_FIELD_REF vector extracting special casing.
> > Does the patch handle PR101929?
>
> The patch will regress the testcase posted in PR101929 again:
>
>  _255 1 times scalar_store costs 12 in body
>  _261 1 times scalar_store costs 12 in body
>  _258 1 times scalar_store costs 12 in body
>  _264 1 times scalar_store costs 12 in body
>  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
>  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
>  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
>  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> -node 0x4182f48 1 times vec_construct costs 16 in prologue
> -node 0x41882b0 1 times vec_construct costs 16 in prologue
> +node 0x4182f48 1 times vec_construct costs 28 in prologue
> +node 0x41882b0 1 times vec_construct costs 28 in prologue
>  t0_406 + t2_451 1 times vector_stmt costs 4 in body
>  t1_472 - t3_444 1 times vector_stmt costs 4 in body
>  node 0x41829f8 1 times vec_perm costs 4 in body
>  _436 1 times vector_store costs 16 in body
>  t.c:37:9: note: Cost model analysis for part in loop 0:
> -  Vector cost: 60
> +  Vector cost: 84
>    Scalar cost: 64
> +t.c:37:9: missed: not vectorized: vectorization is not profitable.
>
> We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> where the components are results of integer adds (so the result is
> definitely in a GPR).  We are costing the construction as
> 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> 4 * COSTS_N_INSNS (1) + 2 * 6.
>
> Whether the vectorization itself is profitable is likely questionable
> but then it's true that the construction of V4SI is more costly
> in terms of uops than a construction of V4SF.
>
> Now, we can - for the first time - now see the actual construction
> pattern and ideal construction might be two GPR->xmm moves
> + two splats + one unpack or maybe two GPR->xmm moves + one
> unpack + splat of DI (or other means of duplicating the lowpart).
Yes, the patch is technically right. I'm ok with the patch.
> It still wouldn't help here of course, and we would not have RTL
> expansion adhere to this scheme.
>
> Trying to capture secondary effects (the FRE opportunity unleashed)
> in costing of this particular SLP subgraph is difficult and probably
> undesirable though.  Adjusting any of the target specific costs
> is likely not going to work because of the original narrow profitability
> gap (60 vs. 64).
>
> For the particular kernel the overall vectorization strathegy needs
> to improve (PR98138 is about this I think).  I know we've reverted
> the previous change that attempted to cost for GPR -> XMM.  It did
>
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> +         int cost
> +           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> +                                               ix86_cost->sse_op
> +                                               :
> ix86_cost->integer_to_sse);
> +
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>
> thus treated integer_to_sse as GPR insert into vector cost for
> integer components in general while the now proposed change
> _adds_ integer to XMM move costs (but only once for duplicate
> elements and not for the cases we know are from memory / vector regs).
> With the proposed patch we can probably be more "correct" above
> and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> for FP the first one is free and for int we're costing it separately.
> Again it won't help here.
>
> Thanks,
> Richard.
>
> > >
> > > I suppose we can let autotesters look for SPEC performance fallout.
> > >
> > > OK if testing succeeds?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 2022-02-18  Richard Biener  <rguenther@suse.de>
> > >
> > >         PR tree-optimization/104582
> > >         PR target/99881
> > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > >         Cost GPR to vector register moves for integer vector construction.
> > >
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> > >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > > ---
> > >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> > >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> > >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> > >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> > >  7 files changed, 102 insertions(+), 3 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 0830dbd7dca..b2bf90576d5 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> > >
> > >  unsigned
> > >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > -                                 stmt_vec_info stmt_info, slp_tree,
> > > +                                 stmt_vec_info stmt_info, slp_tree node,
> > >                                   tree vectype, int misalign,
> > >                                   vect_cost_model_location where)
> > >  {
> > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > >      }
> > > +  else if (kind == vec_construct
> > > +          && node
> > > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > > +    {
> > > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > +      unsigned i;
> > > +      tree op;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       if (TREE_CODE (op) == SSA_NAME)
> > > +         TREE_VISITED (op) = 0;
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       {
> > > +         if (TREE_CODE (op) != SSA_NAME
> > > +             || TREE_VISITED (op))
> > > +           continue;
> > > +         TREE_VISITED (op) = 1;
> > > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > > +         tree tem;
> > > +         if (is_gimple_assign (def)
> > > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > > +             && TREE_CODE (tem) == SSA_NAME
> > > +             /* A sign-change expands to nothing.  */
> > > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > > +                                       TREE_TYPE (tem)))
> > > +           def = SSA_NAME_DEF_STMT (tem);
> > > +         /* When the component is loaded from memory we can directly
> > > +            move it to a vector register, otherwise we have to go
> > > +            via a GPR or via vpinsr which involves similar cost.
> > > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > > +            register we can hope to avoid using a GPR.  */
> > > +         if (!is_gimple_assign (def)
> > > +             || (!gimple_assign_load_p (def)
> > > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > > +           stmt_cost += ix86_cost->sse_to_integer;
> > > +       }
> > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > +       if (TREE_CODE (op) == SSA_NAME)
> > > +         TREE_VISITED (op) = 0;
> > > +    }
> > >    if (stmt_cost == -1)
> > >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > new file mode 100644
> > > index 00000000000..992a845ad7a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (unsigned long *a, unsigned long *b)
> > > +{
> > > +  unsigned long a_ = *a;
> > > +  unsigned long b_ = *b;
> > > +  s.a = a_;
> > > +  s.b = b_;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > new file mode 100644
> > > index 00000000000..7637cdb4a97
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (unsigned long a, unsigned long b)
> > > +{
> > > +  s.a = a;
> > > +  s.b = b;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > new file mode 100644
> > > index 00000000000..999c4905708
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { double a, b; } s;
> > > +
> > > +void
> > > +foo (double a, double b)
> > > +{
> > > +  s.a = a;
> > > +  s.b = b;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > new file mode 100644
> > > index 00000000000..cc471e1ed73
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > +
> > > +struct S { unsigned long a, b; } s;
> > > +
> > > +void
> > > +foo (signed long *a, unsigned long *b)
> > > +{
> > > +  unsigned long a_ = *a;
> > > +  unsigned long b_ = *b;
> > > +  s.a = a_;
> > > +  s.b = b_;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > index 0243ca3ea68..067bf43f698 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> > >    bar (&t);
> > >  }
> > >
> > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > @@ -1,7 +1,7 @@
> > >  /* PR target/99881.  */
> > >  /* { dg-do compile { target { ! ia32 } } } */
> > >  /* { dg-options "-Ofast -march=skylake" } */
> > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> > >
> > >  void
> > >  foo (int* __restrict a, int n, int c)
> > > --
> > > 2.34.1
> >
> >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
Richard Biener Feb. 22, 2022, 7:58 a.m. UTC | #4
On Tue, 22 Feb 2022, Hongtao Liu wrote:

> On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Mon, 21 Feb 2022, Hongtao Liu wrote:
> >
> > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > This uses the now passed SLP node to the vectorizer costing hook
> > > > to adjust vector construction costs for the cost of moving an
> > > > integer component from a GPR to a vector register when that's
> > > > required for building a vector from components.  A cruical difference
> > > > here is whether the component is loaded from memory or extracted
> > > > from a vector register as in those cases no intermediate GPR is involved.
> > > >
> > > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > > pr91446.c testcase now produces scalar code which looks superior
> > > > to me so I've adjusted it as well.
> > > >
> > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > > after adding the BIT_FIELD_REF vector extracting special casing.
> > > Does the patch handle PR101929?
> >
> > The patch will regress the testcase posted in PR101929 again:
> >
> >  _255 1 times scalar_store costs 12 in body
> >  _261 1 times scalar_store costs 12 in body
> >  _258 1 times scalar_store costs 12 in body
> >  _264 1 times scalar_store costs 12 in body
> >  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
> >  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
> >  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
> >  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> > -node 0x4182f48 1 times vec_construct costs 16 in prologue
> > -node 0x41882b0 1 times vec_construct costs 16 in prologue
> > +node 0x4182f48 1 times vec_construct costs 28 in prologue
> > +node 0x41882b0 1 times vec_construct costs 28 in prologue
> >  t0_406 + t2_451 1 times vector_stmt costs 4 in body
> >  t1_472 - t3_444 1 times vector_stmt costs 4 in body
> >  node 0x41829f8 1 times vec_perm costs 4 in body
> >  _436 1 times vector_store costs 16 in body
> >  t.c:37:9: note: Cost model analysis for part in loop 0:
> > -  Vector cost: 60
> > +  Vector cost: 84
> >    Scalar cost: 64
> > +t.c:37:9: missed: not vectorized: vectorization is not profitable.
> >
> > We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> > where the components are results of integer adds (so the result is
> > definitely in a GPR).  We are costing the construction as
> > 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> > 4 * COSTS_N_INSNS (1) + 2 * 6.
> >
> > Whether the vectorization itself is profitable is likely questionable
> > but then it's true that the construction of V4SI is more costly
> > in terms of uops than a construction of V4SF.
> >
> > Now, we can - for the first time - now see the actual construction
> > pattern and ideal construction might be two GPR->xmm moves
> > + two splats + one unpack or maybe two GPR->xmm moves + one
> > unpack + splat of DI (or other means of duplicating the lowpart).
> Yes, the patch is technically right. I'm ok with the patch.

Thanks, I've pushed it now.  I've also tested the suggested adjustment
doing

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..acf2cc977b4 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
       case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * 
ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);

successfully (with no effect on the PR101929 case as expected), I
will queue that for stage1 since it isn't known to fix any
regression (but I will keep it as option in case something pops up).

I'll also have a more detailed look into the x264_r case to see
if there's something we can do about the regression that will now
show up (and I'll watch autotesters).

Thanks,
Richard.


> > It still wouldn't help here of course, and we would not have RTL
> > expansion adhere to this scheme.
> >
> > Trying to capture secondary effects (the FRE opportunity unleashed)
> > in costing of this particular SLP subgraph is difficult and probably
> > undesirable though.  Adjusting any of the target specific costs
> > is likely not going to work because of the original narrow profitability
> > gap (60 vs. 64).
> >
> > For the particular kernel the overall vectorization strathegy needs
> > to improve (PR98138 is about this I think).  I know we've reverted
> > the previous change that attempted to cost for GPR -> XMM.  It did
> >
> >        case vec_construct:
> >         {
> >           /* N element inserts into SSE vectors.  */
> > +         int cost
> > +           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> > +                                               ix86_cost->sse_op
> > +                                               :
> > ix86_cost->integer_to_sse);
> > +
> > -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> >
> > thus treated integer_to_sse as GPR insert into vector cost for
> > integer components in general while the now proposed change
> > _adds_ integer to XMM move costs (but only once for duplicate
> > elements and not for the cases we know are from memory / vector regs).
> > With the proposed patch we can probably be more "correct" above
> > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> > for FP the first one is free and for int we're costing it separately.
> > Again it won't help here.
> >
> > Thanks,
> > Richard.
> >
> > > >
> > > > I suppose we can let autotesters look for SPEC performance fallout.
> > > >
> > > > OK if testing succeeds?
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > 2022-02-18  Richard Biener  <rguenther@suse.de>
> > > >
> > > >         PR tree-optimization/104582
> > > >         PR target/99881
> > > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > >         Cost GPR to vector register moves for integer vector construction.
> > > >
> > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > > >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> > > >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > > > ---
> > > >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> > > >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> > > >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> > > >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> > > >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> > > >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> > > >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> > > >  7 files changed, 102 insertions(+), 3 deletions(-)
> > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index 0830dbd7dca..b2bf90576d5 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> > > >
> > > >  unsigned
> > > >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > > -                                 stmt_vec_info stmt_info, slp_tree,
> > > > +                                 stmt_vec_info stmt_info, slp_tree node,
> > > >                                   tree vectype, int misalign,
> > > >                                   vect_cost_model_location where)
> > > >  {
> > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > > >      }
> > > > +  else if (kind == vec_construct
> > > > +          && node
> > > > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > > > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > > > +    {
> > > > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > +      unsigned i;
> > > > +      tree op;
> > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > +         TREE_VISITED (op) = 0;
> > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > +       {
> > > > +         if (TREE_CODE (op) != SSA_NAME
> > > > +             || TREE_VISITED (op))
> > > > +           continue;
> > > > +         TREE_VISITED (op) = 1;
> > > > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > > > +         tree tem;
> > > > +         if (is_gimple_assign (def)
> > > > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > > > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > > > +             && TREE_CODE (tem) == SSA_NAME
> > > > +             /* A sign-change expands to nothing.  */
> > > > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > > > +                                       TREE_TYPE (tem)))
> > > > +           def = SSA_NAME_DEF_STMT (tem);
> > > > +         /* When the component is loaded from memory we can directly
> > > > +            move it to a vector register, otherwise we have to go
> > > > +            via a GPR or via vpinsr which involves similar cost.
> > > > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > > > +            register we can hope to avoid using a GPR.  */
> > > > +         if (!is_gimple_assign (def)
> > > > +             || (!gimple_assign_load_p (def)
> > > > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > > > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > > > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > > > +           stmt_cost += ix86_cost->sse_to_integer;
> > > > +       }
> > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > +         TREE_VISITED (op) = 0;
> > > > +    }
> > > >    if (stmt_cost == -1)
> > > >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > >
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > new file mode 100644
> > > > index 00000000000..992a845ad7a
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > +
> > > > +struct S { unsigned long a, b; } s;
> > > > +
> > > > +void
> > > > +foo (unsigned long *a, unsigned long *b)
> > > > +{
> > > > +  unsigned long a_ = *a;
> > > > +  unsigned long b_ = *b;
> > > > +  s.a = a_;
> > > > +  s.b = b_;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > new file mode 100644
> > > > index 00000000000..7637cdb4a97
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > @@ -0,0 +1,13 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > +
> > > > +struct S { unsigned long a, b; } s;
> > > > +
> > > > +void
> > > > +foo (unsigned long a, unsigned long b)
> > > > +{
> > > > +  s.a = a;
> > > > +  s.b = b;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > new file mode 100644
> > > > index 00000000000..999c4905708
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > @@ -0,0 +1,13 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > +
> > > > +struct S { double a, b; } s;
> > > > +
> > > > +void
> > > > +foo (double a, double b)
> > > > +{
> > > > +  s.a = a;
> > > > +  s.b = b;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > new file mode 100644
> > > > index 00000000000..cc471e1ed73
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > +
> > > > +struct S { unsigned long a, b; } s;
> > > > +
> > > > +void
> > > > +foo (signed long *a, unsigned long *b)
> > > > +{
> > > > +  unsigned long a_ = *a;
> > > > +  unsigned long b_ = *b;
> > > > +  s.a = a_;
> > > > +  s.b = b_;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > index 0243ca3ea68..067bf43f698 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> > > >    bar (&t);
> > > >  }
> > > >
> > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > @@ -1,7 +1,7 @@
> > > >  /* PR target/99881.  */
> > > >  /* { dg-do compile { target { ! ia32 } } } */
> > > >  /* { dg-options "-Ofast -march=skylake" } */
> > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> > > >
> > > >  void
> > > >  foo (int* __restrict a, int n, int c)
> > > > --
> > > > 2.34.1
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
> 
> 
> 
>
Richard Biener Feb. 22, 2022, 9:51 a.m. UTC | #5
On Tue, 22 Feb 2022, Richard Biener wrote:

> On Tue, 22 Feb 2022, Hongtao Liu wrote:
> 
> > On Mon, Feb 21, 2022 at 5:10 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Mon, 21 Feb 2022, Hongtao Liu wrote:
> > >
> > > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > This uses the now passed SLP node to the vectorizer costing hook
> > > > > to adjust vector construction costs for the cost of moving an
> > > > > integer component from a GPR to a vector register when that's
> > > > > required for building a vector from components.  A cruical difference
> > > > > here is whether the component is loaded from memory or extracted
> > > > > from a vector register as in those cases no intermediate GPR is involved.
> > > > >
> > > > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > > > pr91446.c testcase now produces scalar code which looks superior
> > > > > to me so I've adjusted it as well.
> > > > >
> > > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > > > after adding the BIT_FIELD_REF vector extracting special casing.
> > > > Does the patch handle PR101929?
> > >
> > > The patch will regress the testcase posted in PR101929 again:
> > >
> > >  _255 1 times scalar_store costs 12 in body
> > >  _261 1 times scalar_store costs 12 in body
> > >  _258 1 times scalar_store costs 12 in body
> > >  _264 1 times scalar_store costs 12 in body
> > >  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
> > >  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
> > >  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
> > >  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> > > -node 0x4182f48 1 times vec_construct costs 16 in prologue
> > > -node 0x41882b0 1 times vec_construct costs 16 in prologue
> > > +node 0x4182f48 1 times vec_construct costs 28 in prologue
> > > +node 0x41882b0 1 times vec_construct costs 28 in prologue
> > >  t0_406 + t2_451 1 times vector_stmt costs 4 in body
> > >  t1_472 - t3_444 1 times vector_stmt costs 4 in body
> > >  node 0x41829f8 1 times vec_perm costs 4 in body
> > >  _436 1 times vector_store costs 16 in body
> > >  t.c:37:9: note: Cost model analysis for part in loop 0:
> > > -  Vector cost: 60
> > > +  Vector cost: 84
> > >    Scalar cost: 64
> > > +t.c:37:9: missed: not vectorized: vectorization is not profitable.
> > >
> > > We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> > > where the components are results of integer adds (so the result is
> > > definitely in a GPR).  We are costing the construction as
> > > 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> > > 4 * COSTS_N_INSNS (1) + 2 * 6.
> > >
> > > Whether the vectorization itself is profitable is likely questionable
> > > but then it's true that the construction of V4SI is more costly
> > > in terms of uops than a construction of V4SF.
> > >
> > > Now, we can - for the first time - now see the actual construction
> > > pattern and ideal construction might be two GPR->xmm moves
> > > + two splats + one unpack or maybe two GPR->xmm moves + one
> > > unpack + splat of DI (or other means of duplicating the lowpart).
> > Yes, the patch is technically right. I'm ok with the patch.
> 
> Thanks, I've pushed it now.  I've also tested the suggested adjustment
> doing
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..acf2cc977b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum 
> vect_cost_for_stmt type_of_cost,
>        case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
> -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> +         int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * 
> ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           if (GET_MODE_BITSIZE (mode) == 256)
>             cost += ix86_vec_cost (mode, ix86_cost->addss);
> 
> successfully (with no effect on the PR101929 case as expected), I
> will queue that for stage1 since it isn't known to fix any
> regression (but I will keep it as option in case something pops up).
> 
> I'll also have a more detailed look into the x264_r case to see
> if there's something we can do about the regression that will now
> show up (and I'll watch autotesters).

I found a way to get back the vectorization doing

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 9188d727e33..7f1f12fb6c6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2374,7 +2375,7 @@ fail:
                n_vector_builds++;
            }
        }
-      if (all_uniform_p
+      if ((all_uniform_p && !two_operators)
          || n_vector_builds > 1
          || (n_vector_builds == children.length ()
              && is_a <gphi *> (stmt_info->stmt)))

which in itself is reasonable since the result of the operation
is a non-uniform vector.  That causes us to do

  _37 = {a3_245_6(D), a3_245_6(D), a3_245_6(D), a3_245_6(D)};
  _36 = {a2_232_5(D), a2_232_5(D), a2_232_5(D), a2_232_5(D)};
  vect__252_8.7_39 = _36 - _37;
  vect__250_7.6_38 = _36 + _37;
  _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 
}>;

and then vector lowering does what vect_optimize_slp should do
as well (but doesn't):

  _26 = a2_232_5(D) - a3_245_6(D);
  vect__252_8.7_39 = {_26, _26, _26, _26};
  _25 = a2_232_5(D) + a3_245_6(D);
  vect__250_7.6_38 = {_25, _25, _25, _25};
  _40 = VEC_PERM_EXPR <vect__250_7.6_38, vect__252_8.7_39, { 0, 5, 2, 7 
}>;

and later stmt folding will get us back the non-uniform vector
construction by folding the VEC_PERM_EXPR where we simply assume
that we can optimally expand the constructor.

But the important part is that this way we evade vec_construct
costing and instead get

node 0x40e11a8 1 times scalar_to_vec costs 4 in prologue

where of course it's obvious that we miss to account for
the GPR -> XMM penalty here in the backend.  When we are not fixing
that we get the kernel for x264_r optimized like before.

A self-contained testcase for this opportunity is

void foo (unsigned * __restrict tmp, unsigned a0_206, unsigned a1_219,
          unsigned a2_232, unsigned a3_245)
{
  unsigned _246 = a0_206 + a1_219;
  unsigned _248 = a0_206 - a1_219;
  unsigned _250 = a2_232 + a3_245;
  unsigned _252 = a2_232 - a3_245;
  int t0_247 = (int) _246;
  int t1_249 = (int) _248;
  int t2_251 = (int) _250;
  int t3_253 = (int) _252;
  int _254 = t0_247 + t2_251;
  int _260 = t1_249 + t3_253;
  int _257 = t0_247 - t2_251;
  int _263 = t1_249 - t3_253;
  unsigned _255 = (unsigned int) _254;
  unsigned _261 = (unsigned int) _260;
  unsigned _258 = (unsigned int) _257;
  unsigned _264 = (unsigned int) _263;
  tmp[0] = _255;
  tmp[1] = _261;
  tmp[2] = _258;
  tmp[3] = _264;
}

which does show that the vectorization is not profitable since we
get

0000000000000000 <foo>:
   0:   89 c8                   mov    %ecx,%eax
   2:   44 01 c1                add    %r8d,%ecx
   5:   44 29 c0                sub    %r8d,%eax
   8:   c5 f9 6e d9             vmovd  %ecx,%xmm3
   c:   c4 e3 61 22 c8 01       vpinsrd $0x1,%eax,%xmm3,%xmm1
  12:   89 f0                   mov    %esi,%eax
  14:   01 d6                   add    %edx,%esi
  16:   29 d0                   sub    %edx,%eax
  18:   c5 f9 6e e6             vmovd  %esi,%xmm4
  1c:   c4 e3 59 22 c0 01       vpinsrd $0x1,%eax,%xmm4,%xmm0
  22:   c5 f1 6c c9             vpunpcklqdq %xmm1,%xmm1,%xmm1
  26:   c5 f9 6c c0             vpunpcklqdq %xmm0,%xmm0,%xmm0
  2a:   c5 f9 fe d1             vpaddd %xmm1,%xmm0,%xmm2
  2e:   c5 f9 fa c1             vpsubd %xmm1,%xmm0,%xmm0
  32:   c5 e8 c6 c0 e4          vshufps $0xe4,%xmm0,%xmm2,%xmm0
  37:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)
  3b:   c3                      ret    

vs the scalar code

0000000000000000 <foo>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   8d 3c 16                lea    (%rsi,%rdx,1),%edi
   6:   29 d6                   sub    %edx,%esi
   8:   42 8d 14 01             lea    (%rcx,%r8,1),%edx
   c:   44 29 c1                sub    %r8d,%ecx
   f:   44 8d 04 17             lea    (%rdi,%rdx,1),%r8d
  13:   44 89 00                mov    %r8d,(%rax)
  16:   29 d7                   sub    %edx,%edi
  18:   44 8d 04 0e             lea    (%rsi,%rcx,1),%r8d
  1c:   29 ce                   sub    %ecx,%esi
  1e:   44 89 40 04             mov    %r8d,0x4(%rax)
  22:   89 78 08                mov    %edi,0x8(%rax)
  25:   89 70 0c                mov    %esi,0xc(%rax)
  28:   c3                      ret    

but we could choose to ignore this for the sake of avoiding
the x264 regression (shall it appear which I think it will)
and try to address this in GCC 12.

Richard.

> Thanks,
> Richard.
> 
> 
> > > It still wouldn't help here of course, and we would not have RTL
> > > expansion adhere to this scheme.
> > >
> > > Trying to capture secondary effects (the FRE opportunity unleashed)
> > > in costing of this particular SLP subgraph is difficult and probably
> > > undesirable though.  Adjusting any of the target specific costs
> > > is likely not going to work because of the original narrow profitability
> > > gap (60 vs. 64).
> > >
> > > For the particular kernel the overall vectorization strathegy needs
> > > to improve (PR98138 is about this I think).  I know we've reverted
> > > the previous change that attempted to cost for GPR -> XMM.  It did
> > >
> > >        case vec_construct:
> > >         {
> > >           /* N element inserts into SSE vectors.  */
> > > +         int cost
> > > +           = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> > > +                                               ix86_cost->sse_op
> > > +                                               :
> > > ix86_cost->integer_to_sse);
> > > +
> > > -         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > >
> > > thus treated integer_to_sse as GPR insert into vector cost for
> > > integer components in general while the now proposed change
> > > _adds_ integer to XMM move costs (but only once for duplicate
> > > elements and not for the cases we know are from memory / vector regs).
> > > With the proposed patch we can probably be more "correct" above
> > > and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> > > for FP the first one is free and for int we're costing it separately.
> > > Again it won't help here.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > >
> > > > > I suppose we can let autotesters look for SPEC performance fallout.
> > > > >
> > > > > OK if testing succeeds?
> > > > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > > 2022-02-18  Richard Biener  <rguenther@suse.de>
> > > > >
> > > > >         PR tree-optimization/104582
> > > > >         PR target/99881
> > > > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > > >         Cost GPR to vector register moves for integer vector construction.
> > > > >
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > > > >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > > > >         * gcc.target/i386/pr99881.c: Un-XFAIL.
> > > > >         * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc                       | 45 ++++++++++++++++++-
> > > > >  .../costmodel/x86_64/costmodel-pr104582-1.c   | 15 +++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-2.c   | 13 ++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-3.c   | 13 ++++++
> > > > >  .../costmodel/x86_64/costmodel-pr104582-4.c   | 15 +++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr91446.c       |  2 +-
> > > > >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> > > > >  7 files changed, 102 insertions(+), 3 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index 0830dbd7dca..b2bf90576d5 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -22997,7 +22997,7 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
> > > > >
> > > > >  unsigned
> > > > >  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > > > -                                 stmt_vec_info stmt_info, slp_tree,
> > > > > +                                 stmt_vec_info stmt_info, slp_tree node,
> > > > >                                   tree vectype, int misalign,
> > > > >                                   vect_cost_model_location where)
> > > > >  {
> > > > > @@ -23160,6 +23160,49 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > >        stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > > > >      }
> > > > > +  else if (kind == vec_construct
> > > > > +          && node
> > > > > +          && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > > > > +          && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > > > > +    {
> > > > > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > > +      unsigned i;
> > > > > +      tree op;
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > > +         TREE_VISITED (op) = 0;
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       {
> > > > > +         if (TREE_CODE (op) != SSA_NAME
> > > > > +             || TREE_VISITED (op))
> > > > > +           continue;
> > > > > +         TREE_VISITED (op) = 1;
> > > > > +         gimple *def = SSA_NAME_DEF_STMT (op);
> > > > > +         tree tem;
> > > > > +         if (is_gimple_assign (def)
> > > > > +             && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
> > > > > +             && ((tem = gimple_assign_rhs1 (def)), true)
> > > > > +             && TREE_CODE (tem) == SSA_NAME
> > > > > +             /* A sign-change expands to nothing.  */
> > > > > +             && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
> > > > > +                                       TREE_TYPE (tem)))
> > > > > +           def = SSA_NAME_DEF_STMT (tem);
> > > > > +         /* When the component is loaded from memory we can directly
> > > > > +            move it to a vector register, otherwise we have to go
> > > > > +            via a GPR or via vpinsr which involves similar cost.
> > > > > +            Likewise with a BIT_FIELD_REF extracting from a vector
> > > > > +            register we can hope to avoid using a GPR.  */
> > > > > +         if (!is_gimple_assign (def)
> > > > > +             || (!gimple_assign_load_p (def)
> > > > > +                 && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> > > > > +                     || !VECTOR_TYPE_P (TREE_TYPE
> > > > > +                               (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > > > > +           stmt_cost += ix86_cost->sse_to_integer;
> > > > > +       }
> > > > > +      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> > > > > +       if (TREE_CODE (op) == SSA_NAME)
> > > > > +         TREE_VISITED (op) = 0;
> > > > > +    }
> > > > >    if (stmt_cost == -1)
> > > > >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > > new file mode 100644
> > > > > index 00000000000..992a845ad7a
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > > > > @@ -0,0 +1,15 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (unsigned long *a, unsigned long *b)
> > > > > +{
> > > > > +  unsigned long a_ = *a;
> > > > > +  unsigned long b_ = *b;
> > > > > +  s.a = a_;
> > > > > +  s.b = b_;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > > new file mode 100644
> > > > > index 00000000000..7637cdb4a97
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
> > > > > @@ -0,0 +1,13 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (unsigned long a, unsigned long b)
> > > > > +{
> > > > > +  s.a = a;
> > > > > +  s.b = b;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > > new file mode 100644
> > > > > index 00000000000..999c4905708
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > > > > @@ -0,0 +1,13 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { double a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (double a, double b)
> > > > > +{
> > > > > +  s.a = a;
> > > > > +  s.b = b;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > > new file mode 100644
> > > > > index 00000000000..cc471e1ed73
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > > > > @@ -0,0 +1,15 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
> > > > > +
> > > > > +struct S { unsigned long a, b; } s;
> > > > > +
> > > > > +void
> > > > > +foo (signed long *a, unsigned long *b)
> > > > > +{
> > > > > +  unsigned long a_ = *a;
> > > > > +  unsigned long b_ = *b;
> > > > > +  s.a = a_;
> > > > > +  s.b = b_;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > index 0243ca3ea68..067bf43f698 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > > > > @@ -21,4 +21,4 @@ foo (unsigned long long width, unsigned long long height,
> > > > >    bar (&t);
> > > > >  }
> > > > >
> > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > > > > +/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > index 3e087eb2ed7..a1ec1d1ba8a 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > > > > @@ -1,7 +1,7 @@
> > > > >  /* PR target/99881.  */
> > > > >  /* { dg-do compile { target { ! ia32 } } } */
> > > > >  /* { dg-options "-Ofast -march=skylake" } */
> > > > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > > > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
> > > > >
> > > > >  void
> > > > >  foo (int* __restrict a, int n, int c)
> > > > > --
> > > > > 2.34.1
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > > Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
> > 
> > 
> > 
> > 
> 
>
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 0830dbd7dca..b2bf90576d5 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22997,7 +22997,7 @@  ix86_vectorize_create_costs (vec_info *vinfo, bool costing_for_scalar)
 
 unsigned
 ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
-				  stmt_vec_info stmt_info, slp_tree,
+				  stmt_vec_info stmt_info, slp_tree node,
 				  tree vectype, int misalign,
 				  vect_cost_model_location where)
 {
@@ -23160,6 +23160,49 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
+  else if (kind == vec_construct
+	   && node
+	   && SLP_TREE_DEF_TYPE (node) == vect_external_def
+	   && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
+    {
+      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
+      unsigned i;
+      tree op;
+      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
+	if (TREE_CODE (op) == SSA_NAME)
+	  TREE_VISITED (op) = 0;
+      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
+	{
+	  if (TREE_CODE (op) != SSA_NAME
+	      || TREE_VISITED (op))
+	    continue;
+	  TREE_VISITED (op) = 1;
+	  gimple *def = SSA_NAME_DEF_STMT (op);
+	  tree tem;
+	  if (is_gimple_assign (def)
+	      && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def))
+	      && ((tem = gimple_assign_rhs1 (def)), true)
+	      && TREE_CODE (tem) == SSA_NAME
+	      /* A sign-change expands to nothing.  */
+	      && tree_nop_conversion_p (TREE_TYPE (gimple_assign_lhs (def)),
+					TREE_TYPE (tem)))
+	    def = SSA_NAME_DEF_STMT (tem);
+	  /* When the component is loaded from memory we can directly
+	     move it to a vector register, otherwise we have to go
+	     via a GPR or via vpinsr which involves similar cost.
+	     Likewise with a BIT_FIELD_REF extracting from a vector
+	     register we can hope to avoid using a GPR.  */
+	  if (!is_gimple_assign (def)
+	      || (!gimple_assign_load_p (def)
+		  && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
+		      || !VECTOR_TYPE_P (TREE_TYPE
+				(TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
+	    stmt_cost += ix86_cost->sse_to_integer;
+	}
+      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
+	if (TREE_CODE (op) == SSA_NAME)
+	  TREE_VISITED (op) = 0;
+    }
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
 
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
new file mode 100644
index 00000000000..992a845ad7a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
+
+struct S { unsigned long a, b; } s;
+
+void
+foo (unsigned long *a, unsigned long *b)
+{
+  unsigned long a_ = *a;
+  unsigned long b_ = *b;
+  s.a = a_;
+  s.b = b_;
+}
+
+/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
new file mode 100644
index 00000000000..7637cdb4a97
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
+
+struct S { unsigned long a, b; } s;
+
+void
+foo (unsigned long a, unsigned long b)
+{
+  s.a = a;
+  s.b = b;
+}
+
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
new file mode 100644
index 00000000000..999c4905708
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
+
+struct S { double a, b; } s;
+
+void
+foo (double a, double b)
+{
+  s.a = a;
+  s.b = b;
+}
+
+/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
new file mode 100644
index 00000000000..cc471e1ed73
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-msse -fdump-tree-slp2-details" } */
+
+struct S { unsigned long a, b; } s;
+
+void
+foo (signed long *a, unsigned long *b)
+{
+  unsigned long a_ = *a;
+  unsigned long b_ = *b;
+  s.a = a_;
+  s.b = b_;
+}
+
+/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
index 0243ca3ea68..067bf43f698 100644
--- a/gcc/testsuite/gcc.target/i386/pr91446.c
+++ b/gcc/testsuite/gcc.target/i386/pr91446.c
@@ -21,4 +21,4 @@  foo (unsigned long long width, unsigned long long height,
   bar (&t);
 }
 
-/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "xmm\[0-9\]" 0 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
index 3e087eb2ed7..a1ec1d1ba8a 100644
--- a/gcc/testsuite/gcc.target/i386/pr99881.c
+++ b/gcc/testsuite/gcc.target/i386/pr99881.c
@@ -1,7 +1,7 @@ 
 /* PR target/99881.  */
 /* { dg-do compile { target { ! ia32 } } } */
 /* { dg-options "-Ofast -march=skylake" } */
-/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
 
 void
 foo (int* __restrict a, int n, int c)