diff mbox series

Adjust costing of emulated vectorized gather/scatter

Message ID 20230830103516.882926-1-hongtao.liu@intel.com
State New
Headers show
Series Adjust costing of emulated vectorized gather/scatter | expand

Commit Message

Liu, Hongtao Aug. 30, 2023, 10:35 a.m. UTC
r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
gather/scatter.
----
commit 24905a4bd1375ccd99c02510b9f9529015a48315
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jan 18 11:04:49 2023 +0100

    Adjust costing of emulated vectorized gather/scatter

    Emulated gather/scatter behave similar to strided elementwise
    accesses in that they need to decompose the offset vector
    and construct or decompose the data vector so handle them
    the same way, pessimizing the cases with may elements.
----

But for emulated gather/scatter, offset vector load/vec_construct has
aready been counted, and in real case, it's probably eliminated by
later optimizer.
Also after decomposing, element loads from continous memory could be
less bounded compared to normal elementwise load.
The patch decreases the cost a little bit.

This will enable gather emulation for below loop with VF=8(ymm)

double
foo (double* a, double* b, unsigned int* c, int n)
{
  double sum = 0;
  for (int i = 0; i != n; i++)
    sum += a[i] * b[c[i]];
  return sum;
}

For the upper loop, microbenchmark result shows on ICX,
emulated gather with VF=8 is 30% faster than emulated gather with
VF=4 when tripcount is big enough.
It bring back ~4% for 510.parest still ~5% regression compared to
gather instruction due to throughput bound.

For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
loop, VF remains 4(xmm) as before(guess related to their own cost
model).


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR target/111064
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Decrease cost a little bit for vec_to_scalar(offset vector) in
	emulated gather.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr111064.c: New test.
---
 gcc/config/i386/i386.cc                  | 11 ++++++++++-
 gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c

Comments

Richard Biener Aug. 30, 2023, 12:18 p.m. UTC | #1
On Wed, Aug 30, 2023 at 12:38 PM liuhongt via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
> gather/scatter.
> ----
> commit 24905a4bd1375ccd99c02510b9f9529015a48315
> Author: Richard Biener <rguenther@suse.de>
> Date:   Wed Jan 18 11:04:49 2023 +0100
>
>     Adjust costing of emulated vectorized gather/scatter
>
>     Emulated gather/scatter behave similar to strided elementwise
>     accesses in that they need to decompose the offset vector
>     and construct or decompose the data vector so handle them
>     the same way, pessimizing the cases with may elements.
> ----
>
> But for emulated gather/scatter, offset vector load/vec_construct has
> aready been counted, and in real case, it's probably eliminated by
> later optimizer.
> Also after decomposing, element loads from continous memory could be
> less bounded compared to normal elementwise load.
> The patch decreases the cost a little bit.
>
> This will enable gather emulation for below loop with VF=8(ymm)
>
> double
> foo (double* a, double* b, unsigned int* c, int n)
> {
>   double sum = 0;
>   for (int i = 0; i != n; i++)
>     sum += a[i] * b[c[i]];
>   return sum;
> }
>
> For the upper loop, microbenchmark result shows on ICX,
> emulated gather with VF=8 is 30% faster than emulated gather with
> VF=4 when tripcount is big enough.
> It bring back ~4% for 510.parest still ~5% regression compared to
> gather instruction due to throughput bound.
>
> For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
> loop, VF remains 4(xmm) as before(guess related to their own cost
> model).
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/111064
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>         Decrease cost a little bit for vec_to_scalar(offset vector) in
>         emulated gather.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr111064.c: New test.
> ---
>  gcc/config/i386/i386.cc                  | 11 ++++++++++-
>  gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 1bc3f11ff07..337e0f1bfbb 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>           || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
>      {
>        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> -      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> +      /* For emulated gather/scatter, offset vector load/vec_construct has
> +        already been counted and in real case, it's probably eliminated by
> +        later optimizer.
> +        Also after decomposing, element loads from continous memory
> +        could be less bounded compared to normal elementwise load.  */
> +      if (kind == vec_to_scalar
> +         && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
> +       stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);

For gather we cost N vector extracts (from the offset vector), N scalar loads
(the actual data loads) and one vec_construct.

For scatter we cost N vector extracts (from the offset vector),
N vector extracts (from the data vector) and N scalar stores.

It was intended penaltize the extracts the same way as vector construction.

Your change will adjust all three different decomposition kinds "a
bit", I realize the
scaling by (TYPE_VECTOR_SUBPARTS + 1) is kind-of arbitrary but so is your
adjustment and I don't see why VMAT_GATHER_SCATTER is special to your
adjustment.

So the comment you put before the special-casing doesn't really make
sense to me.

For zen4 costing we currently have

*_11 8 times vec_to_scalar costs 576 in body
*_11 8 times scalar_load costs 96 in body
*_11 1 times vec_construct costs 792 in body

for zmm

*_11 4 times vec_to_scalar costs 80 in body
*_11 4 times scalar_load costs 48 in body
*_11 1 times vec_construct costs 100 in body

for ymm and

*_11 2 times vec_to_scalar costs 24 in body
*_11 2 times scalar_load costs 24 in body
*_11 1 times vec_construct costs 12 in body

for xmm.  Even with your adjustment if we were to enable cost comparison between
vector sizes we'd choose xmm I bet (you can try by re-ordering the modes in
the ix86_autovectorize_vector_modes hook).  So it feels like a hack.  If you
think that Icelake should enable 4 element vectorized emulated gather then
we should disable this individual scaling and possibly instead penaltize when
the number of (emulated) gathers is too high?

That said, we could count the number of element extracts and inserts
(and maybe [scalar] loads and stores) and at finish_cost time weight them
against the number of "other" operations.

As repeatedly said the current cost model setup is a bit garbage-in-garbage-out
since it in no way models latency correctly, instead it disregards all
dependencies
and simply counts ops.

> +      else
> +       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
>      }
>    else if ((kind == vec_construct || kind == scalar_to_vec)
>            && node
> diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
> new file mode 100644
> index 00000000000..aa2589bd36f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr111064.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
> +/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
> +
> +double
> +foo (double* a, double* b, unsigned int* c, int n)
> +{
> +  double sum = 0;
> +  for (int i = 0; i != n; i++)
> +    sum += a[i] * b[c[i]];
> +  return sum;
> +}
> --
> 2.31.1
>
Hongtao Liu Aug. 31, 2023, 8:06 a.m. UTC | #2
On Wed, Aug 30, 2023 at 8:18 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Wed, Aug 30, 2023 at 12:38 PM liuhongt via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
> > gather/scatter.
> > ----
> > commit 24905a4bd1375ccd99c02510b9f9529015a48315
> > Author: Richard Biener <rguenther@suse.de>
> > Date:   Wed Jan 18 11:04:49 2023 +0100
> >
> >     Adjust costing of emulated vectorized gather/scatter
> >
> >     Emulated gather/scatter behave similar to strided elementwise
> >     accesses in that they need to decompose the offset vector
> >     and construct or decompose the data vector so handle them
> >     the same way, pessimizing the cases with may elements.
> > ----
> >
> > But for emulated gather/scatter, offset vector load/vec_construct has
> > aready been counted, and in real case, it's probably eliminated by
> > later optimizer.
> > Also after decomposing, element loads from continous memory could be
> > less bounded compared to normal elementwise load.
> > The patch decreases the cost a little bit.
> >
> > This will enable gather emulation for below loop with VF=8(ymm)
> >
> > double
> > foo (double* a, double* b, unsigned int* c, int n)
> > {
> >   double sum = 0;
> >   for (int i = 0; i != n; i++)
> >     sum += a[i] * b[c[i]];
> >   return sum;
> > }
> >
> > For the upper loop, microbenchmark result shows on ICX,
> > emulated gather with VF=8 is 30% faster than emulated gather with
> > VF=4 when tripcount is big enough.
> > It bring back ~4% for 510.parest still ~5% regression compared to
> > gather instruction due to throughput bound.
> >
> > For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
> > loop, VF remains 4(xmm) as before(guess related to their own cost
> > model).
> >
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/111064
> >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> >         Decrease cost a little bit for vec_to_scalar(offset vector) in
> >         emulated gather.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/pr111064.c: New test.
> > ---
> >  gcc/config/i386/i386.cc                  | 11 ++++++++++-
> >  gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
> >  2 files changed, 22 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 1bc3f11ff07..337e0f1bfbb 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >           || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
> >      {
> >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > -      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > +      /* For emulated gather/scatter, offset vector load/vec_construct has
> > +        already been counted and in real case, it's probably eliminated by
> > +        later optimizer.
> > +        Also after decomposing, element loads from continous memory
> > +        could be less bounded compared to normal elementwise load.  */
> > +      if (kind == vec_to_scalar
> > +         && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
> > +       stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
>
> For gather we cost N vector extracts (from the offset vector), N scalar loads
> (the actual data loads) and one vec_construct.
>
> For scatter we cost N vector extracts (from the offset vector),
> N vector extracts (from the data vector) and N scalar stores.
>
> It was intended penaltize the extracts the same way as vector construction.
>
> Your change will adjust all three different decomposition kinds "a
> bit", I realize the
> scaling by (TYPE_VECTOR_SUBPARTS + 1) is kind-of arbitrary but so is your
> adjustment and I don't see why VMAT_GATHER_SCATTER is special to your
> adjustment.
>
> So the comment you put before the special-casing doesn't really make
> sense to me.
>
> For zen4 costing we currently have
>
> *_11 8 times vec_to_scalar costs 576 in body
> *_11 8 times scalar_load costs 96 in body
> *_11 1 times vec_construct costs 792 in body
>
> for zmm
>
> *_11 4 times vec_to_scalar costs 80 in body
> *_11 4 times scalar_load costs 48 in body
> *_11 1 times vec_construct costs 100 in body
>
> for ymm and
>
> *_11 2 times vec_to_scalar costs 24 in body
> *_11 2 times scalar_load costs 24 in body
> *_11 1 times vec_construct costs 12 in body
>
> for xmm.  Even with your adjustment if we were to enable cost comparison between
> vector sizes we'd choose xmm I bet (you can try by re-ordering the modes in
> the ix86_autovectorize_vector_modes hook).  So it feels like a hack.  If you
> think that Icelake should enable 4 element vectorized emulated gather then
> we should disable this individual scaling and possibly instead penaltize when
> the number of (emulated) gathers is too high?
I think even for element wise load/store, the penalty is too high.
looked at the original issue PR84037, the regression comes from many parts.
and related issue PR87561, the regression is due to outer loop
context(similar for PR82862), not a real vectorization issue,(PR82862
vectorized code standalone is even faster than scalar version)
 stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) seems to just disable
vectorization as a walkround, but not a realistic estimation.
For simplicity, maybe we should reduce the penalty(.i.e.  stmt_cost *=
(TYPE_VECTOR_SUBPARTS (vectype) / 2, at least w/ this, vectorizer will
still choose ymm even with cost comparison.
But I'm not sure if this will regress PR87561, PR84037, PR84016.(Maybe
we should only reduce the penalty when there's no outer loop due to
PR87561/PR82862, makes some sense?)
>
> That said, we could count the number of element extracts and inserts
> (and maybe [scalar] loads and stores) and at finish_cost time weight them
> against the number of "other" operations.
>
> As repeatedly said the current cost model setup is a bit garbage-in-garbage-out
> since it in no way models latency correctly, instead it disregards all
> dependencies
> and simply counts ops.
>
> > +      else
> > +       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> >      }
> >    else if ((kind == vec_construct || kind == scalar_to_vec)
> >            && node
> > diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
> > new file mode 100644
> > index 00000000000..aa2589bd36f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr111064.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
> > +/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
> > +
> > +double
> > +foo (double* a, double* b, unsigned int* c, int n)
> > +{
> > +  double sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[i] * b[c[i]];
> > +  return sum;
> > +}
> > --
> > 2.31.1
> >
Richard Biener Aug. 31, 2023, 8:53 a.m. UTC | #3
On Thu, Aug 31, 2023 at 10:06 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Aug 30, 2023 at 8:18 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Wed, Aug 30, 2023 at 12:38 PM liuhongt via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
> > > gather/scatter.
> > > ----
> > > commit 24905a4bd1375ccd99c02510b9f9529015a48315
> > > Author: Richard Biener <rguenther@suse.de>
> > > Date:   Wed Jan 18 11:04:49 2023 +0100
> > >
> > >     Adjust costing of emulated vectorized gather/scatter
> > >
> > >     Emulated gather/scatter behave similar to strided elementwise
> > >     accesses in that they need to decompose the offset vector
> > >     and construct or decompose the data vector so handle them
> > >     the same way, pessimizing the cases with may elements.
> > > ----
> > >
> > > But for emulated gather/scatter, offset vector load/vec_construct has
> > > aready been counted, and in real case, it's probably eliminated by
> > > later optimizer.
> > > Also after decomposing, element loads from continous memory could be
> > > less bounded compared to normal elementwise load.
> > > The patch decreases the cost a little bit.
> > >
> > > This will enable gather emulation for below loop with VF=8(ymm)
> > >
> > > double
> > > foo (double* a, double* b, unsigned int* c, int n)
> > > {
> > >   double sum = 0;
> > >   for (int i = 0; i != n; i++)
> > >     sum += a[i] * b[c[i]];
> > >   return sum;
> > > }
> > >
> > > For the upper loop, microbenchmark result shows on ICX,
> > > emulated gather with VF=8 is 30% faster than emulated gather with
> > > VF=4 when tripcount is big enough.
> > > It bring back ~4% for 510.parest still ~5% regression compared to
> > > gather instruction due to throughput bound.
> > >
> > > For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
> > > loop, VF remains 4(xmm) as before(guess related to their own cost
> > > model).
> > >
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > >         PR target/111064
> > >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > >         Decrease cost a little bit for vec_to_scalar(offset vector) in
> > >         emulated gather.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/i386/pr111064.c: New test.
> > > ---
> > >  gcc/config/i386/i386.cc                  | 11 ++++++++++-
> > >  gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
> > >  2 files changed, 22 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 1bc3f11ff07..337e0f1bfbb 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > >           || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
> > >      {
> > >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > > -      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > > +      /* For emulated gather/scatter, offset vector load/vec_construct has
> > > +        already been counted and in real case, it's probably eliminated by
> > > +        later optimizer.
> > > +        Also after decomposing, element loads from continous memory
> > > +        could be less bounded compared to normal elementwise load.  */
> > > +      if (kind == vec_to_scalar
> > > +         && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
> > > +       stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
> >
> > For gather we cost N vector extracts (from the offset vector), N scalar loads
> > (the actual data loads) and one vec_construct.
> >
> > For scatter we cost N vector extracts (from the offset vector),
> > N vector extracts (from the data vector) and N scalar stores.
> >
> > It was intended penaltize the extracts the same way as vector construction.
> >
> > Your change will adjust all three different decomposition kinds "a
> > bit", I realize the
> > scaling by (TYPE_VECTOR_SUBPARTS + 1) is kind-of arbitrary but so is your
> > adjustment and I don't see why VMAT_GATHER_SCATTER is special to your
> > adjustment.
> >
> > So the comment you put before the special-casing doesn't really make
> > sense to me.
> >
> > For zen4 costing we currently have
> >
> > *_11 8 times vec_to_scalar costs 576 in body
> > *_11 8 times scalar_load costs 96 in body
> > *_11 1 times vec_construct costs 792 in body
> >
> > for zmm
> >
> > *_11 4 times vec_to_scalar costs 80 in body
> > *_11 4 times scalar_load costs 48 in body
> > *_11 1 times vec_construct costs 100 in body
> >
> > for ymm and
> >
> > *_11 2 times vec_to_scalar costs 24 in body
> > *_11 2 times scalar_load costs 24 in body
> > *_11 1 times vec_construct costs 12 in body
> >
> > for xmm.  Even with your adjustment if we were to enable cost comparison between
> > vector sizes we'd choose xmm I bet (you can try by re-ordering the modes in
> > the ix86_autovectorize_vector_modes hook).  So it feels like a hack.  If you
> > think that Icelake should enable 4 element vectorized emulated gather then
> > we should disable this individual scaling and possibly instead penaltize when
> > the number of (emulated) gathers is too high?
> I think even for element wise load/store, the penalty is too high.
> looked at the original issue PR84037, the regression comes from many parts.
> and related issue PR87561, the regression is due to outer loop
> context(similar for PR82862), not a real vectorization issue,(PR82862
> vectorized code standalone is even faster than scalar version)
>  stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) seems to just disable
> vectorization as a walkround, but not a realistic estimation.
> For simplicity, maybe we should reduce the penalty(.i.e.  stmt_cost *=
> (TYPE_VECTOR_SUBPARTS (vectype) / 2, at least w/ this, vectorizer will
> still choose ymm even with cost comparison.
> But I'm not sure if this will regress PR87561, PR84037, PR84016.(Maybe
> we should only reduce the penalty when there's no outer loop due to
> PR87561/PR82862, makes some sense?)

Indeed the scaling is more a workaround than a real fix.  The original issue
is to honor the fact that using any form of strided load/store or decomposing
of vectors will put us "closer" to simply unrolling the loop VF times but
with the added disadvantage that we tie the unrolled copies into vectors
removing the advantage of scalar unrolling, that unrolled copies can execute
in parallel (to some extent).

That's why I suggest to move the workaround to finish_cost, estimating a
scalar unroll factor that would arrive at a similar number of ops and disable
vectorization when either a very small loop gets very big (so no uop cache
or loop stream detector will work on it) or when the vectorization benefit
is smaller than this estimated scalar unroll factor?

A real fix would need tracking dependences and estimate the latency
for a loop iteration (which in turn needs modeling of execution resources).

So - can you try removing the existing scaling and instead add a
counter accumulating 'count' for vect_body?  Note for scalar costing
we account stmts to 'vect_prologue' (we should possibly change that,
BB vectorization uses 'vect_body' correctly here).  And then do some
cut-off heuristic in finish_cost, setting costs to INT_MAX as the existing
example there does?  If we'd use COMPARE_COSTs we could also apply
scaling or simply use that info during comparing costs, but we don't
do that currently so we have to hard-reject some cases (which is also
why the scaling is so big)

> >
> > That said, we could count the number of element extracts and inserts
> > (and maybe [scalar] loads and stores) and at finish_cost time weight them
> > against the number of "other" operations.
> >
> > As repeatedly said the current cost model setup is a bit garbage-in-garbage-out
> > since it in no way models latency correctly, instead it disregards all
> > dependencies
> > and simply counts ops.
> >
> > > +      else
> > > +       stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
> > >      }
> > >    else if ((kind == vec_construct || kind == scalar_to_vec)
> > >            && node
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
> > > new file mode 100644
> > > index 00000000000..aa2589bd36f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr111064.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
> > > +/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
> > > +
> > > +double
> > > +foo (double* a, double* b, unsigned int* c, int n)
> > > +{
> > > +  double sum = 0;
> > > +  for (int i = 0; i != n; i++)
> > > +    sum += a[i] * b[c[i]];
> > > +  return sum;
> > > +}
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1bc3f11ff07..337e0f1bfbb 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24079,7 +24079,16 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	  || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
+      /* For emulated gather/scatter, offset vector load/vec_construct has
+	 already been counted and in real case, it's probably eliminated by
+	 later optimizer.
+	 Also after decomposing, element loads from continous memory
+	 could be less bounded compared to normal elementwise load.  */
+      if (kind == vec_to_scalar
+	  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+	stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+      else
+	stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
   else if ((kind == vec_construct || kind == scalar_to_vec)
 	   && node
diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
new file mode 100644
index 00000000000..aa2589bd36f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr111064.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
+
+double
+foo (double* a, double* b, unsigned int* c, int n)
+{
+  double sum = 0;
+  for (int i = 0; i != n; i++)
+    sum += a[i] * b[c[i]];
+  return sum;
+}