diff mbox series

Don't vectorize when vector stmts are only vec_contruct and stores

Message ID 20231204053208.908533-1-hongtao.liu@intel.com
State New
Headers show
Series Don't vectorize when vector stmts are only vec_contruct and stores | expand

Commit Message

Liu, Hongtao Dec. 4, 2023, 5:32 a.m. UTC
.i.e. for below cases.
   a[0] = b1;
   a[1] = b2;
   ..
   a[n] = bn;

There're extra dependences when contructing the vector, but not for
scalar store. According to experiments, it's generally worse.

The patch adds an cut-off heuristic when vec_stmt is just
vec_construct and vector store. It improves SPEC2017 a little bit.

BenchMarks		Ratio
500.perlbench_r		2.60%
502.gcc_r		0.30%
505.mcf_r		0.40%
520.omnetpp_r		-1.00%
523.xalancbmk_r		0.90%
525.x264_r		0.00%
531.deepsjeng_r		0.30%
541.leela_r		0.90%
548.exchange2_r		3.20%
557.xz_r		1.40%
503.bwaves_r		0.00%
507.cactuBSSN_r		0.00%
508.namd_r		0.30%
510.parest_r		0.00%
511.povray_r		0.20%
519.lbm_r		SAME BIN
521.wrf_r		-0.30%
526.blender_r		-1.20%
527.cam4_r		-0.20%
538.imagick_r		4.00%
544.nab_r		0.40%
549.fotonik3d_r		0.00%
554.roms_r		0.00%
Geomean-int		0.90%
Geomean-fp		0.30%
Geomean-all		0.50%

And
Regressed testcases:

gcc.target/i386/part-vect-absneghf.c
gcc.target/i386/part-vect-copysignhf.c
gcc.target/i386/part-vect-xorsignhf.c

Regressed under -m32 since it generates 2 vector
.ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
original testcases are used to test vectorization capability for
.ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.

gcc.target/i386/pr111023-2.c
gcc.target/i386/pr111023.c
Regressed under -m32

testcase as below

void
v8hi_v8qi (v8hi *dst, v16qi src)
{
  short tem[8];
  tem[0] = src[0];
  tem[1] = src[1];
  tem[2] = src[2];
  tem[3] = src[3];
  tem[4] = src[4];
  tem[5] = src[5];
  tem[6] = src[6];
  tem[7] = src[7];
  dst[0] = *(v8hi *) tem;
}

under 64-bit target, vectorizer realize it's just permutation of
original src vector, but under -m32, vectorizer relies on
vec_construct for vectorization. I think optimziation for this case
under 32-bit target maynot impact much, so just add
-fno-vect-cost-model.

gcc.target/i386/pr91446.c: This testcase is guard for cost model of
vector store, not vectorization capability, so just adjust testcase.

gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
optimize for bswap, like other optimziation vectorizer can't realize
optimization after it. So the current solution is add
-fno-vect-cost-model to the testcase.

costmodel-pr104582-1.c
costmodel-pr104582-2.c
costmodel-pr104582-4.c

Failed since it's now not vectorized, looked at the PR, it's exactly
what's wanted, so adjust testcase to scan-tree-dump-not.


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR target/99881
	PR target/104582
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Check if kind is vec_construct or vector store.
	(ix86_vector_costs::finish_cost): Don't do vectorization when
	vector stmts are only vec_construct and stores.
	(ix86_vector_costs::ix86_vect_construct_store_only_p): New
	function.
	(ix86_vector_costs::ix86_vect_cut_off): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/part-vect-absneghf.c: Restrict testcase to
	TARGET_64BIT.
	* gcc.target/i386/part-vect-copysignhf.c: Ditto.
	* gcc.target/i386/part-vect-xorsignhf.c: Ditto.
	* gcc.target/i386/pr91446.c: Adjust testcase.
	* gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model.
	* gcc.target/i386/pr111023-2.c: Ditto.
	* gcc.target/i386/pr111023.c: Ditto.
	* gcc.target/i386/pr99881.c: Remove xfail.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed
	to Scan-tree-dump-not.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto.
	* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto.
---
 gcc/config/i386/i386.cc                       | 81 ++++++++++++++++++-
 .../costmodel/x86_64/costmodel-pr104582-1.c   |  2 +-
 .../costmodel/x86_64/costmodel-pr104582-3.c   |  2 +-
 .../costmodel/x86_64/costmodel-pr104582-4.c   |  2 +-
 .../gcc.target/i386/part-vect-absneghf.c      |  4 +-
 .../gcc.target/i386/part-vect-copysignhf.c    |  4 +-
 .../gcc.target/i386/part-vect-xorsignhf.c     |  4 +-
 gcc/testsuite/gcc.target/i386/pr108938-3.c    |  2 +-
 gcc/testsuite/gcc.target/i386/pr111023-2.c    |  2 +-
 gcc/testsuite/gcc.target/i386/pr111023.c      |  2 +-
 gcc/testsuite/gcc.target/i386/pr91446.c       | 14 ++--
 gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
 12 files changed, 99 insertions(+), 22 deletions(-)

Comments

Richard Biener Dec. 4, 2023, 2:10 p.m. UTC | #1
On Mon, Dec 4, 2023 at 6:32 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> .i.e. for below cases.
>    a[0] = b1;
>    a[1] = b2;
>    ..
>    a[n] = bn;
>
> There're extra dependences when contructing the vector, but not for
> scalar store. According to experiments, it's generally worse.
>
> The patch adds an cut-off heuristic when vec_stmt is just
> vec_construct and vector store. It improves SPEC2017 a little bit.
>
> BenchMarks              Ratio
> 500.perlbench_r         2.60%
> 502.gcc_r               0.30%
> 505.mcf_r               0.40%
> 520.omnetpp_r           -1.00%
> 523.xalancbmk_r         0.90%
> 525.x264_r              0.00%
> 531.deepsjeng_r         0.30%
> 541.leela_r             0.90%
> 548.exchange2_r         3.20%
> 557.xz_r                1.40%
> 503.bwaves_r            0.00%
> 507.cactuBSSN_r         0.00%
> 508.namd_r              0.30%
> 510.parest_r            0.00%
> 511.povray_r            0.20%
> 519.lbm_r               SAME BIN
> 521.wrf_r               -0.30%
> 526.blender_r           -1.20%
> 527.cam4_r              -0.20%
> 538.imagick_r           4.00%
> 544.nab_r               0.40%
> 549.fotonik3d_r         0.00%
> 554.roms_r              0.00%
> Geomean-int             0.90%
> Geomean-fp              0.30%
> Geomean-all             0.50%
>
> And
> Regressed testcases:
>
> gcc.target/i386/part-vect-absneghf.c
> gcc.target/i386/part-vect-copysignhf.c
> gcc.target/i386/part-vect-xorsignhf.c
>
> Regressed under -m32 since it generates 2 vector
> .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
> original testcases are used to test vectorization capability for
> .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.
>
> gcc.target/i386/pr111023-2.c
> gcc.target/i386/pr111023.c
> Regressed under -m32
>
> testcase as below
>
> void
> v8hi_v8qi (v8hi *dst, v16qi src)
> {
>   short tem[8];
>   tem[0] = src[0];
>   tem[1] = src[1];
>   tem[2] = src[2];
>   tem[3] = src[3];
>   tem[4] = src[4];
>   tem[5] = src[5];
>   tem[6] = src[6];
>   tem[7] = src[7];
>   dst[0] = *(v8hi *) tem;
> }
>
> under 64-bit target, vectorizer realize it's just permutation of
> original src vector, but under -m32, vectorizer relies on
> vec_construct for vectorization. I think optimziation for this case
> under 32-bit target maynot impact much, so just add
> -fno-vect-cost-model.
>
> gcc.target/i386/pr91446.c: This testcase is guard for cost model of
> vector store, not vectorization capability, so just adjust testcase.
>
> gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
> optimize for bswap, like other optimziation vectorizer can't realize
> optimization after it. So the current solution is add
> -fno-vect-cost-model to the testcase.
>
> costmodel-pr104582-1.c
> costmodel-pr104582-2.c
> costmodel-pr104582-4.c
>
> Failed since it's now not vectorized, looked at the PR, it's exactly
> what's wanted, so adjust testcase to scan-tree-dump-not.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

So the original motivation to not more aggressively prune
store-from-CTOR vectorization in the vectorizer itself is that
the vector store is possibly better for STLF (larger stores are
good, larger loads eventually problematic).

I'd also expect the costs to play out to not make those profitable.

OTOH, if you have a series of 'double' stores you can convert to
a series of V2DF stores you _may_ be faster if this reduces
pressure on the store unit.  Esp. V2DF is cheap to construct
with one movhpd.

So I don't think we want to try to pattern match it this way?

In fact the SLP vectorization cases could all arrive with an
SLP node specified (vectorizable_store would have to be
changed here), which means you could check for an
vect_external_def child instead?

But as said, I would hope that we can arrive at a better way
assessing the CONSTRUCTOR cost.  IMHO one big issue
is that load and store cost are comparatively high compared
to simple stmt ops so it's very hard to offset saving many
stores with "ops".  That's because we generally think of
'cost' to model latency but as you say stores don't really
have latency - we only have store bandwidth of the store
unit and of course issue width (but that's true for other ops
as well).  I wonder what happens if we set both scalar and
vector store cost to zero?  Or maybe one (to count one
issue slot)?

Richard.


> gcc/ChangeLog:
>
>         PR target/99881
>         PR target/104582
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>         Check if kind is vec_construct or vector store.
>         (ix86_vector_costs::finish_cost): Don't do vectorization when
>         vector stmts are only vec_construct and stores.
>         (ix86_vector_costs::ix86_vect_construct_store_only_p): New
>         function.
>         (ix86_vector_costs::ix86_vect_cut_off): Ditto.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to
>         TARGET_64BIT.
>         * gcc.target/i386/part-vect-copysignhf.c: Ditto.
>         * gcc.target/i386/part-vect-xorsignhf.c: Ditto.
>         * gcc.target/i386/pr91446.c: Adjust testcase.
>         * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model.
>         * gcc.target/i386/pr111023-2.c: Ditto.
>         * gcc.target/i386/pr111023.c: Ditto.
>         * gcc.target/i386/pr99881.c: Remove xfail.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed
>         to Scan-tree-dump-not.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto.
>         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto.
> ---
>  gcc/config/i386/i386.cc                       | 81 ++++++++++++++++++-
>  .../costmodel/x86_64/costmodel-pr104582-1.c   |  2 +-
>  .../costmodel/x86_64/costmodel-pr104582-3.c   |  2 +-
>  .../costmodel/x86_64/costmodel-pr104582-4.c   |  2 +-
>  .../gcc.target/i386/part-vect-absneghf.c      |  4 +-
>  .../gcc.target/i386/part-vect-copysignhf.c    |  4 +-
>  .../gcc.target/i386/part-vect-xorsignhf.c     |  4 +-
>  gcc/testsuite/gcc.target/i386/pr108938-3.c    |  2 +-
>  gcc/testsuite/gcc.target/i386/pr111023-2.c    |  2 +-
>  gcc/testsuite/gcc.target/i386/pr111023.c      |  2 +-
>  gcc/testsuite/gcc.target/i386/pr91446.c       | 14 ++--
>  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
>  12 files changed, 99 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index dcaea6c2096..a4b23e29eba 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24573,6 +24573,10 @@ public:
>
>  private:
>
> +  /* Don't do vectorization for certain patterns.  */
> +  void ix86_vect_cut_off ();
> +
> +  bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info);
>    /* Estimate register pressure of the vectorized code.  */
>    void ix86_vect_estimate_reg_pressure ();
>    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> @@ -24581,12 +24585,15 @@ private:
>       where we know it's not loaded from memory.  */
>    unsigned m_num_gpr_needed[3];
>    unsigned m_num_sse_needed[3];
> +
> +  bool m_vec_construct_store_only;
>  };
>
>  ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
>    : vector_costs (vinfo, costing_for_scalar),
>      m_num_gpr_needed (),
> -    m_num_sse_needed ()
> +    m_num_sse_needed (),
> +    m_vec_construct_store_only (true)
>  {
>  }
>
> @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>      = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store);
>    int stmt_cost = - 1;
>
> +  if (m_vec_construct_store_only
> +      && !ix86_vect_construct_store_only_p (kind, stmt_info))
> +    m_vec_construct_store_only = false;
> +
>    bool fp = false;
>    machine_mode mode = scalar_p ? SImode : TImode;
>
> @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
>      }
>  }
>
> +/* Return true if KIND and STMT_INFO is either vec_construct or vector
> +   store, stmt_info could be promote/demote between vec_construct and
> +   vector store, also count that.  */
> +bool
> +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind,
> +                                                   stmt_vec_info stmt_info)
> +{
> +  switch (kind)
> +    {
> +    case vec_construct:
> +    case vector_store:
> +    case unaligned_store:
> +    case vector_scatter_store:
> +    case vec_promote_demote:
> +      return true;
> +
> +      /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes
> +        char* a;
> +        short b1, b2, b3, b4;
> +        a[0] = b1;
> +        a[1] = b2;
> +        a[2] = b3;
> +        a[3] = b4;
> +        Also don't vectorized it.  */
> +    case vector_stmt:
> +      if (stmt_info && stmt_info->stmt
> +         && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
> +         && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR)
> +       return true;
> +
> +    default:
> +      break;
> +    }
> +
> +  return false;
> +}
> +
>  void
> -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +ix86_vector_costs::ix86_vect_cut_off ()
>  {
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>    if (loop_vinfo && !m_costing_for_scalar)
> @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
>           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
>               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
>         m_costs[vect_body] = INT_MAX;
> +      return;
> +    }
> +
> +  /* Don't do vectorization for things like
> +
> +     a[0] = b1;
> +     a[1] = b2;
> +     ..
> +     a[n] = bn;
> +
> +     There're extra dependences when contructing the vector, but not for
> +     scalar store. According to experiments, it's generally worse.  */
> +  if (m_vec_construct_store_only)
> +    {
> +      m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "Skip vectorization for stmts which only contains"
> +                        " vec_construct and vector store.\n");
> +      return;
>      }
>
> +  return;
> +}
> +
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +{
> +
>    ix86_vect_estimate_reg_pressure ();
>
> +  ix86_vect_cut_off ();
> +
>    vector_costs::finish_cost (scalar_costs);
>  }
>
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> index 992a845ad7a..f940af70b72 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b)
>    s.b = b_;
>  }
>
> -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> index 999c4905708..eff60b2a82a 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> @@ -10,4 +10,4 @@ foo (double a, double b)
>    s.b = b;
>  }
>
> -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> index cc471e1ed73..2f354338e96 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b)
>    s.b = b_;
>  }
>
> -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> index 48aed14d604..4052210ec39 100644
> --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> @@ -85,7 +85,7 @@ do_test (void)
>        abort ();
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
>  /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */
>  /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> index 811617bc3dd..006941f6651 100644
> --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> @@ -55,6 +55,6 @@ do_test (void)
>        abort ();
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
>  /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> index a8ec60a088a..a57dc5aba23 100644
> --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> @@ -55,6 +55,6 @@ do_test (void)
>        abort ();
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
> -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
>  /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c
> index 32ac544c7ed..24725a9ab1d 100644
> --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
> +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */
> +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */
>  /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */
>  /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c
> index 6c69f947544..db25668a9ae 100644
> --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c
> +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c
> @@ -1,6 +1,6 @@
>  /* PR target/111023 */
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
> +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
>
>  typedef char v16qi __attribute__((vector_size (16)));
>  typedef short v8hi __attribute__((vector_size (16)));
> diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c
> index 6144c371f32..18cd579d937 100644
> --- a/gcc/testsuite/gcc.target/i386/pr111023.c
> +++ b/gcc/testsuite/gcc.target/i386/pr111023.c
> @@ -1,6 +1,6 @@
>  /* PR target/111023 */
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
> +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
>
>  typedef unsigned char v16qi __attribute__((vector_size (16)));
>  typedef unsigned short v8hi __attribute__((vector_size (16)));
> diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> index 0243ca3ea68..3e0f8a4b315 100644
> --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> @@ -10,15 +10,15 @@ typedef struct
>  extern void bar (info *);
>
>  void
> -foo (unsigned long long width, unsigned long long height,
> -     long long x, long long y)
> +foo (unsigned long long* width,
> +     long long* x)
>  {
>    info t;
> -  t.width = width;
> -  t.height = height;
> -  t.x = x;
> -  t.y = y;
> +  t.width = width[0];
> +  t.height = width[1];
> +  t.x = x[0];
> +  t.y = x[1];
>    bar (&t);
>  }
>
> -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> index 3e087eb2ed7..1a59325ac1c 100644
> --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> @@ -1,7 +1,7 @@
>  /* PR target/99881.  */
>  /* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-Ofast -march=skylake" } */
> -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not "xmm\[0-9\]"  } } */
>
>  void
>  foo (int* __restrict a, int n, int c)
> --
> 2.31.1
>
Hongtao Liu Dec. 6, 2023, 2:16 a.m. UTC | #2
On Mon, Dec 4, 2023 at 10:10 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Dec 4, 2023 at 6:32 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > .i.e. for below cases.
> >    a[0] = b1;
> >    a[1] = b2;
> >    ..
> >    a[n] = bn;
> >
> > There're extra dependences when contructing the vector, but not for
> > scalar store. According to experiments, it's generally worse.
> >
> > The patch adds an cut-off heuristic when vec_stmt is just
> > vec_construct and vector store. It improves SPEC2017 a little bit.
> >
> > BenchMarks              Ratio
> > 500.perlbench_r         2.60%
> > 502.gcc_r               0.30%
> > 505.mcf_r               0.40%
> > 520.omnetpp_r           -1.00%
> > 523.xalancbmk_r         0.90%
> > 525.x264_r              0.00%
> > 531.deepsjeng_r         0.30%
> > 541.leela_r             0.90%
> > 548.exchange2_r         3.20%
> > 557.xz_r                1.40%
> > 503.bwaves_r            0.00%
> > 507.cactuBSSN_r         0.00%
> > 508.namd_r              0.30%
> > 510.parest_r            0.00%
> > 511.povray_r            0.20%
> > 519.lbm_r               SAME BIN
> > 521.wrf_r               -0.30%
> > 526.blender_r           -1.20%
> > 527.cam4_r              -0.20%
> > 538.imagick_r           4.00%
> > 544.nab_r               0.40%
> > 549.fotonik3d_r         0.00%
> > 554.roms_r              0.00%
> > Geomean-int             0.90%
> > Geomean-fp              0.30%
> > Geomean-all             0.50%
> >
> > And
> > Regressed testcases:
> >
> > gcc.target/i386/part-vect-absneghf.c
> > gcc.target/i386/part-vect-copysignhf.c
> > gcc.target/i386/part-vect-xorsignhf.c
> >
> > Regressed under -m32 since it generates 2 vector
> > .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
> > original testcases are used to test vectorization capability for
> > .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.
> >
> > gcc.target/i386/pr111023-2.c
> > gcc.target/i386/pr111023.c
> > Regressed under -m32
> >
> > testcase as below
> >
> > void
> > v8hi_v8qi (v8hi *dst, v16qi src)
> > {
> >   short tem[8];
> >   tem[0] = src[0];
> >   tem[1] = src[1];
> >   tem[2] = src[2];
> >   tem[3] = src[3];
> >   tem[4] = src[4];
> >   tem[5] = src[5];
> >   tem[6] = src[6];
> >   tem[7] = src[7];
> >   dst[0] = *(v8hi *) tem;
> > }
> >
> > under 64-bit target, vectorizer realize it's just permutation of
> > original src vector, but under -m32, vectorizer relies on
> > vec_construct for vectorization. I think optimziation for this case
> > under 32-bit target maynot impact much, so just add
> > -fno-vect-cost-model.
> >
> > gcc.target/i386/pr91446.c: This testcase is guard for cost model of
> > vector store, not vectorization capability, so just adjust testcase.
> >
> > gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
> > optimize for bswap, like other optimziation vectorizer can't realize
> > optimization after it. So the current solution is add
> > -fno-vect-cost-model to the testcase.
> >
> > costmodel-pr104582-1.c
> > costmodel-pr104582-2.c
> > costmodel-pr104582-4.c
> >
> > Failed since it's now not vectorized, looked at the PR, it's exactly
> > what's wanted, so adjust testcase to scan-tree-dump-not.
> >
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> So the original motivation to not more aggressively prune
> store-from-CTOR vectorization in the vectorizer itself is that
> the vector store is possibly better for STLF (larger stores are
> good, larger loads eventually problematic).

That's exactly what I worried about, and I didn't observe any STLF
stall in SPEC2017, I'll try with more benchmarks.
But on the other hand, the cost model is not suitable for solving this
problem, at best it only circumvents part of this.

>
> I'd also expect the costs to play out to not make those profitable.
>
> OTOH, if you have a series of 'double' stores you can convert to
> a series of V2DF stores you _may_ be faster if this reduces
> pressure on the store unit.  Esp. V2DF is cheap to construct
> with one movhpd.

>
> So I don't think we want to try to pattern match it this way?
>
> In fact the SLP vectorization cases could all arrive with an
> SLP node specified (vectorizable_store would have to be
> changed here), which means you could check for an
> vect_external_def child instead?
>
> But as said, I would hope that we can arrive at a better way
> assessing the CONSTRUCTOR cost.  IMHO one big issue
> is that load and store cost are comparatively high compared
> to simple stmt ops so it's very hard to offset saving many
> stores with "ops".  That's because we generally think of
> 'cost' to model latency but as you say stores don't really
> have latency - we only have store bandwidth of the store

Yes.

> unit and of course issue width (but that's true for other ops
> as well).  I wonder what happens if we set both scalar and
> vector store cost to zero?  Or maybe one (to count one
> issue slot)?

I tried to reduce the cost of the scalar store, but it regressed in other cases.
But I haven't tried with zero/one, it sounds like a good idea, Let me try that.

>
> Richard.
>
>
> > gcc/ChangeLog:
> >
> >         PR target/99881
> >         PR target/104582
> >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> >         Check if kind is vec_construct or vector store.
> >         (ix86_vector_costs::finish_cost): Don't do vectorization when
> >         vector stmts are only vec_construct and stores.
> >         (ix86_vector_costs::ix86_vect_construct_store_only_p): New
> >         function.
> >         (ix86_vector_costs::ix86_vect_cut_off): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to
> >         TARGET_64BIT.
> >         * gcc.target/i386/part-vect-copysignhf.c: Ditto.
> >         * gcc.target/i386/part-vect-xorsignhf.c: Ditto.
> >         * gcc.target/i386/pr91446.c: Adjust testcase.
> >         * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model.
> >         * gcc.target/i386/pr111023-2.c: Ditto.
> >         * gcc.target/i386/pr111023.c: Ditto.
> >         * gcc.target/i386/pr99881.c: Remove xfail.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed
> >         to Scan-tree-dump-not.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto.
> >         * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto.
> > ---
> >  gcc/config/i386/i386.cc                       | 81 ++++++++++++++++++-
> >  .../costmodel/x86_64/costmodel-pr104582-1.c   |  2 +-
> >  .../costmodel/x86_64/costmodel-pr104582-3.c   |  2 +-
> >  .../costmodel/x86_64/costmodel-pr104582-4.c   |  2 +-
> >  .../gcc.target/i386/part-vect-absneghf.c      |  4 +-
> >  .../gcc.target/i386/part-vect-copysignhf.c    |  4 +-
> >  .../gcc.target/i386/part-vect-xorsignhf.c     |  4 +-
> >  gcc/testsuite/gcc.target/i386/pr108938-3.c    |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr111023-2.c    |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr111023.c      |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr91446.c       | 14 ++--
> >  gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
> >  12 files changed, 99 insertions(+), 22 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index dcaea6c2096..a4b23e29eba 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -24573,6 +24573,10 @@ public:
> >
> >  private:
> >
> > +  /* Don't do vectorization for certain patterns.  */
> > +  void ix86_vect_cut_off ();
> > +
> > +  bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info);
> >    /* Estimate register pressure of the vectorized code.  */
> >    void ix86_vect_estimate_reg_pressure ();
> >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> > @@ -24581,12 +24585,15 @@ private:
> >       where we know it's not loaded from memory.  */
> >    unsigned m_num_gpr_needed[3];
> >    unsigned m_num_sse_needed[3];
> > +
> > +  bool m_vec_construct_store_only;
> >  };
> >
> >  ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
> >    : vector_costs (vinfo, costing_for_scalar),
> >      m_num_gpr_needed (),
> > -    m_num_sse_needed ()
> > +    m_num_sse_needed (),
> > +    m_vec_construct_store_only (true)
> >  {
> >  }
> >
> > @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >      = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store);
> >    int stmt_cost = - 1;
> >
> > +  if (m_vec_construct_store_only
> > +      && !ix86_vect_construct_store_only_p (kind, stmt_info))
> > +    m_vec_construct_store_only = false;
> > +
> >    bool fp = false;
> >    machine_mode mode = scalar_p ? SImode : TImode;
> >
> > @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> >      }
> >  }
> >
> > +/* Return true if KIND and STMT_INFO is either vec_construct or vector
> > +   store, stmt_info could be promote/demote between vec_construct and
> > +   vector store, also count that.  */
> > +bool
> > +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind,
> > +                                                   stmt_vec_info stmt_info)
> > +{
> > +  switch (kind)
> > +    {
> > +    case vec_construct:
> > +    case vector_store:
> > +    case unaligned_store:
> > +    case vector_scatter_store:
> > +    case vec_promote_demote:
> > +      return true;
> > +
> > +      /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes
> > +        char* a;
> > +        short b1, b2, b3, b4;
> > +        a[0] = b1;
> > +        a[1] = b2;
> > +        a[2] = b3;
> > +        a[3] = b4;
> > +        Also don't vectorized it.  */
> > +    case vector_stmt:
> > +      if (stmt_info && stmt_info->stmt
> > +         && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
> > +         && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR)
> > +       return true;
> > +
> > +    default:
> > +      break;
> > +    }
> > +
> > +  return false;
> > +}
> > +
> >  void
> > -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > +ix86_vector_costs::ix86_vect_cut_off ()
> >  {
> >    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >    if (loop_vinfo && !m_costing_for_scalar)
> > @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> >         m_costs[vect_body] = INT_MAX;
> > +      return;
> > +    }
> > +
> > +  /* Don't do vectorization for things like
> > +
> > +     a[0] = b1;
> > +     a[1] = b2;
> > +     ..
> > +     a[n] = bn;
> > +
> > +     There're extra dependences when contructing the vector, but not for
> > +     scalar store. According to experiments, it's generally worse.  */
> > +  if (m_vec_construct_store_only)
> > +    {
> > +      m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX;
> > +      if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_NOTE, vect_location,
> > +                        "Skip vectorization for stmts which only contains"
> > +                        " vec_construct and vector store.\n");
> > +      return;
> >      }
> >
> > +  return;
> > +}
> > +
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > +{
> > +
> >    ix86_vect_estimate_reg_pressure ();
> >
> > +  ix86_vect_cut_off ();
> > +
> >    vector_costs::finish_cost (scalar_costs);
> >  }
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > index 992a845ad7a..f940af70b72 100644
> > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
> > @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b)
> >    s.b = b_;
> >  }
> >
> > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > index 999c4905708..eff60b2a82a 100644
> > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
> > @@ -10,4 +10,4 @@ foo (double a, double b)
> >    s.b = b;
> >  }
> >
> > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > index cc471e1ed73..2f354338e96 100644
> > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
> > @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b)
> >    s.b = b_;
> >  }
> >
> > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
> > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> > index 48aed14d604..4052210ec39 100644
> > --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> > +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
> > @@ -85,7 +85,7 @@ do_test (void)
> >        abort ();
> >  }
> >
> > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */
> > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
> >  /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */
> >  /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> > index 811617bc3dd..006941f6651 100644
> > --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> > +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
> > @@ -55,6 +55,6 @@ do_test (void)
> >        abort ();
> >  }
> >
> > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
> > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> >  /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> > index a8ec60a088a..a57dc5aba23 100644
> > --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> > +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
> > @@ -55,6 +55,6 @@ do_test (void)
> >        abort ();
> >  }
> >
> > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
> > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
> >  /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c
> > index 32ac544c7ed..24725a9ab1d 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */
> > +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */
> >  /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */
> >  /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c
> > index 6c69f947544..db25668a9ae 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c
> > @@ -1,6 +1,6 @@
> >  /* PR target/111023 */
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
> > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
> >
> >  typedef char v16qi __attribute__((vector_size (16)));
> >  typedef short v8hi __attribute__((vector_size (16)));
> > diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c
> > index 6144c371f32..18cd579d937 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr111023.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr111023.c
> > @@ -1,6 +1,6 @@
> >  /* PR target/111023 */
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
> > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
> >
> >  typedef unsigned char v16qi __attribute__((vector_size (16)));
> >  typedef unsigned short v8hi __attribute__((vector_size (16)));
> > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
> > index 0243ca3ea68..3e0f8a4b315 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr91446.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c
> > @@ -10,15 +10,15 @@ typedef struct
> >  extern void bar (info *);
> >
> >  void
> > -foo (unsigned long long width, unsigned long long height,
> > -     long long x, long long y)
> > +foo (unsigned long long* width,
> > +     long long* x)
> >  {
> >    info t;
> > -  t.width = width;
> > -  t.height = height;
> > -  t.x = x;
> > -  t.y = y;
> > +  t.width = width[0];
> > +  t.height = width[1];
> > +  t.x = x[0];
> > +  t.y = x[1];
> >    bar (&t);
> >  }
> >
> > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
> > index 3e087eb2ed7..1a59325ac1c 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr99881.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c
> > @@ -1,7 +1,7 @@
> >  /* PR target/99881.  */
> >  /* { dg-do compile { target { ! ia32 } } } */
> >  /* { dg-options "-Ofast -march=skylake" } */
> > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
> > +/* { dg-final { scan-assembler-not "xmm\[0-9\]"  } } */
> >
> >  void
> >  foo (int* __restrict a, int n, int c)
> > --
> > 2.31.1
> >
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index dcaea6c2096..a4b23e29eba 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24573,6 +24573,10 @@  public:
 
 private:
 
+  /* Don't do vectorization for certain patterns.  */
+  void ix86_vect_cut_off ();
+
+  bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info);
   /* Estimate register pressure of the vectorized code.  */
   void ix86_vect_estimate_reg_pressure ();
   /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
@@ -24581,12 +24585,15 @@  private:
      where we know it's not loaded from memory.  */
   unsigned m_num_gpr_needed[3];
   unsigned m_num_sse_needed[3];
+
+  bool m_vec_construct_store_only;
 };
 
 ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
   : vector_costs (vinfo, costing_for_scalar),
     m_num_gpr_needed (),
-    m_num_sse_needed ()
+    m_num_sse_needed (),
+    m_vec_construct_store_only (true)
 {
 }
 
@@ -24609,6 +24616,10 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
     = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store);
   int stmt_cost = - 1;
 
+  if (m_vec_construct_store_only
+      && !ix86_vect_construct_store_only_p (kind, stmt_info))
+    m_vec_construct_store_only = false;
+
   bool fp = false;
   machine_mode mode = scalar_p ? SImode : TImode;
 
@@ -24865,8 +24876,45 @@  ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
     }
 }
 
+/* Return true if KIND and STMT_INFO is either vec_construct or vector
+   store, stmt_info could be promote/demote between vec_construct and
+   vector store, also count that.  */
+bool
+ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind,
+						    stmt_vec_info stmt_info)
+{
+  switch (kind)
+    {
+    case vec_construct:
+    case vector_store:
+    case unaligned_store:
+    case vector_scatter_store:
+    case vec_promote_demote:
+      return true;
+
+      /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes
+	 char* a;
+	 short b1, b2, b3, b4;
+	 a[0] = b1;
+	 a[1] = b2;
+	 a[2] = b3;
+	 a[3] = b4;
+	 Also don't vectorized it.  */
+    case vector_stmt:
+      if (stmt_info && stmt_info->stmt
+	  && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
+	  && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR)
+	return true;
+
+    default:
+      break;
+    }
+
+  return false;
+}
+
 void
-ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+ix86_vector_costs::ix86_vect_cut_off ()
 {
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
   if (loop_vinfo && !m_costing_for_scalar)
@@ -24885,10 +24933,39 @@  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
 	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
 	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
 	m_costs[vect_body] = INT_MAX;
+      return;
+    }
+
+  /* Don't do vectorization for things like
+
+     a[0] = b1;
+     a[1] = b2;
+     ..
+     a[n] = bn;
+
+     There're extra dependences when contructing the vector, but not for
+     scalar store. According to experiments, it's generally worse.  */
+  if (m_vec_construct_store_only)
+    {
+      m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Skip vectorization for stmts which only contains"
+			 " vec_construct and vector store.\n");
+      return;
     }
 
+  return;
+}
+
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+
   ix86_vect_estimate_reg_pressure ();
 
+  ix86_vect_cut_off ();
+
   vector_costs::finish_cost (scalar_costs);
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
index 992a845ad7a..f940af70b72 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
@@ -12,4 +12,4 @@  foo (unsigned long *a, unsigned long *b)
   s.b = b_;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
index 999c4905708..eff60b2a82a 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
@@ -10,4 +10,4 @@  foo (double a, double b)
   s.b = b;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
index cc471e1ed73..2f354338e96 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
@@ -12,4 +12,4 @@  foo (signed long *a, unsigned long *b)
   s.b = b_;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
index 48aed14d604..4052210ec39 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
@@ -85,7 +85,7 @@  do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
index 811617bc3dd..006941f6651 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
@@ -55,6 +55,6 @@  do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
index a8ec60a088a..a57dc5aba23 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
@@ -55,6 +55,6 @@  do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c
index 32ac544c7ed..24725a9ab1d 100644
--- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */
+/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */
 /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c
index 6c69f947544..db25668a9ae 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c
@@ -1,6 +1,6 @@ 
 /* PR target/111023 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
 
 typedef char v16qi __attribute__((vector_size (16)));
 typedef short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c
index 6144c371f32..18cd579d937 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023.c
@@ -1,6 +1,6 @@ 
 /* PR target/111023 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
 
 typedef unsigned char v16qi __attribute__((vector_size (16)));
 typedef unsigned short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
index 0243ca3ea68..3e0f8a4b315 100644
--- a/gcc/testsuite/gcc.target/i386/pr91446.c
+++ b/gcc/testsuite/gcc.target/i386/pr91446.c
@@ -10,15 +10,15 @@  typedef struct
 extern void bar (info *);
 
 void
-foo (unsigned long long width, unsigned long long height,
-     long long x, long long y)
+foo (unsigned long long* width,
+     long long* x)
 {
   info t;
-  t.width = width;
-  t.height = height;
-  t.x = x;
-  t.y = y;
+  t.width = width[0];
+  t.height = width[1];
+  t.x = x[0];
+  t.y = x[1];
   bar (&t);
 }
 
-/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
index 3e087eb2ed7..1a59325ac1c 100644
--- a/gcc/testsuite/gcc.target/i386/pr99881.c
+++ b/gcc/testsuite/gcc.target/i386/pr99881.c
@@ -1,7 +1,7 @@ 
 /* PR target/99881.  */
 /* { dg-do compile { target { ! ia32 } } } */
 /* { dg-options "-Ofast -march=skylake" } */
-/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not "xmm\[0-9\]"  } } */
 
 void
 foo (int* __restrict a, int n, int c)