[i386] Fix vec_construct cost, remove unused ix86_vec_cost arg

Message ID alpine.LSU.2.20.1810110929110.16707@zhemvz.fhfr.qr
State New
Headers show
Series
  • [i386] Fix vec_construct cost, remove unused ix86_vec_cost arg
Related show

Commit Message

Richard Biener Oct. 11, 2018, 7:46 a.m.
The following fixes vec_construct cost calculation to properly consider
that the inserts will happen to SSE regs thus forgo the multiplication
done in ix86_vec_cost which is passed the wrong mode.  This gets rid of
the only call passing false to ix86_vec_cost (so consider the patch
amended to remove the arg if approved).

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK for trunk?

I am considering to make the factor we apply in ix86_vec_cost
which currently depends on X86_TUNE_AVX128_OPTIMAL and
X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since
the reason we apply them are underlying CPU architecture details.
Was the original reason of doing the multiplication based on
those tunings to be able to "share" the same basic cost table
across architectures that differ in this important detail?
I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8
and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2
and m_ZNVER1.  Those all have (multiple) exclusive processor_cost_table
entries.

As a first step I'd like to remove the use of ix86_vec_cost for
the entries that already have entries for multiple modes
(loads and stores) and apply the factor there.  For example
Zen can do two 128bit loads per cycle but only one 128bit store.
With multiplying AVX256 costs by two we seem to cost sth like
# instructions to dispatch * instruction latency which is an
odd thing.  I'd have expected # instructions to dispatch / instruction 
throughput * instruction latency - so a AVX256 add would cost
the same as a AVX128 add, likewise for loads but stores would be
more expensive because of the throughput issue.  This all
ignores resource utilization across multiple insns but that's
how the cost model works ...

Thanks,
Richard.

2018-10-11  Richard Biener  <rguenther@suse.de>

	* config/i386/i386.c (ix86_vec_cost): Remove !parallel path
	and argument.
	(ix86_builtin_vectorization_cost): For vec_construct properly
	cost insertion into SSE regs.
	(...): Adjust calls to ix86_vec_cost.

Comments

Richard Biener Oct. 11, 2018, 9:15 a.m. | #1
On Thu, 11 Oct 2018, Richard Biener wrote:

> 
> The following fixes vec_construct cost calculation to properly consider
> that the inserts will happen to SSE regs thus forgo the multiplication
> done in ix86_vec_cost which is passed the wrong mode.  This gets rid of
> the only call passing false to ix86_vec_cost (so consider the patch
> amended to remove the arg if approved).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK for trunk?
> 
> I am considering to make the factor we apply in ix86_vec_cost
> which currently depends on X86_TUNE_AVX128_OPTIMAL and
> X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since
> the reason we apply them are underlying CPU architecture details.
> Was the original reason of doing the multiplication based on
> those tunings to be able to "share" the same basic cost table
> across architectures that differ in this important detail?
> I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8
> and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2
> and m_ZNVER1.  Those all have (multiple) exclusive processor_cost_table
> entries.
> 
> As a first step I'd like to remove the use of ix86_vec_cost for
> the entries that already have entries for multiple modes
> (loads and stores) and apply the factor there.  For example
> Zen can do two 128bit loads per cycle but only one 128bit store.
> With multiplying AVX256 costs by two we seem to cost sth like
> # instructions to dispatch * instruction latency which is an
> odd thing.  I'd have expected # instructions to dispatch / instruction 
> throughput * instruction latency - so a AVX256 add would cost
> the same as a AVX128 add, likewise for loads but stores would be
> more expensive because of the throughput issue.  This all
> ignores resource utilization across multiple insns but that's
> how the cost model works ...

So like the following which removes the use of ix86_vec_cost
for SSE loads and stores since we have per-mode costs already.
I've applied the relevant factor to the individual cost tables
(noting that for X86_TUNE_SSE_SPLIT_REGS we only apply the
multiplication for size == 128, not size >= 128 ...)

There's a ??? hunk in inline_memory_move_cost where we
failed to apply the scaling thus in that place we'd now have
a behavior change.  Alternatively I could leave the cost
tables unaltered if that costing part is more critical than
the vectorizer one.

I've also spotted, when reviewing ix86_vec_cost uses, a bug
in ix86_rtx_cost which keys on SFmode which doesn't work
for SSE modes, thus use GET_MODE_INNER.

Also I've changed X86_TUNE_AVX128_OPTIMAL to also apply
to BTVER1 - everywhere else we glob BTVER1 and BTVER2 so
this must surely be a omission.

Honza - is a patch like this OK?

Should I split out individual fixes to make bisection possible?

Should I update the cost tables or instead change the vectorizer
costing when considering the inline_memory_move_cost "issue"?

Thanks,
Richard.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 0cf4152acb2..f5392232f61 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -39432,6 +39432,7 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass,
       int index = sse_store_index (mode);
       if (index == -1)
 	return 100;
+      /* ??? */
       if (in == 2)
         return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
       return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
@@ -40183,7 +40181,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
         gcc_assert (TARGET_FMA || TARGET_FMA4 || TARGET_AVX512F);
 
         *total = ix86_vec_cost (mode,
-				mode == SFmode ? cost->fmass : cost->fmasd,
+				GET_MODE_INNER (mode) == SFmode
+				? cost->fmass : cost->fmasd,
 				true);
 	*total += rtx_cost (XEXP (x, 1), mode, FMA, 1, speed);
 
@@ -45122,18 +45121,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	/* See PR82713 - we may end up being called on non-vector type.  */
 	if (index < 0)
 	  index = 2;
-        return ix86_vec_cost (mode,
-			      COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2,
-			      true);
+        return COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2;
 
       case vector_store:
 	index = sse_store_index (mode);
 	/* See PR82713 - we may end up being called on non-vector type.  */
 	if (index < 0)
 	  index = 2;
-        return ix86_vec_cost (mode,
-			      COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2,
-			      true);
+        return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;
 
       case vec_to_scalar:
       case scalar_to_vec:
@@ -45146,20 +45141,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	/* See PR82713 - we may end up being called on non-vector type.  */
 	if (index < 0)
 	  index = 2;
-        return ix86_vec_cost (mode,
-			      COSTS_N_INSNS
-				 (ix86_cost->sse_unaligned_load[index]) / 2,
-			      true);
+        return COSTS_N_INSNS (ix86_cost->sse_unaligned_load[index]) / 2;
 
       case unaligned_store:
 	index = sse_store_index (mode);
 	/* See PR82713 - we may end up being called on non-vector type.  */
 	if (index < 0)
 	  index = 2;
-        return ix86_vec_cost (mode,
-			      COSTS_N_INSNS
-				 (ix86_cost->sse_unaligned_store[index]) / 2,
-			      true);
+        return COSTS_N_INSNS (ix86_cost->sse_unaligned_store[index]) / 2;
 
       case vector_gather_load:
         return ix86_vec_cost (mode,
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 9c8ae0a7841..59d0a8b17d0 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -795,12 +795,12 @@ struct processor_costs athlon_cost = {
   {4, 4},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {4, 4, 6, 12, 24},			/* cost of loading SSE registers
+  {4, 4, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 6, 12, 24},			/* cost of unaligned loads.  */
-  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
+  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
@@ -891,12 +891,12 @@ struct processor_costs k8_cost = {
   {4, 4},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {4, 3, 6, 12, 24},			/* cost of loading SSE registers
+  {4, 3, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 3, 6, 12, 24},			/* cost of unaligned loads.  */
-  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
+  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
@@ -1100,12 +1100,12 @@ const struct processor_costs bdver1_cost = {
   {10, 10},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
+  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
-  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
@@ -1203,12 +1203,12 @@ const struct processor_costs bdver2_cost = {
   {10, 10},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
+  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
-  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
@@ -1305,12 +1305,12 @@ struct processor_costs bdver3_cost = {
   {10, 10},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
+  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
-  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
@@ -1406,12 +1406,12 @@ struct processor_costs bdver4_cost = {
   {10, 10},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
+  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
-  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
@@ -1518,12 +1518,12 @@ struct processor_costs znver1_cost = {
   {8, 8},				/* cost of storing MMX registers
 					   in SImode and DImode.  */
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
-  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
+  {6, 6, 6, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
-  {8, 8, 8, 8, 16},			/* cost of storing SSE registers
+  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
@@ -1726,12 +1726,12 @@ const struct processor_costs btver1_cost = {
   {12, 12},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
+  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
-  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
@@ -1817,12 +1817,12 @@ const struct processor_costs btver2_cost = {
   {12, 12},				/* cost of storing MMX registers
 					   in SImode and DImode */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
-  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
+  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
-  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index a46450ad99d..ad8661075aa 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -441,7 +441,7 @@ DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal"
 
 /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
    the auto-vectorizer.  */
-DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
+DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER
 	  | m_ZNVER1)
 
 /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX
Richard Biener Oct. 15, 2018, 1:36 p.m. | #2
On Thu, 11 Oct 2018, Richard Biener wrote:

> On Thu, 11 Oct 2018, Richard Biener wrote:
> 
> > 
> > The following fixes vec_construct cost calculation to properly consider
> > that the inserts will happen to SSE regs thus forgo the multiplication
> > done in ix86_vec_cost which is passed the wrong mode.  This gets rid of
> > the only call passing false to ix86_vec_cost (so consider the patch
> > amended to remove the arg if approved).
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > 
> > OK for trunk?
> > 
> > I am considering to make the factor we apply in ix86_vec_cost
> > which currently depends on X86_TUNE_AVX128_OPTIMAL and
> > X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since
> > the reason we apply them are underlying CPU architecture details.
> > Was the original reason of doing the multiplication based on
> > those tunings to be able to "share" the same basic cost table
> > across architectures that differ in this important detail?
> > I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8
> > and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2
> > and m_ZNVER1.  Those all have (multiple) exclusive processor_cost_table
> > entries.
> > 
> > As a first step I'd like to remove the use of ix86_vec_cost for
> > the entries that already have entries for multiple modes
> > (loads and stores) and apply the factor there.  For example
> > Zen can do two 128bit loads per cycle but only one 128bit store.
> > With multiplying AVX256 costs by two we seem to cost sth like
> > # instructions to dispatch * instruction latency which is an
> > odd thing.  I'd have expected # instructions to dispatch / instruction 
> > throughput * instruction latency - so a AVX256 add would cost
> > the same as a AVX128 add, likewise for loads but stores would be
> > more expensive because of the throughput issue.  This all
> > ignores resource utilization across multiple insns but that's
> > how the cost model works ...
> 
> So like the following which removes the use of ix86_vec_cost
> for SSE loads and stores since we have per-mode costs already.
> I've applied the relevant factor to the individual cost tables
> (noting that for X86_TUNE_SSE_SPLIT_REGS we only apply the
> multiplication for size == 128, not size >= 128 ...)
> 
> There's a ??? hunk in inline_memory_move_cost where we
> failed to apply the scaling thus in that place we'd now have
> a behavior change.  Alternatively I could leave the cost
> tables unaltered if that costing part is more critical than
> the vectorizer one.
> 
> I've also spotted, when reviewing ix86_vec_cost uses, a bug
> in ix86_rtx_cost which keys on SFmode which doesn't work
> for SSE modes, thus use GET_MODE_INNER.
> 
> Also I've changed X86_TUNE_AVX128_OPTIMAL to also apply
> to BTVER1 - everywhere else we glob BTVER1 and BTVER2 so
> this must surely be a omission.
> 
> Honza - is a patch like this OK?

PING.  I dropped the config/i386/x86-tune.def hunk since btver1
doesn't have AVX.

Otherwise bootstrapped and tested on x86_64-unknown-linux-gnu.

Richard.
 
> Should I split out individual fixes to make bisection possible?
> 
> Should I update the cost tables or instead change the vectorizer
> costing when considering the inline_memory_move_cost "issue"?
> 
> Thanks,
> Richard.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 0cf4152acb2..f5392232f61 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -39432,6 +39432,7 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass,
>        int index = sse_store_index (mode);
>        if (index == -1)
>  	return 100;
> +      /* ??? */
>        if (in == 2)
>          return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
>        return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
> @@ -40183,7 +40181,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
>          gcc_assert (TARGET_FMA || TARGET_FMA4 || TARGET_AVX512F);
>  
>          *total = ix86_vec_cost (mode,
> -				mode == SFmode ? cost->fmass : cost->fmasd,
> +				GET_MODE_INNER (mode) == SFmode
> +				? cost->fmass : cost->fmasd,
>  				true);
>  	*total += rtx_cost (XEXP (x, 1), mode, FMA, 1, speed);
>  
> @@ -45122,18 +45121,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2;
>  
>        case vector_store:
>  	index = sse_store_index (mode);
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;
>  
>        case vec_to_scalar:
>        case scalar_to_vec:
> @@ -45146,20 +45141,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS
> -				 (ix86_cost->sse_unaligned_load[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_load[index]) / 2;
>  
>        case unaligned_store:
>  	index = sse_store_index (mode);
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS
> -				 (ix86_cost->sse_unaligned_store[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_store[index]) / 2;
>  
>        case vector_gather_load:
>          return ix86_vec_cost (mode,
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 9c8ae0a7841..59d0a8b17d0 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -795,12 +795,12 @@ struct processor_costs athlon_cost = {
>    {4, 4},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {4, 4, 6, 12, 24},			/* cost of loading SSE registers
> +  {4, 4, 12, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 6, 12, 24},			/* cost of unaligned loads.  */
> -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> +  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
> +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
>    5, 5,					/* SSE->integer and integer->SSE moves */
>    4, 4,					/* Gather load static, per_elt.  */
>    4, 4,					/* Gather store static, per_elt.  */
> @@ -891,12 +891,12 @@ struct processor_costs k8_cost = {
>    {4, 4},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {4, 3, 6, 12, 24},			/* cost of loading SSE registers
> +  {4, 3, 12, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 3, 6, 12, 24},			/* cost of unaligned loads.  */
> -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> +  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
> +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
>    5, 5,					/* SSE->integer and integer->SSE moves */
>    4, 4,					/* Gather load static, per_elt.  */
>    4, 4,					/* Gather store static, per_elt.  */
> @@ -1100,12 +1100,12 @@ const struct processor_costs bdver1_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1203,12 +1203,12 @@ const struct processor_costs bdver2_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1305,12 +1305,12 @@ struct processor_costs bdver3_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1406,12 +1406,12 @@ struct processor_costs bdver4_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1518,12 +1518,12 @@ struct processor_costs znver1_cost = {
>    {8, 8},				/* cost of storing MMX registers
>  					   in SImode and DImode.  */
>    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> -  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
> +  {6, 6, 6, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
> -  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},			/* cost of storing SSE registers
> +  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
> +  {8, 8, 8, 16, 32},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
> -  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
> +  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
>    6, 6,					/* SSE->integer and integer->SSE moves.  */
>    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>       throughput 12.  Approx 9 uops do not depend on vector size and every load
> @@ -1726,12 +1726,12 @@ const struct processor_costs btver1_cost = {
>    {12, 12},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
>    14, 14,				/* SSE->integer and integer->SSE moves */
>    10, 10,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1817,12 +1817,12 @@ const struct processor_costs btver2_cost = {
>    {12, 12},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
>    14, 14,				/* SSE->integer and integer->SSE moves */
>    10, 10,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index a46450ad99d..ad8661075aa 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -441,7 +441,7 @@ DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal"
>  
>  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
>     the auto-vectorizer.  */
> -DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
> +DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER
>  	  | m_ZNVER1)
>  
>  /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX
> 
>
Richard Biener Oct. 15, 2018, 1:37 p.m. | #3
On Thu, 11 Oct 2018, Richard Biener wrote:

> 
> The following fixes vec_construct cost calculation to properly consider
> that the inserts will happen to SSE regs thus forgo the multiplication
> done in ix86_vec_cost which is passed the wrong mode.  This gets rid of
> the only call passing false to ix86_vec_cost (so consider the patch
> amended to remove the arg if approved).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK for trunk?

PING.

> I am considering to make the factor we apply in ix86_vec_cost
> which currently depends on X86_TUNE_AVX128_OPTIMAL and
> X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since
> the reason we apply them are underlying CPU architecture details.
> Was the original reason of doing the multiplication based on
> those tunings to be able to "share" the same basic cost table
> across architectures that differ in this important detail?
> I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8
> and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2
> and m_ZNVER1.  Those all have (multiple) exclusive processor_cost_table
> entries.
> 
> As a first step I'd like to remove the use of ix86_vec_cost for
> the entries that already have entries for multiple modes
> (loads and stores) and apply the factor there.  For example
> Zen can do two 128bit loads per cycle but only one 128bit store.
> With multiplying AVX256 costs by two we seem to cost sth like
> # instructions to dispatch * instruction latency which is an
> odd thing.  I'd have expected # instructions to dispatch / instruction 
> throughput * instruction latency - so a AVX256 add would cost
> the same as a AVX128 add, likewise for loads but stores would be
> more expensive because of the throughput issue.  This all
> ignores resource utilization across multiple insns but that's
> how the cost model works ...
> 
> Thanks,
> Richard.
> 
> 2018-10-11  Richard Biener  <rguenther@suse.de>
> 
> 	* config/i386/i386.c (ix86_vec_cost): Remove !parallel path
> 	and argument.
> 	(ix86_builtin_vectorization_cost): For vec_construct properly
> 	cost insertion into SSE regs.
> 	(...): Adjust calls to ix86_vec_cost.
> 
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 265022)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -39846,11 +39846,10 @@ ix86_set_reg_reg_cost (machine_mode mode
>  static int
>  ix86_vec_cost (machine_mode mode, int cost, bool parallel)
>  {
> +  gcc_assert (parallel);
>    if (!VECTOR_MODE_P (mode))
>      return cost;
> - 
> -  if (!parallel)
> -    return cost * GET_MODE_NUNITS (mode);
> +
>    if (GET_MODE_BITSIZE (mode) == 128
>        && TARGET_SSE_SPLIT_REGS)
>      return cost * 2;
> @@ -45190,8 +45189,9 @@ ix86_builtin_vectorization_cost (enum ve
>  
>        case vec_construct:
>  	{
> -	  /* N element inserts.  */
> -	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +	  gcc_assert (VECTOR_MODE_P (mode));
> +	  /* N element inserts into SSE vectors.  */
> +	  int cost = GET_MODE_NUNITS (mode) * ix86_cost->sse_op;
>  	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
>  	  if (GET_MODE_BITSIZE (mode) == 256)
>  	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);
> 
>
Jan Hubicka Oct. 18, 2018, 11:25 a.m. | #4
> 
> The following fixes vec_construct cost calculation to properly consider
> that the inserts will happen to SSE regs thus forgo the multiplication
> done in ix86_vec_cost which is passed the wrong mode.  This gets rid of
> the only call passing false to ix86_vec_cost (so consider the patch
> amended to remove the arg if approved).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK for trunk?

OK, thanks!
> 
> I am considering to make the factor we apply in ix86_vec_cost
> which currently depends on X86_TUNE_AVX128_OPTIMAL and
> X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since
> the reason we apply them are underlying CPU architecture details.
> Was the original reason of doing the multiplication based on
> those tunings to be able to "share" the same basic cost table
> across architectures that differ in this important detail?

No, just to have fewer entries in the table (since they are rather big
and painful to fill in as they are already)

> I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8
> and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2
> and m_ZNVER1.  Those all have (multiple) exclusive processor_cost_table
> entries.
> 
> As a first step I'd like to remove the use of ix86_vec_cost for
> the entries that already have entries for multiple modes
> (loads and stores) and apply the factor there.  For example
> Zen can do two 128bit loads per cycle but only one 128bit store.

That sounds like a good plan (I think I introduced the entries in cost
table only afer introducing the vec_cost thingy)

> With multiplying AVX256 costs by two we seem to cost sth like
> # instructions to dispatch * instruction latency which is an
> odd thing.  I'd have expected # instructions to dispatch / instruction 
> throughput * instruction latency - so a AVX256 add would cost
> the same as a AVX128 add, likewise for loads but stores would be
> more expensive because of the throughput issue.  This all
> ignores resource utilization across multiple insns but that's
> how the cost model works ...

Yep, cost model simply uses latencies because it originated at a time
CPUs was not parallel.  I know that LLVM backend goes the other way
and uses throughputs only.
Correct thing would be to build the dependence dag and guess how CPU
will schedule but that would be fun to implement at gimple level...

Honza
> 
> Thanks,
> Richard.
> 
> 2018-10-11  Richard Biener  <rguenther@suse.de>
> 
> 	* config/i386/i386.c (ix86_vec_cost): Remove !parallel path
> 	and argument.
> 	(ix86_builtin_vectorization_cost): For vec_construct properly
> 	cost insertion into SSE regs.
> 	(...): Adjust calls to ix86_vec_cost.
> 
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 265022)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -39846,11 +39846,10 @@ ix86_set_reg_reg_cost (machine_mode mode
>  static int
>  ix86_vec_cost (machine_mode mode, int cost, bool parallel)
>  {
> +  gcc_assert (parallel);
>    if (!VECTOR_MODE_P (mode))
>      return cost;
> - 
> -  if (!parallel)
> -    return cost * GET_MODE_NUNITS (mode);
> +
>    if (GET_MODE_BITSIZE (mode) == 128
>        && TARGET_SSE_SPLIT_REGS)
>      return cost * 2;
> @@ -45190,8 +45189,9 @@ ix86_builtin_vectorization_cost (enum ve
>  
>        case vec_construct:
>  	{
> -	  /* N element inserts.  */
> -	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +	  gcc_assert (VECTOR_MODE_P (mode));
> +	  /* N element inserts into SSE vectors.  */
> +	  int cost = GET_MODE_NUNITS (mode) * ix86_cost->sse_op;
>  	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
>  	  if (GET_MODE_BITSIZE (mode) == 256)
>  	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);
>
Jan Hubicka Oct. 18, 2018, 11:49 a.m. | #5
> 
> So like the following which removes the use of ix86_vec_cost
> for SSE loads and stores since we have per-mode costs already.
> I've applied the relevant factor to the individual cost tables
> (noting that for X86_TUNE_SSE_SPLIT_REGS we only apply the
> multiplication for size == 128, not size >= 128 ...)
> 
> There's a ??? hunk in inline_memory_move_cost where we
> failed to apply the scaling thus in that place we'd now have
> a behavior change.  Alternatively I could leave the cost
> tables unaltered if that costing part is more critical than
> the vectorizer one.

Changing the behaviour (applying the scale there) seems like
right way to go to me...
> 
> I've also spotted, when reviewing ix86_vec_cost uses, a bug
> in ix86_rtx_cost which keys on SFmode which doesn't work
> for SSE modes, thus use GET_MODE_INNER.
> 
> Also I've changed X86_TUNE_AVX128_OPTIMAL to also apply
> to BTVER1 - everywhere else we glob BTVER1 and BTVER2 so
> this must surely be a omission.

BTVER1 did not have AVX :)
> 
> Honza - is a patch like this OK?

Looks OK to me.  Splitting up individual changes is up to you.
I think it is not that dramatic change so hopefully regressions
won't be that hard to analyze.
> 
> Should I split out individual fixes to make bisection possible?
> 
> Should I update the cost tables or instead change the vectorizer
> costing when considering the inline_memory_move_cost "issue"?

Looks like memory move cost should do the right thing now after your patch?
Having larger loads/stores more expensive seems correct to me.

Patch is OK, without the ??? comment ;)
Honza
> 
> Thanks,
> Richard.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 0cf4152acb2..f5392232f61 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -39432,6 +39432,7 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass,
>        int index = sse_store_index (mode);
>        if (index == -1)
>  	return 100;
> +      /* ??? */
>        if (in == 2)
>          return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
>        return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
> @@ -40183,7 +40181,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
>          gcc_assert (TARGET_FMA || TARGET_FMA4 || TARGET_AVX512F);
>  
>          *total = ix86_vec_cost (mode,
> -				mode == SFmode ? cost->fmass : cost->fmasd,
> +				GET_MODE_INNER (mode) == SFmode
> +				? cost->fmass : cost->fmasd,
>  				true);
>  	*total += rtx_cost (XEXP (x, 1), mode, FMA, 1, speed);
>  
> @@ -45122,18 +45121,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2;
>  
>        case vector_store:
>  	index = sse_store_index (mode);
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;
>  
>        case vec_to_scalar:
>        case scalar_to_vec:
> @@ -45146,20 +45141,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS
> -				 (ix86_cost->sse_unaligned_load[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_load[index]) / 2;
>  
>        case unaligned_store:
>  	index = sse_store_index (mode);
>  	/* See PR82713 - we may end up being called on non-vector type.  */
>  	if (index < 0)
>  	  index = 2;
> -        return ix86_vec_cost (mode,
> -			      COSTS_N_INSNS
> -				 (ix86_cost->sse_unaligned_store[index]) / 2,
> -			      true);
> +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_store[index]) / 2;
>  
>        case vector_gather_load:
>          return ix86_vec_cost (mode,
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 9c8ae0a7841..59d0a8b17d0 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -795,12 +795,12 @@ struct processor_costs athlon_cost = {
>    {4, 4},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {4, 4, 6, 12, 24},			/* cost of loading SSE registers
> +  {4, 4, 12, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 6, 12, 24},			/* cost of unaligned loads.  */
> -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> +  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
> +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
>    5, 5,					/* SSE->integer and integer->SSE moves */
>    4, 4,					/* Gather load static, per_elt.  */
>    4, 4,					/* Gather store static, per_elt.  */
> @@ -891,12 +891,12 @@ struct processor_costs k8_cost = {
>    {4, 4},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {4, 3, 6, 12, 24},			/* cost of loading SSE registers
> +  {4, 3, 12, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 3, 6, 12, 24},			/* cost of unaligned loads.  */
> -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> +  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
> +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
>    5, 5,					/* SSE->integer and integer->SSE moves */
>    4, 4,					/* Gather load static, per_elt.  */
>    4, 4,					/* Gather store static, per_elt.  */
> @@ -1100,12 +1100,12 @@ const struct processor_costs bdver1_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1203,12 +1203,12 @@ const struct processor_costs bdver2_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1305,12 +1305,12 @@ struct processor_costs bdver3_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1406,12 +1406,12 @@ struct processor_costs bdver4_cost = {
>    {10, 10},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
>    16, 20,				/* SSE->integer and integer->SSE moves */
>    12, 12,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1518,12 +1518,12 @@ struct processor_costs znver1_cost = {
>    {8, 8},				/* cost of storing MMX registers
>  					   in SImode and DImode.  */
>    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> -  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
> +  {6, 6, 6, 12, 24},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
> -  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},			/* cost of storing SSE registers
> +  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
> +  {8, 8, 8, 16, 32},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
> -  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
> +  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
>    6, 6,					/* SSE->integer and integer->SSE moves.  */
>    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>       throughput 12.  Approx 9 uops do not depend on vector size and every load
> @@ -1726,12 +1726,12 @@ const struct processor_costs btver1_cost = {
>    {12, 12},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
>    14, 14,				/* SSE->integer and integer->SSE moves */
>    10, 10,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> @@ -1817,12 +1817,12 @@ const struct processor_costs btver2_cost = {
>    {12, 12},				/* cost of storing MMX registers
>  					   in SImode and DImode */
>    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
> -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
>    14, 14,				/* SSE->integer and integer->SSE moves */
>    10, 10,				/* Gather load static, per_elt.  */
>    10, 10,				/* Gather store static, per_elt.  */
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index a46450ad99d..ad8661075aa 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -441,7 +441,7 @@ DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal"
>  
>  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
>     the auto-vectorizer.  */
> -DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
> +DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER
>  	  | m_ZNVER1)
>  
>  /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX
>
Richard Biener Oct. 18, 2018, 12:16 p.m. | #6
On Thu, 18 Oct 2018, Jan Hubicka wrote:

> > 
> > So like the following which removes the use of ix86_vec_cost
> > for SSE loads and stores since we have per-mode costs already.
> > I've applied the relevant factor to the individual cost tables
> > (noting that for X86_TUNE_SSE_SPLIT_REGS we only apply the
> > multiplication for size == 128, not size >= 128 ...)
> > 
> > There's a ??? hunk in inline_memory_move_cost where we
> > failed to apply the scaling thus in that place we'd now have
> > a behavior change.  Alternatively I could leave the cost
> > tables unaltered if that costing part is more critical than
> > the vectorizer one.
> 
> Changing the behaviour (applying the scale there) seems like
> right way to go to me...
> > 
> > I've also spotted, when reviewing ix86_vec_cost uses, a bug
> > in ix86_rtx_cost which keys on SFmode which doesn't work
> > for SSE modes, thus use GET_MODE_INNER.
> > 
> > Also I've changed X86_TUNE_AVX128_OPTIMAL to also apply
> > to BTVER1 - everywhere else we glob BTVER1 and BTVER2 so
> > this must surely be a omission.
> 
> BTVER1 did not have AVX :)

Yeah - noticed that as well ;)

> > 
> > Honza - is a patch like this OK?
> 
> Looks OK to me.  Splitting up individual changes is up to you.
> I think it is not that dramatic change so hopefully regressions
> won't be that hard to analyze.

I'll commit the ix86_rtx_costs change separately from the load/store
cost changes just to make bisection easier.

> > 
> > Should I split out individual fixes to make bisection possible?
> > 
> > Should I update the cost tables or instead change the vectorizer
> > costing when considering the inline_memory_move_cost "issue"?
> 
> Looks like memory move cost should do the right thing now after your patch?
> Having larger loads/stores more expensive seems correct to me.

Yes.

> Patch is OK, without the ??? comment ;)

Thanks,
Richard.

> Honza
> > 
> > Thanks,
> > Richard.
> > 
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 0cf4152acb2..f5392232f61 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -39432,6 +39432,7 @@ inline_memory_move_cost (machine_mode mode, enum reg_class regclass,
> >        int index = sse_store_index (mode);
> >        if (index == -1)
> >  	return 100;
> > +      /* ??? */
> >        if (in == 2)
> >          return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
> >        return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
> > @@ -40183,7 +40181,8 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
> >          gcc_assert (TARGET_FMA || TARGET_FMA4 || TARGET_AVX512F);
> >  
> >          *total = ix86_vec_cost (mode,
> > -				mode == SFmode ? cost->fmass : cost->fmasd,
> > +				GET_MODE_INNER (mode) == SFmode
> > +				? cost->fmass : cost->fmasd,
> >  				true);
> >  	*total += rtx_cost (XEXP (x, 1), mode, FMA, 1, speed);
> >  
> > @@ -45122,18 +45121,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
> >  	/* See PR82713 - we may end up being called on non-vector type.  */
> >  	if (index < 0)
> >  	  index = 2;
> > -        return ix86_vec_cost (mode,
> > -			      COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2,
> > -			      true);
> > +        return COSTS_N_INSNS (ix86_cost->sse_load[index]) / 2;
> >  
> >        case vector_store:
> >  	index = sse_store_index (mode);
> >  	/* See PR82713 - we may end up being called on non-vector type.  */
> >  	if (index < 0)
> >  	  index = 2;
> > -        return ix86_vec_cost (mode,
> > -			      COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2,
> > -			      true);
> > +        return COSTS_N_INSNS (ix86_cost->sse_store[index]) / 2;
> >  
> >        case vec_to_scalar:
> >        case scalar_to_vec:
> > @@ -45146,20 +45141,14 @@ ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
> >  	/* See PR82713 - we may end up being called on non-vector type.  */
> >  	if (index < 0)
> >  	  index = 2;
> > -        return ix86_vec_cost (mode,
> > -			      COSTS_N_INSNS
> > -				 (ix86_cost->sse_unaligned_load[index]) / 2,
> > -			      true);
> > +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_load[index]) / 2;
> >  
> >        case unaligned_store:
> >  	index = sse_store_index (mode);
> >  	/* See PR82713 - we may end up being called on non-vector type.  */
> >  	if (index < 0)
> >  	  index = 2;
> > -        return ix86_vec_cost (mode,
> > -			      COSTS_N_INSNS
> > -				 (ix86_cost->sse_unaligned_store[index]) / 2,
> > -			      true);
> > +        return COSTS_N_INSNS (ix86_cost->sse_unaligned_store[index]) / 2;
> >  
> >        case vector_gather_load:
> >          return ix86_vec_cost (mode,
> > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > index 9c8ae0a7841..59d0a8b17d0 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -795,12 +795,12 @@ struct processor_costs athlon_cost = {
> >    {4, 4},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {4, 4, 6, 12, 24},			/* cost of loading SSE registers
> > +  {4, 4, 12, 12, 24},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {4, 4, 6, 12, 24},			/* cost of unaligned loads.  */
> > -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> > +  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
> > +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> > +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
> >    5, 5,					/* SSE->integer and integer->SSE moves */
> >    4, 4,					/* Gather load static, per_elt.  */
> >    4, 4,					/* Gather store static, per_elt.  */
> > @@ -891,12 +891,12 @@ struct processor_costs k8_cost = {
> >    {4, 4},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {4, 3, 6, 12, 24},			/* cost of loading SSE registers
> > +  {4, 3, 12, 12, 24},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {4, 3, 6, 12, 24},			/* cost of unaligned loads.  */
> > -  {4, 4, 5, 10, 20},			/* cost of storing SSE registers
> > +  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
> > +  {4, 4, 10, 10, 20},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
> > +  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
> >    5, 5,					/* SSE->integer and integer->SSE moves */
> >    4, 4,					/* Gather load static, per_elt.  */
> >    4, 4,					/* Gather store static, per_elt.  */
> > @@ -1100,12 +1100,12 @@ const struct processor_costs bdver1_cost = {
> >    {10, 10},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> > -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> > +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> > +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
> >    16, 20,				/* SSE->integer and integer->SSE moves */
> >    12, 12,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > @@ -1203,12 +1203,12 @@ const struct processor_costs bdver2_cost = {
> >    {10, 10},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> > -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> > +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> > +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
> >    16, 20,				/* SSE->integer and integer->SSE moves */
> >    12, 12,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > @@ -1305,12 +1305,12 @@ struct processor_costs bdver3_cost = {
> >    {10, 10},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> > -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> > +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> > +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
> >    16, 20,				/* SSE->integer and integer->SSE moves */
> >    12, 12,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > @@ -1406,12 +1406,12 @@ struct processor_costs bdver4_cost = {
> >    {10, 10},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {12, 12, 10, 20, 30},			/* cost of loading SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {12, 12, 10, 20, 30},			/* cost of unaligned loads.  */
> > -  {10, 10, 10, 20, 30},			/* cost of storing SSE registers
> > +  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
> > +  {10, 10, 10, 40, 60},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 10, 20, 30},			/* cost of unaligned stores.  */
> > +  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
> >    16, 20,				/* SSE->integer and integer->SSE moves */
> >    12, 12,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > @@ -1518,12 +1518,12 @@ struct processor_costs znver1_cost = {
> >    {8, 8},				/* cost of storing MMX registers
> >  					   in SImode and DImode.  */
> >    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> > -  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
> > +  {6, 6, 6, 12, 24},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> > -  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
> > -  {8, 8, 8, 8, 16},			/* cost of storing SSE registers
> > +  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
> > +  {8, 8, 8, 16, 32},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> > -  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
> > +  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
> >    6, 6,					/* SSE->integer and integer->SSE moves.  */
> >    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> >       throughput 12.  Approx 9 uops do not depend on vector size and every load
> > @@ -1726,12 +1726,12 @@ const struct processor_costs btver1_cost = {
> >    {12, 12},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> > +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> > -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> > +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> > +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> > +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
> >    14, 14,				/* SSE->integer and integer->SSE moves */
> >    10, 10,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > @@ -1817,12 +1817,12 @@ const struct processor_costs btver2_cost = {
> >    {12, 12},				/* cost of storing MMX registers
> >  					   in SImode and DImode */
> >    2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
> > -  {10, 10, 12, 24, 48},			/* cost of loading SSE registers
> > +  {10, 10, 12, 48, 96},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 12, 24, 48},			/* cost of unaligned loads.  */
> > -  {10, 10, 12, 24, 48},			/* cost of storing SSE registers
> > +  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
> > +  {10, 10, 12, 48, 96},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit */
> > -  {10, 10, 12, 24, 48},			/* cost of unaligned stores.  */
> > +  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
> >    14, 14,				/* SSE->integer and integer->SSE moves */
> >    10, 10,				/* Gather load static, per_elt.  */
> >    10, 10,				/* Gather store static, per_elt.  */
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index a46450ad99d..ad8661075aa 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -441,7 +441,7 @@ DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_store_optimal"
> >  
> >  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
> >     the auto-vectorizer.  */
> > -DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
> > +DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER
> >  	  | m_ZNVER1)
> >  
> >  /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX
> > 
> 
>

Patch

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 265022)
+++ gcc/config/i386/i386.c	(working copy)
@@ -39846,11 +39846,10 @@  ix86_set_reg_reg_cost (machine_mode mode
 static int
 ix86_vec_cost (machine_mode mode, int cost, bool parallel)
 {
+  gcc_assert (parallel);
   if (!VECTOR_MODE_P (mode))
     return cost;
- 
-  if (!parallel)
-    return cost * GET_MODE_NUNITS (mode);
+
   if (GET_MODE_BITSIZE (mode) == 128
       && TARGET_SSE_SPLIT_REGS)
     return cost * 2;
@@ -45190,8 +45189,9 @@  ix86_builtin_vectorization_cost (enum ve
 
       case vec_construct:
 	{
-	  /* N element inserts.  */
-	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
+	  gcc_assert (VECTOR_MODE_P (mode));
+	  /* N element inserts into SSE vectors.  */
+	  int cost = GET_MODE_NUNITS (mode) * ix86_cost->sse_op;
 	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
 	  if (GET_MODE_BITSIZE (mode) == 256)
 	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);