diff mbox series

[i386] Update Zen tuning for vector load cost

Message ID alpine.LSU.2.20.1810081233180.16707@zhemvz.fhfr.qr
State New
Headers show
Series [i386] Update Zen tuning for vector load cost | expand

Commit Message

Richard Biener Oct. 8, 2018, 10:55 a.m. UTC
This adjusts Zen AVX256 vector load cost to be twice as expensive
than AVX128 vector load cost (twice via ix86_vec_cost keying on
TARGET_AVX128_OPTIMAL - should rather use some
TARGET_VECTOR_IMPL_WIDTH or so).

Likely the current cost value was meant to make AVX256 loads _cheaper_
than two AVX128 ones in which case we'd have to use 5 here.  Honza,
what was the intent here?  Maybe not call ix86_vec_cost for loads or
stores at all (given we model different sizes explicitely?)?  The
odd thing is that in some places you pass ix86_vec_cost
COSNTS_N_INSNS (ix86_cost->...) and in others just
ix86_cost->....  That's probably caused by most entries in the cost
tables being scaled via COSTS_N_INSNS but not all which makes them
not easily comparable... (I found a comment that says
"We assume COSTS_N_INSNS is defined as (N)*4")

That ix86_vec_cost adds a factor of two for AVX256 also is throwing
throughput into the mix because latency-wise AVX256 behaves the
same as AVX128 AFAICs.  That is, the modeling isn't very precise...
(multiplying by two still makes sense IMHO - but for example stores
have only one 128bit port so those are the ones that should be
more than factor-of-two pessimized if any)

Anyhow, the patch removes one oddity comparing costs of vectorized
loops when there are no lane-crossing operations (after the patch
such loops will cost the same with AVX128 and AVX256 even though
Zens frontend will benefit from using AVX256).

Bootstrap & regtest running on x86_64-unknown-linux-gnu.

OK?

Thanks,
Richard.

2018-10-08  Richard Biener  <rguenther@suse.de>

	* config/i386/x86-tune-costs.h (znver1_cost): Make AVX256 vector loads
	cost the same as AVX128 ones.

Comments

Jan Hubicka Oct. 8, 2018, 2:22 p.m. UTC | #1
> 
> This adjusts Zen AVX256 vector load cost to be twice as expensive
> than AVX128 vector load cost (twice via ix86_vec_cost keying on
> TARGET_AVX128_OPTIMAL - should rather use some
> TARGET_VECTOR_IMPL_WIDTH or so).
> 
> Likely the current cost value was meant to make AVX256 loads _cheaper_
> than two AVX128 ones in which case we'd have to use 5 here.  Honza,
> what was the intent here?  Maybe not call ix86_vec_cost for loads or
> stores at all (given we model different sizes explicitely?)?  The
> odd thing is that in some places you pass ix86_vec_cost
> COSNTS_N_INSNS (ix86_cost->...) and in others just
> ix86_cost->....  That's probably caused by most entries in the cost
> tables being scaled via COSTS_N_INSNS but not all which makes them
> not easily comparable... (I found a comment that says
> "We assume COSTS_N_INSNS is defined as (N)*4")
> 
> That ix86_vec_cost adds a factor of two for AVX256 also is throwing
> throughput into the mix because latency-wise AVX256 behaves the
> same as AVX128 AFAICs.  That is, the modeling isn't very precise...
> (multiplying by two still makes sense IMHO - but for example stores
> have only one 128bit port so those are the ones that should be
> more than factor-of-two pessimized if any)

Hmm the intent was to make the tables latency based, so 256bit load should be
definitly cheaper than two 128loads.  I guess the current state is result of
getting TARGET_AVX128_OPTIMAL adjustemnt inconsistent.  It did not make much
difference previously becuase TARGET_AVX128_OPTIMAL prevents us from using 256
instructions in most cases.

Vectorizer cost model is bit odd mix of throughputs and latencies in general.
I wondered if it would make sense to compute the cost of a loop by computing
producing the DAG of dependencies and computing cost of the most expensive
path through the DAG...

Patch looks OK to me (it makes situation better than previously for sure :)
Honza
> 
> Anyhow, the patch removes one oddity comparing costs of vectorized
> loops when there are no lane-crossing operations (after the patch
> such loops will cost the same with AVX128 and AVX256 even though
> Zens frontend will benefit from using AVX256).
> 
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> 
> OK?
> 
> Thanks,
> Richard.
> 
> 2018-10-08  Richard Biener  <rguenther@suse.de>
> 
> 	* config/i386/x86-tune-costs.h (znver1_cost): Make AVX256 vector loads
> 	cost the same as AVX128 ones.
> 
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 71a5854c09a..c7f3945d72c 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1518,9 +1518,9 @@ struct processor_costs znver1_cost = {
>    {8, 8},				/* cost of storing MMX registers
>  					   in SImode and DImode.  */
>    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> -  {6, 6, 6, 10, 20},			/* cost of loading SSE registers
> +  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
> -  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
> +  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
>    {8, 8, 8, 8, 16},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit.  */
>    {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
>
Richard Biener Oct. 8, 2018, 2:46 p.m. UTC | #2
On Mon, 8 Oct 2018, Jan Hubicka wrote:

> > 
> > This adjusts Zen AVX256 vector load cost to be twice as expensive
> > than AVX128 vector load cost (twice via ix86_vec_cost keying on
> > TARGET_AVX128_OPTIMAL - should rather use some
> > TARGET_VECTOR_IMPL_WIDTH or so).
> > 
> > Likely the current cost value was meant to make AVX256 loads _cheaper_
> > than two AVX128 ones in which case we'd have to use 5 here.  Honza,
> > what was the intent here?  Maybe not call ix86_vec_cost for loads or
> > stores at all (given we model different sizes explicitely?)?  The
> > odd thing is that in some places you pass ix86_vec_cost
> > COSNTS_N_INSNS (ix86_cost->...) and in others just
> > ix86_cost->....  That's probably caused by most entries in the cost
> > tables being scaled via COSTS_N_INSNS but not all which makes them
> > not easily comparable... (I found a comment that says
> > "We assume COSTS_N_INSNS is defined as (N)*4")
> > 
> > That ix86_vec_cost adds a factor of two for AVX256 also is throwing
> > throughput into the mix because latency-wise AVX256 behaves the
> > same as AVX128 AFAICs.  That is, the modeling isn't very precise...
> > (multiplying by two still makes sense IMHO - but for example stores
> > have only one 128bit port so those are the ones that should be
> > more than factor-of-two pessimized if any)
> 
> Hmm the intent was to make the tables latency based, so 256bit load should be
> definitly cheaper than two 128loads.  I guess the current state is result of
> getting TARGET_AVX128_OPTIMAL adjustemnt inconsistent.  It did not make much
> difference previously becuase TARGET_AVX128_OPTIMAL prevents us from using 256
> instructions in most cases.

Yeah, I guess I'll propose splitting that TARGET_AVX128_OPTIMAL use
in ix86_vec_cost to a different flag/tunable (possibly just a cost
table entry because it really is closely tied to the costs we assign).

> Vectorizer cost model is bit odd mix of throughputs and latencies in general.
> I wondered if it would make sense to compute the cost of a loop by computing
> producing the DAG of dependencies and computing cost of the most expensive
> path through the DAG...

Yeah - in office we threw around some thoughts about throwing
all stmts added via add_stmt hook at the scheduler pipeline model
and compute that path, basically only modeling resources and dependences.
We'd need to identify instructions (do instruction selection) at the
granularity of the pipeline description of course which may be
"interesting".

> Patch looks OK to me (it makes situation better than previously for sure :)

Thanks, I've also done some benchmarking with this and only saw
positive effects.

Applied.

Richard.

> Honza
> > 
> > Anyhow, the patch removes one oddity comparing costs of vectorized
> > loops when there are no lane-crossing operations (after the patch
> > such loops will cost the same with AVX128 and AVX256 even though
> > Zens frontend will benefit from using AVX256).
> > 
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> > 
> > OK?
> > 
> > Thanks,
> > Richard.
> > 
> > 2018-10-08  Richard Biener  <rguenther@suse.de>
> > 
> > 	* config/i386/x86-tune-costs.h (znver1_cost): Make AVX256 vector loads
> > 	cost the same as AVX128 ones.
> > 
> > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > index 71a5854c09a..c7f3945d72c 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -1518,9 +1518,9 @@ struct processor_costs znver1_cost = {
> >    {8, 8},				/* cost of storing MMX registers
> >  					   in SImode and DImode.  */
> >    2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
> > -  {6, 6, 6, 10, 20},			/* cost of loading SSE registers
> > +  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> > -  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
> > +  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},			/* cost of storing SSE registers
> >  					   in 32,64,128,256 and 512-bit.  */
> >    {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
> > 
> 
>
diff mbox series

Patch

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 71a5854c09a..c7f3945d72c 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1518,9 +1518,9 @@  struct processor_costs znver1_cost = {
   {8, 8},				/* cost of storing MMX registers
 					   in SImode and DImode.  */
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
-  {6, 6, 6, 10, 20},			/* cost of loading SSE registers
+  {6, 6, 6, 6, 12},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */