diff mbox series

i386: Separate costs of RTL expressions from costs of moves

Message ID CAMe9rOo04knepkq4WAXCs8kaD0KXxJ3Ur=wfs39uub1rVV=Ptw@mail.gmail.com
State New
Headers show
Series i386: Separate costs of RTL expressions from costs of moves | expand

Commit Message

H.J. Lu June 17, 2019, 4:26 p.m. UTC
processor_costs has costs of RTL expressions and costs of moves:

1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
to generate RTL expressions with the lowest costs.  Costs of RTL memory
operation can be very close to costs of fast instructions to indicate
fast memory operations.

2. After RTL expressions have been generated, costs of moves are used by
TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
costs for register allocator.  Costs of load and store are higher than
costs of register moves to reduce stack usages by register allocator.

We should separate costs of RTL expressions from costs of moves so that
they can be adjusted independently.  This patch moves costs of moves to
the new used_by_ra field and duplicates costs of moves which are also
used for costs of RTL expressions.

All cost models have been checked with

static void
check_one (const struct processor_costs *p)
{
  if (p->used_by_ra.int_load[2] != p->int_load)
    abort ();
  if (p->used_by_ra.int_store[2] != p->int_store)
    abort ();
  if (p->used_by_ra.xmm_move != p->xmm_move)
    abort ();
  if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
    abort ();
  if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
    abort ();
  if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
    abort ();
  if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
    abort ();
}

static void
check_cost ()
{
 check_one (&ix86_size_cost);
  for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
    check_one (processor_cost_table[i]);
}

by calling check_cost from ix86_option_override_internal.

PR target/90878
* config/i386/i386-features.c
(dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
and int_load[2] with int_store and int_load.
* config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
for costs of moves.
(ix86_register_move_cost): Likewise.
(ix86_builtin_vectorization_cost): Replace int_store[2] and
int_load[2] with int_store and int_load.
* config/i386/i386.h (processor_costs): Move costs of moves to
used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
integer_to_sse, sse_load, sse_store, sse_unaligned_load and
sse_unaligned_store for costs of RTL expressions.
* config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
for costs of RTL expressions.  Use sse_unaligned_load and
sse_unaligned_store only for costs of RTL expressions.

Comments

Uros Bizjak June 20, 2019, 7:40 a.m. UTC | #1
On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> processor_costs has costs of RTL expressions and costs of moves:
>
> 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
> to generate RTL expressions with the lowest costs.  Costs of RTL memory
> operation can be very close to costs of fast instructions to indicate
> fast memory operations.
>
> 2. After RTL expressions have been generated, costs of moves are used by
> TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
> costs for register allocator.  Costs of load and store are higher than
> costs of register moves to reduce stack usages by register allocator.
>
> We should separate costs of RTL expressions from costs of moves so that
> they can be adjusted independently.  This patch moves costs of moves to
> the new used_by_ra field and duplicates costs of moves which are also
> used for costs of RTL expressions.

Actually, I think that the current separation is OK. Before reload, we
actually don't know which register set will perform the move (not even
if float mode will be moved in integer registers), the only thing we
can estimate is the number of move instructions. The real cost of
register moves is later calculated by the register allocator, where
the register class is taken into account when calculating the cost.

Uros.

>
> All cost models have been checked with
>
> static void
> check_one (const struct processor_costs *p)
> {
>   if (p->used_by_ra.int_load[2] != p->int_load)
>     abort ();
>   if (p->used_by_ra.int_store[2] != p->int_store)
>     abort ();
>   if (p->used_by_ra.xmm_move != p->xmm_move)
>     abort ();
>   if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
>     abort ();
>   if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
>     abort ();
>   if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
>     abort ();
>   if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
>     abort ();
> }
>
> static void
> check_cost ()
> {
>  check_one (&ix86_size_cost);
>   for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
>     check_one (processor_cost_table[i]);
> }
>
> by calling check_cost from ix86_option_override_internal.
>
> PR target/90878
> * config/i386/i386-features.c
> (dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
> and int_load[2] with int_store and int_load.
> * config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
> for costs of moves.
> (ix86_register_move_cost): Likewise.
> (ix86_builtin_vectorization_cost): Replace int_store[2] and
> int_load[2] with int_store and int_load.
> * config/i386/i386.h (processor_costs): Move costs of moves to
> used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
> integer_to_sse, sse_load, sse_store, sse_unaligned_load and
> sse_unaligned_store for costs of RTL expressions.
> * config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
> xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
> for costs of RTL expressions.  Use sse_unaligned_load and
> sse_unaligned_store only for costs of RTL expressions.
>
> --
> H.J.
Uros Bizjak June 20, 2019, 7:43 a.m. UTC | #2
On Thu, Jun 20, 2019 at 9:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > processor_costs has costs of RTL expressions and costs of moves:
> >
> > 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
> > to generate RTL expressions with the lowest costs.  Costs of RTL memory
> > operation can be very close to costs of fast instructions to indicate
> > fast memory operations.
> >
> > 2. After RTL expressions have been generated, costs of moves are used by
> > TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
> > costs for register allocator.  Costs of load and store are higher than
> > costs of register moves to reduce stack usages by register allocator.
> >
> > We should separate costs of RTL expressions from costs of moves so that
> > they can be adjusted independently.  This patch moves costs of moves to
> > the new used_by_ra field and duplicates costs of moves which are also
> > used for costs of RTL expressions.
>
> Actually, I think that the current separation is OK. Before reload, we
> actually don't know which register set will perform the move (not even
> if float mode will be moved in integer registers), the only thing we
> can estimate is the number of move instructions. The real cost of
> register moves is later calculated by the register allocator, where
> the register class is taken into account when calculating the cost.

Forgot to say that due to the above reasoning, cost of moves should
not be used in the calculation of costs of RTL expressions, as we are
talking about two different cost functions. RTL expressions should
know nothing about register classes.

Uros.
>
> >
> > All cost models have been checked with
> >
> > static void
> > check_one (const struct processor_costs *p)
> > {
> >   if (p->used_by_ra.int_load[2] != p->int_load)
> >     abort ();
> >   if (p->used_by_ra.int_store[2] != p->int_store)
> >     abort ();
> >   if (p->used_by_ra.xmm_move != p->xmm_move)
> >     abort ();
> >   if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
> >     abort ();
> >   if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
> >     abort ();
> >   if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
> >     abort ();
> >   if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
> >     abort ();
> > }
> >
> > static void
> > check_cost ()
> > {
> >  check_one (&ix86_size_cost);
> >   for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
> >     check_one (processor_cost_table[i]);
> > }
> >
> > by calling check_cost from ix86_option_override_internal.
> >
> > PR target/90878
> > * config/i386/i386-features.c
> > (dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
> > and int_load[2] with int_store and int_load.
> > * config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
> > for costs of moves.
> > (ix86_register_move_cost): Likewise.
> > (ix86_builtin_vectorization_cost): Replace int_store[2] and
> > int_load[2] with int_store and int_load.
> > * config/i386/i386.h (processor_costs): Move costs of moves to
> > used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
> > integer_to_sse, sse_load, sse_store, sse_unaligned_load and
> > sse_unaligned_store for costs of RTL expressions.
> > * config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
> > xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
> > for costs of RTL expressions.  Use sse_unaligned_load and
> > sse_unaligned_store only for costs of RTL expressions.
> >
> > --
> > H.J.
H.J. Lu June 20, 2019, 3:18 p.m. UTC | #3
On Thu, Jun 20, 2019 at 12:43 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, Jun 20, 2019 at 9:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > processor_costs has costs of RTL expressions and costs of moves:
> > >
> > > 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
> > > to generate RTL expressions with the lowest costs.  Costs of RTL memory
> > > operation can be very close to costs of fast instructions to indicate
> > > fast memory operations.
> > >
> > > 2. After RTL expressions have been generated, costs of moves are used by
> > > TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
> > > costs for register allocator.  Costs of load and store are higher than
> > > costs of register moves to reduce stack usages by register allocator.
> > >
> > > We should separate costs of RTL expressions from costs of moves so that
> > > they can be adjusted independently.  This patch moves costs of moves to
> > > the new used_by_ra field and duplicates costs of moves which are also
> > > used for costs of RTL expressions.
> >
> > Actually, I think that the current separation is OK. Before reload, we
> > actually don't know which register set will perform the move (not even
> > if float mode will be moved in integer registers), the only thing we
> > can estimate is the number of move instructions. The real cost of
> > register moves is later calculated by the register allocator, where
> > the register class is taken into account when calculating the cost.
>
> Forgot to say that due to the above reasoning, cost of moves should
> not be used in the calculation of costs of RTL expressions, as we are
> talking about two different cost functions. RTL expressions should
> know nothing about register classes.
>

Currently, costs of moves are also used for costs of RTL expressions.   This
patch:

https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html

includes:

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index e943d13..8409a5f 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4}, /* cost of loading integer registers
     in QImode, HImode and SImode.
     Relative to reg-reg move (2).  */
-  {6, 6, 6}, /* cost of storing integer registers */
+  {6, 6, 3}, /* cost of storing integer registers */
   2, /* cost of reg,reg fld/fst */
   {6, 6, 8}, /* cost of loading fp registers
     in SFmode, DFmode and XFmode */

It lowered the cost for SImode store and made it cheaper than SSE<->integer
register move.  It caused a regression:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878

Since the cost for SImode store is also used to compute scalar_store
in ix86_builtin_vectorization_cost, it changed loop costs in

void
foo (long p2, long *diag, long d, long i)
{
  long k;
  k = p2 < 3 ? p2 + p2 : p2 + 3;
  while (i < k)
    diag[i++] = d;
}

As the result, the loop is unrolled 4 times with -O3 -march=skylake,
instead of 3.

My patch separates costs of moves from costs of RTL expressions.  We have
a follow up patch which restores the cost for SImode store back to 6 and leave
the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
data now.
Uros Bizjak June 20, 2019, 8:33 p.m. UTC | #4
On Thu, Jun 20, 2019 at 5:19 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Jun 20, 2019 at 12:43 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Thu, Jun 20, 2019 at 9:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > On Mon, Jun 17, 2019 at 6:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > processor_costs has costs of RTL expressions and costs of moves:
> > > >
> > > > 1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
> > > > to generate RTL expressions with the lowest costs.  Costs of RTL memory
> > > > operation can be very close to costs of fast instructions to indicate
> > > > fast memory operations.
> > > >
> > > > 2. After RTL expressions have been generated, costs of moves are used by
> > > > TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
> > > > costs for register allocator.  Costs of load and store are higher than
> > > > costs of register moves to reduce stack usages by register allocator.
> > > >
> > > > We should separate costs of RTL expressions from costs of moves so that
> > > > they can be adjusted independently.  This patch moves costs of moves to
> > > > the new used_by_ra field and duplicates costs of moves which are also
> > > > used for costs of RTL expressions.
> > >
> > > Actually, I think that the current separation is OK. Before reload, we
> > > actually don't know which register set will perform the move (not even
> > > if float mode will be moved in integer registers), the only thing we
> > > can estimate is the number of move instructions. The real cost of
> > > register moves is later calculated by the register allocator, where
> > > the register class is taken into account when calculating the cost.
> >
> > Forgot to say that due to the above reasoning, cost of moves should
> > not be used in the calculation of costs of RTL expressions, as we are
> > talking about two different cost functions. RTL expressions should
> > know nothing about register classes.
> >
>
> Currently, costs of moves are also used for costs of RTL expressions.   This
> patch:
>
> https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
>
> includes:
>
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index e943d13..8409a5f 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
>    {4, 4, 4}, /* cost of loading integer registers
>      in QImode, HImode and SImode.
>      Relative to reg-reg move (2).  */
> -  {6, 6, 6}, /* cost of storing integer registers */
> +  {6, 6, 3}, /* cost of storing integer registers */
>    2, /* cost of reg,reg fld/fst */
>    {6, 6, 8}, /* cost of loading fp registers
>      in SFmode, DFmode and XFmode */
>
> It lowered the cost for SImode store and made it cheaper than SSE<->integer
> register move.  It caused a regression:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
>
> Since the cost for SImode store is also used to compute scalar_store
> in ix86_builtin_vectorization_cost, it changed loop costs in
>
> void
> foo (long p2, long *diag, long d, long i)
> {
>   long k;
>   k = p2 < 3 ? p2 + p2 : p2 + 3;
>   while (i < k)
>     diag[i++] = d;
> }
>
> As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> instead of 3.
>
> My patch separates costs of moves from costs of RTL expressions.  We have
> a follow up patch which restores the cost for SImode store back to 6 and leave
> the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> data now.

It looks that x86 costs are one big mess. I suggest you took this
matter to Honza, he knows this part better than I.

Uros.
Jan Hubicka June 20, 2019, 9:10 p.m. UTC | #5
> > Currently, costs of moves are also used for costs of RTL expressions.   This
> > patch:
> >
> > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> >
> > includes:
> >
> > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > index e943d13..8409a5f 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> >    {4, 4, 4}, /* cost of loading integer registers
> >      in QImode, HImode and SImode.
> >      Relative to reg-reg move (2).  */
> > -  {6, 6, 6}, /* cost of storing integer registers */
> > +  {6, 6, 3}, /* cost of storing integer registers */
> >    2, /* cost of reg,reg fld/fst */
> >    {6, 6, 8}, /* cost of loading fp registers
> >      in SFmode, DFmode and XFmode */

Well, it seems that the patch was fixing things on wrong spot - the
tables are intended to be mostly latency based. I think we ought to
document divergences from these including benchmarks where the change
helped. Otherwise it is very hard to figure out why the entry does not
match the reality.
> >
> > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > register move.  It caused a regression:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> >
> > Since the cost for SImode store is also used to compute scalar_store
> > in ix86_builtin_vectorization_cost, it changed loop costs in
> >
> > void
> > foo (long p2, long *diag, long d, long i)
> > {
> >   long k;
> >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> >   while (i < k)
> >     diag[i++] = d;
> > }
> >
> > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > instead of 3.
> >
> > My patch separates costs of moves from costs of RTL expressions.  We have
> > a follow up patch which restores the cost for SImode store back to 6 and leave
> > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > data now.

I have seen the problem with scalar_store with AMD tuning as well.
It seems to make SLP vectorizer to be happy about idea of turning
sequence of say integer tores into code which moves all the values into
AVX register and then does one vector store.

The cost basically compare cost of N scalar stores to 1 scalar store +
vector construction. Vector construction then N*sse_op+addss.

With testcase:

short array[8];
test (short a,short b,short c,short d,short e,short f,short g,short h)
{ 
  array[0]=a;
  array[1]=b;
  array[2]=c;
  array[3]=d;
  array[4]=e;
  array[5]=f;
  array[6]=g;
  array[7]=h;
}
int iarray[8];
test2 (int a,int b,int c,int d,int e,int f,int g,int h)
{ 
  iarray[0]=a;
  iarray[1]=b;
  iarray[2]=c;
  iarray[3]=d;
  iarray[4]=e;
  iarray[5]=f;
  iarray[6]=g;
  iarray[7]=h;
}

I get the following codegen:


test:
        vmovd   %edi, %xmm0
        vmovd   %edx, %xmm2
        vmovd   %r8d, %xmm1
        vmovd   8(%rsp), %xmm3
        vpinsrw $1, 16(%rsp), %xmm3, %xmm3
        vpinsrw $1, %esi, %xmm0, %xmm0
        vpinsrw $1, %ecx, %xmm2, %xmm2
        vpinsrw $1, %r9d, %xmm1, %xmm1
        vpunpckldq      %xmm2, %xmm0, %xmm0
        vpunpckldq      %xmm3, %xmm1, %xmm1
        vpunpcklqdq     %xmm1, %xmm0, %xmm0
        vmovaps %xmm0, array(%rip)
        ret

test2:
        vmovd   %r8d, %xmm5
        vmovd   %edx, %xmm6
        vmovd   %edi, %xmm7
        vpinsrd $1, %r9d, %xmm5, %xmm1
        vpinsrd $1, %ecx, %xmm6, %xmm3
        vpinsrd $1, %esi, %xmm7, %xmm0
        vpunpcklqdq     %xmm3, %xmm0, %xmm0
        vmovd   16(%rbp), %xmm4
        vpinsrd $1, 24(%rbp), %xmm4, %xmm2
        vpunpcklqdq     %xmm2, %xmm1, %xmm1
        vinserti128     $0x1, %xmm1, %ymm0, %ymm0
        vmovdqu %ymm0, iarray(%rip)
        vzeroupper
	ret

which is about 20% slower on my skylake notebook than the
non-SLP-vectorized variant.

I wonder if the vec_construct costs should be made more realistic.
It is computed as:

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;

So it expects 8 simple SSE operations + one SSE FP arithmetical
operations.  While code above has 8 inter-unit moves + 3 SSE integer
operations to shuffle things around. Not mentioning the increased
register pressure.

I would say that for integer constructs it is a common case that things
needs to be moved from integer unit to SSE.

Overall the problem is deeper since vectorizer really may need to get
better idea about latencies and throughputs to estimate loop times more
realistically. 

One also may want to account somewhat that stores are often not part
of the hot path and thus their latency is not too critical and the
fact that vector stores prevents later partial memory stalls on the
other hand...

Honza
H.J. Lu June 20, 2019, 9:42 p.m. UTC | #6
On Thu, Jun 20, 2019 at 2:10 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Currently, costs of moves are also used for costs of RTL expressions.   This
> > > patch:
> > >
> > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > >
> > > includes:
> > >
> > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > > index e943d13..8409a5f 100644
> > > --- a/gcc/config/i386/x86-tune-costs.h
> > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > >    {4, 4, 4}, /* cost of loading integer registers
> > >      in QImode, HImode and SImode.
> > >      Relative to reg-reg move (2).  */
> > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > +  {6, 6, 3}, /* cost of storing integer registers */
> > >    2, /* cost of reg,reg fld/fst */
> > >    {6, 6, 8}, /* cost of loading fp registers
> > >      in SFmode, DFmode and XFmode */
>
> Well, it seems that the patch was fixing things on wrong spot - the
> tables are intended to be mostly latency based. I think we ought to
> document divergences from these including benchmarks where the change
> helped. Otherwise it is very hard to figure out why the entry does not
> match the reality.
> > >
> > > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > > register move.  It caused a regression:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > >
> > > Since the cost for SImode store is also used to compute scalar_store
> > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > >
> > > void
> > > foo (long p2, long *diag, long d, long i)
> > > {
> > >   long k;
> > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > >   while (i < k)
> > >     diag[i++] = d;
> > > }
> > >
> > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > instead of 3.
> > >
> > > My patch separates costs of moves from costs of RTL expressions.  We have
> > > a follow up patch which restores the cost for SImode store back to 6 and leave
> > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > > data now.
>
> I have seen the problem with scalar_store with AMD tuning as well.
> It seems to make SLP vectorizer to be happy about idea of turning
> sequence of say integer tores into code which moves all the values into
> AVX register and then does one vector store.
>
> The cost basically compare cost of N scalar stores to 1 scalar store +
> vector construction. Vector construction then N*sse_op+addss.
>
> With testcase:
>
> short array[8];
> test (short a,short b,short c,short d,short e,short f,short g,short h)
> {
>   array[0]=a;
>   array[1]=b;
>   array[2]=c;
>   array[3]=d;
>   array[4]=e;
>   array[5]=f;
>   array[6]=g;
>   array[7]=h;
> }
> int iarray[8];
> test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> {
>   iarray[0]=a;
>   iarray[1]=b;
>   iarray[2]=c;
>   iarray[3]=d;
>   iarray[4]=e;
>   iarray[5]=f;
>   iarray[6]=g;
>   iarray[7]=h;
> }
>
> I get the following codegen:
>
>
> test:
>         vmovd   %edi, %xmm0
>         vmovd   %edx, %xmm2
>         vmovd   %r8d, %xmm1
>         vmovd   8(%rsp), %xmm3
>         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
>         vpinsrw $1, %esi, %xmm0, %xmm0
>         vpinsrw $1, %ecx, %xmm2, %xmm2
>         vpinsrw $1, %r9d, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm0, %xmm0
>         vpunpckldq      %xmm3, %xmm1, %xmm1
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
>         vmovaps %xmm0, array(%rip)
>         ret
>
> test2:
>         vmovd   %r8d, %xmm5
>         vmovd   %edx, %xmm6
>         vmovd   %edi, %xmm7
>         vpinsrd $1, %r9d, %xmm5, %xmm1
>         vpinsrd $1, %ecx, %xmm6, %xmm3
>         vpinsrd $1, %esi, %xmm7, %xmm0
>         vpunpcklqdq     %xmm3, %xmm0, %xmm0
>         vmovd   16(%rbp), %xmm4
>         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
>         vpunpcklqdq     %xmm2, %xmm1, %xmm1
>         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
>         vmovdqu %ymm0, iarray(%rip)
>         vzeroupper
>         ret
>
> which is about 20% slower on my skylake notebook than the
> non-SLP-vectorized variant.
>
> I wonder if the vec_construct costs should be made more realistic.
> It is computed as:
>
>       case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
>           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           if (GET_MODE_BITSIZE (mode) == 256)
>             cost += ix86_vec_cost (mode, ix86_cost->addss);
>           /* One vinserti64x4 and two vinserti128 for combining SSE
>              and AVX256 vectors to AVX512.  */
>           else if (GET_MODE_BITSIZE (mode) == 512)
>             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
>           return cost;
>
> So it expects 8 simple SSE operations + one SSE FP arithmetical
> operations.  While code above has 8 inter-unit moves + 3 SSE integer
> operations to shuffle things around. Not mentioning the increased
> register pressure.
>
> I would say that for integer constructs it is a common case that things
> needs to be moved from integer unit to SSE.
>
> Overall the problem is deeper since vectorizer really may need to get
> better idea about latencies and throughputs to estimate loop times more
> realistically.
>
> One also may want to account somewhat that stores are often not part
> of the hot path and thus their latency is not too critical and the
> fact that vector stores prevents later partial memory stalls on the
> other hand...
>

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952

We shouldn't use costs for moves for costs of RTL expressions.   We can
experiment different RTL expression cost formulas.   But we need to separate
costs of RTL expressions from costs for moves first.   What is the best way
to partition processor_costs to avoid confusion between costs of moves vs.
costs of RTL expressions?
Jan Hubicka June 23, 2019, 11:18 a.m. UTC | #7
> I opened:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952
> 
> We shouldn't use costs for moves for costs of RTL expressions.   We can
> experiment different RTL expression cost formulas.   But we need to separate
> costs of RTL expressions from costs for moves first.   What is the best way
> to partition processor_costs to avoid confusion between costs of moves vs.
> costs of RTL expressions?

I am still worried that splitting the cost and experimentally finding
value which works well for SPEC2017 is not very reliable solution here,
since the problematic decisions is not only about store cost but also
about other factors.

What benchmarks besides x264 are sensitive to this?

Looking at x264 the problem is really simple SLP vectorization of 8
integer stores into one AVX256 store which is not a win on Core.
I wrote simple microbenchmark that tests SLP vectorized versus normal
store (attached). Results at skylake are:

64bit
     float          2 SLP:    1.54
     float          2 no-SLP: 1.52
     float          2 def:    1.55
      char          8 SLP:    3.35
      char          8 no-SLP: 3.34
      char          8 def:    3.32
     short          4 SLP:    1.51
     short          4 no-SLP: 1.51
     short          4 def:    1.52
       int          2 SLP:    1.22
       int          2 no-SLP: 1.24
       int          2 def:    1.25
AVX126
     float          4 SLP:    1.51
     float          4 no-SLP: 1.81
     float          4 def:    1.54
    double          2 SLP:    1.51
    double          2 no-SLP: 1.53
    double          2 def:    1.55
      char         16 SLP:    6.31
      char         16 no-SLP: 8.31
      char         16 def:    6.33
     short          8 SLP:    3.91
     short          8 no-SLP: 3.33
     short          8 def:    3.92
       int          4 SLP:    2.12
       int          4 no-SLP: 1.51
       int          4 def:    1.56
 long long          2 SLP:    1.50
 long long          2 no-SLP: 1.21
 long long          2 def:    1.26

AVX256
     float          8 SLP:    2.11
     float          8 no-SLP: 2.70
     float          8 def:    2.13
    double          4 SLP:    1.83
    double          4 no-SLP: 1.80
    double          4 def:    1.82
      char         32 SLP:    12.72
      char         32 no-SLP: 17.28
      char         32 def:    12.71
     short         16 SLP:    6.32
     short         16 no-SLP: 8.77
     short         16 def:    6.20
       int          8 SLP:    3.93
       int          8 no-SLP: 3.31
       int          8 def:    3.33
 long long          4 SLP:    2.13
 long long          4 no-SLP: 1.52
 long long          4 def:    1.51

def is with cost model based decision.
SLP seems bad idea for 
 - 256 long long and int vectors
   (which I see are cured by your change in cost table.
 - doubles (little bit)
 - shorts for 128bit vectors
   (I guess that would be cured if 16bit store cost was
    decreased a bit like you did for int)

For zen we get:

64bit
     float          2 SLP:    2.22
     float          2 no-SLP: 2.23
     float          2 def:    2.23
      char          8 SLP:    4.08
      char          8 no-SLP: 4.08
      char          8 def:    4.08
     short          4 SLP:    2.22
     short          4 no-SLP: 2.23
     short          4 def:    2.23
       int          2 SLP:    1.86
       int          2 no-SLP: 1.87
       int          2 def:    1.86
AVX126
     float          4 SLP:    2.23
     float          4 no-SLP: 2.60
     float          4 def:    2.23
    double          2 SLP:    2.23
    double          2 no-SLP: 2.23
    double          2 def:    2.23
      char         16 SLP:    4.79
      char         16 no-SLP: 10.03
      char         16 def:    4.85
     short          8 SLP:    3.20
     short          8 no-SLP: 4.08
     short          8 def:    3.22
       int          4 SLP:    2.23
       int          4 no-SLP: 2.23
       int          4 def:    2.23
 long long          2 SLP:    1.86
 long long          2 no-SLP: 1.86
 long long          2 def:    1.87

So SLP is win in general
and for buldozer

64bit
     float          2 SLP:    2.76
     float          2 no-SLP: 2.77
     float          2 def:    2.77
      char          8 SLP:    4.48
      char          8 no-SLP: 4.49
      char          8 def:    4.48
     short          4 SLP:    2.84
     short          4 no-SLP: 2.84
     short          4 def:    2.83
       int          2 SLP:    2.14
       int          2 no-SLP: 2.13
       int          2 def:    2.15
AVX126
     float          4 SLP:    2.59
     float          4 no-SLP: 3.07
     float          4 def:    2.59
    double          2 SLP:    2.48
    double          2 no-SLP: 2.49
    double          2 def:    2.48
      char         16 SLP:    30.33
      char         16 no-SLP: 11.72
      char         16 def:    30.30
     short          8 SLP:    21.04
     short          8 no-SLP: 4.62
     short          8 def:    21.06
       int          4 SLP:    4.29
       int          4 no-SLP: 2.84
       int          4 def:    4.30
 long long          2 SLP:    3.07
 long long          2 no-SLP: 2.14
 long long          2 def:    2.16

Here SLP is major los for integers and we get it all wrong.
This is because SLP for integer implies inter-unit move that is bad
on this chip.

Looking at the generated code, we seem to get constructor costs wrong.

SLP for float4 is generated as:
        vunpcklps       %xmm3, %xmm2, %xmm2
        vunpcklps       %xmm1, %xmm0, %xmm0
        vmovlhps        %xmm2, %xmm0, %xmm0
        vmovaps %xmm0, array(%rip)

While vectorizer does:
0x3050e50 a0_2(D) 1 times vec_construct costs 16 in prologue
0x3050e50 a0_2(D) 1 times vector_store costs 16 in body
0x3051030 a0_2(D) 1 times scalar_store costs 16 in body
0x3051030 a1_4(D) 1 times scalar_store costs 16 in body
0x3051030 a2_6(D) 1 times scalar_store costs 16 in body
0x3051030 a3_8(D) 1 times scalar_store costs 16 in body
testslp.C:70:1: note:  Cost model analysis: 
  Vector inside of basic block cost: 16
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

So it thinks that vectorized sequence will take same time as one store.
This is result of:

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;
        }
So 4*normal sse_op (latency 1) plus addss (latency 4) overall 8 cycles
SSE store should be 4 cycles.

This does not quite meet the reality.  

For integer version this is even less realistic since we output 8
int->SSE moves followed by packing code.

The attached patch gets number of instructions right, but it still won't
result in the optimal scores in my micro benchmark.

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 272507)
+++ config/i386/i386.c	(working copy)
@@ -21130,15 +21132,38 @@ ix86_builtin_vectorization_cost (enum ve
 
       case vec_construct:
 	{
-	  /* N element inserts into SSE vectors.  */
-	  int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
-	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
-	  if (GET_MODE_BITSIZE (mode) == 256)
-	    cost += ix86_vec_cost (mode, ix86_cost->addss);
-	  /* One vinserti64x4 and two vinserti128 for combining SSE
-	     and AVX256 vectors to AVX512.  */
-	  else if (GET_MODE_BITSIZE (mode) == 512)
-	    cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
+	  int cost;
+	  if (fp)
+	      /* vunpcklps or vunpcklpd to move half of the values above
+		 the other half.  */
+	    cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op / 2;
+	  else
+	    /* Scalar values are usually converted from integer unit.
+	       N/2 vmovs and N/2 vpinsrd  */
+	    cost = TYPE_VECTOR_SUBPARTS (vectype)
+		   * COSTS_N_INSNS (ix86_cost->sse_to_integer / 2);
+	  switch (TYPE_VECTOR_SUBPARTS (vectype))
+	    {
+	    case 2:
+	       break;
+	    case 4:
+	       /* movhlps or vinsertf128.  */
+	       cost += ix86_vec_cost (mode, ix86_cost->sse_op);
+	       break;
+	    case 8:
+	       /* 2 vmovlhps + vinsertf128.  */
+	       cost += ix86_vec_cost (mode, 3 * ix86_cost->sse_op);
+	       break;
+	    case 16:
+	       cost += ix86_vec_cost (mode, 7 * ix86_cost->sse_op);
+	       break;
+	    case 32:
+	       cost += ix86_vec_cost (mode, 15 * ix86_cost->sse_op);
+	       break;
+	    case 64:
+	       cost += ix86_vec_cost (mode, 31 * ix86_cost->sse_op);
+	       break;
+	    }
 	  return cost;
 	}
Richard Biener June 24, 2019, 1:37 p.m. UTC | #8
On Thu, 20 Jun 2019, Jan Hubicka wrote:

> > > Currently, costs of moves are also used for costs of RTL expressions.   This
> > > patch:
> > >
> > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > >
> > > includes:
> > >
> > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > > index e943d13..8409a5f 100644
> > > --- a/gcc/config/i386/x86-tune-costs.h
> > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > >    {4, 4, 4}, /* cost of loading integer registers
> > >      in QImode, HImode and SImode.
> > >      Relative to reg-reg move (2).  */
> > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > +  {6, 6, 3}, /* cost of storing integer registers */
> > >    2, /* cost of reg,reg fld/fst */
> > >    {6, 6, 8}, /* cost of loading fp registers
> > >      in SFmode, DFmode and XFmode */
> 
> Well, it seems that the patch was fixing things on wrong spot - the
> tables are intended to be mostly latency based. I think we ought to
> document divergences from these including benchmarks where the change
> helped. Otherwise it is very hard to figure out why the entry does not
> match the reality.
> > >
> > > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > > register move.  It caused a regression:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > >
> > > Since the cost for SImode store is also used to compute scalar_store
> > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > >
> > > void
> > > foo (long p2, long *diag, long d, long i)
> > > {
> > >   long k;
> > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > >   while (i < k)
> > >     diag[i++] = d;
> > > }
> > >
> > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > instead of 3.
> > >
> > > My patch separates costs of moves from costs of RTL expressions.  We have
> > > a follow up patch which restores the cost for SImode store back to 6 and leave
> > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > > data now.
> 
> I have seen the problem with scalar_store with AMD tuning as well.
> It seems to make SLP vectorizer to be happy about idea of turning
> sequence of say integer tores into code which moves all the values into
> AVX register and then does one vector store.
> 
> The cost basically compare cost of N scalar stores to 1 scalar store +
> vector construction. Vector construction then N*sse_op+addss.
> 
> With testcase:
> 
> short array[8];
> test (short a,short b,short c,short d,short e,short f,short g,short h)
> { 
>   array[0]=a;
>   array[1]=b;
>   array[2]=c;
>   array[3]=d;
>   array[4]=e;
>   array[5]=f;
>   array[6]=g;
>   array[7]=h;
> }
> int iarray[8];
> test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> { 
>   iarray[0]=a;
>   iarray[1]=b;
>   iarray[2]=c;
>   iarray[3]=d;
>   iarray[4]=e;
>   iarray[5]=f;
>   iarray[6]=g;
>   iarray[7]=h;
> }
> 
> I get the following codegen:
> 
> 
> test:
>         vmovd   %edi, %xmm0
>         vmovd   %edx, %xmm2
>         vmovd   %r8d, %xmm1
>         vmovd   8(%rsp), %xmm3
>         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
>         vpinsrw $1, %esi, %xmm0, %xmm0
>         vpinsrw $1, %ecx, %xmm2, %xmm2
>         vpinsrw $1, %r9d, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm0, %xmm0
>         vpunpckldq      %xmm3, %xmm1, %xmm1
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
>         vmovaps %xmm0, array(%rip)
>         ret
> 
> test2:
>         vmovd   %r8d, %xmm5
>         vmovd   %edx, %xmm6
>         vmovd   %edi, %xmm7
>         vpinsrd $1, %r9d, %xmm5, %xmm1
>         vpinsrd $1, %ecx, %xmm6, %xmm3
>         vpinsrd $1, %esi, %xmm7, %xmm0
>         vpunpcklqdq     %xmm3, %xmm0, %xmm0
>         vmovd   16(%rbp), %xmm4
>         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
>         vpunpcklqdq     %xmm2, %xmm1, %xmm1
>         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
>         vmovdqu %ymm0, iarray(%rip)
>         vzeroupper
> 	ret
> 
> which is about 20% slower on my skylake notebook than the
> non-SLP-vectorized variant.
> 
> I wonder if the vec_construct costs should be made more realistic.
> It is computed as:
> 
>       case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
>           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           if (GET_MODE_BITSIZE (mode) == 256)
>             cost += ix86_vec_cost (mode, ix86_cost->addss);
>           /* One vinserti64x4 and two vinserti128 for combining SSE
>              and AVX256 vectors to AVX512.  */
>           else if (GET_MODE_BITSIZE (mode) == 512)
>             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
>           return cost;
> 
> So it expects 8 simple SSE operations + one SSE FP arithmetical
> operations.  While code above has 8 inter-unit moves + 3 SSE integer
> operations to shuffle things around. Not mentioning the increased
> register pressure.

But aren't the inter-unit moves a red herring?  Your testcase places
the sources in integer registers but usually for the case of
vectorization we arrive here from strided loads for which we could
load the first value into a %xmm reg directly and have the
later vpinsr instruction have memory source?

Yes, vec_construct cost isn't the full story in this case which is
why add_stmt special-cases strided loads/stores adding some
pessimization.

> I would say that for integer constructs it is a common case that things
> needs to be moved from integer unit to SSE.

Is it?  For SLP vectorization probably yes.  The costing interface
unfortunately is not giving much information here (well, add_stmt
has access to the stmt_info ...).

> Overall the problem is deeper since vectorizer really may need to get
> better idea about latencies and throughputs to estimate loop times more
> realistically. 

Indeed, but I hardly see how we can handle this in a sensible way since
we don't even understand performance corner-cases when analyzing them
and looking at this info but the HW still behaves in unexpected ways :/

> One also may want to account somewhat that stores are often not part
> of the hot path and thus their latency is not too critical and the
> fact that vector stores prevents later partial memory stalls on the
> other hand...
> 
> Honza
>
H.J. Lu June 24, 2019, 4:16 p.m. UTC | #9
On Mon, Jun 24, 2019 at 6:37 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 20 Jun 2019, Jan Hubicka wrote:
>
> > > > Currently, costs of moves are also used for costs of RTL expressions.   This
> > > > patch:
> > > >
> > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > > >
> > > > includes:
> > > >
> > > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > > > index e943d13..8409a5f 100644
> > > > --- a/gcc/config/i386/x86-tune-costs.h
> > > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > > >    {4, 4, 4}, /* cost of loading integer registers
> > > >      in QImode, HImode and SImode.
> > > >      Relative to reg-reg move (2).  */
> > > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > > +  {6, 6, 3}, /* cost of storing integer registers */
> > > >    2, /* cost of reg,reg fld/fst */
> > > >    {6, 6, 8}, /* cost of loading fp registers
> > > >      in SFmode, DFmode and XFmode */
> >
> > Well, it seems that the patch was fixing things on wrong spot - the
> > tables are intended to be mostly latency based. I think we ought to
> > document divergences from these including benchmarks where the change
> > helped. Otherwise it is very hard to figure out why the entry does not
> > match the reality.
> > > >
> > > > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > > > register move.  It caused a regression:
> > > >
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > > >
> > > > Since the cost for SImode store is also used to compute scalar_store
> > > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > > >
> > > > void
> > > > foo (long p2, long *diag, long d, long i)
> > > > {
> > > >   long k;
> > > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > > >   while (i < k)
> > > >     diag[i++] = d;
> > > > }
> > > >
> > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > > instead of 3.
> > > >
> > > > My patch separates costs of moves from costs of RTL expressions.  We have
> > > > a follow up patch which restores the cost for SImode store back to 6 and leave
> > > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > > > data now.
> >
> > I have seen the problem with scalar_store with AMD tuning as well.
> > It seems to make SLP vectorizer to be happy about idea of turning
> > sequence of say integer tores into code which moves all the values into
> > AVX register and then does one vector store.
> >
> > The cost basically compare cost of N scalar stores to 1 scalar store +
> > vector construction. Vector construction then N*sse_op+addss.
> >
> > With testcase:
> >
> > short array[8];
> > test (short a,short b,short c,short d,short e,short f,short g,short h)
> > {
> >   array[0]=a;
> >   array[1]=b;
> >   array[2]=c;
> >   array[3]=d;
> >   array[4]=e;
> >   array[5]=f;
> >   array[6]=g;
> >   array[7]=h;
> > }
> > int iarray[8];
> > test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> > {
> >   iarray[0]=a;
> >   iarray[1]=b;
> >   iarray[2]=c;
> >   iarray[3]=d;
> >   iarray[4]=e;
> >   iarray[5]=f;
> >   iarray[6]=g;
> >   iarray[7]=h;
> > }
> >
> > I get the following codegen:
> >
> >
> > test:
> >         vmovd   %edi, %xmm0
> >         vmovd   %edx, %xmm2
> >         vmovd   %r8d, %xmm1
> >         vmovd   8(%rsp), %xmm3
> >         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
> >         vpinsrw $1, %esi, %xmm0, %xmm0
> >         vpinsrw $1, %ecx, %xmm2, %xmm2
> >         vpinsrw $1, %r9d, %xmm1, %xmm1
> >         vpunpckldq      %xmm2, %xmm0, %xmm0
> >         vpunpckldq      %xmm3, %xmm1, %xmm1
> >         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> >         vmovaps %xmm0, array(%rip)
> >         ret
> >
> > test2:
> >         vmovd   %r8d, %xmm5
> >         vmovd   %edx, %xmm6
> >         vmovd   %edi, %xmm7
> >         vpinsrd $1, %r9d, %xmm5, %xmm1
> >         vpinsrd $1, %ecx, %xmm6, %xmm3
> >         vpinsrd $1, %esi, %xmm7, %xmm0
> >         vpunpcklqdq     %xmm3, %xmm0, %xmm0
> >         vmovd   16(%rbp), %xmm4
> >         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
> >         vpunpcklqdq     %xmm2, %xmm1, %xmm1
> >         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
> >         vmovdqu %ymm0, iarray(%rip)
> >         vzeroupper
> >       ret
> >
> > which is about 20% slower on my skylake notebook than the
> > non-SLP-vectorized variant.
> >
> > I wonder if the vec_construct costs should be made more realistic.
> > It is computed as:
> >
> >       case vec_construct:
> >         {
> >           /* N element inserts into SSE vectors.  */
> >           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> >           /* One vinserti128 for combining two SSE vectors for AVX256.  */
> >           if (GET_MODE_BITSIZE (mode) == 256)
> >             cost += ix86_vec_cost (mode, ix86_cost->addss);
> >           /* One vinserti64x4 and two vinserti128 for combining SSE
> >              and AVX256 vectors to AVX512.  */
> >           else if (GET_MODE_BITSIZE (mode) == 512)
> >             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
> >           return cost;
> >
> > So it expects 8 simple SSE operations + one SSE FP arithmetical
> > operations.  While code above has 8 inter-unit moves + 3 SSE integer
> > operations to shuffle things around. Not mentioning the increased
> > register pressure.
>
> But aren't the inter-unit moves a red herring?  Your testcase places
> the sources in integer registers but usually for the case of
> vectorization we arrive here from strided loads for which we could
> load the first value into a %xmm reg directly and have the
> later vpinsr instruction have memory source?
>
> Yes, vec_construct cost isn't the full story in this case which is
> why add_stmt special-cases strided loads/stores adding some
> pessimization.
>
> > I would say that for integer constructs it is a common case that things
> > needs to be moved from integer unit to SSE.
>
> Is it?  For SLP vectorization probably yes.  The costing interface
> unfortunately is not giving much information here (well, add_stmt
> has access to the stmt_info ...).
>
> > Overall the problem is deeper since vectorizer really may need to get
> > better idea about latencies and throughputs to estimate loop times more
> > realistically.
>
> Indeed, but I hardly see how we can handle this in a sensible way since
> we don't even understand performance corner-cases when analyzing them
> and looking at this info but the HW still behaves in unexpected ways :/
>
> > One also may want to account somewhat that stores are often not part
> > of the hot path and thus their latency is not too critical and the
> > fact that vector stores prevents later partial memory stalls on the
> > other hand...
> >

Costs of moves are closely related to latency and should only be used
for register allocator.   We shouldn't use costs of moves for RTL costs.
For register allocator, register <-> register moves are preferred over
load and store unless it is slower than register -> memory -> register.
For RTL costs,  we may want to make load and store cheap to improve
RTL expansion.  But we don't want to change load and store costs for
register allocator.   We need to separate costs of moves from costs of
RTL expressions first.
diff mbox series

Patch

From 1c04d184860d613ba0d789b9bc4e8754ca283e1e Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 14 Jun 2019 13:30:16 -0700
Subject: [PATCH] i386: Separate costs of RTL expressions from costs of moves

processor_costs has costs of RTL expressions and costs of moves:

1. Costs of RTL expressions is computed as COSTS_N_INSNS which are used
to generate RTL expressions with the lowest costs.  Costs of RTL memory
operation can be very close to costs of fast instructions to indicate
fast memory operations.

2. After RTL expressions have been generated, costs of moves are used by
TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute move
costs for register allocator.  Costs of load and store are higher than
costs of register moves to reduce stack usages by register allocator.

We should separate costs of RTL expressions from costs of moves so that
they can be adjusted independently.  This patch moves costs of moves to
the new used_by_ra field and duplicates costs of moves which are also
used for costs of RTL expressions.

All cost models have been checked with

static void
check_one (const struct processor_costs *p)
{
  if (p->used_by_ra.int_load[2] != p->int_load)
    abort ();
  if (p->used_by_ra.int_store[2] != p->int_store)
    abort ();
  if (p->used_by_ra.xmm_move != p->xmm_move)
    abort ();
  if (p->used_by_ra.sse_to_integer != p->sse_to_integer)
    abort ();
  if (p->used_by_ra.integer_to_sse != p->integer_to_sse)
    abort ();
  if (memcmp (p->used_by_ra.sse_load, p->sse_load, sizeof (p->sse_load)))
    abort ();
  if (memcmp (p->used_by_ra.sse_store, p->sse_store, sizeof (p->sse_store)))
    abort ();
}

static void
check_cost ()
{
 check_one (&ix86_size_cost);
  for (unsigned int i = 0; i < ARRAY_SIZE (processor_cost_table); i++)
    check_one (processor_cost_table[i]);
}

by calling check_cost from ix86_option_override_internal.

	PR target/90878
	* config/i386/i386-features.c
	(dimode_scalar_chain::compute_convert_gain): Replace int_store[2]
	and int_load[2] with int_store and int_load.
	* config/i386/i386.c (inline_memory_move_cost): Use used_by_ra
	for costs of moves.
	(ix86_register_move_cost): Likewise.
	(ix86_builtin_vectorization_cost): Replace int_store[2] and
	int_load[2] with int_store and int_load.
	* config/i386/i386.h (processor_costs): Move costs of moves to
	used_by_ra.  Add int_load, int_store, xmm_move, sse_to_integer,
	integer_to_sse, sse_load, sse_store, sse_unaligned_load and
	sse_unaligned_store for costs of RTL expressions.
	* config/i386/x86-tune-costs.h: Duplicate int_load, int_store,
	xmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store
	for costs of RTL expressions.  Use sse_unaligned_load and
	sse_unaligned_store only for costs of RTL expressions.
---
 gcc/config/i386/i386-features.c  |   6 +-
 gcc/config/i386/i386.c           |  63 +++--
 gcc/config/i386/i386.h           |  49 ++--
 gcc/config/i386/x86-tune-costs.h | 409 ++++++++++++++++++++++++-------
 4 files changed, 388 insertions(+), 139 deletions(-)

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 2eac8f715bb..34eb70c874f 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -501,9 +501,9 @@  dimode_scalar_chain::compute_convert_gain ()
       if (REG_P (src) && REG_P (dst))
 	gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
       else if (REG_P (src) && MEM_P (dst))
-	gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	gain += 2 * ix86_cost->int_store - ix86_cost->sse_store[1];
       else if (MEM_P (src) && REG_P (dst))
-	gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
+	gain += 2 * ix86_cost->int_load - ix86_cost->sse_load[1];
       else if (GET_CODE (src) == ASHIFT
 	       || GET_CODE (src) == ASHIFTRT
 	       || GET_CODE (src) == LSHIFTRT)
@@ -543,7 +543,7 @@  dimode_scalar_chain::compute_convert_gain ()
 	  if (REG_P (dst))
 	    gain += COSTS_N_INSNS (2);
 	  else if (MEM_P (dst))
-	    gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
+	    gain += 2 * ix86_cost->int_store - ix86_cost->sse_store[1];
 	  gain -= vector_const_cost (src);
 	}
       else
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 941e208bcf0..bf3184f4a8b 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18511,8 +18511,10 @@  inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	    return 100;
 	}
       if (in == 2)
-        return MAX (ix86_cost->fp_load [index], ix86_cost->fp_store [index]);
-      return in ? ix86_cost->fp_load [index] : ix86_cost->fp_store [index];
+        return MAX (ix86_cost->used_by_ra.fp_load [index],
+		    ix86_cost->used_by_ra.fp_store [index]);
+      return in ? ix86_cost->used_by_ra.fp_load [index]
+		: ix86_cost->used_by_ra.fp_store [index];
     }
   if (SSE_CLASS_P (regclass))
     {
@@ -18520,8 +18522,10 @@  inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
       if (index == -1)
 	return 100;
       if (in == 2)
-        return MAX (ix86_cost->sse_load [index], ix86_cost->sse_store [index]);
-      return in ? ix86_cost->sse_load [index] : ix86_cost->sse_store [index];
+        return MAX (ix86_cost->used_by_ra.sse_load [index],
+		    ix86_cost->used_by_ra.sse_store [index]);
+      return in ? ix86_cost->used_by_ra.sse_load [index]
+		: ix86_cost->used_by_ra.sse_store [index];
     }
   if (MMX_CLASS_P (regclass))
     {
@@ -18538,8 +18542,10 @@  inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	    return 100;
 	}
       if (in == 2)
-        return MAX (ix86_cost->mmx_load [index], ix86_cost->mmx_store [index]);
-      return in ? ix86_cost->mmx_load [index] : ix86_cost->mmx_store [index];
+        return MAX (ix86_cost->used_by_ra.mmx_load [index],
+		    ix86_cost->used_by_ra.mmx_store [index]);
+      return in ? ix86_cost->used_by_ra.mmx_load [index]
+		: ix86_cost->used_by_ra.mmx_store [index];
     }
   switch (GET_MODE_SIZE (mode))
     {
@@ -18547,37 +18553,41 @@  inline_memory_move_cost (machine_mode mode, enum reg_class regclass, int in)
 	if (Q_CLASS_P (regclass) || TARGET_64BIT)
 	  {
 	    if (!in)
-	      return ix86_cost->int_store[0];
+	      return ix86_cost->used_by_ra.int_store[0];
 	    if (TARGET_PARTIAL_REG_DEPENDENCY
 	        && optimize_function_for_speed_p (cfun))
-	      cost = ix86_cost->movzbl_load;
+	      cost = ix86_cost->used_by_ra.movzbl_load;
 	    else
-	      cost = ix86_cost->int_load[0];
+	      cost = ix86_cost->used_by_ra.int_load[0];
 	    if (in == 2)
-	      return MAX (cost, ix86_cost->int_store[0]);
+	      return MAX (cost, ix86_cost->used_by_ra.int_store[0]);
 	    return cost;
 	  }
 	else
 	  {
 	   if (in == 2)
-	     return MAX (ix86_cost->movzbl_load, ix86_cost->int_store[0] + 4);
+	     return MAX (ix86_cost->used_by_ra.movzbl_load,
+			 ix86_cost->used_by_ra.int_store[0] + 4);
 	   if (in)
-	     return ix86_cost->movzbl_load;
+	     return ix86_cost->used_by_ra.movzbl_load;
 	   else
-	     return ix86_cost->int_store[0] + 4;
+	     return ix86_cost->used_by_ra.int_store[0] + 4;
 	  }
 	break;
       case 2:
 	if (in == 2)
-	  return MAX (ix86_cost->int_load[1], ix86_cost->int_store[1]);
-	return in ? ix86_cost->int_load[1] : ix86_cost->int_store[1];
+	  return MAX (ix86_cost->used_by_ra.int_load[1],
+		      ix86_cost->used_by_ra.int_store[1]);
+	return in ? ix86_cost->used_by_ra.int_load[1]
+		  : ix86_cost->used_by_ra.int_store[1];
       default:
 	if (in == 2)
-	  cost = MAX (ix86_cost->int_load[2], ix86_cost->int_store[2]);
+	  cost = MAX (ix86_cost->used_by_ra.int_load[2],
+		      ix86_cost->used_by_ra.int_store[2]);
 	else if (in)
-	  cost = ix86_cost->int_load[2];
+	  cost = ix86_cost->used_by_ra.int_load[2];
 	else
-	  cost = ix86_cost->int_store[2];
+	  cost = ix86_cost->used_by_ra.int_store[2];
 	/* Multiply with the number of GPR moves needed.  */
 	return cost * CEIL ((int) GET_MODE_SIZE (mode), UNITS_PER_WORD);
     }
@@ -18647,20 +18657,21 @@  ix86_register_move_cost (machine_mode mode, reg_class_t class1_i,
        because of missing QImode and HImode moves to, from or between
        MMX/SSE registers.  */
     return MAX (8, SSE_CLASS_P (class1)
-		? ix86_cost->sse_to_integer : ix86_cost->integer_to_sse);
+		? ix86_cost->used_by_ra.sse_to_integer
+		: ix86_cost->used_by_ra.integer_to_sse);
 
   if (MAYBE_FLOAT_CLASS_P (class1))
-    return ix86_cost->fp_move;
+    return ix86_cost->used_by_ra.fp_move;
   if (MAYBE_SSE_CLASS_P (class1))
     {
       if (GET_MODE_BITSIZE (mode) <= 128)
-	return ix86_cost->xmm_move;
+	return ix86_cost->used_by_ra.xmm_move;
       if (GET_MODE_BITSIZE (mode) <= 256)
-	return ix86_cost->ymm_move;
-      return ix86_cost->zmm_move;
+	return ix86_cost->used_by_ra.ymm_move;
+      return ix86_cost->used_by_ra.zmm_move;
     }
   if (MAYBE_MMX_CLASS_P (class1))
-    return ix86_cost->mmx_move;
+    return ix86_cost->used_by_ra.mmx_move;
   return 2;
 }
 
@@ -21071,11 +21082,11 @@  ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 	/* load/store costs are relative to register move which is 2. Recompute
  	   it to COSTS_N_INSNS so everything have same base.  */
         return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
-			      : ix86_cost->int_load [2]) / 2;
+			      : ix86_cost->int_load) / 2;
 
       case scalar_store:
         return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
-			      : ix86_cost->int_store [2]) / 2;
+			      : ix86_cost->int_store) / 2;
 
       case vector_stmt:
         return ix86_vec_cost (mode,
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 0ac5d651823..1c7ef500d37 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -235,7 +235,11 @@  struct stringop_algs
   } size [MAX_STRINGOP_ALGS];
 };
 
-/* Define the specific costs for a given cpu */
+/* Define the specific costs for a given cpu.  NB: used_by_ra is used
+   by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
+   move costs for register allocator.  Don't use it to describe the
+   relative costs of RTL expressions in TARGET_RTX_COSTS.
+ */
 
 struct processor_costs {
   const int add;		/* cost of an add instruction */
@@ -252,32 +256,47 @@  struct processor_costs {
   const int large_insn;		/* insns larger than this cost more */
   const int move_ratio;		/* The threshold of number of scalar
 				   memory-to-memory move insns.  */
-  const int movzbl_load;	/* cost of loading using movzbl */
-  const int int_load[3];	/* cost of loading integer registers
+
+  /* Costs used by register allocator.  integer->integer register move
+     cost is 2.  */
+  struct
+    {
+      const int movzbl_load;	/* cost of loading using movzbl */
+      const int int_load[3];	/* cost of loading integer registers
 				   in QImode, HImode and SImode relative
 				   to reg-reg move (2).  */
-  const int int_store[3];	/* cost of storing integer register
+      const int int_store[3];	/* cost of storing integer register
 				   in QImode, HImode and SImode */
-  const int fp_move;		/* cost of reg,reg fld/fst */
-  const int fp_load[3];		/* cost of loading FP register
+      const int fp_move;	/* cost of reg,reg fld/fst */
+      const int fp_load[3];	/* cost of loading FP register
 				   in SFmode, DFmode and XFmode */
-  const int fp_store[3];	/* cost of storing FP register
+      const int fp_store[3];	/* cost of storing FP register
 				   in SFmode, DFmode and XFmode */
-  const int mmx_move;		/* cost of moving MMX register.  */
-  const int mmx_load[2];	/* cost of loading MMX register
+      const int mmx_move;	/* cost of moving MMX register.  */
+      const int mmx_load[2];	/* cost of loading MMX register
 				   in SImode and DImode */
-  const int mmx_store[2];	/* cost of storing MMX register
+      const int mmx_store[2];	/* cost of storing MMX register
 				   in SImode and DImode */
-  const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
-	    zmm_move;
+      const int xmm_move;	/* cost of moving XMM register.  */
+      const int ymm_move;	/* cost of moving XMM register.  */
+      const int zmm_move;	/* cost of moving XMM register.  */
+      const int sse_load[5];	/* cost of loading SSE register
+				   in 32bit, 64bit, 128bit, 256bit and 512bit */
+      const int sse_store[5];	/* cost of storing SSE register
+				   in SImode, DImode and TImode.  */
+      const int sse_to_integer;	/* cost of moving SSE register to integer.  */
+      const int integer_to_sse;	/* cost of moving integer register to SSE. */
+    } used_by_ra;
+  const int int_load;		/* cost of loading integer register.  */
+  const int int_store;		/* cost of storing integer register.  */
   const int sse_load[5];	/* cost of loading SSE register
 				   in 32bit, 64bit, 128bit, 256bit and 512bit */
-  const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_store[5];	/* cost of storing SSE register
-				   in SImode, DImode and TImode.  */
+				   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_unaligned_store[5];/* cost of unaligned store.  */
+  const int xmm_move;		/* cost of moving XMM register.  */
   const int sse_to_integer;	/* cost of moving SSE register to integer.  */
-  const int integer_to_sse;	/* cost of moving integer register to SSE. */
   const int gather_static, gather_per_elt; /* Cost of gather load is computed
 				   as static + per_item * nelts. */
   const int scatter_static, scatter_per_elt; /* Cost of gather store is
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index ac06e37733a..879f5aeb09f 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -56,7 +56,7 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   0,					/* "large" insn */
   2,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {2, 2, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -75,13 +75,23 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   3, 3, 3,				/* cost of moving XMM,YMM,ZMM register */
   {3, 3, 3, 3, 3},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {3, 3, 3, 3, 3},			/* cost of unaligned SSE load
-					   in 128bit, 256bit and 512bit */
   {3, 3, 3, 3, 3},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {3, 3, 3, 3, 3},				/* cost of unaligned SSE store
-					   in 128bit, 256bit and 512bit */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {3, 3, 3, 3, 3},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of unaligned SSE load
+					   in 128bit, 256bit and 512bit */
+  {3, 3, 3, 3, 3},			/* cost of unaligned SSE store
+					   in 128bit, 256bit and 512bit */
+  3,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   5, 0,					/* Gather load static, per_elt.  */
   5, 0,					/* Gather store static, per_elt.  */
   0,					/* size of l1 cache  */
@@ -147,8 +157,7 @@  struct processor_costs i386_cost = {	/* 386 specific costs */
   15,					/* "large" insn */
   3,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -167,11 +176,21 @@  struct processor_costs i386_cost = {	/* 386 specific costs */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   0,					/* size of l1 cache  */
@@ -236,8 +255,7 @@  struct processor_costs i486_cost = {	/* 486 specific costs */
   15,					/* "large" insn */
   3,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -256,11 +274,21 @@  struct processor_costs i486_cost = {	/* 486 specific costs */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   4,					/* size of l1 cache.  486 has 8kB cache
@@ -327,8 +355,7 @@  struct processor_costs pentium_cost = {
   8,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -347,11 +374,21 @@  struct processor_costs pentium_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -409,8 +446,7 @@  struct processor_costs lakemont_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {2, 4, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -429,11 +465,21 @@  struct processor_costs lakemont_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -506,8 +552,7 @@  struct processor_costs pentiumpro_cost = {
   8,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -526,11 +571,21 @@  struct processor_costs pentiumpro_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 8, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {4, 8, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
+  {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -594,8 +649,7 @@  struct processor_costs geode_cost = {
   8,					/* "large" insn */
   4,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   2,				     /* cost for loading QImode using movzbl */
   {2, 2, 2},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -615,11 +669,21 @@  struct processor_costs geode_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {2, 2, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  2,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {2, 2, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
+  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
   2, 2,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -683,8 +747,7 @@  struct processor_costs k6_cost = {
   8,					/* "large" insn */
   4,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   3,				     /* cost for loading QImode using movzbl */
   {4, 5, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -703,11 +766,21 @@  struct processor_costs k6_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {2, 2, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {2, 2, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
+  {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
   2, 2,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -777,8 +850,7 @@  struct processor_costs athlon_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -797,11 +869,21 @@  struct processor_costs athlon_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 4, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 4, 12, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -873,8 +955,7 @@  struct processor_costs k8_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -893,11 +974,21 @@  struct processor_costs k8_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 3, 12, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
   5, 5,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 3, 12, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 10, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
+  {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -973,8 +1064,7 @@  struct processor_costs amdfam10_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {3, 4, 3},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -993,11 +1083,11 @@  struct processor_costs amdfam10_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {4, 4, 3, 6, 12},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
   {4, 4, 5, 10, 20},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
   3, 3,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
   					/* On K8:
   					    MOVD reg64, xmmreg Double FSTORE 4
 					    MOVD reg32, xmmreg Double FSTORE 4
@@ -1006,6 +1096,16 @@  struct processor_costs amdfam10_cost = {
 							       1/1  1/1
 					    MOVD reg32, xmmreg Double FADD 3
 							       1/1  1/1 */
+  3,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {4, 4, 3, 6, 12},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 5, 10, 20},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
+  {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
   4, 4,					/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -1082,8 +1182,7 @@  const struct processor_costs bdver_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {8, 8, 8},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1102,11 +1201,21 @@  const struct processor_costs bdver_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {12, 12, 10, 40, 60},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
   {10, 10, 10, 40, 60},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
   16, 20,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  8,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {12, 12, 10, 40, 60},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 10, 40, 60},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
+  {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  16,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   16,					/* size of l1 cache.  */
@@ -1187,8 +1296,7 @@  struct processor_costs znver1_cost = {
   8,					/* "large" insn.  */
   9,					/* MOVE_RATIO.  */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
 
   /* reg-reg moves are done by renaming and thus they are even cheaper than
      1 cycle. Becuase reg-reg move cost is 2 and the following tables correspond
@@ -1214,11 +1322,21 @@  struct processor_costs znver1_cost = {
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
   {6, 6, 6, 12, 24},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE moves.  */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {6, 6, 6, 12, 24},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
+  {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1311,8 +1429,7 @@  struct processor_costs znver2_cost = {
   8,					/* "large" insn.  */
   9,					/* MOVE_RATIO.  */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2.  */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
 
   /* reg-reg moves are done by renaming and thus they are even cheaper than
      1 cycle.  Because reg-reg move cost is 2 and following tables correspond
@@ -1339,12 +1456,22 @@  struct processor_costs znver2_cost = {
 					   register.  */
   {6, 6, 6, 10, 20},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit.  */
-  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
   6, 6,					/* SSE->integer and integer->SSE
 					   moves.  */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  8,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 8, 16},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1438,6 +1565,7 @@  struct processor_costs skylake_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1456,11 +1584,21 @@  struct processor_costs skylake_cost = {
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 10, 20},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 12, 24},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
   2, 2,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  3,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 20},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 12, 24},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
+  {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  2,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
   22, 10,				/* Gather store static, per_elt.  */
   64,					/* size of l1 cache.  */
@@ -1529,8 +1667,7 @@  const struct processor_costs btver1_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {6, 8, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1549,11 +1686,21 @@  const struct processor_costs btver1_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {10, 10, 12, 48, 96},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1620,8 +1767,7 @@  const struct processor_costs btver2_cost = {
   8,					/* "large" insn */
   9,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,				     /* cost for loading QImode using movzbl */
   {8, 8, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1640,11 +1786,21 @@  const struct processor_costs btver2_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {10, 10, 12, 48, 96},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
   14, 14,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {10, 10, 12, 48, 96},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
+  {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
   10, 10,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1710,8 +1866,7 @@  struct processor_costs pentium4_cost = {
   16,					/* "large" insn */
   6,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   5,				     /* cost for loading QImode using movzbl */
   {4, 5, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1730,11 +1885,21 @@  struct processor_costs pentium4_cost = {
   12, 24, 48,				/* cost of moving XMM,YMM,ZMM register */
   {16, 16, 16, 32, 64},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
   20, 12,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  2,					/* cost of storing integer register.  */
+  {16, 16, 16, 32, 64},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {16, 16, 16, 32, 64},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
+  {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
+  12,					/* cost of moving XMM register.  */
+  20,					/* cost of moving SSE register to integer.  */
   16, 16,				/* Gather load static, per_elt.  */
   16, 16,				/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -1803,8 +1968,7 @@  struct processor_costs nocona_cost = {
   16,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   4,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1823,11 +1987,21 @@  struct processor_costs nocona_cost = {
   6, 12, 24,				/* cost of moving XMM,YMM,ZMM register */
   {12, 12, 12, 24, 48},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
   {12, 12, 12, 24, 48},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
   20, 12,				/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  4,					/* cost of storing integer register.  */
+  {12, 12, 12, 24, 48},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {12, 12, 12, 24, 48},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
+  {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
+  6,					/* cost of moving XMM register.  */
+  20,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
   12, 12,				/* Gather store static, per_elt.  */
   8,					/* size of l1 cache.  */
@@ -1894,8 +2068,7 @@  struct processor_costs atom_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,					/* cost for loading QImode using movzbl */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -1914,11 +2087,21 @@  struct processor_costs atom_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {8, 8, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
   8, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {8, 8, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
   8, 8,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -1985,8 +2168,7 @@  struct processor_costs slm_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   8,					/* cost for loading QImode using movzbl */
   {8, 8, 8},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2005,11 +2187,21 @@  struct processor_costs slm_cost = {
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   {8, 8, 8, 16, 32},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
   8, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  8,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {8, 8, 8, 16, 32},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {8, 8, 8, 16, 32},			/* cost of storing SSE register
+					   in SImode, DImode and TImode.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
+  {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
   8, 8,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2076,8 +2268,7 @@  struct processor_costs intel_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2096,11 +2287,21 @@  struct processor_costs intel_cost = {
   2, 2, 2,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 6, 6},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 6},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   4, 4,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 6, 6},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 6},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  2,					/* cost of moving XMM register.  */
+  4,					/* cost of moving SSE register to integer.  */
   6, 6,					/* Gather load static, per_elt.  */
   6, 6,					/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2174,8 +2375,7 @@  struct processor_costs generic_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {6, 6, 6},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2194,11 +2394,21 @@  struct processor_costs generic_cost = {
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 10, 15},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
   6, 6,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  6,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 10, 15},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
+  {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  2,					/* cost of moving XMM register.  */
+  6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
   18, 6,				/* Gather store static, per_elt.  */
   32,					/* size of l1 cache.  */
@@ -2278,8 +2488,7 @@  struct processor_costs core_cost = {
   8,					/* "large" insn */
   17,					/* MOVE_RATIO */
 
-  /* All move costs are relative to integer->integer move times 2 and thus
-     they are latency*2. */
+  /* Start of register allocator costs.  integer->integer move cost is 2. */
   6,				     /* cost for loading QImode using movzbl */
   {4, 4, 4},				/* cost of loading integer registers
 					   in QImode, HImode and SImode.
@@ -2298,11 +2507,21 @@  struct processor_costs core_cost = {
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   {6, 6, 6, 6, 12},			/* cost of loading SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 12},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
-  {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
   2, 2,					/* SSE->integer and integer->SSE moves */
+  /* End of register allocator costs.  */
+
+  4,					/* cost of loading integer register.  */
+  6,					/* cost of storing integer register.  */
+  {6, 6, 6, 6, 12},			/* cost of loading SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 12},			/* cost of storing SSE register
+					   in 32bit, 64bit, 128bit, 256bit and 512bit */
+  {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
+  {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
+  2,					/* cost of moving XMM register.  */
+  2,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
      rec. throughput 6.
      So 5 uops statically and one uops per load.  */
-- 
2.20.1