Patchwork Correct cost model for strided loads

login
register
mail settings
Submitter William J. Schmidt
Date June 10, 2012, 3:58 p.m.
Message ID <1339343898.20419.8.camel@gnopaine>
Download mbox | patch
Permalink /patch/164005/
State New
Headers show

Comments

William J. Schmidt - June 10, 2012, 3:58 p.m.
The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  Ok
for trunk?

Thanks,
Bill


2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>

	* tree-vect-stmts.c (vect_model_load_cost):  Change cost model
	for strided loads.
Richard Guenther - June 11, 2012, 9:15 a.m.
On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> The fix for PR53331 caused a degradation to 187.facerec on
> powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> Bootstrapped and regtested on that platform with no new regressions.  Ok
> for trunk?

Well, would the real cost not be subparts * scalar_to_vec plus
subparts * vec_perm?
At least vec_perm isn't the cost for building up a vector from N scalar elements
either (it might be close enough if subparts == 2).  What's the case
with facerec
here?  Does it have subparts == 2?  I really wanted to pessimize this case
for say AVX and char elements, thus building up a vector from 32 scalars which
certainly does not cost a mere vec_perm.  So, maybe special-case the
subparts == 2 case and assume vec_perm would match the cost only in that
case.

Thanks,
Richard.

> Thanks,
> Bill
>
>
> 2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>
>
>        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
>        for strided loads.
>
>
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       (revision 188341)
> +++ gcc/tree-vect-stmts.c       (working copy)
> @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>   /* The loads themselves.  */
>   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
>     {
> -      /* N scalar loads plus gathering them into a vector.
> -         ???  scalar_to_vec isn't the cost for that.  */
> +      /* N scalar loads plus gathering them into a vector.  */
>       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
>                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
> -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>     }
>   else
>     vect_get_load_cost (first_dr, ncopies,
>
>
William J. Schmidt - June 11, 2012, 1:54 p.m.
On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > The fix for PR53331 caused a degradation to 187.facerec on
> > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > for trunk?
> 
> Well, would the real cost not be subparts * scalar_to_vec plus
> subparts * vec_perm?
> At least vec_perm isn't the cost for building up a vector from N scalar elements
> either (it might be close enough if subparts == 2).  What's the case
> with facerec
> here?  Does it have subparts == 2?  

In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
PowerPC, this requires two merge instructions and a permute instruction
to get the four 32-bit quantities into the right place in a 128-bit
register.  Currently this is modeled as a vec_perm in other parts of the
vectorizer per Ira's earlier patches, so I naively changed this to do
the same thing.

The types of vectorizer instructions aren't documented, and I can't
infer much from the i386.c cost model, so I need a little education.
What semantics are represented by scalar_to_vec?

On PowerPC, we have this mapping of the floating-point registers and
vector float registers where they overlap (the low-order half of each of
the first 32 vector float regs corresponds to a scalar float reg).  So
in this case we have four scalar loads that place things in the bottom
half of four vector registers, two vector merge instructions that
collapse the four registers into two vector registers, and a vector
permute that gets things in the right order.(*)  I wonder if what we
refer to as a merge instruction is similar to scalar_to_vec.

If so, then maybe we need something like

     subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
     inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
     inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
     inside_cost += ncopies * vect_get_stmt_cost (vec_perm);

But then we'd have to change how vec_perm is modeled elsewhere for
PowerPC based on Ira's earlier patches.  As I said, it's difficult for
me to figure out all the intent of cost model decisions that have been
made in the past, using current documentation.

> I really wanted to pessimize this case
> for say AVX and char elements, thus building up a vector from 32 scalars which
> certainly does not cost a mere vec_perm.  So, maybe special-case the
> subparts == 2 case and assume vec_perm would match the cost only in that
> case.

(I'm a little confused by this as what you have at the moment is a
single scalar_to_vec per copy, which has a cost of 1 on most i386
targets (occasionally 2).  The subparts multiplier is only applied to
the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

(*) There are actually a couple more instructions here to convert 64-bit
values to 32-bit values, since on PowerPC 32-bit loads are converted to
64-bit values in scalar float registers and they have to be coerced back
to 32-bit.  Very ugly.  The cost model currently doesn't represent this
at all, which I'll have to look at fixing at some point in some way that
isn't too nasty for the other targets.  The cost model for PowerPC seems
to need a lot of TLC.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> >
> >
> > 2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>
> >
> >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> >        for strided loads.
> >
> >
> > Index: gcc/tree-vect-stmts.c
> > ===================================================================
> > --- gcc/tree-vect-stmts.c       (revision 188341)
> > +++ gcc/tree-vect-stmts.c       (working copy)
> > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> >   /* The loads themselves.  */
> >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> >     {
> > -      /* N scalar loads plus gathering them into a vector.
> > -         ???  scalar_to_vec isn't the cost for that.  */
> > +      /* N scalar loads plus gathering them into a vector.  */
> >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
> > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> >     }
> >   else
> >     vect_get_load_cost (first_dr, ncopies,
> >
> >
>
Richard Guenther - June 11, 2012, 2:10 p.m.
On Mon, 11 Jun 2012, William J. Schmidt wrote:

> 
> 
> On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> > <wschmidt@linux.vnet.ibm.com> wrote:
> > > The fix for PR53331 caused a degradation to 187.facerec on
> > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > > for trunk?
> > 
> > Well, would the real cost not be subparts * scalar_to_vec plus
> > subparts * vec_perm?
> > At least vec_perm isn't the cost for building up a vector from N scalar elements
> > either (it might be close enough if subparts == 2).  What's the case
> > with facerec
> > here?  Does it have subparts == 2?  
> 
> In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
> PowerPC, this requires two merge instructions and a permute instruction
> to get the four 32-bit quantities into the right place in a 128-bit
> register.  Currently this is modeled as a vec_perm in other parts of the
> vectorizer per Ira's earlier patches, so I naively changed this to do
> the same thing.

I see.

> The types of vectorizer instructions aren't documented, and I can't
> infer much from the i386.c cost model, so I need a little education.
> What semantics are represented by scalar_to_vec?

It's a vector splat, thus x -> { x, x, x, ... }.  You can create
{ x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
as VEC_PERM_EXPR, takes two input vectors).  That's by far not
the most efficient way to build up such a vector of course (with AVX
you could do one splat plus N - 1 inserts for example).  The cost
is of course dependent on the number of vector elements, so a
simple new enum vect_cost_for_stmt kind does not cover it but
the target would have to look at the vector type passed and do
some reasonable guess.

> On PowerPC, we have this mapping of the floating-point registers and
> vector float registers where they overlap (the low-order half of each of
> the first 32 vector float regs corresponds to a scalar float reg).  So
> in this case we have four scalar loads that place things in the bottom
> half of four vector registers, two vector merge instructions that
> collapse the four registers into two vector registers, and a vector
> permute that gets things in the right order.(*)  I wonder if what we
> refer to as a merge instruction is similar to scalar_to_vec.

Looks similar to x86 SSE then.

> If so, then maybe we need something like
> 
>      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
>      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> 
> But then we'd have to change how vec_perm is modeled elsewhere for
> PowerPC based on Ira's earlier patches.  As I said, it's difficult for
> me to figure out all the intent of cost model decisions that have been
> made in the past, using current documentation.

Heh, usually the intent was to make the changes simple, not to compute
a proper cost.

I think we simply need a new scalars_to_vec cost kind.

> > I really wanted to pessimize this case
> > for say AVX and char elements, thus building up a vector from 32 scalars which
> > certainly does not cost a mere vec_perm.  So, maybe special-case the
> > subparts == 2 case and assume vec_perm would match the cost only in that
> > case.
> 
> (I'm a little confused by this as what you have at the moment is a
> single scalar_to_vec per copy, which has a cost of 1 on most i386
> targets (occasionally 2).  The subparts multiplier is only applied to
> the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

Oh, I somehow read the patch as you were removing the multiplication
by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
to reflect it with N scalar loads plus N splats plus N - 1 permutes
originally.  You could also model it with N scalar loads plus N
inserts (but we don't have a vec_insert cost either).  I think adding
a scalars_to_vec or vec_init or however we want to call it - basically
what the cost of a vector CONSTRUCTOR would be - is best.

> (*) There are actually a couple more instructions here to convert 64-bit
> values to 32-bit values, since on PowerPC 32-bit loads are converted to
> 64-bit values in scalar float registers and they have to be coerced back
> to 32-bit.  Very ugly.  The cost model currently doesn't represent this
> at all, which I'll have to look at fixing at some point in some way that
> isn't too nasty for the other targets.  The cost model for PowerPC seems
> to need a lot of TLC.

Maybe the above would be a possible way to do it.  On x86 for example
a vector of two doubles is extremely cheap for generic SSE2 to construct,
we can directly load into the low/high part, thus when we use it as
N scalar loads plus the vector-build the vector-build would be free.

Thanks,
Richard.

> Thanks,
> Bill
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > Thanks,
> > > Bill
> > >
> > >
> > > 2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>
> > >
> > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> > >        for strided loads.
> > >
> > >
> > > Index: gcc/tree-vect-stmts.c
> > > ===================================================================
> > > --- gcc/tree-vect-stmts.c       (revision 188341)
> > > +++ gcc/tree-vect-stmts.c       (working copy)
> > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> > >   /* The loads themselves.  */
> > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> > >     {
> > > -      /* N scalar loads plus gathering them into a vector.
> > > -         ???  scalar_to_vec isn't the cost for that.  */
> > > +      /* N scalar loads plus gathering them into a vector.  */
> > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
> > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > >     }
> > >   else
> > >     vect_get_load_cost (first_dr, ncopies,
> > >
> > >
> > 
> 
>
William J. Schmidt - June 11, 2012, 3:01 p.m.
On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, William J. Schmidt wrote:
> 
> > 
> > 
> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> > > <wschmidt@linux.vnet.ibm.com> wrote:
> > > > The fix for PR53331 caused a degradation to 187.facerec on
> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > > > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > > > for trunk?
> > > 
> > > Well, would the real cost not be subparts * scalar_to_vec plus
> > > subparts * vec_perm?
> > > At least vec_perm isn't the cost for building up a vector from N scalar elements
> > > either (it might be close enough if subparts == 2).  What's the case
> > > with facerec
> > > here?  Does it have subparts == 2?  
> > 
> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
> > PowerPC, this requires two merge instructions and a permute instruction
> > to get the four 32-bit quantities into the right place in a 128-bit
> > register.  Currently this is modeled as a vec_perm in other parts of the
> > vectorizer per Ira's earlier patches, so I naively changed this to do
> > the same thing.
> 
> I see.
> 
> > The types of vectorizer instructions aren't documented, and I can't
> > infer much from the i386.c cost model, so I need a little education.
> > What semantics are represented by scalar_to_vec?
> 
> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
> the most efficient way to build up such a vector of course (with AVX
> you could do one splat plus N - 1 inserts for example).  The cost
> is of course dependent on the number of vector elements, so a
> simple new enum vect_cost_for_stmt kind does not cover it but
> the target would have to look at the vector type passed and do
> some reasonable guess.

Ah, splat!  Yes, that's lingo I understand.  I see the intent now.

> 
> > On PowerPC, we have this mapping of the floating-point registers and
> > vector float registers where they overlap (the low-order half of each of
> > the first 32 vector float regs corresponds to a scalar float reg).  So
> > in this case we have four scalar loads that place things in the bottom
> > half of four vector registers, two vector merge instructions that
> > collapse the four registers into two vector registers, and a vector
> > permute that gets things in the right order.(*)  I wonder if what we
> > refer to as a merge instruction is similar to scalar_to_vec.
> 
> Looks similar to x86 SSE then.
> 
> > If so, then maybe we need something like
> > 
> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > 
> > But then we'd have to change how vec_perm is modeled elsewhere for
> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
> > me to figure out all the intent of cost model decisions that have been
> > made in the past, using current documentation.
> 
> Heh, usually the intent was to make the changes simple, not to compute
> a proper cost.
> 
> I think we simply need a new scalars_to_vec cost kind.

That works.  Maybe vec_construct gets the point across a little better?
I think we need to use the full builtin_vectorization_cost interface
instead of vect_get_stmt_cost here, so the targets can parameterize on
type.  Then we can just do one cost calculation for vec_construct that
covers the full costs of getting the vector in order after the loads.

> 
> > > I really wanted to pessimize this case
> > > for say AVX and char elements, thus building up a vector from 32 scalars which
> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
> > > subparts == 2 case and assume vec_perm would match the cost only in that
> > > case.
> > 
> > (I'm a little confused by this as what you have at the moment is a
> > single scalar_to_vec per copy, which has a cost of 1 on most i386
> > targets (occasionally 2).  The subparts multiplier is only applied to
> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
> 
> Oh, I somehow read the patch as you were removing the multiplication
> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
> to reflect it with N scalar loads plus N splats plus N - 1 permutes
> originally.  You could also model it with N scalar loads plus N
> inserts (but we don't have a vec_insert cost either).  I think adding
> a scalars_to_vec or vec_init or however we want to call it - basically
> what the cost of a vector CONSTRUCTOR would be - is best.
> 
> > (*) There are actually a couple more instructions here to convert 64-bit
> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
> > 64-bit values in scalar float registers and they have to be coerced back
> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
> > at all, which I'll have to look at fixing at some point in some way that
> > isn't too nasty for the other targets.  The cost model for PowerPC seems
> > to need a lot of TLC.
> 
> Maybe the above would be a possible way to do it.  On x86 for example
> a vector of two doubles is extremely cheap for generic SSE2 to construct,
> we can directly load into the low/high part, thus when we use it as
> N scalar loads plus the vector-build the vector-build would be free.

Absolutely.  As long as the vector type is available, this can be built
into vec_construct's logic in the target.  I will feel much more
comfortable having better estimation for the ugly 32-bit case.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > > 
> > > Thanks,
> > > Richard.
> > > 
> > > > Thanks,
> > > > Bill
> > > >
> > > >
> > > > 2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>
> > > >
> > > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> > > >        for strided loads.
> > > >
> > > >
> > > > Index: gcc/tree-vect-stmts.c
> > > > ===================================================================
> > > > --- gcc/tree-vect-stmts.c       (revision 188341)
> > > > +++ gcc/tree-vect-stmts.c       (working copy)
> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> > > >   /* The loads themselves.  */
> > > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> > > >     {
> > > > -      /* N scalar loads plus gathering them into a vector.
> > > > -         ???  scalar_to_vec isn't the cost for that.  */
> > > > +      /* N scalar loads plus gathering them into a vector.  */
> > > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> > > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
> > > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > > >     }
> > > >   else
> > > >     vect_get_load_cost (first_dr, ncopies,
> > > >
> > > >
> > > 
> > 
> > 
>
Richard Guenther - June 11, 2012, 6:37 p.m.
On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
>
>
> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
>> On Mon, 11 Jun 2012, William J. Schmidt wrote:
>>
>> >
>> >
>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
>> > > <wschmidt@linux.vnet.ibm.com> wrote:
>> > > > The fix for PR53331 caused a degradation to 187.facerec on
>> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
>> > > > Bootstrapped and regtested on that platform with no new regressions.  Ok
>> > > > for trunk?
>> > >
>> > > Well, would the real cost not be subparts * scalar_to_vec plus
>> > > subparts * vec_perm?
>> > > At least vec_perm isn't the cost for building up a vector from N scalar elements
>> > > either (it might be close enough if subparts == 2).  What's the case
>> > > with facerec
>> > > here?  Does it have subparts == 2?
>> >
>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
>> > PowerPC, this requires two merge instructions and a permute instruction
>> > to get the four 32-bit quantities into the right place in a 128-bit
>> > register.  Currently this is modeled as a vec_perm in other parts of the
>> > vectorizer per Ira's earlier patches, so I naively changed this to do
>> > the same thing.
>>
>> I see.
>>
>> > The types of vectorizer instructions aren't documented, and I can't
>> > infer much from the i386.c cost model, so I need a little education.
>> > What semantics are represented by scalar_to_vec?
>>
>> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
>> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
>> the most efficient way to build up such a vector of course (with AVX
>> you could do one splat plus N - 1 inserts for example).  The cost
>> is of course dependent on the number of vector elements, so a
>> simple new enum vect_cost_for_stmt kind does not cover it but
>> the target would have to look at the vector type passed and do
>> some reasonable guess.
>
> Ah, splat!  Yes, that's lingo I understand.  I see the intent now.
>
>>
>> > On PowerPC, we have this mapping of the floating-point registers and
>> > vector float registers where they overlap (the low-order half of each of
>> > the first 32 vector float regs corresponds to a scalar float reg).  So
>> > in this case we have four scalar loads that place things in the bottom
>> > half of four vector registers, two vector merge instructions that
>> > collapse the four registers into two vector registers, and a vector
>> > permute that gets things in the right order.(*)  I wonder if what we
>> > refer to as a merge instruction is similar to scalar_to_vec.
>>
>> Looks similar to x86 SSE then.
>>
>> > If so, then maybe we need something like
>> >
>> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
>> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>> >
>> > But then we'd have to change how vec_perm is modeled elsewhere for
>> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
>> > me to figure out all the intent of cost model decisions that have been
>> > made in the past, using current documentation.
>>
>> Heh, usually the intent was to make the changes simple, not to compute
>> a proper cost.
>>
>> I think we simply need a new scalars_to_vec cost kind.
>
> That works.  Maybe vec_construct gets the point across a little better?
> I think we need to use the full builtin_vectorization_cost interface
> instead of vect_get_stmt_cost here, so the targets can parameterize on
> type.  Then we can just do one cost calculation for vec_construct that
> covers the full costs of getting the vector in order after the loads.

Yes, vec_construct sounds good to me.

>>
>> > > I really wanted to pessimize this case
>> > > for say AVX and char elements, thus building up a vector from 32 scalars which
>> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
>> > > subparts == 2 case and assume vec_perm would match the cost only in that
>> > > case.
>> >
>> > (I'm a little confused by this as what you have at the moment is a
>> > single scalar_to_vec per copy, which has a cost of 1 on most i386
>> > targets (occasionally 2).  The subparts multiplier is only applied to
>> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
>>
>> Oh, I somehow read the patch as you were removing the multiplication
>> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
>> to reflect it with N scalar loads plus N splats plus N - 1 permutes
>> originally.  You could also model it with N scalar loads plus N
>> inserts (but we don't have a vec_insert cost either).  I think adding
>> a scalars_to_vec or vec_init or however we want to call it - basically
>> what the cost of a vector CONSTRUCTOR would be - is best.
>>
>> > (*) There are actually a couple more instructions here to convert 64-bit
>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
>> > 64-bit values in scalar float registers and they have to be coerced back
>> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
>> > at all, which I'll have to look at fixing at some point in some way that
>> > isn't too nasty for the other targets.  The cost model for PowerPC seems
>> > to need a lot of TLC.
>>
>> Maybe the above would be a possible way to do it.  On x86 for example
>> a vector of two doubles is extremely cheap for generic SSE2 to construct,
>> we can directly load into the low/high part, thus when we use it as
>> N scalar loads plus the vector-build the vector-build would be free.
>
> Absolutely.  As long as the vector type is available, this can be built
> into vec_construct's logic in the target.  I will feel much more
> comfortable having better estimation for the ugly 32-bit case.

Indeed.

Richard.

> Thanks,
> Bill
>
>>
>> Thanks,
>> Richard.
>>
>> > Thanks,
>> > Bill
>> >
>> > >
>> > > Thanks,
>> > > Richard.
>> > >
>> > > > Thanks,
>> > > > Bill
>> > > >
>> > > >
>> > > > 2012-06-10  Bill Schmidt  <wschmidt@linux.ibm.com>
>> > > >
>> > > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
>> > > >        for strided loads.
>> > > >
>> > > >
>> > > > Index: gcc/tree-vect-stmts.c
>> > > > ===================================================================
>> > > > --- gcc/tree-vect-stmts.c       (revision 188341)
>> > > > +++ gcc/tree-vect-stmts.c       (working copy)
>> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
>> > > >   /* The loads themselves.  */
>> > > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
>> > > >     {
>> > > > -      /* N scalar loads plus gathering them into a vector.
>> > > > -         ???  scalar_to_vec isn't the cost for that.  */
>> > > > +      /* N scalar loads plus gathering them into a vector.  */
>> > > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
>> > > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
>> > > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
>> > > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>> > > >     }
>> > > >   else
>> > > >     vect_get_load_cost (first_dr, ncopies,
>> > > >
>> > > >
>> > >
>> >
>> >
>>
>
Richard Guenther - June 12, 2012, 10:59 a.m.
On Mon, Jun 11, 2012 at 8:37 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
>>
>>
>> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
>>> On Mon, 11 Jun 2012, William J. Schmidt wrote:
>>>
>>> >
>>> >
>>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
>>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
>>> > > <wschmidt@linux.vnet.ibm.com> wrote:
>>> > > > The fix for PR53331 caused a degradation to 187.facerec on
>>> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
>>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
>>> > > > Bootstrapped and regtested on that platform with no new regressions.  Ok
>>> > > > for trunk?
>>> > >
>>> > > Well, would the real cost not be subparts * scalar_to_vec plus
>>> > > subparts * vec_perm?
>>> > > At least vec_perm isn't the cost for building up a vector from N scalar elements
>>> > > either (it might be close enough if subparts == 2).  What's the case
>>> > > with facerec
>>> > > here?  Does it have subparts == 2?
>>> >
>>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
>>> > PowerPC, this requires two merge instructions and a permute instruction
>>> > to get the four 32-bit quantities into the right place in a 128-bit
>>> > register.  Currently this is modeled as a vec_perm in other parts of the
>>> > vectorizer per Ira's earlier patches, so I naively changed this to do
>>> > the same thing.
>>>
>>> I see.
>>>
>>> > The types of vectorizer instructions aren't documented, and I can't
>>> > infer much from the i386.c cost model, so I need a little education.
>>> > What semantics are represented by scalar_to_vec?
>>>
>>> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
>>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
>>> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
>>> the most efficient way to build up such a vector of course (with AVX
>>> you could do one splat plus N - 1 inserts for example).  The cost
>>> is of course dependent on the number of vector elements, so a
>>> simple new enum vect_cost_for_stmt kind does not cover it but
>>> the target would have to look at the vector type passed and do
>>> some reasonable guess.
>>
>> Ah, splat!  Yes, that's lingo I understand.  I see the intent now.
>>
>>>
>>> > On PowerPC, we have this mapping of the floating-point registers and
>>> > vector float registers where they overlap (the low-order half of each of
>>> > the first 32 vector float regs corresponds to a scalar float reg).  So
>>> > in this case we have four scalar loads that place things in the bottom
>>> > half of four vector registers, two vector merge instructions that
>>> > collapse the four registers into two vector registers, and a vector
>>> > permute that gets things in the right order.(*)  I wonder if what we
>>> > refer to as a merge instruction is similar to scalar_to_vec.
>>>
>>> Looks similar to x86 SSE then.
>>>
>>> > If so, then maybe we need something like
>>> >
>>> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>>> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
>>> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>>> >
>>> > But then we'd have to change how vec_perm is modeled elsewhere for
>>> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
>>> > me to figure out all the intent of cost model decisions that have been
>>> > made in the past, using current documentation.
>>>
>>> Heh, usually the intent was to make the changes simple, not to compute
>>> a proper cost.
>>>
>>> I think we simply need a new scalars_to_vec cost kind.
>>
>> That works.  Maybe vec_construct gets the point across a little better?
>> I think we need to use the full builtin_vectorization_cost interface
>> instead of vect_get_stmt_cost here, so the targets can parameterize on
>> type.  Then we can just do one cost calculation for vec_construct that
>> covers the full costs of getting the vector in order after the loads.
>
> Yes, vec_construct sounds good to me.
>
>>>
>>> > > I really wanted to pessimize this case
>>> > > for say AVX and char elements, thus building up a vector from 32 scalars which
>>> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
>>> > > subparts == 2 case and assume vec_perm would match the cost only in that
>>> > > case.
>>> >
>>> > (I'm a little confused by this as what you have at the moment is a
>>> > single scalar_to_vec per copy, which has a cost of 1 on most i386
>>> > targets (occasionally 2).  The subparts multiplier is only applied to
>>> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
>>>
>>> Oh, I somehow read the patch as you were removing the multiplication
>>> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
>>> to reflect it with N scalar loads plus N splats plus N - 1 permutes
>>> originally.  You could also model it with N scalar loads plus N
>>> inserts (but we don't have a vec_insert cost either).  I think adding
>>> a scalars_to_vec or vec_init or however we want to call it - basically
>>> what the cost of a vector CONSTRUCTOR would be - is best.
>>>
>>> > (*) There are actually a couple more instructions here to convert 64-bit
>>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
>>> > 64-bit values in scalar float registers and they have to be coerced back
>>> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
>>> > at all, which I'll have to look at fixing at some point in some way that
>>> > isn't too nasty for the other targets.  The cost model for PowerPC seems
>>> > to need a lot of TLC.
>>>
>>> Maybe the above would be a possible way to do it.  On x86 for example
>>> a vector of two doubles is extremely cheap for generic SSE2 to construct,
>>> we can directly load into the low/high part, thus when we use it as
>>> N scalar loads plus the vector-build the vector-build would be free.
>>
>> Absolutely.  As long as the vector type is available, this can be built
>> into vec_construct's logic in the target.  I will feel much more
>> comfortable having better estimation for the ugly 32-bit case.
>
> Indeed.

Btw, with PR53533 I now have a case where multiplications of v4si are
really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
and no further subdivision ...

Thus we'd need a tree_code argument to the cost hook.  Though it gets
quite overloaded then, so maybe splitting it into one handling loads/stores
(and get the misalign parameter) and one handling only vector_stmt but
with a tree_code argument.  Or splitting it even further, seeing
cond_branch_taken ...

Richard.
William J. Schmidt - June 12, 2012, 11:59 a.m.
On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote:

> Btw, with PR53533 I now have a case where multiplications of v4si are
> really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
> and no further subdivision ...
> 
> Thus we'd need a tree_code argument to the cost hook.  Though it gets
> quite overloaded then, so maybe splitting it into one handling loads/stores
> (and get the misalign parameter) and one handling only vector_stmt but
> with a tree_code argument.  Or splitting it even further, seeing
> cond_branch_taken ...

Yes, I think subdividing the hook for the vector_stmt kind is pretty
much inevitable -- more situations like this expensive multiply will
arise.  I agree with the interface starting to get messy also.
Splitting it is probably the way to go -- a little painful but keeping
it all in one hook is going to get ugly.

Bill

> 
> Richard.
>

Patch

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 188341)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -1031,11 +1031,10 @@  vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
     {
-      /* N scalar loads plus gathering them into a vector.
-         ???  scalar_to_vec isn't the cost for that.  */
+      /* N scalar loads plus gathering them into a vector.  */
       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
 		      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
-      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
+      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
     }
   else
     vect_get_load_cost (first_dr, ncopies,