Submitted by William J. Schmidt on June 10, 2012, 3:58 p.m.

Message ID | 1339343898.20419.8.camel@gnopaine |
---|---|

State | New |

Headers | show |

On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > The fix for PR53331 caused a degradation to 187.facerec on > powerpc64-unknown-linux-gnu. The following simple patch reverses the > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > Bootstrapped and regtested on that platform with no new regressions. Ok > for trunk? Well, would the real cost not be subparts * scalar_to_vec plus subparts * vec_perm? At least vec_perm isn't the cost for building up a vector from N scalar elements either (it might be close enough if subparts == 2). What's the case with facerec here? Does it have subparts == 2? I really wanted to pessimize this case for say AVX and char elements, thus building up a vector from 32 scalars which certainly does not cost a mere vec_perm. So, maybe special-case the subparts == 2 case and assume vec_perm would match the cost only in that case. Thanks, Richard. > Thanks, > Bill > > > 2012-06-10 Bill Schmidt <wschmidt@linux.ibm.com> > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > for strided loads. > > > Index: gcc/tree-vect-stmts.c > =================================================================== > --- gcc/tree-vect-stmts.c (revision 188341) > +++ gcc/tree-vect-stmts.c (working copy) > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > /* The loads themselves. */ > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > { > - /* N scalar loads plus gathering them into a vector. > - ??? scalar_to_vec isn't the cost for that. */ > + /* N scalar loads plus gathering them into a vector. */ > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > } > else > vect_get_load_cost (first_dr, ncopies, > >

On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > <wschmidt@linux.vnet.ibm.com> wrote: > > The fix for PR53331 caused a degradation to 187.facerec on > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > Bootstrapped and regtested on that platform with no new regressions. Ok > > for trunk? > > Well, would the real cost not be subparts * scalar_to_vec plus > subparts * vec_perm? > At least vec_perm isn't the cost for building up a vector from N scalar elements > either (it might be close enough if subparts == 2). What's the case > with facerec > here? Does it have subparts == 2? In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On PowerPC, this requires two merge instructions and a permute instruction to get the four 32-bit quantities into the right place in a 128-bit register. Currently this is modeled as a vec_perm in other parts of the vectorizer per Ira's earlier patches, so I naively changed this to do the same thing. The types of vectorizer instructions aren't documented, and I can't infer much from the i386.c cost model, so I need a little education. What semantics are represented by scalar_to_vec? On PowerPC, we have this mapping of the floating-point registers and vector float registers where they overlap (the low-order half of each of the first 32 vector float regs corresponds to a scalar float reg). So in this case we have four scalar loads that place things in the bottom half of four vector registers, two vector merge instructions that collapse the four registers into two vector registers, and a vector permute that gets things in the right order.(*) I wonder if what we refer to as a merge instruction is similar to scalar_to_vec. If so, then maybe we need something like subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; inside_cost += ncopies * vect_get_stmt_cost (vec_perm); But then we'd have to change how vec_perm is modeled elsewhere for PowerPC based on Ira's earlier patches. As I said, it's difficult for me to figure out all the intent of cost model decisions that have been made in the past, using current documentation. > I really wanted to pessimize this case > for say AVX and char elements, thus building up a vector from 32 scalars which > certainly does not cost a mere vec_perm. So, maybe special-case the > subparts == 2 case and assume vec_perm would match the cost only in that > case. (I'm a little confused by this as what you have at the moment is a single scalar_to_vec per copy, which has a cost of 1 on most i386 targets (occasionally 2). The subparts multiplier is only applied to the loads. So changing this to vec_perm seemed to be a no-op for i386.) (*) There are actually a couple more instructions here to convert 64-bit values to 32-bit values, since on PowerPC 32-bit loads are converted to 64-bit values in scalar float registers and they have to be coerced back to 32-bit. Very ugly. The cost model currently doesn't represent this at all, which I'll have to look at fixing at some point in some way that isn't too nasty for the other targets. The cost model for PowerPC seems to need a lot of TLC. Thanks, Bill > > Thanks, > Richard. > > > Thanks, > > Bill > > > > > > 2012-06-10 Bill Schmidt <wschmidt@linux.ibm.com> > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > for strided loads. > > > > > > Index: gcc/tree-vect-stmts.c > > =================================================================== > > --- gcc/tree-vect-stmts.c (revision 188341) > > +++ gcc/tree-vect-stmts.c (working copy) > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > > /* The loads themselves. */ > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > { > > - /* N scalar loads plus gathering them into a vector. > > - ??? scalar_to_vec isn't the cost for that. */ > > + /* N scalar loads plus gathering them into a vector. */ > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > } > > else > > vect_get_load_cost (first_dr, ncopies, > > > > >

On Mon, 11 Jun 2012, William J. Schmidt wrote: > > > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > > <wschmidt@linux.vnet.ibm.com> wrote: > > > The fix for PR53331 caused a degradation to 187.facerec on > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > > Bootstrapped and regtested on that platform with no new regressions. Ok > > > for trunk? > > > > Well, would the real cost not be subparts * scalar_to_vec plus > > subparts * vec_perm? > > At least vec_perm isn't the cost for building up a vector from N scalar elements > > either (it might be close enough if subparts == 2). What's the case > > with facerec > > here? Does it have subparts == 2? > > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On > PowerPC, this requires two merge instructions and a permute instruction > to get the four 32-bit quantities into the right place in a 128-bit > register. Currently this is modeled as a vec_perm in other parts of the > vectorizer per Ira's earlier patches, so I naively changed this to do > the same thing. I see. > The types of vectorizer instructions aren't documented, and I can't > infer much from the i386.c cost model, so I need a little education. > What semantics are represented by scalar_to_vec? It's a vector splat, thus x -> { x, x, x, ... }. You can create { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, as VEC_PERM_EXPR, takes two input vectors). That's by far not the most efficient way to build up such a vector of course (with AVX you could do one splat plus N - 1 inserts for example). The cost is of course dependent on the number of vector elements, so a simple new enum vect_cost_for_stmt kind does not cover it but the target would have to look at the vector type passed and do some reasonable guess. > On PowerPC, we have this mapping of the floating-point registers and > vector float registers where they overlap (the low-order half of each of > the first 32 vector float regs corresponds to a scalar float reg). So > in this case we have four scalar loads that place things in the bottom > half of four vector registers, two vector merge instructions that > collapse the four registers into two vector registers, and a vector > permute that gets things in the right order.(*) I wonder if what we > refer to as a merge instruction is similar to scalar_to_vec. Looks similar to x86 SSE then. > If so, then maybe we need something like > > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > But then we'd have to change how vec_perm is modeled elsewhere for > PowerPC based on Ira's earlier patches. As I said, it's difficult for > me to figure out all the intent of cost model decisions that have been > made in the past, using current documentation. Heh, usually the intent was to make the changes simple, not to compute a proper cost. I think we simply need a new scalars_to_vec cost kind. > > I really wanted to pessimize this case > > for say AVX and char elements, thus building up a vector from 32 scalars which > > certainly does not cost a mere vec_perm. So, maybe special-case the > > subparts == 2 case and assume vec_perm would match the cost only in that > > case. > > (I'm a little confused by this as what you have at the moment is a > single scalar_to_vec per copy, which has a cost of 1 on most i386 > targets (occasionally 2). The subparts multiplier is only applied to > the loads. So changing this to vec_perm seemed to be a no-op for i386.) Oh, I somehow read the patch as you were removing the multiplication by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted to reflect it with N scalar loads plus N splats plus N - 1 permutes originally. You could also model it with N scalar loads plus N inserts (but we don't have a vec_insert cost either). I think adding a scalars_to_vec or vec_init or however we want to call it - basically what the cost of a vector CONSTRUCTOR would be - is best. > (*) There are actually a couple more instructions here to convert 64-bit > values to 32-bit values, since on PowerPC 32-bit loads are converted to > 64-bit values in scalar float registers and they have to be coerced back > to 32-bit. Very ugly. The cost model currently doesn't represent this > at all, which I'll have to look at fixing at some point in some way that > isn't too nasty for the other targets. The cost model for PowerPC seems > to need a lot of TLC. Maybe the above would be a possible way to do it. On x86 for example a vector of two doubles is extremely cheap for generic SSE2 to construct, we can directly load into the low/high part, thus when we use it as N scalar loads plus the vector-build the vector-build would be free. Thanks, Richard. > Thanks, > Bill > > > > > Thanks, > > Richard. > > > > > Thanks, > > > Bill > > > > > > > > > 2012-06-10 Bill Schmidt <wschmidt@linux.ibm.com> > > > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > > for strided loads. > > > > > > > > > Index: gcc/tree-vect-stmts.c > > > =================================================================== > > > --- gcc/tree-vect-stmts.c (revision 188341) > > > +++ gcc/tree-vect-stmts.c (working copy) > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > > > /* The loads themselves. */ > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > > { > > > - /* N scalar loads plus gathering them into a vector. > > > - ??? scalar_to_vec isn't the cost for that. */ > > > + /* N scalar loads plus gathering them into a vector. */ > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > } > > > else > > > vect_get_load_cost (first_dr, ncopies, > > > > > > > > > >

On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: > On Mon, 11 Jun 2012, William J. Schmidt wrote: > > > > > > > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > > > <wschmidt@linux.vnet.ibm.com> wrote: > > > > The fix for PR53331 caused a degradation to 187.facerec on > > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > > > Bootstrapped and regtested on that platform with no new regressions. Ok > > > > for trunk? > > > > > > Well, would the real cost not be subparts * scalar_to_vec plus > > > subparts * vec_perm? > > > At least vec_perm isn't the cost for building up a vector from N scalar elements > > > either (it might be close enough if subparts == 2). What's the case > > > with facerec > > > here? Does it have subparts == 2? > > > > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On > > PowerPC, this requires two merge instructions and a permute instruction > > to get the four 32-bit quantities into the right place in a 128-bit > > register. Currently this is modeled as a vec_perm in other parts of the > > vectorizer per Ira's earlier patches, so I naively changed this to do > > the same thing. > > I see. > > > The types of vectorizer instructions aren't documented, and I can't > > infer much from the i386.c cost model, so I need a little education. > > What semantics are represented by scalar_to_vec? > > It's a vector splat, thus x -> { x, x, x, ... }. You can create > { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, > as VEC_PERM_EXPR, takes two input vectors). That's by far not > the most efficient way to build up such a vector of course (with AVX > you could do one splat plus N - 1 inserts for example). The cost > is of course dependent on the number of vector elements, so a > simple new enum vect_cost_for_stmt kind does not cover it but > the target would have to look at the vector type passed and do > some reasonable guess. Ah, splat! Yes, that's lingo I understand. I see the intent now. > > > On PowerPC, we have this mapping of the floating-point registers and > > vector float registers where they overlap (the low-order half of each of > > the first 32 vector float regs corresponds to a scalar float reg). So > > in this case we have four scalar loads that place things in the bottom > > half of four vector registers, two vector merge instructions that > > collapse the four registers into two vector registers, and a vector > > permute that gets things in the right order.(*) I wonder if what we > > refer to as a merge instruction is similar to scalar_to_vec. > > Looks similar to x86 SSE then. > > > If so, then maybe we need something like > > > > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; > > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; > > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > > But then we'd have to change how vec_perm is modeled elsewhere for > > PowerPC based on Ira's earlier patches. As I said, it's difficult for > > me to figure out all the intent of cost model decisions that have been > > made in the past, using current documentation. > > Heh, usually the intent was to make the changes simple, not to compute > a proper cost. > > I think we simply need a new scalars_to_vec cost kind. That works. Maybe vec_construct gets the point across a little better? I think we need to use the full builtin_vectorization_cost interface instead of vect_get_stmt_cost here, so the targets can parameterize on type. Then we can just do one cost calculation for vec_construct that covers the full costs of getting the vector in order after the loads. > > > > I really wanted to pessimize this case > > > for say AVX and char elements, thus building up a vector from 32 scalars which > > > certainly does not cost a mere vec_perm. So, maybe special-case the > > > subparts == 2 case and assume vec_perm would match the cost only in that > > > case. > > > > (I'm a little confused by this as what you have at the moment is a > > single scalar_to_vec per copy, which has a cost of 1 on most i386 > > targets (occasionally 2). The subparts multiplier is only applied to > > the loads. So changing this to vec_perm seemed to be a no-op for i386.) > > Oh, I somehow read the patch as you were removing the multiplication > by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted > to reflect it with N scalar loads plus N splats plus N - 1 permutes > originally. You could also model it with N scalar loads plus N > inserts (but we don't have a vec_insert cost either). I think adding > a scalars_to_vec or vec_init or however we want to call it - basically > what the cost of a vector CONSTRUCTOR would be - is best. > > > (*) There are actually a couple more instructions here to convert 64-bit > > values to 32-bit values, since on PowerPC 32-bit loads are converted to > > 64-bit values in scalar float registers and they have to be coerced back > > to 32-bit. Very ugly. The cost model currently doesn't represent this > > at all, which I'll have to look at fixing at some point in some way that > > isn't too nasty for the other targets. The cost model for PowerPC seems > > to need a lot of TLC. > > Maybe the above would be a possible way to do it. On x86 for example > a vector of two doubles is extremely cheap for generic SSE2 to construct, > we can directly load into the low/high part, thus when we use it as > N scalar loads plus the vector-build the vector-build would be free. Absolutely. As long as the vector type is available, this can be built into vec_construct's logic in the target. I will feel much more comfortable having better estimation for the ugly 32-bit case. Thanks, Bill > > Thanks, > Richard. > > > Thanks, > > Bill > > > > > > > > Thanks, > > > Richard. > > > > > > > Thanks, > > > > Bill > > > > > > > > > > > > 2012-06-10 Bill Schmidt <wschmidt@linux.ibm.com> > > > > > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > > > for strided loads. > > > > > > > > > > > > Index: gcc/tree-vect-stmts.c > > > > =================================================================== > > > > --- gcc/tree-vect-stmts.c (revision 188341) > > > > +++ gcc/tree-vect-stmts.c (working copy) > > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > > > > /* The loads themselves. */ > > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > > > { > > > > - /* N scalar loads plus gathering them into a vector. > > > > - ??? scalar_to_vec isn't the cost for that. */ > > > > + /* N scalar loads plus gathering them into a vector. */ > > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); > > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > > } > > > > else > > > > vect_get_load_cost (first_dr, ncopies, > > > > > > > > > > > > > > > >

On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > > > On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: >> On Mon, 11 Jun 2012, William J. Schmidt wrote: >> >> > >> > >> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: >> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt >> > > <wschmidt@linux.vnet.ibm.com> wrote: >> > > > The fix for PR53331 caused a degradation to 187.facerec on >> > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the >> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. >> > > > Bootstrapped and regtested on that platform with no new regressions. Ok >> > > > for trunk? >> > > >> > > Well, would the real cost not be subparts * scalar_to_vec plus >> > > subparts * vec_perm? >> > > At least vec_perm isn't the cost for building up a vector from N scalar elements >> > > either (it might be close enough if subparts == 2). What's the case >> > > with facerec >> > > here? Does it have subparts == 2? >> > >> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On >> > PowerPC, this requires two merge instructions and a permute instruction >> > to get the four 32-bit quantities into the right place in a 128-bit >> > register. Currently this is modeled as a vec_perm in other parts of the >> > vectorizer per Ira's earlier patches, so I naively changed this to do >> > the same thing. >> >> I see. >> >> > The types of vectorizer instructions aren't documented, and I can't >> > infer much from the i386.c cost model, so I need a little education. >> > What semantics are represented by scalar_to_vec? >> >> It's a vector splat, thus x -> { x, x, x, ... }. You can create >> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, >> as VEC_PERM_EXPR, takes two input vectors). That's by far not >> the most efficient way to build up such a vector of course (with AVX >> you could do one splat plus N - 1 inserts for example). The cost >> is of course dependent on the number of vector elements, so a >> simple new enum vect_cost_for_stmt kind does not cover it but >> the target would have to look at the vector type passed and do >> some reasonable guess. > > Ah, splat! Yes, that's lingo I understand. I see the intent now. > >> >> > On PowerPC, we have this mapping of the floating-point registers and >> > vector float registers where they overlap (the low-order half of each of >> > the first 32 vector float regs corresponds to a scalar float reg). So >> > in this case we have four scalar loads that place things in the bottom >> > half of four vector registers, two vector merge instructions that >> > collapse the four registers into two vector registers, and a vector >> > permute that gets things in the right order.(*) I wonder if what we >> > refer to as a merge instruction is similar to scalar_to_vec. >> >> Looks similar to x86 SSE then. >> >> > If so, then maybe we need something like >> > >> > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >> > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; >> > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; >> > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >> > >> > But then we'd have to change how vec_perm is modeled elsewhere for >> > PowerPC based on Ira's earlier patches. As I said, it's difficult for >> > me to figure out all the intent of cost model decisions that have been >> > made in the past, using current documentation. >> >> Heh, usually the intent was to make the changes simple, not to compute >> a proper cost. >> >> I think we simply need a new scalars_to_vec cost kind. > > That works. Maybe vec_construct gets the point across a little better? > I think we need to use the full builtin_vectorization_cost interface > instead of vect_get_stmt_cost here, so the targets can parameterize on > type. Then we can just do one cost calculation for vec_construct that > covers the full costs of getting the vector in order after the loads. Yes, vec_construct sounds good to me. >> >> > > I really wanted to pessimize this case >> > > for say AVX and char elements, thus building up a vector from 32 scalars which >> > > certainly does not cost a mere vec_perm. So, maybe special-case the >> > > subparts == 2 case and assume vec_perm would match the cost only in that >> > > case. >> > >> > (I'm a little confused by this as what you have at the moment is a >> > single scalar_to_vec per copy, which has a cost of 1 on most i386 >> > targets (occasionally 2). The subparts multiplier is only applied to >> > the loads. So changing this to vec_perm seemed to be a no-op for i386.) >> >> Oh, I somehow read the patch as you were removing the multiplication >> by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted >> to reflect it with N scalar loads plus N splats plus N - 1 permutes >> originally. You could also model it with N scalar loads plus N >> inserts (but we don't have a vec_insert cost either). I think adding >> a scalars_to_vec or vec_init or however we want to call it - basically >> what the cost of a vector CONSTRUCTOR would be - is best. >> >> > (*) There are actually a couple more instructions here to convert 64-bit >> > values to 32-bit values, since on PowerPC 32-bit loads are converted to >> > 64-bit values in scalar float registers and they have to be coerced back >> > to 32-bit. Very ugly. The cost model currently doesn't represent this >> > at all, which I'll have to look at fixing at some point in some way that >> > isn't too nasty for the other targets. The cost model for PowerPC seems >> > to need a lot of TLC. >> >> Maybe the above would be a possible way to do it. On x86 for example >> a vector of two doubles is extremely cheap for generic SSE2 to construct, >> we can directly load into the low/high part, thus when we use it as >> N scalar loads plus the vector-build the vector-build would be free. > > Absolutely. As long as the vector type is available, this can be built > into vec_construct's logic in the target. I will feel much more > comfortable having better estimation for the ugly 32-bit case. Indeed. Richard. > Thanks, > Bill > >> >> Thanks, >> Richard. >> >> > Thanks, >> > Bill >> > >> > > >> > > Thanks, >> > > Richard. >> > > >> > > > Thanks, >> > > > Bill >> > > > >> > > > >> > > > 2012-06-10 Bill Schmidt <wschmidt@linux.ibm.com> >> > > > >> > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model >> > > > for strided loads. >> > > > >> > > > >> > > > Index: gcc/tree-vect-stmts.c >> > > > =================================================================== >> > > > --- gcc/tree-vect-stmts.c (revision 188341) >> > > > +++ gcc/tree-vect-stmts.c (working copy) >> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >> > > > /* The loads themselves. */ >> > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) >> > > > { >> > > > - /* N scalar loads plus gathering them into a vector. >> > > > - ??? scalar_to_vec isn't the cost for that. */ >> > > > + /* N scalar loads plus gathering them into a vector. */ >> > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies >> > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); >> > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); >> > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >> > > > } >> > > > else >> > > > vect_get_load_cost (first_dr, ncopies, >> > > > >> > > > >> > > >> > >> > >> >

On Mon, Jun 11, 2012 at 8:37 PM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt > <wschmidt@linux.vnet.ibm.com> wrote: >> >> >> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: >>> On Mon, 11 Jun 2012, William J. Schmidt wrote: >>> >>> > >>> > >>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: >>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt >>> > > <wschmidt@linux.vnet.ibm.com> wrote: >>> > > > The fix for PR53331 caused a degradation to 187.facerec on >>> > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the >>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. >>> > > > Bootstrapped and regtested on that platform with no new regressions. Ok >>> > > > for trunk? >>> > > >>> > > Well, would the real cost not be subparts * scalar_to_vec plus >>> > > subparts * vec_perm? >>> > > At least vec_perm isn't the cost for building up a vector from N scalar elements >>> > > either (it might be close enough if subparts == 2). What's the case >>> > > with facerec >>> > > here? Does it have subparts == 2? >>> > >>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On >>> > PowerPC, this requires two merge instructions and a permute instruction >>> > to get the four 32-bit quantities into the right place in a 128-bit >>> > register. Currently this is modeled as a vec_perm in other parts of the >>> > vectorizer per Ira's earlier patches, so I naively changed this to do >>> > the same thing. >>> >>> I see. >>> >>> > The types of vectorizer instructions aren't documented, and I can't >>> > infer much from the i386.c cost model, so I need a little education. >>> > What semantics are represented by scalar_to_vec? >>> >>> It's a vector splat, thus x -> { x, x, x, ... }. You can create >>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, >>> as VEC_PERM_EXPR, takes two input vectors). That's by far not >>> the most efficient way to build up such a vector of course (with AVX >>> you could do one splat plus N - 1 inserts for example). The cost >>> is of course dependent on the number of vector elements, so a >>> simple new enum vect_cost_for_stmt kind does not cover it but >>> the target would have to look at the vector type passed and do >>> some reasonable guess. >> >> Ah, splat! Yes, that's lingo I understand. I see the intent now. >> >>> >>> > On PowerPC, we have this mapping of the floating-point registers and >>> > vector float registers where they overlap (the low-order half of each of >>> > the first 32 vector float regs corresponds to a scalar float reg). So >>> > in this case we have four scalar loads that place things in the bottom >>> > half of four vector registers, two vector merge instructions that >>> > collapse the four registers into two vector registers, and a vector >>> > permute that gets things in the right order.(*) I wonder if what we >>> > refer to as a merge instruction is similar to scalar_to_vec. >>> >>> Looks similar to x86 SSE then. >>> >>> > If so, then maybe we need something like >>> > >>> > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>> > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; >>> > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2; >>> > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >>> > >>> > But then we'd have to change how vec_perm is modeled elsewhere for >>> > PowerPC based on Ira's earlier patches. As I said, it's difficult for >>> > me to figure out all the intent of cost model decisions that have been >>> > made in the past, using current documentation. >>> >>> Heh, usually the intent was to make the changes simple, not to compute >>> a proper cost. >>> >>> I think we simply need a new scalars_to_vec cost kind. >> >> That works. Maybe vec_construct gets the point across a little better? >> I think we need to use the full builtin_vectorization_cost interface >> instead of vect_get_stmt_cost here, so the targets can parameterize on >> type. Then we can just do one cost calculation for vec_construct that >> covers the full costs of getting the vector in order after the loads. > > Yes, vec_construct sounds good to me. > >>> >>> > > I really wanted to pessimize this case >>> > > for say AVX and char elements, thus building up a vector from 32 scalars which >>> > > certainly does not cost a mere vec_perm. So, maybe special-case the >>> > > subparts == 2 case and assume vec_perm would match the cost only in that >>> > > case. >>> > >>> > (I'm a little confused by this as what you have at the moment is a >>> > single scalar_to_vec per copy, which has a cost of 1 on most i386 >>> > targets (occasionally 2). The subparts multiplier is only applied to >>> > the loads. So changing this to vec_perm seemed to be a no-op for i386.) >>> >>> Oh, I somehow read the patch as you were removing the multiplication >>> by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted >>> to reflect it with N scalar loads plus N splats plus N - 1 permutes >>> originally. You could also model it with N scalar loads plus N >>> inserts (but we don't have a vec_insert cost either). I think adding >>> a scalars_to_vec or vec_init or however we want to call it - basically >>> what the cost of a vector CONSTRUCTOR would be - is best. >>> >>> > (*) There are actually a couple more instructions here to convert 64-bit >>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to >>> > 64-bit values in scalar float registers and they have to be coerced back >>> > to 32-bit. Very ugly. The cost model currently doesn't represent this >>> > at all, which I'll have to look at fixing at some point in some way that >>> > isn't too nasty for the other targets. The cost model for PowerPC seems >>> > to need a lot of TLC. >>> >>> Maybe the above would be a possible way to do it. On x86 for example >>> a vector of two doubles is extremely cheap for generic SSE2 to construct, >>> we can directly load into the low/high part, thus when we use it as >>> N scalar loads plus the vector-build the vector-build would be free. >> >> Absolutely. As long as the vector type is available, this can be built >> into vec_construct's logic in the target. I will feel much more >> comfortable having better estimation for the ugly 32-bit case. > > Indeed. Btw, with PR53533 I now have a case where multiplications of v4si are really expensive on x86 without SSE 4.1. But we only have vect_stmt_cost and no further subdivision ... Thus we'd need a tree_code argument to the cost hook. Though it gets quite overloaded then, so maybe splitting it into one handling loads/stores (and get the misalign parameter) and one handling only vector_stmt but with a tree_code argument. Or splitting it even further, seeing cond_branch_taken ... Richard.

On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote: > Btw, with PR53533 I now have a case where multiplications of v4si are > really expensive on x86 without SSE 4.1. But we only have vect_stmt_cost > and no further subdivision ... > > Thus we'd need a tree_code argument to the cost hook. Though it gets > quite overloaded then, so maybe splitting it into one handling loads/stores > (and get the misalign parameter) and one handling only vector_stmt but > with a tree_code argument. Or splitting it even further, seeing > cond_branch_taken ... Yes, I think subdividing the hook for the vector_stmt kind is pretty much inevitable -- more situations like this expensive multiply will arise. I agree with the interface starting to get messy also. Splitting it is probably the way to go -- a little painful but keeping it all in one hook is going to get ugly. Bill > > Richard. >

Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 188341) +++ gcc/tree-vect-stmts.c (working copy) @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int /* The loads themselves. */ if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) { - /* N scalar loads plus gathering them into a vector. - ??? scalar_to_vec isn't the cost for that. */ + /* N scalar loads plus gathering them into a vector. */ inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info))); - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); } else vect_get_load_cost (first_dr, ncopies,