Message ID | 1442413689.2896.45.camel@gnopaine |
---|---|
State | New |
Headers | show |
On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > Hi, > > A recent patch proposal from Alan Hayward > (https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00690.html) uncovered > that the PowerPC back end doesn't have expansions for > reduc_{smax,smin,umax,umin}_<mode> and > reduc_{smax,smin,umax,umin}_scal_<mode> for the integer modes. This > prevents vectorization of reductions involving comparisons that can be > transformed into REDUC_{MAX,MIN}_EXPR expressions. This patch adds > these expansions. > > PowerPC does not have hardware reduction instructions for maximum and > minimum. However, we can emulate this with varying degrees of > efficiency for different modes. The size of the expansion is > logarithmic in the number of vector elements E. The expansions for > reduc_{smax,smin,umax,umin}_<mode> consist of log E stages, each > comprising a rotate operation and a maximum or minimum operation. After > stage N, the maximum value in the vector will appear in at least 2^N > consecutive positions in the intermediate result. > > The ...scal_<mode> expansions just invoke the related non-scalar > expansions, and then extract an arbitrary element from the result > vector. > > The expansions for V16QI, V8HI, and V4SI require TARGET_ALTIVEC. The > expansions for V2DI make use of vector instructions added for ISA 2.07, > so they require TARGET_P8_VECTOR. > > I was able to use iterators for the sub-doubleword ...scal_<mode> > expansions, but that's all. I experimented with trying to use > code_iterators to generate the {smax,smin,umax,umin} expansions, but > couldn't find a way to make that work, as the substitution wasn't being > done into the UNSPEC constants. If there is a way to do this, please > let me know and I'll try to reduce the code size. > > There are already a number of common reduction execution tests that > exercise this logic. I've also added PowerPC-specific code generation > tests to verify the patterns produce what's expected. These are based > on the existing execution tests. > > Some future work will be required: > > (1) The vectorization cost model does not currently allow us to > distinguish between reductions of additions and reductions of max/min. > On PowerPC, these costs are very different, as the former is supported > by hardware and the latter is not. After this patch is applied, we will > possibly vectorize some code when it's not profitable to do so. I think > it's probably best to go ahead with this patch now, and deal with the > cost model as a separate issue after Alan's patch is complete and > upstream. > > (2) The use of rs6000_expand_vector_extract to obtain a scalar from a > vector is not optimal for sub-doubleword modes using the latest > hardware. Currently this generates a vector store followed by a scalar > load, which is Very Bad. We should instead use a mfvsrd and sign- or > zero-extend the rightmost element in the result GPR. To accomplish > this, we should update rs6000_expand_vector_extract to do the more > general thing: mfvsrd, shift the selected element into the rightmost > position, and extend it. At that time we should change the _scal_<mode> > expansions to select the element number that avoids the shift (that > number will differ for BE and LE). > > Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no > regressions. Is this ok for trunk? > > Thanks, > Bill > > > [gcc] > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN, > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > (reduc_smax_v2di): New define_expand. > (reduc_smax_scal_v2di): Likewise. > (reduc_smin_v2di): Likewise. > (reduc_smin_scal_v2di): Likewise. > (reduc_umax_v2di): Likewise. > (reduc_umax_scal_v2di): Likewise. > (reduc_umin_v2di): Likewise. > (reduc_umin_scal_v2di): Likewise. > (reduc_smax_v4si): Likewise. > (reduc_smin_v4si): Likewise. > (reduc_umax_v4si): Likewise. > (reduc_umin_v4si): Likewise. > (reduc_smax_v8hi): Likewise. > (reduc_smin_v8hi): Likewise. > (reduc_umax_v8hi): Likewise. > (reduc_umin_v8hi): Likewise. > (reduc_smax_v16qi): Likewise. > (reduc_smin_v16qi): Likewise. > (reduc_umax_v16qi): Likewise. > (reduc_umin_v16qi): Likewise. > (reduc_smax_scal_<mode>): Likewise. > (reduc_smin_scal_<mode>): Likewise. > (reduc_umax_scal_<mode>): Likewise. > (reduc_umin_scal_<mode>): Likewise. > > [gcc/testsuite] > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > * gcc.target/powerpc/vect-reduc-minmax-char.c: New. > * gcc.target/powerpc/vect-reduc-minmax-short.c: New. > * gcc.target/powerpc/vect-reduc-minmax-int.c: New. > * gcc.target/powerpc/vect-reduc-minmax-long.c: New. This is okay. I don't think that I have seen iterators for UNSPECs, but maybe someone else is aware of the right idiom. Thanks, David
On Wed, Sep 16, 2015 at 3:34 PM, David Edelsohn <dje.gcc@gmail.com> wrote: > On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt > > This is okay. > > I don't think that I have seen iterators for UNSPECs, but maybe > someone else is aware of the right idiom. https://gcc.gnu.org/onlinedocs/gccint/Int-Iterators.html . define_int_iterator was added precisely for this functionality in the aarch64 backend. regards Ramana > > Thanks, David
On Wed, 2015-09-16 at 15:37 +0100, Ramana Radhakrishnan wrote: > On Wed, Sep 16, 2015 at 3:34 PM, David Edelsohn <dje.gcc@gmail.com> wrote: > > On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt > > > > > This is okay. > > > > I don't think that I have seen iterators for UNSPECs, but maybe > > someone else is aware of the right idiom. > > https://gcc.gnu.org/onlinedocs/gccint/Int-Iterators.html . > define_int_iterator was added precisely for this functionality in the > aarch64 backend. Ah, thanks! Much obliged, Ramana. Bill > > > > regards > Ramana > > > > > Thanks, David >
On Wed, Sep 16, 2015 at 09:28:09AM -0500, Bill Schmidt wrote: > I was able to use iterators for the sub-doubleword ...scal_<mode> > expansions, but that's all. I experimented with trying to use > code_iterators to generate the {smax,smin,umax,umin} expansions, but > couldn't find a way to make that work, as the substitution wasn't being > done into the UNSPEC constants. If there is a way to do this, please > let me know and I'll try to reduce the code size. Do you need the unspecs at all? They are only used in the pattern of expanders, and all such expanders call DONE. You can instead write those patterns as just USEs of their operands? Segher
On Wed, 2015-09-16 at 10:14 -0500, Segher Boessenkool wrote: > On Wed, Sep 16, 2015 at 09:28:09AM -0500, Bill Schmidt wrote: > > I was able to use iterators for the sub-doubleword ...scal_<mode> > > expansions, but that's all. I experimented with trying to use > > code_iterators to generate the {smax,smin,umax,umin} expansions, but > > couldn't find a way to make that work, as the substitution wasn't being > > done into the UNSPEC constants. If there is a way to do this, please > > let me know and I'll try to reduce the code size. > > Do you need the unspecs at all? They are only used in the pattern of > expanders, and all such expanders call DONE. You can instead write > those patterns as just USEs of their operands? Hi Segher, Yes, I had just independently come to the same conclusion -- since we are always calling DONE, I don't need the UNSPECs at all. I'm verifying to be sure, but this should solve my problem. Thanks! Bill > > > Segher >
On 16/09/15 15:28, Bill Schmidt wrote: > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN, > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > (reduc_smax_v2di): New define_expand. > (reduc_smax_scal_v2di): Likewise. > (reduc_smin_v2di): Likewise. > (reduc_smin_scal_v2di): Likewise. > (reduc_umax_v2di): Likewise. > (reduc_umax_scal_v2di): Likewise. > (reduc_umin_v2di): Likewise. > (reduc_umin_scal_v2di): Likewise. > (reduc_smax_v4si): Likewise. > (reduc_smin_v4si): Likewise. > (reduc_umax_v4si): Likewise. > (reduc_umin_v4si): Likewise. > (reduc_smax_v8hi): Likewise. > (reduc_smin_v8hi): Likewise. > (reduc_umax_v8hi): Likewise. > (reduc_umin_v8hi): Likewise. > (reduc_smax_v16qi): Likewise. > (reduc_smin_v16qi): Likewise. > (reduc_umax_v16qi): Likewise. > (reduc_umin_v16qi): Likewise. > (reduc_smax_scal_<mode>): Likewise. > (reduc_smin_scal_<mode>): Likewise. > (reduc_umax_scal_<mode>): Likewise. > (reduc_umin_scal_<mode>): Likewise. You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if the _scal are present. The non-_scal's were previously defined as producing a vector with one element holding the result and the other elements all zero, and this was only ever used with a vec_extract immediately after; the _scal pattern now includes the vec_extract as well. Hence the non-_scal patterns are deprecated / considered legacy, as per md.texi. I proposed a patch to migrate PPC off the old patterns, but have forgotten to ping it recently - last at https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!) --Alan
On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > On 16/09/15 15:28, Bill Schmidt wrote: > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN, > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > (reduc_smax_v2di): New define_expand. > > (reduc_smax_scal_v2di): Likewise. > > (reduc_smin_v2di): Likewise. > > (reduc_smin_scal_v2di): Likewise. > > (reduc_umax_v2di): Likewise. > > (reduc_umax_scal_v2di): Likewise. > > (reduc_umin_v2di): Likewise. > > (reduc_umin_scal_v2di): Likewise. > > (reduc_smax_v4si): Likewise. > > (reduc_smin_v4si): Likewise. > > (reduc_umax_v4si): Likewise. > > (reduc_umin_v4si): Likewise. > > (reduc_smax_v8hi): Likewise. > > (reduc_smin_v8hi): Likewise. > > (reduc_umax_v8hi): Likewise. > > (reduc_umin_v8hi): Likewise. > > (reduc_smax_v16qi): Likewise. > > (reduc_smin_v16qi): Likewise. > > (reduc_umax_v16qi): Likewise. > > (reduc_umin_v16qi): Likewise. > > (reduc_smax_scal_<mode>): Likewise. > > (reduc_smin_scal_<mode>): Likewise. > > (reduc_umax_scal_<mode>): Likewise. > > (reduc_umin_scal_<mode>): Likewise. > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if > the _scal are present. The non-_scal's were previously defined as producing a > vector with one element holding the result and the other elements all zero, and > this was only ever used with a vec_extract immediately after; the _scal pattern > now includes the vec_extract as well. Hence the non-_scal patterns are > deprecated / considered legacy, as per md.texi. Thanks -- I had misread the description of the non-scalar versions, missing the part where the other elements are zero. What I really want/need is an optab defined as computing the maximum value in all elements of the vector. This seems like a strange thing to want, but Alan Hayward's proposed patch will cause us to generate the scalar version, followed by a broadcast of the vector. Since our patterns already generate the maximum value in all positions, this creates an unnecessary extract followed by an unnecessary broadcast. As discussed elsewhere, we *could* remove the unnecessary code by recognizing this in simplify-rtx, etc., but the vectorization cost modeling would be wrong. It would have still told us to model this as a vec_to_scalar for the reduc_max_scal, and a vec_stmt for the broadcast. This would overcount the cost of the reduction compared to what we would actually generate. To get this right for all targets, one could envision having a new optab for a reduction-to-vector, which most targets wouldn't implement, but PowerPC and AArch32, at least, would. If a target has a reduction-to-vector, the vectorizer would have to generate a different GIMPLE code that mapped to this; otherwise it would do the REDUC_MAX_EXPR and the broadcast. This obviously starts to get complicated, since adding a GIMPLE code certainly has a nontrivial cost. :/ Perhaps the practical thing is to have the vectorizer also do an add_stmt_cost with some new token that indicates the cost model should make an adjustment if the back end doesn't need the extract/broadcast. Targets like PowerPC and AArch32 could then subtract the unnecessary cost, and remove the unnecessary code in simplify-rtx. Copying Richi and ARM folks for opinions on the best design. I want to be able to model this stuff as accurately as possible, but obviously we need to avoid unnecessary effects on other architectures. In any case, I will remove implementing the deprecated optabs, and I'll also try to look at Alan L's patch shortly. Thanks, Bill > > I proposed a patch to migrate PPC off the old patterns, but have forgotten to > ping it recently - last at > https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!) > > --Alan >
On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > I proposed a patch to migrate PPC off the old patterns, but have forgotten to > ping it recently - last at > https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!) > Hi Alan, Thanks for this patch. I see that you tested it on gcc110, which is a big-endian platform. I think the pattern for V4SF might have an endian problem on little-endian, but I'm not positive just eyeballing it. (I think that the select of element 3 will address the wrong end of the vector for LE.) Can you please try the patch on gcc112 as well to set my mind at ease? Thanks, Bill > --Alan >
On 16/09/15 17:10, Bill Schmidt wrote: > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: >> On 16/09/15 15:28, Bill Schmidt wrote: >>> 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> >>> >>> * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN, >>> UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, >>> UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, >>> UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. >>> (reduc_smax_v2di): New define_expand. >>> (reduc_smax_scal_v2di): Likewise. >>> (reduc_smin_v2di): Likewise. >>> (reduc_smin_scal_v2di): Likewise. >>> (reduc_umax_v2di): Likewise. >>> (reduc_umax_scal_v2di): Likewise. >>> (reduc_umin_v2di): Likewise. >>> (reduc_umin_scal_v2di): Likewise. >>> (reduc_smax_v4si): Likewise. >>> (reduc_smin_v4si): Likewise. >>> (reduc_umax_v4si): Likewise. >>> (reduc_umin_v4si): Likewise. >>> (reduc_smax_v8hi): Likewise. >>> (reduc_smin_v8hi): Likewise. >>> (reduc_umax_v8hi): Likewise. >>> (reduc_umin_v8hi): Likewise. >>> (reduc_smax_v16qi): Likewise. >>> (reduc_smin_v16qi): Likewise. >>> (reduc_umax_v16qi): Likewise. >>> (reduc_umin_v16qi): Likewise. >>> (reduc_smax_scal_<mode>): Likewise. >>> (reduc_smin_scal_<mode>): Likewise. >>> (reduc_umax_scal_<mode>): Likewise. >>> (reduc_umin_scal_<mode>): Likewise. >> >> You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if >> the _scal are present. The non-_scal's were previously defined as producing a >> vector with one element holding the result and the other elements all zero, and >> this was only ever used with a vec_extract immediately after; the _scal pattern >> now includes the vec_extract as well. Hence the non-_scal patterns are >> deprecated / considered legacy, as per md.texi. > > Thanks -- I had misread the description of the non-scalar versions, > missing the part where the other elements are zero. What I really > want/need is an optab defined as computing the maximum value in all > elements of the vector. Yes, indeed. It seems reasonable to me that this would coexist with an optab which computes only a single value (i.e. a scalar). At that point it might be appropriate to change the cond-reduction code to generate the reduce-to-vector in all cases, and optabs.c expand it to reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with the (parallel) changes to cost model already proposed, does that cover all the cases? It does add a new tree code, yes, but I'm feeling that could be justified if we go down this route. However, another point that springs to mind: if you reduce a loop containing OR or MUL expressions, the vect_create_epilog_for_reduction reduces these using shifts, and I think will also use shifts for platforms not possessing a reduc_plus/min/max. If shifts could be changed for rotates, the code there would do your reduce-to-a-vector-of-identical-elements in the midend...can we (sensibly!) bring all of these together? > Perhaps the practical thing is to have the vectorizer also do an > add_stmt_cost with some new token that indicates the cost model should > make an adjustment if the back end doesn't need the extract/broadcast. > Targets like PowerPC and AArch32 could then subtract the unnecessary > cost, and remove the unnecessary code in simplify-rtx. I think it'd be good if we could do it before simplify-rtx, really, although I'm not sure I have a strong argument as to why, as long as we can cost it appropriately. > In any case, I will remove implementing the deprecated optabs, and I'll > also try to look at Alan L's patch shortly. That'd be great, thanks :) Cheers, Alan
On 16/09/15 17:19, Bill Schmidt wrote: > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: >> >> I proposed a patch to migrate PPC off the old patterns, but have forgotten to >> ping it recently - last at >> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!) >> > > Hi Alan, > > Thanks for this patch. I see that you tested it on gcc110, which is a > big-endian platform. I think the pattern for V4SF might have an endian > problem on little-endian, but I'm not positive just eyeballing it. (I > think that the select of element 3 will address the wrong end of the > vector for LE.) Can you please try the patch on gcc112 as well to set > my mind at ease? > > Thanks, > Bill > >> --Alan >> > > I think you are right....I'm just retesting without the patch to rule out other test setup problems etc., but I see more tests failing on gcc112 than I expect (comparing against e.g. https://gcc.gnu.org/ml/gcc-testresults/2015-09/msg01479.html). What's the best way to determine endianness - is it BYTES_BIG_ENDIAN, would that be true on gcc110 but false on gcc112? Cheers, Alan
On Wed, 2015-09-16 at 19:16 +0100, Alan Lawrence wrote: > On 16/09/15 17:19, Bill Schmidt wrote: > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > >> > >> I proposed a patch to migrate PPC off the old patterns, but have forgotten to > >> ping it recently - last at > >> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!) > >> > > > > Hi Alan, > > > > Thanks for this patch. I see that you tested it on gcc110, which is a > > big-endian platform. I think the pattern for V4SF might have an endian > > problem on little-endian, but I'm not positive just eyeballing it. (I > > think that the select of element 3 will address the wrong end of the > > vector for LE.) Can you please try the patch on gcc112 as well to set > > my mind at ease? > > > > Thanks, > > Bill > > > >> --Alan > >> > > > > > > I think you are right....I'm just retesting without the patch to rule out other > test setup problems etc., but I see more tests failing on gcc112 than I expect > (comparing against e.g. > https://gcc.gnu.org/ml/gcc-testresults/2015-09/msg01479.html). > > What's the best way to determine endianness - is it BYTES_BIG_ENDIAN, would that > be true on gcc110 but false on gcc112? Yes, that's the correct test, thanks! Bill > > Cheers, Alan >
On Wed, 2015-09-16 at 18:03 +0100, Alan Lawrence wrote: > On 16/09/15 17:10, Bill Schmidt wrote: > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > >> On 16/09/15 15:28, Bill Schmidt wrote: > >>> 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > >>> > >>> * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN, > >>> UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > >>> UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > >>> UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > >>> (reduc_smax_v2di): New define_expand. > >>> (reduc_smax_scal_v2di): Likewise. > >>> (reduc_smin_v2di): Likewise. > >>> (reduc_smin_scal_v2di): Likewise. > >>> (reduc_umax_v2di): Likewise. > >>> (reduc_umax_scal_v2di): Likewise. > >>> (reduc_umin_v2di): Likewise. > >>> (reduc_umin_scal_v2di): Likewise. > >>> (reduc_smax_v4si): Likewise. > >>> (reduc_smin_v4si): Likewise. > >>> (reduc_umax_v4si): Likewise. > >>> (reduc_umin_v4si): Likewise. > >>> (reduc_smax_v8hi): Likewise. > >>> (reduc_smin_v8hi): Likewise. > >>> (reduc_umax_v8hi): Likewise. > >>> (reduc_umin_v8hi): Likewise. > >>> (reduc_smax_v16qi): Likewise. > >>> (reduc_smin_v16qi): Likewise. > >>> (reduc_umax_v16qi): Likewise. > >>> (reduc_umin_v16qi): Likewise. > >>> (reduc_smax_scal_<mode>): Likewise. > >>> (reduc_smin_scal_<mode>): Likewise. > >>> (reduc_umax_scal_<mode>): Likewise. > >>> (reduc_umin_scal_<mode>): Likewise. > >> > >> You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if > >> the _scal are present. The non-_scal's were previously defined as producing a > >> vector with one element holding the result and the other elements all zero, and > >> this was only ever used with a vec_extract immediately after; the _scal pattern > >> now includes the vec_extract as well. Hence the non-_scal patterns are > >> deprecated / considered legacy, as per md.texi. > > > > Thanks -- I had misread the description of the non-scalar versions, > > missing the part where the other elements are zero. What I really > > want/need is an optab defined as computing the maximum value in all > > elements of the vector. > > Yes, indeed. It seems reasonable to me that this would coexist with an optab > which computes only a single value (i.e. a scalar). > > At that point it might be appropriate to change the cond-reduction code to > generate the reduce-to-vector in all cases, and optabs.c expand it to > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with > the (parallel) changes to cost model already proposed, does that cover all the > cases? It does add a new tree code, yes, but I'm feeling that could be justified > if we go down this route. That's how I envisioned it as well, and it was my original preference, provided the maintainers are ok with it. However, your next suggestion is intriguing... > > However, another point that springs to mind: if you reduce a loop containing OR > or MUL expressions, the vect_create_epilog_for_reduction reduces these using > shifts, and I think will also use shifts for platforms not possessing a > reduc_plus/min/max. If shifts could be changed for rotates, the code there would > do your reduce-to-a-vector-of-identical-elements in the midend...can we > (sensibly!) bring all of these together? Perhaps so. I can have a look at that and see. What I'm calling a rotate is really a double-vector shift where both input vectors are the same, so perhaps this is already pretty close to what we need. Thanks! I'll try to put together some sort of proposal over the next week or so, workload permitting. Bill > > > Perhaps the practical thing is to have the vectorizer also do an > > add_stmt_cost with some new token that indicates the cost model should > > make an adjustment if the back end doesn't need the extract/broadcast. > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > cost, and remove the unnecessary code in simplify-rtx. > > I think it'd be good if we could do it before simplify-rtx, really, although I'm > not sure I have a strong argument as to why, as long as we can cost it > appropriately. > > > In any case, I will remove implementing the deprecated optabs, and I'll > > also try to look at Alan L's patch shortly. > > That'd be great, thanks :) > > Cheers, Alan >
On Wed, 16 Sep 2015, Alan Lawrence wrote: > On 16/09/15 17:10, Bill Schmidt wrote: > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > > On 16/09/15 15:28, Bill Schmidt wrote: > > > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > > > > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, > > > > UNSPEC_REDUC_SMIN, > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > > > (reduc_smax_v2di): New define_expand. > > > > (reduc_smax_scal_v2di): Likewise. > > > > (reduc_smin_v2di): Likewise. > > > > (reduc_smin_scal_v2di): Likewise. > > > > (reduc_umax_v2di): Likewise. > > > > (reduc_umax_scal_v2di): Likewise. > > > > (reduc_umin_v2di): Likewise. > > > > (reduc_umin_scal_v2di): Likewise. > > > > (reduc_smax_v4si): Likewise. > > > > (reduc_smin_v4si): Likewise. > > > > (reduc_umax_v4si): Likewise. > > > > (reduc_umin_v4si): Likewise. > > > > (reduc_smax_v8hi): Likewise. > > > > (reduc_smin_v8hi): Likewise. > > > > (reduc_umax_v8hi): Likewise. > > > > (reduc_umin_v8hi): Likewise. > > > > (reduc_smax_v16qi): Likewise. > > > > (reduc_smin_v16qi): Likewise. > > > > (reduc_umax_v16qi): Likewise. > > > > (reduc_umin_v16qi): Likewise. > > > > (reduc_smax_scal_<mode>): Likewise. > > > > (reduc_smin_scal_<mode>): Likewise. > > > > (reduc_umax_scal_<mode>): Likewise. > > > > (reduc_umin_scal_<mode>): Likewise. > > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be > > > used if > > > the _scal are present. The non-_scal's were previously defined as > > > producing a > > > vector with one element holding the result and the other elements all > > > zero, and > > > this was only ever used with a vec_extract immediately after; the _scal > > > pattern > > > now includes the vec_extract as well. Hence the non-_scal patterns are > > > deprecated / considered legacy, as per md.texi. > > > > Thanks -- I had misread the description of the non-scalar versions, > > missing the part where the other elements are zero. What I really > > want/need is an optab defined as computing the maximum value in all > > elements of the vector. > > Yes, indeed. It seems reasonable to me that this would coexist with an optab > which computes only a single value (i.e. a scalar). So just to clarify - you need to reduce the vector with max to a scalar but want the (same) result in all vector elements? > At that point it might be appropriate to change the cond-reduction code to > generate the reduce-to-vector in all cases, and optabs.c expand it to > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with > the (parallel) changes to cost model already proposed, does that cover all the > cases? It does add a new tree code, yes, but I'm feeling that could be > justified if we go down this route. I'd rather have combine recognize an insn that does both (if it exists). As I understand powerpc doesn't have reduction to scalar (I think no ISA actually has this, but all ISAs have reduce to one vector element plus a cheap way of extraction (effectively a subreg)) but it's reduction already has all result vector elements set to the same value (as opposed to having some undefined or zero or whatever). > However, another point that springs to mind: if you reduce a loop containing > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these > using shifts, and I think will also use shifts for platforms not possessing a > reduc_plus/min/max. If shifts could be changed for rotates, the code there > would do your reduce-to-a-vector-of-identical-elements in the midend...can we > (sensibly!) bring all of these together? > > > Perhaps the practical thing is to have the vectorizer also do an > > add_stmt_cost with some new token that indicates the cost model should > > make an adjustment if the back end doesn't need the extract/broadcast. > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > cost, and remove the unnecessary code in simplify-rtx. > > I think it'd be good if we could do it before simplify-rtx, really, although > I'm not sure I have a strong argument as to why, as long as we can cost it > appropriately. > > > In any case, I will remove implementing the deprecated optabs, and I'll > > also try to look at Alan L's patch shortly. > > That'd be great, thanks :) As for the cost modeling the simple solution is of course to have these as "high-level" costs (a new enum entry), but the sustainable solution is to have the target see the vectorized code and decide for itself. Iff we'd go down the simple route then the target may as well create the vectorized code for that special high-level operation itself. Richard.
On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > On Wed, 16 Sep 2015, Alan Lawrence wrote: > > > On 16/09/15 17:10, Bill Schmidt wrote: > > > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > > > On 16/09/15 15:28, Bill Schmidt wrote: > > > > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > > > > > > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, > > > > > UNSPEC_REDUC_SMIN, > > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > > > > (reduc_smax_v2di): New define_expand. > > > > > (reduc_smax_scal_v2di): Likewise. > > > > > (reduc_smin_v2di): Likewise. > > > > > (reduc_smin_scal_v2di): Likewise. > > > > > (reduc_umax_v2di): Likewise. > > > > > (reduc_umax_scal_v2di): Likewise. > > > > > (reduc_umin_v2di): Likewise. > > > > > (reduc_umin_scal_v2di): Likewise. > > > > > (reduc_smax_v4si): Likewise. > > > > > (reduc_smin_v4si): Likewise. > > > > > (reduc_umax_v4si): Likewise. > > > > > (reduc_umin_v4si): Likewise. > > > > > (reduc_smax_v8hi): Likewise. > > > > > (reduc_smin_v8hi): Likewise. > > > > > (reduc_umax_v8hi): Likewise. > > > > > (reduc_umin_v8hi): Likewise. > > > > > (reduc_smax_v16qi): Likewise. > > > > > (reduc_smin_v16qi): Likewise. > > > > > (reduc_umax_v16qi): Likewise. > > > > > (reduc_umin_v16qi): Likewise. > > > > > (reduc_smax_scal_<mode>): Likewise. > > > > > (reduc_smin_scal_<mode>): Likewise. > > > > > (reduc_umax_scal_<mode>): Likewise. > > > > > (reduc_umin_scal_<mode>): Likewise. > > > > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be > > > > used if > > > > the _scal are present. The non-_scal's were previously defined as > > > > producing a > > > > vector with one element holding the result and the other elements all > > > > zero, and > > > > this was only ever used with a vec_extract immediately after; the _scal > > > > pattern > > > > now includes the vec_extract as well. Hence the non-_scal patterns are > > > > deprecated / considered legacy, as per md.texi. > > > > > > Thanks -- I had misread the description of the non-scalar versions, > > > missing the part where the other elements are zero. What I really > > > want/need is an optab defined as computing the maximum value in all > > > elements of the vector. > > > > Yes, indeed. It seems reasonable to me that this would coexist with an optab > > which computes only a single value (i.e. a scalar). > > So just to clarify - you need to reduce the vector with max to a scalar > but want the (same) result in all vector elements? Yes. Alan Hayward's cond-reduction patch is set up to perform a reduction to scalar, followed by a scalar broadcast to get the value into all positions. It happens that our most efficient expansion to reduce to scalar will naturally produce the value in all positions. Using the reduc_smax_scal_<mode> expander, we are required to extract this into a scalar and then perform the broadcast. Those two operations are unnecessary. We can clean up after the fact, but the cost model will still reflect the unnecessary activity. > > > At that point it might be appropriate to change the cond-reduction code to > > generate the reduce-to-vector in all cases, and optabs.c expand it to > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with > > the (parallel) changes to cost model already proposed, does that cover all the > > cases? It does add a new tree code, yes, but I'm feeling that could be > > justified if we go down this route. > > I'd rather have combine recognize an insn that does both (if it > exists). As I understand powerpc doesn't have reduction to scalar > (I think no ISA actually has this, but all ISAs have reduce to > one vector element plus a cheap way of extraction (effectively a subreg)) > but it's reduction already has all result vector elements set to the > same value (as opposed to having some undefined or zero or whatever). The extraction is not necessarily so cheap. For floating-point values in a register, we can indeed use a subreg. But in this case we are talking about integers, and an extract must move them from the vector registers to the GPRs. With POWER8, we can do this with a 5-cycle move instruction. Prior to that, we can only do this with a very expensive path through memory, as previous processors had no direct path between the vector and integer registers. > > > However, another point that springs to mind: if you reduce a loop containing > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these > > using shifts, and I think will also use shifts for platforms not possessing a > > reduc_plus/min/max. If shifts could be changed for rotates, the code there > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we > > (sensibly!) bring all of these together? If you think it's reasonable, I will take a look at whether this can work. For targets like PowerPC and AArch32, perhaps we are better off not defining an expander at all, but letting the vector epilogue code expose the sequence of shifts/max-mins at the GIMPLE level. For this case the vectorizer would then not generate the unnecessary extract and broadcast. However, I'd want to let Alan Hayward finish revising and committing his patch first. Regards, Bill > > > > > Perhaps the practical thing is to have the vectorizer also do an > > > add_stmt_cost with some new token that indicates the cost model should > > > make an adjustment if the back end doesn't need the extract/broadcast. > > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > > cost, and remove the unnecessary code in simplify-rtx. > > > > I think it'd be good if we could do it before simplify-rtx, really, although > > I'm not sure I have a strong argument as to why, as long as we can cost it > > appropriately. > > > > > In any case, I will remove implementing the deprecated optabs, and I'll > > > also try to look at Alan L's patch shortly. > > > > That'd be great, thanks :) > > As for the cost modeling the simple solution is of course to have > these as "high-level" costs (a new enum entry), but the sustainable > solution is to have the target see the vectorized code and decide > for itself. Iff we'd go down the simple route then the target > may as well create the vectorized code for that special high-level > operation itself. > > Richard. >
On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote: > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > So just to clarify - you need to reduce the vector with max to a scalar > > but want the (same) result in all vector elements? > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > reduction to scalar, followed by a scalar broadcast to get the value > into all positions. It happens that our most efficient expansion to > reduce to scalar will naturally produce the value in all positions. It also is many insns after expand, so relying on combine to combine all that plus the following splat (as Richard suggests below) is not really going to work. If there also are targets where the _scal version is cheaper, maybe we should keep both, and have expand expand to whatever the target supports? Segher
On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote: > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > On Wed, 16 Sep 2015, Alan Lawrence wrote: > > > > > On 16/09/15 17:10, Bill Schmidt wrote: > > > > > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > > > > On 16/09/15 15:28, Bill Schmidt wrote: > > > > > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > > > > > > > > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, > > > > > > UNSPEC_REDUC_SMIN, > > > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > > > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > > > > > (reduc_smax_v2di): New define_expand. > > > > > > (reduc_smax_scal_v2di): Likewise. > > > > > > (reduc_smin_v2di): Likewise. > > > > > > (reduc_smin_scal_v2di): Likewise. > > > > > > (reduc_umax_v2di): Likewise. > > > > > > (reduc_umax_scal_v2di): Likewise. > > > > > > (reduc_umin_v2di): Likewise. > > > > > > (reduc_umin_scal_v2di): Likewise. > > > > > > (reduc_smax_v4si): Likewise. > > > > > > (reduc_smin_v4si): Likewise. > > > > > > (reduc_umax_v4si): Likewise. > > > > > > (reduc_umin_v4si): Likewise. > > > > > > (reduc_smax_v8hi): Likewise. > > > > > > (reduc_smin_v8hi): Likewise. > > > > > > (reduc_umax_v8hi): Likewise. > > > > > > (reduc_umin_v8hi): Likewise. > > > > > > (reduc_smax_v16qi): Likewise. > > > > > > (reduc_smin_v16qi): Likewise. > > > > > > (reduc_umax_v16qi): Likewise. > > > > > > (reduc_umin_v16qi): Likewise. > > > > > > (reduc_smax_scal_<mode>): Likewise. > > > > > > (reduc_smin_scal_<mode>): Likewise. > > > > > > (reduc_umax_scal_<mode>): Likewise. > > > > > > (reduc_umin_scal_<mode>): Likewise. > > > > > > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be > > > > > used if > > > > > the _scal are present. The non-_scal's were previously defined as > > > > > producing a > > > > > vector with one element holding the result and the other elements all > > > > > zero, and > > > > > this was only ever used with a vec_extract immediately after; the _scal > > > > > pattern > > > > > now includes the vec_extract as well. Hence the non-_scal patterns are > > > > > deprecated / considered legacy, as per md.texi. > > > > > > > > Thanks -- I had misread the description of the non-scalar versions, > > > > missing the part where the other elements are zero. What I really > > > > want/need is an optab defined as computing the maximum value in all > > > > elements of the vector. > > > > > > Yes, indeed. It seems reasonable to me that this would coexist with an optab > > > which computes only a single value (i.e. a scalar). > > > > So just to clarify - you need to reduce the vector with max to a scalar > > but want the (same) result in all vector elements? > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > reduction to scalar, followed by a scalar broadcast to get the value > into all positions. It happens that our most efficient expansion to > reduce to scalar will naturally produce the value in all positions. > Using the reduc_smax_scal_<mode> expander, we are required to extract > this into a scalar and then perform the broadcast. Those two operations > are unnecessary. We can clean up after the fact, but the cost model > will still reflect the unnecessary activity. > > > > > > At that point it might be appropriate to change the cond-reduction code to > > > generate the reduce-to-vector in all cases, and optabs.c expand it to > > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with > > > the (parallel) changes to cost model already proposed, does that cover all the > > > cases? It does add a new tree code, yes, but I'm feeling that could be > > > justified if we go down this route. > > > > I'd rather have combine recognize an insn that does both (if it > > exists). As I understand powerpc doesn't have reduction to scalar > > (I think no ISA actually has this, but all ISAs have reduce to > > one vector element plus a cheap way of extraction (effectively a subreg)) > > but it's reduction already has all result vector elements set to the > > same value (as opposed to having some undefined or zero or whatever). > > The extraction is not necessarily so cheap. For floating-point values > in a register, we can indeed use a subreg. But in this case we are > talking about integers, and an extract must move them from the vector > registers to the GPRs. With POWER8, we can do this with a 5-cycle move > instruction. Prior to that, we can only do this with a very expensive > path through memory, as previous processors had no direct path between > the vector and integer registers. > > > > > > However, another point that springs to mind: if you reduce a loop containing > > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these > > > using shifts, and I think will also use shifts for platforms not possessing a > > > reduc_plus/min/max. If shifts could be changed for rotates, the code there > > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we > > > (sensibly!) bring all of these together? > > If you think it's reasonable, I will take a look at whether this can > work. For targets like PowerPC and AArch32, perhaps we are better off > not defining an expander at all, but letting the vector epilogue code > expose the sequence of shifts/max-mins at the GIMPLE level. For this > case the vectorizer would then not generate the unnecessary extract and > broadcast. Well, I don't see an easy way to do this either. We really need either a whole-vector rotate or a double-vector shift to do the expansion early, neither of which has a defined standard pattern name. So either way we would need to introduce a new GIMPLE code and a new optab entry. I'm still leaning towards interest in exploring the reduce-to-vector approach. Thanks, Bill > > However, I'd want to let Alan Hayward finish revising and committing his > patch first. > > Regards, > Bill > > > > > > > > Perhaps the practical thing is to have the vectorizer also do an > > > > add_stmt_cost with some new token that indicates the cost model should > > > > make an adjustment if the back end doesn't need the extract/broadcast. > > > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > > > cost, and remove the unnecessary code in simplify-rtx. > > > > > > I think it'd be good if we could do it before simplify-rtx, really, although > > > I'm not sure I have a strong argument as to why, as long as we can cost it > > > appropriately. > > > > > > > In any case, I will remove implementing the deprecated optabs, and I'll > > > > also try to look at Alan L's patch shortly. > > > > > > That'd be great, thanks :) > > > > As for the cost modeling the simple solution is of course to have > > these as "high-level" costs (a new enum entry), but the sustainable > > solution is to have the target see the vectorized code and decide > > for itself. Iff we'd go down the simple route then the target > > may as well create the vectorized code for that special high-level > > operation itself. > > > > Richard. > > >
On Thu, 17 Sep 2015, Bill Schmidt wrote: > On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote: > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > > On Wed, 16 Sep 2015, Alan Lawrence wrote: > > > > > > > On 16/09/15 17:10, Bill Schmidt wrote: > > > > > > > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote: > > > > > > On 16/09/15 15:28, Bill Schmidt wrote: > > > > > > > 2015-09-16 Bill Schmidt <wschmidt@linux.vnet.ibm.com> > > > > > > > > > > > > > > * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, > > > > > > > UNSPEC_REDUC_SMIN, > > > > > > > UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL, > > > > > > > UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL, > > > > > > > UNSPEC_REDUC_UMIN_SCAL): New enumerated constants. > > > > > > > (reduc_smax_v2di): New define_expand. > > > > > > > (reduc_smax_scal_v2di): Likewise. > > > > > > > (reduc_smin_v2di): Likewise. > > > > > > > (reduc_smin_scal_v2di): Likewise. > > > > > > > (reduc_umax_v2di): Likewise. > > > > > > > (reduc_umax_scal_v2di): Likewise. > > > > > > > (reduc_umin_v2di): Likewise. > > > > > > > (reduc_umin_scal_v2di): Likewise. > > > > > > > (reduc_smax_v4si): Likewise. > > > > > > > (reduc_smin_v4si): Likewise. > > > > > > > (reduc_umax_v4si): Likewise. > > > > > > > (reduc_umin_v4si): Likewise. > > > > > > > (reduc_smax_v8hi): Likewise. > > > > > > > (reduc_smin_v8hi): Likewise. > > > > > > > (reduc_umax_v8hi): Likewise. > > > > > > > (reduc_umin_v8hi): Likewise. > > > > > > > (reduc_smax_v16qi): Likewise. > > > > > > > (reduc_smin_v16qi): Likewise. > > > > > > > (reduc_umax_v16qi): Likewise. > > > > > > > (reduc_umin_v16qi): Likewise. > > > > > > > (reduc_smax_scal_<mode>): Likewise. > > > > > > > (reduc_smin_scal_<mode>): Likewise. > > > > > > > (reduc_umax_scal_<mode>): Likewise. > > > > > > > (reduc_umin_scal_<mode>): Likewise. > > > > > > > > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be > > > > > > used if > > > > > > the _scal are present. The non-_scal's were previously defined as > > > > > > producing a > > > > > > vector with one element holding the result and the other elements all > > > > > > zero, and > > > > > > this was only ever used with a vec_extract immediately after; the _scal > > > > > > pattern > > > > > > now includes the vec_extract as well. Hence the non-_scal patterns are > > > > > > deprecated / considered legacy, as per md.texi. > > > > > > > > > > Thanks -- I had misread the description of the non-scalar versions, > > > > > missing the part where the other elements are zero. What I really > > > > > want/need is an optab defined as computing the maximum value in all > > > > > elements of the vector. > > > > > > > > Yes, indeed. It seems reasonable to me that this would coexist with an optab > > > > which computes only a single value (i.e. a scalar). > > > > > > So just to clarify - you need to reduce the vector with max to a scalar > > > but want the (same) result in all vector elements? > > > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > > reduction to scalar, followed by a scalar broadcast to get the value > > into all positions. It happens that our most efficient expansion to > > reduce to scalar will naturally produce the value in all positions. > > Using the reduc_smax_scal_<mode> expander, we are required to extract > > this into a scalar and then perform the broadcast. Those two operations > > are unnecessary. We can clean up after the fact, but the cost model > > will still reflect the unnecessary activity. > > > > > > > > > At that point it might be appropriate to change the cond-reduction code to > > > > generate the reduce-to-vector in all cases, and optabs.c expand it to > > > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with > > > > the (parallel) changes to cost model already proposed, does that cover all the > > > > cases? It does add a new tree code, yes, but I'm feeling that could be > > > > justified if we go down this route. > > > > > > I'd rather have combine recognize an insn that does both (if it > > > exists). As I understand powerpc doesn't have reduction to scalar > > > (I think no ISA actually has this, but all ISAs have reduce to > > > one vector element plus a cheap way of extraction (effectively a subreg)) > > > but it's reduction already has all result vector elements set to the > > > same value (as opposed to having some undefined or zero or whatever). > > > > The extraction is not necessarily so cheap. For floating-point values > > in a register, we can indeed use a subreg. But in this case we are > > talking about integers, and an extract must move them from the vector > > registers to the GPRs. With POWER8, we can do this with a 5-cycle move > > instruction. Prior to that, we can only do this with a very expensive > > path through memory, as previous processors had no direct path between > > the vector and integer registers. > > > > > > > > > However, another point that springs to mind: if you reduce a loop containing > > > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these > > > > using shifts, and I think will also use shifts for platforms not possessing a > > > > reduc_plus/min/max. If shifts could be changed for rotates, the code there > > > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we > > > > (sensibly!) bring all of these together? > > > > If you think it's reasonable, I will take a look at whether this can > > work. For targets like PowerPC and AArch32, perhaps we are better off > > not defining an expander at all, but letting the vector epilogue code > > expose the sequence of shifts/max-mins at the GIMPLE level. For this > > case the vectorizer would then not generate the unnecessary extract and > > broadcast. > > Well, I don't see an easy way to do this either. We really need either > a whole-vector rotate or a double-vector shift to do the expansion > early, neither of which has a defined standard pattern name. So either > way we would need to introduce a new GIMPLE code and a new optab entry. > I'm still leaning towards interest in exploring the reduce-to-vector > approach. Btw, we ditched the original reduce-to-vector variant due to its endianess issues (it only had _one_ element of the vector contain the reduction result). Re-introducing reduce-to-vector but with the reduction result in all elements wouldn't have any issue like that. Of course it would introduce a third optab variant similar to the old (now legacy) version but with different semantics... Richard. > Thanks, > Bill > > > > > However, I'd want to let Alan Hayward finish revising and committing his > > patch first. > > > > Regards, > > Bill > > > > > > > > > > > Perhaps the practical thing is to have the vectorizer also do an > > > > > add_stmt_cost with some new token that indicates the cost model should > > > > > make an adjustment if the back end doesn't need the extract/broadcast. > > > > > Targets like PowerPC and AArch32 could then subtract the unnecessary > > > > > cost, and remove the unnecessary code in simplify-rtx. > > > > > > > > I think it'd be good if we could do it before simplify-rtx, really, although > > > > I'm not sure I have a strong argument as to why, as long as we can cost it > > > > appropriately. > > > > > > > > > In any case, I will remove implementing the deprecated optabs, and I'll > > > > > also try to look at Alan L's patch shortly. > > > > > > > > That'd be great, thanks :) > > > > > > As for the cost modeling the simple solution is of course to have > > > these as "high-level" costs (a new enum entry), but the sustainable > > > solution is to have the target see the vectorized code and decide > > > for itself. Iff we'd go down the simple route then the target > > > may as well create the vectorized code for that special high-level > > > operation itself. > > > > > > Richard. > > > > > > > >
On Thu, 17 Sep 2015, Segher Boessenkool wrote: > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote: > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > > So just to clarify - you need to reduce the vector with max to a scalar > > > but want the (same) result in all vector elements? > > > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > > reduction to scalar, followed by a scalar broadcast to get the value > > into all positions. It happens that our most efficient expansion to > > reduce to scalar will naturally produce the value in all positions. > > It also is many insns after expand, so relying on combine to combine > all that plus the following splat (as Richard suggests below) is not > really going to work. > > If there also are targets where the _scal version is cheaper, maybe > we should keep both, and have expand expand to whatever the target > supports? Wait .. so you don't actually have an instruction to do, say, REDUC_MAX_EXPR (neither to scalar nor to vector)? Then it's better to _not_ define such pattern and let the vectorizer generate its fallback code. If the fallback code isn't "best" then better think of a way to make it choose the best variant out of its available ones (and maybe add another). I think it tests availability of the building blocks for the variants and simply picks the first that works without checking the cost model. Richard. > > Segher > >
On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote: > On Thu, 17 Sep 2015, Segher Boessenkool wrote: > > > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote: > > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > > > So just to clarify - you need to reduce the vector with max to a scalar > > > > but want the (same) result in all vector elements? > > > > > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > > > reduction to scalar, followed by a scalar broadcast to get the value > > > into all positions. It happens that our most efficient expansion to > > > reduce to scalar will naturally produce the value in all positions. > > > > It also is many insns after expand, so relying on combine to combine > > all that plus the following splat (as Richard suggests below) is not > > really going to work. > > > > If there also are targets where the _scal version is cheaper, maybe > > we should keep both, and have expand expand to whatever the target > > supports? > > Wait .. so you don't actually have an instruction to do, say, > REDUC_MAX_EXPR (neither to scalar nor to vector)? Then it's better > to _not_ define such pattern and let the vectorizer generate > its fallback code. If the fallback code isn't "best" then better > think of a way to make it choose the best variant out of its > available ones (and maybe add another). I think it tests > availability of the building blocks for the variants and simply > picks the first that works without checking the cost model. That's what we were considering per Alan Lawrence's suggestion elsewhere in this thread, but there isn't currently a way to represent a whole-vector rotate in gimple. So we'd either have to add that or fall back to an inferior code sequence, I believe. Bill > > Richard. > > > > > Segher > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg) >
On Fri, 18 Sep 2015, Bill Schmidt wrote: > On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote: > > On Thu, 17 Sep 2015, Segher Boessenkool wrote: > > > > > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote: > > > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > > > > So just to clarify - you need to reduce the vector with max to a scalar > > > > > but want the (same) result in all vector elements? > > > > > > > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > > > > reduction to scalar, followed by a scalar broadcast to get the value > > > > into all positions. It happens that our most efficient expansion to > > > > reduce to scalar will naturally produce the value in all positions. > > > > > > It also is many insns after expand, so relying on combine to combine > > > all that plus the following splat (as Richard suggests below) is not > > > really going to work. > > > > > > If there also are targets where the _scal version is cheaper, maybe > > > we should keep both, and have expand expand to whatever the target > > > supports? > > > > Wait .. so you don't actually have an instruction to do, say, > > REDUC_MAX_EXPR (neither to scalar nor to vector)? Then it's better > > to _not_ define such pattern and let the vectorizer generate > > its fallback code. If the fallback code isn't "best" then better > > think of a way to make it choose the best variant out of its > > available ones (and maybe add another). I think it tests > > availability of the building blocks for the variants and simply > > picks the first that works without checking the cost model. > > That's what we were considering per Alan Lawrence's suggestion elsewhere > in this thread, but there isn't currently a way to represent a > whole-vector rotate in gimple. So we'd either have to add that or fall > back to an inferior code sequence, I believe. A whole-vector rotate is just a VEC_PERM with a proper constant mask. Of course the target would have to detect these cases and use vector rotate instructions (x86 does that for example). Richard.
On Fri, 2015-09-18 at 15:15 +0200, Richard Biener wrote: > On Fri, 18 Sep 2015, Bill Schmidt wrote: > > > On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote: > > > On Thu, 17 Sep 2015, Segher Boessenkool wrote: > > > > > > > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote: > > > > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote: > > > > > > So just to clarify - you need to reduce the vector with max to a scalar > > > > > > but want the (same) result in all vector elements? > > > > > > > > > > Yes. Alan Hayward's cond-reduction patch is set up to perform a > > > > > reduction to scalar, followed by a scalar broadcast to get the value > > > > > into all positions. It happens that our most efficient expansion to > > > > > reduce to scalar will naturally produce the value in all positions. > > > > > > > > It also is many insns after expand, so relying on combine to combine > > > > all that plus the following splat (as Richard suggests below) is not > > > > really going to work. > > > > > > > > If there also are targets where the _scal version is cheaper, maybe > > > > we should keep both, and have expand expand to whatever the target > > > > supports? > > > > > > Wait .. so you don't actually have an instruction to do, say, > > > REDUC_MAX_EXPR (neither to scalar nor to vector)? Then it's better > > > to _not_ define such pattern and let the vectorizer generate > > > its fallback code. If the fallback code isn't "best" then better > > > think of a way to make it choose the best variant out of its > > > available ones (and maybe add another). I think it tests > > > availability of the building blocks for the variants and simply > > > picks the first that works without checking the cost model. > > > > That's what we were considering per Alan Lawrence's suggestion elsewhere > > in this thread, but there isn't currently a way to represent a > > whole-vector rotate in gimple. So we'd either have to add that or fall > > back to an inferior code sequence, I believe. > > A whole-vector rotate is just a VEC_PERM with a proper constant mask. > Of course the target would have to detect these cases and use > vector rotate instructions (x86 does that for example). Hm, yes, that's right. And we should already have those special-permute recognitions in place; I had just forgotten about them. Ok, I agree this is probably the best approach, then. I'll have to refresh my memory on Alan H's patch, but we may need to add logic to do this sort of epilogue expansion for his new reduction. I think right now it is just giving up if the REUC_MAX_EXPR isn't supported. Thanks, Bill > > Richard. > > -- > Richard Biener <rguenther@suse.de> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg) >
On 18/09/15 09:35, Richard Biener wrote: > > Btw, we ditched the original reduce-to-vector variant due to its > endianess issues (it only had _one_ element of the vector contain > the reduction result). Re-introducing reduce-to-vector but with > the reduction result in all elements wouldn't have any issue like > that. > > Of course it would introduce a third optab variant similar to the > old (now legacy) version but with different semantics... I am working (again) on finishing this migration, and think I have fixed the endianness issues on PowerPC, patch posted just now. MIPS/Loongson, I'm still struggling with; AFAICT, they implement their own reduc_ pattern for most cases, which for the most part repeats the midend's approach using shifts, but the first step uses a funky permute, as they have a faster instruction for this... --Alan
Index: gcc/config/rs6000/altivec.md =================================================================== --- gcc/config/rs6000/altivec.md (revision 227817) +++ gcc/config/rs6000/altivec.md (working copy) @@ -87,6 +87,14 @@ UNSPEC_GET_VRSAVE UNSPEC_LVX UNSPEC_REDUC_PLUS + UNSPEC_REDUC_SMAX + UNSPEC_REDUC_SMIN + UNSPEC_REDUC_UMAX + UNSPEC_REDUC_UMIN + UNSPEC_REDUC_SMAX_SCAL + UNSPEC_REDUC_SMIN_SCAL + UNSPEC_REDUC_UMAX_SCAL + UNSPEC_REDUC_UMIN_SCAL UNSPEC_VECSH UNSPEC_EXTEVEN_V4SI UNSPEC_EXTEVEN_V8HI @@ -2690,6 +2698,430 @@ DONE; }) +;; Strategy used here guarantees that after round N, the maximum value +;; will appear in 2^N adjacent positions (where adjacent can include the +;; lowest and highest positions being adjacent). Thus after N = log E +;; rounds, where E is the number of vector elements, the maximum value +;; will appear in every position in the vector. +;; +;; Note that reduc_s{max,min}_v{2d,4s}f already exist in vector.md. +;; +(define_expand "reduc_smax_v2di" + [(set (match_operand:V2DI 0 "register_operand" "=v") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX))] + "TARGET_P8_VECTOR" +{ + rtx shifted = gen_reg_rtx (V2DImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8))); + emit_insn (gen_smaxv2di3 (operands[0], op1, shifted)); + DONE; +}) + +(define_expand "reduc_smax_scal_v2di" + [(set (match_operand:DI 0 "register_operand" "=r") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX_SCAL))] + "TARGET_P8_VECTOR" +{ + rtx reduc = gen_reg_rtx (V2DImode); + emit_insn (gen_reduc_smax_v2di (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_smin_v2di" + [(set (match_operand:V2DI 0 "register_operand" "=v") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN))] + "TARGET_P8_VECTOR" +{ + rtx shifted = gen_reg_rtx (V2DImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8))); + emit_insn (gen_sminv2di3 (operands[0], op1, shifted)); + DONE; +}) + +(define_expand "reduc_smin_scal_v2di" + [(set (match_operand:DI 0 "register_operand" "=r") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN_SCAL))] + "TARGET_P8_VECTOR" +{ + rtx reduc = gen_reg_rtx (V2DImode); + emit_insn (gen_reduc_smin_v2di (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_umax_v2di" + [(set (match_operand:V2DI 0 "register_operand" "=v") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX))] + "TARGET_P8_VECTOR" +{ + rtx shifted = gen_reg_rtx (V2DImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8))); + emit_insn (gen_umaxv2di3 (operands[0], op1, shifted)); + DONE; +}) + +(define_expand "reduc_umax_scal_v2di" + [(set (match_operand:DI 0 "register_operand" "=r") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX_SCAL))] + "TARGET_P8_VECTOR" +{ + rtx reduc = gen_reg_rtx (V2DImode); + emit_insn (gen_reduc_umax_v2di (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_umin_v2di" + [(set (match_operand:V2DI 0 "register_operand" "=v") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN))] + "TARGET_P8_VECTOR" +{ + rtx shifted = gen_reg_rtx (V2DImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8))); + emit_insn (gen_uminv2di3 (operands[0], op1, shifted)); + DONE; +}) + +(define_expand "reduc_umin_scal_v2di" + [(set (match_operand:DI 0 "register_operand" "=r") + (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN_SCAL))] + "TARGET_P8_VECTOR" +{ + rtx reduc = gen_reg_rtx (V2DImode); + emit_insn (gen_reduc_umin_v2di (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_smax_v4si" + [(set (match_operand:V4SI 0 "register_operand" "=v") + (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V4SImode); + rtx shifted2 = gen_reg_rtx (V4SImode); + rtx max1 = gen_reg_rtx (V4SImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4))); + emit_insn (gen_smaxv4si3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v4si (shifted2, max1, max1, GEN_INT (8))); + emit_insn (gen_smaxv4si3 (operands[0], max1, shifted2)); + DONE; +}) + +(define_expand "reduc_smin_v4si" + [(set (match_operand:V4SI 0 "register_operand" "=v") + (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V4SImode); + rtx shifted2 = gen_reg_rtx (V4SImode); + rtx min1 = gen_reg_rtx (V4SImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4))); + emit_insn (gen_sminv4si3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v4si (shifted2, min1, min1, GEN_INT (8))); + emit_insn (gen_sminv4si3 (operands[0], min1, shifted2)); + DONE; +}) + +(define_expand "reduc_umax_v4si" + [(set (match_operand:V4SI 0 "register_operand" "=v") + (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V4SImode); + rtx shifted2 = gen_reg_rtx (V4SImode); + rtx max1 = gen_reg_rtx (V4SImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4))); + emit_insn (gen_umaxv4si3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v4si (shifted2, max1, max1, GEN_INT (8))); + emit_insn (gen_umaxv4si3 (operands[0], max1, shifted2)); + DONE; +}) + +(define_expand "reduc_umin_v4si" + [(set (match_operand:V4SI 0 "register_operand" "=v") + (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V4SImode); + rtx shifted2 = gen_reg_rtx (V4SImode); + rtx min1 = gen_reg_rtx (V4SImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4))); + emit_insn (gen_uminv4si3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v4si (shifted2, min1, min1, GEN_INT (8))); + emit_insn (gen_uminv4si3 (operands[0], min1, shifted2)); + DONE; +}) + +(define_expand "reduc_smax_v8hi" + [(set (match_operand:V8HI 0 "register_operand" "=v") + (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V8HImode); + rtx shifted2 = gen_reg_rtx (V8HImode); + rtx shifted3 = gen_reg_rtx (V8HImode); + rtx max1 = gen_reg_rtx (V8HImode); + rtx max2 = gen_reg_rtx (V8HImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2))); + emit_insn (gen_smaxv8hi3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted2, max1, max1, GEN_INT (4))); + emit_insn (gen_smaxv8hi3 (max2, max1, shifted2)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted3, max2, max2, GEN_INT (8))); + emit_insn (gen_smaxv8hi3 (operands[0], max2, shifted3)); + DONE; +}) + +(define_expand "reduc_smin_v8hi" + [(set (match_operand:V8HI 0 "register_operand" "=v") + (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V8HImode); + rtx shifted2 = gen_reg_rtx (V8HImode); + rtx shifted3 = gen_reg_rtx (V8HImode); + rtx min1 = gen_reg_rtx (V8HImode); + rtx min2 = gen_reg_rtx (V8HImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2))); + emit_insn (gen_sminv8hi3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted2, min1, min1, GEN_INT (4))); + emit_insn (gen_sminv8hi3 (min2, min1, shifted2)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted3, min2, min2, GEN_INT (8))); + emit_insn (gen_sminv8hi3 (operands[0], min2, shifted3)); + DONE; +}) + +(define_expand "reduc_umax_v8hi" + [(set (match_operand:V8HI 0 "register_operand" "=v") + (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V8HImode); + rtx shifted2 = gen_reg_rtx (V8HImode); + rtx shifted3 = gen_reg_rtx (V8HImode); + rtx max1 = gen_reg_rtx (V8HImode); + rtx max2 = gen_reg_rtx (V8HImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2))); + emit_insn (gen_umaxv8hi3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted2, max1, max1, GEN_INT (4))); + emit_insn (gen_umaxv8hi3 (max2, max1, shifted2)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted3, max2, max2, GEN_INT (8))); + emit_insn (gen_umaxv8hi3 (operands[0], max2, shifted3)); + DONE; +}) + +(define_expand "reduc_umin_v8hi" + [(set (match_operand:V8HI 0 "register_operand" "=v") + (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V8HImode); + rtx shifted2 = gen_reg_rtx (V8HImode); + rtx shifted3 = gen_reg_rtx (V8HImode); + rtx min1 = gen_reg_rtx (V8HImode); + rtx min2 = gen_reg_rtx (V8HImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2))); + emit_insn (gen_uminv8hi3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted2, min1, min1, GEN_INT (4))); + emit_insn (gen_uminv8hi3 (min2, min1, shifted2)); + emit_insn (gen_altivec_vsldoi_v8hi (shifted3, min2, min2, GEN_INT (8))); + emit_insn (gen_uminv8hi3 (operands[0], min2, shifted3)); + DONE; +}) + +(define_expand "reduc_smax_v16qi" + [(set (match_operand:V16QI 0 "register_operand" "=v") + (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V16QImode); + rtx shifted2 = gen_reg_rtx (V16QImode); + rtx shifted3 = gen_reg_rtx (V16QImode); + rtx shifted4 = gen_reg_rtx (V16QImode); + rtx max1 = gen_reg_rtx (V16QImode); + rtx max2 = gen_reg_rtx (V16QImode); + rtx max3 = gen_reg_rtx (V16QImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx)); + emit_insn (gen_smaxv16qi3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted2, max1, max1, GEN_INT (2))); + emit_insn (gen_smaxv16qi3 (max2, max1, shifted2)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted3, max2, max2, GEN_INT (4))); + emit_insn (gen_smaxv16qi3 (max3, max2, shifted3)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted4, max3, max3, GEN_INT (8))); + emit_insn (gen_smaxv16qi3 (operands[0], max3, shifted4)); + DONE; +}) + +(define_expand "reduc_smin_v16qi" + [(set (match_operand:V16QI 0 "register_operand" "=v") + (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V16QImode); + rtx shifted2 = gen_reg_rtx (V16QImode); + rtx shifted3 = gen_reg_rtx (V16QImode); + rtx shifted4 = gen_reg_rtx (V16QImode); + rtx min1 = gen_reg_rtx (V16QImode); + rtx min2 = gen_reg_rtx (V16QImode); + rtx min3 = gen_reg_rtx (V16QImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx)); + emit_insn (gen_sminv16qi3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted2, min1, min1, GEN_INT (2))); + emit_insn (gen_sminv16qi3 (min2, min1, shifted2)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted3, min2, min2, GEN_INT (4))); + emit_insn (gen_sminv16qi3 (min3, min2, shifted3)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted4, min3, min3, GEN_INT (8))); + emit_insn (gen_sminv16qi3 (operands[0], min3, shifted4)); + DONE; +}) + +(define_expand "reduc_umax_v16qi" + [(set (match_operand:V16QI 0 "register_operand" "=v") + (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V16QImode); + rtx shifted2 = gen_reg_rtx (V16QImode); + rtx shifted3 = gen_reg_rtx (V16QImode); + rtx shifted4 = gen_reg_rtx (V16QImode); + rtx max1 = gen_reg_rtx (V16QImode); + rtx max2 = gen_reg_rtx (V16QImode); + rtx max3 = gen_reg_rtx (V16QImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx)); + emit_insn (gen_umaxv16qi3 (max1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted2, max1, max1, GEN_INT (2))); + emit_insn (gen_umaxv16qi3 (max2, max1, shifted2)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted3, max2, max2, GEN_INT (4))); + emit_insn (gen_umaxv16qi3 (max3, max2, shifted3)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted4, max3, max3, GEN_INT (8))); + emit_insn (gen_umaxv16qi3 (operands[0], max3, shifted4)); + DONE; +}) + +(define_expand "reduc_umin_v16qi" + [(set (match_operand:V16QI 0 "register_operand" "=v") + (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN))] + "TARGET_ALTIVEC" +{ + rtx shifted1 = gen_reg_rtx (V16QImode); + rtx shifted2 = gen_reg_rtx (V16QImode); + rtx shifted3 = gen_reg_rtx (V16QImode); + rtx shifted4 = gen_reg_rtx (V16QImode); + rtx min1 = gen_reg_rtx (V16QImode); + rtx min2 = gen_reg_rtx (V16QImode); + rtx min3 = gen_reg_rtx (V16QImode); + rtx op1 = operands[1]; + + emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx)); + emit_insn (gen_uminv16qi3 (min1, op1, shifted1)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted2, min1, min1, GEN_INT (2))); + emit_insn (gen_uminv16qi3 (min2, min1, shifted2)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted3, min2, min2, GEN_INT (4))); + emit_insn (gen_uminv16qi3 (min3, min2, shifted3)); + emit_insn (gen_altivec_vsldoi_v16qi (shifted4, min3, min3, GEN_INT (8))); + emit_insn (gen_uminv16qi3 (operands[0], min3, shifted4)); + DONE; +}) + +(define_expand "reduc_smax_scal_<mode>" + [(set (match_operand:<VI_scalar> 0 "register_operand" "=r") + (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")] + UNSPEC_REDUC_SMAX_SCAL))] + "TARGET_ALTIVEC" +{ + rtx reduc = gen_reg_rtx (<MODE>mode); + emit_insn (gen_reduc_smax_<mode> (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_smin_scal_<mode>" + [(set (match_operand:<VI_scalar> 0 "register_operand" "=r") + (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")] + UNSPEC_REDUC_SMIN_SCAL))] + "TARGET_ALTIVEC" +{ + rtx reduc = gen_reg_rtx (<MODE>mode); + emit_insn (gen_reduc_smin_<mode> (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_umax_scal_<mode>" + [(set (match_operand:<VI_scalar> 0 "register_operand" "=r") + (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")] + UNSPEC_REDUC_UMAX_SCAL))] + "TARGET_ALTIVEC" +{ + rtx reduc = gen_reg_rtx (<MODE>mode); + emit_insn (gen_reduc_umax_<mode> (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + +(define_expand "reduc_umin_scal_<mode>" + [(set (match_operand:<VI_scalar> 0 "register_operand" "=r") + (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")] + UNSPEC_REDUC_UMIN_SCAL))] + "TARGET_ALTIVEC" +{ + rtx reduc = gen_reg_rtx (<MODE>mode); + emit_insn (gen_reduc_umin_<mode> (reduc, operands[1])); + rs6000_expand_vector_extract (operands[0], reduc, 0); + DONE; +}) + (define_expand "neg<mode>2" [(use (match_operand:VI 0 "register_operand" "")) (use (match_operand:VI 1 "register_operand" ""))] Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c (working copy) @@ -0,0 +1,85 @@ +/* { dg-do compile { target { powerpc64*-*-* } } } */ +/* { dg-require-effective-target powerpc_altivec_ok } */ +/* { dg-options "-O3" } */ +/* { dg-final { scan-assembler-times "vmaxsb" 6 } } */ +/* { dg-final { scan-assembler-times "vminsb" 6 } } */ +/* { dg-final { scan-assembler-times "vmaxub" 6 } } */ +/* { dg-final { scan-assembler-times "vminub" 6 } } */ +/* { dg-final { scan-assembler-times "vsldoi" 16 } } */ +/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */ +/* { dg-final { scan-assembler-times "lbz" 4 } } */ +/* { dg-final { scan-assembler-times "extsb" 2 } } */ + +/* Test maximum and minimum reduction operations for V16QImode. + This code will be unrolled and SLP vectorized. The numbers + of instructions expected are obtained as follows. There will be + two max or min instructions to compare the broadcast init vector + with the two source vectors in lc[], followed by four shift-max/min + pairs to obtain the maximum value in all positions. An arbitrary + one of these is extracted using stxvd2x and lbz; in the signed + case, the lbz is followed by extsb. */ + +#define N 32 + +typedef unsigned char T; +typedef signed char T2; + +T +testmax (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T +testmin (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmax2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmin2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c (working copy) @@ -0,0 +1,85 @@ +/* { dg-do compile { target { powerpc64*-*-* } } } */ +/* { dg-require-effective-target powerpc_altivec_ok } */ +/* { dg-options "-O3" } */ +/* { dg-final { scan-assembler-times "vmaxsw" 4 } } */ +/* { dg-final { scan-assembler-times "vminsw" 4 } } */ +/* { dg-final { scan-assembler-times "vmaxuw" 4 } } */ +/* { dg-final { scan-assembler-times "vminuw" 4 } } */ +/* { dg-final { scan-assembler-times "vsldoi" 8 } } */ +/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */ +/* { dg-final { scan-assembler-times "lwz" 2 } } */ +/* { dg-final { scan-assembler-times "lwa" 2 } } */ + +/* Test maximum and minimum reduction operations for V4SImode. + This code will be unrolled and SLP vectorized. The numbers + of instructions expected are obtained as follows. There will be + two max or min instructions to compare the broadcast init vector + with the two source vectors in lc[], followed by two shift-max/min + pairs to obtain the maximum value in all positions. An arbitrary + one of these is extracted using stxvd2x and either lwz (unsigned) + or lwa (signed). */ + +#define N 8 + +typedef unsigned int T; +typedef signed int T2; + +T +testmax (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T +testmin (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmax2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmin2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c (working copy) @@ -0,0 +1,82 @@ +/* { dg-do compile { target { powerpc64*-*-* } } } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-options "-O3" } */ +/* { dg-final { scan-assembler-times "vmaxsd" 3 } } */ +/* { dg-final { scan-assembler-times "vminsd" 3 } } */ +/* { dg-final { scan-assembler-times "vmaxud" 3 } } */ +/* { dg-final { scan-assembler-times "vminud" 3 } } */ +/* { dg-final { scan-assembler-times "vsldoi" 4 } } */ +/* { dg-final { scan-assembler-times "mfvsrd" 4 } } */ + +/* Test maximum and minimum reduction operations for V2DImode. + This code will be unrolled and SLP vectorized. The numbers + of instructions expected are obtained as follows. There will be + two max or min instructions to compare the broadcast init vector + with the two source vectors in lc[], followed by a shift and + max/min to obtain the maximum value in both positions. An + arbitrary one of these is extracted using mfvsrd. */ + +#define N 4 + +typedef unsigned long long T; +typedef signed long long T2; + +T +testmax (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T +testmin (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmax2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmin2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c (revision 0) +++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c (working copy) @@ -0,0 +1,85 @@ +/* { dg-do compile { target { powerpc64*-*-* } } } */ +/* { dg-require-effective-target powerpc_altivec_ok } */ +/* { dg-options "-O3" } */ +/* { dg-final { scan-assembler-times "vmaxsh" 5 } } */ +/* { dg-final { scan-assembler-times "vminsh" 5 } } */ +/* { dg-final { scan-assembler-times "vmaxuh" 5 } } */ +/* { dg-final { scan-assembler-times "vminuh" 5 } } */ +/* { dg-final { scan-assembler-times "vsldoi" 12 } } */ +/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */ +/* { dg-final { scan-assembler-times "lhz" 2 } } */ +/* { dg-final { scan-assembler-times "lha" 2 } } */ + +/* Test maximum and minimum reduction operations for V8HImode. + This code will be unrolled and SLP vectorized. The numbers + of instructions expected are obtained as follows. There will be + two max or min instructions to compare the broadcast init vector + with the two source vectors in lc[], followed by three shift-max/min + pairs to obtain the maximum value in all positions. An arbitrary + one of these is extracted using stxvd2x and either lhz (unsigned) + or lha (signed). */ + +#define N 16 + +typedef unsigned short T; +typedef signed short T2; + +T +testmax (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T +testmin (const T *c, T init, T result) +{ + T lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmax2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum < lc[i] ? lc[i] : accum; + } + + return accum; +} + +T2 +testmin2 (const T2 *c, T2 init, T2 result) +{ + T2 lc[N], accum = init; + int i; + + __builtin_memcpy (lc, c, sizeof(lc)); + + for (i = 0; i < N; i++) { + accum = accum > lc[i] ? lc[i] : accum; + } + + return accum; +}