diff mbox

[rs6000] Add expansions for min/max vector reductions

Message ID 1442413689.2896.45.camel@gnopaine
State New
Headers show

Commit Message

Bill Schmidt Sept. 16, 2015, 2:28 p.m. UTC
Hi,

A recent patch proposal from Alan Hayward
(https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00690.html) uncovered
that the PowerPC back end doesn't have expansions for
reduc_{smax,smin,umax,umin}_<mode> and
reduc_{smax,smin,umax,umin}_scal_<mode> for the integer modes.  This
prevents vectorization of reductions involving comparisons that can be
transformed into REDUC_{MAX,MIN}_EXPR expressions.  This patch adds
these expansions.

PowerPC does not have hardware reduction instructions for maximum and
minimum.  However, we can emulate this with varying degrees of
efficiency for different modes.  The size of the expansion is
logarithmic in the number of vector elements E.  The expansions for
reduc_{smax,smin,umax,umin}_<mode> consist of log E stages, each
comprising a rotate operation and a maximum or minimum operation.  After
stage N, the maximum value in the vector will appear in at least 2^N
consecutive positions in the intermediate result.

The ...scal_<mode> expansions just invoke the related non-scalar
expansions, and then extract an arbitrary element from the result
vector.

The expansions for V16QI, V8HI, and V4SI require TARGET_ALTIVEC.  The
expansions for V2DI make use of vector instructions added for ISA 2.07,
so they require TARGET_P8_VECTOR.

I was able to use iterators for the sub-doubleword ...scal_<mode>
expansions, but that's all.  I experimented with trying to use
code_iterators to generate the {smax,smin,umax,umin} expansions, but
couldn't find a way to make that work, as the substitution wasn't being
done into the UNSPEC constants.  If there is a way to do this, please
let me know and I'll try to reduce the code size.

There are already a number of common reduction execution tests that
exercise this logic.  I've also added PowerPC-specific code generation
tests to verify the patterns produce what's expected.  These are based
on the existing execution tests.

Some future work will be required:

(1) The vectorization cost model does not currently allow us to
distinguish between reductions of additions and reductions of max/min.
On PowerPC, these costs are very different, as the former is supported
by hardware and the latter is not.  After this patch is applied, we will
possibly vectorize some code when it's not profitable to do so.  I think
it's probably best to go ahead with this patch now, and deal with the
cost model as a separate issue after Alan's patch is complete and
upstream.

(2) The use of rs6000_expand_vector_extract to obtain a scalar from a
vector is not optimal for sub-doubleword modes using the latest
hardware.  Currently this generates a vector store followed by a scalar
load, which is Very Bad.  We should instead use a mfvsrd and sign- or
zero-extend the rightmost element in the result GPR.  To accomplish
this, we should update rs6000_expand_vector_extract to do the more
general thing:  mfvsrd, shift the selected element into the rightmost
position, and extend it.  At that time we should change the _scal_<mode>
expansions to select the element number that avoids the shift (that
number will differ for BE and LE).

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no
regressions.  Is this ok for trunk?

Thanks,
Bill


[gcc]

2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
	UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
	UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
	UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
	(reduc_smax_v2di): New define_expand.
	(reduc_smax_scal_v2di): Likewise.
	(reduc_smin_v2di): Likewise.
	(reduc_smin_scal_v2di): Likewise.
	(reduc_umax_v2di): Likewise.
	(reduc_umax_scal_v2di): Likewise.
	(reduc_umin_v2di): Likewise.
	(reduc_umin_scal_v2di): Likewise.
	(reduc_smax_v4si): Likewise.
	(reduc_smin_v4si): Likewise.
	(reduc_umax_v4si): Likewise.
	(reduc_umin_v4si): Likewise.
	(reduc_smax_v8hi): Likewise.
	(reduc_smin_v8hi): Likewise.
	(reduc_umax_v8hi): Likewise.
	(reduc_umin_v8hi): Likewise.
	(reduc_smax_v16qi): Likewise.
	(reduc_smin_v16qi): Likewise.
	(reduc_umax_v16qi): Likewise.
	(reduc_umin_v16qi): Likewise.
	(reduc_smax_scal_<mode>): Likewise.
	(reduc_smin_scal_<mode>): Likewise.
	(reduc_umax_scal_<mode>): Likewise.
	(reduc_umin_scal_<mode>): Likewise.

[gcc/testsuite]

2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* gcc.target/powerpc/vect-reduc-minmax-char.c: New.
	* gcc.target/powerpc/vect-reduc-minmax-short.c: New.
	* gcc.target/powerpc/vect-reduc-minmax-int.c: New.
	* gcc.target/powerpc/vect-reduc-minmax-long.c: New.

Comments

David Edelsohn Sept. 16, 2015, 2:34 p.m. UTC | #1
On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> Hi,
>
> A recent patch proposal from Alan Hayward
> (https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00690.html) uncovered
> that the PowerPC back end doesn't have expansions for
> reduc_{smax,smin,umax,umin}_<mode> and
> reduc_{smax,smin,umax,umin}_scal_<mode> for the integer modes.  This
> prevents vectorization of reductions involving comparisons that can be
> transformed into REDUC_{MAX,MIN}_EXPR expressions.  This patch adds
> these expansions.
>
> PowerPC does not have hardware reduction instructions for maximum and
> minimum.  However, we can emulate this with varying degrees of
> efficiency for different modes.  The size of the expansion is
> logarithmic in the number of vector elements E.  The expansions for
> reduc_{smax,smin,umax,umin}_<mode> consist of log E stages, each
> comprising a rotate operation and a maximum or minimum operation.  After
> stage N, the maximum value in the vector will appear in at least 2^N
> consecutive positions in the intermediate result.
>
> The ...scal_<mode> expansions just invoke the related non-scalar
> expansions, and then extract an arbitrary element from the result
> vector.
>
> The expansions for V16QI, V8HI, and V4SI require TARGET_ALTIVEC.  The
> expansions for V2DI make use of vector instructions added for ISA 2.07,
> so they require TARGET_P8_VECTOR.
>
> I was able to use iterators for the sub-doubleword ...scal_<mode>
> expansions, but that's all.  I experimented with trying to use
> code_iterators to generate the {smax,smin,umax,umin} expansions, but
> couldn't find a way to make that work, as the substitution wasn't being
> done into the UNSPEC constants.  If there is a way to do this, please
> let me know and I'll try to reduce the code size.
>
> There are already a number of common reduction execution tests that
> exercise this logic.  I've also added PowerPC-specific code generation
> tests to verify the patterns produce what's expected.  These are based
> on the existing execution tests.
>
> Some future work will be required:
>
> (1) The vectorization cost model does not currently allow us to
> distinguish between reductions of additions and reductions of max/min.
> On PowerPC, these costs are very different, as the former is supported
> by hardware and the latter is not.  After this patch is applied, we will
> possibly vectorize some code when it's not profitable to do so.  I think
> it's probably best to go ahead with this patch now, and deal with the
> cost model as a separate issue after Alan's patch is complete and
> upstream.
>
> (2) The use of rs6000_expand_vector_extract to obtain a scalar from a
> vector is not optimal for sub-doubleword modes using the latest
> hardware.  Currently this generates a vector store followed by a scalar
> load, which is Very Bad.  We should instead use a mfvsrd and sign- or
> zero-extend the rightmost element in the result GPR.  To accomplish
> this, we should update rs6000_expand_vector_extract to do the more
> general thing:  mfvsrd, shift the selected element into the rightmost
> position, and extend it.  At that time we should change the _scal_<mode>
> expansions to select the element number that avoids the shift (that
> number will differ for BE and LE).
>
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no
> regressions.  Is this ok for trunk?
>
> Thanks,
> Bill
>
>
> [gcc]
>
> 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>
>         * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
>         UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
>         UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
>         UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
>         (reduc_smax_v2di): New define_expand.
>         (reduc_smax_scal_v2di): Likewise.
>         (reduc_smin_v2di): Likewise.
>         (reduc_smin_scal_v2di): Likewise.
>         (reduc_umax_v2di): Likewise.
>         (reduc_umax_scal_v2di): Likewise.
>         (reduc_umin_v2di): Likewise.
>         (reduc_umin_scal_v2di): Likewise.
>         (reduc_smax_v4si): Likewise.
>         (reduc_smin_v4si): Likewise.
>         (reduc_umax_v4si): Likewise.
>         (reduc_umin_v4si): Likewise.
>         (reduc_smax_v8hi): Likewise.
>         (reduc_smin_v8hi): Likewise.
>         (reduc_umax_v8hi): Likewise.
>         (reduc_umin_v8hi): Likewise.
>         (reduc_smax_v16qi): Likewise.
>         (reduc_smin_v16qi): Likewise.
>         (reduc_umax_v16qi): Likewise.
>         (reduc_umin_v16qi): Likewise.
>         (reduc_smax_scal_<mode>): Likewise.
>         (reduc_smin_scal_<mode>): Likewise.
>         (reduc_umax_scal_<mode>): Likewise.
>         (reduc_umin_scal_<mode>): Likewise.
>
> [gcc/testsuite]
>
> 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>
>         * gcc.target/powerpc/vect-reduc-minmax-char.c: New.
>         * gcc.target/powerpc/vect-reduc-minmax-short.c: New.
>         * gcc.target/powerpc/vect-reduc-minmax-int.c: New.
>         * gcc.target/powerpc/vect-reduc-minmax-long.c: New.

This is okay.

I don't think that I have seen iterators for UNSPECs, but maybe
someone else is aware of the right idiom.

Thanks, David
Ramana Radhakrishnan Sept. 16, 2015, 2:37 p.m. UTC | #2
On Wed, Sep 16, 2015 at 3:34 PM, David Edelsohn <dje.gcc@gmail.com> wrote:
> On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt

>
> This is okay.
>
> I don't think that I have seen iterators for UNSPECs, but maybe
> someone else is aware of the right idiom.

https://gcc.gnu.org/onlinedocs/gccint/Int-Iterators.html .
define_int_iterator was added precisely for this functionality in the
aarch64 backend.



regards
Ramana

>
> Thanks, David
Bill Schmidt Sept. 16, 2015, 2:52 p.m. UTC | #3
On Wed, 2015-09-16 at 15:37 +0100, Ramana Radhakrishnan wrote:
> On Wed, Sep 16, 2015 at 3:34 PM, David Edelsohn <dje.gcc@gmail.com> wrote:
> > On Wed, Sep 16, 2015 at 10:28 AM, Bill Schmidt
> 
> >
> > This is okay.
> >
> > I don't think that I have seen iterators for UNSPECs, but maybe
> > someone else is aware of the right idiom.
> 
> https://gcc.gnu.org/onlinedocs/gccint/Int-Iterators.html .
> define_int_iterator was added precisely for this functionality in the
> aarch64 backend.

Ah, thanks!  Much obliged, Ramana.

Bill

> 
> 
> 
> regards
> Ramana
> 
> >
> > Thanks, David
>
Segher Boessenkool Sept. 16, 2015, 3:14 p.m. UTC | #4
On Wed, Sep 16, 2015 at 09:28:09AM -0500, Bill Schmidt wrote:
> I was able to use iterators for the sub-doubleword ...scal_<mode>
> expansions, but that's all.  I experimented with trying to use
> code_iterators to generate the {smax,smin,umax,umin} expansions, but
> couldn't find a way to make that work, as the substitution wasn't being
> done into the UNSPEC constants.  If there is a way to do this, please
> let me know and I'll try to reduce the code size.

Do you need the unspecs at all?  They are only used in the pattern of
expanders, and all such expanders call DONE.  You can instead write
those patterns as just USEs of their operands?


Segher
Bill Schmidt Sept. 16, 2015, 3:29 p.m. UTC | #5
On Wed, 2015-09-16 at 10:14 -0500, Segher Boessenkool wrote:
> On Wed, Sep 16, 2015 at 09:28:09AM -0500, Bill Schmidt wrote:
> > I was able to use iterators for the sub-doubleword ...scal_<mode>
> > expansions, but that's all.  I experimented with trying to use
> > code_iterators to generate the {smax,smin,umax,umin} expansions, but
> > couldn't find a way to make that work, as the substitution wasn't being
> > done into the UNSPEC constants.  If there is a way to do this, please
> > let me know and I'll try to reduce the code size.
> 
> Do you need the unspecs at all?  They are only used in the pattern of
> expanders, and all such expanders call DONE.  You can instead write
> those patterns as just USEs of their operands?

Hi Segher,

Yes, I had just independently come to the same conclusion -- since we
are always calling DONE, I don't need the UNSPECs at all.  I'm verifying
to be sure, but this should solve my problem.

Thanks!
Bill
> 
> 
> Segher
>
Alan Lawrence Sept. 16, 2015, 3:29 p.m. UTC | #6
On 16/09/15 15:28, Bill Schmidt wrote:
> 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>
>          * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
>          UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
>          UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
>          UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
>          (reduc_smax_v2di): New define_expand.
>          (reduc_smax_scal_v2di): Likewise.
>          (reduc_smin_v2di): Likewise.
>          (reduc_smin_scal_v2di): Likewise.
>          (reduc_umax_v2di): Likewise.
>          (reduc_umax_scal_v2di): Likewise.
>          (reduc_umin_v2di): Likewise.
>          (reduc_umin_scal_v2di): Likewise.
>          (reduc_smax_v4si): Likewise.
>          (reduc_smin_v4si): Likewise.
>          (reduc_umax_v4si): Likewise.
>          (reduc_umin_v4si): Likewise.
>          (reduc_smax_v8hi): Likewise.
>          (reduc_smin_v8hi): Likewise.
>          (reduc_umax_v8hi): Likewise.
>          (reduc_umin_v8hi): Likewise.
>          (reduc_smax_v16qi): Likewise.
>          (reduc_smin_v16qi): Likewise.
>          (reduc_umax_v16qi): Likewise.
>          (reduc_umin_v16qi): Likewise.
>          (reduc_smax_scal_<mode>): Likewise.
>          (reduc_smin_scal_<mode>): Likewise.
>          (reduc_umax_scal_<mode>): Likewise.
>          (reduc_umin_scal_<mode>): Likewise.

You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if 
the _scal are present. The non-_scal's were previously defined as producing a 
vector with one element holding the result and the other elements all zero, and 
this was only ever used with a vec_extract immediately after; the _scal pattern 
now includes the vec_extract as well. Hence the non-_scal patterns are 
deprecated / considered legacy, as per md.texi.

I proposed a patch to migrate PPC off the old patterns, but have forgotten to 
ping it recently - last at 
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!)

--Alan
Bill Schmidt Sept. 16, 2015, 4:10 p.m. UTC | #7
On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> On 16/09/15 15:28, Bill Schmidt wrote:
> > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> >
> >          * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
> >          UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> >          UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> >          UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> >          (reduc_smax_v2di): New define_expand.
> >          (reduc_smax_scal_v2di): Likewise.
> >          (reduc_smin_v2di): Likewise.
> >          (reduc_smin_scal_v2di): Likewise.
> >          (reduc_umax_v2di): Likewise.
> >          (reduc_umax_scal_v2di): Likewise.
> >          (reduc_umin_v2di): Likewise.
> >          (reduc_umin_scal_v2di): Likewise.
> >          (reduc_smax_v4si): Likewise.
> >          (reduc_smin_v4si): Likewise.
> >          (reduc_umax_v4si): Likewise.
> >          (reduc_umin_v4si): Likewise.
> >          (reduc_smax_v8hi): Likewise.
> >          (reduc_smin_v8hi): Likewise.
> >          (reduc_umax_v8hi): Likewise.
> >          (reduc_umin_v8hi): Likewise.
> >          (reduc_smax_v16qi): Likewise.
> >          (reduc_smin_v16qi): Likewise.
> >          (reduc_umax_v16qi): Likewise.
> >          (reduc_umin_v16qi): Likewise.
> >          (reduc_smax_scal_<mode>): Likewise.
> >          (reduc_smin_scal_<mode>): Likewise.
> >          (reduc_umax_scal_<mode>): Likewise.
> >          (reduc_umin_scal_<mode>): Likewise.
> 
> You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if 
> the _scal are present. The non-_scal's were previously defined as producing a 
> vector with one element holding the result and the other elements all zero, and 
> this was only ever used with a vec_extract immediately after; the _scal pattern 
> now includes the vec_extract as well. Hence the non-_scal patterns are 
> deprecated / considered legacy, as per md.texi.

Thanks -- I had misread the description of the non-scalar versions,
missing the part where the other elements are zero.  What I really
want/need is an optab defined as computing the maximum value in all
elements of the vector.  This seems like a strange thing to want, but
Alan Hayward's proposed patch will cause us to generate the scalar
version, followed by a broadcast of the vector.  Since our patterns
already generate the maximum value in all positions, this creates an
unnecessary extract followed by an unnecessary broadcast.

As discussed elsewhere, we *could* remove the unnecessary code by
recognizing this in simplify-rtx, etc., but the vectorization cost
modeling would be wrong.  It would have still told us to model this as a
vec_to_scalar for the reduc_max_scal, and a vec_stmt for the broadcast.
This would overcount the cost of the reduction compared to what we would
actually generate.

To get this right for all targets, one could envision having a new optab
for a reduction-to-vector, which most targets wouldn't implement, but
PowerPC and AArch32, at least, would.  If a target has a
reduction-to-vector, the vectorizer would have to generate a different
GIMPLE code that mapped to this; otherwise it would do the
REDUC_MAX_EXPR and the broadcast.  This obviously starts to get
complicated, since adding a GIMPLE code certainly has a nontrivial
cost. :/

Perhaps the practical thing is to have the vectorizer also do an
add_stmt_cost with some new token that indicates the cost model should
make an adjustment if the back end doesn't need the extract/broadcast.
Targets like PowerPC and AArch32 could then subtract the unnecessary
cost, and remove the unnecessary code in simplify-rtx.

Copying Richi and ARM folks for opinions on the best design.  I want to
be able to model this stuff as accurately as possible, but obviously we
need to avoid unnecessary effects on other architectures.

In any case, I will remove implementing the deprecated optabs, and I'll
also try to look at Alan L's patch shortly.

Thanks,
Bill


> 
> I proposed a patch to migrate PPC off the old patterns, but have forgotten to 
> ping it recently - last at 
> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!)
> 
> --Alan
>
Bill Schmidt Sept. 16, 2015, 4:19 p.m. UTC | #8
On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> 
> I proposed a patch to migrate PPC off the old patterns, but have forgotten to 
> ping it recently - last at 
> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!)
> 

Hi Alan,

Thanks for this patch.  I see that you tested it on gcc110, which is a
big-endian platform.  I think the pattern for V4SF might have an endian
problem on little-endian, but I'm not positive just eyeballing it.  (I
think that the select of element 3 will address the wrong end of the
vector for LE.)  Can you please try the patch on gcc112 as well to set
my mind at ease?

Thanks,
Bill

> --Alan
>
Alan Lawrence Sept. 16, 2015, 5:03 p.m. UTC | #9
On 16/09/15 17:10, Bill Schmidt wrote:
>
> On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
>> On 16/09/15 15:28, Bill Schmidt wrote:
>>> 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>>>
>>>           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
>>>           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
>>>           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
>>>           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
>>>           (reduc_smax_v2di): New define_expand.
>>>           (reduc_smax_scal_v2di): Likewise.
>>>           (reduc_smin_v2di): Likewise.
>>>           (reduc_smin_scal_v2di): Likewise.
>>>           (reduc_umax_v2di): Likewise.
>>>           (reduc_umax_scal_v2di): Likewise.
>>>           (reduc_umin_v2di): Likewise.
>>>           (reduc_umin_scal_v2di): Likewise.
>>>           (reduc_smax_v4si): Likewise.
>>>           (reduc_smin_v4si): Likewise.
>>>           (reduc_umax_v4si): Likewise.
>>>           (reduc_umin_v4si): Likewise.
>>>           (reduc_smax_v8hi): Likewise.
>>>           (reduc_smin_v8hi): Likewise.
>>>           (reduc_umax_v8hi): Likewise.
>>>           (reduc_umin_v8hi): Likewise.
>>>           (reduc_smax_v16qi): Likewise.
>>>           (reduc_smin_v16qi): Likewise.
>>>           (reduc_umax_v16qi): Likewise.
>>>           (reduc_umin_v16qi): Likewise.
>>>           (reduc_smax_scal_<mode>): Likewise.
>>>           (reduc_smin_scal_<mode>): Likewise.
>>>           (reduc_umax_scal_<mode>): Likewise.
>>>           (reduc_umin_scal_<mode>): Likewise.
>>
>> You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if
>> the _scal are present. The non-_scal's were previously defined as producing a
>> vector with one element holding the result and the other elements all zero, and
>> this was only ever used with a vec_extract immediately after; the _scal pattern
>> now includes the vec_extract as well. Hence the non-_scal patterns are
>> deprecated / considered legacy, as per md.texi.
>
> Thanks -- I had misread the description of the non-scalar versions,
> missing the part where the other elements are zero.  What I really
> want/need is an optab defined as computing the maximum value in all
> elements of the vector.

Yes, indeed. It seems reasonable to me that this would coexist with an optab 
which computes only a single value (i.e. a scalar).

At that point it might be appropriate to change the cond-reduction code to 
generate the reduce-to-vector in all cases, and optabs.c expand it to 
reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with 
the (parallel) changes to cost model already proposed, does that cover all the 
cases? It does add a new tree code, yes, but I'm feeling that could be justified 
if we go down this route.

However, another point that springs to mind: if you reduce a loop containing OR 
or MUL expressions, the vect_create_epilog_for_reduction reduces these using 
shifts, and I think will also use shifts for platforms not possessing a 
reduc_plus/min/max. If shifts could be changed for rotates, the code there would 
do your reduce-to-a-vector-of-identical-elements in the midend...can we 
(sensibly!) bring all of these together?

> Perhaps the practical thing is to have the vectorizer also do an
> add_stmt_cost with some new token that indicates the cost model should
> make an adjustment if the back end doesn't need the extract/broadcast.
> Targets like PowerPC and AArch32 could then subtract the unnecessary
> cost, and remove the unnecessary code in simplify-rtx.

I think it'd be good if we could do it before simplify-rtx, really, although I'm 
not sure I have a strong argument as to why, as long as we can cost it 
appropriately.

> In any case, I will remove implementing the deprecated optabs, and I'll
> also try to look at Alan L's patch shortly.

That'd be great, thanks :)

Cheers, Alan
Alan Lawrence Sept. 16, 2015, 6:16 p.m. UTC | #10
On 16/09/15 17:19, Bill Schmidt wrote:
> On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
>>
>> I proposed a patch to migrate PPC off the old patterns, but have forgotten to
>> ping it recently - last at
>> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!)
>>
>
> Hi Alan,
>
> Thanks for this patch.  I see that you tested it on gcc110, which is a
> big-endian platform.  I think the pattern for V4SF might have an endian
> problem on little-endian, but I'm not positive just eyeballing it.  (I
> think that the select of element 3 will address the wrong end of the
> vector for LE.)  Can you please try the patch on gcc112 as well to set
> my mind at ease?
>
> Thanks,
> Bill
>
>> --Alan
>>
>
>

I think you are right....I'm just retesting without the patch to rule out other 
test setup problems etc., but I see more tests failing on gcc112 than I expect 
(comparing against e.g.
https://gcc.gnu.org/ml/gcc-testresults/2015-09/msg01479.html).

What's the best way to determine endianness - is it BYTES_BIG_ENDIAN, would that 
be true on gcc110 but false on gcc112?

Cheers, Alan
Bill Schmidt Sept. 16, 2015, 6:25 p.m. UTC | #11
On Wed, 2015-09-16 at 19:16 +0100, Alan Lawrence wrote:
> On 16/09/15 17:19, Bill Schmidt wrote:
> > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> >>
> >> I proposed a patch to migrate PPC off the old patterns, but have forgotten to
> >> ping it recently - last at
> >> https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01024.html ... (ping?!)
> >>
> >
> > Hi Alan,
> >
> > Thanks for this patch.  I see that you tested it on gcc110, which is a
> > big-endian platform.  I think the pattern for V4SF might have an endian
> > problem on little-endian, but I'm not positive just eyeballing it.  (I
> > think that the select of element 3 will address the wrong end of the
> > vector for LE.)  Can you please try the patch on gcc112 as well to set
> > my mind at ease?
> >
> > Thanks,
> > Bill
> >
> >> --Alan
> >>
> >
> >
> 
> I think you are right....I'm just retesting without the patch to rule out other 
> test setup problems etc., but I see more tests failing on gcc112 than I expect 
> (comparing against e.g.
> https://gcc.gnu.org/ml/gcc-testresults/2015-09/msg01479.html).
> 
> What's the best way to determine endianness - is it BYTES_BIG_ENDIAN, would that 
> be true on gcc110 but false on gcc112?

Yes, that's the correct test, thanks!

Bill

> 
> Cheers, Alan
>
Bill Schmidt Sept. 16, 2015, 9:21 p.m. UTC | #12
On Wed, 2015-09-16 at 18:03 +0100, Alan Lawrence wrote:
> On 16/09/15 17:10, Bill Schmidt wrote:
> >
> > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> >> On 16/09/15 15:28, Bill Schmidt wrote:
> >>> 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> >>>
> >>>           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX, UNSPEC_REDUC_SMIN,
> >>>           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> >>>           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> >>>           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> >>>           (reduc_smax_v2di): New define_expand.
> >>>           (reduc_smax_scal_v2di): Likewise.
> >>>           (reduc_smin_v2di): Likewise.
> >>>           (reduc_smin_scal_v2di): Likewise.
> >>>           (reduc_umax_v2di): Likewise.
> >>>           (reduc_umax_scal_v2di): Likewise.
> >>>           (reduc_umin_v2di): Likewise.
> >>>           (reduc_umin_scal_v2di): Likewise.
> >>>           (reduc_smax_v4si): Likewise.
> >>>           (reduc_smin_v4si): Likewise.
> >>>           (reduc_umax_v4si): Likewise.
> >>>           (reduc_umin_v4si): Likewise.
> >>>           (reduc_smax_v8hi): Likewise.
> >>>           (reduc_smin_v8hi): Likewise.
> >>>           (reduc_umax_v8hi): Likewise.
> >>>           (reduc_umin_v8hi): Likewise.
> >>>           (reduc_smax_v16qi): Likewise.
> >>>           (reduc_smin_v16qi): Likewise.
> >>>           (reduc_umax_v16qi): Likewise.
> >>>           (reduc_umin_v16qi): Likewise.
> >>>           (reduc_smax_scal_<mode>): Likewise.
> >>>           (reduc_smin_scal_<mode>): Likewise.
> >>>           (reduc_umax_scal_<mode>): Likewise.
> >>>           (reduc_umin_scal_<mode>): Likewise.
> >>
> >> You shouldn't need the non-_scal reductions. Indeed, they shouldn't be used if
> >> the _scal are present. The non-_scal's were previously defined as producing a
> >> vector with one element holding the result and the other elements all zero, and
> >> this was only ever used with a vec_extract immediately after; the _scal pattern
> >> now includes the vec_extract as well. Hence the non-_scal patterns are
> >> deprecated / considered legacy, as per md.texi.
> >
> > Thanks -- I had misread the description of the non-scalar versions,
> > missing the part where the other elements are zero.  What I really
> > want/need is an optab defined as computing the maximum value in all
> > elements of the vector.
> 
> Yes, indeed. It seems reasonable to me that this would coexist with an optab 
> which computes only a single value (i.e. a scalar).
> 
> At that point it might be appropriate to change the cond-reduction code to 
> generate the reduce-to-vector in all cases, and optabs.c expand it to 
> reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with 
> the (parallel) changes to cost model already proposed, does that cover all the 
> cases? It does add a new tree code, yes, but I'm feeling that could be justified 
> if we go down this route.

That's how I envisioned it as well, and it was my original preference,
provided the maintainers are ok with it.  However, your next suggestion
is intriguing...

> 
> However, another point that springs to mind: if you reduce a loop containing OR 
> or MUL expressions, the vect_create_epilog_for_reduction reduces these using 
> shifts, and I think will also use shifts for platforms not possessing a 
> reduc_plus/min/max. If shifts could be changed for rotates, the code there would 
> do your reduce-to-a-vector-of-identical-elements in the midend...can we 
> (sensibly!) bring all of these together?

Perhaps so.  I can have a look at that and see.  What I'm calling a
rotate is really a double-vector shift where both input vectors are the
same, so perhaps this is already pretty close to what we need.

Thanks!  I'll try to put together some sort of proposal over the next
week or so, workload permitting.

Bill

> 
> > Perhaps the practical thing is to have the vectorizer also do an
> > add_stmt_cost with some new token that indicates the cost model should
> > make an adjustment if the back end doesn't need the extract/broadcast.
> > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > cost, and remove the unnecessary code in simplify-rtx.
> 
> I think it'd be good if we could do it before simplify-rtx, really, although I'm 
> not sure I have a strong argument as to why, as long as we can cost it 
> appropriately.
> 
> > In any case, I will remove implementing the deprecated optabs, and I'll
> > also try to look at Alan L's patch shortly.
> 
> That'd be great, thanks :)
> 
> Cheers, Alan
>
Richard Biener Sept. 17, 2015, 7:39 a.m. UTC | #13
On Wed, 16 Sep 2015, Alan Lawrence wrote:

> On 16/09/15 17:10, Bill Schmidt wrote:
> > 
> > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > 
> > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > UNSPEC_REDUC_SMIN,
> > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > >           (reduc_smax_v2di): New define_expand.
> > > >           (reduc_smax_scal_v2di): Likewise.
> > > >           (reduc_smin_v2di): Likewise.
> > > >           (reduc_smin_scal_v2di): Likewise.
> > > >           (reduc_umax_v2di): Likewise.
> > > >           (reduc_umax_scal_v2di): Likewise.
> > > >           (reduc_umin_v2di): Likewise.
> > > >           (reduc_umin_scal_v2di): Likewise.
> > > >           (reduc_smax_v4si): Likewise.
> > > >           (reduc_smin_v4si): Likewise.
> > > >           (reduc_umax_v4si): Likewise.
> > > >           (reduc_umin_v4si): Likewise.
> > > >           (reduc_smax_v8hi): Likewise.
> > > >           (reduc_smin_v8hi): Likewise.
> > > >           (reduc_umax_v8hi): Likewise.
> > > >           (reduc_umin_v8hi): Likewise.
> > > >           (reduc_smax_v16qi): Likewise.
> > > >           (reduc_smin_v16qi): Likewise.
> > > >           (reduc_umax_v16qi): Likewise.
> > > >           (reduc_umin_v16qi): Likewise.
> > > >           (reduc_smax_scal_<mode>): Likewise.
> > > >           (reduc_smin_scal_<mode>): Likewise.
> > > >           (reduc_umax_scal_<mode>): Likewise.
> > > >           (reduc_umin_scal_<mode>): Likewise.
> > > 
> > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > used if
> > > the _scal are present. The non-_scal's were previously defined as
> > > producing a
> > > vector with one element holding the result and the other elements all
> > > zero, and
> > > this was only ever used with a vec_extract immediately after; the _scal
> > > pattern
> > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > deprecated / considered legacy, as per md.texi.
> > 
> > Thanks -- I had misread the description of the non-scalar versions,
> > missing the part where the other elements are zero.  What I really
> > want/need is an optab defined as computing the maximum value in all
> > elements of the vector.
> 
> Yes, indeed. It seems reasonable to me that this would coexist with an optab
> which computes only a single value (i.e. a scalar).

So just to clarify - you need to reduce the vector with max to a scalar
but want the (same) result in all vector elements?

> At that point it might be appropriate to change the cond-reduction code to
> generate the reduce-to-vector in all cases, and optabs.c expand it to
> reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> the (parallel) changes to cost model already proposed, does that cover all the
> cases? It does add a new tree code, yes, but I'm feeling that could be
> justified if we go down this route.

I'd rather have combine recognize an insn that does both (if it
exists).  As I understand powerpc doesn't have reduction to scalar
(I think no ISA actually has this, but all ISAs have reduce to
one vector element plus a cheap way of extraction (effectively a subreg))
but it's reduction already has all result vector elements set to the
same value (as opposed to having some undefined or zero or whatever).

> However, another point that springs to mind: if you reduce a loop containing
> OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> using shifts, and I think will also use shifts for platforms not possessing a
> reduc_plus/min/max. If shifts could be changed for rotates, the code there
> would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> (sensibly!) bring all of these together?
>
> > Perhaps the practical thing is to have the vectorizer also do an
> > add_stmt_cost with some new token that indicates the cost model should
> > make an adjustment if the back end doesn't need the extract/broadcast.
> > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > cost, and remove the unnecessary code in simplify-rtx.
> 
> I think it'd be good if we could do it before simplify-rtx, really, although
> I'm not sure I have a strong argument as to why, as long as we can cost it
> appropriately.
> 
> > In any case, I will remove implementing the deprecated optabs, and I'll
> > also try to look at Alan L's patch shortly.
> 
> That'd be great, thanks :)

As for the cost modeling the simple solution is of course to have
these as "high-level" costs (a new enum entry), but the sustainable
solution is to have the target see the vectorized code and decide
for itself.  Iff we'd go down the simple route then the target
may as well create the vectorized code for that special high-level
operation itself.

Richard.
Bill Schmidt Sept. 17, 2015, 2:18 p.m. UTC | #14
On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> On Wed, 16 Sep 2015, Alan Lawrence wrote:
> 
> > On 16/09/15 17:10, Bill Schmidt wrote:
> > > 
> > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > > 
> > > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > UNSPEC_REDUC_SMIN,
> > > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > >           (reduc_smax_v2di): New define_expand.
> > > > >           (reduc_smax_scal_v2di): Likewise.
> > > > >           (reduc_smin_v2di): Likewise.
> > > > >           (reduc_smin_scal_v2di): Likewise.
> > > > >           (reduc_umax_v2di): Likewise.
> > > > >           (reduc_umax_scal_v2di): Likewise.
> > > > >           (reduc_umin_v2di): Likewise.
> > > > >           (reduc_umin_scal_v2di): Likewise.
> > > > >           (reduc_smax_v4si): Likewise.
> > > > >           (reduc_smin_v4si): Likewise.
> > > > >           (reduc_umax_v4si): Likewise.
> > > > >           (reduc_umin_v4si): Likewise.
> > > > >           (reduc_smax_v8hi): Likewise.
> > > > >           (reduc_smin_v8hi): Likewise.
> > > > >           (reduc_umax_v8hi): Likewise.
> > > > >           (reduc_umin_v8hi): Likewise.
> > > > >           (reduc_smax_v16qi): Likewise.
> > > > >           (reduc_smin_v16qi): Likewise.
> > > > >           (reduc_umax_v16qi): Likewise.
> > > > >           (reduc_umin_v16qi): Likewise.
> > > > >           (reduc_smax_scal_<mode>): Likewise.
> > > > >           (reduc_smin_scal_<mode>): Likewise.
> > > > >           (reduc_umax_scal_<mode>): Likewise.
> > > > >           (reduc_umin_scal_<mode>): Likewise.
> > > > 
> > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > used if
> > > > the _scal are present. The non-_scal's were previously defined as
> > > > producing a
> > > > vector with one element holding the result and the other elements all
> > > > zero, and
> > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > pattern
> > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > deprecated / considered legacy, as per md.texi.
> > > 
> > > Thanks -- I had misread the description of the non-scalar versions,
> > > missing the part where the other elements are zero.  What I really
> > > want/need is an optab defined as computing the maximum value in all
> > > elements of the vector.
> > 
> > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > which computes only a single value (i.e. a scalar).
> 
> So just to clarify - you need to reduce the vector with max to a scalar
> but want the (same) result in all vector elements?

Yes.  Alan Hayward's cond-reduction patch is set up to perform a
reduction to scalar, followed by a scalar broadcast to get the value
into all positions.  It happens that our most efficient expansion to
reduce to scalar will naturally produce the value in all positions.
Using the reduc_smax_scal_<mode> expander, we are required to extract
this into a scalar and then perform the broadcast.  Those two operations
are unnecessary.  We can clean up after the fact, but the cost model
will still reflect the unnecessary activity.

> 
> > At that point it might be appropriate to change the cond-reduction code to
> > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > the (parallel) changes to cost model already proposed, does that cover all the
> > cases? It does add a new tree code, yes, but I'm feeling that could be
> > justified if we go down this route.
> 
> I'd rather have combine recognize an insn that does both (if it
> exists).  As I understand powerpc doesn't have reduction to scalar
> (I think no ISA actually has this, but all ISAs have reduce to
> one vector element plus a cheap way of extraction (effectively a subreg))
> but it's reduction already has all result vector elements set to the
> same value (as opposed to having some undefined or zero or whatever).

The extraction is not necessarily so cheap.  For floating-point values
in a register, we can indeed use a subreg.  But in this case we are
talking about integers, and an extract must move them from the vector
registers to the GPRs.  With POWER8, we can do this with a 5-cycle move
instruction.  Prior to that, we can only do this with a very expensive
path through memory, as previous processors had no direct path between
the vector and integer registers.

> 
> > However, another point that springs to mind: if you reduce a loop containing
> > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > using shifts, and I think will also use shifts for platforms not possessing a
> > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > (sensibly!) bring all of these together?

If you think it's reasonable, I will take a look at whether this can
work.  For targets like PowerPC and AArch32, perhaps we are better off
not defining an expander at all, but letting the vector epilogue code
expose the sequence of shifts/max-mins at the GIMPLE level.  For this
case the vectorizer would then not generate the unnecessary extract and
broadcast.

However, I'd want to let Alan Hayward finish revising and committing his
patch first.

Regards,
Bill

> >
> > > Perhaps the practical thing is to have the vectorizer also do an
> > > add_stmt_cost with some new token that indicates the cost model should
> > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > cost, and remove the unnecessary code in simplify-rtx.
> > 
> > I think it'd be good if we could do it before simplify-rtx, really, although
> > I'm not sure I have a strong argument as to why, as long as we can cost it
> > appropriately.
> > 
> > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > also try to look at Alan L's patch shortly.
> > 
> > That'd be great, thanks :)
> 
> As for the cost modeling the simple solution is of course to have
> these as "high-level" costs (a new enum entry), but the sustainable
> solution is to have the target see the vectorized code and decide
> for itself.  Iff we'd go down the simple route then the target
> may as well create the vectorized code for that special high-level
> operation itself.
> 
> Richard.
>
Segher Boessenkool Sept. 17, 2015, 4:17 p.m. UTC | #15
On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote:
> On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > So just to clarify - you need to reduce the vector with max to a scalar
> > but want the (same) result in all vector elements?
> 
> Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> reduction to scalar, followed by a scalar broadcast to get the value
> into all positions.  It happens that our most efficient expansion to
> reduce to scalar will naturally produce the value in all positions.

It also is many insns after expand, so relying on combine to combine
all that plus the following splat (as Richard suggests below) is not
really going to work.

If there also are targets where the _scal version is cheaper, maybe
we should keep both, and have expand expand to whatever the target
supports?


Segher
Bill Schmidt Sept. 17, 2015, 4:26 p.m. UTC | #16
On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote:
> On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > On Wed, 16 Sep 2015, Alan Lawrence wrote:
> > 
> > > On 16/09/15 17:10, Bill Schmidt wrote:
> > > > 
> > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > > > 
> > > > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > > UNSPEC_REDUC_SMIN,
> > > > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > > >           (reduc_smax_v2di): New define_expand.
> > > > > >           (reduc_smax_scal_v2di): Likewise.
> > > > > >           (reduc_smin_v2di): Likewise.
> > > > > >           (reduc_smin_scal_v2di): Likewise.
> > > > > >           (reduc_umax_v2di): Likewise.
> > > > > >           (reduc_umax_scal_v2di): Likewise.
> > > > > >           (reduc_umin_v2di): Likewise.
> > > > > >           (reduc_umin_scal_v2di): Likewise.
> > > > > >           (reduc_smax_v4si): Likewise.
> > > > > >           (reduc_smin_v4si): Likewise.
> > > > > >           (reduc_umax_v4si): Likewise.
> > > > > >           (reduc_umin_v4si): Likewise.
> > > > > >           (reduc_smax_v8hi): Likewise.
> > > > > >           (reduc_smin_v8hi): Likewise.
> > > > > >           (reduc_umax_v8hi): Likewise.
> > > > > >           (reduc_umin_v8hi): Likewise.
> > > > > >           (reduc_smax_v16qi): Likewise.
> > > > > >           (reduc_smin_v16qi): Likewise.
> > > > > >           (reduc_umax_v16qi): Likewise.
> > > > > >           (reduc_umin_v16qi): Likewise.
> > > > > >           (reduc_smax_scal_<mode>): Likewise.
> > > > > >           (reduc_smin_scal_<mode>): Likewise.
> > > > > >           (reduc_umax_scal_<mode>): Likewise.
> > > > > >           (reduc_umin_scal_<mode>): Likewise.
> > > > > 
> > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > > used if
> > > > > the _scal are present. The non-_scal's were previously defined as
> > > > > producing a
> > > > > vector with one element holding the result and the other elements all
> > > > > zero, and
> > > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > > pattern
> > > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > > deprecated / considered legacy, as per md.texi.
> > > > 
> > > > Thanks -- I had misread the description of the non-scalar versions,
> > > > missing the part where the other elements are zero.  What I really
> > > > want/need is an optab defined as computing the maximum value in all
> > > > elements of the vector.
> > > 
> > > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > > which computes only a single value (i.e. a scalar).
> > 
> > So just to clarify - you need to reduce the vector with max to a scalar
> > but want the (same) result in all vector elements?
> 
> Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> reduction to scalar, followed by a scalar broadcast to get the value
> into all positions.  It happens that our most efficient expansion to
> reduce to scalar will naturally produce the value in all positions.
> Using the reduc_smax_scal_<mode> expander, we are required to extract
> this into a scalar and then perform the broadcast.  Those two operations
> are unnecessary.  We can clean up after the fact, but the cost model
> will still reflect the unnecessary activity.
> 
> > 
> > > At that point it might be appropriate to change the cond-reduction code to
> > > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > > the (parallel) changes to cost model already proposed, does that cover all the
> > > cases? It does add a new tree code, yes, but I'm feeling that could be
> > > justified if we go down this route.
> > 
> > I'd rather have combine recognize an insn that does both (if it
> > exists).  As I understand powerpc doesn't have reduction to scalar
> > (I think no ISA actually has this, but all ISAs have reduce to
> > one vector element plus a cheap way of extraction (effectively a subreg))
> > but it's reduction already has all result vector elements set to the
> > same value (as opposed to having some undefined or zero or whatever).
> 
> The extraction is not necessarily so cheap.  For floating-point values
> in a register, we can indeed use a subreg.  But in this case we are
> talking about integers, and an extract must move them from the vector
> registers to the GPRs.  With POWER8, we can do this with a 5-cycle move
> instruction.  Prior to that, we can only do this with a very expensive
> path through memory, as previous processors had no direct path between
> the vector and integer registers.
> 
> > 
> > > However, another point that springs to mind: if you reduce a loop containing
> > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > > using shifts, and I think will also use shifts for platforms not possessing a
> > > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > > (sensibly!) bring all of these together?
> 
> If you think it's reasonable, I will take a look at whether this can
> work.  For targets like PowerPC and AArch32, perhaps we are better off
> not defining an expander at all, but letting the vector epilogue code
> expose the sequence of shifts/max-mins at the GIMPLE level.  For this
> case the vectorizer would then not generate the unnecessary extract and
> broadcast.

Well, I don't see an easy way to do this either.  We really need either
a whole-vector rotate or a double-vector shift to do the expansion
early, neither of which has a defined standard pattern name.  So either
way we would need to introduce a new GIMPLE code and a new optab entry.
I'm still leaning towards interest in exploring the reduce-to-vector
approach.

Thanks,
Bill

> 
> However, I'd want to let Alan Hayward finish revising and committing his
> patch first.
> 
> Regards,
> Bill
> 
> > >
> > > > Perhaps the practical thing is to have the vectorizer also do an
> > > > add_stmt_cost with some new token that indicates the cost model should
> > > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > > cost, and remove the unnecessary code in simplify-rtx.
> > > 
> > > I think it'd be good if we could do it before simplify-rtx, really, although
> > > I'm not sure I have a strong argument as to why, as long as we can cost it
> > > appropriately.
> > > 
> > > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > > also try to look at Alan L's patch shortly.
> > > 
> > > That'd be great, thanks :)
> > 
> > As for the cost modeling the simple solution is of course to have
> > these as "high-level" costs (a new enum entry), but the sustainable
> > solution is to have the target see the vectorized code and decide
> > for itself.  Iff we'd go down the simple route then the target
> > may as well create the vectorized code for that special high-level
> > operation itself.
> > 
> > Richard.
> > 
>
Richard Biener Sept. 18, 2015, 8:35 a.m. UTC | #17
On Thu, 17 Sep 2015, Bill Schmidt wrote:

> On Thu, 2015-09-17 at 09:18 -0500, Bill Schmidt wrote:
> > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > On Wed, 16 Sep 2015, Alan Lawrence wrote:
> > > 
> > > > On 16/09/15 17:10, Bill Schmidt wrote:
> > > > > 
> > > > > On Wed, 2015-09-16 at 16:29 +0100, Alan Lawrence wrote:
> > > > > > On 16/09/15 15:28, Bill Schmidt wrote:
> > > > > > > 2015-09-16  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > >           * config/rs6000/altivec.md (UNSPEC_REDUC_SMAX,
> > > > > > > UNSPEC_REDUC_SMIN,
> > > > > > >           UNSPEC_REDUC_UMAX, UNSPEC_REDUC_UMIN, UNSPEC_REDUC_SMAX_SCAL,
> > > > > > >           UNSPEC_REDUC_SMIN_SCAL, UNSPEC_REDUC_UMAX_SCAL,
> > > > > > >           UNSPEC_REDUC_UMIN_SCAL): New enumerated constants.
> > > > > > >           (reduc_smax_v2di): New define_expand.
> > > > > > >           (reduc_smax_scal_v2di): Likewise.
> > > > > > >           (reduc_smin_v2di): Likewise.
> > > > > > >           (reduc_smin_scal_v2di): Likewise.
> > > > > > >           (reduc_umax_v2di): Likewise.
> > > > > > >           (reduc_umax_scal_v2di): Likewise.
> > > > > > >           (reduc_umin_v2di): Likewise.
> > > > > > >           (reduc_umin_scal_v2di): Likewise.
> > > > > > >           (reduc_smax_v4si): Likewise.
> > > > > > >           (reduc_smin_v4si): Likewise.
> > > > > > >           (reduc_umax_v4si): Likewise.
> > > > > > >           (reduc_umin_v4si): Likewise.
> > > > > > >           (reduc_smax_v8hi): Likewise.
> > > > > > >           (reduc_smin_v8hi): Likewise.
> > > > > > >           (reduc_umax_v8hi): Likewise.
> > > > > > >           (reduc_umin_v8hi): Likewise.
> > > > > > >           (reduc_smax_v16qi): Likewise.
> > > > > > >           (reduc_smin_v16qi): Likewise.
> > > > > > >           (reduc_umax_v16qi): Likewise.
> > > > > > >           (reduc_umin_v16qi): Likewise.
> > > > > > >           (reduc_smax_scal_<mode>): Likewise.
> > > > > > >           (reduc_smin_scal_<mode>): Likewise.
> > > > > > >           (reduc_umax_scal_<mode>): Likewise.
> > > > > > >           (reduc_umin_scal_<mode>): Likewise.
> > > > > > 
> > > > > > You shouldn't need the non-_scal reductions. Indeed, they shouldn't be
> > > > > > used if
> > > > > > the _scal are present. The non-_scal's were previously defined as
> > > > > > producing a
> > > > > > vector with one element holding the result and the other elements all
> > > > > > zero, and
> > > > > > this was only ever used with a vec_extract immediately after; the _scal
> > > > > > pattern
> > > > > > now includes the vec_extract as well. Hence the non-_scal patterns are
> > > > > > deprecated / considered legacy, as per md.texi.
> > > > > 
> > > > > Thanks -- I had misread the description of the non-scalar versions,
> > > > > missing the part where the other elements are zero.  What I really
> > > > > want/need is an optab defined as computing the maximum value in all
> > > > > elements of the vector.
> > > > 
> > > > Yes, indeed. It seems reasonable to me that this would coexist with an optab
> > > > which computes only a single value (i.e. a scalar).
> > > 
> > > So just to clarify - you need to reduce the vector with max to a scalar
> > > but want the (same) result in all vector elements?
> > 
> > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > reduction to scalar, followed by a scalar broadcast to get the value
> > into all positions.  It happens that our most efficient expansion to
> > reduce to scalar will naturally produce the value in all positions.
> > Using the reduc_smax_scal_<mode> expander, we are required to extract
> > this into a scalar and then perform the broadcast.  Those two operations
> > are unnecessary.  We can clean up after the fact, but the cost model
> > will still reflect the unnecessary activity.
> > 
> > > 
> > > > At that point it might be appropriate to change the cond-reduction code to
> > > > generate the reduce-to-vector in all cases, and optabs.c expand it to
> > > > reduce-to-scalar + broadcast if reduce-to-vector was not available. Along with
> > > > the (parallel) changes to cost model already proposed, does that cover all the
> > > > cases? It does add a new tree code, yes, but I'm feeling that could be
> > > > justified if we go down this route.
> > > 
> > > I'd rather have combine recognize an insn that does both (if it
> > > exists).  As I understand powerpc doesn't have reduction to scalar
> > > (I think no ISA actually has this, but all ISAs have reduce to
> > > one vector element plus a cheap way of extraction (effectively a subreg))
> > > but it's reduction already has all result vector elements set to the
> > > same value (as opposed to having some undefined or zero or whatever).
> > 
> > The extraction is not necessarily so cheap.  For floating-point values
> > in a register, we can indeed use a subreg.  But in this case we are
> > talking about integers, and an extract must move them from the vector
> > registers to the GPRs.  With POWER8, we can do this with a 5-cycle move
> > instruction.  Prior to that, we can only do this with a very expensive
> > path through memory, as previous processors had no direct path between
> > the vector and integer registers.
> > 
> > > 
> > > > However, another point that springs to mind: if you reduce a loop containing
> > > > OR or MUL expressions, the vect_create_epilog_for_reduction reduces these
> > > > using shifts, and I think will also use shifts for platforms not possessing a
> > > > reduc_plus/min/max. If shifts could be changed for rotates, the code there
> > > > would do your reduce-to-a-vector-of-identical-elements in the midend...can we
> > > > (sensibly!) bring all of these together?
> > 
> > If you think it's reasonable, I will take a look at whether this can
> > work.  For targets like PowerPC and AArch32, perhaps we are better off
> > not defining an expander at all, but letting the vector epilogue code
> > expose the sequence of shifts/max-mins at the GIMPLE level.  For this
> > case the vectorizer would then not generate the unnecessary extract and
> > broadcast.
> 
> Well, I don't see an easy way to do this either.  We really need either
> a whole-vector rotate or a double-vector shift to do the expansion
> early, neither of which has a defined standard pattern name.  So either
> way we would need to introduce a new GIMPLE code and a new optab entry.
> I'm still leaning towards interest in exploring the reduce-to-vector
> approach.

Btw, we ditched the original reduce-to-vector variant due to its
endianess issues (it only had _one_ element of the vector contain
the reduction result).  Re-introducing reduce-to-vector but with
the reduction result in all elements wouldn't have any issue like
that.

Of course it would introduce a third optab variant similar to the
old (now legacy) version but with different semantics...

Richard.

> Thanks,
> Bill
> 
> > 
> > However, I'd want to let Alan Hayward finish revising and committing his
> > patch first.
> > 
> > Regards,
> > Bill
> > 
> > > >
> > > > > Perhaps the practical thing is to have the vectorizer also do an
> > > > > add_stmt_cost with some new token that indicates the cost model should
> > > > > make an adjustment if the back end doesn't need the extract/broadcast.
> > > > > Targets like PowerPC and AArch32 could then subtract the unnecessary
> > > > > cost, and remove the unnecessary code in simplify-rtx.
> > > > 
> > > > I think it'd be good if we could do it before simplify-rtx, really, although
> > > > I'm not sure I have a strong argument as to why, as long as we can cost it
> > > > appropriately.
> > > > 
> > > > > In any case, I will remove implementing the deprecated optabs, and I'll
> > > > > also try to look at Alan L's patch shortly.
> > > > 
> > > > That'd be great, thanks :)
> > > 
> > > As for the cost modeling the simple solution is of course to have
> > > these as "high-level" costs (a new enum entry), but the sustainable
> > > solution is to have the target see the vectorized code and decide
> > > for itself.  Iff we'd go down the simple route then the target
> > > may as well create the vectorized code for that special high-level
> > > operation itself.
> > > 
> > > Richard.
> > > 
> > 
> 
> 
>
Richard Biener Sept. 18, 2015, 8:38 a.m. UTC | #18
On Thu, 17 Sep 2015, Segher Boessenkool wrote:

> On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote:
> > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > So just to clarify - you need to reduce the vector with max to a scalar
> > > but want the (same) result in all vector elements?
> > 
> > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > reduction to scalar, followed by a scalar broadcast to get the value
> > into all positions.  It happens that our most efficient expansion to
> > reduce to scalar will naturally produce the value in all positions.
> 
> It also is many insns after expand, so relying on combine to combine
> all that plus the following splat (as Richard suggests below) is not
> really going to work.
> 
> If there also are targets where the _scal version is cheaper, maybe
> we should keep both, and have expand expand to whatever the target
> supports?

Wait .. so you don't actually have an instruction to do, say,
REDUC_MAX_EXPR (neither to scalar nor to vector)?  Then it's better
to _not_ define such pattern and let the vectorizer generate
its fallback code.  If the fallback code isn't "best" then better
think of a way to make it choose the best variant out of its
available ones (and maybe add another).  I think it tests
availability of the building blocks for the variants and simply
picks the first that works without checking the cost model.

Richard.

> 
> Segher
> 
>
Bill Schmidt Sept. 18, 2015, 1:07 p.m. UTC | #19
On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote:
> On Thu, 17 Sep 2015, Segher Boessenkool wrote:
> 
> > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote:
> > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > > So just to clarify - you need to reduce the vector with max to a scalar
> > > > but want the (same) result in all vector elements?
> > > 
> > > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > > reduction to scalar, followed by a scalar broadcast to get the value
> > > into all positions.  It happens that our most efficient expansion to
> > > reduce to scalar will naturally produce the value in all positions.
> > 
> > It also is many insns after expand, so relying on combine to combine
> > all that plus the following splat (as Richard suggests below) is not
> > really going to work.
> > 
> > If there also are targets where the _scal version is cheaper, maybe
> > we should keep both, and have expand expand to whatever the target
> > supports?
> 
> Wait .. so you don't actually have an instruction to do, say,
> REDUC_MAX_EXPR (neither to scalar nor to vector)?  Then it's better
> to _not_ define such pattern and let the vectorizer generate
> its fallback code.  If the fallback code isn't "best" then better
> think of a way to make it choose the best variant out of its
> available ones (and maybe add another).  I think it tests
> availability of the building blocks for the variants and simply
> picks the first that works without checking the cost model.

That's what we were considering per Alan Lawrence's suggestion elsewhere
in this thread, but there isn't currently a way to represent a
whole-vector rotate in gimple.  So we'd either have to add that or fall
back to an inferior code sequence, I believe.

Bill

> 
> Richard.
> 
> > 
> > Segher
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>
Richard Biener Sept. 18, 2015, 1:15 p.m. UTC | #20
On Fri, 18 Sep 2015, Bill Schmidt wrote:

> On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote:
> > On Thu, 17 Sep 2015, Segher Boessenkool wrote:
> > 
> > > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote:
> > > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > > > So just to clarify - you need to reduce the vector with max to a scalar
> > > > > but want the (same) result in all vector elements?
> > > > 
> > > > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > > > reduction to scalar, followed by a scalar broadcast to get the value
> > > > into all positions.  It happens that our most efficient expansion to
> > > > reduce to scalar will naturally produce the value in all positions.
> > > 
> > > It also is many insns after expand, so relying on combine to combine
> > > all that plus the following splat (as Richard suggests below) is not
> > > really going to work.
> > > 
> > > If there also are targets where the _scal version is cheaper, maybe
> > > we should keep both, and have expand expand to whatever the target
> > > supports?
> > 
> > Wait .. so you don't actually have an instruction to do, say,
> > REDUC_MAX_EXPR (neither to scalar nor to vector)?  Then it's better
> > to _not_ define such pattern and let the vectorizer generate
> > its fallback code.  If the fallback code isn't "best" then better
> > think of a way to make it choose the best variant out of its
> > available ones (and maybe add another).  I think it tests
> > availability of the building blocks for the variants and simply
> > picks the first that works without checking the cost model.
> 
> That's what we were considering per Alan Lawrence's suggestion elsewhere
> in this thread, but there isn't currently a way to represent a
> whole-vector rotate in gimple.  So we'd either have to add that or fall
> back to an inferior code sequence, I believe.

A whole-vector rotate is just a VEC_PERM with a proper constant mask.
Of course the target would have to detect these cases and use
vector rotate instructions (x86 does that for example).

Richard.
Bill Schmidt Sept. 18, 2015, 1:52 p.m. UTC | #21
On Fri, 2015-09-18 at 15:15 +0200, Richard Biener wrote:
> On Fri, 18 Sep 2015, Bill Schmidt wrote:
> 
> > On Fri, 2015-09-18 at 10:38 +0200, Richard Biener wrote:
> > > On Thu, 17 Sep 2015, Segher Boessenkool wrote:
> > > 
> > > > On Thu, Sep 17, 2015 at 09:18:42AM -0500, Bill Schmidt wrote:
> > > > > On Thu, 2015-09-17 at 09:39 +0200, Richard Biener wrote:
> > > > > > So just to clarify - you need to reduce the vector with max to a scalar
> > > > > > but want the (same) result in all vector elements?
> > > > > 
> > > > > Yes.  Alan Hayward's cond-reduction patch is set up to perform a
> > > > > reduction to scalar, followed by a scalar broadcast to get the value
> > > > > into all positions.  It happens that our most efficient expansion to
> > > > > reduce to scalar will naturally produce the value in all positions.
> > > > 
> > > > It also is many insns after expand, so relying on combine to combine
> > > > all that plus the following splat (as Richard suggests below) is not
> > > > really going to work.
> > > > 
> > > > If there also are targets where the _scal version is cheaper, maybe
> > > > we should keep both, and have expand expand to whatever the target
> > > > supports?
> > > 
> > > Wait .. so you don't actually have an instruction to do, say,
> > > REDUC_MAX_EXPR (neither to scalar nor to vector)?  Then it's better
> > > to _not_ define such pattern and let the vectorizer generate
> > > its fallback code.  If the fallback code isn't "best" then better
> > > think of a way to make it choose the best variant out of its
> > > available ones (and maybe add another).  I think it tests
> > > availability of the building blocks for the variants and simply
> > > picks the first that works without checking the cost model.
> > 
> > That's what we were considering per Alan Lawrence's suggestion elsewhere
> > in this thread, but there isn't currently a way to represent a
> > whole-vector rotate in gimple.  So we'd either have to add that or fall
> > back to an inferior code sequence, I believe.
> 
> A whole-vector rotate is just a VEC_PERM with a proper constant mask.
> Of course the target would have to detect these cases and use
> vector rotate instructions (x86 does that for example).

Hm, yes, that's right.  And we should already have those special-permute
recognitions in place; I had just forgotten about them.  Ok, I agree
this is probably the best approach, then.

I'll have to refresh my memory on Alan H's patch, but we may need to add
logic to do this sort of epilogue expansion for his new reduction.  I
think right now it is just giving up if the REUC_MAX_EXPR isn't
supported.

Thanks,
Bill


> 
> Richard.
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
>
Alan Lawrence Sept. 18, 2015, 3:39 p.m. UTC | #22
On 18/09/15 09:35, Richard Biener wrote:
>
> Btw, we ditched the original reduce-to-vector variant due to its
> endianess issues (it only had _one_ element of the vector contain
> the reduction result).  Re-introducing reduce-to-vector but with
> the reduction result in all elements wouldn't have any issue like
> that.
>
> Of course it would introduce a third optab variant similar to the
> old (now legacy) version but with different semantics...

I am working (again) on finishing this migration, and think I have fixed the 
endianness issues on PowerPC, patch posted just now.  MIPS/Loongson, I'm still 
struggling with; AFAICT, they implement their own reduc_ pattern for most cases, 
which for the most part repeats the midend's approach using shifts, but the 
first step uses a funky permute, as they have a faster instruction for this...

--Alan
diff mbox

Patch

Index: gcc/config/rs6000/altivec.md
===================================================================
--- gcc/config/rs6000/altivec.md	(revision 227817)
+++ gcc/config/rs6000/altivec.md	(working copy)
@@ -87,6 +87,14 @@ 
    UNSPEC_GET_VRSAVE
    UNSPEC_LVX
    UNSPEC_REDUC_PLUS
+   UNSPEC_REDUC_SMAX
+   UNSPEC_REDUC_SMIN
+   UNSPEC_REDUC_UMAX
+   UNSPEC_REDUC_UMIN
+   UNSPEC_REDUC_SMAX_SCAL
+   UNSPEC_REDUC_SMIN_SCAL
+   UNSPEC_REDUC_UMAX_SCAL
+   UNSPEC_REDUC_UMIN_SCAL
    UNSPEC_VECSH
    UNSPEC_EXTEVEN_V4SI
    UNSPEC_EXTEVEN_V8HI
@@ -2690,6 +2698,430 @@ 
   DONE;
 })
 
+;; Strategy used here guarantees that after round N, the maximum value
+;; will appear in 2^N adjacent positions (where adjacent can include the
+;; lowest and highest positions being adjacent).  Thus after N = log E
+;; rounds, where E is the number of vector elements, the maximum value
+;; will appear in every position in the vector.
+;;
+;; Note that reduc_s{max,min}_v{2d,4s}f already exist in vector.md.
+;;
+(define_expand "reduc_smax_v2di"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMAX))]
+  "TARGET_P8_VECTOR"
+{
+  rtx shifted = gen_reg_rtx (V2DImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8)));
+  emit_insn (gen_smaxv2di3 (operands[0], op1, shifted));
+  DONE;
+})
+
+(define_expand "reduc_smax_scal_v2di"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMAX_SCAL))]
+  "TARGET_P8_VECTOR"
+{
+  rtx reduc = gen_reg_rtx (V2DImode);
+  emit_insn (gen_reduc_smax_v2di (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_smin_v2di"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMIN))]
+  "TARGET_P8_VECTOR"
+{
+  rtx shifted = gen_reg_rtx (V2DImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8)));
+  emit_insn (gen_sminv2di3 (operands[0], op1, shifted));
+  DONE;
+})
+
+(define_expand "reduc_smin_scal_v2di"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMIN_SCAL))]
+  "TARGET_P8_VECTOR"
+{
+  rtx reduc = gen_reg_rtx (V2DImode);
+  emit_insn (gen_reduc_smin_v2di (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_umax_v2di"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMAX))]
+  "TARGET_P8_VECTOR"
+{
+  rtx shifted = gen_reg_rtx (V2DImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8)));
+  emit_insn (gen_umaxv2di3 (operands[0], op1, shifted));
+  DONE;
+})
+
+(define_expand "reduc_umax_scal_v2di"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMAX_SCAL))]
+  "TARGET_P8_VECTOR"
+{
+  rtx reduc = gen_reg_rtx (V2DImode);
+  emit_insn (gen_reduc_umax_v2di (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_umin_v2di"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMIN))]
+  "TARGET_P8_VECTOR"
+{
+  rtx shifted = gen_reg_rtx (V2DImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v2di (shifted, op1, op1, GEN_INT (8)));
+  emit_insn (gen_uminv2di3 (operands[0], op1, shifted));
+  DONE;
+})
+
+(define_expand "reduc_umin_scal_v2di"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+        (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMIN_SCAL))]
+  "TARGET_P8_VECTOR"
+{
+  rtx reduc = gen_reg_rtx (V2DImode);
+  emit_insn (gen_reduc_umin_v2di (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_smax_v4si"
+  [(set (match_operand:V4SI 0 "register_operand" "=v")
+        (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V4SImode);
+  rtx shifted2 = gen_reg_rtx (V4SImode);
+  rtx max1 = gen_reg_rtx (V4SImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4)));
+  emit_insn (gen_smaxv4si3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v4si (shifted2, max1, max1, GEN_INT (8)));
+  emit_insn (gen_smaxv4si3 (operands[0], max1, shifted2));
+  DONE;
+})
+
+(define_expand "reduc_smin_v4si"
+  [(set (match_operand:V4SI 0 "register_operand" "=v")
+        (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V4SImode);
+  rtx shifted2 = gen_reg_rtx (V4SImode);
+  rtx min1 = gen_reg_rtx (V4SImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4)));
+  emit_insn (gen_sminv4si3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v4si (shifted2, min1, min1, GEN_INT (8)));
+  emit_insn (gen_sminv4si3 (operands[0], min1, shifted2));
+  DONE;
+})
+
+(define_expand "reduc_umax_v4si"
+  [(set (match_operand:V4SI 0 "register_operand" "=v")
+        (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V4SImode);
+  rtx shifted2 = gen_reg_rtx (V4SImode);
+  rtx max1 = gen_reg_rtx (V4SImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4)));
+  emit_insn (gen_umaxv4si3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v4si (shifted2, max1, max1, GEN_INT (8)));
+  emit_insn (gen_umaxv4si3 (operands[0], max1, shifted2));
+  DONE;
+})
+
+(define_expand "reduc_umin_v4si"
+  [(set (match_operand:V4SI 0 "register_operand" "=v")
+        (unspec:V4SI [(match_operand:V4SI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V4SImode);
+  rtx shifted2 = gen_reg_rtx (V4SImode);
+  rtx min1 = gen_reg_rtx (V4SImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v4si (shifted1, op1, op1, GEN_INT (4)));
+  emit_insn (gen_uminv4si3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v4si (shifted2, min1, min1, GEN_INT (8)));
+  emit_insn (gen_uminv4si3 (operands[0], min1, shifted2));
+  DONE;
+})
+
+(define_expand "reduc_smax_v8hi"
+  [(set (match_operand:V8HI 0 "register_operand" "=v")
+        (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V8HImode);
+  rtx shifted2 = gen_reg_rtx (V8HImode);
+  rtx shifted3 = gen_reg_rtx (V8HImode);
+  rtx max1 = gen_reg_rtx (V8HImode);
+  rtx max2 = gen_reg_rtx (V8HImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2)));
+  emit_insn (gen_smaxv8hi3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted2, max1, max1, GEN_INT (4)));
+  emit_insn (gen_smaxv8hi3 (max2, max1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted3, max2, max2, GEN_INT (8)));
+  emit_insn (gen_smaxv8hi3 (operands[0], max2, shifted3));
+  DONE;
+})
+
+(define_expand "reduc_smin_v8hi"
+  [(set (match_operand:V8HI 0 "register_operand" "=v")
+        (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V8HImode);
+  rtx shifted2 = gen_reg_rtx (V8HImode);
+  rtx shifted3 = gen_reg_rtx (V8HImode);
+  rtx min1 = gen_reg_rtx (V8HImode);
+  rtx min2 = gen_reg_rtx (V8HImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2)));
+  emit_insn (gen_sminv8hi3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted2, min1, min1, GEN_INT (4)));
+  emit_insn (gen_sminv8hi3 (min2, min1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted3, min2, min2, GEN_INT (8)));
+  emit_insn (gen_sminv8hi3 (operands[0], min2, shifted3));
+  DONE;
+})
+
+(define_expand "reduc_umax_v8hi"
+  [(set (match_operand:V8HI 0 "register_operand" "=v")
+        (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V8HImode);
+  rtx shifted2 = gen_reg_rtx (V8HImode);
+  rtx shifted3 = gen_reg_rtx (V8HImode);
+  rtx max1 = gen_reg_rtx (V8HImode);
+  rtx max2 = gen_reg_rtx (V8HImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2)));
+  emit_insn (gen_umaxv8hi3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted2, max1, max1, GEN_INT (4)));
+  emit_insn (gen_umaxv8hi3 (max2, max1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted3, max2, max2, GEN_INT (8)));
+  emit_insn (gen_umaxv8hi3 (operands[0], max2, shifted3));
+  DONE;
+})
+
+(define_expand "reduc_umin_v8hi"
+  [(set (match_operand:V8HI 0 "register_operand" "=v")
+        (unspec:V8HI [(match_operand:V8HI 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V8HImode);
+  rtx shifted2 = gen_reg_rtx (V8HImode);
+  rtx shifted3 = gen_reg_rtx (V8HImode);
+  rtx min1 = gen_reg_rtx (V8HImode);
+  rtx min2 = gen_reg_rtx (V8HImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted1, op1, op1, GEN_INT (2)));
+  emit_insn (gen_uminv8hi3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted2, min1, min1, GEN_INT (4)));
+  emit_insn (gen_uminv8hi3 (min2, min1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v8hi (shifted3, min2, min2, GEN_INT (8)));
+  emit_insn (gen_uminv8hi3 (operands[0], min2, shifted3));
+  DONE;
+})
+
+(define_expand "reduc_smax_v16qi"
+  [(set (match_operand:V16QI 0 "register_operand" "=v")
+        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
+                      UNSPEC_REDUC_SMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V16QImode);
+  rtx shifted2 = gen_reg_rtx (V16QImode);
+  rtx shifted3 = gen_reg_rtx (V16QImode);
+  rtx shifted4 = gen_reg_rtx (V16QImode);
+  rtx max1 = gen_reg_rtx (V16QImode);
+  rtx max2 = gen_reg_rtx (V16QImode);
+  rtx max3 = gen_reg_rtx (V16QImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx));
+  emit_insn (gen_smaxv16qi3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted2, max1, max1, GEN_INT (2)));
+  emit_insn (gen_smaxv16qi3 (max2, max1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted3, max2, max2, GEN_INT (4)));
+  emit_insn (gen_smaxv16qi3 (max3, max2, shifted3));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted4, max3, max3, GEN_INT (8)));
+  emit_insn (gen_smaxv16qi3 (operands[0], max3, shifted4));
+  DONE;
+})
+
+(define_expand "reduc_smin_v16qi"
+  [(set (match_operand:V16QI 0 "register_operand" "=v")
+        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
+                      UNSPEC_REDUC_SMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V16QImode);
+  rtx shifted2 = gen_reg_rtx (V16QImode);
+  rtx shifted3 = gen_reg_rtx (V16QImode);
+  rtx shifted4 = gen_reg_rtx (V16QImode);
+  rtx min1 = gen_reg_rtx (V16QImode);
+  rtx min2 = gen_reg_rtx (V16QImode);
+  rtx min3 = gen_reg_rtx (V16QImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx));
+  emit_insn (gen_sminv16qi3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted2, min1, min1, GEN_INT (2)));
+  emit_insn (gen_sminv16qi3 (min2, min1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted3, min2, min2, GEN_INT (4)));
+  emit_insn (gen_sminv16qi3 (min3, min2, shifted3));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted4, min3, min3, GEN_INT (8)));
+  emit_insn (gen_sminv16qi3 (operands[0], min3, shifted4));
+  DONE;
+})
+
+(define_expand "reduc_umax_v16qi"
+  [(set (match_operand:V16QI 0 "register_operand" "=v")
+        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
+                      UNSPEC_REDUC_UMAX))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V16QImode);
+  rtx shifted2 = gen_reg_rtx (V16QImode);
+  rtx shifted3 = gen_reg_rtx (V16QImode);
+  rtx shifted4 = gen_reg_rtx (V16QImode);
+  rtx max1 = gen_reg_rtx (V16QImode);
+  rtx max2 = gen_reg_rtx (V16QImode);
+  rtx max3 = gen_reg_rtx (V16QImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx));
+  emit_insn (gen_umaxv16qi3 (max1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted2, max1, max1, GEN_INT (2)));
+  emit_insn (gen_umaxv16qi3 (max2, max1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted3, max2, max2, GEN_INT (4)));
+  emit_insn (gen_umaxv16qi3 (max3, max2, shifted3));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted4, max3, max3, GEN_INT (8)));
+  emit_insn (gen_umaxv16qi3 (operands[0], max3, shifted4));
+  DONE;
+})
+
+(define_expand "reduc_umin_v16qi"
+  [(set (match_operand:V16QI 0 "register_operand" "=v")
+        (unspec:V16QI [(match_operand:V16QI 1 "register_operand" "v")]
+                      UNSPEC_REDUC_UMIN))]
+  "TARGET_ALTIVEC"
+{
+  rtx shifted1 = gen_reg_rtx (V16QImode);
+  rtx shifted2 = gen_reg_rtx (V16QImode);
+  rtx shifted3 = gen_reg_rtx (V16QImode);
+  rtx shifted4 = gen_reg_rtx (V16QImode);
+  rtx min1 = gen_reg_rtx (V16QImode);
+  rtx min2 = gen_reg_rtx (V16QImode);
+  rtx min3 = gen_reg_rtx (V16QImode);
+  rtx op1 = operands[1];
+
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted1, op1, op1, const1_rtx));
+  emit_insn (gen_uminv16qi3 (min1, op1, shifted1));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted2, min1, min1, GEN_INT (2)));
+  emit_insn (gen_uminv16qi3 (min2, min1, shifted2));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted3, min2, min2, GEN_INT (4)));
+  emit_insn (gen_uminv16qi3 (min3, min2, shifted3));
+  emit_insn (gen_altivec_vsldoi_v16qi (shifted4, min3, min3, GEN_INT (8)));
+  emit_insn (gen_uminv16qi3 (operands[0], min3, shifted4));
+  DONE;
+})
+
+(define_expand "reduc_smax_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=r")
+        (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMAX_SCAL))]
+  "TARGET_ALTIVEC"
+{
+  rtx reduc = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_reduc_smax_<mode> (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_smin_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=r")
+        (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")]
+                     UNSPEC_REDUC_SMIN_SCAL))]
+  "TARGET_ALTIVEC"
+{
+  rtx reduc = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_reduc_smin_<mode> (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_umax_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=r")
+        (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMAX_SCAL))]
+  "TARGET_ALTIVEC"
+{
+  rtx reduc = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_reduc_umax_<mode> (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
+(define_expand "reduc_umin_scal_<mode>"
+  [(set (match_operand:<VI_scalar> 0 "register_operand" "=r")
+        (unspec:VI [(match_operand:<MODE> 1 "register_operand" "v")]
+                     UNSPEC_REDUC_UMIN_SCAL))]
+  "TARGET_ALTIVEC"
+{
+  rtx reduc = gen_reg_rtx (<MODE>mode);
+  emit_insn (gen_reduc_umin_<mode> (reduc, operands[1]));
+  rs6000_expand_vector_extract (operands[0], reduc, 0);
+  DONE;
+})
+
 (define_expand "neg<mode>2"
   [(use (match_operand:VI 0 "register_operand" ""))
    (use (match_operand:VI 1 "register_operand" ""))]
Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-char.c	(working copy)
@@ -0,0 +1,85 @@ 
+/* { dg-do compile { target { powerpc64*-*-* } } } */
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O3" } */
+/* { dg-final { scan-assembler-times "vmaxsb" 6 } } */
+/* { dg-final { scan-assembler-times "vminsb" 6 } } */
+/* { dg-final { scan-assembler-times "vmaxub" 6 } } */
+/* { dg-final { scan-assembler-times "vminub" 6 } } */
+/* { dg-final { scan-assembler-times "vsldoi" 16 } } */
+/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */
+/* { dg-final { scan-assembler-times "lbz" 4 } } */
+/* { dg-final { scan-assembler-times "extsb" 2 } } */
+
+/* Test maximum and minimum reduction operations for V16QImode.
+   This code will be unrolled and SLP vectorized.  The numbers
+   of instructions expected are obtained as follows.  There will be
+   two max or min instructions to compare the broadcast init vector
+   with the two source vectors in lc[], followed by four shift-max/min
+   pairs to obtain the maximum value in all positions.  An arbitrary
+   one of these is extracted using stxvd2x and lbz; in the signed
+   case, the lbz is followed by extsb. */
+
+#define N 32
+
+typedef unsigned char T;
+typedef signed char T2;
+
+T 
+testmax (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T
+testmin (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2 
+testmax2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2
+testmin2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-int.c	(working copy)
@@ -0,0 +1,85 @@ 
+/* { dg-do compile { target { powerpc64*-*-* } } } */
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O3" } */
+/* { dg-final { scan-assembler-times "vmaxsw" 4 } } */
+/* { dg-final { scan-assembler-times "vminsw" 4 } } */
+/* { dg-final { scan-assembler-times "vmaxuw" 4 } } */
+/* { dg-final { scan-assembler-times "vminuw" 4 } } */
+/* { dg-final { scan-assembler-times "vsldoi" 8 } } */
+/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */
+/* { dg-final { scan-assembler-times "lwz" 2 } } */
+/* { dg-final { scan-assembler-times "lwa" 2 } } */
+
+/* Test maximum and minimum reduction operations for V4SImode.
+   This code will be unrolled and SLP vectorized.  The numbers
+   of instructions expected are obtained as follows.  There will be
+   two max or min instructions to compare the broadcast init vector
+   with the two source vectors in lc[], followed by two shift-max/min
+   pairs to obtain the maximum value in all positions.  An arbitrary
+   one of these is extracted using stxvd2x and either lwz (unsigned)
+   or lwa (signed). */
+
+#define N 8
+
+typedef unsigned int T;
+typedef signed int T2;
+
+T 
+testmax (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T
+testmin (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2 
+testmax2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2
+testmin2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-long.c	(working copy)
@@ -0,0 +1,82 @@ 
+/* { dg-do compile { target { powerpc64*-*-* } } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-O3" } */
+/* { dg-final { scan-assembler-times "vmaxsd" 3 } } */
+/* { dg-final { scan-assembler-times "vminsd" 3 } } */
+/* { dg-final { scan-assembler-times "vmaxud" 3 } } */
+/* { dg-final { scan-assembler-times "vminud" 3 } } */
+/* { dg-final { scan-assembler-times "vsldoi" 4 } } */
+/* { dg-final { scan-assembler-times "mfvsrd" 4 } } */
+
+/* Test maximum and minimum reduction operations for V2DImode.
+   This code will be unrolled and SLP vectorized.  The numbers
+   of instructions expected are obtained as follows.  There will be
+   two max or min instructions to compare the broadcast init vector
+   with the two source vectors in lc[], followed by a shift and
+   max/min to obtain the maximum value in both positions.  An
+   arbitrary one of these is extracted using mfvsrd.  */
+
+#define N 4
+
+typedef unsigned long long T;
+typedef signed long long T2;
+
+T 
+testmax (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T
+testmin (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2 
+testmax2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2
+testmin2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
Index: gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vect-reduc-minmax-short.c	(working copy)
@@ -0,0 +1,85 @@ 
+/* { dg-do compile { target { powerpc64*-*-* } } } */
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O3" } */
+/* { dg-final { scan-assembler-times "vmaxsh" 5 } } */
+/* { dg-final { scan-assembler-times "vminsh" 5 } } */
+/* { dg-final { scan-assembler-times "vmaxuh" 5 } } */
+/* { dg-final { scan-assembler-times "vminuh" 5 } } */
+/* { dg-final { scan-assembler-times "vsldoi" 12 } } */
+/* { dg-final { scan-assembler-times "stxvd2x" 4 } } */
+/* { dg-final { scan-assembler-times "lhz" 2 } } */
+/* { dg-final { scan-assembler-times "lha" 2 } } */
+
+/* Test maximum and minimum reduction operations for V8HImode.
+   This code will be unrolled and SLP vectorized.  The numbers
+   of instructions expected are obtained as follows.  There will be
+   two max or min instructions to compare the broadcast init vector
+   with the two source vectors in lc[], followed by three shift-max/min
+   pairs to obtain the maximum value in all positions.  An arbitrary
+   one of these is extracted using stxvd2x and either lhz (unsigned)
+   or lha (signed).  */
+
+#define N 16
+
+typedef unsigned short T;
+typedef signed short T2;
+
+T 
+testmax (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T
+testmin (const T *c, T init, T result)
+{
+  T lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2 
+testmax2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum < lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}
+
+T2
+testmin2 (const T2 *c, T2 init, T2 result)
+{
+  T2 lc[N], accum = init;
+  int i;
+
+  __builtin_memcpy (lc, c, sizeof(lc));
+
+  for (i = 0; i < N; i++) {
+    accum = accum > lc[i] ? lc[i] : accum;
+  }
+
+  return accum;
+}