diff mbox series

Add optabs for common types of permutation

Message ID 87shdnwpnr.fsf@linaro.org
State New
Headers show
Series Add optabs for common types of permutation | expand

Commit Message

Richard Sandiford Nov. 9, 2017, 1:24 p.m. UTC
...so that we can use them for variable-length vectors.  For now
constant-length vectors continue to use VEC_PERM_EXPR and the
vec_perm_const optab even for cases that the new optabs could
handle.

The vector optabs are inconsistent about whether there should be
an underscore before the mode part of the name, but the other lo/hi
optabs have one.

Doing this means that we're able to optimise some SLP tests using
non-SLP (for now) on targets with variable-length vectors, so the
patch needs to add a few XFAILs.  Most of these go away with later
patches.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linus-gnu.  OK to install?

Richard


2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
	(vec_extract_even, vec_extract_odd): Document new optabs.
	* internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
	(VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
	functions.
	* optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
	(vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
	New optabs.
	* tree-vect-data-refs.c: Include internal-fn.h.
	(vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
	(vect_permute_store_chain): Use them here too.
	(vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
	(vect_permute_load_chain): Use them here too.
	* tree-vect-stmts.c (can_reverse_vector_p): New function.
	(get_negative_load_store_type): Use it.
	(reverse_vector): New function.
	(vectorizable_store, vectorizable_load): Use it.
	* config/aarch64/iterators.md (perm_optab): New iterator.
	* config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
	(vec_reverse_<mode>): Likewise.

gcc/testsuite/
	* gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
	* gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
	* gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
	* gcc.dg/vect/pr68445.c: Likewise.
	* gcc.dg/vect/slp-12a.c: Likewise.
	* gcc.dg/vect/slp-13-big-array.c: Likewise.
	* gcc.dg/vect/slp-13.c: Likewise.
	* gcc.dg/vect/slp-14.c: Likewise.
	* gcc.dg/vect/slp-15.c: Likewise.
	* gcc.dg/vect/slp-42.c: Likewise.
	* gcc.dg/vect/slp-multitypes-2.c: Likewise.
	* gcc.dg/vect/slp-multitypes-4.c: Likewise.
	* gcc.dg/vect/slp-multitypes-5.c: Likewise.
	* gcc.dg/vect/slp-reduc-4.c: Likewise.
	* gcc.dg/vect/slp-reduc-7.c: Likewise.
	* gcc.target/aarch64/sve_vec_perm_2.c: New test.
	* gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
	* gcc.target/aarch64/sve_vec_perm_3.c: New test.
	* gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
	* gcc.target/aarch64/sve_vec_perm_4.c: New test.
	* gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.

Comments

Jeff Law Nov. 19, 2017, 11:56 p.m. UTC | #1
On 11/09/2017 06:24 AM, Richard Sandiford wrote:
> ...so that we can use them for variable-length vectors.  For now
> constant-length vectors continue to use VEC_PERM_EXPR and the
> vec_perm_const optab even for cases that the new optabs could
> handle.
> 
> The vector optabs are inconsistent about whether there should be
> an underscore before the mode part of the name, but the other lo/hi
> optabs have one.
> 
> Doing this means that we're able to optimise some SLP tests using
> non-SLP (for now) on targets with variable-length vectors, so the
> patch needs to add a few XFAILs.  Most of these go away with later
> patches.
> 
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linus-gnu.  OK to install?
> 
> Richard
> 
> 
> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
> 	    Alan Hayward  <alan.hayward@arm.com>
> 	    David Sherwood  <david.sherwood@arm.com>
> 
> gcc/
> 	* doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
> 	(vec_extract_even, vec_extract_odd): Document new optabs.
> 	* internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
> 	(VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
> 	functions.
> 	* optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
> 	(vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
> 	New optabs.
> 	* tree-vect-data-refs.c: Include internal-fn.h.
> 	(vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
> 	(vect_permute_store_chain): Use them here too.
> 	(vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
> 	(vect_permute_load_chain): Use them here too.
> 	* tree-vect-stmts.c (can_reverse_vector_p): New function.
> 	(get_negative_load_store_type): Use it.
> 	(reverse_vector): New function.
> 	(vectorizable_store, vectorizable_load): Use it.
> 	* config/aarch64/iterators.md (perm_optab): New iterator.
> 	* config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
> 	(vec_reverse_<mode>): Likewise.
> 
> gcc/testsuite/
> 	* gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
> 	* gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
> 	* gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
> 	* gcc.dg/vect/pr68445.c: Likewise.
> 	* gcc.dg/vect/slp-12a.c: Likewise.
> 	* gcc.dg/vect/slp-13-big-array.c: Likewise.
> 	* gcc.dg/vect/slp-13.c: Likewise.
> 	* gcc.dg/vect/slp-14.c: Likewise.
> 	* gcc.dg/vect/slp-15.c: Likewise.
> 	* gcc.dg/vect/slp-42.c: Likewise.
> 	* gcc.dg/vect/slp-multitypes-2.c: Likewise.
> 	* gcc.dg/vect/slp-multitypes-4.c: Likewise.
> 	* gcc.dg/vect/slp-multitypes-5.c: Likewise.
> 	* gcc.dg/vect/slp-reduc-4.c: Likewise.
> 	* gcc.dg/vect/slp-reduc-7.c: Likewise.
> 	* gcc.target/aarch64/sve_vec_perm_2.c: New test.
> 	* gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
> 	* gcc.target/aarch64/sve_vec_perm_3.c: New test.
> 	* gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
> 	* gcc.target/aarch64/sve_vec_perm_4.c: New test.
> 	* gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
OK.
jeff
Richard Biener Nov. 20, 2017, 10:46 a.m. UTC | #2
On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>> ...so that we can use them for variable-length vectors.  For now
>> constant-length vectors continue to use VEC_PERM_EXPR and the
>> vec_perm_const optab even for cases that the new optabs could
>> handle.
>>
>> The vector optabs are inconsistent about whether there should be
>> an underscore before the mode part of the name, but the other lo/hi
>> optabs have one.
>>
>> Doing this means that we're able to optimise some SLP tests using
>> non-SLP (for now) on targets with variable-length vectors, so the
>> patch needs to add a few XFAILs.  Most of these go away with later
>> patches.
>>
>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>> and powerpc64le-linus-gnu.  OK to install?
>>
>> Richard
>>
>>
>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>           Alan Hayward  <alan.hayward@arm.com>
>>           David Sherwood  <david.sherwood@arm.com>
>>
>> gcc/
>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>       functions.
>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>       New optabs.
>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>       (vect_permute_store_chain): Use them here too.
>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>       (vect_permute_load_chain): Use them here too.
>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>       (get_negative_load_store_type): Use it.
>>       (reverse_vector): New function.
>>       (vectorizable_store, vectorizable_load): Use it.
>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>       (vec_reverse_<mode>): Likewise.
>>
>> gcc/testsuite/
>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>       * gcc.dg/vect/pr68445.c: Likewise.
>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>       * gcc.dg/vect/slp-13.c: Likewise.
>>       * gcc.dg/vect/slp-14.c: Likewise.
>>       * gcc.dg/vect/slp-15.c: Likewise.
>>       * gcc.dg/vect/slp-42.c: Likewise.
>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
> OK.

It's really a step backwards - we had those optabs and a tree code in
the past and
canonicalizing things to VEC_PERM_EXPR made things simpler.

Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?

:/

Richard.

> jeff
Richard Sandiford Nov. 20, 2017, 12:35 p.m. UTC | #3
Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>> ...so that we can use them for variable-length vectors.  For now
>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>> vec_perm_const optab even for cases that the new optabs could
>>> handle.
>>>
>>> The vector optabs are inconsistent about whether there should be
>>> an underscore before the mode part of the name, but the other lo/hi
>>> optabs have one.
>>>
>>> Doing this means that we're able to optimise some SLP tests using
>>> non-SLP (for now) on targets with variable-length vectors, so the
>>> patch needs to add a few XFAILs.  Most of these go away with later
>>> patches.
>>>
>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>> and powerpc64le-linus-gnu.  OK to install?
>>>
>>> Richard
>>>
>>>
>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>           Alan Hayward  <alan.hayward@arm.com>
>>>           David Sherwood  <david.sherwood@arm.com>
>>>
>>> gcc/
>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>       functions.
>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>       New optabs.
>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>       (vect_permute_store_chain): Use them here too.
>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>       (vect_permute_load_chain): Use them here too.
>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>       (get_negative_load_store_type): Use it.
>>>       (reverse_vector): New function.
>>>       (vectorizable_store, vectorizable_load): Use it.
>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>       (vec_reverse_<mode>): Likewise.
>>>
>>> gcc/testsuite/
>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>> OK.
>
> It's really a step backwards - we had those optabs and a tree code in
> the past and
> canonicalizing things to VEC_PERM_EXPR made things simpler.
>
> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?

The problems with that are:

- It doesn't work for vectors with 256-bit elements because the indices
  wrap round.

- Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
  special cases would be hard, especially since v256hi isn't a normal
  vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
  then have to worry about that special case.

- VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
  and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
  I guess it's possible to represent that using a combination of
  shifts, masks, and additions, but then:

  1) when generating them, we'd need to make sure that we cost the
     operation as a single permute, rather than costing all the shifts,
     masks and additions

  2) we'd need to make sure that all gimple optimisations that run
     afterwards don't perturb the sequence, otherwise we'll end up
     with something that's very expensive.

  3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
     optimisations, and it wouldn't be trivial to add it, since we'd
     need to re-recognise the sequence first.

  4) expand would need to re-recognise the sequence and use the
     optab anyway.

  Using an internal function seems much simpler :-)

I think VEC_PERM_EXPR is useful because it represents the same
operation as __builtin_shuffle, and we want to optimise that as best
we can.  But these internal functions are only used by the vectoriser,
which should always see what the final form of the permute should be.

Thanks,
Richard
Richard Biener Nov. 21, 2017, 2:39 p.m. UTC | #4
On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>> ...so that we can use them for variable-length vectors.  For now
>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>> vec_perm_const optab even for cases that the new optabs could
>>>> handle.
>>>>
>>>> The vector optabs are inconsistent about whether there should be
>>>> an underscore before the mode part of the name, but the other lo/hi
>>>> optabs have one.
>>>>
>>>> Doing this means that we're able to optimise some SLP tests using
>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>> patches.
>>>>
>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>
>>>> Richard
>>>>
>>>>
>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>
>>>> gcc/
>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>       functions.
>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>       New optabs.
>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>       (vect_permute_store_chain): Use them here too.
>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>       (vect_permute_load_chain): Use them here too.
>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>       (get_negative_load_store_type): Use it.
>>>>       (reverse_vector): New function.
>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>       (vec_reverse_<mode>): Likewise.
>>>>
>>>> gcc/testsuite/
>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>> OK.
>>
>> It's really a step backwards - we had those optabs and a tree code in
>> the past and
>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>
>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>
> The problems with that are:
>
> - It doesn't work for vectors with 256-bit elements because the indices
>   wrap round.

That's a general issue that would need to be addressed for larger
vectors (GCN?).
I presume the requirement that the permutation vector have the same size
needs to be relaxed.

> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>   special cases would be hard, especially since v256hi isn't a normal
>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>   then have to worry about that special case.

I think it's not really a special case - any code here should just
expect the same
number of vector elements and not a particular size.  You already dealt with
using a char[] vector for permutations I think.

> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>   I guess it's possible to represent that using a combination of
>   shifts, masks, and additions, but then:
>
>   1) when generating them, we'd need to make sure that we cost the
>      operation as a single permute, rather than costing all the shifts,
>      masks and additions
>
>   2) we'd need to make sure that all gimple optimisations that run
>      afterwards don't perturb the sequence, otherwise we'll end up
>      with something that's very expensive.
>
>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>      optimisations, and it wouldn't be trivial to add it, since we'd
>      need to re-recognise the sequence first.
>
>   4) expand would need to re-recognise the sequence and use the
>      optab anyway.

Well, the answer is of course that you just need a more powerful VEC_SERIES_CST
that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.
Suggested fix: add a interleaved bit to VEC_SERIES_CST.

At least I'd like to see it used for the cases it can already handle.

VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
handle some cases it needs to be fixed / its constraints relaxed (like
the v256qi case).

>   Using an internal function seems much simpler :-)
>
> I think VEC_PERM_EXPR is useful because it represents the same
> operation as __builtin_shuffle, and we want to optimise that as best
> we can.  But these internal functions are only used by the vectoriser,
> which should always see what the final form of the permute should be.

You hope so.  We have several cases where later unrolling and CSE/forwprop
optimize permutations away.

Richard.

> Thanks,
> Richard
Richard Sandiford Nov. 21, 2017, 10:47 p.m. UTC | #5
Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Biener <richard.guenther@gmail.com> writes:
>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>> handle.
>>>>>
>>>>> The vector optabs are inconsistent about whether there should be
>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>> optabs have one.
>>>>>
>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>> patches.
>>>>>
>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>>
>>>>> gcc/
>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>       functions.
>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>>       New optabs.
>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>       (get_negative_load_store_type): Use it.
>>>>>       (reverse_vector): New function.
>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>
>>>>> gcc/testsuite/
>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>> OK.
>>>
>>> It's really a step backwards - we had those optabs and a tree code in
>>> the past and
>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>
>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>
>> The problems with that are:
>>
>> - It doesn't work for vectors with 256-bit elements because the indices
>>   wrap round.
>
> That's a general issue that would need to be addressed for larger
> vectors (GCN?).
> I presume the requirement that the permutation vector have the same size
> needs to be relaxed.
>
>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>   special cases would be hard, especially since v256hi isn't a normal
>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>   then have to worry about that special case.
>
> I think it's not really a special case - any code here should just
> expect the same
> number of vector elements and not a particular size.  You already dealt with
> using a char[] vector for permutations I think.

It sounds like you're talking about the case in which the permutation
vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
vectors, so that doesn't change.  (And yes, that probably means that it
does break for *fixed-length* 2048-bit vectors.)

But this patch is about the variable-length case, in which the
permutation vector is never a VECTOR_CST, and couldn't get converted
to a vec_perm_indices array.  As far as existing code is concerned,
it's no different from a VEC_PERM_EXPR with a variable permutation
vector.

So by taking this approach, we'd effectively be committing to supporting
VEC_PERM_EXPRs with variable permutation vectors that that are wider than
the vectors being permuted.  Those permutation vectors will usually not
have a vector_mode_supported_p mode and will have to be synthesised
somehow.  Trying to support the general case like this could be incredibly
expensive.  Only certain special cases like interleave hi/lo could be
handled cheaply.

>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>   I guess it's possible to represent that using a combination of
>>   shifts, masks, and additions, but then:
>>
>>   1) when generating them, we'd need to make sure that we cost the
>>      operation as a single permute, rather than costing all the shifts,
>>      masks and additions
>>
>>   2) we'd need to make sure that all gimple optimisations that run
>>      afterwards don't perturb the sequence, otherwise we'll end up
>>      with something that's very expensive.
>>
>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>      need to re-recognise the sequence first.
>>
>>   4) expand would need to re-recognise the sequence and use the
>>      optab anyway.
>
> Well, the answer is of course that you just need a more powerful VEC_SERIES_CST
> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.

It can do lots of other things too :-)  But in all cases as separate
statements.  It seems better to expose them as separate statements in
gimple so that they get optimised by the more powerful gimple optimisers,
rather than waiting until rtl.

I think if we go down the route of building more and more operations into
the constant, we'll end up inventing a gimple version of rtx CONST.

I also don't see why it's OK to expose the concept of interleave hi/lo
as an operation on constants but not as an operation on general vectors.

> Suggested fix: add a interleaved bit to VEC_SERIES_CST.

That only handles this one case though.  We'd have to keep making it
more and more complicated as more cases come up.  E.g. the extra bit
couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.

> At least I'd like to see it used for the cases it can already handle.
>
> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
> handle some cases it needs to be fixed / its constraints relaxed (like
> the v256qi case).

I don't think we gain anything by shoehorning everything into one code
for the variable-length case though.  None of the existing vec_perm_const
code can (or should) be used, and none of the existing VECTOR_CST-based
VEC_PERM_EXPR handling will do anything.  That accounts for the majority
of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
VECTOR_CST, there won't be any vector types that use a mixture of
current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.

So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
permute functions, we wouldn't get any new optimisations for free: we'd
have to write new code to match the new constants.  So it becomes a
question of whether it's easier to do that on VEC_SERIES_CSTs or
internal functions.  I think match.pd makes it much easier to optimise
internal functions, and you get the added benefit that the result is
automatically checked against what the target supports.  And the
direct mapping to optabs means that no re-recognition is necessary.

It also seems inconsistent to allow things like TARGET_MEM_REF vs.
MEM_REF but still require a single gimple permute operation, regardless
of circumstances.  I thought we were trying to move in the other direction,
i.e. trying to get the power of the gimple optimisers for things that
were previously handled by rtl.

>>   Using an internal function seems much simpler :-)
>>
>> I think VEC_PERM_EXPR is useful because it represents the same
>> operation as __builtin_shuffle, and we want to optimise that as best
>> we can.  But these internal functions are only used by the vectoriser,
>> which should always see what the final form of the permute should be.
>
> You hope so.  We have several cases where later unrolling and CSE/forwprop
> optimize permutations away.

Unrolling doesn't usually expose anything useful for variable-length
vectors though, since the iv step is also variable.  I guess it could
still happen, but TBH I'd rather take the hit of that than the risk
that optimisers could create expensive non-native permutes.

Thanks,
Richard
Richard Biener Nov. 23, 2017, 8:50 a.m. UTC | #6
On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Richard Biener <richard.guenther@gmail.com> writes:
>>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>>> handle.
>>>>>>
>>>>>> The vector optabs are inconsistent about whether there should be
>>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>>> optabs have one.
>>>>>>
>>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>>> patches.
>>>>>>
>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>>
>>>>>> Richard
>>>>>>
>>>>>>
>>>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>>>
>>>>>> gcc/
>>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>>       functions.
>>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>>>       New optabs.
>>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>>       (get_negative_load_store_type): Use it.
>>>>>>       (reverse_vector): New function.
>>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>>
>>>>>> gcc/testsuite/
>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>>> OK.
>>>>
>>>> It's really a step backwards - we had those optabs and a tree code in
>>>> the past and
>>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>>
>>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>>
>>> The problems with that are:
>>>
>>> - It doesn't work for vectors with 256-bit elements because the indices
>>>   wrap round.
>>
>> That's a general issue that would need to be addressed for larger
>> vectors (GCN?).
>> I presume the requirement that the permutation vector have the same size
>> needs to be relaxed.
>>
>>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>>   special cases would be hard, especially since v256hi isn't a normal
>>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>>   then have to worry about that special case.
>>
>> I think it's not really a special case - any code here should just
>> expect the same
>> number of vector elements and not a particular size.  You already dealt with
>> using a char[] vector for permutations I think.
>
> It sounds like you're talking about the case in which the permutation
> vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
> vectors, so that doesn't change.  (And yes, that probably means that it
> does break for *fixed-length* 2048-bit vectors.)
>
> But this patch is about the variable-length case, in which the
> permutation vector is never a VECTOR_CST, and couldn't get converted
> to a vec_perm_indices array.  As far as existing code is concerned,
> it's no different from a VEC_PERM_EXPR with a variable permutation
> vector.

But the permutation vector is constant as well - this is what you added those
VEC_SERIES_CST stuff and whatnot for.

I don't want variable-size vector special-casing everywhere.  I want it to be
somehow naturally integrating with existing stuff.

> So by taking this approach, we'd effectively be committing to supporting
> VEC_PERM_EXPRs with variable permutation vectors that that are wider than
> the vectors being permuted.  Those permutation vectors will usually not
> have a vector_mode_supported_p mode and will have to be synthesised
> somehow.  Trying to support the general case like this could be incredibly
> expensive.  Only certain special cases like interleave hi/lo could be
> handled cheaply.

As far as I understand SVE only supports interleave / extract even/odd anyway.

>>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>>   I guess it's possible to represent that using a combination of
>>>   shifts, masks, and additions, but then:
>>>
>>>   1) when generating them, we'd need to make sure that we cost the
>>>      operation as a single permute, rather than costing all the shifts,
>>>      masks and additions
>>>
>>>   2) we'd need to make sure that all gimple optimisations that run
>>>      afterwards don't perturb the sequence, otherwise we'll end up
>>>      with something that's very expensive.
>>>
>>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>>      need to re-recognise the sequence first.
>>>
>>>   4) expand would need to re-recognise the sequence and use the
>>>      optab anyway.
>>
>> Well, the answer is of course that you just need a more powerful VEC_SERIES_CST
>> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
>> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
>> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.
>
> It can do lots of other things too :-)  But in all cases as separate
> statements.  It seems better to expose them as separate statements in
> gimple so that they get optimised by the more powerful gimple optimisers,
> rather than waiting until rtl.
>
> I think if we go down the route of building more and more operations into
> the constant, we'll end up inventing a gimple version of rtx CONST.
>
> I also don't see why it's OK to expose the concept of interleave hi/lo
> as an operation on constants but not as an operation on general vectors.

I'm not suggesting to expose it as an operation.  I'm suggesting that if the
target can vec_perm_const_ok () with an "interleave/extract" permutation
then we should be able to represent that with VEC_PERM_EXPR and thus
also represent the permutation vector.

I wasn't too happy with VEC_SERIES_CST either you know.

As said, having something as disruptive as poly_int everywhere but then
still need all those special casing for variable length vectors in the
vectorizer
looks just wrong.

How are you going to handle __builtin_shuffle () with SVE & intrinsics?
How are you going to handle generic vector "lowering"?

All the predication stuff is hidden from the middle-end as well.  It would
have been nice to finally have a nice way to express these things in GIMPLE.

Bah.

>> Suggested fix: add a interleaved bit to VEC_SERIES_CST.
>
> That only handles this one case though.  We'd have to keep making it
> more and more complicated as more cases come up.  E.g. the extra bit
> couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.

But is there an instruction for this in SVE?  I understand there's a single
instruction doing interleave low/high and extract even/odd?  But is there
more?  Possibly a generic permute but for it you'd have to explicitely
construct a permutation vector using some primitives like that "series"
instruction?  So for that case it's reasonable to have GIMPLE like

 perm_vector_1 = VEC_SERIES_EXRP <...>
 ...
 v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;

that is, it's not required to pretend the VEC_PERM_EXRP is a single
instruction or the permutation vector is "constant"?

>> At least I'd like to see it used for the cases it can already handle.
>>
>> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
>> handle some cases it needs to be fixed / its constraints relaxed (like
>> the v256qi case).
>
> I don't think we gain anything by shoehorning everything into one code
> for the variable-length case though.  None of the existing vec_perm_const
> code can (or should) be used, and none of the existing VECTOR_CST-based
> VEC_PERM_EXPR handling will do anything.  That accounts for the majority
> of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
> VECTOR_CST, there won't be any vector types that use a mixture of
> current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.

I dislike having those as you know.

> So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
> permute functions, we wouldn't get any new optimisations for free: we'd
> have to write new code to match the new constants.

Yeah, too bad we have those new constants ;)

>  So it becomes a
> question of whether it's easier to do that on VEC_SERIES_CSTs or
> internal functions.  I think match.pd makes it much easier to optimise
> internal functions, and you get the added benefit that the result is
> automatically checked against what the target supports.  And the
> direct mapping to optabs means that no re-recognition is necessary.
>
> It also seems inconsistent to allow things like TARGET_MEM_REF vs.
> MEM_REF but still require a single gimple permute operation, regardless
> of circumstances.  I thought we were trying to move in the other direction,
> i.e. trying to get the power of the gimple optimisers for things that
> were previously handled by rtl.

Indeed I don't like TARGET_MEM_REF too much either.

>>>   Using an internal function seems much simpler :-)
>>>
>>> I think VEC_PERM_EXPR is useful because it represents the same
>>> operation as __builtin_shuffle, and we want to optimise that as best
>>> we can.  But these internal functions are only used by the vectoriser,
>>> which should always see what the final form of the permute should be.
>>
>> You hope so.  We have several cases where later unrolling and CSE/forwprop
>> optimize permutations away.
>
> Unrolling doesn't usually expose anything useful for variable-length
> vectors though, since the iv step is also variable.  I guess it could
> still happen, but TBH I'd rather take the hit of that than the risk
> that optimisers could create expensive non-native permutes.

optimizers / folders have to (and do) check vec_perm_const_ok if they
change a constant permute vector.

Note there's always an advantage of exposing target capabilities directly,
like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be
expanded to the (series of!) native "permute" instructions of x86 by adding
(target specific!) IFNs.  But then those would be black boxes to all followup
optimizers which means we could as well have none of those.  But fact is
the vectorizer isn't perfect and we rely on useless permutes being
a) CSEd, b) combined, c) eliminated against extracts, etc.  You'd have to
replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd
for all of the target IFNs.  Yes, if we were right before RTL expansion we can
have those "target IFNs" immediately but fact is we do a _lot_ of optimizations
after vectorization.

Oh, and there I mentioned "target IFNs" (and related, "target match.pd").
You are adding IFNs that exist for each target (because that's how optabs
work(?)) but in reality you are generating ones that match SVE.  Not
very nice either.

That said, all this feels a bit like a hack throughout of GCC rather
than designing
variable-length vectors into GIMPLE and then providing some implementation meat
in the arm backend(s).  I know you're time constrained but I think
we're carrying
a quite big maintainance burden that will be very difficult to "fix"
afterwards (because
of lack of motivation once this is in).  And we've not even seen SVE silicon...
(happened with SSE5 for example but that at least was x86-only).

Richard.

> Thanks,
> Richard
Richard Sandiford Nov. 23, 2017, 11:16 a.m. UTC | #7
Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Nov 21, 2017 at 11:47 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Biener <richard.guenther@gmail.com> writes:
>>> On Mon, Nov 20, 2017 at 1:35 PM, Richard Sandiford
>>> <richard.sandiford@linaro.org> wrote:
>>>> Richard Biener <richard.guenther@gmail.com> writes:
>>>>> On Mon, Nov 20, 2017 at 12:56 AM, Jeff Law <law@redhat.com> wrote:
>>>>>> On 11/09/2017 06:24 AM, Richard Sandiford wrote:
>>>>>>> ...so that we can use them for variable-length vectors.  For now
>>>>>>> constant-length vectors continue to use VEC_PERM_EXPR and the
>>>>>>> vec_perm_const optab even for cases that the new optabs could
>>>>>>> handle.
>>>>>>>
>>>>>>> The vector optabs are inconsistent about whether there should be
>>>>>>> an underscore before the mode part of the name, but the other lo/hi
>>>>>>> optabs have one.
>>>>>>>
>>>>>>> Doing this means that we're able to optimise some SLP tests using
>>>>>>> non-SLP (for now) on targets with variable-length vectors, so the
>>>>>>> patch needs to add a few XFAILs.  Most of these go away with later
>>>>>>> patches.
>>>>>>>
>>>>>>> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>>>>>>> and powerpc64le-linus-gnu.  OK to install?
>>>>>>>
>>>>>>> Richard
>>>>>>>
>>>>>>>
>>>>>>> 2017-11-09  Richard Sandiford  <richard.sandiford@linaro.org>
>>>>>>>           Alan Hayward  <alan.hayward@arm.com>
>>>>>>>           David Sherwood  <david.sherwood@arm.com>
>>>>>>>
>>>>>>> gcc/
>>>>>>>       * doc/md.texi (vec_reverse, vec_interleave_lo, vec_interleave_hi)
>>>>>>>       (vec_extract_even, vec_extract_odd): Document new optabs.
>>>>>>>       * internal-fn.def (VEC_INTERLEAVE_LO, VEC_INTERLEAVE_HI)
>>>>>>>       (VEC_EXTRACT_EVEN, VEC_EXTRACT_ODD, VEC_REVERSE): New internal
>>>>>>>       functions.
>>>>>>>       * optabs.def (vec_interleave_lo_optab, vec_interleave_hi_optab)
>>>>>>>       (vec_extract_even_optab, vec_extract_odd_optab, vec_reverse_optab):
>>>>>>>       New optabs.
>>>>>>>       * tree-vect-data-refs.c: Include internal-fn.h.
>>>>>>>       (vect_grouped_store_supported): Try using IFN_VEC_INTERLEAVE_{LO,HI}.
>>>>>>>       (vect_permute_store_chain): Use them here too.
>>>>>>>       (vect_grouped_load_supported): Try using IFN_VEC_EXTRACT_{EVEN,ODD}.
>>>>>>>       (vect_permute_load_chain): Use them here too.
>>>>>>>       * tree-vect-stmts.c (can_reverse_vector_p): New function.
>>>>>>>       (get_negative_load_store_type): Use it.
>>>>>>>       (reverse_vector): New function.
>>>>>>>       (vectorizable_store, vectorizable_load): Use it.
>>>>>>>       * config/aarch64/iterators.md (perm_optab): New iterator.
>>>>>>>       * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>): New expander.
>>>>>>>       (vec_reverse_<mode>): Likewise.
>>>>>>>
>>>>>>> gcc/testsuite/
>>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-2.c: Remove XFAIL.
>>>>>>>       * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
>>>>>>>       * gcc.dg/vect/pr33953.c: XFAIL for vect_variable_length.
>>>>>>>       * gcc.dg/vect/pr68445.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-12a.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-13-big-array.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-13.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-14.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-15.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-42.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-2.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-4.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-multitypes-5.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-reduc-4.c: Likewise.
>>>>>>>       * gcc.dg/vect/slp-reduc-7.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_2.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_2_run.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_3.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_3_run.c: Likewise.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_4.c: New test.
>>>>>>>       * gcc.target/aarch64/sve_vec_perm_4_run.c: Likewise.
>>>>>> OK.
>>>>>
>>>>> It's really a step backwards - we had those optabs and a tree code in
>>>>> the past and
>>>>> canonicalizing things to VEC_PERM_EXPR made things simpler.
>>>>>
>>>>> Why doesn't VEC_PERM <v1, v2, that-constant-series-expr-thing> not work?
>>>>
>>>> The problems with that are:
>>>>
>>>> - It doesn't work for vectors with 256-bit elements because the indices
>>>>   wrap round.
>>>
>>> That's a general issue that would need to be addressed for larger
>>> vectors (GCN?).
>>> I presume the requirement that the permutation vector have the same size
>>> needs to be relaxed.
>>>
>>>> - Supporting a fake VEC_PERM_EXPR <v256qi, v256qi, v256hi> for a few
>>>>   special cases would be hard, especially since v256hi isn't a normal
>>>>   vector mode.  I imagine everything dealing with VEC_PERM_EXPR would
>>>>   then have to worry about that special case.
>>>
>>> I think it's not really a special case - any code here should just
>>> expect the same
>>> number of vector elements and not a particular size.  You already dealt with
>>> using a char[] vector for permutations I think.
>>
>> It sounds like you're talking about the case in which the permutation
>> vector is a VECTOR_CST.  We still use VEC_PERM_EXPRs for constant-length
>> vectors, so that doesn't change.  (And yes, that probably means that it
>> does break for *fixed-length* 2048-bit vectors.)
>>
>> But this patch is about the variable-length case, in which the
>> permutation vector is never a VECTOR_CST, and couldn't get converted
>> to a vec_perm_indices array.  As far as existing code is concerned,
>> it's no different from a VEC_PERM_EXPR with a variable permutation
>> vector.
>
> But the permutation vector is constant as well - this is what you added those
> VEC_SERIES_CST stuff and whatnot for.
>
> I don't want variable-size vector special-casing everywhere.  I want it to be
> somehow naturally integrating with existing stuff.

It's going to be a special case whatever happens though.  If it's a
VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.  The advantage
of the internal functions and optabs is that they map to a concept that
already exists.  The code that generates the permutation already knows that
it's generating an interleave lo/hi, and like you say, it used to do that
directly via special tree codes.  I agree that having a VEC_PERM_EXPR makes
more sense for the constant-length case, but the concept is still there.

And although using VEC_PERM_EXPR in gimple makes sense, I think not
having the optabs is a step backwards, because it means that every
target with interleave lo/hi has to duplicate the detection logic.

>> So by taking this approach, we'd effectively be committing to supporting
>> VEC_PERM_EXPRs with variable permutation vectors that that are wider than
>> the vectors being permuted.  Those permutation vectors will usually not
>> have a vector_mode_supported_p mode and will have to be synthesised
>> somehow.  Trying to support the general case like this could be incredibly
>> expensive.  Only certain special cases like interleave hi/lo could be
>> handled cheaply.
>
> As far as I understand SVE only supports interleave / extract even/odd anyway.
>
>>>> - VEC_SERIES_CST only copes naturally with EXTRACT_EVEN, EXTRACT_ODD
>>>>   and REVERSE.  INTERLEAVE_LO is { 0, N/2, 1, N/2+1, ... }.
>>>>   I guess it's possible to represent that using a combination of
>>>>   shifts, masks, and additions, but then:
>>>>
>>>>   1) when generating them, we'd need to make sure that we cost the
>>>>      operation as a single permute, rather than costing all the shifts,
>>>>      masks and additions
>>>>
>>>>   2) we'd need to make sure that all gimple optimisations that run
>>>>      afterwards don't perturb the sequence, otherwise we'll end up
>>>>      with something that's very expensive.
>>>>
>>>>   3) that sequence wouldn't be handled by existing VEC_PERM_EXPR
>>>>      optimisations, and it wouldn't be trivial to add it, since we'd
>>>>      need to re-recognise the sequence first.
>>>>
>>>>   4) expand would need to re-recognise the sequence and use the
>>>>      optab anyway.
>>>
>>> Well, the answer is of course that you just need a more powerful
>>> VEC_SERIES_CST
>>> that can handle INTERLEAVE_HI/LO.  It seems to me SVE can generate
>>> such masks relatively cheaply -- do a 0, 1, 2, 3... sequence and then do
>>> a INTERLEAVE_HI/LO on it.  So it makes sense that we can directly specify it.
>>
>> It can do lots of other things too :-)  But in all cases as separate
>> statements.  It seems better to expose them as separate statements in
>> gimple so that they get optimised by the more powerful gimple optimisers,
>> rather than waiting until rtl.
>>
>> I think if we go down the route of building more and more operations into
>> the constant, we'll end up inventing a gimple version of rtx CONST.
>>
>> I also don't see why it's OK to expose the concept of interleave hi/lo
>> as an operation on constants but not as an operation on general vectors.
>
> I'm not suggesting to expose it as an operation.  I'm suggesting that if the
> target can vec_perm_const_ok () with an "interleave/extract" permutation
> then we should be able to represent that with VEC_PERM_EXPR and thus
> also represent the permutation vector.

But vec_perm_const_ok () takes a fixed-length mask, so it can't be
used here.  It would need to be a new hook (and thus a new special
case for variable-length vectors).

> I wasn't too happy with VEC_SERIES_CST either you know.
>
> As said, having something as disruptive as poly_int everywhere but then
> still need all those special casing for variable length vectors in the
> vectorizer
> looks just wrong.
>
> How are you going to handle __builtin_shuffle () with SVE & intrinsics?

The variable case should work with the current constraints, i.e. with
the permutation vector having the same element width as the vectors
being permuted, once there's a way of writing __builtin_shuffle with
variable-length vectors.  That means that 256-element shuffles can't
refer to the second vector, but that's correct.

(The problem here is that the interleaves *do* need to refer to
the second vector, but using wider vectors for the permutation vector
wouldn't be a native operation in general for SVE or for any existing
target.)

> How are you going to handle generic vector "lowering"?

We shouldn't generate variable-length vectors that don't exist on the
target (and there's no syntax for doing that in C).

> All the predication stuff is hidden from the middle-end as well.  It would
> have been nice to finally have a nice way to express these things in GIMPLE.

Not sure what you mean here.  The only time predication is hidden is
when SVE requires an all-true predicate for a full-vector operation,
which seems like a target-specific detail.  All "real" predication
is exposed in GIMPLE.  It builds on the existing support for vector
boolean types.

> Bah.
>
>>> Suggested fix: add a interleaved bit to VEC_SERIES_CST.
>>
>> That only handles this one case though.  We'd have to keep making it
>> more and more complicated as more cases come up.  E.g. the extra bit
>> couldn't represent { 0, 1, 2, ..., n/2-1, n, n+1, ... }.
>
> But is there an instruction for this in SVE?  I understand there's a single
> instruction doing interleave low/high and extract even/odd?

This specific example, no.  But my point is that...

> But is there more?  Possibly a generic permute but for it you'd have
> to explicitely construct a permutation vector using some primitives
> like that "series" instruction?  So for that case it's reasonable to
> have GIMPLE like
>
>  perm_vector_1 = VEC_SERIES_EXRP <...>
>  ...
>  v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;
>
> that is, it's not required to pretend the VEC_PERM_EXRP is a single
> instruction or the permutation vector is "constant"?

...by taking this approach, we're saying that we need to ensure that
there is always a way of representing every directly-supported variable-
length permutation mask as a constant, so that it doesn't get split from
VEC_PERM_EXPR.  I don't see why that's better than having internal
functions.  You said that you don't like the extra constants, and each
time we make the constants more complicated, we have to support the
more complicated constants everywhere that handles the constants
(rather than everywhere that handles the VEC_PERM_EXPRs).

>>> At least I'd like to see it used for the cases it can already handle.
>>>
>>> VEC_PERM_EXPR is supposed to be the only permutation operation, if it cannot
>>> handle some cases it needs to be fixed / its constraints relaxed (like
>>> the v256qi case).
>>
>> I don't think we gain anything by shoehorning everything into one code
>> for the variable-length case though.  None of the existing vec_perm_const
>> code can (or should) be used, and none of the existing VECTOR_CST-based
>> VEC_PERM_EXPR handling will do anything.  That accounts for the majority
>> of the VEC_PERM_EXPR support.  Also, like with VEC_DUPLICATE_CST vs.
>> VECTOR_CST, there won't be any vector types that use a mixture of
>> current VEC_PERM_EXPRs and new VEC_PERM_EXPRs.
>
> I dislike having those as you know.
>
>> So if we used VEC_PERM_EXPRs with VEC_SERIES_CSTs instead of internal
>> permute functions, we wouldn't get any new optimisations for free: we'd
>> have to write new code to match the new constants.
>
> Yeah, too bad we have those new constants ;)
>
>>  So it becomes a
>> question of whether it's easier to do that on VEC_SERIES_CSTs or
>> internal functions.  I think match.pd makes it much easier to optimise
>> internal functions, and you get the added benefit that the result is
>> automatically checked against what the target supports.  And the
>> direct mapping to optabs means that no re-recognition is necessary.
>>
>> It also seems inconsistent to allow things like TARGET_MEM_REF vs.
>> MEM_REF but still require a single gimple permute operation, regardless
>> of circumstances.  I thought we were trying to move in the other direction,
>> i.e. trying to get the power of the gimple optimisers for things that
>> were previously handled by rtl.
>
> Indeed I don't like TARGET_MEM_REF too much either.

Hmm, ok.  Maybe that's the difference here.  It seems like a really
nice feature to me :-)

>>>>   Using an internal function seems much simpler :-)
>>>>
>>>> I think VEC_PERM_EXPR is useful because it represents the same
>>>> operation as __builtin_shuffle, and we want to optimise that as best
>>>> we can.  But these internal functions are only used by the vectoriser,
>>>> which should always see what the final form of the permute should be.
>>>
>>> You hope so.  We have several cases where later unrolling and CSE/forwprop
>>> optimize permutations away.
>>
>> Unrolling doesn't usually expose anything useful for variable-length
>> vectors though, since the iv step is also variable.  I guess it could
>> still happen, but TBH I'd rather take the hit of that than the risk
>> that optimisers could create expensive non-native permutes.
>
> optimizers / folders have to (and do) check vec_perm_const_ok if they
> change a constant permute vector.
>
> Note there's always an advantage of exposing target capabilities directly,
> like on x86 those vec_perm_const_ok VEC_PERM_EXPRs could be
> expanded to the (series of!) native "permute" instructions of x86 by adding
> (target specific!) IFNs.  But then those would be black boxes to all followup
> optimizers which means we could as well have none of those.  But fact is
> the vectorizer isn't perfect and we rely on useless permutes being
> a) CSEd, b) combined, c) eliminated against extracts, etc.  You'd have to
> replicate all VEC_PERM/BIT_FIELD_REF/etc. patterns we have in match.pd
> for all of the target IFNs.  Yes, if we were right before RTL expansion we can
> have those "target IFNs" immediately but fact is we do a _lot_ of optimizations
> after vectorization.
>
> Oh, and there I mentioned "target IFNs" (and related, "target match.pd").
> You are adding IFNs that exist for each target (because that's how optabs
> work(?)) but in reality you are generating ones that match SVE.  Not
> very nice either.

But the concepts are general, even if they're implemented by only
one target at the moment.  One architecture always has to come first.

E.g. when IFN_MASK_LOAD went in, it was only supported for x86_64.
Adding it as a generic function was still the right thing to do and
meant that all SVE had to do was define the optab.

I think target IFNs would only make sense if we have some sort
of pre-expand target-specific lowering pass.  (Which might be
a good thing.)  Here we're adding internal functions for things
that the vectoriser has to be aware of.

> That said, all this feels a bit like a hack throughout of GCC rather
> than designing variable-length vectors into GIMPLE and then providing
> some implementation meat in the arm backend(s).  I know you're time
> constrained but I think we're carrying a quite big maintainance burden
> that will be very difficult to "fix" afterwards (because of lack of
> motivation once this is in).  And we've not even seen SVE silicon...
> (happened with SSE5 for example but that at least was x86-only).

I don't think it's a hack, and it didn't end up this way because
of time constraints.  IMO designing variable-length vectors into
gimple means that (a) it needs to be possible to create variable-
length vector constants in gimple (hence the new constants) and
(b) gimple optimisers need to be aware of the fact that vectors
have a variable length, element offsets can variable, etc.
(hence the poly_int stuff, which is also needed for rtl).

Thanks,
Richard
Michael Matz Nov. 23, 2017, 1:43 p.m. UTC | #8
Hi,

On Thu, 23 Nov 2017, Richard Sandiford wrote:

> > I don't want variable-size vector special-casing everywhere.  I want 
> > it to be somehow naturally integrating with existing stuff.
> 
> It's going to be a special case whatever happens though.

It wouldn't have to be this way.  It's like saying that loops with a 
constant upper bound should be represented in a different way than loops 
with an invariant upper bound.  That would seem like a bad idea.

> If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.

No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new 
EXPR type, instead of being a mere constant.

> The advantage of the internal functions and optabs is that they map to a 
> concept that already exists.  The code that generates the permutation 
> already knows that it's generating an interleave lo/hi, and like you 
> say, it used to do that directly via special tree codes.  I agree that 
> having a VEC_PERM_EXPR makes more sense for the constant-length case, 
> but the concept is still there.
> 
> And although using VEC_PERM_EXPR in gimple makes sense, I think not
> having the optabs is a step backwards, because it means that every
> target with interleave lo/hi has to duplicate the detection logic.

The middle end can provide helper routines to make detection easy.  The 
RTL expander could also match VEC_PERM_EXPR to specific optabs, if we 
really really want to add optab over optab for each specific kind of 
permutation in the future.

In a way the difference boils down to have
  PERM(x,y, TYPE)
(with TYPE being, say, HI_LO, EXTR_EVEN/ODD, REVERSE, and what not)
vs.
  PERM_HI_LO(x,y)
  PERM_EVEN(x,y)
  PERM_ODD(x,y)
  PERM_REVERSE(x,y)
  ...

The former way seems saner for an intermediate representation.  In this 
specific case TYPE would be detected by magicness of the constant, and if 
extended to SVE by magicness of the definition of the variably-sized 
invariant.

> > I'm not suggesting to expose it as an operation.  I'm suggesting that 
> > if the target can vec_perm_const_ok () with an "interleave/extract" 
> > permutation then we should be able to represent that with 
> > VEC_PERM_EXPR and thus also represent the permutation vector.
> 
> But vec_perm_const_ok () takes a fixed-length mask, so it can't be
> used here.  It would need to be a new hook (and thus a new special
> case for variable-length vectors).

Why do you reject extending vec_perm_const_ok to _do_ take an invarant 
mask?

> > But is there more?  Possibly a generic permute but for it you'd have
> > to explicitely construct a permutation vector using some primitives
> > like that "series" instruction?  So for that case it's reasonable to
> > have GIMPLE like
> >
> >  perm_vector_1 = VEC_SERIES_EXRP <...>
> >  ...
> >  v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;
> >
> > that is, it's not required to pretend the VEC_PERM_EXRP is a single
> > instruction or the permutation vector is "constant"?
> 
> ...by taking this approach, we're saying that we need to ensure that
> there is always a way of representing every directly-supported variable-
> length permutation mask as a constant, so that it doesn't get split from
> VEC_PERM_EXPR.

I'm having trouble understanding this.  Why would splitting away the 
defintion of perm_vector_1 from VEC_PERM_EXPR be a problem?  It's still 
the same VEC_SERIES_EXRP, and hence still recognizable as a special 
permutation (if it is one).  The optimizer won't touch VEC_SERIES_EXRP, or 
if they do (e.g. combine two of them), and they feed a VEC_PERM_EXPR they 
will make sure the combined result still is supported by the target.

In a way, on targets which support only specific forms of permutation for 
the vector type in question, this invariant mask won't be explicitely 
generated in code, it's an abstract tag in the IR to specific the type of 
the transformation.  Hence moving the def for that tag around is no 
problem.

> I don't see why that's better than having internal
> functions.

The real difference isn't internal functions vs. expression nodes, but 
rather multiple node types vs. a single node type.


Ciao,
Michael.
Jakub Jelinek Nov. 23, 2017, 2:06 p.m. UTC | #9
On Thu, Nov 23, 2017 at 02:43:32PM +0100, Michael Matz wrote:
> > If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.
> 
> No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new 
> EXPR type, instead of being a mere constant.

Or an internal function that would produce the permutation mask vector
given kind and number of vector elements and element type for the mask.

	Jakub
Richard Sandiford Nov. 23, 2017, 2:45 p.m. UTC | #10
Michael Matz <matz@suse.de> writes:
> Hi,
>
> On Thu, 23 Nov 2017, Richard Sandiford wrote:
>
>> > I don't want variable-size vector special-casing everywhere.  I want 
>> > it to be somehow naturally integrating with existing stuff.
>> 
>> It's going to be a special case whatever happens though.
>
> It wouldn't have to be this way.  It's like saying that loops with a 
> constant upper bound should be represented in a different way than loops 
> with an invariant upper bound.  That would seem like a bad idea.

The difference is that with a loop, each iteration follows a set pattern.
But:

(1) for constant-length VEC_PERM_EXPRs, each element of the permutation
    vector is independent of the others: you can't predict what the selector
    for element i is given the selectors for the other elements.

(2) for variable-length permutes, the elements *do* have to follow
    a set pattern that can be extended indefinitely.

Or do you mean that we should use the new representation of interleave
masks even for constant-length vectors, rather than using a VECTOR_CST?
I suppose that would be more consistent, but we'd then have to check
when generating a VEC_PERM_EXPR of a VECTOR_CST whether it should be
represented in this new way instead.  I think we then lose the benefit
using a single tree code.

The decision for VEC_DUPLICATE_CST and VEC_SERIES_CST was to restrict
them only to variable-length vectors.

>> If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.
>
> No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new 
> EXPR type, instead of being a mere constant.

(See [1] below)

>> The advantage of the internal functions and optabs is that they map to a 
>> concept that already exists.  The code that generates the permutation 
>> already knows that it's generating an interleave lo/hi, and like you 
>> say, it used to do that directly via special tree codes.  I agree that 
>> having a VEC_PERM_EXPR makes more sense for the constant-length case, 
>> but the concept is still there.
>> 
>> And although using VEC_PERM_EXPR in gimple makes sense, I think not
>> having the optabs is a step backwards, because it means that every
>> target with interleave lo/hi has to duplicate the detection logic.
>
> The middle end can provide helper routines to make detection easy.  The 
> RTL expander could also match VEC_PERM_EXPR to specific optabs, if we 
> really really want to add optab over optab for each specific kind of 
> permutation in the future.

Adding a new optab doesn't seem like a big deal to me, if it's for
something that target-independent code already needs to worry about.
(The reason we need these specific optabs is because the target-
independent code already generates these particular permutes.)

The overhead attached to adding an optab isn't really any higher
than adding a new detector function, especially on targets that
don't implement the optab.

> In a way the difference boils down to have
>   PERM(x,y, TYPE)
> (with TYPE being, say, HI_LO, EXTR_EVEN/ODD, REVERSE, and what not)
> vs.
>   PERM_HI_LO(x,y)
>   PERM_EVEN(x,y)
>   PERM_ODD(x,y)
>   PERM_REVERSE(x,y)
>   ...
>
> The former way seems saner for an intermediate representation.  In this 
> specific case TYPE would be detected by magicness of the constant, and if 
> extended to SVE by magicness of the definition of the variably-sized 
> invariant.

[1] That sounds similar to the way that COND_EXPR and VEC_COND_EXPR can
embed the comparison in the first operand.  I think Richard has complained
about that in the past (and it does cause some ugliness in the way mask
types are calculated during vectorisation).

>> > I'm not suggesting to expose it as an operation.  I'm suggesting that 
>> > if the target can vec_perm_const_ok () with an "interleave/extract" 
>> > permutation then we should be able to represent that with 
>> > VEC_PERM_EXPR and thus also represent the permutation vector.
>> 
>> But vec_perm_const_ok () takes a fixed-length mask, so it can't be
>> used here.  It would need to be a new hook (and thus a new special
>> case for variable-length vectors).
>
> Why do you reject extending vec_perm_const_ok to _do_ take an invarant 
> mask?

What kind of interface were you thinking of though?

Note that the current interface is independent of the tree or rtl levels,
since it's called by both gimple optimisers and vec_perm_const expanders.
I assume we'd want to keep that.

>> > But is there more?  Possibly a generic permute but for it you'd have
>> > to explicitely construct a permutation vector using some primitives
>> > like that "series" instruction?  So for that case it's reasonable to
>> > have GIMPLE like
>> >
>> >  perm_vector_1 = VEC_SERIES_EXRP <...>
>> >  ...
>> >  v_2 = VEC_PERM_EXPR <.., .., perm_vector_1>;
>> >
>> > that is, it's not required to pretend the VEC_PERM_EXRP is a single
>> > instruction or the permutation vector is "constant"?
>> 
>> ...by taking this approach, we're saying that we need to ensure that
>> there is always a way of representing every directly-supported variable-
>> length permutation mask as a constant, so that it doesn't get split from
>> VEC_PERM_EXPR.
>
> I'm having trouble understanding this.  Why would splitting away the 
> defintion of perm_vector_1 from VEC_PERM_EXPR be a problem?  It's still 
> the same VEC_SERIES_EXRP, and hence still recognizable as a special 
> permutation (if it is one).  The optimizer won't touch VEC_SERIES_EXRP, or 
> if they do (e.g. combine two of them), and they feed a VEC_PERM_EXPR they 
> will make sure the combined result still is supported by the target.

The problem with splitting it out is that it just becomes any old
gassign, and you don't normally have to check the uses of an SSA_NAME
before optimising the definition.

> In a way, on targets which support only specific forms of permutation for 
> the vector type in question, this invariant mask won't be explicitely 
> generated in code, it's an abstract tag in the IR to specific the type of 
> the transformation.  Hence moving the def for that tag around is no 
> problem.

But why should the def exist as a separate gimple statement in that case?
If it's one operation then it seems better to keep it as one operation,
both for code-gen and for optimisation.

>> I don't see why that's better than having internal
>> functions.
>
> The real difference isn't internal functions vs. expression nodes, but 
> rather multiple node types vs. a single node type.

But maybe more specifically: multiple node types that each have a single
form vs. one node type that has multiple forms.  And at that level it
seems like it's a difference between:

  TREE_CODE (gimple_assign_rhs_code (x) == VEC_PERM_EXPR)

vs.

  gimple_vec_perm_p (x)

Thanks,
Richard
Richard Sandiford Nov. 23, 2017, 5:01 p.m. UTC | #11
Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, Nov 23, 2017 at 02:43:32PM +0100, Michael Matz wrote:
>> > If it's a VEC_PERM_EXPR then it'll be a new form of VEC_PERM_EXPR.
>> 
>> No, it'd be a VEC_PERM_EXPR where the magic mask is generated by a new 
>> EXPR type, instead of being a mere constant.
>
> Or an internal function that would produce the permutation mask vector
> given kind and number of vector elements and element type for the mask.

I think this comes back to what to the return type of the functions
should be.  A vector of QImodes wouldn't be wide enough when vectors
are 256 elements or wider, so we'd need a vctor of HImodes with the
same number of elements as a native vector of QImodes.  This means
that the function will need to return a non-native vector type.
That doesn't matter if the function remains glued to the VEC_PERM_EXPR,
but once we expose it as a separate operation, it can get optimised
separately from the VEC_PERM_EXPR.

Also, even with this representation, the operation isn't truly
variable-length.  It just increases the maximum number of elements
from 256 to 65536.  That ought to be enough realistically (in the
same way that 640k ought to be enough realistically), but the patches
as posted have avoided needing to encode a maximum like that.

Having the internal function do the permute rather than produce the
mask means that the operation really is variable-length: we don't
require a vector index to fit within a specific integer type.

Thanks,
Richard
diff mbox series

Patch

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-09 13:21:01.989917982 +0000
+++ gcc/doc/md.texi	2017-11-09 13:21:02.323463345 +0000
@@ -5017,6 +5017,46 @@  There is no need for a target to supply
 and @samp{vec_perm_const@var{m}} if the former can trivially implement
 the operation with, say, the vector constant loaded into a register.
 
+@cindex @code{vec_reverse_@var{m}} instruction pattern
+@item @samp{vec_reverse_@var{m}}
+Reverse the order of the elements in vector input operand 1 and store
+the result in vector output operand 0.  Both operands have mode @var{m}.
+
+This pattern is provided mainly for targets with variable-length vectors.
+Targets with fixed-length vectors can instead handle any reverse-specific
+optimizations in @samp{vec_perm_const@var{m}}.
+
+@cindex @code{vec_interleave_lo_@var{m}} instruction pattern
+@item @samp{vec_interleave_lo_@var{m}}
+Take the lowest-indexed halves of vector input operands 1 and 2 and
+interleave the elements, so that element @var{x} of operand 1 is followed by
+element @var{x} of operand 2.  Store the result in vector output operand 0.
+All three operands have mode @var{m}.
+
+This pattern is provided mainly for targets with variable-length
+vectors.  Targets with fixed-length vectors can instead handle any
+interleave-specific optimizations in @samp{vec_perm_const@var{m}}.
+
+@cindex @code{vec_interleave_hi_@var{m}} instruction pattern
+@item @samp{vec_interleave_hi_@var{m}}
+Like @samp{vec_interleave_lo_@var{m}}, but operate on the highest-indexed
+halves instead of the lowest-indexed halves.
+
+@cindex @code{vec_extract_even_@var{m}} instruction pattern
+@item @samp{vec_extract_even_@var{m}}
+Concatenate vector input operands 1 and 2, extract the elements with
+even-numbered indices, and store the result in vector output operand 0.
+All three operands have mode @var{m}.
+
+This pattern is provided mainly for targets with variable-length vectors.
+Targets with fixed-length vectors can instead handle any
+extract-specific optimizations in @samp{vec_perm_const@var{m}}.
+
+@cindex @code{vec_extract_odd_@var{m}} instruction pattern
+@item @samp{vec_extract_odd_@var{m}}
+Like @samp{vec_extract_even_@var{m}}, but extract the elements with
+odd-numbered indices.
+
 @cindex @code{push@var{m}1} instruction pattern
 @item @samp{push@var{m}1}
 Output a push instruction.  Operand 0 is value to push.  Used only when
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-09 13:21:01.989917982 +0000
+++ gcc/internal-fn.def	2017-11-09 13:21:02.323463345 +0000
@@ -102,6 +102,17 @@  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_LO, ECF_CONST | ECF_NOTHROW,
+		       vec_interleave_lo, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_HI, ECF_CONST | ECF_NOTHROW,
+		       vec_interleave_hi, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_EVEN, ECF_CONST | ECF_NOTHROW,
+		       vec_extract_even, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_ODD, ECF_CONST | ECF_NOTHROW,
+		       vec_extract_odd, binary)
+DEF_INTERNAL_OPTAB_FN (VEC_REVERSE, ECF_CONST | ECF_NOTHROW,
+		       vec_reverse, unary)
+
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
 /* Unary math functions.  */
Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-09 13:21:01.989917982 +0000
+++ gcc/optabs.def	2017-11-09 13:21:02.323463345 +0000
@@ -309,6 +309,11 @@  OPTAB_D (vec_perm_optab, "vec_perm$a")
 OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")
 OPTAB_D (vec_set_optab, "vec_set$a")
 OPTAB_D (vec_shr_optab, "vec_shr_$a")
+OPTAB_D (vec_interleave_lo_optab, "vec_interleave_lo_$a")
+OPTAB_D (vec_interleave_hi_optab, "vec_interleave_hi_$a")
+OPTAB_D (vec_extract_even_optab, "vec_extract_even_$a")
+OPTAB_D (vec_extract_odd_optab, "vec_extract_odd_$a")
+OPTAB_D (vec_reverse_optab, "vec_reverse_$a")
 OPTAB_D (vec_unpacks_float_hi_optab, "vec_unpacks_float_hi_$a")
 OPTAB_D (vec_unpacks_float_lo_optab, "vec_unpacks_float_lo_$a")
 OPTAB_D (vec_unpacks_hi_optab, "vec_unpacks_hi_$a")
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/tree-vect-data-refs.c	2017-11-09 13:21:02.326167766 +0000
@@ -52,6 +52,7 @@  Software Foundation; either version 3, o
 #include "params.h"
 #include "tree-cfg.h"
 #include "tree-hash-traits.h"
+#include "internal-fn.h"
 
 /* Return true if load- or store-lanes optab OPTAB is implemented for
    COUNT vectors of type VECTYPE.  NAME is the name of OPTAB.  */
@@ -4636,7 +4637,16 @@  vect_grouped_store_supported (tree vecty
       return false;
     }
 
-  /* Check that the permutation is supported.  */
+  /* Powers of 2 use a tree of interleaving operations.  See whether the
+     target supports them directly.  */
+  if (count != 3
+      && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_LO, vectype,
+					 OPTIMIZE_FOR_SPEED)
+      && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_HI, vectype,
+					 OPTIMIZE_FOR_SPEED))
+    return true;
+
+  /* Otherwise check for support in the form of general permutations.  */
   unsigned int nelt;
   if (VECTOR_MODE_P (mode) && GET_MODE_NUNITS (mode).is_constant (&nelt))
     {
@@ -4881,50 +4891,78 @@  vect_permute_store_chain (vec<tree> dr_c
       /* If length is not equal to 3 then only power of 2 is supported.  */
       gcc_assert (pow2p_hwi (length));
 
-      /* vect_grouped_store_supported ensures that this is constant.  */
-      unsigned int nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-      auto_vec_perm_indices sel (nelt);
-      sel.quick_grow (nelt);
-      for (i = 0, n = nelt / 2; i < n; i++)
+      if (direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_LO, vectype,
+					  OPTIMIZE_FOR_SPEED)
+	  && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_HI, vectype,
+					     OPTIMIZE_FOR_SPEED))
+	{
+	  /* We could support the case where only one of the optabs is
+	     implemented, but that seems unlikely.  */
+	  perm_mask_low = NULL_TREE;
+	  perm_mask_high = NULL_TREE;
+	}
+      else
 	{
-	  sel[i * 2] = i;
-	  sel[i * 2 + 1] = i + nelt;
+	  /* vect_grouped_store_supported ensures that this is constant.  */
+	  unsigned int nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
+	  auto_vec_perm_indices sel (nelt);
+	  sel.quick_grow (nelt);
+	  for (i = 0, n = nelt / 2; i < n; i++)
+	    {
+	      sel[i * 2] = i;
+	      sel[i * 2 + 1] = i + nelt;
+	    }
+	  perm_mask_low = vect_gen_perm_mask_checked (vectype, sel);
+
+	  for (i = 0; i < nelt; i++)
+	    sel[i] += nelt / 2;
+	  perm_mask_high = vect_gen_perm_mask_checked (vectype, sel);
 	}
-	perm_mask_high = vect_gen_perm_mask_checked (vectype, sel);
 
-	for (i = 0; i < nelt; i++)
-	  sel[i] += nelt / 2;
-	perm_mask_low = vect_gen_perm_mask_checked (vectype, sel);
+      for (i = 0, n = log_length; i < n; i++)
+	{
+	  for (j = 0; j < length / 2; j++)
+	    {
+	      vect1 = dr_chain[j];
+	      vect2 = dr_chain[j + length / 2];
 
-	for (i = 0, n = log_length; i < n; i++)
-	  {
-	    for (j = 0; j < length/2; j++)
-	      {
-		vect1 = dr_chain[j];
-		vect2 = dr_chain[j+length/2];
+	      /* Create interleaving stmt:
+		 high = VEC_PERM_EXPR <vect1, vect2,
+				       {0, nelt, 1, nelt + 1, ...}>  */
+	      low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
+	      if (perm_mask_low)
+		perm_stmt = gimple_build_assign (low, VEC_PERM_EXPR, vect1,
+						 vect2, perm_mask_low);
+	      else
+		{
+		  perm_stmt = gimple_build_call_internal
+		    (IFN_VEC_INTERLEAVE_LO, 2, vect1, vect2);
+		  gimple_set_lhs (perm_stmt, low);
+		}
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[2 * j] = low;
 
-		/* Create interleaving stmt:
-		   high = VEC_PERM_EXPR <vect1, vect2, {0, nelt, 1, nelt+1,
-							...}>  */
-		high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
+	      /* Create interleaving stmt:
+		 high = VEC_PERM_EXPR <vect1, vect2,
+				      {nelt / 2, nelt * 3 / 2,
+				       nelt / 2 + 1, nelt * 3 / 2 + 1,
+				       ...}>  */
+	      high = make_temp_ssa_name (vectype, NULL, "vect_inter_high");
+	      if (perm_mask_high)
 		perm_stmt = gimple_build_assign (high, VEC_PERM_EXPR, vect1,
 						 vect2, perm_mask_high);
-		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-		(*result_chain)[2*j] = high;
-
-		/* Create interleaving stmt:
-		   low = VEC_PERM_EXPR <vect1, vect2,
-					{nelt/2, nelt*3/2, nelt/2+1, nelt*3/2+1,
-					 ...}>  */
-		low = make_temp_ssa_name (vectype, NULL, "vect_inter_low");
-		perm_stmt = gimple_build_assign (low, VEC_PERM_EXPR, vect1,
-						 vect2, perm_mask_low);
-		vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-		(*result_chain)[2*j+1] = low;
-	      }
-	    memcpy (dr_chain.address (), result_chain->address (),
-		    length * sizeof (tree));
-	  }
+	      else
+		{
+		  perm_stmt = gimple_build_call_internal
+		    (IFN_VEC_INTERLEAVE_HI, 2, vect1, vect2);
+		  gimple_set_lhs (perm_stmt, high);
+		}
+	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+	      (*result_chain)[2 * j + 1] = high;
+	    }
+	  memcpy (dr_chain.address (), result_chain->address (),
+		  length * sizeof (tree));
+	}
     }
 }
 
@@ -5235,7 +5273,16 @@  vect_grouped_load_supported (tree vectyp
       return false;
     }
 
-  /* Check that the permutation is supported.  */
+  /* Powers of 2 use a tree of extract operations.  See whether the
+     target supports them directly.  */
+  if (count != 3
+      && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_EVEN, vectype,
+					 OPTIMIZE_FOR_SPEED)
+      && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_ODD, vectype,
+					 OPTIMIZE_FOR_SPEED))
+    return true;
+
+  /* Otherwise check for support in the form of general permutations.  */
   unsigned int nelt;
   if (VECTOR_MODE_P (mode) && GET_MODE_NUNITS (mode).is_constant (&nelt))
     {
@@ -5464,17 +5511,30 @@  vect_permute_load_chain (vec<tree> dr_ch
       /* If length is not equal to 3 then only power of 2 is supported.  */
       gcc_assert (pow2p_hwi (length));
 
-      /* vect_grouped_load_supported ensures that this is constant.  */
-      unsigned nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-      auto_vec_perm_indices sel (nelt);
-      sel.quick_grow (nelt);
-      for (i = 0; i < nelt; ++i)
-	sel[i] = i * 2;
-      perm_mask_even = vect_gen_perm_mask_checked (vectype, sel);
-
-      for (i = 0; i < nelt; ++i)
-	sel[i] = i * 2 + 1;
-      perm_mask_odd = vect_gen_perm_mask_checked (vectype, sel);
+      if (direct_internal_fn_supported_p (IFN_VEC_EXTRACT_EVEN, vectype,
+					  OPTIMIZE_FOR_SPEED)
+	  && direct_internal_fn_supported_p (IFN_VEC_EXTRACT_ODD, vectype,
+					     OPTIMIZE_FOR_SPEED))
+	{
+	  /* We could support the case where only one of the optabs is
+	     implemented, but that seems unlikely.  */
+	  perm_mask_even = NULL_TREE;
+	  perm_mask_odd = NULL_TREE;
+	}
+      else
+	{
+	  /* vect_grouped_load_supported ensures that this is constant.  */
+	  unsigned nelt = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
+	  auto_vec_perm_indices sel (nelt);
+	  sel.quick_grow (nelt);
+	  for (i = 0; i < nelt; ++i)
+	    sel[i] = i * 2;
+	  perm_mask_even = vect_gen_perm_mask_checked (vectype, sel);
+
+	  for (i = 0; i < nelt; ++i)
+	    sel[i] = i * 2 + 1;
+	  perm_mask_odd = vect_gen_perm_mask_checked (vectype, sel);
+	}
 
       for (i = 0; i < log_length; i++)
 	{
@@ -5485,19 +5545,33 @@  vect_permute_load_chain (vec<tree> dr_ch
 
 	      /* data_ref = permute_even (first_data_ref, second_data_ref);  */
 	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_even");
-	      perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR,
-					       first_vect, second_vect,
-					       perm_mask_even);
+	      if (perm_mask_even)
+		perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR,
+						 first_vect, second_vect,
+						 perm_mask_even);
+	      else
+		{
+		  perm_stmt = gimple_build_call_internal
+		    (IFN_VEC_EXTRACT_EVEN, 2, first_vect, second_vect);
+		  gimple_set_lhs (perm_stmt, data_ref);
+		}
 	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	      (*result_chain)[j/2] = data_ref;
+	      (*result_chain)[j / 2] = data_ref;
 
 	      /* data_ref = permute_odd (first_data_ref, second_data_ref);  */
 	      data_ref = make_temp_ssa_name (vectype, NULL, "vect_perm_odd");
-	      perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR,
-					       first_vect, second_vect,
-					       perm_mask_odd);
+	      if (perm_mask_odd)
+		perm_stmt = gimple_build_assign (data_ref, VEC_PERM_EXPR,
+						 first_vect, second_vect,
+						 perm_mask_odd);
+	      else
+		{
+		  perm_stmt = gimple_build_call_internal
+		    (IFN_VEC_EXTRACT_ODD, 2, first_vect, second_vect);
+		  gimple_set_lhs (perm_stmt, data_ref);
+		}
 	      vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-	      (*result_chain)[j/2+length/2] = data_ref;
+	      (*result_chain)[j / 2 + length / 2] = data_ref;
 	    }
 	  memcpy (dr_chain.address (), result_chain->address (),
 		  length * sizeof (tree));
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/tree-vect-stmts.c	2017-11-09 13:21:02.327069240 +0000
@@ -1796,6 +1796,46 @@  perm_mask_for_reverse (tree vectype)
   return vect_gen_perm_mask_checked (vectype, sel);
 }
 
+/* Return true if the target can reverse the elements in a vector of
+   type VECTOR_TYPE.  */
+
+static bool
+can_reverse_vector_p (tree vector_type)
+{
+  return (direct_internal_fn_supported_p (IFN_VEC_REVERSE, vector_type,
+					  OPTIMIZE_FOR_SPEED)
+	  || perm_mask_for_reverse (vector_type));
+}
+
+/* Generate a statement to reverse the elements in vector INPUT and
+   return the SSA name that holds the result.  GSI is a statement iterator
+   pointing to STMT, which is the scalar statement we're vectorizing.
+   VEC_DEST is the destination variable with which new SSA names
+   should be associated.  */
+
+static tree
+reverse_vector (tree vec_dest, tree input, gimple *stmt,
+		gimple_stmt_iterator *gsi)
+{
+  tree new_temp = make_ssa_name (vec_dest);
+  tree vector_type = TREE_TYPE (input);
+  gimple *perm_stmt;
+  if (direct_internal_fn_supported_p (IFN_VEC_REVERSE, vector_type,
+				      OPTIMIZE_FOR_SPEED))
+    {
+      perm_stmt = gimple_build_call_internal (IFN_VEC_REVERSE, 1, input);
+      gimple_set_lhs (perm_stmt, new_temp);
+    }
+  else
+    {
+      tree perm_mask = perm_mask_for_reverse (vector_type);
+      perm_stmt = gimple_build_assign (new_temp, VEC_PERM_EXPR,
+				       input, input, perm_mask);
+    }
+  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
+  return new_temp;
+}
+
 /* A subroutine of get_load_store_type, with a subset of the same
    arguments.  Handle the case where STMT is part of a grouped load
    or store.
@@ -1999,7 +2039,7 @@  get_negative_load_store_type (gimple *st
       return VMAT_CONTIGUOUS_DOWN;
     }
 
-  if (!perm_mask_for_reverse (vectype))
+  if (!can_reverse_vector_p (vectype))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -6760,20 +6800,10 @@  vectorizable_store (gimple *stmt, gimple
 
 	      if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
 		{
-		  tree perm_mask = perm_mask_for_reverse (vectype);
 		  tree perm_dest 
 		    = vect_create_destination_var (gimple_assign_rhs1 (stmt),
 						   vectype);
-		  tree new_temp = make_ssa_name (perm_dest);
-
-		  /* Generate the permute statement.  */
-		  gimple *perm_stmt 
-		    = gimple_build_assign (new_temp, VEC_PERM_EXPR, vec_oprnd,
-					   vec_oprnd, perm_mask);
-		  vect_finish_stmt_generation (stmt, perm_stmt, gsi);
-
-		  perm_stmt = SSA_NAME_DEF_STMT (new_temp);
-		  vec_oprnd = new_temp;
+		  vec_oprnd = reverse_vector (perm_dest, vec_oprnd, stmt, gsi);
 		}
 
 	      /* Arguments are ready.  Create the new vector stmt.  */
@@ -7998,9 +8028,7 @@  vectorizable_load (gimple *stmt, gimple_
 
 	      if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
 		{
-		  tree perm_mask = perm_mask_for_reverse (vectype);
-		  new_temp = permute_vec_elements (new_temp, new_temp,
-						   perm_mask, stmt, gsi);
+		  new_temp = reverse_vector (vec_dest, new_temp, stmt, gsi);
 		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
 		}
 
Index: gcc/config/aarch64/iterators.md
===================================================================
--- gcc/config/aarch64/iterators.md	2017-11-09 13:21:01.989917982 +0000
+++ gcc/config/aarch64/iterators.md	2017-11-09 13:21:02.322561871 +0000
@@ -1556,6 +1556,11 @@  (define_int_attr pauth_hint_num_a [(UNSP
 				    (UNSPEC_PACI1716 "8")
 				    (UNSPEC_AUTI1716 "12")])
 
+(define_int_attr perm_optab [(UNSPEC_ZIP1 "vec_interleave_lo")
+			     (UNSPEC_ZIP2 "vec_interleave_hi")
+			     (UNSPEC_UZP1 "vec_extract_even")
+			     (UNSPEC_UZP2 "vec_extract_odd")])
+
 (define_int_attr perm_insn [(UNSPEC_ZIP1 "zip") (UNSPEC_ZIP2 "zip")
 			    (UNSPEC_TRN1 "trn") (UNSPEC_TRN2 "trn")
 			    (UNSPEC_UZP1 "uzp") (UNSPEC_UZP2 "uzp")])
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md	2017-11-09 13:21:01.989917982 +0000
+++ gcc/config/aarch64/aarch64-sve.md	2017-11-09 13:21:02.320758923 +0000
@@ -630,6 +630,19 @@  (define_expand "vec_perm<mode>"
   }
 )
 
+(define_expand "<perm_optab>_<mode>"
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+	(unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand")
+			 (match_operand:SVE_ALL 2 "register_operand")]
+			OPTAB_PERMUTE))]
+  "TARGET_SVE && !GET_MODE_NUNITS (<MODE>mode).is_constant ()")
+
+(define_expand "vec_reverse_<mode>"
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+	(unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand")]
+			UNSPEC_REV))]
+  "TARGET_SVE && !GET_MODE_NUNITS (<MODE>mode).is_constant ()")
+
 (define_insn "*aarch64_sve_tbl<mode>"
   [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
 	(unspec:SVE_ALL
Index: gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-2.c	2017-11-09 13:21:02.323463345 +0000
@@ -51,7 +51,4 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" {xfail { vect_no_align && { ! vect_hw_misalign } } } } } */
-/* Requires reverse for variable-length SVE, which is implemented for
-   by a later patch.  Until then we report it twice, once for SVE and
-   once for 128-bit Advanced SIMD.  */
-/* { dg-final { scan-tree-dump-times "dependence distance negative" 1 "vect" { xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "dependence distance negative" 1 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/no-vfa-vect-depend-3.c	2017-11-09 13:21:02.323463345 +0000
@@ -183,7 +183,4 @@  int main ()
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" {xfail { vect_no_align && { ! vect_hw_misalign } } } } } */
-/* f4 requires reverse for SVE, which is implemented by a later patch.
-   Until then we report it twice, once for SVE and once for 128-bit
-   Advanced SIMD.  */
-/* { dg-final { scan-tree-dump-times "dependence distance negative" 4 "vect" { xfail { aarch64_sve && vect_variable_length } } } } */
+/* { dg-final { scan-tree-dump-times "dependence distance negative" 4 "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/pr33953.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr33953.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/pr33953.c	2017-11-09 13:21:02.323463345 +0000
@@ -29,6 +29,6 @@  void blockmove_NtoN_blend_noremap32 (con
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { vect_no_align && { ! vect_hw_misalign } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_align && { ! vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { { vect_no_align && { ! vect_hw_misalign } } || vect_variable_length } } } } */
 
 
Index: gcc/testsuite/gcc.dg/vect/pr68445.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr68445.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/pr68445.c	2017-11-09 13:21:02.323463345 +0000
@@ -16,4 +16,4 @@  void IMB_double_fast_x (int *destf, int
     }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */
Index: gcc/testsuite/gcc.dg/vect/slp-12a.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-12a.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-12a.c	2017-11-09 13:21:02.323463345 +0000
@@ -75,5 +75,5 @@  int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { vect_strided8 && vect_int_mult } xfail vect_variable_length } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
Index: gcc/testsuite/gcc.dg/vect/slp-13-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-13-big-array.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-13-big-array.c	2017-11-09 13:21:02.324364818 +0000
@@ -134,4 +134,4 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && { ! vect_pack_trunc } } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_pack_trunc } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && vect_pack_trunc } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */
Index: gcc/testsuite/gcc.dg/vect/slp-13.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-13.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-13.c	2017-11-09 13:21:02.324364818 +0000
@@ -128,4 +128,4 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && { ! vect_pack_trunc } } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target { ! vect_pack_trunc } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 3 loops" 1 "vect" { target { { vect_interleave && vect_extract_even_odd } && vect_pack_trunc } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */
Index: gcc/testsuite/gcc.dg/vect/slp-14.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-14.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-14.c	2017-11-09 13:21:02.324364818 +0000
@@ -111,5 +111,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_int_mult } } }  */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult xfail vect_variable_length } } } */
   
Index: gcc/testsuite/gcc.dg/vect/slp-15.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-15.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-15.c	2017-11-09 13:21:02.324364818 +0000
@@ -112,6 +112,6 @@  int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  {target vect_int_mult } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect"  {target  { ! { vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" {target vect_int_mult } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { target vect_int_mult xfail vect_variable_length } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" {target { ! { vect_int_mult } } } } } */
   
Index: gcc/testsuite/gcc.dg/vect/slp-42.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-42.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-42.c	2017-11-09 13:21:02.324364818 +0000
@@ -15,5 +15,5 @@  void foo (int n)
     }
 }
 
-/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail vect_variable_length } } } */
 /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c	2017-11-09 13:21:02.324364818 +0000
@@ -77,5 +77,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect"  } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { xfail vect_variable_length } } } */
   
Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-multitypes-4.c	2017-11-09 13:21:02.324364818 +0000
@@ -52,5 +52,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  { target vect_unpack } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect"  { target vect_unpack } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_unpack xfail vect_variable_length } } } */
   
Index: gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-multitypes-5.c	2017-11-09 13:21:02.324364818 +0000
@@ -52,5 +52,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_pack_trunc } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_pack_trunc } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target vect_pack_trunc xfail vect_variable_length } } } */
   
Index: gcc/testsuite/gcc.dg/vect/slp-reduc-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-reduc-4.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-reduc-4.c	2017-11-09 13:21:02.325266292 +0000
@@ -57,5 +57,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_min_max } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || vect_variable_length } } } } */
 
Index: gcc/testsuite/gcc.dg/vect/slp-reduc-7.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-reduc-7.c	2017-11-09 13:21:01.989917982 +0000
+++ gcc/testsuite/gcc.dg/vect/slp-reduc-7.c	2017-11-09 13:21:02.325266292 +0000
@@ -55,5 +55,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || vect_variable_length } } } } */
 
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,31 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)						\
+TYPE __attribute__ ((noinline, noclone))			\
+vec_reverse_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  for (int i = 0; i < n; ++i)					\
+    a[i] = b[n - i - 1];					\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.b, z[0-9]+\.b\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.h, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.s, z[0-9]+\.s\n} 3 } } */
+/* { dg-final { scan-assembler-times {\trev\tz[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2_run.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_2_run.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,29 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_vec_perm_2.c"
+
+#define N 153
+
+#define HARNESS(TYPE)						\
+  {								\
+    TYPE a[N], b[N];						\
+    for (unsigned int i = 0; i < N; ++i)			\
+      {								\
+	b[i] = i * 2 + i % 5;					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    vec_reverse_##TYPE (a, b, N);				\
+    for (unsigned int i = 0; i < N; ++i)			\
+      {								\
+	TYPE expected = (N - i - 1) * 2 + (N - i - 1) % 5;	\
+	if (a[i] != expected)					\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,46 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)					\
+TYPE __attribute__ ((noinline, noclone))		\
+vec_zip_##TYPE (TYPE *restrict a, TYPE *restrict b,	\
+		TYPE *restrict c, long n)		\
+{							\
+  for (long i = 0; i < n; ++i)				\
+    {							\
+      a[i * 8] = c[i * 4];				\
+      a[i * 8 + 1] = b[i * 4];				\
+      a[i * 8 + 2] = c[i * 4 + 1];			\
+      a[i * 8 + 3] = b[i * 4 + 1];			\
+      a[i * 8 + 4] = c[i * 4 + 2];			\
+      a[i * 8 + 5] = b[i * 4 + 2];			\
+      a[i * 8 + 6] = c[i * 4 + 3];			\
+      a[i * 8 + 7] = b[i * 4 + 3];			\
+    }							\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */
+/* Currently we can't use SLP for groups bigger than 128 bits.  */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 36 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 36 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 36 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 36 { xfail *-*-* } } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3_run.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_3_run.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,31 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_vec_perm_3.c"
+
+#define N (43 * 8)
+
+#define HARNESS(TYPE)						\
+  {								\
+    TYPE a[N], b[N], c[N];					\
+    for (unsigned int i = 0; i < N; ++i)			\
+      {								\
+	b[i] = i * 2 + i % 5;					\
+	c[i] = i * 3;						\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    vec_zip_##TYPE (a, b, c, N / 8);				\
+    for (unsigned int i = 0; i < N / 2; ++i)			\
+      {								\
+	TYPE expected1 = i * 3;					\
+	TYPE expected2 = i * 2 + i % 5;				\
+	if (a[i * 2] != expected1 || a[i * 2 + 1] != expected2)	\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,52 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)					\
+TYPE __attribute__ ((noinline, noclone))		\
+vec_uzp_##TYPE (TYPE *restrict a, TYPE *restrict b,	\
+		 TYPE *restrict c, long n)		\
+{							\
+  for (long i = 0; i < n; ++i)				\
+    {							\
+      a[i * 4] = c[i * 8];				\
+      b[i * 4] = c[i * 8 + 1];				\
+      a[i * 4 + 1] = c[i * 8 + 2];			\
+      b[i * 4 + 1] = c[i * 8 + 3];			\
+      a[i * 4 + 2] = c[i * 8 + 4];			\
+      b[i * 4 + 2] = c[i * 8 + 5];			\
+      a[i * 4 + 3] = c[i * 8 + 6];			\
+      b[i * 4 + 3] = c[i * 8 + 7];			\
+    }							\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* We could use a single uzp1 and uzp2 per function by implementing
+   SLP load permutation for variable width.  XFAIL until then.  */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.s, z[0-9]+\.s, z[0-9]+\.s\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 { xfail *-*-* } } } */
+/* Delete these if the tests above start passing instead.  */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.b, z[0-9]+\.b, z[0-9]+\.b\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tuzp1\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */
+/* { dg-final { scan-assembler-times {\tuzp2\tz[0-9]+\.h, z[0-9]+\.h, z[0-9]+\.h\n} 24 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4_run.c
===================================================================
--- /dev/null	2017-11-09 12:47:20.377612760 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_perm_4_run.c	2017-11-09 13:21:02.325266292 +0000
@@ -0,0 +1,29 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_vec_perm_4.c"
+
+#define N (43 * 8)
+
+#define HARNESS(TYPE)					\
+  {							\
+    TYPE a[N], b[N], c[N];				\
+    for (unsigned int i = 0; i < N; ++i)		\
+      {							\
+	c[i] = i * 2 + i % 5;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    vec_uzp_##TYPE (a, b, c, N / 8);			\
+    for (unsigned int i = 0; i < N; ++i)		\
+      {							\
+	TYPE expected = i * 2 + i % 5;			\
+	if ((i & 1 ? b[i / 2] : a[i / 2]) != expected)	\
+	  __builtin_abort ();				\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}