diff mbox

[ARM,RFC] Fix vect.exp failures for NEON in big-endian mode

Message ID 20130227172947.31fa279c@octopus
State New
Headers show

Commit Message

Julian Brown Feb. 27, 2013, 5:29 p.m. UTC
Hi,

Several new (ish?) autovectorizer features have apparently caused NEON
support for same to regress quite heavily in big-endian mode. This
patch is an attempt to fix things up, but is not without problems --
maybe someone will have a suggestion as to how we should proceed.

The problem (as ever) is that the ARM backend must lie to the
middle-end about the layout of NEON vectors in big-endian mode (due to
ABI requirements, VFP compatibility, and the middle-end semantics of
vector indices being equivalent to those of an array with the same type
of elements when stored in memory). A few years ago when the vectorizer
was relatively less sophisticated, the ordering of vector elements
could be ignored to some extent by disabling certain instruction
patterns used by the vectorizer in big-endian mode which were sensitive
to the ordering of elements: in fact this is still the strategy we're
using, but it is clearly becoming less and less tenable as time
progresses. Quad-word registers (being composed of two double-word
registers, loaded/stored the "wrong way round" in big-endian mode)
arguably cause more problems than double-word registers.

So, the idea behind the attached patch was supposed to be to limit the
autovectorizer to using double-word registers only, and to disable a
few additional (or newly-used by the vectorizer) patterns in big-endian
mode. That, plus several testsuite tweaks, gets us down to zero
failures for vect.exp, which is good.

The problem is that at the same time quite a large set of neon.exp tests
regress (vzip/vuzp/vtrn): one of the new patterns which is
disabled because it causes trouble (i.e. execution failures) for the
vectorizer is vec_perm_const<mode>. However __builtin_shuffle (which
uses that pattern) is used for arm_neon.h now -- so disabling it means
that the proper instructions aren't generated for intrinsics any more in
big-endian mode.

I think we have a problem here. The vectorizer also tries to use
__builtin_shuffle (for scatter/gather operations, when lane
loading/storing ops aren't available), but does not understand the
"special tweaks" that arm_evpc_neon_{vuzp,vzip,vtrn} does to try to
hide the true element ordering of vectors from the middle-end. So, I'm
left wondering:

 * Given our funky element ordering in BE mode, are the
   __builtin_shuffle lists in arm_neon.h actually an accurate
   representation of what the given intrinsic should do? (The fallback
   code might or might not do the same thing, I'm not sure.)

 * The vectorizer tries to use VEC_PERM_EXPR (equivalent to
   __builtin_shuffle) with e.g. pairs of doubleword registers loaded
   from adjacent memory locations. Are the semantics required for this
   (again, with our funky element ordering) even the same as those
   required for the intrinsics? Including quad-word registers for the
   latter? (My suspicion is "no", in which case there's a fundamental
   incompatibility here that needs to be resolved somehow.)

Anyway: the tl;dr is "fixing NEON vect tests breaks intrinsics". Any
ideas for what to do about that? (FAOD, I don't think I'm in a position
to do the kind of middle-end surgery required to fix the problem
"properly" at this point :-p).

(It's arguably more important for the vectorizer to not generate bad
code than it is for intrinsics to work properly, in which case: OK to
apply? Tested cross to ARM EABI with configury modifications to build
LE/BE multilibs.)

Thanks,

Julian

ChangeLog

    gcc/
    * config/arm/arm.c (arm_array_mode_supported_p): No array modes for
    big-endian NEON.
    (arm_preferred_simd_mode): Always prefer 64-bit modes for
    big-endian NEON.
    (arm_autovectorize_vector_sizes): Use 8-byte vectors only for NEON.
    (arm_vectorize_vec_perm_const_ok): No permutations are OK in
    big-endian mode.
    * config/arm/neon.md (vec_load_lanes<mode><mode>): Disable in
    big-endian mode.
    (vec_store_lanes<mode><mode>, vec_load_lanesti<mode>)
    (vec_load_lanesoi<mode>, vec_store_lanesti<mode>)
    (vec_store_lanesoi<mode>, vec_load_lanesei<mode>)
    (vec_load_lanesci<mode>, vec_store_lanesei<mode>)
    (vec_store_lanesci<mode>, vec_load_lanesxi<mode>)
    (vec_store_lanesxi<mode>): Likewise.
    (vec_widen_<US>shiftl_lo_<mode>, vec_widen_<US>shiftl_hi_<mode>)
    (vec_widen_<US>mult_hi_<mode>, vec_widen_<US>mult_lo_<mode>):
    Likewise.

    gcc/testsuite/
    * gcc.dg/vect/slp-cond-3.c: XFAIL for !vect_unpack.
    * gcc.dg/vect/slp-cond-4.c: Likewise.
    * gcc.dg/vect/vect-1.c: Likewise.
    * gcc.dg/vect/vect-1-big-array.c: Likewise.
    * gcc.dg/vect/vect-35.c: Likewise.
    * gcc.dg/vect/vect-35-big-array.c: Likewise.
    * gcc.dg/vect/bb-slp-11.c: Likewise.
    * gcc.dg/vect/bb-slp-26.c: Likewise.
    * gcc.dg/vect/vect-over-widen-3-big-array.c: XFAIL
    for !vect_element_align.
    * gcc.dg/vect/vect-over-widen-1.c: Likewise.
    * gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise.
    * gcc.dg/vect/vect-over-widen-2.c: Likewise.
    * gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise.
    * gcc.dg/vect/vect-over-widen-3.c: Likewise.
    * gcc.dg/vect/vect-over-widen-4.c: Likewise.
    * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
    * gcc.dg/vect/pr43430-2.c: Likewise.
    * gcc.dg/vect/vect-widen-shift-u16.c: XFAIL for !vect_widen_shift
    && !vect_unpack.
    * gcc.dg/vect/vect-widen-shift-s8.c: Likewise.
    * gcc.dg/vect/vect-widen-shift-u8.c: Likewise.
    * gcc.dg/vect/vect-widen-shift-s16.c: Likewise.
    * gcc.dg/vect/vect-93.c: Only run if !vect_intfloat_cvt.
    * gcc.dg/vect/vect-intfloat-conversion-4a.c: Only run if
    vect_unpack.
    * gcc.dg/vect/vect-intfloat-conversion-4b.c: Likewise.
    * lib/target-supports.exp (check_effective_target_vect_perm): Only
    enable for NEON little-endian.
    (check_effective_target_vect_widen_sum_qi_to_hi): Likewise.
    (check_effective_target_vect_widen_mult_qi_to_hi): Likewise.
    (check_effective_target_vect_widen_mult_hi_to_si): Likewise.
    (check_effective_target_vect_widen_shift): Likewise.
    (check_effective_target_vect_extract_even_odd): Likewise.
    (check_effective_target_vect_interleave): Likewise.
    (check_effective_target_vect_stridedN): Likewise.
    (check_effective_target_vect_multiple_sizes): Likewise.
    (check_effective_target_vect64): Enable for any NEON.

Comments

Janis Johnson Feb. 27, 2013, 7:04 p.m. UTC | #1
On 02/27/2013 09:29 AM, Julian Brown wrote:
> Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(revision 196170)
> +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(working copy)
> @@ -79,6 +79,6 @@ int main ()
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */
>  /* { dg-final { cleanup-tree-dump "vect" } } */
>  

If this and other modified checks only fail for ARM big-endian then they
should check for that so they don't XPASS for other targets.  It's also
possible now to do things like { target vect_blah xfail arm_big_endian },
which might be useful for some tests.

Janis
Julian Brown Feb. 28, 2013, 10:06 a.m. UTC | #2
On Wed, 27 Feb 2013 11:04:04 -0800
Janis Johnson <janis_johnson@mentor.com> wrote:

> On 02/27/2013 09:29 AM, Julian Brown wrote:
> > Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c
> > ===================================================================
> > --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(revision 196170)
> > +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(working copy)
> > @@ -79,6 +79,6 @@ int main ()
> >    return 0;
> >  }
> >  
> > -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP"
> > 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing
> > stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /*
> > { dg-final { cleanup-tree-dump "vect" } } */ 
> 
> If this and other modified checks only fail for ARM big-endian then
> they should check for that so they don't XPASS for other targets.
> It's also possible now to do things like { target vect_blah xfail
> arm_big_endian }, which might be useful for some tests.

I don't think I understand -- my expectation was e.g. that that test
would fail for any target which doesn't support vect_unpack. Surely
you'd only get an XPASS if the test passed when vect_unpack was not
true?

I'm not sure why checking for a particular architecture-specific
predicate would be preferable to checking that a general feature is
supported. As time progresses, it might well be that e.g. vect_unpack
becomes supported for big-endian ARM, at which point we shouldn't need
to edit all the individual tests again...

Thanks,

Julian
Janis Johnson Feb. 28, 2013, 4:10 p.m. UTC | #3
On 02/28/2013 02:06 AM, Julian Brown wrote:
> On Wed, 27 Feb 2013 11:04:04 -0800
> Janis Johnson <janis_johnson@mentor.com> wrote:
> 
>> On 02/27/2013 09:29 AM, Julian Brown wrote:
>>> Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c
>>> ===================================================================
>>> --- gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(revision 196170)
>>> +++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(working copy)
>>> @@ -79,6 +79,6 @@ int main ()
>>>    return 0;
>>>  }
>>>  
>>> -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP"
>>> 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorizing
>>> stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */ /*
>>> { dg-final { cleanup-tree-dump "vect" } } */ 
>>
>> If this and other modified checks only fail for ARM big-endian then
>> they should check for that so they don't XPASS for other targets.
>> It's also possible now to do things like { target vect_blah xfail
>> arm_big_endian }, which might be useful for some tests.
> 
> I don't think I understand -- my expectation was e.g. that that test
> would fail for any target which doesn't support vect_unpack. Surely
> you'd only get an XPASS if the test passed when vect_unpack was not
> true?

Right.  Please ignore my mail, I was confused. 

> I'm not sure why checking for a particular architecture-specific
> predicate would be preferable to checking that a general feature is
> supported. As time progresses, it might well be that e.g. vect_unpack
> becomes supported for big-endian ARM, at which point we shouldn't need
> to edit all the individual tests again...

Right.  Once again, I was confused, ignore me.

Janis
Richard Biener March 1, 2013, 10:07 a.m. UTC | #4
On Wed, Feb 27, 2013 at 6:29 PM, Julian Brown <julian@codesourcery.com> wrote:
> Hi,
>
> Several new (ish?) autovectorizer features have apparently caused NEON
> support for same to regress quite heavily in big-endian mode. This
> patch is an attempt to fix things up, but is not without problems --
> maybe someone will have a suggestion as to how we should proceed.
>
> The problem (as ever) is that the ARM backend must lie to the
> middle-end about the layout of NEON vectors in big-endian mode (due to
> ABI requirements, VFP compatibility, and the middle-end semantics of
> vector indices being equivalent to those of an array with the same type
> of elements when stored in memory).

Why not simply give up?  Thus, make autovectorization unsupported for
ARM big-endian targets?

Do I understand correctly that the "only" issue is memory vs. register
element ordering?  Thus a fixup could be as simple as extra shuffles
inserted after vector memory loads and before vector memory stores?
(with the hope of RTL optimizers optimizing those)?

Any "lies" are of course bad and you'll pay for them later.

Richard.

> A few years ago when the vectorizer
> was relatively less sophisticated, the ordering of vector elements
> could be ignored to some extent by disabling certain instruction
> patterns used by the vectorizer in big-endian mode which were sensitive
> to the ordering of elements: in fact this is still the strategy we're
> using, but it is clearly becoming less and less tenable as time
> progresses. Quad-word registers (being composed of two double-word
> registers, loaded/stored the "wrong way round" in big-endian mode)
> arguably cause more problems than double-word registers.
>
> So, the idea behind the attached patch was supposed to be to limit the
> autovectorizer to using double-word registers only, and to disable a
> few additional (or newly-used by the vectorizer) patterns in big-endian
> mode. That, plus several testsuite tweaks, gets us down to zero
> failures for vect.exp, which is good.
>
> The problem is that at the same time quite a large set of neon.exp tests
> regress (vzip/vuzp/vtrn): one of the new patterns which is
> disabled because it causes trouble (i.e. execution failures) for the
> vectorizer is vec_perm_const<mode>. However __builtin_shuffle (which
> uses that pattern) is used for arm_neon.h now -- so disabling it means
> that the proper instructions aren't generated for intrinsics any more in
> big-endian mode.
>
> I think we have a problem here. The vectorizer also tries to use
> __builtin_shuffle (for scatter/gather operations, when lane
> loading/storing ops aren't available), but does not understand the
> "special tweaks" that arm_evpc_neon_{vuzp,vzip,vtrn} does to try to
> hide the true element ordering of vectors from the middle-end. So, I'm
> left wondering:
>
>  * Given our funky element ordering in BE mode, are the
>    __builtin_shuffle lists in arm_neon.h actually an accurate
>    representation of what the given intrinsic should do? (The fallback
>    code might or might not do the same thing, I'm not sure.)
>
>  * The vectorizer tries to use VEC_PERM_EXPR (equivalent to
>    __builtin_shuffle) with e.g. pairs of doubleword registers loaded
>    from adjacent memory locations. Are the semantics required for this
>    (again, with our funky element ordering) even the same as those
>    required for the intrinsics? Including quad-word registers for the
>    latter? (My suspicion is "no", in which case there's a fundamental
>    incompatibility here that needs to be resolved somehow.)
>
> Anyway: the tl;dr is "fixing NEON vect tests breaks intrinsics". Any
> ideas for what to do about that? (FAOD, I don't think I'm in a position
> to do the kind of middle-end surgery required to fix the problem
> "properly" at this point :-p).
>
> (It's arguably more important for the vectorizer to not generate bad
> code than it is for intrinsics to work properly, in which case: OK to
> apply? Tested cross to ARM EABI with configury modifications to build
> LE/BE multilibs.)
>
> Thanks,
>
> Julian
>
> ChangeLog
>
>     gcc/
>     * config/arm/arm.c (arm_array_mode_supported_p): No array modes for
>     big-endian NEON.
>     (arm_preferred_simd_mode): Always prefer 64-bit modes for
>     big-endian NEON.
>     (arm_autovectorize_vector_sizes): Use 8-byte vectors only for NEON.
>     (arm_vectorize_vec_perm_const_ok): No permutations are OK in
>     big-endian mode.
>     * config/arm/neon.md (vec_load_lanes<mode><mode>): Disable in
>     big-endian mode.
>     (vec_store_lanes<mode><mode>, vec_load_lanesti<mode>)
>     (vec_load_lanesoi<mode>, vec_store_lanesti<mode>)
>     (vec_store_lanesoi<mode>, vec_load_lanesei<mode>)
>     (vec_load_lanesci<mode>, vec_store_lanesei<mode>)
>     (vec_store_lanesci<mode>, vec_load_lanesxi<mode>)
>     (vec_store_lanesxi<mode>): Likewise.
>     (vec_widen_<US>shiftl_lo_<mode>, vec_widen_<US>shiftl_hi_<mode>)
>     (vec_widen_<US>mult_hi_<mode>, vec_widen_<US>mult_lo_<mode>):
>     Likewise.
>
>     gcc/testsuite/
>     * gcc.dg/vect/slp-cond-3.c: XFAIL for !vect_unpack.
>     * gcc.dg/vect/slp-cond-4.c: Likewise.
>     * gcc.dg/vect/vect-1.c: Likewise.
>     * gcc.dg/vect/vect-1-big-array.c: Likewise.
>     * gcc.dg/vect/vect-35.c: Likewise.
>     * gcc.dg/vect/vect-35-big-array.c: Likewise.
>     * gcc.dg/vect/bb-slp-11.c: Likewise.
>     * gcc.dg/vect/bb-slp-26.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-3-big-array.c: XFAIL
>     for !vect_element_align.
>     * gcc.dg/vect/vect-over-widen-1.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-1-big-array.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-2.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-2-big-array.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-3.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-4.c: Likewise.
>     * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
>     * gcc.dg/vect/pr43430-2.c: Likewise.
>     * gcc.dg/vect/vect-widen-shift-u16.c: XFAIL for !vect_widen_shift
>     && !vect_unpack.
>     * gcc.dg/vect/vect-widen-shift-s8.c: Likewise.
>     * gcc.dg/vect/vect-widen-shift-u8.c: Likewise.
>     * gcc.dg/vect/vect-widen-shift-s16.c: Likewise.
>     * gcc.dg/vect/vect-93.c: Only run if !vect_intfloat_cvt.
>     * gcc.dg/vect/vect-intfloat-conversion-4a.c: Only run if
>     vect_unpack.
>     * gcc.dg/vect/vect-intfloat-conversion-4b.c: Likewise.
>     * lib/target-supports.exp (check_effective_target_vect_perm): Only
>     enable for NEON little-endian.
>     (check_effective_target_vect_widen_sum_qi_to_hi): Likewise.
>     (check_effective_target_vect_widen_mult_qi_to_hi): Likewise.
>     (check_effective_target_vect_widen_mult_hi_to_si): Likewise.
>     (check_effective_target_vect_widen_shift): Likewise.
>     (check_effective_target_vect_extract_even_odd): Likewise.
>     (check_effective_target_vect_interleave): Likewise.
>     (check_effective_target_vect_stridedN): Likewise.
>     (check_effective_target_vect_multiple_sizes): Likewise.
>     (check_effective_target_vect64): Enable for any NEON.
>
Julian Brown March 1, 2013, 12:02 p.m. UTC | #5
On Fri, 1 Mar 2013 11:07:17 +0100
Richard Biener <richard.guenther@gmail.com> wrote:

> On Wed, Feb 27, 2013 at 6:29 PM, Julian Brown
> <julian@codesourcery.com> wrote:
> > Hi,
> >
> > Several new (ish?) autovectorizer features have apparently caused
> > NEON support for same to regress quite heavily in big-endian mode.
> > This patch is an attempt to fix things up, but is not without
> > problems -- maybe someone will have a suggestion as to how we
> > should proceed.
> >
> > The problem (as ever) is that the ARM backend must lie to the
> > middle-end about the layout of NEON vectors in big-endian mode (due
> > to ABI requirements, VFP compatibility, and the middle-end
> > semantics of vector indices being equivalent to those of an array
> > with the same type of elements when stored in memory).
> 
> Why not simply give up?  Thus, make autovectorization unsupported for
> ARM big-endian targets?

That's certainly a tempting option...

> Do I understand correctly that the "only" issue is memory vs. register
> element ordering?  Thus a fixup could be as simple as extra shuffles
> inserted after vector memory loads and before vector memory stores?
> (with the hope of RTL optimizers optimizing those)?

It's not even necessary to use explicit shuffles -- NEON has perfectly
good instructions for loading/storing vectors in the "right" order, in
the form of vld1 & vst1. I'm afraid the solution to this problem might
have been staring us in the face for years, which is simply to forbid
vldr/vstr/vldm/vstm (the instructions which lead to weird element
permutations in BE mode) for loading/storing NEON vectors altogether.
That way the vectorizer gets what it wants, the intrinsics can continue
to use __builtin_shuffle exactly as they are doing, and we get to
remove all the bits which fiddle vector element numbering in BE mode in
the ARM backend.

I can't exactly remember why we didn't do that to start with. I think
the problem was ABI-related, or to do with transferring NEON vectors
to/from ARM registers when it was necessary to do that... I'm planning
to do some archaeology to try to see if I can figure out a definitive
answer.

(Previous discussions include, e.g.:

http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00876.html

http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html

http://lists.linaro.org/pipermail/linaro-toolchain/2010-November/000437.html

it looks like ABI boundaries require vldr/vstr/vldm/vstm ordering:
maybe those can be treated as "opaque" transfers and continue to use
the same instructions & ordering, but vld1/vst1 can be used everywhere
else?)

> Any "lies" are of course bad and you'll pay for them later.

Indeed :-).

Cheers,

Julian
Paul Brook March 1, 2013, 2:35 p.m. UTC | #6
> > Do I understand correctly that the "only" issue is memory vs. register
> > element ordering?  Thus a fixup could be as simple as extra shuffles
> > inserted after vector memory loads and before vector memory stores?
> > (with the hope of RTL optimizers optimizing those)?
> 
> It's not even necessary to use explicit shuffles -- NEON has perfectly
> good instructions for loading/storing vectors in the "right" order, in
> the form of vld1 & vst1. I'm afraid the solution to this problem might
> have been staring us in the face for years, which is simply to forbid
> vldr/vstr/vldm/vstm (the instructions which lead to weird element
> permutations in BE mode) for loading/storing NEON vectors altogether.
> That way the vectorizer gets what it wants, the intrinsics can continue
> to use __builtin_shuffle exactly as they are doing, and we get to
> remove all the bits which fiddle vector element numbering in BE mode in
> the ARM backend.
> 
> I can't exactly remember why we didn't do that to start with. I think
> the problem was ABI-related, or to do with transferring NEON vectors
> to/from ARM registers when it was necessary to do that... I'm planning
> to do some archaeology to try to see if I can figure out a definitive
> answer.

The ABI defined vector types (uint32x4_t etc) are defined to be in vldm/vstm 
order.

Paul
Julian Brown March 4, 2013, 11:56 a.m. UTC | #7
On Fri, 1 Mar 2013 14:35:05 +0000
Paul Brook <paul@codesourcery.com> wrote:

> > It's not even necessary to use explicit shuffles -- NEON has
> > perfectly good instructions for loading/storing vectors in the
> > "right" order, in the form of vld1 & vst1. I'm afraid the solution
> > to this problem might have been staring us in the face for years,
> > which is simply to forbid vldr/vstr/vldm/vstm (the instructions
> > which lead to weird element permutations in BE mode) for
> > loading/storing NEON vectors altogether. That way the vectorizer
> > gets what it wants, the intrinsics can continue to use
> > __builtin_shuffle exactly as they are doing, and we get to remove
> > all the bits which fiddle vector element numbering in BE mode in
> > the ARM backend.
> > 
> > I can't exactly remember why we didn't do that to start with. I
> > think the problem was ABI-related, or to do with transferring NEON
> > vectors to/from ARM registers when it was necessary to do that...
> > I'm planning to do some archaeology to try to see if I can figure
> > out a definitive answer.
> 
> The ABI defined vector types (uint32x4_t etc) are defined to be in
> vldm/vstm order.

There's no conflict with the ABI-defined vector order -- the ABI
(looking at AAPCS, IHI 0042D) describes "containerized" vectors which
should be used to pass and return vector quantities at ABI boundaries,
but I couldn't find any further restrictions. Internally to a function,
we are still free to use vld1/vst1 vector ordering. Using
"containerized"/opaque transfers, the bit pattern of a vector in one
function (using vld1/vst1 ordering internally) will of course remain
unchanged if passed to another function and using the same ordering
there also.

Actually making that work (especially efficiently) with GCC is a
slightly different matter. Let's call vldm/vstm-ordered vectors
"containerized" format, and vld1/vst1-ordered vectors "array" format. We
need to do introduce the concept of marshalling vector arguments from
array format to containerized format when passing them to a function,
and unmarshalling those vector arguments back the other way on function
entry. AFAICT, GCC does not have suitable infrastructure for
implementing such functionality at present: consider that e.g. vectors
passed by value on the stack should use containerized format, which
means the called function cannot simply dereference the stack pointer
to read the vector:

void foo (int dummy1, int dummy2, int dummy3, int dummy4, v4si myvec)
{
  v4si *myvec_ptr = &myvec;
  ...
}

Here the hypothetical "unmarshal" operation would need to do something
like:

  add r0, sp, #myvec_offset
  vldm r0, {q0}
  add r0, sp, #myvec_temp_offset
  vst1.32 {q0}, [r0]
  /* myvec_ptr points to myvec_temp_offset.  */

In many cases the marshall/unmarshall operations don't have to do
anything except use vldr/vstr/vldm/vstm or the core-register transfer
equivalents instead of vld1/vst1 for reading/writing vectors used as
arguments, so we generally don't have to incur any overhead like that,
though.

I experimented with a patch which tried to do marshalling/unmarshalling
in RTL, using DImode/TImode for the containerized format (splitting
neon.md/*neon_mov<mode> into DImode/TImode versions for containerized
vectors, and V*mode versions for array-format vectors with only
vmov/vld1/vst1 alternatives, and tweaking several other target macros
etc. appropriately). but that didn't work very well, and wouldn't be
able to handle the case which requires a copy described above, I don't
think. (Several optimisation passes are keen to form V*mode subregs of
DImode values, even if CANNOT_CHANGE_MODE_CLASS/MODES_TIEABLE_P are
tweaked. The hooks/macros controlling argument & function-return
promotion appear to get some of the way there to implementing the RTL
"solution", but evidently not far enough.)

So, I think the proper way of implementing this is probably at the tree
level -- maybe rewriting vector types in function argument lists to
"opaque" vectors, like e.g. rs6000 uses for some intrinsics, and
inserting machine-dependent operations for marshalling and
unmarshalling at appropriate points -- maybe still using DImode/TImode
to represent containerized (opaque) vectors at the RTL level, or maybe
introducing new machine modes if that doesn't work reliably.

The two main advantages of this approach over the status quo are:

1. Big-endian mode works as well as little-endian mode for NEON --
intrinsics, vectorization, the lot.

2. Even in little-endian mode, using vld1/vst1 predominantly over
vldr/vstr means that the alignment hints in those instructions can be
used more often, which might be a minor performance boost.

Would this be a sensible approach, or am I completely wrong? I'm not
sure if I can dedicate time to implementing it at the moment in any
case. Maybe someone within ARM (or Linaro) could take it up? ;-)

(Anyway, I still think it might be a good idea to apply the original
patch until such work is done, considering vectorization -- enabled at
-O3 -- is broken with NEON turned on in big-endian mode at the moment.)

Thanks,

Julian
Paul Brook March 4, 2013, 1:08 p.m. UTC | #8
> > > I can't exactly remember why we didn't do that to start with. I
> > > think the problem was ABI-related, or to do with transferring NEON
> > > vectors to/from ARM registers when it was necessary to do that...
> > > I'm planning to do some archaeology to try to see if I can figure
> > > out a definitive answer.
> > 
> > The ABI defined vector types (uint32x4_t etc) are defined to be in
> > vldm/vstm order.
> 
> There's no conflict with the ABI-defined vector order -- the ABI
> (looking at AAPCS, IHI 0042D) describes "containerized" vectors which
> should be used to pass and return vector quantities at ABI boundaries,
> but I couldn't find any further restrictions. Internally to a function,
> we are still free to use vld1/vst1 vector ordering. Using
> "containerized"/opaque transfers, the bit pattern of a vector in one
> function (using vld1/vst1 ordering internally) will of course remain
> unchanged if passed to another function and using the same ordering
> there also.

Ah, ok.  If you make the ABI defined types distinct from the GCC generic 
vector types (as used by the vectorizer), then in principle that should work.  
I agree that current GCC probably does not have the infrastructure to do that, 
and some of the vector code plays a bit fast and loose with type 
conversions/subregs.

Remember that it's not just function arguments, it's any interface shared 
between functions.  i.e. including structures and global variables.

> Actually making that work (especially efficiently) with GCC is a
> slightly different matter. Let's call vldm/vstm-ordered vectors
> "containerized" format, and vld1/vst1-ordered vectors "array" format. We
> need to do introduce the concept of marshalling vector arguments from
> array format to containerized format when passing them to a function,
> and unmarshalling those vector arguments back the other way on function
> entry. AFAICT, GCC does not have suitable infrastructure for
> implementing such functionality at present: consider that e.g. vectors
> passed by value on the stack should use containerized format, which
> means the called function cannot simply dereference the stack pointer
> to read the vector:

IIRC I/we tried to do something very similar (possibly the other way around) 
by abusing the unaligned load mechanism.  I don't remember why that failed.

Paul
Julian Brown March 4, 2013, 3:29 p.m. UTC | #9
On Mon, 4 Mar 2013 13:08:57 +0000
Paul Brook <paul@codesourcery.com> wrote:

> > > > I can't exactly remember why we didn't do that to start with. I
> > > > think the problem was ABI-related, or to do with transferring
> > > > NEON vectors to/from ARM registers when it was necessary to do
> > > > that... I'm planning to do some archaeology to try to see if I
> > > > can figure out a definitive answer.
> > > 
> > > The ABI defined vector types (uint32x4_t etc) are defined to be in
> > > vldm/vstm order.
> > 
> > There's no conflict with the ABI-defined vector order -- the ABI
> > (looking at AAPCS, IHI 0042D) describes "containerized" vectors
> > which should be used to pass and return vector quantities at ABI
> > boundaries, but I couldn't find any further restrictions.
> > Internally to a function, we are still free to use vld1/vst1 vector
> > ordering. Using "containerized"/opaque transfers, the bit pattern
> > of a vector in one function (using vld1/vst1 ordering internally)
> > will of course remain unchanged if passed to another function and
> > using the same ordering there also.
> 
> Ah, ok.  If you make the ABI defined types distinct from the GCC
> generic vector types (as used by the vectorizer), then in principle
> that should work. I agree that current GCC probably does not have the
> infrastructure to do that, and some of the vector code plays a bit
> fast and loose with type conversions/subregs.

(Subregs use memory ordering for the byte offset, so I think those are
OK if we use array-order loads/stores pervasively. I'm not 100% sure
though...)

> Remember that it's not just function arguments, it's any interface
> shared between functions.  i.e. including structures and global
> variables.

Ugh, I hadn't considered structures or global variables :-/. If we
decide they have to use the containerized format also, then we lose a
lot of the supposed advantage of using array-format vectors
"everywhere" (apart from at procedure call boundaries), for instance if
we want code with a global variable like:

union {
  char myarr[8];
  v8qi myvec;
} foo;

to do the right thing (i.e., with elements of myvec corresponding
one-to-one to elements of myarr), then using the containerized format
for accesses to myvec would be a non-starter.

Skimming the AAPCS, I'm not sure it actually specifies anything about
the layout of global variables which may be shared between functions
(it'd make sense to do so -- maybe it's elsewhere in the EABI
documents). Aggregates passed by value could also be
marshalled/unmarshalled like vectors, though that starts to sound much
less tractable than dealing with vectors alone.

> > Actually making that work (especially efficiently) with GCC is a
> > slightly different matter. Let's call vldm/vstm-ordered vectors
> > "containerized" format, and vld1/vst1-ordered vectors "array"
> > format. We need to do introduce the concept of marshalling vector
> > arguments from array format to containerized format when passing
> > them to a function, and unmarshalling those vector arguments back
> > the other way on function entry. AFAICT, GCC does not have suitable
> > infrastructure for implementing such functionality at present:
> > consider that e.g. vectors passed by value on the stack should use
> > containerized format, which means the called function cannot simply
> > dereference the stack pointer to read the vector:
> 
> IIRC I/we tried to do something very similar (possibly the other way
> around) by abusing the unaligned load mechanism.  I don't remember
> why that failed.

That'd be this conversation:

http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00876.html

we only tweaked the vectorizer to always use movmisalign, leaving
intrinsics & generic vectors using vldm/vstm order. Fixing-up the
resulting chaos using ad-hoc hacks didn't go down too well with
maintainers, so the patch fizzled out.

Cheers,

Julian
Julian Brown March 4, 2013, 5:24 p.m. UTC | #10
On Mon, 4 Mar 2013 15:29:22 +0000
Julian Brown <julian@codesourcery.com> wrote:

> > Remember that it's not just function arguments, it's any interface
> > shared between functions.  i.e. including structures and global
> > variables.
> 
> Ugh, I hadn't considered structures or global variables :-/. If we
> decide they have to use the containerized format also, then we lose a
> lot of the supposed advantage of using array-format vectors
> "everywhere" (apart from at procedure call boundaries), for instance
> if we want code with a global variable like:
> [...]
> Skimming the AAPCS, I'm not sure it actually specifies anything about
> the layout of global variables which may be shared between functions
> (it'd make sense to do so -- maybe it's elsewhere in the EABI
> documents). Aggregates passed by value could also be
> marshalled/unmarshalled like vectors, though that starts to sound much
> less tractable than dealing with vectors alone.

I somehow missed the "Appendix A: Support for Advanced SIMD Extensions"
in the AAPCS document (it's not in the TOC!). It looks like the
builtin vector types are indeed defined to be stored in memory in
vldm/vstm order -- I think that means we're back to square one.

So: thoughts on disabling vectorization altogether in big-endian mode?

Julian
Paul Brook March 4, 2013, 11:47 p.m. UTC | #11
> I somehow missed the "Appendix A: Support for Advanced SIMD Extensions"
> in the AAPCS document (it's not in the TOC!). It looks like the
> builtin vector types are indeed defined to be stored in memory in
> vldm/vstm order -- I think that means we're back to square one.

There's still the possibility of making gcc "generic" vector types different 
from the ABI specified types[1], but that feels like it's probably a really 
bad idea.

Having a distinct set of types just for the vectorizer may be a more viable 
option. IIRC the type selection hooks are more flexible than when we first 
looked at this problem.

Paul

[1] e.g. int gcc __attribute__((vector_size(8)));  v.s. int32x2_t eabi;
Richard Biener March 5, 2013, 9:42 a.m. UTC | #12
On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com> wrote:
>> I somehow missed the "Appendix A: Support for Advanced SIMD Extensions"
>> in the AAPCS document (it's not in the TOC!). It looks like the
>> builtin vector types are indeed defined to be stored in memory in
>> vldm/vstm order -- I think that means we're back to square one.
>
> There's still the possibility of making gcc "generic" vector types different
> from the ABI specified types[1], but that feels like it's probably a really
> bad idea.
>
> Having a distinct set of types just for the vectorizer may be a more viable
> option. IIRC the type selection hooks are more flexible than when we first
> looked at this problem.
>
> Paul
>
> [1] e.g. int gcc __attribute__((vector_size(8)));  v.s. int32x2_t eabi;

I think int32x2_t should not be a GCC vector type (thus not have a vector mode).
The ABI specified types should map to an integer mode of the right size
instead.  The vectorizer would then still use internal GCC vector types
and modes and the backend needs to provide instruction patterns that
do the right thing with the element ordering the vectorizer expects.

How are the int32x2_t types used?  I suppose they are arguments to
the intrinsics.  Which means that for _most_ operations element order
does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply
use the equivalent of return (int32x2_t)((gcc_int32x2_t)x + (gcc_int32x2_t)y).
In intrinsics where order matters you'd insert appropriate __builtin_shuffle()s.

Oh, of course do the above only for big-endian mode ...

The other way around, mapping intrinsics and ABI vectors to vector modes
will have issues ... you'd have to guard all optab queries in the middle-end
to fail for arm big-endian as they expect instruction patterns that deal with
the GCC vector ordering.

Thus: model the backend after GCCs expectations and "fixup" the rest
by fixing the ABI types and intrinsics.

Richard.
Julian Brown March 5, 2013, 12:18 p.m. UTC | #13
On Tue, 5 Mar 2013 10:42:59 +0100
Richard Biener <richard.guenther@gmail.com> wrote:

> On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com>
> wrote:
> >> I somehow missed the "Appendix A: Support for Advanced SIMD
> >> Extensions" in the AAPCS document (it's not in the TOC!). It looks
> >> like the builtin vector types are indeed defined to be stored in
> >> memory in vldm/vstm order -- I think that means we're back to
> >> square one.
> >
> > There's still the possibility of making gcc "generic" vector types
> > different from the ABI specified types[1], but that feels like it's
> > probably a really bad idea.
> >
> > Having a distinct set of types just for the vectorizer may be a
> > more viable option. IIRC the type selection hooks are more flexible
> > than when we first looked at this problem.
> >
> > Paul
> >
> > [1] e.g. int gcc __attribute__((vector_size(8)));  v.s. int32x2_t
> > eabi;
> 
> I think int32x2_t should not be a GCC vector type (thus not have a
> vector mode). The ABI specified types should map to an integer mode
> of the right size instead.  The vectorizer would then still use
> internal GCC vector types and modes and the backend needs to provide
> instruction patterns that do the right thing with the element
> ordering the vectorizer expects.
> 
> How are the int32x2_t types used?  I suppose they are arguments to
> the intrinsics.  Which means that for _most_ operations element order
> does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply
> use the equivalent of return (int32x2_t)((gcc_int32x2_t)x +
> (gcc_int32x2_t)y). In intrinsics where order matters you'd insert
> appropriate __builtin_shuffle()s.

Maybe there's no need to interpret the vector layout for any of the
intrinsics -- just treat all inputs & outputs as opaque (there are
intrinsics for getting/setting lanes -- IMO these shouldn't attempt to
convert lane numbers at all, though they do at present). Several
intrinsics are currently implemented using __builtin_shuffle, e.g.:

__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
vrev64_s8 (int8x8_t __a)
{
  return (int8x8_t) __builtin_shuffle (__a, (uint8x8_t) { 7, 6, 5, 4, 3, 2, 1, 0 });
}

I'd imagine that if int8x8_t are not actual vector types, we could
invent extra builtins to convert them to and from such types to be able
to still do this kind of thing (in arm_neon.h, not necessarily for
direct use by users), i.e.:

typedef char gcc_int8x8_t __attribute__((vector_size(8)));

int8x8_t
vrev64_s8 (int8x8_t __a)
{
  gcc_int8x8_t tmp = __builtin_neon2generic (__a);
  tmp = __builtin_shuffle (tmp, (gcc_int8x8_t) { 7, 6, 5, 4, ... });
  return __builtin_generic2neon (tmp);
}

(On re-reading, that's basically the same as what you suggested, I
think.)

> Oh, of course do the above only for big-endian mode ...
> 
> The other way around, mapping intrinsics and ABI vectors to vector
> modes will have issues ... you'd have to guard all optab queries in
> the middle-end to fail for arm big-endian as they expect instruction
> patterns that deal with the GCC vector ordering.
> 
> Thus: model the backend after GCCs expectations and "fixup" the rest
> by fixing the ABI types and intrinsics.

I think this plan will work fine -- it has the added advantage (which
looks like a disadvantage, but really isn't) that generic vector
operations like:

void foo (void)
{
  int8x8_t x = { 0, 1, 2, 3, 4, 5, 6, 7 };
}

will *not* work -- nor will e.g. subscripting ABI-defined vectors using
[]s. At the moment using these features can lead to surprising results.

Unfortunately NEON's pretty complicated, and the ARM backend currently
uses vector modes quite heavily implementing it, so just using integer
modes for intrinsics is going to be tough. It might work to create a
shadow set of vector modes for use only by the intrinsics (O*mode for
"opaque" instead of V*mode, say), if the middle end won't barf at that.

Thanks,

Julian
Tejas Belagod March 6, 2013, 2:57 p.m. UTC | #14
Julian Brown wrote:
> On Tue, 5 Mar 2013 10:42:59 +0100
> Richard Biener <richard.guenther@gmail.com> wrote:
> 
>> On Tue, Mar 5, 2013 at 12:47 AM, Paul Brook <paul@codesourcery.com>
>> wrote:
>>>> I somehow missed the "Appendix A: Support for Advanced SIMD
>>>> Extensions" in the AAPCS document (it's not in the TOC!). It looks
>>>> like the builtin vector types are indeed defined to be stored in
>>>> memory in vldm/vstm order -- I think that means we're back to
>>>> square one.
>>> There's still the possibility of making gcc "generic" vector types
>>> different from the ABI specified types[1], but that feels like it's
>>> probably a really bad idea.
>>>
>>> Having a distinct set of types just for the vectorizer may be a
>>> more viable option. IIRC the type selection hooks are more flexible
>>> than when we first looked at this problem.
>>>
>>> Paul
>>>
>>> [1] e.g. int gcc __attribute__((vector_size(8)));  v.s. int32x2_t
>>> eabi;
>> I think int32x2_t should not be a GCC vector type (thus not have a
>> vector mode). The ABI specified types should map to an integer mode
>> of the right size instead.  The vectorizer would then still use
>> internal GCC vector types and modes and the backend needs to provide
>> instruction patterns that do the right thing with the element
>> ordering the vectorizer expects.
>>
>> How are the int32x2_t types used?  I suppose they are arguments to
>> the intrinsics.  Which means that for _most_ operations element order
>> does not matter, thus a plus32x2 (int32x2_t x, int32x2_t y) can simply
>> use the equivalent of return (int32x2_t)((gcc_int32x2_t)x +
>> (gcc_int32x2_t)y). In intrinsics where order matters you'd insert
>> appropriate __builtin_shuffle()s.
> 
> Maybe there's no need to interpret the vector layout for any of the
> intrinsics -- just treat all inputs & outputs as opaque (there are
> intrinsics for getting/setting lanes -- IMO these shouldn't attempt to
> convert lane numbers at all, though they do at present). Several
> intrinsics are currently implemented using __builtin_shuffle, e.g.:
> 
> __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
> vrev64_s8 (int8x8_t __a)
> {
>   return (int8x8_t) __builtin_shuffle (__a, (uint8x8_t) { 7, 6, 5, 4, 3, 2, 1, 0 });
> }
> 
> I'd imagine that if int8x8_t are not actual vector types, we could
> invent extra builtins to convert them to and from such types to be able
> to still do this kind of thing (in arm_neon.h, not necessarily for
> direct use by users), i.e.:
> 
> typedef char gcc_int8x8_t __attribute__((vector_size(8)));
> 
> int8x8_t
> vrev64_s8 (int8x8_t __a)
> {
>   gcc_int8x8_t tmp = __builtin_neon2generic (__a);
>   tmp = __builtin_shuffle (tmp, (gcc_int8x8_t) { 7, 6, 5, 4, ... });
>   return __builtin_generic2neon (tmp);
> }
> 
> (On re-reading, that's basically the same as what you suggested, I
> think.)
> 
>> Oh, of course do the above only for big-endian mode ...
>>
>> The other way around, mapping intrinsics and ABI vectors to vector
>> modes will have issues ... you'd have to guard all optab queries in
>> the middle-end to fail for arm big-endian as they expect instruction
>> patterns that deal with the GCC vector ordering.
>>
>> Thus: model the backend after GCCs expectations and "fixup" the rest
>> by fixing the ABI types and intrinsics.
> 
> I think this plan will work fine -- it has the added advantage (which
> looks like a disadvantage, but really isn't) that generic vector
> operations like:
> 
> void foo (void)
> {
>   int8x8_t x = { 0, 1, 2, 3, 4, 5, 6, 7 };
> }
> 
> will *not* work -- nor will e.g. subscripting ABI-defined vectors using
> []s. At the moment using these features can lead to surprising results.
> 
> Unfortunately NEON's pretty complicated, and the ARM backend currently
> uses vector modes quite heavily implementing it, so just using integer
> modes for intrinsics is going to be tough. It might work to create a
> shadow set of vector modes for use only by the intrinsics (O*mode for
> "opaque" instead of V*mode, say), if the middle end won't barf at that.

I suspect the mid-end may not be too happy with opaque modes for vectors. I've 
faced some issues in the past while experimenting with large int modes for 
vector register lists while implementing permuted loads in AArch64 particularly 
in the area of subreg generation where SUBREG_BYTE is generated based on 
BITS_PER_WORD for all INT mode classes not taking into account which registers 
the values of the particular mode end up in. This causes subreg_bytes to be 
unaligned to vector register boundary. To illustrate this, here is an example 
that exposed this issue:

For aarch64, I mirrored the approach that the arm/thumb backend employs and
defined 'large int' opaque modes to represent the register lists i.e. OImode,
CImode and XImode and defined the standard patterns that implement permuted
load/stores - vec_store_lanes<INT_MODE><VEC_MODE> and
vec_load_lanes<INT_MODE><VEC_MODE>.

At the time, I remember this test case

typedef unsigned short V __attribute__((vector_size(32)));
typedef V VI;

V in = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
VI mask = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, };
V out = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

extern void bar(V);

int main()
{

     V r = __builtin_shuffle(in, mask);

     bar (r);
}

generated this RTL with my experimental compiler:

...
(insn 65 59 61 2 (set (reg:DI 178)
           (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 8)
                   (const_int 1 [0x1]))
               (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz}
        (nil))
...

(insn 151 145 147 2 (set (reg:DI 256)
           (and:DI (ashift:DI (subreg:DI (reg:OI 74 [ mask.3 ]) 24)
                   (const_int 1 [0x1]))
               (const_int 30 [0x1e]))) vs.c:24 380 {*andim_ashiftdi_bfiz}
        (nil))

....

which is the short value extraction out of the vectors. I ran into this 
situation where the subregs were generated with byte offsets such that 
byte_offset % UNITS_PER_VREG != 0 i.e. subreg offsets that were not aligned to 
the vector register boundary. The above dump is before the reload phase. During 
reload subreg elimination, these subregs were converted to refer to the 
incorrect part of vector registers.

Though OImode is a large INT mode, we force these modes only to live in FPSIMD 
registers for which the UNITS_PER_VREG or BITS_PER_WORD is different from the 
integer word size i.e. UNITS_PER_VREG is 16 and BITS_PER_WORD for FPSIMD is 128.

I discovered in the mid-end that subregs were generated using BITS_PER_WORD and 
there weren't checks during generation to see that BITS_PER_WORD could be 
dependent on the mode which the subreg is being generated for. There was an 
assumption that BITS_PER_WORD applied to all INT modes. In this case, because 
OImode was only allowed in FPSIMD regs, BITS_PER_WORD should've been 128 or in 
other words mode-dependent. In general, shouldn't BITS_PER_WORD be dependent on 
the registers that a particular mode ultimately ends up in dictated by the 
target hook HARD_REGNO_MODE_OK? As far as I can see, expmed.c:store_bit_field_1 
() hasn't changed much in this respect and I suspect this issue still remains.

We don't have the same issue on the ARM backend because the basic unit of
register allocation is 32-bits for both FP and Int units(arm.h #define
ARM_NUM_INTS) and the FP unit is a register-packing architecture.

That was in the context of register lists where large opaque int modes represent
more than one 1 vector register. As you suggest, if we extend opaque int
modes to represent 1 vector register, and with SUBREG being generated
independent of modes in the mid-end, I imagine this may cause pain for later 
phases(like reload subreg elimination).

But that said, I'm not an expert on how mid-end handles opaque int modes and 
things might have improved in the area of SUBREG generation since my experiments.

Thanks,
Tejas Belagod
ARM.
diff mbox

Patch

Index: gcc/testsuite/gcc.dg/vect/slp-cond-3.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/slp-cond-3.c	(working copy)
@@ -79,6 +79,6 @@  int main ()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-1.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-1.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-1.c	(working copy)
@@ -86,5 +86,5 @@  foo (int n)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 6 loops" 1 "vect" { target vect_strided2 } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail vect_strided2 } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail { vect_strided2 || { ! vect_unpack } } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/slp-cond-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/slp-cond-4.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/slp-cond-4.c	(working copy)
@@ -82,5 +82,5 @@  int main ()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-1-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-1-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-1-big-array.c	(working copy)
@@ -86,5 +86,5 @@  foo (int n)
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 6 loops" 1 "vect" { target vect_strided2 } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail vect_strided2 } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 5 loops" 1 "vect" { xfail { vect_strided2 || { ! vect_unpack } } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-35.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-35.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-35.c	(working copy)
@@ -45,6 +45,6 @@  int main (void)
 } 
 
 
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { xfail { ia64-*-* sparc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { xfail { { ia64-*-* sparc*-*-* } || { ! vect_unpack } } } } } */
 /* { dg-final { scan-tree-dump "can't determine dependence between" "vect" } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-3-big-array.c	(working copy)
@@ -59,6 +59,6 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 1 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-u16.c	(working copy)
@@ -53,6 +53,6 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 1 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/bb-slp-26.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/bb-slp-26.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/bb-slp-26.c	(working copy)
@@ -55,6 +55,6 @@  int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 } } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 xfail { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "slp" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-1.c	(working copy)
@@ -62,6 +62,6 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { { ! vect_sizes_32B_16B } && { ! vect_widen_shift } } } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 8 "vect" { target vect_sizes_32B_16B } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-35-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-35-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-35-big-array.c	(working copy)
@@ -45,6 +45,6 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { xfail { ia64-*-* sparc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect"  { xfail { { ia64-*-* sparc*-*-* } || { ! vect_unpack } } } } } */
 /* { dg-final { scan-tree-dump-times "can't determine dependence between" 1 "vect" } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-2.c	(working copy)
@@ -60,6 +60,6 @@  int main (void)
 
 /* Final value stays in int, so no over-widening is detected at the moment.  */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 0 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/pr43430-2.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/pr43430-2.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/pr43430-2.c	(working copy)
@@ -12,5 +12,5 @@  vsad16_c (void *c, uint8_t * s1, uint8_t
   return score;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_condition } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_condition && vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-s8.c	(working copy)
@@ -53,6 +53,6 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 1 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-1-big-array.c	(working copy)
@@ -61,6 +61,6 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { ! vect_widen_shift } } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-3.c	(working copy)
@@ -59,6 +59,6 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump "vect_recog_over_widening_pattern: detected" "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-4.c	(working copy)
@@ -66,6 +66,6 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { { ! vect_sizes_32B_16B } && { ! vect_widen_shift } } } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 8 "vect" { target vect_sizes_32B_16B } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-4-big-array.c	(working copy)
@@ -65,6 +65,6 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 4 "vect" { target { ! vect_widen_shift } } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/vect-93.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-93.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-93.c	(working copy)
@@ -79,7 +79,7 @@  int main (void)
 /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target vect_no_align } } } */
 
 /* in main: */
-/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target vect_no_align } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { vect_no_align && { ! vect_intfloat_cvt } } } } } */
 /* { dg-final { scan-tree-dump-times "Vectorizing an unaligned access" 1 "vect" { xfail { vect_no_align } } } } */
 
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-u8.c	(working copy)
@@ -60,5 +60,5 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 2 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4a.c	(working copy)
@@ -35,5 +35,5 @@  int main (void)
   return main1 ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_intfloat_cvt } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_intfloat_cvt && vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-intfloat-conversion-4b.c	(working copy)
@@ -35,5 +35,5 @@  int main (void)
   return main1 ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target vect_intfloat_cvt } } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_intfloat_cvt && vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-over-widen-2-big-array.c	(working copy)
@@ -60,6 +60,6 @@  int main (void)
 
 /* Final value stays in int, so no over-widening is detected at the moment.  */
 /* { dg-final { scan-tree-dump-times "vect_recog_over_widening_pattern: detected" 0 "vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { ! vect_element_align } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/gcc.dg/vect/bb-slp-11.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/bb-slp-11.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/bb-slp-11.c	(working copy)
@@ -48,6 +48,6 @@  int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 } } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized using SLP" 1 "slp" { target vect64 xfail { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "slp" } } */
   
Index: gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c	(revision 196170)
+++ gcc/testsuite/gcc.dg/vect/vect-widen-shift-s16.c	(working copy)
@@ -102,6 +102,6 @@  int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "vect_recog_widen_shift_pattern: detected" 8 "vect" { target vect_widen_shift } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { xfail { ! vect_widen_shift } && { ! vect_unpack } } } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
 
Index: gcc/testsuite/lib/target-supports.exp
===================================================================
--- gcc/testsuite/lib/target-supports.exp	(revision 196170)
+++ gcc/testsuite/lib/target-supports.exp	(working copy)
@@ -3089,7 +3089,8 @@  proc check_effective_target_vect_perm { 
         verbose "check_effective_target_vect_perm: using cached result" 2
     } else {
         set et_vect_perm_saved 0
-        if { [is-effective-target arm_neon_ok]
+        if { ([is-effective-target arm_neon_ok]
+	      && [is-effective-target arm_little_endian])
 	     || [istarget aarch64*-*-*]
 	     || [istarget powerpc*-*-*]
              || [istarget spu-*-*]
@@ -3211,7 +3212,8 @@  proc check_effective_target_vect_widen_s
     } else {
         set et_vect_widen_sum_qi_to_hi_saved 0
 	if { [check_effective_target_vect_unpack] 
-	     || [check_effective_target_arm_neon_ok]
+	     || ([check_effective_target_arm_neon_ok]
+		 && [check_effective_target_arm_little_endian])
 	     || [istarget ia64-*-*] } {
             set et_vect_widen_sum_qi_to_hi_saved 1
 	}
@@ -3263,7 +3265,8 @@  proc check_effective_target_vect_widen_m
 	}
         if { [istarget powerpc*-*-*]
               || [istarget aarch64*-*-*]
-              || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } {
+              || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]
+		  && [check_effective_target_arm_little_endian]) } {
             set et_vect_widen_mult_qi_to_hi_saved 1
         }
     }
@@ -3298,7 +3301,8 @@  proc check_effective_target_vect_widen_m
 	      || [istarget aarch64*-*-*]
 	      || [istarget i?86-*-*]
 	      || [istarget x86_64-*-*]
-              || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } {
+              || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]
+		  && [check_effective_target_arm_little_endian]) } {
             set et_vect_widen_mult_hi_to_si_saved 1
         }
     }
@@ -3368,7 +3372,8 @@  proc check_effective_target_vect_widen_s
         verbose "check_effective_target_vect_widen_shift: using cached result" 2
     } else {
         set et_vect_widen_shift_saved 0
-        if { ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]) } {
+        if { ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]
+	      && [check_effective_target_arm_little_endian]) } {
             set et_vect_widen_shift_saved 1
         }
     }
@@ -3859,7 +3864,8 @@  proc check_effective_target_vect_extract
         set et_vect_extract_even_odd_saved 0 
 	if { [istarget aarch64*-*-*]
 	     || [istarget powerpc*-*-*]
-	     || [is-effective-target arm_neon_ok]
+	     || ([is-effective-target arm_neon_ok]
+		 && [is-effective-target arm_little_endian])
              || [istarget i?86-*-*]
              || [istarget x86_64-*-*]
              || [istarget ia64-*-*]
@@ -3885,7 +3891,8 @@  proc check_effective_target_vect_interle
         set et_vect_interleave_saved 0
 	if { [istarget aarch64*-*-*]
 	     || [istarget powerpc*-*-*]
-	     || [is-effective-target arm_neon_ok]
+	     || ([is-effective-target arm_neon_ok]
+		 && [is-effective-target arm_little_endian])
              || [istarget i?86-*-*]
              || [istarget x86_64-*-*]
              || [istarget ia64-*-*]
@@ -3915,7 +3922,8 @@  foreach N {2 3 4 8} {
 		     && [check_effective_target_vect_extract_even_odd] } {
 		    set et_vect_stridedN_saved 1
 		}
-		if { ([istarget arm*-*-*]
+		if { (([istarget arm*-*-*] && [is-effective-target arm_neon_ok]
+		       && [is-effective-target arm_little_endian])
 		      || [istarget aarch64*-*-*]) && N >= 2 && N <= 4 } {
 		    set et_vect_stridedN_saved 1
 		}
@@ -3934,7 +3942,8 @@  proc check_effective_target_vect_multipl
 
     set et_vect_multiple_sizes_saved 0
     if { ([istarget aarch64*-*-*]
-	  || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok])) } {
+	  || ([istarget arm*-*-*] && [check_effective_target_arm_neon_ok]
+	      && [check_effective_target_arm_little_endian])) } {
        set et_vect_multiple_sizes_saved 1
     }
     if { ([istarget x86_64-*-*] || [istarget i?86-*-*]) } {
@@ -3957,8 +3966,7 @@  proc check_effective_target_vect64 { } {
     } else {
         set et_vect64_saved 0
         if { ([istarget arm*-*-*]
-	      && [check_effective_target_arm_neon_ok]
-	      && [check_effective_target_arm_little_endian]) } {
+	      && [check_effective_target_arm_neon_ok]) } {
            set et_vect64_saved 1
         }
     }
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 196170)
+++ gcc/config/arm/arm.c	(working copy)
@@ -25041,7 +25041,7 @@  static bool
 arm_array_mode_supported_p (enum machine_mode mode,
 			    unsigned HOST_WIDE_INT nelems)
 {
-  if (TARGET_NEON
+  if (TARGET_NEON && !BYTES_BIG_ENDIAN
       && (VALID_NEON_DREG_MODE (mode) || VALID_NEON_QREG_MODE (mode))
       && (nelems >= 2 && nelems <= 4))
     return true;
@@ -25057,23 +25057,27 @@  static enum machine_mode
 arm_preferred_simd_mode (enum machine_mode mode)
 {
   if (TARGET_NEON)
-    switch (mode)
-      {
-      case SFmode:
-	return TARGET_NEON_VECTORIZE_DOUBLE ? V2SFmode : V4SFmode;
-      case SImode:
-	return TARGET_NEON_VECTORIZE_DOUBLE ? V2SImode : V4SImode;
-      case HImode:
-	return TARGET_NEON_VECTORIZE_DOUBLE ? V4HImode : V8HImode;
-      case QImode:
-	return TARGET_NEON_VECTORIZE_DOUBLE ? V8QImode : V16QImode;
-      case DImode:
-	if (!TARGET_NEON_VECTORIZE_DOUBLE)
-	  return V2DImode;
-	break;
+    {
+      bool double_only = BYTES_BIG_ENDIAN || TARGET_NEON_VECTORIZE_DOUBLE;
 
-      default:;
-      }
+      switch (mode)
+	{
+	case SFmode:
+	  return double_only ? V2SFmode : V4SFmode;
+	case SImode:
+	  return double_only ? V2SImode : V4SImode;
+	case HImode:
+	  return double_only ? V4HImode : V8HImode;
+	case QImode:
+	  return double_only ? V8QImode : V16QImode;
+	case DImode:
+	  if (!double_only)
+	    return V2DImode;
+	  break;
+
+	default:;
+	}
+    }
 
   if (TARGET_REALLY_IWMMXT)
     switch (mode)
@@ -25974,6 +25978,11 @@  arm_vector_alignment (const_tree type)
 static unsigned int
 arm_autovectorize_vector_sizes (void)
 {
+  /* Use of quad-word registers for autovectorization for NEON is fraught with
+     difficulties.  Just don't do that.  */
+  if (TARGET_NEON && BYTES_BIG_ENDIAN)
+    return 8;
+
   return TARGET_NEON_VECTORIZE_DOUBLE ? 0 : (16 | 8);
 }
 
@@ -27008,6 +27017,12 @@  arm_vectorize_vec_perm_const_ok (enum ma
   unsigned int i, nelt, which;
   bool ret;
 
+  /* FIXME: There appear to be element-numbering problems with vector
+     permutations in big-endian mode that cause the vectorizer to produce bad
+     code.  Disable for now.  */
+  if (BYTES_BIG_ENDIAN)
+    return false;
+
   d.vmode = vmode;
   d.nelt = nelt = GET_MODE_NUNITS (d.vmode);
   d.testing_p = true;
Index: gcc/config/arm/neon.md
===================================================================
--- gcc/config/arm/neon.md	(revision 196170)
+++ gcc/config/arm/neon.md	(working copy)
@@ -4506,7 +4506,7 @@ 
   [(set (match_operand:VDQX 0 "s_register_operand")
         (unspec:VDQX [(match_operand:VDQX 1 "neon_struct_operand")]
                      UNSPEC_VLD1))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vld1<mode>"
   [(set (match_operand:VDQX 0 "s_register_operand" "=w")
@@ -4618,7 +4618,7 @@ 
   [(set (match_operand:VDQX 0 "neon_struct_operand")
 	(unspec:VDQX [(match_operand:VDQX 1 "s_register_operand")]
 		     UNSPEC_VST1))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vst1<mode>"
   [(set (match_operand:VDQX 0 "neon_struct_operand" "=Um")
@@ -4683,7 +4683,7 @@ 
         (unspec:TI [(match_operand:TI 1 "neon_struct_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 		   UNSPEC_VLD2))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vld2<mode>"
   [(set (match_operand:TI 0 "s_register_operand" "=w")
@@ -4708,7 +4708,7 @@ 
         (unspec:OI [(match_operand:OI 1 "neon_struct_operand")
                     (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 		   UNSPEC_VLD2))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vld2<mode>"
   [(set (match_operand:OI 0 "s_register_operand" "=w")
@@ -4797,7 +4797,7 @@ 
 	(unspec:TI [(match_operand:TI 1 "s_register_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST2))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vst2<mode>"
   [(set (match_operand:TI 0 "neon_struct_operand" "=Um")
@@ -4822,7 +4822,7 @@ 
 	(unspec:OI [(match_operand:OI 1 "s_register_operand")
                     (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST2))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vst2<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
@@ -4894,7 +4894,7 @@ 
         (unspec:EI [(match_operand:EI 1 "neon_struct_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 		   UNSPEC_VLD3))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vld3<mode>"
   [(set (match_operand:EI 0 "s_register_operand" "=w")
@@ -4918,7 +4918,7 @@ 
   [(match_operand:CI 0 "s_register_operand")
    (match_operand:CI 1 "neon_struct_operand")
    (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   emit_insn (gen_neon_vld3<mode> (operands[0], operands[1]));
   DONE;
@@ -5068,7 +5068,7 @@ 
 	(unspec:EI [(match_operand:EI 1 "s_register_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST3))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vst3<mode>"
   [(set (match_operand:EI 0 "neon_struct_operand" "=Um")
@@ -5091,7 +5091,7 @@ 
   [(match_operand:CI 0 "neon_struct_operand")
    (match_operand:CI 1 "s_register_operand")
    (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   emit_insn (gen_neon_vst3<mode> (operands[0], operands[1]));
   DONE;
@@ -5213,7 +5213,7 @@ 
         (unspec:OI [(match_operand:OI 1 "neon_struct_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 		   UNSPEC_VLD4))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vld4<mode>"
   [(set (match_operand:OI 0 "s_register_operand" "=w")
@@ -5237,7 +5237,7 @@ 
   [(match_operand:XI 0 "s_register_operand")
    (match_operand:XI 1 "neon_struct_operand")
    (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   emit_insn (gen_neon_vld4<mode> (operands[0], operands[1]));
   DONE;
@@ -5394,7 +5394,7 @@ 
 	(unspec:OI [(match_operand:OI 1 "s_register_operand")
                     (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST4))]
-  "TARGET_NEON")
+  "TARGET_NEON && !BYTES_BIG_ENDIAN")
 
 (define_insn "neon_vst4<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
@@ -5418,7 +5418,7 @@ 
   [(match_operand:XI 0 "neon_struct_operand")
    (match_operand:XI 1 "s_register_operand")
    (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   emit_insn (gen_neon_vst4<mode> (operands[0], operands[1]));
   DONE;
@@ -5725,7 +5725,7 @@ 
  [(set (match_operand:<V_widen> 0 "register_operand" "=w")
        (SE:<V_widen> (ashift:VW (match_operand:VW 1 "register_operand" "w")
        (match_operand:<V_innermode> 2 "const_neon_scalar_shift_amount_operand" ""))))]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   return "vshll.<US><V_sz_elem> %q0, %P1, %2";
 }
@@ -5771,7 +5771,7 @@ 
 (define_expand "vec_unpack<US>_lo_<mode>"
  [(match_operand:<V_double_width> 0 "register_operand" "")
   (SE:<V_double_width>(match_operand:VDI 1 "register_operand"))]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx tmpreg = gen_reg_rtx (<V_widen>mode);
   emit_insn (gen_neon_unpack<US>_<mode> (tmpreg, operands[1]));
@@ -5784,7 +5784,7 @@ 
 (define_expand "vec_unpack<US>_hi_<mode>"
  [(match_operand:<V_double_width> 0 "register_operand" "")
   (SE:<V_double_width>(match_operand:VDI 1 "register_operand"))]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
 {
   rtx tmpreg = gen_reg_rtx (<V_widen>mode);
   emit_insn (gen_neon_unpack<US>_<mode> (tmpreg, operands[1]));
@@ -5800,7 +5800,7 @@ 
 		 	   (match_operand:VDI 1 "register_operand" "w"))
  		       (SE:<V_widen> 
 			   (match_operand:VDI 2 "register_operand" "w"))))]
-  "TARGET_NEON"
+  "TARGET_NEON && !BYTES_BIG_ENDIAN"
   "vmull.<US><V_sz_elem> %q0, %P1, %P2"
   [(set_attr "neon_type" "neon_shift_1")]
 )
@@ -5809,7 +5809,7 @@ 
   [(match_operand:<V_double_width> 0 "register_operand" "")
    (SE:<V_double_width> (match_operand:VDI 1 "register_operand" ""))
    (SE:<V_double_width> (match_operand:VDI 2 "register_operand" ""))]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
  {
    rtx tmpreg = gen_reg_rtx (<V_widen>mode);
    emit_insn (gen_neon_vec_<US>mult_<mode> (tmpreg, operands[1], operands[2]));
@@ -5824,7 +5824,7 @@ 
   [(match_operand:<V_double_width> 0 "register_operand" "")
    (SE:<V_double_width> (match_operand:VDI 1 "register_operand" ""))
    (SE:<V_double_width> (match_operand:VDI 2 "register_operand" ""))]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
  {
    rtx tmpreg = gen_reg_rtx (<V_widen>mode);
    emit_insn (gen_neon_vec_<US>mult_<mode> (tmpreg, operands[1], operands[2]));
@@ -5839,7 +5839,7 @@ 
  [(match_operand:<V_double_width> 0 "register_operand" "")
    (SE:<V_double_width> (match_operand:VDI 1 "register_operand" ""))
    (match_operand:SI 2 "immediate_operand" "i")]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
  {
    rtx tmpreg = gen_reg_rtx (<V_widen>mode);
    emit_insn (gen_neon_vec_<US>shiftl_<mode> (tmpreg, operands[1], operands[2]));
@@ -5853,7 +5853,7 @@ 
   [(match_operand:<V_double_width> 0 "register_operand" "")
    (SE:<V_double_width> (match_operand:VDI 1 "register_operand" ""))
    (match_operand:SI 2 "immediate_operand" "i")]
- "TARGET_NEON"
+ "TARGET_NEON && !BYTES_BIG_ENDIAN"
  {
    rtx tmpreg = gen_reg_rtx (<V_widen>mode);
    emit_insn (gen_neon_vec_<US>shiftl_<mode> (tmpreg, operands[1], operands[2]));