diff mbox

[ARM,1/3] Neon intrinsics TLC : Replace intrinsics with GNU C implementations where possible.

Message ID 535E30F1.4020902@arm.com
State New
Headers show

Commit Message

Ramana Radhakrishnan April 28, 2014, 10:44 a.m. UTC
I've special cased the ffast-math case for the _f32 intrinsics to 
prevent the auto-vectorizer from coming along and vectorizing addv2sf 
and addv4sf type operations which we don't want to happen by default.
Patch 1/3 causes apparent "regressions" in the rather ineffective neon 
intrinsics tests that we currently carry soon hopefully to be replaced 
by Christophe Lyon's rewrite that is being reviewed. On the whole I deem 
this patch stack to be safe to go in if necessary. These "regressions" 
are for -O0 with the vbic and vorn intrinsics which
don't now get combined and well, so be it.


Given we're in stage 1 and that I think we're getting some where
with clyon's testsuite I feel that is reasonably practical in just
carrying the noise with these extra failures. Christophe and I will
testdrive his testsuite work in this space with these patches to see how 
the conversion process works and if there are any issues with these patches.


<DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

	* config/arm/arm_neon.h (vadd_s8): GNU C implementation
	(vadd_s16): Likewise.
	(vadd_s32): Likewise.
	(vadd_f32): Likewise.
	(vadd_u8): Likewise.
	(vadd_u16): Likewise.
	(vadd_u32): Likewise.
	(vadd_s64): Likewise.
	(vadd_u64): Likewise.
	(vaddq_s8): Likewise.
	(vaddq_s16): Likewise.
	(vaddq_s32): Likewise.
	(vaddq_s64): Likewise.
	(vaddq_f32): Likewise.
	(vaddq_u8): Likewise.
	(vaddq_u16): Likewise.
	(vaddq_u32): Likewise.
	(vaddq_u64): Likewise.
	(vmul_s8): Likewise.
	(vmul_s16): Likewise.
	(vmul_s32): Likewise.
	(vmul_f32): Likewise.
	(vmul_u8): Likewise.
	(vmul_u16): Likewise.
	(vmul_u32): Likewise.
	(vmul_p8): Likewise.
	(vmulq_s8): Likewise.
	(vmulq_s16): Likewise.
	(vmulq_s32): Likewise.
	(vmulq_f32): Likewise.
	(vmulq_u8): Likewise.
	(vmulq_u16): Likewise.
	(vmulq_u32): Likewise.
	(vsub_s8): Likewise.
	(vsub_s16): Likewise.
	(vsub_s32): Likewise.
	(vsub_f32): Likewise.
	(vsub_u8): Likewise.
	(vsub_u16): Likewise.
	(vsub_u32): Likewise.
	(vsub_s64): Likewise.
	(vsub_u64): Likewise.
	(vsubq_s8): Likewise.
	(vsubq_s16): Likewise.
	(vsubq_s32): Likewise.
	(vsubq_s64): Likewise.
	(vsubq_f32): Likewise.
	(vsubq_u8): Likewise.
	(vsubq_u16): Likewise.
	(vsubq_u32): Likewise.
	(vsubq_u64): Likewise.
	(vand_s8): Likewise.
	(vand_s16): Likewise.
	(vand_s32): Likewise.
	(vand_u8): Likewise.
	(vand_u16): Likewise.
	(vand_u32): Likewise.
	(vand_s64): Likewise.
	(vand_u64): Likewise.
	(vandq_s8): Likewise.
	(vandq_s16): Likewise.
	(vandq_s32): Likewise.
	(vandq_s64): Likewise.
	(vandq_u8): Likewise.
	(vandq_u16): Likewise.
	(vandq_u32): Likewise.
	(vandq_u64): Likewise.
	(vorr_s8): Likewise.
	(vorr_s16): Likewise.
	(vorr_s32): Likewise.
	(vorr_u8): Likewise.
	(vorr_u16): Likewise.
	(vorr_u32): Likewise.
	(vorr_s64): Likewise.
	(vorr_u64): Likewise.
	(vorrq_s8): Likewise.
	(vorrq_s16): Likewise.
	(vorrq_s32): Likewise.
	(vorrq_s64): Likewise.
	(vorrq_u8): Likewise.
	(vorrq_u16): Likewise.
	(vorrq_u32): Likewise.
	(vorrq_u64): Likewise.
	(veor_s8): Likewise.
	(veor_s16): Likewise.
	(veor_s32): Likewise.
	(veor_u8): Likewise.
	(veor_u16): Likewise.
	(veor_u32): Likewise.
	(veor_s64): Likewise.
	(veor_u64): Likewise.
	(veorq_s8): Likewise.
	(veorq_s16): Likewise.
	(veorq_s32): Likewise.
	(veorq_s64): Likewise.
	(veorq_u8): Likewise.
	(veorq_u16): Likewise.
	(veorq_u32): Likewise.
	(veorq_u64): Likewise.
	(vbic_s8): Likewise.
	(vbic_s16): Likewise.
	(vbic_s32): Likewise.
	(vbic_u8): Likewise.
	(vbic_u16): Likewise.
	(vbic_u32): Likewise.
	(vbic_s64): Likewise.
	(vbic_u64): Likewise.
	(vbicq_s8): Likewise.
	(vbicq_s16): Likewise.
	(vbicq_s32): Likewise.
	(vbicq_s64): Likewise.
	(vbicq_u8): Likewise.
	(vbicq_u16): Likewise.
	(vbicq_u32): Likewise.
	(vbicq_u64): Likewise.
	(vorn_s8): Likewise.
	(vorn_s16): Likewise.
	(vorn_s32): Likewise.
	(vorn_u8): Likewise.
	(vorn_u16): Likewise.
	(vorn_u32): Likewise.
	(vorn_s64): Likewise.
	(vorn_u64): Likewise.
	(vornq_s8): Likewise.
	(vornq_s16): Likewise.
	(vornq_s32): Likewise.
	(vornq_s64): Likewise.
	(vornq_u8): Likewise.
	(vornq_u16): Likewise.
	(vornq_u32): Likewise.
	(vornq_u64): Likewise.

Comments

Julian Brown April 28, 2014, 11:44 a.m. UTC | #1
On Mon, 28 Apr 2014 11:44:01 +0100
Ramana Radhakrishnan <ramrad01@arm.com> wrote:

> I've special cased the ffast-math case for the _f32 intrinsics to 
> prevent the auto-vectorizer from coming along and vectorizing addv2sf 
> and addv4sf type operations which we don't want to happen by default.
> Patch 1/3 causes apparent "regressions" in the rather ineffective
> neon intrinsics tests that we currently carry soon hopefully to be
> replaced by Christophe Lyon's rewrite that is being reviewed. On the
> whole I deem this patch stack to be safe to go in if necessary. These
> "regressions" are for -O0 with the vbic and vorn intrinsics which
> don't now get combined and well, so be it.

I think reimplementing these intrinsics in C is a mistake if we ever
hope to make big-endian mode work properly, and "fixing" the generated
header file by bypassing the generator makes it harder to accurately
perform the sweeping changes that will probably be necessary to do that.
Recall e.g. the discussion around:

http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html

Generally (though in this case it's merely an implementation detail)
the NEON intrinsics and GCC's generic vector support cannot be expected
to interwork properly (because of incompatible lane ordering). Of
course we get away with it in little-endian mode though, and I guess
the bridge has already been crossed by earlier patches.

Of course it's possible nobody actually wants to use big-endian NEON,
in which case it's probably time to declared it unsupported?

Julian
Ramana Radhakrishnan April 28, 2014, 1:01 p.m. UTC | #2
On Mon, Apr 28, 2014 at 12:44 PM, Julian Brown <julian@codesourcery.com> 
wrote:
 > On Mon, 28 Apr 2014 11:44:01 +0100
 > Ramana Radhakrishnan <ramrad01@arm.com> wrote:
 >
 >> I've special cased the ffast-math case for the _f32 intrinsics to
 >> prevent the auto-vectorizer from coming along and vectorizing addv2sf
 >> and addv4sf type operations which we don't want to happen by default.
 >> Patch 1/3 causes apparent "regressions" in the rather ineffective
 >> neon intrinsics tests that we currently carry soon hopefully to be
 >> replaced by Christophe Lyon's rewrite that is being reviewed. On the
 >> whole I deem this patch stack to be safe to go in if necessary. These
 >> "regressions" are for -O0 with the vbic and vorn intrinsics which
 >> don't now get combined and well, so be it.
 >
 > I think reimplementing these intrinsics in C is a mistake if we ever
 > hope to make big-endian mode work properly, and "fixing" the generated
 > header file by bypassing the generator makes it harder to accurately
 > perform the sweeping changes that will probably be necessary to do that.#


 > Recall e.g. the discussion around:

 >
 > http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html

Well, it would help if the generator were written in a better language 
than ML :) . While I don't mind the different language in the backend 
once in a while the problem is that everytime anyone needs to make a 
change to this file, we spend far more time relearning ML than actually 
doing the change :(.

 >
 > Generally (though in this case it's merely an implementation detail)
 > the NEON intrinsics and GCC's generic vector support cannot be expected
 > to interwork properly (because of incompatible lane ordering). Of
 > course we get away with it in little-endian mode though, and I guess
 > the bridge has already been crossed by earlier patches.

Please note that I have been very careful about doing only those 
operations that will not be afflicted by big endian. I am not touching 
any of the lane-wise intrinsics or intrinsics that touch lane numbers. 
It is the intrinsics that have explicit lane numbering that have the 
issue and not the intrinsics I have touched. What's being done here is 
similar to how these particular intrinsics have been dealt with in the 
AArch64 backend and we don't see any issues with these intrinsics in the 
big endian mode and I will not expect these intrinsics to be more broken 
in big-endian than they are currently with this patch or these set of 
patches.

What specifically are you worried about with Patch 1/3 with respect to 
big endian in this case ? I agree that there may be issues with the 
specific "lane" extraction and vector lane numbering extensions that GCC 
has in big-endian mode vs Neon intrinsics but otherwise this change 
should *not* cause any issues in that space.

What specifically are you worried about with this patch other than 
losing the ability to auto-generate these intrinsics - the patch as is 
doesn't do anything but touch all those that operate on the entire 
vector and have no dependence at all on lane numbering ?

regards
Ramana
Christophe Lyon April 29, 2014, 1:51 p.m. UTC | #3
Hi Ramana,

FWIW, I have executed the current set of my tests which cover all you
changes expect vmul, and I have noticed no regression.

Christophe.


2014-04-28 12:44 GMT+02:00 Ramana Radhakrishnan <ramrad01@arm.com>:
> I've special cased the ffast-math case for the _f32 intrinsics to prevent
> the auto-vectorizer from coming along and vectorizing addv2sf and addv4sf
> type operations which we don't want to happen by default.
> Patch 1/3 causes apparent "regressions" in the rather ineffective neon
> intrinsics tests that we currently carry soon hopefully to be replaced by
> Christophe Lyon's rewrite that is being reviewed. On the whole I deem this
> patch stack to be safe to go in if necessary. These "regressions" are for
> -O0 with the vbic and vorn intrinsics which
> don't now get combined and well, so be it.
>
>
> Given we're in stage 1 and that I think we're getting some where
> with clyon's testsuite I feel that is reasonably practical in just
> carrying the noise with these extra failures. Christophe and I will
> testdrive his testsuite work in this space with these patches to see how the
> conversion process works and if there are any issues with these patches.
>
>
> <DATE>  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
>
>         * config/arm/arm_neon.h (vadd_s8): GNU C implementation
>         (vadd_s16): Likewise.
>         (vadd_s32): Likewise.
>         (vadd_f32): Likewise.
>         (vadd_u8): Likewise.
>         (vadd_u16): Likewise.
>         (vadd_u32): Likewise.
>         (vadd_s64): Likewise.
>         (vadd_u64): Likewise.
>         (vaddq_s8): Likewise.
>         (vaddq_s16): Likewise.
>         (vaddq_s32): Likewise.
>         (vaddq_s64): Likewise.
>         (vaddq_f32): Likewise.
>         (vaddq_u8): Likewise.
>         (vaddq_u16): Likewise.
>         (vaddq_u32): Likewise.
>         (vaddq_u64): Likewise.
>         (vmul_s8): Likewise.
>         (vmul_s16): Likewise.
>         (vmul_s32): Likewise.
>         (vmul_f32): Likewise.
>         (vmul_u8): Likewise.
>         (vmul_u16): Likewise.
>         (vmul_u32): Likewise.
>         (vmul_p8): Likewise.
>         (vmulq_s8): Likewise.
>         (vmulq_s16): Likewise.
>         (vmulq_s32): Likewise.
>         (vmulq_f32): Likewise.
>         (vmulq_u8): Likewise.
>         (vmulq_u16): Likewise.
>         (vmulq_u32): Likewise.
>         (vsub_s8): Likewise.
>         (vsub_s16): Likewise.
>         (vsub_s32): Likewise.
>         (vsub_f32): Likewise.
>         (vsub_u8): Likewise.
>         (vsub_u16): Likewise.
>         (vsub_u32): Likewise.
>         (vsub_s64): Likewise.
>         (vsub_u64): Likewise.
>         (vsubq_s8): Likewise.
>         (vsubq_s16): Likewise.
>         (vsubq_s32): Likewise.
>         (vsubq_s64): Likewise.
>         (vsubq_f32): Likewise.
>         (vsubq_u8): Likewise.
>         (vsubq_u16): Likewise.
>         (vsubq_u32): Likewise.
>         (vsubq_u64): Likewise.
>         (vand_s8): Likewise.
>         (vand_s16): Likewise.
>         (vand_s32): Likewise.
>         (vand_u8): Likewise.
>         (vand_u16): Likewise.
>         (vand_u32): Likewise.
>         (vand_s64): Likewise.
>         (vand_u64): Likewise.
>         (vandq_s8): Likewise.
>         (vandq_s16): Likewise.
>         (vandq_s32): Likewise.
>         (vandq_s64): Likewise.
>         (vandq_u8): Likewise.
>         (vandq_u16): Likewise.
>         (vandq_u32): Likewise.
>         (vandq_u64): Likewise.
>         (vorr_s8): Likewise.
>         (vorr_s16): Likewise.
>         (vorr_s32): Likewise.
>         (vorr_u8): Likewise.
>         (vorr_u16): Likewise.
>         (vorr_u32): Likewise.
>         (vorr_s64): Likewise.
>         (vorr_u64): Likewise.
>         (vorrq_s8): Likewise.
>         (vorrq_s16): Likewise.
>         (vorrq_s32): Likewise.
>         (vorrq_s64): Likewise.
>         (vorrq_u8): Likewise.
>         (vorrq_u16): Likewise.
>         (vorrq_u32): Likewise.
>         (vorrq_u64): Likewise.
>         (veor_s8): Likewise.
>         (veor_s16): Likewise.
>         (veor_s32): Likewise.
>         (veor_u8): Likewise.
>         (veor_u16): Likewise.
>         (veor_u32): Likewise.
>         (veor_s64): Likewise.
>         (veor_u64): Likewise.
>         (veorq_s8): Likewise.
>         (veorq_s16): Likewise.
>         (veorq_s32): Likewise.
>         (veorq_s64): Likewise.
>         (veorq_u8): Likewise.
>         (veorq_u16): Likewise.
>         (veorq_u32): Likewise.
>         (veorq_u64): Likewise.
>         (vbic_s8): Likewise.
>         (vbic_s16): Likewise.
>         (vbic_s32): Likewise.
>         (vbic_u8): Likewise.
>         (vbic_u16): Likewise.
>         (vbic_u32): Likewise.
>         (vbic_s64): Likewise.
>         (vbic_u64): Likewise.
>         (vbicq_s8): Likewise.
>         (vbicq_s16): Likewise.
>         (vbicq_s32): Likewise.
>         (vbicq_s64): Likewise.
>         (vbicq_u8): Likewise.
>         (vbicq_u16): Likewise.
>         (vbicq_u32): Likewise.
>         (vbicq_u64): Likewise.
>         (vorn_s8): Likewise.
>         (vorn_s16): Likewise.
>         (vorn_s32): Likewise.
>         (vorn_u8): Likewise.
>         (vorn_u16): Likewise.
>         (vorn_u32): Likewise.
>         (vorn_s64): Likewise.
>         (vorn_u64): Likewise.
>         (vornq_s8): Likewise.
>         (vornq_s16): Likewise.
>         (vornq_s32): Likewise.
>         (vornq_s64): Likewise.
>         (vornq_u8): Likewise.
>         (vornq_u16): Likewise.
>         (vornq_u32): Likewise.
>         (vornq_u64): Likewise.
>
>
>
> --
> Ramana Radhakrishnan
> Principal Engineer
> ARM Ltd.
Richard Earnshaw May 7, 2014, 3:15 p.m. UTC | #4
On 28/04/14 14:01, Ramana Radhakrishnan wrote:
> 
> On Mon, Apr 28, 2014 at 12:44 PM, Julian Brown <julian@codesourcery.com> 
> wrote:
>  > On Mon, 28 Apr 2014 11:44:01 +0100
>  > Ramana Radhakrishnan <ramrad01@arm.com> wrote:
>  >
>  >> I've special cased the ffast-math case for the _f32 intrinsics to
>  >> prevent the auto-vectorizer from coming along and vectorizing addv2sf
>  >> and addv4sf type operations which we don't want to happen by default.
>  >> Patch 1/3 causes apparent "regressions" in the rather ineffective
>  >> neon intrinsics tests that we currently carry soon hopefully to be
>  >> replaced by Christophe Lyon's rewrite that is being reviewed. On the
>  >> whole I deem this patch stack to be safe to go in if necessary. These
>  >> "regressions" are for -O0 with the vbic and vorn intrinsics which
>  >> don't now get combined and well, so be it.
>  >
>  > I think reimplementing these intrinsics in C is a mistake if we ever
>  > hope to make big-endian mode work properly, and "fixing" the generated
>  > header file by bypassing the generator makes it harder to accurately
>  > perform the sweeping changes that will probably be necessary to do that.#
> 
> 
>  > Recall e.g. the discussion around:
> 
>  >
>  > http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00161.html
> 
> Well, it would help if the generator were written in a better language 
> than ML :) . While I don't mind the different language in the backend 
> once in a while the problem is that everytime anyone needs to make a 
> change to this file, we spend far more time relearning ML than actually 
> doing the change :(.
> 

I agree: it's time the ML files went.  They're an impediment to
maintenance these days.

When the ML description was added it did three things: generated
arm_neon.h, generated the testsuite and generated a pipeline description
for Cortex-A8.  As we've progressed the second and third of these have
gone away (or at least, are about to in the case of the testsuite),
leaving only the arm_neon.h generation.  I don't see any real merit in
having that file generated from the ML file; we might as well just
maintain the existing code directly and that brings about the chance to
have more people actively work on fixing issues there without having to
learn ML first.

R.
diff mbox

Patch

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 37a6e61..479ec2c 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -453,114 +453,121 @@  typedef struct poly64x2x4_t
 } poly64x2x4_t;
 #endif
 
-
-
+/* vadd  */
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vadd_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vaddv8qi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vadd_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vaddv4hi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vadd_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vaddv2si (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vadd_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vaddv2sf (__a, __b, 3);
+#ifdef __FAST_MATH__
+  return __a + __b;
+#else
+  return (float32x2_t) __builtin_neon_vaddv2sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vadd_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vaddv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vadd_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vaddv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vadd_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vaddv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vadd_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vadddi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vadd_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vadddi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vaddq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vaddv16qi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vaddq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vaddv8hi (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vaddq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vaddv4si (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vaddq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vaddv2di (__a, __b, 1);
+  return __a + __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vaddq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vaddv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a + __b;
+#else
+  return (float32x4_t) __builtin_neon_vaddv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vaddq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vaddv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vaddq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vaddv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vaddq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vaddv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vaddq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vaddv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a + __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
@@ -950,91 +957,100 @@  vraddhn_u64 (uint64x2_t __a, uint64x2_t __b)
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vmul_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vmulv8qi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vmul_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vmulv4hi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vmul_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vmulv2si (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vmul_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vmulv2sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a * __b;
+#else
+  return (float32x2_t) __builtin_neon_vmulv2sf (__a, __b, 3);
+#endif
+
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vmul_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vmul_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vmulv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vmul_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vmulv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
-}
-
-__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
-vmul_p8 (poly8x8_t __a, poly8x8_t __b)
-{
-  return (poly8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 2);
+  return __a * __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vmulq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vmulv16qi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vmulq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vmulv8hi (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vmulq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vmulv4si (__a, __b, 1);
+  return __a * __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vmulq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vmulv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a * __b;
+#else
+  return (float32x4_t) __builtin_neon_vmulv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vmulq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vmulv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vmulq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vmulv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a * __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vmulq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vmulv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a * __b;
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vmul_p8 (poly8x8_t __a, poly8x8_t __b)
+{
+  return (poly8x8_t)__builtin_neon_vmulv8qi ((int8x8_t) __a, (int8x8_t) __b, 2);
 }
 
 __extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
@@ -1521,112 +1537,121 @@  vrndq_f32 (float32x4_t __a)
 }
 
 #endif
+
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vsub_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vsubv8qi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vsub_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vsubv4hi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vsub_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vsubv2si (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
 vsub_f32 (float32x2_t __a, float32x2_t __b)
 {
-  return (float32x2_t)__builtin_neon_vsubv2sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a - __b;
+#else
+  return (float32x2_t) __builtin_neon_vsubv2sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vsub_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vsubv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vsub_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vsubv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vsub_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vsubv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vsub_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vsubdi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vsub_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vsubdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vsubq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vsubv16qi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vsubq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vsubv8hi (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vsubq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vsubv4si (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vsubq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vsubv2di (__a, __b, 1);
+  return __a - __b;
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
 vsubq_f32 (float32x4_t __a, float32x4_t __b)
 {
-  return (float32x4_t)__builtin_neon_vsubv4sf (__a, __b, 3);
+#ifdef __FAST_MATH
+  return __a - __b;
+#else
+  return (float32x4_t) __builtin_neon_vsubv4sf (__a, __b, 3);
+#endif
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vsubq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vsubv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vsubq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vsubv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vsubq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vsubv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vsubq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vsubv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a - __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
@@ -10907,484 +10932,483 @@  vst4q_lane_p16 (poly16_t * __a, poly16x8x4_t __b, const int __c)
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vand_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vandv8qi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vand_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vandv4hi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vand_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vandv2si (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vand_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vandv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vand_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vandv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vand_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vandv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vand_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vanddi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vand_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vanddi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vandq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vandv16qi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vandq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vandv8hi (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vandq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vandv4si (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vandq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vandv2di (__a, __b, 1);
+  return __a & __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vandq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vandv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vandq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vandv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vandq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vandv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vandq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vandv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a & __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vorr_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vorrv8qi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vorr_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vorrv4hi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vorr_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vorrv2si (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vorr_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vorrv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vorr_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vorrv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vorr_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vorrv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vorr_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vorrdi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vorr_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vorrdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vorrq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vorrv16qi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vorrq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vorrv8hi (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vorrq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vorrv4si (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vorrq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vorrv2di (__a, __b, 1);
+  return __a | __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vorrq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vorrv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vorrq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vorrv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vorrq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vorrv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vorrq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vorrv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a | __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 veor_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_veorv8qi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 veor_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_veorv4hi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 veor_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_veorv2si (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 veor_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_veorv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 veor_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_veorv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 veor_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_veorv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 veor_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_veordi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 veor_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_veordi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 veorq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_veorv16qi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 veorq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_veorv8hi (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 veorq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_veorv4si (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 veorq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_veorv2di (__a, __b, 1);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 veorq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_veorv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 veorq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_veorv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 veorq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_veorv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 veorq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_veorv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a ^ __b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vbic_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vbicv8qi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vbic_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vbicv4hi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vbic_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vbicv2si (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vbic_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vbicv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vbic_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vbicv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vbic_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vbicv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vbic_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vbicdi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vbic_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vbicdi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vbicq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vbicv16qi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vbicq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vbicv8hi (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vbicq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vbicv4si (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vbicq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vbicv2di (__a, __b, 1);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vbicq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vbicv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vbicq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vbicv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vbicq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vbicv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vbicq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vbicv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a & ~__b;
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
 vorn_s8 (int8x8_t __a, int8x8_t __b)
 {
-  return (int8x8_t)__builtin_neon_vornv8qi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
 vorn_s16 (int16x4_t __a, int16x4_t __b)
 {
-  return (int16x4_t)__builtin_neon_vornv4hi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
 vorn_s32 (int32x2_t __a, int32x2_t __b)
 {
-  return (int32x2_t)__builtin_neon_vornv2si (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
 vorn_u8 (uint8x8_t __a, uint8x8_t __b)
 {
-  return (uint8x8_t)__builtin_neon_vornv8qi ((int8x8_t) __a, (int8x8_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
 vorn_u16 (uint16x4_t __a, uint16x4_t __b)
 {
-  return (uint16x4_t)__builtin_neon_vornv4hi ((int16x4_t) __a, (int16x4_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
 vorn_u32 (uint32x2_t __a, uint32x2_t __b)
 {
-  return (uint32x2_t)__builtin_neon_vornv2si ((int32x2_t) __a, (int32x2_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
 vorn_s64 (int64x1_t __a, int64x1_t __b)
 {
-  return (int64x1_t)__builtin_neon_vorndi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
 vorn_u64 (uint64x1_t __a, uint64x1_t __b)
 {
-  return (uint64x1_t)__builtin_neon_vorndi ((int64x1_t) __a, (int64x1_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
 vornq_s8 (int8x16_t __a, int8x16_t __b)
 {
-  return (int8x16_t)__builtin_neon_vornv16qi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
 vornq_s16 (int16x8_t __a, int16x8_t __b)
 {
-  return (int16x8_t)__builtin_neon_vornv8hi (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
 vornq_s32 (int32x4_t __a, int32x4_t __b)
 {
-  return (int32x4_t)__builtin_neon_vornv4si (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
 vornq_s64 (int64x2_t __a, int64x2_t __b)
 {
-  return (int64x2_t)__builtin_neon_vornv2di (__a, __b, 1);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
 vornq_u8 (uint8x16_t __a, uint8x16_t __b)
 {
-  return (uint8x16_t)__builtin_neon_vornv16qi ((int8x16_t) __a, (int8x16_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
 vornq_u16 (uint16x8_t __a, uint16x8_t __b)
 {
-  return (uint16x8_t)__builtin_neon_vornv8hi ((int16x8_t) __a, (int16x8_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
 vornq_u32 (uint32x4_t __a, uint32x4_t __b)
 {
-  return (uint32x4_t)__builtin_neon_vornv4si ((int32x4_t) __a, (int32x4_t) __b, 0);
+  return __a | ~__b;
 }
 
 __extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
 vornq_u64 (uint64x2_t __a, uint64x2_t __b)
 {
-  return (uint64x2_t)__builtin_neon_vornv2di ((int64x2_t) __a, (int64x2_t) __b, 0);
+  return __a | ~__b;
 }
 
-
 __extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
 vreinterpret_p8_p16 (poly16x4_t __a)
 {