Patchwork [AArch64] Implement vset_lane intrinsics in C

login
register
mail settings
Submitter James Greenhalgh
Date Sept. 13, 2013, 6:35 p.m.
Message ID <1379097315-27647-1-git-send-email-james.greenhalgh@arm.com>
Download mbox | patch
Permalink /patch/274854/
State New
Headers show

Comments

James Greenhalgh - Sept. 13, 2013, 6:35 p.m.
Hi,

The vset<q>_lane_<fpsu><8,16,32,64> intrinsics are currently
written useing assembler, but can be easily expressed
in C.

As I expect we will want to efficiently compose these intrinsics
I've added them as macros, just as was done with the vget_lane
intrinsics.

Regression tested for aarch64-none-elf and a new testcase
added to ensure these intrinsics generate the expected
instruction.

OK?

Thanks,
James

---
gcc/

2013-09-13  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/arm_neon.h
	(__aarch64_vset_lane_any): New.
	(__aarch64_vset<q>_lane_<fpsu><8,16,32,64>): Likewise.
	(vset<q>_lane_<fpsu><8,16,32,64>): Use new macros.

gcc/testsuite

2013-09-13  James Greenhalgh  <james.greenhalgh@arm.com>

	* gcc.target/aarch64/vect_set_lane_1.c: New.
Andrew Pinski - Sept. 13, 2013, 6:39 p.m.
On Fri, Sep 13, 2013 at 11:35 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
>
> Hi,
>
> The vset<q>_lane_<fpsu><8,16,32,64> intrinsics are currently
> written useing assembler, but can be easily expressed
> in C.
>
> As I expect we will want to efficiently compose these intrinsics
> I've added them as macros, just as was done with the vget_lane
> intrinsics.
>
> Regression tested for aarch64-none-elf and a new testcase
> added to ensure these intrinsics generate the expected
> instruction.
>
> OK?


I don't think this works for big-endian due to the way ARM decided the
lanes don't match up with array entry there.

Thanks,
Andrew Pinski

>
> Thanks,
> James
>
> ---
> gcc/
>
> 2013-09-13  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * config/aarch64/arm_neon.h
>         (__aarch64_vset_lane_any): New.
>         (__aarch64_vset<q>_lane_<fpsu><8,16,32,64>): Likewise.
>         (vset<q>_lane_<fpsu><8,16,32,64>): Use new macros.
>
> gcc/testsuite
>
> 2013-09-13  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         * gcc.target/aarch64/vect_set_lane_1.c: New.
James Greenhalgh - Sept. 13, 2013, 6:57 p.m.
On Fri, Sep 13, 2013 at 07:39:08PM +0100, Andrew Pinski wrote:
> I don't think this works for big-endian due to the way ARM decided the
> lanes don't match up with array entry there.

Hi Andrew,

Certainly for the testcase I've added in this patch there are no issues.

Vector indexing should work consistently between big and little endian
AArch64 targets. So,

  int32_t foo[4] = {0, 1, 2, 3};
  int32x4_t a = vld1q_s32 (foo);
  int b = foo[1];
  return b;

Should return '1' whatever your endianness. Throwing together a quick
test case, that is the case for current trunk. Do you have a testcase
where this goes wrong?

Thanks,
James
Andrew Pinski - Sept. 13, 2013, 9:47 p.m.
On Fri, Sep 13, 2013 at 11:57 AM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
> On Fri, Sep 13, 2013 at 07:39:08PM +0100, Andrew Pinski wrote:
>> I don't think this works for big-endian due to the way ARM decided the
>> lanes don't match up with array entry there.
>
> Hi Andrew,
>
> Certainly for the testcase I've added in this patch there are no issues.
>
> Vector indexing should work consistently between big and little endian
> AArch64 targets. So,
>
>   int32_t foo[4] = {0, 1, 2, 3};
>   int32x4_t a = vld1q_s32 (foo);
>   int b = foo[1];
>   return b;
>
> Should return '1' whatever your endianness. Throwing together a quick
> test case, that is the case for current trunk. Do you have a testcase
> where this goes wrong?

I was not thinking of that but rather the definition of lanes in ARM64
is different than from element due to memory ordering of endian.
That is lane 0 is element 3 in big-endian.  Or is this only for
aarch32 where the issue is located?

Thanks,
Andrew Pinski

>
> Thanks,
> James
>
James Greenhalgh - Sept. 16, 2013, 9:29 a.m.
On Fri, Sep 13, 2013 at 10:47:01PM +0100, Andrew Pinski wrote:
> On Fri, Sep 13, 2013 at 11:57 AM, James Greenhalgh
> <james.greenhalgh@arm.com> wrote:
> > Should return '1' whatever your endianness. Throwing together a quick
> > test case, that is the case for current trunk. Do you have a testcase
> > where this goes wrong?
> 
> I was not thinking of that but rather the definition of lanes in ARM64
> is different than from element due to memory ordering of endian.
> That is lane 0 is element 3 in big-endian.  Or is this only for
> aarch32 where the issue is located?
> 
> Thanks,
> Andrew Pinski

Well, AArch64 has the AArch32 style memory ordering for vectors,
which I think is different from what other big-endian architectures
use, but gives consistent behaviour between vector and array indexing.

So, take the easy case of a byte array

  uint8_t foo [8] = {0, 1, 2, 3, 4, 5, 6, 7}

We would expect both the big and little endian toolchains to lay
this out in memory as:

   0x0             ...         0x8
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

And element 0 would give us '0'. If we take the same array and load it
as a vector with ld1.b, both big and little-endian toolchains would load
it as:

   bit 128 ..   bit 64                           bit 0
   lane 16   | lane 7 |                       |  lane 0 |
  |.....     |    7   | 6 | 5 | 4 | 3 | 2 | 1 |   0     |

So lane 0 is '0', we're OK so far!

For a short array:

  uint16_t foo [4] = {0x0a0b, 0x1a1b, 0x2a2b, 0x3a3b};

The little endian compiler would lay memory out as:

   0x0             ...                0x8
  | 0b | 0a | 1b | 1a | 2b | 2a | 3b | 3a |

And the big endian compiler would lay out memory as:

   0x0             ...                0x8
  | 0a | 0b | 1a | 1b | 2a | 2b | 3a | 3b |

In both cases, element 0 is '0x0a0b'. If we load this array as a
vector with ld1.h both big and little-endian compilers will load
the vector as:

   bit 128 ..  bit 64                        bit 0
   lane 16   | lane 3  |                   | lane 0  |
  |.....     | 3b | 3a | 2b | 2a | 1b | 1a | 0b | 0a |

And lane 0 is '0x0a0b' So we are OK again!

Lanes and elements should match under our model. Which I don't think
is true of other architectures, where I think the whole vector object
is arranged big endian, such that we would need to lay our byte array
out as:

   0x0             ...         0x8
  | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

For it to be correctly loaded, at which point there is a discrepancy
between element and lane.

But as I say, that is other architectures. AArch64 should be consistent.

Thanks,
James
James Greenhalgh - Sept. 16, 2013, 9:33 a.m.
On Mon, Sep 16, 2013 at 10:29:37AM +0100, James Greenhalgh wrote:
> The little endian compiler would lay memory out as:
> 
>    0x0             ...                0x8
>   | 0b | 0a | 1b | 1a | 2b | 2a | 3b | 3a |
> 
> And the big endian compiler would lay out memory as:
> 
>    0x0             ...                0x8
>   | 0a | 0b | 1a | 1b | 2a | 2b | 3a | 3b |
> 
> In both cases, element 0 is '0x0a0b'. If we load this array as a
> vector with ld1.h both big and little-endian compilers will load
> the vector as:
> 
>    bit 128 ..  bit 64                        bit 0
>    lane 16   | lane 3  |                   | lane 0  |
>   |.....     | 3b | 3a | 2b | 2a | 1b | 1a | 0b | 0a |
> 

Ugh, I knew I would make a mistake somewhere!

This should, of course, be loaded as:

    bit 128 ..  bit 64                        bit 0
    lane 16   | lane 3  |                   | lane 0  |
   |.....     | 3a | 3b | 2a | 2b | 1a | 1b | 0a | 0b |
 
James

Patch

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index cb58602..6335ddf 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -508,6 +508,58 @@  typedef struct poly16x8x4_t
 #define __aarch64_vgetq_lane_u64(__a, __b) \
   __aarch64_vget_lane_any (v2di, (uint64_t), (int64x2_t), __a, __b)
 
+/* __aarch64_vset_lane internal macros.  */
+#define __aarch64_vset_lane_any(__source, __v, __index) \
+  (__v[__index] = __source, __v)
+
+#define __aarch64_vset_lane_f32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_f64(__source, __v, __index) (__source)
+#define __aarch64_vset_lane_p8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_p16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s64(__source, __v, __index) (__source)
+#define __aarch64_vset_lane_u8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u64(__source, __v, __index) (__source)
+
+/* __aarch64_vset_laneq internal macros.  */
+#define __aarch64_vsetq_lane_f32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_f64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_p8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_p16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+
 /* __aarch64_vdup_lane internal macros.  */
 #define __aarch64_vdup_lane_any(__size, __q1, __q2, __a, __b) \
   vdup##__q1##_n_##__size (__aarch64_vget##__q2##_lane_##__size (__a, __b))
@@ -3969,6 +4021,154 @@  vreinterpretq_u32_p16 (poly16x8_t __a)
   return (uint32x4_t) __builtin_aarch64_reinterpretv4siv8hi ((int16x8_t) __a);
 }
 
+/* vset_lane.  */
+
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vset_lane_f32 (float32_t __a, float32x2_t __v, const int __index)
+{
+  return __aarch64_vset_lane_f32 (__a, __v, __index);
+}
+
+__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
+vset_lane_f64 (float64_t __a, float64x1_t __v, const int __index)
+{
+  return __aarch64_vset_lane_f64 (__a, __v, __index);
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vset_lane_p8 (poly8_t __a, poly8x8_t __v, const int __index)
+{
+  return __aarch64_vset_lane_p8 (__a, __v, __index);
+}
+
+__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__))
+vset_lane_p16 (poly16_t __a, poly16x4_t __v, const int __index)
+{
+  return __aarch64_vset_lane_p16 (__a, __v, __index);
+}
+
+__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
+vset_lane_s8 (int8_t __a, int8x8_t __v, const int __index)
+{
+  return __aarch64_vset_lane_s8 (__a, __v, __index);
+}
+
+__extension__ static __inline int16x4_t __attribute__ ((__always_inline__))
+vset_lane_s16 (int16_t __a, int16x4_t __v, const int __index)
+{
+  return __aarch64_vset_lane_s16 (__a, __v, __index);
+}
+
+__extension__ static __inline int32x2_t __attribute__ ((__always_inline__))
+vset_lane_s32 (int32_t __a, int32x2_t __v, const int __index)
+{
+  return __aarch64_vset_lane_s32 (__a, __v, __index);
+}
+
+__extension__ static __inline int64x1_t __attribute__ ((__always_inline__))
+vset_lane_s64 (int64_t __a, int64x1_t __v, const int __index)
+{
+  return __aarch64_vset_lane_s64 (__a, __v, __index);
+}
+
+__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
+vset_lane_u8 (uint8_t __a, uint8x8_t __v, const int __index)
+{
+  return __aarch64_vset_lane_u8 (__a, __v, __index);
+}
+
+__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__))
+vset_lane_u16 (uint16_t __a, uint16x4_t __v, const int __index)
+{
+  return __aarch64_vset_lane_u16 (__a, __v, __index);
+}
+
+__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__))
+vset_lane_u32 (uint32_t __a, uint32x2_t __v, const int __index)
+{
+  return __aarch64_vset_lane_u32 (__a, __v, __index);
+}
+
+__extension__ static __inline uint64x1_t __attribute__ ((__always_inline__))
+vset_lane_u64 (uint64_t __a, uint64x1_t __v, const int __index)
+{
+  return __aarch64_vset_lane_u64 (__a, __v, __index);
+}
+
+/* vsetq_lane  */
+
+__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
+vsetq_lane_f32 (float32_t __a, float32x4_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_f32 (__a, __v, __index);
+}
+
+__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
+vsetq_lane_f64 (float64_t __a, float64x2_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_f64 (__a, __v, __index);
+}
+
+__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__))
+vsetq_lane_p8 (poly8_t __a, poly8x16_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_p8 (__a, __v, __index);
+}
+
+__extension__ static __inline poly16x8_t __attribute__ ((__always_inline__))
+vsetq_lane_p16 (poly16_t __a, poly16x8_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_p16 (__a, __v, __index);
+}
+
+__extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
+vsetq_lane_s8 (int8_t __a, int8x16_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_s8 (__a, __v, __index);
+}
+
+__extension__ static __inline int16x8_t __attribute__ ((__always_inline__))
+vsetq_lane_s16 (int16_t __a, int16x8_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_s16 (__a, __v, __index);
+}
+
+__extension__ static __inline int32x4_t __attribute__ ((__always_inline__))
+vsetq_lane_s32 (int32_t __a, int32x4_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_s32 (__a, __v, __index);
+}
+
+__extension__ static __inline int64x2_t __attribute__ ((__always_inline__))
+vsetq_lane_s64 (int64_t __a, int64x2_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_s64 (__a, __v, __index);
+}
+
+__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__))
+vsetq_lane_u8 (uint8_t __a, uint8x16_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_u8 (__a, __v, __index);
+}
+
+__extension__ static __inline uint16x8_t __attribute__ ((__always_inline__))
+vsetq_lane_u16 (uint16_t __a, uint16x8_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_u16 (__a, __v, __index);
+}
+
+__extension__ static __inline uint32x4_t __attribute__ ((__always_inline__))
+vsetq_lane_u32 (uint32_t __a, uint32x4_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_u32 (__a, __v, __index);
+}
+
+__extension__ static __inline uint64x2_t __attribute__ ((__always_inline__))
+vsetq_lane_u64 (uint64_t __a, uint64x2_t __v, const int __index)
+{
+  return __aarch64_vsetq_lane_u64 (__a, __v, __index);
+}
+
 #define __GET_LOW(__TYPE) \
   uint64x2_t tmp = vreinterpretq_u64_##__TYPE (__a);  \
   uint64_t lo = vgetq_lane_u64 (tmp, 0);  \
@@ -12192,318 +12392,6 @@  vrsubhn_u64 (uint64x2_t a, uint64x2_t b)
   return result;
 }
 
-#define vset_lane_f32(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       float32x2_t b_ = (b);                                            \
-       float32_t a_ = (a);                                              \
-       float32x2_t result;                                              \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_f64(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       float64x1_t b_ = (b);                                            \
-       float64_t a_ = (a);                                              \
-       float64x1_t result;                                              \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_p8(a, b, c)                                           \
-  __extension__                                                         \
-    ({                                                                  \
-       poly8x8_t b_ = (b);                                              \
-       poly8_t a_ = (a);                                                \
-       poly8x8_t result;                                                \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_p16(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       poly16x4_t b_ = (b);                                             \
-       poly16_t a_ = (a);                                               \
-       poly16x4_t result;                                               \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_s8(a, b, c)                                           \
-  __extension__                                                         \
-    ({                                                                  \
-       int8x8_t b_ = (b);                                               \
-       int8_t a_ = (a);                                                 \
-       int8x8_t result;                                                 \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_s16(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       int16x4_t b_ = (b);                                              \
-       int16_t a_ = (a);                                                \
-       int16x4_t result;                                                \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_s32(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       int32x2_t b_ = (b);                                              \
-       int32_t a_ = (a);                                                \
-       int32x2_t result;                                                \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_s64(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       int64x1_t b_ = (b);                                              \
-       int64_t a_ = (a);                                                \
-       int64x1_t result;                                                \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_u8(a, b, c)                                           \
-  __extension__                                                         \
-    ({                                                                  \
-       uint8x8_t b_ = (b);                                              \
-       uint8_t a_ = (a);                                                \
-       uint8x8_t result;                                                \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_u16(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       uint16x4_t b_ = (b);                                             \
-       uint16_t a_ = (a);                                               \
-       uint16x4_t result;                                               \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_u32(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       uint32x2_t b_ = (b);                                             \
-       uint32_t a_ = (a);                                               \
-       uint32x2_t result;                                               \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vset_lane_u64(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       uint64x1_t b_ = (b);                                             \
-       uint64_t a_ = (a);                                               \
-       uint64x1_t result;                                               \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_f32(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       float32x4_t b_ = (b);                                            \
-       float32_t a_ = (a);                                              \
-       float32x4_t result;                                              \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_f64(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       float64x2_t b_ = (b);                                            \
-       float64_t a_ = (a);                                              \
-       float64x2_t result;                                              \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_p8(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       poly8x16_t b_ = (b);                                             \
-       poly8_t a_ = (a);                                                \
-       poly8x16_t result;                                               \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_p16(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       poly16x8_t b_ = (b);                                             \
-       poly16_t a_ = (a);                                               \
-       poly16x8_t result;                                               \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_s8(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       int8x16_t b_ = (b);                                              \
-       int8_t a_ = (a);                                                 \
-       int8x16_t result;                                                \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_s16(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       int16x8_t b_ = (b);                                              \
-       int16_t a_ = (a);                                                \
-       int16x8_t result;                                                \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_s32(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       int32x4_t b_ = (b);                                              \
-       int32_t a_ = (a);                                                \
-       int32x4_t result;                                                \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_s64(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       int64x2_t b_ = (b);                                              \
-       int64_t a_ = (a);                                                \
-       int64x2_t result;                                                \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_u8(a, b, c)                                          \
-  __extension__                                                         \
-    ({                                                                  \
-       uint8x16_t b_ = (b);                                             \
-       uint8_t a_ = (a);                                                \
-       uint8x16_t result;                                               \
-       __asm__ ("ins %0.b[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_u16(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       uint16x8_t b_ = (b);                                             \
-       uint16_t a_ = (a);                                               \
-       uint16x8_t result;                                               \
-       __asm__ ("ins %0.h[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_u32(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       uint32x4_t b_ = (b);                                             \
-       uint32_t a_ = (a);                                               \
-       uint32x4_t result;                                               \
-       __asm__ ("ins %0.s[%3], %w1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
-#define vsetq_lane_u64(a, b, c)                                         \
-  __extension__                                                         \
-    ({                                                                  \
-       uint64x2_t b_ = (b);                                             \
-       uint64_t a_ = (a);                                               \
-       uint64x2_t result;                                               \
-       __asm__ ("ins %0.d[%3], %x1"                                     \
-                : "=w"(result)                                          \
-                : "r"(a_), "0"(b_), "i"(c)                              \
-                : /* No clobbers */);                                   \
-       result;                                                          \
-     })
-
 #define vshrn_high_n_s16(a, b, c)                                       \
   __extension__                                                         \
     ({                                                                  \
@@ -25537,6 +25425,33 @@  __INTERLEAVE_LIST (zip)
 #undef __aarch64_vgetq_lane_u32
 #undef __aarch64_vgetq_lane_u64
 
+#undef __aarch64_vset_lane_any
+#undef __aarch64_vset_lane_f32
+#undef __aarch64_vset_lane_f64
+#undef __aarch64_vset_lane_p8
+#undef __aarch64_vset_lane_p16
+#undef __aarch64_vset_lane_s8
+#undef __aarch64_vset_lane_s16
+#undef __aarch64_vset_lane_s32
+#undef __aarch64_vset_lane_s64
+#undef __aarch64_vset_lane_u8
+#undef __aarch64_vset_lane_u16
+#undef __aarch64_vset_lane_u32
+#undef __aarch64_vset_lane_u64
+
+#undef __aarch64_vsetq_lane_f32
+#undef __aarch64_vsetq_lane_f64
+#undef __aarch64_vsetq_lane_p8
+#undef __aarch64_vsetq_lane_p16
+#undef __aarch64_vsetq_lane_s8
+#undef __aarch64_vsetq_lane_s16
+#undef __aarch64_vsetq_lane_s32
+#undef __aarch64_vsetq_lane_s64
+#undef __aarch64_vsetq_lane_u8
+#undef __aarch64_vsetq_lane_u16
+#undef __aarch64_vsetq_lane_u32
+#undef __aarch64_vsetq_lane_u64
+
 #undef __aarch64_vdup_lane_any
 #undef __aarch64_vdup_lane_f32
 #undef __aarch64_vdup_lane_f64
diff --git a/gcc/testsuite/gcc.target/aarch64/vect_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vect_set_lane_1.c
new file mode 100644
index 0000000..800ffce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect_set_lane_1.c
@@ -0,0 +1,57 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#include "arm_neon.h"
+
+#define BUILD_TEST(TYPE, INNER_TYPE, Q, SUFFIX, INDEX)			\
+TYPE									\
+test_set##Q##_lane_##SUFFIX (INNER_TYPE a, TYPE v)			\
+{									\
+  return vset##Q##_lane_##SUFFIX (a, v, INDEX);				\
+}
+
+/* vset_lane.  */
+BUILD_TEST (poly8x8_t, poly8_t, , p8, 7)
+BUILD_TEST (int8x8_t, int8_t, , s8, 7)
+BUILD_TEST (uint8x8_t, uint8_t, , u8, 7)
+/* { dg-final { scan-assembler-times "ins\\tv0.b\\\[7\\\], w0" 3 } } */
+BUILD_TEST (poly16x4_t, poly16_t, , p16, 3)
+BUILD_TEST (int16x4_t, int16_t, , s16, 3)
+BUILD_TEST (uint16x4_t, uint16_t, , u16, 3)
+/* { dg-final { scan-assembler-times "ins\\tv0.h\\\[3\\\], w0" 3 } } */
+BUILD_TEST (int32x2_t, int32_t, , s32, 1)
+BUILD_TEST (uint32x2_t, uint32_t, , u32, 1)
+/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], w0" 2 } } */
+BUILD_TEST (int64x1_t, int64_t, , s64, 0)
+BUILD_TEST (uint64x1_t, uint64_t, , u64, 0)
+/* Nothing to do.  */
+
+/* vsetq_lane.  */
+
+BUILD_TEST (poly8x16_t, poly8_t, q, p8, 15)
+BUILD_TEST (int8x16_t, int8_t, q, s8, 15)
+BUILD_TEST (uint8x16_t, uint8_t, q, u8, 15)
+/* { dg-final { scan-assembler-times "ins\\tv0.b\\\[15\\\], w0" 3 } } */
+BUILD_TEST (poly16x8_t, poly16_t, q, p16, 7)
+BUILD_TEST (int16x8_t, int16_t, q, s16, 7)
+BUILD_TEST (uint16x8_t, uint16_t, q, u16, 7)
+/* { dg-final { scan-assembler-times "ins\\tv0.h\\\[7\\\], w0" 3 } } */
+BUILD_TEST (int32x4_t, int32_t, q, s32, 3)
+BUILD_TEST (uint32x4_t, uint32_t, q, u32, 3)
+/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[3\\\], w0" 2 } } */
+BUILD_TEST (int64x2_t, int64_t, q, s64, 1)
+BUILD_TEST (uint64x2_t, uint64_t, q, u64, 1)
+/* { dg-final { scan-assembler-times "ins\\tv0.d\\\[1\\\], x0" 2 } } */
+
+/* Float versions are slightly different as their scalar value
+   will be in v0 rather than w0.  */
+BUILD_TEST (float32x2_t, float32_t, , f32, 1)
+/* { dg-final { scan-assembler-times "ins\\tv1.s\\\[1\\\], v0.s\\\[0\\\]" 1 } } */
+BUILD_TEST (float64x1_t, float64_t, , f64, 0)
+/* Nothing to do.  */
+BUILD_TEST (float32x4_t, float32_t, q, f32, 3)
+/* { dg-final { scan-assembler-times "ins\\tv1.s\\\[3\\\], v0.s\\\[0\\\]" 1 } } */
+BUILD_TEST (float64x2_t, float64_t, q, f64, 1)
+/* { dg-final { scan-assembler-times "ins\\tv1.d\\\[1\\\], v0.d\\\[0\\\]" 1 } } */
+
+/* { dg-final { cleanup-saved-temps } } */