diff mbox series

ACLE intrinsics: BFloat16 store (vst<n>{q}_bf16) intrinsics for AArch32

Message ID fb8d8bd6-f2ea-9990-617c-1b543d8d07e3@arm.com
State New
Headers show
Series ACLE intrinsics: BFloat16 store (vst<n>{q}_bf16) intrinsics for AArch32 | expand

Commit Message

Delia Burduv Dec. 20, 2019, 6:46 p.m. UTC
This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics 
vst<n>{q}_bf16 as part of the BFloat16 extension.
(https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics)
The intrinsics are declared in arm_neon.h .
A new test is added to check assembler output.

This patch depends on the Arm back-end patche. 
(https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)

Tested for regression on arm-none-eabi and armeb-none-eabi. I don't have 
commit rights, so if this is ok can someone please commit it for me?

gcc/ChangeLog:

2019-11-14  Delia Burduv  <delia.burduv@arm.com>

	* config/arm/arm_neon.h (bfloat16_t): New typedef.
         (bfloat16x4x2_t): New typedef.
         (bfloat16x8x2_t): New typedef.
         (bfloat16x4x3_t): New typedef.
         (bfloat16x8x3_t): New typedef.
         (bfloat16x4x4_t): New typedef.
         (bfloat16x8x4_t): New typedef.
         (vst2_bf16): New.
	(vst2q_bf16): New.
	(vst3_bf16): New.
	(vst3q_bf16): New.
	(vst4_bf16): New.
	(vst4q_bf16): New.
         * config/arm/arm-builtins.c (E_V2BFmode): New mode.
         (VAR13): New.
         (arm_simd_types[Bfloat16x2_t]):New type.
         * config/arm/arm-modes.def (V2BF): New mode.
         * config/arm/arm-simd-builtin-types.def
         (Bfloat16x2_t): New entry.
         * config/arm/arm_neon_builtins.def
         (vst2): Changed to VAR13 and added v4bf, v8bf
         (vst3): Changed to VAR13 and added v4bf, v8bf
         (vst4): Changed to VAR13 and added v4bf, v8bf
         * config/arm/iterators.md (VDXBF): New iterator.
         (VQ2BF): New iterator.
         (V_elem): Added V4BF, V8BF.
         (V_sz_elem): Added V4BF, V8BF.
         (V_mode_nunits): Added V4BF, V8BF.
         (q): Added V4BF, V8BF.
         *config/arm/neon.md (vst2): Used new iterators.
         (vst3): Used new iterators.
         (vst3qa): Used new iterators.
         (vst3qb): Used new iterators.
         (vst4): Used new iterators.
         (vst4qa): Used new iterators.
         (vst4qb): Used new iterators.


gcc/testsuite/ChangeLog:

2019-11-14  Delia Burduv  <delia.burduv@arm.com>

	* gcc.target/arm/simd/bf16_vstn_1.c: New test.

Comments

Delia Burduv Jan. 22, 2020, 5:29 p.m. UTC | #1
Ping.

I will change the tests to use the exact input and output registers as 
Richard Sandiford suggested for the AArch64 patches.

On 12/20/19 6:46 PM, Delia Burduv wrote:
> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics 
> vst<n>{q}_bf16 as part of the BFloat16 extension.
> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
> 
> The intrinsics are declared in arm_neon.h .
> A new test is added to check assembler output.
> 
> This patch depends on the Arm back-end patche. 
> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
> 
> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't have 
> commit rights, so if this is ok can someone please commit it for me?
> 
> gcc/ChangeLog:
> 
> 2019-11-14  Delia Burduv  <delia.burduv@arm.com>
> 
>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>          (bfloat16x4x2_t): New typedef.
>          (bfloat16x8x2_t): New typedef.
>          (bfloat16x4x3_t): New typedef.
>          (bfloat16x8x3_t): New typedef.
>          (bfloat16x4x4_t): New typedef.
>          (bfloat16x8x4_t): New typedef.
>          (vst2_bf16): New.
>      (vst2q_bf16): New.
>      (vst3_bf16): New.
>      (vst3q_bf16): New.
>      (vst4_bf16): New.
>      (vst4q_bf16): New.
>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>          (VAR13): New.
>          (arm_simd_types[Bfloat16x2_t]):New type.
>          * config/arm/arm-modes.def (V2BF): New mode.
>          * config/arm/arm-simd-builtin-types.def
>          (Bfloat16x2_t): New entry.
>          * config/arm/arm_neon_builtins.def
>          (vst2): Changed to VAR13 and added v4bf, v8bf
>          (vst3): Changed to VAR13 and added v4bf, v8bf
>          (vst4): Changed to VAR13 and added v4bf, v8bf
>          * config/arm/iterators.md (VDXBF): New iterator.
>          (VQ2BF): New iterator.
>          (V_elem): Added V4BF, V8BF.
>          (V_sz_elem): Added V4BF, V8BF.
>          (V_mode_nunits): Added V4BF, V8BF.
>          (q): Added V4BF, V8BF.
>          *config/arm/neon.md (vst2): Used new iterators.
>          (vst3): Used new iterators.
>          (vst3qa): Used new iterators.
>          (vst3qb): Used new iterators.
>          (vst4): Used new iterators.
>          (vst4qa): Used new iterators.
>          (vst4qb): Used new iterators.
> 
> 
> gcc/testsuite/ChangeLog:
> 
> 2019-11-14  Delia Burduv  <delia.burduv@arm.com>
> 
>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
Delia Burduv Jan. 28, 2020, 4:44 p.m. UTC | #2
Ping.
Delia Burduv Feb. 19, 2020, 5:25 p.m. UTC | #3
Hi,

Here is the latest version of the patch. It just has some minor 
formatting changes that were brought up by Richard Sandiford in the 
AArch64 patches

Thanks,
Delia

On 1/22/20 5:29 PM, Delia Burduv wrote:
> Ping.
> 
> I will change the tests to use the exact input and output registers as 
> Richard Sandiford suggested for the AArch64 patches.
> 
> On 12/20/19 6:46 PM, Delia Burduv wrote:
>> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics 
>> vst<n>{q}_bf16 as part of the BFloat16 extension.
>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>
>> The intrinsics are declared in arm_neon.h .
>> A new test is added to check assembler output.
>>
>> This patch depends on the Arm back-end patche. 
>> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>
>> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't 
>> have commit rights, so if this is ok can someone please commit it for me?
>>
>> gcc/ChangeLog:
>>
>> 2019-11-14  Delia Burduv  <delia.burduv@arm.com>
>>
>>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>          (bfloat16x4x2_t): New typedef.
>>          (bfloat16x8x2_t): New typedef.
>>          (bfloat16x4x3_t): New typedef.
>>          (bfloat16x8x3_t): New typedef.
>>          (bfloat16x4x4_t): New typedef.
>>          (bfloat16x8x4_t): New typedef.
>>          (vst2_bf16): New.
>>      (vst2q_bf16): New.
>>      (vst3_bf16): New.
>>      (vst3q_bf16): New.
>>      (vst4_bf16): New.
>>      (vst4q_bf16): New.
>>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>          (VAR13): New.
>>          (arm_simd_types[Bfloat16x2_t]):New type.
>>          * config/arm/arm-modes.def (V2BF): New mode.
>>          * config/arm/arm-simd-builtin-types.def
>>          (Bfloat16x2_t): New entry.
>>          * config/arm/arm_neon_builtins.def
>>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>          * config/arm/iterators.md (VDXBF): New iterator.
>>          (VQ2BF): New iterator.
>>          (V_elem): Added V4BF, V8BF.
>>          (V_sz_elem): Added V4BF, V8BF.
>>          (V_mode_nunits): Added V4BF, V8BF.
>>          (q): Added V4BF, V8BF.
>>          *config/arm/neon.md (vst2): Used new iterators.
>>          (vst3): Used new iterators.
>>          (vst3qa): Used new iterators.
>>          (vst3qb): Used new iterators.
>>          (vst4): Used new iterators.
>>          (vst4qa): Used new iterators.
>>          (vst4qb): Used new iterators.
>>
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2019-11-14  Delia Burduv  <delia.burduv@arm.com>
>>
>>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
Kyrill Tkachov Feb. 21, 2020, 2:06 p.m. UTC | #4
Hi Delia,

On 2/19/20 5:25 PM, Delia Burduv wrote:
> Hi,
>
> Here is the latest version of the patch. It just has some minor
> formatting changes that were brought up by Richard Sandiford in the
> AArch64 patches
>
> Thanks,
> Delia
>
> On 1/22/20 5:29 PM, Delia Burduv wrote:
> > Ping.
> >
> > I will change the tests to use the exact input and output registers as
> > Richard Sandiford suggested for the AArch64 patches.
> >
> > On 12/20/19 6:46 PM, Delia Burduv wrote:
> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
> >> 
> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>
> >>
> >> The intrinsics are declared in arm_neon.h .
> >> A new test is added to check assembler output.
> >>
> >> This patch depends on the Arm back-end patche.
> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
> >>
> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
> >> have commit rights, so if this is ok can someone please commit it 
> for me?
> >>
> >> gcc/ChangeLog:
> >>
> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
> >>
> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
> >>          (bfloat16x4x2_t): New typedef.
> >>          (bfloat16x8x2_t): New typedef.
> >>          (bfloat16x4x3_t): New typedef.
> >>          (bfloat16x8x3_t): New typedef.
> >>          (bfloat16x4x4_t): New typedef.
> >>          (bfloat16x8x4_t): New typedef.
> >>          (vst2_bf16): New.
> >>      (vst2q_bf16): New.
> >>      (vst3_bf16): New.
> >>      (vst3q_bf16): New.
> >>      (vst4_bf16): New.
> >>      (vst4q_bf16): New.
> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
> >>          (VAR13): New.
> >>          (arm_simd_types[Bfloat16x2_t]):New type.
> >>          * config/arm/arm-modes.def (V2BF): New mode.
> >>          * config/arm/arm-simd-builtin-types.def
> >>          (Bfloat16x2_t): New entry.
> >>          * config/arm/arm_neon_builtins.def
> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
> >>          * config/arm/iterators.md (VDXBF): New iterator.
> >>          (VQ2BF): New iterator.
> >>          (V_elem): Added V4BF, V8BF.
> >>          (V_sz_elem): Added V4BF, V8BF.
> >>          (V_mode_nunits): Added V4BF, V8BF.
> >>          (q): Added V4BF, V8BF.
> >>          *config/arm/neon.md (vst2): Used new iterators.
> >>          (vst3): Used new iterators.
> >>          (vst3qa): Used new iterators.
> >>          (vst3qb): Used new iterators.
> >>          (vst4): Used new iterators.
> >>          (vst4qa): Used new iterators.
> >>          (vst4qb): Used new iterators.
> >>
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
> >>
> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.

One thing I just noticed in this and the other arm bfloat16 patches...

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, float32x4_t __a, float32x4_t __b,
    return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
  }
  
+#pragma GCC push_options
+#pragma GCC target ("arch=armv8.2-a+bf16")
+
+typedef struct bfloat16x4x2_t
+{
+  bfloat16x4_t val[2];
+} bfloat16x4x2_t;


These should be in a new arm_bf16.h file that gets included in the main arm_neon.h file, right?
I believe the aarch64 versions are implemented that way.

Otherwise the patch looks good to me.
Thanks!
Kyrill


  +
+typedef struct bfloat16x8x2_t
+{
+  bfloat16x8_t val[2];
+} bfloat16x8x2_t;
+
Delia Burduv Feb. 21, 2020, 3:18 p.m. UTC | #5
Hi Kyrill,

The arm_bf16.h is only used for scalar operations. That is how the 
aarch64 versions are implemented too.

Thanks,
Delia

On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
> Hi Delia,
> 
> On 2/19/20 5:25 PM, Delia Burduv wrote:
>> Hi,
>>
>> Here is the latest version of the patch. It just has some minor
>> formatting changes that were brought up by Richard Sandiford in the
>> AArch64 patches
>>
>> Thanks,
>> Delia
>>
>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>> > Ping.
>> >
>> > I will change the tests to use the exact input and output registers as
>> > Richard Sandiford suggested for the AArch64 patches.
>> >
>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>> >> 
>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>
>> >>
>> >> The intrinsics are declared in arm_neon.h .
>> >> A new test is added to check assembler output.
>> >>
>> >> This patch depends on the Arm back-end patche.
>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>> >>
>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>> >> have commit rights, so if this is ok can someone please commit it 
>> for me?
>> >>
>> >> gcc/ChangeLog:
>> >>
>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>> >>
>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>> >>          (bfloat16x4x2_t): New typedef.
>> >>          (bfloat16x8x2_t): New typedef.
>> >>          (bfloat16x4x3_t): New typedef.
>> >>          (bfloat16x8x3_t): New typedef.
>> >>          (bfloat16x4x4_t): New typedef.
>> >>          (bfloat16x8x4_t): New typedef.
>> >>          (vst2_bf16): New.
>> >>      (vst2q_bf16): New.
>> >>      (vst3_bf16): New.
>> >>      (vst3q_bf16): New.
>> >>      (vst4_bf16): New.
>> >>      (vst4q_bf16): New.
>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>> >>          (VAR13): New.
>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>> >>          * config/arm/arm-simd-builtin-types.def
>> >>          (Bfloat16x2_t): New entry.
>> >>          * config/arm/arm_neon_builtins.def
>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>> >>          (VQ2BF): New iterator.
>> >>          (V_elem): Added V4BF, V8BF.
>> >>          (V_sz_elem): Added V4BF, V8BF.
>> >>          (V_mode_nunits): Added V4BF, V8BF.
>> >>          (q): Added V4BF, V8BF.
>> >>          *config/arm/neon.md (vst2): Used new iterators.
>> >>          (vst3): Used new iterators.
>> >>          (vst3qa): Used new iterators.
>> >>          (vst3qb): Used new iterators.
>> >>          (vst4): Used new iterators.
>> >>          (vst4qa): Used new iterators.
>> >>          (vst4qb): Used new iterators.
>> >>
>> >>
>> >> gcc/testsuite/ChangeLog:
>> >>
>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>> >>
>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
> 
> One thing I just noticed in this and the other arm bfloat16 patches...
> 
> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
> index 
> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
> 100644
> --- a/gcc/config/arm/arm_neon.h
> +++ b/gcc/config/arm/arm_neon.h
> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
> float32x4_t __a, float32x4_t __b,
>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>   }
> 
> +#pragma GCC push_options
> +#pragma GCC target ("arch=armv8.2-a+bf16")
> +
> +typedef struct bfloat16x4x2_t
> +{
> +  bfloat16x4_t val[2];
> +} bfloat16x4x2_t;
> 
> 
> These should be in a new arm_bf16.h file that gets included in the main 
> arm_neon.h file, right?
> I believe the aarch64 versions are implemented that way.
> 
> Otherwise the patch looks good to me.
> Thanks!
> Kyrill
> 
> 
>   +
> +typedef struct bfloat16x8x2_t
> +{
> +  bfloat16x8_t val[2];
> +} bfloat16x8x2_t;
> +
>
Delia Burduv March 3, 2020, 4:20 p.m. UTC | #6
Hi,

I made a mistake in the previous patch. This is the latest version. 
Please let me know if it is ok.

Thanks,
Delia

On 2/21/20 3:18 PM, Delia Burduv wrote:
> Hi Kyrill,
> 
> The arm_bf16.h is only used for scalar operations. That is how the 
> aarch64 versions are implemented too.
> 
> Thanks,
> Delia
> 
> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>> Hi Delia,
>>
>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>> Hi,
>>>
>>> Here is the latest version of the patch. It just has some minor
>>> formatting changes that were brought up by Richard Sandiford in the
>>> AArch64 patches
>>>
>>> Thanks,
>>> Delia
>>>
>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>> > Ping.
>>> >
>>> > I will change the tests to use the exact input and output registers as
>>> > Richard Sandiford suggested for the AArch64 patches.
>>> >
>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>> >> 
>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>
>>> >>
>>> >> The intrinsics are declared in arm_neon.h .
>>> >> A new test is added to check assembler output.
>>> >>
>>> >> This patch depends on the Arm back-end patche.
>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>> >>
>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>>> >> have commit rights, so if this is ok can someone please commit it 
>>> for me?
>>> >>
>>> >> gcc/ChangeLog:
>>> >>
>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>> >>
>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>> >>          (bfloat16x4x2_t): New typedef.
>>> >>          (bfloat16x8x2_t): New typedef.
>>> >>          (bfloat16x4x3_t): New typedef.
>>> >>          (bfloat16x8x3_t): New typedef.
>>> >>          (bfloat16x4x4_t): New typedef.
>>> >>          (bfloat16x8x4_t): New typedef.
>>> >>          (vst2_bf16): New.
>>> >>      (vst2q_bf16): New.
>>> >>      (vst3_bf16): New.
>>> >>      (vst3q_bf16): New.
>>> >>      (vst4_bf16): New.
>>> >>      (vst4q_bf16): New.
>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>> >>          (VAR13): New.
>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>> >>          * config/arm/arm-simd-builtin-types.def
>>> >>          (Bfloat16x2_t): New entry.
>>> >>          * config/arm/arm_neon_builtins.def
>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>> >>          (VQ2BF): New iterator.
>>> >>          (V_elem): Added V4BF, V8BF.
>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>> >>          (q): Added V4BF, V8BF.
>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>> >>          (vst3): Used new iterators.
>>> >>          (vst3qa): Used new iterators.
>>> >>          (vst3qb): Used new iterators.
>>> >>          (vst4): Used new iterators.
>>> >>          (vst4qa): Used new iterators.
>>> >>          (vst4qb): Used new iterators.
>>> >>
>>> >>
>>> >> gcc/testsuite/ChangeLog:
>>> >>
>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>> >>
>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>
>> One thing I just noticed in this and the other arm bfloat16 patches...
>>
>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>> index 
>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>> 100644
>> --- a/gcc/config/arm/arm_neon.h
>> +++ b/gcc/config/arm/arm_neon.h
>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>> float32x4_t __a, float32x4_t __b,
>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>   }
>>
>> +#pragma GCC push_options
>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>> +
>> +typedef struct bfloat16x4x2_t
>> +{
>> +  bfloat16x4_t val[2];
>> +} bfloat16x4x2_t;
>>
>>
>> These should be in a new arm_bf16.h file that gets included in the 
>> main arm_neon.h file, right?
>> I believe the aarch64 versions are implemented that way.
>>
>> Otherwise the patch looks good to me.
>> Thanks!
>> Kyrill
>>
>>
>>   +
>> +typedef struct bfloat16x8x2_t
>> +{
>> +  bfloat16x8_t val[2];
>> +} bfloat16x8x2_t;
>> +
>>
Delia Burduv March 3, 2020, 4:23 p.m. UTC | #7
Sorry, I forgot the attachment.

On 3/3/20 4:20 PM, Delia Burduv wrote:
> Hi,
> 
> I made a mistake in the previous patch. This is the latest version. 
> Please let me know if it is ok.
> 
> Thanks,
> Delia
> 
> On 2/21/20 3:18 PM, Delia Burduv wrote:
>> Hi Kyrill,
>>
>> The arm_bf16.h is only used for scalar operations. That is how the 
>> aarch64 versions are implemented too.
>>
>> Thanks,
>> Delia
>>
>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>> Hi Delia,
>>>
>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>> Hi,
>>>>
>>>> Here is the latest version of the patch. It just has some minor
>>>> formatting changes that were brought up by Richard Sandiford in the
>>>> AArch64 patches
>>>>
>>>> Thanks,
>>>> Delia
>>>>
>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>> > Ping.
>>>> >
>>>> > I will change the tests to use the exact input and output 
>>>> registers as
>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>> >
>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>> >> 
>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>
>>>> >>
>>>> >> The intrinsics are declared in arm_neon.h .
>>>> >> A new test is added to check assembler output.
>>>> >>
>>>> >> This patch depends on the Arm back-end patche.
>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>> >>
>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>>>> >> have commit rights, so if this is ok can someone please commit it 
>>>> for me?
>>>> >>
>>>> >> gcc/ChangeLog:
>>>> >>
>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>> >>
>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>> >>          (bfloat16x4x2_t): New typedef.
>>>> >>          (bfloat16x8x2_t): New typedef.
>>>> >>          (bfloat16x4x3_t): New typedef.
>>>> >>          (bfloat16x8x3_t): New typedef.
>>>> >>          (bfloat16x4x4_t): New typedef.
>>>> >>          (bfloat16x8x4_t): New typedef.
>>>> >>          (vst2_bf16): New.
>>>> >>      (vst2q_bf16): New.
>>>> >>      (vst3_bf16): New.
>>>> >>      (vst3q_bf16): New.
>>>> >>      (vst4_bf16): New.
>>>> >>      (vst4q_bf16): New.
>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>> >>          (VAR13): New.
>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>> >>          (Bfloat16x2_t): New entry.
>>>> >>          * config/arm/arm_neon_builtins.def
>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>> >>          (VQ2BF): New iterator.
>>>> >>          (V_elem): Added V4BF, V8BF.
>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>> >>          (q): Added V4BF, V8BF.
>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>> >>          (vst3): Used new iterators.
>>>> >>          (vst3qa): Used new iterators.
>>>> >>          (vst3qb): Used new iterators.
>>>> >>          (vst4): Used new iterators.
>>>> >>          (vst4qa): Used new iterators.
>>>> >>          (vst4qb): Used new iterators.
>>>> >>
>>>> >>
>>>> >> gcc/testsuite/ChangeLog:
>>>> >>
>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>> >>
>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>
>>> One thing I just noticed in this and the other arm bfloat16 patches...
>>>
>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>> index 
>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>> 100644
>>> --- a/gcc/config/arm/arm_neon.h
>>> +++ b/gcc/config/arm/arm_neon.h
>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>>> float32x4_t __a, float32x4_t __b,
>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>>   }
>>>
>>> +#pragma GCC push_options
>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>> +
>>> +typedef struct bfloat16x4x2_t
>>> +{
>>> +  bfloat16x4_t val[2];
>>> +} bfloat16x4x2_t;
>>>
>>>
>>> These should be in a new arm_bf16.h file that gets included in the 
>>> main arm_neon.h file, right?
>>> I believe the aarch64 versions are implemented that way.
>>>
>>> Otherwise the patch looks good to me.
>>> Thanks!
>>> Kyrill
>>>
>>>
>>>   +
>>> +typedef struct bfloat16x8x2_t
>>> +{
>>> +  bfloat16x8_t val[2];
>>> +} bfloat16x8x2_t;
>>> +
>>>
Delia Burduv March 3, 2020, 5:23 p.m. UTC | #8
Hi,

I noticed that the patch doesn't apply cleanly. I fixed it and this is 
the latest version.

Thanks,
Delia

On 3/3/20 4:23 PM, Delia Burduv wrote:
> Sorry, I forgot the attachment.
> 
> On 3/3/20 4:20 PM, Delia Burduv wrote:
>> Hi,
>>
>> I made a mistake in the previous patch. This is the latest version. 
>> Please let me know if it is ok.
>>
>> Thanks,
>> Delia
>>
>> On 2/21/20 3:18 PM, Delia Burduv wrote:
>>> Hi Kyrill,
>>>
>>> The arm_bf16.h is only used for scalar operations. That is how the 
>>> aarch64 versions are implemented too.
>>>
>>> Thanks,
>>> Delia
>>>
>>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>>> Hi Delia,
>>>>
>>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>>> Hi,
>>>>>
>>>>> Here is the latest version of the patch. It just has some minor
>>>>> formatting changes that were brought up by Richard Sandiford in the
>>>>> AArch64 patches
>>>>>
>>>>> Thanks,
>>>>> Delia
>>>>>
>>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>>> > Ping.
>>>>> >
>>>>> > I will change the tests to use the exact input and output 
>>>>> registers as
>>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>>> >
>>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>>> >> 
>>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>>
>>>>> >>
>>>>> >> The intrinsics are declared in arm_neon.h .
>>>>> >> A new test is added to check assembler output.
>>>>> >>
>>>>> >> This patch depends on the Arm back-end patche.
>>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>>> >>
>>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>>>>> >> have commit rights, so if this is ok can someone please commit 
>>>>> it for me?
>>>>> >>
>>>>> >> gcc/ChangeLog:
>>>>> >>
>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>> >>
>>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>>> >>          (bfloat16x4x2_t): New typedef.
>>>>> >>          (bfloat16x8x2_t): New typedef.
>>>>> >>          (bfloat16x4x3_t): New typedef.
>>>>> >>          (bfloat16x8x3_t): New typedef.
>>>>> >>          (bfloat16x4x4_t): New typedef.
>>>>> >>          (bfloat16x8x4_t): New typedef.
>>>>> >>          (vst2_bf16): New.
>>>>> >>      (vst2q_bf16): New.
>>>>> >>      (vst3_bf16): New.
>>>>> >>      (vst3q_bf16): New.
>>>>> >>      (vst4_bf16): New.
>>>>> >>      (vst4q_bf16): New.
>>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>>> >>          (VAR13): New.
>>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>>> >>          (Bfloat16x2_t): New entry.
>>>>> >>          * config/arm/arm_neon_builtins.def
>>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>>> >>          (VQ2BF): New iterator.
>>>>> >>          (V_elem): Added V4BF, V8BF.
>>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>>> >>          (q): Added V4BF, V8BF.
>>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>>> >>          (vst3): Used new iterators.
>>>>> >>          (vst3qa): Used new iterators.
>>>>> >>          (vst3qb): Used new iterators.
>>>>> >>          (vst4): Used new iterators.
>>>>> >>          (vst4qa): Used new iterators.
>>>>> >>          (vst4qb): Used new iterators.
>>>>> >>
>>>>> >>
>>>>> >> gcc/testsuite/ChangeLog:
>>>>> >>
>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>> >>
>>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>>
>>>> One thing I just noticed in this and the other arm bfloat16 patches...
>>>>
>>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>>> index 
>>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>>> 100644
>>>> --- a/gcc/config/arm/arm_neon.h
>>>> +++ b/gcc/config/arm/arm_neon.h
>>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>>>> float32x4_t __a, float32x4_t __b,
>>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>>>   }
>>>>
>>>> +#pragma GCC push_options
>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>> +
>>>> +typedef struct bfloat16x4x2_t
>>>> +{
>>>> +  bfloat16x4_t val[2];
>>>> +} bfloat16x4x2_t;
>>>>
>>>>
>>>> These should be in a new arm_bf16.h file that gets included in the 
>>>> main arm_neon.h file, right?
>>>> I believe the aarch64 versions are implemented that way.
>>>>
>>>> Otherwise the patch looks good to me.
>>>> Thanks!
>>>> Kyrill
>>>>
>>>>
>>>>   +
>>>> +typedef struct bfloat16x8x2_t
>>>> +{
>>>> +  bfloat16x8_t val[2];
>>>> +} bfloat16x8x2_t;
>>>> +
>>>>
Kyrill Tkachov March 4, 2020, 5:20 p.m. UTC | #9
Hi Delia,

On 3/3/20 5:23 PM, Delia Burduv wrote:
> Hi,
>
> I noticed that the patch doesn't apply cleanly. I fixed it and this is 
> the latest version.
>
> Thanks,
> Delia
>
> On 3/3/20 4:23 PM, Delia Burduv wrote:
>> Sorry, I forgot the attachment.
>>
>> On 3/3/20 4:20 PM, Delia Burduv wrote:
>>> Hi,
>>>
>>> I made a mistake in the previous patch. This is the latest version. 
>>> Please let me know if it is ok.
>>>
>>> Thanks,
>>> Delia
>>>
>>> On 2/21/20 3:18 PM, Delia Burduv wrote:
>>>> Hi Kyrill,
>>>>
>>>> The arm_bf16.h is only used for scalar operations. That is how the 
>>>> aarch64 versions are implemented too.
>>>>
>>>> Thanks,
>>>> Delia
>>>>
>>>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>>>> Hi Delia,
>>>>>
>>>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Here is the latest version of the patch. It just has some minor
>>>>>> formatting changes that were brought up by Richard Sandiford in the
>>>>>> AArch64 patches
>>>>>>
>>>>>> Thanks,
>>>>>> Delia
>>>>>>
>>>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>>>> > Ping.
>>>>>> >
>>>>>> > I will change the tests to use the exact input and output 
>>>>>> registers as
>>>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>>>> >
>>>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>>>> >> 
>>>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>>>
>>>>>> >>
>>>>>> >> The intrinsics are declared in arm_neon.h .
>>>>>> >> A new test is added to check assembler output.
>>>>>> >>
>>>>>> >> This patch depends on the Arm back-end patche.
>>>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>>>> >>
>>>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I 
>>>>>> don't
>>>>>> >> have commit rights, so if this is ok can someone please commit 
>>>>>> it for me?
>>>>>> >>
>>>>>> >> gcc/ChangeLog:
>>>>>> >>
>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>> >>
>>>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>>>> >>          (bfloat16x4x2_t): New typedef.
>>>>>> >>          (bfloat16x8x2_t): New typedef.
>>>>>> >>          (bfloat16x4x3_t): New typedef.
>>>>>> >>          (bfloat16x8x3_t): New typedef.
>>>>>> >>          (bfloat16x4x4_t): New typedef.
>>>>>> >>          (bfloat16x8x4_t): New typedef.
>>>>>> >>          (vst2_bf16): New.
>>>>>> >>      (vst2q_bf16): New.
>>>>>> >>      (vst3_bf16): New.
>>>>>> >>      (vst3q_bf16): New.
>>>>>> >>      (vst4_bf16): New.
>>>>>> >>      (vst4q_bf16): New.
>>>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>>>> >>          (VAR13): New.
>>>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>>>> >>          (Bfloat16x2_t): New entry.
>>>>>> >>          * config/arm/arm_neon_builtins.def
>>>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>>>> >>          (VQ2BF): New iterator.
>>>>>> >>          (V_elem): Added V4BF, V8BF.
>>>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>>>> >>          (q): Added V4BF, V8BF.
>>>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>>>> >>          (vst3): Used new iterators.
>>>>>> >>          (vst3qa): Used new iterators.
>>>>>> >>          (vst3qb): Used new iterators.
>>>>>> >>          (vst4): Used new iterators.
>>>>>> >>          (vst4qa): Used new iterators.
>>>>>> >>          (vst4qb): Used new iterators.
>>>>>> >>
>>>>>> >>
>>>>>> >> gcc/testsuite/ChangeLog:
>>>>>> >>
>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>> >>
>>>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>>>
>>>>> One thing I just noticed in this and the other arm bfloat16 
>>>>> patches...
>>>>>
>>>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>>>> index 
>>>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>>>> 100644
>>>>> --- a/gcc/config/arm/arm_neon.h
>>>>> +++ b/gcc/config/arm/arm_neon.h
>>>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>>>>> float32x4_t __a, float32x4_t __b,
>>>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>>>>   }
>>>>>
>>>>> +#pragma GCC push_options
>>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>>> +
>>>>> +typedef struct bfloat16x4x2_t
>>>>> +{
>>>>> +  bfloat16x4_t val[2];
>>>>> +} bfloat16x4x2_t;
>>>>>
>>>>>
>>>>> These should be in a new arm_bf16.h file that gets included in the 
>>>>> main arm_neon.h file, right?
>>>>> I believe the aarch64 versions are implemented that way.
>>>>>
>>>>> Otherwise the patch looks good to me.
>>>>> Thanks!
>>>>> Kyrill
>>>>>
>>>>>
>>>>>   +
>>>>> +typedef struct bfloat16x8x2_t
>>>>> +{
>>>>> +  bfloat16x8_t val[2];
>>>>> +} bfloat16x8x2_t;
>>>>> +
>>>>>

diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..b52ecfb959776fd04c7c33908cb7f8898ec3fe0b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
@@ -0,0 +1,84 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+/* { dg-additional-options "-save-temps" } */
+/* { dg-final { check-function-bodies "**" "" {-O[^0]} } } */
+


I don't see the check-function-bodies checks being performed in my testing. Changing the directives order to:
/* { dg-do assemble } */
/* { dg-options "-save-temps" }  */
/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
/* { dg-add-options arm_v8_2a_bf16_neon } */
/* { dg-final { check-function-bodies "**" "" } } */

makes them run but they fail, I think because this test also needs an -O2 option, same as the load intrinsics patch. Can you please adjust the order of the dg-* directives in the test and the function body scan tests to match the codegen?
With this, it will be ready to go :)
Thanks,
Kyrill



  +#include "arm_neon.h"
+
+/*
+**test_vst2_bf16:
+**	...
+**	vst2.16	{d16-d17}, \[r0\]
+**	...
+*/
+void
+test_vst2_bf16 (bfloat16_t *ptr, bfloat16x4x2_t val)
+{
+  vst2_bf16 (ptr, val);
+}
+
Delia Burduv March 5, 2020, 3:50 p.m. UTC | #10
Hi,

This is the latest version of the patch. I am forcing -mfloat-abi=hard 
because the register allocator behaves differently depending on which 
float-abi is used.

Thanks,
Delia

On 3/4/20 5:20 PM, Kyrill Tkachov wrote:
> Hi Delia,
> 
> On 3/3/20 5:23 PM, Delia Burduv wrote:
>> Hi,
>>
>> I noticed that the patch doesn't apply cleanly. I fixed it and this is 
>> the latest version.
>>
>> Thanks,
>> Delia
>>
>> On 3/3/20 4:23 PM, Delia Burduv wrote:
>>> Sorry, I forgot the attachment.
>>>
>>> On 3/3/20 4:20 PM, Delia Burduv wrote:
>>>> Hi,
>>>>
>>>> I made a mistake in the previous patch. This is the latest version. 
>>>> Please let me know if it is ok.
>>>>
>>>> Thanks,
>>>> Delia
>>>>
>>>> On 2/21/20 3:18 PM, Delia Burduv wrote:
>>>>> Hi Kyrill,
>>>>>
>>>>> The arm_bf16.h is only used for scalar operations. That is how the 
>>>>> aarch64 versions are implemented too.
>>>>>
>>>>> Thanks,
>>>>> Delia
>>>>>
>>>>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>>>>> Hi Delia,
>>>>>>
>>>>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here is the latest version of the patch. It just has some minor
>>>>>>> formatting changes that were brought up by Richard Sandiford in the
>>>>>>> AArch64 patches
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Delia
>>>>>>>
>>>>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>>>>> > Ping.
>>>>>>> >
>>>>>>> > I will change the tests to use the exact input and output 
>>>>>>> registers as
>>>>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>>>>> >
>>>>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>>>>> >> 
>>>>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>>>>
>>>>>>> >>
>>>>>>> >> The intrinsics are declared in arm_neon.h .
>>>>>>> >> A new test is added to check assembler output.
>>>>>>> >>
>>>>>>> >> This patch depends on the Arm back-end patche.
>>>>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>>>>> >>
>>>>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I 
>>>>>>> don't
>>>>>>> >> have commit rights, so if this is ok can someone please commit 
>>>>>>> it for me?
>>>>>>> >>
>>>>>>> >> gcc/ChangeLog:
>>>>>>> >>
>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>> >>
>>>>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>>>>> >>          (bfloat16x4x2_t): New typedef.
>>>>>>> >>          (bfloat16x8x2_t): New typedef.
>>>>>>> >>          (bfloat16x4x3_t): New typedef.
>>>>>>> >>          (bfloat16x8x3_t): New typedef.
>>>>>>> >>          (bfloat16x4x4_t): New typedef.
>>>>>>> >>          (bfloat16x8x4_t): New typedef.
>>>>>>> >>          (vst2_bf16): New.
>>>>>>> >>      (vst2q_bf16): New.
>>>>>>> >>      (vst3_bf16): New.
>>>>>>> >>      (vst3q_bf16): New.
>>>>>>> >>      (vst4_bf16): New.
>>>>>>> >>      (vst4q_bf16): New.
>>>>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>>>>> >>          (VAR13): New.
>>>>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>>>>> >>          (Bfloat16x2_t): New entry.
>>>>>>> >>          * config/arm/arm_neon_builtins.def
>>>>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>>>>> >>          (VQ2BF): New iterator.
>>>>>>> >>          (V_elem): Added V4BF, V8BF.
>>>>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>>>>> >>          (q): Added V4BF, V8BF.
>>>>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>>>>> >>          (vst3): Used new iterators.
>>>>>>> >>          (vst3qa): Used new iterators.
>>>>>>> >>          (vst3qb): Used new iterators.
>>>>>>> >>          (vst4): Used new iterators.
>>>>>>> >>          (vst4qa): Used new iterators.
>>>>>>> >>          (vst4qb): Used new iterators.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> gcc/testsuite/ChangeLog:
>>>>>>> >>
>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>> >>
>>>>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>>>>
>>>>>> One thing I just noticed in this and the other arm bfloat16 
>>>>>> patches...
>>>>>>
>>>>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>>>>> index 
>>>>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>>>>> 100644
>>>>>> --- a/gcc/config/arm/arm_neon.h
>>>>>> +++ b/gcc/config/arm/arm_neon.h
>>>>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>>>>>> float32x4_t __a, float32x4_t __b,
>>>>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>>>>>   }
>>>>>>
>>>>>> +#pragma GCC push_options
>>>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>>>> +
>>>>>> +typedef struct bfloat16x4x2_t
>>>>>> +{
>>>>>> +  bfloat16x4_t val[2];
>>>>>> +} bfloat16x4x2_t;
>>>>>>
>>>>>>
>>>>>> These should be in a new arm_bf16.h file that gets included in the 
>>>>>> main arm_neon.h file, right?
>>>>>> I believe the aarch64 versions are implemented that way.
>>>>>>
>>>>>> Otherwise the patch looks good to me.
>>>>>> Thanks!
>>>>>> Kyrill
>>>>>>
>>>>>>
>>>>>>   +
>>>>>> +typedef struct bfloat16x8x2_t
>>>>>> +{
>>>>>> +  bfloat16x8_t val[2];
>>>>>> +} bfloat16x8x2_t;
>>>>>> +
>>>>>>
> 
> diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c 
> b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..b52ecfb959776fd04c7c33908cb7f8898ec3fe0b 
> 
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
> @@ -0,0 +1,84 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon } */
> +/* { dg-additional-options "-save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" {-O[^0]} } } */
> +
> 
> 
> I don't see the check-function-bodies checks being performed in my 
> testing. Changing the directives order to:
> /* { dg-do assemble } */
> /* { dg-options "-save-temps" }  */
> /* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> /* { dg-add-options arm_v8_2a_bf16_neon } */
> /* { dg-final { check-function-bodies "**" "" } } */
> 
> makes them run but they fail, I think because this test also needs an 
> -O2 option, same as the load intrinsics patch. Can you please adjust the 
> order of the dg-* directives in the test and the function body scan 
> tests to match the codegen?
> With this, it will be ready to go :)
> Thanks,
> Kyrill
> 
> 
> 
>   +#include "arm_neon.h"
> +
> +/*
> +**test_vst2_bf16:
> +**    ...
> +**    vst2.16    {d16-d17}, \[r0\]
> +**    ...
> +*/
> +void
> +test_vst2_bf16 (bfloat16_t *ptr, bfloat16x4x2_t val)
> +{
> +  vst2_bf16 (ptr, val);
> +}
> +
>
Delia Burduv March 5, 2020, 3:53 p.m. UTC | #11
Hi,

This is the latest version of the patch. I am forcing -mfloat-abi=hard 
because the register allocator behaves differently depending on the 
float-abi used.

Thanks,
Delia

On 3/4/20 5:20 PM, Kyrill Tkachov wrote:
> Hi Delia,
> 
> On 3/3/20 5:23 PM, Delia Burduv wrote:
>> Hi,
>>
>> I noticed that the patch doesn't apply cleanly. I fixed it and this is 
>> the latest version.
>>
>> Thanks,
>> Delia
>>
>> On 3/3/20 4:23 PM, Delia Burduv wrote:
>>> Sorry, I forgot the attachment.
>>>
>>> On 3/3/20 4:20 PM, Delia Burduv wrote:
>>>> Hi,
>>>>
>>>> I made a mistake in the previous patch. This is the latest version. 
>>>> Please let me know if it is ok.
>>>>
>>>> Thanks,
>>>> Delia
>>>>
>>>> On 2/21/20 3:18 PM, Delia Burduv wrote:
>>>>> Hi Kyrill,
>>>>>
>>>>> The arm_bf16.h is only used for scalar operations. That is how the 
>>>>> aarch64 versions are implemented too.
>>>>>
>>>>> Thanks,
>>>>> Delia
>>>>>
>>>>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>>>>> Hi Delia,
>>>>>>
>>>>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Here is the latest version of the patch. It just has some minor
>>>>>>> formatting changes that were brought up by Richard Sandiford in the
>>>>>>> AArch64 patches
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Delia
>>>>>>>
>>>>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>>>>> > Ping.
>>>>>>> >
>>>>>>> > I will change the tests to use the exact input and output 
>>>>>>> registers as
>>>>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>>>>> >
>>>>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>>>>> >> 
>>>>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>>>>
>>>>>>> >>
>>>>>>> >> The intrinsics are declared in arm_neon.h .
>>>>>>> >> A new test is added to check assembler output.
>>>>>>> >>
>>>>>>> >> This patch depends on the Arm back-end patche.
>>>>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>>>>> >>
>>>>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. I 
>>>>>>> don't
>>>>>>> >> have commit rights, so if this is ok can someone please commit 
>>>>>>> it for me?
>>>>>>> >>
>>>>>>> >> gcc/ChangeLog:
>>>>>>> >>
>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>> >>
>>>>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>>>>> >>          (bfloat16x4x2_t): New typedef.
>>>>>>> >>          (bfloat16x8x2_t): New typedef.
>>>>>>> >>          (bfloat16x4x3_t): New typedef.
>>>>>>> >>          (bfloat16x8x3_t): New typedef.
>>>>>>> >>          (bfloat16x4x4_t): New typedef.
>>>>>>> >>          (bfloat16x8x4_t): New typedef.
>>>>>>> >>          (vst2_bf16): New.
>>>>>>> >>      (vst2q_bf16): New.
>>>>>>> >>      (vst3_bf16): New.
>>>>>>> >>      (vst3q_bf16): New.
>>>>>>> >>      (vst4_bf16): New.
>>>>>>> >>      (vst4q_bf16): New.
>>>>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>>>>> >>          (VAR13): New.
>>>>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>>>>> >>          (Bfloat16x2_t): New entry.
>>>>>>> >>          * config/arm/arm_neon_builtins.def
>>>>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>>>>> >>          (VQ2BF): New iterator.
>>>>>>> >>          (V_elem): Added V4BF, V8BF.
>>>>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>>>>> >>          (q): Added V4BF, V8BF.
>>>>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>>>>> >>          (vst3): Used new iterators.
>>>>>>> >>          (vst3qa): Used new iterators.
>>>>>>> >>          (vst3qb): Used new iterators.
>>>>>>> >>          (vst4): Used new iterators.
>>>>>>> >>          (vst4qa): Used new iterators.
>>>>>>> >>          (vst4qb): Used new iterators.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> gcc/testsuite/ChangeLog:
>>>>>>> >>
>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>> >>
>>>>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>>>>
>>>>>> One thing I just noticed in this and the other arm bfloat16 
>>>>>> patches...
>>>>>>
>>>>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>>>>> index 
>>>>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>>>>> 100644
>>>>>> --- a/gcc/config/arm/arm_neon.h
>>>>>> +++ b/gcc/config/arm/arm_neon.h
>>>>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t __r, 
>>>>>> float32x4_t __a, float32x4_t __b,
>>>>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, __index);
>>>>>>   }
>>>>>>
>>>>>> +#pragma GCC push_options
>>>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>>>> +
>>>>>> +typedef struct bfloat16x4x2_t
>>>>>> +{
>>>>>> +  bfloat16x4_t val[2];
>>>>>> +} bfloat16x4x2_t;
>>>>>>
>>>>>>
>>>>>> These should be in a new arm_bf16.h file that gets included in the 
>>>>>> main arm_neon.h file, right?
>>>>>> I believe the aarch64 versions are implemented that way.
>>>>>>
>>>>>> Otherwise the patch looks good to me.
>>>>>> Thanks!
>>>>>> Kyrill
>>>>>>
>>>>>>
>>>>>>   +
>>>>>> +typedef struct bfloat16x8x2_t
>>>>>> +{
>>>>>> +  bfloat16x8_t val[2];
>>>>>> +} bfloat16x8x2_t;
>>>>>> +
>>>>>>
> 
> diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c 
> b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..b52ecfb959776fd04c7c33908cb7f8898ec3fe0b 
> 
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
> @@ -0,0 +1,84 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon } */
> +/* { dg-additional-options "-save-temps" } */
> +/* { dg-final { check-function-bodies "**" "" {-O[^0]} } } */
> +
> 
> 
> I don't see the check-function-bodies checks being performed in my 
> testing. Changing the directives order to:
> /* { dg-do assemble } */
> /* { dg-options "-save-temps" }  */
> /* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> /* { dg-add-options arm_v8_2a_bf16_neon } */
> /* { dg-final { check-function-bodies "**" "" } } */
> 
> makes them run but they fail, I think because this test also needs an 
> -O2 option, same as the load intrinsics patch. Can you please adjust the 
> order of the dg-* directives in the test and the function body scan 
> tests to match the codegen?
> With this, it will be ready to go :)
> Thanks,
> Kyrill
> 
> 
> 
>   +#include "arm_neon.h"
> +
> +/*
> +**test_vst2_bf16:
> +**    ...
> +**    vst2.16    {d16-d17}, \[r0\]
> +**    ...
> +*/
> +void
> +test_vst2_bf16 (bfloat16_t *ptr, bfloat16x4x2_t val)
> +{
> +  vst2_bf16 (ptr, val);
> +}
> +
>
Kyrill Tkachov March 6, 2020, 10:44 a.m. UTC | #12
Hi Delia,

On 3/5/20 3:53 PM, Delia Burduv wrote:
> Hi,
>
> This is the latest version of the patch. I am forcing -mfloat-abi=hard 
> because the register allocator behaves differently depending on the 
> float-abi used.

Thanks, I've pushed it to master with an updated ChangeLog reflecting 
the recent changes. In the future, please send an updated ChangeLog 
whenever something changes in the patches.

Thanks again!

Kyrill


2020-03-06  Delia Burduv  <delia.burduv@arm.com>

     * config/arm/arm_neon.h (bfloat16x4x2_t): New typedef.
     (bfloat16x8x2_t): New typedef.
     (bfloat16x4x3_t): New typedef.
     (bfloat16x8x3_t): New typedef.
     (bfloat16x4x4_t): New typedef.
     (bfloat16x8x4_t): New typedef.
     (vst2_bf16): New.
     (vst2q_bf16): New.
     (vst3_bf16): New.
     (vst3q_bf16): New.
     (vst4_bf16): New.
     (vst4q_bf16): New.
     * config/arm/arm-builtins.c (v2bf_UP): Define.
     (VAR13): New.
     (arm_init_simd_builtin_types): Init Bfloat16x2_t eltype.
     * config/arm/arm-modes.def (V2BF): New mode.
     * config/arm/arm-simd-builtin-types.def
     (Bfloat16x2_t): New entry.
     * config/arm/arm_neon_builtins.def
     (vst2): Changed to VAR13 and added v4bf, v8bf
     (vst3): Changed to VAR13 and added v4bf, v8bf
     (vst4): Changed to VAR13 and added v4bf, v8bf
     * config/arm/iterators.md (VDXBF): New iterator.
     (VQ2BF): New iterator.
     *config/arm/neon.md (neon_vst2<mode>): Used new iterators.
     (neon_vst2<mode>): Used new iterators.
     (neon_vst3<mode>): Used new iterators.
     (neon_vst3<mode>): Used new iterators.
     (neon_vst3qa<mode>): Used new iterators.
     (neon_vst3qb<mode>): Used new iterators.
     (neon_vst4<mode>): Used new iterators.
     (neon_vst4<mode>): Used new iterators.
     (neon_vst4qa<mode>): Used new iterators.
     (neon_vst4qb<mode>): Used new iterators.



>
> Thanks,
> Delia
>
> On 3/4/20 5:20 PM, Kyrill Tkachov wrote:
>> Hi Delia,
>>
>> On 3/3/20 5:23 PM, Delia Burduv wrote:
>>> Hi,
>>>
>>> I noticed that the patch doesn't apply cleanly. I fixed it and this 
>>> is the latest version.
>>>
>>> Thanks,
>>> Delia
>>>
>>> On 3/3/20 4:23 PM, Delia Burduv wrote:
>>>> Sorry, I forgot the attachment.
>>>>
>>>> On 3/3/20 4:20 PM, Delia Burduv wrote:
>>>>> Hi,
>>>>>
>>>>> I made a mistake in the previous patch. This is the latest 
>>>>> version. Please let me know if it is ok.
>>>>>
>>>>> Thanks,
>>>>> Delia
>>>>>
>>>>> On 2/21/20 3:18 PM, Delia Burduv wrote:
>>>>>> Hi Kyrill,
>>>>>>
>>>>>> The arm_bf16.h is only used for scalar operations. That is how 
>>>>>> the aarch64 versions are implemented too.
>>>>>>
>>>>>> Thanks,
>>>>>> Delia
>>>>>>
>>>>>> On 2/21/20 2:06 PM, Kyrill Tkachov wrote:
>>>>>>> Hi Delia,
>>>>>>>
>>>>>>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Here is the latest version of the patch. It just has some minor
>>>>>>>> formatting changes that were brought up by Richard Sandiford in 
>>>>>>>> the
>>>>>>>> AArch64 patches
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Delia
>>>>>>>>
>>>>>>>> On 1/22/20 5:29 PM, Delia Burduv wrote:
>>>>>>>> > Ping.
>>>>>>>> >
>>>>>>>> > I will change the tests to use the exact input and output 
>>>>>>>> registers as
>>>>>>>> > Richard Sandiford suggested for the AArch64 patches.
>>>>>>>> >
>>>>>>>> > On 12/20/19 6:46 PM, Delia Burduv wrote:
>>>>>>>> >> This patch adds the ARMv8.6 ACLE BFloat16 store intrinsics
>>>>>>>> >> vst<n>{q}_bf16 as part of the BFloat16 extension.
>>>>>>>> >> 
>>>>>>>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>>>>>>>
>>>>>>>> >>
>>>>>>>> >> The intrinsics are declared in arm_neon.h .
>>>>>>>> >> A new test is added to check assembler output.
>>>>>>>> >>
>>>>>>>> >> This patch depends on the Arm back-end patche.
>>>>>>>> >> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>>>>>>>> >>
>>>>>>>> >> Tested for regression on arm-none-eabi and armeb-none-eabi. 
>>>>>>>> I don't
>>>>>>>> >> have commit rights, so if this is ok can someone please 
>>>>>>>> commit it for me?
>>>>>>>> >>
>>>>>>>> >> gcc/ChangeLog:
>>>>>>>> >>
>>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>>> >>
>>>>>>>> >>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>>>>>>> >>          (bfloat16x4x2_t): New typedef.
>>>>>>>> >>          (bfloat16x8x2_t): New typedef.
>>>>>>>> >>          (bfloat16x4x3_t): New typedef.
>>>>>>>> >>          (bfloat16x8x3_t): New typedef.
>>>>>>>> >>          (bfloat16x4x4_t): New typedef.
>>>>>>>> >>          (bfloat16x8x4_t): New typedef.
>>>>>>>> >>          (vst2_bf16): New.
>>>>>>>> >>      (vst2q_bf16): New.
>>>>>>>> >>      (vst3_bf16): New.
>>>>>>>> >>      (vst3q_bf16): New.
>>>>>>>> >>      (vst4_bf16): New.
>>>>>>>> >>      (vst4q_bf16): New.
>>>>>>>> >>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>>>>>>>> >>          (VAR13): New.
>>>>>>>> >>          (arm_simd_types[Bfloat16x2_t]):New type.
>>>>>>>> >>          * config/arm/arm-modes.def (V2BF): New mode.
>>>>>>>> >>          * config/arm/arm-simd-builtin-types.def
>>>>>>>> >>          (Bfloat16x2_t): New entry.
>>>>>>>> >>          * config/arm/arm_neon_builtins.def
>>>>>>>> >>          (vst2): Changed to VAR13 and added v4bf, v8bf
>>>>>>>> >>          (vst3): Changed to VAR13 and added v4bf, v8bf
>>>>>>>> >>          (vst4): Changed to VAR13 and added v4bf, v8bf
>>>>>>>> >>          * config/arm/iterators.md (VDXBF): New iterator.
>>>>>>>> >>          (VQ2BF): New iterator.
>>>>>>>> >>          (V_elem): Added V4BF, V8BF.
>>>>>>>> >>          (V_sz_elem): Added V4BF, V8BF.
>>>>>>>> >>          (V_mode_nunits): Added V4BF, V8BF.
>>>>>>>> >>          (q): Added V4BF, V8BF.
>>>>>>>> >>          *config/arm/neon.md (vst2): Used new iterators.
>>>>>>>> >>          (vst3): Used new iterators.
>>>>>>>> >>          (vst3qa): Used new iterators.
>>>>>>>> >>          (vst3qb): Used new iterators.
>>>>>>>> >>          (vst4): Used new iterators.
>>>>>>>> >>          (vst4qa): Used new iterators.
>>>>>>>> >>          (vst4qb): Used new iterators.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> gcc/testsuite/ChangeLog:
>>>>>>>> >>
>>>>>>>> >> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>>>>>>>> >>
>>>>>>>> >>      * gcc.target/arm/simd/bf16_vstn_1.c: New test.
>>>>>>>
>>>>>>> One thing I just noticed in this and the other arm bfloat16 
>>>>>>> patches...
>>>>>>>
>>>>>>> diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
>>>>>>> index 
>>>>>>> 3c78f435009ab027f92693d00ab5b40960d5419d..fd81c18948db3a7f6e8e863d32511f75bf950e6a 
>>>>>>> 100644
>>>>>>> --- a/gcc/config/arm/arm_neon.h
>>>>>>> +++ b/gcc/config/arm/arm_neon.h
>>>>>>> @@ -18742,6 +18742,89 @@ vcmlaq_rot270_laneq_f32 (float32x4_t 
>>>>>>> __r, float32x4_t __a, float32x4_t __b,
>>>>>>>     return __builtin_neon_vcmla_lane270v4sf (__r, __a, __b, 
>>>>>>> __index);
>>>>>>>   }
>>>>>>>
>>>>>>> +#pragma GCC push_options
>>>>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>>>>> +
>>>>>>> +typedef struct bfloat16x4x2_t
>>>>>>> +{
>>>>>>> +  bfloat16x4_t val[2];
>>>>>>> +} bfloat16x4x2_t;
>>>>>>>
>>>>>>>
>>>>>>> These should be in a new arm_bf16.h file that gets included in 
>>>>>>> the main arm_neon.h file, right?
>>>>>>> I believe the aarch64 versions are implemented that way.
>>>>>>>
>>>>>>> Otherwise the patch looks good to me.
>>>>>>> Thanks!
>>>>>>> Kyrill
>>>>>>>
>>>>>>>
>>>>>>>   +
>>>>>>> +typedef struct bfloat16x8x2_t
>>>>>>> +{
>>>>>>> +  bfloat16x8_t val[2];
>>>>>>> +} bfloat16x8x2_t;
>>>>>>> +
>>>>>>>
>>
>> diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c 
>> b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
>> new file mode 100644
>> index 
>> 0000000000000000000000000000000000000000..b52ecfb959776fd04c7c33908cb7f8898ec3fe0b 
>>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
>> @@ -0,0 +1,84 @@
>> +/* { dg-do assemble } */
>> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
>> +/* { dg-add-options arm_v8_2a_bf16_neon } */
>> +/* { dg-additional-options "-save-temps" } */
>> +/* { dg-final { check-function-bodies "**" "" {-O[^0]} } } */
>> +
>>
>>
>> I don't see the check-function-bodies checks being performed in my 
>> testing. Changing the directives order to:
>> /* { dg-do assemble } */
>> /* { dg-options "-save-temps" }  */
>> /* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
>> /* { dg-add-options arm_v8_2a_bf16_neon } */
>> /* { dg-final { check-function-bodies "**" "" } } */
>>
>> makes them run but they fail, I think because this test also needs an 
>> -O2 option, same as the load intrinsics patch. Can you please adjust 
>> the order of the dg-* directives in the test and the function body 
>> scan tests to match the codegen?
>> With this, it will be ready to go :)
>> Thanks,
>> Kyrill
>>
>>
>>
>>   +#include "arm_neon.h"
>> +
>> +/*
>> +**test_vst2_bf16:
>> +**    ...
>> +**    vst2.16    {d16-d17}, \[r0\]
>> +**    ...
>> +*/
>> +void
>> +test_vst2_bf16 (bfloat16_t *ptr, bfloat16x4x2_t val)
>> +{
>> +  vst2_bf16 (ptr, val);
>> +}
>> +
>>
diff mbox series

Patch

diff --git a/gcc/config/arm/arm-builtins.c b/gcc/config/arm/arm-builtins.c
index df09a6bb1fce5f9216337d71cba51a890fd57baf..551d76a44fadc58a35a6155486ec1fb16c959da0 100644
--- a/gcc/config/arm/arm-builtins.c
+++ b/gcc/config/arm/arm-builtins.c
@@ -318,6 +318,7 @@  arm_set_sat_qualifiers[SIMD_MAX_BUILTIN_ARGS]
 #define v4bf_UP  E_V4BFmode
 #define v2si_UP  E_V2SImode
 #define v2sf_UP  E_V2SFmode
+#define v2bf_UP  E_V2BFmode
 #define di_UP    E_DImode
 #define v16qi_UP E_V16QImode
 #define v8hi_UP  E_V8HImode
@@ -381,6 +382,9 @@  typedef struct {
 #define VAR12(T, N, A, B, C, D, E, F, G, H, I, J, K, L) \
   VAR11 (T, N, A, B, C, D, E, F, G, H, I, J, K) \
   VAR1 (T, N, L)
+#define VAR13(T, N, A, B, C, D, E, F, G, H, I, J, K, L, M) \
+  VAR12 (T, N, A, B, C, D, E, F, G, H, I, J, K, L) \
+  VAR1 (T, N, M)
 
 /* The builtin data can be found in arm_neon_builtins.def, arm_vfp_builtins.def
    and arm_acle_builtins.def.  The entries in arm_neon_builtins.def require
@@ -1013,6 +1017,7 @@  arm_init_simd_builtin_types (void)
   arm_simd_types[Float32x4_t].eltype = float_type_node;
 
   /* Init Bfloat vector types with underlying __bf16 scalar type.  */
+  arm_simd_types[Bfloat16x2_t].eltype = arm_bf16_type_node;
   arm_simd_types[Bfloat16x4_t].eltype = arm_bf16_type_node;
   arm_simd_types[Bfloat16x8_t].eltype = arm_bf16_type_node;
 
diff --git a/gcc/config/arm/arm-modes.def b/gcc/config/arm/arm-modes.def
index 80c3c1a6eb258d116b07ad71fafafc9befb76e8b..9533d177059d98fa2a9e9d1d6321f3d92dad7592 100644
--- a/gcc/config/arm/arm-modes.def
+++ b/gcc/config/arm/arm-modes.def
@@ -80,6 +80,7 @@  VECTOR_MODE (FLOAT, HF, 2);   /*                 V2HF */
 
 FLOAT_MODE (BF, 2, 0);
 ADJUST_FLOAT_FORMAT (BF, &arm_bfloat_half_format);
+VECTOR_MODE (FLOAT, BF, 2);   /*                 V2BF.  */
 VECTOR_MODE (FLOAT, BF, 4);   /*		 V4BF.  */
 VECTOR_MODE (FLOAT, BF, 8);   /*		 V8BF.  */
 
diff --git a/gcc/config/arm/arm-simd-builtin-types.def b/gcc/config/arm/arm-simd-builtin-types.def
index ee240f85c5618417fff039ec43b81641b187c126..f52f679156d5041ab109909393dc37fda33a390d 100644
--- a/gcc/config/arm/arm-simd-builtin-types.def
+++ b/gcc/config/arm/arm-simd-builtin-types.def
@@ -48,5 +48,6 @@ 
   ENTRY (Float16x8_t, V8HF, none, 128, float16, 19)
   ENTRY (Float32x4_t, V4SF, none, 128, float32, 19)
 
+  ENTRY (Bfloat16x2_t, V2BF, none, 32, bfloat16, 20)
   ENTRY (Bfloat16x4_t, V4BF, none, 64, bfloat16, 20)
   ENTRY (Bfloat16x8_t, V8BF, none, 128, bfloat16, 20)
diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 71e7568e4315a9354062dee5442ca4af9d9660a9..2bed33800facb65c20ea95646a5c4053dd5673de 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -91,6 +91,85 @@  typedef float float32_t;
 #ifdef __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
 typedef __simd128_bfloat16_t bfloat16x8_t;
 typedef __simd64_bfloat16_t bfloat16x4_t;
+
+typedef struct bfloat16x4x2_t
+{
+  bfloat16x4_t val[2];
+} bfloat16x4x2_t;
+
+typedef struct bfloat16x8x2_t
+{
+  bfloat16x8_t val[2];
+} bfloat16x8x2_t;
+
+typedef struct bfloat16x4x3_t
+{
+  bfloat16x4_t val[3];
+} bfloat16x4x3_t;
+
+typedef struct bfloat16x8x3_t
+{
+  bfloat16x8_t val[3];
+} bfloat16x8x3_t;
+
+typedef struct bfloat16x4x4_t
+{
+  bfloat16x4_t val[4];
+} bfloat16x4x4_t;
+
+typedef struct bfloat16x8x4_t
+{
+  bfloat16x8_t val[4];
+} bfloat16x8x4_t;
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst2_bf16 (bfloat16_t * __ptr, bfloat16x4x2_t __val)
+{
+  union { bfloat16x4x2_t __i; __builtin_neon_ti __o; } __bu = { __val };
+  return __builtin_neon_vst2v4bf (__ptr, __bu.__o);
+}
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst2q_bf16 (bfloat16_t * __ptr, bfloat16x8x2_t __val)
+{
+  union { bfloat16x8x2_t __i; __builtin_neon_oi __o; } __bu = { __val };
+  return __builtin_neon_vst2v8bf (__ptr, __bu.__o);
+}
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst3_bf16 (bfloat16_t * __ptr, bfloat16x4x3_t __val)
+{
+  union { bfloat16x4x3_t __i; __builtin_neon_ei __o; } __bu = { __val };
+  return __builtin_neon_vst3v4bf (__ptr, __bu.__o);
+}
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst3q_bf16 (bfloat16_t * __ptr, bfloat16x8x3_t __val)
+{
+  union { bfloat16x8x3_t __i; __builtin_neon_ci __o; } __bu = { __val };
+  return __builtin_neon_vst3v8bf (__ptr, __bu.__o);
+}
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst4_bf16 (bfloat16_t * __ptr, bfloat16x4x4_t __val)
+{
+  union { bfloat16x4x4_t __i; __builtin_neon_oi __o; } __bu = { __val };
+  return __builtin_neon_vst4v4bf (__ptr, __bu.__o);
+}
+
+__extension__ extern __inline void
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vst4q_bf16 (bfloat16_t * __ptr, bfloat16x8x4_t __val)
+{
+  union { bfloat16x8x4_t __i; __builtin_neon_xi __o; } __bu = { __val };
+  return __builtin_neon_vst4v8bf (__ptr, __bu.__o);
+}
+
 #endif
 #pragma GCC pop_options
 #pragma GCC pop_options
diff --git a/gcc/config/arm/arm_neon_builtins.def b/gcc/config/arm/arm_neon_builtins.def
index bcccf93f7fa2750e9006e5856efecbec0fb331b9..7f0d58efa4feebf8854631bab58b3f41a55869c5 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -325,8 +325,8 @@  VAR11 (LOAD1, vld2,
 VAR9 (LOAD1LANE, vld2_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR6 (LOAD1, vld2_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
-VAR11 (STORE1, vst2,
-	v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (STORE1, vst2,
+	v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
 VAR9 (STORE1LANE, vst2_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR11 (LOAD1, vld3,
@@ -334,8 +334,8 @@  VAR11 (LOAD1, vld3,
 VAR9 (LOAD1LANE, vld3_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR6 (LOAD1, vld3_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
-VAR11 (STORE1, vst3,
-	v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (STORE1, vst3,
+	v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
 VAR9 (STORE1LANE, vst3_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR11 (LOAD1, vld4,
@@ -343,8 +343,8 @@  VAR11 (LOAD1, vld4,
 VAR9 (LOAD1LANE, vld4_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR6 (LOAD1, vld4_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
-VAR11 (STORE1, vst4,
-	v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (STORE1, vst4,
+	v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
 VAR9 (STORE1LANE, vst4_lane,
 	v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
 VAR2 (TERNOP, sdot, v8qi, v16qi)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 439021fa0733ac31706287c4f98d62b080afc3a1..53ab1e079fbbe38f834de0b7086edb0b7d804798 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -86,6 +86,9 @@ 
 ;; Double-width vector modes plus 64-bit elements.
 (define_mode_iterator VDX [V8QI V4HI V4HF V2SI V2SF DI])
 
+;; Double-width vector modes plus 64-bit elements, including V4BF.
+(define_mode_iterator VDXBF [V8QI V4HI V4HF (V4BF "TARGET_BF16_SIMD") V2SI V2SF DI])
+
 ;; Double-width vector modes plus 64-bit elements,
 ;; with V4BFmode added, suitable for moves.
 (define_mode_iterator VDXMOV [V8QI V4HI V4HF V4BF V2SI V2SF DI])
@@ -102,6 +105,9 @@ 
 ;; Quad-width vector modes, including V8HF.
 (define_mode_iterator VQ2 [V16QI V8HI V8HF V4SI V4SF])
 
+;; Quad-width vector modes, including V8BF.
+(define_mode_iterator VQ2BF [V16QI V8HI V8HF (V8BF "TARGET_BF16_SIMD") V4SI V4SF])
+
 ;; Quad-width vector modes with 16- or 32-bit elements
 (define_mode_iterator VQ_HS [V8HI V8HF V4SI V4SF])
 
@@ -521,6 +527,7 @@ 
 (define_mode_attr V_elem [(V8QI "QI") (V16QI "QI")
 			  (V4HI "HI") (V8HI "HI")
 			  (V4HF "HF") (V8HF "HF")
+			  (V4BF "BF") (V8BF "BF")
                           (V2SI "SI") (V4SI "SI")
                           (V2SF "SF") (V4SF "SF")
                           (DI "DI")   (V2DI "DI")])
@@ -695,6 +702,7 @@ 
 (define_mode_attr V_sz_elem [(V8QI "8")  (V16QI "8")
 			     (V4HI "16") (V8HI  "16")
 			     (V2SI "32") (V4SI  "32")
+                             (V4BF "16") (V8BF "16")
 			     (DI   "64") (V2DI  "64")
 			     (V4HF "16") (V8HF "16")
 			     (V2SF "32") (V4SF  "32")])
@@ -772,6 +780,7 @@ 
 (define_mode_attr V_mode_nunits [(V8QI "8") (V16QI "16")
 				 (V4HF "4") (V8HF "8")
                                  (V4HI "4") (V8HI "8")
+                                 (V4BF "4") (V8BF "8")
                                  (V2SI "2") (V4SI "4")
                                  (V2SF "2") (V4SF "4")
                                  (DI "1")   (V2DI "2")
@@ -822,6 +831,7 @@ 
 		     (V4HI "") (V8HI "_q")
 		     (V2SI "") (V4SI "_q")
 		     (V4HF "") (V8HF "_q")
+                     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF "_q")
 		     (V4HF "") (V8HF "_q")
 		     (V4BF "") (V8BF "_q")
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index b724aab65f720bf0e48bb828f0874426effd235c..e1c80b75d7692a76a8eabc7476e180f5e5a18a0f 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5496,7 +5496,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst2<mode>"
   [(set (match_operand:TI 0 "neon_struct_operand" "=Um")
         (unspec:TI [(match_operand:TI 1 "s_register_operand" "w")
-                    (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST2))]
   "TARGET_NEON"
 {
@@ -5521,7 +5521,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst2<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
 	(unspec:OI [(match_operand:OI 1 "s_register_operand" "w")
-		    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+		    (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 		   UNSPEC_VST2))]
   "TARGET_NEON"
   "vst2.<V_sz_elem>\t%h1, %A0"
@@ -5765,7 +5765,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst3<mode>"
   [(set (match_operand:EI 0 "neon_struct_operand" "=Um")
         (unspec:EI [(match_operand:EI 1 "s_register_operand" "w")
-                    (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST3))]
   "TARGET_NEON"
 {
@@ -5792,7 +5792,7 @@  if (BYTES_BIG_ENDIAN)
 (define_expand "neon_vst3<mode>"
   [(match_operand:CI 0 "neon_struct_operand")
    (match_operand:CI 1 "s_register_operand")
-   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+   (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
   "TARGET_NEON"
 {
   rtx mem;
@@ -5807,7 +5807,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst3qa<mode>"
   [(set (match_operand:EI 0 "neon_struct_operand" "=Um")
         (unspec:EI [(match_operand:CI 1 "s_register_operand" "w")
-                    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST3A))]
   "TARGET_NEON"
 {
@@ -5826,7 +5826,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst3qb<mode>"
   [(set (match_operand:EI 0 "neon_struct_operand" "=Um")
         (unspec:EI [(match_operand:CI 1 "s_register_operand" "w")
-                    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST3B))]
   "TARGET_NEON"
 {
@@ -6090,7 +6090,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst4<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
         (unspec:OI [(match_operand:OI 1 "s_register_operand" "w")
-                    (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST4))]
   "TARGET_NEON"
 {
@@ -6118,7 +6118,7 @@  if (BYTES_BIG_ENDIAN)
 (define_expand "neon_vst4<mode>"
   [(match_operand:XI 0 "neon_struct_operand")
    (match_operand:XI 1 "s_register_operand")
-   (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+   (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
   "TARGET_NEON"
 {
   rtx mem;
@@ -6133,7 +6133,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst4qa<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
         (unspec:OI [(match_operand:XI 1 "s_register_operand" "w")
-                    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST4A))]
   "TARGET_NEON"
 {
@@ -6153,7 +6153,7 @@  if (BYTES_BIG_ENDIAN)
 (define_insn "neon_vst4qb<mode>"
   [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
         (unspec:OI [(match_operand:XI 1 "s_register_operand" "w")
-                    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+                    (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                    UNSPEC_VST4B))]
   "TARGET_NEON"
 {
diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..277755c4fd533280a51980fc80853de0d3c583d1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vstn_1.c
@@ -0,0 +1,84 @@ 
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+/* { dg-additional-options "-save-temps" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "arm_neon.h"
+
+/*
+**test_vst2_bf16:
+**	...
+**	vst2.16\t{d[0-9]+-d[0-9]+}, \[r[0-9]+\]
+**	...
+*/
+void
+test_vst2_bf16 (bfloat16_t *ptr, bfloat16x4x2_t val)
+{
+  vst2_bf16 (ptr, val);
+}
+
+/*
+**test_vst2q_bf16:
+**      ...
+**      vst2.16\t{d[0-9]+-d[0-9]+}, \[r[0-9]+\]
+**      ...
+*/
+void
+test_vst2q_bf16 (bfloat16_t *ptr, bfloat16x8x2_t val)
+{
+  vst2q_bf16 (ptr, val);
+}
+
+/*
+**test_vst3_bf16:
+**      ...
+**      vst3.16\t{d[0-9]+-d[0-9]+}, \[r[0-9]+\]
+**      ...
+*/
+void
+test_vst3_bf16 (bfloat16_t *ptr, bfloat16x4x3_t val)
+{
+  vst3_bf16 (ptr, val);
+}
+
+/*
+**test_vst3q_bf16:
+**      ...
+**      vst3.16\t{d[0-9]+, d[0-9]+, d[0-9]+}, \[r[0-9]+\]
+**      ...
+*/
+void
+test_vst3q_bf16 (bfloat16_t *ptr, bfloat16x8x3_t val)
+{
+  vst3q_bf16 (ptr, val);
+}
+
+/*
+**test_vst4_bf16:
+**      ...
+**      vst4.16\t{d[0-9]+-d[0-9]+}, \[r[0-9]+\]
+**      ...
+*/
+void
+test_vst4_bf16 (bfloat16_t *ptr, bfloat16x4x4_t val)
+{
+  vst4_bf16 (ptr, val);
+}
+
+/*
+**test_vst4q_bf16:
+**      ...
+**      vst4.16\t{d[0-9]+, d[0-9]+, d[0-9]+, d[0-9]+}, \[r[0-9]+\]
+**      ...
+*/
+void
+test_vst4q_bf16 (bfloat16_t *ptr, bfloat16x8x4_t val)
+{
+  vst4q_bf16 (ptr, val);
+}
+
+int main()
+{
+  return 0;
+}