[AArch64] Improve code generation for float16 vector code

Message ID	1441631341-6599-1-git-send-email-alan.lawrence@arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-406810-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; q=dns; s=default; b=VRm wVUck6PbjTpXA4WqpnhgGtMirBRrxWACwC4lLWOkHvv+vh38HQMP5c9OKRK6ymTe CS+SNUcaGJ1vn9Mkn5Omn4cwuAS1wNToMSk9aA7ZtXP9E8l5tNSQnS2FyQ7vHIMg yqpTyppuePVK2aXqYbCjCMQDgACTml3wnhKNR2ZA= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Alan Lawrence <alan.lawrence@arm.com> To: gcc-patches@gcc.gnu.org Cc: james.greenhalgh@arm.com Subject: [PATCH][AArch64] Improve code generation for float16 vector code Date: Mon, 7 Sep 2015 14:09:01 +0100 Message-Id: <1441631341-6599-1-git-send-email-alan.lawrence@arm.com> In-Reply-To: <20150904095450.GB18679@arm.com> References: <20150904095450.GB18679@arm.com> Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable

Message ID

1441631341-6599-1-git-send-email-alan.lawrence@arm.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:in-reply-to:references
	:content-type:content-transfer-encoding; q=dns; s=default; b=VRm
	wVUck6PbjTpXA4WqpnhgGtMirBRrxWACwC4lLWOkHvv+vh38HQMP5c9OKRK6ymTe
	CS+SNUcaGJ1vn9Mkn5Omn4cwuAS1wNToMSk9aA7ZtXP9E8l5tNSQnS2FyQ7vHIMg
	yqpTyppuePVK2aXqYbCjCMQDgACTml3wnhKNR2ZA=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
From: Alan Lawrence <alan.lawrence@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: james.greenhalgh@arm.com
Subject: [PATCH][AArch64] Improve code generation for float16 vector code
Date: Mon,  7 Sep 2015 14:09:01 +0100
Message-Id: <1441631341-6599-1-git-send-email-alan.lawrence@arm.com>
In-Reply-To: <20150904095450.GB18679@arm.com>
References: <20150904095450.GB18679@arm.com>
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable

Commit Message

Alan Lawrence Sept. 7, 2015, 1:09 p.m. UTC

On 04/09/15 13:32, James Greenhalgh wrote:
> In that case, these should be implemented as inline assembly blocks. As it
> stands, the code generation for these intrinsics will be very poor with this
> patch applied.
>
> I'm going to hold off OKing this until I see a follow-up to fix the code
> generation, either replacing those particular intrinsics with inline asm,
> or doing the more comprehensive fix in the back-end.
>
> Thanks,
> James

In that case, here is the follow-up now ;). This fixes each of the following
functions to generate a single instruction followed by ret:
  * vld1_dup_f16, vld1q_dup_f16
  * vset_lane_f16, vsetq_lane_f16
  * vget_lane_f16, vgetq_lane_f16
  * For IN of type either float16x4_t or float16x8_t, and constant C:
return (float16x4_t) {in[C], in[C], in[C], in[C]};
  * Similarly,
return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
(These correspond intuitively to what one might expect for "vdup_lane_f16",
"vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
although such intrinsics do not actually exist.)

This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
that load immediates, rather than using elements of pre-existing vectors.

I'd welcome thoughts/opinions on what testcase would be appropriate. Correctness
of all the intrinsics is already tested by the advsimd-intrinsics testsuite, and
the only way I can see to verify code generation, is to scan-assembler looking
for particular instructions; do we wish to see more scan-assembler tests?

Bootstrapped + check-gcc on aarch64-none-linux-gnu.

Thanks,
Alan

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>,
	aarch64_dup_lane<mode>, aarch64_dup_lane_<vswap_width_name><mode>,
	aarch64_simd_vec_set<mode>, vec_set<mode>, vec_perm_const<mode>,
	vec_init<mode>, *aarch64_simd_ld1r<mode>, vec_extract<mode>): Add
	V4HF and V8HF variants to iterator.

	* config/aarch64/aarch64.c (aarch64_evpc_dup): Add V4HF and V8HF cases.

	* config/aarch64/iterators.md (VDQF_F16): New.
	(VSWAP_WIDTH, vswap_width_name): Add V4HF and V8HF cases.
---
 gcc/config/aarch64/aarch64-simd.md | 39 +++++++++++++++++++-------------------
 gcc/config/aarch64/aarch64.c       |  2 ++
 gcc/config/aarch64/iterators.md    |  7 ++++++-
 3 files changed, 28 insertions(+), 20 deletions(-)

Comments

James Greenhalgh Sept. 8, 2015, 8:21 a.m. UTC | #1

On Mon, Sep 07, 2015 at 02:09:01PM +0100, Alan Lawrence wrote:
> On 04/09/15 13:32, James Greenhalgh wrote:
> > In that case, these should be implemented as inline assembly blocks. As it
> > stands, the code generation for these intrinsics will be very poor with this
> > patch applied.
> >
> > I'm going to hold off OKing this until I see a follow-up to fix the code
> > generation, either replacing those particular intrinsics with inline asm,
> > or doing the more comprehensive fix in the back-end.
> >
> > Thanks,
> > James
> 
> In that case, here is the follow-up now ;). This fixes each of the following
> functions to generate a single instruction followed by ret:
>   * vld1_dup_f16, vld1q_dup_f16
>   * vset_lane_f16, vsetq_lane_f16
>   * vget_lane_f16, vgetq_lane_f16
>   * For IN of type either float16x4_t or float16x8_t, and constant C:
> return (float16x4_t) {in[C], in[C], in[C], in[C]};
>   * Similarly,
> return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
> (These correspond intuitively to what one might expect for "vdup_lane_f16",
> "vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
> although such intrinsics do not actually exist.)
> 
> This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
> that load immediates, rather than using elements of pre-existing vectors.

What is code generation like for these then? if I remeber correctly it
was the vdup_n_f16 implementation that looked most objectionable before.

> I'd welcome thoughts/opinions on what testcase would be appropriate.
> Correctness of all the intrinsics is already tested by the advsimd-intrinsics
> testsuite, and the only way I can see to verify code generation, is to
> scan-assembler looking for particular instructions; do we wish to see more
> scan-assembler tests?

I think these are fine without a test case, as you say corectness is
already handled elsewhere.

> Bootstrapped + check-gcc on aarch64-none-linux-gnu.

OK,

Thanks,
James

> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md (aarch64_simd_dup<mode>,
> 	aarch64_dup_lane<mode>, aarch64_dup_lane_<vswap_width_name><mode>,
> 	aarch64_simd_vec_set<mode>, vec_set<mode>, vec_perm_const<mode>,
> 	vec_init<mode>, *aarch64_simd_ld1r<mode>, vec_extract<mode>): Add
> 	V4HF and V8HF variants to iterator.
> 
> 	* config/aarch64/aarch64.c (aarch64_evpc_dup): Add V4HF and V8HF cases.
> 
> 	* config/aarch64/iterators.md (VDQF_F16): New.
> 	(VSWAP_WIDTH, vswap_width_name): Add V4HF and V8HF cases.

James Greenhalgh Sept. 8, 2015, 8:26 a.m. UTC | #2

On Tue, Sep 08, 2015 at 09:21:08AM +0100, James Greenhalgh wrote:
> On Mon, Sep 07, 2015 at 02:09:01PM +0100, Alan Lawrence wrote:
> > On 04/09/15 13:32, James Greenhalgh wrote:
> > > In that case, these should be implemented as inline assembly blocks. As it
> > > stands, the code generation for these intrinsics will be very poor with this
> > > patch applied.
> > >
> > > I'm going to hold off OKing this until I see a follow-up to fix the code
> > > generation, either replacing those particular intrinsics with inline asm,
> > > or doing the more comprehensive fix in the back-end.
> > >
> > > Thanks,
> > > James
> > 
> > In that case, here is the follow-up now ;). This fixes each of the following
> > functions to generate a single instruction followed by ret:
> >   * vld1_dup_f16, vld1q_dup_f16
> >   * vset_lane_f16, vsetq_lane_f16
> >   * vget_lane_f16, vgetq_lane_f16
> >   * For IN of type either float16x4_t or float16x8_t, and constant C:
> > return (float16x4_t) {in[C], in[C], in[C], in[C]};
> >   * Similarly,
> > return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
> > (These correspond intuitively to what one might expect for "vdup_lane_f16",
> > "vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
> > although such intrinsics do not actually exist.)
> > 
> > This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
> > that load immediates, rather than using elements of pre-existing vectors.
> 
> What is code generation like for these then? if I remeber correctly it
> was the vdup_n_f16 implementation that looked most objectionable before.

Ah, I see what you are saying here. You mean: if there were intrinsics
equivalent to vdup_n_s16 (which there are not), then this patch would not
handle them. I was confused as vld1_dup_f16 does not use an element of a
pre-existing vector, and may well load an immediate, but is handled by
your patch.

Sorry for the noise.

James

Alan Lawrence Sept. 8, 2015, 11:47 a.m. UTC | #3

On 08/09/15 09:26, James Greenhalgh wrote:
> On Tue, Sep 08, 2015 at 09:21:08AM +0100, James Greenhalgh wrote:
>> On Mon, Sep 07, 2015 at 02:09:01PM +0100, Alan Lawrence wrote:
>>> On 04/09/15 13:32, James Greenhalgh wrote:
>>>> In that case, these should be implemented as inline assembly blocks. As it
>>>> stands, the code generation for these intrinsics will be very poor with this
>>>> patch applied.
>>>>
>>>> I'm going to hold off OKing this until I see a follow-up to fix the code
>>>> generation, either replacing those particular intrinsics with inline asm,
>>>> or doing the more comprehensive fix in the back-end.
>>>>
>>>> Thanks,
>>>> James
>>>
>>> In that case, here is the follow-up now ;). This fixes each of the following
>>> functions to generate a single instruction followed by ret:
>>>    * vld1_dup_f16, vld1q_dup_f16
>>>    * vset_lane_f16, vsetq_lane_f16
>>>    * vget_lane_f16, vgetq_lane_f16
>>>    * For IN of type either float16x4_t or float16x8_t, and constant C:
>>> return (float16x4_t) {in[C], in[C], in[C], in[C]};
>>>    * Similarly,
>>> return (float16x8_t) {in[C], in[C], in[C], in[C], in[C], in[C], in[C], in[C]};
>>> (These correspond intuitively to what one might expect for "vdup_lane_f16",
>>> "vdup_laneq_f16", "vdupq_lane_f16" and "vdupq_laneq_f16" intrinsics,
>>> although such intrinsics do not actually exist.)
>>>
>>> This patch does not deal with equivalents to vdup_n_s16 and other intrinsics
>>> that load immediates, rather than using elements of pre-existing vectors.
>>
>> What is code generation like for these then? if I remeber correctly it
>> was the vdup_n_f16 implementation that looked most objectionable before.
>
> Ah, I see what you are saying here. You mean: if there were intrinsics
> equivalent to vdup_n_s16 (which there are not), then this patch would not
> handle them. I was confused as vld1_dup_f16 does not use an element of a
> pre-existing vector, and may well load an immediate, but is handled by
> your patch.

To be clear: the *immediate* case of this, we do not use at all yet, as HFmode 
constants are disabled in aarch64_float_const_representable_p - we need to do 
some mangling to express the floating point value as a binary constant in the 
assembler output. (See the ARM backend.) That is, we cannot output (say) an 
HFmode load of 16.0 as the assembler would express 16.0 as a 32-bit float 
constant; we would instead need to output a load of immediate 0x4400. Instead, 
we will push the constant out to the constant pool and use a load instruction 
taking an address.

--Alan

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 160acf9..b303d58 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -53,18 +53,19 @@ 
 )
 
 (define_insn "aarch64_simd_dup<mode>"
-  [(set (match_operand:VDQF 0 "register_operand" "=w")
-        (vec_duplicate:VDQF (match_operand:<VEL> 1 "register_operand" "w")))]
+  [(set (match_operand:VDQF_F16 0 "register_operand" "=w")
+	(vec_duplicate:VDQF_F16
+	  (match_operand:<VEL> 1 "register_operand" "w")))]
   "TARGET_SIMD"
   "dup\\t%0.<Vtype>, %1.<Vetype>[0]"
   [(set_attr "type" "neon_dup<q>")]
 )
 
 (define_insn "aarch64_dup_lane<mode>"
-  [(set (match_operand:VALL 0 "register_operand" "=w")
-	(vec_duplicate:VALL
+  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
+	(vec_duplicate:VALL_F16
 	  (vec_select:<VEL>
-	    (match_operand:VALL 1 "register_operand" "w")
+	    (match_operand:VALL_F16 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
           )))]
   "TARGET_SIMD"
@@ -76,8 +77,8 @@ 
 )
 
 (define_insn "aarch64_dup_lane_<vswap_width_name><mode>"
-  [(set (match_operand:VALL 0 "register_operand" "=w")
-	(vec_duplicate:VALL
+  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
+	(vec_duplicate:VALL_F16
 	  (vec_select:<VEL>
 	    (match_operand:<VSWAP_WIDTH> 1 "register_operand" "w")
 	    (parallel [(match_operand:SI 2 "immediate_operand" "i")])
@@ -834,11 +835,11 @@ 
 )
 
 (define_insn "aarch64_simd_vec_set<mode>"
-  [(set (match_operand:VDQF 0 "register_operand" "=w")
-        (vec_merge:VDQF
-	    (vec_duplicate:VDQF
+  [(set (match_operand:VDQF_F16 0 "register_operand" "=w")
+	(vec_merge:VDQF_F16
+	    (vec_duplicate:VDQF_F16
 		(match_operand:<VEL> 1 "register_operand" "w"))
-	    (match_operand:VDQF 3 "register_operand" "0")
+	    (match_operand:VDQF_F16 3 "register_operand" "0")
 	    (match_operand:SI 2 "immediate_operand" "i")))]
   "TARGET_SIMD"
   {
@@ -851,7 +852,7 @@ 
 )
 
 (define_expand "vec_set<mode>"
-  [(match_operand:VDQF 0 "register_operand" "+w")
+  [(match_operand:VDQF_F16 0 "register_operand" "+w")
    (match_operand:<VEL> 1 "register_operand" "w")
    (match_operand:SI 2 "immediate_operand" "")]
   "TARGET_SIMD"
@@ -4691,9 +4692,9 @@ 
 ;; vec_perm support
 
 (define_expand "vec_perm_const<mode>"
-  [(match_operand:VALL 0 "register_operand")
-   (match_operand:VALL 1 "register_operand")
-   (match_operand:VALL 2 "register_operand")
+  [(match_operand:VALL_F16 0 "register_operand")
+   (match_operand:VALL_F16 1 "register_operand")
+   (match_operand:VALL_F16 2 "register_operand")
    (match_operand:<V_cmp_result> 3)]
   "TARGET_SIMD"
 {
@@ -4957,7 +4958,7 @@ 
 ;; Standard pattern name vec_init<mode>.
 
 (define_expand "vec_init<mode>"
-  [(match_operand:VALL 0 "register_operand" "")
+  [(match_operand:VALL_F16 0 "register_operand" "")
    (match_operand 1 "" "")]
   "TARGET_SIMD"
 {
@@ -4966,8 +4967,8 @@ 
 })
 
 (define_insn "*aarch64_simd_ld1r<mode>"
-  [(set (match_operand:VALL 0 "register_operand" "=w")
-	(vec_duplicate:VALL
+  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
+	(vec_duplicate:VALL_F16
 	  (match_operand:<VEL> 1 "aarch64_simd_struct_operand" "Utv")))]
   "TARGET_SIMD"
   "ld1r\\t{%0.<Vtype>}, %1"
@@ -5014,7 +5015,7 @@ 
 
 (define_expand "vec_extract<mode>"
   [(match_operand:<VEL> 0 "aarch64_simd_nonimmediate_operand" "")
-   (match_operand:VALL 1 "register_operand" "")
+   (match_operand:VALL_F16 1 "register_operand" "")
    (match_operand:SI 2 "immediate_operand" "")]
   "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f589ca3..362e0a0 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11749,6 +11749,8 @@  aarch64_evpc_dup (struct expand_vec_perm_d *d)
     case V4SImode: gen = gen_aarch64_dup_lanev4si; break;
     case V2SImode: gen = gen_aarch64_dup_lanev2si; break;
     case V2DImode: gen = gen_aarch64_dup_lanev2di; break;
+    case V8HFmode: gen = gen_aarch64_dup_lanev8hf; break;
+    case V4HFmode: gen = gen_aarch64_dup_lanev4hf; break;
     case V4SFmode: gen = gen_aarch64_dup_lanev4sf; break;
     case V2SFmode: gen = gen_aarch64_dup_lanev2sf; break;
     case V2DFmode: gen = gen_aarch64_dup_lanev2df; break;
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index c2af1de..b1f483c 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -82,7 +82,10 @@ 
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
 (define_mode_iterator PTR [(SI "ptr_mode == SImode") (DI "ptr_mode == DImode")])
 
-;; Vector Float modes.
+;; Vector Float modes suitable for moving, loading and storing.
+(define_mode_iterator VDQF_F16 [V4HF V8HF V2SF V4SF V2DF])
+
+;; Vector Float modes, barring HF modes.
 (define_mode_iterator VDQF [V2SF V4SF V2DF])
 
 ;; Vector Float modes, and DF.
@@ -638,12 +641,14 @@ 
 				(V2SI "V4SI") (V4SI  "V2SI")
 				(DI   "V2DI") (V2DI  "DI")
 				(V2SF "V4SF") (V4SF  "V2SF")
+				(V4HF "V8HF") (V8HF  "V4HF")
 				(DF   "V2DF") (V2DF  "DF")])
 
 (define_mode_attr vswap_width_name [(V8QI "to_128") (V16QI "to_64")
 				    (V4HI "to_128") (V8HI  "to_64")
 				    (V2SI "to_128") (V4SI  "to_64")
 				    (DI   "to_128") (V2DI  "to_64")
+				    (V4HF "to_128") (V8HF  "to_64")
 				    (V2SF "to_128") (V4SF  "to_64")
 				    (DF   "to_128") (V2DF  "to_64")])

[AArch64] Improve code generation for float16 vector code

Commit Message

Comments

Patch