diff mbox series

[RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op.

Message ID patch-13979-tamar@arm.com
State New
Headers show
Series [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op. | expand

Commit Message

Tamar Christina Jan. 4, 2021, 12:11 p.m. UTC
Hi All,

I have been looking into a class of problems where GCC is not recognizing that
a subreg of lane 0 (using little-endian as example) of a vector register and
passing that to an instruction.

As an example consider

poly64_t
testcase (uint8x16_t input, poly64x2_t mask)
{
    poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
			       vgetq_lane_p64(mask, 0));
    poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
    return prodL + prodH;
}

Where we generate

testcase:
	dup     d2, v0.d[0]
	dup     d3, v1.d[0]
	pmull2  v0.1q, v0.2d, v1.2d
	pmull   v2.1q, v2.1d, v3.1d
	add     d0, d2, d0
	fmov    x0, d0
	ret

whereas it should have been, which clang generates:

testcase:
	pmull   v2.1q, v0.1d, v1.1d
	pmull2  v0.1q, v0.2d, v1.2d
	add     v0.2d, v0.2d, v2.2d
	fmov    x0, d0
	ret

Now this can be naively solved by just adding the RTL patterns for the
vec_selects as the example in the patch, but this doesn't solve the overall
problem and I am wondering how to best do this.

One approach would be to extend combine's noop detection in noop_move_p to
recognize these cases.

The downside here is that the conversion becomes implicit in the rtl. i.e.
you'll see a SET of a V2DI but a use of DI for that same register.  I'm not sure
the semantics of RTL allow such implicit uses?

The second approach I can think of is to extend reload to recognize these no-ops
and give the same register and mark the extract as unused such that DSE cleans
it up.

But there's probably a better approach I didn't think of :)

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_crypto_pmullv2di): Example RTL.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pmull_2.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 05d18f8bd3ac09c56c82dc73cff855315eb302b7..7bdb93869dbbedc786575b5f89f39c4c6d0d76d0 100644


--

Comments

Richard Sandiford Jan. 4, 2021, 12:28 p.m. UTC | #1
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> I have been looking into a class of problems where GCC is not recognizing that
> a subreg of lane 0 (using little-endian as example) of a vector register and
> passing that to an instruction.
>
> As an example consider
>
> poly64_t
> testcase (uint8x16_t input, poly64x2_t mask)
> {
>     poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> 			       vgetq_lane_p64(mask, 0));
>     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
>     return prodL + prodH;
> }
>
> Where we generate
>
> testcase:
> 	dup     d2, v0.d[0]
> 	dup     d3, v1.d[0]
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	pmull   v2.1q, v2.1d, v3.1d
> 	add     d0, d2, d0
> 	fmov    x0, d0
> 	ret
>
> whereas it should have been, which clang generates:
>
> testcase:
> 	pmull   v2.1q, v0.1d, v1.1d
> 	pmull2  v0.1q, v0.2d, v1.2d
> 	add     v0.2d, v0.2d, v2.2d
> 	fmov    x0, d0
> 	ret
>
> Now this can be naively solved by just adding the RTL patterns for the
> vec_selects as the example in the patch, but this doesn't solve the overall
> problem and I am wondering how to best do this.
>
> One approach would be to extend combine's noop detection in noop_move_p to
> recognize these cases.
>
> The downside here is that the conversion becomes implicit in the rtl. i.e.
> you'll see a SET of a V2DI but a use of DI for that same register.  I'm not sure
> the semantics of RTL allow such implicit uses?

It's OK to set a hard register in one mode and use it in a different mode
(without subregs), but it's not possible to do the same using pseudos.

> The second approach I can think of is to extend reload to recognize these no-ops
> and give the same register and mark the extract as unused such that DSE cleans
> it up.
>
> But there's probably a better approach I didn't think of :)

FWIW, for MIPS we tended to handle this kind of thing using matching
constraints.  E.g. for:

(define_insn_and_split "aarch64_simd_mov_from_<mode>low"
  [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
        (vec_select:<VHALF>
          (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
          (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
  "TARGET_SIMD"
  "@
   #
   umov\t%0, %1.d[0]"
  "&& reload_completed && aarch64_simd_register (operands[0], <VHALF>mode)"
  [(set (match_dup 0) (match_dup 1))]
  {
    operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
  }
  [(set_attr "type" "mov_reg,neon_to_gp<q>")
   (set_attr "length" "4")]
)

use something like "0,w" for operand 1, so that the first alternative
can be split to nothing:

;; When TARGET_64BIT, all SImode integer and accumulator registers
;; should already be in sign-extended form (see TARGET_TRULY_NOOP_TRUNCATION
;; and truncdisi2).  We can therefore get rid of register->register
;; instructions if we constrain the source to be in the same register as
;; the destination.
;;
;; Only the pre-reload scheduler sees the type of the register alternatives;
;; we split them into nothing before the post-reload scheduler runs.
;; These alternatives therefore have type "move" in order to reflect
;; what happens if the two pre-reload operands cannot be tied, and are
;; instead allocated two separate GPRs.  We don't distinguish between
;; the GPR and LO cases because we don't usually know during pre-reload
;; scheduling whether an operand will be LO or not.
(define_insn_and_split "extendsidi2"
  [(set (match_operand:DI 0 "register_operand" "=d,l,d")
        (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "0,0,m")))]
  "TARGET_64BIT"
  "@
   #
   #
   lw\t%0,%1"
  "&& reload_completed && register_operand (operands[1], VOIDmode)"
  [(const_int 0)]
{
  emit_note (NOTE_INSN_DELETED);
  DONE;
}
  [(set_attr "move_type" "move,move,load")
   (set_attr "mode" "DI")])

It'll need some experimentation though.  E.g. is it worth providing
a w<-w alternative as well, with ? or ^ to disparage it?

Independently of that, it might be worth trying to add a memory
alternative, so that we can load spilled values directly from
memory instead of first reloading the vector.

Thanks,
Richard
Tamar Christina Jan. 4, 2021, 12:52 p.m. UTC | #2
Hi Richard,

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, January 4, 2021 12:29 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [RFC] AArch64: Have RTL patterns recognize DI extracts from
> vectors at offset 0 as no-op.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > I have been looking into a class of problems where GCC is not
> > recognizing that a subreg of lane 0 (using little-endian as example)
> > of a vector register and passing that to an instruction.
> >
> > As an example consider
> >
> > poly64_t
> > testcase (uint8x16_t input, poly64x2_t mask) {
> >     poly64_t prodL =
> vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
> > 			       vgetq_lane_p64(mask, 0));
> >     poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
> >     return prodL + prodH;
> > }
> >
> > Where we generate
> >
> > testcase:
> > 	dup     d2, v0.d[0]
> > 	dup     d3, v1.d[0]
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	pmull   v2.1q, v2.1d, v3.1d
> > 	add     d0, d2, d0
> > 	fmov    x0, d0
> > 	ret
> >
> > whereas it should have been, which clang generates:
> >
> > testcase:
> > 	pmull   v2.1q, v0.1d, v1.1d
> > 	pmull2  v0.1q, v0.2d, v1.2d
> > 	add     v0.2d, v0.2d, v2.2d
> > 	fmov    x0, d0
> > 	ret
> >
> > Now this can be naively solved by just adding the RTL patterns for the
> > vec_selects as the example in the patch, but this doesn't solve the
> > overall problem and I am wondering how to best do this.
> >
> > One approach would be to extend combine's noop detection in
> > noop_move_p to recognize these cases.
> >
> > The downside here is that the conversion becomes implicit in the rtl. i.e.
> > you'll see a SET of a V2DI but a use of DI for that same register.
> > I'm not sure the semantics of RTL allow such implicit uses?
> 
> It's OK to set a hard register in one mode and use it in a different mode
> (without subregs), but it's not possible to do the same using pseudos.
> 
> > The second approach I can think of is to extend reload to recognize
> > these no-ops and give the same register and mark the extract as unused
> > such that DSE cleans it up.
> >
> > But there's probably a better approach I didn't think of :)
> 
> FWIW, for MIPS we tended to handle this kind of thing using matching
> constraints.  E.g. for:
> 
> (define_insn_and_split "aarch64_simd_mov_from_<mode>low"
>   [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r")
>         (vec_select:<VHALF>
>           (match_operand:VQMOV_NO2E 1 "register_operand" "w,w")
>           (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))]
>   "TARGET_SIMD"
>   "@
>    #
>    umov\t%0, %1.d[0]"
>   "&& reload_completed && aarch64_simd_register (operands[0],
> <VHALF>mode)"
>   [(set (match_dup 0) (match_dup 1))]
>   {
>     operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode);
>   }
>   [(set_attr "type" "mov_reg,neon_to_gp<q>")
>    (set_attr "length" "4")]
> )
> 
> use something like "0,w" for operand 1, so that the first alternative can be
> split to nothing:

Ah, interesting, I indeed didn't think of this approach.  I'll go experiment.

Thanks!

> 
> ;; When TARGET_64BIT, all SImode integer and accumulator registers ;;
> should already be in sign-extended form (see
> TARGET_TRULY_NOOP_TRUNCATION ;; and truncdisi2).  We can therefore
> get rid of register->register ;; instructions if we constrain the source to be in
> the same register as ;; the destination.
> ;;
> ;; Only the pre-reload scheduler sees the type of the register alternatives; ;;
> we split them into nothing before the post-reload scheduler runs.
> ;; These alternatives therefore have type "move" in order to reflect ;; what
> happens if the two pre-reload operands cannot be tied, and are ;; instead
> allocated two separate GPRs.  We don't distinguish between ;; the GPR and
> LO cases because we don't usually know during pre-reload ;; scheduling
> whether an operand will be LO or not.
> (define_insn_and_split "extendsidi2"
>   [(set (match_operand:DI 0 "register_operand" "=d,l,d")
>         (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand"
> "0,0,m")))]
>   "TARGET_64BIT"
>   "@
>    #
>    #
>    lw\t%0,%1"
>   "&& reload_completed && register_operand (operands[1], VOIDmode)"
>   [(const_int 0)]
> {
>   emit_note (NOTE_INSN_DELETED);
>   DONE;
> }
>   [(set_attr "move_type" "move,move,load")
>    (set_attr "mode" "DI")])
> 
> It'll need some experimentation though.  E.g. is it worth providing a w<-w
> alternative as well, with ? or ^ to disparage it?
> 
> Independently of that, it might be worth trying to add a memory alternative,
> so that we can load spilled values directly from memory instead of first
> reloading the vector.
> 
> Thanks,
> Richard
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 05d18f8bd3ac09c56c82dc73cff855315eb302b7..7bdb93869dbbedc786575b5f89f39c4c6d0d76d0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7231,6 +7231,20 @@  (define_insn "aarch64_crypto_pmulldi"
   [(set_attr "type" "crypto_pmull")]
 )
 
+(define_insn "*aarch64_crypto_pmullv2di"
+  [(set (match_operand:TI 0 "register_operand" "=w")
+        (unspec:TI  [(vec_select:DI
+			(match_operand:V2DI 1 "register_operand" "w")
+			(parallel [
+			  (match_operand:SI 2 "const_int_operand" "Z")]))
+		     (match_operand:DI 3 "register_operand" "w")]
+		    UNSPEC_PMULL))]
+ "TARGET_SIMD && TARGET_AES"
+ "pmull\\t%0.1q, %1.1d, %3.1d"
+  [(set_attr "type" "crypto_pmull")]
+)
+
+
 (define_insn "aarch64_crypto_pmullv2di"
  [(set (match_operand:TI 0 "register_operand" "=w")
        (unspec:TI [(match_operand:V2DI 1 "register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/aarch64/pmull_2.c b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..d9d47518fab2b582329b6332e3a9c7d97c148192
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pmull_2.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv8-a+crypto -O3" } */
+
+#include "arm_neon.h"
+
+poly64_t
+testcase (uint8x16_t input, poly64x2_t mask)
+{
+    poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0),
+			       vgetq_lane_p64(mask, 0));
+    poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask);
+    return prodL + prodH;
+}
+
+/* { dg-final { scan-assembler-times "pmull\\tv" 1 } } */