Message ID | patch-13979-tamar@arm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] AArch64: Have RTL patterns recognize DI extracts from vectors at offset 0 as no-op. | expand |
Tamar Christina <tamar.christina@arm.com> writes: > Hi All, > > I have been looking into a class of problems where GCC is not recognizing that > a subreg of lane 0 (using little-endian as example) of a vector register and > passing that to an instruction. > > As an example consider > > poly64_t > testcase (uint8x16_t input, poly64x2_t mask) > { > poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0), > vgetq_lane_p64(mask, 0)); > poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask); > return prodL + prodH; > } > > Where we generate > > testcase: > dup d2, v0.d[0] > dup d3, v1.d[0] > pmull2 v0.1q, v0.2d, v1.2d > pmull v2.1q, v2.1d, v3.1d > add d0, d2, d0 > fmov x0, d0 > ret > > whereas it should have been, which clang generates: > > testcase: > pmull v2.1q, v0.1d, v1.1d > pmull2 v0.1q, v0.2d, v1.2d > add v0.2d, v0.2d, v2.2d > fmov x0, d0 > ret > > Now this can be naively solved by just adding the RTL patterns for the > vec_selects as the example in the patch, but this doesn't solve the overall > problem and I am wondering how to best do this. > > One approach would be to extend combine's noop detection in noop_move_p to > recognize these cases. > > The downside here is that the conversion becomes implicit in the rtl. i.e. > you'll see a SET of a V2DI but a use of DI for that same register. I'm not sure > the semantics of RTL allow such implicit uses? It's OK to set a hard register in one mode and use it in a different mode (without subregs), but it's not possible to do the same using pseudos. > The second approach I can think of is to extend reload to recognize these no-ops > and give the same register and mark the extract as unused such that DSE cleans > it up. > > But there's probably a better approach I didn't think of :) FWIW, for MIPS we tended to handle this kind of thing using matching constraints. E.g. for: (define_insn_and_split "aarch64_simd_mov_from_<mode>low" [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r") (vec_select:<VHALF> (match_operand:VQMOV_NO2E 1 "register_operand" "w,w") (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))] "TARGET_SIMD" "@ # umov\t%0, %1.d[0]" "&& reload_completed && aarch64_simd_register (operands[0], <VHALF>mode)" [(set (match_dup 0) (match_dup 1))] { operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode); } [(set_attr "type" "mov_reg,neon_to_gp<q>") (set_attr "length" "4")] ) use something like "0,w" for operand 1, so that the first alternative can be split to nothing: ;; When TARGET_64BIT, all SImode integer and accumulator registers ;; should already be in sign-extended form (see TARGET_TRULY_NOOP_TRUNCATION ;; and truncdisi2). We can therefore get rid of register->register ;; instructions if we constrain the source to be in the same register as ;; the destination. ;; ;; Only the pre-reload scheduler sees the type of the register alternatives; ;; we split them into nothing before the post-reload scheduler runs. ;; These alternatives therefore have type "move" in order to reflect ;; what happens if the two pre-reload operands cannot be tied, and are ;; instead allocated two separate GPRs. We don't distinguish between ;; the GPR and LO cases because we don't usually know during pre-reload ;; scheduling whether an operand will be LO or not. (define_insn_and_split "extendsidi2" [(set (match_operand:DI 0 "register_operand" "=d,l,d") (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" "0,0,m")))] "TARGET_64BIT" "@ # # lw\t%0,%1" "&& reload_completed && register_operand (operands[1], VOIDmode)" [(const_int 0)] { emit_note (NOTE_INSN_DELETED); DONE; } [(set_attr "move_type" "move,move,load") (set_attr "mode" "DI")]) It'll need some experimentation though. E.g. is it worth providing a w<-w alternative as well, with ? or ^ to disparage it? Independently of that, it might be worth trying to add a memory alternative, so that we can load spilled values directly from memory instead of first reloading the vector. Thanks, Richard
Hi Richard, > -----Original Message----- > From: Richard Sandiford <richard.sandiford@arm.com> > Sent: Monday, January 4, 2021 12:29 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw > <Richard.Earnshaw@arm.com>; Marcus Shawcroft > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> > Subject: Re: [RFC] AArch64: Have RTL patterns recognize DI extracts from > vectors at offset 0 as no-op. > > Tamar Christina <tamar.christina@arm.com> writes: > > Hi All, > > > > I have been looking into a class of problems where GCC is not > > recognizing that a subreg of lane 0 (using little-endian as example) > > of a vector register and passing that to an instruction. > > > > As an example consider > > > > poly64_t > > testcase (uint8x16_t input, poly64x2_t mask) { > > poly64_t prodL = > vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0), > > vgetq_lane_p64(mask, 0)); > > poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask); > > return prodL + prodH; > > } > > > > Where we generate > > > > testcase: > > dup d2, v0.d[0] > > dup d3, v1.d[0] > > pmull2 v0.1q, v0.2d, v1.2d > > pmull v2.1q, v2.1d, v3.1d > > add d0, d2, d0 > > fmov x0, d0 > > ret > > > > whereas it should have been, which clang generates: > > > > testcase: > > pmull v2.1q, v0.1d, v1.1d > > pmull2 v0.1q, v0.2d, v1.2d > > add v0.2d, v0.2d, v2.2d > > fmov x0, d0 > > ret > > > > Now this can be naively solved by just adding the RTL patterns for the > > vec_selects as the example in the patch, but this doesn't solve the > > overall problem and I am wondering how to best do this. > > > > One approach would be to extend combine's noop detection in > > noop_move_p to recognize these cases. > > > > The downside here is that the conversion becomes implicit in the rtl. i.e. > > you'll see a SET of a V2DI but a use of DI for that same register. > > I'm not sure the semantics of RTL allow such implicit uses? > > It's OK to set a hard register in one mode and use it in a different mode > (without subregs), but it's not possible to do the same using pseudos. > > > The second approach I can think of is to extend reload to recognize > > these no-ops and give the same register and mark the extract as unused > > such that DSE cleans it up. > > > > But there's probably a better approach I didn't think of :) > > FWIW, for MIPS we tended to handle this kind of thing using matching > constraints. E.g. for: > > (define_insn_and_split "aarch64_simd_mov_from_<mode>low" > [(set (match_operand:<VHALF> 0 "register_operand" "=w,?r") > (vec_select:<VHALF> > (match_operand:VQMOV_NO2E 1 "register_operand" "w,w") > (match_operand:VQMOV_NO2E 2 "vect_par_cnst_lo_half" "")))] > "TARGET_SIMD" > "@ > # > umov\t%0, %1.d[0]" > "&& reload_completed && aarch64_simd_register (operands[0], > <VHALF>mode)" > [(set (match_dup 0) (match_dup 1))] > { > operands[1] = aarch64_replace_reg_mode (operands[1], <VHALF>mode); > } > [(set_attr "type" "mov_reg,neon_to_gp<q>") > (set_attr "length" "4")] > ) > > use something like "0,w" for operand 1, so that the first alternative can be > split to nothing: Ah, interesting, I indeed didn't think of this approach. I'll go experiment. Thanks! > > ;; When TARGET_64BIT, all SImode integer and accumulator registers ;; > should already be in sign-extended form (see > TARGET_TRULY_NOOP_TRUNCATION ;; and truncdisi2). We can therefore > get rid of register->register ;; instructions if we constrain the source to be in > the same register as ;; the destination. > ;; > ;; Only the pre-reload scheduler sees the type of the register alternatives; ;; > we split them into nothing before the post-reload scheduler runs. > ;; These alternatives therefore have type "move" in order to reflect ;; what > happens if the two pre-reload operands cannot be tied, and are ;; instead > allocated two separate GPRs. We don't distinguish between ;; the GPR and > LO cases because we don't usually know during pre-reload ;; scheduling > whether an operand will be LO or not. > (define_insn_and_split "extendsidi2" > [(set (match_operand:DI 0 "register_operand" "=d,l,d") > (sign_extend:DI (match_operand:SI 1 "nonimmediate_operand" > "0,0,m")))] > "TARGET_64BIT" > "@ > # > # > lw\t%0,%1" > "&& reload_completed && register_operand (operands[1], VOIDmode)" > [(const_int 0)] > { > emit_note (NOTE_INSN_DELETED); > DONE; > } > [(set_attr "move_type" "move,move,load") > (set_attr "mode" "DI")]) > > It'll need some experimentation though. E.g. is it worth providing a w<-w > alternative as well, with ? or ^ to disparage it? > > Independently of that, it might be worth trying to add a memory alternative, > so that we can load spilled values directly from memory instead of first > reloading the vector. > > Thanks, > Richard
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 05d18f8bd3ac09c56c82dc73cff855315eb302b7..7bdb93869dbbedc786575b5f89f39c4c6d0d76d0 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -7231,6 +7231,20 @@ (define_insn "aarch64_crypto_pmulldi" [(set_attr "type" "crypto_pmull")] ) +(define_insn "*aarch64_crypto_pmullv2di" + [(set (match_operand:TI 0 "register_operand" "=w") + (unspec:TI [(vec_select:DI + (match_operand:V2DI 1 "register_operand" "w") + (parallel [ + (match_operand:SI 2 "const_int_operand" "Z")])) + (match_operand:DI 3 "register_operand" "w")] + UNSPEC_PMULL))] + "TARGET_SIMD && TARGET_AES" + "pmull\\t%0.1q, %1.1d, %3.1d" + [(set_attr "type" "crypto_pmull")] +) + + (define_insn "aarch64_crypto_pmullv2di" [(set (match_operand:TI 0 "register_operand" "=w") (unspec:TI [(match_operand:V2DI 1 "register_operand" "w") diff --git a/gcc/testsuite/gcc.target/aarch64/pmull_2.c b/gcc/testsuite/gcc.target/aarch64/pmull_2.c new file mode 100644 index 0000000000000000000000000000000000000000..d9d47518fab2b582329b6332e3a9c7d97c148192 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pmull_2.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=armv8-a+crypto -O3" } */ + +#include "arm_neon.h" + +poly64_t +testcase (uint8x16_t input, poly64x2_t mask) +{ + poly64_t prodL = vmull_p64((poly64_t)vgetq_lane_p64((poly64x2_t)input, 0), + vgetq_lane_p64(mask, 0)); + poly64_t prodH = vmull_high_p64((poly64x2_t)input, mask); + return prodL + prodH; +} + +/* { dg-final { scan-assembler-times "pmull\\tv" 1 } } */