Message ID | 20210607112007.8659-2-christophe.lyon@linaro.org |
---|---|
State | New |
Headers | show |
Series | [1/2] arm: Auto-vectorization for MVE: vclz | expand |
Christophe Lyon <christophe.lyon@linaro.org> writes: > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, > vec_pack_trunc_<mode> patterns for MVE. > > It does so by moving the unpack patterns from neon.md to > vec-common.md, while adding them support for MVE. The pack expander is > derived from the Neon one (which in turn is renamed into > neon_quad_vec_pack_trunc_<mode>). > > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a > zero-initialized temporary, which is needed if the > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> > instead. > > With this patch, we can now vectorize the 16 and 8-bit versions of > vclz and vshl, although the generated code could still be improved. > For test_clz_s16, we now generate > vldrh.16 q3, [r1] > vmovlb.s16 q2, q3 > vmovlt.s16 q3, q3 > vclz.i32 q2, q2 > vclz.i32 q3, q3 > vmovnb.i32 q1, q2 > vmovnt.i32 q1, q3 > vstrh.16 q1, [r0] > which could be improved to > vldrh.16 q3, [r1] > vclz.i16 q1, q3 > vstrh.16 q1, [r0] > if we could avoid the need for unpack/pack steps. Yeah, there was a PR about fixing this for popcount. I guess the same approach would apply here too. > For reference, clang-12 generates: > vldrh.s32 q0, [r1] > vldrh.s32 q1, [r1, #8] > vclz.i32 q0, q0 > vstrh.32 q0, [r0] > vclz.i32 q0, q1 > vstrh.32 q0, [r0, #8] > > 2021-06-03 Christophe Lyon <christophe.lyon@linaro.org> > > gcc/ > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. > (mve_vmovlbq_<supf><mode>): Likewise. > (mve_vmovnbq_<supf><mode>): Likewise. > (mve_vmovntq_<supf><mode>): Likewise. > (@mve_vec_pack_trunc_<mode>): New pattern. > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to > vec-common.md. > (vec_unpack<US>_lo_<mode>): Likewise. > (vec_pack_trunc_<mode>): Rename to > neon_quad_vec_pack_trunc_<mode>. > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New > pattern. > (vec_unpack<US>_lo_<mode>): New. > (vec_pack_trunc_<mode>): New. > > gcc/testsuite/ > * gcc.target/arm/simd/mve-vclz.c: Update expected results. > * gcc.target/arm/simd/mve-vshl.c: Likewise. > --- > gcc/config/arm/mve.md | 20 ++++- > gcc/config/arm/neon.md | 39 +-------- > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- > 5 files changed, 114 insertions(+), 46 deletions(-) > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > index 99e46d0bc69..b18292c07d3 100644 > --- a/gcc/config/arm/mve.md > +++ b/gcc/config/arm/mve.md > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" > ;; > ;; [vmovltq_u, vmovltq_s]) > ;; > -(define_insn "mve_vmovltq_<supf><mode>" > +(define_insn "@mve_vmovltq_<supf><mode>" > [ > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" > ;; > ;; [vmovlbq_s, vmovlbq_u]) > ;; > -(define_insn "mve_vmovlbq_<supf><mode>" > +(define_insn "@mve_vmovlbq_<supf><mode>" > [ > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" > ;; > ;; [vmovnbq_u, vmovnbq_s]) > ;; > -(define_insn "mve_vmovnbq_<supf><mode>" > +(define_insn "@mve_vmovnbq_<supf><mode>" > [ > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" > ;; > ;; [vmovntq_s, vmovntq_u]) > ;; > -(define_insn "mve_vmovntq_<supf><mode>" > +(define_insn "@mve_vmovntq_<supf><mode>" > [ > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" > [(set_attr "type" "mve_move") > ]) > > +(define_insn "@mve_vec_pack_trunc_<mode>" > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > + (vec_concat:<V_narrow_pack> > + (truncate:<V_narrow> > + (match_operand:MVE_5 1 "register_operand" "w")) > + (truncate:<V_narrow> > + (match_operand:MVE_5 2 "register_operand" "w"))))] > + "TARGET_HAVE_MVE" > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" > + [(set_attr "type" "mve_move")] > +) > + I realise this is (like you say) based on the neon.md pattern, but we should use separate vmovnb and vmovnt instructions instead of putting two instructions into a single pattern. One specific advantage to using separate patterns is that it would avoid the imprecision of the earlyclobber: the output only conflicts with operand 1 and can be tied to operand 2. > ;; > ;; [vmulq_f]) > ;; > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > index 0fdffaf4ec4..392d9607919 100644 > --- a/gcc/config/arm/neon.md > +++ b/gcc/config/arm/neon.md > @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>" > [(set_attr "type" "neon_shift_imm_long")] > ) > > -(define_expand "vec_unpack<US>_hi_<mode>" > - [(match_operand:<V_unpack> 0 "register_operand") > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > - { > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > - rtx t1; > - int i; > - for (i = 0; i < (<V_mode_nunits>/2); i++) > - RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > - > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > - emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > - operands[1], > - t1)); > - DONE; > - } > -) > - > -(define_expand "vec_unpack<US>_lo_<mode>" > - [(match_operand:<V_unpack> 0 "register_operand") > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > - { > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > - rtx t1; > - int i; > - for (i = 0; i < (<V_mode_nunits>/2) ; i++) > - RTVEC_ELT (v, i) = GEN_INT (i); > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > - emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > - operands[1], > - t1)); > - DONE; > - } > -) > - > (define_insn "neon_vec_<US>mult_lo_<mode>" > [(set (match_operand:<V_unpack> 0 "register_operand" "=w") > (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF> > @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>" > ; because the ordering of vector elements in Q registers is different from what > ; the semantics of the instructions require. > > -(define_insn "vec_pack_trunc_<mode>" > +(define_insn "neon_quad_vec_pack_trunc_<mode>" > [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > (vec_concat:<V_narrow_pack> > (truncate:<V_narrow> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > index 1ba1e5eb008..0ffc7a9322c 100644 > --- a/gcc/config/arm/vec-common.md > +++ b/gcc/config/arm/vec-common.md > @@ -638,3 +638,92 @@ (define_expand "clz<mode>2" > emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1])); > DONE; > }) > + > +;; vmovl[tb] are not available for V4SI on MVE > +(define_expand "vec_unpack<US>_hi_<mode>" > + [(match_operand:<V_unpack> 0 "register_operand") > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + if (TARGET_NEON) > + { > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > + rtx t1; > + int i; > + for (i = 0; i < (<V_mode_nunits>/2); i++) > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > + > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > + operands[1], > + t1)); > + } > + else > + { > + emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0], > + operands[1])); It would be good to use the same rtx pattern for the MVE instructions, since the compiler can optimise it better than an unspec. It would be good to have separate tests for the packs and unpacks as well, in case the vclz thing is fixed in future. Thanks, Richard > + } > + DONE; > + } > +) > + > +;; vmovl[tb] are not available for V4SI on MVE > +(define_expand "vec_unpack<US>_lo_<mode>" > + [(match_operand:<V_unpack> 0 "register_operand") > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + if (TARGET_NEON) > + { > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > + rtx t1; > + int i; > + for (i = 0; i < (<V_mode_nunits>/2) ; i++) > + RTVEC_ELT (v, i) = GEN_INT (i); > + > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > + emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > + operands[1], > + t1)); > + } > + else > + { > + emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0], > + operands[1])); > + } > + DONE; > + } > +) > + > +;; vmovn[tb] are not available for V2DI on MVE > +(define_expand "vec_pack_trunc_<mode>" > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > + (vec_concat:<V_narrow_pack> > + (truncate:<V_narrow> > + (match_operand:VN 1 "register_operand" "w")) > + (truncate:<V_narrow> > + (match_operand:VN 2 "register_operand" "w"))))] > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + if (TARGET_NEON) > + { > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > + operands[2])); > + } > + else > + { > + emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1], > + operands[2])); > + } > + DONE; > + } > +) > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > index 7068736bc28..5d6e991cfc6 100644 > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz) > FUNC(s, int, 8, clz) > FUNC(u, uint, 8, clz) > > -/* 16 and 8-bit versions are not vectorized because they need pack/unpack > - patterns since __builtin_clz uses 32-bit parameter and return value. */ > -/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 2 } } */ > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > + instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of > + 14 vclz.i32 expected in this testcase. */ > +/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 14 } } */ > /* { dg-final { scan-assembler-times {vclz\.i16 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > /* { dg-final { scan-assembler-times {vclz\.i8 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > index 7a0644997c8..91dd942d818 100644 > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm) > /* MVE has only 128-bit vectors, so we can vectorize only half of the > functions above. */ > /* We only emit vshl.u, which is equivalent to vshl.s anyway. */ > -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */ > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > + instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of > + 14 vshl.i32 expected in this testcase. */ > +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */ > > /* We emit vshl.i when the shift amount is an immediate. */ > /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
On Tue, 8 Jun 2021 at 14:10, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, > > vec_pack_trunc_<mode> patterns for MVE. > > > > It does so by moving the unpack patterns from neon.md to > > vec-common.md, while adding them support for MVE. The pack expander is > > derived from the Neon one (which in turn is renamed into > > neon_quad_vec_pack_trunc_<mode>). > > > > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a > > zero-initialized temporary, which is needed if the > > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> > > instead. > > > > With this patch, we can now vectorize the 16 and 8-bit versions of > > vclz and vshl, although the generated code could still be improved. > > For test_clz_s16, we now generate > > vldrh.16 q3, [r1] > > vmovlb.s16 q2, q3 > > vmovlt.s16 q3, q3 > > vclz.i32 q2, q2 > > vclz.i32 q3, q3 > > vmovnb.i32 q1, q2 > > vmovnt.i32 q1, q3 > > vstrh.16 q1, [r0] > > which could be improved to > > vldrh.16 q3, [r1] > > vclz.i16 q1, q3 > > vstrh.16 q1, [r0] > > if we could avoid the need for unpack/pack steps. > > Yeah, there was a PR about fixing this for popcount. I guess the same > approach would apply here too. > > > For reference, clang-12 generates: > > vldrh.s32 q0, [r1] > > vldrh.s32 q1, [r1, #8] > > vclz.i32 q0, q0 > > vstrh.32 q0, [r0] > > vclz.i32 q0, q1 > > vstrh.32 q0, [r0, #8] > > > > 2021-06-03 Christophe Lyon <christophe.lyon@linaro.org> > > > > gcc/ > > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. > > (mve_vmovlbq_<supf><mode>): Likewise. > > (mve_vmovnbq_<supf><mode>): Likewise. > > (mve_vmovntq_<supf><mode>): Likewise. > > (@mve_vec_pack_trunc_<mode>): New pattern. > > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to > > vec-common.md. > > (vec_unpack<US>_lo_<mode>): Likewise. > > (vec_pack_trunc_<mode>): Rename to > > neon_quad_vec_pack_trunc_<mode>. > > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New > > pattern. > > (vec_unpack<US>_lo_<mode>): New. > > (vec_pack_trunc_<mode>): New. > > > > gcc/testsuite/ > > * gcc.target/arm/simd/mve-vclz.c: Update expected results. > > * gcc.target/arm/simd/mve-vshl.c: Likewise. > > --- > > gcc/config/arm/mve.md | 20 ++++- > > gcc/config/arm/neon.md | 39 +-------- > > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ > > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- > > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- > > 5 files changed, 114 insertions(+), 46 deletions(-) > > > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > > index 99e46d0bc69..b18292c07d3 100644 > > --- a/gcc/config/arm/mve.md > > +++ b/gcc/config/arm/mve.md > > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" > > ;; > > ;; [vmovltq_u, vmovltq_s]) > > ;; > > -(define_insn "mve_vmovltq_<supf><mode>" > > +(define_insn "@mve_vmovltq_<supf><mode>" > > [ > > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" > > ;; > > ;; [vmovlbq_s, vmovlbq_u]) > > ;; > > -(define_insn "mve_vmovlbq_<supf><mode>" > > +(define_insn "@mve_vmovlbq_<supf><mode>" > > [ > > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" > > ;; > > ;; [vmovnbq_u, vmovnbq_s]) > > ;; > > -(define_insn "mve_vmovnbq_<supf><mode>" > > +(define_insn "@mve_vmovnbq_<supf><mode>" > > [ > > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" > > ;; > > ;; [vmovntq_s, vmovntq_u]) > > ;; > > -(define_insn "mve_vmovntq_<supf><mode>" > > +(define_insn "@mve_vmovntq_<supf><mode>" > > [ > > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" > > [(set_attr "type" "mve_move") > > ]) > > > > +(define_insn "@mve_vec_pack_trunc_<mode>" > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > + (vec_concat:<V_narrow_pack> > > + (truncate:<V_narrow> > > + (match_operand:MVE_5 1 "register_operand" "w")) > > + (truncate:<V_narrow> > > + (match_operand:MVE_5 2 "register_operand" "w"))))] > > + "TARGET_HAVE_MVE" > > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" > > + [(set_attr "type" "mve_move")] > > +) > > + > > I realise this is (like you say) based on the neon.md pattern, > but we should use separate vmovnb and vmovnt instructions instead > of putting two instructions into a single pattern. > > One specific advantage to using separate patterns is that it would > avoid the imprecision of the earlyclobber: the output only conflicts > with operand 1 and can be tied to operand 2. > I'm not sure to follow: with the current patch, for test_clz_s16 we generate: vldrh.16 q3, [r1] vmovlb.s16 q2, q3 vmovlt.s16 q3, q3 vclz.i32 q2, q2 vclz.i32 q3, q3 vmovnb.i32 q1, q2 vmovnt.i32 q1, q3 vstrh.16 q1, [r0] If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern and update define_expand "vec_pack_trunc_<mode>" to call: emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0], operands[0], operands[1])); emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0], operands[0], operands[2])); instead, we now generate: vldrh.16 q3, [r1] vmovlb.s16 q1, q3 vldr.64 d4, .L5 vldr.64 d5, .L5+8 vmovlt.s16 q3, q3 vclz.i32 q1, q1 vclz.i32 q3, q3 vmovnb.i32 q2, q1 vmovnt.i32 q2, q3 vstrh.16 q2, [r0] bx lr .L6: .align 3 .L5: .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 That's what I tried to explain in the commit message. I guess I've misunderstood your comment? Thanks, Christophe > > ;; > > ;; [vmulq_f]) > > ;; > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > > index 0fdffaf4ec4..392d9607919 100644 > > --- a/gcc/config/arm/neon.md > > +++ b/gcc/config/arm/neon.md > > @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>" > > [(set_attr "type" "neon_shift_imm_long")] > > ) > > > > -(define_expand "vec_unpack<US>_hi_<mode>" > > - [(match_operand:<V_unpack> 0 "register_operand") > > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > > - { > > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > > - rtx t1; > > - int i; > > - for (i = 0; i < (<V_mode_nunits>/2); i++) > > - RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > - > > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > - emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > > - operands[1], > > - t1)); > > - DONE; > > - } > > -) > > - > > -(define_expand "vec_unpack<US>_lo_<mode>" > > - [(match_operand:<V_unpack> 0 "register_operand") > > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > > - { > > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > > - rtx t1; > > - int i; > > - for (i = 0; i < (<V_mode_nunits>/2) ; i++) > > - RTVEC_ELT (v, i) = GEN_INT (i); > > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > - emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > > - operands[1], > > - t1)); > > - DONE; > > - } > > -) > > - > > (define_insn "neon_vec_<US>mult_lo_<mode>" > > [(set (match_operand:<V_unpack> 0 "register_operand" "=w") > > (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF> > > @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>" > > ; because the ordering of vector elements in Q registers is different from what > > ; the semantics of the instructions require. > > > > -(define_insn "vec_pack_trunc_<mode>" > > +(define_insn "neon_quad_vec_pack_trunc_<mode>" > > [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > (vec_concat:<V_narrow_pack> > > (truncate:<V_narrow> > > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > > index 1ba1e5eb008..0ffc7a9322c 100644 > > --- a/gcc/config/arm/vec-common.md > > +++ b/gcc/config/arm/vec-common.md > > @@ -638,3 +638,92 @@ (define_expand "clz<mode>2" > > emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1])); > > DONE; > > }) > > + > > +;; vmovl[tb] are not available for V4SI on MVE > > +(define_expand "vec_unpack<US>_hi_<mode>" > > + [(match_operand:<V_unpack> 0 "register_operand") > > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + if (TARGET_NEON) > > + { > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > + rtx t1; > > + int i; > > + for (i = 0; i < (<V_mode_nunits>/2); i++) > > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > + > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > > + operands[1], > > + t1)); > > + } > > + else > > + { > > + emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0], > > + operands[1])); > > It would be good to use the same rtx pattern for the MVE instructions, > since the compiler can optimise it better than an unspec. > > It would be good to have separate tests for the packs and unpacks > as well, in case the vclz thing is fixed in future. > > Thanks, > Richard > > > + } > > + DONE; > > + } > > +) > > + > > +;; vmovl[tb] are not available for V4SI on MVE > > +(define_expand "vec_unpack<US>_lo_<mode>" > > + [(match_operand:<V_unpack> 0 "register_operand") > > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + if (TARGET_NEON) > > + { > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > + rtx t1; > > + int i; > > + for (i = 0; i < (<V_mode_nunits>/2) ; i++) > > + RTVEC_ELT (v, i) = GEN_INT (i); > > + > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > + emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > > + operands[1], > > + t1)); > > + } > > + else > > + { > > + emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0], > > + operands[1])); > > + } > > + DONE; > > + } > > +) > > + > > +;; vmovn[tb] are not available for V2DI on MVE > > +(define_expand "vec_pack_trunc_<mode>" > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > + (vec_concat:<V_narrow_pack> > > + (truncate:<V_narrow> > > + (match_operand:VN 1 "register_operand" "w")) > > + (truncate:<V_narrow> > > + (match_operand:VN 2 "register_operand" "w"))))] > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + if (TARGET_NEON) > > + { > > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > > + operands[2])); > > + } > > + else > > + { > > + emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1], > > + operands[2])); > > + } > > + DONE; > > + } > > +) > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > index 7068736bc28..5d6e991cfc6 100644 > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz) > > FUNC(s, int, 8, clz) > > FUNC(u, uint, 8, clz) > > > > -/* 16 and 8-bit versions are not vectorized because they need pack/unpack > > - patterns since __builtin_clz uses 32-bit parameter and return value. */ > > -/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 2 } } */ > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > > + instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of > > + 14 vclz.i32 expected in this testcase. */ > > +/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 14 } } */ > > /* { dg-final { scan-assembler-times {vclz\.i16 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > > /* { dg-final { scan-assembler-times {vclz\.i8 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > index 7a0644997c8..91dd942d818 100644 > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm) > > /* MVE has only 128-bit vectors, so we can vectorize only half of the > > functions above. */ > > /* We only emit vshl.u, which is equivalent to vshl.s anyway. */ > > -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */ > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > > + instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of > > + 14 vshl.i32 expected in this testcase. */ > > +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */ > > > > /* We emit vshl.i when the shift amount is an immediate. */ > > /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
On Thu, 10 Jun 2021 at 11:50, Christophe Lyon <christophe.lyon@linaro.org> wrote: > > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, > > > vec_pack_trunc_<mode> patterns for MVE. > > > > > > It does so by moving the unpack patterns from neon.md to > > > vec-common.md, while adding them support for MVE. The pack expander is > > > derived from the Neon one (which in turn is renamed into > > > neon_quad_vec_pack_trunc_<mode>). > > > > > > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a > > > zero-initialized temporary, which is needed if the > > > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> > > > instead. > > > > > > With this patch, we can now vectorize the 16 and 8-bit versions of > > > vclz and vshl, although the generated code could still be improved. > > > For test_clz_s16, we now generate > > > vldrh.16 q3, [r1] > > > vmovlb.s16 q2, q3 > > > vmovlt.s16 q3, q3 > > > vclz.i32 q2, q2 > > > vclz.i32 q3, q3 > > > vmovnb.i32 q1, q2 > > > vmovnt.i32 q1, q3 > > > vstrh.16 q1, [r0] > > > which could be improved to > > > vldrh.16 q3, [r1] > > > vclz.i16 q1, q3 > > > vstrh.16 q1, [r0] > > > if we could avoid the need for unpack/pack steps. > > > > Yeah, there was a PR about fixing this for popcount. I guess the same > > approach would apply here too. > > > > > For reference, clang-12 generates: > > > vldrh.s32 q0, [r1] > > > vldrh.s32 q1, [r1, #8] > > > vclz.i32 q0, q0 > > > vstrh.32 q0, [r0] > > > vclz.i32 q0, q1 > > > vstrh.32 q0, [r0, #8] > > > > > > 2021-06-03 Christophe Lyon <christophe.lyon@linaro.org> > > > > > > gcc/ > > > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. > > > (mve_vmovlbq_<supf><mode>): Likewise. > > > (mve_vmovnbq_<supf><mode>): Likewise. > > > (mve_vmovntq_<supf><mode>): Likewise. > > > (@mve_vec_pack_trunc_<mode>): New pattern. > > > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to > > > vec-common.md. > > > (vec_unpack<US>_lo_<mode>): Likewise. > > > (vec_pack_trunc_<mode>): Rename to > > > neon_quad_vec_pack_trunc_<mode>. > > > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New > > > pattern. > > > (vec_unpack<US>_lo_<mode>): New. > > > (vec_pack_trunc_<mode>): New. > > > > > > gcc/testsuite/ > > > * gcc.target/arm/simd/mve-vclz.c: Update expected results. > > > * gcc.target/arm/simd/mve-vshl.c: Likewise. > > > --- > > > gcc/config/arm/mve.md | 20 ++++- > > > gcc/config/arm/neon.md | 39 +-------- > > > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ > > > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- > > > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- > > > 5 files changed, 114 insertions(+), 46 deletions(-) > > > > > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > > > index 99e46d0bc69..b18292c07d3 100644 > > > --- a/gcc/config/arm/mve.md > > > +++ b/gcc/config/arm/mve.md > > > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" > > > ;; > > > ;; [vmovltq_u, vmovltq_s]) > > > ;; > > > -(define_insn "mve_vmovltq_<supf><mode>" > > > +(define_insn "@mve_vmovltq_<supf><mode>" > > > [ > > > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > > > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > > > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" > > > ;; > > > ;; [vmovlbq_s, vmovlbq_u]) > > > ;; > > > -(define_insn "mve_vmovlbq_<supf><mode>" > > > +(define_insn "@mve_vmovlbq_<supf><mode>" > > > [ > > > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > > > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > > > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" > > > ;; > > > ;; [vmovnbq_u, vmovnbq_s]) > > > ;; > > > -(define_insn "mve_vmovnbq_<supf><mode>" > > > +(define_insn "@mve_vmovnbq_<supf><mode>" > > > [ > > > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > > > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > > > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" > > > ;; > > > ;; [vmovntq_s, vmovntq_u]) > > > ;; > > > -(define_insn "mve_vmovntq_<supf><mode>" > > > +(define_insn "@mve_vmovntq_<supf><mode>" > > > [ > > > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > > > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > > > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" > > > [(set_attr "type" "mve_move") > > > ]) > > > > > > +(define_insn "@mve_vec_pack_trunc_<mode>" > > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > > + (vec_concat:<V_narrow_pack> > > > + (truncate:<V_narrow> > > > + (match_operand:MVE_5 1 "register_operand" "w")) > > > + (truncate:<V_narrow> > > > + (match_operand:MVE_5 2 "register_operand" "w"))))] > > > + "TARGET_HAVE_MVE" > > > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" > > > + [(set_attr "type" "mve_move")] > > > +) > > > + > > > > I realise this is (like you say) based on the neon.md pattern, > > but we should use separate vmovnb and vmovnt instructions instead > > of putting two instructions into a single pattern. > > > > One specific advantage to using separate patterns is that it would > > avoid the imprecision of the earlyclobber: the output only conflicts > > with operand 1 and can be tied to operand 2. > > > I'm not sure to follow: with the current patch, for test_clz_s16 we generate: > vldrh.16 q3, [r1] > vmovlb.s16 q2, q3 > vmovlt.s16 q3, q3 > vclz.i32 q2, q2 > vclz.i32 q3, q3 > vmovnb.i32 q1, q2 > vmovnt.i32 q1, q3 > vstrh.16 q1, [r0] > > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern > and update define_expand "vec_pack_trunc_<mode>" to call: > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0], > operands[0], operands[1])); > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0], > operands[0], operands[2])); > instead, we now generate: > vldrh.16 q3, [r1] > vmovlb.s16 q1, q3 > vldr.64 d4, .L5 > vldr.64 d5, .L5+8 > vmovlt.s16 q3, q3 > vclz.i32 q1, q1 > vclz.i32 q3, q3 > vmovnb.i32 q2, q1 > vmovnt.i32 q2, q3 > vstrh.16 q2, [r0] > bx lr > .L6: > .align 3 > .L5: > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > > That's what I tried to explain in the commit message. > > I guess I've misunderstood your comment? > > Thanks, > > Christophe > > > > ;; > > > ;; [vmulq_f]) > > > ;; > > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > > > index 0fdffaf4ec4..392d9607919 100644 > > > --- a/gcc/config/arm/neon.md > > > +++ b/gcc/config/arm/neon.md > > > @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>" > > > [(set_attr "type" "neon_shift_imm_long")] > > > ) > > > > > > -(define_expand "vec_unpack<US>_hi_<mode>" > > > - [(match_operand:<V_unpack> 0 "register_operand") > > > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > > > - { > > > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > > > - rtx t1; > > > - int i; > > > - for (i = 0; i < (<V_mode_nunits>/2); i++) > > > - RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > > - > > > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > > - emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > > > - operands[1], > > > - t1)); > > > - DONE; > > > - } > > > -) > > > - > > > -(define_expand "vec_unpack<US>_lo_<mode>" > > > - [(match_operand:<V_unpack> 0 "register_operand") > > > - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > > - "TARGET_NEON && !BYTES_BIG_ENDIAN" > > > - { > > > - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; > > > - rtx t1; > > > - int i; > > > - for (i = 0; i < (<V_mode_nunits>/2) ; i++) > > > - RTVEC_ELT (v, i) = GEN_INT (i); > > > - t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > > - emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > > > - operands[1], > > > - t1)); > > > - DONE; > > > - } > > > -) > > > - > > > (define_insn "neon_vec_<US>mult_lo_<mode>" > > > [(set (match_operand:<V_unpack> 0 "register_operand" "=w") > > > (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF> > > > @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>" > > > ; because the ordering of vector elements in Q registers is different from what > > > ; the semantics of the instructions require. > > > > > > -(define_insn "vec_pack_trunc_<mode>" > > > +(define_insn "neon_quad_vec_pack_trunc_<mode>" > > > [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > > (vec_concat:<V_narrow_pack> > > > (truncate:<V_narrow> > > > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > > > index 1ba1e5eb008..0ffc7a9322c 100644 > > > --- a/gcc/config/arm/vec-common.md > > > +++ b/gcc/config/arm/vec-common.md > > > @@ -638,3 +638,92 @@ (define_expand "clz<mode>2" > > > emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1])); > > > DONE; > > > }) > > > + > > > +;; vmovl[tb] are not available for V4SI on MVE > > > +(define_expand "vec_unpack<US>_hi_<mode>" > > > + [(match_operand:<V_unpack> 0 "register_operand") > > > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > > + "ARM_HAVE_<MODE>_ARITH > > > + && !TARGET_REALLY_IWMMXT > > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > > + && !BYTES_BIG_ENDIAN" > > > + { > > > + if (TARGET_NEON) > > > + { > > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > > + rtx t1; > > > + int i; > > > + for (i = 0; i < (<V_mode_nunits>/2); i++) > > > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > > + > > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > > > + operands[1], > > > + t1)); > > > + } > > > + else > > > + { > > > + emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0], > > > + operands[1])); > > > > It would be good to use the same rtx pattern for the MVE instructions, > > since the compiler can optimise it better than an unspec. > > > > It would be good to have separate tests for the packs and unpacks > > as well, in case the vclz thing is fixed in future. > > And while writing new tests, I noticed that the same problem happens with unpack as I proposed it: test_pack_s32: mov r3, r1 vldr.64 d6, .L3 vldr.64 d7, .L3+8 vldrw.32 q1, [r3] adds r1, r1, #16 vldrw.32 q2, [r1] vmovnb.i32 q3, q1 vmovnt.i32 q3, q2 vstrh.16 q3, [r0] bx lr .L4: .align 3 .L3: .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 .short 0 We shouldn't need to initialize d6/d7 (q3) > > Thanks, > > Richard > > > > > + } > > > + DONE; > > > + } > > > +) > > > + > > > +;; vmovl[tb] are not available for V4SI on MVE > > > +(define_expand "vec_unpack<US>_lo_<mode>" > > > + [(match_operand:<V_unpack> 0 "register_operand") > > > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > > + "ARM_HAVE_<MODE>_ARITH > > > + && !TARGET_REALLY_IWMMXT > > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > > + && !BYTES_BIG_ENDIAN" > > > + { > > > + if (TARGET_NEON) > > > + { > > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > > + rtx t1; > > > + int i; > > > + for (i = 0; i < (<V_mode_nunits>/2) ; i++) > > > + RTVEC_ELT (v, i) = GEN_INT (i); > > > + > > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > > + emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], > > > + operands[1], > > > + t1)); > > > + } > > > + else > > > + { > > > + emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0], > > > + operands[1])); > > > + } > > > + DONE; > > > + } > > > +) > > > + > > > +;; vmovn[tb] are not available for V2DI on MVE > > > +(define_expand "vec_pack_trunc_<mode>" > > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > > > + (vec_concat:<V_narrow_pack> > > > + (truncate:<V_narrow> > > > + (match_operand:VN 1 "register_operand" "w")) > > > + (truncate:<V_narrow> > > > + (match_operand:VN 2 "register_operand" "w"))))] > > > + "ARM_HAVE_<MODE>_ARITH > > > + && !TARGET_REALLY_IWMMXT > > > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > > > + && !BYTES_BIG_ENDIAN" > > > + { > > > + if (TARGET_NEON) > > > + { > > > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > > > + operands[2])); > > > + } > > > + else > > > + { > > > + emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1], > > > + operands[2])); > > > + } > > > + DONE; > > > + } > > > +) > > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > > index 7068736bc28..5d6e991cfc6 100644 > > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c > > > @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz) > > > FUNC(s, int, 8, clz) > > > FUNC(u, uint, 8, clz) > > > > > > -/* 16 and 8-bit versions are not vectorized because they need pack/unpack > > > - patterns since __builtin_clz uses 32-bit parameter and return value. */ > > > -/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 2 } } */ > > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > > > + instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of > > > + 14 vclz.i32 expected in this testcase. */ > > > +/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 14 } } */ > > > /* { dg-final { scan-assembler-times {vclz\.i16 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > > > /* { dg-final { scan-assembler-times {vclz\.i8 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ > > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > > index 7a0644997c8..91dd942d818 100644 > > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c > > > @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm) > > > /* MVE has only 128-bit vectors, so we can vectorize only half of the > > > functions above. */ > > > /* We only emit vshl.u, which is equivalent to vshl.s anyway. */ > > > -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */ > > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for > > > + instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of > > > + 14 vshl.i32 expected in this testcase. */ > > > +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */ > > > > > > /* We emit vshl.i when the shift amount is an immediate. */ > > > /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
Christophe Lyon <christophe.lyon@linaro.org> writes: > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Christophe Lyon <christophe.lyon@linaro.org> writes: >> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, >> > vec_pack_trunc_<mode> patterns for MVE. >> > >> > It does so by moving the unpack patterns from neon.md to >> > vec-common.md, while adding them support for MVE. The pack expander is >> > derived from the Neon one (which in turn is renamed into >> > neon_quad_vec_pack_trunc_<mode>). >> > >> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a >> > zero-initialized temporary, which is needed if the >> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> >> > instead. >> > >> > With this patch, we can now vectorize the 16 and 8-bit versions of >> > vclz and vshl, although the generated code could still be improved. >> > For test_clz_s16, we now generate >> > vldrh.16 q3, [r1] >> > vmovlb.s16 q2, q3 >> > vmovlt.s16 q3, q3 >> > vclz.i32 q2, q2 >> > vclz.i32 q3, q3 >> > vmovnb.i32 q1, q2 >> > vmovnt.i32 q1, q3 >> > vstrh.16 q1, [r0] >> > which could be improved to >> > vldrh.16 q3, [r1] >> > vclz.i16 q1, q3 >> > vstrh.16 q1, [r0] >> > if we could avoid the need for unpack/pack steps. >> >> Yeah, there was a PR about fixing this for popcount. I guess the same >> approach would apply here too. >> >> > For reference, clang-12 generates: >> > vldrh.s32 q0, [r1] >> > vldrh.s32 q1, [r1, #8] >> > vclz.i32 q0, q0 >> > vstrh.32 q0, [r0] >> > vclz.i32 q0, q1 >> > vstrh.32 q0, [r0, #8] >> > >> > 2021-06-03 Christophe Lyon <christophe.lyon@linaro.org> >> > >> > gcc/ >> > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. >> > (mve_vmovlbq_<supf><mode>): Likewise. >> > (mve_vmovnbq_<supf><mode>): Likewise. >> > (mve_vmovntq_<supf><mode>): Likewise. >> > (@mve_vec_pack_trunc_<mode>): New pattern. >> > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to >> > vec-common.md. >> > (vec_unpack<US>_lo_<mode>): Likewise. >> > (vec_pack_trunc_<mode>): Rename to >> > neon_quad_vec_pack_trunc_<mode>. >> > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New >> > pattern. >> > (vec_unpack<US>_lo_<mode>): New. >> > (vec_pack_trunc_<mode>): New. >> > >> > gcc/testsuite/ >> > * gcc.target/arm/simd/mve-vclz.c: Update expected results. >> > * gcc.target/arm/simd/mve-vshl.c: Likewise. >> > --- >> > gcc/config/arm/mve.md | 20 ++++- >> > gcc/config/arm/neon.md | 39 +-------- >> > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ >> > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- >> > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- >> > 5 files changed, 114 insertions(+), 46 deletions(-) >> > >> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md >> > index 99e46d0bc69..b18292c07d3 100644 >> > --- a/gcc/config/arm/mve.md >> > +++ b/gcc/config/arm/mve.md >> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" >> > ;; >> > ;; [vmovltq_u, vmovltq_s]) >> > ;; >> > -(define_insn "mve_vmovltq_<supf><mode>" >> > +(define_insn "@mve_vmovltq_<supf><mode>" >> > [ >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] >> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" >> > ;; >> > ;; [vmovlbq_s, vmovlbq_u]) >> > ;; >> > -(define_insn "mve_vmovlbq_<supf><mode>" >> > +(define_insn "@mve_vmovlbq_<supf><mode>" >> > [ >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] >> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" >> > ;; >> > ;; [vmovnbq_u, vmovnbq_s]) >> > ;; >> > -(define_insn "mve_vmovnbq_<supf><mode>" >> > +(define_insn "@mve_vmovnbq_<supf><mode>" >> > [ >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") >> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" >> > ;; >> > ;; [vmovntq_s, vmovntq_u]) >> > ;; >> > -(define_insn "mve_vmovntq_<supf><mode>" >> > +(define_insn "@mve_vmovntq_<supf><mode>" >> > [ >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") >> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" >> > [(set_attr "type" "mve_move") >> > ]) >> > >> > +(define_insn "@mve_vec_pack_trunc_<mode>" >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") >> > + (vec_concat:<V_narrow_pack> >> > + (truncate:<V_narrow> >> > + (match_operand:MVE_5 1 "register_operand" "w")) >> > + (truncate:<V_narrow> >> > + (match_operand:MVE_5 2 "register_operand" "w"))))] >> > + "TARGET_HAVE_MVE" >> > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" >> > + [(set_attr "type" "mve_move")] >> > +) >> > + >> >> I realise this is (like you say) based on the neon.md pattern, >> but we should use separate vmovnb and vmovnt instructions instead >> of putting two instructions into a single pattern. >> >> One specific advantage to using separate patterns is that it would >> avoid the imprecision of the earlyclobber: the output only conflicts >> with operand 1 and can be tied to operand 2. >> > I'm not sure to follow: with the current patch, for test_clz_s16 we generate: > vldrh.16 q3, [r1] > vmovlb.s16 q2, q3 > vmovlt.s16 q3, q3 > vclz.i32 q2, q2 > vclz.i32 q3, q3 > vmovnb.i32 q1, q2 > vmovnt.i32 q1, q3 > vstrh.16 q1, [r0] > > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern > and update define_expand "vec_pack_trunc_<mode>" to call: > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0], > operands[0], operands[1])); > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0], > operands[0], operands[2])); > instead, we now generate: > vldrh.16 q3, [r1] > vmovlb.s16 q1, q3 > vldr.64 d4, .L5 > vldr.64 d5, .L5+8 > vmovlt.s16 q3, q3 > vclz.i32 q1, q1 > vclz.i32 q3, q3 > vmovnb.i32 q2, q1 > vmovnt.i32 q2, q3 > vstrh.16 q2, [r0] > bx lr > .L6: > .align 3 > .L5: > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > > That's what I tried to explain in the commit message. Sorry, I missed that part of the message. If we want the first instruction to have an in/out operand whose initial value is undefined then I think the best way would be to have a second pattern for vmovnb (only) that acts like a unary operation. FWIW, in the above, we shouldn't assign to operands[0] twice. The result of the first operation needs to be a new temporary register. Thanks, Richard
On Fri, 11 Jun 2021 at 11:38, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> Christophe Lyon <christophe.lyon@linaro.org> writes: > >> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, > >> > vec_pack_trunc_<mode> patterns for MVE. > >> > > >> > It does so by moving the unpack patterns from neon.md to > >> > vec-common.md, while adding them support for MVE. The pack expander is > >> > derived from the Neon one (which in turn is renamed into > >> > neon_quad_vec_pack_trunc_<mode>). > >> > > >> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a > >> > zero-initialized temporary, which is needed if the > >> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> > >> > instead. > >> > > >> > With this patch, we can now vectorize the 16 and 8-bit versions of > >> > vclz and vshl, although the generated code could still be improved. > >> > For test_clz_s16, we now generate > >> > vldrh.16 q3, [r1] > >> > vmovlb.s16 q2, q3 > >> > vmovlt.s16 q3, q3 > >> > vclz.i32 q2, q2 > >> > vclz.i32 q3, q3 > >> > vmovnb.i32 q1, q2 > >> > vmovnt.i32 q1, q3 > >> > vstrh.16 q1, [r0] > >> > which could be improved to > >> > vldrh.16 q3, [r1] > >> > vclz.i16 q1, q3 > >> > vstrh.16 q1, [r0] > >> > if we could avoid the need for unpack/pack steps. > >> > >> Yeah, there was a PR about fixing this for popcount. I guess the same > >> approach would apply here too. > >> > >> > For reference, clang-12 generates: > >> > vldrh.s32 q0, [r1] > >> > vldrh.s32 q1, [r1, #8] > >> > vclz.i32 q0, q0 > >> > vstrh.32 q0, [r0] > >> > vclz.i32 q0, q1 > >> > vstrh.32 q0, [r0, #8] > >> > > >> > 2021-06-03 Christophe Lyon <christophe.lyon@linaro.org> > >> > > >> > gcc/ > >> > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. > >> > (mve_vmovlbq_<supf><mode>): Likewise. > >> > (mve_vmovnbq_<supf><mode>): Likewise. > >> > (mve_vmovntq_<supf><mode>): Likewise. > >> > (@mve_vec_pack_trunc_<mode>): New pattern. > >> > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to > >> > vec-common.md. > >> > (vec_unpack<US>_lo_<mode>): Likewise. > >> > (vec_pack_trunc_<mode>): Rename to > >> > neon_quad_vec_pack_trunc_<mode>. > >> > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New > >> > pattern. > >> > (vec_unpack<US>_lo_<mode>): New. > >> > (vec_pack_trunc_<mode>): New. > >> > > >> > gcc/testsuite/ > >> > * gcc.target/arm/simd/mve-vclz.c: Update expected results. > >> > * gcc.target/arm/simd/mve-vshl.c: Likewise. > >> > --- > >> > gcc/config/arm/mve.md | 20 ++++- > >> > gcc/config/arm/neon.md | 39 +-------- > >> > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ > >> > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- > >> > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- > >> > 5 files changed, 114 insertions(+), 46 deletions(-) > >> > > >> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md > >> > index 99e46d0bc69..b18292c07d3 100644 > >> > --- a/gcc/config/arm/mve.md > >> > +++ b/gcc/config/arm/mve.md > >> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" > >> > ;; > >> > ;; [vmovltq_u, vmovltq_s]) > >> > ;; > >> > -(define_insn "mve_vmovltq_<supf><mode>" > >> > +(define_insn "@mve_vmovltq_<supf><mode>" > >> > [ > >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > >> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" > >> > ;; > >> > ;; [vmovlbq_s, vmovlbq_u]) > >> > ;; > >> > -(define_insn "mve_vmovlbq_<supf><mode>" > >> > +(define_insn "@mve_vmovlbq_<supf><mode>" > >> > [ > >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") > >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] > >> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" > >> > ;; > >> > ;; [vmovnbq_u, vmovnbq_s]) > >> > ;; > >> > -(define_insn "mve_vmovnbq_<supf><mode>" > >> > +(define_insn "@mve_vmovnbq_<supf><mode>" > >> > [ > >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > >> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" > >> > ;; > >> > ;; [vmovntq_s, vmovntq_u]) > >> > ;; > >> > -(define_insn "mve_vmovntq_<supf><mode>" > >> > +(define_insn "@mve_vmovntq_<supf><mode>" > >> > [ > >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") > >> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" > >> > [(set_attr "type" "mve_move") > >> > ]) > >> > > >> > +(define_insn "@mve_vec_pack_trunc_<mode>" > >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") > >> > + (vec_concat:<V_narrow_pack> > >> > + (truncate:<V_narrow> > >> > + (match_operand:MVE_5 1 "register_operand" "w")) > >> > + (truncate:<V_narrow> > >> > + (match_operand:MVE_5 2 "register_operand" "w"))))] > >> > + "TARGET_HAVE_MVE" > >> > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" > >> > + [(set_attr "type" "mve_move")] > >> > +) > >> > + > >> > >> I realise this is (like you say) based on the neon.md pattern, > >> but we should use separate vmovnb and vmovnt instructions instead > >> of putting two instructions into a single pattern. > >> > >> One specific advantage to using separate patterns is that it would > >> avoid the imprecision of the earlyclobber: the output only conflicts > >> with operand 1 and can be tied to operand 2. > >> > > I'm not sure to follow: with the current patch, for test_clz_s16 we generate: > > vldrh.16 q3, [r1] > > vmovlb.s16 q2, q3 > > vmovlt.s16 q3, q3 > > vclz.i32 q2, q2 > > vclz.i32 q3, q3 > > vmovnb.i32 q1, q2 > > vmovnt.i32 q1, q3 > > vstrh.16 q1, [r0] > > > > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern > > and update define_expand "vec_pack_trunc_<mode>" to call: > > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0], > > operands[0], operands[1])); > > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0], > > operands[0], operands[2])); > > instead, we now generate: > > vldrh.16 q3, [r1] > > vmovlb.s16 q1, q3 > > vldr.64 d4, .L5 > > vldr.64 d5, .L5+8 > > vmovlt.s16 q3, q3 > > vclz.i32 q1, q1 > > vclz.i32 q3, q3 > > vmovnb.i32 q2, q1 > > vmovnt.i32 q2, q3 > > vstrh.16 q2, [r0] > > bx lr > > .L6: > > .align 3 > > .L5: > > .short 0 > > .short 0 > > .short 0 > > .short 0 > > .short 0 > > .short 0 > > .short 0 > > .short 0 > > > > That's what I tried to explain in the commit message. > > Sorry, I missed that part of the message. If we want the first > instruction to have an in/out operand whose initial value is undefined > then I think the best way would be to have a second pattern for vmovnb > (only) that acts like a unary operation. > > FWIW, in the above, we shouldn't assign to operands[0] twice. > The result of the first operation needs to be a new temporary register. Thanks. In the meantime, I tried to make some progress, and looked at how things work on aarch64. This led me to define (in mve.md): (define_insn "@mve_vec_pack_trunc_lo_<mode>" [(set (match_operand:<V_narrow> 0 "register_operand" "=w") (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))] "TARGET_HAVE_MVE" "vmovnb.i<V_sz_elem> %q0, %q1\;" [(set_attr "type" "mve_move")] ) (define_insn "@mve_vec_pack_trunc_hi_<mode>" [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w") (vec_concat:<V_narrow_pack> (match_operand:<V_narrow> 1 "register_operand" "0") (truncate:<V_narrow> (match_operand:MVE_5 2 "register_operand" "w"))))] "TARGET_HAVE_MVE" "vmovnt.i<V_sz_elem> %q0, %q2\;" [(set_attr "type" "mve_move")] ) and in vec-common.md, for (define_expand "vec_pack_trunc_<mode>" [(set (match_operand:<V_narrow_pack> 0 "register_operand") (vec_concat:<V_narrow_pack> (truncate:<V_narrow> (match_operand:VN 1 "register_operand")) (truncate:<V_narrow> (match_operand:VN 2 "register_operand"))))] I expand this for MVE: rtx tmpreg = gen_reg_rtx (<V_narrow>mode); emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0], tmpreg, operands[2])); I am getting an error in reload: error: unable to generate reloads for: (insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ]) (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ]))) "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609 {mve_vec_pack_trunc_lo_v4si} (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ]) (nil))) The next insn is: (insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ]) (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ]) (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ])))) "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611 {mve_vec_pack_trunc_hi_v4si} (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ]) (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ]) (nil)))) What is causing the reload error? Thanks, Christophe
Christophe Lyon <christophe.lyon@linaro.org> writes: > In the meantime, I tried to make some progress, and looked at how > things work on aarch64. > > This led me to define (in mve.md): > > (define_insn "@mve_vec_pack_trunc_lo_<mode>" > [(set (match_operand:<V_narrow> 0 "register_operand" "=w") > (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))] > "TARGET_HAVE_MVE" > "vmovnb.i<V_sz_elem> %q0, %q1\;" > [(set_attr "type" "mve_move")] > ) > > (define_insn "@mve_vec_pack_trunc_hi_<mode>" > [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w") > (vec_concat:<V_narrow_pack> > (match_operand:<V_narrow> 1 "register_operand" "0") > (truncate:<V_narrow> > (match_operand:MVE_5 2 "register_operand" "w"))))] > "TARGET_HAVE_MVE" > "vmovnt.i<V_sz_elem> %q0, %q2\;" > [(set_attr "type" "mve_move")] > ) > > and in vec-common.md, for > (define_expand "vec_pack_trunc_<mode>" > [(set (match_operand:<V_narrow_pack> 0 "register_operand") > (vec_concat:<V_narrow_pack> > (truncate:<V_narrow> > (match_operand:VN 1 "register_operand")) > (truncate:<V_narrow> > (match_operand:VN 2 "register_operand"))))] > > I expand this for MVE: > rtx tmpreg = gen_reg_rtx (<V_narrow>mode); > emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); > emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0], > tmpreg, operands[2])); > > I am getting an error in reload: > error: unable to generate reloads for: > (insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ]) > (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ]))) > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609 > {mve_vec_pack_trunc_lo_v4si} > (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ]) > (nil))) > > The next insn is: > (insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ]) > (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ]) > (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ])))) > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611 > {mve_vec_pack_trunc_hi_v4si} > (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ]) > (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ]) > (nil)))) > > What is causing the reload error? For this to work on MVE, we'd need to allow the 64-bit V_narrow modes to be stored in MVE registers. I'm not sure off-hand whether allowing that would be a good idea or not. If we continue to allow only 128-bit vectors in MVE registers then we'll need to continue to use unspecs instead of truncate and vec_concat. Thanks, Richard
On Fri, 11 Jun 2021 at 12:20, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > In the meantime, I tried to make some progress, and looked at how > > things work on aarch64. > > > > This led me to define (in mve.md): > > > > (define_insn "@mve_vec_pack_trunc_lo_<mode>" > > [(set (match_operand:<V_narrow> 0 "register_operand" "=w") > > (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))] > > "TARGET_HAVE_MVE" > > "vmovnb.i<V_sz_elem> %q0, %q1\;" > > [(set_attr "type" "mve_move")] > > ) > > > > (define_insn "@mve_vec_pack_trunc_hi_<mode>" > > [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w") > > (vec_concat:<V_narrow_pack> > > (match_operand:<V_narrow> 1 "register_operand" "0") > > (truncate:<V_narrow> > > (match_operand:MVE_5 2 "register_operand" "w"))))] > > "TARGET_HAVE_MVE" > > "vmovnt.i<V_sz_elem> %q0, %q2\;" > > [(set_attr "type" "mve_move")] > > ) > > > > and in vec-common.md, for > > (define_expand "vec_pack_trunc_<mode>" > > [(set (match_operand:<V_narrow_pack> 0 "register_operand") > > (vec_concat:<V_narrow_pack> > > (truncate:<V_narrow> > > (match_operand:VN 1 "register_operand")) > > (truncate:<V_narrow> > > (match_operand:VN 2 "register_operand"))))] > > > > I expand this for MVE: > > rtx tmpreg = gen_reg_rtx (<V_narrow>mode); > > emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); > > emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0], > > tmpreg, operands[2])); > > > > I am getting an error in reload: > > error: unable to generate reloads for: > > (insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ]) > > (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ]))) > > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609 > > {mve_vec_pack_trunc_lo_v4si} > > (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ]) > > (nil))) > > > > The next insn is: > > (insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ]) > > (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ]) > > (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ])))) > > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611 > > {mve_vec_pack_trunc_hi_v4si} > > (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ]) > > (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ]) > > (nil)))) > > > > What is causing the reload error? > > For this to work on MVE, we'd need to allow the 64-bit V_narrow modes to be > stored in MVE registers. I'm not sure off-hand whether allowing that > would be a good idea or not. > > If we continue to allow only 128-bit vectors in MVE registers then > we'll need to continue to use unspecs instead of truncate and vec_concat. > Thanks for the feedback. How about v2 attached? Do you want me to merge neon_vec_unpack<US> and mve_vec_unpack<US> and only have different assembly? if (TARGET_HAVE_MVE) return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; else return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > Thanks, > Richard
Christophe Lyon <christophe.lyon@linaro.org> writes: > Thanks for the feedback. How about v2 attached? > Do you want me to merge neon_vec_unpack<US> and > mve_vec_unpack<US> and only have different assembly? > if (TARGET_HAVE_MVE) > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > else > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; I think it'd be better to keep them separate, given that they have different attributes. > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>" > [(set_attr "type" "mve_move") > ]) > > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the > +;; need for an extra register. “to avoid the need for an uninitailized input operand” might be clearer. > +(define_insn "@mve_vec_pack_trunc_lo_<mode>" > + [ > + (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > + (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")] > + VMOVNBQ_S)) > + ] > + "TARGET_HAVE_MVE" > + "vmovnb.i%#<V_sz_elem> %q0, %q1" > + [(set_attr "type" "mve_move") > +]) > + > […] > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > index 430a92ce966..3afb3e1d891 100644 > --- a/gcc/config/arm/vec-common.md > +++ b/gcc/config/arm/vec-common.md > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2" > "ARM_HAVE_<MODE>_ARITH > && !TARGET_REALLY_IWMMXT" > ) > + > +;; vmovl[tb] are not available for V4SI on MVE > +(define_expand "vec_unpack<US>_hi_<mode>" > + [(match_operand:<V_unpack> 0 "register_operand") > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > + rtx t1; > + int i; > + for (i = 0; i < (<V_mode_nunits>/2); i++) > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > + > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > + > + if (TARGET_NEON) > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > + operands[1], > + t1)); > + else > + emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0], > + operands[1], > + t1)); > + DONE; > + } > +) Since the patterns are the same for both MVE and Neon (a good thing!), it would be simpler to write this as: (define_insn "mve_vec_unpack<US>_lo_<mode>" [(set (match_operand:<V_unpack> 0 "register_operand" "=w") (SE:<V_unpack> (vec_select:<V_HALF> (match_operand:VU 1 "register_operand" "w") (match_dup 2))))] … and then set operands[2] to what is t1 above. There would then be no need to call emit_insn & DONE, and we could use the natural iterator (MVE_3) for the MVE patterns. Same idea for the lo version. > […] > +;; vmovn[tb] are not available for V2DI on MVE > +(define_expand "vec_pack_trunc_<mode>" > + [(set (match_operand:<V_narrow_pack> 0 "register_operand") > + (vec_concat:<V_narrow_pack> > + (truncate:<V_narrow> > + (match_operand:VN 1 "register_operand")) > + (truncate:<V_narrow> > + (match_operand:VN 2 "register_operand"))))] > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + if (TARGET_NEON) > + { > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > + operands[2])); > + } > + else > + { > + rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode); > + emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, > + tmpreg, tmpreg, operands[2])); > + emit_move_insn (operands[0], tmpreg); vmovntq can assign directly to operands[0], without a separate move. OK with those changes, thanks. Richard > + } > + DONE; > + } > +) > […]
On Fri, 11 Jun 2021 at 18:25, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > Thanks for the feedback. How about v2 attached? > > Do you want me to merge neon_vec_unpack<US> and > > mve_vec_unpack<US> and only have different assembly? > > if (TARGET_HAVE_MVE) > > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > > else > > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > > I think it'd be better to keep them separate, given that they have > different attributes. > > > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>" > > [(set_attr "type" "mve_move") > > ]) > > > > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the > > +;; need for an extra register. > > “to avoid the need for an uninitailized input operand” might > be clearer. > Done > > +(define_insn "@mve_vec_pack_trunc_lo_<mode>" > > + [ > > + (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > > + (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")] > > + VMOVNBQ_S)) > > + ] > > + "TARGET_HAVE_MVE" > > + "vmovnb.i%#<V_sz_elem> %q0, %q1" > > + [(set_attr "type" "mve_move") > > +]) > > + > > […] > > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > > index 430a92ce966..3afb3e1d891 100644 > > --- a/gcc/config/arm/vec-common.md > > +++ b/gcc/config/arm/vec-common.md > > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2" > > "ARM_HAVE_<MODE>_ARITH > > && !TARGET_REALLY_IWMMXT" > > ) > > + > > +;; vmovl[tb] are not available for V4SI on MVE > > +(define_expand "vec_unpack<US>_hi_<mode>" > > + [(match_operand:<V_unpack> 0 "register_operand") > > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > + rtx t1; > > + int i; > > + for (i = 0; i < (<V_mode_nunits>/2); i++) > > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > + > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > + > > + if (TARGET_NEON) > > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > > + operands[1], > > + t1)); > > + else > > + emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0], > > + operands[1], > > + t1)); > > + DONE; > > + } > > +) > > Since the patterns are the same for both MVE and Neon (a good thing!), > it would be simpler to write this as: > > (define_insn "mve_vec_unpack<US>_lo_<mode>" I guess you mean define_expand? > [(set (match_operand:<V_unpack> 0 "register_operand" "=w") > (SE:<V_unpack> (vec_select:<V_HALF> > (match_operand:VU 1 "register_operand" "w") > (match_dup 2))))] > … > > and then set operands[2] to what is t1 above. There would then be > no need to call emit_insn & DONE, and we could use the natural > iterator (MVE_3) for the MVE patterns. > > Same idea for the lo version. > > > […] > > +;; vmovn[tb] are not available for V2DI on MVE > > +(define_expand "vec_pack_trunc_<mode>" > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand") > > + (vec_concat:<V_narrow_pack> > > + (truncate:<V_narrow> > > + (match_operand:VN 1 "register_operand")) > > + (truncate:<V_narrow> > > + (match_operand:VN 2 "register_operand"))))] > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + if (TARGET_NEON) > > + { > > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > > + operands[2])); > > + } > > + else > > + { > > + rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode); > > + emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); > > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, > > + tmpreg, tmpreg, operands[2])); > > + emit_move_insn (operands[0], tmpreg); > > vmovntq can assign directly to operands[0], without a separate move. > > OK with those changes, thanks. Just to be sure, here is v3 which hopefully matches what you meant. I updated the commit message and removed: < I did not prefix them with '@', because I couldn't find how to call < gen_mve_vec_unpack_[lo|hi] with <US> as first parameter. Anyway, we < can keep the VU iterator instead of MVE_3 since the offending V4SI < mode is disabled in the expander. OK? Thanks Christophe > > Richard > > > + } > > + DONE; > > + } > > +) > > […]
Christophe Lyon <christophe.lyon@linaro.org> writes: > On Fri, 11 Jun 2021 at 18:25, Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Christophe Lyon <christophe.lyon@linaro.org> writes: >> > Thanks for the feedback. How about v2 attached? >> > Do you want me to merge neon_vec_unpack<US> and >> > mve_vec_unpack<US> and only have different assembly? >> > if (TARGET_HAVE_MVE) >> > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; >> > else >> > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; >> >> I think it'd be better to keep them separate, given that they have >> different attributes. >> >> > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>" >> > [(set_attr "type" "mve_move") >> > ]) >> > >> > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the >> > +;; need for an extra register. >> >> “to avoid the need for an uninitailized input operand” might >> be clearer. >> > Done > >> > +(define_insn "@mve_vec_pack_trunc_lo_<mode>" >> > + [ >> > + (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") >> > + (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")] >> > + VMOVNBQ_S)) >> > + ] >> > + "TARGET_HAVE_MVE" >> > + "vmovnb.i%#<V_sz_elem> %q0, %q1" >> > + [(set_attr "type" "mve_move") >> > +]) >> > + >> > […] >> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md >> > index 430a92ce966..3afb3e1d891 100644 >> > --- a/gcc/config/arm/vec-common.md >> > +++ b/gcc/config/arm/vec-common.md >> > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2" >> > "ARM_HAVE_<MODE>_ARITH >> > && !TARGET_REALLY_IWMMXT" >> > ) >> > + >> > +;; vmovl[tb] are not available for V4SI on MVE >> > +(define_expand "vec_unpack<US>_hi_<mode>" >> > + [(match_operand:<V_unpack> 0 "register_operand") >> > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] >> > + "ARM_HAVE_<MODE>_ARITH >> > + && !TARGET_REALLY_IWMMXT >> > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) >> > + && !BYTES_BIG_ENDIAN" >> > + { >> > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); >> > + rtx t1; >> > + int i; >> > + for (i = 0; i < (<V_mode_nunits>/2); i++) >> > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); >> > + >> > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); >> > + >> > + if (TARGET_NEON) >> > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], >> > + operands[1], >> > + t1)); >> > + else >> > + emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0], >> > + operands[1], >> > + t1)); >> > + DONE; >> > + } >> > +) >> >> Since the patterns are the same for both MVE and Neon (a good thing!), >> it would be simpler to write this as: >> >> (define_insn "mve_vec_unpack<US>_lo_<mode>" > I guess you mean define_expand? Yes, sorry. >> [(set (match_operand:<V_unpack> 0 "register_operand" "=w") >> (SE:<V_unpack> (vec_select:<V_HALF> >> (match_operand:VU 1 "register_operand" "w") >> (match_dup 2))))] >> … >> >> and then set operands[2] to what is t1 above. There would then be >> no need to call emit_insn & DONE, and we could use the natural >> iterator (MVE_3) for the MVE patterns. >> >> Same idea for the lo version. >> >> > […] >> > +;; vmovn[tb] are not available for V2DI on MVE >> > +(define_expand "vec_pack_trunc_<mode>" >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand") >> > + (vec_concat:<V_narrow_pack> >> > + (truncate:<V_narrow> >> > + (match_operand:VN 1 "register_operand")) >> > + (truncate:<V_narrow> >> > + (match_operand:VN 2 "register_operand"))))] >> > + "ARM_HAVE_<MODE>_ARITH >> > + && !TARGET_REALLY_IWMMXT >> > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) >> > + && !BYTES_BIG_ENDIAN" >> > + { >> > + if (TARGET_NEON) >> > + { >> > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], >> > + operands[2])); >> > + } >> > + else >> > + { >> > + rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode); >> > + emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); >> > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, >> > + tmpreg, tmpreg, operands[2])); >> > + emit_move_insn (operands[0], tmpreg); >> >> vmovntq can assign directly to operands[0], without a separate move. >> >> OK with those changes, thanks. > > Just to be sure, here is v3 which hopefully matches what you meant. > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > index 430a92ce966..1d6d4379855 100644 > --- a/gcc/config/arm/vec-common.md > +++ b/gcc/config/arm/vec-common.md > @@ -632,3 +632,78 @@ (define_expand "clz<mode>2" > "ARM_HAVE_<MODE>_ARITH > && !TARGET_REALLY_IWMMXT" > ) > + > +;; vmovl[tb] are not available for V4SI on MVE > +(define_expand "vec_unpack<US>_hi_<mode>" > + [(set (match_operand:<V_unpack> 0 "register_operand") > + (SE:<V_unpack> (vec_select:<V_HALF> > + (match_operand:VU 1 "register_operand") > + (match_dup 2))))] The formatting of the vec_unpack<US>_lo_<mode> pattern looks nicer. > + "ARM_HAVE_<MODE>_ARITH > + && !TARGET_REALLY_IWMMXT > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > + && !BYTES_BIG_ENDIAN" > + { > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > + rtx t1; > + int i; > + for (i = 0; i < (<V_mode_nunits>/2); i++) > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > + > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > + > + operands[2] = t1; t1 doesn't seem worth it now, might as well assign directly to operands[2]. > […] > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, > + operands[0], tmpreg, operands[2])); The indentation looks odd here (operands[0] should line up with VMOVNTQ_S), but that might be a mailer issue. > + } > + DONE; OK with those changes, thanks. Richard
On Mon, 14 Jun 2021 at 18:01, Richard Sandiford <richard.sandiford@arm.com> wrote: > > Christophe Lyon <christophe.lyon@linaro.org> writes: > > On Fri, 11 Jun 2021 at 18:25, Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> Christophe Lyon <christophe.lyon@linaro.org> writes: > >> > Thanks for the feedback. How about v2 attached? > >> > Do you want me to merge neon_vec_unpack<US> and > >> > mve_vec_unpack<US> and only have different assembly? > >> > if (TARGET_HAVE_MVE) > >> > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > >> > else > >> > return "vmovlb.<US>%#<V_sz_elem> %q0, %q1"; > >> > >> I think it'd be better to keep them separate, given that they have > >> different attributes. > >> > >> > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>" > >> > [(set_attr "type" "mve_move") > >> > ]) > >> > > >> > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the > >> > +;; need for an extra register. > >> > >> “to avoid the need for an uninitailized input operand” might > >> be clearer. > >> > > Done > > > >> > +(define_insn "@mve_vec_pack_trunc_lo_<mode>" > >> > + [ > >> > + (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") > >> > + (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")] > >> > + VMOVNBQ_S)) > >> > + ] > >> > + "TARGET_HAVE_MVE" > >> > + "vmovnb.i%#<V_sz_elem> %q0, %q1" > >> > + [(set_attr "type" "mve_move") > >> > +]) > >> > + > >> > […] > >> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > >> > index 430a92ce966..3afb3e1d891 100644 > >> > --- a/gcc/config/arm/vec-common.md > >> > +++ b/gcc/config/arm/vec-common.md > >> > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2" > >> > "ARM_HAVE_<MODE>_ARITH > >> > && !TARGET_REALLY_IWMMXT" > >> > ) > >> > + > >> > +;; vmovl[tb] are not available for V4SI on MVE > >> > +(define_expand "vec_unpack<US>_hi_<mode>" > >> > + [(match_operand:<V_unpack> 0 "register_operand") > >> > + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] > >> > + "ARM_HAVE_<MODE>_ARITH > >> > + && !TARGET_REALLY_IWMMXT > >> > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > >> > + && !BYTES_BIG_ENDIAN" > >> > + { > >> > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > >> > + rtx t1; > >> > + int i; > >> > + for (i = 0; i < (<V_mode_nunits>/2); i++) > >> > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > >> > + > >> > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > >> > + > >> > + if (TARGET_NEON) > >> > + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], > >> > + operands[1], > >> > + t1)); > >> > + else > >> > + emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0], > >> > + operands[1], > >> > + t1)); > >> > + DONE; > >> > + } > >> > +) > >> > >> Since the patterns are the same for both MVE and Neon (a good thing!), > >> it would be simpler to write this as: > >> > >> (define_insn "mve_vec_unpack<US>_lo_<mode>" > > I guess you mean define_expand? > > Yes, sorry. > > >> [(set (match_operand:<V_unpack> 0 "register_operand" "=w") > >> (SE:<V_unpack> (vec_select:<V_HALF> > >> (match_operand:VU 1 "register_operand" "w") > >> (match_dup 2))))] > >> … > >> > >> and then set operands[2] to what is t1 above. There would then be > >> no need to call emit_insn & DONE, and we could use the natural > >> iterator (MVE_3) for the MVE patterns. > >> > >> Same idea for the lo version. > >> > >> > […] > >> > +;; vmovn[tb] are not available for V2DI on MVE > >> > +(define_expand "vec_pack_trunc_<mode>" > >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand") > >> > + (vec_concat:<V_narrow_pack> > >> > + (truncate:<V_narrow> > >> > + (match_operand:VN 1 "register_operand")) > >> > + (truncate:<V_narrow> > >> > + (match_operand:VN 2 "register_operand"))))] > >> > + "ARM_HAVE_<MODE>_ARITH > >> > + && !TARGET_REALLY_IWMMXT > >> > + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) > >> > + && !BYTES_BIG_ENDIAN" > >> > + { > >> > + if (TARGET_NEON) > >> > + { > >> > + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], > >> > + operands[2])); > >> > + } > >> > + else > >> > + { > >> > + rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode); > >> > + emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1])); > >> > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, > >> > + tmpreg, tmpreg, operands[2])); > >> > + emit_move_insn (operands[0], tmpreg); > >> > >> vmovntq can assign directly to operands[0], without a separate move. > >> > >> OK with those changes, thanks. > > > > Just to be sure, here is v3 which hopefully matches what you meant. > > > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md > > index 430a92ce966..1d6d4379855 100644 > > --- a/gcc/config/arm/vec-common.md > > +++ b/gcc/config/arm/vec-common.md > > @@ -632,3 +632,78 @@ (define_expand "clz<mode>2" > > "ARM_HAVE_<MODE>_ARITH > > && !TARGET_REALLY_IWMMXT" > > ) > > + > > +;; vmovl[tb] are not available for V4SI on MVE > > +(define_expand "vec_unpack<US>_hi_<mode>" > > + [(set (match_operand:<V_unpack> 0 "register_operand") > > + (SE:<V_unpack> (vec_select:<V_HALF> > > + (match_operand:VU 1 "register_operand") > > + (match_dup 2))))] > > The formatting of the vec_unpack<US>_lo_<mode> pattern looks nicer. > > > + "ARM_HAVE_<MODE>_ARITH > > + && !TARGET_REALLY_IWMMXT > > + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) > > + && !BYTES_BIG_ENDIAN" > > + { > > + rtvec v = rtvec_alloc (<V_mode_nunits>/2); > > + rtx t1; > > + int i; > > + for (i = 0; i < (<V_mode_nunits>/2); i++) > > + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); > > + > > + t1 = gen_rtx_PARALLEL (<MODE>mode, v); > > + > > + operands[2] = t1; > > t1 doesn't seem worth it now, might as well assign directly to > operands[2]. > > > […] > > + emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, > > + operands[0], tmpreg, operands[2])); > > The indentation looks odd here (operands[0] should line up with > VMOVNTQ_S), but that might be a mailer issue. > That was the intent, but there was a mix of tabs/spaces, now fixed (and elsewhere too). I have attached what I have committed. Thanks Christophe > > + } > > + DONE; > > > OK with those changes, thanks. > > Richard
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md index 99e46d0bc69..b18292c07d3 100644 --- a/gcc/config/arm/mve.md +++ b/gcc/config/arm/mve.md @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" ;; ;; [vmovltq_u, vmovltq_s]) ;; -(define_insn "mve_vmovltq_<supf><mode>" +(define_insn "@mve_vmovltq_<supf><mode>" [ (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" ;; ;; [vmovlbq_s, vmovlbq_u]) ;; -(define_insn "mve_vmovlbq_<supf><mode>" +(define_insn "@mve_vmovlbq_<supf><mode>" [ (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")] @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" ;; ;; [vmovnbq_u, vmovnbq_s]) ;; -(define_insn "mve_vmovnbq_<supf><mode>" +(define_insn "@mve_vmovnbq_<supf><mode>" [ (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" ;; ;; [vmovntq_s, vmovntq_u]) ;; -(define_insn "mve_vmovntq_<supf><mode>" +(define_insn "@mve_vmovntq_<supf><mode>" [ (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0") @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" [(set_attr "type" "mve_move") ]) +(define_insn "@mve_vec_pack_trunc_<mode>" + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") + (vec_concat:<V_narrow_pack> + (truncate:<V_narrow> + (match_operand:MVE_5 1 "register_operand" "w")) + (truncate:<V_narrow> + (match_operand:MVE_5 2 "register_operand" "w"))))] + "TARGET_HAVE_MVE" + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" + [(set_attr "type" "mve_move")] +) + ;; ;; [vmulq_f]) ;; diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 0fdffaf4ec4..392d9607919 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>" [(set_attr "type" "neon_shift_imm_long")] ) -(define_expand "vec_unpack<US>_hi_<mode>" - [(match_operand:<V_unpack> 0 "register_operand") - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] - "TARGET_NEON && !BYTES_BIG_ENDIAN" - { - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; - rtx t1; - int i; - for (i = 0; i < (<V_mode_nunits>/2); i++) - RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); - - t1 = gen_rtx_PARALLEL (<MODE>mode, v); - emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], - operands[1], - t1)); - DONE; - } -) - -(define_expand "vec_unpack<US>_lo_<mode>" - [(match_operand:<V_unpack> 0 "register_operand") - (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] - "TARGET_NEON && !BYTES_BIG_ENDIAN" - { - rtvec v = rtvec_alloc (<V_mode_nunits>/2) ; - rtx t1; - int i; - for (i = 0; i < (<V_mode_nunits>/2) ; i++) - RTVEC_ELT (v, i) = GEN_INT (i); - t1 = gen_rtx_PARALLEL (<MODE>mode, v); - emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], - operands[1], - t1)); - DONE; - } -) - (define_insn "neon_vec_<US>mult_lo_<mode>" [(set (match_operand:<V_unpack> 0 "register_operand" "=w") (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF> @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>" ; because the ordering of vector elements in Q registers is different from what ; the semantics of the instructions require. -(define_insn "vec_pack_trunc_<mode>" +(define_insn "neon_quad_vec_pack_trunc_<mode>" [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") (vec_concat:<V_narrow_pack> (truncate:<V_narrow> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md index 1ba1e5eb008..0ffc7a9322c 100644 --- a/gcc/config/arm/vec-common.md +++ b/gcc/config/arm/vec-common.md @@ -638,3 +638,92 @@ (define_expand "clz<mode>2" emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1])); DONE; }) + +;; vmovl[tb] are not available for V4SI on MVE +(define_expand "vec_unpack<US>_hi_<mode>" + [(match_operand:<V_unpack> 0 "register_operand") + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] + "ARM_HAVE_<MODE>_ARITH + && !TARGET_REALLY_IWMMXT + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + rtvec v = rtvec_alloc (<V_mode_nunits>/2); + rtx t1; + int i; + for (i = 0; i < (<V_mode_nunits>/2); i++) + RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i); + + t1 = gen_rtx_PARALLEL (<MODE>mode, v); + emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], + operands[1], + t1)); + } + else + { + emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0], + operands[1])); + } + DONE; + } +) + +;; vmovl[tb] are not available for V4SI on MVE +(define_expand "vec_unpack<US>_lo_<mode>" + [(match_operand:<V_unpack> 0 "register_operand") + (SE:<V_unpack> (match_operand:VU 1 "register_operand"))] + "ARM_HAVE_<MODE>_ARITH + && !TARGET_REALLY_IWMMXT + && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + rtvec v = rtvec_alloc (<V_mode_nunits>/2); + rtx t1; + int i; + for (i = 0; i < (<V_mode_nunits>/2) ; i++) + RTVEC_ELT (v, i) = GEN_INT (i); + + t1 = gen_rtx_PARALLEL (<MODE>mode, v); + emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], + operands[1], + t1)); + } + else + { + emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0], + operands[1])); + } + DONE; + } +) + +;; vmovn[tb] are not available for V2DI on MVE +(define_expand "vec_pack_trunc_<mode>" + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") + (vec_concat:<V_narrow_pack> + (truncate:<V_narrow> + (match_operand:VN 1 "register_operand" "w")) + (truncate:<V_narrow> + (match_operand:VN 2 "register_operand" "w"))))] + "ARM_HAVE_<MODE>_ARITH + && !TARGET_REALLY_IWMMXT + && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE) + && !BYTES_BIG_ENDIAN" + { + if (TARGET_NEON) + { + emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1], + operands[2])); + } + else + { + emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1], + operands[2])); + } + DONE; + } +) diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c index 7068736bc28..5d6e991cfc6 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz) FUNC(s, int, 8, clz) FUNC(u, uint, 8, clz) -/* 16 and 8-bit versions are not vectorized because they need pack/unpack - patterns since __builtin_clz uses 32-bit parameter and return value. */ -/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 2 } } */ +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for + instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of + 14 vclz.i32 expected in this testcase. */ +/* { dg-final { scan-assembler-times {vclz\.i32 q[0-9]+, q[0-9]+} 14 } } */ /* { dg-final { scan-assembler-times {vclz\.i16 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ /* { dg-final { scan-assembler-times {vclz\.i8 q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c index 7a0644997c8..91dd942d818 100644 --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm) /* MVE has only 128-bit vectors, so we can vectorize only half of the functions above. */ /* We only emit vshl.u, which is equivalent to vshl.s anyway. */ -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */ +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for + instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of + 14 vshl.i32 expected in this testcase. */ +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */ /* We emit vshl.i when the shift amount is an immediate. */ /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */