diff mbox series

[2/2] arm: Auto-vectorization for MVE: add pack/unpack patterns

Message ID 20210607112007.8659-2-christophe.lyon@linaro.org
State New
Headers show
Series [1/2] arm: Auto-vectorization for MVE: vclz | expand

Commit Message

Christophe Lyon June 7, 2021, 11:20 a.m. UTC
This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
vec_pack_trunc_<mode> patterns for MVE.

It does so by moving the unpack patterns from neon.md to
vec-common.md, while adding them support for MVE. The pack expander is
derived from the Neon one (which in turn is renamed into
neon_quad_vec_pack_trunc_<mode>).

The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
zero-initialized temporary, which is needed if the
vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
instead.

With this patch, we can now vectorize the 16 and 8-bit versions of
vclz and vshl, although the generated code could still be improved.
For test_clz_s16, we now generate
        vldrh.16        q3, [r1]
        vmovlb.s16   q2, q3
        vmovlt.s16   q3, q3
        vclz.i32  q2, q2
        vclz.i32  q3, q3
        vmovnb.i32      q1, q2
        vmovnt.i32      q1, q3
        vstrh.16        q1, [r0]
which could be improved to
        vldrh.16        q3, [r1]
	vclz.i16	q1, q3
        vstrh.16        q1, [r0]
if we could avoid the need for unpack/pack steps.

For reference, clang-12 generates:
	vldrh.s32       q0, [r1]
	vldrh.s32       q1, [r1, #8]
	vclz.i32        q0, q0
	vstrh.32        q0, [r0]
	vclz.i32        q0, q1
	vstrh.32        q0, [r0, #8]

2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
	(mve_vmovlbq_<supf><mode>): Likewise.
	(mve_vmovnbq_<supf><mode>): Likewise.
	(mve_vmovntq_<supf><mode>): Likewise.
	(@mve_vec_pack_trunc_<mode>): New pattern.
	* config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
	vec-common.md.
	(vec_unpack<US>_lo_<mode>): Likewise.
	(vec_pack_trunc_<mode>): Rename to
	neon_quad_vec_pack_trunc_<mode>.
	* config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
	pattern.
	(vec_unpack<US>_lo_<mode>): New.
	(vec_pack_trunc_<mode>): New.

	gcc/testsuite/
	* gcc.target/arm/simd/mve-vclz.c: Update expected results.
	* gcc.target/arm/simd/mve-vshl.c: Likewise.
---
 gcc/config/arm/mve.md                        | 20 ++++-
 gcc/config/arm/neon.md                       | 39 +--------
 gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
 gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
 gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
 5 files changed, 114 insertions(+), 46 deletions(-)

Comments

Richard Sandiford June 8, 2021, 12:10 p.m. UTC | #1
Christophe Lyon <christophe.lyon@linaro.org> writes:
> This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
> vec_pack_trunc_<mode> patterns for MVE.
>
> It does so by moving the unpack patterns from neon.md to
> vec-common.md, while adding them support for MVE. The pack expander is
> derived from the Neon one (which in turn is renamed into
> neon_quad_vec_pack_trunc_<mode>).
>
> The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
> zero-initialized temporary, which is needed if the
> vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
> instead.
>
> With this patch, we can now vectorize the 16 and 8-bit versions of
> vclz and vshl, although the generated code could still be improved.
> For test_clz_s16, we now generate
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q2, q3
>         vmovlt.s16   q3, q3
>         vclz.i32  q2, q2
>         vclz.i32  q3, q3
>         vmovnb.i32      q1, q2
>         vmovnt.i32      q1, q3
>         vstrh.16        q1, [r0]
> which could be improved to
>         vldrh.16        q3, [r1]
> 	vclz.i16	q1, q3
>         vstrh.16        q1, [r0]
> if we could avoid the need for unpack/pack steps.

Yeah, there was a PR about fixing this for popcount.  I guess the same
approach would apply here too.

> For reference, clang-12 generates:
> 	vldrh.s32       q0, [r1]
> 	vldrh.s32       q1, [r1, #8]
> 	vclz.i32        q0, q0
> 	vstrh.32        q0, [r0]
> 	vclz.i32        q0, q1
> 	vstrh.32        q0, [r0, #8]
>
> 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
>
> 	gcc/
> 	* config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
> 	(mve_vmovlbq_<supf><mode>): Likewise.
> 	(mve_vmovnbq_<supf><mode>): Likewise.
> 	(mve_vmovntq_<supf><mode>): Likewise.
> 	(@mve_vec_pack_trunc_<mode>): New pattern.
> 	* config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
> 	vec-common.md.
> 	(vec_unpack<US>_lo_<mode>): Likewise.
> 	(vec_pack_trunc_<mode>): Rename to
> 	neon_quad_vec_pack_trunc_<mode>.
> 	* config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
> 	pattern.
> 	(vec_unpack<US>_lo_<mode>): New.
> 	(vec_pack_trunc_<mode>): New.
>
> 	gcc/testsuite/
> 	* gcc.target/arm/simd/mve-vclz.c: Update expected results.
> 	* gcc.target/arm/simd/mve-vshl.c: Likewise.
> ---
>  gcc/config/arm/mve.md                        | 20 ++++-
>  gcc/config/arm/neon.md                       | 39 +--------
>  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
>  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
>  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
>  5 files changed, 114 insertions(+), 46 deletions(-)
>
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 99e46d0bc69..b18292c07d3 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
>  ;;
>  ;; [vmovltq_u, vmovltq_s])
>  ;;
> -(define_insn "mve_vmovltq_<supf><mode>"
> +(define_insn "@mve_vmovltq_<supf><mode>"
>    [
>     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>  	(unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
>  ;;
>  ;; [vmovlbq_s, vmovlbq_u])
>  ;;
> -(define_insn "mve_vmovlbq_<supf><mode>"
> +(define_insn "@mve_vmovlbq_<supf><mode>"
>    [
>     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>  	(unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
>  ;;
>  ;; [vmovnbq_u, vmovnbq_s])
>  ;;
> -(define_insn "mve_vmovnbq_<supf><mode>"
> +(define_insn "@mve_vmovnbq_<supf><mode>"
>    [
>     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>  	(unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
>  ;;
>  ;; [vmovntq_s, vmovntq_u])
>  ;;
> -(define_insn "mve_vmovntq_<supf><mode>"
> +(define_insn "@mve_vmovntq_<supf><mode>"
>    [
>     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>  	(unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
>    [(set_attr "type" "mve_move")
>  ])
>  
> +(define_insn "@mve_vec_pack_trunc_<mode>"
> + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> +       (vec_concat:<V_narrow_pack>
> +		(truncate:<V_narrow>
> +			(match_operand:MVE_5 1 "register_operand" "w"))
> +		(truncate:<V_narrow>
> +			(match_operand:MVE_5 2 "register_operand" "w"))))]
> + "TARGET_HAVE_MVE"
> + "vmovnb.i<V_sz_elem>	%q0, %q1\;vmovnt.i<V_sz_elem>	%q0, %q2"
> +  [(set_attr "type" "mve_move")]
> +)
> +

I realise this is (like you say) based on the neon.md pattern,
but we should use separate vmovnb and vmovnt instructions instead
of putting two instructions into a single pattern.

One specific advantage to using separate patterns is that it would
avoid the imprecision of the earlyclobber: the output only conflicts
with operand 1 and can be tied to operand 2.

>  ;;
>  ;; [vmulq_f])
>  ;;
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 0fdffaf4ec4..392d9607919 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>"
>    [(set_attr "type" "neon_shift_imm_long")]
>  )
>  
> -(define_expand "vec_unpack<US>_hi_<mode>"
> -  [(match_operand:<V_unpack> 0 "register_operand")
> -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> -  {
> -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> -   rtx t1;
> -   int i;
> -   for (i = 0; i < (<V_mode_nunits>/2); i++)
> -     RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> -  
> -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> -   emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], 
> -                                                 operands[1], 
> -					         t1));
> -   DONE;
> -  }
> -)
> -
> -(define_expand "vec_unpack<US>_lo_<mode>"
> -  [(match_operand:<V_unpack> 0 "register_operand")
> -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> -  {
> -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> -   rtx t1;
> -   int i;
> -   for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> -     RTVEC_ELT (v, i) = GEN_INT (i);
> -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> -   emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], 
> -                                                 operands[1], 
> -				   	         t1));
> -   DONE;
> -  }
> -)
> -
>  (define_insn "neon_vec_<US>mult_lo_<mode>"
>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
>         (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF>
> @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>"
>  ; because the ordering of vector elements in Q registers is different from what
>  ; the semantics of the instructions require.
>  
> -(define_insn "vec_pack_trunc_<mode>"
> +(define_insn "neon_quad_vec_pack_trunc_<mode>"
>   [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
>         (vec_concat:<V_narrow_pack> 
>  		(truncate:<V_narrow> 
> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> index 1ba1e5eb008..0ffc7a9322c 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm/vec-common.md
> @@ -638,3 +638,92 @@ (define_expand "clz<mode>2"
>      emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1]));
>    DONE;
>  })
> +
> +;; vmovl[tb] are not available for V4SI on MVE
> +(define_expand "vec_unpack<US>_hi_<mode>"
> +  [(match_operand:<V_unpack> 0 "register_operand")
> +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> +  {
> +    if (TARGET_NEON)
> +      {
> +	rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> +	rtx t1;
> +	int i;
> +	for (i = 0; i < (<V_mode_nunits>/2); i++)
> +	  RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> +
> +	t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> +	emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> +						      operands[1],
> +						      t1));
> +      }
> +    else
> +      {
> +	emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0],
> +				    operands[1]));

It would be good to use the same rtx pattern for the MVE instructions,
since the compiler can optimise it better than an unspec.

It would be good to have separate tests for the packs and unpacks
as well, in case the vclz thing is fixed in future.

Thanks,
Richard

> +      }
> +    DONE;
> +  }
> +)
> +
> +;; vmovl[tb] are not available for V4SI on MVE
> +(define_expand "vec_unpack<US>_lo_<mode>"
> +  [(match_operand:<V_unpack> 0 "register_operand")
> +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> +  {
> +    if (TARGET_NEON)
> +      {
> +	rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> +	rtx t1;
> +	int i;
> +	for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> +	  RTVEC_ELT (v, i) = GEN_INT (i);
> +
> +	t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> +	emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
> +						      operands[1],
> +						      t1));
> +      }
> +    else
> +      {
> +	emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0],
> +				    operands[1]));
> +      }
> +    DONE;
> +  }
> +)
> +
> +;; vmovn[tb] are not available for V2DI on MVE
> +(define_expand "vec_pack_trunc_<mode>"
> + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> +       (vec_concat:<V_narrow_pack>
> +		(truncate:<V_narrow>
> +			(match_operand:VN 1 "register_operand" "w"))
> +		(truncate:<V_narrow>
> +			(match_operand:VN 2 "register_operand" "w"))))]
> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> + {
> +   if (TARGET_NEON)
> +     {
> +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> +						       operands[2]));
> +     }
> +   else
> +     {
> +       emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1],
> +					  operands[2]));
> +     }
> +   DONE;
> + }
> +)
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> index 7068736bc28..5d6e991cfc6 100644
> --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz)
>  FUNC(s, int, 8, clz)
>  FUNC(u, uint, 8, clz)
>  
> -/* 16 and 8-bit versions are not vectorized because they need pack/unpack
> -   patterns since __builtin_clz uses 32-bit parameter and return value.  */
> -/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 2 } } */
> +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> +   instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of
> +   14 vclz.i32 expected in this testcase.  */
> +/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 14 } } */
>  /* { dg-final { scan-assembler-times {vclz\.i16  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times {vclz\.i8  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> index 7a0644997c8..91dd942d818 100644
> --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm)
>  /* MVE has only 128-bit vectors, so we can vectorize only half of the
>     functions above.  */
>  /* We only emit vshl.u, which is equivalent to vshl.s anyway.  */
> -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */
> +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> +   instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of
> +   14 vshl.i32 expected in this testcase.  */
> +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */
>  
>  /* We emit vshl.i when the shift amount is an immediate.  */
>  /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
Christophe Lyon June 10, 2021, 9:50 a.m. UTC | #2
On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
> > vec_pack_trunc_<mode> patterns for MVE.
> >
> > It does so by moving the unpack patterns from neon.md to
> > vec-common.md, while adding them support for MVE. The pack expander is
> > derived from the Neon one (which in turn is renamed into
> > neon_quad_vec_pack_trunc_<mode>).
> >
> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
> > zero-initialized temporary, which is needed if the
> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
> > instead.
> >
> > With this patch, we can now vectorize the 16 and 8-bit versions of
> > vclz and vshl, although the generated code could still be improved.
> > For test_clz_s16, we now generate
> >         vldrh.16        q3, [r1]
> >         vmovlb.s16   q2, q3
> >         vmovlt.s16   q3, q3
> >         vclz.i32  q2, q2
> >         vclz.i32  q3, q3
> >         vmovnb.i32      q1, q2
> >         vmovnt.i32      q1, q3
> >         vstrh.16        q1, [r0]
> > which could be improved to
> >         vldrh.16        q3, [r1]
> >       vclz.i16        q1, q3
> >         vstrh.16        q1, [r0]
> > if we could avoid the need for unpack/pack steps.
>
> Yeah, there was a PR about fixing this for popcount.  I guess the same
> approach would apply here too.
>
> > For reference, clang-12 generates:
> >       vldrh.s32       q0, [r1]
> >       vldrh.s32       q1, [r1, #8]
> >       vclz.i32        q0, q0
> >       vstrh.32        q0, [r0]
> >       vclz.i32        q0, q1
> >       vstrh.32        q0, [r0, #8]
> >
> > 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
> >
> >       gcc/
> >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
> >       (mve_vmovlbq_<supf><mode>): Likewise.
> >       (mve_vmovnbq_<supf><mode>): Likewise.
> >       (mve_vmovntq_<supf><mode>): Likewise.
> >       (@mve_vec_pack_trunc_<mode>): New pattern.
> >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
> >       vec-common.md.
> >       (vec_unpack<US>_lo_<mode>): Likewise.
> >       (vec_pack_trunc_<mode>): Rename to
> >       neon_quad_vec_pack_trunc_<mode>.
> >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
> >       pattern.
> >       (vec_unpack<US>_lo_<mode>): New.
> >       (vec_pack_trunc_<mode>): New.
> >
> >       gcc/testsuite/
> >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
> >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
> > ---
> >  gcc/config/arm/mve.md                        | 20 ++++-
> >  gcc/config/arm/neon.md                       | 39 +--------
> >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
> >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
> >  5 files changed, 114 insertions(+), 46 deletions(-)
> >
> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > index 99e46d0bc69..b18292c07d3 100644
> > --- a/gcc/config/arm/mve.md
> > +++ b/gcc/config/arm/mve.md
> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
> >  ;;
> >  ;; [vmovltq_u, vmovltq_s])
> >  ;;
> > -(define_insn "mve_vmovltq_<supf><mode>"
> > +(define_insn "@mve_vmovltq_<supf><mode>"
> >    [
> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
> >  ;;
> >  ;; [vmovlbq_s, vmovlbq_u])
> >  ;;
> > -(define_insn "mve_vmovlbq_<supf><mode>"
> > +(define_insn "@mve_vmovlbq_<supf><mode>"
> >    [
> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
> >  ;;
> >  ;; [vmovnbq_u, vmovnbq_s])
> >  ;;
> > -(define_insn "mve_vmovnbq_<supf><mode>"
> > +(define_insn "@mve_vmovnbq_<supf><mode>"
> >    [
> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> >  ;;
> >  ;; [vmovntq_s, vmovntq_u])
> >  ;;
> > -(define_insn "mve_vmovntq_<supf><mode>"
> > +(define_insn "@mve_vmovntq_<supf><mode>"
> >    [
> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
> >    [(set_attr "type" "mve_move")
> >  ])
> >
> > +(define_insn "@mve_vec_pack_trunc_<mode>"
> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> > +       (vec_concat:<V_narrow_pack>
> > +             (truncate:<V_narrow>
> > +                     (match_operand:MVE_5 1 "register_operand" "w"))
> > +             (truncate:<V_narrow>
> > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
> > + "TARGET_HAVE_MVE"
> > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
> > +  [(set_attr "type" "mve_move")]
> > +)
> > +
>
> I realise this is (like you say) based on the neon.md pattern,
> but we should use separate vmovnb and vmovnt instructions instead
> of putting two instructions into a single pattern.
>
> One specific advantage to using separate patterns is that it would
> avoid the imprecision of the earlyclobber: the output only conflicts
> with operand 1 and can be tied to operand 2.
>
I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
        vldrh.16        q3, [r1]
        vmovlb.s16   q2, q3
        vmovlt.s16   q3, q3
        vclz.i32  q2, q2
        vclz.i32  q3, q3
        vmovnb.i32      q1, q2
        vmovnt.i32      q1, q3
        vstrh.16        q1, [r0]

If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
and update define_expand "vec_pack_trunc_<mode>" to call:
emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
operands[0], operands[1]));
emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
operands[0], operands[2]));
instead, we now generate:
        vldrh.16        q3, [r1]
        vmovlb.s16   q1, q3
        vldr.64 d4, .L5
        vldr.64 d5, .L5+8
        vmovlt.s16   q3, q3
        vclz.i32  q1, q1
        vclz.i32  q3, q3
        vmovnb.i32      q2, q1
        vmovnt.i32      q2, q3
        vstrh.16        q2, [r0]
        bx      lr
.L6:
        .align  3
.L5:
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0

That's what I tried to explain in the commit message.

I guess I've misunderstood your comment?

Thanks,

Christophe

> >  ;;
> >  ;; [vmulq_f])
> >  ;;
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > index 0fdffaf4ec4..392d9607919 100644
> > --- a/gcc/config/arm/neon.md
> > +++ b/gcc/config/arm/neon.md
> > @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>"
> >    [(set_attr "type" "neon_shift_imm_long")]
> >  )
> >
> > -(define_expand "vec_unpack<US>_hi_<mode>"
> > -  [(match_operand:<V_unpack> 0 "register_operand")
> > -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> > -  {
> > -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> > -   rtx t1;
> > -   int i;
> > -   for (i = 0; i < (<V_mode_nunits>/2); i++)
> > -     RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > -
> > -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > -   emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> > -                                                 operands[1],
> > -                                              t1));
> > -   DONE;
> > -  }
> > -)
> > -
> > -(define_expand "vec_unpack<US>_lo_<mode>"
> > -  [(match_operand:<V_unpack> 0 "register_operand")
> > -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> > -  {
> > -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> > -   rtx t1;
> > -   int i;
> > -   for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> > -     RTVEC_ELT (v, i) = GEN_INT (i);
> > -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > -   emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
> > -                                                 operands[1],
> > -                                              t1));
> > -   DONE;
> > -  }
> > -)
> > -
> >  (define_insn "neon_vec_<US>mult_lo_<mode>"
> >   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
> >         (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF>
> > @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>"
> >  ; because the ordering of vector elements in Q registers is different from what
> >  ; the semantics of the instructions require.
> >
> > -(define_insn "vec_pack_trunc_<mode>"
> > +(define_insn "neon_quad_vec_pack_trunc_<mode>"
> >   [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> >         (vec_concat:<V_narrow_pack>
> >               (truncate:<V_narrow>
> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> > index 1ba1e5eb008..0ffc7a9322c 100644
> > --- a/gcc/config/arm/vec-common.md
> > +++ b/gcc/config/arm/vec-common.md
> > @@ -638,3 +638,92 @@ (define_expand "clz<mode>2"
> >      emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1]));
> >    DONE;
> >  })
> > +
> > +;; vmovl[tb] are not available for V4SI on MVE
> > +(define_expand "vec_unpack<US>_hi_<mode>"
> > +  [(match_operand:<V_unpack> 0 "register_operand")
> > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > +  {
> > +    if (TARGET_NEON)
> > +      {
> > +     rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > +     rtx t1;
> > +     int i;
> > +     for (i = 0; i < (<V_mode_nunits>/2); i++)
> > +       RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > +
> > +     t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > +     emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> > +                                                   operands[1],
> > +                                                   t1));
> > +      }
> > +    else
> > +      {
> > +     emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0],
> > +                                 operands[1]));
>
> It would be good to use the same rtx pattern for the MVE instructions,
> since the compiler can optimise it better than an unspec.
>
> It would be good to have separate tests for the packs and unpacks
> as well, in case the vclz thing is fixed in future.
>
> Thanks,
> Richard
>
> > +      }
> > +    DONE;
> > +  }
> > +)
> > +
> > +;; vmovl[tb] are not available for V4SI on MVE
> > +(define_expand "vec_unpack<US>_lo_<mode>"
> > +  [(match_operand:<V_unpack> 0 "register_operand")
> > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > +  {
> > +    if (TARGET_NEON)
> > +      {
> > +     rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > +     rtx t1;
> > +     int i;
> > +     for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> > +       RTVEC_ELT (v, i) = GEN_INT (i);
> > +
> > +     t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > +     emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
> > +                                                   operands[1],
> > +                                                   t1));
> > +      }
> > +    else
> > +      {
> > +     emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0],
> > +                                 operands[1]));
> > +      }
> > +    DONE;
> > +  }
> > +)
> > +
> > +;; vmovn[tb] are not available for V2DI on MVE
> > +(define_expand "vec_pack_trunc_<mode>"
> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> > +       (vec_concat:<V_narrow_pack>
> > +             (truncate:<V_narrow>
> > +                     (match_operand:VN 1 "register_operand" "w"))
> > +             (truncate:<V_narrow>
> > +                     (match_operand:VN 2 "register_operand" "w"))))]
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > + {
> > +   if (TARGET_NEON)
> > +     {
> > +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> > +                                                    operands[2]));
> > +     }
> > +   else
> > +     {
> > +       emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1],
> > +                                       operands[2]));
> > +     }
> > +   DONE;
> > + }
> > +)
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > index 7068736bc28..5d6e991cfc6 100644
> > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz)
> >  FUNC(s, int, 8, clz)
> >  FUNC(u, uint, 8, clz)
> >
> > -/* 16 and 8-bit versions are not vectorized because they need pack/unpack
> > -   patterns since __builtin_clz uses 32-bit parameter and return value.  */
> > -/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 2 } } */
> > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> > +   instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of
> > +   14 vclz.i32 expected in this testcase.  */
> > +/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 14 } } */
> >  /* { dg-final { scan-assembler-times {vclz\.i16  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
> >  /* { dg-final { scan-assembler-times {vclz\.i8  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > index 7a0644997c8..91dd942d818 100644
> > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm)
> >  /* MVE has only 128-bit vectors, so we can vectorize only half of the
> >     functions above.  */
> >  /* We only emit vshl.u, which is equivalent to vshl.s anyway.  */
> > -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */
> > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> > +   instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of
> > +   14 vshl.i32 expected in this testcase.  */
> > +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */
> >
> >  /* We emit vshl.i when the shift amount is an immediate.  */
> >  /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
Christophe Lyon June 10, 2021, 1:46 p.m. UTC | #3
On Thu, 10 Jun 2021 at 11:50, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
>
> On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Christophe Lyon <christophe.lyon@linaro.org> writes:
> > > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
> > > vec_pack_trunc_<mode> patterns for MVE.
> > >
> > > It does so by moving the unpack patterns from neon.md to
> > > vec-common.md, while adding them support for MVE. The pack expander is
> > > derived from the Neon one (which in turn is renamed into
> > > neon_quad_vec_pack_trunc_<mode>).
> > >
> > > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
> > > zero-initialized temporary, which is needed if the
> > > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
> > > instead.
> > >
> > > With this patch, we can now vectorize the 16 and 8-bit versions of
> > > vclz and vshl, although the generated code could still be improved.
> > > For test_clz_s16, we now generate
> > >         vldrh.16        q3, [r1]
> > >         vmovlb.s16   q2, q3
> > >         vmovlt.s16   q3, q3
> > >         vclz.i32  q2, q2
> > >         vclz.i32  q3, q3
> > >         vmovnb.i32      q1, q2
> > >         vmovnt.i32      q1, q3
> > >         vstrh.16        q1, [r0]
> > > which could be improved to
> > >         vldrh.16        q3, [r1]
> > >       vclz.i16        q1, q3
> > >         vstrh.16        q1, [r0]
> > > if we could avoid the need for unpack/pack steps.
> >
> > Yeah, there was a PR about fixing this for popcount.  I guess the same
> > approach would apply here too.
> >
> > > For reference, clang-12 generates:
> > >       vldrh.s32       q0, [r1]
> > >       vldrh.s32       q1, [r1, #8]
> > >       vclz.i32        q0, q0
> > >       vstrh.32        q0, [r0]
> > >       vclz.i32        q0, q1
> > >       vstrh.32        q0, [r0, #8]
> > >
> > > 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
> > >
> > >       gcc/
> > >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
> > >       (mve_vmovlbq_<supf><mode>): Likewise.
> > >       (mve_vmovnbq_<supf><mode>): Likewise.
> > >       (mve_vmovntq_<supf><mode>): Likewise.
> > >       (@mve_vec_pack_trunc_<mode>): New pattern.
> > >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
> > >       vec-common.md.
> > >       (vec_unpack<US>_lo_<mode>): Likewise.
> > >       (vec_pack_trunc_<mode>): Rename to
> > >       neon_quad_vec_pack_trunc_<mode>.
> > >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
> > >       pattern.
> > >       (vec_unpack<US>_lo_<mode>): New.
> > >       (vec_pack_trunc_<mode>): New.
> > >
> > >       gcc/testsuite/
> > >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
> > >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
> > > ---
> > >  gcc/config/arm/mve.md                        | 20 ++++-
> > >  gcc/config/arm/neon.md                       | 39 +--------
> > >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
> > >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
> > >  5 files changed, 114 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > > index 99e46d0bc69..b18292c07d3 100644
> > > --- a/gcc/config/arm/mve.md
> > > +++ b/gcc/config/arm/mve.md
> > > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
> > >  ;;
> > >  ;; [vmovltq_u, vmovltq_s])
> > >  ;;
> > > -(define_insn "mve_vmovltq_<supf><mode>"
> > > +(define_insn "@mve_vmovltq_<supf><mode>"
> > >    [
> > >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> > >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> > > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
> > >  ;;
> > >  ;; [vmovlbq_s, vmovlbq_u])
> > >  ;;
> > > -(define_insn "mve_vmovlbq_<supf><mode>"
> > > +(define_insn "@mve_vmovlbq_<supf><mode>"
> > >    [
> > >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> > >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> > > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
> > >  ;;
> > >  ;; [vmovnbq_u, vmovnbq_s])
> > >  ;;
> > > -(define_insn "mve_vmovnbq_<supf><mode>"
> > > +(define_insn "@mve_vmovnbq_<supf><mode>"
> > >    [
> > >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> > >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> > > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> > >  ;;
> > >  ;; [vmovntq_s, vmovntq_u])
> > >  ;;
> > > -(define_insn "mve_vmovntq_<supf><mode>"
> > > +(define_insn "@mve_vmovntq_<supf><mode>"
> > >    [
> > >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> > >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> > > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
> > >    [(set_attr "type" "mve_move")
> > >  ])
> > >
> > > +(define_insn "@mve_vec_pack_trunc_<mode>"
> > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> > > +       (vec_concat:<V_narrow_pack>
> > > +             (truncate:<V_narrow>
> > > +                     (match_operand:MVE_5 1 "register_operand" "w"))
> > > +             (truncate:<V_narrow>
> > > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
> > > + "TARGET_HAVE_MVE"
> > > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
> > > +  [(set_attr "type" "mve_move")]
> > > +)
> > > +
> >
> > I realise this is (like you say) based on the neon.md pattern,
> > but we should use separate vmovnb and vmovnt instructions instead
> > of putting two instructions into a single pattern.
> >
> > One specific advantage to using separate patterns is that it would
> > avoid the imprecision of the earlyclobber: the output only conflicts
> > with operand 1 and can be tied to operand 2.
> >
> I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q2, q3
>         vmovlt.s16   q3, q3
>         vclz.i32  q2, q2
>         vclz.i32  q3, q3
>         vmovnb.i32      q1, q2
>         vmovnt.i32      q1, q3
>         vstrh.16        q1, [r0]
>
> If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
> and update define_expand "vec_pack_trunc_<mode>" to call:
> emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
> operands[0], operands[1]));
> emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
> operands[0], operands[2]));
> instead, we now generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q1, q3
>         vldr.64 d4, .L5
>         vldr.64 d5, .L5+8
>         vmovlt.s16   q3, q3
>         vclz.i32  q1, q1
>         vclz.i32  q3, q3
>         vmovnb.i32      q2, q1
>         vmovnt.i32      q2, q3
>         vstrh.16        q2, [r0]
>         bx      lr
> .L6:
>         .align  3
> .L5:
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>
> That's what I tried to explain in the commit message.
>
> I guess I've misunderstood your comment?
>
> Thanks,
>
> Christophe
>
> > >  ;;
> > >  ;; [vmulq_f])
> > >  ;;
> > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > > index 0fdffaf4ec4..392d9607919 100644
> > > --- a/gcc/config/arm/neon.md
> > > +++ b/gcc/config/arm/neon.md
> > > @@ -5924,43 +5924,6 @@ (define_insn "neon_vec_unpack<US>_hi_<mode>"
> > >    [(set_attr "type" "neon_shift_imm_long")]
> > >  )
> > >
> > > -(define_expand "vec_unpack<US>_hi_<mode>"
> > > -  [(match_operand:<V_unpack> 0 "register_operand")
> > > -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > > - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> > > -  {
> > > -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> > > -   rtx t1;
> > > -   int i;
> > > -   for (i = 0; i < (<V_mode_nunits>/2); i++)
> > > -     RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > > -
> > > -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > > -   emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> > > -                                                 operands[1],
> > > -                                              t1));
> > > -   DONE;
> > > -  }
> > > -)
> > > -
> > > -(define_expand "vec_unpack<US>_lo_<mode>"
> > > -  [(match_operand:<V_unpack> 0 "register_operand")
> > > -   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > > - "TARGET_NEON && !BYTES_BIG_ENDIAN"
> > > -  {
> > > -   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
> > > -   rtx t1;
> > > -   int i;
> > > -   for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> > > -     RTVEC_ELT (v, i) = GEN_INT (i);
> > > -   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > > -   emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
> > > -                                                 operands[1],
> > > -                                              t1));
> > > -   DONE;
> > > -  }
> > > -)
> > > -
> > >  (define_insn "neon_vec_<US>mult_lo_<mode>"
> > >   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
> > >         (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF>
> > > @@ -6176,7 +6139,7 @@ (define_expand "vec_widen_<US>shiftl_lo_<mode>"
> > >  ; because the ordering of vector elements in Q registers is different from what
> > >  ; the semantics of the instructions require.
> > >
> > > -(define_insn "vec_pack_trunc_<mode>"
> > > +(define_insn "neon_quad_vec_pack_trunc_<mode>"
> > >   [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> > >         (vec_concat:<V_narrow_pack>
> > >               (truncate:<V_narrow>
> > > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> > > index 1ba1e5eb008..0ffc7a9322c 100644
> > > --- a/gcc/config/arm/vec-common.md
> > > +++ b/gcc/config/arm/vec-common.md
> > > @@ -638,3 +638,92 @@ (define_expand "clz<mode>2"
> > >      emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1]));
> > >    DONE;
> > >  })
> > > +
> > > +;; vmovl[tb] are not available for V4SI on MVE
> > > +(define_expand "vec_unpack<US>_hi_<mode>"
> > > +  [(match_operand:<V_unpack> 0 "register_operand")
> > > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > > + "ARM_HAVE_<MODE>_ARITH
> > > +  && !TARGET_REALLY_IWMMXT
> > > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > > +  && !BYTES_BIG_ENDIAN"
> > > +  {
> > > +    if (TARGET_NEON)
> > > +      {
> > > +     rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > > +     rtx t1;
> > > +     int i;
> > > +     for (i = 0; i < (<V_mode_nunits>/2); i++)
> > > +       RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > > +
> > > +     t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > > +     emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> > > +                                                   operands[1],
> > > +                                                   t1));
> > > +      }
> > > +    else
> > > +      {
> > > +     emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0],
> > > +                                 operands[1]));
> >
> > It would be good to use the same rtx pattern for the MVE instructions,
> > since the compiler can optimise it better than an unspec.
> >
> > It would be good to have separate tests for the packs and unpacks
> > as well, in case the vclz thing is fixed in future.
> >

And while writing new tests, I noticed that the same problem happens
with unpack as I proposed it:
test_pack_s32:
        mov     r3, r1
        vldr.64 d6, .L3
        vldr.64 d7, .L3+8
        vldrw.32        q1, [r3]
        adds    r1, r1, #16
        vldrw.32        q2, [r1]
        vmovnb.i32      q3, q1
        vmovnt.i32      q3, q2
        vstrh.16        q3, [r0]
        bx      lr
.L4:
        .align  3
.L3:
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0
        .short  0

We shouldn't need to initialize d6/d7 (q3)



> > Thanks,
> > Richard
> >
> > > +      }
> > > +    DONE;
> > > +  }
> > > +)
> > > +
> > > +;; vmovl[tb] are not available for V4SI on MVE
> > > +(define_expand "vec_unpack<US>_lo_<mode>"
> > > +  [(match_operand:<V_unpack> 0 "register_operand")
> > > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > > + "ARM_HAVE_<MODE>_ARITH
> > > +  && !TARGET_REALLY_IWMMXT
> > > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > > +  && !BYTES_BIG_ENDIAN"
> > > +  {
> > > +    if (TARGET_NEON)
> > > +      {
> > > +     rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > > +     rtx t1;
> > > +     int i;
> > > +     for (i = 0; i < (<V_mode_nunits>/2) ; i++)
> > > +       RTVEC_ELT (v, i) = GEN_INT (i);
> > > +
> > > +     t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > > +     emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
> > > +                                                   operands[1],
> > > +                                                   t1));
> > > +      }
> > > +    else
> > > +      {
> > > +     emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0],
> > > +                                 operands[1]));
> > > +      }
> > > +    DONE;
> > > +  }
> > > +)
> > > +
> > > +;; vmovn[tb] are not available for V2DI on MVE
> > > +(define_expand "vec_pack_trunc_<mode>"
> > > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> > > +       (vec_concat:<V_narrow_pack>
> > > +             (truncate:<V_narrow>
> > > +                     (match_operand:VN 1 "register_operand" "w"))
> > > +             (truncate:<V_narrow>
> > > +                     (match_operand:VN 2 "register_operand" "w"))))]
> > > + "ARM_HAVE_<MODE>_ARITH
> > > +  && !TARGET_REALLY_IWMMXT
> > > +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> > > +  && !BYTES_BIG_ENDIAN"
> > > + {
> > > +   if (TARGET_NEON)
> > > +     {
> > > +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> > > +                                                    operands[2]));
> > > +     }
> > > +   else
> > > +     {
> > > +       emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1],
> > > +                                       operands[2]));
> > > +     }
> > > +   DONE;
> > > + }
> > > +)
> > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > > index 7068736bc28..5d6e991cfc6 100644
> > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
> > > @@ -21,8 +21,9 @@ FUNC(u, uint, 16, clz)
> > >  FUNC(s, int, 8, clz)
> > >  FUNC(u, uint, 8, clz)
> > >
> > > -/* 16 and 8-bit versions are not vectorized because they need pack/unpack
> > > -   patterns since __builtin_clz uses 32-bit parameter and return value.  */
> > > -/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 2 } } */
> > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> > > +   instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of
> > > +   14 vclz.i32 expected in this testcase.  */
> > > +/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 14 } } */
> > >  /* { dg-final { scan-assembler-times {vclz\.i16  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
> > >  /* { dg-final { scan-assembler-times {vclz\.i8  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
> > > diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > > index 7a0644997c8..91dd942d818 100644
> > > --- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > > +++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > > @@ -56,7 +56,10 @@ FUNC_IMM(u, uint, 8, 16, <<, vshlimm)
> > >  /* MVE has only 128-bit vectors, so we can vectorize only half of the
> > >     functions above.  */
> > >  /* We only emit vshl.u, which is equivalent to vshl.s anyway.  */
> > > -/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */
> > > +/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
> > > +   instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of
> > > +   14 vshl.i32 expected in this testcase.  */
> > > +/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */
> > >
> > >  /* We emit vshl.i when the shift amount is an immediate.  */
> > >  /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */
Richard Sandiford June 11, 2021, 9:38 a.m. UTC | #4
Christophe Lyon <christophe.lyon@linaro.org> writes:
> On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Christophe Lyon <christophe.lyon@linaro.org> writes:
>> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
>> > vec_pack_trunc_<mode> patterns for MVE.
>> >
>> > It does so by moving the unpack patterns from neon.md to
>> > vec-common.md, while adding them support for MVE. The pack expander is
>> > derived from the Neon one (which in turn is renamed into
>> > neon_quad_vec_pack_trunc_<mode>).
>> >
>> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
>> > zero-initialized temporary, which is needed if the
>> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
>> > instead.
>> >
>> > With this patch, we can now vectorize the 16 and 8-bit versions of
>> > vclz and vshl, although the generated code could still be improved.
>> > For test_clz_s16, we now generate
>> >         vldrh.16        q3, [r1]
>> >         vmovlb.s16   q2, q3
>> >         vmovlt.s16   q3, q3
>> >         vclz.i32  q2, q2
>> >         vclz.i32  q3, q3
>> >         vmovnb.i32      q1, q2
>> >         vmovnt.i32      q1, q3
>> >         vstrh.16        q1, [r0]
>> > which could be improved to
>> >         vldrh.16        q3, [r1]
>> >       vclz.i16        q1, q3
>> >         vstrh.16        q1, [r0]
>> > if we could avoid the need for unpack/pack steps.
>>
>> Yeah, there was a PR about fixing this for popcount.  I guess the same
>> approach would apply here too.
>>
>> > For reference, clang-12 generates:
>> >       vldrh.s32       q0, [r1]
>> >       vldrh.s32       q1, [r1, #8]
>> >       vclz.i32        q0, q0
>> >       vstrh.32        q0, [r0]
>> >       vclz.i32        q0, q1
>> >       vstrh.32        q0, [r0, #8]
>> >
>> > 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
>> >
>> >       gcc/
>> >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
>> >       (mve_vmovlbq_<supf><mode>): Likewise.
>> >       (mve_vmovnbq_<supf><mode>): Likewise.
>> >       (mve_vmovntq_<supf><mode>): Likewise.
>> >       (@mve_vec_pack_trunc_<mode>): New pattern.
>> >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
>> >       vec-common.md.
>> >       (vec_unpack<US>_lo_<mode>): Likewise.
>> >       (vec_pack_trunc_<mode>): Rename to
>> >       neon_quad_vec_pack_trunc_<mode>.
>> >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
>> >       pattern.
>> >       (vec_unpack<US>_lo_<mode>): New.
>> >       (vec_pack_trunc_<mode>): New.
>> >
>> >       gcc/testsuite/
>> >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
>> >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
>> > ---
>> >  gcc/config/arm/mve.md                        | 20 ++++-
>> >  gcc/config/arm/neon.md                       | 39 +--------
>> >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
>> >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
>> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
>> >  5 files changed, 114 insertions(+), 46 deletions(-)
>> >
>> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>> > index 99e46d0bc69..b18292c07d3 100644
>> > --- a/gcc/config/arm/mve.md
>> > +++ b/gcc/config/arm/mve.md
>> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
>> >  ;;
>> >  ;; [vmovltq_u, vmovltq_s])
>> >  ;;
>> > -(define_insn "mve_vmovltq_<supf><mode>"
>> > +(define_insn "@mve_vmovltq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
>> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
>> >  ;;
>> >  ;; [vmovlbq_s, vmovlbq_u])
>> >  ;;
>> > -(define_insn "mve_vmovlbq_<supf><mode>"
>> > +(define_insn "@mve_vmovlbq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
>> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
>> >  ;;
>> >  ;; [vmovnbq_u, vmovnbq_s])
>> >  ;;
>> > -(define_insn "mve_vmovnbq_<supf><mode>"
>> > +(define_insn "@mve_vmovnbq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
>> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
>> >  ;;
>> >  ;; [vmovntq_s, vmovntq_u])
>> >  ;;
>> > -(define_insn "mve_vmovntq_<supf><mode>"
>> > +(define_insn "@mve_vmovntq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
>> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
>> >    [(set_attr "type" "mve_move")
>> >  ])
>> >
>> > +(define_insn "@mve_vec_pack_trunc_<mode>"
>> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
>> > +       (vec_concat:<V_narrow_pack>
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:MVE_5 1 "register_operand" "w"))
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
>> > + "TARGET_HAVE_MVE"
>> > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
>> > +  [(set_attr "type" "mve_move")]
>> > +)
>> > +
>>
>> I realise this is (like you say) based on the neon.md pattern,
>> but we should use separate vmovnb and vmovnt instructions instead
>> of putting two instructions into a single pattern.
>>
>> One specific advantage to using separate patterns is that it would
>> avoid the imprecision of the earlyclobber: the output only conflicts
>> with operand 1 and can be tied to operand 2.
>>
> I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q2, q3
>         vmovlt.s16   q3, q3
>         vclz.i32  q2, q2
>         vclz.i32  q3, q3
>         vmovnb.i32      q1, q2
>         vmovnt.i32      q1, q3
>         vstrh.16        q1, [r0]
>
> If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
> and update define_expand "vec_pack_trunc_<mode>" to call:
> emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
> operands[0], operands[1]));
> emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
> operands[0], operands[2]));
> instead, we now generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q1, q3
>         vldr.64 d4, .L5
>         vldr.64 d5, .L5+8
>         vmovlt.s16   q3, q3
>         vclz.i32  q1, q1
>         vclz.i32  q3, q3
>         vmovnb.i32      q2, q1
>         vmovnt.i32      q2, q3
>         vstrh.16        q2, [r0]
>         bx      lr
> .L6:
>         .align  3
> .L5:
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>
> That's what I tried to explain in the commit message.

Sorry, I missed that part of the message.  If we want the first
instruction to have an in/out operand whose initial value is undefined
then I think the best way would be to have a second pattern for vmovnb
(only) that acts like a unary operation.

FWIW, in the above, we shouldn't assign to operands[0] twice.
The result of the first operation needs to be a new temporary register.

Thanks,
Richard
Christophe Lyon June 11, 2021, 10:15 a.m. UTC | #5
On Fri, 11 Jun 2021 at 11:38, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Christophe Lyon <christophe.lyon@linaro.org> writes:
> >> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
> >> > vec_pack_trunc_<mode> patterns for MVE.
> >> >
> >> > It does so by moving the unpack patterns from neon.md to
> >> > vec-common.md, while adding them support for MVE. The pack expander is
> >> > derived from the Neon one (which in turn is renamed into
> >> > neon_quad_vec_pack_trunc_<mode>).
> >> >
> >> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
> >> > zero-initialized temporary, which is needed if the
> >> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
> >> > instead.
> >> >
> >> > With this patch, we can now vectorize the 16 and 8-bit versions of
> >> > vclz and vshl, although the generated code could still be improved.
> >> > For test_clz_s16, we now generate
> >> >         vldrh.16        q3, [r1]
> >> >         vmovlb.s16   q2, q3
> >> >         vmovlt.s16   q3, q3
> >> >         vclz.i32  q2, q2
> >> >         vclz.i32  q3, q3
> >> >         vmovnb.i32      q1, q2
> >> >         vmovnt.i32      q1, q3
> >> >         vstrh.16        q1, [r0]
> >> > which could be improved to
> >> >         vldrh.16        q3, [r1]
> >> >       vclz.i16        q1, q3
> >> >         vstrh.16        q1, [r0]
> >> > if we could avoid the need for unpack/pack steps.
> >>
> >> Yeah, there was a PR about fixing this for popcount.  I guess the same
> >> approach would apply here too.
> >>
> >> > For reference, clang-12 generates:
> >> >       vldrh.s32       q0, [r1]
> >> >       vldrh.s32       q1, [r1, #8]
> >> >       vclz.i32        q0, q0
> >> >       vstrh.32        q0, [r0]
> >> >       vclz.i32        q0, q1
> >> >       vstrh.32        q0, [r0, #8]
> >> >
> >> > 2021-06-03  Christophe Lyon  <christophe.lyon@linaro.org>
> >> >
> >> >       gcc/
> >> >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
> >> >       (mve_vmovlbq_<supf><mode>): Likewise.
> >> >       (mve_vmovnbq_<supf><mode>): Likewise.
> >> >       (mve_vmovntq_<supf><mode>): Likewise.
> >> >       (@mve_vec_pack_trunc_<mode>): New pattern.
> >> >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
> >> >       vec-common.md.
> >> >       (vec_unpack<US>_lo_<mode>): Likewise.
> >> >       (vec_pack_trunc_<mode>): Rename to
> >> >       neon_quad_vec_pack_trunc_<mode>.
> >> >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
> >> >       pattern.
> >> >       (vec_unpack<US>_lo_<mode>): New.
> >> >       (vec_pack_trunc_<mode>): New.
> >> >
> >> >       gcc/testsuite/
> >> >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
> >> >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
> >> > ---
> >> >  gcc/config/arm/mve.md                        | 20 ++++-
> >> >  gcc/config/arm/neon.md                       | 39 +--------
> >> >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
> >> >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
> >> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
> >> >  5 files changed, 114 insertions(+), 46 deletions(-)
> >> >
> >> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> >> > index 99e46d0bc69..b18292c07d3 100644
> >> > --- a/gcc/config/arm/mve.md
> >> > +++ b/gcc/config/arm/mve.md
> >> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovltq_u, vmovltq_s])
> >> >  ;;
> >> > -(define_insn "mve_vmovltq_<supf><mode>"
> >> > +(define_insn "@mve_vmovltq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> >> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovlbq_s, vmovlbq_u])
> >> >  ;;
> >> > -(define_insn "mve_vmovlbq_<supf><mode>"
> >> > +(define_insn "@mve_vmovlbq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
> >> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
> >> >  ;;
> >> >  ;; [vmovnbq_u, vmovnbq_s])
> >> >  ;;
> >> > -(define_insn "mve_vmovnbq_<supf><mode>"
> >> > +(define_insn "@mve_vmovnbq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> >> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> >> >  ;;
> >> >  ;; [vmovntq_s, vmovntq_u])
> >> >  ;;
> >> > -(define_insn "mve_vmovntq_<supf><mode>"
> >> > +(define_insn "@mve_vmovntq_<supf><mode>"
> >> >    [
> >> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
> >> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
> >> >    [(set_attr "type" "mve_move")
> >> >  ])
> >> >
> >> > +(define_insn "@mve_vec_pack_trunc_<mode>"
> >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
> >> > +       (vec_concat:<V_narrow_pack>
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:MVE_5 1 "register_operand" "w"))
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
> >> > + "TARGET_HAVE_MVE"
> >> > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
> >> > +  [(set_attr "type" "mve_move")]
> >> > +)
> >> > +
> >>
> >> I realise this is (like you say) based on the neon.md pattern,
> >> but we should use separate vmovnb and vmovnt instructions instead
> >> of putting two instructions into a single pattern.
> >>
> >> One specific advantage to using separate patterns is that it would
> >> avoid the imprecision of the earlyclobber: the output only conflicts
> >> with operand 1 and can be tied to operand 2.
> >>
> > I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
> >         vldrh.16        q3, [r1]
> >         vmovlb.s16   q2, q3
> >         vmovlt.s16   q3, q3
> >         vclz.i32  q2, q2
> >         vclz.i32  q3, q3
> >         vmovnb.i32      q1, q2
> >         vmovnt.i32      q1, q3
> >         vstrh.16        q1, [r0]
> >
> > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
> > and update define_expand "vec_pack_trunc_<mode>" to call:
> > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
> > operands[0], operands[1]));
> > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
> > operands[0], operands[2]));
> > instead, we now generate:
> >         vldrh.16        q3, [r1]
> >         vmovlb.s16   q1, q3
> >         vldr.64 d4, .L5
> >         vldr.64 d5, .L5+8
> >         vmovlt.s16   q3, q3
> >         vclz.i32  q1, q1
> >         vclz.i32  q3, q3
> >         vmovnb.i32      q2, q1
> >         vmovnt.i32      q2, q3
> >         vstrh.16        q2, [r0]
> >         bx      lr
> > .L6:
> >         .align  3
> > .L5:
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >         .short  0
> >
> > That's what I tried to explain in the commit message.
>
> Sorry, I missed that part of the message.  If we want the first
> instruction to have an in/out operand whose initial value is undefined
> then I think the best way would be to have a second pattern for vmovnb
> (only) that acts like a unary operation.
>
> FWIW, in the above, we shouldn't assign to operands[0] twice.
> The result of the first operation needs to be a new temporary register.

Thanks.
In the meantime, I tried to make some progress, and looked at how
things work on aarch64.

This led me to define (in mve.md):

(define_insn "@mve_vec_pack_trunc_lo_<mode>"
 [(set (match_operand:<V_narrow> 0 "register_operand" "=w")
       (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))]
 "TARGET_HAVE_MVE"
 "vmovnb.i<V_sz_elem>   %q0, %q1\;"
  [(set_attr "type" "mve_move")]
)

(define_insn "@mve_vec_pack_trunc_hi_<mode>"
 [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w")
       (vec_concat:<V_narrow_pack>
                (match_operand:<V_narrow> 1 "register_operand" "0")
                (truncate:<V_narrow>
                        (match_operand:MVE_5 2 "register_operand" "w"))))]
 "TARGET_HAVE_MVE"
 "vmovnt.i<V_sz_elem>   %q0, %q2\;"
  [(set_attr "type" "mve_move")]
)

and in vec-common.md, for
(define_expand "vec_pack_trunc_<mode>"
 [(set (match_operand:<V_narrow_pack> 0 "register_operand")
       (vec_concat:<V_narrow_pack>
                (truncate:<V_narrow>
                        (match_operand:VN 1 "register_operand"))
                (truncate:<V_narrow>
                        (match_operand:VN 2 "register_operand"))))]

I expand this for MVE:
      rtx tmpreg = gen_reg_rtx (<V_narrow>mode);
      emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
      emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0],
tmpreg, operands[2]));

I am getting an error in reload:
error: unable to generate reloads for:
(insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ])
        (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ])))
"/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609
{mve_vec_pack_trunc_lo_v4si}
     (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ])
        (nil)))

The next insn is:
(insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ])
        (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ])
            (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ]))))
"/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611
{mve_vec_pack_trunc_hi_v4si}
     (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ])
        (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ])
            (nil))))

What is causing the reload error?

Thanks,

Christophe
Richard Sandiford June 11, 2021, 10:20 a.m. UTC | #6
Christophe Lyon <christophe.lyon@linaro.org> writes:
> In the meantime, I tried to make some progress, and looked at how
> things work on aarch64.
>
> This led me to define (in mve.md):
>
> (define_insn "@mve_vec_pack_trunc_lo_<mode>"
>  [(set (match_operand:<V_narrow> 0 "register_operand" "=w")
>        (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))]
>  "TARGET_HAVE_MVE"
>  "vmovnb.i<V_sz_elem>   %q0, %q1\;"
>   [(set_attr "type" "mve_move")]
> )
>
> (define_insn "@mve_vec_pack_trunc_hi_<mode>"
>  [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w")
>        (vec_concat:<V_narrow_pack>
>                 (match_operand:<V_narrow> 1 "register_operand" "0")
>                 (truncate:<V_narrow>
>                         (match_operand:MVE_5 2 "register_operand" "w"))))]
>  "TARGET_HAVE_MVE"
>  "vmovnt.i<V_sz_elem>   %q0, %q2\;"
>   [(set_attr "type" "mve_move")]
> )
>
> and in vec-common.md, for
> (define_expand "vec_pack_trunc_<mode>"
>  [(set (match_operand:<V_narrow_pack> 0 "register_operand")
>        (vec_concat:<V_narrow_pack>
>                 (truncate:<V_narrow>
>                         (match_operand:VN 1 "register_operand"))
>                 (truncate:<V_narrow>
>                         (match_operand:VN 2 "register_operand"))))]
>
> I expand this for MVE:
>       rtx tmpreg = gen_reg_rtx (<V_narrow>mode);
>       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
>       emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0],
> tmpreg, operands[2]));
>
> I am getting an error in reload:
> error: unable to generate reloads for:
> (insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ])
>         (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ])))
> "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609
> {mve_vec_pack_trunc_lo_v4si}
>      (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ])
>         (nil)))
>
> The next insn is:
> (insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ])
>         (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ])
>             (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ]))))
> "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611
> {mve_vec_pack_trunc_hi_v4si}
>      (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ])
>         (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ])
>             (nil))))
>
> What is causing the reload error?

For this to work on MVE, we'd need to allow the 64-bit V_narrow modes to be
stored in MVE registers.  I'm not sure off-hand whether allowing that
would be a good idea or not.

If we continue to allow only 128-bit vectors in MVE registers then
we'll need to continue to use unspecs instead of truncate and vec_concat.

Thanks,
Richard
Christophe Lyon June 11, 2021, 3:35 p.m. UTC | #7
On Fri, 11 Jun 2021 at 12:20, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > In the meantime, I tried to make some progress, and looked at how
> > things work on aarch64.
> >
> > This led me to define (in mve.md):
> >
> > (define_insn "@mve_vec_pack_trunc_lo_<mode>"
> >  [(set (match_operand:<V_narrow> 0 "register_operand" "=w")
> >        (truncate:<V_narrow> (match_operand:MVE_5 1 "register_operand" "w")))]
> >  "TARGET_HAVE_MVE"
> >  "vmovnb.i<V_sz_elem>   %q0, %q1\;"
> >   [(set_attr "type" "mve_move")]
> > )
> >
> > (define_insn "@mve_vec_pack_trunc_hi_<mode>"
> >  [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=w")
> >        (vec_concat:<V_narrow_pack>
> >                 (match_operand:<V_narrow> 1 "register_operand" "0")
> >                 (truncate:<V_narrow>
> >                         (match_operand:MVE_5 2 "register_operand" "w"))))]
> >  "TARGET_HAVE_MVE"
> >  "vmovnt.i<V_sz_elem>   %q0, %q2\;"
> >   [(set_attr "type" "mve_move")]
> > )
> >
> > and in vec-common.md, for
> > (define_expand "vec_pack_trunc_<mode>"
> >  [(set (match_operand:<V_narrow_pack> 0 "register_operand")
> >        (vec_concat:<V_narrow_pack>
> >                 (truncate:<V_narrow>
> >                         (match_operand:VN 1 "register_operand"))
> >                 (truncate:<V_narrow>
> >                         (match_operand:VN 2 "register_operand"))))]
> >
> > I expand this for MVE:
> >       rtx tmpreg = gen_reg_rtx (<V_narrow>mode);
> >       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
> >       emit_insn (gen_mve_vec_pack_trunc_hi (<MODE>mode, operands[0],
> > tmpreg, operands[2]));
> >
> > I am getting an error in reload:
> > error: unable to generate reloads for:
> > (insn 10 9 11 2 (set (reg:V4HI 122 [ vect__4.16 ])
> >         (truncate:V4HI (reg:V4SI 118 [ vect__4.16 ])))
> > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3609
> > {mve_vec_pack_trunc_lo_v4si}
> >      (expr_list:REG_DEAD (reg:V4SI 118 [ vect__4.16 ])
> >         (nil)))
> >
> > The next insn is:
> > (insn 11 10 12 2 (set (reg:V8HI 121 [ vect__7.18 ])
> >         (vec_concat:V8HI (reg:V4HI 122 [ vect__4.16 ])
> >             (truncate:V4HI (reg:V4SI 119 [ vect__4.17 ]))))
> > "/gcc.target/arm/simd/mve-vec-pack.c":17:112 3611
> > {mve_vec_pack_trunc_hi_v4si}
> >      (expr_list:REG_DEAD (reg:V4HI 122 [ vect__4.16 ])
> >         (expr_list:REG_DEAD (reg:V4SI 119 [ vect__4.17 ])
> >             (nil))))
> >
> > What is causing the reload error?
>
> For this to work on MVE, we'd need to allow the 64-bit V_narrow modes to be
> stored in MVE registers.  I'm not sure off-hand whether allowing that
> would be a good idea or not.
>
> If we continue to allow only 128-bit vectors in MVE registers then
> we'll need to continue to use unspecs instead of truncate and vec_concat.
>

Thanks for the feedback. How about v2 attached?
Do you want me to merge neon_vec_unpack<US> and
mve_vec_unpack<US>  and only have different assembly?
if (TARGET_HAVE_MVE)
  return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
else
  return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";


> Thanks,
> Richard
Richard Sandiford June 11, 2021, 4:25 p.m. UTC | #8
Christophe Lyon <christophe.lyon@linaro.org> writes:
> Thanks for the feedback. How about v2 attached?
> Do you want me to merge neon_vec_unpack<US> and
> mve_vec_unpack<US>  and only have different assembly?
> if (TARGET_HAVE_MVE)
>   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
> else
>   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";

I think it'd be better to keep them separate, given that they have
different attributes.

> @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>"
>    [(set_attr "type" "mve_move")
>  ])
>  
> +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the
> +;; need for an extra register.

“to avoid the need for an uninitailized input operand” might
be clearer.

> +(define_insn "@mve_vec_pack_trunc_lo_<mode>"
> +  [
> +   (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> +	(unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")]
> +	 VMOVNBQ_S))
> +  ]
> +  "TARGET_HAVE_MVE"
> +  "vmovnb.i%#<V_sz_elem>	%q0, %q1"
> +  [(set_attr "type" "mve_move")
> +])
> +
> […]
> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> index 430a92ce966..3afb3e1d891 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm/vec-common.md
> @@ -632,3 +632,91 @@ (define_expand "clz<mode>2"
>    "ARM_HAVE_<MODE>_ARITH
>     && !TARGET_REALLY_IWMMXT"
>  )
> +
> +;; vmovl[tb] are not available for V4SI on MVE
> +(define_expand "vec_unpack<US>_hi_<mode>"
> +  [(match_operand:<V_unpack> 0 "register_operand")
> +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> +  {
> +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> +    rtx t1;
> +    int i;
> +    for (i = 0; i < (<V_mode_nunits>/2); i++)
> +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> +
> +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> +
> +    if (TARGET_NEON)
> +      emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> +						    operands[1],
> +						    t1));
> +    else
> +      emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0],
> +						   operands[1],
> +						   t1));
> +    DONE;
> +  }
> +)

Since the patterns are the same for both MVE and Neon (a good thing!),
it would be simpler to write this as:

(define_insn "mve_vec_unpack<US>_lo_<mode>"
  [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
	(SE:<V_unpack> (vec_select:<V_HALF>
			  (match_operand:VU 1 "register_operand" "w")
			  (match_dup 2))))]
  …

and then set operands[2] to what is t1 above.  There would then be
no need to call emit_insn & DONE, and we could use the natural
iterator (MVE_3) for the MVE patterns.

Same idea for the lo version.

> […]
> +;; vmovn[tb] are not available for V2DI on MVE
> +(define_expand "vec_pack_trunc_<mode>"
> + [(set (match_operand:<V_narrow_pack> 0 "register_operand")
> +       (vec_concat:<V_narrow_pack>
> +		(truncate:<V_narrow>
> +			(match_operand:VN 1 "register_operand"))
> +		(truncate:<V_narrow>
> +			(match_operand:VN 2 "register_operand"))))]
> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> + {
> +   if (TARGET_NEON)
> +     {
> +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> +						       operands[2]));
> +     }
> +   else
> +     {
> +       rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode);
> +       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
> +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
> +       	                           tmpreg, tmpreg, operands[2]));
> +       emit_move_insn (operands[0], tmpreg);

vmovntq can assign directly to operands[0], without a separate move.

OK with those changes, thanks.

Richard

> +     }
> +   DONE;
> + }
> +)
> […]
Christophe Lyon June 14, 2021, 2:35 p.m. UTC | #9
On Fri, 11 Jun 2021 at 18:25, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > Thanks for the feedback. How about v2 attached?
> > Do you want me to merge neon_vec_unpack<US> and
> > mve_vec_unpack<US>  and only have different assembly?
> > if (TARGET_HAVE_MVE)
> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
> > else
> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
>
> I think it'd be better to keep them separate, given that they have
> different attributes.
>
> > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> >    [(set_attr "type" "mve_move")
> >  ])
> >
> > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the
> > +;; need for an extra register.
>
> “to avoid the need for an uninitailized input operand” might
> be clearer.
>
Done

> > +(define_insn "@mve_vec_pack_trunc_lo_<mode>"
> > +  [
> > +   (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> > +     (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")]
> > +      VMOVNBQ_S))
> > +  ]
> > +  "TARGET_HAVE_MVE"
> > +  "vmovnb.i%#<V_sz_elem>     %q0, %q1"
> > +  [(set_attr "type" "mve_move")
> > +])
> > +
> > […]
> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> > index 430a92ce966..3afb3e1d891 100644
> > --- a/gcc/config/arm/vec-common.md
> > +++ b/gcc/config/arm/vec-common.md
> > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2"
> >    "ARM_HAVE_<MODE>_ARITH
> >     && !TARGET_REALLY_IWMMXT"
> >  )
> > +
> > +;; vmovl[tb] are not available for V4SI on MVE
> > +(define_expand "vec_unpack<US>_hi_<mode>"
> > +  [(match_operand:<V_unpack> 0 "register_operand")
> > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > +  {
> > +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > +    rtx t1;
> > +    int i;
> > +    for (i = 0; i < (<V_mode_nunits>/2); i++)
> > +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > +
> > +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > +
> > +    if (TARGET_NEON)
> > +      emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> > +                                                 operands[1],
> > +                                                 t1));
> > +    else
> > +      emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0],
> > +                                                operands[1],
> > +                                                t1));
> > +    DONE;
> > +  }
> > +)
>
> Since the patterns are the same for both MVE and Neon (a good thing!),
> it would be simpler to write this as:
>
> (define_insn "mve_vec_unpack<US>_lo_<mode>"
I guess you mean define_expand?

>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
>         (SE:<V_unpack> (vec_select:<V_HALF>
>                           (match_operand:VU 1 "register_operand" "w")
>                           (match_dup 2))))]
>   …
>
> and then set operands[2] to what is t1 above.  There would then be
> no need to call emit_insn & DONE, and we could use the natural
> iterator (MVE_3) for the MVE patterns.
>
> Same idea for the lo version.
>
> > […]
> > +;; vmovn[tb] are not available for V2DI on MVE
> > +(define_expand "vec_pack_trunc_<mode>"
> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand")
> > +       (vec_concat:<V_narrow_pack>
> > +             (truncate:<V_narrow>
> > +                     (match_operand:VN 1 "register_operand"))
> > +             (truncate:<V_narrow>
> > +                     (match_operand:VN 2 "register_operand"))))]
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > + {
> > +   if (TARGET_NEON)
> > +     {
> > +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> > +                                                    operands[2]));
> > +     }
> > +   else
> > +     {
> > +       rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode);
> > +       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
> > +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
> > +                                        tmpreg, tmpreg, operands[2]));
> > +       emit_move_insn (operands[0], tmpreg);
>
> vmovntq can assign directly to operands[0], without a separate move.
>
> OK with those changes, thanks.

Just to be sure, here is v3 which hopefully matches what you meant.
I updated the commit message and removed:
< I did not prefix them with '@', because I couldn't find how to call
< gen_mve_vec_unpack_[lo|hi] with <US> as first parameter.  Anyway, we
< can keep the VU iterator instead of MVE_3 since the offending V4SI
< mode is disabled in the expander.

OK?

Thanks

Christophe

>
> Richard
>
> > +     }
> > +   DONE;
> > + }
> > +)
> > […]
Richard Sandiford June 14, 2021, 4:01 p.m. UTC | #10
Christophe Lyon <christophe.lyon@linaro.org> writes:
> On Fri, 11 Jun 2021 at 18:25, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Christophe Lyon <christophe.lyon@linaro.org> writes:
>> > Thanks for the feedback. How about v2 attached?
>> > Do you want me to merge neon_vec_unpack<US> and
>> > mve_vec_unpack<US>  and only have different assembly?
>> > if (TARGET_HAVE_MVE)
>> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
>> > else
>> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
>>
>> I think it'd be better to keep them separate, given that they have
>> different attributes.
>>
>> > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>"
>> >    [(set_attr "type" "mve_move")
>> >  ])
>> >
>> > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the
>> > +;; need for an extra register.
>>
>> “to avoid the need for an uninitailized input operand” might
>> be clearer.
>>
> Done
>
>> > +(define_insn "@mve_vec_pack_trunc_lo_<mode>"
>> > +  [
>> > +   (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>> > +     (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")]
>> > +      VMOVNBQ_S))
>> > +  ]
>> > +  "TARGET_HAVE_MVE"
>> > +  "vmovnb.i%#<V_sz_elem>     %q0, %q1"
>> > +  [(set_attr "type" "mve_move")
>> > +])
>> > +
>> > […]
>> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
>> > index 430a92ce966..3afb3e1d891 100644
>> > --- a/gcc/config/arm/vec-common.md
>> > +++ b/gcc/config/arm/vec-common.md
>> > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2"
>> >    "ARM_HAVE_<MODE>_ARITH
>> >     && !TARGET_REALLY_IWMMXT"
>> >  )
>> > +
>> > +;; vmovl[tb] are not available for V4SI on MVE
>> > +(define_expand "vec_unpack<US>_hi_<mode>"
>> > +  [(match_operand:<V_unpack> 0 "register_operand")
>> > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
>> > + "ARM_HAVE_<MODE>_ARITH
>> > +  && !TARGET_REALLY_IWMMXT
>> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
>> > +  && !BYTES_BIG_ENDIAN"
>> > +  {
>> > +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
>> > +    rtx t1;
>> > +    int i;
>> > +    for (i = 0; i < (<V_mode_nunits>/2); i++)
>> > +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
>> > +
>> > +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
>> > +
>> > +    if (TARGET_NEON)
>> > +      emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
>> > +                                                 operands[1],
>> > +                                                 t1));
>> > +    else
>> > +      emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0],
>> > +                                                operands[1],
>> > +                                                t1));
>> > +    DONE;
>> > +  }
>> > +)
>>
>> Since the patterns are the same for both MVE and Neon (a good thing!),
>> it would be simpler to write this as:
>>
>> (define_insn "mve_vec_unpack<US>_lo_<mode>"
> I guess you mean define_expand?

Yes, sorry.

>>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
>>         (SE:<V_unpack> (vec_select:<V_HALF>
>>                           (match_operand:VU 1 "register_operand" "w")
>>                           (match_dup 2))))]
>>   …
>>
>> and then set operands[2] to what is t1 above.  There would then be
>> no need to call emit_insn & DONE, and we could use the natural
>> iterator (MVE_3) for the MVE patterns.
>>
>> Same idea for the lo version.
>>
>> > […]
>> > +;; vmovn[tb] are not available for V2DI on MVE
>> > +(define_expand "vec_pack_trunc_<mode>"
>> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand")
>> > +       (vec_concat:<V_narrow_pack>
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:VN 1 "register_operand"))
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:VN 2 "register_operand"))))]
>> > + "ARM_HAVE_<MODE>_ARITH
>> > +  && !TARGET_REALLY_IWMMXT
>> > +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
>> > +  && !BYTES_BIG_ENDIAN"
>> > + {
>> > +   if (TARGET_NEON)
>> > +     {
>> > +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
>> > +                                                    operands[2]));
>> > +     }
>> > +   else
>> > +     {
>> > +       rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode);
>> > +       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
>> > +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
>> > +                                        tmpreg, tmpreg, operands[2]));
>> > +       emit_move_insn (operands[0], tmpreg);
>>
>> vmovntq can assign directly to operands[0], without a separate move.
>>
>> OK with those changes, thanks.
>
> Just to be sure, here is v3 which hopefully matches what you meant.

> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> index 430a92ce966..1d6d4379855 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm/vec-common.md
> @@ -632,3 +632,78 @@ (define_expand "clz<mode>2"
>    "ARM_HAVE_<MODE>_ARITH
>     && !TARGET_REALLY_IWMMXT"
>  )
> +
> +;; vmovl[tb] are not available for V4SI on MVE
> +(define_expand "vec_unpack<US>_hi_<mode>"
> +  [(set (match_operand:<V_unpack> 0 "register_operand")
> +       (SE:<V_unpack> (vec_select:<V_HALF>
> +       		        (match_operand:VU 1 "register_operand")
> +			(match_dup 2))))]

The formatting of the vec_unpack<US>_lo_<mode> pattern looks nicer.

> + "ARM_HAVE_<MODE>_ARITH
> +  && !TARGET_REALLY_IWMMXT
> +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> +  && !BYTES_BIG_ENDIAN"
> +  {
> +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> +    rtx t1;
> +    int i;
> +    for (i = 0; i < (<V_mode_nunits>/2); i++)
> +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> +
> +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> +
> +    operands[2] = t1;

t1 doesn't seem worth it now, might as well assign directly to
operands[2].

> […]
> +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
> +       	                           operands[0], tmpreg, operands[2]));

The indentation looks odd here (operands[0] should line up with
VMOVNTQ_S), but that might be a mailer issue.

> +     }
> +   DONE;


OK with those changes, thanks.

Richard
Christophe Lyon June 14, 2021, 4:41 p.m. UTC | #11
On Mon, 14 Jun 2021 at 18:01, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Christophe Lyon <christophe.lyon@linaro.org> writes:
> > On Fri, 11 Jun 2021 at 18:25, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Christophe Lyon <christophe.lyon@linaro.org> writes:
> >> > Thanks for the feedback. How about v2 attached?
> >> > Do you want me to merge neon_vec_unpack<US> and
> >> > mve_vec_unpack<US>  and only have different assembly?
> >> > if (TARGET_HAVE_MVE)
> >> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
> >> > else
> >> >   return "vmovlb.<US>%#<V_sz_elem> %q0, %q1";
> >>
> >> I think it'd be better to keep them separate, given that they have
> >> different attributes.
> >>
> >> > @@ -2199,10 +2222,23 @@ (define_insn "mve_vmovnbq_<supf><mode>"
> >> >    [(set_attr "type" "mve_move")
> >> >  ])
> >> >
> >> > +;; vmovnb pattern used by the vec_pack_trunc expander to avoid the
> >> > +;; need for an extra register.
> >>
> >> “to avoid the need for an uninitailized input operand” might
> >> be clearer.
> >>
> > Done
> >
> >> > +(define_insn "@mve_vec_pack_trunc_lo_<mode>"
> >> > +  [
> >> > +   (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
> >> > +     (unspec:<V_narrow_pack> [(match_operand:MVE_5 1 "s_register_operand" "w")]
> >> > +      VMOVNBQ_S))
> >> > +  ]
> >> > +  "TARGET_HAVE_MVE"
> >> > +  "vmovnb.i%#<V_sz_elem>     %q0, %q1"
> >> > +  [(set_attr "type" "mve_move")
> >> > +])
> >> > +
> >> > […]
> >> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> >> > index 430a92ce966..3afb3e1d891 100644
> >> > --- a/gcc/config/arm/vec-common.md
> >> > +++ b/gcc/config/arm/vec-common.md
> >> > @@ -632,3 +632,91 @@ (define_expand "clz<mode>2"
> >> >    "ARM_HAVE_<MODE>_ARITH
> >> >     && !TARGET_REALLY_IWMMXT"
> >> >  )
> >> > +
> >> > +;; vmovl[tb] are not available for V4SI on MVE
> >> > +(define_expand "vec_unpack<US>_hi_<mode>"
> >> > +  [(match_operand:<V_unpack> 0 "register_operand")
> >> > +   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
> >> > + "ARM_HAVE_<MODE>_ARITH
> >> > +  && !TARGET_REALLY_IWMMXT
> >> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> >> > +  && !BYTES_BIG_ENDIAN"
> >> > +  {
> >> > +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> >> > +    rtx t1;
> >> > +    int i;
> >> > +    for (i = 0; i < (<V_mode_nunits>/2); i++)
> >> > +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> >> > +
> >> > +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> >> > +
> >> > +    if (TARGET_NEON)
> >> > +      emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
> >> > +                                                 operands[1],
> >> > +                                                 t1));
> >> > +    else
> >> > +      emit_insn (gen_mve_vec_unpack<US>_hi_<mode> (operands[0],
> >> > +                                                operands[1],
> >> > +                                                t1));
> >> > +    DONE;
> >> > +  }
> >> > +)
> >>
> >> Since the patterns are the same for both MVE and Neon (a good thing!),
> >> it would be simpler to write this as:
> >>
> >> (define_insn "mve_vec_unpack<US>_lo_<mode>"
> > I guess you mean define_expand?
>
> Yes, sorry.
>
> >>   [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
> >>         (SE:<V_unpack> (vec_select:<V_HALF>
> >>                           (match_operand:VU 1 "register_operand" "w")
> >>                           (match_dup 2))))]
> >>   …
> >>
> >> and then set operands[2] to what is t1 above.  There would then be
> >> no need to call emit_insn & DONE, and we could use the natural
> >> iterator (MVE_3) for the MVE patterns.
> >>
> >> Same idea for the lo version.
> >>
> >> > […]
> >> > +;; vmovn[tb] are not available for V2DI on MVE
> >> > +(define_expand "vec_pack_trunc_<mode>"
> >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand")
> >> > +       (vec_concat:<V_narrow_pack>
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:VN 1 "register_operand"))
> >> > +             (truncate:<V_narrow>
> >> > +                     (match_operand:VN 2 "register_operand"))))]
> >> > + "ARM_HAVE_<MODE>_ARITH
> >> > +  && !TARGET_REALLY_IWMMXT
> >> > +  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
> >> > +  && !BYTES_BIG_ENDIAN"
> >> > + {
> >> > +   if (TARGET_NEON)
> >> > +     {
> >> > +       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
> >> > +                                                    operands[2]));
> >> > +     }
> >> > +   else
> >> > +     {
> >> > +       rtx tmpreg = gen_reg_rtx (<V_narrow_pack>mode);
> >> > +       emit_insn (gen_mve_vec_pack_trunc_lo (<MODE>mode, tmpreg, operands[1]));
> >> > +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
> >> > +                                        tmpreg, tmpreg, operands[2]));
> >> > +       emit_move_insn (operands[0], tmpreg);
> >>
> >> vmovntq can assign directly to operands[0], without a separate move.
> >>
> >> OK with those changes, thanks.
> >
> > Just to be sure, here is v3 which hopefully matches what you meant.
>
> > diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
> > index 430a92ce966..1d6d4379855 100644
> > --- a/gcc/config/arm/vec-common.md
> > +++ b/gcc/config/arm/vec-common.md
> > @@ -632,3 +632,78 @@ (define_expand "clz<mode>2"
> >    "ARM_HAVE_<MODE>_ARITH
> >     && !TARGET_REALLY_IWMMXT"
> >  )
> > +
> > +;; vmovl[tb] are not available for V4SI on MVE
> > +(define_expand "vec_unpack<US>_hi_<mode>"
> > +  [(set (match_operand:<V_unpack> 0 "register_operand")
> > +       (SE:<V_unpack> (vec_select:<V_HALF>
> > +                             (match_operand:VU 1 "register_operand")
> > +                     (match_dup 2))))]
>
> The formatting of the vec_unpack<US>_lo_<mode> pattern looks nicer.
>
> > + "ARM_HAVE_<MODE>_ARITH
> > +  && !TARGET_REALLY_IWMMXT
> > +  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
> > +  && !BYTES_BIG_ENDIAN"
> > +  {
> > +    rtvec v = rtvec_alloc (<V_mode_nunits>/2);
> > +    rtx t1;
> > +    int i;
> > +    for (i = 0; i < (<V_mode_nunits>/2); i++)
> > +      RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
> > +
> > +    t1 = gen_rtx_PARALLEL (<MODE>mode, v);
> > +
> > +    operands[2] = t1;
>
> t1 doesn't seem worth it now, might as well assign directly to
> operands[2].
>
> > […]
> > +       emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode,
> > +                                        operands[0], tmpreg, operands[2]));
>
> The indentation looks odd here (operands[0] should line up with
> VMOVNTQ_S), but that might be a mailer issue.
>

That was the intent, but there was a mix of tabs/spaces, now fixed
(and elsewhere too).

I have attached what I have committed.

Thanks
Christophe

> > +     }
> > +   DONE;
>
>
> OK with those changes, thanks.
>
> Richard
diff mbox series

Patch

diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 99e46d0bc69..b18292c07d3 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -510,7 +510,7 @@  (define_insn "mve_vrev32q_<supf><mode>"
 ;;
 ;; [vmovltq_u, vmovltq_s])
 ;;
-(define_insn "mve_vmovltq_<supf><mode>"
+(define_insn "@mve_vmovltq_<supf><mode>"
   [
    (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
 	(unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
@@ -524,7 +524,7 @@  (define_insn "mve_vmovltq_<supf><mode>"
 ;;
 ;; [vmovlbq_s, vmovlbq_u])
 ;;
-(define_insn "mve_vmovlbq_<supf><mode>"
+(define_insn "@mve_vmovlbq_<supf><mode>"
   [
    (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
 	(unspec:<V_double_width> [(match_operand:MVE_3 1 "s_register_operand" "w")]
@@ -2187,7 +2187,7 @@  (define_insn "mve_vmlsldavxq_s<mode>"
 ;;
 ;; [vmovnbq_u, vmovnbq_s])
 ;;
-(define_insn "mve_vmovnbq_<supf><mode>"
+(define_insn "@mve_vmovnbq_<supf><mode>"
   [
    (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
 	(unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
@@ -2202,7 +2202,7 @@  (define_insn "mve_vmovnbq_<supf><mode>"
 ;;
 ;; [vmovntq_s, vmovntq_u])
 ;;
-(define_insn "mve_vmovntq_<supf><mode>"
+(define_insn "@mve_vmovntq_<supf><mode>"
   [
    (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
 	(unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 "s_register_operand" "0")
@@ -2214,6 +2214,18 @@  (define_insn "mve_vmovntq_<supf><mode>"
   [(set_attr "type" "mve_move")
 ])
 
+(define_insn "@mve_vec_pack_trunc_<mode>"
+ [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
+       (vec_concat:<V_narrow_pack>
+		(truncate:<V_narrow>
+			(match_operand:MVE_5 1 "register_operand" "w"))
+		(truncate:<V_narrow>
+			(match_operand:MVE_5 2 "register_operand" "w"))))]
+ "TARGET_HAVE_MVE"
+ "vmovnb.i<V_sz_elem>	%q0, %q1\;vmovnt.i<V_sz_elem>	%q0, %q2"
+  [(set_attr "type" "mve_move")]
+)
+
 ;;
 ;; [vmulq_f])
 ;;
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 0fdffaf4ec4..392d9607919 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5924,43 +5924,6 @@  (define_insn "neon_vec_unpack<US>_hi_<mode>"
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
-(define_expand "vec_unpack<US>_hi_<mode>"
-  [(match_operand:<V_unpack> 0 "register_operand")
-   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
- "TARGET_NEON && !BYTES_BIG_ENDIAN"
-  {
-   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
-   rtx t1;
-   int i;
-   for (i = 0; i < (<V_mode_nunits>/2); i++)
-     RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
-  
-   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
-   emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0], 
-                                                 operands[1], 
-					         t1));
-   DONE;
-  }
-)
-
-(define_expand "vec_unpack<US>_lo_<mode>"
-  [(match_operand:<V_unpack> 0 "register_operand")
-   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
- "TARGET_NEON && !BYTES_BIG_ENDIAN"
-  {
-   rtvec v = rtvec_alloc (<V_mode_nunits>/2)  ;
-   rtx t1;
-   int i;
-   for (i = 0; i < (<V_mode_nunits>/2) ; i++)
-     RTVEC_ELT (v, i) = GEN_INT (i);
-   t1 = gen_rtx_PARALLEL (<MODE>mode, v);
-   emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0], 
-                                                 operands[1], 
-				   	         t1));
-   DONE;
-  }
-)
-
 (define_insn "neon_vec_<US>mult_lo_<mode>"
  [(set (match_operand:<V_unpack> 0 "register_operand" "=w")
        (mult:<V_unpack> (SE:<V_unpack> (vec_select:<V_HALF>
@@ -6176,7 +6139,7 @@  (define_expand "vec_widen_<US>shiftl_lo_<mode>"
 ; because the ordering of vector elements in Q registers is different from what
 ; the semantics of the instructions require.
 
-(define_insn "vec_pack_trunc_<mode>"
+(define_insn "neon_quad_vec_pack_trunc_<mode>"
  [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
        (vec_concat:<V_narrow_pack> 
 		(truncate:<V_narrow> 
diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-common.md
index 1ba1e5eb008..0ffc7a9322c 100644
--- a/gcc/config/arm/vec-common.md
+++ b/gcc/config/arm/vec-common.md
@@ -638,3 +638,92 @@  (define_expand "clz<mode>2"
     emit_insn (gen_mve_vclzq_s (<MODE>mode, operands[0], operands[1]));
   DONE;
 })
+
+;; vmovl[tb] are not available for V4SI on MVE
+(define_expand "vec_unpack<US>_hi_<mode>"
+  [(match_operand:<V_unpack> 0 "register_operand")
+   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
+ "ARM_HAVE_<MODE>_ARITH
+  && !TARGET_REALLY_IWMMXT
+  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
+  && !BYTES_BIG_ENDIAN"
+  {
+    if (TARGET_NEON)
+      {
+	rtvec v = rtvec_alloc (<V_mode_nunits>/2);
+	rtx t1;
+	int i;
+	for (i = 0; i < (<V_mode_nunits>/2); i++)
+	  RTVEC_ELT (v, i) = GEN_INT ((<V_mode_nunits>/2) + i);
+
+	t1 = gen_rtx_PARALLEL (<MODE>mode, v);
+	emit_insn (gen_neon_vec_unpack<US>_hi_<mode> (operands[0],
+						      operands[1],
+						      t1));
+      }
+    else
+      {
+	emit_insn (gen_mve_vmovltq (VMOVLTQ_S, <MODE>mode, operands[0],
+				    operands[1]));
+      }
+    DONE;
+  }
+)
+
+;; vmovl[tb] are not available for V4SI on MVE
+(define_expand "vec_unpack<US>_lo_<mode>"
+  [(match_operand:<V_unpack> 0 "register_operand")
+   (SE:<V_unpack> (match_operand:VU 1 "register_operand"))]
+ "ARM_HAVE_<MODE>_ARITH
+  && !TARGET_REALLY_IWMMXT
+  && ! (<MODE>mode == V4SImode && TARGET_HAVE_MVE)
+  && !BYTES_BIG_ENDIAN"
+  {
+    if (TARGET_NEON)
+      {
+	rtvec v = rtvec_alloc (<V_mode_nunits>/2);
+	rtx t1;
+	int i;
+	for (i = 0; i < (<V_mode_nunits>/2) ; i++)
+	  RTVEC_ELT (v, i) = GEN_INT (i);
+
+	t1 = gen_rtx_PARALLEL (<MODE>mode, v);
+	emit_insn (gen_neon_vec_unpack<US>_lo_<mode> (operands[0],
+						      operands[1],
+						      t1));
+      }
+    else
+      {
+	emit_insn (gen_mve_vmovlbq (VMOVLBQ_S, <MODE>mode, operands[0],
+				    operands[1]));
+      }
+    DONE;
+  }
+)
+
+;; vmovn[tb] are not available for V2DI on MVE
+(define_expand "vec_pack_trunc_<mode>"
+ [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
+       (vec_concat:<V_narrow_pack>
+		(truncate:<V_narrow>
+			(match_operand:VN 1 "register_operand" "w"))
+		(truncate:<V_narrow>
+			(match_operand:VN 2 "register_operand" "w"))))]
+ "ARM_HAVE_<MODE>_ARITH
+  && !TARGET_REALLY_IWMMXT
+  && ! (<MODE>mode == V2DImode && TARGET_HAVE_MVE)
+  && !BYTES_BIG_ENDIAN"
+ {
+   if (TARGET_NEON)
+     {
+       emit_insn (gen_neon_quad_vec_pack_trunc_<mode> (operands[0], operands[1],
+						       operands[2]));
+     }
+   else
+     {
+       emit_insn (gen_mve_vec_pack_trunc (<MODE>mode, operands[0], operands[1],
+					  operands[2]));
+     }
+   DONE;
+ }
+)
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
index 7068736bc28..5d6e991cfc6 100644
--- a/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vclz.c
@@ -21,8 +21,9 @@  FUNC(u, uint, 16, clz)
 FUNC(s, int, 8, clz)
 FUNC(u, uint, 8, clz)
 
-/* 16 and 8-bit versions are not vectorized because they need pack/unpack
-   patterns since __builtin_clz uses 32-bit parameter and return value.  */
-/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 2 } } */
+/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
+   instance instead of using vclz.i8, we need 4 vclz.i32, leading to a total of
+   14 vclz.i32 expected in this testcase.  */
+/* { dg-final { scan-assembler-times {vclz\.i32  q[0-9]+, q[0-9]+} 14 } } */
 /* { dg-final { scan-assembler-times {vclz\.i16  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler-times {vclz\.i8  q[0-9]+, q[0-9]+} 2 { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
index 7a0644997c8..91dd942d818 100644
--- a/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
+++ b/gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
@@ -56,7 +56,10 @@  FUNC_IMM(u, uint, 8, 16, <<, vshlimm)
 /* MVE has only 128-bit vectors, so we can vectorize only half of the
    functions above.  */
 /* We only emit vshl.u, which is equivalent to vshl.s anyway.  */
-/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 2 } } */
+/* 16 and 8-bit versions still use 32-bit intermediate temporaries, so for
+   instance instead of using vshl.u8, we need 4 vshl.i32, leading to a total of
+   14 vshl.i32 expected in this testcase.  */
+/* { dg-final { scan-assembler-times {vshl.u[0-9]+\tq[0-9]+, q[0-9]+} 14 } } */
 
 /* We emit vshl.i when the shift amount is an immediate.  */
 /* { dg-final { scan-assembler-times {vshl.i[0-9]+\tq[0-9]+, q[0-9]+} 6 } } */