diff mbox series

[i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.

Message ID 20210812054323.897480-1-hongtao.liu@intel.com
State New
Headers show
Series [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}. | expand

Commit Message

liuhongt Aug. 12, 2021, 5:43 a.m. UTC
Hi:
  This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
under AVX512.
  For scenarios(like pr101846-2.c) where the upper half is not used, this patch
generates better code with only one vpmov{wb,dw,qd} instruction. For
scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
vector length is 256/512bits, the patch can still generate better code, but for
128bits, the code generation is worse.

128 bits upper half not used.

-       vpshufb .LC2(%rip), %xmm0, %xmm0
+       vpmovdw %xmm0, %xmm0

128 bits upper half used.
-       vpshufb .LC2(%rip), %xmm0, %xmm0
+       vpmovdw %xmm0, %xmm1
+       vmovq   %xmm1, %rax
+       vpinsrq $0, %rax, %xmm0, %xmm0

  Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
vectors, but considering the real use of scenarios like pr101846-3.c
foo_*_128 possibility is relatively low, I still keep this part of the code.

  Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
  Ok for trunk?

gcc/ChangeLog:

	PR target/101846
	* config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
	New function.
	(ix86_vectorize_vec_perm_const): Call
	expand_vec_perm_trunc_vinsert.
	* config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
	(vec_set_lo_v64qi): Ditto.
	(vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.

gcc/testsuite/ChangeLog:

	PR target/101846
	* gcc.target/i386/pr101846-2.c: New test.
	* gcc.target/i386/pr101846-3.c: New test.
---
 gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
 gcc/config/i386/sse.md                     |  60 +++++++++-
 gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
 gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
 4 files changed, 359 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c

Comments

Jakub Jelinek Aug. 12, 2021, 9:22 a.m. UTC | #1
On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote:
> Hi:
>   This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
> under AVX512.
>   For scenarios(like pr101846-2.c) where the upper half is not used, this patch
> generates better code with only one vpmov{wb,dw,qd} instruction. For
> scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
> vector length is 256/512bits, the patch can still generate better code, but for
> 128bits, the code generation is worse.
> 
> 128 bits upper half not used.
> 
> -       vpshufb .LC2(%rip), %xmm0, %xmm0
> +       vpmovdw %xmm0, %xmm0
> 
> 128 bits upper half used.
> -       vpshufb .LC2(%rip), %xmm0, %xmm0
> +       vpmovdw %xmm0, %xmm1
> +       vmovq   %xmm1, %rax
> +       vpinsrq $0, %rax, %xmm0, %xmm0
> 
>   Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
> vectors, but considering the real use of scenarios like pr101846-3.c
> foo_*_128 possibility is relatively low, I still keep this part of the code.

I actually am not sure if even
 foo_dw_512:
 .LFB0:
        .cfi_startproc
-       vmovdqa64       %zmm0, %zmm1
-       vmovdqa64       .LC0(%rip), %zmm0
-       vpermi2w        %zmm1, %zmm1, %zmm0
+       vpmovdw %zmm0, %ymm1
+       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
        ret
is always a win, the permutations we should care most about are in loops
and the constant load as well as the first move in that case likely go
away and it is one permutation insn vs. two.
Different case is e.g.
-       vmovdqa64       .LC5(%rip), %zmm2
-       vmovdqa64       %zmm0, %zmm1
-       vmovdqa64       .LC0(%rip), %zmm0
-       vpermi2w        %zmm1, %zmm1, %zmm2
-       vpermi2w        %zmm1, %zmm1, %zmm0
-       vpshufb .LC6(%rip), %zmm0, %zmm0
-       vpshufb .LC7(%rip), %zmm2, %zmm1
-       vporq   %zmm1, %zmm0, %zmm0
+       vpmovwb %zmm0, %ymm1
+       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
So, I wonder if your new routine shouldn't be instead done after
in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
and handle the other vpmovdw etc. cases in combine splitters (see that we
only use low half or quarter of the result and transform whatever
permutation we've used into what we want).

And perhaps make the routine eventually more general, don't handle
just identity permutation in the upper half, but allow there other
permutations too (ones where that half can be represented by a single insn
permutation).
> 
>   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
>   Ok for trunk?
> 
> gcc/ChangeLog:
> 
> 	PR target/101846
> 	* config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
> 	New function.
> 	(ix86_vectorize_vec_perm_const): Call
> 	expand_vec_perm_trunc_vinsert.
> 	* config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
> 	(vec_set_lo_v64qi): Ditto.
> 	(vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR target/101846
> 	* gcc.target/i386/pr101846-2.c: New test.
> 	* gcc.target/i386/pr101846-3.c: New test.
> ---
>  gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
>  gcc/config/i386/sse.md                     |  60 +++++++++-
>  gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
>  4 files changed, 359 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c
> 
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index bd21efa9530..519caac2e15 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
>    return false;
>  }
>  
> +/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
> +   in terms of a pair of vpmovdw + vinserti128 instructions.  */
> +static bool
> +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
> +{
> +  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
> +  unsigned half = nelt / 2;
> +  machine_mode half_mode, trunc_mode;
> +
> +  /* vpmov{wb,dw,qd} only available under AVX512.  */
> +  if (!d->one_operand_p || !TARGET_AVX512F
> +      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)

Too many spaces.
> +      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
> +    return false;
> +
> +  /* TARGET_AVX512BW is needed for vpmovwb.  */
> +  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
> +    return false;
> +
> +  for (i = 0; i < nelt; i++)
> +    {
> +      unsigned idx = d->perm[i] & mask;
> +      if (idx != i * 2 && i < half)
> +	return false;
> +      if (idx != i && i >= half)
> +	return false;
> +    }
> +
> +  rtx (*gen_trunc) (rtx, rtx) = NULL;
> +  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
> +  switch (d->vmode)
> +    {
> +    case E_V16QImode:
> +      gen_trunc = gen_truncv8hiv8qi2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V8QImode;
> +      trunc_mode = V8HImode;
> +      break;
> +    case E_V32QImode:
> +      gen_trunc = gen_truncv16hiv16qi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v32qi;
> +      half_mode = V16QImode;
> +      trunc_mode = V16HImode;
> +      break;
> +    case E_V64QImode:
> +      gen_trunc = gen_truncv32hiv32qi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v64qi;
> +      half_mode = V32QImode;
> +      trunc_mode = V32HImode;
> +      break;
> +    case E_V8HImode:
> +      gen_trunc = gen_truncv4siv4hi2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V4HImode;
> +      trunc_mode = V4SImode;
> +      break;
> +    case E_V16HImode:
> +      gen_trunc = gen_truncv8siv8hi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v16hi;
> +      half_mode = V8HImode;
> +      trunc_mode = V8SImode;
> +      break;
> +    case E_V32HImode:
> +      gen_trunc = gen_truncv16siv16hi2;
> +      gen_vec_set_lo = gen_vec_set_lo_v32hi;
> +      half_mode = V16HImode;
> +      trunc_mode = V16SImode;
> +      break;
> +    case E_V4SImode:
> +      gen_trunc = gen_truncv2div2si2;
> +      gen_vec_set_lo = gen_vec_setv2di;
> +      half_mode = V2SImode;
> +      trunc_mode = V2DImode;
> +      break;
> +    case E_V8SImode:
> +      gen_trunc = gen_truncv4div4si2;
> +      gen_vec_set_lo = gen_vec_set_lo_v8si;
> +      half_mode = V4SImode;
> +      trunc_mode = V4DImode;
> +      break;
> +    case E_V16SImode:
> +      gen_trunc = gen_truncv8div8si2;
> +      gen_vec_set_lo = gen_vec_set_lo_v16si;
> +      half_mode = V8SImode;
> +      trunc_mode = V8DImode;
> +      break;
> +
> +    default:
> +      break;
> +    }
> +
> +  if (gen_trunc == NULL)
> +    return false;

Isn't it simpler to return false; in the default: case?
> @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
>    d.op0 = nop0;
>    d.op1 = force_reg (vmode, d.op1);
>  
> +  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
> +     it's very likely to be optimized off. So let's put the function here.  */

Two spaces after full stop.

	Jakub
Jakub Jelinek Aug. 12, 2021, 9:41 a.m. UTC | #2
On Thu, Aug 12, 2021 at 11:22:48AM +0200, Jakub Jelinek via Gcc-patches wrote:
> So, I wonder if your new routine shouldn't be instead done after
> in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> and handle the other vpmovdw etc. cases in combine splitters (see that we
> only use low half or quarter of the result and transform whatever
> permutation we've used into what we want).

E.g. in the first function, combine tries:
(set (reg:V16HI 85)
    (vec_select:V16HI (unspec:V32HI [
                (mem/u/c:V32HI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S64 A512])
                (reg:V32HI 88) repeated x2
            ] UNSPEC_VPERMT2)
        (parallel [
                (const_int 0 [0])
                (const_int 1 [0x1])
                (const_int 2 [0x2])
                (const_int 3 [0x3])
                (const_int 4 [0x4])
                (const_int 5 [0x5])
                (const_int 6 [0x6])
                (const_int 7 [0x7])
                (const_int 8 [0x8])
                (const_int 9 [0x9])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 12 [0xc])
                (const_int 13 [0xd])
                (const_int 14 [0xe])
                (const_int 15 [0xf])
            ])))
A combine splitter could run avoid_constant_pool_reference on the
first UNSPEC_VPERMT2 argument and check the permutation if it can be
optimized, ideally using some function call so that we wouldn't need too
many splitters.

	Jakub
Hongtao Liu Aug. 13, 2021, 1:42 a.m. UTC | #3
On Thu, Aug 12, 2021 at 5:23 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote:
> > Hi:
> >   This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb}
> > under AVX512.
> >   For scenarios(like pr101846-2.c) where the upper half is not used, this patch
> > generates better code with only one vpmov{wb,dw,qd} instruction. For
> > scenarios(like pr101846-3.c) where the upper half is actually used,  if the src
> > vector length is 256/512bits, the patch can still generate better code, but for
> > 128bits, the code generation is worse.
> >
> > 128 bits upper half not used.
> >
> > -       vpshufb .LC2(%rip), %xmm0, %xmm0
> > +       vpmovdw %xmm0, %xmm0
> >
> > 128 bits upper half used.
> > -       vpshufb .LC2(%rip), %xmm0, %xmm0
> > +       vpmovdw %xmm0, %xmm1
> > +       vmovq   %xmm1, %rax
> > +       vpinsrq $0, %rax, %xmm0, %xmm0
> >
> >   Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of
> > vectors, but considering the real use of scenarios like pr101846-3.c
> > foo_*_128 possibility is relatively low, I still keep this part of the code.
>
> I actually am not sure if even
>  foo_dw_512:
>  .LFB0:
>         .cfi_startproc
> -       vmovdqa64       %zmm0, %zmm1
> -       vmovdqa64       .LC0(%rip), %zmm0
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> +       vpmovdw %zmm0, %ymm1
> +       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
>         ret
> is always a win, the permutations we should care most about are in loops
Yes, and vpmov{qd,dw} and vpermi{w,d} are both available under
avx512f, so expand_vec_perm_trunc_vinsert will never be matched when
it's placed in other 2 insn cases.
The only part we need to handle is vpmovwb which is under avx512bw but
vpermib require avx512vbmi.
> and the constant load as well as the first move in that case likely go
> away and it is one permutation insn vs. two.
> Different case is e.g.
> -       vmovdqa64       .LC5(%rip), %zmm2
> -       vmovdqa64       %zmm0, %zmm1
> -       vmovdqa64       .LC0(%rip), %zmm0
> -       vpermi2w        %zmm1, %zmm1, %zmm2
> -       vpermi2w        %zmm1, %zmm1, %zmm0
> -       vpshufb .LC6(%rip), %zmm0, %zmm0
> -       vpshufb .LC7(%rip), %zmm2, %zmm1
> -       vporq   %zmm1, %zmm0, %zmm0
> +       vpmovwb %zmm0, %ymm1
> +       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0
> So, I wonder if your new routine shouldn't be instead done after
> in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> and handle the other vpmovdw etc. cases in combine splitters (see that we
> only use low half or quarter of the result and transform whatever
> permutation we've used into what we want).
>
Got it, i'll try that way.
> And perhaps make the routine eventually more general, don't handle
> just identity permutation in the upper half, but allow there other
> permutations too (ones where that half can be represented by a single insn
> permutation).
> >
> >   Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> >   Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >       PR target/101846
> >       * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert):
> >       New function.
> >       (ix86_vectorize_vec_perm_const): Call
> >       expand_vec_perm_trunc_vinsert.
> >       * config/i386/sse.md (vec_set_lo_v32hi): New define_insn.
> >       (vec_set_lo_v64qi): Ditto.
> >       (vec_set_lo_<mode><mask_name>): Extend to no-avx512dq.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       PR target/101846
> >       * gcc.target/i386/pr101846-2.c: New test.
> >       * gcc.target/i386/pr101846-3.c: New test.
> > ---
> >  gcc/config/i386/i386-expand.c              | 125 +++++++++++++++++++++
> >  gcc/config/i386/sse.md                     |  60 +++++++++-
> >  gcc/testsuite/gcc.target/i386/pr101846-2.c |  81 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr101846-3.c |  95 ++++++++++++++++
> >  4 files changed, 359 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index bd21efa9530..519caac2e15 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d)
> >    return false;
> >  }
> >
> > +/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
> > +   in terms of a pair of vpmovdw + vinserti128 instructions.  */
> > +static bool
> > +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
> > +{
> > +  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
> > +  unsigned half = nelt / 2;
> > +  machine_mode half_mode, trunc_mode;
> > +
> > +  /* vpmov{wb,dw,qd} only available under AVX512.  */
> > +  if (!d->one_operand_p || !TARGET_AVX512F
> > +      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)
>
> Too many spaces.
> > +      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
> > +    return false;
> > +
> > +  /* TARGET_AVX512BW is needed for vpmovwb.  */
> > +  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
> > +    return false;
> > +
> > +  for (i = 0; i < nelt; i++)
> > +    {
> > +      unsigned idx = d->perm[i] & mask;
> > +      if (idx != i * 2 && i < half)
> > +     return false;
> > +      if (idx != i && i >= half)
> > +     return false;
> > +    }
> > +
> > +  rtx (*gen_trunc) (rtx, rtx) = NULL;
> > +  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
> > +  switch (d->vmode)
> > +    {
> > +    case E_V16QImode:
> > +      gen_trunc = gen_truncv8hiv8qi2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V8QImode;
> > +      trunc_mode = V8HImode;
> > +      break;
> > +    case E_V32QImode:
> > +      gen_trunc = gen_truncv16hiv16qi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v32qi;
> > +      half_mode = V16QImode;
> > +      trunc_mode = V16HImode;
> > +      break;
> > +    case E_V64QImode:
> > +      gen_trunc = gen_truncv32hiv32qi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v64qi;
> > +      half_mode = V32QImode;
> > +      trunc_mode = V32HImode;
> > +      break;
> > +    case E_V8HImode:
> > +      gen_trunc = gen_truncv4siv4hi2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V4HImode;
> > +      trunc_mode = V4SImode;
> > +      break;
> > +    case E_V16HImode:
> > +      gen_trunc = gen_truncv8siv8hi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v16hi;
> > +      half_mode = V8HImode;
> > +      trunc_mode = V8SImode;
> > +      break;
> > +    case E_V32HImode:
> > +      gen_trunc = gen_truncv16siv16hi2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v32hi;
> > +      half_mode = V16HImode;
> > +      trunc_mode = V16SImode;
> > +      break;
> > +    case E_V4SImode:
> > +      gen_trunc = gen_truncv2div2si2;
> > +      gen_vec_set_lo = gen_vec_setv2di;
> > +      half_mode = V2SImode;
> > +      trunc_mode = V2DImode;
> > +      break;
> > +    case E_V8SImode:
> > +      gen_trunc = gen_truncv4div4si2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v8si;
> > +      half_mode = V4SImode;
> > +      trunc_mode = V4DImode;
> > +      break;
> > +    case E_V16SImode:
> > +      gen_trunc = gen_truncv8div8si2;
> > +      gen_vec_set_lo = gen_vec_set_lo_v16si;
> > +      half_mode = V8SImode;
> > +      trunc_mode = V8DImode;
> > +      break;
> > +
> > +    default:
> > +      break;
> > +    }
> > +
> > +  if (gen_trunc == NULL)
> > +    return false;
>
> Isn't it simpler to return false; in the default: case?
> > @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
> >    d.op0 = nop0;
> >    d.op1 = force_reg (vmode, d.op1);
> >
> > +  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
> > +     it's very likely to be optimized off. So let's put the function here.  */
>
> Two spaces after full stop.
>
>         Jakub
>
Jakub Jelinek Aug. 13, 2021, 8:47 a.m. UTC | #4
On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
> > So, I wonder if your new routine shouldn't be instead done after
> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> > and handle the other vpmovdw etc. cases in combine splitters (see that we
> > only use low half or quarter of the result and transform whatever
> > permutation we've used into what we want).
> >
> Got it, i'll try that way.

Note, IMHO the ultimate fix would be to add real support for the
__builtin_shufflevector -1 indices (meaning I don't care what will be in
that element, perhaps narrowed down to an implementation choice of
any element of the input vector(s) or 0).
As VEC_PERM_EXPR is currently used for both perms by variable permutation
vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
which would be exactly like VEC_PERM_EXPR, except that the last operand
would be required to be a VECTOR_CST and that all ones element would mean
something different, the I don't care behavior.
The GIMPLE side would be fairly easy, except that there should be some
optimizations eventually, like when only certain subset of elements of
a vector are used later, we can mark the other elements as don't care.

The hard part would be backend expansion, especially x86.
I guess we could easily canonicalize VEC_PERM_EXPR with constant
permutations into VEC_PERM_CONST_EXPR by replacing all ones elements
with elements modulo the number of elements (or twice that for 2 operand
perms), but then in all the routines that recognize something we'd
need to special case the unknown elements to match anything during testing
and for expansion replace it by something that would match.
That is again a lot of work, but not extremely hard.  The hardest would be
to deal with the expand_vec_perm_1 handling many cases by trying to recog
an instruction.  Either we'd need to represent the unknown case by a magic
CONST_INT_WILDCARD or CONST_INT_RANGE that recog with the help of the
patterns would replace by some CONST_INT that matches it, but note we have
all those const_0_to_N_operand and in conditions
INTVAL (operands[3]) & 1 == 0 and INTVAL (operands[3]) + 1 == INTVAL (operands[4])
etc., or we'd need to either manually or semi-automatically build some code that
would try to guess right values for unknown before trying to recog it.

	Jakub
Richard Sandiford Aug. 13, 2021, 9:03 a.m. UTC | #5
Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
>> > So, I wonder if your new routine shouldn't be instead done after
>> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
>> > and handle the other vpmovdw etc. cases in combine splitters (see that we
>> > only use low half or quarter of the result and transform whatever
>> > permutation we've used into what we want).
>> >
>> Got it, i'll try that way.
>
> Note, IMHO the ultimate fix would be to add real support for the
> __builtin_shufflevector -1 indices (meaning I don't care what will be in
> that element, perhaps narrowed down to an implementation choice of
> any element of the input vector(s) or 0).
> As VEC_PERM_EXPR is currently used for both perms by variable permutation
> vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
> which would be exactly like VEC_PERM_EXPR, except that the last operand
> would be required to be a VECTOR_CST and that all ones element would mean
> something different, the I don't care behavior.
> The GIMPLE side would be fairly easy, except that there should be some
> optimizations eventually, like when only certain subset of elements of
> a vector are used later, we can mark the other elements as don't care.

Another alternative I'd wondered about was keeping a single tree code,
but adding an extra operand with a “care/don't care” mask.  I think
that would fit with variable-length vectors better.

Thanks,
Richard
Richard Biener Aug. 16, 2021, 7:19 a.m. UTC | #6
On Fri, Aug 13, 2021 at 11:04 AM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote:
> >> > So, I wonder if your new routine shouldn't be instead done after
> >> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases
> >> > and handle the other vpmovdw etc. cases in combine splitters (see that we
> >> > only use low half or quarter of the result and transform whatever
> >> > permutation we've used into what we want).
> >> >
> >> Got it, i'll try that way.
> >
> > Note, IMHO the ultimate fix would be to add real support for the
> > __builtin_shufflevector -1 indices (meaning I don't care what will be in
> > that element, perhaps narrowed down to an implementation choice of
> > any element of the input vector(s) or 0).
> > As VEC_PERM_EXPR is currently used for both perms by variable permutation
> > vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR,
> > which would be exactly like VEC_PERM_EXPR, except that the last operand
> > would be required to be a VECTOR_CST and that all ones element would mean
> > something different, the I don't care behavior.
> > The GIMPLE side would be fairly easy, except that there should be some
> > optimizations eventually, like when only certain subset of elements of
> > a vector are used later, we can mark the other elements as don't care.
>
> Another alternative I'd wondered about was keeping a single tree code,
> but adding an extra operand with a “care/don't care” mask.  I think
> that would fit with variable-length vectors better.

That sounds reasonable.  Note I avoided "quaternary" ops for
BIT_INSERT_EXPR but I don't see a good way to avoid that for
such VEC_PERM_EXPR extension.

Richard.

> Thanks,
> Richard
diff mbox series

Patch

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index bd21efa9530..519caac2e15 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -18317,6 +18317,126 @@  expand_vec_perm_1 (struct expand_vec_perm_d *d)
   return false;
 }
 
+/* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
+   in terms of a pair of vpmovdw + vinserti128 instructions.  */
+static bool
+expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d)
+{
+  unsigned i, nelt = d->nelt, mask = d->nelt - 1;
+  unsigned half = nelt / 2;
+  machine_mode half_mode, trunc_mode;
+
+  /* vpmov{wb,dw,qd} only available under AVX512.  */
+  if (!d->one_operand_p || !TARGET_AVX512F
+      || (!TARGET_AVX512VL  && GET_MODE_SIZE (d->vmode) < 64)
+      || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4)
+    return false;
+
+  /* TARGET_AVX512BW is needed for vpmovwb.  */
+  if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW)
+    return false;
+
+  for (i = 0; i < nelt; i++)
+    {
+      unsigned idx = d->perm[i] & mask;
+      if (idx != i * 2 && i < half)
+	return false;
+      if (idx != i && i >= half)
+	return false;
+    }
+
+  rtx (*gen_trunc) (rtx, rtx) = NULL;
+  rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL;
+  switch (d->vmode)
+    {
+    case E_V16QImode:
+      gen_trunc = gen_truncv8hiv8qi2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V8QImode;
+      trunc_mode = V8HImode;
+      break;
+    case E_V32QImode:
+      gen_trunc = gen_truncv16hiv16qi2;
+      gen_vec_set_lo = gen_vec_set_lo_v32qi;
+      half_mode = V16QImode;
+      trunc_mode = V16HImode;
+      break;
+    case E_V64QImode:
+      gen_trunc = gen_truncv32hiv32qi2;
+      gen_vec_set_lo = gen_vec_set_lo_v64qi;
+      half_mode = V32QImode;
+      trunc_mode = V32HImode;
+      break;
+    case E_V8HImode:
+      gen_trunc = gen_truncv4siv4hi2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V4HImode;
+      trunc_mode = V4SImode;
+      break;
+    case E_V16HImode:
+      gen_trunc = gen_truncv8siv8hi2;
+      gen_vec_set_lo = gen_vec_set_lo_v16hi;
+      half_mode = V8HImode;
+      trunc_mode = V8SImode;
+      break;
+    case E_V32HImode:
+      gen_trunc = gen_truncv16siv16hi2;
+      gen_vec_set_lo = gen_vec_set_lo_v32hi;
+      half_mode = V16HImode;
+      trunc_mode = V16SImode;
+      break;
+    case E_V4SImode:
+      gen_trunc = gen_truncv2div2si2;
+      gen_vec_set_lo = gen_vec_setv2di;
+      half_mode = V2SImode;
+      trunc_mode = V2DImode;
+      break;
+    case E_V8SImode:
+      gen_trunc = gen_truncv4div4si2;
+      gen_vec_set_lo = gen_vec_set_lo_v8si;
+      half_mode = V4SImode;
+      trunc_mode = V4DImode;
+      break;
+    case E_V16SImode:
+      gen_trunc = gen_truncv8div8si2;
+      gen_vec_set_lo = gen_vec_set_lo_v16si;
+      half_mode = V8SImode;
+      trunc_mode = V8DImode;
+      break;
+
+    default:
+      break;
+    }
+
+  if (gen_trunc == NULL)
+    return false;
+
+  rtx op_half = gen_reg_rtx (half_mode);
+  rtx op_trunc = d->op0;
+  if (d->vmode != trunc_mode)
+    op_trunc = lowpart_subreg (trunc_mode, op_trunc, d->vmode);
+  emit_insn (gen_trunc (op_half, op_trunc));
+
+  if (gen_vec_set_lo == gen_vec_setv2di)
+    {
+      op_half = lowpart_subreg (DImode, op_half, half_mode);
+      rtx op_dest = lowpart_subreg (V2DImode, d->op0, d->vmode);
+
+      /* vec_set<mode> require register_operand.  */
+      if (MEM_P (op_dest))
+	op_dest = force_reg (V2DImode, op_dest);
+      if (MEM_P (op_half))
+	op_half = force_reg (DImode, op_half);
+
+      emit_insn (gen_vec_set_lo (op_dest, op_half, GEN_INT(0)));
+      op_dest = lowpart_subreg (d->vmode, op_dest, V2DImode);
+      emit_move_insn (d->target, op_dest);
+    }
+  else
+    emit_insn (gen_vec_set_lo (d->target, d->op0, op_half));
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_const_1.  Try to implement D
    in terms of a pair of pshuflw + pshufhw instructions.  */
 
@@ -21028,6 +21148,11 @@  ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0,
   d.op0 = nop0;
   d.op1 = force_reg (vmode, d.op1);
 
+  /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated,
+     it's very likely to be optimized off. So let's put the function here.  */
+  if (expand_vec_perm_trunc_vinsert (&d))
+    return true;
+
   if (ix86_expand_vec_perm_const_1 (&d))
     return true;
 
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f631756c829..87e22332c83 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15162,8 +15162,12 @@  (define_insn "vec_set_lo_<mode><mask_name>"
 		       (const_int 10) (const_int 11)
 		       (const_int 12) (const_int 13)
 		       (const_int 14) (const_int 15)]))))]
-  "TARGET_AVX512DQ"
-  "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}"
+  "TARGET_AVX512F && <mask_avx512dq_condition>"
+{
+  if (TARGET_AVX512DQ)
+    return "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}";
+  return "vinsert<shuffletype>64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}";
+}
   [(set_attr "type" "sselog")
    (set_attr "length_immediate" "1")
    (set_attr "prefix" "evex")
@@ -22806,6 +22810,28 @@  (define_insn "vec_set_hi_v16hi"
    (set_attr "prefix" "vex,evex")
    (set_attr "mode" "OI")])
 
+(define_insn "vec_set_lo_v32hi"
+  [(set (match_operand:V32HI 0 "register_operand" "=v")
+	(vec_concat:V32HI
+	  (match_operand:V16HI 2 "nonimmediate_operand" "vm")
+	  (vec_select:V16HI
+	    (match_operand:V32HI 1 "register_operand" "v")
+	    (parallel [(const_int 16) (const_int 17)
+		       (const_int 18) (const_int 19)
+		       (const_int 20) (const_int 21)
+		       (const_int 22) (const_int 23)
+		       (const_int 24) (const_int 25)
+		       (const_int 26) (const_int 27)
+		       (const_int 28) (const_int 29)
+		       (const_int 30) (const_int 31)]))))]
+  "TARGET_AVX512F"
+  "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"
+  [(set_attr "type" "sselog")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "XI")])
+
 (define_insn "vec_set_lo_v32qi"
   [(set (match_operand:V32QI 0 "register_operand" "=x,v")
 	(vec_concat:V32QI
@@ -22854,6 +22880,36 @@  (define_insn "vec_set_hi_v32qi"
    (set_attr "prefix" "vex,evex")
    (set_attr "mode" "OI")])
 
+(define_insn "vec_set_lo_v64qi"
+  [(set (match_operand:V64QI 0 "register_operand" "=v")
+	(vec_concat:V64QI
+	  (match_operand:V32QI 2 "nonimmediate_operand" "vm")
+	  (vec_select:V32QI
+	    (match_operand:V64QI 1 "register_operand" "v")
+	    (parallel [(const_int 32) (const_int 33)
+		       (const_int 34) (const_int 35)
+		       (const_int 36) (const_int 37)
+		       (const_int 38) (const_int 39)
+		       (const_int 40) (const_int 41)
+		       (const_int 42) (const_int 43)
+		       (const_int 44) (const_int 45)
+		       (const_int 46) (const_int 47)
+		       (const_int 48) (const_int 49)
+		       (const_int 50) (const_int 51)
+		       (const_int 52) (const_int 53)
+		       (const_int 54) (const_int 55)
+		       (const_int 56) (const_int 57)
+		       (const_int 58) (const_int 59)
+		       (const_int 60) (const_int 61)
+		       (const_int 62) (const_int 63)]))))]
+  "TARGET_AVX512F"
+  "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"
+  [(set_attr "type" "sselog")
+   (set_attr "prefix_extra" "1")
+   (set_attr "length_immediate" "1")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "XI")])
+
 (define_insn "<avx_avx2>_maskload<ssemodesuffix><avxsizesuffix>"
   [(set (match_operand:V48_AVX2 0 "register_operand" "=x")
 	(unspec:V48_AVX2
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-2.c b/gcc/testsuite/gcc.target/i386/pr101846-2.c
new file mode 100644
index 00000000000..af4ae8ccdd6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-2.c
@@ -0,0 +1,81 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */
+/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v16hi
+foo_dw_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8hi
+foo_dw_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4hi
+foo_dw_128 (v8hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v8si
+foo_qd_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14);
+}
+
+v4si
+foo_qd_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6);
+}
+
+v2si
+foo_qd_128 (v4si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2);
+}
+
+v32qi
+foo_wb_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62);
+}
+
+v16qi
+foo_wb_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30);
+}
+
+v8qi
+foo_wb_128 (v16qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101846-3.c b/gcc/testsuite/gcc.target/i386/pr101846-3.c
new file mode 100644
index 00000000000..380b1220327
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101846-3.c
@@ -0,0 +1,95 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */
+/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */
+/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */
+
+typedef short v4hi __attribute__((vector_size (8)));
+typedef short v8hi __attribute__((vector_size (16)));
+typedef short v16hi __attribute__((vector_size (32)));
+typedef short v32hi __attribute__((vector_size (64)));
+typedef char v8qi __attribute__((vector_size (8)));
+typedef char v16qi __attribute__((vector_size (16)));
+typedef char v32qi __attribute__((vector_size (32)));
+typedef char v64qi __attribute__((vector_size (64)));
+typedef int v2si __attribute__((vector_size (8)));
+typedef int v4si __attribute__((vector_size (16)));
+typedef int v8si __attribute__((vector_size (32)));
+typedef int v16si __attribute__((vector_size (64)));
+
+v32hi
+foo_dw_512 (v32hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
+
+v16hi
+foo_dw_256 (v16hi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+v8hi
+foo_dw_128 (v8hi x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7);
+}
+
+v16si
+foo_qd_512 (v16si x)
+{
+  return __builtin_shufflevector (x, x, 0,
+				  2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+
+v8si
+foo_qd_256 (v8si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7);
+}
+
+v4si
+foo_qd_128 (v4si x)
+{
+  return __builtin_shufflevector (x, x, 0, 2, 2, 3);
+}
+
+v64qi
+foo_wb_512 (v64qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  32, 34, 36, 38, 40, 42, 44, 46,
+				  48, 50, 52, 54, 56, 58, 60, 62,
+				  32, 33, 34, 35, 36, 37, 38, 39,
+				  40, 41, 42, 43, 44, 45, 46, 47,
+				  48, 49, 50, 51, 52, 53, 54, 55,
+				  56, 57, 58, 59, 60, 61, 62, 63);
+}
+
+v32qi
+foo_wb_256 (v32qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  16, 18, 20, 22, 24, 26, 28, 30,
+				  16, 17, 18, 19, 20, 21, 22, 23,
+				  24, 25, 26, 27, 28, 29, 30, 31);
+}
+
+v16qi
+foo_wb_128 (v16qi x)
+{
+  return __builtin_shufflevector (x, x,
+				  0, 2, 4, 6, 8, 10, 12, 14,
+				  8, 9, 10, 11, 12, 13, 14, 15);
+}
+