diff mbox series

i386: Optimize _mm_unpacklo_epi8 of 0 vector as second argument or similar VEC_PERM_EXPRs into pmovzx [PR95905]

Message ID 20210112093318.GA1034503@tucnak
State New
Headers show
Series i386: Optimize _mm_unpacklo_epi8 of 0 vector as second argument or similar VEC_PERM_EXPRs into pmovzx [PR95905] | expand

Commit Message

Jakub Jelinek Jan. 12, 2021, 9:33 a.m. UTC
Hi!

The following patch adds patterns (in the end I went with define_insn rather
than combiner define_split + define_insn_and_split I initially hoped or
define_insn_and_split) to represent (so far 128-bit only) permutations
like { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 } where the second
operand is CONST0_RTX CONST_VECTOR as pmovzx.
define_split didn't work (due to the combiner not trying combine_split_insn
when i1 is NULL) but in the end was quite large, and the reason for not
trying to split this afterwards is the different vector mode of the output,
and lowpart_subreg on the result is undesirable,
so we'd need to split it into two instructions and hope some later pass
optimizes the move into just rewriting the uses using lowpart_subreg.
While I initially wrote also avx2 and avx512{bw,f} patterns, turns out that
without further changes they aren't useful, such permutations are expanded
into multiple permutation instructions and while combiner in theory could
be able to undo that, what stops it is that the const_vector of 0 has
multiple uses.  So either we'd need to allow pre-reload const0_operand
next to registers (and sometimes memory) in some permutations, or perhaps
change the vec_perm_const target hook calling code to not force zero vectors
into registers and change each vec_perm_const hook to force it into register
itself if it can't expand such a permutation smartly.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2021-01-12  Jakub Jelinek  <jakub@redhat.com>

	PR target/95905
	* config/i386/predicates.md (pmovzx_parallel): New predicate.
	* config/i386/sse.md (*sse4_1_zero_extendv8qiv8hi2_3,
	*sse4_1_zero_extendv4hiv4si2_3, *sse4_1_zero_extendv2siv2di2_3): New
	define_insn patterns.

	* gcc.target/i386/pr95905-1.c: New test.
	* gcc.target/i386/pr95905-2.c: New test.


	Jakub

Comments

Uros Bizjak Jan. 12, 2021, 10:42 a.m. UTC | #1
On Tue, Jan 12, 2021 at 10:33 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> The following patch adds patterns (in the end I went with define_insn rather
> than combiner define_split + define_insn_and_split I initially hoped or
> define_insn_and_split) to represent (so far 128-bit only) permutations
> like { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 } where the second
> operand is CONST0_RTX CONST_VECTOR as pmovzx.
> define_split didn't work (due to the combiner not trying combine_split_insn
> when i1 is NULL) but in the end was quite large, and the reason for not
> trying to split this afterwards is the different vector mode of the output,
> and lowpart_subreg on the result is undesirable,
> so we'd need to split it into two instructions and hope some later pass
> optimizes the move into just rewriting the uses using lowpart_subreg.

You can use post-reload define_insn_and_split here. This way,
gen_lowpart on all arguments, including output, can be used. So,
instead of generating an insn template, the patterns you introduced
should split to "real" sse4_1 zero-extend insns. This approach is
preferred to avoid having several pseudo-insns in .md files that do
the same thing with slightly different patterns. There are many
examples of post-reload splitters that use gen_lowpart in i386.md.

OTOH, perhaps some of the new testcases can be handled in x86
target_fold_builtin? In the long term, maybe target_fold_shuffle can
be introduced to map __builtin_shufle to various target builtins, so
the builtin can be processed further in target_fold_builtin. As
pointed out below, vector insn patterns can be quite complex, and push
RTL combiners to their limits, so perhaps they can be more efficiently
handled by tree passes.

Uros.

> While I initially wrote also avx2 and avx512{bw,f} patterns, turns out that
> without further changes they aren't useful, such permutations are expanded
> into multiple permutation instructions and while combiner in theory could
> be able to undo that, what stops it is that the const_vector of 0 has
> multiple uses.  So either we'd need to allow pre-reload const0_operand
> next to registers (and sometimes memory) in some permutations, or perhaps
> change the vec_perm_const target hook calling code to not force zero vectors
> into registers and change each vec_perm_const hook to force it into register
> itself if it can't expand such a permutation smartly.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2021-01-12  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/95905
>         * config/i386/predicates.md (pmovzx_parallel): New predicate.
>         * config/i386/sse.md (*sse4_1_zero_extendv8qiv8hi2_3,
>         *sse4_1_zero_extendv4hiv4si2_3, *sse4_1_zero_extendv2siv2di2_3): New
>         define_insn patterns.
>
>         * gcc.target/i386/pr95905-1.c: New test.
>         * gcc.target/i386/pr95905-2.c: New test.
>
> --- gcc/config/i386/predicates.md.jj    2021-01-04 10:25:45.248161185 +0100
> +++ gcc/config/i386/predicates.md       2021-01-11 19:40:04.793433826 +0100
> @@ -1600,6 +1600,38 @@ (define_predicate "addsub_vs_parallel"
>    return true;
>  })
>
> +;; Return true if OP is a parallel for an pmovz{bw,wd,dq} vec_select,
> +;; where one of the two operands of the vec_concat is const0_operand.
> +(define_predicate "pmovzx_parallel"
> +  (and (match_code "parallel")
> +       (match_code "const_int" "a"))
> +{
> +  int nelt = XVECLEN (op, 0);
> +  int elt, i;
> +
> +  if (nelt < 2)
> +    return false;
> +
> +  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
> +     For example { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 }.  */
> +  elt = INTVAL (XVECEXP (op, 0, 0));
> +  if (elt == 0)
> +    {
> +      for (i = 1; i < nelt; ++i)
> +       if ((i & 1) != 0)
> +         {
> +           if (INTVAL (XVECEXP (op, 0, i)) < nelt)
> +             return false;
> +         }
> +       else if (INTVAL (XVECEXP (op, 0, i)) != i / 2)
> +         return false;
> +    }
> +  else
> +    return false;
> +
> +  return true;
> +})
> +
>  ;; Return true if OP is a parallel for a vbroadcast permute.
>  (define_predicate "avx_vbroadcast_operand"
>    (and (match_code "parallel")
> --- gcc/config/i386/sse.md.jj   2021-01-07 15:29:52.604974544 +0100
> +++ gcc/config/i386/sse.md      2021-01-11 21:19:06.247346965 +0100
> @@ -17683,6 +17683,27 @@ (define_insn_and_split "*sse4_1_<code>v8
>         (any_extend:V8HI (match_dup 1)))]
>    "operands[1] = adjust_address_nv (operands[1], V8QImode, 0);")
>
> +(define_insn "*sse4_1_zero_extendv8qiv8hi2_3"
> +  [(set (match_operand:V16QI 0 "register_operand" "=Yr,*x,v")
> +       (vec_select:V16QI
> +         (vec_concat:V32QI
> +           (match_operand:V16QI 1 "vector_operand" "Yrm,*xm,vm")
> +           (match_operand:V16QI 2 "const0_operand" "C,C,C"))
> +         (match_parallel 3 "pmovzx_parallel"
> +           [(match_operand 4 "const_int_operand" "n,n,n")])))]
> +  "TARGET_SSE4_1"
> +{
> +  if (MEM_P (operands[1]))
> +    return "%vpmovzxbw\t{%1, %0|%0, %1}";
> +  else
> +    return "%vpmovzxbw\t{%1, %0|%0, %q1}";
> +}
> +  [(set_attr "isa" "noavx,noavx,avx")
> +   (set_attr "type" "ssemov")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "prefix" "orig,orig,maybe_evex")
> +   (set_attr "mode" "TI")])
> +
>  (define_expand "<insn>v8qiv8hi2"
>    [(set (match_operand:V8HI 0 "register_operand")
>         (any_extend:V8HI
> @@ -17929,6 +17950,27 @@ (define_expand "<insn>v4hiv4si2"
>      }
>  })
>
> +(define_insn "*sse4_1_zero_extendv4hiv4si2_3"
> +  [(set (match_operand:V8HI 0 "register_operand" "=Yr,*x,v")
> +       (vec_select:V8HI
> +         (vec_concat:V16HI
> +           (match_operand:V8HI 1 "vector_operand" "Yrm,*xm,vm")
> +           (match_operand:V8HI 2 "const0_operand" "C,C,C"))
> +         (match_parallel 3 "pmovzx_parallel"
> +           [(match_operand 4 "const_int_operand" "n,n,n")])))]
> +  "TARGET_SSE4_1"
> +{
> +  if (MEM_P (operands[1]))
> +    return "%vpmovzxwd\t{%1, %0|%0, %1}";
> +  else
> +    return "%vpmovzxwd\t{%1, %0|%0, %q1}";
> +}
> +  [(set_attr "isa" "noavx,noavx,avx")
> +   (set_attr "type" "ssemov")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "prefix" "orig,orig,maybe_evex")
> +   (set_attr "mode" "TI")])
> +
>  (define_insn "avx512f_<code>v8qiv8di2<mask_name>"
>    [(set (match_operand:V8DI 0 "register_operand" "=v")
>         (any_extend:V8DI
> @@ -18283,6 +18325,27 @@ (define_insn_and_split "*sse4_1_<code>v2
>         (any_extend:V2DI (match_dup 1)))]
>    "operands[1] = adjust_address_nv (operands[1], V2SImode, 0);")
>
> +(define_insn "*sse4_1_zero_extendv2siv2di2_3"
> +  [(set (match_operand:V4SI 0 "register_operand" "=Yr,*x,v")
> +       (vec_select:V4SI
> +         (vec_concat:V8SI
> +           (match_operand:V4SI 1 "vector_operand" "Yrm,*xm,vm")
> +           (match_operand:V4SI 2 "const0_operand" "C,C,C"))
> +         (match_parallel 3 "pmovzx_parallel"
> +           [(match_operand 4 "const_int_operand" "n,n,n")])))]
> +  "TARGET_SSE4_1"
> +{
> +  if (MEM_P (operands[1]))
> +    return "%vpmovzxdq\t{%1, %0|%0, %1}";
> +  else
> +    return "%vpmovzxdq\t{%1, %0|%0, %q1}";
> +}
> +  [(set_attr "isa" "noavx,noavx,avx")
> +   (set_attr "type" "ssemov")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "prefix" "orig,orig,maybe_evex")
> +   (set_attr "mode" "TI")])
> +
>  (define_expand "<insn>v2siv2di2"
>    [(set (match_operand:V2DI 0 "register_operand")
>         (any_extend:V2DI
> --- gcc/testsuite/gcc.target/i386/pr95905-1.c.jj        2021-01-11 21:19:46.657890361 +0100
> +++ gcc/testsuite/gcc.target/i386/pr95905-1.c   2021-01-11 20:41:25.189872347 +0100
> @@ -0,0 +1,26 @@
> +/* PR target/95905 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse4.1 -mno-avx" } */
> +/* { dg-final { scan-assembler "\tpmovzxbw\t" } } */
> +/* { dg-final { scan-assembler "\tpmovzxwd\t" } } */
> +/* { dg-final { scan-assembler "\tpmovzxdq\t" } } */
> +
> +#include <x86intrin.h>
> +
> +__m128i
> +f1 (__m128i a)
> +{
> +  return _mm_unpacklo_epi8 (a, _mm_setzero_si128 ());
> +}
> +
> +__m128i
> +f2 (__m128i a)
> +{
> +  return _mm_unpacklo_epi16 (a, _mm_setzero_si128 ());
> +}
> +
> +__m128i
> +f3 (__m128i a)
> +{
> +  return _mm_unpacklo_epi32 (a, _mm_setzero_si128 ());
> +}
> --- gcc/testsuite/gcc.target/i386/pr95905-2.c.jj        2021-01-11 21:20:10.075625762 +0100
> +++ gcc/testsuite/gcc.target/i386/pr95905-2.c   2021-01-11 21:20:44.126241029 +0100
> @@ -0,0 +1,46 @@
> +/* PR target/95905 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse4.1" } */
> +/* { dg-final { scan-assembler "\tv?pmovzxbw\t" } } */
> +/* { dg-final { scan-assembler "\tv?pmovzxwd\t" } } */
> +/* { dg-final { scan-assembler "\tv?pmovzxdq\t" } } */
> +
> +typedef unsigned char V1 __attribute__((vector_size (16)));
> +typedef unsigned short V2 __attribute__((vector_size (16)));
> +typedef unsigned int V3 __attribute__((vector_size (16)));
> +
> +V1
> +f1 (V1 x)
> +{
> +  return __builtin_shuffle (x, (V1) {}, (V1) { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23 });
> +}
> +
> +V2
> +f2 (V2 x)
> +{
> +  return __builtin_shuffle (x, (V2) {}, (V2) { 0, 8, 1, 9, 2, 10, 3, 11 });
> +}
> +
> +V3
> +f3 (V3 x)
> +{
> +  return __builtin_shuffle (x, (V3) {}, (V3) { 0, 4, 1, 5 });
> +}
> +
> +V1
> +f4 (V1 *x)
> +{
> +  return __builtin_shuffle (*x, (V1) {}, (V1) { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23 });
> +}
> +
> +V2
> +f5 (V2 *x)
> +{
> +  return __builtin_shuffle (*x, (V2) {}, (V2) { 0, 8, 1, 9, 2, 10, 3, 11 });
> +}
> +
> +V3
> +f6 (V3 *x)
> +{
> +  return __builtin_shuffle (*x, (V3) {}, (V3) { 0, 4, 1, 5 });
> +}
>
>         Jakub
>
Jakub Jelinek Jan. 12, 2021, 10:47 a.m. UTC | #2
On Tue, Jan 12, 2021 at 11:42:44AM +0100, Uros Bizjak wrote:
> > The following patch adds patterns (in the end I went with define_insn rather
> > than combiner define_split + define_insn_and_split I initially hoped or
> > define_insn_and_split) to represent (so far 128-bit only) permutations
> > like { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 } where the second
> > operand is CONST0_RTX CONST_VECTOR as pmovzx.
> > define_split didn't work (due to the combiner not trying combine_split_insn
> > when i1 is NULL) but in the end was quite large, and the reason for not
> > trying to split this afterwards is the different vector mode of the output,
> > and lowpart_subreg on the result is undesirable,
> > so we'd need to split it into two instructions and hope some later pass
> > optimizes the move into just rewriting the uses using lowpart_subreg.
> 
> You can use post-reload define_insn_and_split here. This way,
> gen_lowpart on all arguments, including output, can be used. So,
> instead of generating an insn template, the patterns you introduced
> should split to "real" sse4_1 zero-extend insns. This approach is
> preferred to avoid having several pseudo-insns in .md files that do
> the same thing with slightly different patterns. There are many
> examples of post-reload splitters that use gen_lowpart in i386.md.

Ok, will change it that way.

> OTOH, perhaps some of the new testcases can be handled in x86
> target_fold_builtin? In the long term, maybe target_fold_shuffle can
> be introduced to map __builtin_shufle to various target builtins, so
> the builtin can be processed further in target_fold_builtin. As
> pointed out below, vector insn patterns can be quite complex, and push
> RTL combiners to their limits, so perhaps they can be more efficiently
> handled by tree passes.

My primary motivation was to generate good code from __builtin_shuffle here
and trying to find the best permutation and map it back from insns to
builtins would be a nightmare.
I'll see how many targets I need to modify to try the no middle-end
force_reg for CONST0_RTX case...

	Jakub
Jakub Jelinek Jan. 12, 2021, 12:34 p.m. UTC | #3
On Tue, Jan 12, 2021 at 11:47:48AM +0100, Jakub Jelinek via Gcc-patches wrote:
> > OTOH, perhaps some of the new testcases can be handled in x86
> > target_fold_builtin? In the long term, maybe target_fold_shuffle can
> > be introduced to map __builtin_shufle to various target builtins, so
> > the builtin can be processed further in target_fold_builtin. As
> > pointed out below, vector insn patterns can be quite complex, and push
> > RTL combiners to their limits, so perhaps they can be more efficiently
> > handled by tree passes.
> 
> My primary motivation was to generate good code from __builtin_shuffle here
> and trying to find the best permutation and map it back from insns to
> builtins would be a nightmare.
> I'll see how many targets I need to modify to try the no middle-end
> force_reg for CONST0_RTX case...

For the folding, I think best would be to change _mm_unpacklo_epi8
and all the similar intrinsics for hardcoded specific permutations
from using a builtin to just using __builtin_shuffle (together with
verification that we emit as good or better code from it for each case of
course), and keep __builtin_shuffle -> VEC_PERM_EXPR as the canonical
form (with which the GIMPLE code can do any optimizations it wants).

	Jakub
diff mbox series

Patch

--- gcc/config/i386/predicates.md.jj	2021-01-04 10:25:45.248161185 +0100
+++ gcc/config/i386/predicates.md	2021-01-11 19:40:04.793433826 +0100
@@ -1600,6 +1600,38 @@  (define_predicate "addsub_vs_parallel"
   return true;
 })
 
+;; Return true if OP is a parallel for an pmovz{bw,wd,dq} vec_select,
+;; where one of the two operands of the vec_concat is const0_operand.
+(define_predicate "pmovzx_parallel"
+  (and (match_code "parallel")
+       (match_code "const_int" "a"))
+{
+  int nelt = XVECLEN (op, 0);
+  int elt, i;
+
+  if (nelt < 2)
+    return false;
+
+  /* Check that the permutation is suitable for pmovz{bw,wd,dq}.
+     For example { 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 }.  */
+  elt = INTVAL (XVECEXP (op, 0, 0));
+  if (elt == 0)
+    {
+      for (i = 1; i < nelt; ++i)
+	if ((i & 1) != 0)
+	  {
+	    if (INTVAL (XVECEXP (op, 0, i)) < nelt)
+	      return false;
+	  }
+	else if (INTVAL (XVECEXP (op, 0, i)) != i / 2)
+	  return false;
+    }
+  else
+    return false;
+
+  return true;
+})
+
 ;; Return true if OP is a parallel for a vbroadcast permute.
 (define_predicate "avx_vbroadcast_operand"
   (and (match_code "parallel")
--- gcc/config/i386/sse.md.jj	2021-01-07 15:29:52.604974544 +0100
+++ gcc/config/i386/sse.md	2021-01-11 21:19:06.247346965 +0100
@@ -17683,6 +17683,27 @@  (define_insn_and_split "*sse4_1_<code>v8
 	(any_extend:V8HI (match_dup 1)))]
   "operands[1] = adjust_address_nv (operands[1], V8QImode, 0);")
 
+(define_insn "*sse4_1_zero_extendv8qiv8hi2_3"
+  [(set (match_operand:V16QI 0 "register_operand" "=Yr,*x,v")
+	(vec_select:V16QI
+	  (vec_concat:V32QI
+	    (match_operand:V16QI 1 "vector_operand" "Yrm,*xm,vm")
+	    (match_operand:V16QI 2 "const0_operand" "C,C,C"))
+	  (match_parallel 3 "pmovzx_parallel"
+	    [(match_operand 4 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+{
+  if (MEM_P (operands[1]))
+    return "%vpmovzxbw\t{%1, %0|%0, %1}";
+  else
+    return "%vpmovzxbw\t{%1, %0|%0, %q1}";
+}
+  [(set_attr "isa" "noavx,noavx,avx")
+   (set_attr "type" "ssemov")
+   (set_attr "prefix_extra" "1")
+   (set_attr "prefix" "orig,orig,maybe_evex")
+   (set_attr "mode" "TI")])
+
 (define_expand "<insn>v8qiv8hi2"
   [(set (match_operand:V8HI 0 "register_operand")
 	(any_extend:V8HI
@@ -17929,6 +17950,27 @@  (define_expand "<insn>v4hiv4si2"
     }
 })
 
+(define_insn "*sse4_1_zero_extendv4hiv4si2_3"
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,*x,v")
+	(vec_select:V8HI
+	  (vec_concat:V16HI
+	    (match_operand:V8HI 1 "vector_operand" "Yrm,*xm,vm")
+	    (match_operand:V8HI 2 "const0_operand" "C,C,C"))
+	  (match_parallel 3 "pmovzx_parallel"
+	    [(match_operand 4 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+{
+  if (MEM_P (operands[1]))
+    return "%vpmovzxwd\t{%1, %0|%0, %1}";
+  else
+    return "%vpmovzxwd\t{%1, %0|%0, %q1}";
+}
+  [(set_attr "isa" "noavx,noavx,avx")
+   (set_attr "type" "ssemov")
+   (set_attr "prefix_extra" "1")
+   (set_attr "prefix" "orig,orig,maybe_evex")
+   (set_attr "mode" "TI")])
+
 (define_insn "avx512f_<code>v8qiv8di2<mask_name>"
   [(set (match_operand:V8DI 0 "register_operand" "=v")
 	(any_extend:V8DI
@@ -18283,6 +18325,27 @@  (define_insn_and_split "*sse4_1_<code>v2
 	(any_extend:V2DI (match_dup 1)))]
   "operands[1] = adjust_address_nv (operands[1], V2SImode, 0);")
 
+(define_insn "*sse4_1_zero_extendv2siv2di2_3"
+  [(set (match_operand:V4SI 0 "register_operand" "=Yr,*x,v")
+	(vec_select:V4SI
+	  (vec_concat:V8SI
+	    (match_operand:V4SI 1 "vector_operand" "Yrm,*xm,vm")
+	    (match_operand:V4SI 2 "const0_operand" "C,C,C"))
+	  (match_parallel 3 "pmovzx_parallel"
+	    [(match_operand 4 "const_int_operand" "n,n,n")])))]
+  "TARGET_SSE4_1"
+{
+  if (MEM_P (operands[1]))
+    return "%vpmovzxdq\t{%1, %0|%0, %1}";
+  else
+    return "%vpmovzxdq\t{%1, %0|%0, %q1}";
+}
+  [(set_attr "isa" "noavx,noavx,avx")
+   (set_attr "type" "ssemov")
+   (set_attr "prefix_extra" "1")
+   (set_attr "prefix" "orig,orig,maybe_evex")
+   (set_attr "mode" "TI")])
+
 (define_expand "<insn>v2siv2di2"
   [(set (match_operand:V2DI 0 "register_operand")
 	(any_extend:V2DI
--- gcc/testsuite/gcc.target/i386/pr95905-1.c.jj	2021-01-11 21:19:46.657890361 +0100
+++ gcc/testsuite/gcc.target/i386/pr95905-1.c	2021-01-11 20:41:25.189872347 +0100
@@ -0,0 +1,26 @@ 
+/* PR target/95905 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1 -mno-avx" } */
+/* { dg-final { scan-assembler "\tpmovzxbw\t" } } */
+/* { dg-final { scan-assembler "\tpmovzxwd\t" } } */
+/* { dg-final { scan-assembler "\tpmovzxdq\t" } } */
+
+#include <x86intrin.h>
+
+__m128i
+f1 (__m128i a)
+{
+  return _mm_unpacklo_epi8 (a, _mm_setzero_si128 ());
+}
+
+__m128i
+f2 (__m128i a)
+{
+  return _mm_unpacklo_epi16 (a, _mm_setzero_si128 ());
+}
+
+__m128i
+f3 (__m128i a)
+{
+  return _mm_unpacklo_epi32 (a, _mm_setzero_si128 ());
+}
--- gcc/testsuite/gcc.target/i386/pr95905-2.c.jj	2021-01-11 21:20:10.075625762 +0100
+++ gcc/testsuite/gcc.target/i386/pr95905-2.c	2021-01-11 21:20:44.126241029 +0100
@@ -0,0 +1,46 @@ 
+/* PR target/95905 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1" } */
+/* { dg-final { scan-assembler "\tv?pmovzxbw\t" } } */
+/* { dg-final { scan-assembler "\tv?pmovzxwd\t" } } */
+/* { dg-final { scan-assembler "\tv?pmovzxdq\t" } } */
+
+typedef unsigned char V1 __attribute__((vector_size (16)));
+typedef unsigned short V2 __attribute__((vector_size (16)));
+typedef unsigned int V3 __attribute__((vector_size (16)));
+
+V1
+f1 (V1 x)
+{
+  return __builtin_shuffle (x, (V1) {}, (V1) { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23 });
+}
+
+V2
+f2 (V2 x)
+{
+  return __builtin_shuffle (x, (V2) {}, (V2) { 0, 8, 1, 9, 2, 10, 3, 11 });
+}
+
+V3
+f3 (V3 x)
+{
+  return __builtin_shuffle (x, (V3) {}, (V3) { 0, 4, 1, 5 });
+}
+
+V1
+f4 (V1 *x)
+{
+  return __builtin_shuffle (*x, (V1) {}, (V1) { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23 });
+}
+
+V2
+f5 (V2 *x)
+{
+  return __builtin_shuffle (*x, (V2) {}, (V2) { 0, 8, 1, 9, 2, 10, 3, 11 });
+}
+
+V3
+f6 (V3 *x)
+{
+  return __builtin_shuffle (*x, (V3) {}, (V3) { 0, 4, 1, 5 });
+}