diff mbox series

Reintroduce vec_shl_optab and use it for #pragma omp scan inclusive

Message ID 20190619085516.GN815@tucnak
State New
Headers show
Series Reintroduce vec_shl_optab and use it for #pragma omp scan inclusive | expand

Commit Message

Jakub Jelinek June 19, 2019, 8:55 a.m. UTC
Hi!

When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR, vec_shl_optab
has been removed as unused, because we only used vec_shr_optab for the
reductions.
Without this patch the vect-simd-*.c tests can be vectorized just fine
for SSE4 and above, but can't be with SSE2.  As the comment in
tree-vect-stmts.c tries to explain, for the inclusive scan operation we
want (when using V8SImode vectors):
       _30 = MEM <vector(8) int> [(int *)&D.2043];
       _31 = MEM <vector(8) int> [(int *)&D.2042];
       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
       _33 = _31 + _32;
       // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7] };
       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
       _35 = _33 + _34;
       // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
       _37 = _35 + _36;
       // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
       _38 = _30 + _37;
       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
       MEM <vector(8) int> [(int *)&D.2043] = _39;
       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1, 2 }>,
VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2 }>
permutation (the other two it can do).  Well, to be precise, it can do it
using the vector left shift which has been removed as unused, provided
that init is initializer_zerop (shifting all zeros from the left).
init usually is all zeros, that is the neutral element of additive
reductions and couple of others too, in the unlikely case that some other
reduction is used with scan (multiplication, minimum, maximum, bitwise and),
we can use a VEC_COND_EXPR with constant first argument, i.e. a blend or
and/or.

So, this patch reintroduces vec_shl_optab (most backends actually have those
patterns already) and handles its expansion and vector generic lowering
similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the first
operand is initializer_zerop and third operand starts with a few numbers
smaller than number of elements (doesn't matter which one, as all elements
are same - zero) followed by nelts, nelts+1, nelts+2, ...
Unlike vec_shr_optab which has zero as the second operand, this one has it
as first operand, because VEC_PERM_EXPR canonicalization wants to have
first element selector smaller than number of elements.  And unlike
vec_shr_optab, where we also have a fallback in have_whole_vector_shift
using normal permutations, this one doesn't need it, that "fallback" is tried
first before vec_shl_optab.

For the vec_shl_optab checks, it tests only for constant number of elements
vectors, not really sure if our VECTOR_CST encoding can express the left
shifts in any way nor whether SVE supports those (I see aarch64 has
vec_shl_insert but that is just a fixed shift by element bits and shifts in
a scalar rather than zeros).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2019-06-19  Jakub Jelinek  <jakub@redhat.com>

	* doc/md.texi: Document vec_shl_<mode> pattern.
	* optabs.def (vec_shl_optab): New optab.
	* optabs.c (shift_amt_for_vec_perm_mask): Add shift_optab
	argument, if == vec_shl_optab, check for left whole vector shift
	pattern rather than right shift.
	(expand_vec_perm_const): Add vec_shl_optab support.
	* optabs-query.c (can_vec_perm_var_p): Mention also vec_shl optab
	in the comment.
	* tree-vect-generic.c (lower_vec_perm): Support permutations which
	can be handled by vec_shl_optab.
	* tree-vect-stmts.c (scan_store_can_perm_p): New function.
	(check_scan_store): Use it.
	(vectorizable_scan_store): If target can't do normal permutations,
	try to use whole vector left shifts and if needed a VEC_COND_EXPR
	after it.
	* config/i386/sse.md (vec_shl_<mode>): New expander.

	* gcc.dg/vect/vect-simd-8.c: If main is defined, don't include
	tree-vect.h nor call check_vect.
	* gcc.dg/vect/vect-simd-9.c: Likewise.
	* gcc.dg/vect/vect-simd-10.c: New test.
	* gcc.target/i386/sse2-vect-simd-8.c: New test.
	* gcc.target/i386/sse2-vect-simd-9.c: New test.
	* gcc.target/i386/sse2-vect-simd-10.c: New test.
	* gcc.target/i386/avx2-vect-simd-8.c: New test.
	* gcc.target/i386/avx2-vect-simd-9.c: New test.
	* gcc.target/i386/avx2-vect-simd-10.c: New test.
	* gcc.target/i386/avx512f-vect-simd-8.c: New test.
	* gcc.target/i386/avx512f-vect-simd-9.c: New test.
	* gcc.target/i386/avx512f-vect-simd-10.c: New test.


	Jakub

Comments

Richard Biener June 19, 2019, 9:02 a.m. UTC | #1
On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>Hi!
>
>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>vec_shl_optab
>has been removed as unused, because we only used vec_shr_optab for the
>reductions.
>Without this patch the vect-simd-*.c tests can be vectorized just fine
>for SSE4 and above, but can't be with SSE2.  As the comment in
>tree-vect-stmts.c tries to explain, for the inclusive scan operation we
>want (when using V8SImode vectors):
>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>       _33 = _31 + _32;
> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7] };
>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>       _35 = _33 + _34;
>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
>       _37 = _35 + _36;
>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>       _38 = _30 + _37;
>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>}>,
>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>}>
>permutation (the other two it can do).  Well, to be precise, it can do
>it
>using the vector left shift which has been removed as unused, provided
>that init is initializer_zerop (shifting all zeros from the left).
>init usually is all zeros, that is the neutral element of additive
>reductions and couple of others too, in the unlikely case that some
>other
>reduction is used with scan (multiplication, minimum, maximum, bitwise
>and),
>we can use a VEC_COND_EXPR with constant first argument, i.e. a blend
>or
>and/or.
>
>So, this patch reintroduces vec_shl_optab (most backends actually have
>those
>patterns already) and handles its expansion and vector generic lowering
>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the first
>operand is initializer_zerop and third operand starts with a few
>numbers
>smaller than number of elements (doesn't matter which one, as all
>elements
>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>Unlike vec_shr_optab which has zero as the second operand, this one has
>it
>as first operand, because VEC_PERM_EXPR canonicalization wants to have
>first element selector smaller than number of elements.  And unlike
>vec_shr_optab, where we also have a fallback in have_whole_vector_shift
>using normal permutations, this one doesn't need it, that "fallback" is
>tried
>first before vec_shl_optab.
>
>For the vec_shl_optab checks, it tests only for constant number of
>elements
>vectors, not really sure if our VECTOR_CST encoding can express the
>left
>shifts in any way nor whether SVE supports those (I see aarch64 has
>vec_shl_insert but that is just a fixed shift by element bits and
>shifts in
>a scalar rather than zeros).
>
>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Ok. 

Richard. 

>2019-06-19  Jakub Jelinek  <jakub@redhat.com>
>
>	* doc/md.texi: Document vec_shl_<mode> pattern.
>	* optabs.def (vec_shl_optab): New optab.
>	* optabs.c (shift_amt_for_vec_perm_mask): Add shift_optab
>	argument, if == vec_shl_optab, check for left whole vector shift
>	pattern rather than right shift.
>	(expand_vec_perm_const): Add vec_shl_optab support.
>	* optabs-query.c (can_vec_perm_var_p): Mention also vec_shl optab
>	in the comment.
>	* tree-vect-generic.c (lower_vec_perm): Support permutations which
>	can be handled by vec_shl_optab.
>	* tree-vect-stmts.c (scan_store_can_perm_p): New function.
>	(check_scan_store): Use it.
>	(vectorizable_scan_store): If target can't do normal permutations,
>	try to use whole vector left shifts and if needed a VEC_COND_EXPR
>	after it.
>	* config/i386/sse.md (vec_shl_<mode>): New expander.
>
>	* gcc.dg/vect/vect-simd-8.c: If main is defined, don't include
>	tree-vect.h nor call check_vect.
>	* gcc.dg/vect/vect-simd-9.c: Likewise.
>	* gcc.dg/vect/vect-simd-10.c: New test.
>	* gcc.target/i386/sse2-vect-simd-8.c: New test.
>	* gcc.target/i386/sse2-vect-simd-9.c: New test.
>	* gcc.target/i386/sse2-vect-simd-10.c: New test.
>	* gcc.target/i386/avx2-vect-simd-8.c: New test.
>	* gcc.target/i386/avx2-vect-simd-9.c: New test.
>	* gcc.target/i386/avx2-vect-simd-10.c: New test.
>	* gcc.target/i386/avx512f-vect-simd-8.c: New test.
>	* gcc.target/i386/avx512f-vect-simd-9.c: New test.
>	* gcc.target/i386/avx512f-vect-simd-10.c: New test.
>
>--- gcc/doc/md.texi.jj	2019-06-13 00:35:43.518942525 +0200
>+++ gcc/doc/md.texi	2019-06-18 15:32:38.496629946 +0200
>@@ -5454,6 +5454,14 @@ in operand 2.  Store the result in vecto
> 0 and 1 have mode @var{m} and operand 2 has the mode appropriate for
> one element of @var{m}.
> 
>+@cindex @code{vec_shl_@var{m}} instruction pattern
>+@item @samp{vec_shl_@var{m}}
>+Whole vector left shift in bits, i.e.@: away from element 0.
>+Operand 1 is a vector to be shifted.
>+Operand 2 is an integer shift amount in bits.
>+Operand 0 is where the resulting shifted vector is stored.
>+The output and input vectors should have the same modes.
>+
> @cindex @code{vec_shr_@var{m}} instruction pattern
> @item @samp{vec_shr_@var{m}}
> Whole vector right shift in bits, i.e.@: towards element 0.
>--- gcc/optabs.def.jj	2019-02-11 11:38:08.263617017 +0100
>+++ gcc/optabs.def	2019-06-18 14:56:57.934971410 +0200
>@@ -348,6 +348,7 @@ OPTAB_D (vec_packu_float_optab, "vec_pac
> OPTAB_D (vec_perm_optab, "vec_perm$a")
> OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")
> OPTAB_D (vec_set_optab, "vec_set$a")
>+OPTAB_D (vec_shl_optab, "vec_shl_$a")
> OPTAB_D (vec_shr_optab, "vec_shr_$a")
>OPTAB_D (vec_unpack_sfix_trunc_hi_optab, "vec_unpack_sfix_trunc_hi_$a")
>OPTAB_D (vec_unpack_sfix_trunc_lo_optab, "vec_unpack_sfix_trunc_lo_$a")
>--- gcc/optabs.c.jj	2019-02-13 13:11:47.927612362 +0100
>+++ gcc/optabs.c	2019-06-18 16:45:29.347895585 +0200
>@@ -5444,19 +5444,45 @@ vector_compare_rtx (machine_mode cmp_mod
> }
> 
> /* Check if vec_perm mask SEL is a constant equivalent to a shift of
>-   the first vec_perm operand, assuming the second operand is a
>constant
>-   vector of zeros.  Return the shift distance in bits if so, or
>NULL_RTX
>-   if the vec_perm is not a shift.  MODE is the mode of the value
>being
>-   shifted.  */
>+   the first vec_perm operand, assuming the second operand (for left
>shift
>+   first operand) is a constant vector of zeros.  Return the shift
>distance
>+   in bits if so, or NULL_RTX if the vec_perm is not a shift.  MODE is
>the
>+   mode of the value being shifted.  SHIFT_OPTAB is vec_shr_optab for
>right
>+   shift or vec_shl_optab for left shift.  */
> static rtx
>-shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices
>&sel)
>+shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices
>&sel,
>+			     optab shift_optab)
> {
>   unsigned int bitsize = GET_MODE_UNIT_BITSIZE (mode);
>   poly_int64 first = sel[0];
>   if (maybe_ge (sel[0], GET_MODE_NUNITS (mode)))
>     return NULL_RTX;
> 
>-  if (!sel.series_p (0, 1, first, 1))
>+  if (shift_optab == vec_shl_optab)
>+    {
>+      unsigned int nelt;
>+      if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
>+	return NULL_RTX;
>+      unsigned firstidx = 0;
>+      for (unsigned int i = 0; i < nelt; i++)
>+	{
>+	  if (known_eq (sel[i], nelt))
>+	    {
>+	      if (i == 0 || firstidx)
>+		return NULL_RTX;
>+	      firstidx = i;
>+	    }
>+	  else if (firstidx
>+		   ? maybe_ne (sel[i], nelt + i - firstidx)
>+		   : maybe_ge (sel[i], nelt))
>+	    return NULL_RTX;
>+	}
>+
>+      if (firstidx == 0)
>+	return NULL_RTX;
>+      first = firstidx;
>+    }
>+  else if (!sel.series_p (0, 1, first, 1))
>     {
>       unsigned int nelt;
>       if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
>@@ -5544,25 +5570,37 @@ expand_vec_perm_const (machine_mode mode
>      target instruction.  */
>   vec_perm_indices indices (sel, 2, GET_MODE_NUNITS (mode));
> 
>-  /* See if this can be handled with a vec_shr.  We only do this if
>the
>-     second vector is all zeroes.  */
>-  insn_code shift_code = optab_handler (vec_shr_optab, mode);
>-  insn_code shift_code_qi = ((qimode != VOIDmode && qimode != mode)
>-			     ? optab_handler (vec_shr_optab, qimode)
>-			     : CODE_FOR_nothing);
>-
>-  if (v1 == CONST0_RTX (GET_MODE (v1))
>-      && (shift_code != CODE_FOR_nothing
>-	  || shift_code_qi != CODE_FOR_nothing))
>+  /* See if this can be handled with a vec_shr or vec_shl.  We only do
>this
>+     if the second (for vec_shr) or first (for vec_shl) vector is all
>+     zeroes.  */
>+  insn_code shift_code = CODE_FOR_nothing;
>+  insn_code shift_code_qi = CODE_FOR_nothing;
>+  optab shift_optab = unknown_optab;
>+  rtx v2 = v0;
>+  if (v1 == CONST0_RTX (GET_MODE (v1)))
>+    shift_optab = vec_shr_optab;
>+  else if (v0 == CONST0_RTX (GET_MODE (v0)))
>+    {
>+      shift_optab = vec_shl_optab;
>+      v2 = v1;
>+    }
>+  if (shift_optab != unknown_optab)
>+    {
>+      shift_code = optab_handler (shift_optab, mode);
>+      shift_code_qi = ((qimode != VOIDmode && qimode != mode)
>+		       ? optab_handler (shift_optab, qimode)
>+		       : CODE_FOR_nothing);
>+    }
>+  if (shift_code != CODE_FOR_nothing || shift_code_qi !=
>CODE_FOR_nothing)
>     {
>-      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices);
>+      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices,
>shift_optab);
>       if (shift_amt)
> 	{
> 	  struct expand_operand ops[3];
> 	  if (shift_code != CODE_FOR_nothing)
> 	    {
> 	      create_output_operand (&ops[0], target, mode);
>-	      create_input_operand (&ops[1], v0, mode);
>+	      create_input_operand (&ops[1], v2, mode);
>	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
> 	      if (maybe_expand_insn (shift_code, 3, ops))
> 		return ops[0].value;
>@@ -5571,7 +5609,7 @@ expand_vec_perm_const (machine_mode mode
> 	    {
> 	      rtx tmp = gen_reg_rtx (qimode);
> 	      create_output_operand (&ops[0], tmp, qimode);
>-	      create_input_operand (&ops[1], gen_lowpart (qimode, v0),
>qimode);
>+	      create_input_operand (&ops[1], gen_lowpart (qimode, v2),
>qimode);
>	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
> 	      if (maybe_expand_insn (shift_code_qi, 3, ops))
> 		return gen_lowpart (mode, ops[0].value);
>--- gcc/optabs-query.c.jj	2019-05-20 11:40:16.691121967 +0200
>+++ gcc/optabs-query.c	2019-06-18 15:26:53.028980804 +0200
>@@ -415,8 +415,9 @@ can_vec_perm_var_p (machine_mode mode)
>    permute (if the target supports that).
> 
> Note that additional permutations representing whole-vector shifts may
>-   also be handled via the vec_shr optab, but only where the second
>input
>-   vector is entirely constant zeroes; this case is not dealt with
>here.  */
>+   also be handled via the vec_shr or vec_shl optab, but only where
>the
>+   second input vector is entirely constant zeroes; this case is not
>dealt
>+   with here.  */
> 
> bool
> can_vec_perm_const_p (machine_mode mode, const vec_perm_indices &sel,
>--- gcc/tree-vect-generic.c.jj	2019-01-07 09:47:32.988518893 +0100
>+++ gcc/tree-vect-generic.c	2019-06-18 16:35:29.033319526 +0200
>@@ -1367,6 +1367,32 @@ lower_vec_perm (gimple_stmt_iterator *gs
> 	      return;
> 	    }
> 	}
>+      /* And similarly vec_shl pattern.  */
>+      if (optab_handler (vec_shl_optab, TYPE_MODE (vect_type))
>+	  != CODE_FOR_nothing
>+	  && TREE_CODE (vec0) == VECTOR_CST
>+	  && initializer_zerop (vec0))
>+	{
>+	  unsigned int first = 0;
>+	  for (i = 0; i < elements; ++i)
>+	    if (known_eq (poly_uint64 (indices[i]), elements))
>+	      {
>+		if (i == 0 || first)
>+		  break;
>+		first = i;
>+	      }
>+	    else if (first
>+		     ? maybe_ne (poly_uint64 (indices[i]),
>+					      elements + i - first)
>+		     : maybe_ge (poly_uint64 (indices[i]), elements))
>+	      break;
>+	  if (i == elements)
>+	    {
>+	      gimple_assign_set_rhs3 (stmt, mask);
>+	      update_stmt (stmt);
>+	      return;
>+	    }
>+	}
>     }
>   else if (can_vec_perm_var_p (TYPE_MODE (vect_type)))
>     return;
>--- gcc/tree-vect-stmts.c.jj	2019-06-17 23:18:53.620850072 +0200
>+++ gcc/tree-vect-stmts.c	2019-06-18 17:43:27.484350807 +0200
>@@ -6356,6 +6356,71 @@ scan_operand_equal_p (tree ref1, tree re
> 
> /* Function check_scan_store.
> 
>+   Verify if we can perform the needed permutations or whole vector
>shifts.
>+   Return -1 on failure, otherwise exact log2 of vectype's nunits.  */
>+
>+static int
>+scan_store_can_perm_p (tree vectype, tree init, int
>*use_whole_vector_p = NULL)
>+{
>+  enum machine_mode vec_mode = TYPE_MODE (vectype);
>+  unsigned HOST_WIDE_INT nunits;
>+  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
>+    return -1;
>+  int units_log2 = exact_log2 (nunits);
>+  if (units_log2 <= 0)
>+    return -1;
>+
>+  int i;
>+  for (i = 0; i <= units_log2; ++i)
>+    {
>+      unsigned HOST_WIDE_INT j, k;
>+      vec_perm_builder sel (nunits, nunits, 1);
>+      sel.quick_grow (nunits);
>+      if (i == 0)
>+	{
>+	  for (j = 0; j < nunits; ++j)
>+	    sel[j] = nunits - 1;
>+	}
>+      else
>+	{
>+	  for (j = 0; j < (HOST_WIDE_INT_1U << (i - 1)); ++j)
>+	    sel[j] = j;
>+	  for (k = 0; j < nunits; ++j, ++k)
>+	    sel[j] = nunits + k;
>+	}
>+      vec_perm_indices indices (sel, i == 0 ? 1 : 2, nunits);
>+      if (!can_vec_perm_const_p (vec_mode, indices))
>+	break;
>+    }
>+
>+  if (i == 0)
>+    return -1;
>+
>+  if (i <= units_log2)
>+    {
>+      if (optab_handler (vec_shl_optab, vec_mode) == CODE_FOR_nothing)
>+	return -1;
>+      int kind = 1;
>+      /* Whole vector shifts shift in zeros, so if init is all zero
>constant,
>+	 there is no need to do anything further.  */
>+      if ((TREE_CODE (init) != INTEGER_CST
>+	   && TREE_CODE (init) != REAL_CST)
>+	  || !initializer_zerop (init))
>+	{
>+	  tree masktype = build_same_sized_truth_vector_type (vectype);
>+	  if (!expand_vec_cond_expr_p (vectype, masktype, VECTOR_CST))
>+	    return -1;
>+	  kind = 2;
>+	}
>+      if (use_whole_vector_p)
>+	*use_whole_vector_p = kind;
>+    }
>+  return units_log2;
>+}
>+
>+
>+/* Function check_scan_store.
>+
> Check magic stores for #pragma omp scan {in,ex}clusive reductions.  */
> 
> static bool
>@@ -6596,34 +6661,9 @@ check_scan_store (stmt_vec_info stmt_inf
>   if (!optab || optab_handler (optab, vec_mode) == CODE_FOR_nothing)
>     goto fail;
> 
>-  unsigned HOST_WIDE_INT nunits;
>-  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
>+  int units_log2 = scan_store_can_perm_p (vectype, *init);
>+  if (units_log2 == -1)
>     goto fail;
>-  int units_log2 = exact_log2 (nunits);
>-  if (units_log2 <= 0)
>-    goto fail;
>-
>-  for (int i = 0; i <= units_log2; ++i)
>-    {
>-      unsigned HOST_WIDE_INT j, k;
>-      vec_perm_builder sel (nunits, nunits, 1);
>-      sel.quick_grow (nunits);
>-      if (i == units_log2)
>-	{
>-	  for (j = 0; j < nunits; ++j)
>-	    sel[j] = nunits - 1;
>-	}
>-      else
>-	{
>-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
>-	    sel[j] = nunits + j;
>-	  for (k = 0; j < nunits; ++j, ++k)
>-	    sel[j] = k;
>-	}
>-      vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
>-      if (!can_vec_perm_const_p (vec_mode, indices))
>-	goto fail;
>-    }
> 
>   return true;
> }
>@@ -6686,7 +6726,8 @@ vectorizable_scan_store (stmt_vec_info s
>   unsigned HOST_WIDE_INT nunits;
>   if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
>     gcc_unreachable ();
>-  int units_log2 = exact_log2 (nunits);
>+  int use_whole_vector_p = 0;
>+  int units_log2 = scan_store_can_perm_p (vectype, *init,
>&use_whole_vector_p);
>   gcc_assert (units_log2 > 0);
>   auto_vec<tree, 16> perms;
>   perms.quick_grow (units_log2 + 1);
>@@ -6696,21 +6737,25 @@ vectorizable_scan_store (stmt_vec_info s
>       vec_perm_builder sel (nunits, nunits, 1);
>       sel.quick_grow (nunits);
>       if (i == units_log2)
>-	{
>-	  for (j = 0; j < nunits; ++j)
>-	    sel[j] = nunits - 1;
>-	}
>-      else
>-	{
>-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
>-	    sel[j] = nunits + j;
>-	  for (k = 0; j < nunits; ++j, ++k)
>-	    sel[j] = k;
>-	}
>+	for (j = 0; j < nunits; ++j)
>+	  sel[j] = nunits - 1;
>+	else
>+	  {
>+	    for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
>+	      sel[j] = j;
>+	    for (k = 0; j < nunits; ++j, ++k)
>+	      sel[j] = nunits + k;
>+	  }
>       vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
>-      perms[i] = vect_gen_perm_mask_checked (vectype, indices);
>+      if (use_whole_vector_p && i < units_log2)
>+	perms[i] = vect_gen_perm_mask_any (vectype, indices);
>+      else
>+	perms[i] = vect_gen_perm_mask_checked (vectype, indices);
>     }
> 
>+  tree zero_vec = use_whole_vector_p ? build_zero_cst (vectype) :
>NULL_TREE;
>+  tree masktype = (use_whole_vector_p == 2
>+		   ? build_same_sized_truth_vector_type (vectype) : NULL_TREE);
>   stmt_vec_info prev_stmt_info = NULL;
>   tree vec_oprnd1 = NULL_TREE;
>   tree vec_oprnd2 = NULL_TREE;
>@@ -6742,8 +6787,9 @@ vectorizable_scan_store (stmt_vec_info s
>       for (int i = 0; i < units_log2; ++i)
> 	{
> 	  tree new_temp = make_ssa_name (vectype);
>-	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR, v,
>-					   vec_oprnd1, perms[i]);
>+	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR,
>+					   zero_vec ? zero_vec : vec_oprnd1, v,
>+					   perms[i]);
> 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
> 	  if (prev_stmt_info == NULL)
> 	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt_info;
>@@ -6751,6 +6797,25 @@ vectorizable_scan_store (stmt_vec_info s
> 	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
> 	  prev_stmt_info = new_stmt_info;
> 
>+	  if (use_whole_vector_p == 2)
>+	    {
>+	      /* Whole vector shift shifted in zero bits, but if *init
>+		 is not initializer_zerop, we need to replace those elements
>+		 with elements from vec_oprnd1.  */
>+	      tree_vector_builder vb (masktype, nunits, 1);
>+	      for (unsigned HOST_WIDE_INT k = 0; k < nunits; ++k)
>+		vb.quick_push (k < (HOST_WIDE_INT_1U << i)
>+			       ? boolean_false_node : boolean_true_node);
>+
>+	      tree new_temp2 = make_ssa_name (vectype);
>+	      g = gimple_build_assign (new_temp2, VEC_COND_EXPR, vb.build (),
>+				       new_temp, vec_oprnd1);
>+	      new_stmt_info = vect_finish_stmt_generation (stmt_info, g,
>gsi);
>+	      STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
>+	      prev_stmt_info = new_stmt_info;
>+	      new_temp = new_temp2;
>+	    }
>+
> 	  tree new_temp2 = make_ssa_name (vectype);
> 	  g = gimple_build_assign (new_temp2, code, v, new_temp);
> 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
>--- gcc/config/i386/sse.md.jj	2019-06-17 23:18:26.821267440 +0200
>+++ gcc/config/i386/sse.md	2019-06-18 15:37:28.342043528 +0200
>@@ -11758,6 +11758,19 @@ (define_insn "<shift_insn><mode>3<mask_n
>    (set_attr "mode" "<sseinsnmode>")])
> 
> 
>+(define_expand "vec_shl_<mode>"
>+  [(set (match_dup 3)
>+	(ashift:V1TI
>+	 (match_operand:VI_128 1 "register_operand")
>+	 (match_operand:SI 2 "const_0_to_255_mul_8_operand")))
>+   (set (match_operand:VI_128 0 "register_operand") (match_dup 4))]
>+  "TARGET_SSE2"
>+{
>+  operands[1] = gen_lowpart (V1TImode, operands[1]);
>+  operands[3] = gen_reg_rtx (V1TImode);
>+  operands[4] = gen_lowpart (<MODE>mode, operands[3]);
>+})
>+
> (define_expand "vec_shr_<mode>"
>   [(set (match_dup 3)
> 	(lshiftrt:V1TI
>--- gcc/testsuite/gcc.dg/vect/vect-simd-8.c.jj	2019-06-17
>23:18:53.621850057 +0200
>+++ gcc/testsuite/gcc.dg/vect/vect-simd-8.c	2019-06-18
>18:02:09.428798006 +0200
>@@ -3,7 +3,9 @@
> /* { dg-additional-options "-mavx" { target avx_runtime } } */
>/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect"
>{ target i?86-*-* x86_64-*-* } } } */
> 
>+#ifndef main
> #include "tree-vect.h"
>+#endif
> 
> int r, a[1024], b[1024];
> 
>@@ -63,7 +65,9 @@ int
> main ()
> {
>   int s = 0;
>+#ifndef main
>   check_vect ();
>+#endif
>   for (int i = 0; i < 1024; ++i)
>     {
>       a[i] = i;
>--- gcc/testsuite/gcc.dg/vect/vect-simd-9.c.jj	2019-06-17
>23:18:53.621850057 +0200
>+++ gcc/testsuite/gcc.dg/vect/vect-simd-9.c	2019-06-18
>18:02:34.649406773 +0200
>@@ -3,7 +3,9 @@
> /* { dg-additional-options "-mavx" { target avx_runtime } } */
>/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect"
>{ target i?86-*-* x86_64-*-* } } } */
> 
>+#ifndef main
> #include "tree-vect.h"
>+#endif
> 
> int r, a[1024], b[1024];
> 
>@@ -65,7 +67,9 @@ int
> main ()
> {
>   int s = 0;
>+#ifndef main
>   check_vect ();
>+#endif
>   for (int i = 0; i < 1024; ++i)
>     {
>       a[i] = i;
>--- gcc/testsuite/gcc.dg/vect/vect-simd-10.c.jj	2019-06-18
>18:37:30.742838613 +0200
>+++ gcc/testsuite/gcc.dg/vect/vect-simd-10.c	2019-06-18
>19:44:20.614082076 +0200
>@@ -0,0 +1,96 @@
>+/* { dg-require-effective-target size32plus } */
>+/* { dg-additional-options "-fopenmp-simd" } */
>+/* { dg-additional-options "-mavx" { target avx_runtime } } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" { target i?86-*-* x86_64-*-* } } } */
>+
>+#ifndef main
>+#include "tree-vect.h"
>+#endif
>+
>+float r = 1.0f, a[1024], b[1024];
>+
>+__attribute__((noipa)) void
>+foo (float *a, float *b)
>+{
>+  #pragma omp simd reduction (inscan, *:r)
>+  for (int i = 0; i < 1024; i++)
>+    {
>+      r *= a[i];
>+      #pragma omp scan inclusive(r)
>+      b[i] = r;
>+    }
>+}
>+
>+__attribute__((noipa)) float
>+bar (void)
>+{
>+  float s = -__builtin_inff ();
>+  #pragma omp simd reduction (inscan, max:s)
>+  for (int i = 0; i < 1024; i++)
>+    {
>+      s = s > a[i] ? s : a[i];
>+      #pragma omp scan inclusive(s)
>+      b[i] = s;
>+    }
>+  return s;
>+}
>+
>+int
>+main ()
>+{
>+  float s = 1.0f;
>+#ifndef main
>+  check_vect ();
>+#endif
>+  for (int i = 0; i < 1024; ++i)
>+    {
>+      if (i < 80)
>+	a[i] = (i & 1) ? 0.25f : 0.5f;
>+      else if (i < 200)
>+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
>+      else if (i < 280)
>+	a[i] = (i & 1) ? 0.25f : 0.5f;
>+      else if (i < 380)
>+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
>+      else
>+	switch (i % 6)
>+	  {
>+	  case 0: a[i] = 0.25f; break;
>+	  case 1: a[i] = 2.0f; break;
>+	  case 2: a[i] = -1.0f; break;
>+	  case 3: a[i] = -4.0f; break;
>+	  case 4: a[i] = 0.5f; break;
>+	  case 5: a[i] = 1.0f; break;
>+	  default: a[i] = 0.0f; break;
>+	  }
>+      b[i] = -19.0f;
>+      asm ("" : "+g" (i));
>+    }
>+  foo (a, b);
>+  if (r * 16384.0f != 0.125f)
>+    abort ();
>+  float m = -175.25f;
>+  for (int i = 0; i < 1024; ++i)
>+    {
>+      s *= a[i];
>+      if (b[i] != s)
>+	abort ();
>+      else
>+	{
>+	  a[i] = m - ((i % 3) == 1 ? 2.0f : (i % 3) == 2 ? 4.0f : 0.0f);
>+	  b[i] = -231.75f;
>+	  m += 0.75f;
>+	}
>+    }
>+  if (bar () != 592.0f)
>+    abort ();
>+  s = -__builtin_inff ();
>+  for (int i = 0; i < 1024; ++i)
>+    {
>+      if (s < a[i])
>+	s = a[i];
>+      if (b[i] != s)
>+	abort ();
>+    }
>+  return 0;
>+}
>--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c.jj	2019-06-18
>17:59:27.182314827 +0200
>+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c	2019-06-18
>18:19:48.417341734 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target sse2 } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "sse2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-8.c"
>+
>+static void
>+sse2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c.jj	2019-06-18
>18:03:30.174545446 +0200
>+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c	2019-06-18
>18:20:05.770072628 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target sse2 } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "sse2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-9.c"
>+
>+static void
>+sse2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c.jj	2019-06-18
>19:46:09.015410603 +0200
>+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c	2019-06-18
>19:50:31.621361409 +0200
>@@ -0,0 +1,15 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target sse2 } */
>+
>+#include "sse2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-10.c"
>+
>+static void
>+sse2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c.jj	2019-06-18
>17:59:27.182314827 +0200
>+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c	2019-06-18
>18:19:40.310467451 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" }
>*/
>+/* { dg-require-effective-target avx2 } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-8.c"
>+
>+static void
>+avx2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c.jj	2019-06-18
>18:03:30.174545446 +0200
>+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c	2019-06-18
>18:19:56.479216712 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" }
>*/
>+/* { dg-require-effective-target avx2 } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-9.c"
>+
>+static void
>+avx2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c.jj	2019-06-18
>19:50:47.692113611 +0200
>+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c	2019-06-18
>19:50:56.180982721 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" }
>*/
>+/* { dg-require-effective-target avx2 } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx2-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-10.c"
>+
>+static void
>+avx2_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c.jj	2019-06-18
>17:59:27.182314827 +0200
>+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c	2019-06-18
>18:19:44.364404586 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target avx512f } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx512f-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-8.c"
>+
>+static void
>+avx512f_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c.jj	2019-06-18
>18:03:30.174545446 +0200
>+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c	2019-06-18
>18:20:00.884148400 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target avx512f } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx512f-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-9.c"
>+
>+static void
>+avx512f_test (void)
>+{
>+  do_main ();
>+}
>--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c.jj	2019-06-18
>19:51:12.309734025 +0200
>+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c	2019-06-18
>19:51:18.285641883 +0200
>@@ -0,0 +1,16 @@
>+/* { dg-do run } */
>+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512
>-fdump-tree-vect-details" } */
>+/* { dg-require-effective-target avx512f } */
>+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2
>"vect" } } */
>+
>+#include "avx512f-check.h"
>+
>+#define main() do_main ()
>+
>+#include "../../gcc.dg/vect/vect-simd-10.c"
>+
>+static void
>+avx512f_test (void)
>+{
>+  do_main ();
>+}
>
>	Jakub
Richard Sandiford June 19, 2019, 9:05 a.m. UTC | #2
Richard Biener <rguenther@suse.de> writes:
> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>>Hi!
>>
>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>>vec_shl_optab
>>has been removed as unused, because we only used vec_shr_optab for the
>>reductions.
>>Without this patch the vect-simd-*.c tests can be vectorized just fine
>>for SSE4 and above, but can't be with SSE2.  As the comment in
>>tree-vect-stmts.c tries to explain, for the inclusive scan operation we
>>want (when using V8SImode vectors):
>>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>>       _33 = _31 + _32;
>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7] };
>>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>>       _35 = _33 + _34;
>>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
>>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
>>       _37 = _35 + _36;
>>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2], _31[0]+.._31[3],
>>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>>       _38 = _30 + _37;
>>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>>}>,
>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>>}>
>>permutation (the other two it can do).  Well, to be precise, it can do
>>it
>>using the vector left shift which has been removed as unused, provided
>>that init is initializer_zerop (shifting all zeros from the left).
>>init usually is all zeros, that is the neutral element of additive
>>reductions and couple of others too, in the unlikely case that some
>>other
>>reduction is used with scan (multiplication, minimum, maximum, bitwise
>>and),
>>we can use a VEC_COND_EXPR with constant first argument, i.e. a blend
>>or
>>and/or.
>>
>>So, this patch reintroduces vec_shl_optab (most backends actually have
>>those
>>patterns already) and handles its expansion and vector generic lowering
>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the first
>>operand is initializer_zerop and third operand starts with a few
>>numbers
>>smaller than number of elements (doesn't matter which one, as all
>>elements
>>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>>Unlike vec_shr_optab which has zero as the second operand, this one has
>>it
>>as first operand, because VEC_PERM_EXPR canonicalization wants to have
>>first element selector smaller than number of elements.  And unlike
>>vec_shr_optab, where we also have a fallback in have_whole_vector_shift
>>using normal permutations, this one doesn't need it, that "fallback" is
>>tried
>>first before vec_shl_optab.
>>
>>For the vec_shl_optab checks, it tests only for constant number of
>>elements
>>vectors, not really sure if our VECTOR_CST encoding can express the
>>left
>>shifts in any way nor whether SVE supports those (I see aarch64 has
>>vec_shl_insert but that is just a fixed shift by element bits and
>>shifts in
>>a scalar rather than zeros).
>>
>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> Ok. 

I think it would be worth instead telling evpc that the second permute
vector is zero.  Permutes with a second vector of zero are somewhat
special for SVE too, and could be in other cases for other targets.

I thought the direction of travel was not to have optabs for specific
kinds of permute any more.  E.g. zip, unzip and blend are common
permutes too, but we decided not to commonise those.

Thanks,
Richard
Richard Biener June 19, 2019, 11:27 a.m. UTC | #3
On June 19, 2019 11:05:42 AM GMT+02:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>Richard Biener <rguenther@suse.de> writes:
>> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek
><jakub@redhat.com> wrote:
>>>Hi!
>>>
>>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>>>vec_shl_optab
>>>has been removed as unused, because we only used vec_shr_optab for
>the
>>>reductions.
>>>Without this patch the vect-simd-*.c tests can be vectorized just
>fine
>>>for SSE4 and above, but can't be with SSE2.  As the comment in
>>>tree-vect-stmts.c tries to explain, for the inclusive scan operation
>we
>>>want (when using V8SImode vectors):
>>>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>>>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>>>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>>>       _33 = _31 + _32;
>>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7]
>};
>>>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>>>       _35 = _33 + _34;
>>>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>_31[0]+.._31[3],
>>>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>>>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
>>>       _37 = _35 + _36;
>>>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>_31[0]+.._31[3],
>>>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>>>       _38 = _30 + _37;
>>>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>>>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>>>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1,
>2
>>>}>,
>>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>>>}>
>>>permutation (the other two it can do).  Well, to be precise, it can
>do
>>>it
>>>using the vector left shift which has been removed as unused,
>provided
>>>that init is initializer_zerop (shifting all zeros from the left).
>>>init usually is all zeros, that is the neutral element of additive
>>>reductions and couple of others too, in the unlikely case that some
>>>other
>>>reduction is used with scan (multiplication, minimum, maximum,
>bitwise
>>>and),
>>>we can use a VEC_COND_EXPR with constant first argument, i.e. a blend
>>>or
>>>and/or.
>>>
>>>So, this patch reintroduces vec_shl_optab (most backends actually
>have
>>>those
>>>patterns already) and handles its expansion and vector generic
>lowering
>>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the
>first
>>>operand is initializer_zerop and third operand starts with a few
>>>numbers
>>>smaller than number of elements (doesn't matter which one, as all
>>>elements
>>>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>>>Unlike vec_shr_optab which has zero as the second operand, this one
>has
>>>it
>>>as first operand, because VEC_PERM_EXPR canonicalization wants to
>have
>>>first element selector smaller than number of elements.  And unlike
>>>vec_shr_optab, where we also have a fallback in
>have_whole_vector_shift
>>>using normal permutations, this one doesn't need it, that "fallback"
>is
>>>tried
>>>first before vec_shl_optab.
>>>
>>>For the vec_shl_optab checks, it tests only for constant number of
>>>elements
>>>vectors, not really sure if our VECTOR_CST encoding can express the
>>>left
>>>shifts in any way nor whether SVE supports those (I see aarch64 has
>>>vec_shl_insert but that is just a fixed shift by element bits and
>>>shifts in
>>>a scalar rather than zeros).
>>>
>>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>>
>> Ok. 
>
>I think it would be worth instead telling evpc that the second permute
>vector is zero.  Permutes with a second vector of zero are somewhat
>special for SVE too, and could be in other cases for other targets.
>
>I thought the direction of travel was not to have optabs for specific
>kinds of permute any more.  E.g. zip, unzip and blend are common
>permutes too, but we decided not to commonise those.

The issue is that vec_perm_const_ok doesn't have enough info here (second operand is all zeros). So the optab is needed to query target capabilities, not so much for the actual expansion which could go via vec_perm_const. 

I thought of having special permute vector entries to denote zero or don't - care which would make it possible to 
Have this info and allow these permutes to be single vector permutes. But then encoding this might be awkward. 

Richard. 

>Thanks,
>Richard
Richard Sandiford June 19, 2019, 12:46 p.m. UTC | #4
Richard Biener <rguenther@suse.de> writes:
> On June 19, 2019 11:05:42 AM GMT+02:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>Richard Biener <rguenther@suse.de> writes:
>>> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek
>><jakub@redhat.com> wrote:
>>>>Hi!
>>>>
>>>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>>>>vec_shl_optab
>>>>has been removed as unused, because we only used vec_shr_optab for
>>the
>>>>reductions.
>>>>Without this patch the vect-simd-*.c tests can be vectorized just
>>fine
>>>>for SSE4 and above, but can't be with SSE2.  As the comment in
>>>>tree-vect-stmts.c tries to explain, for the inclusive scan operation
>>we
>>>>want (when using V8SImode vectors):
>>>>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>>>>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>>>>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>>>>       _33 = _31 + _32;
>>>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ..., _31[6]+_31[7]
>>};
>>>>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>>>>       _35 = _33 + _34;
>>>>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>_31[0]+.._31[3],
>>>>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>>>>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3 }>;
>>>>       _37 = _35 + _36;
>>>>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>_31[0]+.._31[3],
>>>>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>>>>       _38 = _30 + _37;
>>>>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>>>>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>>>>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>>>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0, 1,
>>2
>>>>}>,
>>>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>>>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>>>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1, 2
>>>>}>
>>>>permutation (the other two it can do).  Well, to be precise, it can
>>do
>>>>it
>>>>using the vector left shift which has been removed as unused,
>>provided
>>>>that init is initializer_zerop (shifting all zeros from the left).
>>>>init usually is all zeros, that is the neutral element of additive
>>>>reductions and couple of others too, in the unlikely case that some
>>>>other
>>>>reduction is used with scan (multiplication, minimum, maximum,
>>bitwise
>>>>and),
>>>>we can use a VEC_COND_EXPR with constant first argument, i.e. a blend
>>>>or
>>>>and/or.
>>>>
>>>>So, this patch reintroduces vec_shl_optab (most backends actually
>>have
>>>>those
>>>>patterns already) and handles its expansion and vector generic
>>lowering
>>>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the
>>first
>>>>operand is initializer_zerop and third operand starts with a few
>>>>numbers
>>>>smaller than number of elements (doesn't matter which one, as all
>>>>elements
>>>>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>>>>Unlike vec_shr_optab which has zero as the second operand, this one
>>has
>>>>it
>>>>as first operand, because VEC_PERM_EXPR canonicalization wants to
>>have
>>>>first element selector smaller than number of elements.  And unlike
>>>>vec_shr_optab, where we also have a fallback in
>>have_whole_vector_shift
>>>>using normal permutations, this one doesn't need it, that "fallback"
>>is
>>>>tried
>>>>first before vec_shl_optab.
>>>>
>>>>For the vec_shl_optab checks, it tests only for constant number of
>>>>elements
>>>>vectors, not really sure if our VECTOR_CST encoding can express the
>>>>left
>>>>shifts in any way nor whether SVE supports those (I see aarch64 has
>>>>vec_shl_insert but that is just a fixed shift by element bits and
>>>>shifts in
>>>>a scalar rather than zeros).
>>>>
>>>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>>>
>>> Ok. 
>>
>>I think it would be worth instead telling evpc that the second permute
>>vector is zero.  Permutes with a second vector of zero are somewhat
>>special for SVE too, and could be in other cases for other targets.
>>
>>I thought the direction of travel was not to have optabs for specific
>>kinds of permute any more.  E.g. zip, unzip and blend are common
>>permutes too, but we decided not to commonise those.
>
> The issue is that vec_perm_const_ok doesn't have enough info here
> (second operand is all zeros). So the optab is needed to query target
> capabilities, not so much for the actual expansion which could go via
> vec_perm_const.

Not at the moment, sure.  But my point was that we could add that
information instead of doing what the patch does, since the information
could be useful in other cases too.

These days all permute queries go through the same target hook as the
expansion: targetm.vec_perm_const,  So the target interface itself
should adapt naturally.

For can_vec_perm_const_p we could either add zeroness information
to vec_perm_indices or provide it separately (e.g. with tree inputs).
In the latter case the zeroness could be relayed to
targetm.vec_perm_const by passing zero rtxes even for queries.

> I thought of having special permute vector entries to denote zero or
> don't - care which would make it possible to Have this info and allow
> these permutes to be single vector permutes. But then encoding this
> might be awkward.

ISTM similar to the way that we already pass down whether the two vector
inputs are equal.

The patch is adding code to places that would not be patched in the same
way if we already passed down information about zeroness.  So it feels
like we're adding back code that we already want to take out again at
some point, unless we change policy and allow specific optabs for
common permutes.  (I'd be fine with that FWIW.)

Thanks,
Richard
Jakub Jelinek June 19, 2019, 1:05 p.m. UTC | #5
On Wed, Jun 19, 2019 at 01:46:15PM +0100, Richard Sandiford wrote:
> For can_vec_perm_const_p we could either add zeroness information
> to vec_perm_indices or provide it separately (e.g. with tree inputs).
> In the latter case the zeroness could be relayed to
> targetm.vec_perm_const by passing zero rtxes even for queries.

Yeah, I'm not against doing this, it might clean stuff up.

But for start I'd still use the vec_sh[lr]_optab under the hood
in can_vec_perm_const_p, and then we can gradually decide if we want
to convert the targets to use that information too in their target hooks
and whether we'll provide some helper routines for them or not.
vec_sh[rl]_<mode> is right now present in aarch64, alpha, i386, ia64, mips,
rs6000 and s390 backends.

> The patch is adding code to places that would not be patched in the same
> way if we already passed down information about zeroness.  So it feels

Note, I've already committed the patch, but it can be improved
incrementally.

> like we're adding back code that we already want to take out again at
> some point, unless we change policy and allow specific optabs for
> common permutes.  (I'd be fine with that FWIW.)

Well, that is not entirely true, the code that was added for it would
essentially need to be added either way, just instead of to
tree-vect-generic.c to can_vec_perm_const_p if we still kept the optabs,
or to each of the above 7 backends if we propagated that info down to the
target hooks.

	Jakub
Richard Biener June 19, 2019, 6:47 p.m. UTC | #6
On June 19, 2019 2:46:15 PM GMT+02:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>Richard Biener <rguenther@suse.de> writes:
>> On June 19, 2019 11:05:42 AM GMT+02:00, Richard Sandiford
><richard.sandiford@arm.com> wrote:
>>>Richard Biener <rguenther@suse.de> writes:
>>>> On June 19, 2019 10:55:16 AM GMT+02:00, Jakub Jelinek
>>><jakub@redhat.com> wrote:
>>>>>Hi!
>>>>>
>>>>>When VEC_[LR]SHIFT_EXPR has been replaced with VEC_PERM_EXPR,
>>>>>vec_shl_optab
>>>>>has been removed as unused, because we only used vec_shr_optab for
>>>the
>>>>>reductions.
>>>>>Without this patch the vect-simd-*.c tests can be vectorized just
>>>fine
>>>>>for SSE4 and above, but can't be with SSE2.  As the comment in
>>>>>tree-vect-stmts.c tries to explain, for the inclusive scan
>operation
>>>we
>>>>>want (when using V8SImode vectors):
>>>>>       _30 = MEM <vector(8) int> [(int *)&D.2043];
>>>>>       _31 = MEM <vector(8) int> [(int *)&D.2042];
>>>>>       _32 = VEC_PERM_EXPR <_31, _40, { 8, 0, 1, 2, 3, 4, 5, 6 }>;
>>>>>       _33 = _31 + _32;
>>>>> // _33 = { _31[0], _31[0]+_31[1], _31[1]+_31[2], ...,
>_31[6]+_31[7]
>>>};
>>>>>       _34 = VEC_PERM_EXPR <_33, _40, { 8, 9, 0, 1, 2, 3, 4, 5 }>;
>>>>>       _35 = _33 + _34;
>>>>>    // _35 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>>_31[0]+.._31[3],
>>>>>       //         _31[1]+.._31[4], ... _31[4]+.._31[7] };
>>>>>       _36 = VEC_PERM_EXPR <_35, _40, { 8, 9, 10, 11, 0, 1, 2, 3
>}>;
>>>>>       _37 = _35 + _36;
>>>>>    // _37 = { _31[0], _31[0]+_31[1], _31[0]+.._31[2],
>>>_31[0]+.._31[3],
>>>>>       //         _31[0]+.._31[4], ... _31[0]+.._31[7] };
>>>>>       _38 = _30 + _37;
>>>>>       _39 = VEC_PERM_EXPR <_38, _38, { 7, 7, 7, 7, 7, 7, 7, 7 }>;
>>>>>       MEM <vector(8) int> [(int *)&D.2043] = _39;
>>>>>       MEM <vector(8) int> [(int *)&D.2042] = _38;  */
>>>>>For V4SImode vectors that would be VEC_PERM_EXPR <x, init, { 4, 0,
>1,
>>>2
>>>>>}>,
>>>>>VEC_PERM_EXPR <x2, init, { 4, 5, 0, 1 }> and
>>>>>VEC_PERM_EXPR <x3, init, { 3, 3, 3, 3 }> etc.
>>>>>Unfortunately, SSE2 can't do the VEC_PERM_EXPR <x, init, { 4, 0, 1,
>2
>>>>>}>
>>>>>permutation (the other two it can do).  Well, to be precise, it can
>>>do
>>>>>it
>>>>>using the vector left shift which has been removed as unused,
>>>provided
>>>>>that init is initializer_zerop (shifting all zeros from the left).
>>>>>init usually is all zeros, that is the neutral element of additive
>>>>>reductions and couple of others too, in the unlikely case that some
>>>>>other
>>>>>reduction is used with scan (multiplication, minimum, maximum,
>>>bitwise
>>>>>and),
>>>>>we can use a VEC_COND_EXPR with constant first argument, i.e. a
>blend
>>>>>or
>>>>>and/or.
>>>>>
>>>>>So, this patch reintroduces vec_shl_optab (most backends actually
>>>have
>>>>>those
>>>>>patterns already) and handles its expansion and vector generic
>>>lowering
>>>>>similarly to vec_shr_optab - i.e. it is a VEC_PERM_EXPR where the
>>>first
>>>>>operand is initializer_zerop and third operand starts with a few
>>>>>numbers
>>>>>smaller than number of elements (doesn't matter which one, as all
>>>>>elements
>>>>>are same - zero) followed by nelts, nelts+1, nelts+2, ...
>>>>>Unlike vec_shr_optab which has zero as the second operand, this one
>>>has
>>>>>it
>>>>>as first operand, because VEC_PERM_EXPR canonicalization wants to
>>>have
>>>>>first element selector smaller than number of elements.  And unlike
>>>>>vec_shr_optab, where we also have a fallback in
>>>have_whole_vector_shift
>>>>>using normal permutations, this one doesn't need it, that
>"fallback"
>>>is
>>>>>tried
>>>>>first before vec_shl_optab.
>>>>>
>>>>>For the vec_shl_optab checks, it tests only for constant number of
>>>>>elements
>>>>>vectors, not really sure if our VECTOR_CST encoding can express the
>>>>>left
>>>>>shifts in any way nor whether SVE supports those (I see aarch64 has
>>>>>vec_shl_insert but that is just a fixed shift by element bits and
>>>>>shifts in
>>>>>a scalar rather than zeros).
>>>>>
>>>>>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for
>trunk?
>>>>
>>>> Ok. 
>>>
>>>I think it would be worth instead telling evpc that the second
>permute
>>>vector is zero.  Permutes with a second vector of zero are somewhat
>>>special for SVE too, and could be in other cases for other targets.
>>>
>>>I thought the direction of travel was not to have optabs for specific
>>>kinds of permute any more.  E.g. zip, unzip and blend are common
>>>permutes too, but we decided not to commonise those.
>>
>> The issue is that vec_perm_const_ok doesn't have enough info here
>> (second operand is all zeros). So the optab is needed to query target
>> capabilities, not so much for the actual expansion which could go via
>> vec_perm_const.
>
>Not at the moment, sure.  But my point was that we could add that
>information instead of doing what the patch does, since the information
>could be useful in other cases too.
>
>These days all permute queries go through the same target hook as the
>expansion: targetm.vec_perm_const,  So the target interface itself
>should adapt naturally.
>
>For can_vec_perm_const_p we could either add zeroness information
>to vec_perm_indices or provide it separately (e.g. with tree inputs).
>In the latter case the zeroness could be relayed to
>targetm.vec_perm_const by passing zero rtxes even for queries.
>
>> I thought of having special permute vector entries to denote zero or
>> don't - care which would make it possible to Have this info and allow
>> these permutes to be single vector permutes. But then encoding this
>> might be awkward.
>
>ISTM similar to the way that we already pass down whether the two
>vector
>inputs are equal.
>
>The patch is adding code to places that would not be patched in the
>same
>way if we already passed down information about zeroness.  So it feels
>like we're adding back code that we already want to take out again at
>some point, unless we change policy and allow specific optabs for
>common permutes.  (I'd be fine with that FWIW.)

I don't care too much about this implementation detail but in the end I'd like to have better costing for permutes where I think code would be simpler if we just have a single interface to the target. 
IIRC we kept the right shift variant only because the vectorizer relied on it and we were too lazy to adjust the targets.

Richard. 

>Thanks,
>Richard
diff mbox series

Patch

--- gcc/doc/md.texi.jj	2019-06-13 00:35:43.518942525 +0200
+++ gcc/doc/md.texi	2019-06-18 15:32:38.496629946 +0200
@@ -5454,6 +5454,14 @@  in operand 2.  Store the result in vecto
 0 and 1 have mode @var{m} and operand 2 has the mode appropriate for
 one element of @var{m}.
 
+@cindex @code{vec_shl_@var{m}} instruction pattern
+@item @samp{vec_shl_@var{m}}
+Whole vector left shift in bits, i.e.@: away from element 0.
+Operand 1 is a vector to be shifted.
+Operand 2 is an integer shift amount in bits.
+Operand 0 is where the resulting shifted vector is stored.
+The output and input vectors should have the same modes.
+
 @cindex @code{vec_shr_@var{m}} instruction pattern
 @item @samp{vec_shr_@var{m}}
 Whole vector right shift in bits, i.e.@: towards element 0.
--- gcc/optabs.def.jj	2019-02-11 11:38:08.263617017 +0100
+++ gcc/optabs.def	2019-06-18 14:56:57.934971410 +0200
@@ -348,6 +348,7 @@  OPTAB_D (vec_packu_float_optab, "vec_pac
 OPTAB_D (vec_perm_optab, "vec_perm$a")
 OPTAB_D (vec_realign_load_optab, "vec_realign_load_$a")
 OPTAB_D (vec_set_optab, "vec_set$a")
+OPTAB_D (vec_shl_optab, "vec_shl_$a")
 OPTAB_D (vec_shr_optab, "vec_shr_$a")
 OPTAB_D (vec_unpack_sfix_trunc_hi_optab, "vec_unpack_sfix_trunc_hi_$a")
 OPTAB_D (vec_unpack_sfix_trunc_lo_optab, "vec_unpack_sfix_trunc_lo_$a")
--- gcc/optabs.c.jj	2019-02-13 13:11:47.927612362 +0100
+++ gcc/optabs.c	2019-06-18 16:45:29.347895585 +0200
@@ -5444,19 +5444,45 @@  vector_compare_rtx (machine_mode cmp_mod
 }
 
 /* Check if vec_perm mask SEL is a constant equivalent to a shift of
-   the first vec_perm operand, assuming the second operand is a constant
-   vector of zeros.  Return the shift distance in bits if so, or NULL_RTX
-   if the vec_perm is not a shift.  MODE is the mode of the value being
-   shifted.  */
+   the first vec_perm operand, assuming the second operand (for left shift
+   first operand) is a constant vector of zeros.  Return the shift distance
+   in bits if so, or NULL_RTX if the vec_perm is not a shift.  MODE is the
+   mode of the value being shifted.  SHIFT_OPTAB is vec_shr_optab for right
+   shift or vec_shl_optab for left shift.  */
 static rtx
-shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices &sel)
+shift_amt_for_vec_perm_mask (machine_mode mode, const vec_perm_indices &sel,
+			     optab shift_optab)
 {
   unsigned int bitsize = GET_MODE_UNIT_BITSIZE (mode);
   poly_int64 first = sel[0];
   if (maybe_ge (sel[0], GET_MODE_NUNITS (mode)))
     return NULL_RTX;
 
-  if (!sel.series_p (0, 1, first, 1))
+  if (shift_optab == vec_shl_optab)
+    {
+      unsigned int nelt;
+      if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
+	return NULL_RTX;
+      unsigned firstidx = 0;
+      for (unsigned int i = 0; i < nelt; i++)
+	{
+	  if (known_eq (sel[i], nelt))
+	    {
+	      if (i == 0 || firstidx)
+		return NULL_RTX;
+	      firstidx = i;
+	    }
+	  else if (firstidx
+		   ? maybe_ne (sel[i], nelt + i - firstidx)
+		   : maybe_ge (sel[i], nelt))
+	    return NULL_RTX;
+	}
+
+      if (firstidx == 0)
+	return NULL_RTX;
+      first = firstidx;
+    }
+  else if (!sel.series_p (0, 1, first, 1))
     {
       unsigned int nelt;
       if (!GET_MODE_NUNITS (mode).is_constant (&nelt))
@@ -5544,25 +5570,37 @@  expand_vec_perm_const (machine_mode mode
      target instruction.  */
   vec_perm_indices indices (sel, 2, GET_MODE_NUNITS (mode));
 
-  /* See if this can be handled with a vec_shr.  We only do this if the
-     second vector is all zeroes.  */
-  insn_code shift_code = optab_handler (vec_shr_optab, mode);
-  insn_code shift_code_qi = ((qimode != VOIDmode && qimode != mode)
-			     ? optab_handler (vec_shr_optab, qimode)
-			     : CODE_FOR_nothing);
-
-  if (v1 == CONST0_RTX (GET_MODE (v1))
-      && (shift_code != CODE_FOR_nothing
-	  || shift_code_qi != CODE_FOR_nothing))
+  /* See if this can be handled with a vec_shr or vec_shl.  We only do this
+     if the second (for vec_shr) or first (for vec_shl) vector is all
+     zeroes.  */
+  insn_code shift_code = CODE_FOR_nothing;
+  insn_code shift_code_qi = CODE_FOR_nothing;
+  optab shift_optab = unknown_optab;
+  rtx v2 = v0;
+  if (v1 == CONST0_RTX (GET_MODE (v1)))
+    shift_optab = vec_shr_optab;
+  else if (v0 == CONST0_RTX (GET_MODE (v0)))
+    {
+      shift_optab = vec_shl_optab;
+      v2 = v1;
+    }
+  if (shift_optab != unknown_optab)
+    {
+      shift_code = optab_handler (shift_optab, mode);
+      shift_code_qi = ((qimode != VOIDmode && qimode != mode)
+		       ? optab_handler (shift_optab, qimode)
+		       : CODE_FOR_nothing);
+    }
+  if (shift_code != CODE_FOR_nothing || shift_code_qi != CODE_FOR_nothing)
     {
-      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices);
+      rtx shift_amt = shift_amt_for_vec_perm_mask (mode, indices, shift_optab);
       if (shift_amt)
 	{
 	  struct expand_operand ops[3];
 	  if (shift_code != CODE_FOR_nothing)
 	    {
 	      create_output_operand (&ops[0], target, mode);
-	      create_input_operand (&ops[1], v0, mode);
+	      create_input_operand (&ops[1], v2, mode);
 	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
 	      if (maybe_expand_insn (shift_code, 3, ops))
 		return ops[0].value;
@@ -5571,7 +5609,7 @@  expand_vec_perm_const (machine_mode mode
 	    {
 	      rtx tmp = gen_reg_rtx (qimode);
 	      create_output_operand (&ops[0], tmp, qimode);
-	      create_input_operand (&ops[1], gen_lowpart (qimode, v0), qimode);
+	      create_input_operand (&ops[1], gen_lowpart (qimode, v2), qimode);
 	      create_convert_operand_from_type (&ops[2], shift_amt, sizetype);
 	      if (maybe_expand_insn (shift_code_qi, 3, ops))
 		return gen_lowpart (mode, ops[0].value);
--- gcc/optabs-query.c.jj	2019-05-20 11:40:16.691121967 +0200
+++ gcc/optabs-query.c	2019-06-18 15:26:53.028980804 +0200
@@ -415,8 +415,9 @@  can_vec_perm_var_p (machine_mode mode)
    permute (if the target supports that).
 
    Note that additional permutations representing whole-vector shifts may
-   also be handled via the vec_shr optab, but only where the second input
-   vector is entirely constant zeroes; this case is not dealt with here.  */
+   also be handled via the vec_shr or vec_shl optab, but only where the
+   second input vector is entirely constant zeroes; this case is not dealt
+   with here.  */
 
 bool
 can_vec_perm_const_p (machine_mode mode, const vec_perm_indices &sel,
--- gcc/tree-vect-generic.c.jj	2019-01-07 09:47:32.988518893 +0100
+++ gcc/tree-vect-generic.c	2019-06-18 16:35:29.033319526 +0200
@@ -1367,6 +1367,32 @@  lower_vec_perm (gimple_stmt_iterator *gs
 	      return;
 	    }
 	}
+      /* And similarly vec_shl pattern.  */
+      if (optab_handler (vec_shl_optab, TYPE_MODE (vect_type))
+	  != CODE_FOR_nothing
+	  && TREE_CODE (vec0) == VECTOR_CST
+	  && initializer_zerop (vec0))
+	{
+	  unsigned int first = 0;
+	  for (i = 0; i < elements; ++i)
+	    if (known_eq (poly_uint64 (indices[i]), elements))
+	      {
+		if (i == 0 || first)
+		  break;
+		first = i;
+	      }
+	    else if (first
+		     ? maybe_ne (poly_uint64 (indices[i]),
+					      elements + i - first)
+		     : maybe_ge (poly_uint64 (indices[i]), elements))
+	      break;
+	  if (i == elements)
+	    {
+	      gimple_assign_set_rhs3 (stmt, mask);
+	      update_stmt (stmt);
+	      return;
+	    }
+	}
     }
   else if (can_vec_perm_var_p (TYPE_MODE (vect_type)))
     return;
--- gcc/tree-vect-stmts.c.jj	2019-06-17 23:18:53.620850072 +0200
+++ gcc/tree-vect-stmts.c	2019-06-18 17:43:27.484350807 +0200
@@ -6356,6 +6356,71 @@  scan_operand_equal_p (tree ref1, tree re
 
 /* Function check_scan_store.
 
+   Verify if we can perform the needed permutations or whole vector shifts.
+   Return -1 on failure, otherwise exact log2 of vectype's nunits.  */
+
+static int
+scan_store_can_perm_p (tree vectype, tree init, int *use_whole_vector_p = NULL)
+{
+  enum machine_mode vec_mode = TYPE_MODE (vectype);
+  unsigned HOST_WIDE_INT nunits;
+  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
+    return -1;
+  int units_log2 = exact_log2 (nunits);
+  if (units_log2 <= 0)
+    return -1;
+
+  int i;
+  for (i = 0; i <= units_log2; ++i)
+    {
+      unsigned HOST_WIDE_INT j, k;
+      vec_perm_builder sel (nunits, nunits, 1);
+      sel.quick_grow (nunits);
+      if (i == 0)
+	{
+	  for (j = 0; j < nunits; ++j)
+	    sel[j] = nunits - 1;
+	}
+      else
+	{
+	  for (j = 0; j < (HOST_WIDE_INT_1U << (i - 1)); ++j)
+	    sel[j] = j;
+	  for (k = 0; j < nunits; ++j, ++k)
+	    sel[j] = nunits + k;
+	}
+      vec_perm_indices indices (sel, i == 0 ? 1 : 2, nunits);
+      if (!can_vec_perm_const_p (vec_mode, indices))
+	break;
+    }
+
+  if (i == 0)
+    return -1;
+
+  if (i <= units_log2)
+    {
+      if (optab_handler (vec_shl_optab, vec_mode) == CODE_FOR_nothing)
+	return -1;
+      int kind = 1;
+      /* Whole vector shifts shift in zeros, so if init is all zero constant,
+	 there is no need to do anything further.  */
+      if ((TREE_CODE (init) != INTEGER_CST
+	   && TREE_CODE (init) != REAL_CST)
+	  || !initializer_zerop (init))
+	{
+	  tree masktype = build_same_sized_truth_vector_type (vectype);
+	  if (!expand_vec_cond_expr_p (vectype, masktype, VECTOR_CST))
+	    return -1;
+	  kind = 2;
+	}
+      if (use_whole_vector_p)
+	*use_whole_vector_p = kind;
+    }
+  return units_log2;
+}
+
+
+/* Function check_scan_store.
+
    Check magic stores for #pragma omp scan {in,ex}clusive reductions.  */
 
 static bool
@@ -6596,34 +6661,9 @@  check_scan_store (stmt_vec_info stmt_inf
   if (!optab || optab_handler (optab, vec_mode) == CODE_FOR_nothing)
     goto fail;
 
-  unsigned HOST_WIDE_INT nunits;
-  if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
+  int units_log2 = scan_store_can_perm_p (vectype, *init);
+  if (units_log2 == -1)
     goto fail;
-  int units_log2 = exact_log2 (nunits);
-  if (units_log2 <= 0)
-    goto fail;
-
-  for (int i = 0; i <= units_log2; ++i)
-    {
-      unsigned HOST_WIDE_INT j, k;
-      vec_perm_builder sel (nunits, nunits, 1);
-      sel.quick_grow (nunits);
-      if (i == units_log2)
-	{
-	  for (j = 0; j < nunits; ++j)
-	    sel[j] = nunits - 1;
-	}
-      else
-	{
-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
-	    sel[j] = nunits + j;
-	  for (k = 0; j < nunits; ++j, ++k)
-	    sel[j] = k;
-	}
-      vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
-      if (!can_vec_perm_const_p (vec_mode, indices))
-	goto fail;
-    }
 
   return true;
 }
@@ -6686,7 +6726,8 @@  vectorizable_scan_store (stmt_vec_info s
   unsigned HOST_WIDE_INT nunits;
   if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits))
     gcc_unreachable ();
-  int units_log2 = exact_log2 (nunits);
+  int use_whole_vector_p = 0;
+  int units_log2 = scan_store_can_perm_p (vectype, *init, &use_whole_vector_p);
   gcc_assert (units_log2 > 0);
   auto_vec<tree, 16> perms;
   perms.quick_grow (units_log2 + 1);
@@ -6696,21 +6737,25 @@  vectorizable_scan_store (stmt_vec_info s
       vec_perm_builder sel (nunits, nunits, 1);
       sel.quick_grow (nunits);
       if (i == units_log2)
-	{
-	  for (j = 0; j < nunits; ++j)
-	    sel[j] = nunits - 1;
-	}
-      else
-	{
-	  for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
-	    sel[j] = nunits + j;
-	  for (k = 0; j < nunits; ++j, ++k)
-	    sel[j] = k;
-	}
+	for (j = 0; j < nunits; ++j)
+	  sel[j] = nunits - 1;
+	else
+	  {
+	    for (j = 0; j < (HOST_WIDE_INT_1U << i); ++j)
+	      sel[j] = j;
+	    for (k = 0; j < nunits; ++j, ++k)
+	      sel[j] = nunits + k;
+	  }
       vec_perm_indices indices (sel, i == units_log2 ? 1 : 2, nunits);
-      perms[i] = vect_gen_perm_mask_checked (vectype, indices);
+      if (use_whole_vector_p && i < units_log2)
+	perms[i] = vect_gen_perm_mask_any (vectype, indices);
+      else
+	perms[i] = vect_gen_perm_mask_checked (vectype, indices);
     }
 
+  tree zero_vec = use_whole_vector_p ? build_zero_cst (vectype) : NULL_TREE;
+  tree masktype = (use_whole_vector_p == 2
+		   ? build_same_sized_truth_vector_type (vectype) : NULL_TREE);
   stmt_vec_info prev_stmt_info = NULL;
   tree vec_oprnd1 = NULL_TREE;
   tree vec_oprnd2 = NULL_TREE;
@@ -6742,8 +6787,9 @@  vectorizable_scan_store (stmt_vec_info s
       for (int i = 0; i < units_log2; ++i)
 	{
 	  tree new_temp = make_ssa_name (vectype);
-	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR, v,
-					   vec_oprnd1, perms[i]);
+	  gimple *g = gimple_build_assign (new_temp, VEC_PERM_EXPR,
+					   zero_vec ? zero_vec : vec_oprnd1, v,
+					   perms[i]);
 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
 	  if (prev_stmt_info == NULL)
 	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt_info;
@@ -6751,6 +6797,25 @@  vectorizable_scan_store (stmt_vec_info s
 	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
 	  prev_stmt_info = new_stmt_info;
 
+	  if (use_whole_vector_p == 2)
+	    {
+	      /* Whole vector shift shifted in zero bits, but if *init
+		 is not initializer_zerop, we need to replace those elements
+		 with elements from vec_oprnd1.  */
+	      tree_vector_builder vb (masktype, nunits, 1);
+	      for (unsigned HOST_WIDE_INT k = 0; k < nunits; ++k)
+		vb.quick_push (k < (HOST_WIDE_INT_1U << i)
+			       ? boolean_false_node : boolean_true_node);
+
+	      tree new_temp2 = make_ssa_name (vectype);
+	      g = gimple_build_assign (new_temp2, VEC_COND_EXPR, vb.build (),
+				       new_temp, vec_oprnd1);
+	      new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
+	      STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt_info;
+	      prev_stmt_info = new_stmt_info;
+	      new_temp = new_temp2;
+	    }
+
 	  tree new_temp2 = make_ssa_name (vectype);
 	  g = gimple_build_assign (new_temp2, code, v, new_temp);
 	  new_stmt_info = vect_finish_stmt_generation (stmt_info, g, gsi);
--- gcc/config/i386/sse.md.jj	2019-06-17 23:18:26.821267440 +0200
+++ gcc/config/i386/sse.md	2019-06-18 15:37:28.342043528 +0200
@@ -11758,6 +11758,19 @@  (define_insn "<shift_insn><mode>3<mask_n
    (set_attr "mode" "<sseinsnmode>")])
 
 
+(define_expand "vec_shl_<mode>"
+  [(set (match_dup 3)
+	(ashift:V1TI
+	 (match_operand:VI_128 1 "register_operand")
+	 (match_operand:SI 2 "const_0_to_255_mul_8_operand")))
+   (set (match_operand:VI_128 0 "register_operand") (match_dup 4))]
+  "TARGET_SSE2"
+{
+  operands[1] = gen_lowpart (V1TImode, operands[1]);
+  operands[3] = gen_reg_rtx (V1TImode);
+  operands[4] = gen_lowpart (<MODE>mode, operands[3]);
+})
+
 (define_expand "vec_shr_<mode>"
   [(set (match_dup 3)
 	(lshiftrt:V1TI
--- gcc/testsuite/gcc.dg/vect/vect-simd-8.c.jj	2019-06-17 23:18:53.621850057 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-8.c	2019-06-18 18:02:09.428798006 +0200
@@ -3,7 +3,9 @@ 
 /* { dg-additional-options "-mavx" { target avx_runtime } } */
 /* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
 
+#ifndef main
 #include "tree-vect.h"
+#endif
 
 int r, a[1024], b[1024];
 
@@ -63,7 +65,9 @@  int
 main ()
 {
   int s = 0;
+#ifndef main
   check_vect ();
+#endif
   for (int i = 0; i < 1024; ++i)
     {
       a[i] = i;
--- gcc/testsuite/gcc.dg/vect/vect-simd-9.c.jj	2019-06-17 23:18:53.621850057 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-9.c	2019-06-18 18:02:34.649406773 +0200
@@ -3,7 +3,9 @@ 
 /* { dg-additional-options "-mavx" { target avx_runtime } } */
 /* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
 
+#ifndef main
 #include "tree-vect.h"
+#endif
 
 int r, a[1024], b[1024];
 
@@ -65,7 +67,9 @@  int
 main ()
 {
   int s = 0;
+#ifndef main
   check_vect ();
+#endif
   for (int i = 0; i < 1024; ++i)
     {
       a[i] = i;
--- gcc/testsuite/gcc.dg/vect/vect-simd-10.c.jj	2019-06-18 18:37:30.742838613 +0200
+++ gcc/testsuite/gcc.dg/vect/vect-simd-10.c	2019-06-18 19:44:20.614082076 +0200
@@ -0,0 +1,96 @@ 
+/* { dg-require-effective-target size32plus } */
+/* { dg-additional-options "-fopenmp-simd" } */
+/* { dg-additional-options "-mavx" { target avx_runtime } } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" { target i?86-*-* x86_64-*-* } } } */
+
+#ifndef main
+#include "tree-vect.h"
+#endif
+
+float r = 1.0f, a[1024], b[1024];
+
+__attribute__((noipa)) void
+foo (float *a, float *b)
+{
+  #pragma omp simd reduction (inscan, *:r)
+  for (int i = 0; i < 1024; i++)
+    {
+      r *= a[i];
+      #pragma omp scan inclusive(r)
+      b[i] = r;
+    }
+}
+
+__attribute__((noipa)) float
+bar (void)
+{
+  float s = -__builtin_inff ();
+  #pragma omp simd reduction (inscan, max:s)
+  for (int i = 0; i < 1024; i++)
+    {
+      s = s > a[i] ? s : a[i];
+      #pragma omp scan inclusive(s)
+      b[i] = s;
+    }
+  return s;
+}
+
+int
+main ()
+{
+  float s = 1.0f;
+#ifndef main
+  check_vect ();
+#endif
+  for (int i = 0; i < 1024; ++i)
+    {
+      if (i < 80)
+	a[i] = (i & 1) ? 0.25f : 0.5f;
+      else if (i < 200)
+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
+      else if (i < 280)
+	a[i] = (i & 1) ? 0.25f : 0.5f;
+      else if (i < 380)
+	a[i] = (i % 3) == 0 ? 2.0f : (i % 3) == 1 ? 4.0f : 1.0f;
+      else
+	switch (i % 6)
+	  {
+	  case 0: a[i] = 0.25f; break;
+	  case 1: a[i] = 2.0f; break;
+	  case 2: a[i] = -1.0f; break;
+	  case 3: a[i] = -4.0f; break;
+	  case 4: a[i] = 0.5f; break;
+	  case 5: a[i] = 1.0f; break;
+	  default: a[i] = 0.0f; break;
+	  }
+      b[i] = -19.0f;
+      asm ("" : "+g" (i));
+    }
+  foo (a, b);
+  if (r * 16384.0f != 0.125f)
+    abort ();
+  float m = -175.25f;
+  for (int i = 0; i < 1024; ++i)
+    {
+      s *= a[i];
+      if (b[i] != s)
+	abort ();
+      else
+	{
+	  a[i] = m - ((i % 3) == 1 ? 2.0f : (i % 3) == 2 ? 4.0f : 0.0f);
+	  b[i] = -231.75f;
+	  m += 0.75f;
+	}
+    }
+  if (bar () != 592.0f)
+    abort ();
+  s = -__builtin_inff ();
+  for (int i = 0; i < 1024; ++i)
+    {
+      if (s < a[i])
+	s = a[i];
+      if (b[i] != s)
+	abort ();
+    }
+  return 0;
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-8.c	2019-06-18 18:19:48.417341734 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-9.c	2019-06-18 18:20:05.770072628 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c.jj	2019-06-18 19:46:09.015410603 +0200
+++ gcc/testsuite/gcc.target/i386/sse2-vect-simd-10.c	2019-06-18 19:50:31.621361409 +0200
@@ -0,0 +1,15 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -msse2 -mno-sse3 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target sse2 } */
+
+#include "sse2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+sse2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-8.c	2019-06-18 18:19:40.310467451 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-9.c	2019-06-18 18:19:56.479216712 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c.jj	2019-06-18 19:50:47.692113611 +0200
+++ gcc/testsuite/gcc.target/i386/avx2-vect-simd-10.c	2019-06-18 19:50:56.180982721 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx2 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx2-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+avx2_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c.jj	2019-06-18 17:59:27.182314827 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-8.c	2019-06-18 18:19:44.364404586 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-8.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c.jj	2019-06-18 18:03:30.174545446 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-9.c	2019-06-18 18:20:00.884148400 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-9.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}
--- gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c.jj	2019-06-18 19:51:12.309734025 +0200
+++ gcc/testsuite/gcc.target/i386/avx512f-vect-simd-10.c	2019-06-18 19:51:18.285641883 +0200
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -fopenmp-simd -mavx512f -mprefer-vector-width=512 -fdump-tree-vect-details" } */
+/* { dg-require-effective-target avx512f } */
+/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 2 "vect" } } */
+
+#include "avx512f-check.h"
+
+#define main() do_main ()
+
+#include "../../gcc.dg/vect/vect-simd-10.c"
+
+static void
+avx512f_test (void)
+{
+  do_main ();
+}