diff mbox series

[RFC] Fix PR90510, VEC_PERM -> BIT_INSERT folding

Message ID alpine.LSU.2.20.1905171531490.10704@zhemvz.fhfr.qr
State New
Headers show
Series [RFC] Fix PR90510, VEC_PERM -> BIT_INSERT folding | expand

Commit Message

Richard Biener May 17, 2019, 1:36 p.m. UTC
This adds, incrementally ontop of moving VEC_PERM_EXPR folding
to match.pd (but not incremental patch - sorry...), folding
of single-element insert permutations to BIT_INSERT_EXPR.

Things are quite awkward with the new poly-int vec-perm stuff
so effectively this doesn't work for SVE and is still very
ugly.  I wonder how to make it generic enough so SVE would
be happy and / or how to make the code prettier.

I also can't find a helper to read a specific vector element
from a VECTOR_CST/CONSTRUCTOR (can I even do that "generally"
with a poly-int index?!), but there surely must be one.

Richard - any comments / hints?  Look at the match.pd
change next to /* See if the permutation is performing a single elem

Thanks,
Richard.

Comments

Richard Sandiford May 19, 2019, 7:42 a.m. UTC | #1
Richard Biener <rguenther@suse.de> writes:
> This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> to match.pd (but not incremental patch - sorry...), folding
> of single-element insert permutations to BIT_INSERT_EXPR.
>
> Things are quite awkward with the new poly-int vec-perm stuff
> so effectively this doesn't work for SVE and is still very
> ugly.  I wonder how to make it generic enough so SVE would
> be happy and / or how to make the code prettier.
>
> I also can't find a helper to read a specific vector element
> from a VECTOR_CST/CONSTRUCTOR (can I even do that "generally"
> with a poly-int index?!), but there surely must be one.

Yeah, would be nice to have a helper that handles both VECTOR_CST
and CONSTRUCTOR, even if just for constant indices.

CONSTRUCTORs are still very much fixed-length, so it wouldn't really
make sense to fold a poly_int index at compile time.  The only case we
could handle is when the index is known to be beyond the last nonzero
element.

We could fold some poly_int VECTOR_CST_ELT indices to poly_ints, but
it'd depend on the values involved.

Indexing specific poly_int elements of a single vector is a bit dubious
in length-agnostic code though.  Not saying it'll never be useful, but
it's certainly not a native SVE operation.  So I think even for SVE,
constant VECTOR_CST/CONSTRUCTOR elements are the only interesting case
for now.

> [...]
> @@ -11774,61 +11777,7 @@ fold_ternary_loc (location_t loc, enum t
>  	  poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
>  	  bool single_arg = (op0 == op1);
>  	  vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts);
> -
> -	  /* Check for cases that fold to OP0 or OP1 in their original
> -	     element order.  */
> -	  if (sel.series_p (0, 1, 0, 1))
> -	    return op0;
> -	  if (sel.series_p (0, 1, nelts, 1))
> -	    return op1;
> -
> -	  if (!single_arg)
> -	    {
> -	      if (sel.all_from_input_p (0))
> -		op1 = op0;
> -	      else if (sel.all_from_input_p (1))
> -		{
> -		  op0 = op1;
> -		  sel.rotate_inputs (1);
> -		}
> -	    }

Since this isn't an incremental patch... :-)

One of the holes of the current code is that we still allow two
permute indices for the same permutation, e.g.  { 4, 1, 2, 3 } and
{ 0, 5, 6, 7 }.  IMO we should canonicalize it so that the first index
selects from the first vector.  So maybe the above should be:

	  if (!single_arg)
	    {
	      if (known_ge (sel[0], nunits))
		{
		  std::swap (op0, op1);
		  sel.rotate_inputs (1);
		}
	      if (sel.all_from_input_p (0))
		op1 = op0;
	    }

> [...]
> +	/* See if the permutation is performing a single element
> +	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
> +	   in that case.  */
> +	unsigned HOST_WIDE_INT cnelts;
> +        if ((TREE_CODE (cop0) == VECTOR_CST
> +	     || TREE_CODE (cop0) == CONSTRUCTOR
> +	     || TREE_CODE (cop1) == VECTOR_CST
> +	     || TREE_CODE (cop1) == CONSTRUCTOR)
> +	    && nelts.is_constant (&cnelts))
> +          {
> +	    unsigned first = 0, first_oo = 0, first_i;
> +	    unsigned second = 0, second_oo = 0, second_i;
> +	    HOST_WIDE_INT idx;
> +	    for (unsigned HOST_WIDE_INT i = 0; i < cnelts; ++i)
> +	      if (!sel[i].is_constant (&idx))
> +	        {
> +		  first = second = 0;
> +		  break;
> +		}
> +	      else if ((unsigned HOST_WIDE_INT)idx < cnelts)
> +	        {
> +		  first_i = i;
> +		  first++;
> +		  first_oo += (unsigned HOST_WIDE_INT)idx != i;
> +		}
> +	      else
> +	        {
> +		  second_i = i;
> +	          second++;
> +		  second_oo += (unsigned HOST_WIDE_INT)idx != i + cnelts;
> +		}

This won't handle the case in which the inserted element comes from
the same vector.

If we add the extra canonicalization above, we'd only ever be inserting
into the second vector at element 0.   The test for that would be:

   if (sel.series_p (1, 1, nelts + 1, 1))
     // insert sel[0] into index 0 of the second vector

I think the SVE-friendly way of doing the check for the first vector
would be:

   unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
   unsigned int i = 0;
   for (; i < encoded_nelts; ++i)
     if (maybe_ne (sel[i], i))
       break;
   if (i < encoded_nelts && sel.series_p (i + 1, 1, i + 1, 1))
     // insert sel[i] into index i of the first vector

Although that's two searches, one of them will halt very early.

The SVE-friendly way of getting the VECTOR_CST/CONSTRUCTOR element would be:

  poly_uint64 idx = sel[i];
  unsigned HOST_WIDE_INT const_idx;
  if (known_le (idx, nelts) && idx.is_constant (&const_idx))
    // const_idx from first vector
  else if (known_ge (idx, nelts) && (idx - nelts).is_constant (&const_idx))
    // const_idx from second vector
  else
    // bail out

...which is a bit ugly, I admit.

Thanks,
Richard
Richard Biener May 20, 2019, 11:36 a.m. UTC | #2
On Sun, 19 May 2019, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> > to match.pd (but not incremental patch - sorry...), folding
> > of single-element insert permutations to BIT_INSERT_EXPR.
> >
> > Things are quite awkward with the new poly-int vec-perm stuff
> > so effectively this doesn't work for SVE and is still very
> > ugly.  I wonder how to make it generic enough so SVE would
> > be happy and / or how to make the code prettier.
> >
> > I also can't find a helper to read a specific vector element
> > from a VECTOR_CST/CONSTRUCTOR (can I even do that "generally"
> > with a poly-int index?!), but there surely must be one.
> 
> Yeah, would be nice to have a helper that handles both VECTOR_CST
> and CONSTRUCTOR, even if just for constant indices.
> 
> CONSTRUCTORs are still very much fixed-length, so it wouldn't really
> make sense to fold a poly_int index at compile time.  The only case we
> could handle is when the index is known to be beyond the last nonzero
> element.
> 
> We could fold some poly_int VECTOR_CST_ELT indices to poly_ints, but
> it'd depend on the values involved.
> 
> Indexing specific poly_int elements of a single vector is a bit dubious
> in length-agnostic code though.  Not saying it'll never be useful, but
> it's certainly not a native SVE operation.  So I think even for SVE,
> constant VECTOR_CST/CONSTRUCTOR elements are the only interesting case
> for now.
> 
> > [...]
> > @@ -11774,61 +11777,7 @@ fold_ternary_loc (location_t loc, enum t
> >  	  poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
> >  	  bool single_arg = (op0 == op1);
> >  	  vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts);
> > -
> > -	  /* Check for cases that fold to OP0 or OP1 in their original
> > -	     element order.  */
> > -	  if (sel.series_p (0, 1, 0, 1))
> > -	    return op0;
> > -	  if (sel.series_p (0, 1, nelts, 1))
> > -	    return op1;
> > -
> > -	  if (!single_arg)
> > -	    {
> > -	      if (sel.all_from_input_p (0))
> > -		op1 = op0;
> > -	      else if (sel.all_from_input_p (1))
> > -		{
> > -		  op0 = op1;
> > -		  sel.rotate_inputs (1);
> > -		}
> > -	    }
> 
> Since this isn't an incremental patch... :-)
> 
> One of the holes of the current code is that we still allow two
> permute indices for the same permutation, e.g.  { 4, 1, 2, 3 } and
> { 0, 5, 6, 7 }.  IMO we should canonicalize it so that the first index
> selects from the first vector.  So maybe the above should be:
> 
> 	  if (!single_arg)
> 	    {
> 	      if (known_ge (sel[0], nunits))
> 		{
> 		  std::swap (op0, op1);
> 		  sel.rotate_inputs (1);
> 		}
> 	      if (sel.all_from_input_p (0))
> 		op1 = op0;
> 	    }
> 
> > [...]
> > +	/* See if the permutation is performing a single element
> > +	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
> > +	   in that case.  */
> > +	unsigned HOST_WIDE_INT cnelts;
> > +        if ((TREE_CODE (cop0) == VECTOR_CST
> > +	     || TREE_CODE (cop0) == CONSTRUCTOR
> > +	     || TREE_CODE (cop1) == VECTOR_CST
> > +	     || TREE_CODE (cop1) == CONSTRUCTOR)
> > +	    && nelts.is_constant (&cnelts))
> > +          {
> > +	    unsigned first = 0, first_oo = 0, first_i;
> > +	    unsigned second = 0, second_oo = 0, second_i;
> > +	    HOST_WIDE_INT idx;
> > +	    for (unsigned HOST_WIDE_INT i = 0; i < cnelts; ++i)
> > +	      if (!sel[i].is_constant (&idx))
> > +	        {
> > +		  first = second = 0;
> > +		  break;
> > +		}
> > +	      else if ((unsigned HOST_WIDE_INT)idx < cnelts)
> > +	        {
> > +		  first_i = i;
> > +		  first++;
> > +		  first_oo += (unsigned HOST_WIDE_INT)idx != i;
> > +		}
> > +	      else
> > +	        {
> > +		  second_i = i;
> > +	          second++;
> > +		  second_oo += (unsigned HOST_WIDE_INT)idx != i + cnelts;
> > +		}
> 
> This won't handle the case in which the inserted element comes from
> the same vector.
> 
> If we add the extra canonicalization above, we'd only ever be inserting
> into the second vector at element 0.   The test for that would be:
> 
>    if (sel.series_p (1, 1, nelts + 1, 1))
>      // insert sel[0] into index 0 of the second vector
> 
> I think the SVE-friendly way of doing the check for the first vector
> would be:
> 
>    unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
>    unsigned int i = 0;
>    for (; i < encoded_nelts; ++i)
>      if (maybe_ne (sel[i], i))
>        break;
>    if (i < encoded_nelts && sel.series_p (i + 1, 1, i + 1, 1))
>      // insert sel[i] into index i of the first vector
> 
> Although that's two searches, one of them will halt very early.
> 
> The SVE-friendly way of getting the VECTOR_CST/CONSTRUCTOR element would be:
> 
>   poly_uint64 idx = sel[i];
>   unsigned HOST_WIDE_INT const_idx;
>   if (known_le (idx, nelts) && idx.is_constant (&const_idx))
>     // const_idx from first vector
>   else if (known_ge (idx, nelts) && (idx - nelts).is_constant (&const_idx))
>     // const_idx from second vector
>   else
>     // bail out
> 
> ...which is a bit ugly, I admit.

So in the end it's all still a bit smaller than before.  I've split
out a fold_read_from_vector routine.

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

Richard.

2019-05-20  Richard Biener  <rguenther@suse.de>

	PR middle-end/90510
	* fold-const.c (fold_read_from_vector): New function.
	* fold-const.h (fold_read_from_vector): Declare.
	* match.pd (VEC_PERM_EXPR): Build BIT_INSERT_EXPRs for
	single-element insert permutations.  Canonicalize selector
	further and fix issue with last commit.

	* gcc.target/i386/pr90510.c: New testcase.

Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c	(revision 271404)
+++ gcc/fold-const.c	(working copy)
@@ -13793,6 +13793,28 @@ fold_read_from_constant_string (tree exp
   return NULL;
 }
 
+/* Folds a read from vector element at IDX of vector ARG.  */
+
+tree
+fold_read_from_vector (tree arg, poly_int64 idx)
+{
+  HOST_WIDE_INT i;
+  if (known_lt (idx, TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)))
+      && known_ge (idx, 0)
+      && idx.is_constant (&i))
+    {
+      if (TREE_CODE (arg) == VECTOR_CST)
+	return VECTOR_CST_ELT (arg, i);
+      else if (TREE_CODE (arg) == CONSTRUCTOR)
+	{
+	  if (i >= CONSTRUCTOR_NELTS (arg))
+	    return build_zero_cst (TREE_TYPE (TREE_TYPE (arg)));
+	  return CONSTRUCTOR_ELT (arg, i)->value;
+	}
+    }
+  return NULL_TREE;
+}
+
 /* Return the tree for neg (ARG0) when ARG0 is known to be either
    an integer constant, real, or fixed-point constant.
 
Index: gcc/fold-const.h
===================================================================
--- gcc/fold-const.h	(revision 271404)
+++ gcc/fold-const.h	(working copy)
@@ -100,6 +100,7 @@ extern tree fold_bit_and_mask (tree, tre
 			       tree, enum tree_code, tree, tree,
 			       tree, enum tree_code, tree, tree, tree *);
 extern tree fold_read_from_constant_string (tree);
+extern tree fold_read_from_vector (tree, poly_int64);
 #if GCC_VEC_PERN_INDICES_H
 extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &);
 #endif
Index: gcc/match.pd
===================================================================
--- gcc/match.pd	(revision 271404)
+++ gcc/match.pd	(working copy)
@@ -5406,6 +5406,11 @@ (define_operator_list COND_TERNARY
 	       op0 = op1;
 	       sel.rotate_inputs (1);
 	     }
+	   else if (known_ge (sel[0], nelts))
+	     {
+	       std::swap (op0, op1);
+	       sel.rotate_inputs (1);
+	     }
          }
        gassign *def;
        tree cop0 = op0, cop1 = op1;
@@ -5429,9 +5434,47 @@ (define_operator_list COND_TERNARY
      (with
       {
 	bool changed = (op0 == op1 && !single_arg);
+	tree ins = NULL_TREE;
+	unsigned at = 0;
+
+	/* See if the permutation is performing a single element
+	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
+	   in that case.  But only if the vector mode is supported,
+	   otherwise this is invalid GIMPLE.  */
+	unsigned HOST_WIDE_INT cnelts;
+        if (TYPE_MODE (type) != BLKmode
+	    && (TREE_CODE (cop0) == VECTOR_CST
+		|| TREE_CODE (cop0) == CONSTRUCTOR
+		|| TREE_CODE (cop1) == VECTOR_CST
+		|| TREE_CODE (cop1) == CONSTRUCTOR))
+          {
+	    if (sel.series_p (1, 1, nelts + 1, 1))
+	      {
+	        /* After canonicalizing the first elt to come from the
+		   first vector we only can insert the first elt from
+		   the first vector.  */
+	        at = 0;
+		ins = fold_read_from_vector (cop0, 0);
+	        op0 = op1;
+	      }
+	    else
+	      {
+	        unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
+		for (at = 0; at < encoded_nelts; ++at)
+		  if (maybe_ne (sel[at], at))
+		    break;
+		if (at < encoded_nelts && sel.series_p (at + 1, 1, at + 1, 1))
+		  {
+		    if (known_lt (at, nelts))
+		      ins = fold_read_from_vector (cop0, sel[at]);
+		    else
+		      ins = fold_read_from_vector (cop1, sel[at] - nelts);
+		  }
+	      }
+	  }
 
 	/* Generate a canonical form of the selector.  */
-	if (sel.encoding () != builder)
+	if (!ins && sel.encoding () != builder)
 	  {
 	    /* Some targets are deficient and fail to expand a single
 	       argument permutation while still allowing an equivalent
@@ -5450,10 +5493,12 @@ (define_operator_list COND_TERNARY
 		     so use the preferred form.  */
 		  op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
 	      }
-	    /* Differences in the encoder do not necessarily mean
-	       differences in the resulting vector.  */
-	    changed = !operand_equal_p (op2, oldop2, 0);
+	    if (!operand_equal_p (op2, oldop2, 0))
+	      changed = true;
 	  }
       }
-      (if (changed)
-       (vec_perm { op0; } { op1; } { op2; })))))))))
+      (if (ins)
+       (bit_insert { op0; } { ins; }
+         { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); })
+       (if (changed)
+        (vec_perm { op0; } { op1; } { op2; }))))))))))
Index: gcc/testsuite/gcc.target/i386/pr90510.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr90510.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr90510.c	(working copy)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+typedef double __v2df __attribute__ ((__vector_size__ (16)));
+typedef long __v2di __attribute__ ((__vector_size__ (16)));
+
+__v2df
+_mm_add_sd_A (__v2df x, __v2df y)
+{
+  double tem = x[0] + y[0];
+  return __builtin_shuffle ( x, (__v2df) { tem, tem }, (__v2di) { 2, 1 } );
+}
+
+__v2df
+_mm_add_sd_B (__v2df x, __v2df y)
+{
+  __v2df z = { (x[0] + y[0]), x[1] };
+  return z;
+}
+
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 2 "optimized" } } */
+/* { dg-final { scan-assembler-not "unpck" } } */
Richard Biener May 21, 2019, 11:11 a.m. UTC | #3
On Mon, 20 May 2019, Richard Biener wrote:

> On Sun, 19 May 2019, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> > > to match.pd (but not incremental patch - sorry...), folding
> > > of single-element insert permutations to BIT_INSERT_EXPR.
> > >
> > > Things are quite awkward with the new poly-int vec-perm stuff
> > > so effectively this doesn't work for SVE and is still very
> > > ugly.  I wonder how to make it generic enough so SVE would
> > > be happy and / or how to make the code prettier.
> > >
> > > I also can't find a helper to read a specific vector element
> > > from a VECTOR_CST/CONSTRUCTOR (can I even do that "generally"
> > > with a poly-int index?!), but there surely must be one.
> > 
> > Yeah, would be nice to have a helper that handles both VECTOR_CST
> > and CONSTRUCTOR, even if just for constant indices.
> > 
> > CONSTRUCTORs are still very much fixed-length, so it wouldn't really
> > make sense to fold a poly_int index at compile time.  The only case we
> > could handle is when the index is known to be beyond the last nonzero
> > element.
> > 
> > We could fold some poly_int VECTOR_CST_ELT indices to poly_ints, but
> > it'd depend on the values involved.
> > 
> > Indexing specific poly_int elements of a single vector is a bit dubious
> > in length-agnostic code though.  Not saying it'll never be useful, but
> > it's certainly not a native SVE operation.  So I think even for SVE,
> > constant VECTOR_CST/CONSTRUCTOR elements are the only interesting case
> > for now.
> > 
> > > [...]
> > > @@ -11774,61 +11777,7 @@ fold_ternary_loc (location_t loc, enum t
> > >  	  poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
> > >  	  bool single_arg = (op0 == op1);
> > >  	  vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts);
> > > -
> > > -	  /* Check for cases that fold to OP0 or OP1 in their original
> > > -	     element order.  */
> > > -	  if (sel.series_p (0, 1, 0, 1))
> > > -	    return op0;
> > > -	  if (sel.series_p (0, 1, nelts, 1))
> > > -	    return op1;
> > > -
> > > -	  if (!single_arg)
> > > -	    {
> > > -	      if (sel.all_from_input_p (0))
> > > -		op1 = op0;
> > > -	      else if (sel.all_from_input_p (1))
> > > -		{
> > > -		  op0 = op1;
> > > -		  sel.rotate_inputs (1);
> > > -		}
> > > -	    }
> > 
> > Since this isn't an incremental patch... :-)
> > 
> > One of the holes of the current code is that we still allow two
> > permute indices for the same permutation, e.g.  { 4, 1, 2, 3 } and
> > { 0, 5, 6, 7 }.  IMO we should canonicalize it so that the first index
> > selects from the first vector.  So maybe the above should be:
> > 
> > 	  if (!single_arg)
> > 	    {
> > 	      if (known_ge (sel[0], nunits))
> > 		{
> > 		  std::swap (op0, op1);
> > 		  sel.rotate_inputs (1);
> > 		}
> > 	      if (sel.all_from_input_p (0))
> > 		op1 = op0;
> > 	    }
> > 
> > > [...]
> > > +	/* See if the permutation is performing a single element
> > > +	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
> > > +	   in that case.  */
> > > +	unsigned HOST_WIDE_INT cnelts;
> > > +        if ((TREE_CODE (cop0) == VECTOR_CST
> > > +	     || TREE_CODE (cop0) == CONSTRUCTOR
> > > +	     || TREE_CODE (cop1) == VECTOR_CST
> > > +	     || TREE_CODE (cop1) == CONSTRUCTOR)
> > > +	    && nelts.is_constant (&cnelts))
> > > +          {
> > > +	    unsigned first = 0, first_oo = 0, first_i;
> > > +	    unsigned second = 0, second_oo = 0, second_i;
> > > +	    HOST_WIDE_INT idx;
> > > +	    for (unsigned HOST_WIDE_INT i = 0; i < cnelts; ++i)
> > > +	      if (!sel[i].is_constant (&idx))
> > > +	        {
> > > +		  first = second = 0;
> > > +		  break;
> > > +		}
> > > +	      else if ((unsigned HOST_WIDE_INT)idx < cnelts)
> > > +	        {
> > > +		  first_i = i;
> > > +		  first++;
> > > +		  first_oo += (unsigned HOST_WIDE_INT)idx != i;
> > > +		}
> > > +	      else
> > > +	        {
> > > +		  second_i = i;
> > > +	          second++;
> > > +		  second_oo += (unsigned HOST_WIDE_INT)idx != i + cnelts;
> > > +		}
> > 
> > This won't handle the case in which the inserted element comes from
> > the same vector.
> > 
> > If we add the extra canonicalization above, we'd only ever be inserting
> > into the second vector at element 0.   The test for that would be:
> > 
> >    if (sel.series_p (1, 1, nelts + 1, 1))
> >      // insert sel[0] into index 0 of the second vector
> > 
> > I think the SVE-friendly way of doing the check for the first vector
> > would be:
> > 
> >    unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
> >    unsigned int i = 0;
> >    for (; i < encoded_nelts; ++i)
> >      if (maybe_ne (sel[i], i))
> >        break;
> >    if (i < encoded_nelts && sel.series_p (i + 1, 1, i + 1, 1))
> >      // insert sel[i] into index i of the first vector
> > 
> > Although that's two searches, one of them will halt very early.
> > 
> > The SVE-friendly way of getting the VECTOR_CST/CONSTRUCTOR element would be:
> > 
> >   poly_uint64 idx = sel[i];
> >   unsigned HOST_WIDE_INT const_idx;
> >   if (known_le (idx, nelts) && idx.is_constant (&const_idx))
> >     // const_idx from first vector
> >   else if (known_ge (idx, nelts) && (idx - nelts).is_constant (&const_idx))
> >     // const_idx from second vector
> >   else
> >     // bail out
> > 
> > ...which is a bit ugly, I admit.
> 
> So in the end it's all still a bit smaller than before.  I've split
> out a fold_read_from_vector routine.

And the following is what I applied after fixing all sign-compare
issues.

Bootstraped and tested on x86_64-unknown-linux-gnu.

2019-05-21  Richard Biener  <rguenther@suse.de>

	PR middle-end/90510
	* fold-const.c (fold_read_from_vector): New function.
	* fold-const.h (fold_read_from_vector): Declare.
	* match.pd (VEC_PERM_EXPR): Build BIT_INSERT_EXPRs for
	single-element insert permutations.  Canonicalize selector
	further and fix issue with last commit.

	* gcc.target/i386/pr90510.c: New testcase.

Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c	(revision 271412)
+++ gcc/fold-const.c	(working copy)
@@ -13793,6 +13793,28 @@ fold_read_from_constant_string (tree exp
   return NULL;
 }
 
+/* Folds a read from vector element at IDX of vector ARG.  */
+
+tree
+fold_read_from_vector (tree arg, poly_uint64 idx)
+{
+  unsigned HOST_WIDE_INT i;
+  if (known_lt (idx, TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)))
+      && known_ge (idx, 0u)
+      && idx.is_constant (&i))
+    {
+      if (TREE_CODE (arg) == VECTOR_CST)
+	return VECTOR_CST_ELT (arg, i);
+      else if (TREE_CODE (arg) == CONSTRUCTOR)
+	{
+	  if (i >= CONSTRUCTOR_NELTS (arg))
+	    return build_zero_cst (TREE_TYPE (TREE_TYPE (arg)));
+	  return CONSTRUCTOR_ELT (arg, i)->value;
+	}
+    }
+  return NULL_TREE;
+}
+
 /* Return the tree for neg (ARG0) when ARG0 is known to be either
    an integer constant, real, or fixed-point constant.
 
Index: gcc/fold-const.h
===================================================================
--- gcc/fold-const.h	(revision 271412)
+++ gcc/fold-const.h	(working copy)
@@ -100,6 +100,7 @@ extern tree fold_bit_and_mask (tree, tre
 			       tree, enum tree_code, tree, tree,
 			       tree, enum tree_code, tree, tree, tree *);
 extern tree fold_read_from_constant_string (tree);
+extern tree fold_read_from_vector (tree, poly_uint64);
 #if GCC_VEC_PERN_INDICES_H
 extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &);
 #endif
Index: gcc/match.pd
===================================================================
--- gcc/match.pd	(revision 271412)
+++ gcc/match.pd	(working copy)
@@ -5406,6 +5406,11 @@ (define_operator_list COND_TERNARY
 	       op0 = op1;
 	       sel.rotate_inputs (1);
 	     }
+	   else if (known_ge (poly_uint64 (sel[0]), nelts))
+	     {
+	       std::swap (op0, op1);
+	       sel.rotate_inputs (1);
+	     }
          }
        gassign *def;
        tree cop0 = op0, cop1 = op1;
@@ -5429,9 +5434,46 @@ (define_operator_list COND_TERNARY
      (with
       {
 	bool changed = (op0 == op1 && !single_arg);
+	tree ins = NULL_TREE;
+	unsigned at = 0;
+
+	/* See if the permutation is performing a single element
+	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
+	   in that case.  But only if the vector mode is supported,
+	   otherwise this is invalid GIMPLE.  */
+        if (TYPE_MODE (type) != BLKmode
+	    && (TREE_CODE (cop0) == VECTOR_CST
+		|| TREE_CODE (cop0) == CONSTRUCTOR
+		|| TREE_CODE (cop1) == VECTOR_CST
+		|| TREE_CODE (cop1) == CONSTRUCTOR))
+          {
+	    if (sel.series_p (1, 1, nelts + 1, 1))
+	      {
+	        /* After canonicalizing the first elt to come from the
+		   first vector we only can insert the first elt from
+		   the first vector.  */
+	        at = 0;
+		ins = fold_read_from_vector (cop0, 0);
+	        op0 = op1;
+	      }
+	    else
+	      {
+	        unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
+		for (at = 0; at < encoded_nelts; ++at)
+		  if (maybe_ne (sel[at], at))
+		    break;
+		if (at < encoded_nelts && sel.series_p (at + 1, 1, at + 1, 1))
+		  {
+		    if (known_lt (at, nelts))
+		      ins = fold_read_from_vector (cop0, sel[at]);
+		    else
+		      ins = fold_read_from_vector (cop1, sel[at] - nelts);
+		  }
+	      }
+	  }
 
 	/* Generate a canonical form of the selector.  */
-	if (sel.encoding () != builder)
+	if (!ins && sel.encoding () != builder)
 	  {
 	    /* Some targets are deficient and fail to expand a single
 	       argument permutation while still allowing an equivalent
@@ -5450,10 +5492,12 @@ (define_operator_list COND_TERNARY
 		     so use the preferred form.  */
 		  op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
 	      }
-	    /* Differences in the encoder do not necessarily mean
-	       differences in the resulting vector.  */
-	    changed = !operand_equal_p (op2, oldop2, 0);
+	    if (!operand_equal_p (op2, oldop2, 0))
+	      changed = true;
 	  }
       }
-      (if (changed)
-       (vec_perm { op0; } { op1; } { op2; })))))))))
+      (if (ins)
+       (bit_insert { op0; } { ins; }
+         { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); })
+       (if (changed)
+        (vec_perm { op0; } { op1; } { op2; }))))))))))
Index: gcc/testsuite/gcc.target/i386/pr90510.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr90510.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/pr90510.c	(working copy)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
+
+typedef double __v2df __attribute__ ((__vector_size__ (16)));
+typedef long long __v2di __attribute__ ((__vector_size__ (16)));
+
+__v2df
+_mm_add_sd_A (__v2df x, __v2df y)
+{
+  double tem = x[0] + y[0];
+  return __builtin_shuffle ( x, (__v2df) { tem, tem }, (__v2di) { 2, 1 } );
+}
+
+__v2df
+_mm_add_sd_B (__v2df x, __v2df y)
+{
+  __v2df z = { (x[0] + y[0]), x[1] };
+  return z;
+}
+
+/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 2 "optimized" } } */
+/* { dg-final { scan-assembler-not "unpck" } } */
Thomas Schwinge May 22, 2019, 12:47 p.m. UTC | #4
Hi!

On Tue, 21 May 2019 13:11:54 +0200 (CEST), Richard Biener <rguenther@suse.de> wrote:
> On Mon, 20 May 2019, Richard Biener wrote:
> > On Sun, 19 May 2019, Richard Sandiford wrote:
> > > Richard Biener <rguenther@suse.de> writes:
> > > > This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> > > > to match.pd (but not incremental patch - sorry...), folding
> > > > of single-element insert permutations to BIT_INSERT_EXPR.

> And the following is what I applied after fixing all sign-compare
> issues.
> 
> Bootstraped and tested on x86_64-unknown-linux-gnu.

I'm seeing one regression, in 'gcc/testsuite/brig/brig.sum':

    @@ -126,7 +126,7 @@ PASS: packed.hsail.brig scan-tree-dump original "\\+ { 15, 14, 13, 12, 11, 10, 9
    PASS: packed.hsail.brig scan-tree-dump gimple "= { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 };[\n ]+[a-z0-9_]+ = [a-z0-9_]+ \\+ [a-z0-9_]+;"
    PASS: packed.hsail.brig scan-tree-dump gimple "VEC_PERM_EXPR <[a-z0-9_]+, [a-z0-9_]+, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>;"
    PASS: packed.hsail.brig scan-tree-dump gimple "_[0-9]+ = q2 \\+ q3;"
    [-PASS:-]{+FAIL:+} packed.hsail.brig scan-tree-dump gimple "= VEC_PERM_EXPR <[a-z0-9_]+, new_output.[0-9]+_[0-9]+, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;"
    PASS: packed.hsail.brig scan-tree-dump gimple "q4 = (VIEW_CONVERT_EXPR<uint128_t>\\()?s_output.[0-9]+(_[0-9]+)*\\)?;"
    PASS: packed.hsail.brig scan-tree-dump-times gimple "= __builtin___hsail_sat_add_u8" 64
    PASS: packed.hsail.brig scan-tree-dump gimple "= VIEW_CONVERT_EXPR<vector\\(8\\) signed short>\\((s_output.[0-9]+_[0-9]+|q8)\\);[\n ]+q9 = -_[0-9]+;[\n ]+"

The before vs. after tree dump changed as follows:

    --- packed.hsail.brig.005t.gimple	2019-05-22 14:29:48.810860260 +0200
    +++ packed.hsail.brig.005t.gimple	2019-05-22 14:31:32.936670016 +0200
    @@ -112,7 +112,7 @@
       _26 = q2 + q3;
       new_output.11 = _26;
       new_output.21_27 = new_output.11;
    -  _28 = VEC_PERM_EXPR <q4, new_output.21_27, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
    +  _28 = VEC_PERM_EXPR <new_output.21_27, q4, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
       s_output.12 = _28;
       q4 = s_output.12;
       _29 = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
    @@ -369,7 +369,7 @@
       vec_out.22_273 = vec_out.16;
       new_output.17 = vec_out.22_273;
       new_output.23_274 = new_output.17;
    -  _275 = VEC_PERM_EXPR <q8, new_output.23_274, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
    +  _275 = VEC_PERM_EXPR <new_output.23_274, q8, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
       s_output.18 = _275;
       q8 = s_output.18;
       _276 = VIEW_CONVERT_EXPR<vector(8) signed short>(q8);

Will it be OK to just commit the obvious patch to adjust the tree dump
scanning, or is something actually wrong?  (As you can tell, so far I
didn't look up what that actually means -- hoping you can tell from the
little bit of context?)


Grüße
 Thomas


> 2019-05-21  Richard Biener  <rguenther@suse.de>
> 
> 	PR middle-end/90510
> 	* fold-const.c (fold_read_from_vector): New function.
> 	* fold-const.h (fold_read_from_vector): Declare.
> 	* match.pd (VEC_PERM_EXPR): Build BIT_INSERT_EXPRs for
> 	single-element insert permutations.  Canonicalize selector
> 	further and fix issue with last commit.
> 
> 	* gcc.target/i386/pr90510.c: New testcase.
> 
> Index: gcc/fold-const.c
> ===================================================================
> --- gcc/fold-const.c	(revision 271412)
> +++ gcc/fold-const.c	(working copy)
> @@ -13793,6 +13793,28 @@ fold_read_from_constant_string (tree exp
>    return NULL;
>  }
>  
> +/* Folds a read from vector element at IDX of vector ARG.  */
> +
> +tree
> +fold_read_from_vector (tree arg, poly_uint64 idx)
> +{
> +  unsigned HOST_WIDE_INT i;
> +  if (known_lt (idx, TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)))
> +      && known_ge (idx, 0u)
> +      && idx.is_constant (&i))
> +    {
> +      if (TREE_CODE (arg) == VECTOR_CST)
> +	return VECTOR_CST_ELT (arg, i);
> +      else if (TREE_CODE (arg) == CONSTRUCTOR)
> +	{
> +	  if (i >= CONSTRUCTOR_NELTS (arg))
> +	    return build_zero_cst (TREE_TYPE (TREE_TYPE (arg)));
> +	  return CONSTRUCTOR_ELT (arg, i)->value;
> +	}
> +    }
> +  return NULL_TREE;
> +}
> +
>  /* Return the tree for neg (ARG0) when ARG0 is known to be either
>     an integer constant, real, or fixed-point constant.
>  
> Index: gcc/fold-const.h
> ===================================================================
> --- gcc/fold-const.h	(revision 271412)
> +++ gcc/fold-const.h	(working copy)
> @@ -100,6 +100,7 @@ extern tree fold_bit_and_mask (tree, tre
>  			       tree, enum tree_code, tree, tree,
>  			       tree, enum tree_code, tree, tree, tree *);
>  extern tree fold_read_from_constant_string (tree);
> +extern tree fold_read_from_vector (tree, poly_uint64);
>  #if GCC_VEC_PERN_INDICES_H
>  extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &);
>  #endif
> Index: gcc/match.pd
> ===================================================================
> --- gcc/match.pd	(revision 271412)
> +++ gcc/match.pd	(working copy)
> @@ -5406,6 +5406,11 @@ (define_operator_list COND_TERNARY
>  	       op0 = op1;
>  	       sel.rotate_inputs (1);
>  	     }
> +	   else if (known_ge (poly_uint64 (sel[0]), nelts))
> +	     {
> +	       std::swap (op0, op1);
> +	       sel.rotate_inputs (1);
> +	     }
>           }
>         gassign *def;
>         tree cop0 = op0, cop1 = op1;
> @@ -5429,9 +5434,46 @@ (define_operator_list COND_TERNARY
>       (with
>        {
>  	bool changed = (op0 == op1 && !single_arg);
> +	tree ins = NULL_TREE;
> +	unsigned at = 0;
> +
> +	/* See if the permutation is performing a single element
> +	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
> +	   in that case.  But only if the vector mode is supported,
> +	   otherwise this is invalid GIMPLE.  */
> +        if (TYPE_MODE (type) != BLKmode
> +	    && (TREE_CODE (cop0) == VECTOR_CST
> +		|| TREE_CODE (cop0) == CONSTRUCTOR
> +		|| TREE_CODE (cop1) == VECTOR_CST
> +		|| TREE_CODE (cop1) == CONSTRUCTOR))
> +          {
> +	    if (sel.series_p (1, 1, nelts + 1, 1))
> +	      {
> +	        /* After canonicalizing the first elt to come from the
> +		   first vector we only can insert the first elt from
> +		   the first vector.  */
> +	        at = 0;
> +		ins = fold_read_from_vector (cop0, 0);
> +	        op0 = op1;
> +	      }
> +	    else
> +	      {
> +	        unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
> +		for (at = 0; at < encoded_nelts; ++at)
> +		  if (maybe_ne (sel[at], at))
> +		    break;
> +		if (at < encoded_nelts && sel.series_p (at + 1, 1, at + 1, 1))
> +		  {
> +		    if (known_lt (at, nelts))
> +		      ins = fold_read_from_vector (cop0, sel[at]);
> +		    else
> +		      ins = fold_read_from_vector (cop1, sel[at] - nelts);
> +		  }
> +	      }
> +	  }
>  
>  	/* Generate a canonical form of the selector.  */
> -	if (sel.encoding () != builder)
> +	if (!ins && sel.encoding () != builder)
>  	  {
>  	    /* Some targets are deficient and fail to expand a single
>  	       argument permutation while still allowing an equivalent
> @@ -5450,10 +5492,12 @@ (define_operator_list COND_TERNARY
>  		     so use the preferred form.  */
>  		  op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
>  	      }
> -	    /* Differences in the encoder do not necessarily mean
> -	       differences in the resulting vector.  */
> -	    changed = !operand_equal_p (op2, oldop2, 0);
> +	    if (!operand_equal_p (op2, oldop2, 0))
> +	      changed = true;
>  	  }
>        }
> -      (if (changed)
> -       (vec_perm { op0; } { op1; } { op2; })))))))))
> +      (if (ins)
> +       (bit_insert { op0; } { ins; }
> +         { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); })
> +       (if (changed)
> +        (vec_perm { op0; } { op1; } { op2; }))))))))))
> Index: gcc/testsuite/gcc.target/i386/pr90510.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/pr90510.c	(revision 0)
> +++ gcc/testsuite/gcc.target/i386/pr90510.c	(working copy)
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
> +
> +typedef double __v2df __attribute__ ((__vector_size__ (16)));
> +typedef long long __v2di __attribute__ ((__vector_size__ (16)));
> +
> +__v2df
> +_mm_add_sd_A (__v2df x, __v2df y)
> +{
> +  double tem = x[0] + y[0];
> +  return __builtin_shuffle ( x, (__v2df) { tem, tem }, (__v2di) { 2, 1 } );
> +}
> +
> +__v2df
> +_mm_add_sd_B (__v2df x, __v2df y)
> +{
> +  __v2df z = { (x[0] + y[0]), x[1] };
> +  return z;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 2 "optimized" } } */
> +/* { dg-final { scan-assembler-not "unpck" } } */
Richard Biener May 22, 2019, 1:39 p.m. UTC | #5
On Wed, 22 May 2019, Thomas Schwinge wrote:

> Hi!
> 
> On Tue, 21 May 2019 13:11:54 +0200 (CEST), Richard Biener <rguenther@suse.de> wrote:
> > On Mon, 20 May 2019, Richard Biener wrote:
> > > On Sun, 19 May 2019, Richard Sandiford wrote:
> > > > Richard Biener <rguenther@suse.de> writes:
> > > > > This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> > > > > to match.pd (but not incremental patch - sorry...), folding
> > > > > of single-element insert permutations to BIT_INSERT_EXPR.
> 
> > And the following is what I applied after fixing all sign-compare
> > issues.
> > 
> > Bootstraped and tested on x86_64-unknown-linux-gnu.
> 
> I'm seeing one regression, in 'gcc/testsuite/brig/brig.sum':
> 
>     @@ -126,7 +126,7 @@ PASS: packed.hsail.brig scan-tree-dump original "\\+ { 15, 14, 13, 12, 11, 10, 9
>     PASS: packed.hsail.brig scan-tree-dump gimple "= { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 };[\n ]+[a-z0-9_]+ = [a-z0-9_]+ \\+ [a-z0-9_]+;"
>     PASS: packed.hsail.brig scan-tree-dump gimple "VEC_PERM_EXPR <[a-z0-9_]+, [a-z0-9_]+, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>;"
>     PASS: packed.hsail.brig scan-tree-dump gimple "_[0-9]+ = q2 \\+ q3;"
>     [-PASS:-]{+FAIL:+} packed.hsail.brig scan-tree-dump gimple "= VEC_PERM_EXPR <[a-z0-9_]+, new_output.[0-9]+_[0-9]+, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;"
>     PASS: packed.hsail.brig scan-tree-dump gimple "q4 = (VIEW_CONVERT_EXPR<uint128_t>\\()?s_output.[0-9]+(_[0-9]+)*\\)?;"
>     PASS: packed.hsail.brig scan-tree-dump-times gimple "= __builtin___hsail_sat_add_u8" 64
>     PASS: packed.hsail.brig scan-tree-dump gimple "= VIEW_CONVERT_EXPR<vector\\(8\\) signed short>\\((s_output.[0-9]+_[0-9]+|q8)\\);[\n ]+q9 = -_[0-9]+;[\n ]+"
> 
> The before vs. after tree dump changed as follows:
> 
>     --- packed.hsail.brig.005t.gimple	2019-05-22 14:29:48.810860260 +0200
>     +++ packed.hsail.brig.005t.gimple	2019-05-22 14:31:32.936670016 +0200
>     @@ -112,7 +112,7 @@
>        _26 = q2 + q3;
>        new_output.11 = _26;
>        new_output.21_27 = new_output.11;
>     -  _28 = VEC_PERM_EXPR <q4, new_output.21_27, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
>     +  _28 = VEC_PERM_EXPR <new_output.21_27, q4, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
>        s_output.12 = _28;
>        q4 = s_output.12;
>        _29 = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
>     @@ -369,7 +369,7 @@
>        vec_out.22_273 = vec_out.16;
>        new_output.17 = vec_out.22_273;
>        new_output.23_274 = new_output.17;
>     -  _275 = VEC_PERM_EXPR <q8, new_output.23_274, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
>     +  _275 = VEC_PERM_EXPR <new_output.23_274, q8, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
>        s_output.18 = _275;
>        q8 = s_output.18;
>        _276 = VIEW_CONVERT_EXPR<vector(8) signed short>(q8);
> 
> Will it be OK to just commit the obvious patch to adjust the tree dump
> scanning, or is something actually wrong?  (As you can tell, so far I
> didn't look up what that actually means -- hoping you can tell from the
> little bit of context?)

Yeah, this is because of a new canonicalization suggested by Richard
to have the first element in the permutation come from the first vector.

Richard.

> 
> Grüße
>  Thomas
> 
> 
> > 2019-05-21  Richard Biener  <rguenther@suse.de>
> > 
> > 	PR middle-end/90510
> > 	* fold-const.c (fold_read_from_vector): New function.
> > 	* fold-const.h (fold_read_from_vector): Declare.
> > 	* match.pd (VEC_PERM_EXPR): Build BIT_INSERT_EXPRs for
> > 	single-element insert permutations.  Canonicalize selector
> > 	further and fix issue with last commit.
> > 
> > 	* gcc.target/i386/pr90510.c: New testcase.
> > 
> > Index: gcc/fold-const.c
> > ===================================================================
> > --- gcc/fold-const.c	(revision 271412)
> > +++ gcc/fold-const.c	(working copy)
> > @@ -13793,6 +13793,28 @@ fold_read_from_constant_string (tree exp
> >    return NULL;
> >  }
> >  
> > +/* Folds a read from vector element at IDX of vector ARG.  */
> > +
> > +tree
> > +fold_read_from_vector (tree arg, poly_uint64 idx)
> > +{
> > +  unsigned HOST_WIDE_INT i;
> > +  if (known_lt (idx, TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)))
> > +      && known_ge (idx, 0u)
> > +      && idx.is_constant (&i))
> > +    {
> > +      if (TREE_CODE (arg) == VECTOR_CST)
> > +	return VECTOR_CST_ELT (arg, i);
> > +      else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +	{
> > +	  if (i >= CONSTRUCTOR_NELTS (arg))
> > +	    return build_zero_cst (TREE_TYPE (TREE_TYPE (arg)));
> > +	  return CONSTRUCTOR_ELT (arg, i)->value;
> > +	}
> > +    }
> > +  return NULL_TREE;
> > +}
> > +
> >  /* Return the tree for neg (ARG0) when ARG0 is known to be either
> >     an integer constant, real, or fixed-point constant.
> >  
> > Index: gcc/fold-const.h
> > ===================================================================
> > --- gcc/fold-const.h	(revision 271412)
> > +++ gcc/fold-const.h	(working copy)
> > @@ -100,6 +100,7 @@ extern tree fold_bit_and_mask (tree, tre
> >  			       tree, enum tree_code, tree, tree,
> >  			       tree, enum tree_code, tree, tree, tree *);
> >  extern tree fold_read_from_constant_string (tree);
> > +extern tree fold_read_from_vector (tree, poly_uint64);
> >  #if GCC_VEC_PERN_INDICES_H
> >  extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &);
> >  #endif
> > Index: gcc/match.pd
> > ===================================================================
> > --- gcc/match.pd	(revision 271412)
> > +++ gcc/match.pd	(working copy)
> > @@ -5406,6 +5406,11 @@ (define_operator_list COND_TERNARY
> >  	       op0 = op1;
> >  	       sel.rotate_inputs (1);
> >  	     }
> > +	   else if (known_ge (poly_uint64 (sel[0]), nelts))
> > +	     {
> > +	       std::swap (op0, op1);
> > +	       sel.rotate_inputs (1);
> > +	     }
> >           }
> >         gassign *def;
> >         tree cop0 = op0, cop1 = op1;
> > @@ -5429,9 +5434,46 @@ (define_operator_list COND_TERNARY
> >       (with
> >        {
> >  	bool changed = (op0 == op1 && !single_arg);
> > +	tree ins = NULL_TREE;
> > +	unsigned at = 0;
> > +
> > +	/* See if the permutation is performing a single element
> > +	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
> > +	   in that case.  But only if the vector mode is supported,
> > +	   otherwise this is invalid GIMPLE.  */
> > +        if (TYPE_MODE (type) != BLKmode
> > +	    && (TREE_CODE (cop0) == VECTOR_CST
> > +		|| TREE_CODE (cop0) == CONSTRUCTOR
> > +		|| TREE_CODE (cop1) == VECTOR_CST
> > +		|| TREE_CODE (cop1) == CONSTRUCTOR))
> > +          {
> > +	    if (sel.series_p (1, 1, nelts + 1, 1))
> > +	      {
> > +	        /* After canonicalizing the first elt to come from the
> > +		   first vector we only can insert the first elt from
> > +		   the first vector.  */
> > +	        at = 0;
> > +		ins = fold_read_from_vector (cop0, 0);
> > +	        op0 = op1;
> > +	      }
> > +	    else
> > +	      {
> > +	        unsigned int encoded_nelts = sel.encoding ().encoded_nelts ();
> > +		for (at = 0; at < encoded_nelts; ++at)
> > +		  if (maybe_ne (sel[at], at))
> > +		    break;
> > +		if (at < encoded_nelts && sel.series_p (at + 1, 1, at + 1, 1))
> > +		  {
> > +		    if (known_lt (at, nelts))
> > +		      ins = fold_read_from_vector (cop0, sel[at]);
> > +		    else
> > +		      ins = fold_read_from_vector (cop1, sel[at] - nelts);
> > +		  }
> > +	      }
> > +	  }
> >  
> >  	/* Generate a canonical form of the selector.  */
> > -	if (sel.encoding () != builder)
> > +	if (!ins && sel.encoding () != builder)
> >  	  {
> >  	    /* Some targets are deficient and fail to expand a single
> >  	       argument permutation while still allowing an equivalent
> > @@ -5450,10 +5492,12 @@ (define_operator_list COND_TERNARY
> >  		     so use the preferred form.  */
> >  		  op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
> >  	      }
> > -	    /* Differences in the encoder do not necessarily mean
> > -	       differences in the resulting vector.  */
> > -	    changed = !operand_equal_p (op2, oldop2, 0);
> > +	    if (!operand_equal_p (op2, oldop2, 0))
> > +	      changed = true;
> >  	  }
> >        }
> > -      (if (changed)
> > -       (vec_perm { op0; } { op1; } { op2; })))))))))
> > +      (if (ins)
> > +       (bit_insert { op0; } { ins; }
> > +         { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); })
> > +       (if (changed)
> > +        (vec_perm { op0; } { op1; } { op2; }))))))))))
> > Index: gcc/testsuite/gcc.target/i386/pr90510.c
> > ===================================================================
> > --- gcc/testsuite/gcc.target/i386/pr90510.c	(revision 0)
> > +++ gcc/testsuite/gcc.target/i386/pr90510.c	(working copy)
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */
> > +
> > +typedef double __v2df __attribute__ ((__vector_size__ (16)));
> > +typedef long long __v2di __attribute__ ((__vector_size__ (16)));
> > +
> > +__v2df
> > +_mm_add_sd_A (__v2df x, __v2df y)
> > +{
> > +  double tem = x[0] + y[0];
> > +  return __builtin_shuffle ( x, (__v2df) { tem, tem }, (__v2di) { 2, 1 } );
> > +}
> > +
> > +__v2df
> > +_mm_add_sd_B (__v2df x, __v2df y)
> > +{
> > +  __v2df z = { (x[0] + y[0]), x[1] };
> > +  return z;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 2 "optimized" } } */
> > +/* { dg-final { scan-assembler-not "unpck" } } */
>
Thomas Schwinge May 23, 2019, 8:25 a.m. UTC | #6
Hi!

On Wed, 22 May 2019 15:39:52 +0200, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 22 May 2019, Thomas Schwinge wrote:
> > On Tue, 21 May 2019 13:11:54 +0200 (CEST), Richard Biener <rguenther@suse.de> wrote:
> > > On Mon, 20 May 2019, Richard Biener wrote:
> > > > On Sun, 19 May 2019, Richard Sandiford wrote:
> > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > This adds, incrementally ontop of moving VEC_PERM_EXPR folding
> > > > > > to match.pd (but not incremental patch - sorry...), folding
> > > > > > of single-element insert permutations to BIT_INSERT_EXPR.
> > 
> > > And the following is what I applied after fixing all sign-compare
> > > issues.
> > > 
> > > Bootstraped and tested on x86_64-unknown-linux-gnu.
> > 
> > I'm seeing one regression, in 'gcc/testsuite/brig/brig.sum':
> > 
> >     @@ -126,7 +126,7 @@ PASS: packed.hsail.brig scan-tree-dump original "\\+ { 15, 14, 13, 12, 11, 10, 9
> >     PASS: packed.hsail.brig scan-tree-dump gimple "= { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 };[\n ]+[a-z0-9_]+ = [a-z0-9_]+ \\+ [a-z0-9_]+;"
> >     PASS: packed.hsail.brig scan-tree-dump gimple "VEC_PERM_EXPR <[a-z0-9_]+, [a-z0-9_]+, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }>;"
> >     PASS: packed.hsail.brig scan-tree-dump gimple "_[0-9]+ = q2 \\+ q3;"
> >     [-PASS:-]{+FAIL:+} packed.hsail.brig scan-tree-dump gimple "= VEC_PERM_EXPR <[a-z0-9_]+, new_output.[0-9]+_[0-9]+, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;"
> >     PASS: packed.hsail.brig scan-tree-dump gimple "q4 = (VIEW_CONVERT_EXPR<uint128_t>\\()?s_output.[0-9]+(_[0-9]+)*\\)?;"
> >     PASS: packed.hsail.brig scan-tree-dump-times gimple "= __builtin___hsail_sat_add_u8" 64
> >     PASS: packed.hsail.brig scan-tree-dump gimple "= VIEW_CONVERT_EXPR<vector\\(8\\) signed short>\\((s_output.[0-9]+_[0-9]+|q8)\\);[\n ]+q9 = -_[0-9]+;[\n ]+"
> > 
> > The before vs. after tree dump changed as follows:
> > 
> >     --- packed.hsail.brig.005t.gimple	2019-05-22 14:29:48.810860260 +0200
> >     +++ packed.hsail.brig.005t.gimple	2019-05-22 14:31:32.936670016 +0200
> >     @@ -112,7 +112,7 @@
> >        _26 = q2 + q3;
> >        new_output.11 = _26;
> >        new_output.21_27 = new_output.11;
> >     -  _28 = VEC_PERM_EXPR <q4, new_output.21_27, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
> >     +  _28 = VEC_PERM_EXPR <new_output.21_27, q4, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
> >        s_output.12 = _28;
> >        q4 = s_output.12;
> >        _29 = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
> >     @@ -369,7 +369,7 @@
> >        vec_out.22_273 = vec_out.16;
> >        new_output.17 = vec_out.22_273;
> >        new_output.23_274 = new_output.17;
> >     -  _275 = VEC_PERM_EXPR <q8, new_output.23_274, { 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }>;
> >     +  _275 = VEC_PERM_EXPR <new_output.23_274, q8, { 0, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>;
> >        s_output.18 = _275;
> >        q8 = s_output.18;
> >        _276 = VIEW_CONVERT_EXPR<vector(8) signed short>(q8);
> > 
> > Will it be OK to just commit the obvious patch to adjust the tree dump
> > scanning, or is something actually wrong?  (As you can tell, so far I
> > didn't look up what that actually means -- hoping you can tell from the
> > little bit of context?)
> 
> Yeah, this is because of a new canonicalization suggested by Richard
> to have the first element in the permutation come from the first vector.

Committed to trunk in r271540 "[PR90510] Adjust
'brig.dg/test/gimple/packed.hsail'", see attached.


Grüße
 Thomas
diff mbox series

Patch

Index: gcc/gimple-match-head.c
===================================================================
--- gcc/gimple-match-head.c	(revision 271321)
+++ gcc/gimple-match-head.c	(working copy)
@@ -27,6 +27,7 @@  along with GCC; see the file COPYING3.
 #include "gimple.h"
 #include "ssa.h"
 #include "cgraph.h"
+#include "vec-perm-indices.h"
 #include "fold-const.h"
 #include "fold-const-call.h"
 #include "stor-layout.h"
Index: gcc/generic-match-head.c
===================================================================
--- gcc/generic-match-head.c	(revision 271321)
+++ gcc/generic-match-head.c	(working copy)
@@ -27,6 +27,7 @@  along with GCC; see the file COPYING3.
 #include "gimple.h"
 #include "ssa.h"
 #include "cgraph.h"
+#include "vec-perm-indices.h"
 #include "fold-const.h"
 #include "stor-layout.h"
 #include "tree-dfa.h"
Index: gcc/fold-const.c
===================================================================
--- gcc/fold-const.c	(revision 271321)
+++ gcc/fold-const.c	(working copy)
@@ -9015,7 +9015,7 @@  vec_cst_ctor_to_array (tree arg, unsigne
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
 
-static tree
+tree
 fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
@@ -11763,7 +11763,10 @@  fold_ternary_loc (location_t loc, enum t
       return NULL_TREE;
 
     case VEC_PERM_EXPR:
-      if (TREE_CODE (arg2) == VECTOR_CST)
+      /* Perform constant folding of BIT_INSERT_EXPR.  */
+      if (TREE_CODE (arg2) == VECTOR_CST
+	  && TREE_CODE (op0) == VECTOR_CST
+	  && TREE_CODE (op1) == VECTOR_CST)
 	{
 	  /* Build a vector of integers from the tree mask.  */
 	  vec_perm_builder builder;
@@ -11774,61 +11777,7 @@  fold_ternary_loc (location_t loc, enum t
 	  poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
 	  bool single_arg = (op0 == op1);
 	  vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts);
-
-	  /* Check for cases that fold to OP0 or OP1 in their original
-	     element order.  */
-	  if (sel.series_p (0, 1, 0, 1))
-	    return op0;
-	  if (sel.series_p (0, 1, nelts, 1))
-	    return op1;
-
-	  if (!single_arg)
-	    {
-	      if (sel.all_from_input_p (0))
-		op1 = op0;
-	      else if (sel.all_from_input_p (1))
-		{
-		  op0 = op1;
-		  sel.rotate_inputs (1);
-		}
-	    }
-
-	  if ((TREE_CODE (op0) == VECTOR_CST
-	       || TREE_CODE (op0) == CONSTRUCTOR)
-	      && (TREE_CODE (op1) == VECTOR_CST
-		  || TREE_CODE (op1) == CONSTRUCTOR))
-	    {
-	      tree t = fold_vec_perm (type, op0, op1, sel);
-	      if (t != NULL_TREE)
-		return t;
-	    }
-
-	  bool changed = (op0 == op1 && !single_arg);
-
-	  /* Generate a canonical form of the selector.  */
-	  if (arg2 == op2 && sel.encoding () != builder)
-	    {
-	      /* Some targets are deficient and fail to expand a single
-		 argument permutation while still allowing an equivalent
-		 2-argument version.  */
-	      if (sel.ninputs () == 2
-		  || can_vec_perm_const_p (TYPE_MODE (type), sel, false))
-		op2 = vec_perm_indices_to_tree (TREE_TYPE (arg2), sel);
-	      else
-		{
-		  vec_perm_indices sel2 (builder, 2, nelts);
-		  if (can_vec_perm_const_p (TYPE_MODE (type), sel2, false))
-		    op2 = vec_perm_indices_to_tree (TREE_TYPE (arg2), sel2);
-		  else
-		    /* Not directly supported with either encoding,
-		       so use the preferred form.  */
-		    op2 = vec_perm_indices_to_tree (TREE_TYPE (arg2), sel);
-		}
-	      changed = true;
-	    }
-
-	  if (changed)
-	    return build3_loc (loc, VEC_PERM_EXPR, type, op0, op1, op2);
+	  return fold_vec_perm (type, op0, op1, sel);
 	}
       return NULL_TREE;
 
Index: gcc/fold-const.h
===================================================================
--- gcc/fold-const.h	(revision 271321)
+++ gcc/fold-const.h	(working copy)
@@ -100,6 +100,9 @@  extern tree fold_bit_and_mask (tree, tre
 			       tree, enum tree_code, tree, tree,
 			       tree, enum tree_code, tree, tree, tree *);
 extern tree fold_read_from_constant_string (tree);
+#if GCC_VEC_PERN_INDICES_H
+extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &);
+#endif
 extern bool wide_int_binop (wide_int &res, enum tree_code,
 			    const wide_int &arg1, const wide_int &arg2,
 			    signop, wi::overflow_type *);
Index: gcc/match.pd
===================================================================
--- gcc/match.pd	(revision 271321)
+++ gcc/match.pd	(working copy)
@@ -5374,3 +5374,156 @@  (define_operator_list COND_TERNARY
       (bit_and:elt_type
        (BIT_FIELD_REF:elt_type @0 { size; } { pos; })
        { elt; })))))))
+
+(simplify
+ (vec_perm @0 @1 VECTOR_CST@2)
+ (with
+  {
+    tree op0 = @0, op1 = @1, op2 = @2;
+
+    /* Build a vector of integers from the tree mask.  */
+    vec_perm_builder builder;
+    if (!tree_to_vec_perm_builder (&builder, op2))
+      return NULL_TREE;
+
+    /* Create a vec_perm_indices for the integer vector.  */
+    poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type);
+    bool single_arg = (op0 == op1);
+    vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts);
+  }
+  (if (sel.series_p (0, 1, 0, 1))
+   { op0; }
+   (if (sel.series_p (0, 1, nelts, 1))
+    { op1; }
+    (with
+     {
+       if (!single_arg)
+         {
+	   if (sel.all_from_input_p (0))
+	     op1 = op0;
+	   else if (sel.all_from_input_p (1))
+	     {
+	       op0 = op1;
+	       sel.rotate_inputs (1);
+	     }
+         }
+       gassign *def;
+       tree cop0 = op0, cop1 = op1;
+       if (TREE_CODE (op0) == SSA_NAME
+           && (def = dyn_cast <gassign *> (SSA_NAME_DEF_STMT (op0)))
+	   && gimple_assign_rhs_code (def) == CONSTRUCTOR)
+	 cop0 = gimple_assign_rhs1 (def);
+       if (TREE_CODE (op1) == SSA_NAME
+           && (def = dyn_cast <gassign *> (SSA_NAME_DEF_STMT (op1)))
+	   && gimple_assign_rhs_code (def) == CONSTRUCTOR)
+	 cop1 = gimple_assign_rhs1 (def);
+
+       tree t;
+    }
+    (if ((TREE_CODE (cop0) == VECTOR_CST
+	  || TREE_CODE (cop0) == CONSTRUCTOR)
+	 && (TREE_CODE (cop1) == VECTOR_CST
+	     || TREE_CODE (cop1) == CONSTRUCTOR)
+	 && (t = fold_vec_perm (type, cop0, cop1, sel)))
+     { t; }
+     (with
+      {
+	bool changed = (op0 == op1 && !single_arg);
+	tree ins = NULL_TREE;
+	unsigned at = 0;
+
+	/* See if the permutation is performing a single element
+	   insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR
+	   in that case.  */
+	unsigned HOST_WIDE_INT cnelts;
+        if ((TREE_CODE (cop0) == VECTOR_CST
+	     || TREE_CODE (cop0) == CONSTRUCTOR
+	     || TREE_CODE (cop1) == VECTOR_CST
+	     || TREE_CODE (cop1) == CONSTRUCTOR)
+	    && nelts.is_constant (&cnelts))
+          {
+	    unsigned first = 0, first_oo = 0, first_i;
+	    unsigned second = 0, second_oo = 0, second_i;
+	    HOST_WIDE_INT idx;
+	    for (unsigned HOST_WIDE_INT i = 0; i < cnelts; ++i)
+	      if (!sel[i].is_constant (&idx))
+	        {
+		  first = second = 0;
+		  break;
+		}
+	      else if ((unsigned HOST_WIDE_INT)idx < cnelts)
+	        {
+		  first_i = i;
+		  first++;
+		  first_oo += (unsigned HOST_WIDE_INT)idx != i;
+		}
+	      else
+	        {
+		  second_i = i;
+	          second++;
+		  second_oo += (unsigned HOST_WIDE_INT)idx != i + cnelts;
+		}
+	    if (first_oo == 0
+		&& second == 1
+		&& (TREE_CODE (cop1) == VECTOR_CST
+		    || TREE_CODE (cop1) == CONSTRUCTOR))
+	      {
+	        unsigned idx = sel[second_i].to_constant () - cnelts;
+	        at = second_i;
+	        if (TREE_CODE (cop1) == VECTOR_CST)
+		  ins = VECTOR_CST_ELT (cop1, idx);
+		else if (TREE_CODE (cop1) == CONSTRUCTOR)
+		  {
+		    if (CONSTRUCTOR_NELTS (cop1) <= idx)
+		      ins = build_zero_cst (TREE_TYPE (type));
+		    else
+		      ins = CONSTRUCTOR_ELT (cop1, idx)->value;
+		  }
+	      }
+	    else if (second_oo == 0
+	    	     && first == 1
+		     && (TREE_CODE (cop0) == VECTOR_CST
+			 || TREE_CODE (cop0) == CONSTRUCTOR))
+	      {
+	        unsigned idx = sel[first_i].to_constant ();
+	        at = first_i;
+	        if (TREE_CODE (cop0) == VECTOR_CST)
+		  ins = VECTOR_CST_ELT (cop0, idx);
+		else if (TREE_CODE (cop0) == CONSTRUCTOR)
+		  {
+		    if (CONSTRUCTOR_NELTS (cop0) <= idx)
+		      ins = build_zero_cst (TREE_TYPE (type));
+		    else
+		      ins = CONSTRUCTOR_ELT (cop0, idx)->value;
+		  }
+		op0 = op1;
+	      }
+	  }
+
+	/* Generate a canonical form of the selector.  */
+	if (!ins && sel.encoding () != builder)
+	  {
+	    /* Some targets are deficient and fail to expand a single
+	       argument permutation while still allowing an equivalent
+	       2-argument version.  */
+	    if (sel.ninputs () == 2
+	       || can_vec_perm_const_p (TYPE_MODE (type), sel, false))
+	      op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
+	    else
+	      {
+	        vec_perm_indices sel2 (builder, 2, nelts);
+	        if (can_vec_perm_const_p (TYPE_MODE (type), sel2, false))
+	          op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel2);
+	        else
+	          /* Not directly supported with either encoding,
+		     so use the preferred form.  */
+		  op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel);
+	      }
+	    changed = true;
+	  }
+      }
+      (if (ins)
+       (bit_insert { op0; } { ins; }
+         { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); })
+       (if (changed)
+        (vec_perm { op0; } { op1; } { op2; }))))))))))