diff mbox

[rs6000] Fix PR80695 (vec_construct cost modeling issue)

Message ID 600d0ecf-e93d-79d9-1960-28147921fc6b@linux.vnet.ibm.com
State New
Headers show

Commit Message

Bill Schmidt May 11, 2017, 3:59 p.m. UTC
Hi,

PR80695 identifies a case (similar to several others we've seen) where SLP
vectorization is too aggressive about vectorizing stores.  The problem is
that we undervalue the cost of a vec_construct operation.  vec_construct
is the vectorizer's representation for building a vector from scalar
elements.  When we construct an integer vector type from its constituent
parts, it requires a direct move from two GPRs (one instruction on P9,
two direct moves and a merge on P8).  The high cost of this is not
reflected in the current cost calculation, which only counts the cost
of combining the elements using N-1 inserts.  This patch provides a higher
estimate that is closer to reality.  Note that all cost estimation for
vectorization is a bit rough, so this should be viewed as a heuristic.

The patch treats all integer vectors separately from the default case.
There is already special handling for V4SFmode, so this leaves only
V2DFmode in the default case.  It was previously established heuristically
that a cost factor of 2 was appropriate for V2DFmode, so that is left
unchanged here; but since V2DFmode is the only default, we can simplify
the calculation to just return 2.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
Is this ok for trunk?

Thanks,
Bill


[gcc]

2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR target/80695
	* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
	Account for direct move costs for vec_construct of integer
	vectors.

[gcc/testsuite]

2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	PR target/80695
	* gcc.target/powerpc/pr80695-p8.c: New file.
	* gcc.target/powerpc/pr80695-p9.c: New file.

Comments

Bill Schmidt May 11, 2017, 5:37 p.m. UTC | #1
I meant to mention that I regstrapped this on both POWER8 and POWER9.

Bill

> On May 11, 2017, at 10:59 AM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
> 
> Hi,
> 
> PR80695 identifies a case (similar to several others we've seen) where SLP
> vectorization is too aggressive about vectorizing stores.  The problem is
> that we undervalue the cost of a vec_construct operation.  vec_construct
> is the vectorizer's representation for building a vector from scalar
> elements.  When we construct an integer vector type from its constituent
> parts, it requires a direct move from two GPRs (one instruction on P9,
> two direct moves and a merge on P8).  The high cost of this is not
> reflected in the current cost calculation, which only counts the cost
> of combining the elements using N-1 inserts.  This patch provides a higher
> estimate that is closer to reality.  Note that all cost estimation for
> vectorization is a bit rough, so this should be viewed as a heuristic.
> 
> The patch treats all integer vectors separately from the default case.
> There is already special handling for V4SFmode, so this leaves only
> V2DFmode in the default case.  It was previously established heuristically
> that a cost factor of 2 was appropriate for V2DFmode, so that is left
> unchanged here; but since V2DFmode is the only default, we can simplify
> the calculation to just return 2.
> 
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
> Is this ok for trunk?
> 
> Thanks,
> Bill
> 
> 
> [gcc]
> 
> 2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> 	PR target/80695
> 	* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
> 	Account for direct move costs for vec_construct of integer
> 	vectors.
> 
> [gcc/testsuite]
> 
> 2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> 	PR target/80695
> 	* gcc.target/powerpc/pr80695-p8.c: New file.
> 	* gcc.target/powerpc/pr80695-p9.c: New file.
> 
> 
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c	(revision 247809)
> +++ gcc/config/rs6000/rs6000.c	(working copy)
> @@ -5849,8 +5849,20 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
> 	if (SCALAR_FLOAT_TYPE_P (elem_type)
> 	    && TYPE_PRECISION (elem_type) == 32)
> 	  return 5;
> +	/* On POWER9, integer vector types are built up in GPRs and then
> +           use a direct move (2 cycles).  For POWER8 this is even worse,
> +           as we need two direct moves and a merge, and the direct moves
> +	   are five cycles.  */
> +	else if (INTEGRAL_TYPE_P (elem_type))
> +	  {
> +	    if (TARGET_P9_VECTOR)
> +	      return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 2;
> +	    else
> +	      return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 11;
> +	  }
> 	else
> -	  return max (2, TYPE_VECTOR_SUBPARTS (vectype) - 1);
> +	  /* V2DFmode doesn't need a direct move.  */
> +	  return 2;
> 
>       default:
>         gcc_unreachable ();
> Index: gcc/testsuite/gcc.target/powerpc/pr80695-p8.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/pr80695-p8.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/pr80695-p8.c	(working copy)
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target { powerpc*-*-* } } } */
> +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-options "-mcpu=power8 -O3 -fdump-tree-slp-details" } */
> +
> +/* PR80695: Verify cost model for vec_construct on POWER8.  */
> +
> +long a[10] __attribute__((aligned(16)));
> +
> +void foo (long i, long j, long k, long l)
> +{
> +  a[6] = i;
> +  a[7] = j;
> +  a[8] = k;
> +  a[9] = l;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */
> Index: gcc/testsuite/gcc.target/powerpc/pr80695-p9.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/pr80695-p9.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/pr80695-p9.c	(working copy)
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target { powerpc*-*-* } } } */
> +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-options "-mcpu=power9 -O3 -fdump-tree-slp-details" } */
> +
> +/* PR80695: Verify cost model for vec_construct on POWER9.  */
> +
> +long a[10] __attribute__((aligned(16)));
> +
> +void foo (long i, long j, long k, long l)
> +{
> +  a[6] = i;
> +  a[7] = j;
> +  a[8] = k;
> +  a[9] = l;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */
>
Segher Boessenkool May 11, 2017, 7:49 p.m. UTC | #2
On Thu, May 11, 2017 at 10:59:41AM -0500, Bill Schmidt wrote:
> PR80695 identifies a case (similar to several others we've seen) where SLP
> vectorization is too aggressive about vectorizing stores.  The problem is
> that we undervalue the cost of a vec_construct operation.  vec_construct
> is the vectorizer's representation for building a vector from scalar
> elements.  When we construct an integer vector type from its constituent
> parts, it requires a direct move from two GPRs (one instruction on P9,
> two direct moves and a merge on P8).  The high cost of this is not
> reflected in the current cost calculation, which only counts the cost
> of combining the elements using N-1 inserts.  This patch provides a higher
> estimate that is closer to reality.  Note that all cost estimation for
> vectorization is a bit rough, so this should be viewed as a heuristic.
> 
> The patch treats all integer vectors separately from the default case.
> There is already special handling for V4SFmode, so this leaves only
> V2DFmode in the default case.  It was previously established heuristically
> that a cost factor of 2 was appropriate for V2DFmode, so that is left
> unchanged here; but since V2DFmode is the only default, we can simplify
> the calculation to just return 2.
> 
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
> Is this ok for trunk?

Seems fine to me (well, minor stuff below).  Thanks,


Segher


> 2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> 	PR target/80695
> 	* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
> 	Account for direct move costs for vec_construct of integer
> 	vectors.
> 
> [gcc/testsuite]
> 
> 2017-05-11  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> 
> 	PR target/80695
> 	* gcc.target/powerpc/pr80695-p8.c: New file.
> 	* gcc.target/powerpc/pr80695-p9.c: New file.


> @@ -5849,8 +5849,20 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
>  	if (SCALAR_FLOAT_TYPE_P (elem_type)
>  	    && TYPE_PRECISION (elem_type) == 32)
>  	  return 5;
> +	/* On POWER9, integer vector types are built up in GPRs and then
> +           use a direct move (2 cycles).  For POWER8 this is even worse,
> +           as we need two direct moves and a merge, and the direct moves
> +	   are five cycles.  */

You're mixing tabs and spaces here.
diff mbox

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 247809)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -5849,8 +5849,20 @@  rs6000_builtin_vectorization_cost (enum vect_cost_
 	if (SCALAR_FLOAT_TYPE_P (elem_type)
 	    && TYPE_PRECISION (elem_type) == 32)
 	  return 5;
+	/* On POWER9, integer vector types are built up in GPRs and then
+           use a direct move (2 cycles).  For POWER8 this is even worse,
+           as we need two direct moves and a merge, and the direct moves
+	   are five cycles.  */
+	else if (INTEGRAL_TYPE_P (elem_type))
+	  {
+	    if (TARGET_P9_VECTOR)
+	      return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 2;
+	    else
+	      return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 11;
+	  }
 	else
-	  return max (2, TYPE_VECTOR_SUBPARTS (vectype) - 1);
+	  /* V2DFmode doesn't need a direct move.  */
+	  return 2;
 
       default:
         gcc_unreachable ();
Index: gcc/testsuite/gcc.target/powerpc/pr80695-p8.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr80695-p8.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/pr80695-p8.c	(working copy)
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-mcpu=power8 -O3 -fdump-tree-slp-details" } */
+
+/* PR80695: Verify cost model for vec_construct on POWER8.  */
+
+long a[10] __attribute__((aligned(16)));
+
+void foo (long i, long j, long k, long l)
+{
+  a[6] = i;
+  a[7] = j;
+  a[8] = k;
+  a[9] = l;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */
Index: gcc/testsuite/gcc.target/powerpc/pr80695-p9.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr80695-p9.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/pr80695-p9.c	(working copy)
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-options "-mcpu=power9 -O3 -fdump-tree-slp-details" } */
+
+/* PR80695: Verify cost model for vec_construct on POWER9.  */
+
+long a[10] __attribute__((aligned(16)));
+
+void foo (long i, long j, long k, long l)
+{
+  a[6] = i;
+  a[7] = j;
+  a[8] = k;
+  a[9] = l;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */