diff mbox

, Improve vector int/long initialization on PowerPC

Message ID 20160804043344.GA8391@ibm-tiger.the-meissners.org
State New
Headers show

Commit Message

Michael Meissner Aug. 4, 2016, 4:33 a.m. UTC
This is a set of 3 patches to improve initializing vectors on the PowerPC.

The first patch changes the initialization of vector int where the
initialization part was not constant.  Previously, the compiler would create
the vector initialization on the stack and load it up, and now on 64-bit power8
and newer systems, it will create the parts in the GPRs.

Before the switch from using the old RELOAD register allocator to the newer LRA
register allocator, this patch had a problem with one of the fortran benchmarks
(cray_pointers_2) on a Power8 system (it works on power7 because the
optimization is not done there, and on power9 because power9 has d-form vector
addressing).  This was due to TImode not being allowed in vector registers.
So, I added a test to disable the optimization on such a system.  Since LRA
enables TImode to go into vector registers, most users will see the benefits of
this optimization.

The second part is cosmetic, in that moves the determination of the true
register number for a REG or SUBREG to a helper function from the
rs6000_adjust_vec_address function.  Previous versions of this patch had
additional callers to regno_or_subregno, but those uses were removed at this
time.  However, as I work on further optimizing vector initialization, set, and
extract, I may wind up using the helper function.

The third patch improves formation of vector long on ISA 3.0 system, to use the
new MTVSRDD instruction (that builds a vector from two 64-bit GPRs).

I built spec 2006 with these patches on a little endian power8 system, and at
least 18 of the benchmarks had vector initializations replaced.  Most
benchmarks only used the initialization in a few places, but gamess, dealII,
h264ref, and wrf each had over 100 initializations changed.

I have tried these patches on a big endian power7 system (both 32-bit and
64-bit targets), on a big endian power8 system (just 64-bit targets), and a
little endian power8 system (just 64-bit targets).  There were no regressions
on any of the systems.  Can I install these patches to the trunk?

[gcc]
2016-08-03  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/rs6000.c (rs6000_expand_vector): On 64-bit systems
	with direct move and TImode registers allowed in VSX, initialize a
	V4SImode vector in the GPRs, rather than creating a temporary
	vector on the stack, doing 4 stores to that temporary vector, and
	then doing a vector load (which causes a pipeline bubble between
	the stores and the load).
	(regno_or_subregno): New helper function to get the register
	number of a REG or SUBREG rtx.
	(rs6000_adjust_vec_address): Use regno_or_subregno.
	* config/rs6000/vsx.md (vsx_concat_<mode>): Add support for the
	ISA 3.0 mtvsrdd instruction if we are moving two gpr registers to
	create on vector register.

[gcc/testsuite]
2016-08-03  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/vec-init-1.c: New tests for vector init.
	* gcc.target/powerpc/vec-init-2.c: Likewise.
	* gcc.target/powerpc/vec-init-3.c: Likewise.

Comments

Segher Boessenkool Aug. 4, 2016, 3:03 p.m. UTC | #1
Hi Mike,

On Thu, Aug 04, 2016 at 12:33:44AM -0400, Michael Meissner wrote:
> I built spec 2006 with these patches on a little endian power8 system, and at
> least 18 of the benchmarks had vector initializations replaced.  Most
> benchmarks only used the initialization in a few places, but gamess, dealII,
> h264ref, and wrf each had over 100 initializations changed.

Did performance change?

> I have tried these patches on a big endian power7 system (both 32-bit and
> 64-bit targets), on a big endian power8 system (just 64-bit targets), and a
> little endian power8 system (just 64-bit targets).  There were no regressions
> on any of the systems.  Can I install these patches to the trunk?

Some questions below, okay for trunk with those taken care of.  Thanks.


> --- gcc/config/rs6000/rs6000.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 239098)
> +++ gcc/config/rs6000/rs6000.c	(.../gcc/config/rs6000)	(working copy)
> @@ -6736,6 +6736,38 @@ rs6000_expand_vector_init (rtx target, r
>        return;
>      }
>  
> +  /* Special case initializing vector int if we are on 64-bit systems with
> +     direct move.  This bug tickles a bug in reload for fortran's
> +     cray_pointers_2 test unless -mvsx-timode is enabled.  */

"This bug"?  It's not clear to me what this says, could you rephrase?
Just say what the code does, not what would happen without the code.  Or
say both.

> +static inline int
> +regno_or_subregno (rtx op)
> +{
> +  if (REG_P (op))
> +    return REGNO (op);
> +  else if (SUBREG_P (op))
> +    return subreg_regno (op);
> +  else
> +    gcc_unreachable ();
> +}

Maybe this should check the subreg is lowpart, too?  For robustness.

>  ;; Build a V2DF/V2DI vector from two scalars
>  (define_insn "vsx_concat_<mode>"
> -  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=<VSr>,?<VSa>")
> +  [(set (match_operand:VSX_D 0 "gpc_reg_operand" "=<VSa>,we")
>  	(vec_concat:VSX_D
> -	 (match_operand:<VS_scalar> 1 "vsx_register_operand" "<VS_64reg>,<VSa>")
> -	 (match_operand:<VS_scalar> 2 "vsx_register_operand" "<VS_64reg>,<VSa>")))]
> +	 (match_operand:<VS_scalar> 1 "gpc_reg_operand" "<VS_64reg>,r")
> +	 (match_operand:<VS_scalar> 2 "gpc_reg_operand" "<VS_64reg>,r")))
> +   (clobber (match_scratch:DI 3 "=X,X"))]

X,X?  How is that useful?

> +   (set_attr "length" "4")])

One insn is the default.


Segher
Pat Haugen Aug. 5, 2016, 10 p.m. UTC | #2
On 08/03/2016 11:33 PM, Michael Meissner wrote:
>  {
> -  if (BYTES_BIG_ENDIAN)
> -    return "xxpermdi %x0,%x1,%x2,0";
> +  if (which_alternative == 0)
> +    return (BYTES_BIG_ENDIAN
> +	    ? "xxpermdi %x0,%x1,%x2,0"
> +	    : "xxpermdi %x0,%x2,%x1,0");
> +
> +  else if (which_alternative == 1)
> +    return (BYTES_BIG_ENDIAN
> +	    ? "mtvsrdd %x0,%1,%2"
> +	    : "mtvsrdd %x0,%2,%1");
> +
>    else
> -    return "xxpermdi %x0,%x2,%x1,0";
> +    gcc_unreachable ();
>  }
> -  [(set_attr "type" "vecperm")])
> +  [(set_attr "type" "vecperm,mftgpr")
> +   (set_attr "length" "4")])

mtvsrdd actually behaves like a permute, so vecperm would be best insn type for it.

-Pat
Michael Meissner Aug. 8, 2016, 10:55 p.m. UTC | #3
On Thu, Aug 04, 2016 at 10:03:36AM -0500, Segher Boessenkool wrote:
> Hi Mike,
> 
> On Thu, Aug 04, 2016 at 12:33:44AM -0400, Michael Meissner wrote:
> > I built spec 2006 with these patches on a little endian power8 system, and at
> > least 18 of the benchmarks had vector initializations replaced.  Most
> > benchmarks only used the initialization in a few places, but gamess, dealII,
> > h264ref, and wrf each had over 100 initializations changed.
> 
> Did performance change?

I ran a selected set of spec benchmarks, and I saw one regression (3.5% in
tonto).  I'll run the full suite, and see if I can see what the slow down is.

> > I have tried these patches on a big endian power7 system (both 32-bit and
> > 64-bit targets), on a big endian power8 system (just 64-bit targets), and a
> > little endian power8 system (just 64-bit targets).  There were no regressions
> > on any of the systems.  Can I install these patches to the trunk?
> 
> Some questions below, okay for trunk with those taken care of.  Thanks.
> 
> 
> > --- gcc/config/rs6000/rs6000.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 239098)
> > +++ gcc/config/rs6000/rs6000.c	(.../gcc/config/rs6000)	(working copy)
> > @@ -6736,6 +6736,38 @@ rs6000_expand_vector_init (rtx target, r
> >        return;
> >      }
> >  
> > +  /* Special case initializing vector int if we are on 64-bit systems with
> > +     direct move.  This bug tickles a bug in reload for fortran's
> > +     cray_pointers_2 test unless -mvsx-timode is enabled.  */
> 
> "This bug"?  It's not clear to me what this says, could you rephrase?
> Just say what the code does, not what would happen without the code.  Or
> say both.

Here is the comment I tried to reword to be clearer.  Since the bug only occurs
if you do -mno-lra -mno-vsx-timode, I made the test to be those conditions.  I
don't think it is worth the time at this point to track down what the reload
issue is.

  /* Special case initializing vector int if we are on 64-bit systems with
     direct move.  This optimization tickles a bug in RELOAD for fortran's
     cray_pointers_2 test unless -mvsx-timode is enabled (the register
     allocator is trying to load up a V4SImode vector in GPRs with a TImode
     address using a SUBREG).  Since RELOAD is no longer the default register
     allocator, just don't do the optimization.  */
  if (mode == V4SImode && TARGET_DIRECT_MOVE_64BIT
      && (TARGET_LRA || TARGET_VSX_TIMODE))


> > +static inline int
> > +regno_or_subregno (rtx op)
> > +{
> > +  if (REG_P (op))
> > +    return REGNO (op);
> > +  else if (SUBREG_P (op))
> > +    return subreg_regno (op);
> > +  else
> > +    gcc_unreachable ();
> > +}
> 
> Maybe this should check the subreg is lowpart, too?  For robustness.

No, subreg_regno already does that.  Since this is just infrastructure for a
potential future change, I can remove this change until the next patch.

> >  ;; Build a V2DF/V2DI vector from two scalars
> >  (define_insn "vsx_concat_<mode>"
> > -  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=<VSr>,?<VSa>")
> > +  [(set (match_operand:VSX_D 0 "gpc_reg_operand" "=<VSa>,we")
> >  	(vec_concat:VSX_D
> > -	 (match_operand:<VS_scalar> 1 "vsx_register_operand" "<VS_64reg>,<VSa>")
> > -	 (match_operand:<VS_scalar> 2 "vsx_register_operand" "<VS_64reg>,<VSa>")))]
> > +	 (match_operand:<VS_scalar> 1 "gpc_reg_operand" "<VS_64reg>,r")
> > +	 (match_operand:<VS_scalar> 2 "gpc_reg_operand" "<VS_64reg>,r")))
> > +   (clobber (match_scratch:DI 3 "=X,X"))]
> 
> X,X?  How is that useful?

Because I had a more elaborate version for vsx_concat_<mode> that did need a
base register for memory addresses.  I missed deleting the useless clobber.  I
deleted it in my source tree that will become the next iteration of the patch.
When I add support for vector insert to/from memory, I will probably need the
clobbers (and this BTW, was the reason to split off regno_or_subregno).

> > +   (set_attr "length" "4")])
> 
> One insn is the default.

Yep.  I had been working on a larger change, and some of the other alternatives
did have other lengths.
Michael Meissner Aug. 8, 2016, 10:56 p.m. UTC | #4
On Fri, Aug 05, 2016 at 05:00:39PM -0500, Pat Haugen wrote:
> On 08/03/2016 11:33 PM, Michael Meissner wrote:
> >  {
> > -  if (BYTES_BIG_ENDIAN)
> > -    return "xxpermdi %x0,%x1,%x2,0";
> > +  if (which_alternative == 0)
> > +    return (BYTES_BIG_ENDIAN
> > +	    ? "xxpermdi %x0,%x1,%x2,0"
> > +	    : "xxpermdi %x0,%x2,%x1,0");
> > +
> > +  else if (which_alternative == 1)
> > +    return (BYTES_BIG_ENDIAN
> > +	    ? "mtvsrdd %x0,%1,%2"
> > +	    : "mtvsrdd %x0,%2,%1");
> > +
> >    else
> > -    return "xxpermdi %x0,%x2,%x1,0";
> > +    gcc_unreachable ();
> >  }
> > -  [(set_attr "type" "vecperm")])
> > +  [(set_attr "type" "vecperm,mftgpr")
> > +   (set_attr "length" "4")])
> 
> mtvsrdd actually behaves like a permute, so vecperm would be best insn type for it.

Ok, when I submit the patch again, I will change the type to "vecperm".  I will
also change it in "vsx_splat_<mode>" which also generates MTVSRDD.  Thanks.
diff mbox

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 239098)
+++ gcc/config/rs6000/rs6000.c	(.../gcc/config/rs6000)	(working copy)
@@ -6736,6 +6736,38 @@  rs6000_expand_vector_init (rtx target, r
       return;
     }
 
+  /* Special case initializing vector int if we are on 64-bit systems with
+     direct move.  This bug tickles a bug in reload for fortran's
+     cray_pointers_2 test unless -mvsx-timode is enabled.  */
+  if (mode == V4SImode && TARGET_DIRECT_MOVE_64BIT && TARGET_VSX_TIMODE)
+    {
+      rtx di_hi, di_lo, elements[4], tmp;
+      size_t i;
+
+      for (i = 0; i < 4; i++)
+	{
+	  rtx element_si = XVECEXP (vals, 0, VECTOR_ELT_ORDER_BIG ? i : 3 - i);
+	  element_si = copy_to_mode_reg (SImode, element_si);
+	  elements[i] = gen_reg_rtx (DImode);
+	  convert_move (elements[i], element_si, true);
+	}
+
+      di_hi = gen_reg_rtx (DImode);
+      tmp = gen_reg_rtx (DImode);
+      emit_insn (gen_ashldi3 (tmp, elements[0], GEN_INT (32)));
+      emit_insn (gen_iordi3 (di_hi, tmp, elements[1]));
+
+      di_lo = gen_reg_rtx (DImode);
+      tmp = gen_reg_rtx (DImode);
+      emit_insn (gen_ashldi3 (tmp, elements[2], GEN_INT (32)));
+      emit_insn (gen_iordi3 (di_lo, tmp, elements[3]));
+
+      emit_insn (gen_rtx_CLOBBER (VOIDmode, target));
+      emit_move_insn (gen_highpart (DImode, target), di_hi);
+      emit_move_insn (gen_lowpart (DImode, target), di_lo);
+      return;
+    }
+
   /* With single precision floating point on VSX, know that internally single
      precision is actually represented as a double, and either make 2 V2DF
      vectors, and convert these vectors to single precision, or do one
@@ -7021,6 +7053,18 @@  rs6000_expand_vector_extract (rtx target
   emit_move_insn (target, adjust_address_nv (mem, inner_mode, 0));
 }
 
+/* Helper function to return the register number of a RTX.  */
+static inline int
+regno_or_subregno (rtx op)
+{
+  if (REG_P (op))
+    return REGNO (op);
+  else if (SUBREG_P (op))
+    return subreg_regno (op);
+  else
+    gcc_unreachable ();
+}
+
 /* Adjust a memory address (MEM) of a vector type to point to a scalar field
    within the vector (ELEMENT) with a mode (SCALAR_MODE).  Use a base register
    temporary (BASE_TMP) to fixup the address.  Return the new memory address
@@ -7136,14 +7180,7 @@  rs6000_adjust_vec_address (rtx scalar_re
     {
       rtx op1 = XEXP (new_addr, 1);
       addr_mask_type addr_mask;
-      int scalar_regno;
-
-      if (REG_P (scalar_reg))
-	scalar_regno = REGNO (scalar_reg);
-      else if (SUBREG_P (scalar_reg))
-	scalar_regno = subreg_regno (scalar_reg);
-      else
-	gcc_unreachable ();
+      int scalar_regno = regno_or_subregno (scalar_reg);
 
       gcc_assert (scalar_regno < FIRST_PSEUDO_REGISTER);
       if (INT_REGNO_P (scalar_regno))
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/config/rs6000)	(revision 239098)
+++ gcc/config/rs6000/vsx.md	(.../gcc/config/rs6000)	(working copy)
@@ -1899,18 +1899,28 @@  (define_insn "*vsx_float_fix_v2df2"
 
 ;; Build a V2DF/V2DI vector from two scalars
 (define_insn "vsx_concat_<mode>"
-  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=<VSr>,?<VSa>")
+  [(set (match_operand:VSX_D 0 "gpc_reg_operand" "=<VSa>,we")
 	(vec_concat:VSX_D
-	 (match_operand:<VS_scalar> 1 "vsx_register_operand" "<VS_64reg>,<VSa>")
-	 (match_operand:<VS_scalar> 2 "vsx_register_operand" "<VS_64reg>,<VSa>")))]
+	 (match_operand:<VS_scalar> 1 "gpc_reg_operand" "<VS_64reg>,r")
+	 (match_operand:<VS_scalar> 2 "gpc_reg_operand" "<VS_64reg>,r")))
+   (clobber (match_scratch:DI 3 "=X,X"))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
 {
-  if (BYTES_BIG_ENDIAN)
-    return "xxpermdi %x0,%x1,%x2,0";
+  if (which_alternative == 0)
+    return (BYTES_BIG_ENDIAN
+	    ? "xxpermdi %x0,%x1,%x2,0"
+	    : "xxpermdi %x0,%x2,%x1,0");
+
+  else if (which_alternative == 1)
+    return (BYTES_BIG_ENDIAN
+	    ? "mtvsrdd %x0,%1,%2"
+	    : "mtvsrdd %x0,%2,%1");
+
   else
-    return "xxpermdi %x0,%x2,%x1,0";
+    gcc_unreachable ();
 }
-  [(set_attr "type" "vecperm")])
+  [(set_attr "type" "vecperm,mftgpr")
+   (set_attr "length" "4")])
 
 ;; Special purpose concat using xxpermdi to glue two single precision values
 ;; together, relying on the fact that internally scalar floats are represented

Index: gcc/testsuite/gcc.target/powerpc/vec-init-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-init-1.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-init-1.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 239099)
@@ -0,0 +1,36 @@ 
+/* { dg-do run { target { powerpc*-*-linux* } } } */
+/* { dg-require-effective-target vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include <stdlib.h>
+#include <stddef.h>
+#include <altivec.h>
+
+extern void check (vector int a)                    __attribute__((__noinline__));
+extern vector int pack (int a, int b, int c, int d) __attribute__((__noinline__));
+
+void
+check (vector int a)
+{
+  static const int expected[] = { -1, 2, 0, -3 };
+  size_t i;
+
+  for (i = 0; i < 4; i++)
+    if (vec_extract (a, i) != expected[i])
+      abort ();
+}
+
+vector int
+pack (int a, int b, int c, int d)
+{
+  return (vector int) { a, b, c, d };
+}
+
+vector int sv = (vector int) { -1, 2, 0, -3 };
+
+int main (void)
+{
+  check (sv);
+  check (pack (-1, 2, 0, -3));
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/vec-init-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-init-2.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-init-2.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 239099)
@@ -0,0 +1,36 @@ 
+/* { dg-do run { target { powerpc*-*-linux* && lp64 } } } */
+/* { dg-require-effective-target vsx_hw } */
+/* { dg-options "-O2 -mvsx" } */
+
+#include <stdlib.h>
+#include <stddef.h>
+#include <altivec.h>
+
+extern void check (vector long a)        __attribute__((__noinline__));
+extern vector long pack (long a, long b) __attribute__((__noinline__));
+
+void
+check (vector long a)
+{
+  static const long expected[] = { 2L, -3L };
+  size_t i;
+
+  for (i = 0; i < 2; i++)
+    if (vec_extract (a, i) != expected[i])
+      abort ();
+}
+
+vector long
+pack (long a, long b)
+{
+  return (vector long) { a, b };
+}
+
+vector long sv = (vector long) { 2L, -3L };
+
+int main (void)
+{
+  check (sv);
+  check (pack (2L, -3L));
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/vec-init-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vec-init-3.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/trunk/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vec-init-3.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 239099)
@@ -0,0 +1,12 @@ 
+/* { dg-do compile { target { powerpc64*-*-* && lp64 } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-mcpu=power9 -O2 -mupper-regs-di" } */
+
+vector long
+merge (long a, long b)
+{
+  return (vector long) { a, b };
+}
+
+/* { dg-final { scan-assembler "mtvsrdd" } } */