diff mbox

[gomp4] New reduction infrastructure for OpenACC

Message ID 55D4DC16.7020106@codesourcery.com
State New
Headers show

Commit Message

Cesar Philippidis Aug. 19, 2015, 7:42 p.m. UTC
This patch introduces a infrastructure for reductions in OpenACC. This
infrastructure consists of four internal functions,
GOACC_REDUCTION_SETUP, GOACC_REDUCTION_INIT, GOACC_REDUCTION_FINI, and
GOACC_REDUCTION_TEARDOWN, along with a new target hook goacc.reduction.
Each internal function shares a common interface:

  var = ifn (*ref_to_res, local_var, level, op, lid, rid)

var is the intermediate and private result of the reduction. Usually,
var = local_var.

*ref_to_res is a pointer to the resulting reduction. This is only
non-NULL for gang reductions. All other reduction operate on local
variables for which var will suffice.

local_var is a local (private) copy of the reduction variable.

level is the GOMP_DIM of the reduction. Each function call may only
contain one dim. If a loop a combination of gang, worker and vector,
then ifn must be called one per each dim.

op is the reduction operation.

lid is a unique loop ID. It's not 100% unique because it might get reset
in different TUs.

rid is the reduction ID within a loop. E.g., if a loop has two
reductions associated with it, the first could be designated zero and
the second one.

The target hook takes in one argument, the gimple statement containing
the call to the internal reduction function, and it returns true if it
introduces any calls to other target functions. This was necessary for
the nvptx backend, specifically for vector INIT because the thread ID is
necessary.

Each internal function is expanded during execute_oacc_transform using
that goacc reduction target hook. This allows us to generate
target-specific code while lowering it in a target-independent manner.

There are a couple of significant changes in this patch over the
existing OpenMP reduction implementation. The first change is that
reductions no longer rely on special ganglocal mappings. Certain
targets, such as nvptx gpus, have a distributed memory hierarchy. On
nvptx targets, all of the processors are partitioned into blocks. Each
block has a limited amount of shared memory. Because of the OpenACC spec
is written, we were initially mapping nvptx's shared memory into
gang-local memory. However, Nathan's worker and vector state propagator
is robust enough that we were able to eliminate the ganglocal mappings
altogether.

While this new infrastructure allows us to eliminate the ganglocal
mappings, nvptx still needs to use shared memory for worker reductions.
Consider the following example where red is private:

  #pragma acc loop worker reduction (+:red)
  for (...)
    red++;

This loop would expand to this during omp-lower:

  red = GOACC_REDUCTION_SETUP (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);
  GOACC_FORK (GOMP_DIM_WORKER);
  red = GOACC_REDUCTION_INIT (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);

  for (...)
    red++;

  red = GOACC_REDUCTION_FINI (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);
  GOACC_JOIN (GOMP_DIM_WORKER);
  red = GOACC_REDUCTION_TEARDOWN (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);

For nvptx targets, SETUP and TEARDOWN are responsible for allocating and
freeing shared memory. INIT is responsible for initializing the private
reduction variable. This is necessary for vector reductions because we
want thread 0 to contain the original value of local_var, and the other
threads to be initialized to the proper value for 'op'. All of the
intermediate reduction results are combined in FINI and written back to
var or *ref_to_res, whichever is necessary, in TEARDOWN.

I don't want to delve too much into the use of this infrastructure right
now. We do have a design for that, and I intend to present more details
when I post the lowering patch. The next patch will likely be the nvptx
changes though.

One of the reasons why we needed create this generic interface was to
implement vector reductions on nvptx targets. On nvptx targets, we're
mapping vectors to warps. That's fine, but warps cannot use spinlocks or
the warp will deadlock. As a consequence, we can't use the existing
OpenMP atomic reductions in OpenACC. The way I got around the spinlock
problem in 5.0 was by allocating an array of length vector_length, and
stashing all of the intermediate reductions in there. The later on, one
thread would merge all of those reductions together.

This new reduction infrastructure provides a more elegant solution for
OpenACC reduction. And while we're still using atomic operations for
gang and worker reductions, we're no longer using a global lock for
workers. This api allows us to use a lock in shared memory for workers.
That said, this infrastructure does provide sufficient flexibility to
implement tree reductions for gangs and workers later on.

It should be noted that this is not a replacement for the existing
OpenMP reductions. Rather, OpenMP will continue to use
lower_reduction_clauses and friends, while OpenACC will use this
infrastructure. That said, OpenMP could taught to use this infrastructure.

Is this patch OK for gomp-4_0-branch?

Thanks,
Cesar

Comments

Nathan Sidwell Aug. 20, 2015, 8:14 p.m. UTC | #1
Sigh, pdf's get filtered.  Let's try some raw tex ...

Here's the design document for the reduction implementation

nathan
diff mbox

Patch

2015-08-19  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Add a placeholder for TARGET_GOACC_REDUCTION.
	* internal-fn.c (expand_GOACC_REDUCTION_SETUP): New function.
	(expand_GOACC_REDUCTION_INIT): New function.
	(expand_GOACC_REDUCTION_FINI): New function.
	(expand_GOACC_REDUCTION_TEARDOWN): New function.
	* internal-fn.def (DEF_INTERNAL_FN):
	* omp-low.c (GOACC_REDUCTION_SETUP, GOACC_REDUCTION_INIT,
	GOACC_REDUCTION_FINI, GOACC_REDUCTION_TEARDOWN): New internal
	functions.
	* omp-low.c (execute_oacc_transform): Expand those new internal
	functions.
	(make_pass_oacc_transform): Add TODO_cleanup_cfg to todo_flags_finish.
	(default_goacc_reduction_setup): New function.
	(default_goacc_reduction_init_fini): New function.
	(default_goacc_reduction_teardown): New function.
	(default_goacc_reduction): New function.
	* target.def (reduction): New goacc target hook.
	* targhooks.h (default_goacc_reduction): Declare


diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 12e57a7..0c8ba5d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5767,6 +5767,19 @@  pass.  It should return true, if the functions should be deleted.  The
 default hook returns true, if there is no RTL expanders for them.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_GOACC_REDUCTION (gimple @var{call})
+This hook is used by the oacc_transform pass to expand calls to the
+internal functions @var{GOACC_REDUCTION_SETUP},
+@var{GOACC_REDUCTION_INIT},
+ @var{GOACC_REDUCTION_FINI} and
+ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.
+ @var{call} is gimple statement containing the call to the function.  This
+ hook removes statement @var{call} after the expanded sequence has been
+ inserted.  This hook is also responsible for allocating any storage for
+ reductions when necessary.  It returns @var{true} if the expanded
+sequence introduces any calls to OpenACC-specific internal functions.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3bf98be..dfd32a7 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4253,6 +4253,8 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_LOCK_UNLOCK
 
+@hook TARGET_GOACC_REDUCTION
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fcb9c47..9c923ed 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2051,6 +2051,39 @@  expand_GOACC_UNLOCK (gcall *ARG_UNUSED (stmt))
 #endif
 }
 
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_SETUP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_INIT (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_FINI (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_TEARDOWN (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 0bb8a91..6c5db37 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -89,3 +89,21 @@  DEF_INTERNAL_FN (GOACC_DIM_POS, ECF_PURE | ECF_NOTHROW | ECF_LEAF, ".")
    argument is a loop identifer.  */
 DEF_INTERNAL_FN (GOACC_LOCK, ECF_NOTHROW | ECF_LEAF, "..")
 DEF_INTERNAL_FN (GOACC_UNLOCK, ECF_NOTHROW | ECF_LEAF, "..")
+
+/* REDUCTION_SETUP, REDUCTION_INIT, REDUCTION_FINI and REDUCTION_TEARDOWN
+   together define a generic interface to support gang, worker and vector
+   reductions. All of the functions take the following form
+
+     V = goacc_reduction_foo (REF_TO_RES, LOCAL_VAR, LEVEL, OP, LID, RID)
+
+   where REF_TO_RES is a reference to the original reduction variable for
+   that particular reduction, LOCAL_VAR is the intermediate reduction
+   variable. LEVEL corresponds to the GOMP_DIM of the reduction, OP is a
+   tree code of the reduction operation. LID is a unique identifier of the
+   loop within a TU and RID is a unique id for a reduction within a loop.
+   V is the resulting intermediate reduction variable returned by the
+   function.  In general, V should equal LOCAL_VAR.  */
+DEF_INTERNAL_FN (GOACC_REDUCTION_SETUP, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_INIT, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_FINI, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_TEARDOWN, ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fcf037e..2049eea 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -14671,11 +14671,16 @@  execute_oacc_transform ()
   tree attrs = get_oacc_fn_attrib (current_function_decl);
   int dims[GOMP_DIM_MAX];
   tree purpose[GOMP_DIM_MAX];
+  bool needs_rescan;
   
   if (!attrs)
     /* Not an offloaded function.  */
     return 0;
 
+  /* Offloaded targets may introduce new basic blocks, which require
+     dominance information to update SSA.  */
+  calculate_dominance_info (CDI_DOMINATORS);
+
   {
     unsigned ix;
     tree pos = TREE_VALUE (attrs);
@@ -14725,59 +14730,74 @@  execute_oacc_transform ()
 	replace_oacc_fn_attrib (current_function_decl, pos);
       }
   }
-  
-  FOR_ALL_BB_FN (bb, cfun)
-    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
-      {
-	gimple stmt = gsi_stmt (gsi);
 
-	if (!is_gimple_call (stmt))
-	  ; /* Nothing.  */
-	else if (gimple_call_builtin_p (stmt, BUILT_IN_ACC_ON_DEVICE))
-	  /* acc_on_device must be evaluated at compile time for
-	     constant arguments.  */
-	  {
-	    gsi_next (&gsi);
-	    oacc_xform_on_device (stmt);
-	    continue;
-	  }
-	else if (gimple_call_internal_p (stmt))
+  do
+    {
+      needs_rescan = false;
+
+      FOR_ALL_BB_FN (bb, cfun)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
 	  {
-	    unsigned ifn_code = gimple_call_internal_fn (stmt);
-	    switch (ifn_code)
-	      {
-	      default: break;
+	    gimple stmt = gsi_stmt (gsi);
 
-	      case IFN_GOACC_DIM_POS:
-	      case IFN_GOACC_DIM_SIZE:
+	    if (!is_gimple_call (stmt))
+	      ; /* Nothing.  */
+	    else if (gimple_call_builtin_p (stmt, BUILT_IN_ACC_ON_DEVICE))
+	      /* acc_on_device must be evaluated at compile time for
+		 constant arguments.  */
+	      {
 		gsi_next (&gsi);
-		oacc_xform_dim (stmt, dims, ifn_code == IFN_GOACC_DIM_POS);
+		oacc_xform_on_device (stmt);
 		continue;
+	      }
+	    else if (gimple_call_internal_p (stmt))
+	      {
+		unsigned ifn_code = gimple_call_internal_fn (stmt);
+		int retval = 0;
+		switch (ifn_code)
+		  {
+		  default: break;
 
-	      case IFN_GOACC_LOCK:
-	      case IFN_GOACC_UNLOCK:
-		if (targetm.goacc.lock_unlock
-		    (stmt, dims, ifn_code == IFN_GOACC_LOCK))
-		  goto remove;
-		break;
+		  case IFN_GOACC_DIM_POS:
+		  case IFN_GOACC_DIM_SIZE:
+		    gsi_next (&gsi);
+		    oacc_xform_dim (stmt, dims, ifn_code == IFN_GOACC_DIM_POS);
+		    continue;
 
-	      case IFN_GOACC_FORK:
-	      case IFN_GOACC_JOIN:
-		if (targetm.goacc.fork_join
-		    (stmt, dims, ifn_code == IFN_GOACC_FORK))
-		  {
-		  remove:
-		    replace_uses_by (gimple_vdef (stmt),
-				     gimple_vuse (stmt));
-		    gsi_remove (&gsi, true);
-		    /* Removal will have advanced the iterator.  */
+		  case IFN_GOACC_LOCK:
+		  case IFN_GOACC_UNLOCK:
+		    if (targetm.goacc.lock_unlock
+			(stmt, dims, ifn_code == IFN_GOACC_LOCK))
+		      goto remove;
+		    break;
+
+		  case IFN_GOACC_REDUCTION_SETUP:
+		  case IFN_GOACC_REDUCTION_INIT:
+		  case IFN_GOACC_REDUCTION_FINI:
+		  case IFN_GOACC_REDUCTION_TEARDOWN:
+		    gsi_next (&gsi);
+		    if (targetm.goacc.reduction (stmt))
+		      needs_rescan = true;
 		    continue;
+
+		  case IFN_GOACC_FORK:
+		  case IFN_GOACC_JOIN:
+		    if (targetm.goacc.fork_join
+			(stmt, dims, ifn_code == IFN_GOACC_FORK))
+		      {
+		      remove:
+			replace_uses_by (gimple_vdef (stmt),
+					 gimple_vuse (stmt));
+			gsi_remove (&gsi, true);
+			/* Removal will have advanced the iterator.  */
+			continue;
+		      }
+		    break;
 		  }
-		break;
 	      }
+	    gsi_next (&gsi);
 	  }
-	gsi_next (&gsi);
-      }
+    } while (needs_rescan);
 
   return 0;
 }
@@ -14875,7 +14895,7 @@  const pass_data pass_data_oacc_transform =
   0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
-  TODO_update_ssa, /* todo_flags_finish */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
 };
 
 class pass_oacc_transform : public gimple_opt_pass
@@ -14906,4 +14926,145 @@  make_pass_oacc_transform (gcc::context *ctxt)
   return new pass_oacc_transform (ctxt);
 }
 
+/* Default implementation of targetm.goacc.reduction_setup.  This hook
+   provides a baseline implementation for the internal function
+   GOACC_REDUCTION_SETUP for a single-threaded target.  I.e. num_gangs =
+   num_workers = vector_length = 1.
+
+   Given:
+
+     V = IFN_RED_SETUP (RES_PTR, LOCAL, LEVEL, OP. LID, RID)
+
+   Expand to:
+
+     V = RES_PTR ? *RES_PTR : LOCAL;
+*/
+
+static bool
+default_goacc_reduction_setup (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree v = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 0);
+  tree local_var = gimple_call_arg (call, 1);
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (!integer_zerop (ref_to_res))
+    {
+      tree x = build_simple_mem_ref (ref_to_res);
+      gimplify_assign (v, x, &seq);
+    }
+  else
+    gimplify_assign (v, local_var, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+
+  return false;
+}
+
+/* Default implementation for both targetm.goacc.reduction_init and
+   reduction_fini.  This hook provides a baseline implementation for the
+   internal functions GOACC_REDUCTION_INIT and GOACC_REDUCTION_FINI for a
+   single-threaded target.
+
+   Given:
+
+     V = IFN_RED_INIT (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   or
+
+     V = IFN_RED_FINI (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   Expand to:
+
+     V = LOCAL;
+*/
+
+static bool
+default_goacc_reduction_init_fini (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree v = gimple_call_lhs (call);
+  tree local_var = gimple_call_arg (call, 1);
+  gimple g;
+
+  g = gimple_build_assign (v, local_var);
+  gsi_replace (&gsi, g, true);
+
+  return false;
+}
+
+/* Default implementation of targetm.goacc.reduction_teardown.  This hook
+   provides a baseline implementation for the internal function
+   GOACC_REDUCTION_TEARDOWN for a single-threaded target.
+
+   Given:
+
+     IFN_RED_TEARDOWN (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   Expand to:
+
+     if (RES_PTR)
+       *RES_PTR = LOCAL;
+
+    V = LOCAL;
+*/
+
+static bool
+default_goacc_reduction_teardown (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 0);
+  tree var = gimple_call_arg (call, 1);
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (!integer_zerop (ref_to_res))
+    {
+      tree x = build_simple_mem_ref (ref_to_res);
+      gimplify_assign (x, var, &seq);
+    }
+
+  if (lhs != NULL_TREE)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+
+  return false;
+}
+
+/* Default goacc.reduction early expander.  */
+
+bool
+default_goacc_reduction (gimple call)
+{
+  /* Reductions modify the SSA names in complicated ways.  Let update_ssa
+     correct it.  */
+  mark_virtual_operands_for_renaming (cfun);
+
+  switch (gimple_call_internal_fn (call))
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      return default_goacc_reduction_setup (call);
+
+    case IFN_GOACC_REDUCTION_INIT:
+    case IFN_GOACC_REDUCTION_FINI:
+      return default_goacc_reduction_init_fini (call);
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      return default_goacc_reduction_teardown (call);
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
 #include "gt-omp-low.h"
diff --git a/gcc/target.def b/gcc/target.def
index fa5670a..550db6a 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1679,6 +1679,15 @@  default hook returns true, if there is no RTL expanders for them.",
 bool, (gimple, const int[], bool),
 default_goacc_lock_unlock)
 
+DEFHOOK
+(reduction,
+"This hook is used by the oacc_transform pass to expand calls to the\n\
+internal functions @var{GOACC_REDUCTION_SETUP},\n\
+@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\ @var{call} is gimple statement containing the call to the function.  This\n\ hook removes statement @var{call} after the expanded sequence has been\n\ inserted.  This hook is also responsible for allocating any storage for\n\ reductions when necessary.  It returns @var{true} if the expanded\n\
+sequence introduces any calls to OpenACC-specific internal functions.",
+bool, (gimple call),
+default_goacc_reduction)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 0e7f13d..ddde78d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -107,6 +107,7 @@  extern unsigned default_add_stmt_cost (void *, int, enum vect_cost_for_stmt,
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
+extern bool default_goacc_reduction (gimple);
 extern bool default_goacc_validate_dims (tree, int [], int);
 extern unsigned default_goacc_dim_limit (unsigned);
 extern bool default_goacc_fork_join (gimple, const int [], bool);