diff mbox

[2/3] OpenACC reductions

Message ID 563790D6.2070108@acm.org
State New
Headers show

Commit Message

Nathan Sidwell Nov. 2, 2015, 4:35 p.m. UTC
This patch contains the PTX backend pieces of OpenACC reduction handling.  These 
functions are lowered to gimple, using a couple of PTX-specific builtins for 
some functionality.  Expansion to RTL introduced no new patterns.

We need 3 different schemes for the 3 different partitioning axes, but 
fortunately there is still a lot of commonality.

For the gang level, there is usually a receiver object (if there isn't, the 
program is not well formed).  Each gang of execution has it's own instance of 
the reduction variable, which it initializes with an operator-specific init 
value (a zero or all-bits of appropriate type).  To merge, we use a lockless[1] 
update scheme on the receiver object.  Thus, setup and teardown are simply 
copies of local var.  INIT is a constant initializer and FINI is the update.

For worker level, we need to allocate a buffer in .shared memory local to the 
CTA.  This buffer is reused for each set of reductions, so we only need to size 
it to the maximum value across the program, in a similar manner to the worker 
propagation buffer.  (Unfortunately both need to be live concurrently, so we 
can't share them).

At the setup call, we copy the incoming value (from receiver object or 
LOCAL_VAR) to a slot in the buffer.  At the init call, we initialize to the 
operator-specific value.  At the fini call we do a  lockless update on the 
worker reduction buffer slot.  At the teardown call we read from the reduction 
buffer and possibly write to the receiver object.

The worker reduction buffer slots are the OFFSET argument of the reduction 
calls.  This is the only use we make of this operand for PTX.

For vector level, we  can use shuffle instructions to copy from another vector 
and arrange to use a  binary-tree of combinations to provide a final value.  The 
setup call reads the reciever object, if there is  one.  The init call 
initializes all but vector-zero with an operator-specific value.  Vector zero 
carries the incoming value.  The finalize call expands to the binary tree of 
shuffles.  For PTX this is 5 steps[2].  The teardown call writes back to the 
receiver object, if there is one.

[1] We use a lockless update for gang and worker level that looks somewhat like:

actual = INIT_VAL  (OP)
do
   guess = actual;
   result = guess OP myval
   actual = atomic_cmp_exchange (*REF_TO_RES, guess, result)
while (actual != guess)

The reason for this scheme is that a locking scheme doesn't work across workers 
-- it deadlocks apparently due to resource starvation.  The above is guaranteed 
to make progress.  Further it is easier to optimize to  target-specific 
atomics, should OP be supported directly.

[2] We have to emit an unrolled loop for this tree.   If it were a loop, we'd 
need to mark the loop branch as unified, and don't have a mechanism for that at 
the gimple level.  I have some ideas as to how to do that, but not had time to 
investigate.  We're fixing the vector_length to 32 in this manner, which is what 
other compilers appear to do anyway.

Bernd, I think the builtin fn bits you originally wrote got completely rewritten 
by me several times.

nathan

Comments

Jakub Jelinek Nov. 4, 2015, 10:01 a.m. UTC | #1
On Mon, Nov 02, 2015 at 11:35:34AM -0500, Nathan Sidwell wrote:
> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
> 	    Cesar Philippidis  <cesar@codesourcery.com>
> 
> 	* config/nvptx/nvptx.c: Include gimple headers.
> 	(worker_red_size, worker_red_align, worker_red_name,
...

I think you can approve this yourself, or do you want Bernd to review it?

	Jakub
Bernd Schmidt Nov. 4, 2015, 1:27 p.m. UTC | #2
On 11/02/2015 05:35 PM, Nathan Sidwell wrote:
>
> +/* Size of buffer needed for worker reductions.  This has to be

Maybe "description" rather than "Size" since there's really four 
variables we're covering with the comment.

> +      worker_red_size = (worker_red_size + worker_red_align - 1)
> +	& ~(worker_red_align - 1);

Formatting. Wrap the entire multi-line expression in parentheses to get 
editors to align the & operator.

> +static rtx
> +nvptx_expand_cmp_swap (tree exp, rtx target,
> +		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))

Add a comment. You're using ATTRIBUTE_UNUSED and ARG_UNUSED in this 
patch, it would be good to be consistent - I'm still not sure which 
style is preferred after the switch to C++, so as far as I'm concerned 
just pick one.

> +{
> +#define DEF(ID, NAME, T)						\
> +  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID] =				\
> +   add_builtin_function ("__builtin_nvptx_" NAME,			\
> +			 build_function_type_list T,			\
> +			 NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))

I think the assignment operator should start the line like all others, 
but other code in gcc is pretty inconsistent in that department.

> +static tree
> +nvptx_get_worker_red_addr (tree type, tree offset)

Add a comment.

> +  switch (TYPE_MODE (TREE_TYPE (var)))
> +    {
> +    case SFmode:
> +      code = VIEW_CONVERT_EXPR;
> +      /* FALLTHROUGH */
> +    case SImode:
> +      break;
> +
> +    case DFmode:
> +      code = VIEW_CONVERT_EXPR;
> +      /* FALLTHROUGH  */
> +    case DImode:
> +      type = long_long_unsigned_type_node;
> +      fn = NVPTX_BUILTIN_CMP_SWAPLL;
> +      break;
> +
> +    default:
> +      gcc_unreachable ();
> +    }

There are two such switch statements, and it's possible to write this 
more compactly:
   if (!INTEGRAL_MODE_P (...))
     code = VIEW_CONVERT_EXPR;
   if (GET_MODE_SIZE (...) == 8)
     fn = CMP_SWAPLL;
Not required, you can decide which you like better.

Otherwise I have no objections.


Bernd
Nathan Sidwell Nov. 4, 2015, 1:56 p.m. UTC | #3
On 11/04/15 05:01, Jakub Jelinek wrote:
> On Mon, Nov 02, 2015 at 11:35:34AM -0500, Nathan Sidwell wrote:
>> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
>> 	    Cesar Philippidis  <cesar@codesourcery.com>
>>
>> 	* config/nvptx/nvptx.c: Include gimple headers.
>> 	(worker_red_size, worker_red_align, worker_red_name,
> ...
>
> I think you can approve this yourself, or do you want Bernd to review it?

I was posting for completeness, and Bernd often spots things I missed.

thanks all!

nathan
Nathan Sidwell Nov. 4, 2015, 2:09 p.m. UTC | #4
On 11/04/15 08:27, Bernd Schmidt wrote:
> On 11/02/2015 05:35 PM, Nathan Sidwell wrote:
>>

> There are two such switch statements, and it's possible to write this more
> compactly:
>    if (!INTEGRAL_MODE_P (...))
>      code = VIEW_CONVERT_EXPR;
>    if (GET_MODE_SIZE (...) == 8)
>      fn = CMP_SWAPLL;
> Not required, you can decide which you like better.

thanks, not noticed that (the switch was originally more complex)

nathan
diff mbox

Patch

2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	* config/nvptx/nvptx.c: Include gimple headers.
	(worker_red_size, worker_red_align, worker_red_name,
	worker_red_sym): New.
	(nvptx_option_override): Initialize worker reduction buffer.
	(nvptx_file_end): Write out worker reduction buffer var.
	(nvptx_expand_shuffle, nvptx_expand_worker_addr,
	nvptx_expand_cmp_swap): New builtin expanders.
	(enum nvptx_builtins): New.
	(nvptx_builtin_decls): New.
	(nvptx_builtin_decl, nvptx_init_builtins, nvptx_expand_builtin): New
	(PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH): New.
	(nvptx_get_worker_red_addr, nvptx_generate_vector_shuffle,
	nvptx_lockless_update): New helpers.
	(nvptx_goacc_reduction_setup, nvptx_goacc_reduction_init,
	nvptx_goacc_reduction_fini, nvptx_goacc_reduction_teaddown): New.
	(nvptx_goacc_reduction): New.
	(TARGET_INIT_BUILTINS, TARGET_EXPAND_BUILTIN,
	TARGET_BUILTIN_DECL): Override.
	(TARGET_GOACC_REDUCTION): Override.

Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 229667)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -57,6 +57,15 @@ 
 #include "omp-low.h"
 #include "gomp-constants.h"
 #include "dumpfile.h"
+#include "internal-fn.h"
+#include "gimple-iterator.h"
+#include "stringpool.h"
+#include "tree-ssa-operands.h"
+#include "tree-ssanames.h"
+#include "gimplify.h"
+#include "tree-phinodes.h"
+#include "cfgloop.h"
+#include "fold-const.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -98,6 +107,14 @@  static unsigned worker_bcast_align;
 #define worker_bcast_name "__worker_bcast"
 static GTY(()) rtx worker_bcast_sym;
 
+/* Size of buffer needed for worker reductions.  This has to be
+   distinct from the worker broadcast array, as both may be live
+   concurrently.  */
+static unsigned worker_red_size;
+static unsigned worker_red_align;
+#define worker_red_name "__worker_red"
+static GTY(()) rtx worker_red_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -128,6 +145,9 @@  nvptx_option_override (void)
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
   worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name);
+  worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -3246,8 +3266,199 @@  nvptx_file_end (void)
 	       worker_bcast_align,
 	       worker_bcast_name, worker_bcast_size);
     }
+
+  if (worker_red_size)
+    {
+      /* Define the reduction buffer.  */
+
+      worker_red_size = (worker_red_size + worker_red_align - 1)
+	& ~(worker_red_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_red_name);
+      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
+	       worker_red_align,
+	       worker_red_name, worker_red_size);
+    }
+}
+
+/* Expander for the shuffle builtins.  */
+
+static rtx
+nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
+{
+  if (ignore)
+    return target;
+  
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+
+  rtx idx = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx op = expand_expr (CALL_EXPR_ARG  (exp, 2),
+			NULL_RTX, SImode, EXPAND_NORMAL);
+  
+  if (!REG_P (idx) && GET_CODE (idx) != CONST_INT)
+    idx = copy_to_mode_reg (SImode, idx);
+
+  rtx pat = nvptx_gen_shuffle (target, src, idx, INTVAL (op));
+  if (pat)
+    emit_insn (pat);
+
+  return target;
+}
+
+/* Worker reduction address expander.  */
+
+static rtx
+nvptx_expand_worker_addr (tree exp, rtx target,
+			  machine_mode ARG_UNUSED (mode), int ignore)
+{
+  if (ignore)
+    return target;
+
+  unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
+  if (align > worker_red_align)
+    worker_red_align = align;
+
+  unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
+  unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
+  if (size + offset > worker_red_size)
+    worker_red_size = size + offset;
+
+  emit_insn (gen_rtx_SET (target, worker_red_sym));
+
+  if (offset)
+    emit_insn (gen_rtx_SET (target,
+			    gen_rtx_PLUS (Pmode, target, GEN_INT (offset))));
+
+  emit_insn (gen_rtx_SET (target,
+			  gen_rtx_UNSPEC (Pmode, gen_rtvec (1, target),
+					  UNSPEC_FROM_SHARED)));
+
+  return target;
+}
+
+static rtx
+nvptx_expand_cmp_swap (tree exp, rtx target,
+		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))
+{
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  
+  if (!target)
+    target = gen_reg_rtx (mode);
+
+  rtx mem = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, Pmode, EXPAND_NORMAL);
+  rtx cmp = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 2),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx pat;
+
+  mem = gen_rtx_MEM (mode, mem);
+  if (!REG_P (cmp))
+    cmp = copy_to_mode_reg (mode, cmp);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+  
+  if (mode == SImode)
+    pat = gen_atomic_compare_and_swapsi_1 (target, mem, cmp, src, const0_rtx);
+  else
+    pat = gen_atomic_compare_and_swapdi_1 (target, mem, cmp, src, const0_rtx);
+
+  emit_insn (pat);
+
+  return target;
+}
+
+
+/* Codes for all the NVPTX builtins.  */
+enum nvptx_builtins
+{
+  NVPTX_BUILTIN_SHUFFLE,
+  NVPTX_BUILTIN_SHUFFLELL,
+  NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_CMP_SWAP,
+  NVPTX_BUILTIN_CMP_SWAPLL,
+  NVPTX_BUILTIN_MAX
+};
+
+static GTY(()) tree nvptx_builtin_decls[NVPTX_BUILTIN_MAX];
+
+/* Return the NVPTX builtin for CODE.  */
+
+static tree
+nvptx_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
+{
+  if (code >= NVPTX_BUILTIN_MAX)
+    return error_mark_node;
+
+  return nvptx_builtin_decls[code];
+}
+
+/* Set up all builtin functions for this target.  */
+
+static void
+nvptx_init_builtins (void)
+{
+#define DEF(ID, NAME, T)						\
+  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID] =				\
+   add_builtin_function ("__builtin_nvptx_" NAME,			\
+			 build_function_type_list T,			\
+			 NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))
+#define ST sizetype
+#define UINT unsigned_type_node
+#define LLUINT long_long_unsigned_type_node
+#define PTRVOID ptr_type_node
+
+  DEF (SHUFFLE, "shuffle", (UINT, UINT, UINT, UINT, NULL_TREE));
+  DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
+  DEF (WORKER_ADDR, "worker_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
+
+#undef DEF
+#undef ST
+#undef UINT
+#undef LLUINT
+#undef PTRVOID
+}
+
+/* Expand an expression EXP that calls a built-in function,
+   with result going to TARGET if that's convenient
+   (and in mode MODE if that's convenient).
+   SUBTARGET may be used as the target for computing one of EXP's operands.
+   IGNORE is nonzero if the value is to be ignored.  */
+
+static rtx
+nvptx_expand_builtin (tree exp, rtx target, rtx subtarget ATTRIBUTE_UNUSED,
+		      machine_mode mode, int ignore)
+{
+  tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    case NVPTX_BUILTIN_SHUFFLE:
+    case NVPTX_BUILTIN_SHUFFLELL:
+      return nvptx_expand_shuffle (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_WORKER_ADDR:
+      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_CMP_SWAP:
+    case NVPTX_BUILTIN_CMP_SWAPLL:
+      return nvptx_expand_cmp_swap (exp, target, mode, ignore);
+
+    default: gcc_unreachable ();
+    }
 }
 
+/* Define dimension sizes for known hardware.  */
+#define PTX_VECTOR_LENGTH 32
+#define PTX_WORKER_LENGTH 32
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  */
@@ -3284,6 +3495,422 @@  nvptx_goacc_fork_join (gcall *call, cons
   return true;
 }
 
+
+static tree
+nvptx_get_worker_red_addr (tree type, tree offset)
+{
+  machine_mode mode = TYPE_MODE (type);
+  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
+  tree align = build_int_cst (unsigned_type_node,
+			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
+  tree call = build_call_expr (fndecl, 3, offset, size, align);
+
+  return fold_convert (build_pointer_type (type), call);
+}
+
+/* Emit a SHFL.DOWN using index SHFL of VAR into DEST_VAR.  This function
+   will cast the variable if necessary.  */
+
+static void
+nvptx_generate_vector_shuffle (location_t loc,
+			       tree dest_var, tree var, unsigned shift,
+			       gimple_seq *seq)
+{
+  unsigned fn = NVPTX_BUILTIN_SHUFFLE;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+
+  switch (TYPE_MODE (TREE_TYPE (var)))
+    {
+    case SFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH */
+    case SImode:
+      break;
+
+    case DFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH  */
+    case DImode:
+      type = long_long_unsigned_type_node;
+      fn = NVPTX_BUILTIN_SHUFFLELL;
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  tree call = nvptx_builtin_decl (fn, true);
+  call = build_call_expr_loc
+    (loc, call, 3, build1 (code, type, var),
+     build_int_cst (unsigned_type_node, shift),
+     build_int_cst (unsigned_type_node, SHUFFLE_DOWN));
+
+  call = fold_build1 (code, TREE_TYPE (dest_var), call);
+
+  gimplify_assign (dest_var, call, seq);
+}
+
+/* Insert code to locklessly update  *PTR with *PTR OP VAR just before
+   GSI.  */
+
+static tree
+nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
+		       tree ptr, tree var, tree_code op)
+{
+  unsigned fn = NVPTX_BUILTIN_CMP_SWAP;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+
+  switch (TYPE_MODE (TREE_TYPE (var)))
+    {
+    case SFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH */
+    case SImode:
+      break;
+
+    case DFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH  */
+    case DImode:
+      type = long_long_unsigned_type_node;
+      fn = NVPTX_BUILTIN_CMP_SWAPLL;
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  gimple_seq init_seq = NULL;
+  tree init_var = make_ssa_name (type);
+  tree init_expr = omp_reduction_init_op (loc, op, TREE_TYPE (var));
+  init_expr = fold_build1 (code, type, init_expr);
+  gimplify_assign (init_var, init_expr, &init_seq);
+  gimple *init_end = gimple_seq_last (init_seq);
+
+  gsi_insert_seq_before (gsi, init_seq, GSI_SAME_STMT);
+  
+  gimple_seq loop_seq = NULL;
+  tree expect_var = make_ssa_name (type);
+  tree actual_var = make_ssa_name (type);
+  tree write_var = make_ssa_name (type);
+  
+  tree write_expr = fold_build1 (code, TREE_TYPE (var), expect_var);
+  write_expr = fold_build2 (op, TREE_TYPE (var), write_expr, var);
+  write_expr = fold_build1 (code, type, write_expr);
+  gimplify_assign (write_var, write_expr, &loop_seq);
+
+  tree swap_expr = nvptx_builtin_decl (fn, true);
+  swap_expr = build_call_expr_loc (loc, swap_expr, 3,
+				   ptr, expect_var, write_var);
+  gimplify_assign (actual_var, swap_expr, &loop_seq);
+
+  gcond *cond = gimple_build_cond (EQ_EXPR, actual_var, expect_var,
+				   NULL_TREE, NULL_TREE);
+  gimple_seq_add_stmt (&loop_seq, cond);
+
+  /* Split the block just after the init stmts.  */
+  basic_block pre_bb = gsi_bb (*gsi);
+  edge pre_edge = split_block (pre_bb, init_end);
+  basic_block loop_bb = pre_edge->dest;
+  pre_bb = pre_edge->src;
+  /* Reset the iterator.  */
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  /* Insert the loop statements.  */
+  gimple *loop_end = gimple_seq_last (loop_seq);
+  gsi_insert_seq_before (gsi, loop_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the loop stmts.  */
+  edge post_edge = split_block (loop_bb, loop_end);
+  basic_block post_bb = post_edge->dest;
+  loop_bb = post_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  post_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
+  edge loop_edge = make_edge (loop_bb, loop_bb, EDGE_FALSE_VALUE);
+  set_immediate_dominator (CDI_DOMINATORS, loop_bb, pre_bb);
+  set_immediate_dominator (CDI_DOMINATORS, post_bb, loop_bb);
+
+  gphi *phi = create_phi_node (expect_var, loop_bb);
+  add_phi_arg (phi, init_var, pre_edge, loc);
+  add_phi_arg (phi, actual_var, loop_edge, loc);
+
+  loop *loop = alloc_loop ();
+  loop->header = loop_bb;
+  loop->latch = loop_bb;
+  add_loop (loop, loop_bb->loop_father);
+
+  return fold_build1 (code, TREE_TYPE (var), write_var);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
+
+static void
+nvptx_goacc_reduction_setup (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Copy the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	var = build_simple_mem_ref (ref_to_res);
+    }
+  
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Store incoming value to worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      tree ref = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (ref) = 1;
+      gimplify_assign (ref, var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_INIT. */
+
+static void
+nvptx_goacc_reduction_init (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code rcode
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  tree init = omp_reduction_init_op (gimple_location (call), rcode,
+				     TREE_TYPE (var));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
+      tree tid = make_ssa_name (integer_type_node);
+      tree dim_vector = gimple_call_arg (call, 3);
+      gimple *tid_call = gimple_build_call_internal (IFN_GOACC_DIM_POS, 1,
+						     dim_vector);
+      gimple *cond_stmt = gimple_build_cond (NE_EXPR, tid, integer_zero_node,
+					     NULL_TREE, NULL_TREE);
+
+      gimple_call_set_lhs (tid_call, tid);
+      gimple_seq_add_stmt (&seq, tid_call);
+      gimple_seq_add_stmt (&seq, cond_stmt);
+
+      /* Split the block just after the call.  */
+      edge init_edge = split_block (gsi_bb (gsi), call);
+      basic_block init_bb = init_edge->dest;
+      basic_block call_bb = init_edge->src;
+
+      /* Fixup flags from call_bb to init_bb.  */
+      init_edge->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+      
+      /* Set the initialization stmts.  */
+      gimple_seq init_seq = NULL;
+      tree init_var = make_ssa_name (TREE_TYPE (var));
+      gimplify_assign (init_var, init, &init_seq);
+      gsi = gsi_start_bb (init_bb);
+      gsi_insert_seq_before (&gsi, init_seq, GSI_SAME_STMT);
+
+      /* Split block just after the init stmt.  */
+      gsi_prev (&gsi);
+      edge inited_edge = split_block (gsi_bb (gsi), gsi_stmt (gsi));
+      basic_block dst_bb = inited_edge->dest;
+      
+      /* Create false edge from call_bb to dst_bb.  */
+      edge nop_edge = make_edge (call_bb, dst_bb, EDGE_FALSE_VALUE);
+
+      /* Create phi node in dst block.  */
+      gphi *phi = create_phi_node (lhs, dst_bb);
+      add_phi_arg (phi, init_var, inited_edge, gimple_location (call));
+      add_phi_arg (phi, var, nop_edge, gimple_location (call));
+
+      /* Reset dominator of dst bb.  */
+      set_immediate_dominator (CDI_DOMINATORS, dst_bb, call_bb);
+
+      /* Reset the gsi.  */
+      gsi = gsi_for_stmt (call);
+    }
+  else
+    {
+      if (level == GOMP_DIM_GANG)
+	{
+	  /* If there's no receiver object, propagate the incoming VAR.  */
+	  tree ref_to_res = gimple_call_arg (call, 1);
+	  if (integer_zerop (ref_to_res))
+	    init = var;
+	}
+
+      gimplify_assign (lhs, init, &seq);
+    }
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_FINI.  */
+
+static void
+nvptx_goacc_reduction_fini (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 1);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code op
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  gimple_seq seq = NULL;
+  tree r = NULL_TREE;;
+
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Emit binary shuffle tree.  TODO. Emit this as an actual loop,
+	 but that requires a method of emitting a unified jump at the
+	 gimple level.  */
+      for (int shfl = PTX_VECTOR_LENGTH / 2; shfl > 0; shfl = shfl >> 1)
+	{
+	  tree other_var = make_ssa_name (TREE_TYPE (var));
+	  nvptx_generate_vector_shuffle (gimple_location (call),
+					 other_var, var, shfl, &seq);
+
+	  r = make_ssa_name (TREE_TYPE (var));
+	  gimplify_assign (r, fold_build2 (op, TREE_TYPE (var),
+					   var, other_var), &seq);
+	  var = r;
+	}
+    }
+  else
+    {
+      tree accum = NULL_TREE;
+
+      if (level == GOMP_DIM_WORKER)
+	{
+	  /* Get reduction buffer address.  */
+	  tree offset = gimple_call_arg (call, 5);
+	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree ptr = make_ssa_name (TREE_TYPE (call));
+
+	  gimplify_assign (ptr, call, &seq);
+	  accum = ptr;
+	}
+      else if (integer_zerop (ref_to_res))
+	r = var;
+      else
+	accum = ref_to_res;
+
+      if (accum)
+	{
+	  /* Locklessly update the accumulator.  */
+	  gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+	  seq = NULL;
+	  r = nvptx_lockless_update (gimple_location (call), &gsi,
+				     accum, var, op);
+	}
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, r, &seq);
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
+
+static void
+nvptx_goacc_reduction_teardown (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Read the worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      var = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (var) = 1;
+    }
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Write to the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	gimplify_assign (build_simple_mem_ref (ref_to_res), var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+  
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX reduction expander.  */
+
+void
+nvptx_goacc_reduction (gcall *call)
+{
+  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+  switch (code)
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      nvptx_goacc_reduction_setup (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_INIT:
+      nvptx_goacc_reduction_init (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_FINI:
+      nvptx_goacc_reduction_fini (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      nvptx_goacc_reduction_teardown (call);
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -3373,12 +4000,22 @@  nvptx_goacc_fork_join (gcall *call, cons
 #undef TARGET_CANNOT_COPY_INSN_P
 #define TARGET_CANNOT_COPY_INSN_P nvptx_cannot_copy_insn_p
 
+#undef TARGET_INIT_BUILTINS
+#define TARGET_INIT_BUILTINS nvptx_init_builtins
+#undef TARGET_EXPAND_BUILTIN
+#define TARGET_EXPAND_BUILTIN nvptx_expand_builtin
+#undef  TARGET_BUILTIN_DECL
+#define TARGET_BUILTIN_DECL nvptx_builtin_decl
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
+#undef TARGET_GOACC_REDUCTION
+#define TARGET_GOACC_REDUCTION nvptx_goacc_reduction
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"