diff mbox

[gomp] Move openacc vector& worker single handling to RTL

Message ID 5597120D.2080308@acm.org
State New
Headers show

Commit Message

Nathan Sidwell July 3, 2015, 10:51 p.m. UTC
This patch reorganizes the handling of vector and worker single modes and their 
transitions to/from partitioned mode out of omp-low and into mach-dep-reorg. 
That allows the regular middle end optimizers to behave normally -- with two 
exceptions, see below.

There are no libgomp regressions, and a number of progressions -- mainly private 
variables now 'just work'.

The approach taken is to have expand_omp_for_static_(no)chunk to emit open acc 
builtins at the start and end of the loop -- the points where execution should 
transition into a partitioned mode and back to single mode.   I've actually used 
a single builtin with a constant argument to say whether it is the head or tail 
of the loop.  You could consider these to be like 'fork' and 'join' primitives, 
if that helps.

We cope with multi-mode loops over (say worker & vector dimensions), by emitted 
two loop head and tails in nested seqence.  I.e. 'hed-worker, head-vector <loop> 
tail-vector tail-worker'.  Thus at a transition we only have to consider one 
particular axis.

These builtins are made known to the duplication and merging optimizations as 
not-to-be duplicated or merged (see builtin_unique_p).  For instance, the jump 
threading optimizer has to already check operations on the potentially  threaded 
path as suitable for duplication, and this is an additional test there.  The 
tail-merging optimizer similarly has to determine that tails are identical, and 
that is never true for this particular builtin.  The intent is that the loops 
are then maintained as single-entry-single-exit all the way through to RTL 
expansion.

Where and when these builtins are expanded to target specific code is not fixed. 
  In the case of PTX they go all the way to RTL expansion.

At RTL expansion the builtins are expanded to volatile unspecs.  We insert 'pre' 
markers too, as some code needs to know the last instruction before the 
transition.  These are uncopyable, and AFAICT RTL doesn't do tail merging (or at 
least I've not encountered it) so again these cause the SESE nature of the loop 
to be preserved all the way to mach dep reorg.

That's where the fun starts.  We scan the CFG looking for the loop markers. 
First we break basic blocks so the head and tail markers are the first insns of 
their block.  That prevents us needing a mode transition mid block.  We then 
rescan the graph discovering loops and adding each block to the loop in which it 
resides.  The entire function is modeled as a NULL loop.

Once that is done we walk the loop structure and insert state propagation code 
at the loop head points.  For vector propagation that'll be a sequence of PTX 
shuffle instructions.  For worker propagation it is a bit more complicated.  At 
the pre-head marker, we insert a spill of state to .shared memory (executed by 
the single active worker) and at the head marker we insert a fill (executed by 
all workers).  We also insert a sync barrier before the fill.  More on where 
that memory comes from later.

Finally we walk the loop structure again, inserting block or loop neutering 
code.  Where possible we try and skip entire blocks[*], but the basic approach 
is the same.  We insert branch-around at the start of the initial block and, if 
needed, insert propagation code at the end of the final block (which might be 
the same block).  The vector-propagation case is again a simple shuffle, but the 
worker case is a spill/sync/fill sequence, with the spill done by the single 
active worker.  The subsequent unified branch is marked with an unspec operand, 
rather than relying on detecting the data flow.

Note, the branch around is inserted using hidden branches that appear to the 
rest of the compiler as volatile unspecs referring to a later label.  I don't 
think the expense of creating new blocks is necessary or worthwhile -- this is 
flow control the compiler doesn't need to know about (if it did, I argue that 
we're inserting this too early).

The worker spill/fill storage is a file-scope array variable, sized during 
compilation and emitted directly at the end of the compilation process.  Again, 
this is not registered with the rest of the compiler = (a) I  wasn't sure how 
to, and (b) considered this an internal bit of the backend.  It is shared by all 
functions in this TU.  Unfortunately PTX  doesn't appear to support COMMON,  so 
making it shared across all TU appears difficult -- one can always use LTO 
optimization anyway,

IMHO this is a step towards putting target-dependent handling in the target 
compiler and out of the more generic host-side compiler.

The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?

nathan

[*] a possible optimization is to do superblock discovery, and skip those in a 
similar manner to loop skipping.
2015-07-02  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* builtins.h (builtin_unique_p): Declare.
	* builtins.c (builtin_unique_p): New fn
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for builtin_unique_p.
	* gimple.c (gimple_call_same_target_p): Add check for
	builtin_unique_p.

	Additions:
	* builtin-types.def (BT_FN_VOID_INT_INT): New.
	* builtins.c (expand_oacc_levels): New.
	(expand_oacc_loop): New.
	(expand_builtin): Adjust.
	* omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New.
	(expand_omp_for_static_nochunk): Add oacc loop head & tail calls.
	(expand_omp_for_static_chunk): Likewise.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add
	BUILT_IN_GOACC_LOOP.
	* omp-builtins.def (BUILT_IN_GOACC_LEVELS, BUILT_IN_GOACC_LOOP): New.
	* config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_levels, nvptx_loop): New insns.
	(oacc_loop): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead,
	nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_opetion_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_loop): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct reorg_unspec, struct reorg_loop): New structs.
	(unspec_map_t): New map.
	(loop_t, work_loop_t): New types.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops,
	nvptx_discover_loops): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_loop): New.
	(nvptx_process_loops): New.
	(nvptx_neuter_loops): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_loops,
	nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops.
	(nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns.
	(nvptx_file_end): Output worker broadcast array definition.

	fortran/
	* types.def (BT_FN_VOID_INT_INT): New function type.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Comments

Jakub Jelinek July 3, 2015, 11:11 p.m. UTC | #1
On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:
> IMHO this is a step towards putting target-dependent handling in the target
> compiler and out of the more generic host-side compiler.
> 
> The changelog is separated into 3 parts
> - a) general infrastructure
> - b) additiona
> - c) deletions.
> 
> comments?

Thanks for working on it.

If the builtins are not meant to be used by users directly (I assume they
aren't) nor have a 1-1 correspondence to a library routine, it is much
better to emit them as internal calls (see internal-fn.{c,def}) instead of
BUILT_IN_NORMAL functions.

	Jakub
Nathan Sidwell July 4, 2015, 8:41 p.m. UTC | #2
On 07/03/15 19:11, Jakub Jelinek wrote:
> On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:
>> IMHO this is a step towards putting target-dependent handling in the target
>> compiler and out of the more generic host-side compiler.
>>
>> The changelog is separated into 3 parts
>> - a) general infrastructure
>> - b) additiona
>> - c) deletions.
>>
>> comments?
>
> Thanks for working on it.
>
> If the builtins are not meant to be used by users directly (I assume they
> aren't) nor have a 1-1 correspondence to a library routine, it is much
> better to emit them as internal calls (see internal-fn.{c,def}) instead of
> BUILT_IN_NORMAL functions.

thanks, Cesar pointed me at the internal builtins too --  I'll take a look.

nathan
diff mbox

Patch

Index: gimple.c
===================================================================
--- gimple.c	(revision 225154)
+++ gimple.c	(working copy)
@@ -68,7 +68,7 @@  along with GCC; see the file COPYING3.
 #include "lto-streamer.h"
 #include "cgraph.h"
 #include "gimple-ssa.h"
-
+#include "builtins.h"
 
 /* All the tuples have their operand vector (if present) at the very bottom
    of the structure.  Therefore, the offset required to find the
@@ -1382,10 +1382,22 @@  gimple_call_same_target_p (const_gimple
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
 	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      /* If it is a unique builtin call, all calls are distinct.  */
+      if (DECL_BUILT_IN (decl) && builtin_unique_p (decl))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225154)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@  struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -5248,25 +5244,6 @@  gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: builtin-types.def
===================================================================
--- builtin-types.def	(revision 225154)
+++ builtin-types.def	(working copy)
@@ -236,6 +236,7 @@  DEF_FUNCTION_TYPE_1 (BT_FN_CONST_PTR_BND
 
 DEF_POINTER_TYPE (BT_PTR_FN_VOID_PTR, BT_FN_VOID_PTR)
 
+DEF_FUNCTION_TYPE_2 (BT_FN_VOID_INT_INT, BT_VOID, BT_INT, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTR_INT, BT_VOID, BT_PTR, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_STRING_STRING_CONST_STRING,
 		     BT_STRING, BT_STRING, BT_CONST_STRING)
Index: builtins.c
===================================================================
--- builtins.c	(revision 225154)
+++ builtins.c	(working copy)
@@ -274,6 +274,19 @@  is_builtin_fn (tree decl)
   return TREE_CODE (decl) == FUNCTION_DECL && DECL_BUILT_IN (decl);
 }
 
+bool
+builtin_unique_p (tree decl)
+{
+  if (DECL_BUILT_IN_CLASS (decl) != BUILT_IN_NORMAL)
+    return false;
+
+  /* The OpenACC Loop markers must not be cloned or deleted.  */
+  if (DECL_FUNCTION_CODE (decl) == BUILT_IN_GOACC_LOOP)
+    return true;
+
+  return false;
+}
+
 /* Return true if NODE should be considered for inline expansion regardless
    of the optimization level.  This means whenever a function is invoked with
    its "internal" name, which normally contains the prefix "__builtin".  */
@@ -5947,20 +5960,6 @@  expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6013,64 +6012,65 @@  expand_oacc_id (enum built_in_function f
   return result;
 }
 
-static rtx
-expand_oacc_ganglocal_ptr (rtx target ATTRIBUTE_UNUSED)
+static void
+expand_oacc_levels (tree exp)
 {
-#ifdef HAVE_ganglocal_ptr
-  enum insn_code icode;
-  icode = CODE_FOR_ganglocal_ptr;
-  rtx tmp = target;
-  if (!REG_P (tmp) || GET_MODE (tmp) != Pmode)
-    tmp = gen_reg_rtx (Pmode);
-  rtx insn = GEN_FCN (icode) (tmp);
-  if (insn != NULL_RTX)
+  rtx arg = expand_normal (CALL_EXPR_ARG (exp, 0));
+  unsigned limit = OACC_LOOP_MASK (OACC_HWM);
+
+  if (GET_CODE (arg) != CONST_INT || UINTVAL (arg) >= limit)
     {
-      emit_insn (insn);
-      return tmp;
+      error ("argument to %D must be constant in range 0 to %d",
+	     get_callee_fndecl (exp), limit - 1);
+      return;
     }
+  
+#ifdef HAVE_oacc_levels
+  emit_insn (gen_oacc_levels (arg));
 #endif
-  return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
+static void
+expand_oacc_loop (tree exp)
 {
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
+  rtx arg0 = expand_normal (CALL_EXPR_ARG (exp, 0));
+  if (GET_CODE (arg0) != CONST_INT || UINTVAL (arg0) >= 2)
+    {
+      error ("first argument to %D must be constant in range 0 to 1",
+	     get_callee_fndecl (exp));
+      return;
+    }
+  rtx arg1 = expand_normal (CALL_EXPR_ARG (exp, 1));
 
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
+  if (GET_CODE (arg1) != CONST_INT || UINTVAL (arg1) >= OACC_HWM)
     {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
+      error ("second argument to %D must be constant in range 0 to %d",
+	     get_callee_fndecl (exp), OACC_HWM - 1);
+      return;
     }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  
+#ifdef HAVE_oacc_loop
+  emit_insn (gen_oacc_loop (arg0, arg1));
+#endif
+}
 
+static rtx
+expand_oacc_ganglocal_ptr (rtx target ATTRIBUTE_UNUSED)
+{
+#ifdef HAVE_ganglocal_ptr
+  enum insn_code icode;
+  icode = CODE_FOR_ganglocal_ptr;
   rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
+  if (!REG_P (tmp) || GET_MODE (tmp) != Pmode)
+    tmp = gen_reg_rtx (Pmode);
+  rtx insn = GEN_FCN (icode) (tmp);
   if (insn != NULL_RTX)
     {
       emit_insn (insn);
       return tmp;
     }
-  return const0_rtx;
+#endif
+  return NULL_RTX;
 }
 
 /* Expand an expression EXP that calls a built-in function,
@@ -7219,20 +7219,20 @@  expand_builtin (tree exp, rtx target, rt
     case BUILT_IN_GOACC_NID:
       return expand_oacc_id (fcode, exp, target);
 
+    case BUILT_IN_GOACC_LEVELS:
+      expand_oacc_levels (exp);
+      return const0_rtx;
+
+    case BUILT_IN_GOACC_LOOP:
+      expand_oacc_loop (exp);
+      return const0_rtx;
+
     case BUILT_IN_GOACC_GET_GANGLOCAL_PTR:
       target = expand_oacc_ganglocal_ptr (target);
       if (target)
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225154)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,10 @@  DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_LEVELS, "GOACC_levels",
+		   BT_FN_VOID_INT, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_LOOP, "GOACC_loop",
+		   BT_FN_VOID_INT_INT, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: builtins.h
===================================================================
--- builtins.h	(revision 225154)
+++ builtins.h	(working copy)
@@ -50,6 +50,7 @@  extern struct target_builtins *this_targ
 extern bool force_folding_builtin_constant_p;
 
 extern bool is_builtin_fn (tree);
+extern bool builtin_unique_p (tree);
 extern bool get_object_alignment_1 (tree, unsigned int *,
 				    unsigned HOST_WIDE_INT *);
 extern unsigned int get_object_alignment (tree);
Index: fortran/types.def
===================================================================
--- fortran/types.def	(revision 225154)
+++ fortran/types.def	(working copy)
@@ -115,6 +115,7 @@  DEF_FUNCTION_TYPE_2 (BT_FN_VOID_VPTR_INT
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_VPTR_INT, BT_BOOL, BT_VOLATILE_PTR, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_SIZE_CONST_VPTR, BT_BOOL, BT_SIZE,
 		     BT_CONST_VOLATILE_PTR)
+DEF_FUNCTION_TYPE_2 (BT_FN_VOID_INT_INT, BT_VOID, BT_INT, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_INT_BOOL, BT_BOOL, BT_INT, BT_BOOL)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_UINT_UINT, BT_VOID, BT_UINT, BT_UINT)
 
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225154)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,23 @@ 
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_LEVELS
+   UNSPECV_LOOP
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +258,8 @@ 
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +820,7 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +829,34 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1360,72 @@ 
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_levels"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_LEVELS)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// levels %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+  "// loop %0, %1;"
+)
+
+(define_expand "oacc_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+{
+  nvptx_expand_oacc_loop (operands[0], operands[1]);
 })
 
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1531,8 @@ 
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225154)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@ 
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,15 @@ 
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
+
+#define nvptx_loop_head		0
+#define nvptx_loop_tail		1
+#define LOOP_MODE_CHANGE_P(X) ((X) < 2)
+#define nvptx_loop_prehead 	2
+#define nvptx_loop_pretail 	3
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +107,16 @@  static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +144,8 @@  nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1075,7 @@  nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1089,203 @@  nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc_loop primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_loop (rtx kind, rtx mode)
+{
+  /* Emit pre-tail for all loops and emit pre-head for worker level.  */
+  if (UINTVAL (kind) || UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_loop (GEN_INT (UINTVAL (kind) + 2), mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@  nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1880,7 @@  condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1942,10 @@  nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2102,7 @@  get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2116,6 @@  nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2131,6 @@  nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2184,780 @@  nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* An unspec of interest and the BB in which it resides.  */
+struct reorg_unspec
+{
+  rtx_insn *insn;
+  basic_block block;
+};
 
-  while (!warp_reg_worklist.is_empty ())
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct reorg_loop
+{
+  /* Parent loop.  */
+  reorg_loop *parent;
+  
+  /* Next sibling loop.  */
+  reorg_loop *next;
+
+  /* First child loop.  */
+  reorg_loop *inner;
+
+  /* Partitioning mode of the loop.  */
+  unsigned mode;
+
+  /* Partitioning used within inner loops. */
+  unsigned inner_mask;
+
+  /* Location of loop head and tail.  The head is the first block in
+     the partitioned loop and the tail is the first block out of the
+     partitioned loop.  */
+  basic_block head_block;
+  basic_block tail_block;
+
+  rtx_insn *head_insn;
+  rtx_insn *tail_insn;
+
+  rtx_insn *pre_head_insn;
+  rtx_insn *pre_tail_insn;
+
+  /* basic blocks in this loop, but not in child loops.  The HEAD and
+     PRETAIL blocks are in the loop.  The PREHEAD and TAIL blocks
+     are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  reorg_loop (reorg_loop *parent, unsigned mode);
+  ~reorg_loop ();
+};
+
+typedef auto_vec<reorg_unspec> unspec_vec_t;
+
+/* Constructor links the new loop into it's parent's chain of
+   children.  */
+
+reorg_loop::reorg_loop (reorg_loop *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  head_block = tail_block = 0;
+  head_insn = tail_insn = 0;
+  pre_head_insn = pre_tail_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+reorg_loop::~reorg_loop ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to unspecs */
+typedef hash_map<basic_block, rtx_insn *> unspec_map_t;
+
+/* Split basic blocks such that each loop head & tail unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the partitioning
+   execution mode of the function as a whole.  Populate MAP with head
+   and tail blocks.   We also clear the BB visited flag, which is
+   used when finding loops.  */
+
+static unsigned
+nvptx_split_blocks (unspec_map_t *map)
+{
+  auto_vec<reorg_unspec> worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned levels = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by loop locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
-	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
+	  if (!INSN_P (insn))
 	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_levels:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (levels == ~0U || l == levels);
+		levels = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		rtx kind = XVECEXP (PATTERN (insn), 0, 0);
+		if (!LOOP_MODE_CHANGE_P (UINTVAL (kind)))
+		  {
+		    seen_insn = true;
+		    continue;
+		  }
+	      }
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
 
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  if (seen_insn)
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	      /* We've found an instruction that  must be at the start of
+		 a block, but isn't.  Add it to the worklist.  */
+	      reorg_unspec uns;
+	      uns.insn = insn;
+	      uns.block = block;
+	      worklist.safe_push (uns);
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  reorg_unspec *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->block)
+	{
+	  block = elt->block;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->insn));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->insn;
+    }
+
+  return levels;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, unsigned expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == CODE_FOR_nvptx_loop
+	      && (UINTVAL (XVECEXP (PATTERN (pre_insn), 0, 0))
+		  == expected));
+  return pre_insn;
+}
+
+typedef std::pair<basic_block, reorg_loop *> loop_t;
+typedef auto_vec<loop_t> work_loop_t;
+
+/*  Dump this loop and all its inner loops.  */
+
+static void
+nvptx_dump_loops (reorg_loop *loop, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, loop->mode,
+	   loop->head_block ? loop->head_block->index : -1,
+	   loop->tail_block ? loop->tail_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (loop->inner)
+    nvptx_dump_loops (loop->inner, depth + 1);
+
+  if (loop->next)
+    nvptx_dump_loops (loop->next, depth);
+}
+
+/* Walk the BBG looking for loop head & tail markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static reorg_loop *
+nvptx_discover_loops (unspec_map_t *map)
+{
+  reorg_loop *outer_loop = new reorg_loop (0, OACC_null);
+  work_loop_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (loop_t (block, outer_loop));
+
+  while (worklist.length ())
+    {
+      loop_t loop = worklist.pop ();
+      reorg_loop *l = loop.second;
+
+      block = loop.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		unsigned kind = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 1));
+		
+		switch (kind)
+		  {
+		  case nvptx_loop_head:
+		    /* Loop head, create a new inner loop and add it
+		       into our parent's child list.  */
+		    l = new reorg_loop (l, mode);
+		    l->head_block = block;
+		    l->head_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_head_insn
+			= nvptx_discover_pre (block, nvptx_loop_prehead);
+		    break;
+
+		  case nvptx_loop_tail:
+		    /* A loop tail.  Finish the current loop and
+		       return to parent.  */
+		    gcc_assert (l->mode == mode);
+		    l->tail_block = block;
+		    l->tail_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_tail_insn
+			= nvptx_discover_pre (block, nvptx_loop_pretail);
+		    l = l->parent;
+		    break;
+		    
+		  default:
+		    gcc_unreachable ();
+		  }
+	      }
+	      break;
+
+	    default:gcc_unreachable ();
 	    }
 	}
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+
+      loop.second = l;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	{
+	  loop.first = e->dest;
+	  
+	  worklist.safe_push (loop);
+	}
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_loops (outer_loop, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_loop;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_loop:
+	case CODE_FOR_oacc_levels:
+	  return;
+	}
 
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & (1 << OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if ((1 << mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if ((1 << OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* LOOP is a loop that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at loop head
+   and ending at loop pre-tail.  */
+
+static void
+nvptx_skip_loop (unsigned mask, reorg_loop *loop)
+{
+  basic_block tail = loop->tail_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, loop->head_block, pre_tail);
+}
+
+/* Process the loop LOOP and all its contained loops.  We do
+   everything but the neutering.  Return mask of partition modes used
+   within this loop.  */
+
+static unsigned
+nvptx_process_loops (reorg_loop *loop)
+{
+  unsigned inner_mask = 1 << loop->mode;
+  
+  /* Do the inner loops first.  */
+  if (loop->inner)
+    {
+      loop->inner_mask = nvptx_process_loops (loop->inner);
+      inner_mask |= loop->inner_mask;
+    }
+  
+  switch (loop->mode)
+    {
+    case OACC_null:
+      /* Dummy loop.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (loop->head_block, loop->head_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, loop->head_block, loop->head_insn);
+	nvptx_wpropagate (true, loop->head_block, loop->pre_head_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, loop->head_insn);
+	nvptx_wsync (true, loop->pre_tail_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (loop->next)
+    inner_mask |= nvptx_process_loops (loop->next);
+  return inner_mask;
+}
+
+/* Neuter the loop described by LOOP.  We recurse in depth-first
+   order.  LEVELS are the partitioning of the execution and OUTER is
+   the partitioning of the loops we are contained in.  Return the
+   partitioning level within this loop.  */
+
+static void
+nvptx_neuter_loops (reorg_loop *loop, unsigned levels, unsigned outer)
+{
+  unsigned me = (1 << loop->mode) & ((1 << OACC_worker) | (1 << OACC_vector));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (loop->inner)
+    nvptx_neuter_loops (loop->inner, levels, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & (1 << mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(levels & (1 << mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (loop->inner_mask & (1 << mode)
+	       || !loop->head_insn)
+	/* Partitioning inside this loop, or we're not a loop: neuter
+	   individual blocks.  */
+	neuter_mask |= 1 << mode;
+      else if (!loop->parent || !loop->parent->head_insn
+	       || loop->parent->inner_mask & (1 << mode))
+	/* Parent isn't a loop or contains this partitioning: skip
+	   loop at this level.  */
+	skip_mask |= 1 << mode;
+      else
+	{ /* Parent will skip this loop itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_loop (skip_mask, loop);
+  
+  if (loop->next)
+    nvptx_neuter_loops (loop->next, levels, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2966,34 @@  nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  unspec_map_t unspec_map;
+  unsigned levels = nvptx_split_blocks (&unspec_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  reorg_loop *loops = nvptx_discover_loops (&unspec_map);
+
+  nvptx_process_loops (loops);
+  nvptx_neuter_loops (loops, levels, 0);
 
+  delete loops;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3042,21 @@  nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_loop:
+      return true;
+    default:
+      return false;
+    }
 }
 
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3096,21 @@  nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225154)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@  struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225154)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,7 @@  extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_loop (rtx, rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225154)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@  record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  tree decl = gimple_call_fndecl (as_a <gcall *> (stmt));
+
+	  if (decl && DECL_BUILT_IN (decl)
+	      && builtin_unique_p (decl))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225154)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@  ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
@@ -1779,6 +1778,7 @@  ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_SECTIONS_END_CANCEL:
 	case BUILT_IN_GOMP_SINGLE_COPY_START:
 	case BUILT_IN_GOMP_SINGLE_COPY_END:
+        case BUILT_IN_GOACC_LOOP:
 	  return true;
 
 	default:
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225154)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@  struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@  static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3742,15 +3734,6 @@  build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3761,6 +3744,56 @@  maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_loop_head (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call
+      (builtin_decl_explicit (BUILT_IN_GOACC_LEVELS), 1,
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  tree loop_decl = builtin_decl_explicit (BUILT_IN_GOACC_LOOP);
+  tree arg0 = build_int_cst (unsigned_type_node, 0);
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call (loop_decl, 2,  arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_loop_tail (gimple_seq *seq, unsigned mask)
+{
+  tree loop_decl = builtin_decl_explicit (BUILT_IN_GOACC_LOOP);
+  tree arg0 = build_int_cst (unsigned_type_node, 1);
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call (loop_decl, 2,  arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -7190,21 +7223,6 @@  expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -7213,6 +7231,7 @@  oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7240,6 +7259,11 @@  oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_LOOP_TAIL
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -7281,10 +7305,6 @@  expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7306,6 +7326,15 @@  expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7364,6 +7393,7 @@  expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7547,18 +7577,19 @@  expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7633,6 +7664,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7643,6 +7675,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7661,6 +7694,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_LOOP_TAIL
 */
 
 static void
@@ -7694,10 +7728,6 @@  expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7709,7 +7739,7 @@  expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7731,6 +7761,14 @@  expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7989,18 +8027,20 @@  expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9571,20 +9611,6 @@  expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -10360,7 +10386,6 @@  find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10504,669 +10529,6 @@  build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -11185,8 +10547,6 @@  execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -13220,10 +12580,7 @@  lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -13510,7 +12867,6 @@  lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -14227,16 +13583,7 @@  make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225154)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@  along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@  enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;