diff mbox

[gomp] Move openacc vector& worker single handling to RTL

Message ID 559D381C.7020804@acm.org
State New
Headers show

Commit Message

Nathan Sidwell July 8, 2015, 2:47 p.m. UTC
On 07/07/15 10:22, Jakub Jelinek wrote:

> I agree that fork/join might be less confusing.

this version is the great renaming.  I've added fork & join internal fns.  In 
the PTX backend I've added 4 new unspecs:

fork -- the final single mode insn
forked -- the first partitioned mode insn
joining -- the last partitioned mode insn
join -- the first single mode insn

Not all partitionings need all four markers.  I've renamed the loop data 
structures to 'parallel' and similar, because that's actually what they are 
representing -- parallel regions.  The fact those regions contain loops is 
irrelevant to the task at hand.



nathan
2015-07-08  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_MODES, GOACC_FORK, GOACC_JOIN): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_FORK, IFN_GOACC_JOIN.
	(expand_GOACC_MODES, expand_GOACC_FORK, expand_GOACC_JOIN): New.
	* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
	(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
	(expand_omp_for_static_chunk): Likewise.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_MODES, UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING,
	UNSPECV_JOIN, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_modes, nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join):
	New insns.
	(oacc_fork, oacc_join): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct parallel): New structs.
	(parallel::parallel, parallel::~parallel): Ctor & dtor.
	(bb_insn_map_t): New map.
	(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
	(nvptx_split_blocks, nvptx_discover_pre): New.
	(bb_par_t, bb_par_vec_t); New tuple & vector of.
	(nvptx_dump_pars,nvptx_discover_pars): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars): New.
	(nvptx_neuter_pars): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
	nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
	(nvptx_cannot_copy_insn): Check for broadcast, sync, fork& join insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Comments

Jakub Jelinek July 8, 2015, 2:58 p.m. UTC | #1
On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote:
> +/* Generate loop head markers in outer->inner order.  */
> +
> +static void
> +gen_oacc_fork (gimple_seq *seq, unsigned mask)
> +{
> +  {
> +    // TODDO: Determine this information from the parallel region itself

TODO ?

> +    // and emit it once in the offload function.  Currently the target
> +    // geometry definition is being extracted early.  For now inform
> +    // the backend we're using all axes of parallelism, which is a
> +    // safe default.
> +    gcall *call = gimple_build_call_internal
> +      (IFN_GOACC_MODES, 1, 
> +       build_int_cst (unsigned_type_node,
> +		      OACC_LOOP_MASK (OACC_gang)
> +		      | OACC_LOOP_MASK (OACC_vector)
> +		      | OACC_LOOP_MASK (OACC_worker)));

The formatting is too ugly.  I'd say you just want

    tree arg = build_int_cst (unsigned_type_node,
			      OACC_LOOP_MASK (OACC_gang)
			      | OACC_LOOP_MASK (OACC_vector)
			      | OACC_LOOP_MASK (OACC_worker));
    gcall *call = gimple_build_call_internal (IFN_GOACC_MODES, 1, arg);
> +                   | OACC_LOOP_MASK (OACC_vector)   

> +  for (level = OACC_gang; level != OACC_HWM; level++)
> +    if (mask & OACC_LOOP_MASK (level))
> +      {
> +	tree arg = build_int_cst (unsigned_type_node, level);
> +	gcall *call = gimple_build_call_internal
> +	  (IFN_GOACC_FORK, 1, arg);

Why the line-break?  That should fit into 80 columns just fine.

> +	gimple_seq_add_stmt (seq, call);
> +      }
> +}
> +
> +/* Generate loop tail markers in inner->outer order.  */
> +
> +static void
> +gen_oacc_join (gimple_seq *seq, unsigned mask)
> +{
> +  unsigned level;
> +
> +  for (level = OACC_HWM; level-- != OACC_gang; )
> +    if (mask & OACC_LOOP_MASK (level))
> +      {
> +	tree arg = build_int_cst (unsigned_type_node, level);
> +	gcall *call = gimple_build_call_internal
> +	  (IFN_GOACC_JOIN, 1, arg);
> +	gimple_seq_add_stmt (seq, call);
> +      }
> +}
>  
>  /* Find the mapping for DECL in CTX or the immediately enclosing
>     context that has a mapping for DECL.
> @@ -6777,21 +6808,6 @@ expand_omp_for_generic (struct omp_regio
>      }
>  }
>  
> -
> -/* True if a barrier is needed after a loop partitioned over
> -   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
> -   that a (conceptual) barrier is needed after worker and vector-partitioned
> -   loops, but not after gang-partitioned loops.  Currently we are relying on
> -   warp reconvergence to synchronise threads within a warp after vector loops,
> -   so an explicit barrier is not helpful after those.  */
> -
> -static bool
> -oacc_loop_needs_threadbarrier_p (int gwv_bits)
> -{
> -  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
> -    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
> -}
> -
>  /* A subroutine of expand_omp_for.  Generate code for a parallel
>     loop with static schedule and no specified chunk size.  Given
>     parameters:
> @@ -6800,6 +6816,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
>  
>     where COND is "<" or ">", we generate pseudocode
>  
> +  OACC_FORK
>  	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>  	if (cond is <)
>  	  adj = STEP - 1;
> @@ -6827,6 +6844,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
>  	V += STEP;
>  	if (V cond e) goto L1;
>      L2:
> + OACC_JOIN
> +
> + It'd be better to place the OACC_LOOP markers just inside the outer
> + conditional, so they can be entirely eliminated if the loop is
> + unreachable.

Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very
confusing.  The expand_omp_for_static_nochunk routine is used for
#pragma omp for schedule(static), #pragma omp distribute etc. which
certainly don't want to emit such markers in there.  So perhaps mention
somewhere that you wrap all the above sequence in between
OACC_FORK/OACC_JOIN markers.

> @@ -7220,6 +7249,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  
>     where COND is "<" or ">", we generate pseudocode
>  
> +OACC_FORK
>  	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>  	if (cond is <)
>  	  adj = STEP - 1;
> @@ -7230,6 +7260,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  	else
>  	  n = (adj + N2 - N1) / STEP;
>  	trip = 0;
> +
>  	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
>  					      here so that V is defined
>  					      if the loop is not entered
> @@ -7248,6 +7279,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  	trip += 1;
>  	goto L0;
>      L4:
> +OACC_JOIN
>  */

Likewise.
>  
>  static void
> @@ -7281,10 +7313,6 @@ expand_omp_for_static_chunk (struct omp_
>    gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
>    fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
>  
> -  /* Broadcast variables to OpenACC threads.  */
> -  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
> -  region->entry = entry_bb;
> -
>    gcc_assert (broken_loop
>  	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
>    seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
> @@ -7296,7 +7324,7 @@ expand_omp_for_static_chunk (struct omp_
>        trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
>      }
>    exit_bb = region->exit;
> -
> +  

Please avoid such whitespace changes.

In any case, as it is a gomp-4_0-branch patch, I'll defer full review to the
branch maintainers.

	Jakub
diff mbox

Patch

Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@  struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@  static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@  build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,54 @@  maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_fork (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call_internal
+      (IFN_GOACC_MODES, 1, 
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_FORK, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_join (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_JOIN, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6808,6 @@  expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6800,6 +6816,7 @@  oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_FORK
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -6827,6 +6844,11 @@  oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_JOIN
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -6868,10 +6890,6 @@  expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6911,15 @@  expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -6951,6 +6978,7 @@  expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7134,18 +7162,19 @@  expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7220,6 +7249,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_FORK
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7230,6 +7260,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7248,6 +7279,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_JOIN
 */
 
 static void
@@ -7281,10 +7313,6 @@  expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7296,7 +7324,7 @@  expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7318,6 +7346,14 @@  expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,18 +7612,20 @@  expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9158,20 +9196,6 @@  expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9971,6 @@  find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10114,6 @@  build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10132,6 @@  execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11700,7 @@  lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11986,6 @@  lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12702,7 @@  make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@  along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@  enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,6 @@  DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_MODES, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,20 @@  init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_FORK: return true;
+    case IFN_GOACC_JOIN: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2004,36 @@  expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+static void
+expand_GOACC_MODES (gcall *stmt)
+{
+  rtx mask = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_modes
+  emit_insn (gen_oacc_modes (mask));
+#endif
+}
+
+static void
+expand_GOACC_FORK (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_fork
+  emit_insn (gen_oacc_fork (mode));
+#endif
+}
+
+static void
+expand_GOACC_JOIN (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_join
+  emit_insn (gen_oacc_join (mode));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@  expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@  expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@  expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@  ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@  bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@  struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@  gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@  gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@  record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@  same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@  DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@ 
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,9 @@ 
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +101,16 @@  static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +138,8 @@  nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1069,7 @@  nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1083,210 @@  nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (rtx mode)
+{
+  /* Emit fork for worker level.  */
+  if (UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_fork (mode));
+}
+
+void
+nvptx_expand_oacc_join (rtx mode)
+{
+  /* Emit joining for all pars.  */
+  emit_insn (gen_nvptx_joining (mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1868,6 @@  nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1881,7 @@  condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1943,10 @@  nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2103,7 @@  get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2117,6 @@  nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2132,6 @@  nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2185,757 @@  nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mode of the parallel.  */
+  unsigned mode;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
 
-  while (!warp_reg_worklist.is_empty ())
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static unsigned
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned modes = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
+	  if (!INSN_P (insn))
 	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
-	    continue;
-
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  switch (recog_memoized (insn))
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_modes:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (modes == ~0U || l == modes);
+		modes = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+
+  return modes;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/*  Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, par->mode,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+typedef std::pair<basic_block, parallel *> bb_par_t;
+typedef auto_vec<bb_par_t> bb_par_vec_t;
+
+/* Walk the BBG looking for fork & join markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  parallel *outer_par = new parallel (0, OACC_null);
+  bb_par_vec_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (bb_par_t (block, outer_par));
+
+  while (worklist.length ())
+    {
+      bb_par_t bb_par = worklist.pop ();
+      parallel *l = bb_par.second;
+
+      block = bb_par.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	      /* Loop head, create a new inner loop and add it into
+		 our parent's child list.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		
+		l = new parallel (l, mode);
+		l->forked_block = block;
+		l->forked_insn = end;
+		if (mode == OACC_worker)
+		  l->fork_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	      }
+	      break;
+
+	    case CODE_FOR_nvptx_join:
+	      /* A loop tail.  Finish the current loop and return to
+		 parent.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+		gcc_assert (l->mode == mode);
+		l->join_block = block;
+		l->join_insn = end;
+		if (mode == OACC_worker)
+		  l->joining_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+		l = l->parent;
+	      }
+	      break;
+
+	    default:
+	      gcc_unreachable ();
 	    }
 	}
+
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	worklist.safe_push (bb_par_t (e->dest, l));
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (outer_par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	case CODE_FOR_oacc_modes:
+	  return;
+	}
+
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
 
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (par->mode);
+  
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+  
+  switch (par->mode)
+    {
+    case OACC_null:
+      /* Dummy parallel.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (par->forked_block, par->forked_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, par->forked_block,
+			  par->forked_insn);
+	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, par->forked_insn);
+	nvptx_wsync (true, par->joining_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (par->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(modes & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (par->inner_mask & OACC_LOOP_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this parallel itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2944,34 @@  nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+  unsigned modes = nvptx_split_blocks (&bb_insn_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  parallel *pars = nvptx_discover_pars (&bb_insn_map);
+
+  nvptx_process_pars (pars);
+  nvptx_neuter_pars (pars, modes, 0);
 
+  delete pars;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3020,24 @@  nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
 }
 
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3077,21 @@  nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@  struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@  extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (rtx);
+extern void nvptx_expand_oacc_join (rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,26 @@ 
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_MODES
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +264,8 @@ 
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +826,7 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +835,34 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1366,99 @@ 
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_modes"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_MODES)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// modes %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
+  ""
+  "// fork %0;"
+)
+
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+{
+  nvptx_expand_oacc_fork (operands[0]);
 })
 
+(define_expand "oacc_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+{
+  nvptx_expand_oacc_join (operands[0]);
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1564,8 @@ 
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})