diff mbox

[gomp4.5] depend nowait support for target

Message ID 20151113151150.GQ5675@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek Nov. 13, 2015, 3:11 p.m. UTC
On Fri, Nov 13, 2015 at 11:18:41AM +0100, Jakub Jelinek wrote:
> For the offloading case, I actually see a problematic spot, namely that
> GOMP_PLUGIN_target_task_completion could finish too early, and get the
> task_lock before the thread that run the gomp_target_task_fn doing map_vars
> + async_run for it.  Bet I need to add further ttask state kinds and deal
> with that case (so GOMP_PLUGIN_target_task_completion would just take the
> task lock and tweak ttask state if it has not been added to the queues
> yet).
> Plus I think I want to improve the case where we are not waiting, in
> gomp_create_target_task if not waiting for dependencies actually schedule
> manually the gomp_target_task_fn.

These two have been resolved, plus target-34.c issue resolved too (the bug
was that I've been too lazy and just put target-33.c test into #pragma omp
parallel #pragma omp single, but that is invalid OpenMP, as single is a
worksharing region and #pragma omp barrier may not be encountered in such a
region.  Fixed by rewriting the testcase.

So here is a full patch that passes for me both non-offloading and
offloading, OMP_NUM_THREADS=16 (implicit on my box) as well as
OMP_NUM_THREADS=1 (explicit).  I've incorporated your incremental patch.

One option to avoid the static variable would be to pass two pointers
instead of one (async_data), one would be the callback function pointer,
another argument to it.  Or another possibility would be to say that
the async_data argument the plugin passes to liboffloadmic would be
pointer to structure, holding a function pointer (completion callback)
and the data pointer to pass to it, and then the plugin would just
GOMP_PLUGIN_malloc 2 * sizeof (void *) for it, fill it in and
register some function in itself that would call the
GOMP_PLUGIN_target_task_completion with the second structure element
as argument and then free the structure pointer.

Do you get still crashes on any of the testcases with this?

2015-11-13  Jakub Jelinek  <jakub@redhat.com>
	    Ilya Verbin  <ilya.verbin@intel.com>

	* parallel.c (gomp_resolve_num_threads): Don't assume that
	if thr->ts.team is non-NULL, then pool must be non-NULL.
	* libgomp-plugin.h (GOMP_PLUGIN_target_task_completion): Declare.
	* team.c (gomp_free_thread): Call gomp_team_end if thr->ts.team
	is artificial team created for target nowait in implicit parallel
	region.
	(gomp_team_start): For nested check, test thr->ts.level instead of
	thr->ts.team != NULL.
	* target.c (GOMP_target): Don't adjust *thr in any way around
	running offloaded task.
	(GOMP_target_ext): Likewise.  Handle target nowait.
	(GOMP_target_update_ext, GOMP_target_enter_exit_data): Check
	return value from gomp_create_target_task, if false, fallthrough
	as if no dependencies exist.
	(gomp_target_task_fn): Change return type to bool, return true
	if the task should have another part scheduled later.  Handle
	target nowait.
	(gomp_load_plugin_for_device): Initialize async_run.
	* libgomp.map (GOMP_PLUGIN_1.1): New symbol version, export
	GOMP_PLUGIN_target_task_completion.
	* task.c (priority_queue_move_task_first,
	gomp_target_task_completion, GOMP_PLUGIN_target_task_completion):
	New functions.
	(gomp_create_target_task): Change return type to bool, add
	state argument, return false if for async {{enter,exit} data,update}
	constructs no dependencies need to be waited for, handle target
	nowait.  Set task->fn to NULL instead of gomp_target_task_fn.
	(gomp_barrier_handle_tasks, GOMP_taskwait,
	gomp_task_maybe_wait_for_dependencies): Handle target nowait target
	tasks specially.
	(GOMP_taskgroup_end): Likewise.  If taskgroup is NULL, and
	thr->ts.level is 0, act as a barrier.
	target nowait tasks specially.
	* priority_queue.c (priority_queue_task_in_queue_p,
	priority_list_verify): Adjust for addition of
	GOMP_TASK_ASYNC_RUNNING kind.
	* libgomp.h (enum gomp_task_kind): Add GOMP_TASK_ASYNC_RUNNING.
	(enum gomp_target_task_state): New enum.
	(struct gomp_target_task): Add state, tgt, task and team fields.
	(gomp_create_target_task): Change return type to bool, add
	state argument.
	(gomp_target_task_fn): Change return type to bool.
	(struct gomp_device_descr): Add async_run_func.
	* testsuite/libgomp.c/target-32.c: New test.
	* testsuite/libgomp.c/target-34.c: New test.
	* testsuite/libgomp.c/target-33.c: New test.

2015-11-13  Ilya Verbin  <ilya.verbin@intel.com>

	* runtime/offload_host.cpp (task_completion_callback): New
	variable.
	(offload_proxy_task_completed_ooo): Call task_completion_callback.
	(__offload_register_task_callback): New function.
	* runtime/offload_host.h (__offload_register_task_callback): New
	declaration.
	* plugin/libgomp-plugin-intelmic.cpp (offload): Add async_data
	argument, handle async offloading.
	(register_main_image): Call register_main_image.
	(GOMP_OFFLOAD_init_device, get_target_table, GOMP_OFFLOAD_alloc,
	GOMP_OFFLOAD_free, GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host,
	GOMP_OFFLOAD_dev2dev) Adjust offload callers.
	(GOMP_OFFLOAD_async_run): New function.
	(GOMP_OFFLOAD_run): Implement using GOMP_OFFLOAD_async_run.


	Jakub

Comments

Ilya Verbin Nov. 13, 2015, 4:37 p.m. UTC | #1
On Fri, Nov 13, 2015 at 16:11:50 +0100, Jakub Jelinek wrote:
> On Fri, Nov 13, 2015 at 11:18:41AM +0100, Jakub Jelinek wrote:
> > For the offloading case, I actually see a problematic spot, namely that
> > GOMP_PLUGIN_target_task_completion could finish too early, and get the
> > task_lock before the thread that run the gomp_target_task_fn doing map_vars
> > + async_run for it.  Bet I need to add further ttask state kinds and deal
> > with that case (so GOMP_PLUGIN_target_task_completion would just take the
> > task lock and tweak ttask state if it has not been added to the queues
> > yet).
> > Plus I think I want to improve the case where we are not waiting, in
> > gomp_create_target_task if not waiting for dependencies actually schedule
> > manually the gomp_target_task_fn.
> 
> These two have been resolved, plus target-34.c issue resolved too (the bug
> was that I've been too lazy and just put target-33.c test into #pragma omp
> parallel #pragma omp single, but that is invalid OpenMP, as single is a
> worksharing region and #pragma omp barrier may not be encountered in such a
> region.  Fixed by rewriting the testcase.
> 
> So here is a full patch that passes for me both non-offloading and
> offloading, OMP_NUM_THREADS=16 (implicit on my box) as well as
> OMP_NUM_THREADS=1 (explicit).  I've incorporated your incremental patch.
> 
> One option to avoid the static variable would be to pass two pointers
> instead of one (async_data), one would be the callback function pointer,
> another argument to it.  Or another possibility would be to say that
> the async_data argument the plugin passes to liboffloadmic would be
> pointer to structure, holding a function pointer (completion callback)
> and the data pointer to pass to it, and then the plugin would just
> GOMP_PLUGIN_malloc 2 * sizeof (void *) for it, fill it in and
> register some function in itself that would call the
> GOMP_PLUGIN_target_task_completion with the second structure element
> as argument and then free the structure pointer.

I don't know which interface to implement to maintain compatibility in the
future.
Anyway, currently it's impossible that a process will use the same liboffloadmic
for 2 different offloading paths (say GCC's in exec and ICC's in a dso), because
in fact GCC's and ICC's libraries are not the same.  First of all, they have
different names: liboffloadmic in GCC and just liboffload in ICC.  And most
importantly, ICC's version contains some references to libiomp5, which were
removed form GCC's version.  In theory, we want to use one library with all
compilers, but I'm not sure when it will be possible.

> Do you get still crashes on any of the testcases with this?

No, all tests now pass using emul.  I'll report when I have any results on HW.

Thanks,
  -- Ilya
Jakub Jelinek Nov. 13, 2015, 4:41 p.m. UTC | #2
On Fri, Nov 13, 2015 at 07:37:17PM +0300, Ilya Verbin wrote:
> I don't know which interface to implement to maintain compatibility in the
> future.
> Anyway, currently it's impossible that a process will use the same liboffloadmic
> for 2 different offloading paths (say GCC's in exec and ICC's in a dso), because
> in fact GCC's and ICC's libraries are not the same.  First of all, they have
> different names: liboffloadmic in GCC and just liboffload in ICC.  And most
> importantly, ICC's version contains some references to libiomp5, which were
> removed form GCC's version.  In theory, we want to use one library with all
> compilers, but I'm not sure when it will be possible.

Ok, in that case it is less of a problem.

> > Do you get still crashes on any of the testcases with this?
> 
> No, all tests now pass using emul.  I'll report when I have any results on HW.

Perfect, I'll commit it to gomp-4_5-branch then.

Thanks.

	Jakub
Ilya Verbin Nov. 13, 2015, 6:37 p.m. UTC | #3
On Fri, Nov 13, 2015 at 17:41:53 +0100, Jakub Jelinek wrote:
> On Fri, Nov 13, 2015 at 07:37:17PM +0300, Ilya Verbin wrote:
> > I don't know which interface to implement to maintain compatibility in the
> > future.
> > Anyway, currently it's impossible that a process will use the same liboffloadmic
> > for 2 different offloading paths (say GCC's in exec and ICC's in a dso), because
> > in fact GCC's and ICC's libraries are not the same.  First of all, they have
> > different names: liboffloadmic in GCC and just liboffload in ICC.  And most
> > importantly, ICC's version contains some references to libiomp5, which were
> > removed form GCC's version.  In theory, we want to use one library with all
> > compilers, but I'm not sure when it will be possible.
> 
> Ok, in that case it is less of a problem.
> 
> > > Do you get still crashes on any of the testcases with this?
> > 
> > No, all tests now pass using emul.  I'll report when I have any results on HW.
> 
> Perfect, I'll commit it to gomp-4_5-branch then.

make check-target-libgomp with offloading to HW also passed :)

And this:

+++ b/libgomp/testsuite/libgomp.c/target-32.c
@@ -3,6 +3,7 @@
 
 int main ()
 {
+  int x = 1;
   int a = 0, b = 0, c = 0, d[7];
 
   #pragma omp parallel
@@ -18,6 +19,7 @@ int main ()
 
     #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[3])
     {
+      while (x);
       usleep (1000);
       #pragma omp atomic update
       b |= 4;
@@ -25,6 +27,7 @@ int main ()
 
     #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[4])
     {
+      while (x);
       usleep (5000);
       #pragma omp atomic update
       b |= 1;

demonstrates 200% CPU usage both using emul and HW, so 2 target tasks really run
concurrently.

  -- Ilya
diff mbox

Patch

--- liboffloadmic/runtime/offload_host.cpp.jj	2015-11-05 11:31:05.013916598 +0100
+++ liboffloadmic/runtime/offload_host.cpp	2015-11-13 14:23:54.469798572 +0100
@@ -64,6 +64,8 @@  static void __offload_fini_library(void)
 #define GET_OFFLOAD_NUMBER(timer_data) \
     timer_data? timer_data->offload_number : 0
 
+static void (*task_completion_callback)(void *);
+
 extern "C" {
 #ifdef TARGET_WINNT
 // Windows does not support imports from libraries without actually
@@ -2507,7 +2509,7 @@  extern "C" {
         const void *info
     )
     {
-	/* TODO: Call callback function, pass info.  */
+	task_completion_callback ((void *) info);
     }
 }
 
@@ -5669,6 +5671,11 @@  extern "C" void __offload_unregister_ima
     }
 }
 
+extern "C" void __offload_register_task_callback(void (*cb)(void *))
+{
+    task_completion_callback = cb;
+}
+
 // Runtime trace interface for user programs
 
 void __offload_console_trace(int level)
--- liboffloadmic/runtime/offload_host.h.jj	2015-10-14 10:24:10.904194499 +0200
+++ liboffloadmic/runtime/offload_host.h	2015-11-13 14:23:54.470798557 +0100
@@ -376,6 +376,9 @@  extern "C" bool __offload_target_image_i
 extern "C" bool __offload_register_image(const void* image);
 extern "C" void __offload_unregister_image(const void* image);
 
+// Registers asynchronous task completion callback
+extern "C" void __offload_register_task_callback(void (*cb)(void *));
+
 // Initializes offload runtime library.
 DLL_LOCAL extern int __offload_init_library(void);
 
--- liboffloadmic/plugin/libgomp-plugin-intelmic.cpp.jj	2015-10-14 10:24:10.922194230 +0200
+++ liboffloadmic/plugin/libgomp-plugin-intelmic.cpp	2015-11-13 14:23:54.467798600 +0100
@@ -192,11 +192,23 @@  GOMP_OFFLOAD_get_num_devices (void)
 
 static void
 offload (const char *file, uint64_t line, int device, const char *name,
-	 int num_vars, VarDesc *vars, VarDesc2 *vars2)
+	 int num_vars, VarDesc *vars, VarDesc2 *vars2, const void **async_data)
 {
   OFFLOAD ofld = __offload_target_acquire1 (&device, file, line);
   if (ofld)
-    __offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL, NULL);
+    {
+      if (async_data == NULL)
+	__offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL,
+			    NULL);
+      else
+	{
+	  OffloadFlags flags;
+	  flags.flags = 0;
+	  flags.bits.omp_async = 1;
+	  __offload_offload3 (ofld, name, 0, num_vars, vars, NULL, 0, NULL,
+			      async_data, 0, NULL, flags, NULL);
+	}
+    }
   else
     {
       fprintf (stderr, "%s:%d: Offload target acquire failed\n", file, line);
@@ -208,6 +220,10 @@  static void
 register_main_image ()
 {
   __offload_register_image (&main_target_image);
+
+  /* liboffloadmic will call GOMP_PLUGIN_target_task_completion when
+     asynchronous task on target is completed.  */
+  __offload_register_task_callback (GOMP_PLUGIN_target_task_completion);
 }
 
 /* liboffloadmic loads and runs offload_target_main on all available devices
@@ -218,7 +234,7 @@  GOMP_OFFLOAD_init_device (int device)
   TRACE ("");
   pthread_once (&main_image_is_registered, register_main_image);
   offload (__FILE__, __LINE__, device, "__offload_target_init_proc", 0,
-	   NULL, NULL);
+	   NULL, NULL, NULL);
 }
 
 extern "C" void
@@ -240,7 +256,7 @@  get_target_table (int device, int &num_f
   VarDesc2 vd1g[2] = { { "num_funcs", 0 }, { "num_vars", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_table_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   int table_size = num_funcs + 2 * num_vars;
   if (table_size > 0)
@@ -254,7 +270,7 @@  get_target_table (int device, int &num_f
       VarDesc2 vd2g = { "table", 0 };
 
       offload (__FILE__, __LINE__, device, "__offload_target_table_p2", 1,
-	       &vd2, &vd2g);
+	       &vd2, &vd2g, NULL);
     }
 }
 
@@ -401,8 +417,8 @@  GOMP_OFFLOAD_alloc (int device, size_t s
   vd1[1].size = sizeof (void *);
   VarDesc2 vd1g[2] = { { "size", 0 }, { "tgt_ptr", 0 } };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g);
-
+  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g,
+	   NULL);
   return tgt_ptr;
 }
 
@@ -416,7 +432,8 @@  GOMP_OFFLOAD_free (int device, void *tgt
   vd1.size = sizeof (void *);
   VarDesc2 vd1g = { "tgt_ptr", 0 };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g);
+  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g,
+	   NULL);
 }
 
 extern "C" void *
@@ -435,7 +452,7 @@  GOMP_OFFLOAD_host2dev (int device, void
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_host2tgt;
   vd2.ptr = (void *) host_ptr;
@@ -443,7 +460,7 @@  GOMP_OFFLOAD_host2dev (int device, void
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return tgt_ptr;
 }
@@ -464,7 +481,7 @@  GOMP_OFFLOAD_dev2host (int device, void
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_tgt2host;
   vd2.ptr = (void *) host_ptr;
@@ -472,7 +489,7 @@  GOMP_OFFLOAD_dev2host (int device, void
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return host_ptr;
 }
@@ -495,22 +512,32 @@  GOMP_OFFLOAD_dev2dev (int device, void *
   VarDesc2 vd1g[3] = { { "dst_ptr", 0 }, { "src_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2tgt", 3, vd1,
-	   vd1g);
+	   vd1g, NULL);
 
   return dst_ptr;
 }
 
 extern "C" void
-GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
+			void *async_data)
 {
-  TRACE ("(tgt_fn = %p, tgt_vars = %p)", tgt_fn, tgt_vars);
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p, async_data = %p)", device,
+	 tgt_fn, tgt_vars, async_data);
 
-  VarDesc vd1[2] = { vd_host2tgt, vd_host2tgt };
-  vd1[0].ptr = &tgt_fn;
-  vd1[0].size = sizeof (void *);
-  vd1[1].ptr = &tgt_vars;
-  vd1[1].size = sizeof (void *);
-  VarDesc2 vd1g[2] = { { "tgt_fn", 0 }, { "tgt_vars", 0 } };
+  VarDesc vd[2] = { vd_host2tgt, vd_host2tgt };
+  vd[0].ptr = &tgt_fn;
+  vd[0].size = sizeof (void *);
+  vd[1].ptr = &tgt_vars;
+  vd[1].size = sizeof (void *);
+
+  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd, NULL,
+	   (const void **) async_data);
+}
+
+extern "C" void
+GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+{
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p)", device, tgt_fn, tgt_vars);
 
-  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd1, vd1g);
+  GOMP_OFFLOAD_async_run (device, tgt_fn, tgt_vars, NULL);
 }
--- libgomp/parallel.c.jj	2015-10-14 10:24:10.000000000 +0200
+++ libgomp/parallel.c	2015-11-12 15:12:38.349901541 +0100
@@ -85,7 +85,7 @@  gomp_resolve_num_threads (unsigned speci
      nested parallel, so there is just one thread in the
      contention group as well, no need to handle it atomically.  */
   pool = thr->thread_pool;
-  if (thr->ts.team == NULL)
+  if (thr->ts.team == NULL || pool == NULL)
     {
       num_threads = max_num_threads;
       if (num_threads > icv->thread_limit_var)
--- libgomp/libgomp-plugin.h.jj	2015-10-14 10:24:10.000000000 +0200
+++ libgomp/libgomp-plugin.h	2015-11-13 11:40:03.366418330 +0100
@@ -63,6 +63,7 @@  struct addr_pair
 extern void *GOMP_PLUGIN_malloc (size_t) __attribute__ ((malloc));
 extern void *GOMP_PLUGIN_malloc_cleared (size_t) __attribute__ ((malloc));
 extern void *GOMP_PLUGIN_realloc (void *, size_t);
+void GOMP_PLUGIN_target_task_completion (void *);
 
 extern void GOMP_PLUGIN_debug (int, const char *, ...)
 	__attribute__ ((format (printf, 2, 3)));
--- libgomp/testsuite/libgomp.c/target-32.c.jj	2015-11-10 12:58:55.087951346 +0100
+++ libgomp/testsuite/libgomp.c/target-32.c	2015-11-12 13:28:55.053380366 +0100
@@ -0,0 +1,54 @@ 
+#include <stdlib.h>
+#include <unistd.h>
+
+int main ()
+{
+  int a = 0, b = 0, c = 0, d[7];
+
+  #pragma omp parallel
+  #pragma omp single
+  {
+    #pragma omp task depend(out: d[0])
+      a = 2;
+
+    #pragma omp target enter data nowait map(to: a,b,c) depend(in: d[0]) depend(out: d[1])
+
+    #pragma omp target nowait map(alloc: a) depend(in: d[1]) depend(out: d[2])
+      a++;
+
+    #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[3])
+    {
+      usleep (1000);
+      #pragma omp atomic update
+      b |= 4;
+    }
+
+    #pragma omp target nowait map(alloc: b) depend(in: d[2]) depend(out: d[4])
+    {
+      usleep (5000);
+      #pragma omp atomic update
+      b |= 1;
+    }
+
+    #pragma omp target nowait map(alloc: c) depend(in: d[3], d[4]) depend(out: d[5])
+    {
+      usleep (5000);
+      #pragma omp atomic update
+      c |= 8;
+    }
+
+    #pragma omp target nowait map(alloc: c) depend(in: d[3], d[4]) depend(out: d[6])
+    {
+      usleep (1000);
+      #pragma omp atomic update
+      c |= 2;
+    }
+
+    #pragma omp target exit data map(always,from: a,b,c) depend(in: d[5], d[6])
+  }
+
+  if (a != 3 || b != 5 || c != 10)
+    abort ();
+
+  return 0;
+}
--- libgomp/testsuite/libgomp.c/target-34.c.jj	2015-11-13 08:54:42.607799433 +0100
+++ libgomp/testsuite/libgomp.c/target-34.c	2015-11-13 14:16:09.104425707 +0100
@@ -0,0 +1,112 @@ 
+extern void abort (void);
+
+int
+main ()
+{
+  int a = 1, b = 2, c = 4, d[7];
+  #pragma omp parallel
+  {
+    #pragma omp single
+    {
+      #pragma omp taskgroup
+      {
+	#pragma omp target enter data nowait map (to: a, b, c) depend(out: d[0])
+	#pragma omp target nowait map (alloc: a, b) depend(in: d[0]) depend(out: d[1])
+	{
+	  #pragma omp atomic update
+	  a |= 4;
+	  #pragma omp atomic update
+	  b |= 8;
+	}
+	#pragma omp target nowait map (alloc: a, c) depend(in: d[0]) depend(out: d[2])
+	{
+	  #pragma omp atomic update
+	  a |= 16;
+	  #pragma omp atomic update
+	  c |= 32;
+	}
+	#pragma omp target exit data nowait map (from: a, b, c) depend(in: d[1], d[2])
+      }
+      if (a != 21 || b != 10 || c != 36)
+	abort ();
+      #pragma omp target map (tofrom: a, b) nowait
+      {
+	a &= ~16;
+	b &= ~2;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      {
+	c |= 8;
+      }
+    } /* Implicit barrier here.  */
+    #pragma omp single
+    {
+      if (a != 5 || b != 8 || c != 44)
+	abort ();
+      #pragma omp target map (tofrom: a, b) nowait
+      {
+	a |= 32;
+	b |= 4;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      c &= ~4;
+      #pragma omp taskwait
+      if (a != 37 || b != 12 || c != 40)
+	abort ();
+      #pragma omp target nowait map (tofrom: a, b) depend(out: d[3])
+      {
+	#pragma omp atomic update
+	a = a + 9;
+	b -= 8;
+      }
+      #pragma omp target nowait map (tofrom: a, c) depend(out: d[4])
+      {
+	#pragma omp atomic update
+	a = a + 4;
+	c >>= 1;
+      }
+      #pragma omp task if (0) depend (in: d[3], d[4]) shared (a, b, c)
+      if (a != 50 || b != 4 || c != 20)
+	abort ();
+      #pragma omp task shared (a)
+      a += 50;
+      #pragma omp target nowait map (tofrom: b)
+      b++;
+      #pragma omp target map (tofrom: c) nowait
+      c--;
+      #pragma omp taskwait
+      if (a != 100 || b != 5 || c != 19)
+	abort ();
+      #pragma omp target map (tofrom: a) nowait depend(out: d[5])
+      a++;
+      #pragma omp target map (tofrom: b) nowait depend(out: d[6])
+      b++;
+      #pragma omp target map (tofrom: a, b) depend(in: d[5], d[6])
+      {
+	if (a != 101 || b != 6)
+	  a = -9;
+	else
+	  {
+	    a = 24;
+	    b = 38;
+	  }
+      }
+      if (a != 24 || b != 38)
+	abort ();
+    } /* Implicit barrier here.  */
+    #pragma omp master
+    {
+      #pragma omp target nowait map (tofrom: a, b)
+      {
+	a *= 2;
+	b++;
+      }
+      #pragma omp target map (tofrom: c) nowait
+      c--;
+    }
+    #pragma omp barrier
+    if (a != 48 || b != 39 || c != 18)
+      abort ();
+  }
+  return 0;
+}
--- libgomp/testsuite/libgomp.c/target-33.c.jj	2015-11-12 16:20:23.332860573 +0100
+++ libgomp/testsuite/libgomp.c/target-33.c	2015-11-13 09:45:27.000000000 +0100
@@ -0,0 +1,93 @@ 
+extern void abort (void);
+
+int
+main ()
+{
+  int a = 1, b = 2, c = 4, d[7];
+  #pragma omp taskgroup
+  {
+    #pragma omp target enter data nowait map (to: a, b, c) depend(out: d[0])
+    #pragma omp target nowait map (alloc: a, b) depend(in: d[0]) depend(out: d[1])
+    {
+      #pragma omp atomic update
+      a |= 4;
+      #pragma omp atomic update
+      b |= 8;
+    }
+    #pragma omp target nowait map (alloc: a, c) depend(in: d[0]) depend(out: d[2])
+    {
+      #pragma omp atomic update
+      a |= 16;
+      #pragma omp atomic update
+      c |= 32;
+    }
+    #pragma omp target exit data nowait map (from: a, b, c) depend(in: d[1], d[2])
+  }
+  if (a != 21 || b != 10 || c != 36)
+    abort ();
+  #pragma omp target map (tofrom: a, b) nowait
+  {
+    a &= ~16;
+    b &= ~2;
+  }
+  #pragma omp target map (tofrom: c) nowait
+  {
+    c |= 8;
+  }
+  #pragma omp barrier
+  if (a != 5 || b != 8 || c != 44)
+    abort ();
+  #pragma omp target map (tofrom: a, b) nowait
+  {
+    a |= 32;
+    b |= 4;
+  }
+  #pragma omp target map (tofrom: c) nowait
+  {
+    c &= ~4;
+  }
+  #pragma omp taskwait
+  if (a != 37 || b != 12 || c != 40)
+    abort ();
+  #pragma omp target nowait map (tofrom: a, b) depend(out: d[3])
+  {
+    #pragma omp atomic update
+    a = a + 9;
+    b -= 8;
+  }
+  #pragma omp target nowait map (tofrom: a, c) depend(out: d[4])
+  {
+    #pragma omp atomic update
+    a = a + 4;
+    c >>= 1;
+  }
+  #pragma omp task if (0) depend (in: d[3], d[4]) shared (a, b, c)
+  if (a != 50 || b != 4 || c != 20)
+    abort ();
+  #pragma omp task shared (a)
+  a += 50;
+  #pragma omp target nowait map (tofrom: b)
+  b++;
+  #pragma omp target map (tofrom: c) nowait
+  c--;
+  #pragma omp taskwait
+  if (a != 100 || b != 5 || c != 19)
+    abort ();
+  #pragma omp target map (tofrom: a) nowait depend(out: d[5])
+  a++;
+  #pragma omp target map (tofrom: b) nowait depend(out: d[6])
+  b++;
+  #pragma omp target map (tofrom: a, b) depend(in: d[5], d[6])
+  {
+    if (a != 101 || b != 6)
+      a = -9;
+    else
+      {
+	a = 24;
+	b = 38;
+      }
+  }
+  if (a != 24 || b != 38)
+    abort ();
+  return 0;
+}
--- libgomp/team.c.jj	2015-11-09 11:14:37.000000000 +0100
+++ libgomp/team.c	2015-11-12 15:09:23.584644449 +0100
@@ -272,6 +272,8 @@  gomp_free_thread (void *arg __attribute_
       free (pool);
       thr->thread_pool = NULL;
     }
+  if (thr->ts.level == 0 && __builtin_expect (thr->ts.team != NULL, 0))
+    gomp_team_end ();
   if (thr->task != NULL)
     {
       struct gomp_task *task = thr->task;
@@ -301,7 +303,7 @@  gomp_team_start (void (*fn) (void *), vo
   struct gomp_thread **affinity_thr = NULL;
 
   thr = gomp_thread ();
-  nested = thr->ts.team != NULL;
+  nested = thr->ts.level;
   pool = thr->thread_pool;
   task = thr->task;
   icv = task ? &task->icv : &gomp_global_icv;
--- libgomp/target.c.jj	2015-11-09 11:14:37.325239961 +0100
+++ libgomp/target.c	2015-11-13 11:42:28.255345131 +0100
@@ -1348,17 +1348,7 @@  GOMP_target (int device, void (*fn) (voi
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     GOMP_MAP_VARS_TARGET);
-  struct gomp_thread old_thr, *thr = gomp_thread ();
-  old_thr = *thr;
-  memset (thr, '\0', sizeof (*thr));
-  if (gomp_places_list)
-    {
-      thr->place = old_thr.place;
-      thr->ts.place_partition_len = gomp_places_list_len;
-    }
   devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_free_thread (thr);
-  *thr = old_thr;
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1387,10 +1377,52 @@  GOMP_target_ext (int device, void (*fn)
   (void) num_teams;
   (void) thread_limit;
 
-  /* If there are depend clauses, but nowait is not present,
-     block the parent task until the dependencies are resolved
-     and then just continue with the rest of the function as if it
-     is a merged task.  */
+  if (flags & GOMP_TARGET_FLAG_NOWAIT)
+    {
+      struct gomp_thread *thr = gomp_thread ();
+      /* Create a team if we don't have any around, as nowait
+	 target tasks make sense to run asynchronously even when
+	 outside of any parallel.  */
+      if (__builtin_expect (thr->ts.team == NULL, 0))
+	{
+	  struct gomp_team *team = gomp_new_team (1);
+	  struct gomp_task *task = thr->task;
+	  struct gomp_task_icv *icv = task ? &task->icv : &gomp_global_icv;
+	  team->prev_ts = thr->ts;
+	  thr->ts.team = team;
+	  thr->ts.team_id = 0;
+	  thr->ts.work_share = &team->work_shares[0];
+	  thr->ts.last_work_share = NULL;
+#ifdef HAVE_SYNC_BUILTINS
+	  thr->ts.single_count = 0;
+#endif
+	  thr->ts.static_trip = 0;
+	  thr->task = &team->implicit_task[0];
+	  gomp_init_task (thr->task, NULL, icv);
+	  if (task)
+	    {
+	      thr->task = task;
+	      gomp_end_task ();
+	      free (task);
+	      thr->task = &team->implicit_task[0];
+	    }
+	  else
+	    pthread_setspecific (gomp_thread_destructor, thr);
+	}
+      if (thr->ts.team
+	  && !thr->task->final_task)
+	{
+	  gomp_create_target_task (devicep, fn, mapnum, hostaddrs,
+				   sizes, kinds, flags, depend,
+				   GOMP_TARGET_TASK_BEFORE_MAP);
+	  return;
+	}
+    }
+
+  /* If there are depend clauses, but nowait is not present
+     (or we are in a final task), block the parent task until the
+     dependencies are resolved and then just continue with the rest
+     of the function as if it is a merged task.  */
   if (depend != NULL)
     {
       struct gomp_thread *thr = gomp_thread ();
@@ -1410,17 +1442,7 @@  GOMP_target_ext (int device, void (*fn)
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
 		     GOMP_MAP_VARS_TARGET);
-  struct gomp_thread old_thr, *thr = gomp_thread ();
-  old_thr = *thr;
-  memset (thr, '\0', sizeof (*thr));
-  if (gomp_places_list)
-    {
-      thr->place = old_thr.place;
-      thr->ts.place_partition_len = gomp_places_list_len;
-    }
   devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_free_thread (thr);
-  *thr = old_thr;
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1527,23 +1549,25 @@  GOMP_target_update_ext (int device, size
 	      && thr->ts.team
 	      && !thr->task->final_task)
 	    {
-	      gomp_create_target_task (devicep, (void (*) (void *)) NULL,
-				       mapnum, hostaddrs, sizes, kinds,
-				       flags | GOMP_TARGET_FLAG_UPDATE,
-				       depend);
-	      return;
+	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
+					   mapnum, hostaddrs, sizes, kinds,
+					   flags | GOMP_TARGET_FLAG_UPDATE,
+					   depend, GOMP_TARGET_TASK_DATA))
+		return;
 	    }
+	  else
+	    {
+	      struct gomp_team *team = thr->ts.team;
+	      /* If parallel or taskgroup has been cancelled, don't start new
+		 tasks.  */
+	      if (team
+		  && (gomp_team_barrier_cancelled (&team->barrier)
+		      || (thr->task->taskgroup
+			  && thr->task->taskgroup->cancelled)))
+		return;
 
-	  struct gomp_team *team = thr->ts.team;
-	  /* If parallel or taskgroup has been cancelled, don't start new
-	     tasks.  */
-	  if (team
-	      && (gomp_team_barrier_cancelled (&team->barrier)
-		  || (thr->task->taskgroup
-		      && thr->task->taskgroup->cancelled)))
-	    return;
-
-	  gomp_task_maybe_wait_for_dependencies (depend);
+	      gomp_task_maybe_wait_for_dependencies (depend);
+	    }
 	}
     }
 
@@ -1647,22 +1671,25 @@  GOMP_target_enter_exit_data (int device,
 	      && thr->ts.team
 	      && !thr->task->final_task)
 	    {
-	      gomp_create_target_task (devicep, (void (*) (void *)) NULL,
-				       mapnum, hostaddrs, sizes, kinds,
-				       flags, depend);
-	      return;
+	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
+					   mapnum, hostaddrs, sizes, kinds,
+					   flags, depend,
+					   GOMP_TARGET_TASK_DATA))
+		return;
 	    }
+	  else
+	    {
+	      struct gomp_team *team = thr->ts.team;
+	      /* If parallel or taskgroup has been cancelled, don't start new
+		 tasks.  */
+	      if (team
+		  && (gomp_team_barrier_cancelled (&team->barrier)
+		      || (thr->task->taskgroup
+			  && thr->task->taskgroup->cancelled)))
+		return;
 
-	  struct gomp_team *team = thr->ts.team;
-	  /* If parallel or taskgroup has been cancelled, don't start new
-	     tasks.  */
-	  if (team
-	      && (gomp_team_barrier_cancelled (&team->barrier)
-		  || (thr->task->taskgroup
-		      && thr->task->taskgroup->cancelled)))
-	    return;
-
-	  gomp_task_maybe_wait_for_dependencies (depend);
+	      gomp_task_maybe_wait_for_dependencies (depend);
+	    }
 	}
     }
 
@@ -1694,38 +1721,65 @@  GOMP_target_enter_exit_data (int device,
     gomp_exit_data (devicep, mapnum, hostaddrs, sizes, kinds);
 }
 
-void
+bool
 gomp_target_task_fn (void *data)
 {
   struct gomp_target_task *ttask = (struct gomp_target_task *) data;
+  struct gomp_device_descr *devicep = ttask->devicep;
+
   if (ttask->fn != NULL)
     {
-      /* GOMP_target_ext */
-    }
-  else if (ttask->devicep == NULL
-	   || !(ttask->devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
-    return;
+      if (devicep == NULL
+	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	{
+	  ttask->state = GOMP_TARGET_TASK_FALLBACK;
+	  gomp_target_fallback_firstprivate (ttask->fn, ttask->mapnum,
+					     ttask->hostaddrs, ttask->sizes,
+					     ttask->kinds);
+	  return false;
+	}
+
+      if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+	{
+	  gomp_unmap_vars (ttask->tgt, true);
+	  return false;
+	}
+
+      void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
+      ttask->tgt
+	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
+			 ttask->sizes, ttask->kinds, true,
+			 GOMP_MAP_VARS_TARGET);
+      ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
+
+      devicep->async_run_func (devicep->target_id, fn_addr,
+			       (void *) ttask->tgt->tgt_start, (void *) ttask);
+      return true;
+    }
+  else if (devicep == NULL
+	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+    return false;
 
   size_t i;
   if (ttask->flags & GOMP_TARGET_FLAG_UPDATE)
-    gomp_update (ttask->devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+    gomp_update (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
 		 ttask->kinds, true);
   else if ((ttask->flags & GOMP_TARGET_FLAG_EXIT_DATA) == 0)
     for (i = 0; i < ttask->mapnum; i++)
       if ((ttask->kinds[i] & 0xff) == GOMP_MAP_STRUCT)
 	{
-	  gomp_map_vars (ttask->devicep, ttask->sizes[i] + 1,
-			 &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
-			 &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
+	  gomp_map_vars (devicep, ttask->sizes[i] + 1, &ttask->hostaddrs[i],
+			 NULL, &ttask->sizes[i], &ttask->kinds[i], true,
+			 GOMP_MAP_VARS_ENTER_DATA);
 	  i += ttask->sizes[i];
 	}
       else
-	gomp_map_vars (ttask->devicep, 1, &ttask->hostaddrs[i], NULL,
-		       &ttask->sizes[i], &ttask->kinds[i],
-		       true, GOMP_MAP_VARS_ENTER_DATA);
+	gomp_map_vars (devicep, 1, &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
+		       &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
   else
-    gomp_exit_data (ttask->devicep, ttask->mapnum, ttask->hostaddrs,
-		    ttask->sizes, ttask->kinds);
+    gomp_exit_data (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+		    ttask->kinds);
+  return false;
 }
 
 void
@@ -2170,6 +2224,7 @@  gomp_load_plugin_for_device (struct gomp
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {
       DLSYM (run);
+      DLSYM (async_run);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
--- libgomp/libgomp.map.jj	2015-10-26 15:38:20.000000000 +0100
+++ libgomp/libgomp.map	2015-11-11 16:15:23.807818735 +0100
@@ -406,3 +406,8 @@  GOMP_PLUGIN_1.0 {
 	GOMP_PLUGIN_async_unmap_vars;
 	GOMP_PLUGIN_acc_thread;
 };
+
+GOMP_PLUGIN_1.1 {
+  global:
+	GOMP_PLUGIN_target_task_completion;
+} GOMP_PLUGIN_1.0;
--- libgomp/task.c.jj	2015-11-09 11:14:37.332239862 +0100
+++ libgomp/task.c	2015-11-13 15:36:05.954411999 +0100
@@ -480,13 +480,119 @@  ialias (GOMP_taskgroup_end)
 #undef UTYPE
 #undef GOMP_taskloop
 
-/* Called for nowait target tasks.  */
+static void inline
+priority_queue_move_task_first (enum priority_queue_type type,
+				struct priority_queue *head,
+				struct gomp_task *task)
+{
+#if _LIBGOMP_CHECKING_
+  if (!priority_queue_task_in_queue_p (type, head, task))
+    gomp_fatal ("Attempt to move first missing task %p", task);
+#endif
+  struct priority_list *list;
+  if (priority_queue_multi_p (head))
+    {
+      list = priority_queue_lookup_priority (head, task->priority);
+#if _LIBGOMP_CHECKING_
+      if (!list)
+	gomp_fatal ("Unable to find priority %d", task->priority);
+#endif
+    }
+  else
+    list = &head->l;
+  priority_list_remove (list, task_to_priority_node (type, task), 0);
+  priority_list_insert (type, list, task, task->priority,
+			PRIORITY_INSERT_BEGIN, type == PQ_CHILDREN,
+			task->parent_depends_on);
+}
+
+/* Actual body of GOMP_PLUGIN_target_task_completion that is executed
+   with team->task_lock held, or is executed in the thread that called
+   gomp_target_task_fn if GOMP_PLUGIN_target_task_completion has been
+   run before it acquires team->task_lock.  */
+
+static void
+gomp_target_task_completion (struct gomp_team *team, struct gomp_task *task)
+{
+  struct gomp_task *parent = task->parent;
+  if (parent)
+    priority_queue_move_task_first (PQ_CHILDREN, &parent->children_queue,
+				    task);
+
+  struct gomp_taskgroup *taskgroup = task->taskgroup;
+  if (taskgroup)
+    priority_queue_move_task_first (PQ_TASKGROUP, &taskgroup->taskgroup_queue,
+				    task);
+
+  priority_queue_insert (PQ_TEAM, &team->task_queue, task, task->priority,
+			 PRIORITY_INSERT_BEGIN, false,
+			 task->parent_depends_on);
+  task->kind = GOMP_TASK_WAITING;
+  if (parent && parent->taskwait)
+    {
+      if (parent->taskwait->in_taskwait)
+	{
+	  /* One more task has had its dependencies met.
+	     Inform any waiters.  */
+	  parent->taskwait->in_taskwait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+      else if (parent->taskwait->in_depend_wait)
+	{
+	  /* One more task has had its dependencies met.
+	     Inform any waiters.  */
+	  parent->taskwait->in_depend_wait = false;
+	  gomp_sem_post (&parent->taskwait->taskwait_sem);
+	}
+    }
+  if (taskgroup && taskgroup->in_taskgroup_wait)
+    {
+      /* One more task has had its dependencies met.
+	 Inform any waiters.  */
+      taskgroup->in_taskgroup_wait = false;
+      gomp_sem_post (&taskgroup->taskgroup_sem);
+    }
+
+  ++team->task_queued_count;
+  gomp_team_barrier_set_task_pending (&team->barrier);
+  /* I'm afraid this can't be done after releasing team->task_lock,
+     as gomp_target_task_completion is run from unrelated thread and
+     therefore in between gomp_mutex_unlock and gomp_team_barrier_wake
+     the team could be gone already.  */
+  if (team->nthreads > team->task_running_count)
+    gomp_team_barrier_wake (&team->barrier, 1);
+}
+
+/* Signal that a target task TTASK has completed the asynchronously
+   running phase and should be requeued as a task to handle the
+   variable unmapping.  */
 
 void
+GOMP_PLUGIN_target_task_completion (void *data)
+{
+  struct gomp_target_task *ttask = (struct gomp_target_task *) data;
+  struct gomp_task *task = ttask->task;
+  struct gomp_team *team = ttask->team;
+
+  gomp_mutex_lock (&team->task_lock);
+  if (ttask->state == GOMP_TARGET_TASK_READY_TO_RUN)
+    {
+      ttask->state = GOMP_TARGET_TASK_FINISHED;
+      gomp_mutex_unlock (&team->task_lock);
+    }
+  ttask->state = GOMP_TARGET_TASK_FINISHED;
+  gomp_target_task_completion (team, task);
+  gomp_mutex_unlock (&team->task_lock);
+}
+
+/* Called for nowait target tasks.  */
+
+bool
 gomp_create_target_task (struct gomp_device_descr *devicep,
 			 void (*fn) (void *), size_t mapnum, void **hostaddrs,
 			 size_t *sizes, unsigned short *kinds,
-			 unsigned int flags, void **depend)
+			 unsigned int flags, void **depend,
+			 enum gomp_target_task_state state)
 {
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr->ts.team;
@@ -495,7 +601,7 @@  gomp_create_target_task (struct gomp_dev
   if (team
       && (gomp_team_barrier_cancelled (&team->barrier)
 	  || (thr->task->taskgroup && thr->task->taskgroup->cancelled)))
-    return;
+    return true;
 
   struct gomp_target_task *ttask;
   struct gomp_task *task;
@@ -503,19 +609,45 @@  gomp_create_target_task (struct gomp_dev
   struct gomp_taskgroup *taskgroup = parent->taskgroup;
   bool do_wake;
   size_t depend_size = 0;
+  uintptr_t depend_cnt = 0;
+  size_t tgt_align = 0, tgt_size = 0;
 
   if (depend != NULL)
-    depend_size = ((uintptr_t) depend[0]
-		   * sizeof (struct gomp_task_depend_entry));
+    {
+      depend_cnt = (uintptr_t) depend[0];
+      depend_size = depend_cnt * sizeof (struct gomp_task_depend_entry);
+    }
+  if (fn)
+    {
+      /* GOMP_MAP_FIRSTPRIVATE need to be copied first, as they are
+	 firstprivate on the target task.  */
+      size_t i;
+      for (i = 0; i < mapnum; i++)
+	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    if (tgt_align < align)
+	      tgt_align = align;
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    tgt_size += sizes[i];
+	  }
+      if (tgt_align)
+	tgt_size += tgt_align - 1;
+      else
+	tgt_size = 0;
+    }
+
   task = gomp_malloc (sizeof (*task) + depend_size
 		      + sizeof (*ttask)
 		      + mapnum * (sizeof (void *) + sizeof (size_t)
-				  + sizeof (unsigned short)));
+				  + sizeof (unsigned short))
+		      + tgt_size);
   gomp_init_task (task, parent, gomp_icv (false));
+  task->priority = 0;
   task->kind = GOMP_TASK_WAITING;
   task->in_tied_task = parent->in_tied_task;
   task->taskgroup = taskgroup;
-  ttask = (struct gomp_target_task *) &task->depend[(uintptr_t) depend[0]];
+  ttask = (struct gomp_target_task *) &task->depend[depend_cnt];
   ttask->devicep = devicep;
   ttask->fn = fn;
   ttask->mapnum = mapnum;
@@ -524,8 +656,29 @@  gomp_create_target_task (struct gomp_dev
   memcpy (ttask->sizes, sizes, mapnum * sizeof (size_t));
   ttask->kinds = (unsigned short *) &ttask->sizes[mapnum];
   memcpy (ttask->kinds, kinds, mapnum * sizeof (unsigned short));
+  if (tgt_align)
+    {
+      char *tgt = (char *) &ttask->kinds[mapnum];
+      size_t i;
+      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
+      if (al)
+	tgt += tgt_align - al;
+      tgt_size = 0;
+      for (i = 0; i < mapnum; i++)
+	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
+	    ttask->hostaddrs[i] = tgt + tgt_size;
+	    tgt_size = tgt_size + sizes[i];
+	  }
+    }
   ttask->flags = flags;
-  task->fn = gomp_target_task_fn;
+  ttask->state = state;
+  ttask->task = task;
+  ttask->team = team;
+  task->fn = NULL;
   task->fn_data = ttask;
   task->final_task = 0;
   gomp_mutex_lock (&team->task_lock);
@@ -536,19 +689,65 @@  gomp_create_target_task (struct gomp_dev
       gomp_mutex_unlock (&team->task_lock);
       gomp_finish_task (task);
       free (task);
-      return;
+      return true;
     }
-  if (taskgroup)
-    taskgroup->num_children++;
   if (depend_size)
     {
       gomp_task_handle_depend (task, parent, depend);
       if (task->num_dependees)
 	{
+	  if (taskgroup)
+	    taskgroup->num_children++;
 	  gomp_mutex_unlock (&team->task_lock);
-	  return;
+	  return true;
 	}
     }
+  if (state == GOMP_TARGET_TASK_DATA)
+    {
+      gomp_mutex_unlock (&team->task_lock);
+      gomp_finish_task (task);
+      free (task);
+      return false;
+    }
+  if (taskgroup)
+    taskgroup->num_children++;
+  /* For async offloading, if we don't need to wait for dependencies,
+     run the gomp_target_task_fn right away, essentially schedule the
+     mapping part of the task in the current thread.  */
+  if (devicep != NULL
+      && (devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+    {
+      priority_queue_insert (PQ_CHILDREN, &parent->children_queue, task, 0,
+			     PRIORITY_INSERT_END,
+			     /*adjust_parent_depends_on=*/false,
+			     task->parent_depends_on);
+      if (taskgroup)
+	priority_queue_insert (PQ_TASKGROUP, &taskgroup->taskgroup_queue,
+			       task, 0, PRIORITY_INSERT_END,
+			       /*adjust_parent_depends_on=*/false,
+			       task->parent_depends_on);
+      task->pnode[PQ_TEAM].next = NULL;
+      task->pnode[PQ_TEAM].prev = NULL;
+      task->kind = GOMP_TASK_TIED;
+      ++team->task_count;
+      gomp_mutex_unlock (&team->task_lock);
+
+      thr->task = task;
+      gomp_target_task_fn (task->fn_data);
+      thr->task = parent;
+
+      gomp_mutex_lock (&team->task_lock);
+      task->kind = GOMP_TASK_ASYNC_RUNNING;
+      /* If GOMP_PLUGIN_target_task_completion has run already
+	 in between gomp_target_task_fn and the mutex lock,
+	 perform the requeuing here.  */
+      if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+	gomp_target_task_completion (team, task);
+      else
+	ttask->state = GOMP_TARGET_TASK_RUNNING;
+      gomp_mutex_unlock (&team->task_lock);
+      return true;
+    }
   priority_queue_insert (PQ_CHILDREN, &parent->children_queue, task, 0,
 			 PRIORITY_INSERT_BEGIN,
 			 /*adjust_parent_depends_on=*/false,
@@ -570,6 +769,7 @@  gomp_create_target_task (struct gomp_dev
   gomp_mutex_unlock (&team->task_lock);
   if (do_wake)
     gomp_team_barrier_wake (&team->barrier, 1);
+  return true;
 }
 
 /* Given a parent_depends_on task in LIST, move it to the front of its
@@ -1041,7 +1241,29 @@  gomp_barrier_handle_tasks (gomp_barrier_
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  team->task_running_count--;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1170,7 +1392,28 @@  GOMP_taskwait (void)
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1342,7 +1585,28 @@  gomp_task_maybe_wait_for_dependencies (v
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
@@ -1423,6 +1687,17 @@  GOMP_taskgroup_end (void)
   if (team == NULL)
     return;
   taskgroup = task->taskgroup;
+  if (__builtin_expect (taskgroup == NULL, 0)
+      && thr->ts.level == 0)
+    {
+      /* This can happen if GOMP_taskgroup_start is called when
+	 thr->ts.team == NULL, but inside of the taskgroup there
+	 is #pragma omp target nowait that creates an implicit
+	 team with a single thread.  In this case, we want to wait
+	 for all outstanding tasks in this team.  */
+      gomp_team_barrier_wait (&team->barrier);
+      return;
+    }
 
   /* The acquire barrier on load of taskgroup->num_children here
      synchronizes with the write of 0 in gomp_task_run_post_remove_taskgroup.
@@ -1450,8 +1725,8 @@  GOMP_taskgroup_end (void)
 		= priority_queue_next_task (PQ_CHILDREN, &task->children_queue,
 					    PQ_TEAM, &team->task_queue,
 					    &unused);
-            }
-          else
+	    }
+	  else
 	    {
 	      gomp_mutex_unlock (&team->task_lock);
 	      if (to_free)
@@ -1506,7 +1781,28 @@  GOMP_taskgroup_end (void)
       if (child_task)
 	{
 	  thr->task = child_task;
-	  child_task->fn (child_task->fn_data);
+	  if (__builtin_expect (child_task->fn == NULL, 0))
+	    {
+	      if (gomp_target_task_fn (child_task->fn_data))
+		{
+		  thr->task = task;
+		  gomp_mutex_lock (&team->task_lock);
+		  child_task->kind = GOMP_TASK_ASYNC_RUNNING;
+		  struct gomp_target_task *ttask
+		    = (struct gomp_target_task *) child_task->fn_data;
+		  /* If GOMP_PLUGIN_target_task_completion has run already
+		     in between gomp_target_task_fn and the mutex lock,
+		     perform the requeuing here.  */
+		  if (ttask->state == GOMP_TARGET_TASK_FINISHED)
+		    gomp_target_task_completion (team, child_task);
+		  else
+		    ttask->state = GOMP_TARGET_TASK_RUNNING;
+		  child_task = NULL;
+		  continue;
+		}
+	    }
+	  else
+	    child_task->fn (child_task->fn_data);
 	  thr->task = task;
 	}
       else
--- libgomp/priority_queue.c.jj	2015-11-09 11:15:33.000000000 +0100
+++ libgomp/priority_queue.c	2015-11-10 17:52:33.769414428 +0100
@@ -85,7 +85,7 @@  priority_queue_task_in_queue_p (enum pri
    order.  LIST is a priority list of type TYPE.
 
    The expected order is that GOMP_TASK_WAITING tasks come before
-   GOMP_TASK_TIED ones.
+   GOMP_TASK_TIED/GOMP_TASK_ASYNC_RUNNING ones.
 
    If CHECK_DEPS is TRUE, we also check that parent_depends_on WAITING
    tasks come before !parent_depends_on WAITING tasks.  This is only
@@ -104,7 +104,7 @@  priority_list_verify (enum priority_queu
       struct gomp_task *t = priority_node_to_task (type, p);
       if (seen_tied && t->kind == GOMP_TASK_WAITING)
 	gomp_fatal ("priority_queue_verify: WAITING task after TIED");
-      if (t->kind == GOMP_TASK_TIED)
+      if (t->kind >= GOMP_TASK_TIED)
 	seen_tied = true;
       else if (check_deps && t->kind == GOMP_TASK_WAITING)
 	{
--- libgomp/libgomp.h.jj	2015-11-09 11:14:37.326239947 +0100
+++ libgomp/libgomp.h	2015-11-13 11:41:46.743939113 +0100
@@ -373,7 +373,12 @@  enum gomp_task_kind
   /* Task created by GOMP_task and waiting to be run.  */
   GOMP_TASK_WAITING,
   /* Task currently executing or scheduled and about to execute.  */
-  GOMP_TASK_TIED
+  GOMP_TASK_TIED,
+  /* Used for target tasks that have vars mapped and async run started,
+     but not yet completed.  Once that completes, they will be readded
+     into the queues as GOMP_TASK_WAITING in order to perform the var
+     unmapping.  */
+  GOMP_TASK_ASYNC_RUNNING
 };
 
 struct gomp_task_depend_entry
@@ -453,6 +458,8 @@  struct gomp_task
   struct gomp_task_depend_entry depend[];
 };
 
+/* This structure describes a single #pragma omp taskgroup.  */
+
 struct gomp_taskgroup
 {
   struct gomp_taskgroup *prev;
@@ -464,6 +471,19 @@  struct gomp_taskgroup
   size_t num_children;
 };
 
+/* Various state of OpenMP async offloading tasks.  */
+enum gomp_target_task_state
+{
+  GOMP_TARGET_TASK_DATA,
+  GOMP_TARGET_TASK_BEFORE_MAP,
+  GOMP_TARGET_TASK_FALLBACK,
+  GOMP_TARGET_TASK_READY_TO_RUN,
+  GOMP_TARGET_TASK_RUNNING,
+  GOMP_TARGET_TASK_FINISHED
+};
+
+/* This structure describes a target task.  */
+
 struct gomp_target_task
 {
   struct gomp_device_descr *devicep;
@@ -472,6 +492,10 @@  struct gomp_target_task
   size_t *sizes;
   unsigned short *kinds;
   unsigned int flags;
+  enum gomp_target_task_state state;
+  struct target_mem_desc *tgt;
+  struct gomp_task *task;
+  struct gomp_team *team;
   void *hostaddrs[];
 };
 
@@ -723,10 +747,10 @@  extern void gomp_init_task (struct gomp_
 extern void gomp_end_task (void);
 extern void gomp_barrier_handle_tasks (gomp_barrier_state_t);
 extern void gomp_task_maybe_wait_for_dependencies (void **);
-extern void gomp_create_target_task (struct gomp_device_descr *,
+extern bool gomp_create_target_task (struct gomp_device_descr *,
 				     void (*) (void *), size_t, void **,
 				     size_t *, unsigned short *, unsigned int,
-				     void **);
+				     void **, enum gomp_target_task_state);
 
 static void inline
 gomp_finish_task (struct gomp_task *task)
@@ -747,7 +771,7 @@  extern void gomp_free_thread (void *);
 
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
-extern void gomp_target_task_fn (void *);
+extern bool gomp_target_task_fn (void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
@@ -901,6 +925,7 @@  struct gomp_device_descr
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
+  void (*async_run_func) (int, void *, void *, void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;