diff mbox

[gomp4] OpenACC async re-work

Message ID f3fe2923-fd7f-64e5-326d-c20f161aa205@mentor.com
State New
Headers show

Commit Message

Chung-Lin Tang June 24, 2017, 7:54 a.m. UTC
Hi Cesar, Thomas,
This patch is the re-implementation of OpenACC async we talked about.
The changes are rather large, so I am putting it here for a few days before
actually committing them to gomp-4_0-branch. Would appreciate if you guys
take a look.

To overall describe the highlights of the changes:

(1) Instead of essentially implementing the entire OpenACC async support
inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
plugin functions. Most of the OpenACC specific logic is pulled into
libgomp/oacc-async.c

(2) CUDA events are no longer used. The limitation of no CUDA calls inside
CUDA callbacks were a problem for resource freeing, but we now stash
them onto the ptx_device and free them later.

(3) For 'wait + async', we now add a local thread synchronize, instead
of just ordering the streams.

(4) To work with the (3) change, some front end changes were added to
propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
represent a 'wait all'.

Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
this after the weekend (or Tues.)

Thanks,
Chung-Lin

2017-06-24  Chung-Lin Tang  <cltang@codesourcery.com>

        gcc/c/
        * c-parser.c (c_parser_oacc_clause_wait): Add representation of wait
        clause without argument as 'wait (GOMP_ASYNC_NOVAL)', adjust comments.

        gcc/cp/
        * parser.c (cp_parser_oacc_clause_wait): Add representation of wait
        clause without argument as 'wait (GOMP_ASYNC_NOVAL)', adjust comments.

        gcc/fortran/
        * trans-openmp.c (gfc_trans_omp_clauses_1): Add representation of wait
        clause without argument as 'wait (GOMP_ASYNC_NOVAL)'.

        gcc/
        * omp-low.c (expand_omp_target): Add middle-end support for handling
        OMP_CLAUSE_WAIT clause with a GOMP_ASYNC_NOVAL(-1) as the argument.

        gcc/testsuite/
        * c-c++-common/goacc/dtype-1.c: Adjust testcase.
        * gfortran.dg/goacc/dtype-1.f95: Likewise.

        include/
        * gomp-constants.h (GOMP_LAUNCH_OP_MASK): Define.
        (GOMP_LAUNCH_PACK): Add bitwise-and of GOMP_LAUNCH_OP_MASK.
        (GOMP_LAUNCH_OP): Likewise.


        libgomp/
	* libgomp-plugin.h (struct goacc_asyncqueue): Declare.
        (struct goacc_asyncqueue_list): Likewise.
        (goacc_aq): Likewise.
        (goacc_aq_list): Likewise.
	(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_async_test): Remove.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust declaration.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise.

        (GOMP_OFFLOAD_openacc_async_exec): Declare.
        (GOMP_OFFLOAD_openacc_async_construct): Declare.
        (GOMP_OFFLOAD_openacc_async_destruct): Declare.
        (GOMP_OFFLOAD_openacc_async_test): Declare.
        (GOMP_OFFLOAD_openacc_async_synchronize): Declare.
        (GOMP_OFFLOAD_openacc_async_serialize): Declare.
        (GOMP_OFFLOAD_openacc_async_queue_callback): Declare.
        (GOMP_OFFLOAD_openacc_async_host2dev): Declare.
        (GOMP_OFFLOAD_openacc_async_dev2host): Declare.

	* libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct.
        (gomp_acc_insert_pointer): Adjust declaration.
        (gomp_copy_host2dev): New declaration.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_async): Likewise.
        (gomp_unmap_tgt): Likewise.
        (gomp_unmap_vars_async): Likewise.
        (gomp_fini_device): Likewise.

	* oacc-async.c (get_goacc_thread): New function.
        (get_goacc_thread_device): New function.
        (lookup_goacc_asyncqueue): New function.
        (get_goacc_asyncqueue): New function.
        (acc_async_test): Adjust code to use new async design.
        (acc_async_test_all): Likewise.
        (acc_wait): Likewise.
        (acc_wait_async): Likewise.

	* oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
        design.
        (acc_set_cuda_stream): Likewise.

	* oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
        (host_openacc_register_async_cleanup): Remove.
        (host_openacc_async_exec): New function.
        (host_openacc_async_test): Adjust parameters.
        (host_openacc_async_test_all): Remove.
        (host_openacc_async_wait): Remove.
        (host_openacc_async_wait_async): Remove.
        (host_openacc_async_wait_all): Remove.
        (host_openacc_async_wait_all_async): Remove.
        (host_openacc_async_set_async): Remove.
        (host_openacc_async_synchronize): New function.
        (host_openacc_async_serialize): New function.
        (host_openacc_async_host2dev): New function.
        (host_openacc_async_dev2host): New function.
        (host_openacc_async_queue_callback): New function.
        (host_openacc_async_construct): New function.
        (host_openacc_async_destruct): New function.
        (struct gomp_device_descr host_dispatch): Remove initialization of old
        interface, add intialization of new async sub-struct.
        * oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
        (goacc_attach_host_thread_to_device): Remove old async code usage.
        * oacc-int.h (goacc_init_asyncqueues): New declaration.
        (goacc_fini_asyncqueues): Likewise.
        (goacc_async_copyout_unmap_vars): Likewise.
        (goacc_async_free): Likewise.
        (get_goacc_asyncqueue): Likewise.
        (lookup_goacc_asyncqueue): Likewise.

        * oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
        design.
        (acc_is_present): Explicitly use 1/0 as return value;
        (present_create_copy): Adjust code to use new async design.
        (delete_copyout): Likewise.
        (update_dev_host): Likewise.
        (gomp_acc_insert_pointer): Add async parameter, adjust code to use new
        async design.
        (gomp_acc_remove_pointer): Adjust code to use new async design.
        * oacc-parallel.c (GOACC_parallel_keyed): Adjust code to use new async
        design, adjust profiling bits, interpret launch op as signed 16-bit
        field.
        (GOACC_enter_exit_data): Handle -1 as waits num, adjust code to use new
        async design.
        (goacc_wait): Adjust code to use new async design.
        (GOACC_update): Likewise.
        * oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

	* target.c (goacc_device_copy_async): New function.
        (gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter,
        add goacc_device_copy_async case.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code.
        (gomp_map_pointer): Likewise.
        (gomp_map_fields_existing): Likewise.
        (gomp_map_vars): Add function for compatiblity.
        (gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue
        parameter.
        (gomp_unmap_tgt): Remove statis, add attribute_hidden.
        (gomp_unmap_vars): Add function for compatiblity.
        (gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add
        goacc_asyncqueue parameter.
        (gomp_fini_device): New function.
        (gomp_exit_data): Adjust gomp_copy_dev2host call.
        (gomp_load_plugin_for_device): Remove old interface, adjust to load
        new async interface.
        (gomp_target_fini): Adjust code to call gomp_fini_device.

	* plugin/plugin-nvptx.c (struct cuda_map): Remove.
        (struct ptx_stream): Remove.
        (struct nvptx_thread): Remove current_stream field.
        (cuda_map_create): Remove.
        (cuda_map_destroy): Remove.
        (map_init): Remove.
        (map_fini): Remove.
        (map_pop): Remove.
        (map_push): Remove.
        (struct goacc_asyncqueue): Define.
        (struct nvptx_callback): Define.
        (struct ptx_free_block): Define.
        (struct ptx_device): Remove null_stream, active_streams, async_streams,
        stream_lock, and next fields.
        (enum ptx_event_type): Remove.
        (struct ptx_event): Remove.
        (ptx_event_lock): Remove.
        (ptx_events): Remove.
        (init_streams_for_device): Remove.
        (fini_streams_for_device): Remove.
        (select_stream_for_async): Remove.
        (nvptx_init): Remove ptx_events and ptx_event_lock references.
        (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
        case.
        (nvptx_open_device): Add free_blocks initialization, remove
        init_streams_for_device call.
        (nvptx_close_device): Remove fini_streams_for_device call, add
        free_blocks destruct code.
        (event_gc): Remove.
        (event_add): Remove.
        (nvptx_exec): Adjust parameters and code.
        (nvptx_free): Likewise.
        (nvptx_host2dev): Remove.
        (nvptx_dev2host): Remove.
        (nvptx_set_async): Remove.
        (nvptx_async_test): Remove.
        (nvptx_async_test_all): Remove.
        (nvptx_wait): Remove.
        (nvptx_wait_async): Remove.
        (nvptx_wait_all): Remove.
        (nvptx_wait_all_async): Remove.
        (nvptx_get_cuda_stream): Remove.
        (nvptx_set_cuda_stream): Remove.
        (GOMP_OFFLOAD_alloc): Adjust code.
        (GOMP_OFFLOAD_free): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (cuda_free_argmem): New function.
        (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
        (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
        (cuda_callback_wrapper): New function.
        (cuda_memcpy_sanity_check): New function.
        (GOMP_OFFLOAD_host2dev): Remove and re-implement.
        (GOMP_OFFLOAD_dev2host): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.

	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
        (cb_enter_data_start): Adjust testcase.
        * testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase.
        * testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise.
        * testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise.

Comments

Cesar Philippidis June 26, 2017, 10:45 p.m. UTC | #1
I still need more time to review this, but ...

On 06/24/2017 12:54 AM, Chung-Lin Tang wrote:
> Hi Cesar, Thomas,
> This patch is the re-implementation of OpenACC async we talked about.
> The changes are rather large, so I am putting it here for a few days before
> actually committing them to gomp-4_0-branch. Would appreciate if you guys
> take a look.
>
> To overall describe the highlights of the changes:
> 
> (1) Instead of essentially implementing the entire OpenACC async support
> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
> plugin functions. Most of the OpenACC specific logic is pulled into
> libgomp/oacc-async.c

I'm not sure if plugins need to maintain backwards compatibility.
However, I don't see any changes inside libgomp.map, so maybe it's not
required.

> (2) CUDA events are no longer used. The limitation of no CUDA calls inside
> CUDA callbacks were a problem for resource freeing, but we now stash
> them onto the ptx_device and free them later.

Yay!

> (3) For 'wait + async', we now add a local thread synchronize, instead
> of just ordering the streams.
>
> (4) To work with the (3) change, some front end changes were added to
> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
> represent a 'wait all'.

What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
easier to make that change in the gimplifier?

> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
> this after the weekend (or Tues.)

> 	* plugin/plugin-nvptx.c (struct cuda_map): Remove.
>         (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>         (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.

These two functions seem extremely similar.  I wonder if you should
consolidate them.

Overall, I like how you were able eliminate the externally managed map_*
data structure which was used to pass in arguments to nvptx_exec.
Although I wonder if we should just pass in those individual arguments
directly to cuLaunchKernel. But that's a big change in itself.

Cesar
Chung-Lin Tang June 27, 2017, 10:56 a.m. UTC | #2
On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>> (1) Instead of essentially implementing the entire OpenACC async support
>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>> plugin functions. Most of the OpenACC specific logic is pulled into
>> libgomp/oacc-async.c
> I'm not sure if plugins need to maintain backwards compatibility.
> However, I don't see any changes inside libgomp.map, so maybe it's not
> required.

This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
No user API compatibility was altered.

>> (3) For 'wait + async', we now add a local thread synchronize, instead
>> of just ordering the streams.
>>
>> (4) To work with the (3) change, some front end changes were added to
>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>> represent a 'wait all'.
> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
> easier to make that change in the gimplifier?

Actually, we were basically throwing away argument-less wait clauses in front-ends
before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
were internally the same.

The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
convention.

>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>> this after the weekend (or Tues.)
>> 	* plugin/plugin-nvptx.c (struct cuda_map): Remove.
>>         (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>>         (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
> These two functions seem extremely similar.  I wonder if you should
> consolidate them.

It would be nice to have a proper set of pthreads based host fallback hooks
for the openacc.async substruct later. Ideally, an accelerator plugin can
just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
would make this less clean.

> Overall, I like how you were able eliminate the externally managed map_*
> data structure which was used to pass in arguments to nvptx_exec.
> Although I wonder if we should just pass in those individual arguments
> directly to cuLaunchKernel. But that's a big change in itself.

I didn't think of that when working on the current patch, maybe later.

Thanks,
Chung-Lin
Cesar Philippidis June 28, 2017, 10:31 p.m. UTC | #3
On 06/27/2017 03:56 AM, Chung-Lin Tang wrote:
> On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>>> (1) Instead of essentially implementing the entire OpenACC async support
>>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>>> plugin functions. Most of the OpenACC specific logic is pulled into
>>> libgomp/oacc-async.c
>> I'm not sure if plugins need to maintain backwards compatibility.
>> However, I don't see any changes inside libgomp.map, so maybe it's not
>> required.
> 
> This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
> No user API compatibility was altered.
> 
>>> (3) For 'wait + async', we now add a local thread synchronize, instead
>>> of just ordering the streams.
>>>
>>> (4) To work with the (3) change, some front end changes were added to
>>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>>> represent a 'wait all'.
>> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
>> easier to make that change in the gimplifier?
> 
> Actually, we were basically throwing away argument-less wait clauses in front-ends
> before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
> were internally the same.
>
> The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
> convention.

So then then wait was implied before? Or maybe that's why 'wait async'
didn't work.

>>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>>> this after the weekend (or Tues.)
>>> 	* plugin/plugin-nvptx.c (struct cuda_map): Remove.
>>>         (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>>>         (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
>> These two functions seem extremely similar.  I wonder if you should
>> consolidate them.
> 
> It would be nice to have a proper set of pthreads based host fallback hooks
> for the openacc.async substruct later. Ideally, an accelerator plugin can
> just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
> GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
> the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
> would make this less clean.

After looking at this some more, I like how your patch simplifies
things. This small bit of somewhat duplicated code is much better than
what we had before. So I'm ok with it.

>> Overall, I like how you were able eliminate the externally managed map_*
>> data structure which was used to pass in arguments to nvptx_exec.
>> Although I wonder if we should just pass in those individual arguments
>> directly to cuLaunchKernel. But that's a big change in itself.
> 
> I didn't think of that when working on the current patch, maybe later.

Here's some more comments regarding the code below. One high-level
comment regarding the usage of async-specific locks. Can't you get by
with using the global device lock, instead of a special async queue or
would that cause a deadlock?

> Index: libgomp/oacc-async.c
> ===================================================================
> --- libgomp/oacc-async.c	(revision 249620)
> +++ libgomp/oacc-async.c	(working copy)
> @@ -27,10 +27,85 @@
>     <http://www.gnu.org/licenses/>.  */
>
>  #include <assert.h>
> +#include <string.h>
>  #include "openacc.h"
>  #include "libgomp.h"
>  #include "oacc-int.h"
>
> +static struct goacc_thread *
> +get_goacc_thread (void)
> +{
> +  struct goacc_thread *thr = goacc_thread ();
> +  if (!thr || !thr->dev)
> +    gomp_fatal ("no device active");
> +  return thr;
> +}
> +
> +static struct gomp_device_descr *
> +get_goacc_thread_device (void)
> +{
> +  struct goacc_thread *thr = goacc_thread ();
> +
> +  if (!thr || !thr->dev)
> +    gomp_fatal ("no device active");
> +
> +  return thr->dev;
> +}

These two functions can be made public because a lot of other functioncs
can use them too. I don't know where to stash them though. You can
change that later though.

> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int
async)
> +{
> +  /* The special value acc_async_noval (-1) maps to the thread-specific
> +     default async stream.  */
> +  if (async == acc_async_noval)
> +    async = thr->default_async;

Is the default async queue device independent? I thought the default
async queue is defined in the acc_async_t enum. Maybe set

 async = acc_async_default

?

> +  if (async == acc_async_sync)
> +    return NULL;
> +
> +  if (async < 0)
> +    gomp_fatal ("bad async %d", async);
> +
> +  struct gomp_device_descr *dev = thr->dev;
> +
> +  if (!create
> +      && (async >= dev->openacc.async.nasyncqueue
> +	  || !dev->openacc.async.asyncqueue[async]))
> +    return NULL;
> +
> +  gomp_mutex_lock (&dev->openacc.async.lock);
Is this lock sufficient? What happens if the device is released?

> +  if (async >= dev->openacc.async.nasyncqueue)
> +    {
Not your fault, but I wonder if we would be better off just hard-capping
the number of async queues. What happens if the user does something like
wait (1<<30)? That can be addressed later.

> +      int diff = async + 1 - dev->openacc.async.nasyncqueue;
> +      dev->openacc.async.asyncqueue
> +	= gomp_realloc (dev->openacc.async.asyncqueue,
> +			sizeof (goacc_aq) * (async + 1));
> +      memset (dev->openacc.async.asyncqueue +
dev->openacc.async.nasyncqueue,
> +	      0, sizeof (goacc_aq) * diff);
> +      dev->openacc.async.nasyncqueue = async + 1;
> +    }
> +
> +  if (!dev->openacc.async.asyncqueue[async])
> +    {
> +      dev->openacc.async.asyncqueue[async] =
dev->openacc.async.construct_func ();
> +
> +      /* Link new async queue into active list.  */
> +      goacc_aq_list n = gomp_malloc (sizeof (struct
goacc_asyncqueue_list));
> +      n->aq = dev->openacc.async.asyncqueue[async];
> +      n->next = dev->openacc.async.active;
> +      dev->openacc.async.active = n;
> +    }
> +  gomp_mutex_unlock (&dev->openacc.async.lock);
> +  return dev->openacc.async.asyncqueue[async];
> +}
> +
> +attribute_hidden struct goacc_asyncqueue *
> +get_goacc_asyncqueue (int async)
> +{
> +  struct goacc_thread *thr = get_goacc_thread ();
> +  return lookup_goacc_asyncqueue (thr, true, async);
> +}
> +
>  int
>  acc_async_test (int async)
>  {
> @@ -54,15 +129,14 @@ acc_async_test (int async)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  int res = thr->dev->openacc.async_test_func (async);
> -
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
>        thr->api_info = NULL;
>      }
> -
> -  return res;
> +
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> +  return thr->dev->openacc.async.test_func (aq);
I'm not sure how the profling stuff works. Should the profling state be
state be set after calling thr->dev->openacc.async.test_func?

>  }
>
>  int
> @@ -69,7 +143,6 @@ int
>  acc_async_test_all (void)
>  {
>    struct goacc_thread *thr = goacc_thread ();
> -
>    acc_prof_info prof_info;
>    acc_api_info api_info;
>    bool profiling_setup_p
> @@ -79,8 +152,6 @@ acc_async_test_all (void)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  int res = thr->dev->openacc.async_test_all_func ();
> -
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
> @@ -87,7 +158,17 @@ acc_async_test_all (void)
>        thr->api_info = NULL;
>      }
>
> -  return res;
> +  int ret = 1;
> +  /*struct goacc_thread *thr = get_goacc_thread ();*/
> +  gomp_mutex_lock (&thr->dev->openacc.async.lock);
> +  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
> +    if (!thr->dev->openacc.async.test_func (l->aq))
> +      {
> +	ret = 0;
> +	break;
> +      }
> +  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> +  return ret;
Likewise.

>  }
>
>  void
> @@ -113,7 +194,8 @@ acc_wait (int async)
>    if (!thr || !thr->dev)
>      gomp_fatal ("no device active");
>
> -  thr->dev->openacc.async_wait_func (async);
> +  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> +  thr->dev->openacc.async.synchronize_func (aq);
>
>    if (profiling_setup_p)
>      {

> Index: libgomp/oacc-cuda.c
> ===================================================================
> --- libgomp/oacc-cuda.c	(revision 249620)
> +++ libgomp/oacc-cuda.c	(working copy)
> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
>        prof_info.async_queue = prof_info.async;
>      }
>
> -  void *ret = NULL;
>    if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> -    ret = thr->dev->openacc.cuda.get_stream_func (async);
> -
> -  if (profiling_setup_p)
>      {
> -      thr->prof_info = NULL;
> -      thr->api_info = NULL;
> +      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> +      return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
Again, strange ordering fo profiling_setup_p.

By the way, why not use get_goacc_thread here and other places in this
function? Again, that's a problem for another day.

>      }
> -
> -  return ret;
> +  return NULL;
>  }
>
>  int
> @@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
>
>    int ret = -1;
>    if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> -    ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
> +    {
> +      goacc_aq aq = get_goacc_asyncqueue (async);
> +      gomp_mutex_lock (&thr->dev->openacc.async.lock);
> +      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
> +      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> +    }
>
>    if (profiling_setup_p)
>      {
> Index: libgomp/oacc-int.h
> ===================================================================
> --- libgomp/oacc-int.h	(revision 249620)
> +++ libgomp/oacc-int.h	(working copy)
> @@ -109,6 +109,15 @@ void goacc_restore_bind (void);
>  void goacc_lazy_initialize (void);
>  void goacc_host_init (void);
>
> +void goacc_init_asyncqueues (struct gomp_device_descr *);
> +bool goacc_fini_asyncqueues (struct gomp_device_descr *);
> +void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
> +				     struct goacc_asyncqueue *);
> +void goacc_async_free (struct gomp_device_descr *,
> +		       struct goacc_asyncqueue *, void *);
> +struct goacc_asyncqueue *get_goacc_asyncqueue (int);
> +struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread
*, bool, int);
> +
>  void goacc_profiling_initialize (void);
>  bool goacc_profiling_setup_p (struct goacc_thread *,
>  			      acc_prof_info *, acc_api_info *);
> Index: libgomp/oacc-mem.c
> ===================================================================
> --- libgomp/oacc-mem.c	(revision 249620)
> +++ libgomp/oacc-mem.c	(working copy)
> @@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
>        goto out;
>      }
>
> -  if (async > acc_async_sync)
> -    thr->dev->openacc.async_set_async_func (async);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
> +  if (from)
> +    gomp_copy_dev2host (thr->dev, aq, h, d, s);
> +  else
> +    gomp_copy_host2dev (thr->dev, aq, d, h, s);
>
> -  bool ret = (from
> -	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
> -	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
> -
> -  if (async > acc_async_sync)
> -    thr->dev->openacc.async_set_async_func (acc_async_sync);
> -
> -  if (!ret)
> -    gomp_fatal ("error in %s", libfnname);
> -
>   out:
>    if (profiling_setup_p)
>      {
> @@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
>
>    gomp_mutex_unlock (&acc_dev->lock);
>
> -  return n != NULL;
> +  return (n ? 1 : 0);
>  }
>
>  /* Create a mapping for host [H,+S] -> device [D,+S] */
> @@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
>
>        gomp_mutex_unlock (&acc_dev->lock);
>
> -      if (async > acc_async_sync)
> -	acc_dev->openacc.async_set_async_func (async);
> +      goacc_aq aq = get_goacc_asyncqueue (async);

Do you want to call async_set_async_func outside of the protection of
acc_dev->lock?

> -      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s,
&kinds, true,
> -			   GOMP_MAP_VARS_OPENACC);
> +      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs,
NULL, &s,
> +				 &kinds, true, GOMP_MAP_VARS_OPENACC);
>        /* Initialize dynamic refcount.  */
>        tgt->list[0].key->dynamic_refcount = 1;
>
> -      if (async > acc_async_sync)
> -	acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
>        gomp_mutex_lock (&acc_dev->lock);
>
>        d = tgt->to_free;
> @@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
>
>        if (f & FLAG_COPYOUT)
>  	{
> -	  if (async > acc_async_sync)
> -	    acc_dev->openacc.async_set_async_func (async);
> -	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> -	  if (async > acc_async_sync)
> -	    acc_dev->openacc.async_set_async_func (acc_async_sync);
> +	  goacc_aq aq = get_goacc_asyncqueue (async);
> +	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
>  	}
>        gomp_remove_var (acc_dev, n);
>      }
> @@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
>    d = (void *) (n->tgt->tgt_start + n->tgt_offset
>  		+ (uintptr_t) h - n->host_start);
>
> -  if (async > acc_async_sync)
> -    acc_dev->openacc.async_set_async_func (async);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
>
>    if (is_dev)
> -    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
> +    gomp_copy_host2dev (acc_dev, aq, d, h, s);
>    else
> -    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> +    gomp_copy_dev2host (acc_dev, aq, h, d, s);
>
> -  if (async > acc_async_sync)
> -    acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
Why did you remove this, but not add a clal to set_goacc_asyncqueue?
Maybe it's redundant.

>    gomp_mutex_unlock (&acc_dev->lock);
> -
> +
>    if (profiling_setup_p)
>      {
>        thr->prof_info = NULL;
> @@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
>
>  void
>  gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
> -			 void *kinds)
> +			 void *kinds, int async)
>  {
>    struct target_mem_desc *tgt;
>    struct goacc_thread *thr = goacc_thread ();
> @@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
>      }
>
>    gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
> -  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
> -		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> +  goacc_aq aq = get_goacc_asyncqueue (async);
> +  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
> +			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
>    gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
>
>    /* Initialize dynamic refcount.  */
> @@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
>  	    t->list[i].copy_from = force_copyfrom ? 1 : 0;
>  	    break;
>  	  }
> -      if (async > acc_async_sync)
> -	acc_dev->openacc.async_set_async_func (async);
> -      gomp_unmap_vars (t, true);
> -      if (async > acc_async_sync)
> -	acc_dev->openacc.async_set_async_func (acc_async_sync);
> +
> +      /* If running synchronously, unmap immediately.  */
> +      if (async < acc_async_noval)
> +	gomp_unmap_vars (t, true);
> +      else
> +	{
> +	  goacc_aq aq = get_goacc_asyncqueue (async);
> +	  goacc_async_copyout_unmap_vars (t, aq);
> +	}
>      }
>
>    gomp_mutex_unlock (&acc_dev->lock);
> Index: libgomp/oacc-parallel.c
> ===================================================================
> --- libgomp/oacc-parallel.c	(revision 249620)
> +++ libgomp/oacc-parallel.c	(working copy)
> @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>        fn (hostaddrs);
>        goto out;
>      }
> -
> +  else if (profiling_dispatch_p)
> +    api_info.device_api = acc_device_api_cuda;
> +
That seems target specific. Does that belong in the generic code path?

>    /* Default: let the runtime choose.  */
>    for (i = 0; i != GOMP_DIM_MAX; i++)
>      dims[i] = 0;
> @@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>
>  	case GOMP_LAUNCH_WAIT:
>  	  {
> -	    unsigned num_waits = GOMP_LAUNCH_OP (tag);
> +	    /* Be careful to cast the op field as a signed 16-bit, and
> +	       sign-extend to full integer.  */
> +	    int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
>
> -	    if (num_waits)
> +	    if (num_waits > 0)
>  	      goacc_wait (async, num_waits, &ap);
> +	    else if (num_waits == acc_async_noval)
> +	      acc_wait_all_async (async);
>  	    break;
>  	  }
>

Cesar
diff mbox

Patch

Index: gcc/c/c-parser.c
===================================================================
--- gcc/c/c-parser.c	(revision 249620)
+++ gcc/c/c-parser.c	(working copy)
@@ -11941,7 +11941,7 @@  c_parser_oacc_clause_tile (c_parser *parser, tree
 }
 
 /* OpenACC:
-   wait ( int-expr-list ) */
+   wait [( int-expr-list )] */
 
 static tree
 c_parser_oacc_clause_wait (c_parser *parser, tree list)
@@ -11950,7 +11950,15 @@  c_parser_oacc_clause_wait (c_parser *parser, tree
 
   if (c_parser_peek_token (parser)->type == CPP_OPEN_PAREN)
     list = c_parser_oacc_wait_list (parser, clause_loc, list);
+  else
+    {
+      tree c = build_omp_clause (clause_loc, OMP_CLAUSE_WAIT);
 
+      OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+      OMP_CLAUSE_CHAIN (c) = list;
+      list = c;
+    }
+  
   return list;
 }
 
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 249620)
+++ gcc/cp/parser.c	(working copy)
@@ -30619,7 +30619,7 @@  cp_parser_oacc_wait_list (cp_parser *parser, locat
 }
 
 /* OpenACC:
-   wait ( int-expr-list ) */
+   wait [( int-expr-list )] */
 
 static tree
 cp_parser_oacc_clause_wait (cp_parser *parser, tree list)
@@ -30626,11 +30626,17 @@  cp_parser_oacc_clause_wait (cp_parser *parser, tre
 {
   location_t location = cp_lexer_peek_token (parser->lexer)->location;
 
-  if (cp_lexer_peek_token (parser->lexer)->type != CPP_OPEN_PAREN)
-    return list;
+  if (cp_lexer_peek_token (parser->lexer)->type == CPP_OPEN_PAREN)
+    list = cp_parser_oacc_wait_list (parser, location, list);
+  else
+    {
+      tree c = build_omp_clause (location, OMP_CLAUSE_WAIT);
+ 
+      OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+      OMP_CLAUSE_CHAIN (c) = list;
+      list = c;
+    }
 
-  list = cp_parser_oacc_wait_list (parser, location, list);
-
   return list;
 }
 
Index: gcc/fortran/trans-openmp.c
===================================================================
--- gcc/fortran/trans-openmp.c	(revision 249620)
+++ gcc/fortran/trans-openmp.c	(working copy)
@@ -2962,6 +2962,13 @@  gfc_trans_omp_clauses_1 (stmtblock_t *block, gfc_o
 	  omp_clauses = c;
 	}
     }
+  else if (clauses->wait)
+    {
+      c = build_omp_clause (where.lb->location, OMP_CLAUSE_WAIT);
+      OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+      OMP_CLAUSE_CHAIN (c) = omp_clauses;
+      omp_clauses = c;
+    }
   if (clauses->num_gangs_expr)
     {
       tree num_gangs_var
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 249620)
+++ gcc/omp-low.c	(working copy)
@@ -14226,9 +14226,18 @@  expand_omp_target (struct omp_region *region)
 	  /* ... push a placeholder.  */
 	  args.safe_push (integer_zero_node);
 
+	bool noval_seen = false;
+	tree noval = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+	
 	for (; c; c = OMP_CLAUSE_CHAIN (c))
 	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
 	    {
+	      if (tree_int_cst_compare (OMP_CLAUSE_WAIT_EXPR (c), noval) == 0)
+		{
+		  noval_seen = true;
+		  continue;
+		}
+
 	      args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
 						integer_type_node,
 						OMP_CLAUSE_WAIT_EXPR (c)));
@@ -14235,7 +14244,12 @@  expand_omp_target (struct omp_region *region)
 	      num_waits++;
 	    }
 
-	if (!tagging || num_waits)
+	if (noval_seen && num_waits == 0)
+	  args[t_wait_idx] =
+	    (tagging
+	     ? oacc_launch_pack (GOMP_LAUNCH_WAIT, NULL_TREE, GOMP_ASYNC_NOVAL)
+	     : noval);
+	else if (!tagging || num_waits)
 	  {
 	    tree len;
 
Index: gcc/testsuite/c-c++-common/goacc/dtype-1.c
===================================================================
--- gcc/testsuite/c-c++-common/goacc/dtype-1.c	(revision 249620)
+++ gcc/testsuite/c-c++-common/goacc/dtype-1.c	(working copy)
@@ -96,11 +96,11 @@  test ()
 
 /* { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ wait\\(10\\) vector_length\\(10\\) num_workers\\(10\\) num_gangs\\(10\\) async\\(10\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(3\\) vector_length\\(128\\) num_workers\\(300\\) num_gangs\\(300\\) async\\(3\\) \\\] wait\\(1\\) vector_length\\(1\\) num_workers\\(1\\) num_gangs\\(1\\) async\\(1\\)" 1 "omplower" } } */
 
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(-1\\) async\\(-1\\) \\\]" 1 "omplower" } } */
 
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
 
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
 
 /* { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\.0\\) private\\(i1\\)" 1 "omplower" } } */
 
Index: gcc/testsuite/gfortran.dg/goacc/dtype-1.f95
===================================================================
--- gcc/testsuite/gfortran.dg/goacc/dtype-1.f95	(revision 249620)
+++ gcc/testsuite/gfortran.dg/goacc/dtype-1.f95	(working copy)
@@ -175,13 +175,13 @@  end subroutine sr5b
 
 ! { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ async\\(10\\) wait\\(10\\) num_gangs\\(10\\) num_workers\\(10\\) vector_length\\(10\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(3\\) wait\\(3\\) num_gangs\\(300\\) num_workers\\(300\\) vector_length\\(128\\) \\\] async\\(1\\) wait\\(1\\) num_gangs\\(1\\) num_workers\\(1\\) vector_length\\(1\\)" 1 "omplower" } }
 
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) wait\\(-1\\) \\\]" 1 "omplower" } }
 
 ! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(1\\) wait\\(1\\) \\\]" 1 "omplower" } }
 
 ! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia\\) \\\[ async\\(2\\) wait\\(2\\) \\\]" 1 "omplower" } }
 
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\) wait\\(-1\\)" 1 "omplower" } }
 
 ! { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\) private\\(i1\\.1\\)" 1 "omplower" } }
 
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h	(revision 249620)
+++ include/gomp-constants.h	(working copy)
@@ -249,13 +249,14 @@  enum gomp_map_kind
 #define GOMP_LAUNCH_CODE_SHIFT	28
 #define GOMP_LAUNCH_DEVICE_SHIFT 16
 #define GOMP_LAUNCH_OP_SHIFT 0
+#define GOMP_LAUNCH_OP_MASK 0xffff
 #define GOMP_LAUNCH_PACK(CODE,DEVICE,OP)	\
   (((CODE) << GOMP_LAUNCH_CODE_SHIFT)		\
    | ((DEVICE) << GOMP_LAUNCH_DEVICE_SHIFT)	\
-   | ((OP) << GOMP_LAUNCH_OP_SHIFT))
+   | (((OP) & GOMP_LAUNCH_OP_MASK) << GOMP_LAUNCH_OP_SHIFT))
 #define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
 #define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
-#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & GOMP_LAUNCH_OP_MASK)
 #define GOMP_LAUNCH_OP_MAX 0xffff
 
 /* Bitmask to apply in order to find out the intended device of a target
Index: libgomp/libgomp-plugin.h
===================================================================
--- libgomp/libgomp-plugin.h	(revision 249620)
+++ libgomp/libgomp-plugin.h	(working copy)
@@ -55,6 +55,20 @@  enum offload_target_type
   OFFLOAD_TARGET_TYPE_HSA = 7
 };
 
+/* Opaque type to represent plugin-dependent implementation of an
+   OpenACC asynchronous queue.  */
+struct goacc_asyncqueue;  
+
+/* Used to keep a list of active asynchronous queues.  */
+struct goacc_asyncqueue_list
+{
+  struct goacc_asyncqueue *aq;
+  struct goacc_asyncqueue_list *next;
+};
+
+typedef struct goacc_asyncqueue *goacc_aq;
+typedef struct goacc_asyncqueue_list *goacc_aq_list;
+
 /* Auxiliary struct, used for transferring pairs of addresses from plugin
    to libgomp.  */
 struct addr_pair
@@ -99,22 +113,31 @@  extern bool GOMP_OFFLOAD_dev2dev (int, void *, con
 extern bool GOMP_OFFLOAD_can_run (void *);
 extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
 extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
+
 extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
-				       void **, int, unsigned *, void *);
-extern void GOMP_OFFLOAD_openacc_register_async_cleanup (void *, int);
-extern int GOMP_OFFLOAD_openacc_async_test (int);
-extern int GOMP_OFFLOAD_openacc_async_test_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait (int);
-extern void GOMP_OFFLOAD_openacc_async_wait_async (int, int);
-extern void GOMP_OFFLOAD_openacc_async_wait_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait_all_async (int);
-extern void GOMP_OFFLOAD_openacc_async_set_async (int);
+				       void **, unsigned *, void *);
+extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
+					     void **, unsigned *, void *,
+					     struct goacc_asyncqueue *);
+extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
+extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+						  struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
+						       void (*)(void *), void *);
+extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
+						 struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
+						 struct goacc_asyncqueue *);
 extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
 extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void);
-extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (int);
-extern int GOMP_OFFLOAD_openacc_cuda_set_stream (int, void *);
+extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *,
+						 void *);
 
 #ifdef __cplusplus
 }
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h	(revision 249620)
+++ libgomp/libgomp.h	(working copy)
@@ -870,19 +870,23 @@  typedef struct acc_dispatch_t
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
 
-  /* Async cleanup callback registration.  */
-  __typeof (GOMP_OFFLOAD_openacc_register_async_cleanup)
-    *register_async_cleanup_func;
+  struct {
+    gomp_mutex_t lock;
+    int nasyncqueue;
+    struct goacc_asyncqueue **asyncqueue;
+    struct goacc_asyncqueue_list *active;
+    
+    __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_test) *test_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_synchronize) *synchronize_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_serialize) *serialize_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
 
-  /* Asynchronous routines.  */
-  __typeof (GOMP_OFFLOAD_openacc_async_test) *async_test_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_test_all) *async_test_all_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait) *async_wait_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_async) *async_wait_async_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_all) *async_wait_all_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_all_async)
-    *async_wait_all_async_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_set_async) *async_set_async_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
+  } async;
 
   /* Create/destroy TLS data.  */
   __typeof (GOMP_OFFLOAD_openacc_create_thread_data) *create_thread_data_func;
@@ -974,17 +978,31 @@  enum gomp_map_vars_kind
   GOMP_MAP_VARS_ENTER_DATA
 };
 
-extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *);
+extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int);
 extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int);
 extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 				       unsigned short *);
-
+extern void gomp_copy_host2dev (struct gomp_device_descr *,
+				struct goacc_asyncqueue *,
+				void *, const void *, size_t);
+extern void gomp_copy_dev2host (struct gomp_device_descr *,
+				struct goacc_asyncqueue *,
+				void *, const void *, size_t);
 extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 					      size_t, void **, void **,
 					      size_t *, void *, bool,
 					      enum gomp_map_vars_kind);
+extern struct target_mem_desc *gomp_map_vars_async (struct gomp_device_descr *,
+						    struct goacc_asyncqueue *,
+						    size_t, void **, void **,
+						    size_t *, void *, bool,
+						    enum gomp_map_vars_kind);
+extern void gomp_unmap_tgt (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
+extern void gomp_unmap_vars_async (struct target_mem_desc *, bool,
+				   struct goacc_asyncqueue *);
 extern void gomp_init_device (struct gomp_device_descr *);
+extern bool gomp_fini_device (struct gomp_device_descr *);
 extern void gomp_unload_device (struct gomp_device_descr *);
 extern bool gomp_offload_target_available_p (int);
 extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key);
Index: libgomp/oacc-async.c
===================================================================
--- libgomp/oacc-async.c	(revision 249620)
+++ libgomp/oacc-async.c	(working copy)
@@ -27,10 +27,85 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #include <assert.h>
+#include <string.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+     default async stream.  */
+  if (async == acc_async_noval)
+    async = thr->default_async;
+
+  if (async == acc_async_sync)
+    return NULL;
+
+  if (async < 0)
+    gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  if (!create
+      && (async >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[async]))
+    return NULL;
+
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  if (async >= dev->openacc.async.nasyncqueue)
+    {
+      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      dev->openacc.async.asyncqueue
+	= gomp_realloc (dev->openacc.async.asyncqueue,
+			sizeof (goacc_aq) * (async + 1));
+      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+	      0, sizeof (goacc_aq) * diff);
+      dev->openacc.async.nasyncqueue = async + 1;
+    }
+
+  if (!dev->openacc.async.asyncqueue[async])
+    {
+      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+      /* Link new async queue into active list.  */
+      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+      n->aq = dev->openacc.async.asyncqueue[async];
+      n->next = dev->openacc.async.active;
+      dev->openacc.async.active = n;
+    }
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+  return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -54,15 +129,14 @@  acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  int res = thr->dev->openacc.async_test_func (async);
-
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
       thr->api_info = NULL;
     }
-
-  return res;
+    
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
@@ -69,7 +143,6 @@  int
 acc_async_test_all (void)
 {
   struct goacc_thread *thr = goacc_thread ();
-
   acc_prof_info prof_info;
   acc_api_info api_info;
   bool profiling_setup_p
@@ -79,8 +152,6 @@  acc_async_test_all (void)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  int res = thr->dev->openacc.async_test_all_func ();
-
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
@@ -87,7 +158,17 @@  acc_async_test_all (void)
       thr->api_info = NULL;
     }
 
-  return res;
+  int ret = 1;
+  /*struct goacc_thread *thr = get_goacc_thread ();*/
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    if (!thr->dev->openacc.async.test_func (l->aq))
+      {
+	ret = 0;
+	break;
+      }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -113,7 +194,8 @@  acc_wait (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 
   if (profiling_setup_p)
     {
@@ -153,8 +235,16 @@  acc_wait_async (int async1, int async2)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  thr->dev->openacc.async_wait_async_func (async1, async2);
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  if (!aq1)
+    gomp_fatal ("invalid async 1");
+  if (aq1 == aq2)
+    gomp_fatal ("identical parameters");
 
+  thr->dev->openacc.async.synchronize_func (aq1);
+  thr->dev->openacc.async.serialize_func (aq1, aq2);
+
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
@@ -176,8 +266,13 @@  acc_wait_all (void)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  thr->dev->openacc.async_wait_all_func ();
+  struct gomp_device_descr *dev = get_goacc_thread_device ();
 
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+    dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
@@ -219,8 +314,17 @@  acc_wait_all_async (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  thr->dev->openacc.async_wait_all_async_func (async);
+  goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    {
+      thr->dev->openacc.async.synchronize_func (l->aq);
+      if (waiting_queue)
+	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+    }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
@@ -251,10 +355,72 @@  acc_set_default_async (int async)
   if (async < acc_async_sync)
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
+  thr->default_async = async;
+}
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+static void
+goacc_async_unmap_tgt (void *ptr)
+{
+  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
 
-  thr->default_async = async;
+  if (tgt->refcount > 1)
+    tgt->refcount--;
+  else
+    gomp_unmap_tgt (tgt);
 }
+
+attribute_hidden void
+goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt,
+				struct goacc_asyncqueue *aq)
+{
+  struct gomp_device_descr *devicep = tgt->device_descr;
+
+  /* Increment reference to delay freeing of device memory until callback
+     has triggered.  */
+  tgt->refcount++;
+  gomp_unmap_vars_async (tgt, true, aq);
+  devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
+					      (void *) tgt);
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+		  struct goacc_asyncqueue *aq, void *ptr)
+{
+  if (!aq)
+    free (ptr);
+  else
+    devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+  gomp_mutex_init (&devicep->openacc.async.lock);
+  devicep->openacc.async.nasyncqueue = 0;
+  devicep->openacc.async.asyncqueue = NULL;
+  devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+  bool ret = true;
+  if (devicep->openacc.async.nasyncqueue > 0)
+    {
+      goacc_aq_list next;
+      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+	{
+	  ret &= devicep->openacc.async.destruct_func (l->aq);
+	  next = l->next;
+	  free (l);
+	}
+      free (devicep->openacc.async.asyncqueue);
+      devicep->openacc.async.nasyncqueue = 0;
+      devicep->openacc.async.asyncqueue = NULL;
+      devicep->openacc.async.active = NULL;
+    }
+  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  return ret;
+}
Index: libgomp/oacc-cuda.c
===================================================================
--- libgomp/oacc-cuda.c	(revision 249620)
+++ libgomp/oacc-cuda.c	(working copy)
@@ -99,17 +99,12 @@  acc_get_cuda_stream (int async)
       prof_info.async_queue = prof_info.async;
     }
 
-  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
-    ret = thr->dev->openacc.cuda.get_stream_func (async);
- 
-  if (profiling_setup_p)
     {
-      thr->prof_info = NULL;
-      thr->api_info = NULL;
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+      return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
     }
-
-  return ret;
+  return NULL;
 }
 
 int
@@ -138,7 +133,12 @@  acc_set_cuda_stream (int async, void *stream)
 
   int ret = -1;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
-    ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
+    {
+      goacc_aq aq = get_goacc_asyncqueue (async);
+      gomp_mutex_lock (&thr->dev->openacc.async.lock);
+      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+    }
 
   if (profiling_setup_p)
     {
Index: libgomp/oacc-host.c
===================================================================
--- libgomp/oacc-host.c	(revision 249620)
+++ libgomp/oacc-host.c	(working copy)
@@ -140,7 +140,6 @@  host_openacc_exec (void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
 		   void **devaddrs __attribute__ ((unused)),
-		   int async __attribute__ ((unused)),
 		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
@@ -148,49 +147,81 @@  host_openacc_exec (void (*fn) (void *),
 }
 
 static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
-				     int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+			 size_t mapnum __attribute__ ((unused)),
+			 void **hostaddrs,
+			 void **devaddrs __attribute__ ((unused)),
+			 unsigned *dims __attribute__ ((unused)),
+			 void *targ_mem_desc __attribute__ ((unused)),
+			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
+  fn (hostaddrs);
 }
 
 static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
   return 1;
 }
 
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+				__attribute__ ((unused)))
 {
-  return 1;
 }
 
 static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+			      __attribute__ ((unused)),
+			      struct goacc_asyncqueue *aq2
+			      __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
-			       int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
 static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+				   __attribute__ ((unused)),
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
 {
+  return NULL;
 }
 
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
+{
+  return true;
+}
+
 static void *
 host_openacc_create_thread_data (int ord __attribute__ ((unused)))
 {
@@ -235,16 +266,18 @@  static struct gomp_device_descr host_dispatch =
 
       .exec_func = host_openacc_exec,
 
-      .register_async_cleanup_func = host_openacc_register_async_cleanup,
+      .async = {
+	.construct_func = host_openacc_async_construct,
+	.destruct_func = host_openacc_async_destruct,
+	.test_func = host_openacc_async_test,
+	.synchronize_func = host_openacc_async_synchronize,
+	.serialize_func = host_openacc_async_serialize,
+	.queue_callback_func = host_openacc_async_queue_callback,
+	.exec_func = host_openacc_async_exec,
+	.dev2host_func = host_openacc_async_dev2host,
+	.host2dev_func = host_openacc_async_host2dev,
+      },
 
-      .async_test_func = host_openacc_async_test,
-      .async_test_all_func = host_openacc_async_test_all,
-      .async_wait_func = host_openacc_async_wait,
-      .async_wait_async_func = host_openacc_async_wait_async,
-      .async_wait_all_func = host_openacc_async_wait_all,
-      .async_wait_all_async_func = host_openacc_async_wait_all_async,
-      .async_set_async_func = host_openacc_async_set_async,
-
       .create_thread_data_func = host_openacc_create_thread_data,
       .destroy_thread_data_func = host_openacc_destroy_thread_data,
 
Index: libgomp/oacc-init.c
===================================================================
--- libgomp/oacc-init.c	(revision 249620)
+++ libgomp/oacc-init.c	(working copy)
@@ -390,7 +390,7 @@  acc_shutdown_1 (acc_device_t d)
       if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
         {
 	  devices_active = true;
-	  ret &= acc_dev->fini_device_func (acc_dev->target_id);
+	  ret &= gomp_fini_device (acc_dev);
 	  acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
 	}
       gomp_mutex_unlock (&acc_dev->lock);
@@ -513,8 +513,6 @@  goacc_attach_host_thread_to_device (int ord)
     = acc_dev->openacc.create_thread_data_func (ord);
 
   thr->default_async = acc_async_default;
-  
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
Index: libgomp/oacc-int.h
===================================================================
--- libgomp/oacc-int.h	(revision 249620)
+++ libgomp/oacc-int.h	(working copy)
@@ -109,6 +109,15 @@  void goacc_restore_bind (void);
 void goacc_lazy_initialize (void);
 void goacc_host_init (void);
 
+void goacc_init_asyncqueues (struct gomp_device_descr *);
+bool goacc_fini_asyncqueues (struct gomp_device_descr *);
+void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
+				     struct goacc_asyncqueue *);
+void goacc_async_free (struct gomp_device_descr *,
+		       struct goacc_asyncqueue *, void *);
+struct goacc_asyncqueue *get_goacc_asyncqueue (int);
+struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool, int);
+
 void goacc_profiling_initialize (void);
 bool goacc_profiling_setup_p (struct goacc_thread *,
 			      acc_prof_info *, acc_api_info *);
Index: libgomp/oacc-mem.c
===================================================================
--- libgomp/oacc-mem.c	(revision 249620)
+++ libgomp/oacc-mem.c	(working copy)
@@ -224,19 +224,12 @@  memcpy_tofrom_device (bool from, void *d, void *h,
       goto out;
     }
 
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  if (from)
+    gomp_copy_dev2host (thr->dev, aq, h, d, s);
+  else
+    gomp_copy_host2dev (thr->dev, aq, d, h, s);
 
-  bool ret = (from
-	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
-	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (acc_async_sync);
-
-  if (!ret)
-    gomp_fatal ("error in %s", libfnname);
-
  out:
   if (profiling_setup_p)
     {
@@ -381,7 +374,7 @@  acc_is_present (void *h, size_t s)
 
   gomp_mutex_unlock (&acc_dev->lock);
 
-  return n != NULL;
+  return (n ? 1 : 0);
 }
 
 /* Create a mapping for host [H,+S] -> device [D,+S] */
@@ -613,17 +606,13 @@  present_create_copy (unsigned f, void *h, size_t s
 
       gomp_mutex_unlock (&acc_dev->lock);
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
-			   GOMP_MAP_VARS_OPENACC);
+      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+				 &kinds, true, GOMP_MAP_VARS_OPENACC);
       /* Initialize dynamic refcount.  */
       tgt->list[0].key->dynamic_refcount = 1;
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
-
       gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
@@ -798,11 +787,8 @@  delete_copyout (unsigned f, void *h, size_t s, int
 
       if (f & FLAG_COPYOUT)
 	{
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (async);
-	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (acc_async_sync);
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
       gomp_remove_var (acc_dev, n);
     }
@@ -904,19 +890,15 @@  update_dev_host (int is_dev, void *h, size_t s, in
   d = (void *) (n->tgt->tgt_start + n->tgt_offset
 		+ (uintptr_t) h - n->host_start);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s);
   else
-    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
+    gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (acc_async_sync);
-
   gomp_mutex_unlock (&acc_dev->lock);
-
+  
   if (profiling_setup_p)
     {
       thr->prof_info = NULL;
@@ -978,7 +960,7 @@  gomp_acc_declare_allocate (bool allocate, size_t m
 
 void
 gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds)
+			 void *kinds, int async)
 {
   struct target_mem_desc *tgt;
   struct goacc_thread *thr = goacc_thread ();
@@ -1008,8 +990,9 @@  gomp_acc_insert_pointer (size_t mapnum, void **hos
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
-		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
 
   /* Initialize dynamic refcount.  */
@@ -1098,11 +1081,15 @@  gomp_acc_remove_pointer (void *h, size_t s, bool f
 	    t->list[i].copy_from = force_copyfrom ? 1 : 0;
 	    break;
 	  }
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
-      gomp_unmap_vars (t, true);
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+      /* If running synchronously, unmap immediately.  */
+      if (async < acc_async_noval)
+	gomp_unmap_vars (t, true);
+      else
+	{
+	  goacc_aq aq = get_goacc_asyncqueue (async);        
+	  goacc_async_copyout_unmap_vars (t, aq);
+	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 249620)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -215,7 +215,9 @@  GOACC_parallel_keyed (int device, void (*fn) (void
       fn (hostaddrs);
       goto out;
     }
-
+  else if (profiling_dispatch_p)
+    api_info.device_api = acc_device_api_cuda;
+    
   /* Default: let the runtime choose.  */
   for (i = 0; i != GOMP_DIM_MAX; i++)
     dims[i] = 0;
@@ -260,10 +262,14 @@  GOACC_parallel_keyed (int device, void (*fn) (void
 
 	case GOMP_LAUNCH_WAIT:
 	  {
-	    unsigned num_waits = GOMP_LAUNCH_OP (tag);
+	    /* Be careful to cast the op field as a signed 16-bit, and
+	       sign-extend to full integer.  */
+	    int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
 
-	    if (num_waits)
+	    if (num_waits > 0)
 	      goacc_wait (async, num_waits, &ap);
+	    else if (num_waits == acc_async_noval)
+	      acc_wait_all_async (async);
 	    break;
 	  }
 
@@ -274,8 +280,6 @@  GOACC_parallel_keyed (int device, void (*fn) (void
     }
   va_end (ap);
   
-  acc_dev->openacc.async_set_async_func (async);
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -307,8 +311,11 @@  GOACC_parallel_keyed (int device, void (*fn) (void
       goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
 				&api_info);
     }
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		       GOMP_MAP_VARS_OPENACC);
+
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+			     true, GOMP_MAP_VARS_OPENACC);
   if (profiling_dispatch_p)
     {
       prof_info.event_type = acc_ev_enter_data_end;
@@ -329,14 +336,10 @@  GOACC_parallel_keyed (int device, void (*fn) (void
 	devaddrs[i] = NULL;
     }
 
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-			      async, dims, tgt);
-
-  /* If running synchronously, unmap immediately.  */
-  bool copyfrom = true;
-  if (async < acc_async_noval)
+  if (aq == NULL)
     {
-    unmap:
+      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				  dims, tgt);
       if (profiling_dispatch_p)
 	{
 	  prof_info.event_type = acc_ev_exit_data_start;
@@ -346,7 +349,8 @@  GOACC_parallel_keyed (int device, void (*fn) (void
 	  goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
 				    &api_info);
 	}
-      gomp_unmap_vars (tgt, copyfrom);
+      /* If running synchronously, unmap immediately.  */
+      gomp_unmap_vars (tgt, true);
       if (profiling_dispatch_p)
 	{
 	  prof_info.event_type = acc_ev_exit_data_end;
@@ -358,27 +362,11 @@  GOACC_parallel_keyed (int device, void (*fn) (void
     }
   else
     {
-      bool async_unmap = false;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	{
-	  splay_tree_key k = tgt->list[i].key;
-	  if (k && k->refcount == 1)
-	    {
-	      async_unmap = true;
-	      break;
-	    }
-	}
-      if (async_unmap)
-	tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
-      else
-	{
-	  copyfrom = false;
-	  goto unmap;
-	}
+      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					dims, tgt, aq);
+      goacc_async_copyout_unmap_vars (tgt, aq);
     }
 
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
-
  out:
   if (profiling_dispatch_p)
     {
@@ -724,7 +712,7 @@  GOACC_enter_exit_data (int device, size_t mapnum,
       goto out;
     }
 
-  if (num_waits)
+  if (num_waits > 0)
     {
       va_list ap;
 
@@ -732,9 +720,9 @@  GOACC_enter_exit_data (int device, size_t mapnum,
       goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
+  else if (num_waits == acc_async_noval)
+    acc_wait_all_async (async);
 
-  acc_dev->openacc.async_set_async_func (async);
-
   /* In c, non-pointers and arrays are represented by a single data clause.
      Dynamically allocated arrays and subarrays are represented by a data
      clause followed by an internal GOMP_MAP_POINTER.
@@ -783,7 +771,7 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 					   &sizes[i], &kinds[i]);
 	      else
 		gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-					 &sizes[i], &kinds[i]);
+					 &sizes[i], &kinds[i], async);
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -808,9 +796,9 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 		if (acc_is_present (hostaddrs[i], sizes[i]))
 		  {
 		    if (finalize)
-		      acc_delete_finalize (hostaddrs[i], sizes[i]);
+		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
 		    else
-		      acc_delete (hostaddrs[i], sizes[i]);
+		      acc_delete_async (hostaddrs[i], sizes[i], async);
 		  }
 		break;
 	      case GOMP_MAP_DECLARE_DEALLOCATE:
@@ -817,9 +805,9 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_FORCE_FROM:
 		if (finalize)
-		  acc_copyout_finalize (hostaddrs[i], sizes[i]);
+		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
 		else
-		  acc_copyout (hostaddrs[i], sizes[i]);
+		  acc_copyout_async (hostaddrs[i], sizes[i], async);
 		break;
 	      default:
 		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -844,8 +832,6 @@  GOACC_enter_exit_data (int device, size_t mapnum,
 	  }
       }
 
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
-
  out:
   if (profiling_dispatch_p)
     {
@@ -868,18 +854,22 @@  goacc_wait (int async, int num_waits, va_list *ap)
   while (num_waits--)
     {
       int qid = va_arg (*ap, int);
-      
-      if (acc_async_test (qid))
+      goacc_aq aq = get_goacc_asyncqueue (qid);
+      if (acc_dev->openacc.async.test_func (aq))
 	continue;
-
       if (async == acc_async_sync)
-	acc_wait (qid);
+	acc_dev->openacc.async.synchronize_func (aq);
       else if (qid == async)
-	;/* If we're waiting on the same asynchronous queue as we're
+      /* If we're waiting on the same asynchronous queue as we're
 	    launching on, the queue itself will order work as
 	    required, so there's no need to wait explicitly.  */
+	;
       else
-	acc_dev->openacc.async_wait_async_func (qid, async);
+	{
+	  goacc_aq aq2 = get_goacc_asyncqueue (async);
+	  acc_dev->openacc.async.synchronize_func (aq);
+	  acc_dev->openacc.async.serialize_func (aq, aq2);
+	}
     }
 }
 
@@ -957,7 +947,7 @@  GOACC_update (int device, size_t mapnum,
       goto out;
     }
 
-  if (num_waits)
+  if (num_waits > 0)
     {
       va_list ap;
 
@@ -965,9 +955,9 @@  GOACC_update (int device, size_t mapnum,
       goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
+  else if (num_waits == acc_async_noval)
+    acc_wait_all_async (async);
 
-  acc_dev->openacc.async_set_async_func (async);
-
   bool update_device = false;
   for (i = 0; i < mapnum; ++i)
     {
@@ -1007,7 +997,7 @@  GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_TO:
 	  update_device = true;
-	  acc_update_device (hostaddrs[i], sizes[i]);
+	  acc_update_device_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	case GOMP_MAP_FROM:
@@ -1019,7 +1009,7 @@  GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_FROM:
 	  update_device = false;
-	  acc_update_self (hostaddrs[i], sizes[i]);
+	  acc_update_self_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	default:
@@ -1028,8 +1018,6 @@  GOACC_update (int device, size_t mapnum,
 	}
     }
 
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
-
  out:
   if (profiling_dispatch_p)
     {
@@ -1075,7 +1063,7 @@  GOACC_wait (int async, int num_waits, ...)
   else if (async == acc_async_sync)
     acc_wait_all ();
   else if (async == acc_async_noval)
-    thr->dev->openacc.async_wait_all_async_func (acc_async_noval);
+    acc_wait_all_async (async);
 
   if (profiling_setup_p)
     {
Index: libgomp/oacc-plugin.c
===================================================================
--- libgomp/oacc-plugin.c	(revision 249620)
+++ libgomp/oacc-plugin.c	(working copy)
@@ -30,17 +30,6 @@ 
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
-void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
-{
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);
-}
-
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 249620)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -96,21 +96,19 @@  cuda_error (CUresult r)
 static unsigned int instantiated_devices = 0;
 static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
-struct cuda_map
+/* NVPTX/CUDA specific definition of asynchronous queues.  */
+struct goacc_asyncqueue
 {
-  CUdeviceptr d;
-  size_t size;
-  bool active;
-  struct cuda_map *next;
+  CUstream cuda_stream;
+  pthread_mutex_t lock;
 };
 
-struct ptx_stream
+struct nvptx_callback
 {
-  CUstream stream;
-  pthread_t host_thread;
-  bool multithreaded;
-  struct cuda_map *map;
-  struct ptx_stream *next;
+  void (*fn) (void *);
+  void *ptr;
+  struct goacc_asyncqueue *aq;
+  struct nvptx_callback *next;
 };
 
 /* Thread-specific data for PTX.  */
@@ -117,179 +115,12 @@  static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTE
 
 struct nvptx_thread
 {
-  struct ptx_stream *current_stream;
+  /* We currently have this embedded inside the plugin because libgomp manages
+     devices through integer target_ids.  This might be better if using an
+     opaque target-specific pointer directly from gomp_device_descr.  */
   struct ptx_device *ptx_dev;
 };
 
-static struct cuda_map *
-cuda_map_create (struct goacc_thread *thr, size_t size)
-{
-  struct cuda_map *map = GOMP_PLUGIN_malloc (sizeof (struct cuda_map));
-
-  assert (map);
-
-  map->next = NULL;
-  map->size = size;
-  map->active = false;
-
-  CUDA_CALL_ERET (NULL, cuMemAlloc, &map->d, size);
-  assert (map->d);
-
-  bool profiling_dispatch_p
-    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      acc_prof_info *prof_info = thr->prof_info;
-      acc_event_info data_event_info;
-      acc_api_info *api_info = thr->api_info;
-
-      prof_info->event_type = acc_ev_alloc;
-
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
-      data_event_info.data_event.implicit = 1;
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = size;
-      data_event_info.data_event.host_ptr = NULL;
-      data_event_info.data_event.device_ptr = (void *) map->d;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-  return map;
-}
-
-static void
-cuda_map_destroy (struct goacc_thread *thr, struct cuda_map *map)
-{
-  CUDA_CALL_ASSERT (cuMemFree, map->d);
-
-  bool profiling_dispatch_p
-    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      acc_prof_info *prof_info = thr->prof_info;
-      acc_event_info data_event_info;
-      acc_api_info *api_info = thr->api_info;
-
-      prof_info->event_type = acc_ev_free;
-
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
-      data_event_info.data_event.implicit = 1;
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = map->size;
-      data_event_info.data_event.host_ptr = NULL;
-      data_event_info.data_event.device_ptr = (void *) map->d;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-  free (map);
-}
-
-/* The following map_* routines manage the CUDA device memory that
-   contains the data mapping arguments for cuLaunchKernel.  Each
-   asynchronous PTX stream may have multiple pending kernel
-   invocations, which are launched in a FIFO order.  As such, the map
-   routines maintains a queue of cuLaunchKernel arguments.
-
-   Calls to map_push and map_pop must be guarded by ptx_event_lock.
-   Likewise, calls to map_init and map_fini are guarded by
-   ptx_dev_lock inside GOMP_OFFLOAD_init_device and
-   GOMP_OFFLOAD_fini_device, respectively.  */
-
-static bool
-map_init (struct goacc_thread *thr, struct ptx_stream *s)
-{
-  int size = getpagesize ();
-
-  assert (s);
-
-  s->map = cuda_map_create (thr, size);
-
-  return true;
-}
-
-static bool
-map_fini (struct goacc_thread *thr, struct ptx_stream *s)
-{
-  assert (s->map->next == NULL);
-  assert (!s->map->active);
-
-  cuda_map_destroy (thr, s->map);
-
-  return true;
-}
-
-static void
-map_pop (struct goacc_thread *thr, struct ptx_stream *s)
-{
-  struct cuda_map *next;
-
-  assert (s != NULL);
-
-  if (s->map->next == NULL)
-    {
-      s->map->active = false;
-      return;
-    }
-
-  next = s->map->next;
-  cuda_map_destroy (thr, s->map);
-  s->map = next;
-}
-
-static CUdeviceptr
-map_push (struct goacc_thread *thr, struct ptx_stream *s, size_t size)
-{
-  struct cuda_map *map = NULL, *t = NULL;
-
-  assert (s);
-  assert (s->map);
-
-  /* Each PTX stream requires a separate data region to store the
-     launch arguments for cuLaunchKernel.  Allocate a new
-     cuda_map and push it to the end of the list.  */
-  if (s->map->active)
-    {
-      map = cuda_map_create (thr, size);
-
-      for (t = s->map; t->next != NULL; t = t->next)
-	;
-
-      t->next = map;
-    }
-  else if (s->map->size < size)
-    {
-      cuda_map_destroy (thr, s->map);
-      map = cuda_map_create (thr, size);
-    }
-  else
-    map = s->map;
-
-  s->map = map;
-  s->map->active = true;
-
-  return s->map->d;
-}
-
 /* Target data function launch information.  */
 
 struct targ_fn_launch
@@ -342,22 +173,18 @@  struct ptx_image_data
   struct ptx_image_data *next;
 };
 
+struct ptx_free_block
+{
+  void *ptr;
+  struct ptx_free_block *next;
+};
+
 struct ptx_device
 {
   CUcontext ctx;
   bool ctx_shared;
   CUdevice dev;
-  struct ptx_stream *null_stream;
-  /* All non-null streams associated with this device (actually context),
-     either created implicitly or passed in from the user (via
-     acc_set_cuda_stream).  */
-  struct ptx_stream *active_streams;
-  struct {
-    struct ptx_stream **arr;
-    int size;
-  } async_streams;
-  /* A lock for use when manipulating the above stream list and array.  */
-  pthread_mutex_t stream_lock;
+
   int ord;
   bool overlap;
   bool map;
@@ -381,32 +208,13 @@  struct ptx_device
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
-  
-  struct ptx_device *next;
-};
 
-enum ptx_event_type
-{
-  PTX_EVT_MEM,
-  PTX_EVT_KNL,
-  PTX_EVT_SYNC,
-  PTX_EVT_ASYNC_CLEANUP
-};
+  struct ptx_free_block *free_blocks;
+  pthread_mutex_t free_blocks_lock;
 
-struct ptx_event
-{
-  CUevent *evt;
-  int type;
-  void *addr;
-  int ord;
-  int val;
-
-  struct ptx_event *next;
+  struct ptx_device *next;
 };
 
-static pthread_mutex_t ptx_event_lock;
-static struct ptx_event *ptx_events;
-
 static struct ptx_device **ptx_devices;
 
 static inline struct nvptx_thread *
@@ -415,190 +223,6 @@  nvptx_thread (void)
   return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
 }
 
-static bool
-init_streams_for_device (struct ptx_device *ptx_dev, int concurrency)
-{
-  int i;
-  struct ptx_stream *null_stream
-    = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
-  null_stream->stream = NULL;
-  null_stream->host_thread = pthread_self ();
-  null_stream->multithreaded = true;
-  if (!map_init (NULL, null_stream))
-    return false;
-
-  ptx_dev->null_stream = null_stream;
-  ptx_dev->active_streams = NULL;
-  pthread_mutex_init (&ptx_dev->stream_lock, NULL);
-
-  if (concurrency < 1)
-    concurrency = 1;
-
-  /* This is just a guess -- make space for as many async streams as the
-     current device is capable of concurrently executing.  This can grow
-     later as necessary.  No streams are created yet.  */
-  ptx_dev->async_streams.arr
-    = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *));
-  ptx_dev->async_streams.size = concurrency;
-
-  for (i = 0; i < concurrency; i++)
-    ptx_dev->async_streams.arr[i] = NULL;
-
-  return true;
-}
-
-static bool
-fini_streams_for_device (struct ptx_device *ptx_dev)
-{
-  free (ptx_dev->async_streams.arr);
-
-  bool ret = true;
-  while (ptx_dev->active_streams != NULL)
-    {
-      struct ptx_stream *s = ptx_dev->active_streams;
-      ptx_dev->active_streams = ptx_dev->active_streams->next;
-
-      ret &= map_fini (NULL, s);
-
-      CUresult r = cuStreamDestroy (s->stream);
-      if (r != CUDA_SUCCESS)
-	{
-	  GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r));
-	  ret = false;
-	}
-      free (s);
-    }
-
-  ret &= map_fini (NULL, ptx_dev->null_stream);
-  free (ptx_dev->null_stream);
-  return ret;
-}
-
-/* Select a stream for (OpenACC-semantics) ASYNC argument for the current
-   thread THREAD (and also current device/context).  If CREATE is true, create
-   the stream if it does not exist (or use EXISTING if it is non-NULL), and
-   associate the stream with the same thread argument.  Returns stream to use
-   as result.  */
-
-static struct ptx_stream *
-select_stream_for_async (int async, pthread_t thread, bool create,
-			 CUstream existing)
-{
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-  /* Local copy of TLS variable.  */
-  struct ptx_device *ptx_dev = nvthd->ptx_dev;
-  struct ptx_stream *stream = NULL;
-  int orig_async = async;
-
-  /* The special value acc_async_noval (-1) maps to the thread-specific
-     default async stream.  */
-  if (async == acc_async_noval)
-    async = GOMP_PLUGIN_acc_thread_default_async ();
-
-  if (create)
-    pthread_mutex_lock (&ptx_dev->stream_lock);
-
-  /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the
-     null stream, and in fact better performance may be obtainable if it doesn't
-     (because the null stream enforces overly-strict synchronisation with
-     respect to other streams for legacy reasons, and that's probably not
-     needed with OpenACC).  Maybe investigate later.  */
-  if (async == acc_async_sync)
-    stream = ptx_dev->null_stream;
-  else if (async >= 0 && async < ptx_dev->async_streams.size
-	   && ptx_dev->async_streams.arr[async] && !(create && existing))
-    stream = ptx_dev->async_streams.arr[async];
-  else if (async >= 0 && create)
-    {
-      if (async >= ptx_dev->async_streams.size)
-	{
-	  int i, newsize = ptx_dev->async_streams.size * 2;
-
-	  if (async >= newsize)
-	    newsize = async + 1;
-
-	  ptx_dev->async_streams.arr
-	    = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr,
-				   newsize * sizeof (struct ptx_stream *));
-
-	  for (i = ptx_dev->async_streams.size; i < newsize; i++)
-	    ptx_dev->async_streams.arr[i] = NULL;
-
-	  ptx_dev->async_streams.size = newsize;
-	}
-
-      /* Create a new stream on-demand if there isn't one already, or if we're
-	 setting a particular async value to an existing (externally-provided)
-	 stream.  */
-      if (!ptx_dev->async_streams.arr[async] || existing)
-        {
-	  CUresult r;
-	  struct ptx_stream *s
-	    = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
-	  if (existing)
-	    s->stream = existing;
-	  else
-	    {
-	      r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT);
-	      if (r != CUDA_SUCCESS)
-		{
-		  pthread_mutex_unlock (&ptx_dev->stream_lock);
-		  GOMP_PLUGIN_fatal ("cuStreamCreate error: %s",
-				     cuda_error (r));
-		}
-	    }
-
-	  /* If CREATE is true, we're going to be queueing some work on this
-	     stream.  Associate it with the current host thread.  */
-	  s->host_thread = thread;
-	  s->multithreaded = false;
-
-	  if (!map_init (thr, s))
-	    {
-	      pthread_mutex_unlock (&ptx_dev->stream_lock);
-	      GOMP_PLUGIN_fatal ("map_init fail");
-	    }
-
-	  s->next = ptx_dev->active_streams;
-	  ptx_dev->active_streams = s;
-	  ptx_dev->async_streams.arr[async] = s;
-	}
-
-      stream = ptx_dev->async_streams.arr[async];
-    }
-  else if (async < 0)
-    {
-      if (create)
-	pthread_mutex_unlock (&ptx_dev->stream_lock);
-      GOMP_PLUGIN_fatal ("bad async %d", async);
-    }
-
-  if (create)
-    {
-      assert (stream != NULL);
-
-      /* If we're trying to use the same stream from different threads
-	 simultaneously, set stream->multithreaded to true.  This affects the
-	 behaviour of acc_async_test_all and acc_wait_all, which are supposed to
-	 only wait for asynchronous launches from the same host thread they are
-	 invoked on.  If multiple threads use the same async value, we make note
-	 of that here and fall back to testing/waiting for all threads in those
-	 functions.  */
-      if (thread != stream->host_thread)
-        stream->multithreaded = true;
-
-      pthread_mutex_unlock (&ptx_dev->stream_lock);
-    }
-  else if (stream && !stream->multithreaded
-	   && !pthread_equal (stream->host_thread, thread))
-    GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async);
-
-  return stream;
-}
-
 /* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
    should be locked on entry and remains locked on exit.  */
 
@@ -611,9 +235,6 @@  nvptx_init (void)
     return true;
 
   CUDA_CALL (cuInit, 0);
-  ptx_events = NULL;
-  pthread_mutex_init (&ptx_event_lock, NULL);
-
   CUDA_CALL (cuDeviceGetCount, &ndevs);
   ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
 					    * ndevs);
@@ -632,6 +253,11 @@  nvptx_attach_host_thread_to_device (int n)
   CUcontext thd_ctx;
 
   r = cuCtxGetDevice (&dev);
+  if (r == CUDA_ERROR_NOT_PERMITTED)
+    {
+      /* Assume we're in a CUDA callback, just return true.  */
+      return true;
+    }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -759,6 +385,9 @@  nvptx_open_device (int n)
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
+  ptx_dev->free_blocks = NULL;
+  pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
+
   GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
 		     "\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
 		     "\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
@@ -775,9 +404,6 @@  nvptx_open_device (int n)
 		     ptx_dev->max_registers_per_multiprocessor,
 		     ptx_dev->max_shared_memory_per_multiprocessor);
 
-  if (!init_streams_for_device (ptx_dev, async_engines))
-    return NULL;
-
   return ptx_dev;
 }
 
@@ -787,9 +413,15 @@  nvptx_close_device (struct ptx_device *ptx_dev)
   if (!ptx_dev)
     return true;
 
-  if (!fini_streams_for_device (ptx_dev))
-    return false;
-  
+  for (struct ptx_free_block *b = ptx_dev->free_blocks; b;)
+    {
+      struct ptx_free_block *b_next = b->next;
+      CUDA_CALL (cuMemFree, (CUdeviceptr) b->ptr);
+      free (b);
+      b = b_next;
+    }
+
+  pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
   if (!ptx_dev->ctx_shared)
@@ -905,134 +537,14 @@  link_ptx (CUmodule *module, const struct targ_ptx_
 }
 
 static void
-event_gc (bool memmap_lockable)
-{
-  struct ptx_event *ptx_event = ptx_events;
-  struct ptx_event *async_cleanups = NULL;
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
-  pthread_mutex_lock (&ptx_event_lock);
-
-  while (ptx_event != NULL)
-    {
-      CUresult r;
-      struct ptx_event *e = ptx_event;
-
-      ptx_event = ptx_event->next;
-
-      if (e->ord != nvthd->ptx_dev->ord)
-	continue;
-
-      r = cuEventQuery (*e->evt);
-      if (r == CUDA_SUCCESS)
-	{
-	  bool append_async = false;
-	  CUevent *te;
-
-	  te = e->evt;
-
-	  switch (e->type)
-	    {
-	    case PTX_EVT_MEM:
-	    case PTX_EVT_SYNC:
-	      break;
-
-	    case PTX_EVT_KNL:
-	      map_pop (thr, e->addr);
-	      break;
-
-	    case PTX_EVT_ASYNC_CLEANUP:
-	      {
-		/* The function gomp_plugin_async_unmap_vars needs to claim the
-		   memory-map splay tree lock for the current device, so we
-		   can't call it when one of our callers has already claimed
-		   the lock.  In that case, just delay the GC for this event
-		   until later.  */
-		if (!memmap_lockable)
-		  continue;
-
-		append_async = true;
-	      }
-	      break;
-	    }
-
-	  cuEventDestroy (*te);
-	  free ((void *)te);
-
-	  /* Unlink 'e' from ptx_events list.  */
-	  if (ptx_events == e)
-	    ptx_events = ptx_events->next;
-	  else
-	    {
-	      struct ptx_event *e_ = ptx_events;
-	      while (e_->next != e)
-		e_ = e_->next;
-	      e_->next = e_->next->next;
-	    }
-
-	  if (append_async)
-	    {
-	      e->next = async_cleanups;
-	      async_cleanups = e;
-	    }
-	  else
-	    free (e);
-	}
-    }
-
-  pthread_mutex_unlock (&ptx_event_lock);
-
-  /* We have to do these here, after ptx_event_lock is released.  */
-  while (async_cleanups)
-    {
-      struct ptx_event *e = async_cleanups;
-      async_cleanups = async_cleanups->next;
-
-      GOMP_PLUGIN_async_unmap_vars (e->addr, e->val);
-      free (e);
-    }
-}
-
-static void
-event_add (enum ptx_event_type type, CUevent *e, void *h, int val)
-{
-  struct ptx_event *ptx_event;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC
-	  || type == PTX_EVT_ASYNC_CLEANUP);
-
-  ptx_event = GOMP_PLUGIN_malloc (sizeof (struct ptx_event));
-  ptx_event->type = type;
-  ptx_event->evt = e;
-  ptx_event->addr = h;
-  ptx_event->ord = nvthd->ptx_dev->ord;
-  ptx_event->val = val;
-
-  pthread_mutex_lock (&ptx_event_lock);
-
-  ptx_event->next = ptx_events;
-  ptx_events = ptx_event;
-
-  pthread_mutex_unlock (&ptx_event_lock);
-}
-
-static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	    int async, unsigned *dims, void *targ_mem_desc)
+	    unsigned *dims, void *targ_mem_desc,
+	    CUdeviceptr dp, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
-  CUresult r;
   int i;
-  struct ptx_stream *dev_str;
   void *kargs[1];
-  void *hp;
-  CUdeviceptr dp;
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-  const char *maybe_abort_msg = "(perhaps abort was called)";
   int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
   int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
   int dev_size = nvptx_thread ()->ptx_dev->multiprocessor_count;
@@ -1044,9 +556,6 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
 
   function = targ_fn->fn;
 
-  dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
-  assert (dev_str == nvthd->current_stream);
-
   /* Initialize the launch dimensions.  Typically this is constant,
      provided by the device compiler, but we must permit runtime
      values.  */
@@ -1175,61 +684,6 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
 			   threads_per_block);
     }
 
-  /* This reserves a chunk of a pre-allocated page of memory mapped on both
-     the host and the device. HP is a host pointer to the new chunk, and DP is
-     the corresponding device pointer.  */
-  pthread_mutex_lock (&ptx_event_lock);
-  dp = map_push (thr, dev_str, mapnum * sizeof (void *));
-  pthread_mutex_unlock (&ptx_event_lock);
-
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-
-  /* Copy the array of arguments to the mapped page.  */
-  hp = alloca(sizeof(void *) * mapnum);
-  for (i = 0; i < mapnum; i++)
-    ((void **) hp)[i] = devaddrs[i] != 0 ? devaddrs[i] : hostaddrs[i];
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_upload_start;
-
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
-      data_event_info.data_event.implicit = 1;
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = mapnum * sizeof (void *);
-      data_event_info.data_event.host_ptr = hp;
-      data_event_info.data_event.device_ptr = (void *) dp;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-  CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, hp,
-		    mapnum * sizeof (void *));
-
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_upload_end;
-      data_event_info.data_event.event_type = prof_info->event_type;
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " gangs=%u, workers=%u, vectors=%u\n",
 		     __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
@@ -1241,7 +695,11 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
   // num_workers	ntid.y
   // vector length	ntid.x
 
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
   acc_event_info enqueue_launch_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
   if (profiling_dispatch_p)
     {
       prof_info->event_type = acc_ev_enqueue_launch_start;
@@ -1269,11 +727,13 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
       GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
 					    api_info);
     }
+  
   kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-		    0, dev_str->stream, kargs, 0);
+		    0, stream, kargs, 0);
+
   if (profiling_dispatch_p)
     {
       prof_info->event_type = acc_ev_enqueue_launch_end;
@@ -1283,91 +743,8 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
 					    api_info);
     }
 
-  acc_event_info wait_event_info;
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_wait_start;
-
-      wait_event_info.other_event.event_type = prof_info->event_type;
-      wait_event_info.other_event.valid_bytes
-	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
-      wait_event_info.other_event.parent_construct
-	/* TODO = compute_construct_event_info.other_event.parent_construct */
-	= acc_construct_parallel; //TODO: kernels...
-      wait_event_info.other_event.implicit = 1;
-      wait_event_info.other_event.tool_info = NULL;
-
-      api_info->device_api = acc_device_api_cuda;
-    }
-#ifndef DISABLE_ASYNC
-  if (async < acc_async_noval)
-    {
-      if (profiling_dispatch_p)
-	{
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-						api_info);
-	}
-      r = cuStreamSynchronize (dev_str->stream);
-      if (profiling_dispatch_p)
-	{
-	  prof_info->event_type = acc_ev_wait_end;
-	  wait_event_info.other_event.event_type = prof_info->event_type;
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-						api_info);
-	}
-      if (r == CUDA_ERROR_LAUNCH_FAILED)
-	GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
-			   maybe_abort_msg);
-      else if (r != CUDA_SUCCESS)
-        GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-    }
-  else
-    {
-      CUevent *e;
-
-      e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-      r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING);
-      if (r == CUDA_ERROR_LAUNCH_FAILED)
-	GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r),
-			   maybe_abort_msg);
-      else if (r != CUDA_SUCCESS)
-        GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r));
-
-      event_gc (true);
-
-      CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream);
-
-      event_add (PTX_EVT_KNL, e, (void *)dev_str, 0);
-    }
-#else
-  if (profiling_dispatch_p)
-    {
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-					    api_info);
-    }
-  r = cuCtxSynchronize ();
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_wait_end;
-      wait_event_info.other_event.event_type = prof_info->event_type;
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-					    api_info);
-    }
-  if (r == CUDA_ERROR_LAUNCH_FAILED)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
-		       maybe_abort_msg);
-  else if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-#endif
-
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
 		     targ_fn->launch->fn);
-
-#ifndef DISABLE_ASYNC
-  if (async < acc_async_noval)
-#endif
-    map_pop (thr, dev_str);
 }
 
 void * openacc_get_current_cuda_context (void);
@@ -1410,8 +787,21 @@  nvptx_alloc (size_t s)
 }
 
 static bool
-nvptx_free (void *p)
+nvptx_free (void *p, struct ptx_device *ptx_dev)
 {
+  /* Assume callback context if this is null.  */
+  if (GOMP_PLUGIN_goacc_thread () == NULL)
+    {
+      struct ptx_free_block *n
+	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+      n->ptr = p;
+      pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+      n->next = ptx_dev->free_blocks;
+      ptx_dev->free_blocks = n;
+      pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+      return true;
+    }
+
   CUdeviceptr pb;
   size_t ps;
 
@@ -1423,478 +813,9 @@  static bool
     }
 
   CUDA_CALL (cuMemFree, (CUdeviceptr) p);
-
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_free;
-
-      acc_event_info data_event_info;
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      data_event_info.data_event.implicit = 1; //TODO
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = ps;
-      data_event_info.data_event.host_ptr = NULL;
-      data_event_info.data_event.device_ptr = p;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
   return true;
 }
 
-
-static bool
-nvptx_host2dev (void *d, const void *h, size_t s)
-{
-  CUdeviceptr pb;
-  size_t ps;
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
-  if (!s)
-    return true;
-  if (!d)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
-  if (!pb)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-  if (!h)
-    {
-      GOMP_PLUGIN_error ("invalid host address");
-      return false;
-    }
-  if (d == h)
-    {
-      GOMP_PLUGIN_error ("invalid host or device address");
-      return false;
-    }
-  if ((void *)(d + s) > (void *)(pb + ps))
-    {
-      GOMP_PLUGIN_error ("invalid size");
-      return false;
-    }
-
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_upload_start;
-
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      data_event_info.data_event.implicit = 1; //TODO
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = s;
-      data_event_info.data_event.host_ptr = /* TODO */ (void *) h;
-      data_event_info.data_event.device_ptr = d;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-#ifndef DISABLE_ASYNC
-  if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
-    {
-      CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-      CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-      event_gc (false);
-      CUDA_CALL (cuMemcpyHtoDAsync,
-		 (CUdeviceptr) d, h, s, nvthd->current_stream->stream);
-      CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
-      event_add (PTX_EVT_MEM, e, (void *)h, 0);
-    }
-  else
-#endif
-    CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s);
-
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_upload_end;
-      data_event_info.data_event.event_type = prof_info->event_type;
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-  return true;
-}
-
-static bool
-nvptx_dev2host (void *h, const void *d, size_t s)
-{
-  CUdeviceptr pb;
-  size_t ps;
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
-  if (!s)
-    return true;
-  if (!d)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
-  if (!pb)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-  if (!h)
-    {
-      GOMP_PLUGIN_error ("invalid host address");
-      return false;
-    }
-  if (d == h)
-    {
-      GOMP_PLUGIN_error ("invalid host or device address");
-      return false;
-    }
-  if ((void *)(d + s) > (void *)(pb + ps))
-    {
-      GOMP_PLUGIN_error ("invalid size");
-      return false;
-    }
-
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_download_start;
-
-      data_event_info.data_event.event_type = prof_info->event_type;
-      data_event_info.data_event.valid_bytes
-	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
-      data_event_info.data_event.parent_construct
-	= acc_construct_parallel; //TODO
-      data_event_info.data_event.implicit = 1; //TODO
-      data_event_info.data_event.tool_info = NULL;
-      data_event_info.data_event.var_name = NULL; //TODO
-      data_event_info.data_event.bytes = s;
-      data_event_info.data_event.host_ptr = h;
-      data_event_info.data_event.device_ptr = /* TODO */ (void *) d;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-#ifndef DISABLE_ASYNC
-  if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
-    {
-      CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-      CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-      event_gc (false);
-      CUDA_CALL (cuMemcpyDtoHAsync,
-		 h, (CUdeviceptr) d, s, nvthd->current_stream->stream);
-      CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
-      event_add (PTX_EVT_MEM, e, (void *)h, 0);
-    }
-  else
-#endif
-    CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s);
-
-  if (profiling_dispatch_p)
-    {
-      prof_info->event_type = acc_ev_enqueue_download_end;
-      data_event_info.data_event.event_type = prof_info->event_type;
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-					    api_info);
-    }
-
-  return true;
-}
-
-static void
-nvptx_set_async (int async)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  nvthd->current_stream
-    = select_stream_for_async (async, pthread_self (), true, NULL);
-}
-
-static int
-nvptx_async_test (int async)
-{
-  CUresult r;
-  struct ptx_stream *s;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-
-  if (!s)
-    GOMP_PLUGIN_fatal ("unknown async %d", async);
-
-  r = cuStreamQuery (s->stream);
-  if (r == CUDA_SUCCESS)
-    {
-      /* The oacc-parallel.c:goacc_wait function calls this hook to determine
-	 whether all work has completed on this stream, and if so omits the call
-	 to the wait hook.  If that happens, event_gc might not get called
-	 (which prevents variables from getting unmapped and their associated
-	 device storage freed), so call it here.  */
-      event_gc (true);
-      return 1;
-    }
-  else if (r == CUDA_ERROR_NOT_READY)
-    return 0;
-
-  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
-  return 0;
-}
-
-static int
-nvptx_async_test_all (void)
-{
-  struct ptx_stream *s;
-  pthread_t self = pthread_self ();
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
-    {
-      if ((s->multithreaded || pthread_equal (s->host_thread, self))
-	  && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY)
-	{
-	  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-	  return 0;
-	}
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  event_gc (true);
-
-  return 1;
-}
-
-static void
-nvptx_wait (int async)
-{
-  struct ptx_stream *s;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-  if (!s)
-    GOMP_PLUGIN_fatal ("unknown async %d", async);
-
-  GOMP_PLUGIN_debug (0, "  %s: waiting on async=%d\n", __FUNCTION__, async);
-
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  bool profiling_dispatch_p
-    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
-  acc_event_info wait_event_info;
-  if (profiling_dispatch_p)
-    {
-      acc_prof_info *prof_info = thr->prof_info;
-      acc_api_info *api_info = thr->api_info;
-
-      prof_info->event_type = acc_ev_wait_start;
-
-      wait_event_info.other_event.event_type = prof_info->event_type;
-      wait_event_info.other_event.valid_bytes
-	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
-      wait_event_info.other_event.parent_construct
-	/* TODO = compute_construct_event_info.other_event.parent_construct */
-	= acc_construct_parallel; //TODO: kernels...
-      wait_event_info.other_event.implicit = 1;
-      wait_event_info.other_event.tool_info = NULL;
-
-      api_info->device_api = acc_device_api_cuda;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-					    api_info);
-    }
-  CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
-  if (profiling_dispatch_p)
-    {
-      acc_prof_info *prof_info = thr->prof_info;
-      acc_api_info *api_info = thr->api_info;
-
-      prof_info->event_type = acc_ev_wait_end;
-
-      wait_event_info.other_event.event_type = prof_info->event_type;
-
-      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
-					    api_info);
-    }
-
-  event_gc (true);
-}
-
-static void
-nvptx_wait_async (int async1, int async2)
-{
-  CUevent *e;
-  struct ptx_stream *s1, *s2;
-  pthread_t self = pthread_self ();
-
-  /* The stream that is waiting (rather than being waited for) doesn't
-     necessarily have to exist already.  */
-  s2 = select_stream_for_async (async2, self, true, NULL);
-
-  s1 = select_stream_for_async (async1, self, false, NULL);
-  if (!s1)
-    GOMP_PLUGIN_fatal ("invalid async 1\n");
-
-  if (s1 == s2)
-    GOMP_PLUGIN_fatal ("identical parameters");
-
-  e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-  CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
-  event_gc (true);
-
-  CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream);
-
-  event_add (PTX_EVT_SYNC, e, NULL, 0);
-
-  CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0);
-}
-
-static void
-nvptx_wait_all (void)
-{
-  CUresult r;
-  struct ptx_stream *s;
-  pthread_t self = pthread_self ();
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info wait_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
-  if (profiling_dispatch_p)
-    {
-      wait_event_info.other_event.valid_bytes
-	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
-      wait_event_info.other_event.parent_construct
-	/* TODO = compute_construct_event_info.other_event.parent_construct */
-	= acc_construct_parallel; //TODO: kernels...
-      wait_event_info.other_event.implicit = 1;
-      wait_event_info.other_event.tool_info = NULL;
-
-      api_info->device_api = acc_device_api_cuda;
-    }
-
-  /* Wait for active streams initiated by this thread (or by multiple threads)
-     to complete.  */
-  for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
-    {
-      if (s->multithreaded || pthread_equal (s->host_thread, self))
-	{
-	  r = cuStreamQuery (s->stream);
-	  if (r == CUDA_SUCCESS)
-	    continue;
-	  else if (r != CUDA_ERROR_NOT_READY)
-	    GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
-	  if (profiling_dispatch_p)
-	    {
-	      prof_info->event_type = acc_ev_wait_start;
-	      wait_event_info.other_event.event_type = prof_info->event_type;
-	      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
-						    &wait_event_info,
-						    api_info);
-	    }
-	  CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
-	  if (profiling_dispatch_p)
-	    {
-	      prof_info->event_type = acc_ev_wait_end;
-	      wait_event_info.other_event.event_type = prof_info->event_type;
-	      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
-						    &wait_event_info,
-						    api_info);
-	    }
-	}
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  event_gc (true);
-}
-
-static void
-nvptx_wait_all_async (int async)
-{
-  struct ptx_stream *waiting_stream, *other_stream;
-  CUevent *e;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  pthread_t self = pthread_self ();
-
-  /* The stream doing the waiting.  This could be the first mention of the
-     stream, so create it if necessary.  */
-  waiting_stream
-    = select_stream_for_async (async, pthread_self (), true, NULL);
-
-  /* Launches on the null stream already block on other streams in the
-     context.  */
-  if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream)
-    return;
-
-  event_gc (true);
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  for (other_stream = nvthd->ptx_dev->active_streams;
-       other_stream != NULL;
-       other_stream = other_stream->next)
-    {
-      if (!other_stream->multithreaded
-	  && !pthread_equal (other_stream->host_thread, self))
-	continue;
-
-      e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-      CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
-      /* Record an event on the waited-for stream.  */
-      CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream);
-
-      event_add (PTX_EVT_SYNC, e, NULL, 0);
-
-      CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0);
-   }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-}
-
 static void *
 nvptx_get_current_cuda_device (void)
 {
@@ -1917,70 +838,6 @@  nvptx_get_current_cuda_context (void)
   return nvthd->ptx_dev->ctx;
 }
 
-static void *
-nvptx_get_cuda_stream (int async)
-{
-  struct ptx_stream *s;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (!nvthd || !nvthd->ptx_dev)
-    return NULL;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-
-  return s ? s->stream : NULL;
-}
-
-static int
-nvptx_set_cuda_stream (int async, void *stream)
-{
-  struct ptx_stream *oldstream;
-  pthread_t self = pthread_self ();
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
-  if (async < 0)
-    GOMP_PLUGIN_fatal ("bad async %d", async);
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  /* We have a list of active streams and an array mapping async values to
-     entries of that list.  We need to take "ownership" of the passed-in stream,
-     and add it to our list, removing the previous entry also (if there was one)
-     in order to prevent resource leaks.  Note the potential for surprise
-     here: maybe we should keep track of passed-in streams and leave it up to
-     the user to tidy those up, but that doesn't work for stream handles
-     returned from acc_get_cuda_stream above...  */
-
-  oldstream = select_stream_for_async (async, self, false, NULL);
-
-  if (oldstream)
-    {
-      if (nvthd->ptx_dev->active_streams == oldstream)
-	nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next;
-      else
-	{
-	  struct ptx_stream *s = nvthd->ptx_dev->active_streams;
-	  while (s->next != oldstream)
-	    s = s->next;
-	  s->next = s->next->next;
-	}
-
-      CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream);
-
-      if (!map_fini (thr, oldstream))
-	GOMP_PLUGIN_fatal ("error when freeing host memory");
-
-      free (oldstream);
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  (void) select_stream_for_async (async, self, true, (CUstream) stream);
-
-  return 1;
-}
-
 /* Plugin entry points.  */
 
 const char *
@@ -2223,6 +1080,23 @@  GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
   if (!nvptx_attach_host_thread_to_device (ord))
     return NULL;
+
+  struct ptx_device *ptx_dev = ptx_devices[ord];
+  struct ptx_free_block *blocks, *tmp;
+
+  pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+  blocks = ptx_dev->free_blocks;
+  ptx_dev->free_blocks = NULL;
+  pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+
+  while (blocks)
+    {
+      tmp = blocks->next;
+      nvptx_free (blocks->ptr, ptx_dev);
+      free (blocks);
+      blocks = tmp;
+    }
+
   return nvptx_alloc (size);
 }
 
@@ -2230,84 +1104,165 @@  bool
 GOMP_OFFLOAD_free (int ord, void *ptr)
 {
   return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_free (ptr));
+	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-bool
-GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
-{
-  return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_dev2host (dst, src, n));
-}
-
-bool
-GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
-{
-  return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_host2dev (dst, src, n));
-}
-
-void (*device_run) (int n, void *fn_ptr, void *vars) = NULL;
-
 void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 			   void **hostaddrs, void **devaddrs,
-			   int async, unsigned *dims, void *targ_mem_desc)
+			   unsigned *dims, void *targ_mem_desc)
 {
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, async, dims, targ_mem_desc);
-}
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-void
-GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc, int async)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
+  void **hp = NULL;
+  CUdeviceptr dp = 0;
 
-  CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-  CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream);
-  event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc, async);
-}
+  if (mapnum > 0)
+    {
+      hp = alloca (mapnum * sizeof (void *));
+      for (int i = 0; i < mapnum; i++)
+	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+    }
 
-int
-GOMP_OFFLOAD_openacc_async_test (int async)
-{
-  return nvptx_async_test (async);
-}
+  /* Copy the (device) pointers to arguments to the device (dp and hp might in
+     fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
 
-int
-GOMP_OFFLOAD_openacc_async_test_all (void)
-{
-  return nvptx_async_test_all ();
-}
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+      data_event_info.data_event.device_ptr = (void *) dp;
 
-void
-GOMP_OFFLOAD_openacc_async_wait (int async)
-{
-  nvptx_wait (async);
-}
+      api_info->device_api = acc_device_api_cuda;
 
-void
-GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2)
-{
-  nvptx_wait_async (async1, async2);
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  if (mapnum > 0)
+    CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
+		      mapnum * sizeof (void *));
+
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      dp, NULL);
+
+  CUresult r = cuStreamSynchronize (NULL);
+  const char *maybe_abort_msg = "(perhaps abort was called)";
+  if (r == CUDA_ERROR_LAUNCH_FAILED)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+		       maybe_abort_msg);
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+  CUDA_CALL_ASSERT (cuMemFree, dp);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_wait_all (void)
+static void
+cuda_free_argmem (void *ptr)
 {
-  nvptx_wait_all ();
+  void **block = (void **) ptr;
+  nvptx_free (block[0], (struct ptx_device *) block[1]);
+  free (block);
 }
 
 void
-GOMP_OFFLOAD_openacc_async_wait_all_async (int async)
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
 {
-  nvptx_wait_all_async (async);
-}
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-void
-GOMP_OFFLOAD_openacc_async_set_async (int async)
-{
-  nvptx_set_async (async);
+  void **hp = NULL;
+  CUdeviceptr dp = 0;
+  void **block = NULL;
+
+  if (mapnum > 0)
+    {
+      block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+      hp = block + 2;
+      for (int i = 0; i < mapnum; i++)
+	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+    }
+
+  /* Copy the (device) pointers to arguments to the device (dp and hp might in
+     fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+      data_event_info.data_event.device_ptr = (void *) dp;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
+  if (mapnum > 0)
+    {
+      CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
+			mapnum * sizeof (void *), aq->cuda_stream);
+      block[0] = (void *) dp;
+
+      struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+      struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
+      block[1] = (void *) nvthd->ptx_dev;
+    }
+
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+  
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      dp, aq->cuda_stream);
+
+  if (mapnum > 0)
+    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
 void *
@@ -2329,7 +1284,6 @@  GOMP_OFFLOAD_openacc_create_thread_data (int ord)
   if (!thd_ctx)
     CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx);
 
-  nvthd->current_stream = ptx_dev->null_stream;
   nvthd->ptx_dev = ptx_dev;
 
   return (void *) nvthd;
@@ -2354,17 +1308,169 @@  GOMP_OFFLOAD_openacc_cuda_get_current_context (voi
 }
 
 /* NOTE: This returns a CUstream, not a ptx_stream pointer.  */
-
 void *
-GOMP_OFFLOAD_openacc_cuda_get_stream (int async)
+GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 {
-  return nvptx_get_cuda_stream (async);
+  return (void *) aq->cuda_stream;
 }
 
 /* NOTE: This takes a CUstream, not a ptx_stream pointer.  */
+int
+GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
+{
+  if (aq->cuda_stream)
+    {
+      CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+      CUDA_CALL_ASSERT (cuStreamDestroy, aq->cuda_stream);
+    }
 
+  aq->cuda_stream = (CUstream) stream;
+  return 1;
+}
+
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+  struct goacc_asyncqueue *aq
+    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+  CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
+  return aq;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+  CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
+  free (aq);
+  return true;
+}
+
 int
-GOMP_OFFLOAD_openacc_cuda_set_stream (int async, void *stream)
+GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
-  return nvptx_set_cuda_stream (async, stream);
+  CUresult r = cuStreamQuery (aq->cuda_stream);
+  if (r == CUDA_SUCCESS)
+    return 1;
+  if (r == CUDA_ERROR_NOT_READY)
+    return 0;
+
+  GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r));
+  return -1;
 }
+
+void
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
+				      struct goacc_asyncqueue *aq2)
+{
+  CUevent e;
+  CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
+  CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
+  CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+}
+
+static void
+cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
+{
+  if (res != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+  struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
+  cb->fn (cb->ptr);
+  free (ptr);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
+					   void (*callback_fn)(void *),
+					   void *userptr)
+{
+  struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
+  b->fn = callback_fn;
+  b->ptr = userptr;
+  b->aq = aq;
+  CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
+		    cuda_callback_wrapper, (void *) b, 0);
+}
+
+static bool
+cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
+{
+  CUdeviceptr pb;
+  size_t ps;
+  if (!s)
+    return true;
+  if (!d)
+    {
+      GOMP_PLUGIN_error ("invalid device address");
+      return false;
+    }
+  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
+  if (!pb)
+    {
+      GOMP_PLUGIN_error ("invalid device address");
+      return false;
+    }
+  if (!h)
+    {
+      GOMP_PLUGIN_error ("invalid host address");
+      return false;
+    }
+  if (d == h)
+    {
+      GOMP_PLUGIN_error ("invalid host or device address");
+      return false;
+    }
+  if ((void *)(d + s) > (void *)(pb + ps))
+    {
+      GOMP_PLUGIN_error ("invalid size");
+      return false;
+    }
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (src, dst, n))
+    return false;
+  CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (dst, src, n))
+    return false;
+  CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
+				     size_t n, struct goacc_asyncqueue *aq)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (src, dst, n))
+    return false;
+  CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
+				     size_t n, struct goacc_asyncqueue *aq)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (dst, src, n))
+    return false;
+  CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
+  return true;
+}
Index: libgomp/target.c
===================================================================
--- libgomp/target.c	(revision 249620)
+++ libgomp/target.c	(working copy)
@@ -187,18 +187,44 @@  gomp_device_copy (struct gomp_device_descr *device
     }
 }
 
-static void
+static inline void
+goacc_device_copy_async (struct gomp_device_descr *devicep,
+			 bool (*copy_func) (int, void *, const void *, size_t,
+					    struct goacc_asyncqueue *),
+			 const char *dst, void *dstaddr,
+			 const char *src, const void *srcaddr,
+			 size_t size, struct goacc_asyncqueue *aq)
+{
+  if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
+    {
+      gomp_mutex_unlock (&devicep->lock);
+      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
+		  src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
+    }
+}
+
+attribute_hidden void
 gomp_copy_host2dev (struct gomp_device_descr *devicep,
+		    struct goacc_asyncqueue *aq,
 		    void *d, const void *h, size_t sz)
 {
-  gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+  if (aq)
+    goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+			     "dev", d, "host", h, sz, aq);
+  else
+    gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
 
-static void
+attribute_hidden void
 gomp_copy_dev2host (struct gomp_device_descr *devicep,
+		    struct goacc_asyncqueue *aq,
 		    void *h, const void *d, size_t sz)
 {
-  gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
+  if (aq)
+    goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
+			     "host", h, "dev", d, sz, aq);
+  else
+    gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
 }
 
 static void
@@ -216,7 +242,8 @@  gomp_free_device_memory (struct gomp_device_descr
    Helper function of gomp_map_vars.  */
 
 static inline void
-gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
+gomp_map_vars_existing (struct gomp_device_descr *devicep,
+			struct goacc_asyncqueue *aq, splay_tree_key oldn,
 			splay_tree_key newn, struct target_var_desc *tgt_var,
 			unsigned char kind)
 {
@@ -238,7 +265,7 @@  static inline void
     }
 
   if (GOMP_MAP_ALWAYS_TO_P (kind))
-    gomp_copy_host2dev (devicep,
+    gomp_copy_host2dev (devicep, aq,
 			(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
 				  + newn->host_start - oldn->host_start),
 			(void *) newn->host_start,
@@ -256,8 +283,8 @@  get_kind (bool short_mapkind, void *kinds, int idx
 }
 
 static void
-gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
-		  uintptr_t target_offset, uintptr_t bias)
+gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+		  uintptr_t host_ptr, uintptr_t target_offset, uintptr_t bias)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
   struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -268,7 +295,7 @@  static void
     {
       cur_node.tgt_offset = (uintptr_t) NULL;
       /* FIXME: see comment about coalescing host/dev transfers below.  */
-      gomp_copy_host2dev (devicep,
+      gomp_copy_host2dev (devicep, aq,
 			  (void *) (tgt->tgt_start + target_offset),
 			  (void *) &cur_node.tgt_offset,
 			  sizeof (void *));
@@ -291,7 +318,7 @@  static void
      to initialize the pointer with.  */
   cur_node.tgt_offset -= bias;
   /* FIXME: see comment about coalescing host/dev transfers below.  */
-  gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + target_offset),
+  gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
 		      (void *) &cur_node.tgt_offset, sizeof (void *));
 }
 
@@ -329,9 +356,9 @@  gomp_map_pset (struct target_mem_desc *tgt, uintpt
 }
 
 static void
-gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
-			  size_t first, size_t i, void **hostaddrs,
-			  size_t *sizes, void *kinds)
+gomp_map_fields_existing (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+			  splay_tree_key n, size_t first, size_t i,
+			  void **hostaddrs, size_t *sizes, void *kinds)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
   struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -348,7 +375,7 @@  static void
       && n2->tgt == n->tgt
       && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
     {
-      gomp_map_vars_existing (devicep, n2, &cur_node,
+      gomp_map_vars_existing (devicep, aq, n2, &cur_node,
 			      &tgt->list[i], kind & typemask);
       return;
     }
@@ -364,7 +391,7 @@  static void
 	      && n2->host_start - n->host_start
 		 == n2->tgt_offset - n->tgt_offset)
 	    {
-	      gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+	      gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
 				      kind & typemask);
 	      return;
 	    }
@@ -376,7 +403,7 @@  static void
 	  && n2->tgt == n->tgt
 	  && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
 	{
-	  gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+	  gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
 				  kind & typemask);
 	  return;
 	}
@@ -547,6 +574,18 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	       void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
 	       bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
 {
+  struct target_mem_desc *tgt;
+  tgt = gomp_map_vars_async (devicep, NULL, mapnum, hostaddrs, devaddrs,
+			     sizes, kinds, short_mapkind, pragma_kind);
+  return tgt;
+}
+
+attribute_hidden struct target_mem_desc *
+gomp_map_vars_async (struct gomp_device_descr *devicep,
+		     struct goacc_asyncqueue *aq, size_t mapnum,
+		     void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
+		     bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
+{
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   bool has_firstprivate = false;
   const int rshift = short_mapkind ? 8 : 3;
@@ -665,7 +704,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	      continue;
 	    }
 	  for (i = first; i <= last; i++)
-	    gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+	    gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
 				      sizes, kinds);
 	  i--;
 	  continue;
@@ -722,7 +761,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
       else
 	n = splay_tree_lookup (mem_map, &cur_node);
       if (n && n->refcount != REFCOUNT_LINK)
-	gomp_map_vars_existing (devicep, n, &cur_node, &tgt->list[i],
+	gomp_map_vars_existing (devicep, aq, n, &cur_node, &tgt->list[i],
 				kind & typemask);
       else
 	{
@@ -790,7 +829,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	  if (n)
 	    {
 	      assert (n->refcount != REFCOUNT_LINK);
-	      gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+	      gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
 				      kind & typemask);	      
 	    }
 	  else
@@ -866,7 +905,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		tgt_size = (tgt_size + align - 1) & ~(align - 1);
 		tgt->list[i].offset = tgt_size;
 		len = sizes[i];
-		gomp_copy_host2dev (devicep,
+		gomp_copy_host2dev (devicep, aq,
 				    (void *) (tgt->tgt_start + tgt_size),
 				    (void *) hostaddrs[i], len);
 		tgt_size += len;
@@ -900,7 +939,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		    continue;
 		  }
 		for (i = first; i <= last; i++)
-		  gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+		  gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
 					    sizes, kinds);
 		i--;
 		continue;
@@ -920,7 +959,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		  cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i - 1);
 		if (cur_node.tgt_offset)
 		  cur_node.tgt_offset -= sizes[i];
-		gomp_copy_host2dev (devicep,
+		gomp_copy_host2dev (devicep, aq,
 				    (void *) (n->tgt->tgt_start
 					      + n->tgt_offset
 					      + cur_node.host_start
@@ -950,7 +989,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	      k->host_end = k->host_start + sizeof (void *);
 	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n && n->refcount != REFCOUNT_LINK)
-	      gomp_map_vars_existing (devicep, n, k, &tgt->list[i],
+	      gomp_map_vars_existing (devicep, aq, n, k, &tgt->list[i],
 				      kind & typemask);
 	    else
 	      {
@@ -1006,7 +1045,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		    /* FIXME: Perhaps add some smarts, like if copying
 		       several adjacent fields from host to target, use some
 		       host buffer to avoid sending each var individually.  */
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
@@ -1013,7 +1052,8 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 					k->host_end - k->host_start);
 		    break;
 		  case GOMP_MAP_POINTER:
-		    gomp_map_pointer (tgt, (uintptr_t) *(void **) k->host_start,
+		    gomp_map_pointer (tgt, aq,
+				      (uintptr_t) *(void **) k->host_start,
 				      k->tgt_offset, sizes[i]);
 		    break;
 		  case GOMP_MAP_TO_PSET:
@@ -1042,7 +1082,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 					     sizes[j]);
 			    tptr = *(uintptr_t *) hostaddrs[i];
 			    *(uintptr_t *) hostaddrs[i]= toffset;
-			    gomp_copy_host2dev (devicep,
+			    gomp_copy_host2dev (devicep, aq,
 						(void *) (tgt->tgt_start
 							  + k->tgt_offset),
 						(void *) k->host_start,
@@ -1052,7 +1092,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 			    found_pointer = true;
 			  }
 		      if (!found_pointer)
-			gomp_copy_host2dev (devicep,
+			gomp_copy_host2dev (devicep, aq,
 					    (void *) (tgt->tgt_start
 						      + k->tgt_offset),
 					    (void *) k->host_start,
@@ -1079,7 +1119,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		    break;
 		  case GOMP_MAP_FORCE_DEVICEPTR:
 		    assert (k->host_end - k->host_start == sizeof (void *));
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
@@ -1096,9 +1136,8 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		    /* Set link pointer on target to the device address of the
 		       mapped object.  */
 		    void *tgt_addr = (void *) (tgt->tgt_start + k->tgt_offset);
-		    devicep->host2dev_func (devicep->target_id,
-					    (void *) n->tgt_offset,
-					    &tgt_addr, sizeof (void *));
+		    gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
+					&tgt_addr, sizeof (void *));
 		  }
 		array++;
 	      }
@@ -1142,7 +1181,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	      if (n)
 		{
 		  assert (n->refcount != REFCOUNT_LINK);
-		  gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+		  gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
 					  kind & typemask);
 		  target_row_addr = n->tgt->tgt_start + n->tgt_offset;
 		}
@@ -1166,7 +1205,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		  row_desc->copy_from
 		    = GOMP_MAP_COPY_FROM_P (kind & typemask);
 		  row_desc->always_copy_from
-		    = GOMP_MAP_COPY_FROM_P (kind & typemask);
+		    = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
 		  row_desc->offset = 0;
 		  row_desc->length = da->data_row_size;
 
@@ -1175,7 +1214,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 		  splay_tree_insert (mem_map, array);
 
 		  if (GOMP_MAP_COPY_TO_P (kind & typemask))
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) tgt->tgt_start + k->tgt_offset,
 					(void *) k->host_start,
 					da->data_row_size);
@@ -1191,9 +1230,11 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	    {
 	      void *ptrblock = gomp_dynamic_array_create_ptrblock
 		(da, target_ptrblock, target_data_rows + row_start);
-	      gomp_copy_host2dev (devicep, target_ptrblock, ptrblock,
+	      gomp_copy_host2dev (devicep, aq, target_ptrblock, ptrblock,
 				  da->ptrblock_size);
-	      free (ptrblock);
+	      /* Freeing of the ptrblock must be scheduled after the host2dev
+		 copy completes.  */
+	      goacc_async_free (devicep, aq, ptrblock);
 	    }
 
 	  row_start += da->data_row_num;
@@ -1213,7 +1254,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
 	{
 	  cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
 	  /* FIXME: see above FIXME comment.  */
-	  gomp_copy_host2dev (devicep,
+	  gomp_copy_host2dev (devicep, aq,
 			      (void *) (tgt->tgt_start + i * sizeof (void *)),
 			      (void *) &cur_node.tgt_offset, sizeof (void *));
 	}
@@ -1232,7 +1273,7 @@  gomp_map_vars (struct gomp_device_descr *devicep,
   return tgt;
 }
 
-static void
+attribute_hidden void
 gomp_unmap_tgt (struct target_mem_desc *tgt)
 {
   /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region.  */
@@ -1267,6 +1308,13 @@  gomp_remove_var (struct gomp_device_descr *devicep
 attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
+  gomp_unmap_vars_async (tgt, do_copyfrom, NULL);
+}
+
+attribute_hidden void
+gomp_unmap_vars_async (struct target_mem_desc *tgt, bool do_copyfrom,
+		       struct goacc_asyncqueue *aq)
+{
   struct gomp_device_descr *devicep = tgt->device_descr;
 
   if (tgt->list_count == 0)
@@ -1302,7 +1350,7 @@  gomp_unmap_vars (struct target_mem_desc *tgt, bool
 
       if ((do_unmap && do_copyfrom && tgt->list[i].copy_from)
 	  || tgt->list[i].always_copy_from)
-	gomp_copy_dev2host (devicep,
+	gomp_copy_dev2host (devicep, aq,
 			    (void *) (k->host_start + tgt->list[i].offset),
 			    (void *) (k->tgt->tgt_start + k->tgt_offset
 				      + tgt->list[i].offset),
@@ -1368,9 +1416,9 @@  gomp_update (struct gomp_device_descr *devicep, si
 	    size_t size = cur_node.host_end - cur_node.host_start;
 
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
-	      gomp_copy_host2dev (devicep, devaddr, hostaddr, size);
+	      gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size);
 	    if (GOMP_MAP_COPY_FROM_P (kind & typemask))
-	      gomp_copy_dev2host (devicep, hostaddr, devaddr, size);
+	      gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
 	  }
       }
   gomp_mutex_unlock (&devicep->lock);
@@ -1691,9 +1739,21 @@  gomp_init_device (struct gomp_device_descr *device
 				   false);
     }
 
+  /* Initialize OpenACC asynchronous queues.  */
+  goacc_init_asyncqueues (devicep);
+
   devicep->state = GOMP_DEVICE_INITIALIZED;
 }
 
+attribute_hidden bool
+gomp_fini_device (struct gomp_device_descr *devicep)
+{
+  devicep->state = GOMP_DEVICE_FINALIZED;
+  bool ret = goacc_fini_asyncqueues (devicep);
+  ret &= devicep->fini_device_func (devicep->target_id);
+  return ret;
+}
+
 attribute_hidden void
 gomp_unload_device (struct gomp_device_descr *devicep)
 {
@@ -2222,7 +2282,7 @@  gomp_exit_data (struct gomp_device_descr *devicep,
 
 	  if ((kind == GOMP_MAP_FROM && k->refcount == 0)
 	      || kind == GOMP_MAP_ALWAYS_FROM)
-	    gomp_copy_dev2host (devicep, (void *) cur_node.host_start,
+	    gomp_copy_dev2host (devicep, NULL, (void *) cur_node.host_start,
 				(void *) (k->tgt->tgt_start + k->tgt_offset
 					  + cur_node.host_start
 					  - k->host_start),
@@ -2848,20 +2908,20 @@  gomp_load_plugin_for_device (struct gomp_device_de
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
     {
       if (!DLSYM_OPT (openacc.exec, openacc_exec)
-	  || !DLSYM_OPT (openacc.register_async_cleanup,
-			 openacc_register_async_cleanup)
-	  || !DLSYM_OPT (openacc.async_test, openacc_async_test)
-	  || !DLSYM_OPT (openacc.async_test_all, openacc_async_test_all)
-	  || !DLSYM_OPT (openacc.async_wait, openacc_async_wait)
-	  || !DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async)
-	  || !DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all)
-	  || !DLSYM_OPT (openacc.async_wait_all_async,
-			 openacc_async_wait_all_async)
-	  || !DLSYM_OPT (openacc.async_set_async, openacc_async_set_async)
 	  || !DLSYM_OPT (openacc.create_thread_data,
 			 openacc_create_thread_data)
 	  || !DLSYM_OPT (openacc.destroy_thread_data,
-			 openacc_destroy_thread_data))
+			 openacc_destroy_thread_data)
+	  || !DLSYM_OPT (openacc.async.construct, openacc_async_construct)
+	  || !DLSYM_OPT (openacc.async.destruct, openacc_async_destruct)
+	  || !DLSYM_OPT (openacc.async.test, openacc_async_test)
+	  || !DLSYM_OPT (openacc.async.synchronize, openacc_async_synchronize)
+	  || !DLSYM_OPT (openacc.async.serialize, openacc_async_serialize)
+	  || !DLSYM_OPT (openacc.async.queue_callback,
+			 openacc_async_queue_callback)
+	  || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+	  || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
+	  || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
 	{
 	  /* Require all the OpenACC handlers if we have
 	     GOMP_OFFLOAD_CAP_OPENACC_200.  */
@@ -2912,10 +2972,7 @@  gomp_target_fini (void)
       struct gomp_device_descr *devicep = &devices[i];
       gomp_mutex_lock (&devicep->lock);
       if (devicep->state == GOMP_DEVICE_INITIALIZED)
-	{
-	  ret = devicep->fini_device_func (devicep->target_id);
-	  devicep->state = GOMP_DEVICE_FINALIZED;
-	}
+	ret = gomp_fini_device (devicep);
       gomp_mutex_unlock (&devicep->lock);
       if (!ret)
 	gomp_fatal ("device finalization failed");
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c	(working copy)
@@ -206,11 +206,6 @@  void cb_enter_data_start (acc_prof_info *prof_info
   assert (event_info->other_event.implicit == 1);
   assert (event_info->other_event.tool_info == NULL);
 
-  if (acc_device_type == acc_device_host
-      || state < 100) //TODO
-    assert (api_info->device_api == acc_device_api_none);
-  else
-    assert (api_info->device_api == acc_device_api_cuda);
   assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
   assert (api_info->device_type == prof_info->device_type);
   assert (api_info->vendor == -1);
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c	(working copy)
@@ -151,7 +151,7 @@  main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \
-  async (4)
+  wait (1, 2, 3) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c	(working copy)
@@ -162,7 +162,7 @@  main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
-  wait (1) async (4)
+  wait (1, 2, 3) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c	(working copy)
@@ -138,7 +138,7 @@  main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
-  wait (1,5) async (4)
+  wait (1, 2, 3, 5) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c	(working copy)
@@ -46,16 +46,22 @@  main (int argc, char **argv)
       abort ();
     }
 
-  fprintf (stderr, "CheCKpOInT\n");
-  if (acc_async_test (1) != 0)
+  if (acc_async_test (0) != 0)
     {
       fprintf (stderr, "asynchronous operation not running\n");
       abort ();
     }
 
+  /* Test unseen async number.  */
+  if (acc_async_test (1) != 1)
+    {
+      fprintf (stderr, "acc_async_test failed on unseen number\n");
+      abort ();
+    }
+  
   sleep (1);
 
-  if (acc_async_test (1) != 1)
+  if (acc_async_test (0) != 1)
     {
       fprintf (stderr, "found asynchronous operation still running\n");
       abort ();
@@ -65,7 +71,3 @@  main (int argc, char **argv)
 
   return 0;
 }
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c	(working copy)
@@ -72,14 +72,13 @@  main (int argc, char **argv)
       abort ();
     }
 
-  fprintf (stderr, "CheCKpOInT\n");
-  acc_wait (1);
+  acc_wait (0);
 
   gettimeofday (&tv2, NULL);
 
   t2 = ((tv2.tv_sec - tv1.tv_sec) * 1000000) + (tv2.tv_usec - tv1.tv_usec);
 
-  if (t2 > t1)
+  if (t2 - t1 > 100)
     {
       fprintf (stderr, "too long 1\n");
       abort ();
@@ -87,7 +86,7 @@  main (int argc, char **argv)
 
   gettimeofday (&tv1, NULL);
 
-  acc_wait (1);
+  acc_wait (0);
 
   gettimeofday (&tv2, NULL);
 
@@ -103,7 +102,3 @@  main (int argc, char **argv)
 
   return 0;
 }
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c	(working copy)
@@ -84,6 +84,7 @@  main (int argc, char **argv)
 
   for (i = 0; i < N; i++)
     {
+      stream = (CUstream) acc_get_cuda_stream (i & 1);
       r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, NULL, 0);
       if (r != CUDA_SUCCESS)
 	{
@@ -92,10 +93,10 @@  main (int argc, char **argv)
 	}
     }
 
-  acc_wait_async (0, 1);
-
   if (acc_async_test (0) != 0)
     abort ();
+  
+  acc_wait_async (0, 1);
 
   if (acc_async_test (1) != 0)
     abort ();
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c	(working copy)
@@ -109,7 +109,7 @@  main (int argc, char **argv)
 
   for (i = 0; i <= N; i++)
     {
-      if (acc_async_test (i) != 0)
+      if (acc_async_test (i) == 0)
 	abort ();
     }
 
Index: libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
===================================================================
--- libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90	(revision 249620)
+++ libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90	(working copy)
@@ -1,4 +1,5 @@ 
 ! { dg-do run }
+! { dg-xfail-run-if "n/a" { openacc_host_selected } }
 
 program main
   use openacc