diff mbox

[gomp4.1] depend nowait support for target {update,{enter,exit} data}

Message ID 20151002192801.GA24765@msticlxl57.ims.intel.com
State New
Headers show

Commit Message

Ilya Verbin Oct. 2, 2015, 7:28 p.m. UTC
Hi!

On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote:
> nowait support for #pragma omp target is not implemented yet, supposedly we
> need to mark those somehow (some flag) already in the struct gomp_task
> structure, essentially it will need either 2 or 3 callbacks
> (the current one, executed when the dependencies are resolved (it actually
> waits until some thread schedules it after that point, I think it is
> undesirable to run it with the tasking lock held), which would perform
> the gomp_map_vars and initiate the running of the region, and then some
> query routine which would poll the plugin whether the task is done or not,
> and either perform the finalization (unmap_vars) if it is done (and in any
> case return bool whether it should be polled again or not), and if the
> finalization is not done there, also another callback for the finalization.
> Also, there is the issue that if we are waiting for task that needs to be
> polled, and we don't have any further tasks to run, we shouldn't really
> attempt to sleep on some semaphore (e.g. in taskwait, end of
> taskgroup, etc.) or barrier, but rather either need to keep polling it, or
> call the query hook with some argument that it should sleep in there until
> the work is done by the offloading device.
> Also, there needs to be a way for the target nowait first callback to say
> that it is using host fallback and thus acts as a normal task, therefore
> once the task fn finishes, the task is done.

Here is my WIP patch.  target.c part is obviously incorrect, but it demonstrates
a possible libgomp <-> plugin interface for running a target task function
asynchronously and checking whether it is completed or not.
(Refactored liboffloadmic/runtime/emulator from trunk is required to run
target-tmp.c testcase.)



  -- Ilya

Comments

Jakub Jelinek Oct. 15, 2015, 2:01 p.m. UTC | #1
Hi!

CCing various people, because I'd like to have something that won't work on
XeonPhi only.

On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote:
> On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote:
> > nowait support for #pragma omp target is not implemented yet, supposedly we
> > need to mark those somehow (some flag) already in the struct gomp_task
> > structure, essentially it will need either 2 or 3 callbacks
> > (the current one, executed when the dependencies are resolved (it actually
> > waits until some thread schedules it after that point, I think it is
> > undesirable to run it with the tasking lock held), which would perform
> > the gomp_map_vars and initiate the running of the region, and then some
> > query routine which would poll the plugin whether the task is done or not,
> > and either perform the finalization (unmap_vars) if it is done (and in any
> > case return bool whether it should be polled again or not), and if the
> > finalization is not done there, also another callback for the finalization.
> > Also, there is the issue that if we are waiting for task that needs to be
> > polled, and we don't have any further tasks to run, we shouldn't really
> > attempt to sleep on some semaphore (e.g. in taskwait, end of
> > taskgroup, etc.) or barrier, but rather either need to keep polling it, or
> > call the query hook with some argument that it should sleep in there until
> > the work is done by the offloading device.
> > Also, there needs to be a way for the target nowait first callback to say
> > that it is using host fallback and thus acts as a normal task, therefore
> > once the task fn finishes, the task is done.
> 
> Here is my WIP patch.  target.c part is obviously incorrect, but it demonstrates
> a possible libgomp <-> plugin interface for running a target task function
> asynchronously and checking whether it is completed or not.
> (Refactored liboffloadmic/runtime/emulator from trunk is required to run
> target-tmp.c testcase.)

The difficulty is designing something that will work (if possible fast) on the
various devices we want to eventually support (at least XeonPhi, XeonPhi emul,
PTX/Cuda and HSA), ideally without too much busy waiting.

The OpenMP 4.5 spec says that there is a special "target task" on the host
side around the target region, and that the "target task" is mergeable and
if nowait is not specified is included (otherwise it may be), and that the
mapping operations (which include target device memory allocation,
refcount management and mapping data structure updates as well as the
memory copying to target device) happens only after the (optional) dependencies
are satisfied.  After the memory mapping operations are done, the offloading
kernel starts, and when it finishes, the unmapping operations are performed
(which includes memory copying from the target device, refcount management
and mapping data structure updates, and finally memory deallocation).

Right now on the OpenMP side everything is synchronous, e.g. target
enter/exit data and update are asynchronous only in that the mapping or
unmapping operation is scheduled as a task, but the whole mapping or
unmapping operations including all the above mentioned subparts are
performed while holding the particular device's lock.

To make that more asynchronous, e.g. for Cuda we might want to use Cuda
(non-default) streams, and perform the allocation, refcount management and
mapping data structure updates, and perform the data copying to device
already as part of the stream.  Except that it means that if another target
mapping/unmapping operation is enqueued at that point and it refers to any
of the affected objects, it could acquire the device lock, yet the data
copying would be still in flux.  Dunno here if it would be e.g. acceptable
to add some flags to the mapping data structures, this memory range has
either pending data transfers or has enqueued data transfers that depend on
whether the refcount will become zero or not.  When mapping if we'd want to
touch any of the regions marked with such in_flux flag, we'd need to wait
until all of the other stream's operation finish and the unmapping
operations are performed (and the device lock released again) before
continuing.  That way we could get good performance if either concurrent
async regions touch different variables, or target data or non-async target
enter data or exit data has been put around the mappings, so the streams can
be independent, but worst case we'd make them non-concurrent.

Anyway, let's put the asynchronous memory data transfers (which also implies
the ability to enqueue multiple different target regions into a stream for
the device to operate on independently from the host) on the side for now
and just discuss what we want for the actual async execution and for now
keep a device lock around all the mapping or unmapping operations.

If the "target task" has unresolved dependencies, then it will use existing
task.c waiting code first (if the above is resolved somehow, there could be
exceptions of "target task" depending on another "target task").
When the dependencies are resolved, we can run the gomp_target_task_fn
callback (but not with the team's tasking lock held), which can perform
the gomp_map_vars call and start the async execution.  For host fallback,
that is all we do, the task is at this point a normal task.
For offloading task, we now want the host to continue scheduling other tasks
if there are any, which means (not currently implemented on the task.c side)
we want to move the task somewhere that we don't consider it finished, and
that we'll need to schedule it again at some point to perform the unmapping
(perhaps requeue it again in a WAITING or some other state).

Right now, the tasking code would in that case try to schedule another task,
and if there are none or none that are runnable among the tasks of interest,
it can go to sleep and expect to be awaken when some task it is waiting for
is awaken.

And the main question is how to find out on the various devices whether
the async execution has completed already.

From what I can see in the liboffloadmic plugin, you have an extra host
thread that can run a callback function on the host.  Such a callback could
say tweak the state of the "target task", could take the team's tasking
lock, and even awake sleepers, maybe even take the device lock and perform
unmapping of vars?  The function would need to be in task.c
so that it can access everything defined in there.  Or the callback could
just change something in the "target task" state and let the tasking poll
for the change.

Looking at Cuda, for async target region kernels we'd probably use
a non-default stream and enqueue the async kernel in there.  I see
we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
to busy poll the event, or cudaEventSynchronize to block until the event
occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to
resolve the above mentioned mapping/unmapping async issues for Cuda
- like add an event after the mapping operations that the other target tasks
could wait for if they see any in_flux stuff, and wait for an event etc.
I don't see a possibility to have something like a callback on stream
completion though, so it has to be handled with polling.  If that is true,
it means the tasking code can't go to sleep if there are any pending target
tasks (at least for devices that can't do a callback) it wants to wait for,
it would need to call in a loop the poll methods of the plugins that it
wants to wait for (unless there are no host tasks left and only a single
device is involved, then it could call a blocking method).

For HSA I have no idea.

Now, for the polling case, the question is how the polling is expensive,
whether it can be performed with the team's lock held or not.  If XeonPhi
doesn't do the full host callback, but polling, it could just read some
memory from target_task struct and thus be fast enough to run it with the
lock held.  How expensive is cudaEventQuery?

> diff --git a/libgomp/target.c b/libgomp/target.c
> index 77bd442..31f034c 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -45,6 +45,10 @@
>  #include "plugin-suffix.h"
>  #endif
>  
> +/* FIXME: TMP */
> +#include <stdio.h>
> +#include <unistd.h>

I hope you mean to remove this later on.

> @@ -1227,6 +1231,44 @@ gomp_target_fallback (void (*fn) (void *), void **hostaddrs)
>    *thr = old_thr;
>  }
>  
> +/* Host fallback with firstprivate map-type handling.  */
> +
> +static void
> +gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> +				   void **hostaddrs, size_t *sizes,
> +				   unsigned short *kinds)
> +{
> +  size_t i, tgt_align = 0, tgt_size = 0;
> +  char *tgt = NULL;
> +  for (i = 0; i < mapnum; i++)
> +    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> +      {
> +	size_t align = (size_t) 1 << (kinds[i] >> 8);
> +	if (tgt_align < align)
> +	  tgt_align = align;
> +	tgt_size = (tgt_size + align - 1) & ~(align - 1);
> +	tgt_size += sizes[i];
> +      }
> +  if (tgt_align)
> +    {
> +      tgt = gomp_alloca (tgt_size + tgt_align - 1);
> +      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> +      if (al)
> +	tgt += tgt_align - al;
> +      tgt_size = 0;
> +      for (i = 0; i < mapnum; i++)
> +	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> +	  {
> +	    size_t align = (size_t) 1 << (kinds[i] >> 8);
> +	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
> +	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> +	    hostaddrs[i] = tgt + tgt_size;
> +	    tgt_size = tgt_size + sizes[i];
> +	  }
> +    }
> +  gomp_target_fallback (fn, hostaddrs);
> +}

This is ok.

> +
>  /* Helper function of GOMP_target{,_41} routines.  */
>  
>  static void *
> @@ -1311,40 +1353,19 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
>    if (devicep == NULL
>        || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
>      {
> -      size_t i, tgt_align = 0, tgt_size = 0;
> -      char *tgt = NULL;
> -      for (i = 0; i < mapnum; i++)
> -	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> -	  {
> -	    size_t align = (size_t) 1 << (kinds[i] >> 8);
> -	    if (tgt_align < align)
> -	      tgt_align = align;
> -	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
> -	    tgt_size += sizes[i];
> -	  }
> -      if (tgt_align)
> -	{
> -	  tgt = gomp_alloca (tgt_size + tgt_align - 1);
> -	  uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> -	  if (al)
> -	    tgt += tgt_align - al;
> -	  tgt_size = 0;
> -	  for (i = 0; i < mapnum; i++)
> -	    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> -	      {
> -		size_t align = (size_t) 1 << (kinds[i] >> 8);
> -		tgt_size = (tgt_size + align - 1) & ~(align - 1);
> -		memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> -		hostaddrs[i] = tgt + tgt_size;
> -		tgt_size = tgt_size + sizes[i];
> -	      }
> -	}
> -      gomp_target_fallback (fn, hostaddrs);
> +      gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, kinds);
>        return;
>      }

This too.

>    void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
>  
> +  if (flags & GOMP_TARGET_FLAG_NOWAIT)
> +    {
> +      gomp_create_target_task (devicep, fn_addr, mapnum, hostaddrs, sizes,
> +			       kinds, flags, depend);
> +      return;
> +    }

But this is not ok.  You need to do this far earlier, already before the
if (depend != NULL) code in GOMP_target_41.  And, I think you should just
not pass fn_addr, but fn itself.

> @@ -1636,34 +1657,58 @@ void
>  gomp_target_task_fn (void *data)
>  {
>    struct gomp_target_task *ttask = (struct gomp_target_task *) data;
> +  struct gomp_device_descr *devicep = ttask->devicep;
> +
>    if (ttask->fn != NULL)
>      {
> -      /* GOMP_target_41 */
> +      if (devicep == NULL
> +	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
> +	{
> +	  /* FIXME: Save host fn addr into gomp_target_task?  */
> +	  gomp_target_fallback_firstprivate (NULL, ttask->mapnum,

If you pass above fn instead of fn_addr, ttask->fn is what you want
to pass to gomp_target_fallback_firstprivate here and remove the FIXME.

> +					     ttask->hostaddrs, ttask->sizes,
> +					     ttask->kinds);
> +	  return;
> +	}
> +
> +      struct target_mem_desc *tgt_vars
> +	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
> +			 ttask->sizes, ttask->kinds, true,
> +			 GOMP_MAP_VARS_TARGET);
> +      devicep->async_run_func (devicep->target_id, ttask->fn,
> +			       (void *) tgt_vars->tgt_start, data);

You need to void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
first obviously, and pass fn_addr.

> +
> +      /* FIXME: TMP example of checking for completion.
> +	 Alternatively the plugin can set some completion flag in ttask.  */
> +      while (!devicep->async_is_completed_func (devicep->target_id, data))
> +	{
> +	  fprintf (stderr, "-");
> +	  usleep (100000);
> +	}

This obviously doesn't belong here.

>    if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> diff --git a/libgomp/testsuite/libgomp.c/target-tmp.c b/libgomp/testsuite/libgomp.c/target-tmp.c
> new file mode 100644
> index 0000000..23a739c
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.c/target-tmp.c
> @@ -0,0 +1,40 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#pragma omp declare target
> +void foo (int n)
> +{
> +  printf ("Start tgt %d\n", n);
> +  usleep (5000000);

5s is too long.  Not to mention that not sure if PTX can do printf
and especially usleep.

> diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> index 26ac6fe..c843710 100644
> --- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> +++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
...
> +/* Set of asynchronously running target tasks.  */
> +static std::set<const void *> *async_tasks;
> +
>  /* Thread-safe registration of the main image.  */
>  static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
>  
> +/* Mutex for protecting async_tasks.  */
> +static pthread_mutex_t async_tasks_lock = PTHREAD_MUTEX_INITIALIZER;
> +
>  static VarDesc vd_host2tgt = {
>    { 1, 1 },		      /* dst, src			      */
>    { 1, 0 },		      /* in, out			      */
> @@ -156,6 +163,8 @@ init (void)
>  
>  out:
>    address_table = new ImgDevAddrMap;
> +  async_tasks = new std::set<const void *>;
> +  pthread_mutex_init (&async_tasks_lock, NULL);

PTHREAD_MUTEX_INITIALIZER should already initialize the lock.
But, do you really need async_tasks and the lock?  Better store
something into some plugin's owned field in target_task struct and
let the plugin callback be passed address of that field rather than the
whole target_task?

> diff --git a/liboffloadmic/runtime/offload_host.cpp b/liboffloadmic/runtime/offload_host.cpp
> index 08f626f..8cee12c 100644
> --- a/liboffloadmic/runtime/offload_host.cpp
> +++ b/liboffloadmic/runtime/offload_host.cpp
> @@ -64,6 +64,9 @@ static void __offload_fini_library(void);
>  #define GET_OFFLOAD_NUMBER(timer_data) \
>      timer_data? timer_data->offload_number : 0
>  
> +extern "C" void
> +__gomp_offload_intelmic_async_completed (const void *);
> +
>  extern "C" {
>  #ifdef TARGET_WINNT
>  // Windows does not support imports from libraries without actually
> @@ -2507,7 +2510,7 @@ extern "C" {
>          const void *info
>      )
>      {
> -	/* TODO: Call callback function, pass info.  */
> +	__gomp_offload_intelmic_async_completed (info);
>      }
>  }

Is this for the emul only, or KNL only, or both?
In any case, not sure how it works, this is in liboffloadmic.so and
the function defined in the plugin?

	Jakub
Alexander Monakov Oct. 15, 2015, 4:18 p.m. UTC | #2
On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> Looking at Cuda, for async target region kernels we'd probably use
> a non-default stream and enqueue the async kernel in there.  I see
> we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
> to busy poll the event, or cudaEventSynchronize to block until the event
> occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to
> resolve the above mentioned mapping/unmapping async issues for Cuda
> - like add an event after the mapping operations that the other target tasks
> could wait for if they see any in_flux stuff, and wait for an event etc.
> I don't see a possibility to have something like a callback on stream
> completion though, so it has to be handled with polling.

Not sure why you say so.  There's cu[da]StreamAddCallback, which exists
exactly for registering completion callback, but there are restrictions:

  - this functionality doesn't currently work through CUDA MPS ("multi-process
    server", for funneling CUDA calls from different processes through a
    single "server" process, avoiding context-switch overhead on the device,
    sometimes used for CUDA-with-MPI applications);

  - it is explicitely forbidden to invoke CUDA API calls from the callback;
    perhaps understandable, as the callback may be running in a signal-handler
    context (unlikely), or, more plausibly, in a different thread than the one
    that registered the callback.

Ideally we'd queue all accelerator work up front via
EventRecord/StreamWaitEvent, and not rely on callbacks.  If host-side work
must be done on completion, we could spawn a helper thread waiting on
cudaEventSynchronize.

> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.c/target-tmp.c
> > @@ -0,0 +1,40 @@
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +
> > +#pragma omp declare target
> > +void foo (int n)
> > +{
> > +  printf ("Start tgt %d\n", n);
> > +  usleep (5000000);
> 
> 5s is too long.  Not to mention that not sure if PTX can do printf
> and especially usleep.

printf is available, usleep is not (but presumably use of usleep needs to be
revised anyway)

Alexander
Ilya Verbin Oct. 15, 2015, 4:42 p.m. UTC | #3
On Thu, Oct 15, 2015 at 16:01:56 +0200, Jakub Jelinek wrote:
> On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote:
> > Here is my WIP patch.  target.c part is obviously incorrect, but it demonstrates
> > a possible libgomp <-> plugin interface for running a target task function
> > asynchronously and checking whether it is completed or not.
> > (Refactored liboffloadmic/runtime/emulator from trunk is required to run
> > target-tmp.c testcase.)
> 
> > diff --git a/libgomp/target.c b/libgomp/target.c
> > index 77bd442..31f034c 100644
> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> > @@ -45,6 +45,10 @@
> >  #include "plugin-suffix.h"
> >  #endif
> >  
> > +/* FIXME: TMP */
> > +#include <stdio.h>
> > +#include <unistd.h>
> 
> I hope you mean to remove this later on.

Sure, this is just a prototype, not for committing.


> > @@ -1227,6 +1231,44 @@ gomp_target_fallback (void (*fn) (void *), void **hostaddrs)
> >    *thr = old_thr;
> >  }
> >  
> > +/* Host fallback with firstprivate map-type handling.  */
> > +
> > +static void
> > +gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> > +				   void **hostaddrs, size_t *sizes,
> > +				   unsigned short *kinds)
> > +{
> > +  size_t i, tgt_align = 0, tgt_size = 0;
> > +  char *tgt = NULL;
> > +  for (i = 0; i < mapnum; i++)
> > +    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> > +      {
> > +	size_t align = (size_t) 1 << (kinds[i] >> 8);
> > +	if (tgt_align < align)
> > +	  tgt_align = align;
> > +	tgt_size = (tgt_size + align - 1) & ~(align - 1);
> > +	tgt_size += sizes[i];
> > +      }
> > +  if (tgt_align)
> > +    {
> > +      tgt = gomp_alloca (tgt_size + tgt_align - 1);
> > +      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> > +      if (al)
> > +	tgt += tgt_align - al;
> > +      tgt_size = 0;
> > +      for (i = 0; i < mapnum; i++)
> > +	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> > +	  {
> > +	    size_t align = (size_t) 1 << (kinds[i] >> 8);
> > +	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
> > +	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> > +	    hostaddrs[i] = tgt + tgt_size;
> > +	    tgt_size = tgt_size + sizes[i];
> > +	  }
> > +    }
> > +  gomp_target_fallback (fn, hostaddrs);
> > +}
> 
> This is ok.
> 
> >  /* Helper function of GOMP_target{,_41} routines.  */
> >  
> >  static void *
> > @@ -1311,40 +1353,19 @@ GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
> >    if (devicep == NULL
> >        || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
> >      {
> > -      size_t i, tgt_align = 0, tgt_size = 0;
> > -      char *tgt = NULL;
> > -      for (i = 0; i < mapnum; i++)
> > -	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> > -	  {
> > -	    size_t align = (size_t) 1 << (kinds[i] >> 8);
> > -	    if (tgt_align < align)
> > -	      tgt_align = align;
> > -	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
> > -	    tgt_size += sizes[i];
> > -	  }
> > -      if (tgt_align)
> > -	{
> > -	  tgt = gomp_alloca (tgt_size + tgt_align - 1);
> > -	  uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> > -	  if (al)
> > -	    tgt += tgt_align - al;
> > -	  tgt_size = 0;
> > -	  for (i = 0; i < mapnum; i++)
> > -	    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
> > -	      {
> > -		size_t align = (size_t) 1 << (kinds[i] >> 8);
> > -		tgt_size = (tgt_size + align - 1) & ~(align - 1);
> > -		memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
> > -		hostaddrs[i] = tgt + tgt_size;
> > -		tgt_size = tgt_size + sizes[i];
> > -	      }
> > -	}
> > -      gomp_target_fallback (fn, hostaddrs);
> > +      gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, kinds);
> >        return;
> >      }
> 
> This too.

I will commit this small part to gomp-4_5-branch separately.


> > diff --git a/libgomp/testsuite/libgomp.c/target-tmp.c b/libgomp/testsuite/libgomp.c/target-tmp.c
> > new file mode 100644
> > index 0000000..23a739c
> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.c/target-tmp.c
> > @@ -0,0 +1,40 @@
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +
> > +#pragma omp declare target
> > +void foo (int n)
> > +{
> > +  printf ("Start tgt %d\n", n);
> > +  usleep (5000000);
> 
> 5s is too long.  Not to mention that not sure if PTX can do printf
> and especially usleep.

This testcase is also for demonstration only.


> > diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> > index 26ac6fe..c843710 100644
> > --- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> > +++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
> ...
> > +/* Set of asynchronously running target tasks.  */
> > +static std::set<const void *> *async_tasks;
> > +
> >  /* Thread-safe registration of the main image.  */
> >  static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
> >  
> > +/* Mutex for protecting async_tasks.  */
> > +static pthread_mutex_t async_tasks_lock = PTHREAD_MUTEX_INITIALIZER;
> > +
> >  static VarDesc vd_host2tgt = {
> >    { 1, 1 },		      /* dst, src			      */
> >    { 1, 0 },		      /* in, out			      */
> > @@ -156,6 +163,8 @@ init (void)
> >  
> >  out:
> >    address_table = new ImgDevAddrMap;
> > +  async_tasks = new std::set<const void *>;
> > +  pthread_mutex_init (&async_tasks_lock, NULL);
> 
> PTHREAD_MUTEX_INITIALIZER should already initialize the lock.
> But, do you really need async_tasks and the lock?  Better store
> something into some plugin's owned field in target_task struct and
> let the plugin callback be passed address of that field rather than the
> whole target_task?

OK, that should work.


> > diff --git a/liboffloadmic/runtime/offload_host.cpp b/liboffloadmic/runtime/offload_host.cpp
> > index 08f626f..8cee12c 100644
> > --- a/liboffloadmic/runtime/offload_host.cpp
> > +++ b/liboffloadmic/runtime/offload_host.cpp
> > @@ -64,6 +64,9 @@ static void __offload_fini_library(void);
> >  #define GET_OFFLOAD_NUMBER(timer_data) \
> >      timer_data? timer_data->offload_number : 0
> >  
> > +extern "C" void
> > +__gomp_offload_intelmic_async_completed (const void *);
> > +
> >  extern "C" {
> >  #ifdef TARGET_WINNT
> >  // Windows does not support imports from libraries without actually
> > @@ -2507,7 +2510,7 @@ extern "C" {
> >          const void *info
> >      )
> >      {
> > -	/* TODO: Call callback function, pass info.  */
> > +	__gomp_offload_intelmic_async_completed (info);
> >      }
> >  }
> 
> Is this for the emul only, or KNL only, or both?

This is for both.  liboffloadmic doesn't know whether target process is running
on real KNL or on host using emul, only underlying libcoi matters.

> In any case, not sure how it works, this is in liboffloadmic.so and
> the function defined in the plugin?

Yes, this is in liboffloadmic_host.so, and the function is defined in the
plugin.  Both are loaded into the host process.
We can replace it by a callback directly into libgomp, if needed.

  -- Ilya
Jakub Jelinek Oct. 15, 2015, 5:18 p.m. UTC | #4
On Thu, Oct 15, 2015 at 07:18:53PM +0300, Alexander Monakov wrote:
> On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> > Looking at Cuda, for async target region kernels we'd probably use
> > a non-default stream and enqueue the async kernel in there.  I see
> > we can e.g. cudaEventRecord into the stream and then either cudaEventQuery
> > to busy poll the event, or cudaEventSynchronize to block until the event
> > occurs, plus there is cudaStreamWaitEvent that perhaps might be even used to
> > resolve the above mentioned mapping/unmapping async issues for Cuda
> > - like add an event after the mapping operations that the other target tasks
> > could wait for if they see any in_flux stuff, and wait for an event etc.
> > I don't see a possibility to have something like a callback on stream
> > completion though, so it has to be handled with polling.
> 
> Not sure why you say so.  There's cu[da]StreamAddCallback, which exists
> exactly for registering completion callback, but there are restrictions:

Ah, thanks.

>   - this functionality doesn't currently work through CUDA MPS ("multi-process
>     server", for funneling CUDA calls from different processes through a
>     single "server" process, avoiding context-switch overhead on the device,
>     sometimes used for CUDA-with-MPI applications);

That shouldn't be an issue for the OpenMP 4.5 / PTX offloading, right?

>   - it is explicitely forbidden to invoke CUDA API calls from the callback;
>     perhaps understandable, as the callback may be running in a signal-handler
>     context (unlikely), or, more plausibly, in a different thread than the one
>     that registered the callback.

So, is it run from async signal handlers, or just could be?
If all we need to achieve is just change some word in target_task struct,
then it should be enough to just asynchronously memcpy there the value,
or e.g. use the events.  If we need to also gomp_sem_post, then for
config/linux/ that is also something that can be done from async signal
contexts, but not for other OSes (but perhaps we could just not go to sleep
on those OSes if there are pending offloading tasks).

> Ideally we'd queue all accelerator work up front via
> EventRecord/StreamWaitEvent, and not rely on callbacks.

> If host-side work
> must be done on completion, we could spawn a helper thread waiting on
> cudaEventSynchronize.

Spawning a helper thread is very expensive and we need something to be done
upon completion pretty much always.  Perhaps we can optimize and somehow
deal with merging multiple async tasks that are waiting on each other, but
the user could have intermixed the offloading tasks with host tasks and have
dependencies in between them, plus there are all the various spots where
user wants to wait for both host and offloading tasks, or e.g. offloading
tasks from two different devices, or multiple offloading tasks from the same
devices (multiple streams), etc.

	Jakub
Alexander Monakov Oct. 15, 2015, 6:11 p.m. UTC | #5
On Thu, 15 Oct 2015, Jakub Jelinek wrote:
> >   - this functionality doesn't currently work through CUDA MPS ("multi-process
> >     server", for funneling CUDA calls from different processes through a
> >     single "server" process, avoiding context-switch overhead on the device,
> >     sometimes used for CUDA-with-MPI applications);
> 
> That shouldn't be an issue for the OpenMP 4.5 / PTX offloading, right?

I think it can be an issue for applications employing MPI for (coarse-grain)
parallelism and OpenMP for simd/offloading.  It can be a non-issue if PTX
offloading conflicts with MPS in some other way, but at the moment I'm not
aware of such (as long as dynamic parallelism is not a hard requirement).

> >   - it is explicitely forbidden to invoke CUDA API calls from the callback;
> >     perhaps understandable, as the callback may be running in a signal-handler
> >     context (unlikely), or, more plausibly, in a different thread than the one
> >     that registered the callback.
> 
> So, is it run from async signal handlers, or just could be?

The documentation doesn't tell.  I could find out experimentally, but then it
would tell how the current implementation behaves; it could change in the
future.  Like I said in the quote, I expect it runs asynchronously in a
different thread, rather than in an async signal context.

> Spawning a helper thread is very expensive and we need something to be done
> upon completion pretty much always.  Perhaps we can optimize and somehow
> deal with merging multiple async tasks that are waiting on each other, but
> the user could have intermixed the offloading tasks with host tasks and have
> dependencies in between them, plus there are all the various spots where
> user wants to wait for both host and offloading tasks, or e.g. offloading
> tasks from two different devices, or multiple offloading tasks from the same
> devices (multiple streams), etc.

I think we should avoid involving the host in "reasonable" cases, and for the
rest just have something minimally acceptable (either with callbacks, or
polling).

Alexander
Martin Jambor Oct. 16, 2015, 11:32 a.m. UTC | #6
Hi,

On Thu, Oct 15, 2015 at 04:01:56PM +0200, Jakub Jelinek wrote:
> Hi!
> 
> CCing various people, because I'd like to have something that won't work on
> XeonPhi only.

thanks.  However, I have not paid too much attention to OMP tasks
yet.  Nevertheless, let me try to answer some of the questions.

> 
> On Fri, Oct 02, 2015 at 10:28:01PM +0300, Ilya Verbin wrote:
> > On Tue, Sep 08, 2015 at 11:20:14 +0200, Jakub Jelinek wrote:
> > > nowait support for #pragma omp target is not implemented yet, supposedly we
> > > need to mark those somehow (some flag) already in the struct gomp_task
> > > structure, essentially it will need either 2 or 3 callbacks
> > > (the current one, executed when the dependencies are resolved (it actually
> > > waits until some thread schedules it after that point, I think it is
> > > undesirable to run it with the tasking lock held), which would perform
> > > the gomp_map_vars and initiate the running of the region, and then some
> > > query routine which would poll the plugin whether the task is done or not,
> > > and either perform the finalization (unmap_vars) if it is done (and in any
> > > case return bool whether it should be polled again or not), and if the
> > > finalization is not done there, also another callback for the finalization.
> > > Also, there is the issue that if we are waiting for task that needs to be
> > > polled, and we don't have any further tasks to run, we shouldn't really
> > > attempt to sleep on some semaphore (e.g. in taskwait, end of
> > > taskgroup, etc.) or barrier, but rather either need to keep polling it, or
> > > call the query hook with some argument that it should sleep in there until
> > > the work is done by the offloading device.
> > > Also, there needs to be a way for the target nowait first callback to say
> > > that it is using host fallback and thus acts as a normal task, therefore
> > > once the task fn finishes, the task is done.
> > 
> > Here is my WIP patch.  target.c part is obviously incorrect, but it demonstrates
> > a possible libgomp <-> plugin interface for running a target task function
> > asynchronously and checking whether it is completed or not.
> > (Refactored liboffloadmic/runtime/emulator from trunk is required to run
> > target-tmp.c testcase.)
> 
> The difficulty is designing something that will work (if possible fast) on the
> various devices we want to eventually support (at least XeonPhi, XeonPhi emul,
> PTX/Cuda and HSA), ideally without too much busy waiting.
> 
> The OpenMP 4.5 spec says that there is a special "target task" on the host
> side around the target region, and that the "target task" is mergeable and
> if nowait is not specified is included (otherwise it may be), and that the
> mapping operations (which include target device memory allocation,
> refcount management and mapping data structure updates as well as the
> memory copying to target device) happens only after the (optional) dependencies
> are satisfied.  After the memory mapping operations are done, the offloading
> kernel starts, and when it finishes, the unmapping operations are performed
> (which includes memory copying from the target device, refcount management
> and mapping data structure updates, and finally memory deallocation).
> 
> Right now on the OpenMP side everything is synchronous, e.g. target
> enter/exit data and update are asynchronous only in that the mapping or
> unmapping operation is scheduled as a task, but the whole mapping or
> unmapping operations including all the above mentioned subparts are
> performed while holding the particular device's lock.

Memory mapping and unmapping is a no-op on HSA so this is fortunately
a concern for us.  (I'm assuming that ref-counting is also something
device specific and not part of running a task here).

> Anyway, let's put the asynchronous memory data transfers (which also implies
> the ability to enqueue multiple different target regions into a stream for
> the device to operate on independently from the host) on the side for now
> and just discuss what we want for the actual async execution and for now
> keep a device lock around all the mapping or unmapping operations.
> 
> If the "target task" has unresolved dependencies, then it will use existing
> task.c waiting code first (if the above is resolved somehow, there could be
> exceptions of "target task" depending on another "target task").
> When the dependencies are resolved, we can run the gomp_target_task_fn
> callback (but not with the team's tasking lock held), which can perform
> the gomp_map_vars call and start the async execution.  For host fallback,
> that is all we do, the task is at this point a normal task.
> For offloading task, we now want the host to continue scheduling other tasks
> if there are any, which means (not currently implemented on the task.c side)
> we want to move the task somewhere that we don't consider it finished, and
> that we'll need to schedule it again at some point to perform the unmapping
> (perhaps requeue it again in a WAITING or some other state).
> 
> Right now, the tasking code would in that case try to schedule another task,
> and if there are none or none that are runnable among the tasks of interest,
> it can go to sleep and expect to be awaken when some task it is waiting for
> is awaken.
> 
> And the main question is how to find out on the various devices whether
> the async execution has completed already.
> 
> From what I can see in the liboffloadmic plugin, you have an extra host
> thread that can run a callback function on the host.  Such a callback could
> say tweak the state of the "target task", could take the team's tasking
> lock, and even awake sleepers, maybe even take the device lock and perform
> unmapping of vars?  The function would need to be in task.c
> so that it can access everything defined in there.  Or the callback could
> just change something in the "target task" state and let the tasking poll
> for the change.
> 

...

> 
> For HSA I have no idea.
> 

In HSA, the task completion is signaled via so called "signals."
Which are basically (long) integers that you can atomically read/write
(etc) with the given API and wait until a specified condition (eq, ne,
le, gte) happens.  Atomic reading should be very cheap.  I do not see
a way to wait on multiple signals but we can arrange it so that
completions of a number of kernels are communicated with a single
signal.

At the moment we wait and do not create any special servicing threads
in our libgomp plugin and, as far as I know, run-time itself does not
offer a way of registering a call-back to announce kernel completion.

So polling is certainly a possibility, blocking wait if HSA task(s)
are the last ones we wait for is also simple.  Sleeping until either a
CPU or a HSA task completes might be tricky.

I hope this helps,

Martin
diff mbox

Patch

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index d798321..8e2b5aa 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -872,6 +872,8 @@  struct gomp_device_descr
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
   void (*run_func) (int, void *, void *);
+  void (*async_run_func) (int, void *, void *, const void *);
+  bool (*async_is_completed_func) (int, const void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;
diff --git a/libgomp/target.c b/libgomp/target.c
index 77bd442..31f034c 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -45,6 +45,10 @@ 
 #include "plugin-suffix.h"
 #endif
 
+/* FIXME: TMP */
+#include <stdio.h>
+#include <unistd.h>
+
 static void gomp_target_init (void);
 
 /* The whole initialization code for offloading plugins is only run one.  */
@@ -1227,6 +1231,44 @@  gomp_target_fallback (void (*fn) (void *), void **hostaddrs)
   *thr = old_thr;
 }
 
+/* Host fallback with firstprivate map-type handling.  */
+
+static void
+gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
+				   void **hostaddrs, size_t *sizes,
+				   unsigned short *kinds)
+{
+  size_t i, tgt_align = 0, tgt_size = 0;
+  char *tgt = NULL;
+  for (i = 0; i < mapnum; i++)
+    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+      {
+	size_t align = (size_t) 1 << (kinds[i] >> 8);
+	if (tgt_align < align)
+	  tgt_align = align;
+	tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	tgt_size += sizes[i];
+      }
+  if (tgt_align)
+    {
+      tgt = gomp_alloca (tgt_size + tgt_align - 1);
+      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
+      if (al)
+	tgt += tgt_align - al;
+      tgt_size = 0;
+      for (i = 0; i < mapnum; i++)
+	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
+	    hostaddrs[i] = tgt + tgt_size;
+	    tgt_size = tgt_size + sizes[i];
+	  }
+    }
+  gomp_target_fallback (fn, hostaddrs);
+}
+
 /* Helper function of GOMP_target{,_41} routines.  */
 
 static void *
@@ -1311,40 +1353,19 @@  GOMP_target_41 (int device, void (*fn) (void *), size_t mapnum,
   if (devicep == NULL
       || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
     {
-      size_t i, tgt_align = 0, tgt_size = 0;
-      char *tgt = NULL;
-      for (i = 0; i < mapnum; i++)
-	if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
-	  {
-	    size_t align = (size_t) 1 << (kinds[i] >> 8);
-	    if (tgt_align < align)
-	      tgt_align = align;
-	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	    tgt_size += sizes[i];
-	  }
-      if (tgt_align)
-	{
-	  tgt = gomp_alloca (tgt_size + tgt_align - 1);
-	  uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
-	  if (al)
-	    tgt += tgt_align - al;
-	  tgt_size = 0;
-	  for (i = 0; i < mapnum; i++)
-	    if ((kinds[i] & 0xff) == GOMP_MAP_FIRSTPRIVATE)
-	      {
-		size_t align = (size_t) 1 << (kinds[i] >> 8);
-		tgt_size = (tgt_size + align - 1) & ~(align - 1);
-		memcpy (tgt + tgt_size, hostaddrs[i], sizes[i]);
-		hostaddrs[i] = tgt + tgt_size;
-		tgt_size = tgt_size + sizes[i];
-	      }
-	}
-      gomp_target_fallback (fn, hostaddrs);
+      gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, kinds);
       return;
     }
 
   void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
 
+  if (flags & GOMP_TARGET_FLAG_NOWAIT)
+    {
+      gomp_create_target_task (devicep, fn_addr, mapnum, hostaddrs, sizes,
+			       kinds, flags, depend);
+      return;
+    }
+
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
 		     GOMP_MAP_VARS_TARGET);
@@ -1636,34 +1657,58 @@  void
 gomp_target_task_fn (void *data)
 {
   struct gomp_target_task *ttask = (struct gomp_target_task *) data;
+  struct gomp_device_descr *devicep = ttask->devicep;
+
   if (ttask->fn != NULL)
     {
-      /* GOMP_target_41 */
+      if (devicep == NULL
+	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	{
+	  /* FIXME: Save host fn addr into gomp_target_task?  */
+	  gomp_target_fallback_firstprivate (NULL, ttask->mapnum,
+					     ttask->hostaddrs, ttask->sizes,
+					     ttask->kinds);
+	  return;
+	}
+
+      struct target_mem_desc *tgt_vars
+	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
+			 ttask->sizes, ttask->kinds, true,
+			 GOMP_MAP_VARS_TARGET);
+      devicep->async_run_func (devicep->target_id, ttask->fn,
+			       (void *) tgt_vars->tgt_start, data);
+
+      /* FIXME: TMP example of checking for completion.
+	 Alternatively the plugin can set some completion flag in ttask.  */
+      while (!devicep->async_is_completed_func (devicep->target_id, data))
+	{
+	  fprintf (stderr, "-");
+	  usleep (100000);
+	}
     }
-  else if (ttask->devicep == NULL
-	   || !(ttask->devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+  else if (devicep == NULL
+	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
     return;
 
   size_t i;
   if (ttask->flags & GOMP_TARGET_FLAG_UPDATE)
-    gomp_update (ttask->devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+    gomp_update (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
 		 ttask->kinds, true);
   else if ((ttask->flags & GOMP_TARGET_FLAG_EXIT_DATA) == 0)
     for (i = 0; i < ttask->mapnum; i++)
       if ((ttask->kinds[i] & 0xff) == GOMP_MAP_STRUCT)
 	{
-	  gomp_map_vars (ttask->devicep, ttask->sizes[i] + 1,
-			 &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
-			 &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
+	  gomp_map_vars (devicep, ttask->sizes[i] + 1, &ttask->hostaddrs[i],
+			 NULL, &ttask->sizes[i], &ttask->kinds[i], true,
+			 GOMP_MAP_VARS_ENTER_DATA);
 	  i += ttask->sizes[i];
 	}
       else
-	gomp_map_vars (ttask->devicep, 1, &ttask->hostaddrs[i], NULL,
-		       &ttask->sizes[i], &ttask->kinds[i],
-		       true, GOMP_MAP_VARS_ENTER_DATA);
+	gomp_map_vars (devicep, 1, &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
+		       &ttask->kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
   else
-    gomp_exit_data (ttask->devicep, ttask->mapnum, ttask->hostaddrs,
-		    ttask->sizes, ttask->kinds);
+    gomp_exit_data (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
+		    ttask->kinds);
 }
 
 void
@@ -2108,6 +2153,8 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {
       DLSYM (run);
+      DLSYM (async_run);
+      DLSYM (async_is_completed);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
diff --git a/libgomp/testsuite/libgomp.c/target-tmp.c b/libgomp/testsuite/libgomp.c/target-tmp.c
new file mode 100644
index 0000000..23a739c
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/target-tmp.c
@@ -0,0 +1,40 @@ 
+#include <stdio.h>
+#include <unistd.h>
+
+#pragma omp declare target
+void foo (int n)
+{
+  printf ("Start tgt %d\n", n);
+  usleep (5000000);
+  printf ("End tgt %d\n", n);
+}
+#pragma omp end declare target
+
+int x, y, z;
+
+int main ()
+{
+  #pragma omp parallel
+  #pragma omp single
+    {
+      #pragma omp task depend(out: x)
+      printf ("Host task\n");
+
+      #pragma omp target nowait depend(in: x) depend(out: y)
+      foo (1);
+
+      #pragma omp target nowait depend(in: y)
+      foo (2);
+
+      #pragma omp target nowait depend(in: y)
+      foo (3);
+
+      while (1)
+	{
+	  usleep (333333);
+	  fprintf (stderr, ".");
+	}
+    }
+
+  return 0;
+}
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index 26ac6fe..c843710 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -34,6 +34,7 @@ 
 #include <string.h>
 #include <utility>
 #include <vector>
+#include <set>
 #include <map>
 #include "libgomp-plugin.h"
 #include "compiler_if_host.h"
@@ -76,9 +77,15 @@  static int num_images;
    second key is number of device.  Contains a vector of pointer pairs.  */
 static ImgDevAddrMap *address_table;
 
+/* Set of asynchronously running target tasks.  */
+static std::set<const void *> *async_tasks;
+
 /* Thread-safe registration of the main image.  */
 static pthread_once_t main_image_is_registered = PTHREAD_ONCE_INIT;
 
+/* Mutex for protecting async_tasks.  */
+static pthread_mutex_t async_tasks_lock = PTHREAD_MUTEX_INITIALIZER;
+
 static VarDesc vd_host2tgt = {
   { 1, 1 },		      /* dst, src			      */
   { 1, 0 },		      /* in, out			      */
@@ -156,6 +163,8 @@  init (void)
 
 out:
   address_table = new ImgDevAddrMap;
+  async_tasks = new std::set<const void *>;
+  pthread_mutex_init (&async_tasks_lock, NULL);
   num_devices = _Offload_number_of_devices ();
 }
 
@@ -192,11 +201,27 @@  GOMP_OFFLOAD_get_num_devices (void)
 
 static void
 offload (const char *file, uint64_t line, int device, const char *name,
-	 int num_vars, VarDesc *vars, VarDesc2 *vars2)
+	 int num_vars, VarDesc *vars, VarDesc2 *vars2, const void *async_data)
 {
   OFFLOAD ofld = __offload_target_acquire1 (&device, file, line);
   if (ofld)
-    __offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL, NULL);
+    {
+      if (async_data == NULL)
+	__offload_offload1 (ofld, name, 0, num_vars, vars, vars2, 0, NULL,
+			    NULL);
+      else
+	{
+	  pthread_mutex_lock (&async_tasks_lock);
+	  async_tasks->insert (async_data);
+	  pthread_mutex_unlock (&async_tasks_lock);
+
+	  OffloadFlags flags;
+	  flags.flags = 0;
+	  flags.bits.omp_async = 1;
+	  __offload_offload3 (ofld, name, 0, num_vars, vars, NULL, 0, NULL,
+			      (const void **) async_data, 0, NULL, flags, NULL);
+	}
+    }
   else
     {
       fprintf (stderr, "%s:%d: Offload target acquire failed\n", file, line);
@@ -218,7 +243,7 @@  GOMP_OFFLOAD_init_device (int device)
   TRACE ("");
   pthread_once (&main_image_is_registered, register_main_image);
   offload (__FILE__, __LINE__, device, "__offload_target_init_proc", 0,
-	   NULL, NULL);
+	   NULL, NULL, NULL);
 }
 
 extern "C" void
@@ -240,7 +265,7 @@  get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
   VarDesc2 vd1g[2] = { { "num_funcs", 0 }, { "num_vars", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_table_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   int table_size = num_funcs + 2 * num_vars;
   if (table_size > 0)
@@ -254,7 +279,7 @@  get_target_table (int device, int &num_funcs, int &num_vars, void **&table)
       VarDesc2 vd2g = { "table", 0 };
 
       offload (__FILE__, __LINE__, device, "__offload_target_table_p2", 1,
-	       &vd2, &vd2g);
+	       &vd2, &vd2g, NULL);
     }
 }
 
@@ -401,8 +426,8 @@  GOMP_OFFLOAD_alloc (int device, size_t size)
   vd1[1].size = sizeof (void *);
   VarDesc2 vd1g[2] = { { "size", 0 }, { "tgt_ptr", 0 } };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g);
-
+  offload (__FILE__, __LINE__, device, "__offload_target_alloc", 2, vd1, vd1g,
+	   NULL);
   return tgt_ptr;
 }
 
@@ -416,7 +441,8 @@  GOMP_OFFLOAD_free (int device, void *tgt_ptr)
   vd1.size = sizeof (void *);
   VarDesc2 vd1g = { "tgt_ptr", 0 };
 
-  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g);
+  offload (__FILE__, __LINE__, device, "__offload_target_free", 1, &vd1, &vd1g,
+	   NULL);
 }
 
 extern "C" void *
@@ -435,7 +461,7 @@  GOMP_OFFLOAD_host2dev (int device, void *tgt_ptr, const void *host_ptr,
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_host2tgt;
   vd2.ptr = (void *) host_ptr;
@@ -443,7 +469,7 @@  GOMP_OFFLOAD_host2dev (int device, void *tgt_ptr, const void *host_ptr,
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_host2tgt_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return tgt_ptr;
 }
@@ -464,7 +490,7 @@  GOMP_OFFLOAD_dev2host (int device, void *host_ptr, const void *tgt_ptr,
   VarDesc2 vd1g[2] = { { "tgt_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p1", 2,
-	   vd1, vd1g);
+	   vd1, vd1g, NULL);
 
   VarDesc vd2 = vd_tgt2host;
   vd2.ptr = (void *) host_ptr;
@@ -472,7 +498,7 @@  GOMP_OFFLOAD_dev2host (int device, void *host_ptr, const void *tgt_ptr,
   VarDesc2 vd2g = { "var", 0 };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2host_p2", 1,
-	   &vd2, &vd2g);
+	   &vd2, &vd2g, NULL);
 
   return host_ptr;
 }
@@ -495,22 +521,56 @@  GOMP_OFFLOAD_dev2dev (int device, void *dst_ptr, const void *src_ptr,
   VarDesc2 vd1g[3] = { { "dst_ptr", 0 }, { "src_ptr", 0 }, { "size", 0 } };
 
   offload (__FILE__, __LINE__, device, "__offload_target_tgt2tgt", 3, vd1,
-	   vd1g);
+	   vd1g, NULL);
 
   return dst_ptr;
 }
 
 extern "C" void
+GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
+			const void *async_data)
+{
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p, async_data = %p)", device,
+	 tgt_fn, tgt_vars, async_data);
+
+  VarDesc vd[2] = { vd_host2tgt, vd_host2tgt };
+  vd[0].ptr = &tgt_fn;
+  vd[0].size = sizeof (void *);
+  vd[1].ptr = &tgt_vars;
+  vd[1].size = sizeof (void *);
+
+  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd, NULL,
+	   async_data);
+}
+
+extern "C" void
 GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
 {
-  TRACE ("(tgt_fn = %p, tgt_vars = %p)", tgt_fn, tgt_vars);
+  TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p)", device, tgt_fn, tgt_vars);
 
-  VarDesc vd1[2] = { vd_host2tgt, vd_host2tgt };
-  vd1[0].ptr = &tgt_fn;
-  vd1[0].size = sizeof (void *);
-  vd1[1].ptr = &tgt_vars;
-  vd1[1].size = sizeof (void *);
-  VarDesc2 vd1g[2] = { { "tgt_fn", 0 }, { "tgt_vars", 0 } };
+  GOMP_OFFLOAD_async_run (device, tgt_fn, tgt_vars, NULL);
+}
+
+extern "C" bool
+GOMP_OFFLOAD_async_is_completed (int device, const void *async_data)
+{
+  TRACE ("(device = %d, async_data = %p)", device, async_data);
+
+  bool res;
+  pthread_mutex_lock (&async_tasks_lock);
+  res = async_tasks->count (async_data) == 0;
+  pthread_mutex_unlock (&async_tasks_lock);
+  return res;
+}
+
+/* Called by liboffloadmic when asynchronous function is completed.  */
+
+extern "C" void
+__gomp_offload_intelmic_async_completed (const void *async_data)
+{
+  TRACE ("(async_data = %p)", async_data);
 
-  offload (__FILE__, __LINE__, device, "__offload_target_run", 2, vd1, vd1g);
+  pthread_mutex_lock (&async_tasks_lock);
+  async_tasks->erase (async_data);
+  pthread_mutex_unlock (&async_tasks_lock);
 }
diff --git a/liboffloadmic/runtime/offload_host.cpp b/liboffloadmic/runtime/offload_host.cpp
index 08f626f..8cee12c 100644
--- a/liboffloadmic/runtime/offload_host.cpp
+++ b/liboffloadmic/runtime/offload_host.cpp
@@ -64,6 +64,9 @@  static void __offload_fini_library(void);
 #define GET_OFFLOAD_NUMBER(timer_data) \
     timer_data? timer_data->offload_number : 0
 
+extern "C" void
+__gomp_offload_intelmic_async_completed (const void *);
+
 extern "C" {
 #ifdef TARGET_WINNT
 // Windows does not support imports from libraries without actually
@@ -2507,7 +2510,7 @@  extern "C" {
         const void *info
     )
     {
-	/* TODO: Call callback function, pass info.  */
+	__gomp_offload_intelmic_async_completed (info);
     }
 }