diff mbox series

libgomp: Handle OpenMP's reverse offloads

Message ID 0567b7c6-fede-72b8-63d1-1fc10dca36a0@codesourcery.com
State New
Headers show
Series libgomp: Handle OpenMP's reverse offloads | expand

Commit Message

Tobias Burnus Dec. 6, 2022, 7:45 a.m. UTC
This patch finally handles reverse offload. Due to the prep work,
it essentially only adds content to libgomp/target.c's gomp_target_rev(),
except that it additionally saves the reverse-offload-function table
in gomp_load_image_to_device.

In the comment to "[Patch] libgomp: Add reverse-offload splay tree",
https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601368.html ,
it was suggested not to keep track of all the variable mappings and
to reconstruct the mapping from the normal splay tree, which this
patch does.
(Albeit in the very slow walk-everything way. Given that reverse-offload
target regions likely have only few map items and program should only use
few reverse-offload regions and expect them not being fast, that might
be okay.)

Specification references:
- For pointer attachment, I assume that the pointer is already fine on
   the host (if existed on the host before) and it does not need to get
   updated. I think the spec lacks a wording for this; cf. OpenMP Spec Issue #3424.
- There are plans to permit 'nowait'. I think it wouldn't change anything
   except for not spin waiting for the result - and (only for shared memory),
   the argument lists (addr, kinds, sizes) need to be copied to have a sufficent
   life time. (To be implemented in future; cf. OpenMP Spec Pull Req. 3423
   for Issue 2038.)

  * * *

32bit vs. 64bit: libgomp itself is compiled with both -m32 and -m64; however,
nvptx and gcn requires -m64 on the device side and assume that the device
pointers are representable on the host (i.e. all are 64bit). The new code
tries to be in principle compatible with uint32_t pointers and uses uint64_t
to represent it consistently. – The code should be mostly fine, except that
one called function requires an array of void* and size_t. Instead of handling
that case, I added some code to permit optimizing away the function content
without offloading - and a run-time assert if it should ever happen that this
function gets called on a 32bit host from the target side.
It is a run-time fail as '#if TARGET_OFFLOAD == ""' does not work (string
comparison by the C preprocessor not supported, unfortunately).

Comments, suggestions, OK for mainline, ... ?

Tobias

PS:
* As follow-up,  libgomp.texi must be updated
* For GCN, it currently does not work until stack variables are accessible
   from the host. (Prep work for this is in newlib + GCC 13.) One done, a
   similar one-line change to plugin-gcn.c's GOMP_OFFLOAD_get_num_devices is
   required.

PPS: (Off topic remark to 32bit host)
While 32bit host with 32bit device will mostly work, having a 32bit host
with a 64bit device becomes interesting as 'void *' returned by omp_target_alloc(...)
can't represent a device pointer. The solution is a 32bit pointer pointing to a 64bit
valirable, e.g.
   uint64_t *devptr = malloc(sizeof(uint64_t*);
   *devptr = internal_device_alloc ();
   return devptr;
with all the fun to translate this correctly with {use,has}_device_ptr etc.

To actually support this will require some larger changes to libgomp, which I
do not see happening unless a device system with sizeof(void*) > 64 bit shows
up. Or some compelling reason to use 32bit on the host; but not for for x86-64 or arm64
(or PowerPC). (There exist 128bit pointer systems, which use the upper bits for extra
purposes - but for unified-shared address purposes, it seems to be unlikely that
accelerator devices head this direction.)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

Comments

Jakub Jelinek Dec. 9, 2022, 2:44 p.m. UTC | #1
On Tue, Dec 06, 2022 at 08:45:07AM +0100, Tobias Burnus wrote:
> 32bit vs. 64bit: libgomp itself is compiled with both -m32 and -m64; however,
> nvptx and gcn requires -m64 on the device side and assume that the device
> pointers are representable on the host (i.e. all are 64bit). The new code
> tries to be in principle compatible with uint32_t pointers and uses uint64_t
> to represent it consistently. – The code should be mostly fine, except that
> one called function requires an array of void* and size_t. Instead of handling
> that case, I added some code to permit optimizing away the function content
> without offloading - and a run-time assert if it should ever happen that this
> function gets called on a 32bit host from the target side.

I think we just shouldn't support libgomp plugins for 32-bit libgomp, only
host fallback.  If you want offloading, use 64-bit host...

> libgomp: Handle OpenMP's reverse offloads
> 
> This commit enabled reverse offload for nvptx such that gomp_target_rev
> actually gets called.  And it fills the latter function to do all of
> the following: finding the host function to the device func ptr and
> copying the arguments to the host, processing the mapping/firstprivate,
> calling the host function, copying back the data and freeing as needed.
> 
> The data handling is made easier by assuming that all host variables
> either existed before (and are in the mapping) or that those are
> devices variables not yet available on the host. Thus, the reverse
> mapping can do without refcounts etc. Note that the spec disallows
> inside a target region device-affecting constructs other than target
> plus ancestor device-modifier and it also limits the clauses permitted
> on this construct.
> 
> For the function addresses, an additional splay tree is used; for
> the lookup of mapped variables, the existing splay-tree is used.
> Unfortunately, its data structure requires a full walk of the tree;
> Additionally, the just mapped variables are recorded in a separate
> data structure an extra lookup. While the lookup is slow, assuming
> that only few variables get mapped in each reverse offload construct
> and that reverse offload is the exception and not performance critical,
> this seems to be acceptable.
> 
> libgomp/ChangeLog:
> 
> 	* libgomp.h (struct target_mem_desc): Predeclare; move
> 	below after 'reverse_splay_tree_node' and add rev_array
> 	member.
> 	(struct reverse_splay_tree_key_s, reverse_splay_compare): New.
> 	(reverse_splay_tree_node, reverse_splay_tree,
> 	reverse_splay_tree_key): New typedef.
> 	(struct gomp_device_descr): Add mem_map_rev member.
> 	* oacc-host.c (host_dispatch): NULL init .mem_map_rev.
> 	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Claim
> 	support for GOMP_REQUIRES_REVERSE_OFFLOAD.
> 	* splay-tree.h (splay_tree_callback_stop): New typedef; like
> 	splay_tree_callback but returning int not void.
> 	(splay_tree_foreach_lazy): Define; like splay_tree_foreach but
> 	taking splay_tree_callback_stop as argument.
> 	* splay-tree.c (splay_tree_foreach_internal_lazy,
> 	splay_tree_foreach_lazy): New; but early exit if callback returns
> 	nonzero.
> 	* target.c: Instatiate splay_tree_c with splay_tree_prefix 'reverse'.
> 	(gomp_map_lookup_rev): New.
> 	(gomp_load_image_to_device): Handle reverse-offload function
> 	lookup table.
> 	(gomp_unload_image_from_device): Free devicep->mem_map_rev.
> 	(struct gomp_splay_tree_rev_lookup_data, gomp_splay_tree_rev_lookup,
> 	gomp_map_rev_lookup, struct cpy_data, gomp_map_cdata_lookup_int,
> 	gomp_map_cdata_lookup): New auxiliary structs and functions for
> 	gomp_target_rev.
> 	(gomp_target_rev): Implement reverse offloading and its mapping.
> 	(gomp_target_init): Init current_device.mem_map_rev.root.
> 	* testsuite/libgomp.fortran/reverse-offload-2.f90: New test.
> 	* testsuite/libgomp.fortran/reverse-offload-3.f90: New test.
> 	* testsuite/libgomp.fortran/reverse-offload-4.f90: New test.
> 	* testsuite/libgomp.fortran/reverse-offload-5.f90: New test.
> 	* testsuite/libgomp.fortran/reverse-offload-5a.f90: New test without
> 	mapping of on-device allocated variables.

> +  /* Likeverse for the reverse lookup device->host for reverse offload. */

Likewise

> +  reverse_splay_tree_node rev_array;

Do we need reverse_splay_tree* stuff in libgomp.h?
As splay_tree_node is just a pointer, perhaps just
struct reverse_splay_tree_node_s;
early and
  struct reverse_splay_tree_node_s *rev_array;
in libgomp.h and include the extra splay-tree.h only in target.c?
Unless one needs it anywhere else...

Otherwise LGTM.

	Jakub
Tobias Burnus Dec. 10, 2022, 8:11 a.m. UTC | #2
On 09.12.22 15:44, Jakub Jelinek wrote:
> On Tue, Dec 06, 2022 at 08:45:07AM +0100, Tobias Burnus wrote:
>> [...]
> I think we just shouldn't support libgomp plugins for 32-bit libgomp, only
> host fallback.  If you want offloading, use 64-bit host...
(I concur.)
>
>> libgomp: Handle OpenMP's reverse offloads
>>
>> +  /* Likeverse for the reverse lookup device->host for reverse offload. */
> Likewise
>
>> +  reverse_splay_tree_node rev_array;
> Do we need reverse_splay_tree* stuff in libgomp.h?
> As splay_tree_node is just a pointer, perhaps just
> struct reverse_splay_tree_node_s;
> early and
>    struct reverse_splay_tree_node_s *rev_array;
> in libgomp.h and include the extra splay-tree.h only in target.c?
> Unless one needs it anywhere else...

It is used as 'typedef struct reverse_splay_tree_node_s *reverse_splay_tree_node;' in

struct target_mem_desc {
....
   reverse_splay_tree_node rev_array;
}

but also as

struct gomp_device_descr
{
   ...
   struct reverse_splay_tree_s mem_map_rev;
}

The latter is

struct reverse_splay_tree_key_s {
   /* Address of the device object.  */
   uint64_t dev;
   splay_tree_key k;
};

which in turn needs 'splay_tree_key'.

Thus, I could either commit it as is – or turn the latter also
into a pointer and malloc it. Currently, it is accessed as
mem_map.k.root = NULL for init and later through the splay-tree
functions indirectly.

Thoughts?

Unless there are further comments, I will later commit it as is.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Jakub Jelinek Dec. 10, 2022, 8:28 a.m. UTC | #3
On Sat, Dec 10, 2022 at 09:11:24AM +0100, Tobias Burnus wrote:
> It is used as 'typedef struct reverse_splay_tree_node_s *reverse_splay_tree_node;' in
> 
> struct target_mem_desc {
> ....
>   reverse_splay_tree_node rev_array;
> }
> 
> but also as
> 
> struct gomp_device_descr
> {
>   ...
>   struct reverse_splay_tree_s mem_map_rev;
> }
> 
> The latter is
> 
> struct reverse_splay_tree_key_s {
>   /* Address of the device object.  */
>   uint64_t dev;
>   splay_tree_key k;
> };
> 
> which in turn needs 'splay_tree_key'.
> 
> Thus, I could either commit it as is – or turn the latter also
> into a pointer and malloc it. Currently, it is accessed as
> mem_map.k.root = NULL for init and later through the splay-tree
> functions indirectly.
> 
> Thoughts?
> 
> Unless there are further comments, I will later commit it as is.

Ok; just wanted to save compile time by not including too much for all the
other files...

	Jakub
Thomas Schwinge Dec. 15, 2022, 5:34 p.m. UTC | #4
Hi!

On 2022-12-06T08:45:07+0100, Tobias Burnus <tobias@codesourcery.com> wrote:
> This patch finally handles reverse offload.

Yay!  \o/


The 'libgomp.fortran/reverse-offload-5.f90' test case for nvptx
offloading runs into an error condition (thus, XFAILed) -- but then
blocks until timeout, insted of terminating promptly:

    libgomp: cuMemcpyDtoHAsync error: invalid argument
    WARNING: program timed out.
    XFAIL: libgomp.fortran/reverse-offload-5.f90   -O  execution test

Attempt to fix that by not calling 'CUDA_CALL_ASSERT'/'GOMP_PLUGIN_fatal'
with device lock held; see
"libgomp: Handle OpenMP's reverse offloads, unlocking on error paths, pt. 1"
attached.  OK to push?  That unfortunately doesn't resolve the issue (get
another hang elsewhere, later on), but does seem like an improvement
anyway?


We then hang when tearing down the device:

    #7  0x00007ffff72f7137 in nvptx_close_device (ptx_dev=0x6cf580) at [...]/libgomp/plugin/plugin-nvptx.c:576
    #8  0x00007ffff72f88e6 in GOMP_OFFLOAD_fini_device (n=0) at [...]/libgomp/plugin/plugin-nvptx.c:1238
    #9  0x00007ffff78d767e in gomp_fini_device (devicep=0x6cf400) at [...]/libgomp/target.c:2648
    #10 0x00007ffff78de69f in gomp_target_fini () at [...]/libgomp/target.c:5026
    #11 0x00007ffff76b88a7 in __run_exit_handlers (status=1, listp=0x7ffff785e718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
    #12 0x00007ffff76b8a60 in __GI_exit (status=<optimized out>) at exit.c:139
    #13 0x00007ffff78d9df5 in gomp_target_rev (fn_ptr=2870272, mapnum=21, devaddrs_ptr=140736641236016, sizes_ptr=140736641235848, kinds_ptr=140736666274304, dev_num=0, dev_to_host_cpy=0x7ffff72fae00 <rev_off_dev_to_host_cpy>, host_to_dev_cpy=0x7ffff72faec0 <rev_off_host_to_dev_cpy>, token=0xb0f910) at [...]/libgomp/target.c:3508
    #14 0x00007ffff72fb3c1 in GOMP_OFFLOAD_run (ord=0, tgt_fn=0xad0fd8, tgt_vars=0x7fffcd800000, args=0x7fffffffd140) at [...]/libgomp/plugin/plugin-nvptx.c:2155
    #15 0x00007ffff78d8b91 in GOMP_target_ext (device=-1, fn=0x4012aa <MAIN__._omp_fn.1>, mapnum=11, hostaddrs=0x7fffffffd560, sizes=0x7fffffffd500, kinds=0x6adb20 <omp_data_kinds.146>, flags=0, depend=0x0, args=0x7fffffffd670) at [...]/libgomp/target.c:3146
    #16 0x0000000000400f44 in MAIN__ ()

That is:

    (gdb) frame 7
    #7  0x00007ffff72f7137 in nvptx_close_device (ptx_dev=0x6cf580) at [...]/libgomp/plugin/plugin-nvptx.c:576
    576         CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);

At the point where we called 'gomp_target_rev' via
'GOMP_PLUGIN_target_rev' in 'GOMP_OFFLOAD_run', 'omp_stacks' was still in
active use, and so the 'cuMemFree' here waits for that CUDA stream to
complete?  (I haven't looked up the exact details, because...)

..., then I found that I generally don't understand the locking scheme
applied here.

If the libgomp plugin doesn't request special
'host_to_dev_cpy'/'dev_to_host_cpy' for 'gomp_target_rev', then standard
'gomp_copy_host2dev'/'gomp_copy_dev2host' are used, which use
'gomp_device_copy', which expects the device to be locked.  (As can be
told by the unconditional 'gomp_mutex_unlock (&devicep->lock);' before
'gomp_fatal'.)  However, in a number of the
'gomp_copy_host2dev'/'gomp_copy_dev2host' calls from 'gomp_target_rev',
the device definitely is not locked; see the calls adjacent to the TODO
markers in my patch.  That means, we potentially
'gomp_mutex_unlock (&devicep->lock);' when not actually holding that
lock?

How, generally, is libgomp device locking supposed to interact with
OpenMP reverse offloading?


Grüße
 Thomas


> Due to the prep work,
> it essentially only adds content to libgomp/target.c's gomp_target_rev(),
> except that it additionally saves the reverse-offload-function table
> in gomp_load_image_to_device.
>
> In the comment to "[Patch] libgomp: Add reverse-offload splay tree",
> https://gcc.gnu.org/pipermail/gcc-patches/2022-September/601368.html ,
> it was suggested not to keep track of all the variable mappings and
> to reconstruct the mapping from the normal splay tree, which this
> patch does.
> (Albeit in the very slow walk-everything way. Given that reverse-offload
> target regions likely have only few map items and program should only use
> few reverse-offload regions and expect them not being fast, that might
> be okay.)
>
> Specification references:
> - For pointer attachment, I assume that the pointer is already fine on
>    the host (if existed on the host before) and it does not need to get
>    updated. I think the spec lacks a wording for this; cf. OpenMP Spec Issue #3424.
> - There are plans to permit 'nowait'. I think it wouldn't change anything
>    except for not spin waiting for the result - and (only for shared memory),
>    the argument lists (addr, kinds, sizes) need to be copied to have a sufficent
>    life time. (To be implemented in future; cf. OpenMP Spec Pull Req. 3423
>    for Issue 2038.)
>
>   * * *
>
> 32bit vs. 64bit: libgomp itself is compiled with both -m32 and -m64; however,
> nvptx and gcn requires -m64 on the device side and assume that the device
> pointers are representable on the host (i.e. all are 64bit). The new code
> tries to be in principle compatible with uint32_t pointers and uses uint64_t
> to represent it consistently. – The code should be mostly fine, except that
> one called function requires an array of void* and size_t. Instead of handling
> that case, I added some code to permit optimizing away the function content
> without offloading - and a run-time assert if it should ever happen that this
> function gets called on a 32bit host from the target side.
> It is a run-time fail as '#if TARGET_OFFLOAD == ""' does not work (string
> comparison by the C preprocessor not supported, unfortunately).
>
> Comments, suggestions, OK for mainline, ... ?
>
> Tobias
>
> PS:
> * As follow-up,  libgomp.texi must be updated
> * For GCN, it currently does not work until stack variables are accessible
>    from the host. (Prep work for this is in newlib + GCC 13.) One done, a
>    similar one-line change to plugin-gcn.c's GOMP_OFFLOAD_get_num_devices is
>    required.
>
> PPS: (Off topic remark to 32bit host)
> While 32bit host with 32bit device will mostly work, having a 32bit host
> with a 64bit device becomes interesting as 'void *' returned by omp_target_alloc(...)
> can't represent a device pointer. The solution is a 32bit pointer pointing to a 64bit
> valirable, e.g.
>    uint64_t *devptr = malloc(sizeof(uint64_t*);
>    *devptr = internal_device_alloc ();
>    return devptr;
> with all the fun to translate this correctly with {use,has}_device_ptr etc.
>
> To actually support this will require some larger changes to libgomp, which I
> do not see happening unless a device system with sizeof(void*) > 64 bit shows
> up. Or some compelling reason to use 32bit on the host; but not for for x86-64 or arm64
> (or PowerPC). (There exist 128bit pointer systems, which use the upper bits for extra
> purposes - but for unified-shared address purposes, it seems to be unlikely that
> accelerator devices head this direction.)

> libgomp: Handle OpenMP's reverse offloads
>
> This commit enabled reverse offload for nvptx such that gomp_target_rev
> actually gets called.  And it fills the latter function to do all of
> the following: finding the host function to the device func ptr and
> copying the arguments to the host, processing the mapping/firstprivate,
> calling the host function, copying back the data and freeing as needed.
>
> The data handling is made easier by assuming that all host variables
> either existed before (and are in the mapping) or that those are
> devices variables not yet available on the host. Thus, the reverse
> mapping can do without refcounts etc. Note that the spec disallows
> inside a target region device-affecting constructs other than target
> plus ancestor device-modifier and it also limits the clauses permitted
> on this construct.
>
> For the function addresses, an additional splay tree is used; for
> the lookup of mapped variables, the existing splay-tree is used.
> Unfortunately, its data structure requires a full walk of the tree;
> Additionally, the just mapped variables are recorded in a separate
> data structure an extra lookup. While the lookup is slow, assuming
> that only few variables get mapped in each reverse offload construct
> and that reverse offload is the exception and not performance critical,
> this seems to be acceptable.
>
> libgomp/ChangeLog:
>
>       * libgomp.h (struct target_mem_desc): Predeclare; move
>       below after 'reverse_splay_tree_node' and add rev_array
>       member.
>       (struct reverse_splay_tree_key_s, reverse_splay_compare): New.
>       (reverse_splay_tree_node, reverse_splay_tree,
>       reverse_splay_tree_key): New typedef.
>       (struct gomp_device_descr): Add mem_map_rev member.
>       * oacc-host.c (host_dispatch): NULL init .mem_map_rev.
>       * plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Claim
>       support for GOMP_REQUIRES_REVERSE_OFFLOAD.
>       * splay-tree.h (splay_tree_callback_stop): New typedef; like
>       splay_tree_callback but returning int not void.
>       (splay_tree_foreach_lazy): Define; like splay_tree_foreach but
>       taking splay_tree_callback_stop as argument.
>       * splay-tree.c (splay_tree_foreach_internal_lazy,
>       splay_tree_foreach_lazy): New; but early exit if callback returns
>       nonzero.
>       * target.c: Instatiate splay_tree_c with splay_tree_prefix 'reverse'.
>       (gomp_map_lookup_rev): New.
>       (gomp_load_image_to_device): Handle reverse-offload function
>       lookup table.
>       (gomp_unload_image_from_device): Free devicep->mem_map_rev.
>       (struct gomp_splay_tree_rev_lookup_data, gomp_splay_tree_rev_lookup,
>       gomp_map_rev_lookup, struct cpy_data, gomp_map_cdata_lookup_int,
>       gomp_map_cdata_lookup): New auxiliary structs and functions for
>       gomp_target_rev.
>       (gomp_target_rev): Implement reverse offloading and its mapping.
>       (gomp_target_init): Init current_device.mem_map_rev.root.
>       * testsuite/libgomp.fortran/reverse-offload-2.f90: New test.
>       * testsuite/libgomp.fortran/reverse-offload-3.f90: New test.
>       * testsuite/libgomp.fortran/reverse-offload-4.f90: New test.
>       * testsuite/libgomp.fortran/reverse-offload-5.f90: New test.
>       * testsuite/libgomp.fortran/reverse-offload-5a.f90: New test without
>       mapping of on-device allocated variables.
>
>  libgomp/libgomp.h                                  |  77 ++-
>  libgomp/oacc-host.c                                |   1 +
>  libgomp/plugin/plugin-nvptx.c                      |   4 +-
>  libgomp/splay-tree.c                               |  22 +
>  libgomp/splay-tree.h                               |   8 +
>  libgomp/target.c                                   | 599 ++++++++++++++++++++-
>  .../libgomp.fortran/reverse-offload-2.f90          |  72 +++
>  .../libgomp.fortran/reverse-offload-3.f90          |  68 +++
>  .../libgomp.fortran/reverse-offload-4.f90          | 129 +++++
>  .../libgomp.fortran/reverse-offload-5.f90          | 100 ++++
>  .../libgomp.fortran/reverse-offload-5a.f90         |  98 ++++
>  11 files changed, 1139 insertions(+), 39 deletions(-)
>
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 5803683afc4..19d7d170091 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -1157,29 +1157,7 @@ struct target_var_desc {
>    uintptr_t length;
>  };
>
> -struct target_mem_desc {
> -  /* Reference count.  */
> -  uintptr_t refcount;
> -  /* All the splay nodes allocated together.  */
> -  splay_tree_node array;
> -  /* Start of the target region.  */
> -  uintptr_t tgt_start;
> -  /* End of the targer region.  */
> -  uintptr_t tgt_end;
> -  /* Handle to free.  */
> -  void *to_free;
> -  /* Previous target_mem_desc.  */
> -  struct target_mem_desc *prev;
> -  /* Number of items in following list.  */
> -  size_t list_count;
> -
> -  /* Corresponding target device descriptor.  */
> -  struct gomp_device_descr *device_descr;
> -
> -  /* List of target items to remove (or decrease refcount)
> -     at the end of region.  */
> -  struct target_var_desc list[];
> -};
> +struct target_mem_desc;
>
>  /* Special value for refcount - mask to indicate existence of special
>     values. Right now we allocate 3 bits.  */
> @@ -1273,6 +1251,58 @@ splay_compare (splay_tree_key x, splay_tree_key y)
>
>  #include "splay-tree.h"
>
> +/* Reverse offload splay-tree handling (functions only). */
> +
> +struct reverse_splay_tree_key_s {
> +  /* Address of the device object.  */
> +  uint64_t dev;
> +  splay_tree_key k;
> +};
> +
> +typedef struct reverse_splay_tree_node_s *reverse_splay_tree_node;
> +typedef struct reverse_splay_tree_s *reverse_splay_tree;
> +typedef struct reverse_splay_tree_key_s *reverse_splay_tree_key;
> +
> +static inline int
> +reverse_splay_compare (reverse_splay_tree_key x, reverse_splay_tree_key y)
> +{
> +  if (x->dev < y->dev)
> +    return -1;
> +  if (x->dev > y->dev)
> +    return 1;
> +  return 0;
> +}
> +
> +#define splay_tree_prefix reverse
> +#include "splay-tree.h"
> +
> +struct target_mem_desc {
> +  /* Reference count.  */
> +  uintptr_t refcount;
> +  /* All the splay nodes allocated together.  */
> +  splay_tree_node array;
> +  /* Likeverse for the reverse lookup device->host for reverse offload. */
> +  reverse_splay_tree_node rev_array;
> +  /* Start of the target region.  */
> +  uintptr_t tgt_start;
> +  /* End of the targer region.  */
> +  uintptr_t tgt_end;
> +  /* Handle to free.  */
> +  void *to_free;
> +  /* Previous target_mem_desc.  */
> +  struct target_mem_desc *prev;
> +  /* Number of items in following list.  */
> +  size_t list_count;
> +
> +  /* Corresponding target device descriptor.  */
> +  struct gomp_device_descr *device_descr;
> +
> +  /* List of target items to remove (or decrease refcount)
> +     at the end of region.  */
> +  struct target_var_desc list[];
> +};
> +
> +
>  typedef struct acc_dispatch_t
>  {
>    /* Execute.  */
> @@ -1367,6 +1397,7 @@ struct gomp_device_descr
>
>    /* Splay tree containing information about mapped memory regions.  */
>    struct splay_tree_s mem_map;
> +  struct reverse_splay_tree_s mem_map_rev;
>
>    /* Mutex for the mutable data.  */
>    gomp_mutex_t lock;
> diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
> index 4e3971ae1a9..77a24279e2f 100644
> --- a/libgomp/oacc-host.c
> +++ b/libgomp/oacc-host.c
> @@ -284,6 +284,7 @@ static struct gomp_device_descr host_dispatch =
>      .run_func = host_run,
>
>      .mem_map = { NULL },
> +    .mem_map_rev = { NULL },
>      /* .lock initialized in goacc_host_init.  */
>      .state = GOMP_DEVICE_UNINITIALIZED,
>
> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index e803f083591..4f4c25a90ba 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1190,7 +1190,9 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
>       devices were present.  Unified-shared address: see comment in
>       nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
>    if (num_devices > 0
> -      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
> +      && ((omp_requires_mask
> +        & ~(GOMP_REQUIRES_UNIFIED_ADDRESS
> +            | GOMP_REQUIRES_REVERSE_OFFLOAD)) != 0))
>      return -1;
>    return num_devices;
>  }
> diff --git a/libgomp/splay-tree.c b/libgomp/splay-tree.c
> index 2d8742fd19a..3809e90b68d 100644
> --- a/libgomp/splay-tree.c
> +++ b/libgomp/splay-tree.c
> @@ -236,3 +236,25 @@ splay_tree_foreach (splay_tree sp, splay_tree_callback func, void *data)
>  {
>    splay_tree_foreach_internal (sp->root, func, data);
>  }
> +
> +/* Like above, except when func returns != 0, stop early.  */
> +
> +static int
> +splay_tree_foreach_internal_lazy (splay_tree_node node,
> +                               splay_tree_callback_stop func, void *data)
> +{
> +  if (!node)
> +    return 0;
> +  if (func (&node->key, data))
> +    return 1;
> +  if (splay_tree_foreach_internal_lazy (node->left, func, data))
> +    return 1;
> +  /* Yeah, whatever.  GCC can fix my tail recursion.  */
> +  return splay_tree_foreach_internal_lazy (node->right, func, data);
> +}
> +
> +attribute_hidden void
> +splay_tree_foreach_lazy (splay_tree sp, splay_tree_callback_stop func, void *data)
> +{
> +  splay_tree_foreach_internal_lazy (sp->root, func, data);
> +}
> diff --git a/libgomp/splay-tree.h b/libgomp/splay-tree.h
> index 992381c4000..7dea0cc87f7 100644
> --- a/libgomp/splay-tree.h
> +++ b/libgomp/splay-tree.h
> @@ -78,8 +78,12 @@ typedef struct splay_tree_key_s *splay_tree_key;
>      splay_tree_name (splay_tree_prefix, splay_tree_remove)
>  # define splay_tree_foreach  \
>      splay_tree_name (splay_tree_prefix, splay_tree_foreach)
> +# define splay_tree_foreach_lazy \
> +    splay_tree_name (splay_tree_prefix, splay_tree_foreach_lazy)
>  # define splay_tree_callback \
>      splay_tree_name (splay_tree_prefix, splay_tree_callback)
> +# define splay_tree_callback_stop    \
> +    splay_tree_name (splay_tree_prefix, splay_tree_callback_stop)
>  #endif
>
>  #ifndef splay_tree_c
> @@ -99,11 +103,13 @@ struct splay_tree_s {
>  };
>
>  typedef void (*splay_tree_callback) (splay_tree_key, void *);
> +typedef int (*splay_tree_callback_stop) (splay_tree_key, void *);
>
>  extern splay_tree_key splay_tree_lookup (splay_tree, splay_tree_key);
>  extern void splay_tree_insert (splay_tree, splay_tree_node);
>  extern void splay_tree_remove (splay_tree, splay_tree_key);
>  extern void splay_tree_foreach (splay_tree, splay_tree_callback, void *);
> +extern void splay_tree_foreach_lazy (splay_tree, splay_tree_callback_stop, void *);
>  #else  /* splay_tree_c */
>  #  ifdef splay_tree_prefix
>  #    include "splay-tree.c"
> @@ -125,6 +131,8 @@ extern void splay_tree_foreach (splay_tree, splay_tree_callback, void *);
>  #  undef splay_tree_insert
>  #  undef splay_tree_remove
>  #  undef splay_tree_foreach
> +#  undef splay_tree_foreach_lazy
>  #  undef splay_tree_callback
> +#  undef splay_tree_callback_stop
>  #  undef splay_tree_prefix
>  #endif
> diff --git a/libgomp/target.c b/libgomp/target.c
> index 189b201e344..1d8bca9ffe8 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -45,6 +45,12 @@
>  #include "plugin-suffix.h"
>  #endif
>
> +/* Define another splay tree instantiation - for reverse offload.  */
> +#define splay_tree_prefix reverse
> +#define splay_tree_c
> +#include "splay-tree.h"
> +
> +
>  typedef uintptr_t *hash_entry_type;
>  static inline void * htab_alloc (size_t size) { return gomp_malloc (size); }
>  static inline void htab_free (void *ptr) { free (ptr); }
> @@ -200,6 +206,12 @@ gomp_map_lookup (splay_tree mem_map, splay_tree_key key)
>    return splay_tree_lookup (mem_map, key);
>  }
>
> +static inline reverse_splay_tree_key
> +gomp_map_lookup_rev (reverse_splay_tree mem_map_rev, reverse_splay_tree_key key)
> +{
> +  return reverse_splay_tree_lookup (mem_map_rev, key);
> +}
> +
>  static inline splay_tree_key
>  gomp_map_0len_lookup (splay_tree mem_map, splay_tree_key key)
>  {
> @@ -1674,7 +1686,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
>                                                - k->host_start),
>                                             sizes[j], cbufp, false);
>                         }
> -                     }
> +                   }
>                   i = j - 1;
>                   break;
>                 case GOMP_MAP_FORCE_PRESENT:
> @@ -2192,11 +2204,16 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>
>    /* Load image to device and get target addresses for the image.  */
>    struct addr_pair *target_table = NULL;
> +  uint64_t *rev_target_fn_table = NULL;
>    int i, num_target_entries;
>
> +  /* With reverse offload, insert also target-host addresses. */
> +  bool rev_lookup = omp_requires_mask & GOMP_REQUIRES_REVERSE_OFFLOAD;
> +
>    num_target_entries
>      = devicep->load_image_func (devicep->target_id, version,
> -                             target_data, &target_table, NULL);
> +                             target_data, &target_table,
> +                             rev_lookup ? &rev_target_fn_table : NULL);
>
>    if (num_target_entries != num_funcs + num_vars
>        /* "+1" due to the additional ICV struct.  */
> @@ -2215,6 +2232,10 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>    /* "+1" due to the additional ICV struct.  */
>    tgt->array = gomp_malloc ((num_funcs + num_vars + 1)
>                           * sizeof (*tgt->array));
> +  if (rev_target_fn_table)
> +    tgt->rev_array = gomp_malloc (num_funcs * sizeof (*tgt->rev_array));
> +  else
> +    tgt->rev_array = NULL;
>    tgt->refcount = REFCOUNT_INFINITY;
>    tgt->tgt_start = 0;
>    tgt->tgt_end = 0;
> @@ -2223,6 +2244,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>    tgt->list_count = 0;
>    tgt->device_descr = devicep;
>    splay_tree_node array = tgt->array;
> +  reverse_splay_tree_node rev_array = tgt->rev_array;
>
>    for (i = 0; i < num_funcs; i++)
>      {
> @@ -2237,6 +2259,17 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>        array->left = NULL;
>        array->right = NULL;
>        splay_tree_insert (&devicep->mem_map, array);
> +      if (rev_target_fn_table)
> +     {
> +       reverse_splay_tree_key k2 = &rev_array->key;
> +       k2->dev = rev_target_fn_table[i];
> +       k2->k = k;
> +       rev_array->left = NULL;
> +       rev_array->right = NULL;
> +       if (k2->dev != 0)
> +         reverse_splay_tree_insert (&devicep->mem_map_rev, rev_array);
> +       rev_array++;
> +     }
>        array++;
>      }
>
> @@ -2349,6 +2382,15 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
>        gomp_mutex_unlock (&devicep->lock);
>        gomp_fatal ("image unload fail");
>      }
> +  if (devicep->mem_map_rev.root)
> +    {
> +      /* Free reverse offload splay tree + data; 'tgt->rev_array' is the only
> +      real allocation.  */
> +      assert (node && node->tgt && node->tgt->rev_array);
> +      assert (devicep->mem_map_rev.root->key.k->tgt == node->tgt);
> +      free (node->tgt->rev_array);
> +      devicep->mem_map_rev.root = NULL;
> +    }
>
>    /* Remove mappings from splay tree.  */
>    int i;
> @@ -2934,23 +2976,549 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>      htab_free (refcount_set);
>  }
>
> +
> +/* Reverse lookup (device addr -> host addr) for reverse offload.  We avoid
> +   keeping track of all variable handling - assuming that reverse offload occurs
> +   ony very rarely.  Downside is that the reverse search is slow.  */
> +
> +struct gomp_splay_tree_rev_lookup_data {
> +  uintptr_t tgt_start;
> +  uintptr_t tgt_end;
> +  splay_tree_key key;
> +};
> +
> +static int
> +gomp_splay_tree_rev_lookup (splay_tree_key key, void *d)
> +{
> +  struct gomp_splay_tree_rev_lookup_data *data;
> +  data = (struct gomp_splay_tree_rev_lookup_data *)d;
> +  uintptr_t tgt_start = key->tgt->tgt_start + key->tgt_offset;
> +
> +  if (tgt_start > data->tgt_start || key->tgt->list_count == 0)
> +    return 0;
> +
> +  size_t j;
> +  for (j = 0; j < key->tgt->list_count; j++)
> +    if (key->tgt->list[j].key == key)
> +      break;
> +  assert (j < key->tgt->list_count);
> +  uintptr_t tgt_end = tgt_start + key->tgt->list[j].length;
> +
> +  if ((tgt_start == data->tgt_start && tgt_end == data->tgt_end)
> +      || (tgt_end > data->tgt_start && tgt_start < data->tgt_end))
> +    {
> +      data->key = key;
> +      return 1;
> +    }
> +  return 0;
> +}
> +
> +static inline splay_tree_key
> +gomp_map_rev_lookup (splay_tree mem_map, uint64_t tgt_start, uint64_t tgt_end,
> +                  bool zero_len)
> +{
> +  struct gomp_splay_tree_rev_lookup_data data;
> +  data.key = NULL;
> +  data.tgt_start = tgt_start;
> +  data.tgt_end = tgt_end;
> +
> +  if (tgt_start != tgt_end)
> +    {
> +      splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
> +      return data.key;
> +    }
> +
> +  data.tgt_end++;
> +  splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
> +  if (data.key != NULL || zero_len)
> +    return data.key;
> +  data.tgt_end--;
> +
> +  data.tgt_start--;
> +  splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
> +  return data.key;
> +}
> +
> +struct cpy_data
> +{
> +  uint64_t devaddr;
> +  bool present, aligned;
> +};
> +
> +
> +/* Search just mapped reverse-offload data; returns index if found,
> +   otherwise >= n.  */
> +
> +static inline int
> +gomp_map_cdata_lookup_int (struct cpy_data *d, uint64_t *devaddrs,
> +                        unsigned short *kinds, uint64_t *sizes, size_t n,
> +                        uint64_t tgt_start, uint64_t tgt_end)
> +{
> +  const bool short_mapkind = true;
> +  const int typemask = short_mapkind ? 0xff : 0x7;
> +  size_t i;
> +  for (i = 0; i < n; i++)
> +    {
> +      bool is_struct = ((get_kind (short_mapkind, kinds, i) & typemask)
> +                     == GOMP_MAP_STRUCT);
> +      uint64_t dev_end;
> +      if (!is_struct)
> +     dev_end = d[i].devaddr + sizes[i];
> +      else
> +     {
> +       if (i + sizes[i] < n)
> +         dev_end = d[i + sizes[i]].devaddr + sizes[i + sizes[i]];
> +       else
> +         dev_end = devaddrs[i + sizes[i]] + sizes[i + sizes[i]];
> +     }
> +      if ((d[i].devaddr == tgt_start && dev_end == tgt_end)
> +       || (dev_end > tgt_start && d[i].devaddr < tgt_end))
> +     break;
> +      if (is_struct)
> +     i += sizes[i];
> +    }
> +  return i;
> +}
> +
> +static inline int
> +gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
> +                    unsigned short *kinds, uint64_t *sizes,
> +                    size_t n, uint64_t tgt_start, uint64_t tgt_end,
> +                    bool zero_len)
> +{
> +  size_t i;
> +  if (tgt_start != tgt_end)
> +    return gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
> +                                   tgt_start, tgt_end);
> +  tgt_end++;
> +  i = gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
> +                              tgt_start, tgt_end);
> +  if (i < n || zero_len)
> +    return i;
> +  tgt_end--;
> +
> +  tgt_start--;
> +  return gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
> +                                 tgt_start, tgt_end);
> +}
> +
>  /* Handle reverse offload.  This is called by the device plugins for a
> -   reverse offload; it is not called if the outer target runs on the host.  */
> +   reverse offload; it is not called if the outer target runs on the host.
> +   The mapping is simplified device-affecting constructs (except for target
> +   with device(ancestor:1)) must not be encountered; in particular not
> +   target (enter/exit) data.  */
>
>  void
> -gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
> -              uint64_t mapnum __attribute__ ((unused)),
> -              uint64_t devaddrs_ptr __attribute__ ((unused)),
> -              uint64_t sizes_ptr __attribute__ ((unused)),
> -              uint64_t kinds_ptr __attribute__ ((unused)),
> -              int dev_num __attribute__ ((unused)),
> -              void (*dev_to_host_cpy) (void *, const void *, size_t,
> -                                       void *) __attribute__ ((unused)),
> -              void (*host_to_dev_cpy) (void *, const void *, size_t,
> -                                       void *) __attribute__ ((unused)),
> -              void *token __attribute__ ((unused)))
> +gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
> +              uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> +              void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
> +              void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
> +              void *token)
>  {
> -  __builtin_unreachable ();
> +  /* Return early if there is no offload code.  */
> +  if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
> +    return;
> +  /* Currently, this fails because of calculate_firstprivate_requirements
> +     below; it could be fixed but additional code needs to be updated to
> +     handle 32bit hosts - thus, it is not worthwhile.  */
> +  if (sizeof (void *) != sizeof (uint64_t))
> +    gomp_fatal ("Reverse offload of 32bit hosts not supported.");
> +
> +  struct cpy_data *cdata = NULL;
> +  uint64_t *devaddrs;
> +  uint64_t *sizes;
> +  unsigned short *kinds;
> +  const bool short_mapkind = true;
> +  const int typemask = short_mapkind ? 0xff : 0x7;
> +  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
> +
> +  reverse_splay_tree_key n;
> +  struct reverse_splay_tree_key_s k;
> +  k.dev = fn_ptr;
> +
> +  gomp_mutex_lock (&devicep->lock);
> +  n = gomp_map_lookup_rev (&devicep->mem_map_rev, &k);
> +  gomp_mutex_unlock (&devicep->lock);
> +
> +  if (n == NULL)
> +    gomp_fatal ("Cannot find reverse-offload function");
> +  void (*host_fn)() = (void (*)()) n->k->host_start;
> +
> +  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> +    {
> +      devaddrs = (uint64_t *) (uintptr_t) devaddrs_ptr;
> +      sizes = (uint64_t *) (uintptr_t) sizes_ptr;
> +      kinds = (unsigned short *) (uintptr_t) kinds_ptr;
> +    }
> +  else
> +    {
> +      devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
> +      sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
> +      kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
> +      if (dev_to_host_cpy)
> +     {
> +       dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
> +                        mapnum * sizeof (uint64_t), token);
> +       dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
> +                        mapnum * sizeof (uint64_t), token);
> +       dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
> +                        mapnum * sizeof (unsigned short), token);
> +     }
> +      else
> +     {
> +       gomp_copy_dev2host (devicep, NULL, devaddrs,
> +                           (const void *) (uintptr_t) devaddrs_ptr,
> +                           mapnum * sizeof (uint64_t));
> +       gomp_copy_dev2host (devicep, NULL, sizes,
> +                           (const void *) (uintptr_t) sizes_ptr,
> +                           mapnum * sizeof (uint64_t));
> +       gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
> +                           mapnum * sizeof (unsigned short));
> +     }
> +    }
> +
> +  size_t tgt_align = 0, tgt_size = 0;
> +
> +  /* If actually executed on 32bit systems, the casts lead to wrong code;
> +     but 32bit with offloading is not supported; see top of this function.  */
> +  calculate_firstprivate_requirements (mapnum, (void *) (uintptr_t) sizes,
> +                                    (void *) (uintptr_t) kinds,
> +                                    &tgt_align, &tgt_size);
> +
> +  if (tgt_align)
> +    {
> +      char *tgt = gomp_alloca (tgt_size + tgt_align - 1);
> +      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
> +      if (al)
> +     tgt += tgt_align - al;
> +      tgt_size = 0;
> +      for (uint64_t i = 0; i < mapnum; i++)
> +     if (get_kind (short_mapkind, kinds, i) == GOMP_MAP_FIRSTPRIVATE
> +         && devaddrs[i] != 0)
> +       {
> +         size_t align = (size_t) 1 << (kinds[i] >> 8);
> +         tgt_size = (tgt_size + align - 1) & ~(align - 1);
> +         if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> +           memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
> +                   (size_t) sizes[i]);
> +         else if (dev_to_host_cpy)
> +           dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
> +                            (size_t) sizes[i], token);
> +         else
> +           gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
> +                               (void *) (uintptr_t) devaddrs[i],
> +                               (size_t) sizes[i]);
> +         devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
> +         tgt_size = tgt_size + sizes[i];
> +         if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> +             && i + 1 < mapnum
> +             && ((get_kind (short_mapkind, kinds, i) & typemask)
> +                 == GOMP_MAP_ATTACH))
> +           {
> +             *(uint64_t*) (uintptr_t) (devaddrs[i+1] + sizes[i+1])
> +               = (uint64_t) devaddrs[i];
> +             ++i;
> +           }
> +       }
> +    }
> +
> +  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM))
> +    {
> +      size_t j, struct_cpy = 0;
> +      splay_tree_key n2;
> +      cdata = gomp_alloca (sizeof (*cdata) * mapnum);
> +      memset (cdata, '\0', sizeof (*cdata) * mapnum);
> +      gomp_mutex_lock (&devicep->lock);
> +      for (uint64_t i = 0; i < mapnum; i++)
> +     {
> +       if (devaddrs[i] == 0)
> +         continue;
> +       n = NULL;
> +       int kind = get_kind (short_mapkind, kinds, i) & typemask;
> +       switch (kind)
> +         {
> +           case GOMP_MAP_FIRSTPRIVATE:
> +           case GOMP_MAP_FIRSTPRIVATE_INT:
> +             continue;
> +
> +           case GOMP_MAP_DELETE:
> +           case GOMP_MAP_RELEASE:
> +           case GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION:
> +             /* Assume it is present; look it up - but ignore otherwise. */
> +           case GOMP_MAP_ALLOC:
> +           case GOMP_MAP_FROM:
> +           case GOMP_MAP_FORCE_ALLOC:
> +           case GOMP_MAP_FORCE_FROM:
> +           case GOMP_MAP_ALWAYS_FROM:
> +           case GOMP_MAP_TO:
> +           case GOMP_MAP_TOFROM:
> +           case GOMP_MAP_FORCE_TO:
> +           case GOMP_MAP_FORCE_TOFROM:
> +           case GOMP_MAP_ALWAYS_TO:
> +           case GOMP_MAP_ALWAYS_TOFROM:
> +           case GOMP_MAP_ZERO_LEN_ARRAY_SECTION:
> +             cdata[i].devaddr = devaddrs[i];
> +             bool zero_len = (kind == GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION
> +                              || kind == GOMP_MAP_ZERO_LEN_ARRAY_SECTION);
> +             j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
> +                                        devaddrs[i],
> +                                        devaddrs[i] + sizes[i], zero_len);
> +             if (j < i)
> +               {
> +                 n2 = NULL;
> +                 cdata[i].present = true;
> +                 devaddrs[i] = devaddrs[j] + devaddrs[i] - cdata[j].devaddr;
> +               }
> +             else
> +               {
> +                 n2 = gomp_map_rev_lookup (&devicep->mem_map,
> +                                           devaddrs[i],
> +                                           devaddrs[i] + sizes[i], zero_len);
> +                 cdata[i].present = n2 != NULL;
> +               }
> +             if (!cdata[i].present
> +                 && kind != GOMP_MAP_DELETE
> +                 && kind != GOMP_MAP_RELEASE
> +                 && kind != GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION)
> +               {
> +                 cdata[i].aligned = true;
> +                 size_t align = (size_t) 1 << (kinds[i] >> 8);
> +                 devaddrs[i]
> +                   = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
> +                                                                sizes[i]);
> +               }
> +             else if (n2 != NULL)
> +               devaddrs[i] = (n2->host_start + cdata[i].devaddr
> +                              - (n2->tgt->tgt_start + n2->tgt_offset));
> +             if (((!cdata[i].present || struct_cpy)
> +                  && (kind == GOMP_MAP_TO || kind == GOMP_MAP_TOFROM))
> +                 || kind == GOMP_MAP_FORCE_TO
> +                 || kind == GOMP_MAP_FORCE_TOFROM
> +                 || kind == GOMP_MAP_ALWAYS_TO
> +                 || kind == GOMP_MAP_ALWAYS_TOFROM)
> +               {
> +                 if (dev_to_host_cpy)
> +                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
> +                                    (void *) (uintptr_t) cdata[i].devaddr,
> +                                    sizes[i], token);
> +                 else
> +                   gomp_copy_dev2host (devicep, NULL,
> +                                       (void *) (uintptr_t) devaddrs[i],
> +                                       (void *) (uintptr_t) cdata[i].devaddr,
> +                                       sizes[i]);
> +               }
> +             if (struct_cpy)
> +               struct_cpy--;
> +             break;
> +           case GOMP_MAP_ATTACH:
> +           case GOMP_MAP_POINTER:
> +           case GOMP_MAP_ALWAYS_POINTER:
> +             n2 = gomp_map_rev_lookup (&devicep->mem_map,
> +                                       devaddrs[i] + sizes[i],
> +                                       devaddrs[i] + sizes[i]
> +                                       + sizeof (void*), false);
> +             cdata[i].present = n2 != NULL;
> +             cdata[i].devaddr = devaddrs[i];
> +             if (n2)
> +               devaddrs[i] = (n2->host_start + cdata[i].devaddr
> +                              - (n2->tgt->tgt_start + n2->tgt_offset));
> +             else
> +               {
> +                 j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
> +                                            devaddrs[i] + sizes[i],
> +                                            devaddrs[i] + sizes[i]
> +                                            + sizeof (void*), false);
> +                 if (j < i)
> +                   {
> +                     cdata[i].present = true;
> +                     devaddrs[i] = (devaddrs[j] + devaddrs[i]
> +                                    - cdata[j].devaddr);
> +                   }
> +               }
> +             if (!cdata[i].present)
> +               devaddrs[i] = (uintptr_t) gomp_malloc (sizeof (void*));
> +             /* Assume that when present, the pointer is already correct.  */
> +             if (!n2)
> +               *(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[i])
> +                 = devaddrs[i-1];
> +             break;
> +           case GOMP_MAP_TO_PSET:
> +             /* Assume that when present, the pointers are fine and no 'to:'
> +                is required.  */
> +             n2 = gomp_map_rev_lookup (&devicep->mem_map,
> +                                       devaddrs[i], devaddrs[i] + sizes[i],
> +                                       false);
> +             cdata[i].present = n2 != NULL;
> +             cdata[i].devaddr = devaddrs[i];
> +             if (n2)
> +               devaddrs[i] = (n2->host_start + cdata[i].devaddr
> +                              - (n2->tgt->tgt_start + n2->tgt_offset));
> +             else
> +               {
> +                 j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
> +                                            devaddrs[i],
> +                                            devaddrs[i] + sizes[i], false);
> +                 if (j < i)
> +                   {
> +                     cdata[i].present = true;
> +                     devaddrs[i] = (devaddrs[j] + devaddrs[i]
> +                                    - cdata[j].devaddr);
> +                   }
> +               }
> +             if (!cdata[i].present)
> +               {
> +                 cdata[i].aligned = true;
> +                 size_t align = (size_t) 1 << (kinds[i] >> 8);
> +                 devaddrs[i]
> +                   = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
> +                                                                sizes[i]);
> +                 if (dev_to_host_cpy)
> +                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
> +                                    (void *) (uintptr_t) cdata[i].devaddr,
> +                                    sizes[i], token);
> +                 else
> +                   gomp_copy_dev2host (devicep, NULL,
> +                                       (void *) (uintptr_t) devaddrs[i],
> +                                       (void *) (uintptr_t) cdata[i].devaddr,
> +                                       sizes[i]);
> +               }
> +             for (j = i + 1; j < mapnum; j++)
> +               {
> +                 kind = get_kind (short_mapkind, kinds, j) & typemask;
> +                 if (!GOMP_MAP_ALWAYS_POINTER_P (kind)
> +                     && !GOMP_MAP_POINTER_P (kind))
> +                   break;
> +                 if (devaddrs[j] < devaddrs[i])
> +                   break;
> +                 if (cdata[i].present)
> +                   continue;
> +                 if (devaddrs[j] == 0)
> +                   {
> +                     *(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[j]) = 0;
> +                     continue;
> +                   }
> +                 int k;
> +                 n2 = NULL;
> +                 cdata[i].present = true;
> +                 cdata[j].devaddr = devaddrs[j];
> +                 k = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, j,
> +                                            devaddrs[j],
> +                                            devaddrs[j] + sizeof (void*),
> +                                            false);
> +                 if (k < j)
> +                   devaddrs[j] = (devaddrs[k] + devaddrs[j]
> +                                  - cdata[k].devaddr);
> +                 else
> +                   {
> +                     n2 = gomp_map_rev_lookup (&devicep->mem_map,
> +                                               devaddrs[j],
> +                                               devaddrs[j] + sizeof (void*),
> +                                               false);
> +                     if (n2 == NULL)
> +                       {
> +                         gomp_mutex_unlock (&devicep->lock);
> +                         gomp_fatal ("Pointer target wasn't mapped");
> +                       }
> +                     devaddrs[j] = (n2->host_start + cdata[j].devaddr
> +                                    - (n2->tgt->tgt_start + n2->tgt_offset));
> +                   }
> +                 *(void **) (uintptr_t) (devaddrs[i] + sizes[j])
> +                   = (void *) (uintptr_t) devaddrs[j];
> +               }
> +             i = j -1;
> +             break;
> +           case GOMP_MAP_STRUCT:
> +             n2 = gomp_map_rev_lookup (&devicep->mem_map, devaddrs[i+1],
> +                                       devaddrs[i + sizes[i]]
> +                                       + sizes[i + sizes[i]], false);
> +             cdata[i].present = n2 != NULL;
> +             cdata[i].devaddr = devaddrs[i];
> +             struct_cpy = cdata[i].present ? 0 : sizes[i];
> +             if (!n2)
> +               {
> +                 size_t sz = (size_t) (devaddrs[i + sizes[i]]
> +                                       - devaddrs[i+1]
> +                                       + sizes[i + sizes[i]]);
> +                 size_t align = (size_t) 1 << (kinds[i] >> 8);
> +                 cdata[i].aligned = true;
> +                 devaddrs[i] = (uintptr_t) gomp_aligned_alloc (align, sz);
> +                 devaddrs[i] -= devaddrs[i+1] - cdata[i].devaddr;
> +               }
> +             else
> +               devaddrs[i] = (n2->host_start + cdata[i].devaddr
> +                              - (n2->tgt->tgt_start + n2->tgt_offset));
> +             break;
> +           default:
> +             gomp_mutex_unlock (&devicep->lock);
> +             gomp_fatal ("gomp_target_rev unhandled kind 0x%.4x", kinds[i]);
> +         }
> +     }
> +      gomp_mutex_unlock (&devicep->lock);
> +    }
> +
> +  host_fn (devaddrs);
> +
> +  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM))
> +    {
> +      uint64_t struct_cpy = 0;
> +      bool clean_struct = false;
> +      for (uint64_t i = 0; i < mapnum; i++)
> +     {
> +       if (cdata[i].devaddr == 0)
> +         continue;
> +       int kind = get_kind (short_mapkind, kinds, i) & typemask;
> +       bool copy = !cdata[i].present || struct_cpy;
> +       switch (kind)
> +         {
> +           case GOMP_MAP_FORCE_FROM:
> +           case GOMP_MAP_FORCE_TOFROM:
> +           case GOMP_MAP_ALWAYS_FROM:
> +           case GOMP_MAP_ALWAYS_TOFROM:
> +             copy = true;
> +             /* FALLTHRU */
> +           case GOMP_MAP_FROM:
> +           case GOMP_MAP_TOFROM:
> +             if (copy && host_to_dev_cpy)
> +               host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
> +                                (void *) (uintptr_t) devaddrs[i],
> +                                sizes[i], token);
> +             else if (copy)
> +               gomp_copy_host2dev (devicep, NULL,
> +                                   (void *) (uintptr_t) cdata[i].devaddr,
> +                                   (void *) (uintptr_t) devaddrs[i],
> +                                   sizes[i], false, NULL);
> +           default:
> +             break;
> +         }
> +       if (struct_cpy)
> +         {
> +           struct_cpy--;
> +           continue;
> +         }
> +       if (kind == GOMP_MAP_STRUCT && !cdata[i].present)
> +         {
> +           clean_struct = true;
> +           struct_cpy = sizes[i];
> +         }
> +       else if (cdata[i].aligned)
> +         gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
> +       else if (!cdata[i].present)
> +         free ((void *) (uintptr_t) devaddrs[i]);
> +     }
> +      if (clean_struct)
> +     for (uint64_t i = 0; i < mapnum; i++)
> +       if (!cdata[i].present
> +           && ((get_kind (short_mapkind, kinds, i) & typemask)
> +               == GOMP_MAP_STRUCT))
> +         {
> +           devaddrs[i] += cdata[i+1].devaddr - cdata[i].devaddr;
> +           gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
> +         }
> +
> +      free (devaddrs);
> +      free (sizes);
> +      free (kinds);
> +    }
>  }
>
>  /* Host fallback for GOMP_target_data{,_ext} routines.  */
> @@ -4330,6 +4898,7 @@ gomp_target_init (void)
>               /* current_device.capabilities has already been set.  */
>               current_device.type = current_device.get_type_func ();
>               current_device.mem_map.root = NULL;
> +             current_device.mem_map_rev.root = NULL;
>               current_device.state = GOMP_DEVICE_UNINITIALIZED;
>               for (i = 0; i < new_num_devs; i++)
>                 {
> diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90
> new file mode 100644
> index 00000000000..067639bc67e
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90
> @@ -0,0 +1,72 @@
> +! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
> +
> +implicit none
> +!$omp requires reverse_offload
> +integer :: A(50), A2(50)
> +integer :: i, error
> +logical :: shared_mem
> +
> +error = 0
> +shared_mem = .false.
> +A  = [(i, i=1,50)]
> +A2 = [(i, i=1,50)]
> +
> +!$omp target map(to: shared_mem)
> +  shared_mem = .true.
> +!$omp end target
> +
> +!$omp target map(to: A(20:40), A2(20:40)) map(from: error)
> +block
> +  integer :: B(10), C(10)
> +  B = 99
> +  C = 88
> +  A(20:40)  = -A(20:40)
> +  A2(20:40) = -A2(20:40)
> +
> +  !$omp target device (ancestor:1)  &
> +  !$omp&       map(to: A(25:35)) map(always, to: A2(25:35))  &
> +  !$omp&       map(from:B(4:8)) map(tofrom:C(4:8))
> +     if (shared_mem) then
> +       if (any (A(25:35)  /= [(-i,i=25,35)])) stop 20
> +     else
> +       if (any (A(25:35)  /= [( i,i=25,35)])) stop 21
> +     end if
> +    if (any (A2(25:35) /= [(-i,i=25,35)])) stop 22
> +    if (any (C(4:8) /= 88)) stop 23
> +
> +    A(25:35) = -A(25:35)*10
> +    A2(25:35) = -A2(25:35)*10
> +    B(4:8) = [4,5,6,7,8]
> +    C(4:8) = [-4,-5,-6,-7,-8]
> +  !$omp end target
> +
> +  if (any (B(1:3) /= 99) .or. any (B(9:10) /= 99)) then
> +    error = 30
> +  elseif (any (B(4:8) /= [4,5,6,7,8])) then
> +    error = 31
> +  elseif (any (C(1:3) /= 88) .or. any (C(9:10) /= 88)) then
> +    error = 32
> +  elseif (any (C(4:8) /= [-4,-5,-6,-7,-8])) then
> +    error = 33
> +  else
> +    error = 0
> +  endif
> +end block
> +
> +if (error /= 0) stop error
> +
> +if (shared_mem) then
> +  if (any (A(1:19)  /= [( i, i=1,19)])) stop 1
> +  if (any (A(20:24) /= [(-i, i=20,24)])) stop 2
> +  if (any (A(36:40) /= [(-i, i=36,40)])) stop 3
> +  if (any (A(41:50) /= [( i, i=41,50)])) stop 4
> +
> +  if (any (A(25:35) /= [( 10*i, i=25,35)])) stop 5
> +else
> +  if (any (A(1:24)  /= [( i, i=1,24)])) stop 6
> +  if (any (A(36:50) /= [( i, i=36,50)])) stop 7
> +
> +  if (any (A(25:35) /= [(-10*i, i=25,35)])) stop 8
> +end if
> +if (any (A2(25:35) /= [( 10*i, i=25,35)])) stop 9
> +end
> diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90
> new file mode 100644
> index 00000000000..2fd2f5b6589
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90
> @@ -0,0 +1,68 @@
> +! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
> +
> +implicit none
> +!$omp requires reverse_offload
> +integer :: A(50), A2(50), A3(50)
> +integer :: i
> +logical :: shared_mem
> +
> +shared_mem = .false.
> +A  = [(3*i, i=1,50)]
> +A2 = [(7*i, i=1,50)]
> +A3 = [(11*i, i=1,50)]
> +
> +!$omp target map(to: shared_mem)
> +  shared_mem = .true.
> +!$omp end target
> +
> +!$omp target map(to: A(20:40), A2(20:40), A3(20:40))
> +block
> +  integer :: C(10)
> +  C = 88
> +  A(20:40)  = -2*A(20:40)
> +  A2(20:40) = -9*A2(20:40)
> +  A3(20:40) = -13*A3(20:40)
> +
> +  !$omp target device (ancestor:1)  &
> +  !$omp&       map(from: A(25:35)) map(always, from: A2(25:35))  &
> +  !$omp&       map(alloc: A3(25:35)) map(alloc:C(4:8))
> +    if (shared_mem) then
> +      if (any (A(25:35)  /= [(-2*3*i, i=25,35)])) stop 1
> +      if (any (A2(25:35) /= [(-9*7*i, i=25,35)])) stop 2
> +      if (any (A3(25:35) /= [(-13*11*i, i=25,35)])) stop 3
> +    else
> +      if (any (A(25:35)  /= [(3*i, i=25,35)])) stop 4
> +      if (any (A2(25:35) /= [(7*i, i=25,35)])) stop 5
> +      if (any (A3(25:35) /= [(11*i, i=25,35)])) stop 6
> +    end if
> +
> +    A(25:35) = A(25:35)*5
> +    A2(25:35) = A2(25:35)*8
> +    A3(25:35) = A3(25:35)*18
> +    C(4:8) = [4,5,6,7,8]
> +  !$omp end target
> +
> +  if (shared_mem) then
> +    if (any (A(25:35)  /= [(-2*3*5*i, i=25,35)])) stop 7
> +    if (any (A2(25:35) /= [(-9*7*8*i, i=25,35)])) stop 8
> +    if (any (A3(25:35) /= [(-13*11*18*i, i=25,35)])) stop 9
> +    if (any (C(4:8) /= [4,5,6,7,8])) stop 10
> +  else
> +    if (any (A(25:35)  /= [(-2*3*i, i=25,35)])) stop 11
> +    if (any (A2(25:35) /= [(7*8*i, i=25,35)])) stop 12
> +    if (any (A3(25:35) /= [(-13*11*i, i=25,35)])) stop 13
> +    if (any (C(4:8) /= 88)) stop 14
> +  end if
> +end block
> +
> +if (shared_mem) then
> +  if (any (A(25:35)  /= [(-2*3*5*i, i=25,35)])) stop
> +  if (any (A2(25:35) /= [(-9*7**8*i, i=25,35)])) stop
> +  if (any (A3(25:35) /= [(-13*11*18*i, i=25,35)])) stop
> +else
> +  if (any (A(25:35)  /= [(3*5*i, i=25,35)])) stop
> +  if (any (A2(25:35) /= [(7*8*i, i=25,35)])) stop
> +  if (any (A3(25:35) /= [(11*18*i, i=25,35)])) stop
> +end if
> +
> +end
> diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90
> new file mode 100644
> index 00000000000..fb27aa73eba
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90
> @@ -0,0 +1,129 @@
> +! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
> +
> +implicit none
> +!$omp requires reverse_offload
> +
> +type t2
> +  integer :: a, b, c
> +end type t2
> +
> +type t
> + integer :: A(5), B(5), C(5)
> + integer, pointer :: P(:), P2 !Just some padding
> + type(t2) :: tt !Just some padding
> +end type t
> +
> +type(t) :: S1, S2
> +logical :: shared_mem
> +
> +shared_mem = .false.
> +
> +!$omp target map(to: shared_mem)
> +  shared_mem = .true.
> +!$omp end target
> +
> +s1%A = [1,2,3,4,5]
> +s1%B = [10,20,30,40,50]
> +s1%C = [11,22,33,44,55]
> +s2%A = 2*s1%A
> +s2%B = 2*s1%B
> +s2%C = 2*s1%C
> +
> +!$omp target &
> +!$omp&       map(to: s1%b, s1%c) &
> +!$omp&       map(to: s2%b, s2%c)
> +block
> +  type(t) :: si1, si2, si3, si4
> +
> +  s1%B = -10 * s1%B
> +  s1%C = -10 * s1%C
> +  s2%B = -15 * s2%B
> +  s2%C = -15 * s2%C
> +
> +  si1%A = -1 * [1,2,3,4,5]
> +  si1%B = -1 * [10,20,30,40,50]
> +  si1%C = -1 * [11,22,33,44,55]
> +  si2%A = -23 * [1,2,3,4,5]
> +  si2%B = -23 * [10,20,30,40,50]
> +  si2%C = -23 * [11,22,33,44,55]
> +
> +  !$omp target device (ancestor:1)  &
> +  !$omp&       map(to: si1%C, si1%B)  &
> +  !$omp&       map(tofrom: si2%C, si2%B)  &
> +  !$omp&       map(always, to: s1%B)  &
> +  !$omp&       map(        to: s2%B)
> +    if (any (s1%A /=       [1,2,3,4,5])) stop 1
> +    if (any (s1%B /= -10 * [10,20,30,40,50])) stop 2
> +    if (shared_mem) then
> +      if (any (s1%C /= -10 * [11,22,33,44,55])) stop 4
> +    else
> +      if (any (s1%C /=       [11,22,33,44,55])) stop 3
> +    endif
> +    if (any (s2%A /=   2 * [1,2,3,4,5])) stop 4
> +    if (shared_mem) then
> +      if (any (s2%B /= -15 * 2 * [10,20,30,40,50])) stop 5
> +      if (any (s2%C /= -15 * 2 * [11,22,33,44,55])) stop 6
> +    else
> +      if (any (s2%B /=   2 * [10,20,30,40,50])) stop 7
> +      if (any (s2%C /=   2 * [11,22,33,44,55])) stop 8
> +    endif
> +    if (any (si1%B /= -1 * [10,20,30,40,50])) stop 9
> +    if (any (si1%C /= -1 * [11,22,33,44,55])) stop 10
> +    if (any (si2%B /= -23 * [10,20,30,40,50])) stop 10
> +    if (any (si2%C /= -23 * [11,22,33,44,55])) stop 11
> +
> +    s1%A = 5 * s1%A
> +    s1%B = 7 * s1%B
> +    s1%C = 13 * s1%C
> +    s2%A = 9 * s2%A
> +    s2%B = 21 * s2%B
> +    s2%C = 31 * s2%C
> +    si1%B = -11 * si1%B
> +    si1%C = -13 * si1%C
> +    si2%B = -27 * si2%B
> +    si2%C = -29 * si2%C
> +  !$omp end target
> +
> +  if (shared_mem) then
> +    if (any (s1%B /= -10 * 7 * [10,20,30,40,50])) stop 20
> +    if (any (s1%C /= -10 * 13 * [11,22,33,44,55])) stop 21
> +  else
> +    if (any (s1%B /= -10 * [10,20,30,40,50])) stop 22
> +    if (any (s1%C /= -10 * [11,22,33,44,55])) stop 23
> +  endif
> +  if (shared_mem) then
> +    if (any (s2%B /= -15 * 2 * 21 * [10,20,30,40,50])) stop 24
> +    if (any (s2%C /= -15 * 2 * 31 * [11,22,33,44,55])) stop 25
> +  else
> +    if (any (s2%B /= -15 * 2 * [10,20,30,40,50])) stop 26
> +    if (any (s2%C /= -15 * 2 * [11,22,33,44,55])) stop 27
> +  endif
> +  if (any (si1%A /= -1 * [1,2,3,4,5])) stop 28
> +  if (shared_mem) then
> +    if (any (si1%B /= -1 * (-11) * [10,20,30,40,50])) stop 29
> +    if (any (si1%C /= -1 * (-13) * [11,22,33,44,55])) stop 30
> +  else
> +    if (any (si1%B /= -1 * [10,20,30,40,50])) stop 31
> +    if (any (si1%C /= -1 * [11,22,33,44,55])) stop 32
> +  endif
> +  if (any (si2%A /= -23 * [1,2,3,4,5])) stop 33
> +  if (any (si2%B /= -23 * (-27) * [10,20,30,40,50])) stop 34
> +  if (any (si2%C /= -23 * (-29) * [11,22,33,44,55])) stop 35
> +end block
> +
> +if (any (s1%A /=       5 * [1,2,3,4,5])) stop 40
> +if (any (s1%B /= -10 * 7 * [10,20,30,40,50])) stop 41
> +if (shared_mem) then
> +  if (any (s1%C /= -10 * 13 * [11,22,33,44,55])) stop 42
> +else
> +  if (any (s1%C /= 13 *    [11,22,33,44,55])) stop 43
> +endif
> +if (any (s2%A /=   2 * 9 * [1,2,3,4,5])) stop 44
> +if (shared_mem) then
> +  if (any (s2%B /= -15 * 2 * 21 * [10,20,30,40,50])) stop 45
> +  if (any (s2%C /= -15 * 2 * 31 * [11,22,33,44,55])) stop 46
> +else
> +  if (any (s2%B /=   2 * 21 * [10,20,30,40,50])) stop 47
> +  if (any (s2%C /=   2 * 31 * [11,22,33,44,55])) stop 48
> +endif
> +end
> diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90
> new file mode 100644
> index 00000000000..ef7eb7bdd52
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90
> @@ -0,0 +1,100 @@
> +! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
> +! { dg-xfail-run-if "Copying on-device allocated memory fails with cuMemcpyDtoHAsync error: invalid argument" { offload_device_nvptx } }
> +
> +! Because of the nvptx fail, a non-device alloc version has been
> +! created: reverse-offload-5a.f90
> +
> +implicit none
> +!$omp requires reverse_offload
> +
> +integer, allocatable :: A(:), A2(:), s1, s2
> +integer :: i
> +logical :: shared_mem
> +
> +shared_mem = .false.
> +
> +a = [1,2,3,4]
> +a2 = [8,7,6,5]
> +s1 = 532
> +s2 = 55
> +
> +!$omp target map(to: shared_mem)
> +  shared_mem = .true.
> +!$omp end target
> +
> +!$omp target map(to: A, A2, s1, s2)
> +block
> +  integer, allocatable :: ai(:), ai2(:), si1, si2
> +
> +  a = a * 2
> +  a2 = a2 * 3
> +  s1 = s1 * 4
> +  s2 = s2 * 5
> +
> +  ai = [23,35,86,43]
> +  ai2 = [8,4,7,1]
> +  si1 = 64
> +  si2 = 765
> +
> +  !$omp target device (ancestor:1)  &
> +  !$omp&       map(to: A, s1, ai, si1) map(always, to: a2, s2)  &
> +  !$omp&       map(tofrom: ai2, si2)
> +    if (shared_mem) then
> +      if (any (a  /= 2 * [1,2,3,4])) stop 1
> +      if (s1 /= 4 * 532) stop 2
> +    else
> +      if (any (a  /=     [1,2,3,4])) stop 3
> +      if (s1 /=     532) stop 4
> +    endif
> +    if (any (a2 /= 3 * [8,7,6,5])) stop 5
> +    if (s2 /= 5 * 55) stop 6
> +    if (any (ai /= [23,35,86,43])) stop 7
> +    if (any (ai2 /= [8,4,7,1])) stop 8
> +    if (si1 /= 64) stop 9
> +    if (si2 /= 765) stop 10
> +
> +    a = a*3
> +    a2 = a2*7
> +    s1 = s1*11
> +    s2 = s2*5
> +    ai = ai*13
> +    ai2 = ai2*21
> +    si1 = si1*27
> +    si2 = si2*31
> +  !$omp end target
> +
> +  if (shared_mem) then
> +    if (any (a  /= 3 * 2 * [1,2,3,4])) stop 11
> +    if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 12
> +    if (s1 /= 11 * 4 * 532) stop 13
> +    if (s2 /= 5 * 5 * 55) stop 14
> +    if (any (ai /= 13 * [23,35,86,43])) stop 15
> +    if (si1 /= 27 * 64) stop 16
> +  else
> +    if (any (a  /= 2 * [1,2,3,4])) stop 17
> +    if (any (a2 /= 3 * [8,7,6,5])) stop 18
> +    if (s1 /= 4 * 532) stop 19
> +    if (s2 /= 5 * 55) stop 20
> +    if (any (ai /= [23,35,86,43])) stop 22
> +    if (si1 /= 64) stop 23
> +  endif
> +  if (any (ai2 /= 21 * [8,4,7,1])) stop 24
> +  if (si2 /= 31 * 765) stop 25
> +
> +  deallocate (ai, ai2, si1, si2)
> +end block
> +
> +if (shared_mem) then
> +  if (any (a  /= 3 * 2 * [1,2,3,4])) stop 30
> +  if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 31
> +  if (s1 /= 11 * 4 * 532) stop 32
> +  if (s2 /= 5 * 5 * 55) stop 33
> +else
> +  if (any (a  /= 3 * [1,2,3,4])) stop 34
> +  if (any (a2 /= 3 * 7 * [8,7,6,5])) stop 35
> +  if (s1 /= 11 * 532) stop 36
> +  if (s2 /= 5 * 5 * 55) stop 37
> +endif
> +
> +deallocate (a, a2, s1, s2)
> +end
> diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90
> new file mode 100644
> index 00000000000..914d10d8144
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90
> @@ -0,0 +1,98 @@
> +! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
> +
> +! Copying on-device allocated memory fails with cuMemcpyDtoHAsync error: invalid argument
> +! Hence, reverse-offload-5.f90 has been copied to *this* file,
> +! which uses on-host allocated vars - but only on the device side stack variables
> +
> +implicit none
> +!$omp requires reverse_offload
> +
> +integer, allocatable :: A(:), A2(:), s1, s2
> +integer :: i,D(4)
> +logical :: shared_mem
> +
> +shared_mem = .false.
> +
> +a = [1,2,3,4]
> +a2 = [8,7,6,5]
> +s1 = 532
> +s2 = 55
> +
> +!$omp target map(to: shared_mem)
> +  shared_mem = .true.
> +!$omp end target
> +
> +!$omp target map(to: A, A2, s1, s2)
> +block
> +  ! stack variables:
> +  integer :: ai(4), ai2(4), si1, si2
> +
> +  a = a * 2
> +  a2 = a2 * 3
> +  s1 = s1 * 4
> +  s2 = s2 * 5
> +
> +  ai = [23,35,86,43]
> +  ai2 = [8,4,7,1]
> +  si1 = 64
> +  si2 = 765
> +
> +  !$omp target device (ancestor:1)  &
> +  !$omp&       map(to: A, s1, ai, si1) map(always, to: a2, s2)  &
> +  !$omp&       map(tofrom: ai2, si2)
> +    if (shared_mem) then
> +      if (any (a  /= 2 * [1,2,3,4])) stop 1
> +      if (s1 /= 4 * 532) stop 2
> +    else
> +      if (any (a  /=     [1,2,3,4])) stop 3
> +      if (s1 /=     532) stop 4
> +    endif
> +    if (any (a2 /= 3 * [8,7,6,5])) stop 5
> +    if (s2 /= 5 * 55) stop 6
> +    if (any (ai /= [23,35,86,43])) stop 7
> +    if (any (ai2 /= [8,4,7,1])) stop 8
> +    if (si1 /= 64) stop 9
> +    if (si2 /= 765) stop 10
> +
> +    a = a*3
> +    a2 = a2*7
> +    s1 = s1*11
> +    s2 = s2*5
> +    ai = ai*13
> +    ai2 = ai2*21
> +    si1 = si1*27
> +    si2 = si2*31
> +  !$omp end target
> +
> +  if (shared_mem) then
> +    if (any (a  /= 3 * 2 * [1,2,3,4])) stop 11
> +    if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 12
> +    if (s1 /= 11 * 4 * 532) stop 13
> +    if (s2 /= 5 * 5 * 55) stop 14
> +    if (any (ai /= 13 * [23,35,86,43])) stop 15
> +    if (si1 /= 27 * 64) stop 16
> +  else
> +    if (any (a  /= 2 * [1,2,3,4])) stop 17
> +    if (any (a2 /= 3 * [8,7,6,5])) stop 18
> +    if (s1 /= 4 * 532) stop 19
> +    if (s2 /= 5 * 55) stop 20
> +    if (any (ai /= [23,35,86,43])) stop 22
> +    if (si1 /= 64) stop 23
> +  endif
> +  if (any (ai2 /= 21 * [8,4,7,1])) stop 24
> +  if (si2 /= 31 * 765) stop 25
> +end block
> +if (shared_mem) then
> +  if (any (a  /= 3 * 2 * [1,2,3,4])) stop 30
> +  if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 31
> +  if (s1 /= 11 * 4 * 532) stop 32
> +  if (s2 /= 5 * 5 * 55) stop 33
> +else
> +  if (any (a  /= 3 * [1,2,3,4])) stop 34
> +  if (any (a2 /= 3 * 7 * [8,7,6,5])) stop 35
> +  if (s1 /= 11 * 532) stop 36
> +  if (s2 /= 5 * 5 * 55) stop 37
> +endif
> +
> +deallocate (a, a2, s1, s2)
> +end


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Jakub Jelinek Dec. 15, 2022, 5:49 p.m. UTC | #5
On Thu, Dec 15, 2022 at 06:34:30PM +0100, Thomas Schwinge wrote:
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -82,9 +82,9 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>  void
>  GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>  			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> -			void (*dev_to_host_cpy) (void *, const void *, size_t,
> +			bool (*dev_to_host_cpy) (void *, const void *, size_t,
>  						 void *),
> -			void (*host_to_dev_cpy) (void *, const void *, size_t,
> +			bool (*host_to_dev_cpy) (void *, const void *, size_t,
>  						 void *), void *token)
>  {
>    gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
> diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
> index ac3878289506..fb533164bf9b 100644
> --- a/libgomp/libgomp-plugin.h
> +++ b/libgomp/libgomp-plugin.h
> @@ -122,9 +122,9 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
>  
>  extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
>  				    uint64_t, int,
> -				    void (*) (void *, const void *, size_t,
> +				    bool (*) (void *, const void *, size_t,
>  					      void *),
> -				    void (*) (void *, const void *, size_t,
> +				    bool (*) (void *, const void *, size_t,
>  					      void *), void *);
>  
>  /* Prototypes for functions implemented by libgomp plugins.  */
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -1130,8 +1130,8 @@ extern int gomp_get_num_devices (void);
>  extern bool gomp_target_task_fn (void *);
>  extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
>  			     int,
> -			     void (*) (void *, const void *, size_t, void *),
> -			     void (*) (void *, const void *, size_t, void *),
> +			     bool (*) (void *, const void *, size_t, void *),
> +			     bool (*) (void *, const void *, size_t, void *),
>  			     void *);
>  
>  /* Splay tree definitions.  */

I think returning bool from those is fine.

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1,3 +1,5 @@
> +#pragma GCC optimize "O0"

But the pragmas are not.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -1,3 +1,5 @@
> +#pragma GCC optimize "O0"

Neither here.

> @@ -3340,12 +3342,21 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>        kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
>        if (dev_to_host_cpy)
>         {
> -         dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
> -                          mapnum * sizeof (uint64_t), token);
> -         dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
> -                          mapnum * sizeof (uint64_t), token);
> -         dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
> -                          mapnum * sizeof (unsigned short), token);
> +         bool ok = true;
> +         ok = ok && dev_to_host_cpy (devaddrs,
> +                                     (const void *) (uintptr_t) devaddrs_ptr,
> +                                     mapnum * sizeof (uint64_t), token);
> +         ok = ok && dev_to_host_cpy (sizes,
> +                                     (const void *) (uintptr_t) sizes_ptr,
> +                                     mapnum * sizeof (uint64_t), token);
> +         ok = ok && dev_to_host_cpy (kinds,
> +                                     (const void *) (uintptr_t) kinds_ptr,
> +                                     mapnum * sizeof (unsigned short), token);

Why not just
	if (!dev_to_host_cpy (...)
	    || !dev_to_host_cpy (...)
	    || !dev_to_host_cpy (...))
?
	
> +         if (!ok)
> +           {
> +             /*TODO gomp_mutex_unlock (&devicep->lock); */

Why the comment?  That makes no sense, devicep->lock isn't locked here.

>  	    else if (dev_to_host_cpy)
> -	      dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
> -			       (size_t) sizes[i], token);
> +	      {
> +		if (!dev_to_host_cpy (tgt + tgt_size,
> +				      (void *) (uintptr_t) devaddrs[i],
> +				      (size_t) sizes[i], token))
> +		  {
> +		    /*TODO gomp_mutex_unlock (&devicep->lock); */

Neither here.
> @@ -3662,9 +3692,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>  	      case GOMP_MAP_FROM:
>  	      case GOMP_MAP_TOFROM:
>  		if (copy && host_to_dev_cpy)
> -		  host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
> -				   (void *) (uintptr_t) devaddrs[i],
> -				   sizes[i], token);
> +		  {
> +		    if (!host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
> +					  (void *) (uintptr_t) devaddrs[i],
> +					  sizes[i], token))
> +		      {
> +			/*TODO gomp_mutex_unlock (&devicep->lock); */

And neither here.
> +			exit (EXIT_FAILURE);
> +		      }
> +		  }

	Jakub
Tobias Burnus Dec. 15, 2022, 7:42 p.m. UTC | #6
Hi,

I have not fully tried to understand it, yet.

(A) Regarding the issue of stalling, see als Andrew's patch and the
discussion about it in

"[PATCH] libgomp: fix hang on fatal error",
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603616.html

and in particular Jakub's two replies.

(b) I think you want to remove this:

On 15.12.22 18:34, Thomas Schwinge wrote:
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1,3 +1,5 @@
> +#pragma GCC optimize "O0"
> +
>   /* Plugin for NVPTX execution.

(c)

> If the libgomp plugin doesn't request special
> 'host_to_dev_cpy'/'dev_to_host_cpy' for 'gomp_target_rev', then standard
> 'gomp_copy_host2dev'/'gomp_copy_dev2host' are used, which use
> 'gomp_device_copy', which expects the device to be locked.  (As can be
> told by the unconditional 'gomp_mutex_unlock (&devicep->lock);' before
> 'gomp_fatal'.)  However, in a number of the
> 'gomp_copy_host2dev'/'gomp_copy_dev2host' calls from 'gomp_target_rev',
> the device definitely is not locked; see the calls adjacent to the TODO

The question is what unlocks the device – it is surely locked in gomp_target_rev by:

   if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM))
     ...
       gomp_mutex_lock (&devicep->lock);
       for (uint64_t i = 0; i < mapnum; i++)
...
         }
       gomp_mutex_unlock (&devicep->lock);
     }

Except for code like:
                 gomp_mutex_unlock (&devicep->lock);
                 gomp_fatal ("gomp_target_rev unhandled kind 0x%.4x", kinds[i]);

The only functions that know about the pointer and get called are those behind
the dev_to_host_cpy and host_to_dev_cpy - thus, they seemingly mess about with the
unlocking?!?

  * * *

Regarding your patch, I do not understand why you call twice unlock and
have trice TODO unlock; that does not seem to make any sense.

I think it is worthwhile to understand why plugin-nvptx.c unlocks the lock in
the non-error case - as you observe that it is not locked in the error case.

Additionally, it seems to make more sense to look into a revised patch of
Andrew's patch, your patch looks like a rather bad band aid.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Tobias Burnus Dec. 15, 2022, 8:13 p.m. UTC | #7
Hi,

On 15.12.22 20:42, Tobias Burnus wrote:
>> If the libgomp plugin doesn't request special
>> 'host_to_dev_cpy'/'dev_to_host_cpy' for 'gomp_target_rev', then standard
>> 'gomp_copy_host2dev'/'gomp_copy_dev2host' are used, which use
>> 'gomp_device_copy', which expects the device to be locked.  (As can be
>> told by the unconditional 'gomp_mutex_unlock (&devicep->lock);' before
>> 'gomp_fatal'.)  However, in a number of the
>> 'gomp_copy_host2dev'/'gomp_copy_dev2host' calls from 'gomp_target_rev',
>> the device definitely is not locked; see

Actually, reading it + the source code again, I think it makes sense to
return a boolean – similar to devicep->host2dev_func and
devicep->dev2host_func — and possibly wrap it into some convenience
function, similar to gomp_device_copy – at least a bare exit() without
further diagnostic does not seem to userfriendly.

BTW: In line with the other code, you could use CUDA_CALL instead of
CUDA_CALL_ERET; the fomer already calls the latter with 'false' as first
argument + is used elsewhere.

Regarding the lock: It seems the problem is the copying of
devaddrs/sizes/kinds; this does not need any lock as the stack variables
are on the device and only used for this reverse offload. Thus, there is
no need for a lock as there are no races.

However, as the existing gomp_copy_dev2host removes the lock, we could
simply keep this lock – and probably should move it down to just before
the user-function call – removing all (non-error) locks and unlocks on
the way. — I mean something like the attached patch.

Finally, I think we need to find a solution for the issue Andrew tried
to address. — The current code invokes CUDA_CALL_ASSERT – which calls
GOMP_PLUGIN_fatal.

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
diff mbox series

Patch

libgomp: Handle OpenMP's reverse offloads

This commit enabled reverse offload for nvptx such that gomp_target_rev
actually gets called.  And it fills the latter function to do all of
the following: finding the host function to the device func ptr and
copying the arguments to the host, processing the mapping/firstprivate,
calling the host function, copying back the data and freeing as needed.

The data handling is made easier by assuming that all host variables
either existed before (and are in the mapping) or that those are
devices variables not yet available on the host. Thus, the reverse
mapping can do without refcounts etc. Note that the spec disallows
inside a target region device-affecting constructs other than target
plus ancestor device-modifier and it also limits the clauses permitted
on this construct.

For the function addresses, an additional splay tree is used; for
the lookup of mapped variables, the existing splay-tree is used.
Unfortunately, its data structure requires a full walk of the tree;
Additionally, the just mapped variables are recorded in a separate
data structure an extra lookup. While the lookup is slow, assuming
that only few variables get mapped in each reverse offload construct
and that reverse offload is the exception and not performance critical,
this seems to be acceptable.

libgomp/ChangeLog:

	* libgomp.h (struct target_mem_desc): Predeclare; move
	below after 'reverse_splay_tree_node' and add rev_array
	member.
	(struct reverse_splay_tree_key_s, reverse_splay_compare): New.
	(reverse_splay_tree_node, reverse_splay_tree,
	reverse_splay_tree_key): New typedef.
	(struct gomp_device_descr): Add mem_map_rev member.
	* oacc-host.c (host_dispatch): NULL init .mem_map_rev.
	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Claim
	support for GOMP_REQUIRES_REVERSE_OFFLOAD.
	* splay-tree.h (splay_tree_callback_stop): New typedef; like
	splay_tree_callback but returning int not void.
	(splay_tree_foreach_lazy): Define; like splay_tree_foreach but
	taking splay_tree_callback_stop as argument.
	* splay-tree.c (splay_tree_foreach_internal_lazy,
	splay_tree_foreach_lazy): New; but early exit if callback returns
	nonzero.
	* target.c: Instatiate splay_tree_c with splay_tree_prefix 'reverse'.
	(gomp_map_lookup_rev): New.
	(gomp_load_image_to_device): Handle reverse-offload function
	lookup table.
	(gomp_unload_image_from_device): Free devicep->mem_map_rev.
	(struct gomp_splay_tree_rev_lookup_data, gomp_splay_tree_rev_lookup,
	gomp_map_rev_lookup, struct cpy_data, gomp_map_cdata_lookup_int,
	gomp_map_cdata_lookup): New auxiliary structs and functions for
	gomp_target_rev.
	(gomp_target_rev): Implement reverse offloading and its mapping.
	(gomp_target_init): Init current_device.mem_map_rev.root.
	* testsuite/libgomp.fortran/reverse-offload-2.f90: New test.
	* testsuite/libgomp.fortran/reverse-offload-3.f90: New test.
	* testsuite/libgomp.fortran/reverse-offload-4.f90: New test.
	* testsuite/libgomp.fortran/reverse-offload-5.f90: New test.
	* testsuite/libgomp.fortran/reverse-offload-5a.f90: New test without
	mapping of on-device allocated variables.

 libgomp/libgomp.h                                  |  77 ++-
 libgomp/oacc-host.c                                |   1 +
 libgomp/plugin/plugin-nvptx.c                      |   4 +-
 libgomp/splay-tree.c                               |  22 +
 libgomp/splay-tree.h                               |   8 +
 libgomp/target.c                                   | 599 ++++++++++++++++++++-
 .../libgomp.fortran/reverse-offload-2.f90          |  72 +++
 .../libgomp.fortran/reverse-offload-3.f90          |  68 +++
 .../libgomp.fortran/reverse-offload-4.f90          | 129 +++++
 .../libgomp.fortran/reverse-offload-5.f90          | 100 ++++
 .../libgomp.fortran/reverse-offload-5a.f90         |  98 ++++
 11 files changed, 1139 insertions(+), 39 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 5803683afc4..19d7d170091 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1157,29 +1157,7 @@  struct target_var_desc {
   uintptr_t length;
 };
 
-struct target_mem_desc {
-  /* Reference count.  */
-  uintptr_t refcount;
-  /* All the splay nodes allocated together.  */
-  splay_tree_node array;
-  /* Start of the target region.  */
-  uintptr_t tgt_start;
-  /* End of the targer region.  */
-  uintptr_t tgt_end;
-  /* Handle to free.  */
-  void *to_free;
-  /* Previous target_mem_desc.  */
-  struct target_mem_desc *prev;
-  /* Number of items in following list.  */
-  size_t list_count;
-
-  /* Corresponding target device descriptor.  */
-  struct gomp_device_descr *device_descr;
-
-  /* List of target items to remove (or decrease refcount)
-     at the end of region.  */
-  struct target_var_desc list[];
-};
+struct target_mem_desc;
 
 /* Special value for refcount - mask to indicate existence of special
    values. Right now we allocate 3 bits.  */
@@ -1273,6 +1251,58 @@  splay_compare (splay_tree_key x, splay_tree_key y)
 
 #include "splay-tree.h"
 
+/* Reverse offload splay-tree handling (functions only). */
+
+struct reverse_splay_tree_key_s {
+  /* Address of the device object.  */
+  uint64_t dev;
+  splay_tree_key k;
+};
+
+typedef struct reverse_splay_tree_node_s *reverse_splay_tree_node;
+typedef struct reverse_splay_tree_s *reverse_splay_tree;
+typedef struct reverse_splay_tree_key_s *reverse_splay_tree_key;
+
+static inline int
+reverse_splay_compare (reverse_splay_tree_key x, reverse_splay_tree_key y)
+{
+  if (x->dev < y->dev)
+    return -1;
+  if (x->dev > y->dev)
+    return 1;
+  return 0;
+}
+
+#define splay_tree_prefix reverse
+#include "splay-tree.h"
+
+struct target_mem_desc {
+  /* Reference count.  */
+  uintptr_t refcount;
+  /* All the splay nodes allocated together.  */
+  splay_tree_node array;
+  /* Likeverse for the reverse lookup device->host for reverse offload. */
+  reverse_splay_tree_node rev_array;
+  /* Start of the target region.  */
+  uintptr_t tgt_start;
+  /* End of the targer region.  */
+  uintptr_t tgt_end;
+  /* Handle to free.  */
+  void *to_free;
+  /* Previous target_mem_desc.  */
+  struct target_mem_desc *prev;
+  /* Number of items in following list.  */
+  size_t list_count;
+
+  /* Corresponding target device descriptor.  */
+  struct gomp_device_descr *device_descr;
+
+  /* List of target items to remove (or decrease refcount)
+     at the end of region.  */
+  struct target_var_desc list[];
+};
+
+
 typedef struct acc_dispatch_t
 {
   /* Execute.  */
@@ -1367,6 +1397,7 @@  struct gomp_device_descr
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;
+  struct reverse_splay_tree_s mem_map_rev;
 
   /* Mutex for the mutable data.  */
   gomp_mutex_t lock;
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 4e3971ae1a9..77a24279e2f 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -284,6 +284,7 @@  static struct gomp_device_descr host_dispatch =
     .run_func = host_run,
 
     .mem_map = { NULL },
+    .mem_map_rev = { NULL },
     /* .lock initialized in goacc_host_init.  */
     .state = GOMP_DEVICE_UNINITIALIZED,
 
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index e803f083591..4f4c25a90ba 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1190,7 +1190,9 @@  GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
      devices were present.  Unified-shared address: see comment in
      nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
   if (num_devices > 0
-      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
+      && ((omp_requires_mask
+	   & ~(GOMP_REQUIRES_UNIFIED_ADDRESS
+	       | GOMP_REQUIRES_REVERSE_OFFLOAD)) != 0))
     return -1;
   return num_devices;
 }
diff --git a/libgomp/splay-tree.c b/libgomp/splay-tree.c
index 2d8742fd19a..3809e90b68d 100644
--- a/libgomp/splay-tree.c
+++ b/libgomp/splay-tree.c
@@ -236,3 +236,25 @@  splay_tree_foreach (splay_tree sp, splay_tree_callback func, void *data)
 {
   splay_tree_foreach_internal (sp->root, func, data);
 }
+
+/* Like above, except when func returns != 0, stop early.  */
+
+static int
+splay_tree_foreach_internal_lazy (splay_tree_node node,
+				  splay_tree_callback_stop func, void *data)
+{
+  if (!node)
+    return 0;
+  if (func (&node->key, data))
+    return 1;
+  if (splay_tree_foreach_internal_lazy (node->left, func, data))
+    return 1;
+  /* Yeah, whatever.  GCC can fix my tail recursion.  */
+  return splay_tree_foreach_internal_lazy (node->right, func, data);
+}
+
+attribute_hidden void
+splay_tree_foreach_lazy (splay_tree sp, splay_tree_callback_stop func, void *data)
+{
+  splay_tree_foreach_internal_lazy (sp->root, func, data);
+}
diff --git a/libgomp/splay-tree.h b/libgomp/splay-tree.h
index 992381c4000..7dea0cc87f7 100644
--- a/libgomp/splay-tree.h
+++ b/libgomp/splay-tree.h
@@ -78,8 +78,12 @@  typedef struct splay_tree_key_s *splay_tree_key;
     splay_tree_name (splay_tree_prefix, splay_tree_remove)
 # define splay_tree_foreach	\
     splay_tree_name (splay_tree_prefix, splay_tree_foreach)
+# define splay_tree_foreach_lazy \
+    splay_tree_name (splay_tree_prefix, splay_tree_foreach_lazy)
 # define splay_tree_callback	\
     splay_tree_name (splay_tree_prefix, splay_tree_callback)
+# define splay_tree_callback_stop	\
+    splay_tree_name (splay_tree_prefix, splay_tree_callback_stop)
 #endif
 
 #ifndef splay_tree_c
@@ -99,11 +103,13 @@  struct splay_tree_s {
 };
 
 typedef void (*splay_tree_callback) (splay_tree_key, void *);
+typedef int (*splay_tree_callback_stop) (splay_tree_key, void *);
 
 extern splay_tree_key splay_tree_lookup (splay_tree, splay_tree_key);
 extern void splay_tree_insert (splay_tree, splay_tree_node);
 extern void splay_tree_remove (splay_tree, splay_tree_key);
 extern void splay_tree_foreach (splay_tree, splay_tree_callback, void *);
+extern void splay_tree_foreach_lazy (splay_tree, splay_tree_callback_stop, void *);
 #else  /* splay_tree_c */
 #  ifdef splay_tree_prefix
 #    include "splay-tree.c"
@@ -125,6 +131,8 @@  extern void splay_tree_foreach (splay_tree, splay_tree_callback, void *);
 #  undef splay_tree_insert
 #  undef splay_tree_remove
 #  undef splay_tree_foreach
+#  undef splay_tree_foreach_lazy
 #  undef splay_tree_callback
+#  undef splay_tree_callback_stop
 #  undef splay_tree_prefix
 #endif
diff --git a/libgomp/target.c b/libgomp/target.c
index 189b201e344..1d8bca9ffe8 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -45,6 +45,12 @@ 
 #include "plugin-suffix.h"
 #endif
 
+/* Define another splay tree instantiation - for reverse offload.  */
+#define splay_tree_prefix reverse
+#define splay_tree_c
+#include "splay-tree.h"
+
+
 typedef uintptr_t *hash_entry_type;
 static inline void * htab_alloc (size_t size) { return gomp_malloc (size); }
 static inline void htab_free (void *ptr) { free (ptr); }
@@ -200,6 +206,12 @@  gomp_map_lookup (splay_tree mem_map, splay_tree_key key)
   return splay_tree_lookup (mem_map, key);
 }
 
+static inline reverse_splay_tree_key
+gomp_map_lookup_rev (reverse_splay_tree mem_map_rev, reverse_splay_tree_key key)
+{
+  return reverse_splay_tree_lookup (mem_map_rev, key);
+}
+
 static inline splay_tree_key
 gomp_map_0len_lookup (splay_tree mem_map, splay_tree_key key)
 {
@@ -1674,7 +1686,7 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 						 - k->host_start),
 					      sizes[j], cbufp, false);
 			  }
-			}
+		      }
 		    i = j - 1;
 		    break;
 		  case GOMP_MAP_FORCE_PRESENT:
@@ -2192,11 +2204,16 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
 
   /* Load image to device and get target addresses for the image.  */
   struct addr_pair *target_table = NULL;
+  uint64_t *rev_target_fn_table = NULL;
   int i, num_target_entries;
 
+  /* With reverse offload, insert also target-host addresses. */
+  bool rev_lookup = omp_requires_mask & GOMP_REQUIRES_REVERSE_OFFLOAD;
+
   num_target_entries
     = devicep->load_image_func (devicep->target_id, version,
-				target_data, &target_table, NULL);
+				target_data, &target_table,
+				rev_lookup ? &rev_target_fn_table : NULL);
 
   if (num_target_entries != num_funcs + num_vars
       /* "+1" due to the additional ICV struct.  */
@@ -2215,6 +2232,10 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
   /* "+1" due to the additional ICV struct.  */
   tgt->array = gomp_malloc ((num_funcs + num_vars + 1)
 			    * sizeof (*tgt->array));
+  if (rev_target_fn_table)
+    tgt->rev_array = gomp_malloc (num_funcs * sizeof (*tgt->rev_array));
+  else
+    tgt->rev_array = NULL;
   tgt->refcount = REFCOUNT_INFINITY;
   tgt->tgt_start = 0;
   tgt->tgt_end = 0;
@@ -2223,6 +2244,7 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
   tgt->list_count = 0;
   tgt->device_descr = devicep;
   splay_tree_node array = tgt->array;
+  reverse_splay_tree_node rev_array = tgt->rev_array;
 
   for (i = 0; i < num_funcs; i++)
     {
@@ -2237,6 +2259,17 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
+      if (rev_target_fn_table)
+	{
+	  reverse_splay_tree_key k2 = &rev_array->key;
+	  k2->dev = rev_target_fn_table[i];
+	  k2->k = k;
+	  rev_array->left = NULL;
+	  rev_array->right = NULL;
+	  if (k2->dev != 0)
+	    reverse_splay_tree_insert (&devicep->mem_map_rev, rev_array);
+	  rev_array++;
+	}
       array++;
     }
 
@@ -2349,6 +2382,15 @@  gomp_unload_image_from_device (struct gomp_device_descr *devicep,
       gomp_mutex_unlock (&devicep->lock);
       gomp_fatal ("image unload fail");
     }
+  if (devicep->mem_map_rev.root)
+    {
+      /* Free reverse offload splay tree + data; 'tgt->rev_array' is the only
+	 real allocation.  */
+      assert (node && node->tgt && node->tgt->rev_array);
+      assert (devicep->mem_map_rev.root->key.k->tgt == node->tgt);
+      free (node->tgt->rev_array);
+      devicep->mem_map_rev.root = NULL;
+    }
 
   /* Remove mappings from splay tree.  */
   int i;
@@ -2934,23 +2976,549 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+
+/* Reverse lookup (device addr -> host addr) for reverse offload.  We avoid
+   keeping track of all variable handling - assuming that reverse offload occurs
+   ony very rarely.  Downside is that the reverse search is slow.  */
+
+struct gomp_splay_tree_rev_lookup_data {
+  uintptr_t tgt_start;
+  uintptr_t tgt_end;
+  splay_tree_key key;
+};
+
+static int
+gomp_splay_tree_rev_lookup (splay_tree_key key, void *d)
+{
+  struct gomp_splay_tree_rev_lookup_data *data;
+  data = (struct gomp_splay_tree_rev_lookup_data *)d;
+  uintptr_t tgt_start = key->tgt->tgt_start + key->tgt_offset;
+
+  if (tgt_start > data->tgt_start || key->tgt->list_count == 0)
+    return 0;
+
+  size_t j;
+  for (j = 0; j < key->tgt->list_count; j++)
+    if (key->tgt->list[j].key == key)
+      break;
+  assert (j < key->tgt->list_count);
+  uintptr_t tgt_end = tgt_start + key->tgt->list[j].length;
+
+  if ((tgt_start == data->tgt_start && tgt_end == data->tgt_end)
+      || (tgt_end > data->tgt_start && tgt_start < data->tgt_end))
+    {
+      data->key = key;
+      return 1;
+    }
+  return 0;
+}
+
+static inline splay_tree_key
+gomp_map_rev_lookup (splay_tree mem_map, uint64_t tgt_start, uint64_t tgt_end,
+		     bool zero_len)
+{
+  struct gomp_splay_tree_rev_lookup_data data;
+  data.key = NULL;
+  data.tgt_start = tgt_start;
+  data.tgt_end = tgt_end;
+
+  if (tgt_start != tgt_end)
+    {
+      splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
+      return data.key;
+    }
+
+  data.tgt_end++;
+  splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
+  if (data.key != NULL || zero_len)
+    return data.key;
+  data.tgt_end--;
+
+  data.tgt_start--;
+  splay_tree_foreach_lazy (mem_map, gomp_splay_tree_rev_lookup, &data);
+  return data.key;
+}
+
+struct cpy_data
+{
+  uint64_t devaddr;
+  bool present, aligned;
+};
+
+
+/* Search just mapped reverse-offload data; returns index if found,
+   otherwise >= n.  */
+
+static inline int
+gomp_map_cdata_lookup_int (struct cpy_data *d, uint64_t *devaddrs,
+			   unsigned short *kinds, uint64_t *sizes, size_t n,
+			   uint64_t tgt_start, uint64_t tgt_end)
+{
+  const bool short_mapkind = true;
+  const int typemask = short_mapkind ? 0xff : 0x7;
+  size_t i;
+  for (i = 0; i < n; i++)
+    {
+      bool is_struct = ((get_kind (short_mapkind, kinds, i) & typemask)
+			== GOMP_MAP_STRUCT);
+      uint64_t dev_end;
+      if (!is_struct)
+	dev_end = d[i].devaddr + sizes[i];
+      else
+	{
+	  if (i + sizes[i] < n)
+	    dev_end = d[i + sizes[i]].devaddr + sizes[i + sizes[i]];
+	  else
+	    dev_end = devaddrs[i + sizes[i]] + sizes[i + sizes[i]];
+	}
+      if ((d[i].devaddr == tgt_start && dev_end == tgt_end)
+	  || (dev_end > tgt_start && d[i].devaddr < tgt_end))
+	break;
+      if (is_struct)
+	i += sizes[i];
+    }
+  return i;
+}
+
+static inline int
+gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
+		       unsigned short *kinds, uint64_t *sizes,
+		       size_t n, uint64_t tgt_start, uint64_t tgt_end,
+		       bool zero_len)
+{
+  size_t i;
+  if (tgt_start != tgt_end)
+    return gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
+				      tgt_start, tgt_end);
+  tgt_end++;
+  i = gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
+				 tgt_start, tgt_end);
+  if (i < n || zero_len)
+    return i;
+  tgt_end--;
+
+  tgt_start--;
+  return gomp_map_cdata_lookup_int (d, devaddrs, kinds, sizes, n,
+				    tgt_start, tgt_end);
+}
+
 /* Handle reverse offload.  This is called by the device plugins for a
-   reverse offload; it is not called if the outer target runs on the host.  */
+   reverse offload; it is not called if the outer target runs on the host.
+   The mapping is simplified device-affecting constructs (except for target
+   with device(ancestor:1)) must not be encountered; in particular not
+   target (enter/exit) data.  */
 
 void
-gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
-		 uint64_t mapnum __attribute__ ((unused)),
-		 uint64_t devaddrs_ptr __attribute__ ((unused)),
-		 uint64_t sizes_ptr __attribute__ ((unused)),
-		 uint64_t kinds_ptr __attribute__ ((unused)),
-		 int dev_num __attribute__ ((unused)),
-		 void (*dev_to_host_cpy) (void *, const void *, size_t,
-					  void *) __attribute__ ((unused)),
-		 void (*host_to_dev_cpy) (void *, const void *, size_t,
-					  void *) __attribute__ ((unused)),
-		 void *token __attribute__ ((unused)))
+gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+		 void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
+		 void *token)
 {
-  __builtin_unreachable ();
+  /* Return early if there is no offload code.  */
+  if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
+    return;
+  /* Currently, this fails because of calculate_firstprivate_requirements
+     below; it could be fixed but additional code needs to be updated to
+     handle 32bit hosts - thus, it is not worthwhile.  */
+  if (sizeof (void *) != sizeof (uint64_t))
+    gomp_fatal ("Reverse offload of 32bit hosts not supported.");
+
+  struct cpy_data *cdata = NULL;
+  uint64_t *devaddrs;
+  uint64_t *sizes;
+  unsigned short *kinds;
+  const bool short_mapkind = true;
+  const int typemask = short_mapkind ? 0xff : 0x7;
+  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
+
+  reverse_splay_tree_key n;
+  struct reverse_splay_tree_key_s k;
+  k.dev = fn_ptr;
+
+  gomp_mutex_lock (&devicep->lock);
+  n = gomp_map_lookup_rev (&devicep->mem_map_rev, &k);
+  gomp_mutex_unlock (&devicep->lock);
+
+  if (n == NULL)
+    gomp_fatal ("Cannot find reverse-offload function");
+  void (*host_fn)() = (void (*)()) n->k->host_start;
+
+  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    {
+      devaddrs = (uint64_t *) (uintptr_t) devaddrs_ptr;
+      sizes = (uint64_t *) (uintptr_t) sizes_ptr;
+      kinds = (unsigned short *) (uintptr_t) kinds_ptr;
+    }
+  else
+    {
+      devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
+      sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
+      kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
+      if (dev_to_host_cpy)
+	{
+	  dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
+			   mapnum * sizeof (uint64_t), token);
+	  dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
+			   mapnum * sizeof (uint64_t), token);
+	  dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
+			   mapnum * sizeof (unsigned short), token);
+	}
+      else
+	{
+	  gomp_copy_dev2host (devicep, NULL, devaddrs,
+			      (const void *) (uintptr_t) devaddrs_ptr,
+			      mapnum * sizeof (uint64_t));
+	  gomp_copy_dev2host (devicep, NULL, sizes,
+			      (const void *) (uintptr_t) sizes_ptr,
+			      mapnum * sizeof (uint64_t));
+	  gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
+			      mapnum * sizeof (unsigned short));
+	}
+    }
+
+  size_t tgt_align = 0, tgt_size = 0;
+
+  /* If actually executed on 32bit systems, the casts lead to wrong code;
+     but 32bit with offloading is not supported; see top of this function.  */
+  calculate_firstprivate_requirements (mapnum, (void *) (uintptr_t) sizes,
+				       (void *) (uintptr_t) kinds,
+				       &tgt_align, &tgt_size);
+
+  if (tgt_align)
+    {
+      char *tgt = gomp_alloca (tgt_size + tgt_align - 1);
+      uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
+      if (al)
+	tgt += tgt_align - al;
+      tgt_size = 0;
+      for (uint64_t i = 0; i < mapnum; i++)
+	if (get_kind (short_mapkind, kinds, i) == GOMP_MAP_FIRSTPRIVATE
+	    && devaddrs[i] != 0)
+	  {
+	    size_t align = (size_t) 1 << (kinds[i] >> 8);
+	    tgt_size = (tgt_size + align - 1) & ~(align - 1);
+	    if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+	      memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
+		      (size_t) sizes[i]);
+	    else if (dev_to_host_cpy)
+	      dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
+			       (size_t) sizes[i], token);
+	    else
+	      gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
+				  (void *) (uintptr_t) devaddrs[i],
+				  (size_t) sizes[i]);
+	    devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
+	    tgt_size = tgt_size + sizes[i];
+	    if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+		&& i + 1 < mapnum
+		&& ((get_kind (short_mapkind, kinds, i) & typemask)
+		    == GOMP_MAP_ATTACH))
+	      {
+		*(uint64_t*) (uintptr_t) (devaddrs[i+1] + sizes[i+1])
+		  = (uint64_t) devaddrs[i];
+		++i;
+	      }
+	  }
+    }
+
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM))
+    {
+      size_t j, struct_cpy = 0;
+      splay_tree_key n2;
+      cdata = gomp_alloca (sizeof (*cdata) * mapnum);
+      memset (cdata, '\0', sizeof (*cdata) * mapnum);
+      gomp_mutex_lock (&devicep->lock);
+      for (uint64_t i = 0; i < mapnum; i++)
+	{
+	  if (devaddrs[i] == 0)
+	    continue;
+	  n = NULL;
+	  int kind = get_kind (short_mapkind, kinds, i) & typemask;
+	  switch (kind)
+	    {
+	      case GOMP_MAP_FIRSTPRIVATE:
+	      case GOMP_MAP_FIRSTPRIVATE_INT:
+		continue;
+
+	      case GOMP_MAP_DELETE:
+	      case GOMP_MAP_RELEASE:
+	      case GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION:
+		/* Assume it is present; look it up - but ignore otherwise. */
+	      case GOMP_MAP_ALLOC:
+	      case GOMP_MAP_FROM:
+	      case GOMP_MAP_FORCE_ALLOC:
+	      case GOMP_MAP_FORCE_FROM:
+	      case GOMP_MAP_ALWAYS_FROM:
+	      case GOMP_MAP_TO:
+	      case GOMP_MAP_TOFROM:
+	      case GOMP_MAP_FORCE_TO:
+	      case GOMP_MAP_FORCE_TOFROM:
+	      case GOMP_MAP_ALWAYS_TO:
+	      case GOMP_MAP_ALWAYS_TOFROM:
+	      case GOMP_MAP_ZERO_LEN_ARRAY_SECTION:
+		cdata[i].devaddr = devaddrs[i];
+		bool zero_len = (kind == GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION
+				 || kind == GOMP_MAP_ZERO_LEN_ARRAY_SECTION);
+		j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
+					   devaddrs[i],
+					   devaddrs[i] + sizes[i], zero_len);
+		if (j < i)
+		  {
+		    n2 = NULL;
+		    cdata[i].present = true;
+		    devaddrs[i] = devaddrs[j] + devaddrs[i] - cdata[j].devaddr;
+		  }
+		else
+		  {
+		    n2 = gomp_map_rev_lookup (&devicep->mem_map,
+					      devaddrs[i],
+					      devaddrs[i] + sizes[i], zero_len);
+		    cdata[i].present = n2 != NULL;
+		  }
+		if (!cdata[i].present
+		    && kind != GOMP_MAP_DELETE
+		    && kind != GOMP_MAP_RELEASE
+		    && kind != GOMP_MAP_DELETE_ZERO_LEN_ARRAY_SECTION)
+		  {
+		    cdata[i].aligned = true;
+		    size_t align = (size_t) 1 << (kinds[i] >> 8);
+		    devaddrs[i]
+		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
+								   sizes[i]);
+		  }
+		else if (n2 != NULL)
+		  devaddrs[i] = (n2->host_start + cdata[i].devaddr
+				 - (n2->tgt->tgt_start + n2->tgt_offset));
+		if (((!cdata[i].present || struct_cpy)
+		     && (kind == GOMP_MAP_TO || kind == GOMP_MAP_TOFROM))
+		    || kind == GOMP_MAP_FORCE_TO
+		    || kind == GOMP_MAP_FORCE_TOFROM
+		    || kind == GOMP_MAP_ALWAYS_TO
+		    || kind == GOMP_MAP_ALWAYS_TOFROM)
+		  {
+		    if (dev_to_host_cpy)
+		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
+				       (void *) (uintptr_t) cdata[i].devaddr,
+				       sizes[i], token);
+		    else
+		      gomp_copy_dev2host (devicep, NULL,
+					  (void *) (uintptr_t) devaddrs[i],
+					  (void *) (uintptr_t) cdata[i].devaddr,
+					  sizes[i]);
+		  }
+		if (struct_cpy)
+		  struct_cpy--;
+		break;
+	      case GOMP_MAP_ATTACH:
+	      case GOMP_MAP_POINTER:
+	      case GOMP_MAP_ALWAYS_POINTER:
+		n2 = gomp_map_rev_lookup (&devicep->mem_map,
+					  devaddrs[i] + sizes[i],
+					  devaddrs[i] + sizes[i]
+					  + sizeof (void*), false);
+		cdata[i].present = n2 != NULL;
+		cdata[i].devaddr = devaddrs[i];
+		if (n2)
+		  devaddrs[i] = (n2->host_start + cdata[i].devaddr
+				 - (n2->tgt->tgt_start + n2->tgt_offset));
+		else
+		  {
+		    j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
+					       devaddrs[i] + sizes[i],
+					       devaddrs[i] + sizes[i]
+					       + sizeof (void*), false);
+		    if (j < i)
+		      {
+			cdata[i].present = true;
+			devaddrs[i] = (devaddrs[j] + devaddrs[i]
+				       - cdata[j].devaddr);
+		      }
+		  }
+		if (!cdata[i].present)
+		  devaddrs[i] = (uintptr_t) gomp_malloc (sizeof (void*));
+		/* Assume that when present, the pointer is already correct.  */
+		if (!n2)
+		  *(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[i])
+		    = devaddrs[i-1];
+		break;
+	      case GOMP_MAP_TO_PSET:
+		/* Assume that when present, the pointers are fine and no 'to:'
+		   is required.  */
+		n2 = gomp_map_rev_lookup (&devicep->mem_map,
+					  devaddrs[i], devaddrs[i] + sizes[i],
+					  false);
+		cdata[i].present = n2 != NULL;
+		cdata[i].devaddr = devaddrs[i];
+		if (n2)
+		  devaddrs[i] = (n2->host_start + cdata[i].devaddr
+				 - (n2->tgt->tgt_start + n2->tgt_offset));
+		else
+		  {
+		    j = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, i,
+					       devaddrs[i],
+					       devaddrs[i] + sizes[i], false);
+		    if (j < i)
+		      {
+			cdata[i].present = true;
+			devaddrs[i] = (devaddrs[j] + devaddrs[i]
+				       - cdata[j].devaddr);
+		      }
+		  }
+		if (!cdata[i].present)
+		  {
+		    cdata[i].aligned = true;
+		    size_t align = (size_t) 1 << (kinds[i] >> 8);
+		    devaddrs[i]
+		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
+								   sizes[i]);
+		    if (dev_to_host_cpy)
+		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
+				       (void *) (uintptr_t) cdata[i].devaddr,
+				       sizes[i], token);
+		    else
+		      gomp_copy_dev2host (devicep, NULL,
+					  (void *) (uintptr_t) devaddrs[i],
+					  (void *) (uintptr_t) cdata[i].devaddr,
+					  sizes[i]);
+		  }
+		for (j = i + 1; j < mapnum; j++)
+		  {
+		    kind = get_kind (short_mapkind, kinds, j) & typemask;
+		    if (!GOMP_MAP_ALWAYS_POINTER_P (kind)
+			&& !GOMP_MAP_POINTER_P (kind))
+		      break;
+		    if (devaddrs[j] < devaddrs[i])
+		      break;
+		    if (cdata[i].present)
+		      continue;
+		    if (devaddrs[j] == 0)
+		      {
+			*(uint64_t *) (uintptr_t) (devaddrs[i] + sizes[j]) = 0;
+			continue;
+		      }
+		    int k;
+		    n2 = NULL;
+		    cdata[i].present = true;
+		    cdata[j].devaddr = devaddrs[j];
+		    k = gomp_map_cdata_lookup (cdata, devaddrs, kinds, sizes, j,
+					       devaddrs[j],
+					       devaddrs[j] + sizeof (void*),
+					       false);
+		    if (k < j)
+		      devaddrs[j] = (devaddrs[k] + devaddrs[j]
+				     - cdata[k].devaddr);
+		    else
+		      {
+			n2 = gomp_map_rev_lookup (&devicep->mem_map,
+						  devaddrs[j],
+						  devaddrs[j] + sizeof (void*),
+						  false);
+			if (n2 == NULL)
+			  {
+			    gomp_mutex_unlock (&devicep->lock);
+			    gomp_fatal ("Pointer target wasn't mapped");
+			  }
+			devaddrs[j] = (n2->host_start + cdata[j].devaddr
+				       - (n2->tgt->tgt_start + n2->tgt_offset));
+		      }
+		    *(void **) (uintptr_t) (devaddrs[i] + sizes[j])
+		      = (void *) (uintptr_t) devaddrs[j];
+		  }
+		i = j -1;
+		break;
+	      case GOMP_MAP_STRUCT:
+		n2 = gomp_map_rev_lookup (&devicep->mem_map, devaddrs[i+1],
+					  devaddrs[i + sizes[i]]
+					  + sizes[i + sizes[i]], false);
+		cdata[i].present = n2 != NULL;
+		cdata[i].devaddr = devaddrs[i];
+		struct_cpy = cdata[i].present ? 0 : sizes[i];
+		if (!n2)
+		  {
+		    size_t sz = (size_t) (devaddrs[i + sizes[i]]
+					  - devaddrs[i+1]
+					  + sizes[i + sizes[i]]);
+		    size_t align = (size_t) 1 << (kinds[i] >> 8);
+		    cdata[i].aligned = true;
+		    devaddrs[i] = (uintptr_t) gomp_aligned_alloc (align, sz);
+		    devaddrs[i] -= devaddrs[i+1] - cdata[i].devaddr;
+		  }
+		else
+		  devaddrs[i] = (n2->host_start + cdata[i].devaddr
+				 - (n2->tgt->tgt_start + n2->tgt_offset));
+		break;
+	      default:
+		gomp_mutex_unlock (&devicep->lock);
+		gomp_fatal ("gomp_target_rev unhandled kind 0x%.4x", kinds[i]);
+	    }
+	}
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  host_fn (devaddrs);
+
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM))
+    {
+      uint64_t struct_cpy = 0;
+      bool clean_struct = false;
+      for (uint64_t i = 0; i < mapnum; i++)
+	{
+	  if (cdata[i].devaddr == 0)
+	    continue;
+	  int kind = get_kind (short_mapkind, kinds, i) & typemask;
+	  bool copy = !cdata[i].present || struct_cpy;
+	  switch (kind)
+	    {
+	      case GOMP_MAP_FORCE_FROM:
+	      case GOMP_MAP_FORCE_TOFROM:
+	      case GOMP_MAP_ALWAYS_FROM:
+	      case GOMP_MAP_ALWAYS_TOFROM:
+		copy = true;
+		/* FALLTHRU */
+	      case GOMP_MAP_FROM:
+	      case GOMP_MAP_TOFROM:
+		if (copy && host_to_dev_cpy)
+		  host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
+				   (void *) (uintptr_t) devaddrs[i],
+				   sizes[i], token);
+		else if (copy)
+		  gomp_copy_host2dev (devicep, NULL,
+				      (void *) (uintptr_t) cdata[i].devaddr,
+				      (void *) (uintptr_t) devaddrs[i],
+				      sizes[i], false, NULL);
+	      default:
+		break;
+	    }
+	  if (struct_cpy)
+	    {
+	      struct_cpy--;
+	      continue;
+	    }
+	  if (kind == GOMP_MAP_STRUCT && !cdata[i].present)
+	    {
+	      clean_struct = true;
+	      struct_cpy = sizes[i];
+	    }
+	  else if (cdata[i].aligned)
+	    gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
+	  else if (!cdata[i].present)
+	    free ((void *) (uintptr_t) devaddrs[i]);
+	}
+      if (clean_struct)
+	for (uint64_t i = 0; i < mapnum; i++)
+	  if (!cdata[i].present
+	      && ((get_kind (short_mapkind, kinds, i) & typemask)
+		  == GOMP_MAP_STRUCT))
+	    {
+	      devaddrs[i] += cdata[i+1].devaddr - cdata[i].devaddr;
+	      gomp_aligned_free ((void *) (uintptr_t) devaddrs[i]);
+	    }
+
+      free (devaddrs);
+      free (sizes);
+      free (kinds);
+    }
 }
 
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
@@ -4330,6 +4898,7 @@  gomp_target_init (void)
 		/* current_device.capabilities has already been set.  */
 		current_device.type = current_device.get_type_func ();
 		current_device.mem_map.root = NULL;
+		current_device.mem_map_rev.root = NULL;
 		current_device.state = GOMP_DEVICE_UNINITIALIZED;
 		for (i = 0; i < new_num_devs; i++)
 		  {
diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90
new file mode 100644
index 00000000000..067639bc67e
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-2.f90
@@ -0,0 +1,72 @@ 
+! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
+
+implicit none
+!$omp requires reverse_offload
+integer :: A(50), A2(50)
+integer :: i, error
+logical :: shared_mem
+
+error = 0
+shared_mem = .false.
+A  = [(i, i=1,50)]
+A2 = [(i, i=1,50)]
+
+!$omp target map(to: shared_mem)
+  shared_mem = .true.
+!$omp end target
+
+!$omp target map(to: A(20:40), A2(20:40)) map(from: error)
+block
+  integer :: B(10), C(10)
+  B = 99
+  C = 88
+  A(20:40)  = -A(20:40)
+  A2(20:40) = -A2(20:40)
+
+  !$omp target device (ancestor:1)  &
+  !$omp&       map(to: A(25:35)) map(always, to: A2(25:35))  &
+  !$omp&       map(from:B(4:8)) map(tofrom:C(4:8))
+     if (shared_mem) then
+       if (any (A(25:35)  /= [(-i,i=25,35)])) stop 20
+     else
+       if (any (A(25:35)  /= [( i,i=25,35)])) stop 21
+     end if
+    if (any (A2(25:35) /= [(-i,i=25,35)])) stop 22
+    if (any (C(4:8) /= 88)) stop 23
+
+    A(25:35) = -A(25:35)*10
+    A2(25:35) = -A2(25:35)*10
+    B(4:8) = [4,5,6,7,8]
+    C(4:8) = [-4,-5,-6,-7,-8]
+  !$omp end target
+
+  if (any (B(1:3) /= 99) .or. any (B(9:10) /= 99)) then
+    error = 30
+  elseif (any (B(4:8) /= [4,5,6,7,8])) then
+    error = 31
+  elseif (any (C(1:3) /= 88) .or. any (C(9:10) /= 88)) then
+    error = 32
+  elseif (any (C(4:8) /= [-4,-5,-6,-7,-8])) then
+    error = 33
+  else
+    error = 0
+  endif
+end block
+
+if (error /= 0) stop error
+
+if (shared_mem) then
+  if (any (A(1:19)  /= [( i, i=1,19)])) stop 1
+  if (any (A(20:24) /= [(-i, i=20,24)])) stop 2
+  if (any (A(36:40) /= [(-i, i=36,40)])) stop 3
+  if (any (A(41:50) /= [( i, i=41,50)])) stop 4
+
+  if (any (A(25:35) /= [( 10*i, i=25,35)])) stop 5
+else
+  if (any (A(1:24)  /= [( i, i=1,24)])) stop 6
+  if (any (A(36:50) /= [( i, i=36,50)])) stop 7
+
+  if (any (A(25:35) /= [(-10*i, i=25,35)])) stop 8
+end if
+if (any (A2(25:35) /= [( 10*i, i=25,35)])) stop 9
+end
diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90
new file mode 100644
index 00000000000..2fd2f5b6589
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-3.f90
@@ -0,0 +1,68 @@ 
+! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
+
+implicit none
+!$omp requires reverse_offload
+integer :: A(50), A2(50), A3(50)
+integer :: i
+logical :: shared_mem
+
+shared_mem = .false.
+A  = [(3*i, i=1,50)]
+A2 = [(7*i, i=1,50)]
+A3 = [(11*i, i=1,50)]
+
+!$omp target map(to: shared_mem)
+  shared_mem = .true.
+!$omp end target
+
+!$omp target map(to: A(20:40), A2(20:40), A3(20:40))
+block
+  integer :: C(10)
+  C = 88
+  A(20:40)  = -2*A(20:40)
+  A2(20:40) = -9*A2(20:40)
+  A3(20:40) = -13*A3(20:40)
+
+  !$omp target device (ancestor:1)  &
+  !$omp&       map(from: A(25:35)) map(always, from: A2(25:35))  &
+  !$omp&       map(alloc: A3(25:35)) map(alloc:C(4:8))
+    if (shared_mem) then
+      if (any (A(25:35)  /= [(-2*3*i, i=25,35)])) stop 1
+      if (any (A2(25:35) /= [(-9*7*i, i=25,35)])) stop 2
+      if (any (A3(25:35) /= [(-13*11*i, i=25,35)])) stop 3
+    else
+      if (any (A(25:35)  /= [(3*i, i=25,35)])) stop 4
+      if (any (A2(25:35) /= [(7*i, i=25,35)])) stop 5
+      if (any (A3(25:35) /= [(11*i, i=25,35)])) stop 6
+    end if
+
+    A(25:35) = A(25:35)*5
+    A2(25:35) = A2(25:35)*8
+    A3(25:35) = A3(25:35)*18
+    C(4:8) = [4,5,6,7,8]
+  !$omp end target
+
+  if (shared_mem) then
+    if (any (A(25:35)  /= [(-2*3*5*i, i=25,35)])) stop 7
+    if (any (A2(25:35) /= [(-9*7*8*i, i=25,35)])) stop 8
+    if (any (A3(25:35) /= [(-13*11*18*i, i=25,35)])) stop 9
+    if (any (C(4:8) /= [4,5,6,7,8])) stop 10
+  else
+    if (any (A(25:35)  /= [(-2*3*i, i=25,35)])) stop 11
+    if (any (A2(25:35) /= [(7*8*i, i=25,35)])) stop 12
+    if (any (A3(25:35) /= [(-13*11*i, i=25,35)])) stop 13
+    if (any (C(4:8) /= 88)) stop 14
+  end if
+end block
+
+if (shared_mem) then
+  if (any (A(25:35)  /= [(-2*3*5*i, i=25,35)])) stop
+  if (any (A2(25:35) /= [(-9*7**8*i, i=25,35)])) stop
+  if (any (A3(25:35) /= [(-13*11*18*i, i=25,35)])) stop
+else
+  if (any (A(25:35)  /= [(3*5*i, i=25,35)])) stop
+  if (any (A2(25:35) /= [(7*8*i, i=25,35)])) stop
+  if (any (A3(25:35) /= [(11*18*i, i=25,35)])) stop
+end if
+
+end
diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90
new file mode 100644
index 00000000000..fb27aa73eba
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-4.f90
@@ -0,0 +1,129 @@ 
+! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
+
+implicit none
+!$omp requires reverse_offload
+
+type t2
+  integer :: a, b, c
+end type t2
+
+type t
+ integer :: A(5), B(5), C(5)
+ integer, pointer :: P(:), P2 !Just some padding
+ type(t2) :: tt !Just some padding
+end type t
+
+type(t) :: S1, S2
+logical :: shared_mem
+
+shared_mem = .false.
+
+!$omp target map(to: shared_mem)
+  shared_mem = .true.
+!$omp end target
+
+s1%A = [1,2,3,4,5]
+s1%B = [10,20,30,40,50]
+s1%C = [11,22,33,44,55]
+s2%A = 2*s1%A
+s2%B = 2*s1%B
+s2%C = 2*s1%C
+
+!$omp target &
+!$omp&       map(to: s1%b, s1%c) &
+!$omp&       map(to: s2%b, s2%c)
+block
+  type(t) :: si1, si2, si3, si4
+
+  s1%B = -10 * s1%B
+  s1%C = -10 * s1%C
+  s2%B = -15 * s2%B
+  s2%C = -15 * s2%C
+
+  si1%A = -1 * [1,2,3,4,5]
+  si1%B = -1 * [10,20,30,40,50]
+  si1%C = -1 * [11,22,33,44,55]
+  si2%A = -23 * [1,2,3,4,5]
+  si2%B = -23 * [10,20,30,40,50]
+  si2%C = -23 * [11,22,33,44,55]
+
+  !$omp target device (ancestor:1)  &
+  !$omp&       map(to: si1%C, si1%B)  &
+  !$omp&       map(tofrom: si2%C, si2%B)  &
+  !$omp&       map(always, to: s1%B)  &
+  !$omp&       map(        to: s2%B)
+    if (any (s1%A /=       [1,2,3,4,5])) stop 1
+    if (any (s1%B /= -10 * [10,20,30,40,50])) stop 2
+    if (shared_mem) then
+      if (any (s1%C /= -10 * [11,22,33,44,55])) stop 4
+    else
+      if (any (s1%C /=       [11,22,33,44,55])) stop 3
+    endif
+    if (any (s2%A /=   2 * [1,2,3,4,5])) stop 4
+    if (shared_mem) then
+      if (any (s2%B /= -15 * 2 * [10,20,30,40,50])) stop 5
+      if (any (s2%C /= -15 * 2 * [11,22,33,44,55])) stop 6
+    else
+      if (any (s2%B /=   2 * [10,20,30,40,50])) stop 7
+      if (any (s2%C /=   2 * [11,22,33,44,55])) stop 8
+    endif
+    if (any (si1%B /= -1 * [10,20,30,40,50])) stop 9
+    if (any (si1%C /= -1 * [11,22,33,44,55])) stop 10
+    if (any (si2%B /= -23 * [10,20,30,40,50])) stop 10
+    if (any (si2%C /= -23 * [11,22,33,44,55])) stop 11
+
+    s1%A = 5 * s1%A
+    s1%B = 7 * s1%B
+    s1%C = 13 * s1%C
+    s2%A = 9 * s2%A
+    s2%B = 21 * s2%B
+    s2%C = 31 * s2%C
+    si1%B = -11 * si1%B
+    si1%C = -13 * si1%C
+    si2%B = -27 * si2%B
+    si2%C = -29 * si2%C
+  !$omp end target
+
+  if (shared_mem) then
+    if (any (s1%B /= -10 * 7 * [10,20,30,40,50])) stop 20
+    if (any (s1%C /= -10 * 13 * [11,22,33,44,55])) stop 21
+  else
+    if (any (s1%B /= -10 * [10,20,30,40,50])) stop 22
+    if (any (s1%C /= -10 * [11,22,33,44,55])) stop 23
+  endif
+  if (shared_mem) then
+    if (any (s2%B /= -15 * 2 * 21 * [10,20,30,40,50])) stop 24
+    if (any (s2%C /= -15 * 2 * 31 * [11,22,33,44,55])) stop 25
+  else
+    if (any (s2%B /= -15 * 2 * [10,20,30,40,50])) stop 26
+    if (any (s2%C /= -15 * 2 * [11,22,33,44,55])) stop 27
+  endif
+  if (any (si1%A /= -1 * [1,2,3,4,5])) stop 28
+  if (shared_mem) then
+    if (any (si1%B /= -1 * (-11) * [10,20,30,40,50])) stop 29
+    if (any (si1%C /= -1 * (-13) * [11,22,33,44,55])) stop 30
+  else
+    if (any (si1%B /= -1 * [10,20,30,40,50])) stop 31
+    if (any (si1%C /= -1 * [11,22,33,44,55])) stop 32
+  endif
+  if (any (si2%A /= -23 * [1,2,3,4,5])) stop 33
+  if (any (si2%B /= -23 * (-27) * [10,20,30,40,50])) stop 34
+  if (any (si2%C /= -23 * (-29) * [11,22,33,44,55])) stop 35
+end block
+
+if (any (s1%A /=       5 * [1,2,3,4,5])) stop 40
+if (any (s1%B /= -10 * 7 * [10,20,30,40,50])) stop 41
+if (shared_mem) then
+  if (any (s1%C /= -10 * 13 * [11,22,33,44,55])) stop 42
+else
+  if (any (s1%C /= 13 *    [11,22,33,44,55])) stop 43
+endif
+if (any (s2%A /=   2 * 9 * [1,2,3,4,5])) stop 44
+if (shared_mem) then
+  if (any (s2%B /= -15 * 2 * 21 * [10,20,30,40,50])) stop 45
+  if (any (s2%C /= -15 * 2 * 31 * [11,22,33,44,55])) stop 46
+else
+  if (any (s2%B /=   2 * 21 * [10,20,30,40,50])) stop 47
+  if (any (s2%C /=   2 * 31 * [11,22,33,44,55])) stop 48
+endif
+end
diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90
new file mode 100644
index 00000000000..ef7eb7bdd52
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-5.f90
@@ -0,0 +1,100 @@ 
+! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
+! { dg-xfail-run-if "Copying on-device allocated memory fails with cuMemcpyDtoHAsync error: invalid argument" { offload_device_nvptx } }
+
+! Because of the nvptx fail, a non-device alloc version has been
+! created: reverse-offload-5a.f90
+
+implicit none
+!$omp requires reverse_offload
+
+integer, allocatable :: A(:), A2(:), s1, s2
+integer :: i
+logical :: shared_mem
+
+shared_mem = .false.
+
+a = [1,2,3,4]
+a2 = [8,7,6,5]
+s1 = 532
+s2 = 55
+
+!$omp target map(to: shared_mem)
+  shared_mem = .true.
+!$omp end target
+
+!$omp target map(to: A, A2, s1, s2)
+block
+  integer, allocatable :: ai(:), ai2(:), si1, si2
+
+  a = a * 2
+  a2 = a2 * 3
+  s1 = s1 * 4
+  s2 = s2 * 5
+
+  ai = [23,35,86,43]
+  ai2 = [8,4,7,1]
+  si1 = 64
+  si2 = 765
+
+  !$omp target device (ancestor:1)  &
+  !$omp&       map(to: A, s1, ai, si1) map(always, to: a2, s2)  &
+  !$omp&       map(tofrom: ai2, si2)
+    if (shared_mem) then
+      if (any (a  /= 2 * [1,2,3,4])) stop 1
+      if (s1 /= 4 * 532) stop 2
+    else
+      if (any (a  /=     [1,2,3,4])) stop 3
+      if (s1 /=     532) stop 4
+    endif
+    if (any (a2 /= 3 * [8,7,6,5])) stop 5
+    if (s2 /= 5 * 55) stop 6
+    if (any (ai /= [23,35,86,43])) stop 7
+    if (any (ai2 /= [8,4,7,1])) stop 8
+    if (si1 /= 64) stop 9
+    if (si2 /= 765) stop 10
+
+    a = a*3
+    a2 = a2*7
+    s1 = s1*11
+    s2 = s2*5
+    ai = ai*13
+    ai2 = ai2*21
+    si1 = si1*27
+    si2 = si2*31
+  !$omp end target
+
+  if (shared_mem) then
+    if (any (a  /= 3 * 2 * [1,2,3,4])) stop 11
+    if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 12
+    if (s1 /= 11 * 4 * 532) stop 13
+    if (s2 /= 5 * 5 * 55) stop 14
+    if (any (ai /= 13 * [23,35,86,43])) stop 15
+    if (si1 /= 27 * 64) stop 16
+  else
+    if (any (a  /= 2 * [1,2,3,4])) stop 17
+    if (any (a2 /= 3 * [8,7,6,5])) stop 18
+    if (s1 /= 4 * 532) stop 19
+    if (s2 /= 5 * 55) stop 20
+    if (any (ai /= [23,35,86,43])) stop 22
+    if (si1 /= 64) stop 23
+  endif
+  if (any (ai2 /= 21 * [8,4,7,1])) stop 24
+  if (si2 /= 31 * 765) stop 25
+
+  deallocate (ai, ai2, si1, si2)
+end block
+
+if (shared_mem) then
+  if (any (a  /= 3 * 2 * [1,2,3,4])) stop 30
+  if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 31
+  if (s1 /= 11 * 4 * 532) stop 32
+  if (s2 /= 5 * 5 * 55) stop 33
+else
+  if (any (a  /= 3 * [1,2,3,4])) stop 34
+  if (any (a2 /= 3 * 7 * [8,7,6,5])) stop 35
+  if (s1 /= 11 * 532) stop 36
+  if (s2 /= 5 * 5 * 55) stop 37
+endif
+
+deallocate (a, a2, s1, s2)
+end
diff --git a/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90 b/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90
new file mode 100644
index 00000000000..914d10d8144
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/reverse-offload-5a.f90
@@ -0,0 +1,98 @@ 
+! { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } }
+
+! Copying on-device allocated memory fails with cuMemcpyDtoHAsync error: invalid argument
+! Hence, reverse-offload-5.f90 has been copied to *this* file,
+! which uses on-host allocated vars - but only on the device side stack variables
+
+implicit none
+!$omp requires reverse_offload
+
+integer, allocatable :: A(:), A2(:), s1, s2
+integer :: i,D(4)
+logical :: shared_mem
+
+shared_mem = .false.
+
+a = [1,2,3,4]
+a2 = [8,7,6,5]
+s1 = 532
+s2 = 55
+
+!$omp target map(to: shared_mem)
+  shared_mem = .true.
+!$omp end target
+
+!$omp target map(to: A, A2, s1, s2)
+block
+  ! stack variables:
+  integer :: ai(4), ai2(4), si1, si2
+
+  a = a * 2
+  a2 = a2 * 3
+  s1 = s1 * 4
+  s2 = s2 * 5
+
+  ai = [23,35,86,43]
+  ai2 = [8,4,7,1]
+  si1 = 64
+  si2 = 765
+
+  !$omp target device (ancestor:1)  &
+  !$omp&       map(to: A, s1, ai, si1) map(always, to: a2, s2)  &
+  !$omp&       map(tofrom: ai2, si2)
+    if (shared_mem) then
+      if (any (a  /= 2 * [1,2,3,4])) stop 1
+      if (s1 /= 4 * 532) stop 2
+    else
+      if (any (a  /=     [1,2,3,4])) stop 3
+      if (s1 /=     532) stop 4
+    endif
+    if (any (a2 /= 3 * [8,7,6,5])) stop 5
+    if (s2 /= 5 * 55) stop 6
+    if (any (ai /= [23,35,86,43])) stop 7
+    if (any (ai2 /= [8,4,7,1])) stop 8
+    if (si1 /= 64) stop 9
+    if (si2 /= 765) stop 10
+
+    a = a*3
+    a2 = a2*7
+    s1 = s1*11
+    s2 = s2*5
+    ai = ai*13
+    ai2 = ai2*21
+    si1 = si1*27
+    si2 = si2*31
+  !$omp end target
+
+  if (shared_mem) then
+    if (any (a  /= 3 * 2 * [1,2,3,4])) stop 11
+    if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 12
+    if (s1 /= 11 * 4 * 532) stop 13
+    if (s2 /= 5 * 5 * 55) stop 14
+    if (any (ai /= 13 * [23,35,86,43])) stop 15
+    if (si1 /= 27 * 64) stop 16
+  else
+    if (any (a  /= 2 * [1,2,3,4])) stop 17
+    if (any (a2 /= 3 * [8,7,6,5])) stop 18
+    if (s1 /= 4 * 532) stop 19
+    if (s2 /= 5 * 55) stop 20
+    if (any (ai /= [23,35,86,43])) stop 22
+    if (si1 /= 64) stop 23
+  endif
+  if (any (ai2 /= 21 * [8,4,7,1])) stop 24
+  if (si2 /= 31 * 765) stop 25
+end block
+if (shared_mem) then
+  if (any (a  /= 3 * 2 * [1,2,3,4])) stop 30
+  if (any (a2 /= 7 * 3 * [8,7,6,5])) stop 31
+  if (s1 /= 11 * 4 * 532) stop 32
+  if (s2 /= 5 * 5 * 55) stop 33
+else
+  if (any (a  /= 3 * [1,2,3,4])) stop 34
+  if (any (a2 /= 3 * 7 * [8,7,6,5])) stop 35
+  if (s1 /= 11 * 532) stop 36
+  if (s2 /= 5 * 5 * 55) stop 37
+endif
+
+deallocate (a, a2, s1, s2)
+end