diff mbox

OpenACC async clause regressions (was: [gomp4.1] Add new versions of GOMP_target{,_data,_update} and GOMP_target_enter_exit_data)

Message ID 87pp0aaksc.fsf@kepler.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge Oct. 19, 2015, 4:24 p.m. UTC
Hi!

Chung-Lin, would you please have a look at the following (on
gomp-4_0-branch)?  Also, anyone else got any ideas off-hand?

On Tue, 23 Jun 2015 13:51:39 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 23, 2015 at 02:40:43PM +0300, Ilya Verbin wrote:
> > On Sat, Jun 20, 2015 at 00:35:14 +0300, Ilya Verbin wrote:
> > > Given that a mapped variable in 4.1 can have different kinds across nested data
> > > regions, we need to store map-type not only for each var, but also for each
> > > structured mapping.  Here is my WIP patch, is it sane? :)
> > > Attached testcase works OK on the device with non-shared memory.
> > 
> > A bit updated version with a fix for GOMP_MAP_TO_PSET.
> > make check-target-libgomp passed.
> 
> Ok, thanks.
> 
> > include/gcc/
> > 	* gomp-constants.h (GOMP_MAP_ALWAYS_TO_P,
> > 	GOMP_MAP_ALWAYS_FROM_P): Define.
> > libgomp/
> > 	* libgomp.h (struct target_var_desc): New.
> > 	(struct target_mem_desc): Replace array of splay_tree_key with array of
> > 	target_var_desc.
> > 	(struct splay_tree_key_s): Move copy_from to target_var_desc.
> > 	* oacc-mem.c (gomp_acc_remove_pointer): Use copy_from from
> > 	target_var_desc.
> > 	* oacc-parallel.c (GOACC_parallel): Use copy_from from target_var_desc.
> > 	* target.c (gomp_map_vars_existing): Copy data to device if map-type is
> > 	'always to' or 'always tofrom'.
> > 	(gomp_map_vars): Use key from target_var_desc.  Set copy_from and
> > 	always_copy_from.
> > 	(gomp_copy_from_async): Use key and copy_from from target_var_desc.
> > 	(gomp_unmap_vars): Copy data from device if always_copy_from is set.
> > 	(gomp_offload_image_to_device): Do not use copy_from.
> > 	* testsuite/libgomp.c/target-11.c: New test.

(That's gomp-4_1-branch r224838.  The attached
gomp-4_1-branch-r224838.patch is a variant that applies on top of
gomp-4_0-branch r228972.)  This change introduces regressions in OpenACC
async clause handling.

Testing on gomp-4_1-branch r224838:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test

Same for C++.

Testing on gomp-4_0-branch r228972 plus the attached
gomp-4_1-branch-r224838.patch:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/asyncwait-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/asyncwait-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none execution test

Same for C++.

As I mentioned in
<http://news.gmane.org/find-root.php?message_id=%3C87lhb3b11q.fsf%40kepler.schwinge.homeip.net%3E>,
all three regressions are visible when testing on trunk r228777.  I have
not analyzed why the three different branches show different sets of
regressions -- I'm hoping they're all manifestations of the same
underlying problem: they're all using the OpenACC async clause.

Looking at gomp-4_0-branch r228972 plus the attached
gomp-4_1-branch-r224838.patch, clearly there is "some kind of data
corruption":

    $ gdb -q a.out 
    Reading symbols from a.out...done.
    (gdb) start
    [...]
    25          a = (float *) malloc (nbytes);
    (gdb) n
    26          b = (float *) malloc (nbytes);
    (gdb) print a
    $1 = (float *) 0xab12c0
    (gdb) c
    Continuing.
    
    Program received signal SIGSEGV, Segmentation fault.
    0x00000000004015d2 in main (argc=1, argv=0x7fffffffd408) at source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c:133
    133             if (a[i] != 3.0)
    (gdb) print a
    $2 = (float *) 0x500680620

0x500680620 looks like a nvptx device pointer to me, which is a) wrong
(after the "malloc", "a" shouldn't change its value throughout program
execution), and b) that "explains" the segmentation fault (device pointer
dereferenced in host code).

So, maybe data is erroneously being copied back to the host from device,
or from libgomp internal data structures.  Maybe some copy_from flag
handling needs to be adjusted or added in the OpenACC code in libgomp?


I have no idea whether that's related, but I noticed that currently we're
not in any way handling async_refcount in libgomp/oacc-*.c -- do we have
to?  (Its name certainly makes me believe it's related to asynchronous
data (un-)mapping.)  Should we be able to drop some of the
OpenACC-specific async implementation in libgomp, and use new/generic
target.c code instead?


Please note that there will be further libgomp changes (target.c, and
other files) coming in later merges from gomp-4_1-branch, so please for
now just work on identifying/resolving the regression, and let any code
refactoring wait for later.


Grüße
 Thomas

Comments

Ilya Verbin Oct. 19, 2015, 4:43 p.m. UTC | #1
On Mon, Oct 19, 2015 at 18:24:35 +0200, Thomas Schwinge wrote:
> Chung-Lin, would you please have a look at the following (on
> gomp-4_0-branch)?  Also, anyone else got any ideas off-hand?
> 
>     PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
>     [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
>     PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
>     [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test

Maybe it was caused by this change in gomp_unmap_vars?
https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01376.html

Looking at the code, I don't see any difference in async_refcount handling, but
I was unable to test it without having hardware :(

  -- Ilya
Jakub Jelinek Oct. 20, 2015, 7:35 a.m. UTC | #2
On Mon, Oct 19, 2015 at 07:43:59PM +0300, Ilya Verbin wrote:
> On Mon, Oct 19, 2015 at 18:24:35 +0200, Thomas Schwinge wrote:
> > Chung-Lin, would you please have a look at the following (on
> > gomp-4_0-branch)?  Also, anyone else got any ideas off-hand?
> > 
> >     PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
> >     [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
> >     PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 (test for excess errors)
> >     [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/data-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
> 
> Maybe it was caused by this change in gomp_unmap_vars?
> https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01376.html
> 
> Looking at the code, I don't see any difference in async_refcount handling, but
> I was unable to test it without having hardware :(

I think that is the only patch that could have affected it.
The copy_from change is from the old behavior, where basically all
concurrent mappings ored into the copy_from flag and when refcount went to
0, if there were any mappings with from or tofrom, it copied back,
the OpenMP 4.5 behavior is that whether data is copied from the device
is determined solely by the mapping kind of the mapping that performs the
refcount decrease to 0.  Plus there is the always flag which requests
the data copying operation always, no matter what the refcount is (either on
the mapping/refcount increase side, or unmapping/refcount decrease size).

	Jakub
Thomas Schwinge Oct. 22, 2015, 6:27 p.m. UTC | #3
Hi!

On Mon, 19 Oct 2015 18:24:35 +0200, I wrote:
> Chung-Lin, would you please have a look at the following (on
> gomp-4_0-branch)?  Also, anyone else got any ideas off-hand?

Ilya, Jakub, thanks for your comments!

> On Tue, 23 Jun 2015 13:51:39 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Tue, Jun 23, 2015 at 02:40:43PM +0300, Ilya Verbin wrote:
> > > On Sat, Jun 20, 2015 at 00:35:14 +0300, Ilya Verbin wrote:
> > > > Given that a mapped variable in 4.1 can have different kinds across nested data
> > > > regions, we need to store map-type not only for each var, but also for each
> > > > structured mapping.  Here is my WIP patch, is it sane? :)
> > > > Attached testcase works OK on the device with non-shared memory.
> > > 
> > > A bit updated version with a fix for GOMP_MAP_TO_PSET.
> > > make check-target-libgomp passed.
> > 
> > Ok, thanks.
> > 
> > > include/gcc/
> > > 	* gomp-constants.h (GOMP_MAP_ALWAYS_TO_P,
> > > 	GOMP_MAP_ALWAYS_FROM_P): Define.
> > > libgomp/
> > > 	* libgomp.h (struct target_var_desc): New.
> > > 	(struct target_mem_desc): Replace array of splay_tree_key with array of
> > > 	target_var_desc.
> > > 	(struct splay_tree_key_s): Move copy_from to target_var_desc.
> > > 	* oacc-mem.c (gomp_acc_remove_pointer): Use copy_from from
> > > 	target_var_desc.
> > > 	* oacc-parallel.c (GOACC_parallel): Use copy_from from target_var_desc.
> > > 	* target.c (gomp_map_vars_existing): Copy data to device if map-type is
> > > 	'always to' or 'always tofrom'.
> > > 	(gomp_map_vars): Use key from target_var_desc.  Set copy_from and
> > > 	always_copy_from.
> > > 	(gomp_copy_from_async): Use key and copy_from from target_var_desc.
> > > 	(gomp_unmap_vars): Copy data from device if always_copy_from is set.
> > > 	(gomp_offload_image_to_device): Do not use copy_from.
> > > 	* testsuite/libgomp.c/target-11.c: New test.
> 
> (That's gomp-4_1-branch r224838.  The attached
> gomp-4_1-branch-r224838.patch is a variant that applies on top of
> gomp-4_0-branch r228972.)  This change introduces regressions in OpenACC
> async clause handling.

> Testing on gomp-4_0-branch r228972 plus the attached
> gomp-4_1-branch-r224838.patch:
> 
>     PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/asyncwait-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none (test for excess errors)
>     [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/asyncwait-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none execution test
> 
> Same for C++.

With an XFAIL added (Chung-Lin, please remove that one once you come up
with a fix), and merge conflicts resolved as follows, I have now merged
gomp-4_1-branch r224838 in gomp-4_0-branch r229178:

commit cbef8ef8e3b6bf7ea3705b1fae5462be9e619a56
Merge: 3596aeb a568354
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Oct 22 17:50:08 2015 +0000

    svn merge -r 224607:224838 svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_1-branch
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@229178 138bc75d-0d04-0410-961f-82ee72b054a4

 include/ChangeLog.gomp41                           |   5 +
 include/gomp-constants.h                           |   6 ++
 libgomp/ChangeLog.gomp41                           |  18 ++++
 libgomp/libgomp.h                                  |  15 ++-
 libgomp/oacc-mem.c                                 |   2 +-
 libgomp/oacc-parallel.c                            |   6 +-
 libgomp/target.c                                   | 106 +++++++++++++--------
 libgomp/testsuite/libgomp.c/target-11.c            |  51 ++++++++++
 .../libgomp.oacc-c-c++-common/asyncwait-1.c        |   2 +
 9 files changed, 162 insertions(+), 49 deletions(-)

diff --cc libgomp/oacc-mem.c
index 7fcf199,c0fcb07..a90c912
--- libgomp/oacc-mem.c
+++ libgomp/oacc-mem.c
@@@ -685,7 -650,8 +685,7 @@@ gomp_acc_remove_pointer (void *h, bool 
  	}
      }
  
-   t->list[0]->copy_from = force_copyfrom ? 1 : 0;
 -  if (force_copyfrom)
 -    t->list[0].copy_from = 1;
++  t->list[0].copy_from = force_copyfrom ? 1 : 0;
  
    gomp_mutex_unlock (&acc_dev->lock);
  
diff --cc libgomp/oacc-parallel.c
index 2b90c9f,8ea3dd1..e4ecc87
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@@ -261,16 -135,12 +261,16 @@@ GOACC_parallel_keyed (int device, void 
  
    devaddrs = gomp_alloca (sizeof (void *) * mapnum);
    for (i = 0; i < mapnum; i++)
 -    devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
 -			    + tgt->list[i].key->tgt_offset);
 +    {
-       if (tgt->list[i] != NULL)
- 	devaddrs[i] = (void *) (tgt->list[i]->tgt->tgt_start
- 				+ tgt->list[i]->tgt_offset);
++      if (tgt->list[i].key != NULL)
++	devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
++				+ tgt->list[i].key->tgt_offset);
 +      else
 +	devaddrs[i] = NULL;
 +    }
  
 -  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, sizes, kinds,
 -			      num_gangs, num_workers, vector_length, async,
 -			      tgt);
 +  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
 +			      async, dims, tgt);
  
    /* If running synchronously, unmap immediately.  */
    if (async < acc_async_noval)
diff --cc libgomp/target.c
index 4587361,05c9b71..c2e1996
--- libgomp/target.c
+++ libgomp/target.c
@@@ -714,9 -721,6 +740,8 @@@ gomp_load_image_to_device (struct gomp_
        k->tgt_offset = target_table[i].start;
        k->refcount = 1;
        k->async_refcount = 0;
-       k->copy_from = false;
-       tgt->list[i] = k;
++      tgt->list[i].key = k;
 +      tgt->refcount++;
        array->left = NULL;
        array->right = NULL;
        splay_tree_insert (&devicep->mem_map, array);
@@@ -742,9 -746,6 +767,8 @@@
        k->tgt_offset = target_var->start;
        k->refcount = 1;
        k->async_refcount = 0;
-       k->copy_from = false;
-       tgt->list[i] = k;
++      tgt->list[i].key = k;
 +      tgt->refcount++;
        array->left = NULL;
        array->right = NULL;
        splay_tree_insert (&devicep->mem_map, array);
diff --cc libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
index d478ce2,22cef6d..f3b490a
--- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
@@@ -1,4 -1,4 +1,6 @@@
  /* { dg-do run { target openacc_nvidia_accel_selected } } */
++/* <http://news.gmane.org/find-root.php?message_id=%3C87pp0aaksc.fsf%40kepler.schwinge.homeip.net%3E>.
++   { dg-xfail-run-if "TODO" { *-*-* } } */
  /* { dg-additional-options "-lcuda" } */
  
  #include <openacc.h>


Grüße
 Thomas
Tom de Vries Nov. 18, 2015, 3:17 p.m. UTC | #4
On 22/10/15 20:27, Thomas Schwinge wrote:
> diff --cc libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> index d478ce2,22cef6d..f3b490a
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> @@@ -1,4 -1,4 +1,6 @@@
>    /* { dg-do run { target openacc_nvidia_accel_selected } } */
> ++/*<http://news.gmane.org/find-root.php?message_id=%3C87pp0aaksc.fsf%40kepler.schwinge.homeip.net%3E>.
> ++   { dg-xfail-run-if "TODO" { *-*-* } } */
>    /* { dg-additional-options "-lcuda" } */
>
>    #include <openacc.h>

This failure shows up on trunk. Should it also be xfailed there?

Thanks,
- Tom
Thomas Schwinge March 30, 2016, 4:23 p.m. UTC | #5
Hi!

On Wed, 18 Nov 2015 16:17:39 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> On 22/10/15 20:27, Thomas Schwinge wrote:
> > diff --cc libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> > index d478ce2,22cef6d..f3b490a
> > --- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> > +++ libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-1.c
> > @@@ -1,4 -1,4 +1,6 @@@
> >    /* { dg-do run { target openacc_nvidia_accel_selected } } */
> > ++/*<http://news.gmane.org/find-root.php?message_id=%3C87pp0aaksc.fsf%40kepler.schwinge.homeip.net%3E>.
> > ++   { dg-xfail-run-if "TODO" { *-*-* } } */
> >    /* { dg-additional-options "-lcuda" } */
> >
> >    #include <openacc.h>
> 
> This failure shows up on trunk. Should it also be xfailed there?

I added the XFAIL as part of my recent r234575 "Update OpenACC test
cases" commit,
<http://news.gmane.org/find-root.php?message_id=%3C878u109ew4.fsf%40hertz.schwinge.homeip.net%3E>.
Chung-Lin, for avoidance of doubt, please remove that XFAIL once you get
to commit your fix for this issue (currently waiting for Jakub's
approval).


Grüße
 Thomas
diff mbox

Patch

diff --git include/gomp-constants.h include/gomp-constants.h
index b55f68b..540a31e 100644
--- include/gomp-constants.h
+++ include/gomp-constants.h
@@ -111,6 +111,12 @@  enum gomp_map_kind
 #define GOMP_MAP_POINTER_P(X) \
   ((X) == GOMP_MAP_POINTER)
 
+#define GOMP_MAP_ALWAYS_TO_P(X) \
+  (((X) == GOMP_MAP_ALWAYS_TO) || ((X) == GOMP_MAP_ALWAYS_TOFROM))
+
+#define GOMP_MAP_ALWAYS_FROM_P(X) \
+  (((X) == GOMP_MAP_ALWAYS_FROM) || ((X) == GOMP_MAP_ALWAYS_TOFROM))
+
 
 /* Asynchronous behavior.  Keep in sync with
    libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t.  */
diff --git libgomp/libgomp.h libgomp/libgomp.h
index d86da7d..8fd7d08 100644
--- libgomp/libgomp.h
+++ libgomp/libgomp.h
@@ -641,6 +641,15 @@  typedef struct splay_tree_node_s *splay_tree_node;
 typedef struct splay_tree_s *splay_tree;
 typedef struct splay_tree_key_s *splay_tree_key;
 
+struct target_var_desc {
+  /* Splay key.  */
+  splay_tree_key key;
+  /* True if data should be copied from device to host at the end.  */
+  bool copy_from;
+  /* True if data always should be copied from device to host at the end.  */
+  bool always_copy_from;
+};
+
 struct target_mem_desc {
   /* Reference count.  */
   uintptr_t refcount;
@@ -660,9 +669,9 @@  struct target_mem_desc {
   /* Corresponding target device descriptor.  */
   struct gomp_device_descr *device_descr;
 
-  /* List of splay keys to remove (or decrease refcount)
+  /* List of target items to remove (or decrease refcount)
      at the end of region.  */
-  splay_tree_key list[];
+  struct target_var_desc list[];
 };
 
 struct splay_tree_key_s {
@@ -678,8 +687,6 @@  struct splay_tree_key_s {
   uintptr_t refcount;
   /* Asynchronous reference count.  */
   uintptr_t async_refcount;
-  /* True if data should be copied from device to host at the end.  */
-  bool copy_from;
 };
 
 #include "splay-tree.h"
diff --git libgomp/oacc-mem.c libgomp/oacc-mem.c
index 7fcf199..a90c912 100644
--- libgomp/oacc-mem.c
+++ libgomp/oacc-mem.c
@@ -685,7 +685,7 @@  gomp_acc_remove_pointer (void *h, bool force_copyfrom, int async, int mapnum)
 	}
     }
 
-  t->list[0]->copy_from = force_copyfrom ? 1 : 0;
+  t->list[0].copy_from = force_copyfrom ? 1 : 0;
 
   gomp_mutex_unlock (&acc_dev->lock);
 
diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c
index 2b90c9f..e4ecc87 100644
--- libgomp/oacc-parallel.c
+++ libgomp/oacc-parallel.c
@@ -262,9 +262,9 @@  GOACC_parallel_keyed (int device, void (*fn) (void *),
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     {
-      if (tgt->list[i] != NULL)
-	devaddrs[i] = (void *) (tgt->list[i]->tgt->tgt_start
-				+ tgt->list[i]->tgt_offset);
+      if (tgt->list[i].key != NULL)
+	devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
+				+ tgt->list[i].key->tgt_offset);
       else
 	devaddrs[i] = NULL;
     }
diff --git libgomp/target.c libgomp/target.c
index 4587361..c2e1996 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -168,6 +168,12 @@  gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
 		  (void *) newn->host_start, (void *) newn->host_end,
 		  (void *) oldn->host_start, (void *) oldn->host_end);
     }
+
+  if (GOMP_MAP_ALWAYS_TO_P (kind))
+    devicep->host2dev_func (devicep->target_id,
+			    (void *) (oldn->tgt->tgt_start + oldn->tgt_offset),
+			    (void *) newn->host_start,
+			    newn->host_end - newn->host_start);
   oldn->refcount++;
 }
 
@@ -267,7 +273,7 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       int kind = get_kind (short_mapkind, kinds, i);
       if (hostaddrs[i] == NULL)
 	{
-	  tgt->list[i] = NULL;
+	  tgt->list[i].key = NULL;
 	  continue;
 	}
       cur_node.host_start = (uintptr_t) hostaddrs[i];
@@ -278,12 +284,15 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       splay_tree_key n = splay_tree_lookup (mem_map, &cur_node);
       if (n)
 	{
-	  tgt->list[i] = n;
+	  tgt->list[i].key = n;
+	  tgt->list[i].copy_from = GOMP_MAP_COPY_FROM_P (kind & typemask);
+	  tgt->list[i].always_copy_from
+	    = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
 	  gomp_map_vars_existing (devicep, n, &cur_node, kind & typemask);
 	}
       else
 	{
-	  tgt->list[i] = NULL;
+	  tgt->list[i].key = NULL;
 
 	  size_t align = (size_t) 1 << (kind >> rshift);
 	  not_found_cnt++;
@@ -304,7 +313,7 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		  break;
 		else
 		  {
-		    tgt->list[j] = NULL;
+		    tgt->list[j].key = NULL;
 		    i++;
 		  }
 	    }
@@ -352,7 +361,7 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       size_t j;
 
       for (i = 0; i < mapnum; i++)
-	if (tgt->list[i] == NULL)
+	if (tgt->list[i].key == NULL)
 	  {
 	    int kind = get_kind (short_mapkind, kinds, i);
 	    if (hostaddrs[i] == NULL)
@@ -366,18 +375,23 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n)
 	      {
-		tgt->list[i] = n;
+		tgt->list[i].key = n;
+		tgt->list[i].copy_from = GOMP_MAP_COPY_FROM_P (kind & typemask);
+		tgt->list[i].always_copy_from
+		  = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
 		gomp_map_vars_existing (devicep, n, k, kind & typemask);
 	      }
 	    else
 	      {
 		size_t align = (size_t) 1 << (kind >> rshift);
-		tgt->list[i] = k;
+		tgt->list[i].key = k;
 		tgt_size = (tgt_size + align - 1) & ~(align - 1);
 		k->tgt = tgt;
 		k->tgt_offset = tgt_size;
 		tgt_size += k->host_end - k->host_start;
-		k->copy_from = GOMP_MAP_COPY_FROM_P (kind & typemask);
+		tgt->list[i].copy_from = GOMP_MAP_COPY_FROM_P (kind & typemask);
+		tgt->list[i].always_copy_from
+		  = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
 		k->refcount = 1;
 		k->async_refcount = 0;
 		tgt->refcount++;
@@ -395,6 +409,8 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		  case GOMP_MAP_TOFROM:
 		  case GOMP_MAP_FORCE_TO:
 		  case GOMP_MAP_FORCE_TOFROM:
+		  case GOMP_MAP_ALWAYS_TO:
+		  case GOMP_MAP_ALWAYS_TOFROM:
 		    /* FIXME: Perhaps add some smarts, like if copying
 		       several adjacent fields from host to target, use some
 		       host buffer to avoid sending each var individually.  */
@@ -427,7 +443,9 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			break;
 		      else
 			{
-			  tgt->list[j] = k;
+			  tgt->list[j].key = k;
+			  tgt->list[j].copy_from = false;
+			  tgt->list[j].always_copy_from = false;
 			  k->refcount++;
 			  gomp_map_pointer (tgt,
 					    (uintptr_t) *(void **) hostaddrs[j],
@@ -479,11 +497,11 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
     {
       for (i = 0; i < mapnum; i++)
 	{
-	  if (tgt->list[i] == NULL)
+	  if (tgt->list[i].key == NULL)
 	    cur_node.tgt_offset = (uintptr_t) NULL;
 	  else
-	    cur_node.tgt_offset = tgt->list[i]->tgt->tgt_start
-				  + tgt->list[i]->tgt_offset;
+	    cur_node.tgt_offset = tgt->list[i].key->tgt->tgt_start
+				  + tgt->list[i].key->tgt_offset;
 	  /* FIXME: see above FIXME comment.  */
 	  devicep->host2dev_func (devicep->target_id,
 				  (void *) (tgt->tgt_start
@@ -523,17 +541,17 @@  gomp_copy_from_async (struct target_mem_desc *tgt)
   gomp_mutex_lock (&devicep->lock);
 
   for (i = 0; i < tgt->list_count; i++)
-    if (tgt->list[i] == NULL)
+    if (tgt->list[i].key == NULL)
       ;
-    else if (tgt->list[i]->refcount > 1)
+    else if (tgt->list[i].key->refcount > 1)
       {
-	tgt->list[i]->refcount--;
-	tgt->list[i]->async_refcount++;
+	tgt->list[i].key->refcount--;
+	tgt->list[i].key->async_refcount++;
       }
     else
       {
-	splay_tree_key k = tgt->list[i];
-	if (k->copy_from)
+	splay_tree_key k = tgt->list[i].key;
+	if (tgt->list[i].copy_from)
 	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
 				  (void *) (k->tgt->tgt_start + k->tgt_offset),
 				  k->host_end - k->host_start);
@@ -561,25 +579,33 @@  gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 
   size_t i;
   for (i = 0; i < tgt->list_count; i++)
-    if (tgt->list[i] == NULL)
-      ;
-    else if (tgt->list[i]->refcount > 1)
-      tgt->list[i]->refcount--;
-    else if (tgt->list[i]->async_refcount > 0)
-      tgt->list[i]->async_refcount--;
-    else
-      {
-	splay_tree_key k = tgt->list[i];
-	if (k->copy_from && do_copyfrom)
-	  devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
-				  (void *) (k->tgt->tgt_start + k->tgt_offset),
-				  k->host_end - k->host_start);
-	splay_tree_remove (&devicep->mem_map, k);
-	if (k->tgt->refcount > 1)
-	  k->tgt->refcount--;
-	else
-	  gomp_unmap_tgt (k->tgt);
-      }
+    {
+      splay_tree_key k = tgt->list[i].key;
+      if (k == NULL)
+	continue;
+
+      bool do_unmap = false;
+      if (k->refcount > 1)
+	k->refcount--;
+      else if (k->async_refcount > 0)
+	k->async_refcount--;
+      else
+	do_unmap = true;
+
+      if ((do_unmap && do_copyfrom && tgt->list[i].copy_from)
+	  || tgt->list[i].always_copy_from)
+	devicep->dev2host_func (devicep->target_id, (void *) k->host_start,
+				(void *) (k->tgt->tgt_start + k->tgt_offset),
+				k->host_end - k->host_start);
+      if (do_unmap)
+	{
+	  splay_tree_remove (&devicep->mem_map, k);
+	  if (k->tgt->refcount > 1)
+	    k->tgt->refcount--;
+	  else
+	    gomp_unmap_tgt (k->tgt);
+	}
+    }
 
   if (tgt->refcount > 1)
     tgt->refcount--;
@@ -714,8 +740,7 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->tgt_offset = target_table[i].start;
       k->refcount = 1;
       k->async_refcount = 0;
-      k->copy_from = false;
-      tgt->list[i] = k;
+      tgt->list[i].key = k;
       tgt->refcount++;
       array->left = NULL;
       array->right = NULL;
@@ -742,8 +767,7 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->tgt_offset = target_var->start;
       k->refcount = 1;
       k->async_refcount = 0;
-      k->copy_from = false;
-      tgt->list[i] = k;
+      tgt->list[i].key = k;
       tgt->refcount++;
       array->left = NULL;
       array->right = NULL;
diff --git libgomp/testsuite/libgomp.c/target-11.c libgomp/testsuite/libgomp.c/target-11.c
new file mode 100644
index 0000000..4562d88
--- /dev/null
+++ libgomp/testsuite/libgomp.c/target-11.c
@@ -0,0 +1,51 @@ 
+/* { dg-require-effective-target offload_device } */
+
+#include <assert.h>
+
+int main ()
+{
+  int aa = 0, bb = 0, cc = 0, dd = 0;
+
+  #pragma omp target data map(tofrom: aa) map(to: bb) map(from: cc, dd)
+    {
+      int ok;
+      aa = bb = cc = 1;
+
+      /* Set dd on target to 0 for the further check.  */
+      #pragma omp target map(always to: dd)
+	{ dd; }
+
+      dd = 1;
+      #pragma omp target map(tofrom: aa) map(always to: bb) \
+	map(always from: cc) map(to: dd) map(from: ok)
+	{
+	  /* bb is always to, aa and dd are not.  */
+	  ok = (aa == 0) && (bb == 1) && (dd == 0);
+	  aa = bb = cc = dd = 2;
+	}
+
+      assert (ok);
+      assert (aa == 1);
+      assert (bb == 1);
+      assert (cc == 2); /* cc is always from.  */
+      assert (dd == 1);
+
+      dd = 3;
+      #pragma omp target map(from: cc) map(always to: dd) map(from: ok)
+	{
+	  ok = (dd == 3); /* dd is always to.  */
+	  cc = dd = 4;
+	}
+
+      assert (ok);
+      assert (cc == 2);
+      assert (dd == 3);
+    }
+
+  assert (aa == 2);
+  assert (bb == 1);
+  assert (cc == 4);
+  assert (dd == 4);
+
+  return 0;
+}