diff mbox series

OpenACC reference count overhaul

Message ID 20191003163505.49997-2-julian@codesourcery.com
State New
Headers show
Series OpenACC reference count overhaul | expand

Commit Message

Julian Brown Oct. 3, 2019, 4:35 p.m. UTC
This patch has been broken out of the patch supporting OpenACC 2.6 manual
deep copy last posted here:

  https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html

As part of developing that patch, libgomp's OpenACC reference counting
implementation proved to be somewhat inconsistent, especially when
used in combination with the deep copy support which exercises it
more thoroughly.

So, this patch contains just the changes to reference-counting behaviour,
for ease of (re-)review.  The other parts of OpenACC 2.6 manual deep
copy support are forthcoming, but some changes in this patch anticipate
that support.

Tested with offloading to NVPTX, with good results (though a couple of
tests need fixing also).

A follow-up patch provides self-checking of the reference count
implementation.

OK for trunk?

Thanks,

Julian

2019-10-02  Julian Brown  <julian@codesourcery.com>

	libgomp/
	* libgomp.h (VREFCOUNT_LINK_KEY): New macro.
	(struct splay_tree_key_s): Put link_key field into a new union.
	Substitute dynamic_refcount field for virtual_refcount.
	(struct acc_dispatch_t): Remove data_environ field.
	(enum gomp_map_vars_kind): Add GOMP_MAP_VARS_OPENACC_ENTER_DATA.
	(gomp_acc_insert_pointer): Remove prototype.
	(gomp_acc_remove_pointer): Update prototype.
	(gomp_free_memmap): Remove prototype.
	(gomp_remove_var_async): Add prototype.
	* oacc-host.c (host_dispatch): Don't initialise removed data_environ
	field.
	* oacc-init.c (acc_shutdown_1): Use gomp_remove_var instead of
	gomp_free_memmap.
	* oacc-mem.c (lookup_dev_1): New function.
	(lookup_dev): Reimplement using above.
	(acc_free, acc_hostptr): Update calls to lookup_dev.
	(acc_map_data): Likewise.  Don't add to data_environ list.
	(acc_unmap_data): Remove call to gomp_unmap_vars.  Fix semantics to
	remove mapping, but not mapped data.
	(present_create_copy): Use virtual_refcount instead of
	dynamic_refcount.  Don't manipulate data_environ.  Fix target pointer
	return value.
	(delete_copyout): Update for virtual_refcount semantics.  Use
	goacc_remove_var_async for asynchronous delete/copyouts.
	(gomp_acc_insert_pointer): Remove function.
	(gomp_acc_remove_pointer): Reimplement.
	* oacc-parallel.c (find_pointer): Make a little more strict.
	(GOACC_enter_exit_data): Call gomp_map_vars_async directly instead of
	calling gomp_acc_insert_pointer.  Update call to
	gomp_acc_remove_pointer.
	* target.c (gomp_map_vars_internal): Handle
	GOMP_MAP_VARS_OPENACC_ENTER_DATA.  Update for virtual_refcount
	semantics.
	(gomp_remove_var): Reimplement in terms of...
	(gomp_remove_var_internal): ...this new helper function.
	(gomp_remove_var_async): Implement using above helper funciton.
	(gomp_unref_tgt): Reimplement.
	(gomp_unref_tgt_void): New function.
	(gomp_unmap_vars_internal): Update for virtual_refcount
	semantics.  Check for special virtual_refcount tag value before using
	link_key.
	(gomp_load_image_to_device): Zero-initialise virtual_refcount fields.
	(gomp_free_memmap): Remove function.
	(gomp_exit_data): Check virtual_refcount for tag value before using
	link_key.
	(omp_target_associate_ptr): Zero-initialise virtual_refcount and
	link_key splay tree key fields.
	(gomp_target_init): Don't initialise removed data_environ field.
	* testsuite/libgomp.oacc-c-c++-common/context-2.c: Use correct API to
	deallocate acc_copyin'd data.
	* testsuite/libgomp.oacc-c-c++-common/context-4.c: Likewise.
	* testsuite/libgomp.oacc-fortran/data-2.f90: Update test.
---
 libgomp/libgomp.h                             |  36 +-
 libgomp/oacc-host.c                           |   2 -
 libgomp/oacc-init.c                           |  10 +-
 libgomp/oacc-mem.c                            | 347 ++++++------------
 libgomp/oacc-parallel.c                       |  44 ++-
 libgomp/target.c                              | 127 ++++---
 .../libgomp.oacc-c-c++-common/context-2.c     |   6 +-
 .../libgomp.oacc-c-c++-common/context-4.c     |   6 +-
 .../testsuite/libgomp.oacc-fortran/data-2.f90 |   7 +-
 9 files changed, 278 insertions(+), 307 deletions(-)

Comments

Thomas Schwinge Oct. 15, 2019, 3:30 p.m. UTC | #1
Hi Julian!

On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote:
> This patch has been broken out of the patch supporting OpenACC 2.6 manual
> deep copy last posted here:
>
>   https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html

Thanks.


> a couple of
> tests need fixing also

Let's look at these first, and independently.

The overall goal not being to bend test cases until they (again) work,
but rather to verify what they're testing, so that they're valid OpenACC
code, or if not that, then they're testing specifics of the GCC
implementation (for example, the 'dg-shouldfail' test cases).

> 	* testsuite/libgomp.oacc-c-c++-common/context-2.c: Use correct API to
> 	deallocate acc_copyin'd data.
> 	* testsuite/libgomp.oacc-c-c++-common/context-4.c: Likewise.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c

> +    acc_delete (&h_X[0], N * sizeof (float));
> +    acc_delete (&h_Y1[0], N * sizeof (float));
> +
>      free (h_X);
>      free (h_Y1);
>      free (h_Y2);
>  
> -    acc_free (d_X);
> -    acc_free (d_Y);

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c

> +    acc_delete (&h_X[0], N * sizeof (float));
> +    acc_delete (&h_Y1[0], N * sizeof (float));
> +
>      free (h_X);
>      free (h_Y1);
>      free (h_Y2);
>  
> -    acc_free (d_X);
> -    acc_free (d_Y);

ACK -- but do we understand why the same shouldn't be applied to the very
similar 'libgomp.oacc-c-c++-common/context-1.c' and
'libgomp.oacc-c-c++-common/context-3.c', too?

I suppose your testing of the "OpenACC reference count overhaul" tripped
over these constructs?  (Why just some, then?)

The same pattern ('acc_copyin', 'acc_free') also appears in
'libgomp.oacc-c-c++-common/clauses-1.c', does that also need to be
corrected?  Same in 'libgomp.oacc-c-c++-common/lib-13.c' (... where that
test case actually is titled "Check acc_is_present and acc_delete"
instead of "[...] acc_free", huh), 'libgomp.oacc-c-c++-common/lib-14.c',
'libgomp.oacc-c-c++-common/lib-18.c'.

Then, the 'acc_deviceptr', 'acc_unmap_data', 'acc_free' usage in
'libgomp.oacc-c-c++-common/clauses-1.c' also seems strange, as the
respective 'acc_free' argument certainly is not (at least not directly) a
"pointer value that was returned by a call to 'acc_malloc'".  Does it
make sense to (continue to) support that, assuming that's how it's
implemented internally, or should these be corrected to valid OpenACC,
too?  Same in 'libgomp.oacc-c-c++-common/present-1.c'.

Same in 'libgomp.oacc-c-c++-common/clauses-2.c' (we 'dg-shouldfail'
earlier, but the later code should otherwise be made correct anyway).

Several of these things again in 'libgomp.oacc-c-c++-common/nested-1.c'.

(The other 'libgomp.oacc-c-c++-common/lib-*.c' ones are correctly pairing
'acc_malloc', 'acc_free', as far as I can tell.)


> --- a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90

> @@ -70,10 +71,14 @@ program test
>      end do
>    !$acc end parallel
>    
> -  !$acc exit data copyout (d(1:N)) async
> +  !$acc exit data delete (c(1:N)) copyout (d(1:N)) async
>    !$acc exit data async
>    !$acc wait

ACK, but also it seems to me as if the '!$acc exit data async' (currently
"clause-less") was meant to carry the 'delete (c(1:N))' clause?

> @@ -1,4 +1,5 @@
>  ! { dg-do run }
> +! { dg-additional-options "-cpp" }

> [...]

> +#if !ACC_MEM_SHARED
> +  if (acc_is_present (c) .eqv. .TRUE.) call abort
> +#endif

;-) Should be able to simplify that one to 'if (acc_is_present (c))', no?

But is that a really useful test here: don't we elsewhere have enough of
such 'acc_is_present' testing?  (That is, OK to keep that, but likewise
OK to drop that.)

And, just for background information: per PR84381, it has been suggested
to use the Fortran standard 'stop' (or was it 'error stop'?) instead of
'call abort'.  But no need to change that here individually; the libgomp
testsuite still (or, again?)  contains a lot of 'call abort'.

> +
>    do i = 1, N
>      if (d(i) .ne. 4.0) call abort
>    end do

..., for example, here.  ;-) (For avoidance of doubt, I'm not asking you
to change these now.)


So, please address these items first, as separate "Fix OpenACC test cases
regarding 'acc_malloc', 'acc_free' pairing", and "Fix OpenACC test case
for unstructured data regions" (or similar) commits.  If you're confident
you're doing "the obvious", feel free to commit without further review.


Grüße
 Thomas
Julian Brown Oct. 21, 2019, 12:53 p.m. UTC | #2
On Tue, 15 Oct 2019 17:30:06 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
> wrote:
> > This patch has been broken out of the patch supporting OpenACC 2.6
> > manual deep copy last posted here:
> >
> >   https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html  
> 
> Thanks.
> 
> 
> > a couple of
> > tests need fixing also  
> 
> Let's look at these first, and independently.
> 
> The overall goal not being to bend test cases until they (again) work,
> but rather to verify what they're testing, so that they're valid
> OpenACC code, or if not that, then they're testing specifics of the
> GCC implementation (for example, the 'dg-shouldfail' test cases).

Indeed, the tests looked "obviously wrong", but actually none of them
should have failed with the reference-count overhaul patch. As far as I
can tell, only the context-2.c test now fails with the current og9
branch, intermittently, with the last version of the patch sent. Turns
out that was a real bug! So, good catch.

> ACK -- but do we understand why the same shouldn't be applied to the
> very similar 'libgomp.oacc-c-c++-common/context-1.c' and
> 'libgomp.oacc-c-c++-common/context-3.c', too?
> 
> I suppose your testing of the "OpenACC reference count overhaul"
> tripped over these constructs?  (Why just some, then?)

Yeah. Just blind luck, I think.

> The same pattern ('acc_copyin', 'acc_free') also appears in
> 'libgomp.oacc-c-c++-common/clauses-1.c', does that also need to be
> corrected?  Same in 'libgomp.oacc-c-c++-common/lib-13.c' (... where
> that test case actually is titled "Check acc_is_present and
> acc_delete" instead of "[...] acc_free", huh),
> 'libgomp.oacc-c-c++-common/lib-14.c',
> 'libgomp.oacc-c-c++-common/lib-18.c'.
> 
> Then, the 'acc_deviceptr', 'acc_unmap_data', 'acc_free' usage in
> 'libgomp.oacc-c-c++-common/clauses-1.c' also seems strange, as the
> respective 'acc_free' argument certainly is not (at least not
> directly) a "pointer value that was returned by a call to
> 'acc_malloc'".  Does it make sense to (continue to) support that,
> assuming that's how it's implemented internally, or should these be
> corrected to valid OpenACC, too?  Same in
> 'libgomp.oacc-c-c++-common/present-1.c'.
> 
> Same in 'libgomp.oacc-c-c++-common/clauses-2.c' (we 'dg-shouldfail'
> earlier, but the later code should otherwise be made correct anyway).
> 
> Several of these things again in
> 'libgomp.oacc-c-c++-common/nested-1.c'.

I'm not sure if *all* of those are wrong. I have a patch (forthcoming)
that fixes some of the pedantically-wrong OpenACC usage, but none of
the tests now regress with this version of the patch, so the urgency
is gone.

This version of the patch fixes the lookup_dev_1 helper function --
previously I had:

static splay_tree_key
lookup_dev_1 (splay_tree_node node, uintptr_t d, size_t s)
{
  splay_tree_key k = &node->key;
  struct target_mem_desc *t = k->tgt;

  if (d >= t->tgt_start && d + s <= t->tgt_end)
    return k;

  if (node->left)
    return lookup_dev_1 (node->left, d, s);

  if (node->right)
    return lookup_dev_1 (node->right, d, s);

  return NULL;
}

which would never recurse into a right-hand branch if there was a
left-hand node! Oops. So, device-address lookups would sometimes fail
when there was a valid mapping, depending on the balance of the splay
tree. (As an aside, I think calling lookup_dev unconditionally in
several of the OpenACC API calls as we do is a bad idea -- it takes time
linear to the number of mappings, with no way to avoid that overhead.
But that's another matter.)

Re-testing shows that the previously-regressing tests no longer
regress, but I haven't yet made any changes to VREFCOUNT_LINK_KEY, etc.
as suggested in the review of the attach/detach patch:

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01374.html

OK? (ChangeLog as before.)

Julian
Thomas Schwinge Oct. 21, 2019, 2:14 p.m. UTC | #3
Hi!

On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote:
> This patch has been broken out of the patch supporting OpenACC 2.6 manual
> deep copy last posted here:
>
>   https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html

Thanks.

Remeber to look into <https://gcc.gnu.org/PR92116> "Potential null
pointer dereference in 'gomp_acc_remove_pointer'", which may be relevant
here.

I see you've merged in the relevant parts of my incremental patch '[WIP]
OpenACC 2.6 manual deep copy support (attach/detach): adjust for
"goacc_async_unmap_tgt" removal', that I included in
<http://mid.mail-archive.com/yxfpftuqpakv.fsf@hertz.schwinge.homeip.net>,
which tells me that I supposedly understood that part alright.  ;-D

> As part of developing that patch, libgomp's OpenACC reference counting
> implementation proved to be somewhat inconsistent, especially when
> used in combination with the deep copy support which exercises it
> more thoroughly.
>
> So, this patch contains just the changes to reference-counting behaviour,
> for ease of (re-)review.  The other parts of OpenACC 2.6 manual deep
> copy support are forthcoming, but some changes in this patch anticipate
> that support.

As we're discussing these separately, please for now remove the changes
related to the 'VREFCOUNT_LINK_KEY' toggle flag, and moving 'link_key'
into an union (to later be shared with 'attach_count');
<http://mid.mail-archive.com/87pniuuhkj.fsf@euler.schwinge.homeip.net>.

> Tested with offloading to NVPTX, with good results (though a couple of
> tests need fixing also).

The testsuite changes we're discussing separately, and need to go in
before this one, obviously.

> OK for trunk?

I haven't understood all the changes related to replacing
'dynamic_refcount' with 'virtual_refcount', getting rid of
'data_environ', the 'lookup_dev' rework, but I trust you got that right.
In particular, these seem to remove special-case OpenACC code in favor of
generic OMP code, which is good.

A few more comments:

> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h

>  typedef struct acc_dispatch_t
>  {
> -  /* This is a linked list of data mapped using the
> -     acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas.
> -     Unlike mapped_data in the goacc_thread struct, unmapping can
> -     happen out-of-order with respect to mapping.  */
> -  /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
> -  struct target_mem_desc *data_environ;

As mentioned before, please also accordingly update the comment attached
to 'acc_dispatch_t openacc' in 'struct gomp_device_descr'.

That code:

> -/* Free address mapping tables.  MM must be locked on entry, and remains locked
> -   on return.  */
> -
> -attribute_hidden void
> -gomp_free_memmap (struct splay_tree_s *mem_map)
> -{
> -  while (mem_map->root)
> -    {
> -      struct target_mem_desc *tgt = mem_map->root->key.tgt;
> -
> -      splay_tree_remove (mem_map, &mem_map->root->key);
> -      free (tgt->array);
> -      free (tgt);
> -    }
> -}

... kind-of gets inlined here:

> --- a/libgomp/oacc-init.c
> +++ b/libgomp/oacc-init.c
> @@ -356,9 +356,13 @@ acc_shutdown_1 (acc_device_t d)
>  
>        if (walk->dev)
>  	{
> -	  gomp_mutex_lock (&walk->dev->lock);
> -	  gomp_free_memmap (&walk->dev->mem_map);
> -	  gomp_mutex_unlock (&walk->dev->lock);
> +	  while (walk->dev->mem_map.root)
> +	    {
> +	      splay_tree_key k = &walk->dev->mem_map.root->key;
> +	      gomp_remove_var (walk->dev, k);
> +	    }
>  
>  	  walk->dev = NULL;
>  	  walk->base_dev = NULL;

It's not obvious to me why it's OK to remove the locking?  Don't all
operations on the 'mem_map' have to have the device locked?

Does that code now still have the previous (and expected?) "finalize"
semantics (don't consider 'refcount', always unmap)?  (Should we assert
here that 'gomp_remove_var' always returns 'true'?  And/or, if it
doesn't, what does that mean then?)  Or am I confused?  ;-)

> --- a/libgomp/oacc-mem.c
> +++ b/libgomp/oacc-mem.c

> @@ -427,6 +418,7 @@ acc_unmap_data (void *h)
>  {
>    struct goacc_thread *thr = goacc_thread ();
>    struct gomp_device_descr *acc_dev = thr->dev;
> +  struct splay_tree_key_s cur_node;

I know it's often not the case in existing code, but when adding new
code, please move definitions next to their first use.

> @@ -438,12 +430,11 @@ acc_unmap_data (void *h)
>    acc_api_info api_info;
>    bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
>  
>    gomp_mutex_lock (&acc_dev->lock);
>  
> -  splay_tree_key n = lookup_host (acc_dev, h, 1);
> -  struct target_mem_desc *t;
> +  cur_node.host_start = (uintptr_t) h;
> +  cur_node.host_end = cur_node.host_start + 1;
> +  splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map, &cur_node);
>  
>    if (!n)
>      {

Isn't this just inlining 'lookup_host'?  There may be a good reason to do
that, but what is it?

> @@ -451,47 +442,28 @@ acc_unmap_data (void *h)

> -  /* Mark for removal.  */
> -  n->refcount = 1;
> +  splay_tree_remove (&acc_dev->mem_map, n);
>  
> -  t = n->tgt;
> +  struct target_mem_desc *tgt = n->tgt;
>  
> -  if (t->refcount == 2)
> +  if (tgt->refcount > 0)
> +    tgt->refcount--;
> +  else
>      {
> -[...]
> +      free (tgt->array);
> +      free (tgt);
>      }

Shouldn't that be 'if (tgt->refcount > 1)' (instead of '> 0'), like in
'gomp_unref_tgt' -- or actually use that function?

>  
>    gomp_mutex_unlock (&acc_dev->lock);
>  
> -  gomp_unmap_vars (t, true);
> -
>    if (profiling_p)
>      {
>        thr->prof_info = NULL;

Hmm, I don't understand the changes leading to this, but again, I shall
trust that you've got that right.

Or, was that a bug in the existing code, and we don't have proper test
coverage?

> @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async)

> -      d = tgt->to_free;

> +      n = lookup_host (acc_dev, h, s);
> +      assert (n != NULL);
> +      d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h
> +		    - n->host_start);

|   return d;

Again, it's not obvious to me how that is semantically equivalent to what
we've returned before?

>  void
> -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async,
> -			 int finalize, int mapnum)
> +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void **hostaddrs,
> +			 size_t *sizes, unsigned short *kinds, int async,
> +			 bool finalize, int mapnum)
>  {

> +      switch (kind)
> +        {
> +	case GOMP_MAP_FROM:
> +	case GOMP_MAP_FORCE_FROM:
> +	case GOMP_MAP_ALWAYS_FROM:
> +	  copyfrom = true;
> +	  /* Fallthrough.  */
> +
> +	case GOMP_MAP_TO_PSET:
> +	case GOMP_MAP_POINTER:
> +	case GOMP_MAP_DELETE:
> +	case GOMP_MAP_RELEASE:
> [...]
> +	default:
> +	  gomp_mutex_unlock (&acc_dev->lock);
> +	  gomp_fatal ("gomp_acc_remove_pointer unhandled kind 0x%.2x",
> +		      kind);

Thanks for being explicit about the expected mapping kinds, etc.

> -      /* If running synchronously, unmap immediately.  */
> -      if (async < acc_async_noval)
> -	gomp_unmap_vars (t, true);
> -      else
> -	{
> -	  goacc_aq aq = get_goacc_asyncqueue (async);
> -	  gomp_unmap_vars_async (t, true, aq);

As mentioned before, 'gomp_acc_remove_pointer' now "has an unused 'async'
formal parameter.  Is that meant to be resolved to an asyncqueue, and
pass that one to 'gomp_copy_dev2host', and call 'gomp_remove_var_async'
instead of 'gomp_remove_var'"?  That's here:

> +	  if (copyfrom)
> +	    gomp_copy_dev2host (acc_dev, NULL, (void *) cur_node.host_start,
> +				(void *) (n->tgt->tgt_start + n->tgt_offset
> +					  + cur_node.host_start
> +					  - n->host_start),
> +				cur_node.host_end - cur_node.host_start);
> +
> +	  if (n->refcount == 0)
> +	    gomp_remove_var (acc_dev, n);
> +	  break;

> --- a/libgomp/oacc-parallel.c
> +++ b/libgomp/oacc-parallel.c
> @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned short *kinds)

I've always been confused by this function (before known as 'find_pset');
this feels wrong, but I've never gotten to the bottom of it.

I'll trust that your changes there can only improve the current
situation, not worsen it.  ;-)

And, again, thanks for being explicit about the expected mapping kinds,
etc.

> @@ -745,8 +762,14 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
>  	    }
>  	  else
>  	    {
> -	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
> -				       &sizes[i], &kinds[i], async);
> +	      goacc_aq aq = get_goacc_asyncqueue (async);
> +	      for (int j = 0; j < 2; j++)

Should this magic constant '2' be derived from 'pointer' or some such?

> +		gomp_map_vars_async (acc_dev, aq,
> +				     (j == 0 || pointer == 2) ? 1 : 2,
> +				     &hostaddrs[i + j], NULL,
> +				     &sizes[i + j], &kinds[i + j], true,
> +				     GOMP_MAP_VARS_OPENACC_ENTER_DATA);

;-) Yuck.  As requested before: "Can we get a comment added to such
'magic', please?"

I just wish that eventually we'll be able to can get rid of that stuff,
and just let 'gomp_map_vars' do its thing.  Similar to
<https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use
'GOMP_MAP_VARS_TARGET'".

(For avoidance of doubt, that's not your task right now.)

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -536,7 +536,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
>    struct target_mem_desc *tgt
>      = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
>    tgt->list_count = mapnum;
> -  tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1;
> +  tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA
> +		   || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1;
>    tgt->device_descr = devicep;
>    struct gomp_coalesce_buf cbuf, *cbufp = NULL;
>  
> @@ -1051,8 +1053,20 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
>    /* If the variable from "omp target enter data" map-list was already mapped,
>       tgt is not needed.  Otherwise tgt will be freed by gomp_unmap_vars or
>       gomp_exit_data.  */
> -  if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount == 0)
> -    {
> +  if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA
> +       || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
> +      && tgt->refcount == 0)
> +    {
> +      /* If we're about to discard a target_mem_desc with no "structural"
> +	 references (tgt->refcount == 0), any splay keys linked in the tgt's
> +	 list must have their virtual refcount incremented to represent that
> +	 "lost" reference in order to implement the semantics of the OpenACC
> +	 "present increment" operation properly.  */
> +      if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
> +	for (i = 0; i < tgt->list_count; i++)
> +	  if (tgt->list[i].key)
> +	    tgt->list[i].key->virtual_refcount++;
> +
>        free (tgt);
>        tgt = NULL;
>      }

So that last item is the only difference between
'GOMP_MAP_VARS_ENTER_DATA' and 'GOMP_MAP_VARS_OPENACC_ENTER_DATA'.  Again
I have not digested that one, but will trust you.

> @@ -1310,7 +1366,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>        k->tgt = tgt;
>        k->tgt_offset = target_table[i].start;
>        k->refcount = REFCOUNT_INFINITY;
> -      k->link_key = NULL;
> +      k->virtual_refcount = 0;
>        array->left = NULL;
>        array->right = NULL;
>        splay_tree_insert (&devicep->mem_map, array);
> @@ -1342,7 +1398,7 @@ gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
>        k->tgt = tgt;
>        k->tgt_offset = target_var->start;
>        k->refcount = target_size & link_bit ? REFCOUNT_LINK : REFCOUNT_INFINITY;
> -      k->link_key = NULL;
> +      k->virtual_refcount = 0;
>        array->left = NULL;
>        array->right = NULL;
>        splay_tree_insert (&devicep->mem_map, array);

Why no longer initialize 'link_key' here?

I'd expect that always all fields of 'k' ('struct splay_tree_key_s') get
initialized, so like:

> @@ -2612,6 +2652,8 @@ omp_target_associate_ptr (const void *host_ptr, const void *device_ptr,
>        k->tgt = tgt;
>        k->tgt_offset = (uintptr_t) device_ptr + device_offset;
>        k->refcount = REFCOUNT_INFINITY;
> +      k->virtual_refcount = 0;
> +      k->u.link_key = NULL;
>        array->left = NULL;
>        array->right = NULL;
>        splay_tree_insert (&devicep->mem_map, array);

(I haven't verified whether that's always done, please verify.)


Grüße
 Thomas
Thomas Schwinge Oct. 23, 2019, 11:37 a.m. UTC | #4
Hi Julian!

On 2019-10-21T16:14:11+0200, I wrote:
> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote:
>> This patch has been broken out of the patch supporting OpenACC 2.6 manual
>> deep copy last posted here:
>>
>>   https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html
>
> Thanks.

I meanwhile re-discovered that an earlier submission,
<http://mid.mail-archive.com/cover.1543438190.git.julian@codesourcery.com>,
had included some documentation/rationale for:

> I haven't understood all the changes related to replacing
> 'dynamic_refcount' with 'virtual_refcount', getting rid of
> 'data_environ', the 'lookup_dev' rework, but I trust you got that right.
> In particular, these seem to remove special-case OpenACC code in favor of
> generic OMP code, which is good.

... these changes.  Please in the future remember to refer to such
existing documentation/rationale, or again include in any re-submissions,
thanks.


>> Tested with offloading to NVPTX, with good results

I noticed that when testing with
'-foffload=x86_64-intelmicemul-linux-gnu', the x86_64-pc-linux-gnu '-m32'
multilib (but not default '-m64', huh) then reproducibly regresses:

    PASS: libgomp.c/target-link-1.c (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.c/target-link-1.c execution test

..., with an un-helpful message: "offload error: process on the device 0
unexpectedly exited with code 0".

So non-OpenACC code paths seem to be negatively affected in some way?

Hopefully that'll go away when backing out the 'VREFCOUNT_LINK_KEY'
etc. changes, as discussed elsewhere.  (I can easily test patches for
you, no need for you to set up Intel MIC (emulated) offloading testing.)


Grüße
 Thomas
Julian Brown Oct. 29, 2019, 12:15 p.m. UTC | #5
Hi!

This is a new version of the patch which hopefully addresses all review
comments. Further commentary below.

On Mon, 21 Oct 2019 16:14:11 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
> wrote:
> > This patch has been broken out of the patch supporting OpenACC 2.6
> > manual deep copy last posted here:
> >
> >   https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01084.html  
> 
> Thanks.
> 
> Remeber to look into <https://gcc.gnu.org/PR92116> "Potential null
> pointer dereference in 'gomp_acc_remove_pointer'", which may be
> relevant here.

I've deleted the whole function (see below) so nothing to do there, I
don't think, even if that code had still been live in the last version
of the patch.

> I see you've merged in the relevant parts of my incremental patch
> '[WIP] OpenACC 2.6 manual deep copy support (attach/detach): adjust
> for "goacc_async_unmap_tgt" removal', that I included in
> <http://mid.mail-archive.com/yxfpftuqpakv.fsf@hertz.schwinge.homeip.net>,
> which tells me that I supposedly understood that part alright.  ;-D

Yes I think so -- I'll add you as co-author to the ChangeLog. Apologies
for the omission!

> > As part of developing that patch, libgomp's OpenACC reference
> > counting implementation proved to be somewhat inconsistent,
> > especially when used in combination with the deep copy support
> > which exercises it more thoroughly.
> >
> > So, this patch contains just the changes to reference-counting
> > behaviour, for ease of (re-)review.  The other parts of OpenACC 2.6
> > manual deep copy support are forthcoming, but some changes in this
> > patch anticipate that support.  
> 
> As we're discussing these separately, please for now remove the
> changes related to the 'VREFCOUNT_LINK_KEY' toggle flag, and moving
> 'link_key' into an union (to later be shared with 'attach_count');
> <http://mid.mail-archive.com/87pniuuhkj.fsf@euler.schwinge.homeip.net>.

Done (I have a plan for the link_key/attach_count fields, but it's not
in this patch, and I'm not sure how well it'll work out yet).

> > Tested with offloading to NVPTX, with good results (though a couple
> > of tests need fixing also).  
> 
> The testsuite changes we're discussing separately, and need to go in
> before this one, obviously.

Those tests no longer regress, so no testsuite changes are strictly
necessary for this patch.

> > OK for trunk?  
> 
> I haven't understood all the changes related to replacing
> 'dynamic_refcount' with 'virtual_refcount', getting rid of
> 'data_environ', the 'lookup_dev' rework, but I trust you got that
> right. In particular, these seem to remove special-case OpenACC code
> in favor of generic OMP code, which is good.

Yep -- the previous email you dug up included the following rationale:

 - reference counts in the linked memory-mapping splay tree structure
   can be self-checked for consistency using optional (i.e.
   development-only) code.  This survives a libgomp test run (with
   offloading to nvptx), so I'm reasonably confident it's good.

 - the "data_environ" field in the device descriptor -- a linear linked
   list containing a target memory descriptor for each "acc enter data"
   mapping -- has been removed.  This brings OpenACC closer to the
   OpenMP implementation for non-lexically-scoped data mapping
   (GOMP_target_enter_exit_data), and is potentially a performance win
   if lots of data is mapped in this way.

 - the semantics of the "dynamic_refcount" field in the splay_tree_key
   structure have shifted slightly, so I've renamed the field.  It now
   represents references that are excess to those represented by actual
   pointers in the linked splay tree/target-memory descriptor structure.
   That might have been the intention before in fact, but the
   implementation was inconsistent.

The big thing here is the auto-checking of refcounting behaviour. There
were quite a few corner cases that were broken before.

> A few more comments:
> 
> > --- a/libgomp/libgomp.h
> > +++ b/libgomp/libgomp.h  
> 
> >  typedef struct acc_dispatch_t
> >  {
> > -  /* This is a linked list of data mapped using the
> > -     acc_map_data/acc_unmap_data or "acc enter data"/"acc exit
> > data" pragmas.
> > -     Unlike mapped_data in the goacc_thread struct, unmapping can
> > -     happen out-of-order with respect to mapping.  */
> > -  /* This is guarded by the lock in the "outer" struct
> > gomp_device_descr.  */
> > -  struct target_mem_desc *data_environ;  
> 
> As mentioned before, please also accordingly update the comment
> attached to 'acc_dispatch_t openacc' in 'struct gomp_device_descr'.

Done.

> That code:
> 
> > -/* Free address mapping tables.  MM must be locked on entry, and
> > remains locked
> > -   on return.  */
> > -
> > -attribute_hidden void
> > -gomp_free_memmap (struct splay_tree_s *mem_map)
> > -{
> > -  while (mem_map->root)
> > -    {
> > -      struct target_mem_desc *tgt = mem_map->root->key.tgt;
> > -
> > -      splay_tree_remove (mem_map, &mem_map->root->key);
> > -      free (tgt->array);
> > -      free (tgt);
> > -    }
> > -}  
> 
> ... kind-of gets inlined here:
> 
> > --- a/libgomp/oacc-init.c
> > +++ b/libgomp/oacc-init.c
> > @@ -356,9 +356,13 @@ acc_shutdown_1 (acc_device_t d)
> >  
> >        if (walk->dev)
> >  	{
> > -	  gomp_mutex_lock (&walk->dev->lock);
> > -	  gomp_free_memmap (&walk->dev->mem_map);
> > -	  gomp_mutex_unlock (&walk->dev->lock);
> > +	  while (walk->dev->mem_map.root)
> > +	    {
> > +	      splay_tree_key k = &walk->dev->mem_map.root->key;
> > +	      gomp_remove_var (walk->dev, k);
> > +	    }
> >  
> >  	  walk->dev = NULL;
> >  	  walk->base_dev = NULL;  
> 
> It's not obvious to me why it's OK to remove the locking?  Don't all
> operations on the 'mem_map' have to have the device locked?

You're probably right about this -- good catch. Although if the user is
shutting down the device whilst it is still active (from some other
thread?) it's just a case of how ugly their crash is going to be
either way, I suspect!

> Does that code now still have the previous (and expected?) "finalize"
> semantics (don't consider 'refcount', always unmap)?  (Should we
> assert here that 'gomp_remove_var' always returns 'true'?  And/or, if
> it doesn't, what does that mean then?)  Or am I confused?  ;-)

Yeah. The splay tree keys are removed one at a time (without paying
attention to the refcounts for those), and the linked target_mem_descs
are freed when their refcounts drop to zero. Hence is_tgt_unmapped
won't always be true -- only when one of the linked target_mem_descs
gets freed.

> > --- a/libgomp/oacc-mem.c
> > +++ b/libgomp/oacc-mem.c  
> 
> > @@ -427,6 +418,7 @@ acc_unmap_data (void *h)
> >  {
> >    struct goacc_thread *thr = goacc_thread ();
> >    struct gomp_device_descr *acc_dev = thr->dev;
> > +  struct splay_tree_key_s cur_node;  
> 
> I know it's often not the case in existing code, but when adding new
> code, please move definitions next to their first use.

Done.

> > @@ -438,12 +430,11 @@ acc_unmap_data (void *h)
> >    acc_api_info api_info;
> >    bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info,
> > &api_info); 
> >    gomp_mutex_lock (&acc_dev->lock);
> >  
> > -  splay_tree_key n = lookup_host (acc_dev, h, 1);
> > -  struct target_mem_desc *t;
> > +  cur_node.host_start = (uintptr_t) h;
> > +  cur_node.host_end = cur_node.host_start + 1;
> > +  splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map,
> > &cur_node); 
> >    if (!n)
> >      {  
> 
> Isn't this just inlining 'lookup_host'?  There may be a good reason
> to do that, but what is it?

Yeah, looks like it. I changed the code to use lookup_host.

> > @@ -451,47 +442,28 @@ acc_unmap_data (void *h)  
> 
> > -  /* Mark for removal.  */
> > -  n->refcount = 1;
> > +  splay_tree_remove (&acc_dev->mem_map, n);
> >  
> > -  t = n->tgt;
> > +  struct target_mem_desc *tgt = n->tgt;
> >  
> > -  if (t->refcount == 2)
> > +  if (tgt->refcount > 0)
> > +    tgt->refcount--;
> > +  else
> >      {
> > -[...]
> > +      free (tgt->array);
> > +      free (tgt);
> >      }  
> 
> Shouldn't that be 'if (tgt->refcount > 1)' (instead of '> 0'), like in
> 'gomp_unref_tgt' -- or actually use that function?

I think you're right about the condition -- well spotted! We can't use
gomp_unref_tgt here because acc_unmap_data isn't supposed to free the
device memory.

> >  
> >    gomp_mutex_unlock (&acc_dev->lock);
> >  
> > -  gomp_unmap_vars (t, true);
> > -
> >    if (profiling_p)
> >      {
> >        thr->prof_info = NULL;  
> 
> Hmm, I don't understand the changes leading to this, but again, I
> shall trust that you've got that right.
> 
> Or, was that a bug in the existing code, and we don't have proper test
> coverage?

I think that was a bug in the original code.

> > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h,
> > size_t s, int async)  
> 
> > -      d = tgt->to_free;  
> 
> > +      n = lookup_host (acc_dev, h, s);
> > +      assert (n != NULL);
> > +      d = (void *) (n->tgt->tgt_start + n->tgt_offset +
> > (uintptr_t) h
> > +		    - n->host_start);  
> 
> |   return d;
> 
> Again, it's not obvious to me how that is semantically equivalent to
> what we've returned before?

This is a bug fix (it's mentioned in the ChangeLog).

> >  void
> > -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom,
> > int async,
> > -			 int finalize, int mapnum)
> > +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void
> > **hostaddrs,
> > +			 size_t *sizes, unsigned short *kinds, int
> > async,
> > +			 bool finalize, int mapnum)
> >  {  
> 
> > +      switch (kind)
> > +        {
> > +	case GOMP_MAP_FROM:
> > +	case GOMP_MAP_FORCE_FROM:
> > +	case GOMP_MAP_ALWAYS_FROM:
> > +	  copyfrom = true;
> > +	  /* Fallthrough.  */
> > +
> > +	case GOMP_MAP_TO_PSET:
> > +	case GOMP_MAP_POINTER:
> > +	case GOMP_MAP_DELETE:
> > +	case GOMP_MAP_RELEASE:
> > [...]
> > +	default:
> > +	  gomp_mutex_unlock (&acc_dev->lock);
> > +	  gomp_fatal ("gomp_acc_remove_pointer unhandled kind
> > 0x%.2x",
> > +		      kind);  
> 
> Thanks for being explicit about the expected mapping kinds, etc.

That code's all gone with this version...

> > -      /* If running synchronously, unmap immediately.  */
> > -      if (async < acc_async_noval)
> > -	gomp_unmap_vars (t, true);
> > -      else
> > -	{
> > -	  goacc_aq aq = get_goacc_asyncqueue (async);
> > -	  gomp_unmap_vars_async (t, true, aq);  
> 
> As mentioned before, 'gomp_acc_remove_pointer' now "has an unused
> 'async' formal parameter.  Is that meant to be resolved to an
> asyncqueue, and pass that one to 'gomp_copy_dev2host', and call
> 'gomp_remove_var_async' instead of 'gomp_remove_var'"?  That's here:

Hmm yeah, that's all gone however.

> > +	  if (copyfrom)
> > +	    gomp_copy_dev2host (acc_dev, NULL, (void *)
> > cur_node.host_start,
> > +				(void *) (n->tgt->tgt_start +
> > n->tgt_offset
> > +					  + cur_node.host_start
> > +					  - n->host_start),
> > +				cur_node.host_end -
> > cur_node.host_start); +
> > +	  if (n->refcount == 0)
> > +	    gomp_remove_var (acc_dev, n);
> > +	  break;  
> 
> > --- a/libgomp/oacc-parallel.c
> > +++ b/libgomp/oacc-parallel.c
> > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned
> > short *kinds)  
> 
> I've always been confused by this function (before known as
> 'find_pset'); this feels wrong, but I've never gotten to the bottom
> of it.

This version removes that function in favour of a function that finds
groups of consecutive mappings that should be kept together for a
single gomp_map_vars invocation. I think that fits better with my
findings as written up on the wiki page
https://gcc.gnu.org/wiki/LibgompPointerMappingKinds.

> I'll trust that your changes there can only improve the current
> situation, not worsen it.  ;-)
> 
> And, again, thanks for being explicit about the expected mapping
> kinds, etc.
> 
> > @@ -745,8 +762,14 @@ GOACC_enter_exit_data (int flags_m, size_t
> > mapnum, }
> >  	  else
> >  	    {
> > -	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
> > -				       &sizes[i], &kinds[i],
> > async);
> > +	      goacc_aq aq = get_goacc_asyncqueue (async);
> > +	      for (int j = 0; j < 2; j++)  
> 
> Should this magic constant '2' be derived from 'pointer' or some such?
> 
> > +		gomp_map_vars_async (acc_dev, aq,
> > +				     (j == 0 || pointer == 2) ?
> > 1 : 2,
> > +				     &hostaddrs[i + j], NULL,
> > +				     &sizes[i + j], &kinds[i + j],
> > true,
> > +
> > GOMP_MAP_VARS_OPENACC_ENTER_DATA);  
> 
> ;-) Yuck.  As requested before: "Can we get a comment added to such
> 'magic', please?"

That magic is gone now. 

> I just wish that eventually we'll be able to can get rid of that
> stuff, and just let 'gomp_map_vars' do its thing.  Similar to
> <https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use
> 'GOMP_MAP_VARS_TARGET'".
> 
> (For avoidance of doubt, that's not your task right now.)

Does this version look better? I've removed the special-case handling
of pointers in the enter/exit data code, and combined the
gomp_acc_remove_pointer code (which now iterated over mappings
one-at-a-time anyway) with the loop iterating over mappings in the
new goacc_exit_data_internal function. It was a bit nonsensical to have
the "exit data" code split over two files, as before.

> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> > @@ -536,7 +536,8 @@ gomp_map_vars_internal (struct
> > gomp_device_descr *devicep, struct target_mem_desc *tgt
> >      = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
> >    tgt->list_count = mapnum;
> > -  tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1;
> > +  tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA
> > +		   || pragma_kind ==
> > GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1; tgt->device_descr =
> > devicep; struct gomp_coalesce_buf cbuf, *cbufp = NULL;
> >  
> > @@ -1051,8 +1053,20 @@ gomp_map_vars_internal (struct
> > gomp_device_descr *devicep, /* If the variable from "omp target
> > enter data" map-list was already mapped, tgt is not needed.
> > Otherwise tgt will be freed by gomp_unmap_vars or gomp_exit_data.
> > */
> > -  if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount ==
> > 0)
> > -    {
> > +  if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA
> > +       || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
> > +      && tgt->refcount == 0)
> > +    {
> > +      /* If we're about to discard a target_mem_desc with no
> > "structural"
> > +	 references (tgt->refcount == 0), any splay keys linked in
> > the tgt's
> > +	 list must have their virtual refcount incremented to
> > represent that
> > +	 "lost" reference in order to implement the semantics of
> > the OpenACC
> > +	 "present increment" operation properly.  */
> > +      if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
> > +	for (i = 0; i < tgt->list_count; i++)
> > +	  if (tgt->list[i].key)
> > +	    tgt->list[i].key->virtual_refcount++;
> > +
> >        free (tgt);
> >        tgt = NULL;
> >      }  
> 
> So that last item is the only difference between
> 'GOMP_MAP_VARS_ENTER_DATA' and 'GOMP_MAP_VARS_OPENACC_ENTER_DATA'.
> Again I have not digested that one, but will trust you.

Yeah, because of the OpenACC reference counting & finalize semantics,
which I don't think are applicable to OpenMP.

> > @@ -1310,7 +1366,7 @@ gomp_load_image_to_device (struct
> > gomp_device_descr *devicep, unsigned version, k->tgt = tgt;
> >        k->tgt_offset = target_table[i].start;
> >        k->refcount = REFCOUNT_INFINITY;
> > -      k->link_key = NULL;
> > +      k->virtual_refcount = 0;
> >        array->left = NULL;
> >        array->right = NULL;
> >        splay_tree_insert (&devicep->mem_map, array);
> > @@ -1342,7 +1398,7 @@ gomp_load_image_to_device (struct
> > gomp_device_descr *devicep, unsigned version, k->tgt = tgt;
> >        k->tgt_offset = target_var->start;
> >        k->refcount = target_size & link_bit ? REFCOUNT_LINK :
> > REFCOUNT_INFINITY;
> > -      k->link_key = NULL;
> > +      k->virtual_refcount = 0;
> >        array->left = NULL;
> >        array->right = NULL;
> >        splay_tree_insert (&devicep->mem_map, array);  
> 
> Why no longer initialize 'link_key' here?
> 
> I'd expect that always all fields of 'k' ('struct splay_tree_key_s')
> get initialized, so like:
> 
> > @@ -2612,6 +2652,8 @@ omp_target_associate_ptr (const void
> > *host_ptr, const void *device_ptr, k->tgt = tgt;
> >        k->tgt_offset = (uintptr_t) device_ptr + device_offset;
> >        k->refcount = REFCOUNT_INFINITY;
> > +      k->virtual_refcount = 0;
> > +      k->u.link_key = NULL;
> >        array->left = NULL;
> >        array->right = NULL;
> >        splay_tree_insert (&devicep->mem_map, array);  
> 
> (I haven't verified whether that's always done, please verify.)

This version (without the link_key union, etc.) should avoid those
problems. I've added some missing initialisations, too.

Re-tested with offloading to nvptx. OK for trunk?

Thanks,

Julian
Thomas Schwinge Oct. 31, 2019, 6:11 p.m. UTC | #6
Hi Julian!

On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote:
> This is a new version of the patch which hopefully addresses all review
> comments. Further commentary below.

Thanks, great, looking into that one -- I see you're removing more and
more special-case, strange code, replacing it with generic and/or
well-explained code.


Question, for my understanding:

> On Mon, 21 Oct 2019 16:14:11 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
>> wrote:

>> > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h, size_t s, int async)  
>> 
>> > -      d = tgt->to_free;  
>> 
>> > +      n = lookup_host (acc_dev, h, s);
>> > +      assert (n != NULL);
>> > +      d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h
>> > +		    - n->host_start);  
>> 
>> |   return d;
>> 
>> Again, it's not obvious to me how that is semantically equivalent to
>> what we've returned before?
>
> This is a bug fix (it's mentioned in the ChangeLog).

Eh, well hidden.  Indeed that mentions:

	(present_create_copy): [...] Fix target pointer
	return value.

So that's not related to reference counting, needs to be discussed
separately.

..., and while I do agree that the current code is a bit "strange"
(returning 'tgt->to_free'), I couldn't quickly find or come up with a
test cases where this would actually do the wrong thing.  After all, this
is the code path taken for "not present", and 'tgt' is built anew for one
single mapping, with no alignment set (which would cause 'to_free' to
differ from 'tgt_start'); 'tgt_offset' should always be zero, and 'h'
always the same as 'host_start'.  What am I missing?  That is, given the
current set of libgomp test cases, the attached never triggeres.


Grüße
 Thomas
Julian Brown Nov. 9, 2019, 1:28 a.m. UTC | #7
On Thu, 31 Oct 2019 19:11:57 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com>
> wrote:
> > This is a new version of the patch which hopefully addresses all
> > review comments. Further commentary below.  
> 
> Thanks, great, looking into that one -- I see you're removing more and
> more special-case, strange code, replacing it with generic and/or
> well-explained code.
> 
> 
> Question, for my understanding:
> 
> > On Mon, 21 Oct 2019 16:14:11 +0200
> > Thomas Schwinge <thomas@codesourcery.com> wrote:  
> >> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
> >> wrote:  
> 
> >> > @@ -577,17 +551,14 @@ present_create_copy (unsigned f, void *h,
> >> > size_t s, int async)    
> >>   
> >> > -      d = tgt->to_free;    
> >>   
> >> > +      n = lookup_host (acc_dev, h, s);
> >> > +      assert (n != NULL);
> >> > +      d = (void *) (n->tgt->tgt_start + n->tgt_offset +
> >> > (uintptr_t) h
> >> > +		    - n->host_start);    
> >> 
> >> |   return d;
> >> 
> >> Again, it's not obvious to me how that is semantically equivalent
> >> to what we've returned before?  
> >
> > This is a bug fix (it's mentioned in the ChangeLog).  
> 
> Eh, well hidden.  Indeed that mentions:
> 
> 	(present_create_copy): [...] Fix target pointer
> 	return value.
> 
> So that's not related to reference counting, needs to be discussed
> separately.
> 
> ..., and while I do agree that the current code is a bit "strange"
> (returning 'tgt->to_free'), I couldn't quickly find or come up with a
> test cases where this would actually do the wrong thing.  After all,
> this is the code path taken for "not present", and 'tgt' is built
> anew for one single mapping, with no alignment set (which would cause
> 'to_free' to differ from 'tgt_start'); 'tgt_offset' should always be
> zero, and 'h' always the same as 'host_start'.  What am I missing?
> That is, given the current set of libgomp test cases, the attached
> never triggeres.

The code can't stay exactly as it is with this patch, because the tgt
return value from gomp_map_vars_async with
GOMP_MAP_VARS_OPENACC_ENTER_DATA is a null pointer.

So, the device pointer calculation needed to be re-done -- although it's
not quite a bug fix, as you point out, and some of the offsets will
always be zero or cancel out in practice.

*However*, it looks like the device pointer calculation for the
"present" case is wrong in the preceding code. I've addressed that in
the patch posted here:

https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00661.html

The patch attached here applies on top of that one, and attempts to
keep the device pointer calculation "the same" for the non-present
case, modulo an extra lookup_host -- and also adds some assertions to
make sure the assumptions about zero/cancelled-out offsets stay true.

OK for trunk? Re-tested with offloading to nvptx.

Thanks,

Julian
Julian Brown Nov. 22, 2019, 11:21 p.m. UTC | #8
On Sat, 9 Nov 2019 01:28:51 +0000
Julian Brown <julian@codesourcery.com> wrote:

> On Thu, 31 Oct 2019 19:11:57 +0100
> Thomas Schwinge <thomas@codesourcery.com> wrote:
> 
> > So that's not related to reference counting, needs to be discussed
> > separately.
> > 
> > ..., and while I do agree that the current code is a bit "strange"
> > (returning 'tgt->to_free'), I couldn't quickly find or come up with
> > a test cases where this would actually do the wrong thing.  After
> > all, this is the code path taken for "not present", and 'tgt' is
> > built anew for one single mapping, with no alignment set (which
> > would cause 'to_free' to differ from 'tgt_start'); 'tgt_offset'
> > should always be zero, and 'h' always the same as 'host_start'.
> > What am I missing? That is, given the current set of libgomp test
> > cases, the attached never triggeres.  
> 
> The code can't stay exactly as it is with this patch, because the tgt
> return value from gomp_map_vars_async with
> GOMP_MAP_VARS_OPENACC_ENTER_DATA is a null pointer.
> 
> So, the device pointer calculation needed to be re-done -- although
> it's not quite a bug fix, as you point out, and some of the offsets
> will always be zero or cancel out in practice.
> 
> *However*, it looks like the device pointer calculation for the
> "present" case is wrong in the preceding code. I've addressed that in
> the patch posted here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00661.html
> 
> The patch attached here applies on top of that one, and attempts to
> keep the device pointer calculation "the same" for the non-present
> case, modulo an extra lookup_host -- and also adds some assertions to
> make sure the assumptions about zero/cancelled-out offsets stay true.

Here's another iteration that applies over the version of the
present/subarray patch committed, and also addresses the use of
REFCOUNT_INFINITY on target blocks as queried in the following message:

https://gcc.gnu.org/ml/gcc-patches/2019-11/msg01146.html

Most uses of REFCOUNT_INFINITY indeed appear to be unreachable (as in,
a target_mem_desc with refcount == REFCOUNT_INFINITY will most of the
time be linked from a splay tree key with refcount ==
REFCOUNT_INFINITY, and the code to decrement the former's refcount
and/or free the block will never be called).

I found one case (for OpenACC) where a runtime check/error can be
added -- attempting to free a mapped target block corresponding to a
device_resident global variable using an API routine. I don't think
there's a code path using directives (for either OpenACC or OpenMP) that
exhibits any problematic behaviour in that regard. I've added a couple
of test cases, and a couple of assertions.

OK now? (Or perhaps the REFCOUNT_INFINITY bits want splitting out? It
all still arguably comes under the "refcount overhaul" umbrella!).

Thanks,

Julian
Thomas Schwinge Dec. 9, 2019, 2:44 p.m. UTC | #9
Hi Julian!

On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com> wrote:
> --- a/libgomp/oacc-mem.c
> +++ b/libgomp/oacc-mem.c

> @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)

>        if (f & FLAG_COPYOUT)
> [...]
>  	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
>  	}
> -      gomp_remove_var (acc_dev, n);
> +      gomp_remove_var_async (acc_dev, n, aq);

Conceptually, I understand correctly that we need to use this (new)
'gomp_remove_var_async' to make sure that we don't
'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above is
still in process?

I'm curious why this isn't causing any problems for nvptx offloading
already, any thoughts on that?  Or, is this just missing test coverage?
(Always difficult for 'async' stuff, of course.)  By chance, is this
right now already causing problems with AMD GCN offloading?  (I really
need to set up AMD GCN offloading testing...)


I'm citing below the changes introducing 'gomp_remove_var_async',
modelled similar to the existing 'gomp_unmap_vars_async'.


Also for both these, do I understand correctly, that it's actually not
the 'gomp_unref_tgt' that needs to be "delayed" via 'goacc_asyncqueue',
but rather really only the 'gomp_free_device_memory', called via
'gomp_unmap_tgt', called via 'gomp_unref_tgt'?  In other words: why do we
need to keep the 'struct target_mem_desc' alive?  Per my understanding,
that one is one component of the mapping table, and not relevant anymore
(thus can be 'free'd) as soon as it has been determined that
'tgt->refcount == 0'?  Am I missing something there?

It will be OK to clean that up later, but I'd like to understand this
now.  Well, or, stating that you just blindly copied that from the
existing 'gomp_unmap_vars_async' is fine, too!  ;-P


Grüße
 Thomas


> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> @@ -1092,32 +1106,66 @@ gomp_unmap_tgt (struct target_mem_desc *tgt)
>    free (tgt);
>  }
>  
> -attribute_hidden bool
> -gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k)
> +static bool
> +gomp_unref_tgt (void *ptr)
>  {
>    bool is_tgt_unmapped = false;
> -  splay_tree_remove (&devicep->mem_map, k);
> -  if (k->link_key)
> -    splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->link_key);
> -  if (k->tgt->refcount > 1)
> -    k->tgt->refcount--;
> +
> +  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
> +
> +  if (tgt->refcount > 1)
> +    tgt->refcount--;
>    else
>      {
> +      gomp_unmap_tgt (tgt);
>        is_tgt_unmapped = true;
> -      gomp_unmap_tgt (k->tgt);
>      }
> +
>    return is_tgt_unmapped;
>  }
>  
>  static void
> -gomp_unref_tgt (void *ptr)
> +gomp_unref_tgt_void (void *ptr)
>  {
> -  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
> +  (void) gomp_unref_tgt (ptr);
> +}
>  
> -  if (tgt->refcount > 1)
> -    tgt->refcount--;
> +static inline __attribute__((always_inline)) bool
> +gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k,
> +			  struct goacc_asyncqueue *aq)
> +{
> +  bool is_tgt_unmapped = false;
> +  splay_tree_remove (&devicep->mem_map, k);
> +  if (k->virtual_refcount == VREFCOUNT_LINK_KEY)
> +    {
> +      if (k->u.link_key)
> +	splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->u.link_key);
> +    }
> +  if (aq)
> +    devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void,
> +						(void *) k->tgt);
>    else
> -    gomp_unmap_tgt (tgt);
> +    is_tgt_unmapped = gomp_unref_tgt ((void *) k->tgt);
> +  return is_tgt_unmapped;
> +}
> +
> +attribute_hidden bool
> +gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k)
> +{
> +  return gomp_remove_var_internal (devicep, k, NULL);
> +}
> +
> +/* Remove a variable asynchronously.  This actually removes the variable
> +   mapping immediately, but retains the linked target_mem_desc until the
> +   asynchronous operation has completed (as it may still refer to target
> +   memory).  The device lock must be held before entry, and remains locked on
> +   exit.  */
> +
> +attribute_hidden void
> +gomp_remove_var_async (struct gomp_device_descr *devicep, splay_tree_key k,
> +		       struct goacc_asyncqueue *aq)
> +{
> +  (void) gomp_remove_var_internal (devicep, k, aq);
>  }
Julian Brown Dec. 9, 2019, 3:04 p.m. UTC | #10
On Mon, 9 Dec 2019 15:44:25 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
> wrote:
> > --- a/libgomp/oacc-mem.c
> > +++ b/libgomp/oacc-mem.c  
> 
> > @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t
> > s, int async, const char *libfnname)  
> 
> >        if (f & FLAG_COPYOUT)
> > [...]
> >  	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
> >  	}
> > -      gomp_remove_var (acc_dev, n);
> > +      gomp_remove_var_async (acc_dev, n, aq);  
> 
> Conceptually, I understand correctly that we need to use this (new)
> 'gomp_remove_var_async' to make sure that we don't
> 'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above
> is still in process?

Yep.

> I'm curious why this isn't causing any problems for nvptx offloading
> already, any thoughts on that?  Or, is this just missing test
> coverage? (Always difficult for 'async' stuff, of course.)  By
> chance, is this right now already causing problems with AMD GCN
> offloading?  (I really need to set up AMD GCN offloading testing...)

In a few cases, async stuff on nvidia seems to "just work" even in
cases where we wouldn't expect it to via inspection (either because the
driver/hardware is doing something "magic", or because we're
somehow driving async operations in such a way that they run
synchronously in practice). One such case is with the "ephemeral"
asynchronous host-to-device memory copy patch.

The AMD side seems much more sensitive to improper async behaviour --
but I don't actually remember if I hit problems with this code in
particular.

> I'm citing below the changes introducing 'gomp_remove_var_async',
> modelled similar to the existing 'gomp_unmap_vars_async'.
> 
> 
> Also for both these, do I understand correctly, that it's actually not
> the 'gomp_unref_tgt' that needs to be "delayed" via
> 'goacc_asyncqueue', but rather really only the
> 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via
> 'gomp_unref_tgt'?  In other words: why do we need to keep the 'struct
> target_mem_desc' alive?  Per my understanding, that one is one
> component of the mapping table, and not relevant anymore (thus can be
> 'free'd) as soon as it has been determined that 'tgt->refcount ==
> 0'?  Am I missing something there?

IIRC, that was Chung-Lin's choice. I'll CC him in. I think delaying
freeing of the target_mem_desc isn't really a huge problem, in practice.

> It will be OK to clean that up later, but I'd like to understand this
> now.  Well, or, stating that you just blindly copied that from the
> existing 'gomp_unmap_vars_async' is fine, too!  ;-P

Some changes arose via the porting to AMD GCN, and some may have been
drive-by fixes (e.g. where a synchronous call was used in a context
where it is obvious that an asynchronous call is really needed). Like
you mentioned, test coverage could probably be better, and writing
reliable tests for async behaviour is challenging.

Julian
Thomas Schwinge Dec. 10, 2019, 8:25 a.m. UTC | #11
Hi Julian!

On 2019-12-09T15:04:15+0000, Julian Brown <julian@codesourcery.com> wrote:
> On Mon, 9 Dec 2019 15:44:25 +0100
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
>> wrote:
>> > --- a/libgomp/oacc-mem.c
>> > +++ b/libgomp/oacc-mem.c  
>> 
>> > @@ -715,48 +684,34 @@ delete_copyout (unsigned f, void *h, size_t
>> > s, int async, const char *libfnname)  
>> 
>> >        if (f & FLAG_COPYOUT)
>> > [...]
>> >  	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
>> >  	}
>> > -      gomp_remove_var (acc_dev, n);
>> > +      gomp_remove_var_async (acc_dev, n, aq);  
>> 
>> Conceptually, I understand correctly that we need to use this (new)
>> 'gomp_remove_var_async' to make sure that we don't
>> 'gomp_free_device_memory' while the 'gomp_copy_dev2host' cited above
>> is still in process?
>
> Yep.

OK, so please prepare a patch changing just that, referencing PR92881:
's%gomp_remove_var%gomp_remove_var_async%' as cited above and also in
'libgomp/target.c:gomp_unmap_vars_internal' (for clarity, even though it
doesn't matter in practice as that call will never
'gomp_free_device_memory'; see
<http://mid.mail-archive.com/871rtg43me.fsf@euler.schwinge.homeip.net>),
plus the addition of 'libgomp/target.c:gomp_remove_var_async' etc.


>> I'm curious why this isn't causing any problems for nvptx offloading
>> already, any thoughts on that?  Or, is this just missing test
>> coverage? (Always difficult for 'async' stuff, of course.)  By
>> chance, is this right now already causing problems with AMD GCN
>> offloading?  (I really need to set up AMD GCN offloading testing...)
>
> In a few cases, async stuff on nvidia seems to "just work" even in
> cases where we wouldn't expect it to via inspection (either because the
> driver/hardware is doing something "magic"

Yeah, I too wondered whether there might be some such "magic" going on,
to "help" users...

> or because we're
> somehow driving async operations in such a way that they run
> synchronously in practice).

Hope that's not that case.  ;-)

> One such case is with the "ephemeral"
> asynchronous host-to-device memory copy patch.

(Yeah, I still need to look into that.)

> The AMD side seems much more sensitive to improper async behaviour --
> but I don't actually remember if I hit problems with this code in
> particular.


>> I'm citing below the changes introducing 'gomp_remove_var_async',
>> modelled similar to the existing 'gomp_unmap_vars_async'.
>> 
>> 
>> Also for both these, do I understand correctly, that it's actually not
>> the 'gomp_unref_tgt' that needs to be "delayed" via
>> 'goacc_asyncqueue', but rather really only the
>> 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via
>> 'gomp_unref_tgt'?  In other words: why do we need to keep the 'struct
>> target_mem_desc' alive?  Per my understanding, that one is one
>> component of the mapping table, and not relevant anymore (thus can be
>> 'free'd) as soon as it has been determined that 'tgt->refcount ==
>> 0'?  Am I missing something there?
>
> IIRC, that was Chung-Lin's choice. I'll CC him in.

;-) Or even mine; see 'gomp_unmap_vars_async' description and incremental
patch in <https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01620.html>.

> I think delaying
> freeing of the target_mem_desc isn't really a huge problem, in practice.

It certainly isn't a problem (only small bits of host memory "delayed"),
but it still isn't the most clean design.  Anyway:

>> It will be OK to clean that up later

>> but I'd like to understand this
>> now.  Well, or, stating that you just blindly copied that from the
>> existing 'gomp_unmap_vars_async' is fine, too!  ;-P
>
> Some changes arose via the porting to AMD GCN, and some may have been
> drive-by fixes (e.g. where a synchronous call was used in a context
> where it is obvious that an asynchronous call is really needed).

Please, again, for sake of easy review, always do such changes separately
from whatever else you're working on.  This of course will add a bit of
delay during your original development, but will make review and
reasoning much, much easier -- at that time, and also when someone
(yourself even?) needs to look up again something from the development
history.


> Like
> you mentioned, test coverage could probably be better, and writing
> reliable tests for async behaviour is challenging.

Thus we need to invent something, eventually.  Not testing stuff because
it's challenging is not a good excuse for shipping un-tested code.


Grüße
 Thomas
Chung-Lin Tang Dec. 11, 2019, 7:36 a.m. UTC | #12
On 2019/12/10 12:04 AM, Julian Brown wrote:
>> I'm citing below the changes introducing 'gomp_remove_var_async',
>> modelled similar to the existing 'gomp_unmap_vars_async'.
>>
>>
>> Also for both these, do I understand correctly, that it's actually not
>> the 'gomp_unref_tgt' that needs to be "delayed" via
>> 'goacc_asyncqueue', but rather really only the
>> 'gomp_free_device_memory', called via 'gomp_unmap_tgt', called via
>> 'gomp_unref_tgt'?  In other words: why do we need to keep the 'struct
>> target_mem_desc' alive?  Per my understanding, that one is one
>> component of the mapping table, and not relevant anymore (thus can be
>> 'free'd) as soon as it has been determined that 'tgt->refcount ==
>> 0'?  Am I missing something there?
> IIRC, that was Chung-Lin's choice. I'll CC him in. I think delaying
> freeing of the target_mem_desc isn't really a huge problem, in practice.

I don't clearly remember all the details. It could be possible that not
asyncqueue-ifying gomp_remove_var was simply an overlook.

The 'target_mem_desc' is supposed to represent the piece of device memory
inside libgomp, so unref/freeing it only after all dev-to-host copying is
done seems logical.

Chung-Lin
Thomas Schwinge Dec. 11, 2019, 5:22 p.m. UTC | #13
Hi!

On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote:
> On Mon, 21 Oct 2019 16:14:11 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
>> wrote:
>> >  void
>> > -gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom,
>> > int async,
>> > -			 int finalize, int mapnum)
>> > +gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void
>> > **hostaddrs,
>> > +			 size_t *sizes, unsigned short *kinds, int
>> > async,
>> > +			 bool finalize, int mapnum)
>> >  {  
>> > [...]

> That code's all gone with this version...

\o/ Yay!

>> > --- a/libgomp/oacc-parallel.c
>> > +++ b/libgomp/oacc-parallel.c
>> > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned
>> > short *kinds)  
>> 
>> I've always been confused by this function (before known as
>> 'find_pset'); this feels wrong, but I've never gotten to the bottom
>> of it.
>
> This version removes that function in favour of a function that finds
> groups of consecutive mappings that should be kept together for a
> single gomp_map_vars invocation. I think that fits better with my
> findings as written up on the wiki page
> https://gcc.gnu.org/wiki/LibgompPointerMappingKinds.

\o/ Yay!

>> > [...]
>> 
>> ;-) Yuck.  As requested before: "Can we get a comment added to such
>> 'magic', please?"
>
> That magic is gone now. 

\o/ Yay!

>> I just wish that eventually we'll be able to can get rid of that
>> stuff, and just let 'gomp_map_vars' do its thing.  Similar to
>> <https://gcc.gnu.org/PR90596> "'GOACC_parallel_keyed' should use
>> 'GOMP_MAP_VARS_TARGET'".
>> 
>> (For avoidance of doubt, that's not your task right now.)

> I've removed the special-case handling
> of pointers in the enter/exit data code, and combined the
> gomp_acc_remove_pointer code (which now iterated over mappings
> one-at-a-time anyway) with the loop iterating over mappings in the
> new goacc_exit_data_internal function. It was a bit nonsensical to have
> the "exit data" code split over two files, as before.

Yes, I like that very much, and we shall tackle that next intermediate
step once your patch for <https://gcc.gnu.org/PR92881> "[OpenACC] In
async context, need to use 'gomp_remove_var_async' instead of
'gomp_remove_var'" is done,
<http://mid.mail-archive.com/87tv681tb3.fsf@euler.schwinge.homeip.net>.

One thing:

>             libgomp/

>             * oacc-parallel.c (find_pointer): Remove function.
>             (find_group_last, goacc_enter_data_internal,
>             goacc_exit_data_internal): New functions.
>             (GOACC_enter_exit_data): Use goacc_enter_data_internal and
>             goacc_exit_data_internal helper functions.

It makes much sense to move all that into 'libgomp/oacc-mem.c', and as a
preparational step, see attached "[OpenACC] Consolidate
'GOACC_enter_exit_data' and its helper functions in
'libgomp/oacc-mem.c'", committed to trunk in r279233.


Grüße
 Thomas
Thomas Schwinge Dec. 13, 2019, 3:25 p.m. UTC | #14
Hi Julian!

On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote:
> On Mon, 21 Oct 2019 16:14:11 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>
>> On 2019-10-03T09:35:04-0700, Julian Brown <julian@codesourcery.com>
>> wrote:
>> > --- a/libgomp/oacc-parallel.c
>> > +++ b/libgomp/oacc-parallel.c
>> > @@ -56,12 +56,29 @@ find_pointer (int pos, size_t mapnum, unsigned
>> > short *kinds)  
>> 
>> I've always been confused by this function (before known as
>> 'find_pset'); this feels wrong, but I've never gotten to the bottom
>> of it.
>
> This version removes that function in favour of a function that finds
> groups of consecutive mappings that should be kept together for a
> single gomp_map_vars invocation. I think that fits better with my
> findings as written up on the wiki page
> https://gcc.gnu.org/wiki/LibgompPointerMappingKinds.

:-) Please guide my trying to understand the changes there:

> --- a/libgomp/oacc-parallel.c
> +++ b/libgomp/oacc-parallel.c
> @@ -47,23 +47,39 @@ _Static_assert (GOACC_FLAGS_UNMARSHAL (GOMP_DEVICE_HOST_FALLBACK)
>  		"legacy GOMP_DEVICE_HOST_FALLBACK broken");
>  
>  
> -/* Returns the number of mappings associated with the pointer or pset. PSET
> -   have three mappings, whereas pointer have two.  */
> +/* Some types of (pointer) variables use several consecutive mappings, which
> +   must be treated as a group for enter/exit data directives.  This function
> +   returns the last mapping in such a group (inclusive), or POS for singleton
> +   mappings.  */
>  
>  static int
> -find_pointer (int pos, size_t mapnum, unsigned short *kinds)
> +find_group_last (int pos, size_t mapnum, unsigned short *kinds)
>  {
> -  if (pos + 1 >= mapnum)
> -    return 0;
> +  unsigned char kind0 = kinds[pos] & 0xff;
> +  int first_pos = pos, last_pos = pos;
>  
> -  unsigned char kind = kinds[pos+1] & 0xff;
> -
> -  if (kind == GOMP_MAP_TO_PSET)
> -    return 3;
> -  else if (kind == GOMP_MAP_POINTER)
> -    return 2;
> +  if (kind0 == GOMP_MAP_TO_PSET)
> +    {
> +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER)
> +	last_pos = ++pos;
> +      /* We expect at least one GOMP_MAP_POINTER after a GOMP_MAP_TO_PSET.  */
> +      assert (last_pos > first_pos);
> +    }
> +  else
> +    {
> +      /* GOMP_MAP_ALWAYS_POINTER can only appear directly after some other
> +	 mapping.  */
> +      if (pos + 1 < mapnum
> +	  && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER)
> +	return pos + 1;
> +
> +      /* We can have one or several GOMP_MAP_POINTER mappings after a to/from
> +	 (etc.) mapping.  */
> +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER)
> +	last_pos = ++pos;
> +    }
>  
> -  return 0;
> +  return last_pos;
>  }

So I ran a simple experiment where I did:

    assert (find_group_last (i, mapnum, kinds) == i + pointer);

... where 'pointer' is the current 'find_pointer' function.  (That is,
compare that the old and new way are doing the same things, given the
current GCC code generation/test cases.)

This 'assert' triggers for a few test cases:
'libgomp.oacc-fortran/allocatable-array-1.f90',
'libgomp.oacc-fortran/data-2.f90', 'libgomp.oacc-fortran/data-3.f90',
'libgomp.oacc-fortran/data-4-2.f90', 'libgomp.oacc-fortran/data-4.f90',
'libgomp.oacc-fortran/data-5.f90', 'libgomp.oacc-fortran/if-1.f90',
'libgomp.oacc-fortran/optional-data-enter-exit.f90'.  (Maybe those are
the only ones actually using that stuff?)

I looked into the first one
('libgomp.oacc-fortran/allocatable-array-1.f90'), and for:

    integer, parameter :: n = 40
    integer, allocatable :: ar(:,:,:)

    allocate (ar(1:n,0:n-1,0:n-1))
    !$acc enter data copyin (ar)

... found:

    (gdb) print mapnum
    $2 = 3
    (gdb) print kinds[0]
    $3 = 1 // GOMP_MAP_TO
    (gdb) print kinds[1]
    $4 = 773
    (gdb) print kinds[1] & 0xff
    $5 = 5 // GOMP_MAP_TO_PSET
    (gdb) print kinds[2]
    $6 = 772
    (gdb) print kinds[2] & 0xff
    $7 = 4 // GOMP_MAP_POINTER

Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all three
get mapped as one group.

New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the
'GOMP_MAP_TO' gets mapped alone.  Then, 'find_group_last (1, mapnum,
kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get mapped as
one group.

Is that intentional?

Any then, compating that to
'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from
'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything always
gets mapped alone:

    for (i = 0; i < mapnum; i++)
      if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT)
        { [...] }
      else
        gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], &kinds[i],
                       true, GOMP_MAP_VARS_ENTER_DATA);

Is it just an "accident" that for OpenACC we were and still are going to
do this differently, or is there an actual reason?


I'm not objecting to changing any of that, but would like to understand
this better.


Grüße
 Thomas
Julian Brown Dec. 14, 2019, 12:19 a.m. UTC | #15
On Fri, 13 Dec 2019 16:25:25 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com>
> wrote:
> >  static int
> > -find_pointer (int pos, size_t mapnum, unsigned short *kinds)
> > +find_group_last (int pos, size_t mapnum, unsigned short *kinds)
> >  {
> > -  if (pos + 1 >= mapnum)
> > -    return 0;
> > +  unsigned char kind0 = kinds[pos] & 0xff;
> > +  int first_pos = pos, last_pos = pos;
> >  
> > -  unsigned char kind = kinds[pos+1] & 0xff;
> > -
> > -  if (kind == GOMP_MAP_TO_PSET)
> > -    return 3;
> > -  else if (kind == GOMP_MAP_POINTER)
> > -    return 2;
> > +  if (kind0 == GOMP_MAP_TO_PSET)
> > +    {
> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> > GOMP_MAP_POINTER)
> > +	last_pos = ++pos;
> > +      /* We expect at least one GOMP_MAP_POINTER after a
> > GOMP_MAP_TO_PSET.  */
> > +      assert (last_pos > first_pos);
> > +    }
> > +  else
> > +    {
> > +      /* GOMP_MAP_ALWAYS_POINTER can only appear directly after
> > some other
> > +	 mapping.  */
> > +      if (pos + 1 < mapnum
> > +	  && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER)
> > +	return pos + 1;
> > +
> > +      /* We can have one or several GOMP_MAP_POINTER mappings
> > after a to/from
> > +	 (etc.) mapping.  */
> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> > GOMP_MAP_POINTER)
> > +	last_pos = ++pos;
> > +    }
> >  
> > -  return 0;
> > +  return last_pos;
> >  }  
> 
> So I ran a simple experiment where I did:
> 
>     assert (find_group_last (i, mapnum, kinds) == i + pointer);
> 
> ... where 'pointer' is the current 'find_pointer' function.  (That is,
> compare that the old and new way are doing the same things, given the
> current GCC code generation/test cases.)
> 
> This 'assert' triggers for a few test cases:
> 'libgomp.oacc-fortran/allocatable-array-1.f90',
> 'libgomp.oacc-fortran/data-2.f90', 'libgomp.oacc-fortran/data-3.f90',
> 'libgomp.oacc-fortran/data-4-2.f90',
> 'libgomp.oacc-fortran/data-4.f90', 'libgomp.oacc-fortran/data-5.f90',
> 'libgomp.oacc-fortran/if-1.f90',
> 'libgomp.oacc-fortran/optional-data-enter-exit.f90'.  (Maybe those
> are the only ones actually using that stuff?)
> 
> I looked into the first one
> ('libgomp.oacc-fortran/allocatable-array-1.f90'), and for:
> 
>     integer, parameter :: n = 40
>     integer, allocatable :: ar(:,:,:)
> 
>     allocate (ar(1:n,0:n-1,0:n-1))
>     !$acc enter data copyin (ar)
> 
> ... found:
> 
>     (gdb) print mapnum
>     $2 = 3
>     (gdb) print kinds[0]
>     $3 = 1 // GOMP_MAP_TO
>     (gdb) print kinds[1]
>     $4 = 773
>     (gdb) print kinds[1] & 0xff
>     $5 = 5 // GOMP_MAP_TO_PSET
>     (gdb) print kinds[2]
>     $6 = 772
>     (gdb) print kinds[2] & 0xff
>     $7 = 4 // GOMP_MAP_POINTER
> 
> Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all three
> get mapped as one group.
> 
> New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the
> 'GOMP_MAP_TO' gets mapped alone.  Then, 'find_group_last (1, mapnum,
> kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get
> mapped as one group.
> 
> Is that intentional?

Yes. In a previous iteration of the refcount overhaul patch, we had the
"magic" code fragment:

> +	      for (int j = 0; j < 2; j++)  
> +		gomp_map_vars_async (acc_dev, aq,
> +				     (j == 0 || pointer == 2) ? 1 : 2,
> +				     &hostaddrs[i + j], NULL,
> +				     &sizes[i + j], &kinds[i + j], true,
> +				     GOMP_MAP_VARS_OPENACC_ENTER_DATA);  

The "pointer == 3" case here will do precisely the same thing as the
current iteration of the patch: pass the GOMP_MAP_TO to one
gomp_map_vars_async call, and pass the GOMP_MAP_TO_PSET +
GOMP_MAP_POINTER as a pair in a second call.

The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a
GOMP_MAP_POINTER) will also handle the mappings separately in both the
earlier patch iteration and this one.

That's different from the current behaviour, because we don't want all
three mappings to be bound together. The problematic cases of doing
so might only appear with the manual deep copy patch applied also,
though (and/or with the refcount-checking patch applied/enabled). (I
don't remember exactly which test cases this affected, but I can check.)

The GOMP_MAP_TO_PSET plus the following GOMP_MAP_POINTER mappings are
treated as a group within gomp_map_vars_internal. So I'm not sure... 

> Any then, compating that to
> 'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from
> 'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything
> always gets mapped alone:
> 
>     for (i = 0; i < mapnum; i++)
>       if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT)
>         { [...] }
>       else
>         gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i],
> &kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
> 
> Is it just an "accident" that for OpenACC we were and still are going
> to do this differently, or is there an actual reason?

...why mapping one-at-a-time is the right thing to do here. Maybe the
OpenMP version never sees GOMP_MAP_TO_PSET (or GOMP_MAP_ALWAYS_POINTER,
which has a hard-wired dependency on the previous clause)? (I can try
to check that too.)

Thanks,

Julian
Julian Brown Dec. 17, 2019, 3:25 a.m. UTC | #16
On Sat, 14 Dec 2019 00:19:04 +0000
Julian Brown <julian@codesourcery.com> wrote:

> On Fri, 13 Dec 2019 16:25:25 +0100
> Thomas Schwinge <thomas@codesourcery.com> wrote:
> 
> > Hi Julian!
> > 
> > On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com>
> > wrote:  
> > >  static int
> > > -find_pointer (int pos, size_t mapnum, unsigned short *kinds)
> > > +find_group_last (int pos, size_t mapnum, unsigned short *kinds)
> > >  {
> > > -  if (pos + 1 >= mapnum)
> > > -    return 0;
> > > +  unsigned char kind0 = kinds[pos] & 0xff;
> > > +  int first_pos = pos, last_pos = pos;
> > >  
> > > -  unsigned char kind = kinds[pos+1] & 0xff;
> > > -
> > > -  if (kind == GOMP_MAP_TO_PSET)
> > > -    return 3;
> > > -  else if (kind == GOMP_MAP_POINTER)
> > > -    return 2;
> > > +  if (kind0 == GOMP_MAP_TO_PSET)
> > > +    {
> > > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> > > GOMP_MAP_POINTER)
> > > +	last_pos = ++pos;
> > > +      /* We expect at least one GOMP_MAP_POINTER after a
> > > GOMP_MAP_TO_PSET.  */
> > > +      assert (last_pos > first_pos);
> > > +    }
> > > +  else
> > > +    {
> > > +      /* GOMP_MAP_ALWAYS_POINTER can only appear directly after
> > > some other
> > > +	 mapping.  */
> > > +      if (pos + 1 < mapnum
> > > +	  && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER)
> > > +	return pos + 1;
> > > +
> > > +      /* We can have one or several GOMP_MAP_POINTER mappings
> > > after a to/from
> > > +	 (etc.) mapping.  */
> > > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> > > GOMP_MAP_POINTER)
> > > +	last_pos = ++pos;
> > > +    }
> > >  
> > > -  return 0;
> > > +  return last_pos;
> > >  }    
> > 
> > So I ran a simple experiment where I did:
> > 
> >     assert (find_group_last (i, mapnum, kinds) == i + pointer);
> > 
> > ... where 'pointer' is the current 'find_pointer' function.  (That
> > is, compare that the old and new way are doing the same things,
> > given the current GCC code generation/test cases.)
> > 
> > This 'assert' triggers for a few test cases:
> > 'libgomp.oacc-fortran/allocatable-array-1.f90',
> > 'libgomp.oacc-fortran/data-2.f90',
> > 'libgomp.oacc-fortran/data-3.f90',
> > 'libgomp.oacc-fortran/data-4-2.f90',
> > 'libgomp.oacc-fortran/data-4.f90',
> > 'libgomp.oacc-fortran/data-5.f90', 'libgomp.oacc-fortran/if-1.f90',
> > 'libgomp.oacc-fortran/optional-data-enter-exit.f90'.  (Maybe those
> > are the only ones actually using that stuff?)
> > 
> > I looked into the first one
> > ('libgomp.oacc-fortran/allocatable-array-1.f90'), and for:
> > 
> >     integer, parameter :: n = 40
> >     integer, allocatable :: ar(:,:,:)
> > 
> >     allocate (ar(1:n,0:n-1,0:n-1))
> >     !$acc enter data copyin (ar)
> > 
> > ... found:
> > 
> >     (gdb) print mapnum
> >     $2 = 3
> >     (gdb) print kinds[0]
> >     $3 = 1 // GOMP_MAP_TO
> >     (gdb) print kinds[1]
> >     $4 = 773
> >     (gdb) print kinds[1] & 0xff
> >     $5 = 5 // GOMP_MAP_TO_PSET
> >     (gdb) print kinds[2]
> >     $6 = 772
> >     (gdb) print kinds[2] & 0xff
> >     $7 = 4 // GOMP_MAP_POINTER
> > 
> > Current behavior: 'find_pointer (0, mapnum, kinds) == 3', so all
> > three get mapped as one group.
> > 
> > New behavior: 'find_group_last (0, mapnum, kinds) == 0', so the
> > 'GOMP_MAP_TO' gets mapped alone.  Then, 'find_group_last (1, mapnum,
> > kinds) == 2', so the 'GOMP_MAP_TO_PSET', 'GOMP_MAP_POINTER' get
> > mapped as one group.
> > 
> > Is that intentional?  
> 
> Yes. In a previous iteration of the refcount overhaul patch, we had
> the "magic" code fragment:
> 
> > +	      for (int j = 0; j < 2; j++)  
> > +		gomp_map_vars_async (acc_dev, aq,
> > +				     (j == 0 || pointer == 2) ?
> > 1 : 2,
> > +				     &hostaddrs[i + j], NULL,
> > +				     &sizes[i + j], &kinds[i + j],
> > true,
> > +
> > GOMP_MAP_VARS_OPENACC_ENTER_DATA);    
> 
> The "pointer == 3" case here will do precisely the same thing as the
> current iteration of the patch: pass the GOMP_MAP_TO to one
> gomp_map_vars_async call, and pass the GOMP_MAP_TO_PSET +
> GOMP_MAP_POINTER as a pair in a second call.
> 
> The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a
> GOMP_MAP_POINTER) will also handle the mappings separately in both the
> earlier patch iteration and this one.
> 
> That's different from the current behaviour, because we don't want all
> three mappings to be bound together. The problematic cases of doing
> so might only appear with the manual deep copy patch applied also,
> though (and/or with the refcount-checking patch applied/enabled). (I
> don't remember exactly which test cases this affected, but I can
> check.)

To follow up from this: the change in this patch is really to ensure
that reference counts are correct/consistent for *all* mappings at all
times. Contrast the behaviour described in the following comment in the
existing code (goacc_insert_pointer):

   /* ...

   Only the first mapping is considered in reference counting; the
   following ones implicitly follow suit.  */

This is problematic with automated checking since the "hidden" mappings
will have incorrect counts, and the problem becomes worse when the
GOMP_MAP_ATTACH, etc. mappings are added by the manual deep copy patch.

I tweaked the patch together with some debug-dumping code, and the
change from "find_pointer-like" behaviour and "find_group_last-like"
behaviour can be seen as follows (from deep-copy-8.c):

with find_pointer:

mapping group 0-4
  0 : gomp_map_struct                           0x7ffd5aa10ce0          4
  1 : gomp_map_to                               0x7ffd5aa10ce0          4
  2 : gomp_map_alloc                            0x7ffd5aa10ce8          8
  3 : gomp_map_alloc                            0x7ffd5aa10cf0          8
  4 : gomp_map_alloc                            0x7ffd5aa10cf8          8
mapping group 5-6
  0 : gomp_map_to                                    0x14ee050        400
  1 : gomp_map_attach                           0x7ffd5aa10ce8          0
mapping group 7-8
  0 : gomp_map_to                                    0x14ee1f0        400
  1 : gomp_map_attach                           0x7ffd5aa10cf0          0
mapping group 9-10
  0 : gomp_map_to                                    0x14ee390        400
  1 : gomp_map_attach                           0x7ffd5aa10cf8          0

with find_group_last:

mapping group 0-4
  0 : gomp_map_struct                           0x7ffc9011c3b0          4
  1 : gomp_map_to                               0x7ffc9011c3b0          4
  2 : gomp_map_alloc                            0x7ffc9011c3b8          8
  3 : gomp_map_alloc                            0x7ffc9011c3c0          8
  4 : gomp_map_alloc                            0x7ffc9011c3c8          8
mapping group 5-5
  0 : gomp_map_to                                    0x10e0050        400
mapping group 6-6
  0 : gomp_map_attach                           0x7ffc9011c3b8          0
mapping group 7-7
  0 : gomp_map_to                                    0x10e01f0        400
mapping group 8-8
  0 : gomp_map_attach                           0x7ffc9011c3c0          0
mapping group 9-9
  0 : gomp_map_to                                    0x10e0390        400
mapping group 10-10
  0 : gomp_map_attach                           0x7ffc9011c3c8          0

In the former case, each grouped "gomp_map_to/gomp_map_attach" will
form a single target_mem_desc. Then, goacc_exit_data_internal (or the
previous code it replaces) performs unmapping one mapping (splay tree
key) at a time.

If any of these splay trees reference count hits zero,
gomp_remove_var_async will be called, and then (I think) that
grouping-together becomes problematic: the reference for the "other"
splay tree key in the target_mem_desc's list gets lost.

> > Any then, compating that to
> > 'libgomp/target.c:GOMP_target_enter_exit_data', where (aside from
> > 'GOMP_MAP_STRUCT'; not relevant for us right now, yes?) everything
> > always gets mapped alone:
> > 
> >     for (i = 0; i < mapnum; i++)
> >       if ((kinds[i] & 0xff) == GOMP_MAP_STRUCT)
> >         { [...] }
> >       else
> >         gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i],
> > &kinds[i], true, GOMP_MAP_VARS_ENTER_DATA);
> > 
> > Is it just an "accident" that for OpenACC we were and still are
> > going to do this differently, or is there an actual reason?  
> 
> ...why mapping one-at-a-time is the right thing to do here. Maybe the
> OpenMP version never sees GOMP_MAP_TO_PSET (or
> GOMP_MAP_ALWAYS_POINTER, which has a hard-wired dependency on the
> previous clause)? (I can try to check that too.)

Actually it looks like GOMP_MAP_TO_PSET can occur in
GOMP_target_enter_exit_data, but it seems that only a single test case
exercises that (libgomp.fortran/target9.f90). I'd guess probably either
way works -- either with GOMP_MAP_POINTER grouped together after
a related GOMP_MAP_TO_PSET, or not.

Thanks,

Julian
Thomas Schwinge Dec. 18, 2019, 9:18 a.m. UTC | #17
Hi Julian!

Thanks for walking me through this.

On 2019-12-14T00:19:04+0000, Julian Brown <julian@codesourcery.com> wrote:
> On Fri, 13 Dec 2019 16:25:25 +0100
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com>
>> wrote:
>> >  static int
>> > -find_pointer (int pos, size_t mapnum, unsigned short *kinds)
>> > +find_group_last (int pos, size_t mapnum, unsigned short *kinds)
>> >  {
>> > -  if (pos + 1 >= mapnum)
>> > -    return 0;
>> > +  unsigned char kind0 = kinds[pos] & 0xff;
>> > +  int first_pos = pos, last_pos = pos;
>> >  
>> > -  unsigned char kind = kinds[pos+1] & 0xff;
>> > -
>> > -  if (kind == GOMP_MAP_TO_PSET)
>> > -    return 3;
>> > -  else if (kind == GOMP_MAP_POINTER)
>> > -    return 2;
>> > +  if (kind0 == GOMP_MAP_TO_PSET)
>> > +    {
>> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER)
>> > +	last_pos = ++pos;
>> > +      /* We expect at least one GOMP_MAP_POINTER after a GOMP_MAP_TO_PSET.  */
>> > +      assert (last_pos > first_pos);
>> > +    }
>> > +  else
>> > +    {
>> > +      /* GOMP_MAP_ALWAYS_POINTER can only appear directly after some other
>> > +	 mapping.  */
>> > +      if (pos + 1 < mapnum
>> > +	  && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER)
>> > +	return pos + 1;
>> > +
>> > +      /* We can have one or several GOMP_MAP_POINTER mappings after a to/from
>> > +	 (etc.) mapping.  */
>> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) == GOMP_MAP_POINTER)
>> > +	last_pos = ++pos;
>> > +    }
>> >  
>> > -  return 0;
>> > +  return last_pos;
>> >  }  

Given:

    program test
      implicit none
    
      integer, parameter :: n = 64
      integer :: a(n)
    
      call test_array(a)
    
    contains
      subroutine test_array(a)
        implicit none
    
        integer :: a(n)
    
        !$acc enter data copyin(a)
    
        !$acc exit data delete(a)
      end subroutine test_array
    end program test

..., we get a 'GOMP_MAP_TO' followed by a 'GOMP_MAP_POINTER'.  That got
us 'find_pointer () == 2', and now we get 'find_group_last (i) == i + 1'
(so, the same).

> In a previous iteration of the refcount overhaul patch, we had the
> "magic" code fragment:
>
>> +	      for (int j = 0; j < 2; j++)  
>> +		gomp_map_vars_async (acc_dev, aq,
>> +				     (j == 0 || pointer == 2) ? 1 : 2,
>> +				     &hostaddrs[i + j], NULL,
>> +				     &sizes[i + j], &kinds[i + j], true,
>> +				     GOMP_MAP_VARS_OPENACC_ENTER_DATA);  

> The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a
> GOMP_MAP_POINTER)

So, that's the example given above.

> will also handle the mappings separately in both the
> earlier patch iteration

ACK, given the "previous iteration" code presented above.

> and this one.

NACK?  Given 'find_group_last (i) == i + 1', that means that
'GOMP_MAP_TO' and 'GOMP_MAP_POINTER' get mapped as one group?

On the other hand, it still does match the current 'find_pointer'
behavior?

But what should the behavior here be: 'GOMP_MAP_TO', 'GOMP_MAP_POINTER'
each separate, or as one group?

Confusing stuff.  :-|


Grüße
 Thomas
Julian Brown Dec. 18, 2019, 1:52 p.m. UTC | #18
On Wed, 18 Dec 2019 10:18:14 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> Thanks for walking me through this.
> 
> On 2019-12-14T00:19:04+0000, Julian Brown <julian@codesourcery.com>
> wrote:
> > On Fri, 13 Dec 2019 16:25:25 +0100
> > Thomas Schwinge <thomas@codesourcery.com> wrote:  
> >> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com>
> >> wrote:  
> >> >  static int
> >> > -find_pointer (int pos, size_t mapnum, unsigned short *kinds)
> >> > +find_group_last (int pos, size_t mapnum, unsigned short *kinds)
> >> >  {
> >> > -  if (pos + 1 >= mapnum)
> >> > -    return 0;
> >> > +  unsigned char kind0 = kinds[pos] & 0xff;
> >> > +  int first_pos = pos, last_pos = pos;
> >> >  
> >> > -  unsigned char kind = kinds[pos+1] & 0xff;
> >> > -
> >> > -  if (kind == GOMP_MAP_TO_PSET)
> >> > -    return 3;
> >> > -  else if (kind == GOMP_MAP_POINTER)
> >> > -    return 2;
> >> > +  if (kind0 == GOMP_MAP_TO_PSET)
> >> > +    {
> >> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> >> > GOMP_MAP_POINTER)
> >> > +	last_pos = ++pos;
> >> > +      /* We expect at least one GOMP_MAP_POINTER after a
> >> > GOMP_MAP_TO_PSET.  */
> >> > +      assert (last_pos > first_pos);
> >> > +    }
> >> > +  else
> >> > +    {
> >> > +      /* GOMP_MAP_ALWAYS_POINTER can only appear directly after
> >> > some other
> >> > +	 mapping.  */
> >> > +      if (pos + 1 < mapnum
> >> > +	  && (kinds[pos + 1] & 0xff) == GOMP_MAP_ALWAYS_POINTER)
> >> > +	return pos + 1;
> >> > +
> >> > +      /* We can have one or several GOMP_MAP_POINTER mappings
> >> > after a to/from
> >> > +	 (etc.) mapping.  */
> >> > +      while (pos + 1 < mapnum && (kinds[pos + 1] & 0xff) ==
> >> > GOMP_MAP_POINTER)
> >> > +	last_pos = ++pos;
> >> > +    }
> >> >  
> >> > -  return 0;
> >> > +  return last_pos;
> >> >  }    
> 
> Given:
> 
>     program test
>       implicit none
>     
>       integer, parameter :: n = 64
>       integer :: a(n)
>     
>       call test_array(a)
>     
>     contains
>       subroutine test_array(a)
>         implicit none
>     
>         integer :: a(n)
>     
>         !$acc enter data copyin(a)
>     
>         !$acc exit data delete(a)
>       end subroutine test_array
>     end program test
> 
> ..., we get a 'GOMP_MAP_TO' followed by a 'GOMP_MAP_POINTER'.  That
> got us 'find_pointer () == 2', and now we get 'find_group_last (i) ==
> i + 1' (so, the same).
> 
> > In a previous iteration of the refcount overhaul patch, we had the
> > "magic" code fragment:
> >  
> >> +	      for (int j = 0; j < 2; j++)  
> >> +		gomp_map_vars_async (acc_dev, aq,
> >> +				     (j == 0 || pointer == 2) ?
> >> 1 : 2,
> >> +				     &hostaddrs[i + j], NULL,
> >> +				     &sizes[i + j], &kinds[i +
> >> j], true,
> >> +
> >> GOMP_MAP_VARS_OPENACC_ENTER_DATA);    
> 
> > The "pointer == 2" case (i.e. with a GOMP_MAP_TO and a
> > GOMP_MAP_POINTER)  
> 
> So, that's the example given above.
> 
> > will also handle the mappings separately in both the
> > earlier patch iteration  
> 
> ACK, given the "previous iteration" code presented above.
> 
> > and this one.  
> 
> NACK?  Given 'find_group_last (i) == i + 1', that means that
> 'GOMP_MAP_TO' and 'GOMP_MAP_POINTER' get mapped as one group?
> 
> On the other hand, it still does match the current 'find_pointer'
> behavior?
> 
> But what should the behavior here be: 'GOMP_MAP_TO',
> 'GOMP_MAP_POINTER' each separate, or as one group?
> 
> Confusing stuff.  :-|

Hmm.

I think that GOMP_MAP_POINTER is only intended to be used after some
other mapping (TO/TOFROM/TO_PSET/etc.). In the follow-up patch
supporting deep copy, this code is extended and refactored a little
more:

https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01256.html

One of the changes made there is to disallow GOMP_MAP{,_ALWAYS}_POINTER
from appearing by itself. By my reading, that must be the case for
GOMP_MAP_ALWAYS_POINTER because it has a hard-wired dependency on the
previous mapping. GOMP_MAP_POINTER is slightly more questionable: at
least according to the comment in gomp-constants.h, these are "an
internal only map kind, used for pointer based array sections" -- so
it's a little surprising they now reach the libgomp runtime at all.
Maybe it was a mistake?

The GOMP_MAP_ATTACH mapping (as in the example upthread) is different --
that one *can* appear by itself. Perhaps the difference (wrt. reference
counting here) is that GOMP_MAP_POINTER refers to the same
target_mem_desc as the previous (grouped-together) mapping, but
GOMP_MAP_ATTACH does not (rather, referring to the location of the
*pointer* to the data of a previous mapping, rather than the data
itself).

For GOMP_MAP_TO_PSET, a subsequent GOMP_MAP_POINTER will refer to the
pointer set itself. So, same thing, and it's not problematic to group
the mappings together.

Anyway: thinking about it some more, I don't think any of the ways
these types of mappings get grouped together should really be causing
refcount-checking failures, so maybe something's wrong (at least
academically) in goacc_exit_data_internal. The "real" problem with
parasitical groupings is if we have multiple "enter data" mappings that
get bound together in a single target_mem_desc, and are unmapped at
different times:

#pragma acc enter data copyin(arr1) copyin(arr2)
...
#pragma acc exit data copyout(arr1)
#pragma acc exit data copyout(arr2)

That's clearly not what's happening here though.

I will investigate further.

Thanks,

Julian
Thomas Schwinge Dec. 18, 2019, 5:17 p.m. UTC | #19
Hi!

On 2019-12-11T18:22:00+0100, I wrote:
> On 2019-10-29T12:15:01+0000, Julian Brown <julian@codesourcery.com> wrote:
>> I've removed the special-case handling
>> of pointers in the enter/exit data code, and combined the
>> gomp_acc_remove_pointer code (which now iterated over mappings
>> one-at-a-time anyway) with the loop iterating over mappings in the
>> new goacc_exit_data_internal function. It was a bit nonsensical to have
>> the "exit data" code split over two files, as before.
>
> Yes, I like that very much, and we shall tackle that next intermediate
> step

> One thing:
>
>>             libgomp/
>
>>             * oacc-parallel.c (find_pointer): Remove function.
>>             (find_group_last, goacc_enter_data_internal,
>>             goacc_exit_data_internal): New functions.
>>             (GOACC_enter_exit_data): Use goacc_enter_data_internal and
>>             goacc_exit_data_internal helper functions.
>
> It makes much sense to move all that into 'libgomp/oacc-mem.c', and as a
> preparational step, see attached "[OpenACC] Consolidate
> 'GOACC_enter_exit_data' and its helper functions in
> 'libgomp/oacc-mem.c'", committed to trunk in r279233.

Working incrementally towards the goal of unifying all that mapping
handling code, I did some refactoring ("No functional changes"): see the
attached "[OpenACC] Refactor 'present_create_copy' into
'goacc_enter_data'", "[OpenACC] Refactor 'delete_copyout' into
'goacc_exit_data'", "[OpenACC] Refactor 'GOACC_enter_exit_data' to call
'goacc_enter_data', 'goacc_exit_data'", "[OpenACC] Refactor
'goacc_remove_pointer' interface", "[OpenACC] Refactor 'goacc_enter_data'
so that it can be called from 'goacc_insert_pointer', "not present"
case", "[OpenACC] Refactor 'goacc_enter_data' so that it can be called
from 'goacc_insert_pointer', "present" case, and simplify"; committed to
trunk in r279535, r279536, r279537, r279538, r279539, r279540.


Grüße
 Thomas
diff mbox series

Patch

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 178eb600ccd..6b7ed7248a1 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -908,6 +908,10 @@  struct target_mem_desc {
 #define OFFSET_POINTER (~(uintptr_t) 1)
 #define OFFSET_STRUCT (~(uintptr_t) 2)
 
+/* A special tag value for "virtual_refcount" in the splay_tree_key_s structure
+   below.  */
+#define VREFCOUNT_LINK_KEY (~(uintptr_t) 0)
+
 struct splay_tree_key_s {
   /* Address of the host object.  */
   uintptr_t host_start;
@@ -919,10 +923,18 @@  struct splay_tree_key_s {
   uintptr_t tgt_offset;
   /* Reference count.  */
   uintptr_t refcount;
-  /* Dynamic reference count.  */
-  uintptr_t dynamic_refcount;
-  /* Pointer to the original mapping of "omp declare target link" object.  */
-  splay_tree_key link_key;
+  /* Reference counts beyond those that represent genuine references in the
+     linked splay tree key/target memory structures, e.g. for multiple OpenACC
+     "present increment" operations (via "acc enter data") referring to the same
+     host-memory block.
+     If set to VREFCOUNT_LINK_KEY (for OpenMP, where this field is not otherwise
+     needed), the union below represents a link key.  */
+  uintptr_t virtual_refcount;
+  union {
+    /* Pointer to the original mapping of "omp declare target link" object.
+       Only used for OpenMP.  */
+    splay_tree_key link_key;
+  } u;
 };
 
 /* The comparison function.  */
@@ -944,13 +956,6 @@  splay_compare (splay_tree_key x, splay_tree_key y)
 
 typedef struct acc_dispatch_t
 {
-  /* This is a linked list of data mapped using the
-     acc_map_data/acc_unmap_data or "acc enter data"/"acc exit data" pragmas.
-     Unlike mapped_data in the goacc_thread struct, unmapping can
-     happen out-of-order with respect to mapping.  */
-  /* This is guarded by the lock in the "outer" struct gomp_device_descr.  */
-  struct target_mem_desc *data_environ;
-
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
 
@@ -1060,13 +1065,15 @@  struct gomp_device_descr
 enum gomp_map_vars_kind
 {
   GOMP_MAP_VARS_OPENACC,
+  GOMP_MAP_VARS_OPENACC_ENTER_DATA,
   GOMP_MAP_VARS_TARGET,
   GOMP_MAP_VARS_DATA,
   GOMP_MAP_VARS_ENTER_DATA
 };
 
-extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int);
-extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int);
+extern void gomp_acc_remove_pointer (struct gomp_device_descr *, void **,
+				     size_t *, unsigned short *, int, bool,
+				     int);
 extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 				       unsigned short *);
 struct gomp_coalesce_buf;
@@ -1092,9 +1099,10 @@  extern void gomp_unmap_vars_async (struct target_mem_desc *, bool,
 				   struct goacc_asyncqueue *);
 extern void gomp_init_device (struct gomp_device_descr *);
 extern bool gomp_fini_device (struct gomp_device_descr *);
-extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_unload_device (struct gomp_device_descr *);
 extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key);
+extern void gomp_remove_var_async (struct gomp_device_descr *, splay_tree_key,
+				   struct goacc_asyncqueue *);
 
 /* work.c */
 
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 12299aee65d..1b9adcec774 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -264,8 +264,6 @@  static struct gomp_device_descr host_dispatch =
     .state = GOMP_DEVICE_UNINITIALIZED,
 
     .openacc = {
-      .data_environ = NULL,
-
       .exec_func = host_openacc_exec,
 
       .create_thread_data_func = host_openacc_create_thread_data,
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index e1568c535b3..e0395ef43b2 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -356,9 +356,13 @@  acc_shutdown_1 (acc_device_t d)
 
       if (walk->dev)
 	{
-	  gomp_mutex_lock (&walk->dev->lock);
-	  gomp_free_memmap (&walk->dev->mem_map);
-	  gomp_mutex_unlock (&walk->dev->lock);
+	  while (walk->dev->mem_map.root)
+	    {
+	      splay_tree_key k = &walk->dev->mem_map.root->key;
+	      if (k->virtual_refcount == VREFCOUNT_LINK_KEY)
+		k->u.link_key = NULL;
+	      gomp_remove_var (walk->dev, k);
+	    }
 
 	  walk->dev = NULL;
 	  walk->base_dev = NULL;
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 2f271009fb8..25084b71a2d 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -50,6 +50,25 @@  lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
   return key;
 }
 
+/* Helper for lookup_dev.  Iterate over splay tree.  */
+
+static splay_tree_key
+lookup_dev_1 (splay_tree_node node, uintptr_t d, size_t s)
+{
+  splay_tree_key k = &node->key;
+  struct target_mem_desc *t = k->tgt;
+
+  if (d >= t->tgt_start && d + s <= t->tgt_end)
+    return k;
+
+  if (node->left)
+    return lookup_dev_1 (node->left, d, s);
+  if (node->right)
+    return lookup_dev_1 (node->right, d, s);
+
+  return NULL;
+}
+
 /* Return block containing [D->S), or NULL if not contained.
    The list isn't ordered by device address, so we have to iterate
    over the whole array.  This is not expected to be a common
@@ -57,35 +76,12 @@  lookup_host (struct gomp_device_descr *dev, void *h, size_t s)
    remains locked on exit.  */
 
 static splay_tree_key
-lookup_dev (struct target_mem_desc *tgt, void *d, size_t s)
+lookup_dev (splay_tree mem_map, void *d, size_t s)
 {
-  int i;
-  struct target_mem_desc *t;
-
-  if (!tgt)
+  if (!mem_map || !mem_map->root)
     return NULL;
 
-  for (t = tgt; t != NULL; t = t->prev)
-    {
-      if (t->tgt_start <= (uintptr_t) d && t->tgt_end >= (uintptr_t) d + s)
-        break;
-    }
-
-  if (!t)
-    return NULL;
-
-  for (i = 0; i < t->list_count; i++)
-    {
-      void * offset;
-
-      splay_tree_key k = &t->array[i].key;
-      offset = d - t->tgt_start + k->tgt_offset;
-
-      if (k->host_start + offset <= (void *) k->host_end)
-        return k;
-    }
-
-  return NULL;
+  return lookup_dev_1 (mem_map->root, (uintptr_t) d, s);
 }
 
 /* OpenACC is silent on how memory exhaustion is indicated.  We return
@@ -150,7 +146,7 @@  acc_free (void *d)
   /* We don't have to call lazy open here, as the ptr value must have
      been returned by acc_malloc.  It's not permitted to pass NULL in
      (unless you got that null from acc_malloc).  */
-  if ((k = lookup_dev (acc_dev->openacc.data_environ, d, 1)))
+  if ((k = lookup_dev (&acc_dev->mem_map, d, 1)))
     {
       void *offset;
 
@@ -301,7 +297,7 @@  acc_hostptr (void *d)
 
   gomp_mutex_lock (&acc_dev->lock);
 
-  n = lookup_dev (acc_dev->openacc.data_environ, d, 1);
+  n = lookup_dev (&acc_dev->mem_map, d, 1);
 
   if (!n)
     {
@@ -396,7 +392,7 @@  acc_map_data (void *h, void *d, size_t s)
 		      (int)s);
 	}
 
-      if (lookup_dev (thr->dev->openacc.data_environ, d, s))
+      if (lookup_dev (&thr->dev->mem_map, d, s))
         {
 	  gomp_mutex_unlock (&acc_dev->lock);
 	  gomp_fatal ("device address [%p, +%d] is already mapped", (void *)d,
@@ -415,11 +411,6 @@  acc_map_data (void *h, void *d, size_t s)
 	  thr->api_info = NULL;
 	}
     }
-
-  gomp_mutex_lock (&acc_dev->lock);
-  tgt->prev = acc_dev->openacc.data_environ;
-  acc_dev->openacc.data_environ = tgt;
-  gomp_mutex_unlock (&acc_dev->lock);
 }
 
 void
@@ -427,6 +418,7 @@  acc_unmap_data (void *h)
 {
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
+  struct splay_tree_key_s cur_node;
 
   /* No need to call lazy open, as the address must have been mapped.  */
 
@@ -438,12 +430,11 @@  acc_unmap_data (void *h)
   acc_api_info api_info;
   bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
 
-  size_t host_size;
-
   gomp_mutex_lock (&acc_dev->lock);
 
-  splay_tree_key n = lookup_host (acc_dev, h, 1);
-  struct target_mem_desc *t;
+  cur_node.host_start = (uintptr_t) h;
+  cur_node.host_end = cur_node.host_start + 1;
+  splay_tree_key n = splay_tree_lookup (&acc_dev->mem_map, &cur_node);
 
   if (!n)
     {
@@ -451,47 +442,28 @@  acc_unmap_data (void *h)
       gomp_fatal ("%p is not a mapped block", (void *)h);
     }
 
-  host_size = n->host_end - n->host_start;
-
   if (n->host_start != (uintptr_t) h)
     {
+      size_t host_size = n->host_end - n->host_start;
       gomp_mutex_unlock (&acc_dev->lock);
       gomp_fatal ("[%p,%d] surrounds %p",
 		  (void *) n->host_start, (int) host_size, (void *) h);
     }
 
-  /* Mark for removal.  */
-  n->refcount = 1;
+  splay_tree_remove (&acc_dev->mem_map, n);
 
-  t = n->tgt;
+  struct target_mem_desc *tgt = n->tgt;
 
-  if (t->refcount == 2)
+  if (tgt->refcount > 0)
+    tgt->refcount--;
+  else
     {
-      struct target_mem_desc *tp;
-
-      /* This is the last reference, so pull the descriptor off the
-         chain. This avoids gomp_unmap_vars via gomp_unmap_tgt from
-         freeing the device memory. */
-      t->tgt_end = 0;
-      t->to_free = 0;
-
-      for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
-	   tp = t, t = t->prev)
-	if (n->tgt == t)
-	  {
-	    if (tp)
-	      tp->prev = t->prev;
-	    else
-	      acc_dev->openacc.data_environ = t->prev;
-
-	    break;
-	  }
+      free (tgt->array);
+      free (tgt);
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
 
-  gomp_unmap_vars (t, true);
-
   if (profiling_p)
     {
       thr->prof_info = NULL;
@@ -549,11 +521,14 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
 	  gomp_fatal ("[%p,+%d] not mapped", (void *)h, (int)s);
 	}
 
+      assert (n->virtual_refcount != VREFCOUNT_LINK_KEY);
+
       if (n->refcount != REFCOUNT_INFINITY)
 	{
 	  n->refcount++;
-	  n->dynamic_refcount++;
+	  n->virtual_refcount++;
 	}
+
       gomp_mutex_unlock (&acc_dev->lock);
     }
   else if (!(f & FLAG_CREATE))
@@ -563,7 +538,6 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
     }
   else
     {
-      struct target_mem_desc *tgt;
       size_t mapnum = 1;
       unsigned short kinds;
       void *hostaddrs = h;
@@ -577,17 +551,14 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
 
       goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
-				 &kinds, true, GOMP_MAP_VARS_OPENACC);
-      /* Initialize dynamic refcount.  */
-      tgt->list[0].key->dynamic_refcount = 1;
+      gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s, &kinds,
+			   true, GOMP_MAP_VARS_OPENACC_ENTER_DATA);
 
       gomp_mutex_lock (&acc_dev->lock);
-
-      d = tgt->to_free;
-      tgt->prev = acc_dev->openacc.data_environ;
-      acc_dev->openacc.data_environ = tgt;
-
+      n = lookup_host (acc_dev, h, s);
+      assert (n != NULL);
+      d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h
+		    - n->host_start);
       gomp_mutex_unlock (&acc_dev->lock);
     }
 
@@ -671,7 +642,6 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
 {
   size_t host_size;
   splay_tree_key n;
-  void *d;
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
@@ -700,8 +670,7 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
       gomp_fatal ("[%p,%d] is not mapped", (void *)h, (int)s);
     }
 
-  d = (void *) (n->tgt->tgt_start + n->tgt_offset
-		+ (uintptr_t) h - n->host_start);
+  assert (n->virtual_refcount != VREFCOUNT_LINK_KEY);
 
   host_size = n->host_end - n->host_start;
 
@@ -715,48 +684,34 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
   if (n->refcount == REFCOUNT_INFINITY)
     {
       n->refcount = 0;
-      n->dynamic_refcount = 0;
-    }
-  if (n->refcount < n->dynamic_refcount)
-    {
-      gomp_mutex_unlock (&acc_dev->lock);
-      gomp_fatal ("Dynamic reference counting assert fail\n");
+      n->virtual_refcount = 0;
     }
 
   if (f & FLAG_FINALIZE)
     {
-      n->refcount -= n->dynamic_refcount;
-      n->dynamic_refcount = 0;
+      n->refcount -= n->virtual_refcount;
+      n->virtual_refcount = 0;
     }
-  else if (n->dynamic_refcount)
+
+  if (n->virtual_refcount > 0)
     {
-      n->dynamic_refcount--;
       n->refcount--;
+      n->virtual_refcount--;
     }
+  else if (n->refcount > 0)
+    n->refcount--;
 
   if (n->refcount == 0)
     {
-      if (n->tgt->refcount == 2)
-	{
-	  struct target_mem_desc *tp, *t;
-	  for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
-	       tp = t, t = t->prev)
-	    if (n->tgt == t)
-	      {
-		if (tp)
-		  tp->prev = t->prev;
-		else
-		  acc_dev->openacc.data_environ = t->prev;
-		break;
-	      }
-	}
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
       if (f & FLAG_COPYOUT)
-	{
-	  goacc_aq aq = get_goacc_asyncqueue (async);
+        {
+	  void *d = (void *) (n->tgt->tgt_start + n->tgt_offset
+			      + (uintptr_t) h - n->host_start);
 	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
-      gomp_remove_var (acc_dev, n);
+      gomp_remove_var_async (acc_dev, n, aq);
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
@@ -894,140 +849,80 @@  acc_update_self_async (void *h, size_t s, int async)
 }
 
 void
-gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds, int async)
-{
-  struct target_mem_desc *tgt;
-  struct goacc_thread *thr = goacc_thread ();
-  struct gomp_device_descr *acc_dev = thr->dev;
-
-  if (acc_is_present (*hostaddrs, *sizes))
-    {
-      splay_tree_key n;
-      gomp_mutex_lock (&acc_dev->lock);
-      n = lookup_host (acc_dev, *hostaddrs, *sizes);
-      gomp_mutex_unlock (&acc_dev->lock);
-
-      tgt = n->tgt;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	if (tgt->list[i].key == n)
-	  {
-	    for (size_t j = 0; j < mapnum; j++)
-	      if (i + j < tgt->list_count && tgt->list[i + j].key)
-		{
-		  tgt->list[i + j].key->refcount++;
-		  tgt->list[i + j].key->dynamic_refcount++;
-		}
-	    return;
-	  }
-      /* Should not reach here.  */
-      gomp_fatal ("Dynamic refcount incrementing failed for pointer/pset");
-    }
-
-  gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  goacc_aq aq = get_goacc_asyncqueue (async);
-  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
-			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
-  gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
-
-  /* Initialize dynamic refcount.  */
-  tgt->list[0].key->dynamic_refcount = 1;
-
-  gomp_mutex_lock (&acc_dev->lock);
-  tgt->prev = acc_dev->openacc.data_environ;
-  acc_dev->openacc.data_environ = tgt;
-  gomp_mutex_unlock (&acc_dev->lock);
-}
-
-void
-gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async,
-			 int finalize, int mapnum)
+gomp_acc_remove_pointer (struct gomp_device_descr *acc_dev, void **hostaddrs,
+			 size_t *sizes, unsigned short *kinds, int async,
+			 bool finalize, int mapnum)
 {
-  struct goacc_thread *thr = goacc_thread ();
-  struct gomp_device_descr *acc_dev = thr->dev;
+  struct splay_tree_key_s cur_node;
   splay_tree_key n;
-  struct target_mem_desc *t;
-  int minrefs = (mapnum == 1) ? 2 : 3;
-
-  if (!acc_is_present (h, s))
-    return;
 
   gomp_mutex_lock (&acc_dev->lock);
 
-  n = lookup_host (acc_dev, h, 1);
-
-  if (!n)
-    {
-      gomp_mutex_unlock (&acc_dev->lock);
-      gomp_fatal ("%p is not a mapped block", (void *)h);
-    }
-
-  gomp_debug (0, "  %s: restore mappings\n", __FUNCTION__);
-
-  t = n->tgt;
-
-  if (n->refcount < n->dynamic_refcount)
+  for (int i = 0; i < mapnum; i++)
     {
-      gomp_mutex_unlock (&acc_dev->lock);
-      gomp_fatal ("Dynamic reference counting assert fail\n");
-    }
+      int kind = kinds[i] & 0xff;
+      bool copyfrom = false;
 
-  if (finalize)
-    {
-      n->refcount -= n->dynamic_refcount;
-      n->dynamic_refcount = 0;
-    }
-  else if (n->dynamic_refcount)
-    {
-      n->dynamic_refcount--;
-      n->refcount--;
-    }
+      switch (kind)
+        {
+	case GOMP_MAP_FROM:
+	case GOMP_MAP_FORCE_FROM:
+	case GOMP_MAP_ALWAYS_FROM:
+	  copyfrom = true;
+	  /* Fallthrough.  */
+
+	case GOMP_MAP_TO_PSET:
+	case GOMP_MAP_POINTER:
+	case GOMP_MAP_DELETE:
+	case GOMP_MAP_RELEASE:
+	  cur_node.host_start = (uintptr_t) hostaddrs[i];
+	  cur_node.host_end = cur_node.host_start
+			      + (kind == GOMP_MAP_POINTER
+				 ? sizeof (void *) : sizes[i]);
+	  n = splay_tree_lookup (&acc_dev->mem_map, &cur_node);
+
+	  if (n == NULL)
+	    continue;
+
+	  assert (n->virtual_refcount != VREFCOUNT_LINK_KEY);
+
+	  if (n->refcount == REFCOUNT_INFINITY)
+	    {
+	      n->refcount = 1;
+	      n->virtual_refcount = 0;
+	    }
 
-  gomp_mutex_unlock (&acc_dev->lock);
+	  if (finalize)
+	    {
+	      n->refcount -= n->virtual_refcount;
+	      n->virtual_refcount = 0;
+	    }
 
-  if (n->refcount == 0)
-    {
-      if (t->refcount == minrefs)
-	{
-	  /* This is the last reference, so pull the descriptor off the
-	     chain. This prevents gomp_unmap_vars via gomp_unmap_tgt from
-	     freeing the device memory. */
-	  struct target_mem_desc *tp;
-	  for (tp = NULL, t = acc_dev->openacc.data_environ; t != NULL;
-	       tp = t, t = t->prev)
+	  if (n->virtual_refcount > 0)
 	    {
-	      if (n->tgt == t)
-		{
-		  if (tp)
-		    tp->prev = t->prev;
-		  else
-		    acc_dev->openacc.data_environ = t->prev;
-		  break;
-		}
+	      n->refcount--;
+	      n->virtual_refcount--;
 	    }
-	}
+	  else if (n->refcount > 0)
+	    n->refcount--;
 
-      /* Set refcount to 1 to allow gomp_unmap_vars to unmap it.  */
-      n->refcount = 1;
-      t->refcount = minrefs;
-      for (size_t i = 0; i < t->list_count; i++)
-	if (t->list[i].key == n)
-	  {
-	    t->list[i].copy_from = force_copyfrom ? 1 : 0;
-	    break;
-	  }
-
-      /* If running synchronously, unmap immediately.  */
-      if (async < acc_async_noval)
-	gomp_unmap_vars (t, true);
-      else
-	{
-	  goacc_aq aq = get_goacc_asyncqueue (async);
-	  gomp_unmap_vars_async (t, true, aq);
+	  if (copyfrom)
+	    gomp_copy_dev2host (acc_dev, NULL, (void *) cur_node.host_start,
+				(void *) (n->tgt->tgt_start + n->tgt_offset
+					  + cur_node.host_start
+					  - n->host_start),
+				cur_node.host_end - cur_node.host_start);
+
+	  if (n->refcount == 0)
+	    gomp_remove_var (acc_dev, n);
+	  break;
+
+	default:
+	  gomp_mutex_unlock (&acc_dev->lock);
+	  gomp_fatal ("gomp_acc_remove_pointer unhandled kind 0x%.2x",
+		      kind);
 	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
-
-  gomp_debug (0, "  %s: mappings restored\n", __FUNCTION__);
 }
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 68a60de24fa..7e72d9c6b24 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -56,12 +56,29 @@  find_pointer (int pos, size_t mapnum, unsigned short *kinds)
   if (pos + 1 >= mapnum)
     return 0;
 
-  unsigned char kind = kinds[pos+1] & 0xff;
+  unsigned char kind0 = kinds[pos] & 0xff;
 
-  if (kind == GOMP_MAP_TO_PSET)
-    return 3;
-  else if (kind == GOMP_MAP_POINTER)
-    return 2;
+  switch (kind0)
+    {
+    case GOMP_MAP_TO:
+    case GOMP_MAP_FORCE_TO:
+    case GOMP_MAP_FROM:
+    case GOMP_MAP_FORCE_FROM:
+    case GOMP_MAP_TOFROM:
+    case GOMP_MAP_FORCE_TOFROM:
+    case GOMP_MAP_ALLOC:
+    case GOMP_MAP_RELEASE:
+      {
+	unsigned char kind1 = kinds[pos + 1] & 0xff;
+	if (kind1 == GOMP_MAP_POINTER
+	    || kind1 == GOMP_MAP_ALWAYS_POINTER)
+	  return 2;
+	else if (kind1 == GOMP_MAP_TO_PSET)
+	  return 3;
+      }
+    default:
+      /* empty.  */;
+    }
 
   return 0;
 }
@@ -745,8 +762,14 @@  GOACC_enter_exit_data (int flags_m, size_t mapnum,
 	    }
 	  else
 	    {
-	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-				       &sizes[i], &kinds[i], async);
+	      goacc_aq aq = get_goacc_asyncqueue (async);
+	      for (int j = 0; j < 2; j++)
+		gomp_map_vars_async (acc_dev, aq,
+				     (j == 0 || pointer == 2) ? 1 : 2,
+				     &hostaddrs[i + j], NULL,
+				     &sizes[i + j], &kinds[i + j], true,
+				     GOMP_MAP_VARS_OPENACC_ENTER_DATA);
+
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -791,11 +814,8 @@  GOACC_enter_exit_data (int flags_m, size_t mapnum,
 	  }
 	else
 	  {
-	    bool copyfrom = (kind == GOMP_MAP_FORCE_FROM
-			     || kind == GOMP_MAP_FROM);
-	    gomp_acc_remove_pointer (hostaddrs[i], sizes[i], copyfrom, async,
-				     finalize, pointer);
-	    /* See the above comment.  */
+	    gomp_acc_remove_pointer (acc_dev, &hostaddrs[i], &sizes[i],
+				     &kinds[i], async, finalize, pointer);
 	    i += pointer - 1;
 	  }
       }
diff --git a/libgomp/target.c b/libgomp/target.c
index a83cb48108a..b42b4ad2448 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -536,7 +536,8 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
   struct target_mem_desc *tgt
     = gomp_malloc (sizeof (*tgt) + sizeof (tgt->list[0]) * mapnum);
   tgt->list_count = mapnum;
-  tgt->refcount = pragma_kind == GOMP_MAP_VARS_ENTER_DATA ? 0 : 1;
+  tgt->refcount = (pragma_kind == GOMP_MAP_VARS_ENTER_DATA
+		   || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA) ? 0 : 1;
   tgt->device_descr = devicep;
   struct gomp_coalesce_buf cbuf, *cbufp = NULL;
 
@@ -883,13 +884,14 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 				      kind & typemask, cbufp);
 	    else
 	      {
-		k->link_key = NULL;
+		k->u.link_key = NULL;
 		if (n && n->refcount == REFCOUNT_LINK)
 		  {
 		    /* Replace target address of the pointer with target address
 		       of mapped object in the splay tree.  */
 		    splay_tree_remove (mem_map, n);
-		    k->link_key = n;
+		    k->u.link_key = n;
+		    k->virtual_refcount = VREFCOUNT_LINK_KEY;
 		  }
 		size_t align = (size_t) 1 << (kind >> rshift);
 		tgt->list[i].key = k;
@@ -913,7 +915,7 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		tgt->list[i].offset = 0;
 		tgt->list[i].length = k->host_end - k->host_start;
 		k->refcount = 1;
-		k->dynamic_refcount = 0;
+		k->virtual_refcount = 0;
 		tgt->refcount++;
 		array->left = NULL;
 		array->right = NULL;
@@ -1007,7 +1009,7 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 				kind);
 		  }
 
-		if (k->link_key)
+		if (k->virtual_refcount == VREFCOUNT_LINK_KEY && k->u.link_key)
 		  {
 		    /* Set link pointer on target to the device address of the
 		       mapped object.  */
@@ -1051,8 +1053,20 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
   /* If the variable from "omp target enter data" map-list was already mapped,
      tgt is not needed.  Otherwise tgt will be freed by gomp_unmap_vars or
      gomp_exit_data.  */
-  if (pragma_kind == GOMP_MAP_VARS_ENTER_DATA && tgt->refcount == 0)
-    {
+  if ((pragma_kind == GOMP_MAP_VARS_ENTER_DATA
+       || pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
+      && tgt->refcount == 0)
+    {
+      /* If we're about to discard a target_mem_desc with no "structural"
+	 references (tgt->refcount == 0), any splay keys linked in the tgt's
+	 list must have their virtual refcount incremented to represent that
+	 "lost" reference in order to implement the semantics of the OpenACC
+	 "present increment" operation properly.  */
+      if (pragma_kind == GOMP_MAP_VARS_OPENACC_ENTER_DATA)
+	for (i = 0; i < tgt->list_count; i++)
+	  if (tgt->list[i].key)
+	    tgt->list[i].key->virtual_refcount++;
+
       free (tgt);
       tgt = NULL;
     }
@@ -1092,32 +1106,66 @@  gomp_unmap_tgt (struct target_mem_desc *tgt)
   free (tgt);
 }
 
-attribute_hidden bool
-gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k)
+static bool
+gomp_unref_tgt (void *ptr)
 {
   bool is_tgt_unmapped = false;
-  splay_tree_remove (&devicep->mem_map, k);
-  if (k->link_key)
-    splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->link_key);
-  if (k->tgt->refcount > 1)
-    k->tgt->refcount--;
+
+  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
+
+  if (tgt->refcount > 1)
+    tgt->refcount--;
   else
     {
+      gomp_unmap_tgt (tgt);
       is_tgt_unmapped = true;
-      gomp_unmap_tgt (k->tgt);
     }
+
   return is_tgt_unmapped;
 }
 
 static void
-gomp_unref_tgt (void *ptr)
+gomp_unref_tgt_void (void *ptr)
 {
-  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
+  (void) gomp_unref_tgt (ptr);
+}
 
-  if (tgt->refcount > 1)
-    tgt->refcount--;
+static inline __attribute__((always_inline)) bool
+gomp_remove_var_internal (struct gomp_device_descr *devicep, splay_tree_key k,
+			  struct goacc_asyncqueue *aq)
+{
+  bool is_tgt_unmapped = false;
+  splay_tree_remove (&devicep->mem_map, k);
+  if (k->virtual_refcount == VREFCOUNT_LINK_KEY)
+    {
+      if (k->u.link_key)
+	splay_tree_insert (&devicep->mem_map, (splay_tree_node) k->u.link_key);
+    }
+  if (aq)
+    devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void,
+						(void *) k->tgt);
   else
-    gomp_unmap_tgt (tgt);
+    is_tgt_unmapped = gomp_unref_tgt ((void *) k->tgt);
+  return is_tgt_unmapped;
+}
+
+attribute_hidden bool
+gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k)
+{
+  return gomp_remove_var_internal (devicep, k, NULL);
+}
+
+/* Remove a variable asynchronously.  This actually removes the variable
+   mapping immediately, but retains the linked target_mem_desc until the
+   asynchronous operation has completed (as it may still refer to target
+   memory).  The device lock must be held before entry, and remains locked on
+   exit.  */
+
+attribute_hidden void
+gomp_remove_var_async (struct gomp_device_descr *devicep, splay_tree_key k,
+		       struct goacc_asyncqueue *aq)
+{
+  (void) gomp_remove_var_internal (devicep, k, aq);
 }
 
 /* Unmap variables described by TGT.  If DO_COPYFROM is true, copy relevant
@@ -1153,7 +1201,15 @@  gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom,
 	continue;
 
       bool do_unmap = false;
-      if (k->refcount > 1 && k->refcount != REFCOUNT_INFINITY)
+      if (k->tgt == tgt
+	  && k->virtual_refcount > 0
+	  && k->virtual_refcount != VREFCOUNT_LINK_KEY
+	  && k->refcount != REFCOUNT_INFINITY)
+	{
+	  k->virtual_refcount--;
+	  k->refcount--;
+	}
+      else if (k->refcount > 1 && k->refcount != REFCOUNT_INFINITY)
 	k->refcount--;
       else if (k->refcount == 1)
 	{
@@ -1173,7 +1229,7 @@  gomp_unmap_vars_internal (struct target_mem_desc *tgt, bool do_copyfrom,
     }
 
   if (aq)
-    devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt,
+    devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt_void,
 						(void *) tgt);
   else
     gomp_unref_tgt ((void *) tgt);
@@ -1310,7 +1366,7 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->tgt = tgt;
       k->tgt_offset = target_table[i].start;
       k->refcount = REFCOUNT_INFINITY;
-      k->link_key = NULL;
+      k->virtual_refcount = 0;
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -1342,7 +1398,7 @@  gomp_load_image_to_device (struct gomp_device_descr *devicep, unsigned version,
       k->tgt = tgt;
       k->tgt_offset = target_var->start;
       k->refcount = target_size & link_bit ? REFCOUNT_LINK : REFCOUNT_INFINITY;
-      k->link_key = NULL;
+      k->virtual_refcount = 0;
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -1576,22 +1632,6 @@  gomp_unload_device (struct gomp_device_descr *devicep)
     }
 }
 
-/* Free address mapping tables.  MM must be locked on entry, and remains locked
-   on return.  */
-
-attribute_hidden void
-gomp_free_memmap (struct splay_tree_s *mem_map)
-{
-  while (mem_map->root)
-    {
-      struct target_mem_desc *tgt = mem_map->root->key.tgt;
-
-      splay_tree_remove (mem_map, &mem_map->root->key);
-      free (tgt->array);
-      free (tgt);
-    }
-}
-
 /* Host fallback for GOMP_target{,_ext} routines.  */
 
 static void
@@ -2073,9 +2113,9 @@  gomp_exit_data (struct gomp_device_descr *devicep, size_t mapnum,
 	  if (k->refcount == 0)
 	    {
 	      splay_tree_remove (&devicep->mem_map, k);
-	      if (k->link_key)
+	      if (k->virtual_refcount == VREFCOUNT_LINK_KEY && k->u.link_key)
 		splay_tree_insert (&devicep->mem_map,
-				   (splay_tree_node) k->link_key);
+				   (splay_tree_node) k->u.link_key);
 	      if (k->tgt->refcount > 1)
 		k->tgt->refcount--;
 	      else
@@ -2612,6 +2652,8 @@  omp_target_associate_ptr (const void *host_ptr, const void *device_ptr,
       k->tgt = tgt;
       k->tgt_offset = (uintptr_t) device_ptr + device_offset;
       k->refcount = REFCOUNT_INFINITY;
+      k->virtual_refcount = 0;
+      k->u.link_key = NULL;
       array->left = NULL;
       array->right = NULL;
       splay_tree_insert (&devicep->mem_map, array);
@@ -2882,7 +2924,6 @@  gomp_target_init (void)
 		current_device.type = current_device.get_type_func ();
 		current_device.mem_map.root = NULL;
 		current_device.state = GOMP_DEVICE_UNINITIALIZED;
-		current_device.openacc.data_environ = NULL;
 		for (i = 0; i < new_num_devices; i++)
 		  {
 		    current_device.target_id = i;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c
index 6a52f746dcb..6bdcfe7d429 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-2.c
@@ -182,13 +182,13 @@  main (int argc, char **argv)
         exit (EXIT_FAILURE);
     }
 
+    acc_delete (&h_X[0], N * sizeof (float));
+    acc_delete (&h_Y1[0], N * sizeof (float));
+
     free (h_X);
     free (h_Y1);
     free (h_Y2);
 
-    acc_free (d_X);
-    acc_free (d_Y);
-
     context_check (pctx);
 
     s = cublasDestroy (h);
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c
index 71365e8ed32..b403a5cf5cb 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/context-4.c
@@ -176,13 +176,13 @@  main (int argc, char **argv)
         exit (EXIT_FAILURE);
     }
 
+    acc_delete (&h_X[0], N * sizeof (float));
+    acc_delete (&h_Y1[0], N * sizeof (float));
+
     free (h_X);
     free (h_Y1);
     free (h_Y2);
 
-    acc_free (d_X);
-    acc_free (d_Y);
-
     context_check (pctx);
 
     s = cublasDestroy (h);
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90
index 83a540070e6..6bb92c12ed1 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/data-2.f90
@@ -1,4 +1,5 @@ 
 ! { dg-do run }
+! { dg-additional-options "-cpp" }
 
 program test
   use openacc
@@ -70,10 +71,14 @@  program test
     end do
   !$acc end parallel
   
-  !$acc exit data copyout (d(1:N)) async
+  !$acc exit data delete (c(1:N)) copyout (d(1:N)) async
   !$acc exit data async
   !$acc wait
 
+#if !ACC_MEM_SHARED
+  if (acc_is_present (c) .eqv. .TRUE.) call abort
+#endif
+
   do i = 1, N
     if (d(i) .ne. 4.0) call abort
   end do