diff mbox series

[nvptx,libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free()

Message ID bd5d7ef4-55dd-13be-849b-daa09c2801db@codesourcery.com
State New
Headers show
Series [nvptx,libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free() | expand

Commit Message

Chung-Lin Tang Aug. 20, 2020, 1:03 p.m. UTC
Hi Tom,
this patch adjusts nvptx_free() in libgomp/plugin/plugin-nvptx.c to avoid a
"GOMP_PLUGIN_acc_thread() == NULL" check that was causing problems under
OpenMP offloading.

This check was originally used to determine if nvptx_free() was running under
CUDA callback context, when freeing resources from an OpenACC asynchronous compute
region. Since CUDA API calls are not allowed inside callback context, we have
to save the freed block to ptx_dev->free_blocks, and cuMemFree it later.

The check to see if GOMP_PLUGIN_acc_thread() exists to determine normal host thread
vs. callback thread worked under -fopenacc, but since the OpenACC per-thread data
is not created under -fopenmp, and always returned NULL, we have a leak situation
where OpenMP offloading kept accumulating freed device memory blocks until failing;
nvptx_free() never reaches the part where cuMemFree() is actually called.

I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is returned
for such CUDA calls inside callback context, and it appears to be enough to replace
the current check, so the new code sees if this error is returned on the first
cuMemGetAddressRange call to determine callback context. This should now work
for both OpenACC/OpenMP.

(Tobias, Catherine, the earlier internal patch to re-organize this callback context
checking did not work in general, since OpenACC also uses the .queue_callback
functionality to free the struct target_mem_desc asynchronously, so in general we
have to ensure nvptx_free() could be used under both normal/callback context)

This patch has been libgomp tested for x86_64-linux with nvptx offloading without
regressions, and should be applied for mainline and GCC10. Is this okay?

Thanks,
Chung-Lin

2020-08-20  Chung-Lin Tang  <cltang@codesourcery.com>

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_free): Change "GOMP_PLUGIN_acc_thread () == NULL"
	test into check of CUDA_ERROR_NOT_PERMITTED status for cuMemGetAddressRange.
	Adjust comments.

Comments

Tom de Vries Aug. 20, 2020, 1:33 p.m. UTC | #1
On 8/20/20 3:03 PM, Chung-Lin Tang wrote:
> Hi Tom,
> this patch adjusts nvptx_free() in libgomp/plugin/plugin-nvptx.c to avoid a
> "GOMP_PLUGIN_acc_thread() == NULL" check that was causing problems under
> OpenMP offloading.
> 
> This check was originally used to determine if nvptx_free() was running
> under
> CUDA callback context, when freeing resources from an OpenACC
> asynchronous compute
> region. Since CUDA API calls are not allowed inside callback context, we
> have
> to save the freed block to ptx_dev->free_blocks, and cuMemFree it later.
> 
> The check to see if GOMP_PLUGIN_acc_thread() exists to determine normal
> host thread
> vs. callback thread worked under -fopenacc, but since the OpenACC
> per-thread data
> is not created under -fopenmp, and always returned NULL, we have a leak
> situation
> where OpenMP offloading kept accumulating freed device memory blocks
> until failing;
> nvptx_free() never reaches the part where cuMemFree() is actually called.
> 
> I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is
> returned
> for such CUDA calls inside callback context,

Right, that's "Callbacks must not make any CUDA API calls. Attempting to
use a CUDA API will result in CUDA_ERROR_NOT_PERMITTED" at
cuStreamAddCallback.  Perhaps mention more precisely where you found this.

> and it appears to be enough
> to replace
> the current check, so the new code sees if this error is returned on the
> first
> cuMemGetAddressRange call to determine callback context. This should now
> work
> for both OpenACC/OpenMP.
> 
> (Tobias, Catherine, the earlier internal patch to re-organize this
> callback context
> checking did not work in general, since OpenACC also uses the
> .queue_callback
> functionality to free the struct target_mem_desc asynchronously, so in
> general we
> have to ensure nvptx_free() could be used under both normal/callback
> context)
> 
> This patch has been libgomp tested for x86_64-linux with nvptx
> offloading without
> regressions, and should be applied for mainline and GCC10. Is this okay?
> 

Ok, thanks for fixing this.

- Tom

> Thanks,
> Chung-Lin
> 
> 2020-08-20  Chung-Lin Tang  <cltang@codesourcery.com>
> 
>     libgomp/
>     * plugin/plugin-nvptx.c (nvptx_free): Change "GOMP_PLUGIN_acc_thread
> () == NULL"
>     test into check of CUDA_ERROR_NOT_PERMITTED status for
> cuMemGetAddressRange.
>     Adjust comments.


> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index ec103a2f40b..188a34f1d04 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1038,27 +1038,34 @@ goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p)
>  }
>  
>  static bool
>  nvptx_free (void *p, struct ptx_device *ptx_dev)
>  {
> -  /* Assume callback context if this is null.  */
> -  if (GOMP_PLUGIN_acc_thread () == NULL)
> +  CUdeviceptr pb;
> +  size_t ps;
> +
> +  CUresult r = CUDA_CALL_NOCHECK (cuMemGetAddressRange, &pb, &ps,
> +				  (CUdeviceptr) p);
> +  if (r == CUDA_ERROR_NOT_PERMITTED)
>      {
> +      /* We assume that this error indicates we are in a CUDA callback context,
> +	 where all CUDA calls are not allowed. Arrange to free this piece of
> +	 device memory later.  */
>        struct ptx_free_block *n
>  	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
>        n->ptr = p;
>        pthread_mutex_lock (&ptx_dev->free_blocks_lock);
>        n->next = ptx_dev->free_blocks;
>        ptx_dev->free_blocks = n;
>        pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
>        return true;
>      }
> -
> -  CUdeviceptr pb;
> -  size_t ps;
> -
> -  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p);
> +  else if (r != CUDA_SUCCESS)
> +    {
> +      GOMP_PLUGIN_error ("cuMemGetAddressRange error: %s", cuda_error (r));
> +      return false;
> +    }
>    if ((CUdeviceptr) p != pb)
>      {
>        GOMP_PLUGIN_error ("invalid device address");
>        return false;
>      }
Chung-Lin Tang Aug. 20, 2020, 2:32 p.m. UTC | #2
Hi Tom,
thanks for the extremely quick review :)

On 2020/8/20 9:33 PM, Tom de Vries wrote:
>> I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is
>> returned
>> for such CUDA calls inside callback context,
> Right, that's "Callbacks must not make any CUDA API calls. Attempting to
> use a CUDA API will result in CUDA_ERROR_NOT_PERMITTED" at
> cuStreamAddCallback.  Perhaps mention more precisely where you found this.

I've added a little bit more in the comments in the final patch pushed.

>> and it appears to be enough
>> to replace
>> the current check, so the new code sees if this error is returned on the
>> first
>> cuMemGetAddressRange call to determine callback context. This should now
>> work
>> for both OpenACC/OpenMP.
>>
>> (Tobias, Catherine, the earlier internal patch to re-organize this
>> callback context
>> checking did not work in general, since OpenACC also uses the
>> .queue_callback
>> functionality to free the struct target_mem_desc asynchronously, so in
>> general we
>> have to ensure nvptx_free() could be used under both normal/callback
>> context)
>>
>> This patch has been libgomp tested for x86_64-linux with nvptx
>> offloading without
>> regressions, and should be applied for mainline and GCC10. Is this okay?
>>
> Ok, thanks for fixing this.

Just pushed to master, releases/gcc-10, and devel/omp/gcc-10.

Chung-Lin
diff mbox series

Patch

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ec103a2f40b..188a34f1d04 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1038,27 +1038,34 @@  goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p)
 }
 
 static bool
 nvptx_free (void *p, struct ptx_device *ptx_dev)
 {
-  /* Assume callback context if this is null.  */
-  if (GOMP_PLUGIN_acc_thread () == NULL)
+  CUdeviceptr pb;
+  size_t ps;
+
+  CUresult r = CUDA_CALL_NOCHECK (cuMemGetAddressRange, &pb, &ps,
+				  (CUdeviceptr) p);
+  if (r == CUDA_ERROR_NOT_PERMITTED)
     {
+      /* We assume that this error indicates we are in a CUDA callback context,
+	 where all CUDA calls are not allowed. Arrange to free this piece of
+	 device memory later.  */
       struct ptx_free_block *n
 	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
       n->ptr = p;
       pthread_mutex_lock (&ptx_dev->free_blocks_lock);
       n->next = ptx_dev->free_blocks;
       ptx_dev->free_blocks = n;
       pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
       return true;
     }
-
-  CUdeviceptr pb;
-  size_t ps;
-
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p);
+  else if (r != CUDA_SUCCESS)
+    {
+      GOMP_PLUGIN_error ("cuMemGetAddressRange error: %s", cuda_error (r));
+      return false;
+    }
   if ((CUdeviceptr) p != pb)
     {
       GOMP_PLUGIN_error ("invalid device address");
       return false;
     }