diff mbox

gomp_target_fini (was: [gomp4.5] Handle #pragma omp declare target link)

Message ID 87r3impode.fsf@kepler.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge Dec. 16, 2015, 12:30 p.m. UTC
Hi!

On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iverbin@gmail.com> wrote:
> On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > +/* This function finalizes all initialized devices.  */
> > > +
> > > +static void
> > > +gomp_target_fini (void)
> > > +{
> > > +  [...]
> > 
> > The question is what will this do if there are async target tasks still
> > running on some of the devices at this point (forgotten #pragma omp taskwait
> > or similar if target nowait regions are started outside of parallel region,
> > or exit inside of parallel, etc.  But perhaps it can be handled incrementally.
> > Also there is the question that the 
> > So I think the patch is ok with the above mentioned changes.
> 
> Here is what I've committed to trunk.

> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
>    } cuda;
>  } acc_dispatch_t;
>  
> +/* Various state of the accelerator device.  */
> +enum gomp_device_state
> +{
> +  GOMP_DEVICE_UNINITIALIZED,
> +  GOMP_DEVICE_INITIALIZED,
> +  GOMP_DEVICE_FINALIZED
> +};
> +
>  /* This structure describes accelerator device.
>     It contains name of the corresponding libgomp plugin, function handlers for
>     interaction with the device, ID-number of the device, and information about
> @@ -933,8 +941,10 @@ struct gomp_device_descr
>    /* Mutex for the mutable data.  */
>    gomp_mutex_t lock;
>  
> -  /* Set to true when device is initialized.  */
> -  bool is_initialized;
> +  /* Current state of the device.  OpenACC allows to move from INITIALIZED state
> +     back to UNINITIALIZED state.  OpenMP allows only to move from INITIALIZED
> +     to FINALIZED state (at program shutdown).  */
> +  enum gomp_device_state state;

(ACK, but I assume we'll want to make sure that an OpenACC device is
never re-initialized if we're in/after the libgomp finalization phase.)


The issue mentioned above: "exit inside of parallel" is actually a
problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
test cases now run into annoying "WARNING: program timed out".  Here is
what's happening, as I understand it: in
libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> +/* This function finalizes all initialized devices.  */
> +
> +static void
> +gomp_target_fini (void)
> +{
> +  int i;
> +  for (i = 0; i < num_devices; i++)
> +    {
> +      struct gomp_device_descr *devicep = &devices[i];
> +      gomp_mutex_lock (&devicep->lock);
> +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> +	{
> +	  devicep->fini_device_func (devicep->target_id);
> +	  devicep->state = GOMP_DEVICE_FINALIZED;
> +	}
> +      gomp_mutex_unlock (&devicep->lock);
> +    }
> +}

> @@ -2387,6 +2433,9 @@ gomp_target_init (void)
>        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
>  	goacc_register (&devices[i]);
>      }
> +
> +  if (atexit (gomp_target_fini) != 0)
> +    gomp_fatal ("atexit failed");
>  }

Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
atexit handler, gomp_target_fini, which, with the device lock held, will
call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
clean up.

Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
context is now in an inconsistent state, see
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:

    CUDA_ERROR_LAUNCH_FAILED = 719
        An exception occurred on the device while executing a
        kernel. Common causes include dereferencing an invalid device
        pointer and accessing out of bounds shared memory. The context
        cannot be used, so it must be destroyed (and a new one should be
        created). All existing device memory allocations from this
        context are invalid and must be reconstructed if the program is
        to continue using CUDA.

Thus, any cuMemFreeHost invocations that are run during clean-up will now
also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
GOMP_PLUGIN_fatal, which again will trigger the same or another
(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
trying to lock the device again, which is still locked.

(Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
errors, should we destroy the context right away, or toggle a boolean
flag to mark it as unusable, and use that as an indication to avoid the
follow-on failures of cuMemFreeHost just described above, for example?)

<http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
tells us:

    Since the behavior is undefined if the exit() function is called more
    than once, portable applications calling atexit() must ensure that the
    exit() function is not called at normal process termination when all
    functions registered by the atexit() function are called.

... which we're violating here, at least in the nvptx plugin.  I have not
analyzed the intermic one.

As it happens, Chung-Lin has been working in that area:
<http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
which he recently re-posted:
<http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
<http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
<http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
I have not analyzed whether his changes would completely resolve the
problem just described, but at least conceptually they seem to be a step
into the right direction?  (Jakub?)

Now, to resolve the immediate problem, what is the right thing for us to
do?  Is the following simple change OK, or is there a reason to still run
atexit handlers if terminating under error conditions?

commit b1733e8f9df6ae7d6828e2194df1b314772701c5
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Wed Dec 16 13:10:39 2015 +0100

    Avoid deadlocks in libgomp due to competing atexit handlers
    
    	libgomp/
    	* error.c (gomp_vfatal): Call _exit instead of exit.
---
 libgomp/error.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)



Grüße
 Thomas

Comments

Thomas Schwinge Dec. 23, 2015, 11:05 a.m. UTC | #1
Hi!

Ping.

On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iverbin@gmail.com> wrote:
> > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  [...]
> > > 
> > > The question is what will this do if there are async target tasks still
> > > running on some of the devices at this point (forgotten #pragma omp taskwait
> > > or similar if target nowait regions are started outside of parallel region,
> > > or exit inside of parallel, etc.  But perhaps it can be handled incrementally.
> > > Also there is the question that the 
> > > So I think the patch is ok with the above mentioned changes.
> > 
> > Here is what I've committed to trunk.
> 
> > --- a/libgomp/libgomp.h
> > +++ b/libgomp/libgomp.h
> > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> >    } cuda;
> >  } acc_dispatch_t;
> >  
> > +/* Various state of the accelerator device.  */
> > +enum gomp_device_state
> > +{
> > +  GOMP_DEVICE_UNINITIALIZED,
> > +  GOMP_DEVICE_INITIALIZED,
> > +  GOMP_DEVICE_FINALIZED
> > +};
> > +
> >  /* This structure describes accelerator device.
> >     It contains name of the corresponding libgomp plugin, function handlers for
> >     interaction with the device, ID-number of the device, and information about
> > @@ -933,8 +941,10 @@ struct gomp_device_descr
> >    /* Mutex for the mutable data.  */
> >    gomp_mutex_t lock;
> >  
> > -  /* Set to true when device is initialized.  */
> > -  bool is_initialized;
> > +  /* Current state of the device.  OpenACC allows to move from INITIALIZED state
> > +     back to UNINITIALIZED state.  OpenMP allows only to move from INITIALIZED
> > +     to FINALIZED state (at program shutdown).  */
> > +  enum gomp_device_state state;
> 
> (ACK, but I assume we'll want to make sure that an OpenACC device is
> never re-initialized if we're in/after the libgomp finalization phase.)
> 
> 
> The issue mentioned above: "exit inside of parallel" is actually a
> problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> test cases now run into annoying "WARNING: program timed out".  Here is
> what's happening, as I understand it: in
> libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> 
> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> 
> > +/* This function finalizes all initialized devices.  */
> > +
> > +static void
> > +gomp_target_fini (void)
> > +{
> > +  int i;
> > +  for (i = 0; i < num_devices; i++)
> > +    {
> > +      struct gomp_device_descr *devicep = &devices[i];
> > +      gomp_mutex_lock (&devicep->lock);
> > +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > +	{
> > +	  devicep->fini_device_func (devicep->target_id);
> > +	  devicep->state = GOMP_DEVICE_FINALIZED;
> > +	}
> > +      gomp_mutex_unlock (&devicep->lock);
> > +    }
> > +}
> 
> > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> >        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> >  	goacc_register (&devices[i]);
> >      }
> > +
> > +  if (atexit (gomp_target_fini) != 0)
> > +    gomp_fatal ("atexit failed");
> >  }
> 
> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> atexit handler, gomp_target_fini, which, with the device lock held, will
> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> clean up.
> 
> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> context is now in an inconsistent state, see
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:
> 
>     CUDA_ERROR_LAUNCH_FAILED = 719
>         An exception occurred on the device while executing a
>         kernel. Common causes include dereferencing an invalid device
>         pointer and accessing out of bounds shared memory. The context
>         cannot be used, so it must be destroyed (and a new one should be
>         created). All existing device memory allocations from this
>         context are invalid and must be reconstructed if the program is
>         to continue using CUDA.
> 
> Thus, any cuMemFreeHost invocations that are run during clean-up will now
> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> GOMP_PLUGIN_fatal, which again will trigger the same or another
> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> trying to lock the device again, which is still locked.
> 
> (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> errors, should we destroy the context right away, or toggle a boolean
> flag to mark it as unusable, and use that as an indication to avoid the
> follow-on failures of cuMemFreeHost just described above, for example?)
> 
> <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
> tells us:
> 
>     Since the behavior is undefined if the exit() function is called more
>     than once, portable applications calling atexit() must ensure that the
>     exit() function is not called at normal process termination when all
>     functions registered by the atexit() function are called.
> 
> ... which we're violating here, at least in the nvptx plugin.  I have not
> analyzed the intermic one.
> 
> As it happens, Chung-Lin has been working in that area:
> <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
> which he recently re-posted:
> <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
> <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
> I have not analyzed whether his changes would completely resolve the
> problem just described, but at least conceptually they seem to be a step
> into the right direction?  (Jakub?)
> 
> Now, to resolve the immediate problem, what is the right thing for us to
> do?  Is the following simple change OK, or is there a reason to still run
> atexit handlers if terminating under error conditions?
> 
> commit b1733e8f9df6ae7d6828e2194df1b314772701c5
> Author: Thomas Schwinge <thomas@codesourcery.com>
> Date:   Wed Dec 16 13:10:39 2015 +0100
> 
>     Avoid deadlocks in libgomp due to competing atexit handlers
>     
>     	libgomp/
>     	* error.c (gomp_vfatal): Call _exit instead of exit.
> ---
>  libgomp/error.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git libgomp/error.c libgomp/error.c
> index 094c24a..1ef7491 100644
> --- libgomp/error.c
> +++ libgomp/error.c
> @@ -34,6 +34,7 @@
>  #include <stdarg.h>
>  #include <stdio.h>
>  #include <stdlib.h>
> +#include <unistd.h>
>  
>  
>  #undef gomp_vdebug
> @@ -77,7 +78,7 @@ void
>  gomp_vfatal (const char *fmt, va_list list)
>  {
>    gomp_verror (fmt, list);
> -  exit (EXIT_FAILURE);
> +  _exit (EXIT_FAILURE);
>  }
>  
>  void


Grüße
 Thomas
Thomas Schwinge Jan. 11, 2016, 10:39 a.m. UTC | #2
Hi!

Ping.

On Wed, 23 Dec 2015 12:05:32 +0100, I wrote:
> Ping.
> 
> On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> > On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iverbin@gmail.com> wrote:
> > > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > > +/* This function finalizes all initialized devices.  */
> > > > > +
> > > > > +static void
> > > > > +gomp_target_fini (void)
> > > > > +{
> > > > > +  [...]
> > > > 
> > > > The question is what will this do if there are async target tasks still
> > > > running on some of the devices at this point (forgotten #pragma omp taskwait
> > > > or similar if target nowait regions are started outside of parallel region,
> > > > or exit inside of parallel, etc.  But perhaps it can be handled incrementally.
> > > > Also there is the question that the 
> > > > So I think the patch is ok with the above mentioned changes.
> > > 
> > > Here is what I've committed to trunk.
> > 
> > > --- a/libgomp/libgomp.h
> > > +++ b/libgomp/libgomp.h
> > > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> > >    } cuda;
> > >  } acc_dispatch_t;
> > >  
> > > +/* Various state of the accelerator device.  */
> > > +enum gomp_device_state
> > > +{
> > > +  GOMP_DEVICE_UNINITIALIZED,
> > > +  GOMP_DEVICE_INITIALIZED,
> > > +  GOMP_DEVICE_FINALIZED
> > > +};
> > > +
> > >  /* This structure describes accelerator device.
> > >     It contains name of the corresponding libgomp plugin, function handlers for
> > >     interaction with the device, ID-number of the device, and information about
> > > @@ -933,8 +941,10 @@ struct gomp_device_descr
> > >    /* Mutex for the mutable data.  */
> > >    gomp_mutex_t lock;
> > >  
> > > -  /* Set to true when device is initialized.  */
> > > -  bool is_initialized;
> > > +  /* Current state of the device.  OpenACC allows to move from INITIALIZED state
> > > +     back to UNINITIALIZED state.  OpenMP allows only to move from INITIALIZED
> > > +     to FINALIZED state (at program shutdown).  */
> > > +  enum gomp_device_state state;
> > 
> > (ACK, but I assume we'll want to make sure that an OpenACC device is
> > never re-initialized if we're in/after the libgomp finalization phase.)
> > 
> > 
> > The issue mentioned above: "exit inside of parallel" is actually a
> > problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> > libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> > test cases now run into annoying "WARNING: program timed out".  Here is
> > what's happening, as I understand it: in
> > libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> > returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> > 
> > > --- a/libgomp/target.c
> > > +++ b/libgomp/target.c
> > 
> > > +/* This function finalizes all initialized devices.  */
> > > +
> > > +static void
> > > +gomp_target_fini (void)
> > > +{
> > > +  int i;
> > > +  for (i = 0; i < num_devices; i++)
> > > +    {
> > > +      struct gomp_device_descr *devicep = &devices[i];
> > > +      gomp_mutex_lock (&devicep->lock);
> > > +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > > +	{
> > > +	  devicep->fini_device_func (devicep->target_id);
> > > +	  devicep->state = GOMP_DEVICE_FINALIZED;
> > > +	}
> > > +      gomp_mutex_unlock (&devicep->lock);
> > > +    }
> > > +}
> > 
> > > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> > >        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > >  	goacc_register (&devices[i]);
> > >      }
> > > +
> > > +  if (atexit (gomp_target_fini) != 0)
> > > +    gomp_fatal ("atexit failed");
> > >  }
> > 
> > Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > atexit handler, gomp_target_fini, which, with the device lock held, will
> > call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > clean up.
> > 
> > Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > context is now in an inconsistent state, see
> > <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:
> > 
> >     CUDA_ERROR_LAUNCH_FAILED = 719
> >         An exception occurred on the device while executing a
> >         kernel. Common causes include dereferencing an invalid device
> >         pointer and accessing out of bounds shared memory. The context
> >         cannot be used, so it must be destroyed (and a new one should be
> >         created). All existing device memory allocations from this
> >         context are invalid and must be reconstructed if the program is
> >         to continue using CUDA.
> > 
> > Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > GOMP_PLUGIN_fatal, which again will trigger the same or another
> > (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > trying to lock the device again, which is still locked.
> > 
> > (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> > errors, should we destroy the context right away, or toggle a boolean
> > flag to mark it as unusable, and use that as an indication to avoid the
> > follow-on failures of cuMemFreeHost just described above, for example?)
> > 
> > <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
> > tells us:
> > 
> >     Since the behavior is undefined if the exit() function is called more
> >     than once, portable applications calling atexit() must ensure that the
> >     exit() function is not called at normal process termination when all
> >     functions registered by the atexit() function are called.
> > 
> > ... which we're violating here, at least in the nvptx plugin.  I have not
> > analyzed the intermic one.
> > 
> > As it happens, Chung-Lin has been working in that area:
> > <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
> > which he recently re-posted:
> > <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
> > <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
> > <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
> > I have not analyzed whether his changes would completely resolve the
> > problem just described, but at least conceptually they seem to be a step
> > into the right direction?  (Jakub?)
> > 
> > Now, to resolve the immediate problem, what is the right thing for us to
> > do?  Is the following simple change OK, or is there a reason to still run
> > atexit handlers if terminating under error conditions?
> > 
> > commit b1733e8f9df6ae7d6828e2194df1b314772701c5
> > Author: Thomas Schwinge <thomas@codesourcery.com>
> > Date:   Wed Dec 16 13:10:39 2015 +0100
> > 
> >     Avoid deadlocks in libgomp due to competing atexit handlers
> >     
> >     	libgomp/
> >     	* error.c (gomp_vfatal): Call _exit instead of exit.
> > ---
> >  libgomp/error.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git libgomp/error.c libgomp/error.c
> > index 094c24a..1ef7491 100644
> > --- libgomp/error.c
> > +++ libgomp/error.c
> > @@ -34,6 +34,7 @@
> >  #include <stdarg.h>
> >  #include <stdio.h>
> >  #include <stdlib.h>
> > +#include <unistd.h>
> >  
> >  
> >  #undef gomp_vdebug
> > @@ -77,7 +78,7 @@ void
> >  gomp_vfatal (const char *fmt, va_list list)
> >  {
> >    gomp_verror (fmt, list);
> > -  exit (EXIT_FAILURE);
> > +  _exit (EXIT_FAILURE);
> >  }
> >  
> >  void


Grüße
 Thomas
Thomas Schwinge Jan. 21, 2016, 6:17 a.m. UTC | #3
Hi!

Ping.

On Mon, 11 Jan 2016 11:39:46 +0100, I wrote:
> Ping.
> 
> On Wed, 23 Dec 2015 12:05:32 +0100, I wrote:
> > Ping.
> > 
> > On Wed, 16 Dec 2015 13:30:21 +0100, I wrote:
> > > On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin <iverbin@gmail.com> wrote:
> > > > On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > > > > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > > > > +/* This function finalizes all initialized devices.  */
> > > > > > +
> > > > > > +static void
> > > > > > +gomp_target_fini (void)
> > > > > > +{
> > > > > > +  [...]
> > > > > 
> > > > > The question is what will this do if there are async target tasks still
> > > > > running on some of the devices at this point (forgotten #pragma omp taskwait
> > > > > or similar if target nowait regions are started outside of parallel region,
> > > > > or exit inside of parallel, etc.  But perhaps it can be handled incrementally.
> > > > > Also there is the question that the 
> > > > > So I think the patch is ok with the above mentioned changes.
> > > > 
> > > > Here is what I've committed to trunk.
> > > 
> > > > --- a/libgomp/libgomp.h
> > > > +++ b/libgomp/libgomp.h
> > > > @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
> > > >    } cuda;
> > > >  } acc_dispatch_t;
> > > >  
> > > > +/* Various state of the accelerator device.  */
> > > > +enum gomp_device_state
> > > > +{
> > > > +  GOMP_DEVICE_UNINITIALIZED,
> > > > +  GOMP_DEVICE_INITIALIZED,
> > > > +  GOMP_DEVICE_FINALIZED
> > > > +};
> > > > +
> > > >  /* This structure describes accelerator device.
> > > >     It contains name of the corresponding libgomp plugin, function handlers for
> > > >     interaction with the device, ID-number of the device, and information about
> > > > @@ -933,8 +941,10 @@ struct gomp_device_descr
> > > >    /* Mutex for the mutable data.  */
> > > >    gomp_mutex_t lock;
> > > >  
> > > > -  /* Set to true when device is initialized.  */
> > > > -  bool is_initialized;
> > > > +  /* Current state of the device.  OpenACC allows to move from INITIALIZED state
> > > > +     back to UNINITIALIZED state.  OpenMP allows only to move from INITIALIZED
> > > > +     to FINALIZED state (at program shutdown).  */
> > > > +  enum gomp_device_state state;
> > > 
> > > (ACK, but I assume we'll want to make sure that an OpenACC device is
> > > never re-initialized if we're in/after the libgomp finalization phase.)
> > > 
> > > 
> > > The issue mentioned above: "exit inside of parallel" is actually a
> > > problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
> > > libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
> > > test cases now run into annoying "WARNING: program timed out".  Here is
> > > what's happening, as I understand it: in
> > > libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
> > > returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.
> > > 
> > > > --- a/libgomp/target.c
> > > > +++ b/libgomp/target.c
> > > 
> > > > +/* This function finalizes all initialized devices.  */
> > > > +
> > > > +static void
> > > > +gomp_target_fini (void)
> > > > +{
> > > > +  int i;
> > > > +  for (i = 0; i < num_devices; i++)
> > > > +    {
> > > > +      struct gomp_device_descr *devicep = &devices[i];
> > > > +      gomp_mutex_lock (&devicep->lock);
> > > > +      if (devicep->state == GOMP_DEVICE_INITIALIZED)
> > > > +	{
> > > > +	  devicep->fini_device_func (devicep->target_id);
> > > > +	  devicep->state = GOMP_DEVICE_FINALIZED;
> > > > +	}
> > > > +      gomp_mutex_unlock (&devicep->lock);
> > > > +    }
> > > > +}
> > > 
> > > > @@ -2387,6 +2433,9 @@ gomp_target_init (void)
> > > >        if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
> > > >  	goacc_register (&devices[i]);
> > > >      }
> > > > +
> > > > +  if (atexit (gomp_target_fini) != 0)
> > > > +    gomp_fatal ("atexit failed");
> > > >  }
> > > 
> > > Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > > atexit handler, gomp_target_fini, which, with the device lock held, will
> > > call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > > clean up.
> > > 
> > > Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > > context is now in an inconsistent state, see
> > > <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html>:
> > > 
> > >     CUDA_ERROR_LAUNCH_FAILED = 719
> > >         An exception occurred on the device while executing a
> > >         kernel. Common causes include dereferencing an invalid device
> > >         pointer and accessing out of bounds shared memory. The context
> > >         cannot be used, so it must be destroyed (and a new one should be
> > >         created). All existing device memory allocations from this
> > >         context are invalid and must be reconstructed if the program is
> > >         to continue using CUDA.
> > > 
> > > Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > > also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > > GOMP_PLUGIN_fatal, which again will trigger the same or another
> > > (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > > trying to lock the device again, which is still locked.
> > > 
> > > (Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
> > > errors, should we destroy the context right away, or toggle a boolean
> > > flag to mark it as unusable, and use that as an indication to avoid the
> > > follow-on failures of cuMemFreeHost just described above, for example?)
> > > 
> > > <http://pubs.opengroup.org/onlinepubs/9699919799/functions/atexit.html>
> > > tells us:
> > > 
> > >     Since the behavior is undefined if the exit() function is called more
> > >     than once, portable applications calling atexit() must ensure that the
> > >     exit() function is not called at normal process termination when all
> > >     functions registered by the atexit() function are called.
> > > 
> > > ... which we're violating here, at least in the nvptx plugin.  I have not
> > > analyzed the intermic one.
> > > 
> > > As it happens, Chung-Lin has been working in that area:
> > > <http://news.gmane.org/find-root.php?message_id=%3C55DF1452.9050501%40codesourcery.com%3E>,
> > > which he recently re-posted:
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EE49A.3050403%40codesourcery.com%3E>,
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EC310.8000403%40codesourcery.com%3E>,
> > > <http://news.gmane.org/find-root.php?message_id=%3C566EC324.9050505%40codesourcery.com%3E>.
> > > I have not analyzed whether his changes would completely resolve the
> > > problem just described, but at least conceptually they seem to be a step
> > > into the right direction?  (Jakub?)
> > > 
> > > Now, to resolve the immediate problem, what is the right thing for us to
> > > do?  Is the following simple change OK, or is there a reason to still run
> > > atexit handlers if terminating under error conditions?
> > > 
> > > commit b1733e8f9df6ae7d6828e2194df1b314772701c5
> > > Author: Thomas Schwinge <thomas@codesourcery.com>
> > > Date:   Wed Dec 16 13:10:39 2015 +0100
> > > 
> > >     Avoid deadlocks in libgomp due to competing atexit handlers
> > >     
> > >     	libgomp/
> > >     	* error.c (gomp_vfatal): Call _exit instead of exit.
> > > ---
> > >  libgomp/error.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git libgomp/error.c libgomp/error.c
> > > index 094c24a..1ef7491 100644
> > > --- libgomp/error.c
> > > +++ libgomp/error.c
> > > @@ -34,6 +34,7 @@
> > >  #include <stdarg.h>
> > >  #include <stdio.h>
> > >  #include <stdlib.h>
> > > +#include <unistd.h>
> > >  
> > >  
> > >  #undef gomp_vdebug
> > > @@ -77,7 +78,7 @@ void
> > >  gomp_vfatal (const char *fmt, va_list list)
> > >  {
> > >    gomp_verror (fmt, list);
> > > -  exit (EXIT_FAILURE);
> > > +  _exit (EXIT_FAILURE);
> > >  }
> > >  
> > >  void


Grüße
 Thomas
Bernd Schmidt Jan. 21, 2016, 3:24 p.m. UTC | #4
Thomas, I've mentioned this issue before - there is sometimes just too 
much irrelevant stuff to wade through in your patch submissions, and it 
discourages review. The discussion of the actual problem begins more 
than halfway through your multi-page mail. Please try to be more concise.

On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> atexit handler, gomp_target_fini, which, with the device lock held, will
> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> clean up.
>
> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> context is now in an inconsistent state

> Thus, any cuMemFreeHost invocations that are run during clean-up will now
> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> GOMP_PLUGIN_fatal, which again will trigger the same or another
> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> trying to lock the device again, which is still locked.

>      	libgomp/
>      	* error.c (gomp_vfatal): Call _exit instead of exit.

It seems unfortunate to disable the atexit handlers for everything for 
what seems purely an nvptx problem.

What exactly happens if you don't register the cleanups with atexit in 
the first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in 
the cleanup functions?


Bernd
Jakub Jelinek Jan. 22, 2016, 10:16 a.m. UTC | #5
On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
> Thomas, I've mentioned this issue before - there is sometimes just too much
> irrelevant stuff to wade through in your patch submissions, and it
> discourages review. The discussion of the actual problem begins more than
> halfway through your multi-page mail. Please try to be more concise.
> 
> On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> >atexit handler, gomp_target_fini, which, with the device lock held, will
> >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> >clean up.
> >
> >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> >context is now in an inconsistent state
> 
> >Thus, any cuMemFreeHost invocations that are run during clean-up will now
> >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> >GOMP_PLUGIN_fatal, which again will trigger the same or another
> >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> >trying to lock the device again, which is still locked.
> 
> >     	libgomp/
> >     	* error.c (gomp_vfatal): Call _exit instead of exit.
> 
> It seems unfortunate to disable the atexit handlers for everything for what
> seems purely an nvptx problem.
> 
> What exactly happens if you don't register the cleanups with atexit in the
> first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in the
> cleanup functions?

I agree, _exit is just wrong, there could be important atexit hooks from the
application.  You can set some flag that the libgomp or nvptx plugin atexit
hooks should not do anything, or should do things differently.  But
bypassing all atexit handlers is risky.

	Jakub
Mike Stump Jan. 25, 2016, 6:20 p.m. UTC | #6
On Jan 22, 2016, at 2:16 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
>> Thomas, I've mentioned this issue before - there is sometimes just too much
>> irrelevant stuff to wade through in your patch submissions, and it
>> discourages review. The discussion of the actual problem begins more than
>> halfway through your multi-page mail. Please try to be more concise.
>> 
>> On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
>>> Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
>>> atexit handler, gomp_target_fini, which, with the device lock held, will
>>> call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
>>> clean up.
>>> 
>>> Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
>>> context is now in an inconsistent state
>> 
>>> Thus, any cuMemFreeHost invocations that are run during clean-up will now
>>> also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
>>> GOMP_PLUGIN_fatal, which again will trigger the same or another
>>> (GOMP_offload_unregister_ver) atexit handler, which will then deadlock
>>> trying to lock the device again, which is still locked.
>> 
>>>    	libgomp/
>>>    	* error.c (gomp_vfatal): Call _exit instead of exit.
>> 
>> It seems unfortunate to disable the atexit handlers for everything for what
>> seems purely an nvptx problem.
>> 
>> What exactly happens if you don't register the cleanups with atexit in the
>> first place? Or maybe you can query for CUDA_ERROR_LAUNCH_FAILED in the
>> cleanup functions?
> 
> I agree, _exit is just wrong, there could be important atexit hooks from the
> application.  You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently.  But
> bypassing all atexit handlers is risky.

I’d use the phrase, is wrong.

Just create a semaphore that says that init was fully done, and at the end of init, set it, and at the beginning of the cleanup, just test it and anytime you want to cancel the cleanup, reset the semaphore.  Think of it, as a is_valid predicate.  Any operation that needs it to be valid can query it first, and fail otherwise.
diff mbox

Patch

diff --git libgomp/error.c libgomp/error.c
index 094c24a..1ef7491 100644
--- libgomp/error.c
+++ libgomp/error.c
@@ -34,6 +34,7 @@ 
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <unistd.h>
 
 
 #undef gomp_vdebug
@@ -77,7 +78,7 @@  void
 gomp_vfatal (const char *fmt, va_list list)
 {
   gomp_verror (fmt, list);
-  exit (EXIT_FAILURE);
+  _exit (EXIT_FAILURE);
 }
 
 void