diff mbox

gomp_target_fini

Message ID 87d1pl3dgd.fsf@hertz.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge April 19, 2016, 2:01 p.m. UTC
Hi!

On Fri, 22 Jan 2016 11:16:07 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Jan 21, 2016 at 04:24:46PM +0100, Bernd Schmidt wrote:
> > On 12/16/2015 01:30 PM, Thomas Schwinge wrote:
> > >Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
> > >atexit handler, gomp_target_fini, which, with the device lock held, will
> > >call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
> > >clean up.
> > >
> > >Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
> > >context is now in an inconsistent state
> > 
> > >Thus, any cuMemFreeHost invocations that are run during clean-up will now
> > >also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
> > >GOMP_PLUGIN_fatal, which again will trigger the same or another
> > >(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
> > >trying to lock the device again, which is still locked.

(... causing "WARNING: program timed out" for the affected libgomp test
cases, as well as deadlocks for any such user code, too.)

> > >     	libgomp/
> > >     	* error.c (gomp_vfatal): Call _exit instead of exit.
> > 
> > It seems unfortunate to disable the atexit handlers for everything for what
> > seems purely an nvptx problem.  [...]

> I agree, _exit is just wrong, there could be important atexit hooks from the
> application.  You can set some flag that the libgomp or nvptx plugin atexit
> hooks should not do anything, or should do things differently.  But
> bypassing all atexit handlers is risky.

Well, I certainly had done at least some thinking before proposing this:
we're talking about the libgomp "fatal exit" function, called when
something has gone very wrong, and we're about to terminate the process,
because there's no hope to recover.  In this situation/consideration it
didn't seem important to me to have atexit handlers called.  Just like
these are also not called when we run into a SIGSEGV, or the kernel kills
the process for other reasons.  So I'm not completely convinced by your
assessment that calling "_exit is just wrong".  Anyway, I can certainly
accept that my understanding of the seriousness of a libgomp "fatal exit"
has been too pessimistic, and that we can do better than my proposed
_exit solution.

Two other solutions have been proposed in the past months: Chung-Lin's
patches with subject: "Adjust offload plugin interface for avoiding
deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
later: "Resolve deadlock on plugin exit" (still pending review/approval),
and Alexander's much smaller patch with subject: "libgomp plugin: make
cuMemFreeHost error non-fatal",
<http://news.gmane.org/find-root.php?message_id=%3C1458323327-9908-4-git-send-email-amonakov%40ispras.ru%3E>.
(Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
patches are considered too invasive for gcc-6-branch, can we at least get
Alexander's patch committed to gcc-6-branch as well as on trunk, please?

commit d86a582bd9c21451dc888695ee6ecef37b5fb6ac
Author: Alexander Monakov <amonakov@ispras.ru>
Date:   Fri Mar 11 15:31:33 2016 +0300

    libgomp plugin: make cuMemFreeHost error non-fatal
    
    Unlike cuMemFree and other resource-releasing functions called on exit,
    cuMemFreeHost appears to re-report errors encountered in kernel launch.
    This leads to a deadlock after GOMP_PLUGIN_fatal is reentered.
    
    While the behavior on libgomp side is suboptimal (there's no need to
    call resource-releasing functions if we're about to destroy the CUDA
    context anyway), this behavior on cuMemFreeHost part is not useful
    and just makes error "recovery" harder.  This was reported to NVIDIA
    (bug ref. 1737876), but we can work around it by simply reporting the
    error without making it fatal.
    
    	* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
---
 libgomp/ChangeLog.gomp-nvptx  | 4 ++++
 libgomp/plugin/plugin-nvptx.c | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)



Grüße
 Thomas

Comments

Jakub Jelinek April 19, 2016, 2:04 p.m. UTC | #1
On Tue, Apr 19, 2016 at 04:01:06PM +0200, Thomas Schwinge wrote:
> Two other solutions have been proposed in the past months: Chung-Lin's
> patches with subject: "Adjust offload plugin interface for avoiding
> deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
> later: "Resolve deadlock on plugin exit" (still pending review/approval),
> and Alexander's much smaller patch with subject: "libgomp plugin: make
> cuMemFreeHost error non-fatal",
> <http://news.gmane.org/find-root.php?message_id=%3C1458323327-9908-4-git-send-email-amonakov%40ispras.ru%3E>.
> (Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
> patches are considered too invasive for gcc-6-branch, can we at least get
> Alexander's patch committed to gcc-6-branch as well as on trunk, please?

Yeah, Alex' patch is IMHO fine, even for gcc-6-branch.

> --- libgomp/ChangeLog.gomp-nvptx
> +++ libgomp/ChangeLog.gomp-nvptx
> @@ -1,3 +1,7 @@
> +2016-03-11  Alexander Monakov  <amonakov@ispras.ru>
> +
> +	* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
> +
>  2016-03-04  Alexander Monakov  <amonakov@ispras.ru>
>  
>  	* config/nvptx/bar.c: Remove wrong invocation of
> diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
> index adf57b1..4e44242 100644
> --- libgomp/plugin/plugin-nvptx.c
> +++ libgomp/plugin/plugin-nvptx.c
> @@ -135,7 +135,7 @@ map_fini (struct ptx_stream *s)
>  
>    r = cuMemFreeHost (s->h);
>    if (r != CUDA_SUCCESS)
> -    GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
> +    GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
>  }
>  
>  static void

	Jakub
Alexander Monakov April 19, 2016, 3:23 p.m. UTC | #2
On Tue, 19 Apr 2016, Thomas Schwinge wrote:
> Well, I certainly had done at least some thinking before proposing this:
> we're talking about the libgomp "fatal exit" function, called when
> something has gone very wrong, and we're about to terminate the process,
> because there's no hope to recover.

By the way, this relates to something I wanted to bring up for a while now.

The OpenMP spec does not talk about error conditions arising in well-formed
programs due to resource exhaustion (OOM, in particular).  My understanding
is that an implementation always has a "way out": if e.g. it fails to allocate
memory required for a thread, it could run with reduced parallelism.
Ultimately the implementation can "fail gracefully" all the way back to
running the program sequentially.

Offloading makes that unclear due to how host fallbacks for target regions are
observable (which I don't understand, and I hope we get a chance to discuss
it), but is the above understanding generally correct?  Today libgomp is
clearly "trigger happy" to crash the process when something goes slightly
wrong, but was graceful failure ever considered as a design [non-]goal?

In that light, can a general policy of avoiding aborting the program be in
place, and should plugin authors work towards introducing fallback paths
instead of [over-]using GOMP_PLUGIN_fatal?

Thanks.
Alexander
Alexander Monakov April 21, 2016, 1:43 p.m. UTC | #3
On Tue, 19 Apr 2016, Jakub Jelinek wrote:
> On Tue, Apr 19, 2016 at 04:01:06PM +0200, Thomas Schwinge wrote:
> > Two other solutions have been proposed in the past months: Chung-Lin's
> > patches with subject: "Adjust offload plugin interface for avoiding
> > deadlock on exit", later: "Resolve libgomp plugin deadlock on exit",
> > later: "Resolve deadlock on plugin exit" (still pending review/approval),
> > and Alexander's much smaller patch with subject: "libgomp plugin: make
> > cuMemFreeHost error non-fatal",
> > <http://news.gmane.org/find-root.php?message_id=%3C1458323327-9908-4-git-send-email-amonakov%40ispras.ru%3E>.
> > (Both of which I have not reviewed in detail.)  Assuming that Chung-Lin's
> > patches are considered too invasive for gcc-6-branch, can we at least get
> > Alexander's patch committed to gcc-6-branch as well as on trunk, please?
> 
> Yeah, Alex' patch is IMHO fine, even for gcc-6-branch.

Applied to both.

Thanks.
Alexander
diff mbox

Patch

diff --git libgomp/ChangeLog.gomp-nvptx libgomp/ChangeLog.gomp-nvptx
index 7eefe0b..6bd9e5e 100644
--- libgomp/ChangeLog.gomp-nvptx
+++ libgomp/ChangeLog.gomp-nvptx
@@ -1,3 +1,7 @@ 
+2016-03-11  Alexander Monakov  <amonakov@ispras.ru>
+
+	* plugin/plugin-nvptx.c (map_fini): Make cuMemFreeHost error non-fatal.
+
 2016-03-04  Alexander Monakov  <amonakov@ispras.ru>
 
 	* config/nvptx/bar.c: Remove wrong invocation of
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index adf57b1..4e44242 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -135,7 +135,7 @@  map_fini (struct ptx_stream *s)
 
   r = cuMemFreeHost (s->h);
   if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuMemFreeHost error: %s", cuda_error (r));
+    GOMP_PLUGIN_error ("cuMemFreeHost error: %s", cuda_error (r));
 }
 
 static void