Message ID | 20180924085658.26505-1-blomqvist.janne@gmail.com |
---|---|
State | New |
Headers | show |
Series | [libgfortran] Remove recursion check | expand |
Hi Janne, > libgfortran has a recursion check in the error handling paths. This > works by checking the value of a static variable, and if it matches, > aborting immediately instead of continuing error processing. > Unfortunately, in a multi-threaded program, if two threads report an > error at the same time, this can trigger this check, and thus the > underlying reason for the error is never reported. I agree that this is a problem that should be addressed. Hm... What do you think, would it be desirable / possible to keep the recursive error check, but ensure thread safety by a suitable locking mechanism? As a first step, we should probably specify what exactly we would like to happen. Let us assume that the aim is to keep the current behavior for normal processes and allow concurrent error processing for multiple threads. This could be done by making the static variable thread-local. I'm not sure that this would work reliably, or if some sort of locking mechanism (probably involving PTHREAD_MUTEX_RECURSIVE) would be required - only have one thread at a time process an error. Do we actually want to keep the current behavior for non-threaded programs? I'd be in favor, but I do not feel strongly about that. Regards Thomas
On Mon, Sep 24, 2018 at 7:48 PM Thomas Koenig <tkoenig@netcologne.de> wrote: > Hi Janne, > > > libgfortran has a recursion check in the error handling paths. This > > works by checking the value of a static variable, and if it matches, > > aborting immediately instead of continuing error processing. > > Unfortunately, in a multi-threaded program, if two threads report an > > error at the same time, this can trigger this check, and thus the > > underlying reason for the error is never reported. > > I agree that this is a problem that should be addressed. Hm... > > What do you think, would it be desirable / possible to keep the > recursive error check, but ensure thread safety by a suitable > locking mechanism? As a first step, we should probably specify > what exactly we would like to happen. > > Let us assume that the aim is to keep the current behavior for normal > processes and allow concurrent error processing for multiple threads. > > This could be done by making the static variable thread-local. > I'm not sure that this would work reliably, or if some sort of > locking mechanism (probably involving PTHREAD_MUTEX_RECURSIVE) > would be required - only have one thread at a time process > an error. > > Do we actually want to keep the current behavior for non-threaded > programs? I'd be in favor, but I do not feel strongly about that. > What I briefly was thinking about doing, was to use TLS. Or rather, since TLS is not supported on all targets, something like I did in intrinsics/random.c:get_rand_state(). But, since the error handling stuff should be async-signal-safe, the allocation in the setup path should be done separately, e.g. as part of library initialization. In the end I decided against it because 1) It's more complicated, in order to handle a quite unlikely edge case. 2) If recursion happens anyway, it's a bug in our error handling flow which should be fixed and not be papered over. Anyway, I'm not massively against this, if people have any particular opinion lets hear it.
On Mon, Sep 24, 2018 at 10:18 PM Janne Blomqvist <blomqvist.janne@gmail.com> wrote: > On Mon, Sep 24, 2018 at 7:48 PM Thomas Koenig <tkoenig@netcologne.de> > wrote: > >> Hi Janne, >> >> > libgfortran has a recursion check in the error handling paths. This >> > works by checking the value of a static variable, and if it matches, >> > aborting immediately instead of continuing error processing. >> > Unfortunately, in a multi-threaded program, if two threads report an >> > error at the same time, this can trigger this check, and thus the >> > underlying reason for the error is never reported. >> >> I agree that this is a problem that should be addressed. Hm... >> >> What do you think, would it be desirable / possible to keep the >> recursive error check, but ensure thread safety by a suitable >> locking mechanism? As a first step, we should probably specify >> what exactly we would like to happen. >> >> Let us assume that the aim is to keep the current behavior for normal >> processes and allow concurrent error processing for multiple threads. >> >> This could be done by making the static variable thread-local. >> I'm not sure that this would work reliably, or if some sort of >> locking mechanism (probably involving PTHREAD_MUTEX_RECURSIVE) >> would be required - only have one thread at a time process >> an error. >> >> Do we actually want to keep the current behavior for non-threaded >> programs? I'd be in favor, but I do not feel strongly about that. >> > > What I briefly was thinking about doing, was to use TLS. Or rather, since > TLS is not supported on all targets, something like I did in > intrinsics/random.c:get_rand_state(). But, since the error handling stuff > should be async-signal-safe, the allocation in the setup path should be > done separately, e.g. as part of library initialization. > > In the end I decided against it because > > 1) It's more complicated, in order to handle a quite unlikely edge case. > > 2) If recursion happens anyway, it's a bug in our error handling flow > which should be fixed and not be papered over. > > Anyway, I'm not massively against this, if people have any particular > opinion lets hear it. > PING! Any opinions on the above?
On 10/6/18 10:00 AM, Janne Blomqvist wrote: > On Mon, Sep 24, 2018 at 10:18 PM Janne Blomqvist <blomqvist.janne@gmail.com> > wrote: > >> On Mon, Sep 24, 2018 at 7:48 PM Thomas Koenig <tkoenig@netcologne.de> >> wrote: >> >>> Hi Janne, >>> >>>> libgfortran has a recursion check in the error handling paths. This >>>> works by checking the value of a static variable, and if it matches, >>>> aborting immediately instead of continuing error processing. >>>> Unfortunately, in a multi-threaded program, if two threads report an >>>> error at the same time, this can trigger this check, and thus the >>>> underlying reason for the error is never reported. >>> >>> I agree that this is a problem that should be addressed. Hm... >>> >>> What do you think, would it be desirable / possible to keep the >>> recursive error check, but ensure thread safety by a suitable >>> locking mechanism? As a first step, we should probably specify >>> what exactly we would like to happen. >>> >>> Let us assume that the aim is to keep the current behavior for normal >>> processes and allow concurrent error processing for multiple threads. >>> >>> This could be done by making the static variable thread-local. >>> I'm not sure that this would work reliably, or if some sort of >>> locking mechanism (probably involving PTHREAD_MUTEX_RECURSIVE) >>> would be required - only have one thread at a time process >>> an error. >>> >>> Do we actually want to keep the current behavior for non-threaded >>> programs? I'd be in favor, but I do not feel strongly about that. >>> >> >> What I briefly was thinking about doing, was to use TLS. Or rather, since >> TLS is not supported on all targets, something like I did in >> intrinsics/random.c:get_rand_state(). But, since the error handling stuff >> should be async-signal-safe, the allocation in the setup path should be >> done separately, e.g. as part of library initialization. >> >> In the end I decided against it because >> >> 1) It's more complicated, in order to handle a quite unlikely edge case. >> >> 2) If recursion happens anyway, it's a bug in our error handling flow >> which should be fixed and not be papered over. >> >> Anyway, I'm not massively against this, if people have any particular >> opinion lets hear it. >> > > PING! Any opinions on the above? > Agree it should be fixed. I would think a mutex lock should work. Lock, issue error message, unlock. Even if there is recursion, since there is at least one error somewhere, It should be OK to pause the world to issue the error, and come back and finish. AS I think about it, maybe one global lock variable so that only one thread can get it at a time and others wait until it is unlcoked to proceed with their lock and error. Would there be a problem with some sort of endless loop waiting for the unlock? Jerry
Hi Jerry, > Agree it should be fixed. I would think a mutex lock should work. Lock, > issue error message, unlock. Even if there is recursion, since there is > at least one error somewhere, It should be OK to pause the world to > issue the error, and come back and finish. AS I think about it, maybe > one global lock variable so that only one thread can get it at a time > and others wait until it is unlcoked to proceed with their lock and error. > Would there be a problem with some sort of endless loop waiting for the > unlock? I think that as long as we a) make the lock recursive b) make sure to unlock it afterwards we should be fine, we also should not enter an endless loop on the rare case of aa double error. I can prepare a patch, unless somebody else wants to do it. Regards Thomas
On Sat, Oct 6, 2018 at 8:55 PM Thomas Koenig <tkoenig@netcologne.de> wrote: > Hi Jerry, > > > Agree it should be fixed. I would think a mutex lock should work. Lock, > > issue error message, unlock. Even if there is recursion, since there is > > at least one error somewhere, It should be OK to pause the world to > > issue the error, and come back and finish. AS I think about it, maybe > > one global lock variable so that only one thread can get it at a time > > and others wait until it is unlcoked to proceed with their lock and > error. > > > Would there be a problem with some sort of endless loop waiting for the > > unlock? > > I think that as long as we > > a) make the lock recursive > > b) make sure to unlock it afterwards > > we should be fine, we also should not enter an endless loop on > the rare case of aa double error. > > I can prepare a patch, unless somebody else wants to do it. > The error handling functions can be called from a signal handler, so they need to be async-signal-safe. See a list e.g. at http://www.man7.org/linux/man-pages/man7/signal-safety.7.html . Notably, none of the pthread mutex functions are there. And come to think of it, neither are pthread_{g,s}etspecific(), so my idea of using TLS a few messages earlier in this thread is probably not Ok either..
Hi Janne, > The error handling functions can be called from a signal handler, so they > need to be async-signal-safe. I didn't know that. How can this happen? Regards Thomas
On Sun, Oct 7, 2018 at 12:36 AM Thomas Koenig <tkoenig@netcologne.de> wrote: > Hi Janne, > > > The error handling functions can be called from a signal handler, so they > > need to be async-signal-safe. > > I didn't know that. How can this happen? > Hmm, seems I was imagining things, I can't find anything like that in the code. Anyway, the more I think about it the better I like my original patch 1) It's KISS 2) I can't find anything in the code that would lead to endless recursive invocation of the error printing functions. So, Ok for trunk?
Hi Janne, > 1) It's KISS > > 2) I can't find anything in the code that would lead to endless > recursive invocation of the error printing functions. > > So, Ok for trunk? With async I/O, I think the possibilities of hitting concurrent errors have increased, so I'd still prefer the solution with wrapping a recursive lock around error handling. Regards Thomas
diff --git a/libgfortran/runtime/error.c b/libgfortran/runtime/error.c index b07a4c0b12a..b6cbd14ff6d 100644 --- a/libgfortran/runtime/error.c +++ b/libgfortran/runtime/error.c @@ -330,25 +330,6 @@ show_locus (st_parameter_common *cmp) } -/* recursion_check()-- It's possible for additional errors to occur - * during fatal error processing. We detect this condition here and - * exit with code 4 immediately. */ - -#define MAGIC 0x20DE8101 - -static void -recursion_check (void) -{ - static int magic = 0; - - /* Don't even try to print something at this point */ - if (magic == MAGIC) - sys_abort (); - - magic = MAGIC; -} - - #define STRERR_MAXSZ 256 /* os_error()-- Operating system error. We get a message from the @@ -360,7 +341,6 @@ os_error (const char *message) { char errmsg[STRERR_MAXSZ]; struct iovec iov[5]; - recursion_check (); iov[0].iov_base = (char*) "Operating system error: "; iov[0].iov_len = strlen (iov[0].iov_base); iov[1].iov_base = gf_strerror (errno, errmsg, STRERR_MAXSZ); @@ -388,7 +368,6 @@ runtime_error (const char *message, ...) va_list ap; int written; - recursion_check (); iov[0].iov_base = (char*) "Fortran runtime error: "; iov[0].iov_len = strlen (iov[0].iov_base); va_start (ap, message); @@ -417,7 +396,6 @@ runtime_error_at (const char *where, const char *message, ...) struct iovec iov[4]; int written; - recursion_check (); iov[0].iov_base = (char*) where; iov[0].iov_len = strlen (where); iov[1].iov_base = (char*) "\nFortran runtime error: "; @@ -473,7 +451,6 @@ internal_error (st_parameter_common *cmp, const char *message) { struct iovec iov[3]; - recursion_check (); show_locus (cmp); iov[0].iov_base = (char*) "Internal Error: "; iov[0].iov_len = strlen (iov[0].iov_base); @@ -674,7 +651,6 @@ generate_error_common (st_parameter_common *cmp, int family, const char *message /* Return code, caller is responsible for terminating the program if necessary. */ - recursion_check (); show_locus (cmp); struct iovec iov[3]; iov[0].iov_base = (char*) "Fortran runtime error: "; @@ -766,7 +742,6 @@ notify_std (st_parameter_common *cmp, int std, const char * message) if (!warning) { - recursion_check (); show_locus (cmp); iov[0].iov_base = (char*) "Fortran runtime error: "; iov[0].iov_len = strlen (iov[0].iov_base); diff --git a/libgfortran/runtime/minimal.c b/libgfortran/runtime/minimal.c index b6d26fd2228..44ebd99af07 100644 --- a/libgfortran/runtime/minimal.c +++ b/libgfortran/runtime/minimal.c @@ -43,24 +43,6 @@ options_t options; static int argc_save; static char **argv_save; -/* recursion_check()-- It's possible for additional errors to occur - * during fatal error processing. We detect this condition here and - * exit with code 4 immediately. */ - -#define MAGIC 0x20DE8101 - -static void -recursion_check (void) -{ - static int magic = 0; - - /* Don't even try to print something at this point */ - if (magic == MAGIC) - sys_abort (); - - magic = MAGIC; -} - /* os_error()-- Operating system error. We get a message from the * operating system, show it and leave. Some operating system errors @@ -69,7 +51,6 @@ recursion_check (void) void os_error (const char *message) { - recursion_check (); printf ("Operating system error: "); printf ("%s\n", message); exit (1); @@ -85,7 +66,6 @@ runtime_error (const char *message, ...) { va_list ap; - recursion_check (); printf ("Fortran runtime error: "); va_start (ap, message); vprintf (message, ap); @@ -103,7 +83,6 @@ runtime_error_at (const char *where, const char *message, ...) { va_list ap; - recursion_check (); printf ("%s", where); printf ("\nFortran runtime error: "); va_start (ap, message); @@ -136,7 +115,6 @@ iexport(runtime_warning_at); void internal_error (st_parameter_common *cmp, const char *message) { - recursion_check (); printf ("Internal Error: "); printf ("%s", message); printf ("\n");