Message ID | CAPJVwB=xFZuuxxMX92Ru-EVM2APOjvrTLmuNfLsRL9e0UO7TDA@mail.gmail.com |
---|---|
State | New |
Headers | show |
Ping^2. On Tue, Oct 28, 2014 at 6:17 PM, Nathaniel Smith <njs@pobox.com> wrote: > Ping. > > On 19 Oct 2014 23:44, "Nathaniel Smith" <njs@pobox.com> wrote: >> >> Hi Jakub, >> >> Thanks for your feedback! See below. >> >> On Thu, Oct 16, 2014 at 4:52 PM, Jakub Jelinek <jakub@redhat.com> wrote: >> > On Mon, Oct 13, 2014 at 10:16:19PM +0100, Nathaniel Smith wrote: >> >> Got total silence the last 4 times I posted this, and users have been >> >> bugging me about it offline, so trying again. >> >> >> >> This patch fixes a showstopper problem preventing the transparent use >> >> of OpenMP in scientific libraries, esp. with Python. Specifically, it >> >> is currently not possible to use GNU OpenMP -- even in a limited, >> >> temporary manner -- in any program that uses (or might use) fork() for >> >> parallelism, even if the fork() and the use of OpenMP occur at totally >> >> different times. This limitation is unique to GNU OpenMP -- every >> >> competing OpenMP implementation already contains something like this >> >> patch. While technically not fully POSIX-compliant (because POSIX >> >> gives much much weaker guarantees around fork() than any real Unix), >> >> the approach used in this patch (a) performs only POSIX-compliant >> >> operations when the host program is itself fully POSIX-compliant, and >> >> (b) actually works perfectly reliably in practice on all commonly used >> >> platforms I'm aware of. >> > >> > 1) gomp_we_are_forked in your patch will attempt to free the pool >> > of the thread that encounters it, which is racy; consider a program >> > after fork calling pthread_create several times, each thread >> > thusly created then ~ at the same time doing #pragma omp parallel >> > and the initial thread too. You really should clean up the pool >> > data structure only in the initial thread and nowhere else; >> > for native TLS (non-emulated, IE model) the best would be to have a >> > flag >> > in the gomp_thread_pool structure, >> > struct gomp_thread *thr = gomp_thread (); >> > if (thr && thr->thread_pool) >> > thr->thread_pool->after_fork = true; >> > should in that case be safe in the atfork child handler. >> > For !HAVE_TLS or emulated TLS not sure if it is completely safe, >> > it would call pthread_getspecific. Perhaps just don't register >> > atfork handler on those targets at all? >> >> Good point. The updated patch below takes a slightly different >> approach. I moved we_are_forked to the per-thread struct, and then I >> moved the setting of it into the *parent* process's fork handlers -- >> the before-fork handler toggles it to true, then the child spawns off >> and inherits this setting, and then the parent after-fork handler >> toggles it back again. (Since it's per-thread, there's no race >> condition here.) This lets us remove the child after-fork handler >> entirely, and -- since the parent handlers aren't subject to any >> restrictions on what they can call -- it works on all platforms >> regardless of the TLS implementation. >> >> > 2) can you explain why are you removing the cleanups from >> > gomp_free_pool_helper ? >> >> They aren't removed, but rather moved from the helper function (which >> runs in the helper threads) into gomp_free_thread_pool (which runs in >> the main thread) -- which makes it easier to run the appropriate >> cleanups even in the case where the helper threads aren't running. >> (But see below -- we might prefer to drop this part of the patch >> entirely.) >> >> > 3) you can call pthread_atfork many times (once for each pthread >> > that creates a thread pool), that is undesirable, you want to do that >> > only if the initial thread creates thread pool >> >> Good point. I've moved the pthread_atfork call to initialize_team, >> which is an __attribute__((constructor)). >> >> I am a little uncertain whether this is the best approach, though, >> because of the comment in team_destructor about wanting to correctly >> handle dlopen/dlclose. One of pthread_atfork's many (many) limitations >> is that there's no way to unregister handlers, so if dlopen/dlclose is >> important (is it?) then we can't call pthread_atfork from >> initialize_team. >> >> If this is a problem, then we could delay the pthread_atfork until >> e.g. the first thread pool is spawned -- would this be preferred? >> >> > 4) the testcase is clearly not portable enough, should be probably >> > limited >> > to *-*-linux* only, fork etc. will likely not work on many targets. >> >> I think it should work on pretty much any target that has fork(); we >> definitely care about having this functionality on e.g. OS X. I've >> added some genericish target specifications. >> >> > In any case, even with the patch, are you aware that you'll leak >> > megabytes >> > of thread stacks etc.? >> >> Well, err, I wasn't, no :-). Thanks for pointing that out. >> >> To me this does clinch the argument that a better approach would be >> the one I suggested in >> https://gcc.gnu.org/ml/gcc-patches/2014-02/msg00979.html >> i.e., of tracking whether any threadprivate variables were present, >> and if not then simply shutting down the thread pools before forking. >> But this would be a much more invasive change to gomp (I wouldn't know >> where to start). >> >> In the mean time, the current patch is still worthwhile. The cost is >> not that bad: I wouldn't think of it as "leaking" so much as "overhead >> of supporting OMP->fork->OMP". No-one forks a child which forks a >> child which forks a child etc., so the cost is pretty much bounded in >> practice. The most common use case is probably using fork() to spawn a >> set of worker processes, which will end up COW-sharing the thread >> stacks (which will just end up resting peacefully in swap). And when >> doing computational work where working set sizes are often in the >> gigabytes, spending a few megabytes is small change -- esp. compared >> to the current cost, which involves first wasting hours of programmer >> time figuring out why things are just locking up, and then (in many >> cases) having to rewrite the code entirely because there's no fix for >> this, you just have to redesign you parallelization architecture to >> either avoid OMP to avoid fork(). >> >> However, the thread stack issue does make me wonder if it's worth >> spending so much effort on cleaning up a few semaphores and mutexes. >> So I split the patch into two parts. The first enables the basic >> functionality and passes the test, but it doesn't even try to clean up >> the thread pool -- it just forgets that it existed and moves on. The >> second patch goes on top of the first, and adds the best-effort >> cleanup of synchronization objects and easily free-able heap. So patch >> #1 alone will do the job, and patch #2 is optional -- applying means >> we leak a bit yes, but does increase the chance of portability >> gremlins cropping up. >> >> Changelog for patch #1: >> >> 2014-10-19 Nathaniel J. Smith <njs@pobox.com> >> >> * libgomp.h (struct gomp_thread): New member we_are_forked. >> * team.c (gomp_thread_start): Add we_are_forked to gomp_thread >> initialization. >> (gomp_before_fork_callback) >> (gomp_after_fork_parent_callback): New functions. >> (initialize_team): Register atfork handlers. >> (gomp_team_start): Check for fork on entry, and clear thread state >> if found. >> * testsuite/libgomp.c/fork-1.c: New test. >> >> Changelog for patch #2: >> >> 2014-10-19 Nathaniel J. Smith <njs@pobox.com> >> >> * team.c (gomp_free_pool_helper): Move per-thread cleanup to main >> thread. >> (gomp_free_thread): Delegate implementation to... >> (gomp_free_thread_pool): ...this new function. Like old >> gomp_free_thread, but does per-thread cleanup, and has option to >> skip everything that involves interacting with actual threads, >> which is useful when called after fork. >> (gomp_team_start): Call gomp_free_thread_pool to release (some) >> resources after fork. >> >> -n >> >> -- >> Nathaniel J. Smith >> Postdoctoral researcher - Informatics - University of Edinburgh >> http://vorpus.org
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index a1482cc..ef3a7f4 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -441,6 +441,9 @@ struct gomp_thread /* User pthread thread pool */ struct gomp_thread_pool *thread_pool; + + /* This is to enable best-effort cleanup after fork. */ + int we_are_forked; }; diff --git a/libgomp/team.c b/libgomp/team.c index e6a6d8f..19b3cc8 100644 --- a/libgomp/team.c +++ b/libgomp/team.c @@ -86,6 +86,7 @@ gomp_thread_start (void *xdata) thr->ts = data->ts; thr->task = data->task; thr->place = data->place; + thr->we_are_forked = 0; thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release; @@ -266,6 +267,62 @@ gomp_free_thread (void *arg __attribute__((unused))) } } +/* According to POSIX, if a process which uses threads calls fork(), then + there are very few things that the resulting child process can do safely -- + mostly just exec(). + + However, in practice, (almost?) all POSIX implementations do allow + arbitrary code to run inside the child, *if* the parent process's threads + are in a well-defined state when the fork occurs. And this circumstance can + easily arise in OMP-using programs, e.g. when a library function like DGEMM + uses OMP internally, and some other unrelated part of the program calls + fork() at some other time, when no OMP sections are running. + + Therefore, we make a best effort attempt to handle the case: + + OMP section (in parent) -> OMP quiesce -> fork -> OMP section (in child) + + "Best-effort" here means that: + - Your system may or may not be able to handle this kind of code at all; + our goal is just to make sure that if it fails it's not gomp's fault. + - All threadprivate variables will be reset in the child. Fortunately this + is entirely compliant with the spec, according to the rule of nasal + demons. + - We must have minimal speed impact, and no correctness impact, on + compliant programs. + + Our approach is to use pthread_atfork to make a note whenever a fork() has + occurred. *All* we do is make a note. We can't actually shut down our + thread pool in the parent, because this would violate the OMP spec (it + would cause threadprivate variables to disappear whenever the process did + fork+exec). And we can't immediately shut it down in the child, because + that requires calling non-async-signal-safe functions, and thus would + violate POSIX in the case where the host program is just trying to + fork+exec. In fact, we can't even access our we_are_forked flag from the + child, because it's stored in TLS and accessing TLS on some platforms + requires a non-async-signal-safe call to pthread_getspecific(). So what we + do is set the flag in the parent just before calling fork, let the child + process inherit the flag, and then unset the flag in the parent. This is + safe, because in the parent the flag is only visible to the thread calling + fork(), and by the time fork() has returned the flag is set back to its + correct value. +*/ +static void +gomp_before_fork_callback (void) +{ + struct gomp_thread *thr = gomp_thread (); + /* Use increment/decrement to handle the case where the child of our child + enters an OMP section. */ + thr->we_are_forked++; +} + +static void +gomp_after_fork_parent_callback (void) +{ + struct gomp_thread *thr = gomp_thread (); + thr->we_are_forked--; +} + /* Launch a team. */ void @@ -287,6 +344,15 @@ gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads, struct gomp_thread **affinity_thr = NULL; thr = gomp_thread (); + if (__builtin_expect (thr->we_are_forked, 0)) + { + /* There was some parent process who was using OMP, and then called + fork(). We are the main thread of the resulting child process. Our + thread structure contains stale data referring to the parent thread + who called fork(). Reset it to reflect our new main-thread + status. (This leaks, but that's better than deadlocking.) */ + memset (thr, 0, sizeof(struct gomp_thread)); + } nested = thr->ts.team != NULL; if (__builtin_expect (thr->thread_pool == NULL, 0)) { @@ -925,6 +991,10 @@ initialize_team (void) if (pthread_key_create (&gomp_thread_destructor, gomp_free_thread) != 0) gomp_fatal ("could not create thread pool destructor."); + + pthread_atfork (gomp_before_fork_callback, + gomp_after_fork_parent_callback, + NULL); } static void __attribute__((destructor)) diff --git a/libgomp/testsuite/libgomp.c/fork-1.c b/libgomp/testsuite/libgomp.c/fork-1.c new file mode 100644 index 0000000..97bb391 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/fork-1.c @@ -0,0 +1,80 @@ +/* This test requires fork(). It ought to work everywhere that fork() does, + though. Unfortunately that is not so easy to write down... */ +/* { dg-do run + {target *-*-linux* *-*-gnu* *-*-freebsd* *-*-darwin* *-*-solaris* } } */ +/* { dg-timeout 10 } */ + +#include <omp.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include <assert.h> + +static int saw[4]; + +static void +check_parallel (int exit_on_failure) +{ + memset (saw, 0, sizeof (saw)); + #pragma omp parallel num_threads (2) + { + int iam = omp_get_thread_num (); + saw[iam] = 1; + } + + // Encode failure in status code to report to parent process + if (exit_on_failure) + { + if (saw[0] != 1) + _exit(1); + else if (saw[1] != 1) + _exit(2); + else if (saw[2] != 0) + _exit(3); + else if (saw[3] != 0) + _exit(4); + else + _exit(0); + } + // Use regular assertions + else + { + assert (saw[0] == 1); + assert (saw[1] == 1); + assert (saw[2] == 0); + assert (saw[3] == 0); + } +} + +int +main () +{ + // Initialize the OMP thread pool in the parent process + check_parallel (0); + pid_t fork_pid = fork(); + if (fork_pid == -1) + return 1; + else if (fork_pid == 0) + { + // Call OMP again in the child process and encode failures in exit + // code. + check_parallel (1); + } + else + { + // Check that OMP runtime is still functional in parent process after + // the fork. + check_parallel (0); + + // Wait for the child to finish and check the exit code. + int child_status = 0; + pid_t wait_pid = wait(&child_status); + assert (wait_pid == fork_pid); + assert (WEXITSTATUS (child_status) == 0); + + // Check that the termination of the child process did not impact + // OMP in parent process. + check_parallel (0); + } + return 0; +}