diff mbox

ping x 7: [PATCH] [libgomp] make it possible to use OMP on both sides of a fork

Message ID CAPJVwB=xFZuuxxMX92Ru-EVM2APOjvrTLmuNfLsRL9e0UO7TDA@mail.gmail.com
State New
Headers show

Commit Message

Nathaniel Smith Oct. 19, 2014, 10:44 p.m. UTC
Hi Jakub,

Thanks for your feedback! See below.

On Thu, Oct 16, 2014 at 4:52 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Mon, Oct 13, 2014 at 10:16:19PM +0100, Nathaniel Smith wrote:
>> Got total silence the last 4 times I posted this, and users have been
>> bugging me about it offline, so trying again.
>>
>> This patch fixes a showstopper problem preventing the transparent use
>> of OpenMP in scientific libraries, esp. with Python. Specifically, it
>> is currently not possible to use GNU OpenMP -- even in a limited,
>> temporary manner -- in any program that uses (or might use) fork() for
>> parallelism, even if the fork() and the use of OpenMP occur at totally
>> different times. This limitation is unique to GNU OpenMP -- every
>> competing OpenMP implementation already contains something like this
>> patch. While technically not fully POSIX-compliant (because POSIX
>> gives much much weaker guarantees around fork() than any real Unix),
>> the approach used in this patch (a) performs only POSIX-compliant
>> operations when the host program is itself fully POSIX-compliant, and
>> (b) actually works perfectly reliably in practice on all commonly used
>> platforms I'm aware of.
>
> 1) gomp_we_are_forked in your patch will attempt to free the pool
>    of the thread that encounters it, which is racy; consider a program
>    after fork calling pthread_create several times, each thread
>    thusly created then ~ at the same time doing #pragma omp parallel
>    and the initial thread too.  You really should clean up the pool
>    data structure only in the initial thread and nowhere else;
>    for native TLS (non-emulated, IE model) the best would be to have a flag
>    in the gomp_thread_pool structure,
>    struct gomp_thread *thr = gomp_thread ();
>    if (thr && thr->thread_pool)
>      thr->thread_pool->after_fork = true;
>    should in that case be safe in the atfork child handler.
>    For !HAVE_TLS or emulated TLS not sure if it is completely safe,
>    it would call pthread_getspecific.  Perhaps just don't register
>    atfork handler on those targets at all?

Good point. The updated patch below takes a slightly different
approach. I moved we_are_forked to the per-thread struct, and then I
moved the setting of it into the *parent* process's fork handlers --
the before-fork handler toggles it to true, then the child spawns off
and inherits this setting, and then the parent after-fork handler
toggles it back again. (Since it's per-thread, there's no race
condition here.) This lets us remove the child after-fork handler
entirely, and -- since the parent handlers aren't subject to any
restrictions on what they can call -- it works on all platforms
regardless of the TLS implementation.

> 2) can you explain why are you removing the cleanups from
>    gomp_free_pool_helper ?

They aren't removed, but rather moved from the helper function (which
runs in the helper threads) into gomp_free_thread_pool (which runs in
the main thread) -- which makes it easier to run the appropriate
cleanups even in the case where the helper threads aren't running.
(But see below -- we might prefer to drop this part of the patch
entirely.)

> 3) you can call pthread_atfork many times (once for each pthread
>    that creates a thread pool), that is undesirable, you want to do that
>    only if the initial thread creates thread pool

Good point. I've moved the pthread_atfork call to initialize_team,
which is an __attribute__((constructor)).

I am a little uncertain whether this is the best approach, though,
because of the comment in team_destructor about wanting to correctly
handle dlopen/dlclose. One of pthread_atfork's many (many) limitations
is that there's no way to unregister handlers, so if dlopen/dlclose is
important (is it?) then we can't call pthread_atfork from
initialize_team.

If this is a problem, then we could delay the pthread_atfork until
e.g. the first thread pool is spawned -- would this be preferred?

> 4) the testcase is clearly not portable enough, should be probably limited
>    to *-*-linux* only, fork etc. will likely not work on many targets.

I think it should work on pretty much any target that has fork(); we
definitely care about having this functionality on e.g. OS X. I've
added some genericish target specifications.

> In any case, even with the patch, are you aware that you'll leak megabytes
> of thread stacks etc.?

Well, err, I wasn't, no :-). Thanks for pointing that out.

To me this does clinch the argument that a better approach would be
the one I suggested in
   https://gcc.gnu.org/ml/gcc-patches/2014-02/msg00979.html
i.e., of tracking whether any threadprivate variables were present,
and if not then simply shutting down the thread pools before forking.
But this would be a much more invasive change to gomp (I wouldn't know
where to start).

In the mean time, the current patch is still worthwhile. The cost is
not that bad: I wouldn't think of it as "leaking" so much as "overhead
of supporting OMP->fork->OMP". No-one forks a child which forks a
child which forks a child etc., so the cost is pretty much bounded in
practice. The most common use case is probably using fork() to spawn a
set of worker processes, which will end up COW-sharing the thread
stacks (which will just end up resting peacefully in swap). And when
doing computational work where working set sizes are often in the
gigabytes, spending a few megabytes is small change -- esp. compared
to the current cost, which involves first wasting hours of programmer
time figuring out why things are just locking up, and then (in many
cases) having to rewrite the code entirely because there's no fix for
this, you just have to redesign you parallelization architecture to
either avoid OMP to avoid fork().

However, the thread stack issue does make me wonder if it's worth
spending so much effort on cleaning up a few semaphores and mutexes.
So I split the patch into two parts. The first enables the basic
functionality and passes the test, but it doesn't even try to clean up
the thread pool -- it just forgets that it existed and moves on. The
second patch goes on top of the first, and adds the best-effort
cleanup of synchronization objects and easily free-able heap. So patch
#1 alone will do the job, and patch #2 is optional -- applying means
we leak a bit yes, but does increase the chance of portability
gremlins cropping up.

Changelog for patch #1:

2014-10-19  Nathaniel J. Smith  <njs@pobox.com>

        * libgomp.h (struct gomp_thread): New member we_are_forked.
        * team.c (gomp_thread_start): Add we_are_forked to gomp_thread
        initialization.
        (gomp_before_fork_callback)
        (gomp_after_fork_parent_callback): New functions.
        (initialize_team): Register atfork handlers.
        (gomp_team_start): Check for fork on entry, and clear thread state
        if found.
        * testsuite/libgomp.c/fork-1.c: New test.

Changelog for patch #2:

2014-10-19  Nathaniel J. Smith  <njs@pobox.com>

        * team.c (gomp_free_pool_helper): Move per-thread cleanup to main
        thread.
        (gomp_free_thread): Delegate implementation to...
        (gomp_free_thread_pool): ...this new function. Like old
        gomp_free_thread, but does per-thread cleanup, and has option to
        skip everything that involves interacting with actual threads,
        which is useful when called after fork.
        (gomp_team_start): Call gomp_free_thread_pool to release (some)
        resources after fork.

-n

Comments

Nathaniel Smith Nov. 6, 2014, 5:01 p.m. UTC | #1
Ping^2.

On Tue, Oct 28, 2014 at 6:17 PM, Nathaniel Smith <njs@pobox.com> wrote:
> Ping.
>
> On 19 Oct 2014 23:44, "Nathaniel Smith" <njs@pobox.com> wrote:
>>
>> Hi Jakub,
>>
>> Thanks for your feedback! See below.
>>
>> On Thu, Oct 16, 2014 at 4:52 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> > On Mon, Oct 13, 2014 at 10:16:19PM +0100, Nathaniel Smith wrote:
>> >> Got total silence the last 4 times I posted this, and users have been
>> >> bugging me about it offline, so trying again.
>> >>
>> >> This patch fixes a showstopper problem preventing the transparent use
>> >> of OpenMP in scientific libraries, esp. with Python. Specifically, it
>> >> is currently not possible to use GNU OpenMP -- even in a limited,
>> >> temporary manner -- in any program that uses (or might use) fork() for
>> >> parallelism, even if the fork() and the use of OpenMP occur at totally
>> >> different times. This limitation is unique to GNU OpenMP -- every
>> >> competing OpenMP implementation already contains something like this
>> >> patch. While technically not fully POSIX-compliant (because POSIX
>> >> gives much much weaker guarantees around fork() than any real Unix),
>> >> the approach used in this patch (a) performs only POSIX-compliant
>> >> operations when the host program is itself fully POSIX-compliant, and
>> >> (b) actually works perfectly reliably in practice on all commonly used
>> >> platforms I'm aware of.
>> >
>> > 1) gomp_we_are_forked in your patch will attempt to free the pool
>> >    of the thread that encounters it, which is racy; consider a program
>> >    after fork calling pthread_create several times, each thread
>> >    thusly created then ~ at the same time doing #pragma omp parallel
>> >    and the initial thread too.  You really should clean up the pool
>> >    data structure only in the initial thread and nowhere else;
>> >    for native TLS (non-emulated, IE model) the best would be to have a
>> > flag
>> >    in the gomp_thread_pool structure,
>> >    struct gomp_thread *thr = gomp_thread ();
>> >    if (thr && thr->thread_pool)
>> >      thr->thread_pool->after_fork = true;
>> >    should in that case be safe in the atfork child handler.
>> >    For !HAVE_TLS or emulated TLS not sure if it is completely safe,
>> >    it would call pthread_getspecific.  Perhaps just don't register
>> >    atfork handler on those targets at all?
>>
>> Good point. The updated patch below takes a slightly different
>> approach. I moved we_are_forked to the per-thread struct, and then I
>> moved the setting of it into the *parent* process's fork handlers --
>> the before-fork handler toggles it to true, then the child spawns off
>> and inherits this setting, and then the parent after-fork handler
>> toggles it back again. (Since it's per-thread, there's no race
>> condition here.) This lets us remove the child after-fork handler
>> entirely, and -- since the parent handlers aren't subject to any
>> restrictions on what they can call -- it works on all platforms
>> regardless of the TLS implementation.
>>
>> > 2) can you explain why are you removing the cleanups from
>> >    gomp_free_pool_helper ?
>>
>> They aren't removed, but rather moved from the helper function (which
>> runs in the helper threads) into gomp_free_thread_pool (which runs in
>> the main thread) -- which makes it easier to run the appropriate
>> cleanups even in the case where the helper threads aren't running.
>> (But see below -- we might prefer to drop this part of the patch
>> entirely.)
>>
>> > 3) you can call pthread_atfork many times (once for each pthread
>> >    that creates a thread pool), that is undesirable, you want to do that
>> >    only if the initial thread creates thread pool
>>
>> Good point. I've moved the pthread_atfork call to initialize_team,
>> which is an __attribute__((constructor)).
>>
>> I am a little uncertain whether this is the best approach, though,
>> because of the comment in team_destructor about wanting to correctly
>> handle dlopen/dlclose. One of pthread_atfork's many (many) limitations
>> is that there's no way to unregister handlers, so if dlopen/dlclose is
>> important (is it?) then we can't call pthread_atfork from
>> initialize_team.
>>
>> If this is a problem, then we could delay the pthread_atfork until
>> e.g. the first thread pool is spawned -- would this be preferred?
>>
>> > 4) the testcase is clearly not portable enough, should be probably
>> > limited
>> >    to *-*-linux* only, fork etc. will likely not work on many targets.
>>
>> I think it should work on pretty much any target that has fork(); we
>> definitely care about having this functionality on e.g. OS X. I've
>> added some genericish target specifications.
>>
>> > In any case, even with the patch, are you aware that you'll leak
>> > megabytes
>> > of thread stacks etc.?
>>
>> Well, err, I wasn't, no :-). Thanks for pointing that out.
>>
>> To me this does clinch the argument that a better approach would be
>> the one I suggested in
>>    https://gcc.gnu.org/ml/gcc-patches/2014-02/msg00979.html
>> i.e., of tracking whether any threadprivate variables were present,
>> and if not then simply shutting down the thread pools before forking.
>> But this would be a much more invasive change to gomp (I wouldn't know
>> where to start).
>>
>> In the mean time, the current patch is still worthwhile. The cost is
>> not that bad: I wouldn't think of it as "leaking" so much as "overhead
>> of supporting OMP->fork->OMP". No-one forks a child which forks a
>> child which forks a child etc., so the cost is pretty much bounded in
>> practice. The most common use case is probably using fork() to spawn a
>> set of worker processes, which will end up COW-sharing the thread
>> stacks (which will just end up resting peacefully in swap). And when
>> doing computational work where working set sizes are often in the
>> gigabytes, spending a few megabytes is small change -- esp. compared
>> to the current cost, which involves first wasting hours of programmer
>> time figuring out why things are just locking up, and then (in many
>> cases) having to rewrite the code entirely because there's no fix for
>> this, you just have to redesign you parallelization architecture to
>> either avoid OMP to avoid fork().
>>
>> However, the thread stack issue does make me wonder if it's worth
>> spending so much effort on cleaning up a few semaphores and mutexes.
>> So I split the patch into two parts. The first enables the basic
>> functionality and passes the test, but it doesn't even try to clean up
>> the thread pool -- it just forgets that it existed and moves on. The
>> second patch goes on top of the first, and adds the best-effort
>> cleanup of synchronization objects and easily free-able heap. So patch
>> #1 alone will do the job, and patch #2 is optional -- applying means
>> we leak a bit yes, but does increase the chance of portability
>> gremlins cropping up.
>>
>> Changelog for patch #1:
>>
>> 2014-10-19  Nathaniel J. Smith  <njs@pobox.com>
>>
>>         * libgomp.h (struct gomp_thread): New member we_are_forked.
>>         * team.c (gomp_thread_start): Add we_are_forked to gomp_thread
>>         initialization.
>>         (gomp_before_fork_callback)
>>         (gomp_after_fork_parent_callback): New functions.
>>         (initialize_team): Register atfork handlers.
>>         (gomp_team_start): Check for fork on entry, and clear thread state
>>         if found.
>>         * testsuite/libgomp.c/fork-1.c: New test.
>>
>> Changelog for patch #2:
>>
>> 2014-10-19  Nathaniel J. Smith  <njs@pobox.com>
>>
>>         * team.c (gomp_free_pool_helper): Move per-thread cleanup to main
>>         thread.
>>         (gomp_free_thread): Delegate implementation to...
>>         (gomp_free_thread_pool): ...this new function. Like old
>>         gomp_free_thread, but does per-thread cleanup, and has option to
>>         skip everything that involves interacting with actual threads,
>>         which is useful when called after fork.
>>         (gomp_team_start): Call gomp_free_thread_pool to release (some)
>>         resources after fork.
>>
>> -n
>>
>> --
>> Nathaniel J. Smith
>> Postdoctoral researcher - Informatics - University of Edinburgh
>> http://vorpus.org
diff mbox

Patch

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index a1482cc..ef3a7f4 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -441,6 +441,9 @@  struct gomp_thread
 
   /* User pthread thread pool */
   struct gomp_thread_pool *thread_pool;
+
+  /* This is to enable best-effort cleanup after fork.  */
+  int we_are_forked;
 };
 
 
diff --git a/libgomp/team.c b/libgomp/team.c
index e6a6d8f..19b3cc8 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -86,6 +86,7 @@  gomp_thread_start (void *xdata)
   thr->ts = data->ts;
   thr->task = data->task;
   thr->place = data->place;
+  thr->we_are_forked = 0;
 
   thr->ts.team->ordered_release[thr->ts.team_id] = &thr->release;
 
@@ -266,6 +267,62 @@  gomp_free_thread (void *arg __attribute__((unused)))
     }
 }
 
+/* According to POSIX, if a process which uses threads calls fork(), then
+   there are very few things that the resulting child process can do safely --
+   mostly just exec().
+
+   However, in practice, (almost?) all POSIX implementations do allow
+   arbitrary code to run inside the child, *if* the parent process's threads
+   are in a well-defined state when the fork occurs. And this circumstance can
+   easily arise in OMP-using programs, e.g. when a library function like DGEMM
+   uses OMP internally, and some other unrelated part of the program calls
+   fork() at some other time, when no OMP sections are running.
+
+   Therefore, we make a best effort attempt to handle the case:
+
+     OMP section (in parent) -> OMP quiesce -> fork -> OMP section (in child)
+
+   "Best-effort" here means that:
+   - Your system may or may not be able to handle this kind of code at all;
+     our goal is just to make sure that if it fails it's not gomp's fault.
+   - All threadprivate variables will be reset in the child. Fortunately this
+     is entirely compliant with the spec, according to the rule of nasal
+     demons.
+   - We must have minimal speed impact, and no correctness impact, on
+     compliant programs.
+
+   Our approach is to use pthread_atfork to make a note whenever a fork() has
+   occurred. *All* we do is make a note. We can't actually shut down our
+   thread pool in the parent, because this would violate the OMP spec (it
+   would cause threadprivate variables to disappear whenever the process did
+   fork+exec). And we can't immediately shut it down in the child, because
+   that requires calling non-async-signal-safe functions, and thus would
+   violate POSIX in the case where the host program is just trying to
+   fork+exec. In fact, we can't even access our we_are_forked flag from the
+   child, because it's stored in TLS and accessing TLS on some platforms
+   requires a non-async-signal-safe call to pthread_getspecific(). So what we
+   do is set the flag in the parent just before calling fork, let the child
+   process inherit the flag, and then unset the flag in the parent. This is
+   safe, because in the parent the flag is only visible to the thread calling
+   fork(), and by the time fork() has returned the flag is set back to its
+   correct value.
+*/
+static void
+gomp_before_fork_callback (void)
+{
+  struct gomp_thread *thr = gomp_thread ();
+  /* Use increment/decrement to handle the case where the child of our child
+     enters an OMP section. */
+  thr->we_are_forked++;
+}
+
+static void
+gomp_after_fork_parent_callback (void)
+{
+  struct gomp_thread *thr = gomp_thread ();
+  thr->we_are_forked--;
+}
+
 /* Launch a team.  */
 
 void
@@ -287,6 +344,15 @@  gomp_team_start (void (*fn) (void *), void *data, unsigned nthreads,
   struct gomp_thread **affinity_thr = NULL;
 
   thr = gomp_thread ();
+  if (__builtin_expect (thr->we_are_forked, 0))
+    {
+      /* There was some parent process who was using OMP, and then called
+	 fork(). We are the main thread of the resulting child process. Our
+	 thread structure contains stale data referring to the parent thread
+	 who called fork(). Reset it to reflect our new main-thread
+	 status. (This leaks, but that's better than deadlocking.)  */
+      memset (thr, 0, sizeof(struct gomp_thread));
+    }
   nested = thr->ts.team != NULL;
   if (__builtin_expect (thr->thread_pool == NULL, 0))
     {
@@ -925,6 +991,10 @@  initialize_team (void)
 
   if (pthread_key_create (&gomp_thread_destructor, gomp_free_thread) != 0)
     gomp_fatal ("could not create thread pool destructor.");
+
+  pthread_atfork (gomp_before_fork_callback,
+		  gomp_after_fork_parent_callback,
+		  NULL);
 }
 
 static void __attribute__((destructor))
diff --git a/libgomp/testsuite/libgomp.c/fork-1.c b/libgomp/testsuite/libgomp.c/fork-1.c
new file mode 100644
index 0000000..97bb391
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/fork-1.c
@@ -0,0 +1,80 @@ 
+/* This test requires fork(). It ought to work everywhere that fork() does,
+   though. Unfortunately that is not so easy to write down... */
+/* { dg-do run
+     {target *-*-linux* *-*-gnu* *-*-freebsd* *-*-darwin* *-*-solaris* } } */
+/* { dg-timeout 10 } */
+
+#include <omp.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <assert.h>
+
+static int saw[4];
+
+static void
+check_parallel (int exit_on_failure)
+{
+  memset (saw, 0, sizeof (saw));
+  #pragma omp parallel num_threads (2)
+  {
+    int iam = omp_get_thread_num ();
+    saw[iam] = 1;
+  }
+
+  // Encode failure in status code to report to parent process
+  if (exit_on_failure)
+    {
+      if (saw[0] != 1)
+        _exit(1);
+      else if (saw[1] != 1)
+        _exit(2);
+      else if (saw[2] != 0)
+        _exit(3);
+      else if (saw[3] != 0)
+        _exit(4);
+      else
+        _exit(0);
+  }
+  // Use regular assertions
+  else
+    {
+      assert (saw[0] == 1);
+      assert (saw[1] == 1);
+      assert (saw[2] == 0);
+      assert (saw[3] == 0);
+    }
+}
+
+int
+main ()
+{
+  // Initialize the OMP thread pool in the parent process
+  check_parallel (0);
+  pid_t fork_pid = fork();
+  if (fork_pid == -1)
+    return 1;
+  else if (fork_pid == 0)
+    {
+      // Call OMP again in the child process and encode failures in exit
+      // code.
+      check_parallel (1);
+    }
+  else
+    {
+      // Check that OMP runtime is still functional in parent process after
+      // the fork.
+      check_parallel (0);
+
+      // Wait for the child to finish and check the exit code.
+      int child_status = 0;
+      pid_t wait_pid = wait(&child_status);
+      assert (wait_pid == fork_pid);
+      assert (WEXITSTATUS (child_status) == 0);
+
+      // Check that the termination of the child process did not impact
+      // OMP in parent process.
+      check_parallel (0);
+    }
+  return 0;
+}