[RFC] pthread support for FUTEX_WAIT_MULTIPLE
diff mbox series

Message ID a56dd13f-910c-6ec2-648e-0a6fd46c1189@valvesoftware.com
State New
Headers show
Series
  • [RFC] pthread support for FUTEX_WAIT_MULTIPLE
Related show

Commit Message

Pierre-Loup A. Griffais July 31, 2019, 12:07 a.m. UTC
Hello,

I started putting together a patch to expose the new Linux futex 
functionality that recently got proposed for upstream inclusion. [1]

I pushed the patch to a repository [2] and am also including it in this 
email for early comments and feedback.

Since I'm so unfamiliar with the generally accepted practice, I've held 
on finishing it until I could get some feedback on a variety of points 
I'm unsure about.

The gist of it is that it adds a new function, 
pthread_mutex_timedlock_any(), that can take a list of mutexes. It 
returns when one of the mutex has been locked (and tells you which one), 
or if the timeout is met. We would use this to reduce unnecessary 
wakeups and spinning in our thread pool synchronization code, compared 
to the current eventfd hack we rely on.

All mutexes passed to the list need to have matching types/kinds, as the 
kernel interface can only accept one PRIVATE flag for the whole operation.

Here are a grab bag of my questions, please bear with me:

  - I assume whichever name it might end up as should end with _np? Is 
there any specific process to follow for this sort of non-standard 
inclusion, other than just writing the code and documentation?

  - What is the standard way for an application to discover whether it 
can use an entrypoint dependent on a certain Linux kernel version? With 
our proposed use, we'd be fine running the function once at startup to 
pick which path to chose, eg. pthread_mutex_lock_any( NULL, 0, NULL, 
NULL ). If it returns 0, we'd enable the new path, otherwise we'd fall 
back to eventfd(). I have a TODO in the code where we could do that, but 
it's probably not the right way to do things.

  - I assume the way I'm exposing it as a 2.2.5-versioned symbol for 
local testing is wrong; what is the right way to do this?

  - I was planning on implementing the various different mutex types 
with branches, and possibly try to have a couple of callsites to a 
version that takes flags and rely on compiler inlining possibly doing 
the right thing. I see other implementations of similar functions opt 
for much more explicit inlining based on the mutex type. Would a version 
with branches be acceptable initially? We could further work on 
inlining/tuning if we found hotspots in the future.

  - In my tree I have a placeholder test application that should be 
replaced by a new mutex test. However, it would also be a good idea to 
leverage other mutex tests to test this new thing, since it's a superset 
of many other mutex calls. Could we make the normal mutex test suite run 
a second time, with a macro wrapping the normal pthread_lock with this 
implementation instead?

Thanks,

  - Pierre-Loup

[1] https://lkml.org/lkml/2019/7/30/1399

[2] 
https://github.com/Plagman/glibc/commit/3b01145fa25987f2f93e7eda7f3e7d0f2f77b290

 From 3b01145fa25987f2f93e7eda7f3e7d0f2f77b290 Mon Sep 17 00:00:00 2001
From: "Pierre-Loup A. Griffais" <pgriffais@valvesoftware.com>
Date: Wed, 10 Jul 2019 19:47:52 -0700
Subject: [PATCH 1/2] Implement pthread_mutex_lock_any() and
  pthread_mutex_timedlock_any().

On newer Linux kernels, futex supports a WAIT_MULTIPLE operation that can be
used to implement new locking functionality that allows a given application
thread to sleep while waiting for one of multiple given locks to become
available.
---
  nptl/Makefile                                 |   1 +
  nptl/Versions                                 |   2 +
  nptl/pthreadP.h                               |   5 +
  nptl/pthread_mutex_lock_any.c                 |  37 ++++
  nptl/pthread_mutex_timedlock_any.c            | 193 ++++++++++++++++++
  sysdeps/nptl/pthread.h                        |  10 +
  sysdeps/unix/sysv/linux/lowlevellock-futex.h  |  15 ++
  .../sysv/linux/x86_64/64/libpthread.abilist   |   1 +
  8 files changed, 264 insertions(+)
  create mode 100644 nptl/pthread_mutex_lock_any.c
  create mode 100644 nptl/pthread_mutex_timedlock_any.c

Comments

Florian Weimer July 31, 2019, 8:14 a.m. UTC | #1
* Pierre-Loup A. Griffais:

> The gist of it is that it adds a new function,
> pthread_mutex_timedlock_any(), that can take a list of mutexes. It
> returns when one of the mutex has been locked (and tells you which
> one), or if the timeout is met. We would use this to reduce
> unnecessary wakeups and spinning in our thread pool synchronization
> code, compared to the current eventfd hack we rely on.

This explains why the mutexes have to be contiguous in memory.  For
other applications, this looks like an unnecessary restriction.

>  - I assume whichever name it might end up as should end with _np? Is
> there any specific process to follow for this sort of non-standard
> inclusion, other than just writing the code and documentation?

It seems unlikely that this is ever going to be standardized, so I think
we'd need the _np suffix, yes.

>  - What is the standard way for an application to discover whether it
> can use an entrypoint dependent on a certain Linux kernel version?
> With our proposed use, we'd be fine running the function once at
> startup to pick which path to chose, eg. pthread_mutex_lock_any( NULL,
> 0, NULL, NULL ). If it returns 0, we'd enable the new path, otherwise
> we'd fall back to eventfd(). I have a TODO in the code where we could
> do that, but it's probably not the right way to do things.

I think you would have to probe on first use inside
pthread_mutex_lock_any, using a dummy call.

>  - I assume the way I'm exposing it as a 2.2.5-versioned symbol for
> local testing is wrong; what is the right way to do this?

This patch could be targeted at glibc 2.31, then you would have to use
GLIBC_2.31.

>  - In my tree I have a placeholder test application that should be
> replaced by a new mutex test. However, it would also be a good idea to
> leverage other mutex tests to test this new thing, since it's a
> superset of many other mutex calls. Could we make the normal mutex
> test suite run a second time, with a macro wrapping the normal
> pthread_lock with this implementation instead?

Due to the new ENOMEM error, the new function is not a strict superset.

I'm wondering if the current design is really the right one,
particularly for thread pools.  The implementation performs multiple
scans of the mutex lists, which look rather costly for large pools.
That's probably unavoidable if the list is dynamic and potentially
different for every call, but given the array-based interface, I don't
think this is the intended use.  Something that use pre-registration of
the participating futexes could avoid that.  I also find it difficult to
believe that this approach beats something that involves queues, where a
worker thread that becomes available identifies itself directly to a
submitter thread, or can directly consume the submitted work item.

Thanks,
Florian
Szabolcs Nagy July 31, 2019, 10:01 a.m. UTC | #2
On 31/07/2019 01:07, Pierre-Loup A. Griffais wrote:
> I started putting together a patch to expose the new Linux futex functionality that recently got proposed for upstream inclusion. [1]
...
> 
> [1] https://lkml.org/lkml/2019/7/30/1399

i don't see that patch on the linux-api list where
userspace api related patches are discussed.

syscalls that have time argument need extra care now
that 32bit targets will get a new 64bit time_t abi.

the futex syscall is multiplexed and intricately
related to the pthread implementation so there are
many reasons why such patch should not be accepted
into linux before agreement with userspace.
Adhemerval Zanella July 31, 2019, 1:53 p.m. UTC | #3
On 31/07/2019 05:14, Florian Weimer wrote:
> * Pierre-Loup A. Griffais:
> 
>> The gist of it is that it adds a new function,
>> pthread_mutex_timedlock_any(), that can take a list of mutexes. It
>> returns when one of the mutex has been locked (and tells you which
>> one), or if the timeout is met. We would use this to reduce
>> unnecessary wakeups and spinning in our thread pool synchronization
>> code, compared to the current eventfd hack we rely on.
> 
> This explains why the mutexes have to be contiguous in memory.  For
> other applications, this looks like an unnecessary restriction.
> 
>>  - I assume whichever name it might end up as should end with _np? Is
>> there any specific process to follow for this sort of non-standard
>> inclusion, other than just writing the code and documentation?
> 
> It seems unlikely that this is ever going to be standardized, so I think
> we'd need the _np suffix, yes.
> 
>>  - What is the standard way for an application to discover whether it
>> can use an entrypoint dependent on a certain Linux kernel version?
>> With our proposed use, we'd be fine running the function once at
>> startup to pick which path to chose, eg. pthread_mutex_lock_any( NULL,
>> 0, NULL, NULL ). If it returns 0, we'd enable the new path, otherwise
>> we'd fall back to eventfd(). I have a TODO in the code where we could
>> do that, but it's probably not the right way to do things.
> 
> I think you would have to probe on first use inside
> pthread_mutex_lock_any, using a dummy call.
> 
>>  - I assume the way I'm exposing it as a 2.2.5-versioned symbol for
>> local testing is wrong; what is the right way to do this?
> 
> This patch could be targeted at glibc 2.31, then you would have to use
> GLIBC_2.31.
> 
>>  - In my tree I have a placeholder test application that should be
>> replaced by a new mutex test. However, it would also be a good idea to
>> leverage other mutex tests to test this new thing, since it's a
>> superset of many other mutex calls. Could we make the normal mutex
>> test suite run a second time, with a macro wrapping the normal
>> pthread_lock with this implementation instead?
> 
> Due to the new ENOMEM error, the new function is not a strict superset.
> 
> I'm wondering if the current design is really the right one,
> particularly for thread pools.  The implementation performs multiple
> scans of the mutex lists, which look rather costly for large pools.
> That's probably unavoidable if the list is dynamic and potentially
> different for every call, but given the array-based interface, I don't
> think this is the intended use.  Something that use pre-registration of
> the participating futexes could avoid that.  I also find it difficult to
> believe that this approach beats something that involves queues, where a
> worker thread that becomes available identifies itself directly to a
> submitter thread, or can directly consume the submitted work item.

Also, I am trying to understand exactly what the new futex operation tries
to optimize.  The kernel patch adds some information about a wine specific
implementation, but it not clear what exactly was tested on "Wine fsync 
implementation and executing multi-threaded applications" neither by
"example code that exercises the FUTEX_WAIT_MULTIPLE path on a
multi-threaded".

Could you provide some actual source code for your 'thread pool synchronization 
code' or at least point to the wine code that you are aiming to optimize?
Pierre-Loup A. Griffais July 31, 2019, 9:58 p.m. UTC | #4
[replying to both inline below, let me know if just branching the thread
is generally preferable in the future]

On 7/31/19 1:14 AM, Florian Weimer wrote:> * Pierre-Loup A. Griffais:
> 
>> The gist of it is that it adds a new function, 
>> pthread_mutex_timedlock_any(), that can take a list of mutexes. It 
>> returns when one of the mutex has been locked (and tells you which 
>> one), or if the timeout is met. We would use this to reduce 
>> unnecessary wakeups and spinning in our thread pool
>> synchronization code, compared to the current eventfd hack we rely
>> on.
> 
> This explains why the mutexes have to be contiguous in memory.  For 
> other applications, this looks like an unnecessary restriction.

Agreed. Will switch the first argument to an array of pointers instead.

> 
>> - I assume whichever name it might end up as should end with _np?
>> Is there any specific process to follow for this sort of
>> non-standard inclusion, other than just writing the code and
>> documentation?
> 
> It seems unlikely that this is ever going to be standardized, so I
> think we'd need the _np suffix, yes.

Thanks for clarifying.

> 
>> - What is the standard way for an application to discover whether
>> it can use an entrypoint dependent on a certain Linux kernel
>> version? With our proposed use, we'd be fine running the function
>> once at startup to pick which path to chose, eg.
>> pthread_mutex_lock_any( NULL, 0, NULL, NULL ). If it returns 0,
>> we'd enable the new path, otherwise we'd fall back to eventfd(). I
>> have a TODO in the code where we could do that, but it's probably
>> not the right way to do things.
> 
> I think you would have to probe on first use inside 
> pthread_mutex_lock_any, using a dummy call.

OK, that was my original plan, I can finish writing that up.

> 
>> - I assume the way I'm exposing it as a 2.2.5-versioned symbol for 
>> local testing is wrong; what is the right way to do this?
> 
> This patch could be targeted at glibc 2.31, then you would have to
> use GLIBC_2.31.
> 
>> - In my tree I have a placeholder test application that should be 
>> replaced by a new mutex test. However, it would also be a good idea
>> to leverage other mutex tests to test this new thing, since it's a 
>> superset of many other mutex calls. Could we make the normal mutex 
>> test suite run a second time, with a macro wrapping the normal 
>> pthread_lock with this implementation instead?
> 
> Due to the new ENOMEM error, the new function is not a strict
> superset.

True, thanks for pointing that out. Any sort of attempted wrapping would
have to account for that somehow.

> 
> I'm wondering if the current design is really the right one, 
> particularly for thread pools.  The implementation performs multiple 
> scans of the mutex lists, which look rather costly for large pools. 
> That's probably unavoidable if the list is dynamic and potentially 
> different for every call, but given the array-based interface, I
> don't think this is the intended use.  Something that use
> pre-registration of the participating futexes could avoid that.  I
> also find it difficult to believe that this approach beats something
> that involves queues, where a worker thread that becomes available
> identifies itself directly to a submitter thread, or can directly
> consume the submitted work item.

That might be interesting if walking the lists turns out to be a
hotspot. I think the mutex count per operation would typically not be 
thousands, and probably not hundreds either.

I would think there's still a queue somewhere to acquire jobs, this 
would be used before and after. For instance, job threads want to sleep 
until work has been queued, or another system event occurs that might 
require them to wake up, like app shutdown or scene transition. 
Similarly, after firing off N jobs, the job manager will want to sleep 
until one of the jobs is complete to perform some accounting and publish 
the results to other systems. For both of these usecases, using eventfd 
to wait for multiple events seems to result in more CPU spinning than 
the futex-based solution, both in userspace and the kernel.

> 
> Thanks, Florian
> 

On 7/31/19 3:01 AM, Szabolcs Nagy wrote:
> On 31/07/2019 01:07, Pierre-Loup A. Griffais wrote:
>> I started putting together a patch to expose the new Linux futex
>> functionality that recently got proposed for upstream inclusion.
>> [1]
> ...
>> 
>> [1] https://lkml.org/lkml/2019/7/30/1399
> 
> i don't see that patch on the linux-api list where userspace api
> related patches are discussed.
> 
> syscalls that have time argument need extra care now that 32bit
> targets will get a new 64bit time_t abi.

Thanks, looks like there were compat concerns raised on the kernel side 
as well; we can copy linux-api for the next patch iteration.

> 
> the futex syscall is multiplexed and intricately related to the
> pthread implementation so there are many reasons why such patch
> should not be accepted into linux before agreement with userspace.

What does that process typically look like, other than raising it on 
both ends like we did?

Thanks for all the feedback!
  - Pierre-Loup

>
Szabolcs Nagy Aug. 1, 2019, 8:49 a.m. UTC | #5
On 31/07/2019 22:58, Pierre-Loup A. Griffais wrote:
> On 7/31/19 3:01 AM, Szabolcs Nagy wrote:
>> the futex syscall is multiplexed and intricately related to the
>> pthread implementation so there are many reasons why such patch
>> should not be accepted into linux before agreement with userspace.
> 
> What does that process typically look like, other than raising it on both ends like we did?

such kernel patch has to be visible to ppl who care about
the linux userspace api, i think the linux-api list is the
closest and if the patch needs eyes from glibc developers
then cc libc-alpha too (e.g. a patch documenting new syscall
behaviour on the kernel side should ideally be reviewed on
libc-alpha too to ensure the semantics is ok from a libc pov)
Florian Weimer Aug. 1, 2019, 9:39 a.m. UTC | #6
* Pierre-Loup A. Griffais:

> I would think there's still a queue somewhere to acquire jobs, this
> would be used before and after. For instance, job threads want to
> sleep until work has been queued, or another system event occurs that
> might require them to wake up, like app shutdown or scene
> transition. Similarly, after firing off N jobs, the job manager will
> want to sleep until one of the jobs is complete to perform some
> accounting and publish the results to other systems. For both of these
> usecases, using eventfd to wait for multiple events seems to result in
> more CPU spinning than the futex-based solution, both in userspace and
> the kernel.

Why do you consider eventfd the only viable alternative?  If you want a
futex-based solution today, you can use condition variables.  It won't
give you the theoretical minimum of context switches, but neither does
FUTEX_WAIT_MULTIPLE, as far as I can tell.

Thanks,
Florian
Pierre-Loup A. Griffais Aug. 1, 2019, 9:29 p.m. UTC | #7
On 8/1/19 2:39 AM, Florian Weimer wrote:
> * Pierre-Loup A. Griffais:
> 
>> I would think there's still a queue somewhere to acquire jobs, this
>> would be used before and after. For instance, job threads want to
>> sleep until work has been queued, or another system event occurs that
>> might require them to wake up, like app shutdown or scene
>> transition. Similarly, after firing off N jobs, the job manager will
>> want to sleep until one of the jobs is complete to perform some
>> accounting and publish the results to other systems. For both of these
>> usecases, using eventfd to wait for multiple events seems to result in
>> more CPU spinning than the futex-based solution, both in userspace and
>> the kernel.
> 
> Why do you consider eventfd the only viable alternative?  If you want a
> futex-based solution today, you can use condition variables.  It won't
> give you the theoretical minimum of context switches, but neither does
> FUTEX_WAIT_MULTIPLE, as far as I can tell.

I think there's two main aspects to this, one is purely technical and I 
can try to speak to it a bit below:

Unless I'm missing something, in the scenario where I have N (where N is 
probably close to the CPU count on the machine) job threads processing 
work and needing to report back to a job system, then promptly get back 
to work, wouldn't trying to implement this with condition variables 
introduce unwanted contention between the job threads at reporting that 
wouldn't exist otherwise? Or, unwanted spinning on the job manager side.

So I do think it would give us quite a bit more potential context 
switches and overhead, which we're trying to reduce. Spinning more for 
the same latency is not desirable for gaming, as power usage also 
factors in.

When you say the proposed futex approach is still not necessarily 
optimal, do you mean compared to a complete lockless design on the app 
side? Do you agree it would still be more efficient than using cond 
vars? (and that would require some redesign work on the app side as well)

The other aspect is that we're dealing with a bunch of applications that 
have their threading model already defined by the time they target 
Linux. They can use WaitForMultiple() on Windows and kqueue on macOS; 
they typically opt for lower-performance "emulation" of the desired 
behavior using multiple mutexes, condition variables, or eventfd. I 
think this new primitive would let them quickly port to something that 
is equally as efficient, or more efficient, than their starting point.

Thanks,
  - Pierre-Loup

> 
> Thanks,
> Florian
>
Florian Weimer Aug. 2, 2019, 10:12 a.m. UTC | #8
* Pierre-Loup A. Griffais:

> On 8/1/19 2:39 AM, Florian Weimer wrote:
>> * Pierre-Loup A. Griffais:
>>
>>> I would think there's still a queue somewhere to acquire jobs, this
>>> would be used before and after. For instance, job threads want to
>>> sleep until work has been queued, or another system event occurs that
>>> might require them to wake up, like app shutdown or scene
>>> transition. Similarly, after firing off N jobs, the job manager will
>>> want to sleep until one of the jobs is complete to perform some
>>> accounting and publish the results to other systems. For both of these
>>> usecases, using eventfd to wait for multiple events seems to result in
>>> more CPU spinning than the futex-based solution, both in userspace and
>>> the kernel.
>>
>> Why do you consider eventfd the only viable alternative?  If you want a
>> futex-based solution today, you can use condition variables.  It won't
>> give you the theoretical minimum of context switches, but neither does
>> FUTEX_WAIT_MULTIPLE, as far as I can tell.
>
> I think there's two main aspects to this, one is purely technical and
> I can try to speak to it a bit below:
>
> Unless I'm missing something, in the scenario where I have N (where N
> is probably close to the CPU count on the machine) job threads
> processing work and needing to report back to a job system, then
> promptly get back to work, wouldn't trying to implement this with
> condition variables introduce unwanted contention between the job
> threads at reporting that wouldn't exist otherwise? Or, unwanted
> spinning on the job manager side.

I don't really see how FUTEX_WAIT_MULTIPLE supports this nicely.  Since
the mutex you get back is arbitrary, you cannot unconditionally enqueue
a job on that worker thread because you might always get back the same
worker thread, leading to unbounded queue growth.  So the submitting
needs to synchronize with the chosen worker thread anyway.

And the contention is there with FUTEX_WAIT_MULTIPLE as well, except
that you don't have one very hot cache line, but several cache lines
bouncing around between CPUs.

I guess the architecture just isn't clear to me.

> The other aspect is that we're dealing with a bunch of applications
> that have their threading model already defined by the time they
> target Linux. They can use WaitForMultiple() on Windows and kqueue on
> macOS; they typically opt for lower-performance "emulation" of the
> desired behavior using multiple mutexes, condition variables, or
> eventfd. I think this new primitive would let them quickly port to
> something that is equally as efficient, or more efficient, than their
> starting point.

I really doubt that a futex-based WaitForMultipleObjects implementation
can have equally efficient performance, unless the Windows
implementation is really, really bad.  It's hard to optimize
WaitForMultipleObjects (just as it is hard to optimize select/poll), but
I think it's impossible if all you have is an array of concrete futexes
with fixed semantics.

Thanks,
Florian

Patch
diff mbox series

diff --git a/nptl/Makefile b/nptl/Makefile
index 0567e77a790..990315c292f 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -68,6 +68,7 @@  libpthread-routines = nptl-init nptlfreeres vars 
events version pt-interp \
                pthread_getattr_np \
                pthread_mutex_init pthread_mutex_destroy \
                pthread_mutex_lock pthread_mutex_trylock \
+              pthread_mutex_lock_any pthread_mutex_timedlock_any \
                pthread_mutex_timedlock pthread_mutex_unlock \
                pthread_mutex_cond_lock \
                pthread_mutexattr_init pthread_mutexattr_destroy \
diff --git a/nptl/Versions b/nptl/Versions
index f2ea2b32a1b..1ec7ffffeb6 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -274,6 +274,8 @@  libpthread {
      mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; 
mtx_destroy;
      call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal;
      cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set;
+
+    pthread_mutex_lock_any; pthread_mutex_timedlock_any;
    }

    GLIBC_2.30 {
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index d80662ab07b..d6e89c0a53d 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -390,6 +390,11 @@  extern int __pthread_mutex_trylock (pthread_mutex_t 
*_mutex);
  extern int __pthread_mutex_lock (pthread_mutex_t *__mutex);
  extern int __pthread_mutex_timedlock (pthread_mutex_t *__mutex,
       const struct timespec *__abstime);
+extern int __pthread_mutex_lock_any (pthread_mutex_t *__mutex, int 
mutexcount,
+                     int *outlocked);
+extern int __pthread_mutex_timedlock_any (pthread_mutex_t *__mutex, int 
count,
+                      const struct timespec *__abstime,
+                      int *outlocked);
  extern int __pthread_mutex_cond_lock (pthread_mutex_t *__mutex)
       attribute_hidden;
  extern void __pthread_mutex_cond_lock_adjust (pthread_mutex_t *__mutex)
diff --git a/nptl/pthread_mutex_lock_any.c b/nptl/pthread_mutex_lock_any.c
new file mode 100644
index 00000000000..485c213a172
--- /dev/null
+++ b/nptl/pthread_mutex_lock_any.c
@@ -0,0 +1,37 @@ 
+/* Copyright (C) 2002-2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/param.h>
+#include <not-cancel.h>
+#include "pthreadP.h"
+#include <atomic.h>
+#include <lowlevellock.h>
+#include <stap-probe.h>
+
+int
+__pthread_mutex_lock_any (pthread_mutex_t *mutexlist, int mutexcount,
+              int *outlocked)
+{
+  return __pthread_mutex_timedlock_any(mutexlist, mutexcount, NULL, 
outlocked);
+}
+
+weak_alias (__pthread_mutex_lock_any, pthread_mutex_lock_any)
diff --git a/nptl/pthread_mutex_timedlock_any.c 
b/nptl/pthread_mutex_timedlock_any.c
new file mode 100644
index 00000000000..a95687ce8e1
--- /dev/null
+++ b/nptl/pthread_mutex_timedlock_any.c
@@ -0,0 +1,193 @@ 
+/* Copyright (C) 2002-2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/param.h>
+#include <not-cancel.h>
+#include "pthreadP.h"
+#include <atomic.h>
+#include <lowlevellock.h>
+#include <stap-probe.h>
+
+/* TODO: this probably comes from a kernel header when upstream? */
+struct futex_wait_block {
+    int *uaddr;
+    int val;
+    int bitset;
+} __attribute__((packed));
+
+int
+__pthread_mutex_timedlock_any (pthread_mutex_t *mutexlist, int mutexcount,
+                   const struct timespec *abstime, int *outlocked)
+{
+  /* This requires futex support */
+#ifndef __NR_futex
+  return ENOTSUP;
+#endif
+
+  if (mutexlist == NULL)
+  {
+    /* User is asking us if kernel supports the feature. */
+
+    /* TODO: how does one check if supported?
+     * I was thinking of trying the ioctl once and then returning the 
static
+     * cached value, is that OK?
+     */
+    return 0;
+  }
+
+  if (mutexlist != NULL && mutexcount <= 0)
+    return EINVAL;
+
+  if (outlocked == NULL)
+    return EINVAL;
+
+  int type = PTHREAD_MUTEX_TYPE (mutexlist);
+
+  for (int i = 1; i < mutexcount; i++)
+  {
+    /* Types have to match, since the PRIVATE flag is OP-global. */
+    if (PTHREAD_MUTEX_TYPE (&mutexlist[i]) != type)
+      return EINVAL;
+  }
+
+  int kind = type & PTHREAD_MUTEX_KIND_MASK_NP;
+
+  /* TODO: implement recursive, errorcheck and adaptive. */
+  if (kind != PTHREAD_MUTEX_NORMAL)
+    return EINVAL;
+
+  /* TODO: implement robust. */
+  if (type & PTHREAD_MUTEX_ROBUST_NORMAL_NP)
+    return EINVAL;
+
+  /* TODO: implement PI. */
+  if (type & PTHREAD_MUTEX_PRIO_INHERIT_NP)
+    return EINVAL;
+
+  /* TODO: implement PP. */
+  if (type & PTHREAD_MUTEX_PRIO_PROTECT_NP)
+    return EINVAL;
+
+  pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
+  int result;
+
+  result = -1;
+
+  for (int i = 0; i < mutexcount; i++)
+  {
+    if (__lll_trylock (&mutexlist[i].__data.__lock) == 0)
+    {
+      result = i;
+      break;
+    }
+  }
+
+  while (result == -1)
+  {
+    for (int i = 0; i < mutexcount; i++)
+    {
+      int oldval = atomic_exchange_acq (&mutexlist[i].__data.__lock, 2);
+
+      if (oldval == 0)
+      {
+    result = i;
+    break;
+      }
+    }
+
+    if (result == -1)
+    {
+      /* Couldn't get one of the locks immediately, we have to sleep 
now. */
+      struct timespec *timeout = NULL;
+      struct timespec rt;
+
+      if (abstime != NULL)
+      {
+    /* Reject invalid timeouts. */
+    if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000)
+      return EINVAL;
+
+    struct timeval tv;
+
+    /* Get the current time.  */
+    (void) __gettimeofday (&tv, NULL);
+
+    /* Compute relative timeout.  */
+    rt.tv_sec = abstime->tv_sec - tv.tv_sec;
+    rt.tv_nsec = abstime->tv_nsec - tv.tv_usec * 1000;
+    if (rt.tv_nsec < 0)
+    {
+      rt.tv_nsec += 1000000000;
+      --rt.tv_sec;
+    }
+
+    if (rt.tv_sec < 0)
+      return ETIMEDOUT;
+
+    timeout = &rt;
+      }
+
+      struct futex_wait_block waitblock[mutexcount];
+
+      for (int i = 0; i < mutexcount; i++)
+      {
+    waitblock[i].uaddr = &mutexlist[i].__data.__lock;
+    waitblock[i].val = 2;
+    waitblock[i].bitset = ~0;
+      }
+
+      long int __ret;
+
+      /* Safe to use the flag for the first one, since all their types 
match. */
+      int private_flag = PTHREAD_MUTEX_PSHARED (&mutexlist[0]);
+
+      __ret = lll_futex_timed_wait_multiple (waitblock, mutexcount, 
timeout,
+                        private_flag);
+
+      if (__ret < 0)
+    return -__ret; /* TODO is this correct? */
+
+      /* Have slept, try grabbing the one that woke us up? */
+      if (atomic_exchange_acq (&mutexlist[__ret].__data.__lock, 2) == 0)
+      {
+    /* We got it, done, loop will end below. */
+    result = __ret;
+      }
+    }
+  }
+
+  if (result != -1)
+  {
+    /* Record the ownership. */
+    mutexlist[result].__data.__owner = id;
+    ++mutexlist[result].__data.__nusers;
+
+    /* Let the user know which mutex is now locked. */
+    *outlocked = result;
+
+    result = 0;
+  }
+
+  return result;
+}
+
+weak_alias (__pthread_mutex_timedlock_any, pthread_mutex_timedlock_any)
diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
index a767d6f1808..d9e500f9877 100644
--- a/sysdeps/nptl/pthread.h
+++ b/sysdeps/nptl/pthread.h
@@ -763,7 +763,17 @@  extern int pthread_mutex_trylock (pthread_mutex_t 
*__mutex)
  extern int pthread_mutex_lock (pthread_mutex_t *__mutex)
       __THROWNL __nonnull ((1));

+/* Lock any one of several mutexes.  */
+extern int pthread_mutex_lock_any (pthread_mutex_t *__mutexlist,
+                   int mutexcount, int *outlocked);
+
  #ifdef __USE_XOPEN2K
+/* Lock any one of several mutexes, with timeout.  */
+extern int pthread_mutex_timedlock_any (pthread_mutex_t *__mutexlist,
+                    int mutexcount,
+                    const struct timespec *__restrict
+                    __abstime, int *outlocked);
+
  /* Wait until lock becomes available, or specified time passes. */
  extern int pthread_mutex_timedlock (pthread_mutex_t *__restrict __mutex,
                      const struct timespec *__restrict
diff --git a/sysdeps/unix/sysv/linux/lowlevellock-futex.h 
b/sysdeps/unix/sysv/linux/lowlevellock-futex.h
index cfa796be2bd..28140721648 100644
--- a/sysdeps/unix/sysv/linux/lowlevellock-futex.h
+++ b/sysdeps/unix/sysv/linux/lowlevellock-futex.h
@@ -38,6 +38,7 @@ 
  #define FUTEX_WAKE_BITSET    10
  #define FUTEX_WAIT_REQUEUE_PI   11
  #define FUTEX_CMP_REQUEUE_PI    12
+#define FUTEX_WAIT_MULTIPLE     13
  #define FUTEX_PRIVATE_FLAG    128
  #define FUTEX_CLOCK_REALTIME    256

@@ -74,6 +75,15 @@ 
       ? -INTERNAL_SYSCALL_ERRNO (__ret, __err) : 0);                     \
    })

+#define lll_futex_syscall_ret(nargs, futexp, op, ...)                   \
+ ({ \
+    INTERNAL_SYSCALL_DECL (__err);                                      \
+    long int __ret = INTERNAL_SYSCALL (futex, __err, nargs, futexp, op, \
+                       __VA_ARGS__);                    \
+    (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret, __err))         \
+     ? -INTERNAL_SYSCALL_ERRNO (__ret, __err) : __ret);                 \
+  })
+
  #define lll_futex_wait(futexp, val, private) \
    lll_futex_timed_wait (futexp, val, NULL, private)

@@ -82,6 +92,11 @@ 
               __lll_private_flag (FUTEX_WAIT, private),  \
               val, timeout)

+#define lll_futex_timed_wait_multiple(futexp, val, timeout, private)     \
+  lll_futex_syscall_ret (4, futexp,                                      \
+             __lll_private_flag (FUTEX_WAIT_MULTIPLE, private), \
+             val, timeout)
+
  /* Verify whether the supplied clockid is supported by
     lll_futex_clock_wait_bitset.  */
  #define lll_futex_supported_clockid(clockid)            \
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist 
b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
index 297fec9686d..f9bf1a40509 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
@@ -126,6 +126,7 @@  GLIBC_2.2.5 pthread_kill_other_threads_np F
  GLIBC_2.2.5 pthread_mutex_destroy F
  GLIBC_2.2.5 pthread_mutex_init F
  GLIBC_2.2.5 pthread_mutex_lock F
+GLIBC_2.2.5 pthread_mutex_lock_any F
  GLIBC_2.2.5 pthread_mutex_timedlock F
  GLIBC_2.2.5 pthread_mutex_trylock F
  GLIBC_2.2.5 pthread_mutex_unlock F