[v2,1/3] Mutex: Accelerate lock acquisition by queuing spinner

Message ID 1545286557-20901-1-git-send-email-kemi.wang@intel.com
State New
Headers show
Series
  • [v2,1/3] Mutex: Accelerate lock acquisition by queuing spinner
Related show

Commit Message

kemi Dec. 20, 2018, 6:15 a.m.
Adaptive mutex indicates the semantic of spinning for a while before
calling into the kernel to block. Thus, the lock of adaptive mutex could be
acquired via immediate getting, spinning getting or waking up.

Currently, the spin-waiting algorithm of adaptive mutex is for each
processor to repeatedly execute an test_and_set instruction until either
the maximum spin count is reached or the  lock is acquired. However, the
lock performance via spinning getting will degrade significantly as the
number of spinning processors increases. Two factors at least cause this
degradation[1]. First, in order to release the lock, the lock holder has to
contend with spinning processors for exclusive access of lock cache line.
(E.g. "lock; decl 0%" of lll_unlock() in
sysdeps/unix/sysv/linux/x86_64/lowlevellock.h for pthread_mutex_unlock()).
For most multiprocessor architectures, it has to wait behind those
test_and_set instructions from spinning processors.
Furthermore, on these invalidation-based cache coherency system,
test_and_set instruction will trigger a "read-for-ownership" request for
exclusive access to the lock cache line, it will potentially slow the
access of other locations(shares the cache line with the lock at least) by
the lock holder.

In this patch, we propose another spin-waiting algorithm to accelerate
lock acquisition by queuing spinners based-on MCS[2] lock. With MCS
algorithm, only the header of queue is allowed to spin on the mutex lock,
others spin on locally-accessible flag variables. Thus, the previous
negative factors are eliminated.

The implementation of this MCS-based spin-waiting algorithm requires an
additional pointer to hold the tail of queue. To keep the size of mutex
data structure consistent and maintain the user space ABI unchanged, the
__list field which is originally for implementing robust futex will be
reused. Therefore, we propose a new type mutex with GNU extension
PTHREAD_MUTEX_QUEUESPINNER_NP in this patch.

Pass the ABI compatibility test by running "make check-abi"

The benchmark is available at branch mutex_testing in
https://github.com/kemicoco/tst-spinlock.git.
The tunable pthread.mutex_spin_count is set to 10000 which is big enough in
our testing for fair comparison.

The first test case emulates a practical workload with mathematical
calculation.
The second test case provided by our customer emulates the lock contention
in a distributed file system.

Each workload runs with multiple threads in parallel, each workload does
a) Acquire the lock;
b) Do some work in critical section;
c) Unlock
d) Wait a random time (0~3000 TSCs)
in a loop until 5 seconds, and obtain the total iterations.

Testing on 2s-Skylake platform (112 cores with 62G RAM, HT=on).
TC1: Hashwork
Thread num      adaptive mutex          mcs mutex
	1               7297792             7298755 (+0.01%)
	2               9332105             9752901 (+4.51%)
	3               10428251            11029771 (+5.7%)
	4               10572596            11203997 (+5.9%)
	5               10496815            11008433 (+4.8%)
	6               10292946            10569467 (+2.6%)
	7               9861111             10153538 (+2.97%)
	14              5845303             9756283  (+66.91%)
	28              4299209             8985135  (+109.0%)
	56              3298821             5747645  (+74.23%)
	112             2941309             5629700  (+91.4%)
	448             2821056             3716799  (+31.75%)

TC2: Test_and_set instruction on shared variables
Thread num      adaptive mutex          mcs mutex
	1               7748765             7751963 (+0.04%)
	2               8521596             9251649 (+8.57%)
	3               9594653             9890211 (+3.08%)
	4               9934361             9800205 (-1.35%)
	5               8146007             9597982 (+17.82%)
	6               6896436             9367882 (+35.84%)
	7               5943880             9159893 (+54.11%)
	14              4194305             8623029 (+105.59%)
	28              3374218             7818631 (+131.72%)
	56              2533912             4836622 (+90.88%)
	112             2541950             4850938 (+90.84%)
	448             2507000             5345149 (+113.21%)

Test result on workstation(16 cores with 32G RAM, HT=on)
TC1: Hashwork
Thread num      adaptive mutex          mcs mutex
	1               11419169            11430369 (+0.1%)
	2               15364452            15873667 (+3.31%)
	3               17234014            17547329 (+1.82%)
	4               17711736            17613548 (-0.55%)
	5               16583392            17626707 (+6.29%)
	6               14855586            17305468 (+16.49%)
	7               12948851            17130807 (+32.3%)
	14              8698172             15322237 (+76.15%)
	16              8123930             14937645 (+83.87%)
	64              7946006             5685132 (-28.45%)

TC2: Test_and_set instruction on shared variables
Thread num      adaptive mutex          mcs mutex
	1               12535156            12595629 (+0.48%)
	2               15665576            15929652 (+1.69%)
	3               17469331            16881225 (-3.37%)
	4               14833035            15777572 (+6.37%)
	5               12376033            15256528 (+23.27%)
	6               10568899            14693661 (+39.03%)
	7               9657775             14486039 (+49.99%)
	14              8015061             14112508 (+76.07%)
	16              7641725             13935473 (+82.36%)
	64              7571112             7735482 (+2.17%)

Potential issues:
a) As the preemption can't be disabled at userland during spinning, MCS
style lock potentially has the risk to collapse lock performance when CPUs
are heavily oversubscribed. But generally, MCS-based spin-waiting algorithm
performs much better than the existed one. We may consider to mitigate this
issue by using a cancellable MCS to prevent unnecessary active waiting in
future.

Reference:
[1]"The performance of spin lock alternatives for shared-memory
multiprocessors"
https://www.cc.gatech.edu/classes/AY2009/cs4210_fall/papers/anderson-spinlock.pdf.
[2]"Algorithms for scalable synchronization on shared-memory
multiprocessors"
http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf

V1->V2:
  a) Add one line description before copyright in new files
  b) Add one entry to introduce this new type of mutex with GNU
  PTHREAD_MUTEX_QUEUESPINNER_NP extension in NEWS

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 NEWS                                    |  6 +++
 nptl/Makefile                           |  3 +-
 nptl/allocatestack.c                    |  2 +-
 nptl/descr.h                            | 26 ++++++------
 nptl/mcs_lock.c                         | 72 +++++++++++++++++++++++++++++++++
 nptl/mcs_lock.h                         | 23 +++++++++++
 nptl/nptl-init.c                        |  2 +-
 nptl/pthreadP.h                         |  2 +-
 nptl/pthread_mutex_init.c               |  5 +++
 nptl/pthread_mutex_lock.c               | 34 +++++++++++++++-
 nptl/pthread_mutex_timedlock.c          | 31 ++++++++++++--
 nptl/pthread_mutex_trylock.c            |  5 ++-
 nptl/pthread_mutex_unlock.c             |  7 +++-
 nptl/pthread_mutexattr_settype.c        |  2 +-
 sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++---
 sysdeps/nptl/pthread.h                  | 15 ++++---
 sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
 17 files changed, 223 insertions(+), 37 deletions(-)
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h

Comments

kemi Dec. 21, 2018, 10:17 a.m. | #1
Hi, Dear maintainers
  I hope this patchset can catch up 2.29 release cycle, could you help
to get it reviewed. If any question, I will resolve it ASAP. Thanks

On 2018/12/20 下午2:15, Kemi Wang wrote:
> Adaptive mutex indicates the semantic of spinning for a while before
> calling into the kernel to block. Thus, the lock of adaptive mutex could be
> acquired via immediate getting, spinning getting or waking up.
> 
> Currently, the spin-waiting algorithm of adaptive mutex is for each
> processor to repeatedly execute an test_and_set instruction until either
> the maximum spin count is reached or the  lock is acquired. However, the
> lock performance via spinning getting will degrade significantly as the
> number of spinning processors increases. Two factors at least cause this
> degradation[1]. First, in order to release the lock, the lock holder has to
> contend with spinning processors for exclusive access of lock cache line.
> (E.g. "lock; decl 0%" of lll_unlock() in
> sysdeps/unix/sysv/linux/x86_64/lowlevellock.h for pthread_mutex_unlock()).
> For most multiprocessor architectures, it has to wait behind those
> test_and_set instructions from spinning processors.
> Furthermore, on these invalidation-based cache coherency system,
> test_and_set instruction will trigger a "read-for-ownership" request for
> exclusive access to the lock cache line, it will potentially slow the
> access of other locations(shares the cache line with the lock at least) by
> the lock holder.
>
Siddhesh Poyarekar Dec. 31, 2018, 5:40 p.m. | #2
On 20/12/18 11:45 AM, Kemi Wang wrote:
> Adaptive mutex indicates the semantic of spinning for a while before
> calling into the kernel to block. Thus, the lock of adaptive mutex could be
> acquired via immediate getting, spinning getting or waking up.
> 
> Currently, the spin-waiting algorithm of adaptive mutex is for each
> processor to repeatedly execute an test_and_set instruction until either
> the maximum spin count is reached or the  lock is acquired. However, the
> lock performance via spinning getting will degrade significantly as the
> number of spinning processors increases. Two factors at least cause this
> degradation[1]. First, in order to release the lock, the lock holder has to
> contend with spinning processors for exclusive access of lock cache line.
> (E.g. "lock; decl 0%" of lll_unlock() in
> sysdeps/unix/sysv/linux/x86_64/lowlevellock.h for pthread_mutex_unlock()).
> For most multiprocessor architectures, it has to wait behind those
> test_and_set instructions from spinning processors.
> Furthermore, on these invalidation-based cache coherency system,
> test_and_set instruction will trigger a "read-for-ownership" request for
> exclusive access to the lock cache line, it will potentially slow the
> access of other locations(shares the cache line with the lock at least) by
> the lock holder.
> 
> In this patch, we propose another spin-waiting algorithm to accelerate
> lock acquisition by queuing spinners based-on MCS[2] lock. With MCS
> algorithm, only the header of queue is allowed to spin on the mutex lock,
> others spin on locally-accessible flag variables. Thus, the previous
> negative factors are eliminated.
> 
> The implementation of this MCS-based spin-waiting algorithm requires an
> additional pointer to hold the tail of queue. To keep the size of mutex
> data structure consistent and maintain the user space ABI unchanged, the
> __list field which is originally for implementing robust futex will be
> reused. Therefore, we propose a new type mutex with GNU extension
> PTHREAD_MUTEX_QUEUESPINNER_NP in this patch.
> 
> Pass the ABI compatibility test by running "make check-abi"
> 
> The benchmark is available at branch mutex_testing in
> https://github.com/kemicoco/tst-spinlock.git.
> The tunable pthread.mutex_spin_count is set to 10000 which is big enough in
> our testing for fair comparison.

Outdated comment?  It seems to refer to a tunable that doesn't exist in 
the patchset.

> The first test case emulates a practical workload with mathematical
> calculation.
> The second test case provided by our customer emulates the lock contention
> in a distributed file system.
> 
> Each workload runs with multiple threads in parallel, each workload does
> a) Acquire the lock;
> b) Do some work in critical section;
> c) Unlock
> d) Wait a random time (0~3000 TSCs)
> in a loop until 5 seconds, and obtain the total iterations.

It would be great to add these or some variant of these tests to 
benchtests.  That would help other architectures test if this feature is 
more broadly useful.

> Testing on 2s-Skylake platform (112 cores with 62G RAM, HT=on).
> TC1: Hashwork
> Thread num      adaptive mutex          mcs mutex
> 	1               7297792             7298755 (+0.01%)
> 	2               9332105             9752901 (+4.51%)
> 	3               10428251            11029771 (+5.7%)
> 	4               10572596            11203997 (+5.9%)
> 	5               10496815            11008433 (+4.8%)
> 	6               10292946            10569467 (+2.6%)
> 	7               9861111             10153538 (+2.97%)
> 	14              5845303             9756283  (+66.91%)
> 	28              4299209             8985135  (+109.0%)
> 	56              3298821             5747645  (+74.23%)
> 	112             2941309             5629700  (+91.4%)
> 	448             2821056             3716799  (+31.75%)
> 
> TC2: Test_and_set instruction on shared variables
> Thread num      adaptive mutex          mcs mutex
> 	1               7748765             7751963 (+0.04%)
> 	2               8521596             9251649 (+8.57%)
> 	3               9594653             9890211 (+3.08%)
> 	4               9934361             9800205 (-1.35%)
> 	5               8146007             9597982 (+17.82%)
> 	6               6896436             9367882 (+35.84%)
> 	7               5943880             9159893 (+54.11%)
> 	14              4194305             8623029 (+105.59%)
> 	28              3374218             7818631 (+131.72%)
> 	56              2533912             4836622 (+90.88%)
> 	112             2541950             4850938 (+90.84%)
> 	448             2507000             5345149 (+113.21%)
> 
> Test result on workstation(16 cores with 32G RAM, HT=on)
> TC1: Hashwork
> Thread num      adaptive mutex          mcs mutex
> 	1               11419169            11430369 (+0.1%)
> 	2               15364452            15873667 (+3.31%)
> 	3               17234014            17547329 (+1.82%)
> 	4               17711736            17613548 (-0.55%)
> 	5               16583392            17626707 (+6.29%)
> 	6               14855586            17305468 (+16.49%)
> 	7               12948851            17130807 (+32.3%)
> 	14              8698172             15322237 (+76.15%)
> 	16              8123930             14937645 (+83.87%)
> 	64              7946006             5685132 (-28.45%)
> 
> TC2: Test_and_set instruction on shared variables
> Thread num      adaptive mutex          mcs mutex
> 	1               12535156            12595629 (+0.48%)
> 	2               15665576            15929652 (+1.69%)
> 	3               17469331            16881225 (-3.37%)
> 	4               14833035            15777572 (+6.37%)
> 	5               12376033            15256528 (+23.27%)
> 	6               10568899            14693661 (+39.03%)
> 	7               9657775             14486039 (+49.99%)
> 	14              8015061             14112508 (+76.07%)
> 	16              7641725             13935473 (+82.36%)
> 	64              7571112             7735482 (+2.17%)
> 
> Potential issues:
> a) As the preemption can't be disabled at userland during spinning, MCS
> style lock potentially has the risk to collapse lock performance when CPUs
> are heavily oversubscribed. But generally, MCS-based spin-waiting algorithm
> performs much better than the existed one. We may consider to mitigate this
> issue by using a cancellable MCS to prevent unnecessary active waiting in
> future.
> 
> Reference:
> [1]"The performance of spin lock alternatives for shared-memory
> multiprocessors"
> https://www.cc.gatech.edu/classes/AY2009/cs4210_fall/papers/anderson-spinlock.pdf.
> [2]"Algorithms for scalable synchronization on shared-memory
> multiprocessors"
> http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf
> 
> V1->V2:
>    a) Add one line description before copyright in new files
>    b) Add one entry to introduce this new type of mutex with GNU
>    PTHREAD_MUTEX_QUEUESPINNER_NP extension in NEWS
> 
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>   NEWS                                    |  6 +++
>   nptl/Makefile                           |  3 +-
>   nptl/allocatestack.c                    |  2 +-
>   nptl/descr.h                            | 26 ++++++------
>   nptl/mcs_lock.c                         | 72 +++++++++++++++++++++++++++++++++
>   nptl/mcs_lock.h                         | 23 +++++++++++
>   nptl/nptl-init.c                        |  2 +-
>   nptl/pthreadP.h                         |  2 +-
>   nptl/pthread_mutex_init.c               |  5 +++
>   nptl/pthread_mutex_lock.c               | 34 +++++++++++++++-
>   nptl/pthread_mutex_timedlock.c          | 31 ++++++++++++--
>   nptl/pthread_mutex_trylock.c            |  5 ++-
>   nptl/pthread_mutex_unlock.c             |  7 +++-
>   nptl/pthread_mutexattr_settype.c        |  2 +-
>   sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++---
>   sysdeps/nptl/pthread.h                  | 15 ++++---
>   sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
>   17 files changed, 223 insertions(+), 37 deletions(-)
>   create mode 100644 nptl/mcs_lock.c
>   create mode 100644 nptl/mcs_lock.h

Change needs a ChangeLog entry.  Also, adding a new type means that you 
also have to update the pretty printing python script, i.e. 
nptl/nptl-printers.py.  In general I don't think this is suitable for 
inclusion at this stage but I've done a review to help you move forward. 
  The change is isolated enough that you may be able to backport it 
after the release but that's a decision to make once the final version 
of the patch is included, hopefully in 2.30.

> 
> diff --git a/NEWS b/NEWS
> index ae80818..ff58d15 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -46,6 +46,12 @@ Major new features:
>     incosistent mutex state after fork call in multithread environment.
>     In both popen and system there is no direct access to user-defined mutexes.
>   
> +* The mutex with PTHREAD_MUTEX_QUEUESPINNER_NP GNU extension has been added,
> +  which accelerates lock acquisition by queuing spinners. This type of
> +  mutex does not support PTHREAD_MUTEXATTR_FLAG_ROBUST attribution and it
> +  has a potential risk to collapse lock performance when CPUs are heavily
> +  oversubscribed.
> +

This is probably too much information in NEWS.  I'd point to the manual 
for more details and only specify the extension here.

>   Deprecated and removed features, and other changes affecting compatibility:
>   
>   * The glibc.tune tunable namespace has been renamed to glibc.cpu and the
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 34ae830..997da5e 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -145,7 +145,8 @@ libpthread-routines = nptl-init nptlfreeres vars events version pt-interp \
>   		      mtx_destroy mtx_init mtx_lock mtx_timedlock \
>   		      mtx_trylock mtx_unlock call_once cnd_broadcast \
>   		      cnd_destroy cnd_init cnd_signal cnd_timedwait cnd_wait \
> -		      tss_create tss_delete tss_get tss_set pthread_mutex_conf
> +		      tss_create tss_delete tss_get tss_set pthread_mutex_conf \
> +              mcs_lock
>   #		      pthread_setuid pthread_seteuid pthread_setreuid \
>   #		      pthread_setresuid \
>   #		      pthread_setgid pthread_setegid pthread_setregid \
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index 04e3f08..9f47129 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -749,7 +749,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>        might have happened in the kernel.  */
>     pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
>   				  - offsetof (pthread_mutex_t,
> -					      __data.__list.__next));
> +					      __data.__list.__list_t.__next));
>     pd->robust_head.list_op_pending = NULL;
>   #if __PTHREAD_MUTEX_HAVE_PREV
>     pd->robust_prev = &pd->robust_head;
> diff --git a/nptl/descr.h b/nptl/descr.h
> index 9c01e1b..dc24dd8 100644
> --- a/nptl/descr.h
> +++ b/nptl/descr.h
> @@ -184,38 +184,38 @@ struct pthread
>        FIXME We should use relaxed MO atomic operations here and signal fences
>        because this kind of concurrency is similar to synchronizing with a
>        signal handler.  */
> -# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __next))
> +# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __list_t.__next))
>   
>   # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
>     do {									      \
>       __pthread_list_t *next = (__pthread_list_t *)			      \
>         ((((uintptr_t) THREAD_GETMEM (THREAD_SELF, robust_head.list)) & ~1ul)   \
>          - QUEUE_PTR_ADJUST);						      \
> -    next->__prev = (void *) &mutex->__data.__list.__next;		      \
> -    mutex->__data.__list.__next = THREAD_GETMEM (THREAD_SELF,		      \
> +    next->__list_t.__prev = (void *) &mutex->__data.__list.__list_t.__next;		      \
> +    mutex->__data.__list.__list_t.__next = THREAD_GETMEM (THREAD_SELF,		      \
>   						 robust_head.list);	      \
> -    mutex->__data.__list.__prev = (void *) &THREAD_SELF->robust_head;	      \
> +    mutex->__data.__list.__list_t.__prev = (void *) &THREAD_SELF->robust_head;	      \
>       /* Ensure that the new list entry is ready before we insert it.  */	      \
>       __asm ("" ::: "memory");						      \
>       THREAD_SETMEM (THREAD_SELF, robust_head.list,			      \
> -		   (void *) (((uintptr_t) &mutex->__data.__list.__next)	      \
> +		   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)	      \
>   			     | val));					      \
>     } while (0)
>   # define DEQUEUE_MUTEX(mutex) \
>     do {									      \
>       __pthread_list_t *next = (__pthread_list_t *)			      \
> -      ((char *) (((uintptr_t) mutex->__data.__list.__next) & ~1ul)	      \
> +      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__next) & ~1ul)	      \
>          - QUEUE_PTR_ADJUST);						      \
> -    next->__prev = mutex->__data.__list.__prev;				      \
> +    next->__list_t.__prev = mutex->__data.__list.__list_t.__prev;				      \
>       __pthread_list_t *prev = (__pthread_list_t *)			      \
> -      ((char *) (((uintptr_t) mutex->__data.__list.__prev) & ~1ul)	      \
> +      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__prev) & ~1ul)	      \
>          - QUEUE_PTR_ADJUST);						      \
> -    prev->__next = mutex->__data.__list.__next;				      \
> +    prev->__list_t.__next = mutex->__data.__list.__list_t.__next;				      \
>       /* Ensure that we remove the entry from the list before we change the     \
>          __next pointer of the entry, which is read by the kernel.  */	      \
>       __asm ("" ::: "memory");						      \
> -    mutex->__data.__list.__prev = NULL;					      \
> -    mutex->__data.__list.__next = NULL;					      \
> +    mutex->__data.__list.__list_t.__prev = NULL;					      \
> +    mutex->__data.__list.__list_t.__next = NULL;					      \
>     } while (0)
>   #else
>     union
> @@ -226,7 +226,7 @@ struct pthread
>   
>   # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
>     do {									      \
> -    mutex->__data.__list.__next						      \
> +    mutex->__data.__list.__list_t.__next						      \
>         = THREAD_GETMEM (THREAD_SELF, robust_list.__next);		      \
>       /* Ensure that the new list entry is ready before we insert it.  */	      \
>       __asm ("" ::: "memory");						      \
> @@ -253,7 +253,7 @@ struct pthread
>   	/* Ensure that we remove the entry from the list before we change the \
>   	   __next pointer of the entry, which is read by the kernel.  */      \
>   	    __asm ("" ::: "memory");					      \
> -	mutex->__data.__list.__next = NULL;				      \
> +	mutex->__data.__list.__list_t.__next = NULL;				      \
>         }									      \
>     } while (0)
>   #endif
> diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c
> new file mode 100644
> index 0000000..43f90dc
> --- /dev/null
> +++ b/nptl/mcs_lock.c
> @@ -0,0 +1,72 @@
> +/* MCS-style lock for queuing spinners
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "pthreadP.h"
> +#include <atomic.h>
> +
> +static __thread mcs_lock_t node = {
> +  NULL,
> +  0,
> +};
> +
> +void mcs_lock (mcs_lock_t **lock)

Incorrect formatting, should be:

void
mcs_lock (mcs_lock_t **lock)

> +{
> +  mcs_lock_t *prev;
> +
> +  /* Initalize node.  */
> +  node.next = NULL;
> +  node.locked = 0;
> +
> +  prev = atomic_exchange_acquire(lock, &node);

Please add a more detailed explanation on what this would synchronize 
with and why the semantics you chose are necessary and/or sufficient. 
In general we tend to be as verbose as possible about synchronization 
points to ensure that the concurrency issues associated with those 
points are well understood.

This applies to all synchronization points in the two functions defined 
in this file.

> +
> +  /* No spinners waiting in the queue, lock is acquired immediately.  */
> +  if (prev == NULL)
> +    {
> +      node.locked = 1;
> +      return;
> +    }
> +
> +  /* Add current spinner into the queue.  */
> +  atomic_store_release (&prev->next, &node);
> +  atomic_full_barrier ();
> +  /* Waiting until waken up by the previous spinner.  */
> +  while (!atomic_load_relaxed (&node.locked))
> +    atomic_spin_nop ();
> +}
> +
> +void mcs_unlock (mcs_lock_t **lock)

Incorrect formatting.

> +{
> +  mcs_lock_t *next = node.next;
> +
> +  if (next == NULL)
> +    {
> +      /* Check the tail of the queue:
> +         a) Release the lock and return if current node is the tail.  */
> +      if (atomic_compare_and_exchange_val_acq(lock, NULL, &node) == &node)
> +        return;
> +
> +      /* b) Waiting until new node is added to the queue if current node is
> +         not the tail (lock != node).  */
> +      while (! (next = atomic_load_relaxed (&node.next)))
> +        atomic_spin_nop ();
> +    }
> +
> +  /* Wake up the next spinner.  */
> +  atomic_store_release (&next->locked, 1);
> +  atomic_full_barrier ();
> +}
> diff --git a/nptl/mcs_lock.h b/nptl/mcs_lock.h
> new file mode 100644
> index 0000000..5ee32f5
> --- /dev/null
> +++ b/nptl/mcs_lock.h
> @@ -0,0 +1,23 @@
> +/* MCS-style lock for queuing spinners
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +extern void mcs_lock (mcs_lock_t **lock)
> +  __attribute__ ((visibility ("hidden")));
> +
> +extern void mcs_unlock (mcs_lock_t **lock)
> +  __attribute__ ((visibility ("hidden")));

Why is the attribute needed?

> diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
> index 20ff3fd..d20481a 100644
> --- a/nptl/nptl-init.c
> +++ b/nptl/nptl-init.c
> @@ -289,7 +289,7 @@ __pthread_initialize_minimal_internal (void)
>   #ifdef __NR_set_robust_list
>       pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
>   				    - offsetof (pthread_mutex_t,
> -						__data.__list.__next));
> +						__data.__list.__list_t.__next));
>       INTERNAL_SYSCALL_DECL (err);
>       int res = INTERNAL_SYSCALL (set_robust_list, err, 2, &pd->robust_head,
>   				sizeof (struct robust_list_head));
> diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
> index 7f16ba9..1179864 100644
> --- a/nptl/pthreadP.h
> +++ b/nptl/pthreadP.h
> @@ -67,7 +67,7 @@ static inline short max_adaptive_count (void)
>   /* Internal mutex type value.  */
>   enum
>   {
> -  PTHREAD_MUTEX_KIND_MASK_NP = 3,
> +  PTHREAD_MUTEX_KIND_MASK_NP = 7,
>   
>     PTHREAD_MUTEX_ELISION_NP    = 256,
>     PTHREAD_MUTEX_NO_ELISION_NP = 512,
> diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
> index 5cf290c..99f1707 100644
> --- a/nptl/pthread_mutex_init.c
> +++ b/nptl/pthread_mutex_init.c
> @@ -111,6 +111,11 @@ __pthread_mutex_init (pthread_mutex_t *mutex,
>   	return ENOTSUP;
>   #endif
>   
> +      /* Robust mutex does not support the PTHREAD_MUTEX_QUEUESPINNER_NP
> +         GNU extension.  */
> +      if ((imutexattr->mutexkind & PTHREAD_MUTEX_QUEUESPINNER_NP) != 0)
> +        return ENOTSUP;
> +
>         mutex_kind |= PTHREAD_MUTEX_ROBUST_NORMAL_NP;
>       }
>   
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 474b4df..7b81470 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>   #include <atomic.h>
>   #include <lowlevellock.h>
>   #include <stap-probe.h>
> +#include <mcs_lock.h>
>   
>   #ifndef lll_lock_elision
>   #define lll_lock_elision(lock, try_lock, private)	({ \
> @@ -118,6 +119,35 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>         mutex->__data.__count = 1;
>       }
>     else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
> +			 == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
> +    {
> +      if (! __is_smp)
> +        goto simple;
> +
> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
> +        {
> +          int cnt = 0;
> +          int max_cnt = MIN (max_adaptive_count (),
> +                            mutex->__data.__spins * 2 + 10);
> +          int val = 0;
> +
> +          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
> +
> +          do
> +            {
> +              atomic_spin_nop ();
> +              val = atomic_load_relaxed (&mutex->__data.__lock);
> +            }
> +          while (val != 0 && ++cnt < max_cnt);
> +
> +          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
> +          LLL_MUTEX_LOCK (mutex);

The loop reads the lock value till it finds that it is unlocked.  It 
then proceeds to a MUTEX_LOCK (after mcs_unlock), which means that 
there's a race window where a thread proceeds to LLL_MUTEX_LOCK only to 
find that another thread has raced ahead and taken the mutex.  Is that 
expected behaviour?

> +
> +          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> +        }
> +      assert (mutex->__data.__owner == 0);
> +    }
> +  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>   			  == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>       {
>         if (! __is_smp)
> @@ -179,7 +209,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -		     &mutex->__data.__list.__next);
> +		     &mutex->__data.__list.__list_t.__next);
>         /* We need to set op_pending before starting the operation.  Also
>   	 see comments at ENQUEUE_MUTEX.  */
>         __asm ("" ::: "memory");
> @@ -365,7 +395,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>   	  {
>   	    /* Note: robust PI futexes are signaled by setting bit 0.  */
>   	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
> +			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>   				     | 1));
>   	    /* We need to set op_pending before starting the operation.  Also
>   	       see comments at ENQUEUE_MUTEX.  */
> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 453b824..edf0415 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -25,6 +25,7 @@
>   #include <atomic.h>
>   #include <lowlevellock.h>
>   #include <not-cancel.h>
> +#include <mcs_lock.h>
>   
>   #include <stap-probe.h>
>   
> @@ -135,13 +136,37 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>   	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>   	}
>         break;
> -
> +    case PTHREAD_MUTEX_QUEUESPINNER_NP:
> +      if (! __is_smp)
> +        goto simple;
> +
> +      if (lll_trylock (mutex) != 0)
> +        {
> +          int cnt = 0;
> +          int max_cnt = MIN (max_adaptive_count (),
> +                            mutex->__data.__spins * 2 + 10);
> +          int val = 0;
> +
> +          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
> +          do
> +            {
> +              atomic_spin_nop ();
> +              val = atomic_load_relaxed (&mutex->__data.__lock);
> +            }
> +          while (val != 0 && ++cnt < max_cnt);
> +          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
> +          result = lll_timedlock(mutex->__data.__lock, abstime,
> +                      PTHREAD_MUTEX_PSHARED (mutex));

Similar situation here: there seems to be a race window similar to the 
above.

> +
> +          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> +        }
> +      break;
>       case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
>       case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -		     &mutex->__data.__list.__next);
> +		     &mutex->__data.__list.__list_t.__next);
>         /* We need to set op_pending before starting the operation.  Also
>   	 see comments at ENQUEUE_MUTEX.  */
>         __asm ("" ::: "memory");
> @@ -353,7 +378,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>   	  {
>   	    /* Note: robust PI futexes are signaled by setting bit 0.  */
>   	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
> +			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>   				     | 1));
>   	    /* We need to set op_pending before starting the operation.  Also
>   	       see comments at ENQUEUE_MUTEX.  */
> diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
> index fa90c1d..a22de5e 100644
> --- a/nptl/pthread_mutex_trylock.c
> +++ b/nptl/pthread_mutex_trylock.c
> @@ -78,6 +78,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>         FORCE_ELISION (mutex, goto elision);
>         /*FALL THROUGH*/
>       case PTHREAD_MUTEX_ADAPTIVE_NP:
> +    case PTHREAD_MUTEX_QUEUESPINNER_NP:
>       case PTHREAD_MUTEX_ERRORCHECK_NP:
>         if (lll_trylock (mutex->__data.__lock) != 0)
>   	break;
> @@ -93,7 +94,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -		     &mutex->__data.__list.__next);
> +		     &mutex->__data.__list.__list_t.__next);
>   
>         oldval = mutex->__data.__lock;
>         do
> @@ -213,7 +214,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>   	if (robust)
>   	  /* Note: robust PI futexes are signaled by setting bit 0.  */
>   	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
> +			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>   				   | 1));
>   
>   	oldval = mutex->__data.__lock;
> diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
> index 68d04d5..c3e8ef4 100644
> --- a/nptl/pthread_mutex_unlock.c
> +++ b/nptl/pthread_mutex_unlock.c
> @@ -78,6 +78,9 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
>         goto normal;
>       }
>     else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
> +			      == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
> +    goto normal;
> +  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>   			      == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>       goto normal;
>     else
> @@ -142,7 +145,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
>       robust:
>         /* Remove mutex from the list.  */
>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -		     &mutex->__data.__list.__next);
> +		     &mutex->__data.__list.__list_t.__next);
>         /* We must set op_pending before we dequeue the mutex.  Also see
>   	 comments at ENQUEUE_MUTEX.  */
>         __asm ("" ::: "memory");
> @@ -242,7 +245,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
>   	  /* Remove mutex from the list.
>   	     Note: robust PI futexes are signaled by setting bit 0.  */
>   	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> -			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
> +			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>   				   | 1));
>   	  /* We must set op_pending before we dequeue the mutex.  Also see
>   	     comments at ENQUEUE_MUTEX.  */
> diff --git a/nptl/pthread_mutexattr_settype.c b/nptl/pthread_mutexattr_settype.c
> index 7d36cc3..c2382b4 100644
> --- a/nptl/pthread_mutexattr_settype.c
> +++ b/nptl/pthread_mutexattr_settype.c
> @@ -25,7 +25,7 @@ __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind)
>   {
>     struct pthread_mutexattr *iattr;
>   
> -  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_ADAPTIVE_NP)
> +  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_QUEUESPINNER_NP)
>       return EINVAL;
>   
>     /* Cannot distinguish between DEFAULT and NORMAL. So any settype
> diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
> index 05c94e7..1cf8874 100644
> --- a/sysdeps/nptl/bits/thread-shared-types.h
> +++ b/sysdeps/nptl/bits/thread-shared-types.h
> @@ -79,15 +79,19 @@
>   /* Common definition of pthread_mutex_t. */
>   
>   #if !__PTHREAD_MUTEX_USE_UNION
> -typedef struct __pthread_internal_list
> +typedef union __pthread_internal_list
>   {
> -  struct __pthread_internal_list *__prev;
> -  struct __pthread_internal_list *__next;
> +  struct {
> +    union __pthread_internal_list *__prev;
> +    union __pthread_internal_list *__next;
> +  }__list_t;
> +  void *mcs_lock;
>   } __pthread_list_t;
>   #else
> -typedef struct __pthread_internal_slist
> +typedef union __pthread_internal_slist
>   {
> -  struct __pthread_internal_slist *__next;
> +  union __pthread_internal_slist *__next;
> +  void *mcs_lock;
>   } __pthread_slist_t;
>   #endif
>   
> @@ -165,6 +169,13 @@ struct __pthread_mutex_s
>     __PTHREAD_COMPAT_PADDING_END
>   };
>   
> +struct mcs_lock
> +{
> +  struct mcs_lock *next;
> +  int locked;
> +};
> +
> +typedef struct mcs_lock mcs_lock_t;
>   
>   /* Common definition of pthread_cond_t. */
>   
> diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
> index df049ab..4b4b80a 100644
> --- a/sysdeps/nptl/pthread.h
> +++ b/sysdeps/nptl/pthread.h
> @@ -45,7 +45,8 @@ enum
>     PTHREAD_MUTEX_TIMED_NP,
>     PTHREAD_MUTEX_RECURSIVE_NP,
>     PTHREAD_MUTEX_ERRORCHECK_NP,
> -  PTHREAD_MUTEX_ADAPTIVE_NP
> +  PTHREAD_MUTEX_ADAPTIVE_NP,
> +  PTHREAD_MUTEX_QUEUESPINNER_NP
>   #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
>     ,
>     PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
> @@ -85,14 +86,16 @@ enum
>   
>   #if __PTHREAD_MUTEX_HAVE_PREV
>   # define PTHREAD_MUTEX_INITIALIZER \
> -  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { 0, 0 } } }
> +  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { { 0, 0 } } } }
>   # ifdef __USE_GNU
>   #  define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \
> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>   #  define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \
> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { 0, 0 } } }
> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>   #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
> +#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>   
>   # endif
>   #else
> @@ -105,6 +108,8 @@ enum
>     { { 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, 0, { __PTHREAD_SPINS } } }
>   #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
>     { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, 0, { __PTHREAD_SPINS } } }
> +#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
> +  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, 0, { __PTHREAD_SPINS } } }
>   
>   # endif
>   #endif
> diff --git a/sysdeps/unix/sysv/linux/hppa/pthread.h b/sysdeps/unix/sysv/linux/hppa/pthread.h
> index 11a024d..57c101c 100644
> --- a/sysdeps/unix/sysv/linux/hppa/pthread.h
> +++ b/sysdeps/unix/sysv/linux/hppa/pthread.h
> @@ -46,6 +46,7 @@ enum
>     PTHREAD_MUTEX_RECURSIVE_NP,
>     PTHREAD_MUTEX_ERRORCHECK_NP,
>     PTHREAD_MUTEX_ADAPTIVE_NP
> +  PTHREAD_MUTEX_QUEUESPINNER_NP,
>   #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
>     ,
>     PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
> @@ -95,6 +96,9 @@ enum
>   # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
>     { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, { 0, 0, 0, 0 }, 0, \
>         { __PTHREAD_SPINS }, { 0, 0 } } }
> +# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
> +  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, { 0, 0, 0, 0 }, 0, \
> +      { __PTHREAD_SPINS }, { 0, 0 } } }
>   #endif
>   
>   
>
kemi Jan. 2, 2019, 2:24 a.m. | #3
On 2019/1/1 上午1:40, Siddhesh Poyarekar wrote:
> On 20/12/18 11:45 AM, Kemi Wang wrote:
>> Adaptive mutex indicates the semantic of spinning for a while before
>> calling into the kernel to block. Thus, the lock of adaptive mutex could be
>> acquired via immediate getting, spinning getting or waking up.
>>
>> Currently, the spin-waiting algorithm of adaptive mutex is for each
>> processor to repeatedly execute an test_and_set instruction until either
>> the maximum spin count is reached or the  lock is acquired. However, the
>> lock performance via spinning getting will degrade significantly as the
>> number of spinning processors increases. Two factors at least cause this
>> degradation[1]. First, in order to release the lock, the lock holder has to
>> contend with spinning processors for exclusive access of lock cache line.
>> (E.g. "lock; decl 0%" of lll_unlock() in
>> sysdeps/unix/sysv/linux/x86_64/lowlevellock.h for pthread_mutex_unlock()).
>> For most multiprocessor architectures, it has to wait behind those
>> test_and_set instructions from spinning processors.
>> Furthermore, on these invalidation-based cache coherency system,
>> test_and_set instruction will trigger a "read-for-ownership" request for
>> exclusive access to the lock cache line, it will potentially slow the
>> access of other locations(shares the cache line with the lock at least) by
>> the lock holder.
>>
>> In this patch, we propose another spin-waiting algorithm to accelerate
>> lock acquisition by queuing spinners based-on MCS[2] lock. With MCS
>> algorithm, only the header of queue is allowed to spin on the mutex lock,
>> others spin on locally-accessible flag variables. Thus, the previous
>> negative factors are eliminated.
>>
>> The implementation of this MCS-based spin-waiting algorithm requires an
>> additional pointer to hold the tail of queue. To keep the size of mutex
>> data structure consistent and maintain the user space ABI unchanged, the
>> __list field which is originally for implementing robust futex will be
>> reused. Therefore, we propose a new type mutex with GNU extension
>> PTHREAD_MUTEX_QUEUESPINNER_NP in this patch.
>>
>> Pass the ABI compatibility test by running "make check-abi"
>>
>> The benchmark is available at branch mutex_testing in
>> https://github.com/kemicoco/tst-spinlock.git.
>> The tunable pthread.mutex_spin_count is set to 10000 which is big enough in
>> our testing for fair comparison.
> 
> Outdated comment?  It seems to refer to a tunable that doesn't exist in the patchset.
> 

No.
It referred to the glibc.pthread.mutex_spin_count tunable patch(Commit:6310e6be9b7c322d56a45729b3ebcd22e26dd0c2)
which has been merged in master recently.

>> The first test case emulates a practical workload with mathematical
>> calculation.
>> The second test case provided by our customer emulates the lock contention
>> in a distributed file system.
>>
>> Each workload runs with multiple threads in parallel, each workload does
>> a) Acquire the lock;
>> b) Do some work in critical section;
>> c) Unlock
>> d) Wait a random time (0~3000 TSCs)
>> in a loop until 5 seconds, and obtain the total iterations.
> 
> It would be great to add these or some variant of these tests to benchtests.  That would help other architectures test if this feature is more broadly useful.
> 

Sounds good.

>> Testing on 2s-Skylake platform (112 cores with 62G RAM, HT=on).
>> TC1: Hashwork
>> Thread num      adaptive mutex          mcs mutex
>>     1               7297792             7298755 (+0.01%)
>>     2               9332105             9752901 (+4.51%)
>>     3               10428251            11029771 (+5.7%)
>>     4               10572596            11203997 (+5.9%)
>>     5               10496815            11008433 (+4.8%)
>>     6               10292946            10569467 (+2.6%)
>>     7               9861111             10153538 (+2.97%)
>>     14              5845303             9756283  (+66.91%)
>>     28              4299209             8985135  (+109.0%)
>>     56              3298821             5747645  (+74.23%)
>>     112             2941309             5629700  (+91.4%)
>>     448             2821056             3716799  (+31.75%)
>>
>> TC2: Test_and_set instruction on shared variables
>> Thread num      adaptive mutex          mcs mutex
>>     1               7748765             7751963 (+0.04%)
>>     2               8521596             9251649 (+8.57%)
>>     3               9594653             9890211 (+3.08%)
>>     4               9934361             9800205 (-1.35%)
>>     5               8146007             9597982 (+17.82%)
>>     6               6896436             9367882 (+35.84%)
>>     7               5943880             9159893 (+54.11%)
>>     14              4194305             8623029 (+105.59%)
>>     28              3374218             7818631 (+131.72%)
>>     56              2533912             4836622 (+90.88%)
>>     112             2541950             4850938 (+90.84%)
>>     448             2507000             5345149 (+113.21%)
>>
>> Test result on workstation(16 cores with 32G RAM, HT=on)
>> TC1: Hashwork
>> Thread num      adaptive mutex          mcs mutex
>>     1               11419169            11430369 (+0.1%)
>>     2               15364452            15873667 (+3.31%)
>>     3               17234014            17547329 (+1.82%)
>>     4               17711736            17613548 (-0.55%)
>>     5               16583392            17626707 (+6.29%)
>>     6               14855586            17305468 (+16.49%)
>>     7               12948851            17130807 (+32.3%)
>>     14              8698172             15322237 (+76.15%)
>>     16              8123930             14937645 (+83.87%)
>>     64              7946006             5685132 (-28.45%)
>>
>> TC2: Test_and_set instruction on shared variables
>> Thread num      adaptive mutex          mcs mutex
>>     1               12535156            12595629 (+0.48%)
>>     2               15665576            15929652 (+1.69%)
>>     3               17469331            16881225 (-3.37%)
>>     4               14833035            15777572 (+6.37%)
>>     5               12376033            15256528 (+23.27%)
>>     6               10568899            14693661 (+39.03%)
>>     7               9657775             14486039 (+49.99%)
>>     14              8015061             14112508 (+76.07%)
>>     16              7641725             13935473 (+82.36%)
>>     64              7571112             7735482 (+2.17%)
>>
>> Potential issues:
>> a) As the preemption can't be disabled at userland during spinning, MCS
>> style lock potentially has the risk to collapse lock performance when CPUs
>> are heavily oversubscribed. But generally, MCS-based spin-waiting algorithm
>> performs much better than the existed one. We may consider to mitigate this
>> issue by using a cancellable MCS to prevent unnecessary active waiting in
>> future.
>>
>> Reference:
>> [1]"The performance of spin lock alternatives for shared-memory
>> multiprocessors"
>> https://www.cc.gatech.edu/classes/AY2009/cs4210_fall/papers/anderson-spinlock.pdf.
>> [2]"Algorithms for scalable synchronization on shared-memory
>> multiprocessors"
>> http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf
>>
>> V1->V2:
>>    a) Add one line description before copyright in new files
>>    b) Add one entry to introduce this new type of mutex with GNU
>>    PTHREAD_MUTEX_QUEUESPINNER_NP extension in NEWS
>>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>   NEWS                                    |  6 +++
>>   nptl/Makefile                           |  3 +-
>>   nptl/allocatestack.c                    |  2 +-
>>   nptl/descr.h                            | 26 ++++++------
>>   nptl/mcs_lock.c                         | 72 +++++++++++++++++++++++++++++++++
>>   nptl/mcs_lock.h                         | 23 +++++++++++
>>   nptl/nptl-init.c                        |  2 +-
>>   nptl/pthreadP.h                         |  2 +-
>>   nptl/pthread_mutex_init.c               |  5 +++
>>   nptl/pthread_mutex_lock.c               | 34 +++++++++++++++-
>>   nptl/pthread_mutex_timedlock.c          | 31 ++++++++++++--
>>   nptl/pthread_mutex_trylock.c            |  5 ++-
>>   nptl/pthread_mutex_unlock.c             |  7 +++-
>>   nptl/pthread_mutexattr_settype.c        |  2 +-
>>   sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++---
>>   sysdeps/nptl/pthread.h                  | 15 ++++---
>>   sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
>>   17 files changed, 223 insertions(+), 37 deletions(-)
>>   create mode 100644 nptl/mcs_lock.c
>>   create mode 100644 nptl/mcs_lock.h
> 
> Change needs a ChangeLog entry.  Also, adding a new type means that you also have to update the pretty printing python script, i.e. nptl/nptl-printers.py.  In general I don't think this is suitable for inclusion at this stage but I've done a review to help you move forward. 

Thanks for your review.

> The change is isolated enough that you may be able to backport it after the release but that's a decision to make once the final version of the patch is included, hopefully in 2.30.
> 
>>
>> diff --git a/NEWS b/NEWS
>> index ae80818..ff58d15 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -46,6 +46,12 @@ Major new features:
>>     incosistent mutex state after fork call in multithread environment.
>>     In both popen and system there is no direct access to user-defined mutexes.
>>   +* The mutex with PTHREAD_MUTEX_QUEUESPINNER_NP GNU extension has been added,
>> +  which accelerates lock acquisition by queuing spinners. This type of
>> +  mutex does not support PTHREAD_MUTEXATTR_FLAG_ROBUST attribution and it
>> +  has a potential risk to collapse lock performance when CPUs are heavily
>> +  oversubscribed.
>> +
> 
> This is probably too much information in NEWS.  I'd point to the manual for more details and only specify the extension here.
> 
>>   Deprecated and removed features, and other changes affecting compatibility:
>>     * The glibc.tune tunable namespace has been renamed to glibc.cpu and the
>> diff --git a/nptl/Makefile b/nptl/Makefile
>> index 34ae830..997da5e 100644
>> --- a/nptl/Makefile
>> +++ b/nptl/Makefile
>> @@ -145,7 +145,8 @@ libpthread-routines = nptl-init nptlfreeres vars events version pt-interp \
>>                 mtx_destroy mtx_init mtx_lock mtx_timedlock \
>>                 mtx_trylock mtx_unlock call_once cnd_broadcast \
>>                 cnd_destroy cnd_init cnd_signal cnd_timedwait cnd_wait \
>> -              tss_create tss_delete tss_get tss_set pthread_mutex_conf
>> +              tss_create tss_delete tss_get tss_set pthread_mutex_conf \
>> +              mcs_lock
>>   #              pthread_setuid pthread_seteuid pthread_setreuid \
>>   #              pthread_setresuid \
>>   #              pthread_setgid pthread_setegid pthread_setregid \
>> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
>> index 04e3f08..9f47129 100644
>> --- a/nptl/allocatestack.c
>> +++ b/nptl/allocatestack.c
>> @@ -749,7 +749,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>>        might have happened in the kernel.  */
>>     pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
>>                     - offsetof (pthread_mutex_t,
>> -                          __data.__list.__next));
>> +                          __data.__list.__list_t.__next));
>>     pd->robust_head.list_op_pending = NULL;
>>   #if __PTHREAD_MUTEX_HAVE_PREV
>>     pd->robust_prev = &pd->robust_head;
>> diff --git a/nptl/descr.h b/nptl/descr.h
>> index 9c01e1b..dc24dd8 100644
>> --- a/nptl/descr.h
>> +++ b/nptl/descr.h
>> @@ -184,38 +184,38 @@ struct pthread
>>        FIXME We should use relaxed MO atomic operations here and signal fences
>>        because this kind of concurrency is similar to synchronizing with a
>>        signal handler.  */
>> -# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __next))
>> +# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __list_t.__next))
>>     # define ENQUEUE_MUTEX_BOTH(mutex, val)                          \
>>     do {                                          \
>>       __pthread_list_t *next = (__pthread_list_t *)                  \
>>         ((((uintptr_t) THREAD_GETMEM (THREAD_SELF, robust_head.list)) & ~1ul)   \
>>          - QUEUE_PTR_ADJUST);                              \
>> -    next->__prev = (void *) &mutex->__data.__list.__next;              \
>> -    mutex->__data.__list.__next = THREAD_GETMEM (THREAD_SELF,              \
>> +    next->__list_t.__prev = (void *) &mutex->__data.__list.__list_t.__next;              \
>> +    mutex->__data.__list.__list_t.__next = THREAD_GETMEM (THREAD_SELF,              \
>>                            robust_head.list);          \
>> -    mutex->__data.__list.__prev = (void *) &THREAD_SELF->robust_head;          \
>> +    mutex->__data.__list.__list_t.__prev = (void *) &THREAD_SELF->robust_head;          \
>>       /* Ensure that the new list entry is ready before we insert it.  */          \
>>       __asm ("" ::: "memory");                              \
>>       THREAD_SETMEM (THREAD_SELF, robust_head.list,                  \
>> -           (void *) (((uintptr_t) &mutex->__data.__list.__next)          \
>> +           (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)          \
>>                    | val));                          \
>>     } while (0)
>>   # define DEQUEUE_MUTEX(mutex) \
>>     do {                                          \
>>       __pthread_list_t *next = (__pthread_list_t *)                  \
>> -      ((char *) (((uintptr_t) mutex->__data.__list.__next) & ~1ul)          \
>> +      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__next) & ~1ul)          \
>>          - QUEUE_PTR_ADJUST);                              \
>> -    next->__prev = mutex->__data.__list.__prev;                      \
>> +    next->__list_t.__prev = mutex->__data.__list.__list_t.__prev;                      \
>>       __pthread_list_t *prev = (__pthread_list_t *)                  \
>> -      ((char *) (((uintptr_t) mutex->__data.__list.__prev) & ~1ul)          \
>> +      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__prev) & ~1ul)          \
>>          - QUEUE_PTR_ADJUST);                              \
>> -    prev->__next = mutex->__data.__list.__next;                      \
>> +    prev->__list_t.__next = mutex->__data.__list.__list_t.__next;                      \
>>       /* Ensure that we remove the entry from the list before we change the     \
>>          __next pointer of the entry, which is read by the kernel.  */          \
>>       __asm ("" ::: "memory");                              \
>> -    mutex->__data.__list.__prev = NULL;                          \
>> -    mutex->__data.__list.__next = NULL;                          \
>> +    mutex->__data.__list.__list_t.__prev = NULL;                          \
>> +    mutex->__data.__list.__list_t.__next = NULL;                          \
>>     } while (0)
>>   #else
>>     union
>> @@ -226,7 +226,7 @@ struct pthread
>>     # define ENQUEUE_MUTEX_BOTH(mutex, val)                          \
>>     do {                                          \
>> -    mutex->__data.__list.__next                              \
>> +    mutex->__data.__list.__list_t.__next                              \
>>         = THREAD_GETMEM (THREAD_SELF, robust_list.__next);              \
>>       /* Ensure that the new list entry is ready before we insert it.  */          \
>>       __asm ("" ::: "memory");                              \
>> @@ -253,7 +253,7 @@ struct pthread
>>       /* Ensure that we remove the entry from the list before we change the \
>>          __next pointer of the entry, which is read by the kernel.  */      \
>>           __asm ("" ::: "memory");                          \
>> -    mutex->__data.__list.__next = NULL;                      \
>> +    mutex->__data.__list.__list_t.__next = NULL;                      \
>>         }                                          \
>>     } while (0)
>>   #endif
>> diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c
>> new file mode 100644
>> index 0000000..43f90dc
>> --- /dev/null
>> +++ b/nptl/mcs_lock.c
>> @@ -0,0 +1,72 @@
>> +/* MCS-style lock for queuing spinners
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include "pthreadP.h"
>> +#include <atomic.h>
>> +
>> +static __thread mcs_lock_t node = {
>> +  NULL,
>> +  0,
>> +};
>> +
>> +void mcs_lock (mcs_lock_t **lock)
> 
> Incorrect formatting, should be:
> 
> void
> mcs_lock (mcs_lock_t **lock)
> 

OK

>> +{
>> +  mcs_lock_t *prev;
>> +
>> +  /* Initalize node.  */
>> +  node.next = NULL;
>> +  node.locked = 0;
>> +
>> +  prev = atomic_exchange_acquire(lock, &node);
> 
> Please add a more detailed explanation on what this would synchronize with and why the semantics you chose are necessary and/or sufficient. In general we tend to be as verbose as possible about synchronization points to ensure that the concurrency issues associated with those points are well understood.
> 
> This applies to all synchronization points in the two functions defined in this file.
> 

Got it. Will add more comments.

>> +
>> +  /* No spinners waiting in the queue, lock is acquired immediately.  */
>> +  if (prev == NULL)
>> +    {
>> +      node.locked = 1;
>> +      return;
>> +    }
>> +
>> +  /* Add current spinner into the queue.  */
>> +  atomic_store_release (&prev->next, &node);
>> +  atomic_full_barrier ();
>> +  /* Waiting until waken up by the previous spinner.  */
>> +  while (!atomic_load_relaxed (&node.locked))
>> +    atomic_spin_nop ();
>> +}
>> +
>> +void mcs_unlock (mcs_lock_t **lock)
> 
> Incorrect formatting.
> 

OK

>> +{
>> +  mcs_lock_t *next = node.next;
>> +
>> +  if (next == NULL)
>> +    {
>> +      /* Check the tail of the queue:
>> +         a) Release the lock and return if current node is the tail.  */
>> +      if (atomic_compare_and_exchange_val_acq(lock, NULL, &node) == &node)
>> +        return;
>> +
>> +      /* b) Waiting until new node is added to the queue if current node is
>> +         not the tail (lock != node).  */
>> +      while (! (next = atomic_load_relaxed (&node.next)))
>> +        atomic_spin_nop ();
>> +    }
>> +
>> +  /* Wake up the next spinner.  */
>> +  atomic_store_release (&next->locked, 1);
>> +  atomic_full_barrier ();
>> +}
>> diff --git a/nptl/mcs_lock.h b/nptl/mcs_lock.h
>> new file mode 100644
>> index 0000000..5ee32f5
>> --- /dev/null
>> +++ b/nptl/mcs_lock.h
>> @@ -0,0 +1,23 @@
>> +/* MCS-style lock for queuing spinners
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +extern void mcs_lock (mcs_lock_t **lock)
>> +  __attribute__ ((visibility ("hidden")));
>> +
>> +extern void mcs_unlock (mcs_lock_t **lock)
>> +  __attribute__ ((visibility ("hidden")));
> 
> Why is the attribute needed?
> 

MCS lock should only be used in lock mechanism to queue spinner. The hidden attribution can help to
avoid other executable or shared library touch it.

>> diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
>> index 20ff3fd..d20481a 100644
>> --- a/nptl/nptl-init.c
>> +++ b/nptl/nptl-init.c
>> @@ -289,7 +289,7 @@ __pthread_initialize_minimal_internal (void)
>>   #ifdef __NR_set_robust_list
>>       pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
>>                       - offsetof (pthread_mutex_t,
>> -                        __data.__list.__next));
>> +                        __data.__list.__list_t.__next));
>>       INTERNAL_SYSCALL_DECL (err);
>>       int res = INTERNAL_SYSCALL (set_robust_list, err, 2, &pd->robust_head,
>>                   sizeof (struct robust_list_head));
>> diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
>> index 7f16ba9..1179864 100644
>> --- a/nptl/pthreadP.h
>> +++ b/nptl/pthreadP.h
>> @@ -67,7 +67,7 @@ static inline short max_adaptive_count (void)
>>   /* Internal mutex type value.  */
>>   enum
>>   {
>> -  PTHREAD_MUTEX_KIND_MASK_NP = 3,
>> +  PTHREAD_MUTEX_KIND_MASK_NP = 7,
>>       PTHREAD_MUTEX_ELISION_NP    = 256,
>>     PTHREAD_MUTEX_NO_ELISION_NP = 512,
>> diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
>> index 5cf290c..99f1707 100644
>> --- a/nptl/pthread_mutex_init.c
>> +++ b/nptl/pthread_mutex_init.c
>> @@ -111,6 +111,11 @@ __pthread_mutex_init (pthread_mutex_t *mutex,
>>       return ENOTSUP;
>>   #endif
>>   +      /* Robust mutex does not support the PTHREAD_MUTEX_QUEUESPINNER_NP
>> +         GNU extension.  */
>> +      if ((imutexattr->mutexkind & PTHREAD_MUTEX_QUEUESPINNER_NP) != 0)
>> +        return ENOTSUP;
>> +
>>         mutex_kind |= PTHREAD_MUTEX_ROBUST_NORMAL_NP;
>>       }
>>   diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index 474b4df..7b81470 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -26,6 +26,7 @@
>>   #include <atomic.h>
>>   #include <lowlevellock.h>
>>   #include <stap-probe.h>
>> +#include <mcs_lock.h>
>>     #ifndef lll_lock_elision
>>   #define lll_lock_elision(lock, try_lock, private)    ({ \
>> @@ -118,6 +119,35 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>>         mutex->__data.__count = 1;
>>       }
>>     else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>> +             == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
>> +    {
>> +      if (! __is_smp)
>> +        goto simple;
>> +
>> +      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
>> +        {
>> +          int cnt = 0;
>> +          int max_cnt = MIN (max_adaptive_count (),
>> +                            mutex->__data.__spins * 2 + 10);
>> +          int val = 0;
>> +
>> +          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
>> +
>> +          do
>> +            {
>> +              atomic_spin_nop ();
>> +              val = atomic_load_relaxed (&mutex->__data.__lock);
>> +            }
>> +          while (val != 0 && ++cnt < max_cnt);
>> +
>> +          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
>> +          LLL_MUTEX_LOCK (mutex);
> 
> The loop reads the lock value till it finds that it is unlocked.  It then proceeds to a MUTEX_LOCK (after mcs_unlock), which means that there's a race window where a thread proceeds to LLL_MUTEX_LOCK only to find that another thread has raced ahead and taken the mutex.  Is that expected behaviour?

Yes. It's the expected behavior. 
Actually, in that race window(rarely happen), the mutex lock may be stealed by
a) the next spinner(we have called mcs_unlock() to wake up it), or b) the subsequent
thread which just right calls LLL_MUTEX_TRYLOCK().

If a) or b) happens, the current spinner will probably fail to the get the lock and call 
futex_wait to block. It does not a matter unless the caller hopes a strict FIFO grantunee.
IMHO, I don't think the existed spinlock or mutex gives that grantunee.

BTW, the LLL_MUTEX_LOCK semantic indicates a) trying to get the lock via cmpxchg first, and 
b) call futex_wait if fails
>> +
>> +          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>> +        }
>> +      assert (mutex->__data.__owner == 0);
>> +    }
>> +  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>>                 == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>>       {
>>         if (! __is_smp)
>> @@ -179,7 +209,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             &mutex->__data.__list.__next);
>> +             &mutex->__data.__list.__list_t.__next);
>>         /* We need to set op_pending before starting the operation.  Also
>>        see comments at ENQUEUE_MUTEX.  */
>>         __asm ("" ::: "memory");
>> @@ -365,7 +395,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>>         {
>>           /* Note: robust PI futexes are signaled by setting bit 0.  */
>>           THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -               (void *) (((uintptr_t) &mutex->__data.__list.__next)
>> +               (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>>                        | 1));
>>           /* We need to set op_pending before starting the operation.  Also
>>              see comments at ENQUEUE_MUTEX.  */
>> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
>> index 453b824..edf0415 100644
>> --- a/nptl/pthread_mutex_timedlock.c
>> +++ b/nptl/pthread_mutex_timedlock.c
>> @@ -25,6 +25,7 @@
>>   #include <atomic.h>
>>   #include <lowlevellock.h>
>>   #include <not-cancel.h>
>> +#include <mcs_lock.h>
>>     #include <stap-probe.h>
>>   @@ -135,13 +136,37 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>>         mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>       }
>>         break;
>> -
>> +    case PTHREAD_MUTEX_QUEUESPINNER_NP:
>> +      if (! __is_smp)
>> +        goto simple;
>> +
>> +      if (lll_trylock (mutex) != 0)
>> +        {
>> +          int cnt = 0;
>> +          int max_cnt = MIN (max_adaptive_count (),
>> +                            mutex->__data.__spins * 2 + 10);
>> +          int val = 0;
>> +
>> +          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
>> +          do
>> +            {
>> +              atomic_spin_nop ();
>> +              val = atomic_load_relaxed (&mutex->__data.__lock);
>> +            }
>> +          while (val != 0 && ++cnt < max_cnt);
>> +          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
>> +          result = lll_timedlock(mutex->__data.__lock, abstime,
>> +                      PTHREAD_MUTEX_PSHARED (mutex));
> 
> Similar situation here: there seems to be a race window similar to the above.
> 

See above.

>> +
>> +          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>> +        }
>> +      break;
>>       case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
>>       case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
>>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             &mutex->__data.__list.__next);
>> +             &mutex->__data.__list.__list_t.__next);
>>         /* We need to set op_pending before starting the operation.  Also
>>        see comments at ENQUEUE_MUTEX.  */
>>         __asm ("" ::: "memory");
>> @@ -353,7 +378,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>>         {
>>           /* Note: robust PI futexes are signaled by setting bit 0.  */
>>           THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -               (void *) (((uintptr_t) &mutex->__data.__list.__next)
>> +               (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>>                        | 1));
>>           /* We need to set op_pending before starting the operation.  Also
>>              see comments at ENQUEUE_MUTEX.  */
>> diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
>> index fa90c1d..a22de5e 100644
>> --- a/nptl/pthread_mutex_trylock.c
>> +++ b/nptl/pthread_mutex_trylock.c
>> @@ -78,6 +78,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>>         FORCE_ELISION (mutex, goto elision);
>>         /*FALL THROUGH*/
>>       case PTHREAD_MUTEX_ADAPTIVE_NP:
>> +    case PTHREAD_MUTEX_QUEUESPINNER_NP:
>>       case PTHREAD_MUTEX_ERRORCHECK_NP:
>>         if (lll_trylock (mutex->__data.__lock) != 0)
>>       break;
>> @@ -93,7 +94,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>>       case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
>>       case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             &mutex->__data.__list.__next);
>> +             &mutex->__data.__list.__list_t.__next);
>>           oldval = mutex->__data.__lock;
>>         do
>> @@ -213,7 +214,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex)
>>       if (robust)
>>         /* Note: robust PI futexes are signaled by setting bit 0.  */
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             (void *) (((uintptr_t) &mutex->__data.__list.__next)
>> +             (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>>                      | 1));
>>         oldval = mutex->__data.__lock;
>> diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
>> index 68d04d5..c3e8ef4 100644
>> --- a/nptl/pthread_mutex_unlock.c
>> +++ b/nptl/pthread_mutex_unlock.c
>> @@ -78,6 +78,9 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
>>         goto normal;
>>       }
>>     else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>> +                  == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
>> +    goto normal;
>> +  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
>>                     == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>>       goto normal;
>>     else
>> @@ -142,7 +145,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
>>       robust:
>>         /* Remove mutex from the list.  */
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             &mutex->__data.__list.__next);
>> +             &mutex->__data.__list.__list_t.__next);
>>         /* We must set op_pending before we dequeue the mutex.  Also see
>>        comments at ENQUEUE_MUTEX.  */
>>         __asm ("" ::: "memory");
>> @@ -242,7 +245,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
>>         /* Remove mutex from the list.
>>            Note: robust PI futexes are signaled by setting bit 0.  */
>>         THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
>> -             (void *) (((uintptr_t) &mutex->__data.__list.__next)
>> +             (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
>>                      | 1));
>>         /* We must set op_pending before we dequeue the mutex.  Also see
>>            comments at ENQUEUE_MUTEX.  */
>> diff --git a/nptl/pthread_mutexattr_settype.c b/nptl/pthread_mutexattr_settype.c
>> index 7d36cc3..c2382b4 100644
>> --- a/nptl/pthread_mutexattr_settype.c
>> +++ b/nptl/pthread_mutexattr_settype.c
>> @@ -25,7 +25,7 @@ __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind)
>>   {
>>     struct pthread_mutexattr *iattr;
>>   -  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_ADAPTIVE_NP)
>> +  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_QUEUESPINNER_NP)
>>       return EINVAL;
>>       /* Cannot distinguish between DEFAULT and NORMAL. So any settype
>> diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
>> index 05c94e7..1cf8874 100644
>> --- a/sysdeps/nptl/bits/thread-shared-types.h
>> +++ b/sysdeps/nptl/bits/thread-shared-types.h
>> @@ -79,15 +79,19 @@
>>   /* Common definition of pthread_mutex_t. */
>>     #if !__PTHREAD_MUTEX_USE_UNION
>> -typedef struct __pthread_internal_list
>> +typedef union __pthread_internal_list
>>   {
>> -  struct __pthread_internal_list *__prev;
>> -  struct __pthread_internal_list *__next;
>> +  struct {
>> +    union __pthread_internal_list *__prev;
>> +    union __pthread_internal_list *__next;
>> +  }__list_t;
>> +  void *mcs_lock;
>>   } __pthread_list_t;
>>   #else
>> -typedef struct __pthread_internal_slist
>> +typedef union __pthread_internal_slist
>>   {
>> -  struct __pthread_internal_slist *__next;
>> +  union __pthread_internal_slist *__next;
>> +  void *mcs_lock;
>>   } __pthread_slist_t;
>>   #endif
>>   @@ -165,6 +169,13 @@ struct __pthread_mutex_s
>>     __PTHREAD_COMPAT_PADDING_END
>>   };
>>   +struct mcs_lock
>> +{
>> +  struct mcs_lock *next;
>> +  int locked;
>> +};
>> +
>> +typedef struct mcs_lock mcs_lock_t;
>>     /* Common definition of pthread_cond_t. */
>>   diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
>> index df049ab..4b4b80a 100644
>> --- a/sysdeps/nptl/pthread.h
>> +++ b/sysdeps/nptl/pthread.h
>> @@ -45,7 +45,8 @@ enum
>>     PTHREAD_MUTEX_TIMED_NP,
>>     PTHREAD_MUTEX_RECURSIVE_NP,
>>     PTHREAD_MUTEX_ERRORCHECK_NP,
>> -  PTHREAD_MUTEX_ADAPTIVE_NP
>> +  PTHREAD_MUTEX_ADAPTIVE_NP,
>> +  PTHREAD_MUTEX_QUEUESPINNER_NP
>>   #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
>>     ,
>>     PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
>> @@ -85,14 +86,16 @@ enum
>>     #if __PTHREAD_MUTEX_HAVE_PREV
>>   # define PTHREAD_MUTEX_INITIALIZER \
>> -  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { 0, 0 } } }
>> +  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { { 0, 0 } } } }
>>   # ifdef __USE_GNU
>>   #  define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \
>> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
>> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>>   #  define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \
>> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { 0, 0 } } }
>> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>>   #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
>> -  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
>> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>> +#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
>> +  { { 0, 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
>>     # endif
>>   #else
>> @@ -105,6 +108,8 @@ enum
>>     { { 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, 0, { __PTHREAD_SPINS } } }
>>   #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
>>     { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, 0, { __PTHREAD_SPINS } } }
>> +#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
>> +  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, 0, { __PTHREAD_SPINS } } }
>>     # endif
>>   #endif
>> diff --git a/sysdeps/unix/sysv/linux/hppa/pthread.h b/sysdeps/unix/sysv/linux/hppa/pthread.h
>> index 11a024d..57c101c 100644
>> --- a/sysdeps/unix/sysv/linux/hppa/pthread.h
>> +++ b/sysdeps/unix/sysv/linux/hppa/pthread.h
>> @@ -46,6 +46,7 @@ enum
>>     PTHREAD_MUTEX_RECURSIVE_NP,
>>     PTHREAD_MUTEX_ERRORCHECK_NP,
>>     PTHREAD_MUTEX_ADAPTIVE_NP
>> +  PTHREAD_MUTEX_QUEUESPINNER_NP,
>>   #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
>>     ,
>>     PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
>> @@ -95,6 +96,9 @@ enum
>>   # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
>>     { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, { 0, 0, 0, 0 }, 0, \
>>         { __PTHREAD_SPINS }, { 0, 0 } } }
>> +# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
>> +  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, { 0, 0, 0, 0 }, 0, \
>> +      { __PTHREAD_SPINS }, { 0, 0 } } }
>>   #endif
>>    
>

Patch

diff --git a/NEWS b/NEWS
index ae80818..ff58d15 100644
--- a/NEWS
+++ b/NEWS
@@ -46,6 +46,12 @@  Major new features:
   incosistent mutex state after fork call in multithread environment.
   In both popen and system there is no direct access to user-defined mutexes.
 
+* The mutex with PTHREAD_MUTEX_QUEUESPINNER_NP GNU extension has been added,
+  which accelerates lock acquisition by queuing spinners. This type of
+  mutex does not support PTHREAD_MUTEXATTR_FLAG_ROBUST attribution and it
+  has a potential risk to collapse lock performance when CPUs are heavily
+  oversubscribed.
+
 Deprecated and removed features, and other changes affecting compatibility:
 
 * The glibc.tune tunable namespace has been renamed to glibc.cpu and the
diff --git a/nptl/Makefile b/nptl/Makefile
index 34ae830..997da5e 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -145,7 +145,8 @@  libpthread-routines = nptl-init nptlfreeres vars events version pt-interp \
 		      mtx_destroy mtx_init mtx_lock mtx_timedlock \
 		      mtx_trylock mtx_unlock call_once cnd_broadcast \
 		      cnd_destroy cnd_init cnd_signal cnd_timedwait cnd_wait \
-		      tss_create tss_delete tss_get tss_set pthread_mutex_conf
+		      tss_create tss_delete tss_get tss_set pthread_mutex_conf \
+              mcs_lock
 #		      pthread_setuid pthread_seteuid pthread_setreuid \
 #		      pthread_setresuid \
 #		      pthread_setgid pthread_setegid pthread_setregid \
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 04e3f08..9f47129 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -749,7 +749,7 @@  allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
      might have happened in the kernel.  */
   pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
 				  - offsetof (pthread_mutex_t,
-					      __data.__list.__next));
+					      __data.__list.__list_t.__next));
   pd->robust_head.list_op_pending = NULL;
 #if __PTHREAD_MUTEX_HAVE_PREV
   pd->robust_prev = &pd->robust_head;
diff --git a/nptl/descr.h b/nptl/descr.h
index 9c01e1b..dc24dd8 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -184,38 +184,38 @@  struct pthread
      FIXME We should use relaxed MO atomic operations here and signal fences
      because this kind of concurrency is similar to synchronizing with a
      signal handler.  */
-# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __next))
+# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __list_t.__next))
 
 # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
   do {									      \
     __pthread_list_t *next = (__pthread_list_t *)			      \
       ((((uintptr_t) THREAD_GETMEM (THREAD_SELF, robust_head.list)) & ~1ul)   \
        - QUEUE_PTR_ADJUST);						      \
-    next->__prev = (void *) &mutex->__data.__list.__next;		      \
-    mutex->__data.__list.__next = THREAD_GETMEM (THREAD_SELF,		      \
+    next->__list_t.__prev = (void *) &mutex->__data.__list.__list_t.__next;		      \
+    mutex->__data.__list.__list_t.__next = THREAD_GETMEM (THREAD_SELF,		      \
 						 robust_head.list);	      \
-    mutex->__data.__list.__prev = (void *) &THREAD_SELF->robust_head;	      \
+    mutex->__data.__list.__list_t.__prev = (void *) &THREAD_SELF->robust_head;	      \
     /* Ensure that the new list entry is ready before we insert it.  */	      \
     __asm ("" ::: "memory");						      \
     THREAD_SETMEM (THREAD_SELF, robust_head.list,			      \
-		   (void *) (((uintptr_t) &mutex->__data.__list.__next)	      \
+		   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)	      \
 			     | val));					      \
   } while (0)
 # define DEQUEUE_MUTEX(mutex) \
   do {									      \
     __pthread_list_t *next = (__pthread_list_t *)			      \
-      ((char *) (((uintptr_t) mutex->__data.__list.__next) & ~1ul)	      \
+      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__next) & ~1ul)	      \
        - QUEUE_PTR_ADJUST);						      \
-    next->__prev = mutex->__data.__list.__prev;				      \
+    next->__list_t.__prev = mutex->__data.__list.__list_t.__prev;				      \
     __pthread_list_t *prev = (__pthread_list_t *)			      \
-      ((char *) (((uintptr_t) mutex->__data.__list.__prev) & ~1ul)	      \
+      ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__prev) & ~1ul)	      \
        - QUEUE_PTR_ADJUST);						      \
-    prev->__next = mutex->__data.__list.__next;				      \
+    prev->__list_t.__next = mutex->__data.__list.__list_t.__next;				      \
     /* Ensure that we remove the entry from the list before we change the     \
        __next pointer of the entry, which is read by the kernel.  */	      \
     __asm ("" ::: "memory");						      \
-    mutex->__data.__list.__prev = NULL;					      \
-    mutex->__data.__list.__next = NULL;					      \
+    mutex->__data.__list.__list_t.__prev = NULL;					      \
+    mutex->__data.__list.__list_t.__next = NULL;					      \
   } while (0)
 #else
   union
@@ -226,7 +226,7 @@  struct pthread
 
 # define ENQUEUE_MUTEX_BOTH(mutex, val)					      \
   do {									      \
-    mutex->__data.__list.__next						      \
+    mutex->__data.__list.__list_t.__next						      \
       = THREAD_GETMEM (THREAD_SELF, robust_list.__next);		      \
     /* Ensure that the new list entry is ready before we insert it.  */	      \
     __asm ("" ::: "memory");						      \
@@ -253,7 +253,7 @@  struct pthread
 	/* Ensure that we remove the entry from the list before we change the \
 	   __next pointer of the entry, which is read by the kernel.  */      \
 	    __asm ("" ::: "memory");					      \
-	mutex->__data.__list.__next = NULL;				      \
+	mutex->__data.__list.__list_t.__next = NULL;				      \
       }									      \
   } while (0)
 #endif
diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c
new file mode 100644
index 0000000..43f90dc
--- /dev/null
+++ b/nptl/mcs_lock.c
@@ -0,0 +1,72 @@ 
+/* MCS-style lock for queuing spinners
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "pthreadP.h"
+#include <atomic.h>
+
+static __thread mcs_lock_t node = {
+  NULL,
+  0,
+};
+
+void mcs_lock (mcs_lock_t **lock)
+{
+  mcs_lock_t *prev;
+
+  /* Initalize node.  */
+  node.next = NULL;
+  node.locked = 0;
+
+  prev = atomic_exchange_acquire(lock, &node);
+
+  /* No spinners waiting in the queue, lock is acquired immediately.  */
+  if (prev == NULL)
+    {
+      node.locked = 1;
+      return;
+    }
+
+  /* Add current spinner into the queue.  */
+  atomic_store_release (&prev->next, &node);
+  atomic_full_barrier ();
+  /* Waiting until waken up by the previous spinner.  */
+  while (!atomic_load_relaxed (&node.locked))
+    atomic_spin_nop ();
+}
+
+void mcs_unlock (mcs_lock_t **lock)
+{
+  mcs_lock_t *next = node.next;
+
+  if (next == NULL)
+    {
+      /* Check the tail of the queue:
+         a) Release the lock and return if current node is the tail.  */
+      if (atomic_compare_and_exchange_val_acq(lock, NULL, &node) == &node)
+        return;
+
+      /* b) Waiting until new node is added to the queue if current node is
+         not the tail (lock != node).  */
+      while (! (next = atomic_load_relaxed (&node.next)))
+        atomic_spin_nop ();
+    }
+
+  /* Wake up the next spinner.  */
+  atomic_store_release (&next->locked, 1);
+  atomic_full_barrier ();
+}
diff --git a/nptl/mcs_lock.h b/nptl/mcs_lock.h
new file mode 100644
index 0000000..5ee32f5
--- /dev/null
+++ b/nptl/mcs_lock.h
@@ -0,0 +1,23 @@ 
+/* MCS-style lock for queuing spinners
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+extern void mcs_lock (mcs_lock_t **lock)
+  __attribute__ ((visibility ("hidden")));
+
+extern void mcs_unlock (mcs_lock_t **lock)
+  __attribute__ ((visibility ("hidden")));
diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
index 20ff3fd..d20481a 100644
--- a/nptl/nptl-init.c
+++ b/nptl/nptl-init.c
@@ -289,7 +289,7 @@  __pthread_initialize_minimal_internal (void)
 #ifdef __NR_set_robust_list
     pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock)
 				    - offsetof (pthread_mutex_t,
-						__data.__list.__next));
+						__data.__list.__list_t.__next));
     INTERNAL_SYSCALL_DECL (err);
     int res = INTERNAL_SYSCALL (set_robust_list, err, 2, &pd->robust_head,
 				sizeof (struct robust_list_head));
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 7f16ba9..1179864 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -67,7 +67,7 @@  static inline short max_adaptive_count (void)
 /* Internal mutex type value.  */
 enum
 {
-  PTHREAD_MUTEX_KIND_MASK_NP = 3,
+  PTHREAD_MUTEX_KIND_MASK_NP = 7,
 
   PTHREAD_MUTEX_ELISION_NP    = 256,
   PTHREAD_MUTEX_NO_ELISION_NP = 512,
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index 5cf290c..99f1707 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -111,6 +111,11 @@  __pthread_mutex_init (pthread_mutex_t *mutex,
 	return ENOTSUP;
 #endif
 
+      /* Robust mutex does not support the PTHREAD_MUTEX_QUEUESPINNER_NP
+         GNU extension.  */
+      if ((imutexattr->mutexkind & PTHREAD_MUTEX_QUEUESPINNER_NP) != 0)
+        return ENOTSUP;
+
       mutex_kind |= PTHREAD_MUTEX_ROBUST_NORMAL_NP;
     }
 
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 474b4df..7b81470 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <mcs_lock.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -118,6 +119,35 @@  __pthread_mutex_lock (pthread_mutex_t *mutex)
       mutex->__data.__count = 1;
     }
   else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
+			 == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
+    {
+      if (! __is_smp)
+        goto simple;
+
+      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+        {
+          int cnt = 0;
+          int max_cnt = MIN (max_adaptive_count (),
+                            mutex->__data.__spins * 2 + 10);
+          int val = 0;
+
+          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
+
+          do
+            {
+              atomic_spin_nop ();
+              val = atomic_load_relaxed (&mutex->__data.__lock);
+            }
+          while (val != 0 && ++cnt < max_cnt);
+
+          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
+          LLL_MUTEX_LOCK (mutex);
+
+          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+        }
+      assert (mutex->__data.__owner == 0);
+    }
+  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
 			  == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     {
       if (! __is_smp)
@@ -179,7 +209,7 @@  __pthread_mutex_lock_full (pthread_mutex_t *mutex)
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We need to set op_pending before starting the operation.  Also
 	 see comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -365,7 +395,7 @@  __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	  {
 	    /* Note: robust PI futexes are signaled by setting bit 0.  */
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				     | 1));
 	    /* We need to set op_pending before starting the operation.  Also
 	       see comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 453b824..edf0415 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,6 +25,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
+#include <mcs_lock.h>
 
 #include <stap-probe.h>
 
@@ -135,13 +136,37 @@  __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
       break;
-
+    case PTHREAD_MUTEX_QUEUESPINNER_NP:
+      if (! __is_smp)
+        goto simple;
+
+      if (lll_trylock (mutex) != 0)
+        {
+          int cnt = 0;
+          int max_cnt = MIN (max_adaptive_count (),
+                            mutex->__data.__spins * 2 + 10);
+          int val = 0;
+
+          mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
+          do
+            {
+              atomic_spin_nop ();
+              val = atomic_load_relaxed (&mutex->__data.__lock);
+            }
+          while (val != 0 && ++cnt < max_cnt);
+          mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock);
+          result = lll_timedlock(mutex->__data.__lock, abstime,
+                      PTHREAD_MUTEX_PSHARED (mutex));
+
+          mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+        }
+      break;
     case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
     case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We need to set op_pending before starting the operation.  Also
 	 see comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -353,7 +378,7 @@  __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 	  {
 	    /* Note: robust PI futexes are signaled by setting bit 0.  */
 	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			   (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				     | 1));
 	    /* We need to set op_pending before starting the operation.  Also
 	       see comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c
index fa90c1d..a22de5e 100644
--- a/nptl/pthread_mutex_trylock.c
+++ b/nptl/pthread_mutex_trylock.c
@@ -78,6 +78,7 @@  __pthread_mutex_trylock (pthread_mutex_t *mutex)
       FORCE_ELISION (mutex, goto elision);
       /*FALL THROUGH*/
     case PTHREAD_MUTEX_ADAPTIVE_NP:
+    case PTHREAD_MUTEX_QUEUESPINNER_NP:
     case PTHREAD_MUTEX_ERRORCHECK_NP:
       if (lll_trylock (mutex->__data.__lock) != 0)
 	break;
@@ -93,7 +94,7 @@  __pthread_mutex_trylock (pthread_mutex_t *mutex)
     case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
     case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
 
       oldval = mutex->__data.__lock;
       do
@@ -213,7 +214,7 @@  __pthread_mutex_trylock (pthread_mutex_t *mutex)
 	if (robust)
 	  /* Note: robust PI futexes are signaled by setting bit 0.  */
 	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				   | 1));
 
 	oldval = mutex->__data.__lock;
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index 68d04d5..c3e8ef4 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -78,6 +78,9 @@  __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr)
       goto normal;
     }
   else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
+			      == PTHREAD_MUTEX_QUEUESPINNER_NP, 1))
+    goto normal;
+  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
 			      == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     goto normal;
   else
@@ -142,7 +145,7 @@  __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
     robust:
       /* Remove mutex from the list.  */
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-		     &mutex->__data.__list.__next);
+		     &mutex->__data.__list.__list_t.__next);
       /* We must set op_pending before we dequeue the mutex.  Also see
 	 comments at ENQUEUE_MUTEX.  */
       __asm ("" ::: "memory");
@@ -242,7 +245,7 @@  __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
 	  /* Remove mutex from the list.
 	     Note: robust PI futexes are signaled by setting bit 0.  */
 	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
-			 (void *) (((uintptr_t) &mutex->__data.__list.__next)
+			 (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next)
 				   | 1));
 	  /* We must set op_pending before we dequeue the mutex.  Also see
 	     comments at ENQUEUE_MUTEX.  */
diff --git a/nptl/pthread_mutexattr_settype.c b/nptl/pthread_mutexattr_settype.c
index 7d36cc3..c2382b4 100644
--- a/nptl/pthread_mutexattr_settype.c
+++ b/nptl/pthread_mutexattr_settype.c
@@ -25,7 +25,7 @@  __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind)
 {
   struct pthread_mutexattr *iattr;
 
-  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_ADAPTIVE_NP)
+  if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_QUEUESPINNER_NP)
     return EINVAL;
 
   /* Cannot distinguish between DEFAULT and NORMAL. So any settype
diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
index 05c94e7..1cf8874 100644
--- a/sysdeps/nptl/bits/thread-shared-types.h
+++ b/sysdeps/nptl/bits/thread-shared-types.h
@@ -79,15 +79,19 @@ 
 /* Common definition of pthread_mutex_t. */
 
 #if !__PTHREAD_MUTEX_USE_UNION
-typedef struct __pthread_internal_list
+typedef union __pthread_internal_list
 {
-  struct __pthread_internal_list *__prev;
-  struct __pthread_internal_list *__next;
+  struct {
+    union __pthread_internal_list *__prev;
+    union __pthread_internal_list *__next;
+  }__list_t;
+  void *mcs_lock;
 } __pthread_list_t;
 #else
-typedef struct __pthread_internal_slist
+typedef union __pthread_internal_slist
 {
-  struct __pthread_internal_slist *__next;
+  union __pthread_internal_slist *__next;
+  void *mcs_lock;
 } __pthread_slist_t;
 #endif
 
@@ -165,6 +169,13 @@  struct __pthread_mutex_s
   __PTHREAD_COMPAT_PADDING_END
 };
 
+struct mcs_lock
+{
+  struct mcs_lock *next;
+  int locked;
+};
+
+typedef struct mcs_lock mcs_lock_t;
 
 /* Common definition of pthread_cond_t. */
 
diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
index df049ab..4b4b80a 100644
--- a/sysdeps/nptl/pthread.h
+++ b/sysdeps/nptl/pthread.h
@@ -45,7 +45,8 @@  enum
   PTHREAD_MUTEX_TIMED_NP,
   PTHREAD_MUTEX_RECURSIVE_NP,
   PTHREAD_MUTEX_ERRORCHECK_NP,
-  PTHREAD_MUTEX_ADAPTIVE_NP
+  PTHREAD_MUTEX_ADAPTIVE_NP,
+  PTHREAD_MUTEX_QUEUESPINNER_NP
 #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
   ,
   PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
@@ -85,14 +86,16 @@  enum
 
 #if __PTHREAD_MUTEX_HAVE_PREV
 # define PTHREAD_MUTEX_INITIALIZER \
-  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { { 0, 0 } } } }
 # ifdef __USE_GNU
 #  define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 #  define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
-  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { 0, 0 } } }
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
+#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, __PTHREAD_SPINS, { { 0, 0 } } } }
 
 # endif
 #else
@@ -105,6 +108,8 @@  enum
   { { 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, 0, { __PTHREAD_SPINS } } }
 #  define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
   { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, 0, { __PTHREAD_SPINS } } }
+#  define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, 0, { __PTHREAD_SPINS } } }
 
 # endif
 #endif
diff --git a/sysdeps/unix/sysv/linux/hppa/pthread.h b/sysdeps/unix/sysv/linux/hppa/pthread.h
index 11a024d..57c101c 100644
--- a/sysdeps/unix/sysv/linux/hppa/pthread.h
+++ b/sysdeps/unix/sysv/linux/hppa/pthread.h
@@ -46,6 +46,7 @@  enum
   PTHREAD_MUTEX_RECURSIVE_NP,
   PTHREAD_MUTEX_ERRORCHECK_NP,
   PTHREAD_MUTEX_ADAPTIVE_NP
+  PTHREAD_MUTEX_QUEUESPINNER_NP,
 #if defined __USE_UNIX98 || defined __USE_XOPEN2K8
   ,
   PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP,
@@ -95,6 +96,9 @@  enum
 # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
   { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, { 0, 0, 0, 0 }, 0, \
       { __PTHREAD_SPINS }, { 0, 0 } } }
+# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \
+  { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, { 0, 0, 0, 0 }, 0, \
+      { __PTHREAD_SPINS }, { 0, 0 } } }
 #endif