From patchwork Mon Jul 2 08:11:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 937722 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93893-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="EciUZ+uE"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41K0VK72lkz9s3q for ; Mon, 2 Jul 2018 18:17:29 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=HQTnN/gzhBj42oVM0e1wDIQHVR4ct0/ sHw6OQAPIMo9mIf+M3LUhYaBHJ9JzyQ6/qW9TrGoqA2Wa2RburvH2mwoqulfRiVw ap465LyW0MQmPAByxmq7nAoIHvIyygdawBQJRZv/c1VzHvlxT27zpE4Xi/UVWZTl D6ybcd2g6TJc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=Wq8PvXd4XpkthxiXbzd/lLbust4=; b=EciUZ +uEdi5d1pnpReknF2IlRvJ0JjKR6AY49FhvC9Idn4NIX0YOAkeU2zaOUiveuZBJq QG3iFzDijfz3Uyn1uVlolNLFOyilTR06WwwyIg1IjQ9+/gj5XdJX5aA8/FLZyXaV QXfN3TXVgKY/18YaaN1Xw52Zx+LmwBREPDKI2o= Received: (qmail 24181 invoked by alias); 2 Jul 2018 08:17:04 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 23656 invoked by uid 89); 2 Jul 2018 08:16:14 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, TIME_LIMIT_EXCEEDED autolearn=unavailable version=3.3.2 spammy=Waiting, Head, 0.3, arrival X-HELO: mga09.intel.com From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH 1/4] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness Date: Mon, 2 Jul 2018 16:11:53 +0800 Message-Id: <1530519116-13103-2-git-send-email-kemi.wang@intel.com> In-Reply-To: <1530519116-13103-1-git-send-email-kemi.wang@intel.com> References: <1530519116-13103-1-git-send-email-kemi.wang@intel.com> There are two main problems in current adaptive mutex. The first one is fairness, multiple spinners contend for the lock simultaneously and there is no any guarantee for a spinner to acquire the lock no matter how long it has been waiting for. The other is the heavy cache line bouncing. Since the cache line including the mutex is shared among all of the spinners, when lock is released, each spinner will try to acquire the lock via cmpxchg instruction which constantly floods the system via "read-for-ownership" requests. As a result, there will be a lot of cache line bouncing in a large system with a lots of CPUs. This patch introduces a new type of mutex called PTHREAD_MUTEX_QUEUESPINNER_NP which puts mutex spinners into a queue before spinning on the mutex lock and only allows the first spinner to spin on the mutex lock. This reduces the overhead of cache line bouncing when lock holder is transferred and allow the tasks to move forward faster, because there is only one spinner and the cache line of mutex lock is contended between lock holder and current spinner. At the same time, lock fairness is also guaranteed in some degree. However, there may be a potential issue in this proposal if CPUs are oversubscribed. When the lock holder is transferred to the next spinner in the queue, but it may not be running at that moment (CPU is scheduled to run other task), the lock performance would collapse (see more details at the end of commit log), this is why we introduced a new type of mutex rather than do the optimization based-on the existed mutex discipline. The implementation of queuing spinner mutex is based-on MCS lock and an additional pointer is required to hold MCS lock. To keep the size of mutex data structure consistent and maintain the user space ABI unchanged, the __list field which is originally for implementing robust futex will be reused for that purpose. Therefore, the queue spinner mutex with robust futex would not be supported. Pass the ABI compatibility test by running "make check-abi" Test machine: 2-sockets Skylake platform, 112 cores with 62G RAM Test case: mutex-adaptive-thread/mutex-queuespinner-thread Usage: make bench BENCHSET="mutex-adaptive-thread mutex-queuespinner-thread" Test result: +----------------+-----------------+-----------------+------------+ | Configuration | Base | Head | % Change | | | Total iteration | Total iteration | base->head | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1x | +----------------+-----------------+-----------------+------------+ | 1 thread | 9304170 | 9318160 | 0.2% | +----------------+-----------------+-----------------+------------+ | 2 threads | 14718000 | 14947600 | 1.6% | +----------------+-----------------+-----------------+------------+ | 3 threads | 21436800 | 20249800 | -5.5% | +----------------+-----------------+-----------------+------------+ | 4 threads | 16657600 | 15656500 | -6.0% | +----------------+-----------------+-----------------+------------+ | 28 threads | 4020620 | 14757000 | 267.0% | +----------------+-----------------+-----------------+------------+ | 56 threads | 3489400 | 8996000 | 157.8% | +----------------+-----------------+-----------------+------------+ | 112 threads | 3102040 | 9106490 | 193.6% | +----------------+-----------------+-----------------+------------+ | | Critical section size: 10x | +----------------+-----------------+-----------------+------------+ | 1 thread | 5226360 | 5228880 | 0.0% | +----------------+-----------------+-----------------+------------+ | 2 threads | 6875240 | 7016720 | 2.1% | +----------------+-----------------+-----------------+------------+ | 3 threads | 6323230 | 6053060 | -4.3% | +----------------+-----------------+-----------------+------------+ | 4 threads | 6215860 | 6388180 | 2.8% | +----------------+-----------------+-----------------+------------+ | 28 threads | 3921620 | 5249650 | 33.9% | +----------------+-----------------+-----------------+------------+ | 56 threads | 2855460 | 4308940 | 50.9% | +----------------+-----------------+-----------------+------------+ | 112 threads | 2572420 | 4166650 | 62.0% | +----------------+-----------------+-----------------+------------+ | | Critical section size: 100x | +----------------+-----------------+-----------------+------------+ | 1 thread | 968946 | 969081 | 0.0% | +----------------+-----------------+-----------------+------------+ | 2 threads | 772844 | 776187 | 0.4% | +----------------+-----------------+-----------------+------------+ | 3 threads | 808041 | 812314 | 0.5% | +----------------+-----------------+-----------------+------------+ | 4 threads | 802213 | 794792 | -0.9% | +----------------+-----------------+-----------------+------------+ | 28 threads | 338170 | 339024 | 0.3% | +----------------+-----------------+-----------------+------------+ | 56 threads | 339900 | 339932 | 0.0% | +----------------+-----------------+-----------------+------------+ | 112 threads | 331791 | 335243 | 1.0% | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1000x | +----------------+-----------------+-----------------+------------+ | 1 thread | 106082 | 106102 | 0.0% | +----------------+-----------------+-----------------+------------+ | 2 threads | 100833 | 100823 | -0.0% | +----------------+-----------------+-----------------+------------+ | 3 threads | 100965 | 100842 | -0.1% | +----------------+-----------------+-----------------+------------+ | 4 threads | 96813 | 96846 | 0.0% | +----------------+-----------------+-----------------+------------+ | 28 threads | 52230 | 52024 | -0.4% | +----------------+-----------------+-----------------+------------+ | 56 threads | 48298 | 46427 | -3.9% | +----------------+-----------------+-----------------+------------+ | 112 threads | 45865 | 44405 | -3.2% | +----------------+-----------------+-----------------+------------+ Though, the queue spinner mutex performs better than adaptive mutex, people probably ask why we need this new type of mutex since we have already had pthread spin lock and pthread mutex. Therefore, we designed several test cases below and explore how they perform. Let's assume the size of critical section is represented by *s*, the size of non-critical section is represented by *t*, and let t = k*s. Then, on a single thread, the arrival rate at which a single core will try to acquire the lock, in the absence of contention, is 1/(k+1). We also assume there are *n* threads contending for a lock, each thread binds to an individual CPU core, and does the following: 1) lock 2) spend *s* nanoseconds in the critical section 3) unlock 4) spend *t* nanoseconds in the non-critical section in a loop until 5 seconds, and the lock performance is measured by the total throughput. To emulate different usage scenarios, we let k=6, s=100ns and run this workload using different lock disciplines. Then increase *s* and repeat. In our workload, 4 threads contending for a lock emulates little lock contention, and 28 threads contending for a lock emulates severe lock contention within a socket, and 56 threads contending for a lock emulates severe lock contention across sockets. The benchmark is provided by Andi Kleen. +-------+-------------+--------------+----------------+---------------+ | Num | Spin Lock | Normal Mutex | Adaptive Mutex | Queue Spinner | +-------+-------------+--------------+------------- --+---------------+ | | s=100ns t=600ns | +-------+-------------+--------------+----------------+---------------+ | 4 | 12117662 | 7124320 | 10372184 | 9557689 | +-------+-------------+--------------+----------------+---------------+ | 28 | 2695783 | 6385815 | 3927942 | 7182092 | +-------+-------------+--------------+----------------+---------------+ | 56 | 2203519 | 4555164 | 3143599 | 4690016 | +-------+-------------+--------------+----------------+---------------+ | | s=1000ns t=6000ns | +-------+-------------+--------------+----------------+---------------+ | 4 | 1529542 | 1380643 | 1495118 | 1503344 | +-------+-------------+--------------+----------------+---------------+ | 28 | 2063929 | 1695128 | 2064940 | 2205245 | +-------+-------------+--------------+----------------+---------------+ | 56 | 1507764 | 1427931 | 1704105 | 1720832 | +-------+-------------+--------------+----------------+---------------+ | | s=10000ns t=60000ns | +-------+-------------+--------------+----------------+---------------+ | 4 | 159407 | 159213 | 159213 | 159215 | +-------+-------------+--------------+----------------+---------------+ | 28 | 272062 | 153567 | 223229 | 224948 | +-------+-------------+--------------+----------------+---------------+ | 56 | 269920 | 157287 | 239814 | 239887 | +-------+-------------+--------------+----------------+---------------+ | | s=100000ns t=600000ns | +-------+-------------+--------------+----------------+---------------+ | 4 | 16024 | 16023 | 16021 | 16021 | +-------+-------------+--------------+----------------+---------------+ | 28 | 27990 | 20421 | 20372 | 20378 | +-------+-------------+--------------+----------------+---------------+ | 56 | 27987 | 20395 | 20322 | 20348 | +-------+-------------+--------------+----------------+---------------+ | | s=1000000ns t=6000000ns | +-------+-------------+--------------+----------------+---------------+ | 4 | 1604 | 1604 | 1604 | 1604 | +-------+-------------+--------------+----------------+---------------+ | 28 | 2826 | 2748 | 2748 | 2748 | +-------+-------------+--------------+----------------+---------------+ | 56 | 2853 | 2773 | 2774 | 2773 | +-------+-------------+--------------+----------------+---------------+ Generally, we can get some conclusion from the rest result above: a) In case of little lock contention, spin lock performs best, queue spinner mutex performs similar to adaptive mutex, and both perform a little better than pthread mutex; b) In the case of severe lock contention with large number of CPUs when protecting a small critical section (less than 1000ns). Most of lock acquisition is got via spinning. Queue spinner mutex performs much better than spin lock and adaptive mutex. This is because the overhead of heavy cache line bouncing plays a big role on lock performance. c) With the increase of the size of a critical section, the advantage of queue spinner mutex on performance in reduced gradually. This is because the overhead of cache line bouncing will not become the bottleneck of lock performance, instead, the overhead of futex_wait and futex_wake plays a big role. When the critical section is increased to 1ms, even the latency of futex syscall would be ignorable compared to the total time of lock acquisition. As we can see above, queue spinner mutex performs well in kinds of workload, but there would be a potential risk to use this type of mutex. When the lock holder is transformed to the next spinner in the queue, but it is not running (the CPU is scheduled to run other task). And, the other spinners have to wait in the queue, this would probably collapse the lock performance. To emulate this case, we run two same processes simultaneously, there are 28 threads running in parallel in the process, each thread sets CPU affinity to an individual CPU according to the thread id. Thus, CPU [0~27] are subscribed by two threads at the same time. We run this test with different workload mentioned above, in the worst case (s=1000ns, t=6000ns), the lock performance is reduced by 58.1% (2205245->924263). Therefore, queue spinner mutex would be carefully used for applications to pursue fairness and performance without oversubscribing CPU resource. E.g. Run a application within a container in public cloud infrastructure. May to do list: a) Tuning the threshold of spin count At last, I would like to extend my appreciation sincerely to Andi Kleen for his guidance and support during the development. Signed-off-by: Kemi Wang --- nptl/Makefile | 2 +- nptl/allocatestack.c | 2 +- nptl/descr.h | 26 ++++++------- nptl/mcs_lock.c | 68 +++++++++++++++++++++++++++++++++ nptl/mcs_lock.h | 21 ++++++++++ nptl/nptl-init.c | 2 +- nptl/pthreadP.h | 2 +- nptl/pthread_mutex_init.c | 3 +- nptl/pthread_mutex_lock.c | 35 ++++++++++++++++- nptl/pthread_mutex_timedlock.c | 35 +++++++++++++++-- nptl/pthread_mutex_trylock.c | 5 ++- nptl/pthread_mutex_unlock.c | 7 +++- nptl/pthread_mutexattr_settype.c | 2 +- sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++--- sysdeps/nptl/pthread.h | 15 +++++--- sysdeps/unix/sysv/linux/hppa/pthread.h | 4 ++ 16 files changed, 212 insertions(+), 38 deletions(-) create mode 100644 nptl/mcs_lock.c create mode 100644 nptl/mcs_lock.h diff --git a/nptl/Makefile b/nptl/Makefile index bd1096f..7559907 100644 --- a/nptl/Makefile +++ b/nptl/Makefile @@ -140,7 +140,7 @@ libpthread-routines = nptl-init vars events version pt-interp \ pthread_mutex_setprioceiling \ pthread_setname pthread_getname \ pthread_setattr_default_np pthread_getattr_default_np \ - pthread_mutex_conf + pthread_mutex_conf mcs_lock # pthread_setuid pthread_seteuid pthread_setreuid \ # pthread_setresuid \ # pthread_setgid pthread_setegid pthread_setregid \ diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index 9c10b99..cb0fad7 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -743,7 +743,7 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, might have happened in the kernel. */ pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock) - offsetof (pthread_mutex_t, - __data.__list.__next)); + __data.__list.__list_t.__next)); pd->robust_head.list_op_pending = NULL; #if __PTHREAD_MUTEX_HAVE_PREV pd->robust_prev = &pd->robust_head; diff --git a/nptl/descr.h b/nptl/descr.h index 0a0abb4..ddad47c 100644 --- a/nptl/descr.h +++ b/nptl/descr.h @@ -184,38 +184,38 @@ struct pthread FIXME We should use relaxed MO atomic operations here and signal fences because this kind of concurrency is similar to synchronizing with a signal handler. */ -# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __next)) +# define QUEUE_PTR_ADJUST (offsetof (__pthread_list_t, __list_t.__next)) # define ENQUEUE_MUTEX_BOTH(mutex, val) \ do { \ __pthread_list_t *next = (__pthread_list_t *) \ ((((uintptr_t) THREAD_GETMEM (THREAD_SELF, robust_head.list)) & ~1ul) \ - QUEUE_PTR_ADJUST); \ - next->__prev = (void *) &mutex->__data.__list.__next; \ - mutex->__data.__list.__next = THREAD_GETMEM (THREAD_SELF, \ + next->__list_t.__prev = (void *) &mutex->__data.__list.__list_t.__next; \ + mutex->__data.__list.__list_t.__next = THREAD_GETMEM (THREAD_SELF, \ robust_head.list); \ - mutex->__data.__list.__prev = (void *) &THREAD_SELF->robust_head; \ + mutex->__data.__list.__list_t.__prev = (void *) &THREAD_SELF->robust_head; \ /* Ensure that the new list entry is ready before we insert it. */ \ __asm ("" ::: "memory"); \ THREAD_SETMEM (THREAD_SELF, robust_head.list, \ - (void *) (((uintptr_t) &mutex->__data.__list.__next) \ + (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next) \ | val)); \ } while (0) # define DEQUEUE_MUTEX(mutex) \ do { \ __pthread_list_t *next = (__pthread_list_t *) \ - ((char *) (((uintptr_t) mutex->__data.__list.__next) & ~1ul) \ + ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__next) & ~1ul) \ - QUEUE_PTR_ADJUST); \ - next->__prev = mutex->__data.__list.__prev; \ + next->__list_t.__prev = mutex->__data.__list.__list_t.__prev; \ __pthread_list_t *prev = (__pthread_list_t *) \ - ((char *) (((uintptr_t) mutex->__data.__list.__prev) & ~1ul) \ + ((char *) (((uintptr_t) mutex->__data.__list.__list_t.__prev) & ~1ul) \ - QUEUE_PTR_ADJUST); \ - prev->__next = mutex->__data.__list.__next; \ + prev->__list_t.__next = mutex->__data.__list.__list_t.__next; \ /* Ensure that we remove the entry from the list before we change the \ __next pointer of the entry, which is read by the kernel. */ \ __asm ("" ::: "memory"); \ - mutex->__data.__list.__prev = NULL; \ - mutex->__data.__list.__next = NULL; \ + mutex->__data.__list.__list_t.__prev = NULL; \ + mutex->__data.__list.__list_t.__next = NULL; \ } while (0) #else union @@ -226,7 +226,7 @@ struct pthread # define ENQUEUE_MUTEX_BOTH(mutex, val) \ do { \ - mutex->__data.__list.__next \ + mutex->__data.__list.__list_t.__next \ = THREAD_GETMEM (THREAD_SELF, robust_list.__next); \ /* Ensure that the new list entry is ready before we insert it. */ \ __asm ("" ::: "memory"); \ @@ -253,7 +253,7 @@ struct pthread /* Ensure that we remove the entry from the list before we change the \ __next pointer of the entry, which is read by the kernel. */ \ __asm ("" ::: "memory"); \ - mutex->__data.__list.__next = NULL; \ + mutex->__data.__list.__list_t.__next = NULL; \ } \ } while (0) #endif diff --git a/nptl/mcs_lock.c b/nptl/mcs_lock.c new file mode 100644 index 0000000..21d20cf --- /dev/null +++ b/nptl/mcs_lock.c @@ -0,0 +1,68 @@ +/* Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + Contributed by Ulrich Drepper , 2002. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include "pthreadP.h" +#include + +void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node) +{ + mcs_lock_t *prev; + + /* Initalize node. */ + node->next = NULL; + node->locked = 0; + + prev = atomic_exchange_acquire(lock, node); + + /* No spinners waiting in the queue, lock is acquired immediately. */ + if (prev == NULL) + { + node->locked = 1; + return; + } + + /* Add current spinner into the queue. */ + atomic_store_release (&prev->next, node); + atomic_full_barrier (); + /* Waiting until waken up by the previous spinner. */ + while (!atomic_load_relaxed (&node->locked)) + atomic_spin_nop (); +} + +void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node) +{ + mcs_lock_t *next = node->next; + + if (next == NULL) + { + /* Check the tail of the queue: + * a) Release the lock and return if current node is the tail + * (lock == node). */ + if (atomic_compare_and_exchange_val_acq(lock, NULL, node) == node) + return; + + /* b) Waiting until new node is added to the queue if current node is + * not the tail (lock != node). */ + while (! (next = atomic_load_relaxed (&node->next))) + atomic_spin_nop (); + } + + /* Wake up the next spinner. */ + atomic_store_release (&next->locked, 1); + atomic_full_barrier (); +} diff --git a/nptl/mcs_lock.h b/nptl/mcs_lock.h new file mode 100644 index 0000000..b779824 --- /dev/null +++ b/nptl/mcs_lock.h @@ -0,0 +1,21 @@ +/* Copyright (C) 2002-2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + Contributed by Ulrich Drepper , 2002. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node); + +void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node); diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c index 1d3790f..5e8643f 100644 --- a/nptl/nptl-init.c +++ b/nptl/nptl-init.c @@ -303,7 +303,7 @@ __pthread_initialize_minimal_internal (void) #ifdef __NR_set_robust_list pd->robust_head.futex_offset = (offsetof (pthread_mutex_t, __data.__lock) - offsetof (pthread_mutex_t, - __data.__list.__next)); + __data.__list.__list_t.__next)); INTERNAL_SYSCALL_DECL (err); int res = INTERNAL_SYSCALL (set_robust_list, err, 2, &pd->robust_head, sizeof (struct robust_list_head)); diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h index 075530c..0dad1dc 100644 --- a/nptl/pthreadP.h +++ b/nptl/pthreadP.h @@ -62,7 +62,7 @@ /* Internal mutex type value. */ enum { - PTHREAD_MUTEX_KIND_MASK_NP = 3, + PTHREAD_MUTEX_KIND_MASK_NP = 7, PTHREAD_MUTEX_ELISION_NP = 256, PTHREAD_MUTEX_NO_ELISION_NP = 512, diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c index d8fe473..e682de3 100644 --- a/nptl/pthread_mutex_init.c +++ b/nptl/pthread_mutex_init.c @@ -110,6 +110,8 @@ __pthread_mutex_init (pthread_mutex_t *mutex, && __set_robust_list_avail < 0) return ENOTSUP; #endif + if ((imutexattr->mutexkind & PTHREAD_MUTEX_QUEUESPINNER_NP) != 0) + return ENOTSUP; mutex->__data.__kind |= PTHREAD_MUTEX_ROBUST_NORMAL_NP; } @@ -146,7 +148,6 @@ __pthread_mutex_init (pthread_mutex_t *mutex, if ((imutexattr->mutexkind & (PTHREAD_MUTEXATTR_FLAG_PSHARED | PTHREAD_MUTEXATTR_FLAG_ROBUST)) != 0) mutex->__data.__kind |= PTHREAD_MUTEX_PSHARED_BIT; - /* Default values: mutex not used yet. */ // mutex->__count = 0; already done by memset // mutex->__owner = 0; already done by memset diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 26bcebf..5fe6038 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -26,6 +26,7 @@ #include #include #include +#include "mcs_lock.h" #ifndef lll_lock_elision #define lll_lock_elision(lock, try_lock, private) ({ \ @@ -116,6 +117,36 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) mutex->__data.__count = 1; } else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex) + == PTHREAD_MUTEX_QUEUESPINNER_NP, 1)) + { + if (! __is_smp) + goto simple; + + if (LLL_MUTEX_TRYLOCK (mutex) != 0) + { + int cnt = 0; + int max_cnt = MIN (MAX_ADAPTIVE_COUNT, + mutex->__data.__spins * 2 + 10); + int val = 0; + mcs_lock_t node; + + mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node); + + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); + + mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node); + LLL_MUTEX_LOCK (mutex); + + mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; + } + assert (mutex->__data.__owner == 0); + } + else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ADAPTIVE_NP, 1)) { if (! __is_smp) @@ -192,7 +223,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex) case PTHREAD_MUTEX_ROBUST_NORMAL_NP: case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP: THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - &mutex->__data.__list.__next); + &mutex->__data.__list.__list_t.__next); /* We need to set op_pending before starting the operation. Also see comments at ENQUEUE_MUTEX. */ __asm ("" ::: "memory"); @@ -372,7 +403,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex) { /* Note: robust PI futexes are signaled by setting bit 0. */ THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - (void *) (((uintptr_t) &mutex->__data.__list.__next) + (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next) | 1)); /* We need to set op_pending before starting the operation. Also see comments at ENQUEUE_MUTEX. */ diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c index 66efd39..07fa52a 100644 --- a/nptl/pthread_mutex_timedlock.c +++ b/nptl/pthread_mutex_timedlock.c @@ -25,6 +25,8 @@ #include #include #include +#include "mcs_lock.h" +#include "pthread_mutex_conf.h" #include @@ -133,13 +135,40 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; } break; - + case PTHREAD_MUTEX_QUEUESPINNER_NP: + if (! __is_smp) + goto simple; + + if (lll_trylock (mutex) != 0) + { + int cnt = 0; + int max_cnt = MIN (MAX_ADAPTIVE_COUNT, + mutex->__data.__spins * 2 + 10); + int val = 0; + mcs_lock_t node; + + mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node); + + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); + + mcs_unlock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node); + result = lll_timedlock(mutex->__data.__lock, abstime, + PTHREAD_MUTEX_PSHARED (mutex)); + + mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; + } + break; case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP: case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP: case PTHREAD_MUTEX_ROBUST_NORMAL_NP: case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP: THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - &mutex->__data.__list.__next); + &mutex->__data.__list.__list_t.__next); /* We need to set op_pending before starting the operation. Also see comments at ENQUEUE_MUTEX. */ __asm ("" ::: "memory"); @@ -345,7 +374,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, { /* Note: robust PI futexes are signaled by setting bit 0. */ THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - (void *) (((uintptr_t) &mutex->__data.__list.__next) + (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next) | 1)); /* We need to set op_pending before starting the operation. Also see comments at ENQUEUE_MUTEX. */ diff --git a/nptl/pthread_mutex_trylock.c b/nptl/pthread_mutex_trylock.c index 7de61f4..6979abd 100644 --- a/nptl/pthread_mutex_trylock.c +++ b/nptl/pthread_mutex_trylock.c @@ -76,6 +76,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex) FORCE_ELISION (mutex, goto elision); /*FALL THROUGH*/ case PTHREAD_MUTEX_ADAPTIVE_NP: + case PTHREAD_MUTEX_QUEUESPINNER_NP: case PTHREAD_MUTEX_ERRORCHECK_NP: if (lll_trylock (mutex->__data.__lock) != 0) break; @@ -91,7 +92,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex) case PTHREAD_MUTEX_ROBUST_NORMAL_NP: case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP: THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - &mutex->__data.__list.__next); + &mutex->__data.__list.__list_t.__next); oldval = mutex->__data.__lock; do @@ -205,7 +206,7 @@ __pthread_mutex_trylock (pthread_mutex_t *mutex) if (robust) /* Note: robust PI futexes are signaled by setting bit 0. */ THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - (void *) (((uintptr_t) &mutex->__data.__list.__next) + (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next) | 1)); oldval = mutex->__data.__lock; diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c index 9ea6294..5a5c511 100644 --- a/nptl/pthread_mutex_unlock.c +++ b/nptl/pthread_mutex_unlock.c @@ -76,6 +76,9 @@ __pthread_mutex_unlock_usercnt (pthread_mutex_t *mutex, int decr) goto normal; } else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex) + == PTHREAD_MUTEX_QUEUESPINNER_NP, 1)) + goto normal; + else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ADAPTIVE_NP, 1)) goto normal; else @@ -140,7 +143,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr) robust: /* Remove mutex from the list. */ THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - &mutex->__data.__list.__next); + &mutex->__data.__list.__list_t.__next); /* We must set op_pending before we dequeue the mutex. Also see comments at ENQUEUE_MUTEX. */ __asm ("" ::: "memory"); @@ -234,7 +237,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr) /* Remove mutex from the list. Note: robust PI futexes are signaled by setting bit 0. */ THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, - (void *) (((uintptr_t) &mutex->__data.__list.__next) + (void *) (((uintptr_t) &mutex->__data.__list.__list_t.__next) | 1)); /* We must set op_pending before we dequeue the mutex. Also see comments at ENQUEUE_MUTEX. */ diff --git a/nptl/pthread_mutexattr_settype.c b/nptl/pthread_mutexattr_settype.c index 7d36cc3..c2382b4 100644 --- a/nptl/pthread_mutexattr_settype.c +++ b/nptl/pthread_mutexattr_settype.c @@ -25,7 +25,7 @@ __pthread_mutexattr_settype (pthread_mutexattr_t *attr, int kind) { struct pthread_mutexattr *iattr; - if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_ADAPTIVE_NP) + if (kind < PTHREAD_MUTEX_NORMAL || kind > PTHREAD_MUTEX_QUEUESPINNER_NP) return EINVAL; /* Cannot distinguish between DEFAULT and NORMAL. So any settype diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h index 1e2092a..9d3c4de 100644 --- a/sysdeps/nptl/bits/thread-shared-types.h +++ b/sysdeps/nptl/bits/thread-shared-types.h @@ -79,15 +79,19 @@ /* Common definition of pthread_mutex_t. */ #if !__PTHREAD_MUTEX_USE_UNION -typedef struct __pthread_internal_list +typedef union __pthread_internal_list { - struct __pthread_internal_list *__prev; - struct __pthread_internal_list *__next; + struct { + union __pthread_internal_list *__prev; + union __pthread_internal_list *__next; + }__list_t; + void *mcs_lock; } __pthread_list_t; #else -typedef struct __pthread_internal_slist +typedef union __pthread_internal_slist { - struct __pthread_internal_slist *__next; + union __pthread_internal_slist *__next; + void *mcs_lock; } __pthread_slist_t; #endif @@ -145,6 +149,13 @@ struct __pthread_mutex_s __PTHREAD_COMPAT_PADDING_END }; +struct mcs_lock +{ + struct mcs_lock *next; + int locked; +}; + +typedef struct mcs_lock mcs_lock_t; /* Common definition of pthread_cond_t. */ diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h index df049ab..4b4b80a 100644 --- a/sysdeps/nptl/pthread.h +++ b/sysdeps/nptl/pthread.h @@ -45,7 +45,8 @@ enum PTHREAD_MUTEX_TIMED_NP, PTHREAD_MUTEX_RECURSIVE_NP, PTHREAD_MUTEX_ERRORCHECK_NP, - PTHREAD_MUTEX_ADAPTIVE_NP + PTHREAD_MUTEX_ADAPTIVE_NP, + PTHREAD_MUTEX_QUEUESPINNER_NP #if defined __USE_UNIX98 || defined __USE_XOPEN2K8 , PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP, @@ -85,14 +86,16 @@ enum #if __PTHREAD_MUTEX_HAVE_PREV # define PTHREAD_MUTEX_INITIALIZER \ - { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { 0, 0 } } } + { { 0, 0, 0, 0, 0, __PTHREAD_SPINS, { { 0, 0 } } } } # ifdef __USE_GNU # define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \ - { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { 0, 0 } } } + { { 0, 0, 0, 0, PTHREAD_MUTEX_RECURSIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } } # define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \ - { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { 0, 0 } } } + { { 0, 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, __PTHREAD_SPINS, { { 0, 0 } } } } # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \ - { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { 0, 0 } } } + { { 0, 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, __PTHREAD_SPINS, { { 0, 0 } } } } +# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \ + { { 0, 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, __PTHREAD_SPINS, { { 0, 0 } } } } # endif #else @@ -105,6 +108,8 @@ enum { { 0, 0, 0, PTHREAD_MUTEX_ERRORCHECK_NP, 0, { __PTHREAD_SPINS } } } # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \ { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, 0, { __PTHREAD_SPINS } } } +# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \ + { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, 0, { __PTHREAD_SPINS } } } # endif #endif diff --git a/sysdeps/unix/sysv/linux/hppa/pthread.h b/sysdeps/unix/sysv/linux/hppa/pthread.h index 11a024d..57c101c 100644 --- a/sysdeps/unix/sysv/linux/hppa/pthread.h +++ b/sysdeps/unix/sysv/linux/hppa/pthread.h @@ -46,6 +46,7 @@ enum PTHREAD_MUTEX_RECURSIVE_NP, PTHREAD_MUTEX_ERRORCHECK_NP, PTHREAD_MUTEX_ADAPTIVE_NP + PTHREAD_MUTEX_QUEUESPINNER_NP, #if defined __USE_UNIX98 || defined __USE_XOPEN2K8 , PTHREAD_MUTEX_NORMAL = PTHREAD_MUTEX_TIMED_NP, @@ -95,6 +96,9 @@ enum # define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \ { { 0, 0, 0, PTHREAD_MUTEX_ADAPTIVE_NP, { 0, 0, 0, 0 }, 0, \ { __PTHREAD_SPINS }, { 0, 0 } } } +# define PTHREAD_QUEUESPINNER_MUTEX_INITIALIZER_NP \ + { { 0, 0, 0, PTHREAD_MUTEX_QUEUESPINNER_NP, { 0, 0, 0, 0 }, 0, \ + { __PTHREAD_SPINS }, { 0, 0 } } } #endif