From patchwork Mon Jul 2 08:11:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 937719 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93890-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="x8/WGJ7Z"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41K0Sr6wccz9s2R for ; Mon, 2 Jul 2018 18:16:11 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id; q=dns; s= default; b=DqWIq8CH7FTUzT2QLzBlnoCIlMFZWVOtbV0zJGBv59l2/lFCInjxS kCp439dZGgbVVqlahoFGtuGOk8BRjgCOdyZbfS9YNVRvWmsKolQ14S6mgmoD5KiU fP29w778nFb4Nd2j0C7/+n2B9r/qvtgvgXG64HPAlbSmMPqG4kCyXk= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id; s=default; bh=P0EXQ39orF3P2kKOq4uCYKeLeVA=; b=x8/WGJ7Z89H5s2K+rLLd2Y2eE+ae fvO1TLkdwrXVHAwVOTfTjPSdPDKLeXfaDzHx2A6m478a51RStU36N7U3tE5xwjeK CdrMZ7tSkHd59uy+XDfzSUmZ+EHuqybxD2Z1I2994+yZI4Pg+1FtadFB4YttK/+V KQFG+Ct3hMhu8T0= Received: (qmail 22615 invoked by alias); 2 Jul 2018 08:16:05 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 22411 invoked by uid 89); 2 Jul 2018 08:15:44 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-10.5 required=5.0 tests=BAYES_05, GIT_PATCH_2, GIT_PATCH_3, SPF_PASS autolearn=ham version=3.3.2 spammy=H*MI:wang, queue, Therefore, truth X-HELO: mga09.intel.com From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [RFC 0/4] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP Date: Mon, 2 Jul 2018 16:11:52 +0800 Message-Id: <1530519116-13103-1-git-send-email-kemi.wang@intel.com> The pthread adaptive mutex is designed based-on the truth that the lock probably would be released after a few of CPU cycles in most cases. Especially for the case: the applications code is well-behaved with a short critical section that is highly contended. Thus, spinning on the lock for a while instead of calling into the kernel to block immediately can help to improve the system performance. But there are two main problems in current adaptive mutex. The first one is fairness, multiple spinners contend for the lock simultaneously and there is no any guarantee for a spinner to acquire the lock no matter how long it has been waiting for. The other is the heavy cache line bouncing. Since the cache line including the lock is shared among all of the spinners, when lock is released, each spinner will try to acquire the lock via cmpxchg instruction which constantly floods the system via "read-for-ownership" requests. As a result, there will be a lot of cache line bouncing in a large system with a lots of CPUs. To address the problems mentioned above, the idea for queuing mutex spinners with MCS lock referred to the implementation of mutex in kernel is proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is introduced. Compare to adaptive mutex (read only while spinning), the test result on Intel 2 sockets Skylake platform has showns significant performance improvement (See the first patch for details). Though, queuing spinner with mcs lock can help to improve the performance of adaptive mutex when multiple threads contending for a lock, people probably want to know how queue spinner mutex performs when compare to other lock discipline widely knowns as pthread spin lock and pthread mutex. We can see the details of the test result in the first patch. Simply, some conclusion is summarized as below: a) In case of little lock contention, spin lock performs best, queue spinner mutex performs similar to adaptive mutex, and both perform a little better than pthread mutex. b) In the case of severe lock contention with large number of CPUs when protecting a small critical section (less than 1000ns). Most of lock acquisition is got via spinning. Queue spinner mutex. performs much better than spin lock and adaptive mutex. This is because the overhead of heavy cache line bouncing plays a big role on lock performance. c) With the increase of the size of a critical section, the advantage of queue spinner mutex on performance in reduced gradually. This is because the overhead of cache line bouncing will not become the bottleneck of lock performance, instead, the overhead of futex_wait and futex_wake plays a big role. When the critical section is increased to 1ms, even the latency of futex syscall would be ignorable compared to the total time of lock acquisition. As we can see above, queue spinner mutex performs well in kinds of workload, but there would be a potential risk to use this type of mutex. When the lock holder is transformed to the next spinner in the queue, but it is not running (the CPU is scheduled to run other task). Thus, the other spinners have to wait in the queue, this would probably collapse the lock performance. To emulate this case, we run two same processes simultaneously, the process has 28 threads each of which sets CPU affinity to an individual CPU according to the thread id. Thus, CPU [0~27] are subscribed by two threads. In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by 58.1% (2205245->924263). Therefore, queue spinner mutex would be carefully used for applications to pursue fairness and performance without oversubscribing CPU resource. E.g. Containers in public cloud infrastructure. Kemi Wang (4): Mutex: Queue spinners to reduce cache line bouncing and ensure fairness Mutex: add unit tests for new type BENCHMARK: add a benchmark for testing new type of mutex Manual: Add manual for pthread mutex benchtests/Makefile | 4 +- benchtests/bench-mutex-adaptive-thread.c | 8 +++- benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++++ manual/Makefile | 2 +- manual/mutex.texi | 68 ++++++++++++++++++++++++++++ nptl/Makefile | 8 ++-- nptl/allocatestack.c | 2 +- nptl/descr.h | 26 +++++------ nptl/mcs_lock.c | 68 ++++++++++++++++++++++++++++ nptl/mcs_lock.h | 21 +++++++++ nptl/nptl-init.c | 2 +- nptl/pthreadP.h | 2 +- nptl/pthread_mutex_init.c | 3 +- nptl/pthread_mutex_lock.c | 35 +++++++++++++- nptl/pthread_mutex_timedlock.c | 35 ++++++++++++-- nptl/pthread_mutex_trylock.c | 5 +- nptl/pthread_mutex_unlock.c | 7 ++- nptl/pthread_mutexattr_settype.c | 2 +- nptl/tst-initializers1.c | 11 +++-- nptl/tst-mutex5b.c | 2 + nptl/tst-mutex7b.c | 2 + sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++-- sysdeps/nptl/pthread.h | 15 ++++-- sysdeps/unix/sysv/linux/hppa/pthread.h | 4 ++ 24 files changed, 324 insertions(+), 50 deletions(-) create mode 100644 benchtests/bench-mutex-queuespinner-thread.c create mode 100644 manual/mutex.texi create mode 100644 nptl/mcs_lock.c create mode 100644 nptl/mcs_lock.h create mode 100644 nptl/tst-mutex5b.c create mode 100644 nptl/tst-mutex7b.c