From patchwork Fri Aug 17 08:10:32 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 958686 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-95343-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="dlS21jfj"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41sGFn5RPPz9s47 for ; Fri, 17 Aug 2018 18:14:37 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=fh7nqOFxxuNDfTt3CBL+Y89XjuXgjb4 PdCN6u4MKIaaaZ9UkD+TvzXurMRPiiYOpV++Jf12CJ5Y3c04VNPApxqKDtfqyPHk SPMgYCXmQvFCeFPCenEgvVwf+Hr6UOk1BCYUmw1T4TxsGQx0qcO9vj8dvW5j96Ag Qo/3EWh20Mzc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=4o3H9NFsqhZOfrW4J3bxnLA1O5I=; b=dlS21 jfjDnkuj2yFmaQ3LfsPVpYtvaP/2isOI3R5pHN38jr0dEzeaJ+LJq+D2yiFhTHSL TC5SfxRwqypcwd7RMInS5MS8Tv/cbRsBlC0WIZ9JQq2IKPrDSTX0Nis6glaeSZLk r2bdsa3ob4rIUA5xK0ZqDk3D/fbef5zr6yWI2E= Received: (qmail 63663 invoked by alias); 17 Aug 2018 08:14:26 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 63537 invoked by uid 89); 17 Aug 2018 08:14:25 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-23.4 required=5.0 tests=BAYES_50, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_SHORT, SPF_PASS autolearn=ham version=3.3.2 spammy=transactions, traffic, requesting, arrival X-HELO: mga07.intel.com From: Kemi Wang To: Glibc alpha Cc: Kemi Wang Subject: [RESEND PATCH] Mutex: Use test and cmpxchg instead of cmpxchg while spinning Date: Fri, 17 Aug 2018 16:10:32 +0800 Message-Id: <1534493432-20242-2-git-send-email-kemi.wang@intel.com> In-Reply-To: <1534493432-20242-1-git-send-email-kemi.wang@intel.com> References: <1534493432-20242-1-git-send-email-kemi.wang@intel.com> The pthread adaptive spin mutex uses cmpxchg instead of test and cmpxchg while spinning on the lock. The first is very unfriendly to the uncore because it consistently floods the system with "read for ownership" requests, which are much more expensive to process than a single read. ============================Some Background=============================== For Intel x86 architecture: Memory coherence between cores is handled by the "uncore" (roughly equates to logic outside the CPU cores but residing on the same die). Cores can ask the uncore to "read" or "read for ownership". If the request is "read" then generally the cacheline will wind up being marked SHARED in all caches that have a copy, or will wind up EXCLUSIVE in the requesting core if no one else had it before. If the request is "read for ownership" any other copies will be invalidated (or written-back and invalidated if MODIFIED) and the requesting core will get the line in EXCLUSIVE mode. When a read for ownership comes to the uncore, all copies in other caches must be invalidated. If there is only one other copy, it may be MODIFIED and will have to be written back before being invalidated. The uncore does these things by sending "snoops" to affected cores on the same or other sockets and waiting for the answers. When a read comes to the uncore, if there are more than one copy already in cores, then adding another one does not require a snoop. If there is only one other copy, a snoop is required to find out what state it is in and to force a transition to SHARED state. A lock cmpxchg prevents any other core from modifying the line between the read phase of cmpxchg and the (potential) write phase. The way this is done is to "read for ownership" the line, which will make it EXCLUSIVE. Then the core defers responding to a snoop from the uncore until the write phase is complete. This prevents any other core from acquiring the necessary exclusive access to the cacheline until the instruction is complete. With this background, we can explain how lock cmpxchg works and understand the benefit of test and cmpxchg vs cmpxchg for implementation of locks. ============================Performance Impact============================ Now think about the case that a lock is in alternate use by two cores. At the conclusion of an unlock, the lock's cacheline is in MODIFIED state in core A which did the unlock. If core B tries for a lock while core A is holding it, then core B will take away the cache line from core A and only then find out the lock is held. Core B will then spin, testing the lock over and over again and making no progress until core A finally releases it. The release will be expensive because core A will have to do an RFO to get the line back, and then hold off core B's snoops until it can release the lock. Then core B will grab it back and successfully acquire. In the two-contending cores case, lock-cmpxchg is not too bad. The real problem happens with three or more contenders. In this case, every time one of the contenders attempts to lock the lock with a lock cmpxchg, a flurry of RFO transactions happens. Generally, test and test and set is better. To acquire the lock, you first "test" with an ordinary MOV and only when the lock appears free do you try the lock cmpxchg. The MOV puts the line into SHARED state, and everyone waiting can poll locally in their own caches without causing any uncore traffic. When core A finally unlocks, the uncore will have to invalidate everyone, then grant ownership to A for the unlock. Then all the other cores will pile on again and one of them will win. Now, let's go back to see the impact of test and cmpxchg on adaptive mutex: For the case with no/little lock contention, the lock performance is nearly no any difference between two approaches, because lock is usually acquired via immediate gets in the fast path. For the case with slight lock contention, lock usually is acquired via either immediate gets or spinning gets, test and cmpxchg performs better than the cmpxchg way even if the first one has one more test operation. This is probably because, in regard of lock acquisition (decode as "lock cmpxchg") and lock release (decode as "lock dec"), the snoop of uncore originated from RFOs of requesting core can be responded immediately due to fewer in-flight snoops. For the case of severe lock contention (E.g. the contending thread number is large), the lock performance improvement may not be obvious even if significant RFOs number is reduced. Because the behavior of adaptive mutex is changed, and some of lock acquisition is shifted from spinning gets to waking up. In such case, the lock performance is dominated by the overhead of futex syscall and context switch. Even so, test and cmpxchg also has its value because unnecessary uncore traffic is reduced in the whole system. ================================Testing=================================== Test Machine: Intel 2-sockets Skylake platform (56 cores, 62G RAM) Methodology: Let's assume the size of critical section is represented by *s*, the size of non-critical section is represented by *t*, and let t = k*s. Then, on a single thread, the arrival rate at which a single core will try to acquire the lock, in the absence of contention, is 1/(k+1). We also assume there are *n* threads contending for a lock, each thread binds to an individual CPU core, and does the following: 1) lock 2) spend *s* nanoseconds in the critical section 3) unlock 4) spend *t* nanoseconds in the non-critical section in a loop until each thread reaches 100,000 iteration, the performance is measured by RFOs number and the average latency of lock acquisition and lock release. *Note*: the latency is measured by CPU cycles with the help of RDTSCP instruction. The beginning time frame is recorded before calling lock/unlock, and ending time frame is recorded once lock/unlock is returned. The delta is calculated as the latency of lock acquisition and lock release, respectively. We have got rid of invalid data in the statistic result (e.g. Interruption is handled in that core during calling lock/unlock). To emulate different usage scenarios, we let k=6, s=200ns and run this workload with the different spinning methods. In our workload, [1-5] threads contending for a lock emulates little(no) lock contention, and [6-15] threads contending for a lock emulates slight lock contention, and [16-56] threads contending for a lock emulates severe lock contention across sockets (Benchmark is provided by Andi Kleen). Test Command: perf stat -a -e offcore_requests.demand_rfo ./run_workload Result: Generally, test and cmpxchg performs consistently better than cmpxchg with lock contention, see details below: For the case with little/no lock contention, no obvious difference of lock performance between two approaches. For the case with slight lock contention (using thread number 8 as an example), the test and cmpxchg way performs better than the cmpxchg way, the average latency of lock acquisition is reduced from 7501 cycles to 5396 cycles (-28.1%), the average latency of lock release is reduced from 1704 cycles to 1095 cycles (-35.7%), and total RFOs number is reduced from 22239643 to 12484358 (-43.9%). For the case with severe lock contention (using thread number 56 as an example), the test and cmpxchg way performs better than the cmpxchg way, the average latency of lock acquisition is reduced from 136787 cycles to 106050 cycles (-22.5%), the average latency of lock release is reduced from 57787 cycles to 49895 cycles (-13.7%), and total RFOs number is reduced from 317483283 to 215756961 (-32.0%). Many thanks to H.J. Lu to help refine this patch, and to Stewart, Lawrence to explain these "lock compxchg" matters to me in details. * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API * nptl/pthread_mutex_timedlock.c: Likewise * nptl/pthread_spinlock.h: New file * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file Suggested-by: Andi Kleen Signed-off-by: Kemi Wang --- nptl/pthread_mutex_lock.c | 3 ++- nptl/pthread_mutex_timedlock.c | 4 ++-- nptl/pthread_spinlock.h | 23 +++++++++++++++++++ sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++ 4 files changed, 58 insertions(+), 3 deletions(-) create mode 100644 nptl/pthread_spinlock.h create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..c910ec4 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -25,6 +25,7 @@ #include "pthreadP.h" #include #include +#include #include #ifndef lll_lock_elision @@ -133,7 +134,7 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) LLL_MUTEX_LOCK (mutex); break; } - atomic_spin_nop (); + atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt); } while (LLL_MUTEX_TRYLOCK (mutex) != 0); diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c index 28237b0..2ede5a0 100644 --- a/nptl/pthread_mutex_timedlock.c +++ b/nptl/pthread_mutex_timedlock.c @@ -25,7 +25,7 @@ #include #include #include - +#include #include #ifndef lll_timedlock_elision @@ -126,7 +126,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, PTHREAD_MUTEX_PSHARED (mutex)); break; } - atomic_spin_nop (); + atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt); } while (lll_trylock (mutex->__data.__lock) != 0); diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h new file mode 100644 index 0000000..8bd7c16 --- /dev/null +++ b/nptl/pthread_spinlock.h @@ -0,0 +1,23 @@ +/* Functions for pthread_spinlock_t. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +static __always_inline void +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt) +{ + atomic_spin_nop (); +} diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h new file mode 100644 index 0000000..5ca84d1 --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h @@ -0,0 +1,31 @@ +/* Functions for pthread_spinlock_t. X86 version. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +static __always_inline void +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt) +{ + int val = 0; + int cnt = *cnt_p; + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (lock); + } + while (val != 0 && ++cnt < max_cnt); + *cnt_p = cnt; +}