From patchwork Wed May 23 09:22:34 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 918903 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-92697-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="PHOaUEc3"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40rRvg1F3pz9rvt for ; Wed, 23 May 2018 19:25:50 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=JFm8Qd0/ZFPgKUmqel873Gk0wCejBmJ mGaTKPz6RQ1VBCcsZJyoAcZX8ve6ya9uA8LuEyW5C0h7ManZoPj8/XtZ9MGZecgh MTESBgW548iKWA38PLiHOh0a/AeErsiT+FSCA8+0TfRI0UKJxyeC3/X094ZREcfS mwh1xYW9EwPM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=C3FKesFjsNgTsDfVTBjo1bk1ENY=; b=PHOaU Ec3xaWH8SScfbcyt7s1VZ3aBMRRDIxyf/XJLlTpFQQ8j8THmtmhpd0zi2+PvKX9u 8ENoLVfhff+ygNbPqporhg1BIATdg3gaLQ+UsAswlreaU/6YVG+PD/O4WgshsKC1 qxShImJR83SXiboy3TZ3aEh/H8fNdjW1PJ8mz4= Received: (qmail 122004 invoked by alias); 23 May 2018 09:25:32 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 121855 invoked by uid 89); 23 May 2018 09:25:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.0 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, SPF_PASS autolearn=ham version=3.3.2 spammy=winner, 79, acquisition, constantly X-HELO: mga14.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v3 3/3] Mutex: Replace trylock by read only while spinning Date: Wed, 23 May 2018 17:22:34 +0800 Message-Id: <1527067354-13333-3-git-send-email-kemi.wang@intel.com> In-Reply-To: <1527067354-13333-1-git-send-email-kemi.wang@intel.com> References: <1527067354-13333-1-git-send-email-kemi.wang@intel.com> The pthread adaptive spin mutex spins on the lock for a while before calling into the kernel to block. But, in the current implementation of spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a good idea on many targets as that will force expensive memory synchronization among processors and penalize other running threads. For example, it constantly floods the system with "read for ownership" requests, which are much more expensive to process than a single read. Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggeusted by Andi Kleen. Performance impact: Significant mutex performance improvement is not expected for this patch, though, it probably bring some benefit for the scenarios with severe lock contention on many architectures, the whole system performance can benefit from this modification because a number of unnecessary "read for ownership" requests which stress the cache system via read and invalidate broadcast are eliminated during spinning. Meanwhile, it may have some tiny performance regression on the lock holder transformation for the case of lock acquisition via spinning gets, because the lock state is checked before acquiring the lock via trylock. In the worst case, the extra latency of read and pause is added except that of trylock when lock is available. Similar mechanism has been implemented for pthread spin lock. Test machine: 2-sockets Skylake platform, 112 cores with 62G RAM Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex with global update) Usage: make bench BENCHSET=mutex-adaptive-thread Test result: +----------------+-----------------+-----------------+------------+ | Configuration | Base | Head | % Change | | | Total iteration | Total iteration | base->head | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1x | +----------------+------------------------------------------------+ |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | +----------------+------------------------------------------------+ | | Critical section size: 10x | +----------------+------------------------------------------------+ |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | +----------------+------------------------------------------------+ | | Critical section size: 100x | +----------------+------------------------------------------------+ |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | +----------------+------------------------------------------------+ | | Critical section size: 1000x | +----------------+------------------------------------------------+ |1 thread | 316392 | 315635 | -0.2% | |2 threads | 302806 | 303469 | +0.2% | |3 threads | 298506 | 294281 | -1.4% | |4 threads | 292037 | 289945 | -0.7% | |28 threads | 155188 | 155250 | +0.0% | |56 threads | 190657 | 183106 | -4.0% | |112 threads | 210818 | 220342 | +4.5% | +----------------+-----------------+-----------------+------------+ * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex ChangLog: V2->V3: a) Drop the idea of blocking spinners if fail to acquire a lock, since this idea would not be an universal winner. E.g. several threads contend for a lock which protects a small critical section, thus, probably any thread can acquire the lock via spinning. b) Fix the format issue AFAIC V1->V2: fix format issue Suggested-by: Andi Kleen Signed-off-by: Kemi Wang --- ChangeLog | 5 +++++ nptl/pthread_mutex_lock.c | 41 +++++++++++++++++++++++++---------------- 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/ChangeLog b/ChangeLog index e2991e9..3bafb0e 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,10 @@ 2018-05-23 Kemi Wang + * nptl/pthread_mutex_lock.c: Replace trylock by read only while + spinning. + +2018-05-23 Kemi Wang + * benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive spin mutex. * benchmark/Makefile: Add adaptive spin mutex benchmark. diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..7ce50f6 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -26,6 +26,7 @@ #include #include #include +#include #ifndef lll_lock_elision #define lll_lock_elision(lock, try_lock, private) ({ \ @@ -123,22 +124,30 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) if (LLL_MUTEX_TRYLOCK (mutex) != 0) { - int cnt = 0; - int max_cnt = MIN (MAX_ADAPTIVE_COUNT, - mutex->__data.__spins * 2 + 10); - do - { - if (cnt++ >= max_cnt) - { - LLL_MUTEX_LOCK (mutex); - break; - } - atomic_spin_nop (); - } - while (LLL_MUTEX_TRYLOCK (mutex) != 0); - - mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; - } + int val = 0; + int cnt = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); + + do + { + if (cnt >= max_cnt) + { + LLL_MUTEX_LOCK (mutex); + break; + } + /* Read only while spinning unless lock is available. */ + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); + } + while (LLL_MUTEX_TRYLOCK (mutex) != 0); + + mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; + } assert (mutex->__data.__owner == 0); } else