From patchwork Wed Apr 25 02:56:28 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 903941 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-91809-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="qGtZiwaN"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40W4fk6BCbz9s0n for ; Wed, 25 Apr 2018 12:59:26 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=hMOFX7WaBIVEuXT9VKPOsSF7zhwW+Ch DJ2k4Y53LVuEayiz/GdG0dd2+hbwZH+LLKKQX5NWcp0FhBt/oI+99l/HxzM2sb8+ z2W85Sa42efoD1nROEALGfWSHGpDhen7sTQlotdCqOJQNvzZX/iurZVSYNRNl8jd 1EbRyXdHCFt8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=tlPhABUPeT9ZCvEhT9B+4eBZ904=; b=qGtZi waNFrSrYChFiugEfUvfyhwpYC6ug5m+S7LW+VoQYlyl3MQpXgznxkM3n02Ghcbkp UEdRs3JGGxulqioNABTqjaoYLiJORWsgF2NLaEVcXb1Ll6Dxoq5nx8x11BPrPvjG heIMOicrCCTuUko39YjiwddeFqoxgCN1kQLmpY= Received: (qmail 119844 invoked by alias); 25 Apr 2018 02:59:02 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 119741 invoked by uid 89); 25 Apr 2018 02:59:01 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.4 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, SPF_PASS autolearn=ham version=3.3.2 spammy=budget, 0.3, 1.3, 84 X-HELO: mga04.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm Date: Wed, 25 Apr 2018 10:56:28 +0800 Message-Id: <1524624988-29141-3-git-send-email-kemi.wang@intel.com> In-Reply-To: <1524624988-29141-1-git-send-email-kemi.wang@intel.com> References: <1524624988-29141-1-git-send-email-kemi.wang@intel.com> The pthread adaptive spin mutex spins on the lock for a while before calling into the kernel to block. But, in the current implementation of spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a good idea on many targets as that will force expensive memory synchronization among processors and penalize other running threads. For example, it constantly floods the system with "read for ownership" requests, which are much more expensive to process than a single read. Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggested by Andi Kleen. Usually, it is useless to go on spinning on the lock if fail to acquire the lock when lock is available. That's because the spinner probably does not have the possibility to acquire the lock during the spin process in case of severe lock contention. Therefore, it would be better to call into the kernel to block the thread, as suggested by Tim Chen and we can gain the benefit at least from: a) save the CPU time; b) save power budget; c) reduce the overhead of cache line bouncing during the spinning. Test machine: 2-sockets Skylake platform, 112 cores with 62G RAM Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex with global update) Usage: make bench BENCHSET=mutex-adaptive-thread Test result: +----------------+-----------------+-----------------+------------+ | Configuration | Base | Head | % Change | | | Total iteration | Total iteration | base->head | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1x | +----------------+------------------------------------------------+ |1 thread | 7.06542e+08 | 7.08998e+08 | +0.3% | |2 threads | 5.73018e+07 | 7.20815e+07 | +25.6% | |3 threads | 3.78511e+07 | 1.15544e+08 | +205.3% | |4 threads | 2.28214e+07 | 6.57055e+07 | +187.9% | |28 threads | 1.68839e+07 | 5.19314e+07 | +207.6% | |56 threads | 1.84983e+07 | 5.06522e+07 | +173.8% | |112 threads | 2.3568e+07 | 4.95375e+07 | +110.2% | +----------------+------------------------------------------------+ | | Critical section size: 10x | +----------------+------------------------------------------------+ |1 thread | 5.40274e+08 | 5.47189e+08 | +1.3% | |2 threads | 4.55684e+07 | 6.03275e+07 | +32.4% | |3 threads | 3.05702e+07 | 1.04035e+08 | +240.3% | |4 threads | 2.17341e+07 | 5.57264e+07 | +156.4% | |28 threads | 1.39503e+07 | 4.53525e+07 | +225.1% | |56 threads | 1.50154e+07 | 4.16203e+07 | +177.2% | |112 threads | 1.90175e+07 | 3.88308e+07 | +104.2% | +----------------+------------------------------------------------+ | | Critical section size: 100x | +----------------+------------------------------------------------+ |1 thread | 7.23372e+07 | 7.25654e+07 | +0.3% | |2 threads | 2.67302e+07 | 2.40265e+07 | -10.1% | |3 threads | 1.89936e+07 | 2.70759e+07 | +42.6% | |4 threads | 1.62423e+07 | 2.25097e+07 | +38.6% | |28 threads | 9.85977e+06 | 1.59003e+07 | +61.3% | |56 threads | 8.11471e+06 | 1.6344e+07 | +101.4% | |112 threads | 8.58044e+06 | 1.53827e+07 | +79.3% | +----------------+------------------------------------------------+ | | Critical section size: 1000x | +----------------+------------------------------------------------+ |1 thread | 8.16913e+06 | 8.16126e+06 | -0.1% | |2 threads | 5.82987e+06 | 5.92752e+06 | +1.7% | |3 threads | 6.05125e+06 | 6.37068e+06 | +5.3% | |4 threads | 5.91259e+06 | 6.27616e+06 | +6.1% | |28 threads | 2.40584e+06 | 2.60738e+06 | +8.4% | |56 threads | 2.32643e+06 | 2.3245e+06 | -0.1% | |112 threads | 2.32366e+06 | 2.30271e+06 | -0.9% | +----------------+-----------------+-----------------+------------+ * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex ChangLog: V1->V2: fix format issue Suggested-by: Andi Kleen Suggested-by: Tim Chen Signed-off-by: Kemi Wang --- ChangeLog | 4 ++++ nptl/pthread_mutex_lock.c | 32 ++++++++++++++++++-------------- 2 files changed, 22 insertions(+), 14 deletions(-) diff --git a/ChangeLog b/ChangeLog index 76d2628..4c81693 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,5 +1,9 @@ 2018-04-24 Kemi Wang + * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex. + +2018-04-24 Kemi Wang + * benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive spin mutex. * benchmark/Makefile: Add adaptive spin mutex benchmark. diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..3442c58 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -26,6 +26,7 @@ #include #include #include +#include #ifndef lll_lock_elision #define lll_lock_elision(lock, try_lock, private) ({ \ @@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) if (LLL_MUTEX_TRYLOCK (mutex) != 0) { int cnt = 0; - int max_cnt = MIN (MAX_ADAPTIVE_COUNT, - mutex->__data.__spins * 2 + 10); - do - { - if (cnt++ >= max_cnt) - { - LLL_MUTEX_LOCK (mutex); - break; - } - atomic_spin_nop (); - } - while (LLL_MUTEX_TRYLOCK (mutex) != 0); + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 100); + + /* MO read while spinning */ + do + { + atomic_spin_nop (); + } + while (atomic_load_relaxed (&mutex->__data.__lock) != 0 && + ++cnt < max_cnt); + /* Try to acquire the lock if lock is available or the spin count + * is run out, call into kernel to block if fails + */ + if (LLL_MUTEX_TRYLOCK (mutex) != 0) + LLL_MUTEX_LOCK (mutex); - mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; - } + mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8; + } assert (mutex->__data.__owner == 0); } else