From patchwork Mon Jun 11 10:18:59 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 927593 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93026-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="fazlNC23"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4148G41jF5z9s2t for ; Mon, 11 Jun 2018 20:22:20 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=BmRV18YFFt1SFLNuEE4ODYmJ+wV7X9O bbfT31OQqkawmTnuSwXO7L6DLM3UMUt9xCGD3DmTObno3GvHG3JP6OMVk+e76Imv BD8R2at0gGXnKCXW5fvcuAxAaksse4YWouvzp/XSCs9r4Q9eN0rAboWN0pm6ADNV 3sFso0fFpyO8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=pEfGmxuJU7Fl+m8R67hegP7QX0k=; b=fazlN C23+nMSbbTpKUS56EzctZgG/aXkcMBxfGbJJzRgGj2mKrCvynIDh2+LYxWtma4F4 EyaqVxWjphZEkkeBAzNfGFVehu4RbC86HWAj8eq3LDJLCg3CVTzlUz3O6bmnbBcs GzM6TBErkWA8F9BcwyQk4wCXImDA0SJFwdwJpg= Received: (qmail 10952 invoked by alias); 11 Jun 2018 10:22:00 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 10827 invoked by uid 89); 11 Jun 2018 10:21:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.0 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY autolearn=ham version=3.3.2 spammy=ownership, meanwhile, Head, 0.7 X-HELO: mga01.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v4 3/3] Mutex: Replace trylock by read only while spinning Date: Mon, 11 Jun 2018 18:18:59 +0800 Message-Id: <1528712339-32299-3-git-send-email-kemi.wang@intel.com> In-Reply-To: <1528712339-32299-1-git-send-email-kemi.wang@intel.com> References: <1528712339-32299-1-git-send-email-kemi.wang@intel.com> The pthread adaptive spin mutex spins on the lock for a while before calling into the kernel to block. But, in the current implementation of spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a good idea on many targets as that will force expensive memory synchronization among processors and penalize other running threads. For example, it constantly floods the system with "read for ownership" requests, which are much more expensive to process than a single read. Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggeusted by Andi Kleen. Performance impact: Significant mutex performance improvement is not expected for this patch, though, it probably bring some benefit for the scenarios with severe lock contention on many architectures, the whole system performance can benefit from this modification because a number of unnecessary "read for ownership" requests which stress the cache system by broadcasting cache line invalidity are eliminated during spinning. Meanwhile, it may have some tiny performance regression on the lock holder transformation for the case of lock acquisition via spinning gets, because the lock state is checked before acquiring the lock via trylock. Similar mechanism has been implemented for pthread spin lock. Test machine: 2-sockets Skylake platform, 112 cores with 62G RAM Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex with global update) Usage: make bench BENCHSET=mutex-adaptive-thread Test result: +----------------+-----------------+-----------------+------------+ | Configuration | Base | Head | % Change | | | Total iteration | Total iteration | base->head | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1x | +----------------+------------------------------------------------+ |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | +----------------+------------------------------------------------+ | | Critical section size: 10x | +----------------+------------------------------------------------+ |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | +----------------+------------------------------------------------+ | | Critical section size: 100x | +----------------+------------------------------------------------+ |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | +----------------+------------------------------------------------+ | | Critical section size: 1000x | +----------------+------------------------------------------------+ |1 thread | 316392 | 315635 | -0.2% | |2 threads | 302806 | 303469 | +0.2% | |3 threads | 298506 | 294281 | -1.4% | |4 threads | 292037 | 289945 | -0.7% | |28 threads | 155188 | 155250 | +0.0% | |56 threads | 190657 | 183106 | -4.0% | |112 threads | 210818 | 220342 | +4.5% | +----------------+-----------------+-----------------+------------+ * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex * nptl/pthread_mutex_conf.h: Add READ_ONLY_SPIN micro definition * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only while spinning for x86 architecture * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise ChangLog: V3->V4: a) Make the optimization opt-in, and enable for x86 architecture as default, as suggested by Florian Weimer. V2->V3: a) Drop the idea of blocking spinners if fail to acquire a lock, since this idea would not be an universal winner. E.g. several threads contend for a lock which protects a small critical section, thus, probably any thread can acquire the lock via spinning. b) Fix the format issue AFAIC V1->V2: fix format issue Suggested-by: Andi Kleen Signed-off-by: Kemi Wang --- nptl/pthread_mutex_conf.h | 2 ++ nptl/pthread_mutex_lock.c | 15 +++++++++++++++ sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + 4 files changed, 19 insertions(+) diff --git a/nptl/pthread_mutex_conf.h b/nptl/pthread_mutex_conf.h index e5b027c..f2d6ca9 100644 --- a/nptl/pthread_mutex_conf.h +++ b/nptl/pthread_mutex_conf.h @@ -28,4 +28,6 @@ struct mutex_config extern struct mutex_config __mutex_aconf attribute_hidden; +#define READ_ONLY_SPIN 1 + #endif diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..26bcebf 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -124,8 +124,14 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) if (LLL_MUTEX_TRYLOCK (mutex) != 0) { int cnt = 0; +#ifdef READ_ONLY_SPIN + int val = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); +#else int max_cnt = MIN (MAX_ADAPTIVE_COUNT, mutex->__data.__spins * 2 + 10); +#endif do { if (cnt++ >= max_cnt) @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) LLL_MUTEX_LOCK (mutex); break; } +#ifdef READ_ONLY_SPIN + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); +#else atomic_spin_nop (); +#endif } while (LLL_MUTEX_TRYLOCK (mutex) != 0); diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c index 967d007..a44c48c 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c @@ -19,4 +19,5 @@ already elided locks. */ #include +#include #include diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c index c23678f..29d20e8 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c @@ -19,4 +19,5 @@ #include #include "force-elision.h" +#include "nptl/pthread_mutex_conf.h" #include "nptl/pthread_mutex_lock.c"