From patchwork Fri Jun 22 09:32:11 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 933171 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93546-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="xUhWLgAF"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41Bthw3Pn4z9s2L for ; Fri, 22 Jun 2018 19:35:27 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id; q=dns; s= default; b=srSVlitGMScM8sSjtDv5YalI4qRzdDbOkf1k+yM5S+CFH1Bpj4f3G +BY7D6ImAcs0FnrQRU8au7Ix6pnsZIhPYIO7lZF/Dapqt9zZf92gRu3OPSqt5Siu k1L2JU7OAuonyIf32seHy6Z4ezchogGqiB36OYmWix+tKWrQ+tb31U= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id; s=default; bh=ff8q6MMlCb69LPHrnW133G58Whg=; b=xUhWLgAFAppfQawvyJziHQPyTwJe AkmfGVxLj9kTZ10lURZxDy9W7pEq3sPYKChM1WWq94rpGZ5ZdEnh2N5GY/9q2Dyo 82cEDJJmuu7ymF4H3Ggic5IPLmOSQVe/bO0go2DWQD7X0LB4QKxDx3cFs/CAWlad KkWs0NScXoglgIo= Received: (qmail 91770 invoked by alias); 22 Jun 2018 09:35:20 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 91364 invoked by uid 89); 22 Jun 2018 09:35:19 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, SPF_PASS autolearn=ham version=3.3.2 spammy=acquired, capabilities, H*MI:wang, out X-HELO: mga07.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v5 1/3] Tunables: Add tunables of spin count for pthread adaptive spin mutex Date: Fri, 22 Jun 2018 17:32:11 +0800 Message-Id: <1529659933-9549-1-git-send-email-kemi.wang@intel.com> This patch does not have any functionality change, we only provide a spin count tunes for pthread adaptive spin mutex. The tunable glibc.mutex.spin_count tunes can be used by system administrator to squeeze system performance according to different hardware capabilities and workload characteristics. The maximum value of spin count is limited to 30000 to avoid the overflow of mutex->__data.__spins variable with the possible type of short in pthread_mutex_lock (). The default value of spin count is set to 100 with the reference to the previous number of times of spinning via trylock. This value would be architecture-specific and can be tuned with kinds of benchmarks to fit most cases in future. This is the preparation work for the next patch, in which the way of adaptive spin would be changed from an expensive cmpxchg to read while spinning. * elf/dl-tunables.list: Add glibc.mutex.spin_count entry. * manual/tunables.texi: Add glibc.mutex.spin_count description. * nptl/Makefile: Add pthread_mutex_conf.c for compilation. * nptl/pthread_mutex_conf.h: New file. * nptl/pthread_mutex_conf.c: New file. * nptl/nptl-init.c: Put mutex tunable initialization in pthread initialization. ChangeLog: V4->V5 a) Put mutex tunable (glibc.mutex.spin_count) initialization as part of overall pthread initialization, that would avoid the extra relocation, as suggested by Florian Weimer. Thanks for pointing it out! b) Move the READ_ONLY_SPIN macro definition from the third patch to this patch V3->V4 a) Add comments in elf/dl-tunables.list V2->V3 a) Polish the description of glibc.mutex.spin_count tunable with the help from Rical Jasan. b) Get rid of the TUNABLE_CALLBACK_FNDECL macros in pthread_mutex_conf.c, as suggested by Florian Weimer. c) Adjust the default value of spin count to 100 with the reference of the previous spinning way via trylock. V1->V2 a) Renamed nptl/mutex-conf.h -> nptl/pthread_mutex_conf.h b) Renamed nptl/mutex-conf.c -> nptl/pthread_mutex_conf.c c) Change the Makefile to compile pthread_mutex_conf.c d) Modify the copyright "2013-2018" -> "2018" for new added files e) Fix the indentation issue (tab -> double space) in elf/dl-tunables.list f) Remove the env alias LD_SPIN_COUNT in elf/dl-tunables.list g) Fix the typo errors and refresh glibc.mutex.spin_count tunable description in manual/tunables.texi. h) Fix the indentation issue in nptl/pthread_mutex_conf.c i) Fix the indentation issue for nested preprocessor (add one space for each level) Suggested-by: Andi Kleen Signed-off-by: Kemi Wang --- elf/dl-tunables.list | 17 ++++++++++++++ manual/tunables.texi | 22 ++++++++++++++++++ nptl/Makefile | 3 ++- nptl/nptl-init.c | 5 +++++ nptl/pthread_mutex_conf.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++ nptl/pthread_mutex_conf.h | 35 +++++++++++++++++++++++++++++ 6 files changed, 138 insertions(+), 1 deletion(-) create mode 100644 nptl/pthread_mutex_conf.c create mode 100644 nptl/pthread_mutex_conf.h diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list index 1f8ecb8..2c5a13f 100644 --- a/elf/dl-tunables.list +++ b/elf/dl-tunables.list @@ -121,4 +121,21 @@ glibc { default: 3 } } + +# The maximum value of spin count is limited to 30000 to avoid the overflow +# of mutex->__data.__spins variable with the possible type of short in +# pthread_mutex_lock (). +# +# The default value of spin count is set to 100 with the reference to the +# previous number of times of spinning via trylock. This value would be +# architecture-specific and can be tuned with kinds of benchmarks to fit +# most cases in future. + mutex { + spin_count { + type: INT_32 + minval: 0 + maxval: 30000 + default: 100 + } + } } diff --git a/manual/tunables.texi b/manual/tunables.texi index be33c9f..05c5e4a 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -281,6 +281,28 @@ of try lock attempts. The default value of this tunable is @samp{3}. @end deftp +@node Pthread Mutex Tunables +@section Pthread Mutex Tunables +@cindex pthread mutex tunables + +@deftp {Tunable namespace} glibc.mutex +The behavior of pthread mutexes can be tuned to gain performance improvements +according to specific hardware capabilities and workload characteristics by +setting the following tunables in the @code{mutex} namespace: +@end deftp + +@deftp Tunable glibc.mutex.spin_count +The @code{glibc.mutex.spin_count} tunable sets the maximum number of times +a thread should spin on the lock before calling into the kernel to block. +Adaptive spin is used for mutexes initialized with the PTHREAD_MUTEX_ADAPTIVE_NP +GNU extension. It affects both pthread_mutex_lock and pthread_mutex_timedlock. + +The spinning is done until either the maximum spin times is reached or +the lock is acquired. + +The default value of this tunable is @samp{100}. +@end deftp + @node Hardware Capability Tunables @section Hardware Capability Tunables @cindex hardware capability tunables diff --git a/nptl/Makefile b/nptl/Makefile index 94be92c..bd1096f 100644 --- a/nptl/Makefile +++ b/nptl/Makefile @@ -139,7 +139,8 @@ libpthread-routines = nptl-init vars events version pt-interp \ pthread_mutex_getprioceiling \ pthread_mutex_setprioceiling \ pthread_setname pthread_getname \ - pthread_setattr_default_np pthread_getattr_default_np + pthread_setattr_default_np pthread_getattr_default_np \ + pthread_mutex_conf # pthread_setuid pthread_seteuid pthread_setreuid \ # pthread_setresuid \ # pthread_setgid pthread_setegid pthread_setregid \ diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c index 1d3790f..3e6e2e1 100644 --- a/nptl/nptl-init.c +++ b/nptl/nptl-init.c @@ -38,6 +38,7 @@ #include #include #include +#include #ifndef TLS_MULTIPLE_THREADS_IN_TCB /* Pointer to the corresponding variable in libc. */ @@ -446,6 +447,10 @@ __pthread_initialize_minimal_internal (void) /* Determine whether the machine is SMP or not. */ __is_smp = is_smp_system (); + +#if HAVE_TUNABLES + mutex_tunables_init (); +#endif } strong_alias (__pthread_initialize_minimal_internal, __pthread_initialize_minimal) diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c new file mode 100644 index 0000000..9b2c5d1 --- /dev/null +++ b/nptl/pthread_mutex_conf.c @@ -0,0 +1,57 @@ +/* pthread_mutex_conf.c: Pthread mutex tunable parameters. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include "config.h" +#include +#include +#include +#include + +#if HAVE_TUNABLES +# define TUNABLE_NAMESPACE mutex +#endif +#include + + +struct mutex_config __mutex_aconf = +{ + /* The maximum number of times a thread should spin on the lock before + calling into kernel to block. */ + .spin_count = 100, +}; + +#if HAVE_TUNABLES +static inline void __always_inline +do_set_mutex_spin_count (int32_t value) +{ + __mutex_aconf.spin_count = value; +} + +void +TUNABLE_CALLBACK (set_mutex_spin_count) (tunable_val_t *valp) +{ + int32_t value = (int32_t) (valp)->numval; + do_set_mutex_spin_count (value); +} + +void mutex_tunables_init (void) +{ + TUNABLE_GET (spin_count, int32_t, + TUNABLE_CALLBACK (set_mutex_spin_count)); +} +#endif diff --git a/nptl/pthread_mutex_conf.h b/nptl/pthread_mutex_conf.h new file mode 100644 index 0000000..74a0735 --- /dev/null +++ b/nptl/pthread_mutex_conf.h @@ -0,0 +1,35 @@ +/* pthread_mutex_conf.h: Pthread mutex tunable parameters. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ +#ifndef _PTHREAD_MUTEX_CONF_H +#define _PTHREAD_MUTEX_CONF_H 1 + +#include +#include + +struct mutex_config +{ + int spin_count; +}; + +extern struct mutex_config __mutex_aconf attribute_hidden; + +void mutex_tunables_init (void); + +#define READ_ONLY_SPIN 1 + +#endif From patchwork Fri Jun 22 09:32:12 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 933172 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93547-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="Xy21daiu"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41Btj50Sj9z9s2L for ; Fri, 22 Jun 2018 19:35:36 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=iztHfdfdbT8yQcsLCTHBMI0SsG71XZc Z4EqJ3yu0Q0dm9f/fT7ihOI/Ksr2U89k1A8XJRmGz5X/fJDsM5BFI/dk/tNfPk0v BLUFiiwFUl3+3tqwJEBHSrUQzh3Lz7V/YPCyhFiOfTy23Dm9NQ6E57h2/to5BcGd 2aFeGVE53EiE= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=n/31wMKjYGnyo6M1zh7+CWmmB/Q=; b=Xy21d aiu7BGfAAFBxCh6CI9mxkSfh+vuBVx6HHwIKyFaCM95fK8dRUvARCUlIHUDCTenr lHuR4l0ethyal2snMrVQaRJWDTjMFL0IHMtJxQR6m6G1PbUiZoc/Ps6YUsRUVrTB 8rcEl0NcoPFWRxoRT+ACtpc2NmI9MZPg7kEucA= Received: (qmail 92622 invoked by alias); 22 Jun 2018 09:35:22 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 92036 invoked by uid 89); 22 Jun 2018 09:35:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, SPF_PASS autolearn=ham version=3.3.2 spammy=protecting, 1667, bench, spin X-HELO: mga07.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v5 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Date: Fri, 22 Jun 2018 17:32:12 +0800 Message-Id: <1529659933-9549-2-git-send-email-kemi.wang@intel.com> In-Reply-To: <1529659933-9549-1-git-send-email-kemi.wang@intel.com> References: <1529659933-9549-1-git-send-email-kemi.wang@intel.com> Add a microbenchmark for measuring mutex lock and unlock performance with varying numbers of threads and varying size of a critical section. The benchmark leverages the mutex lock and unlock operation for protecting the critical section and measures the minimum iterations, the maximum iterations, and the total iterations within a fixed duration. Variants of benchmark are run with 1, 2, 3, 4, nproc/4, nproc/2, nproc threads. The size of critical section is determined by the number of times of pause (x86 only) and read which is intended to emulate the scenarios in real applications. In this microbenchmark, the number 1, 10, 100, and 1000 are used to represent different size of critical sections in the working set. * benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive spin mutex * benchmark/Makefile: Add adaptive spin mutex benchmark ChangLog: V4->V5: a) Add sanity check in benchtests/Makefile to avoid redundant execution of test case. E.g. we intend to run test case with 1, 2, 3, 4, nproc/4, nproc/2, nproc threads. It is unnecessary to run test case with nproc/4, nproc/2 and nproc if the thread number of a system is less than 4. V3->V4: no change V2->V3: a) Add some delay after mutex unlock to reduce the possibility of lock acquisition by the previous lock holder, and that makes more sense for practical applications b) Sync threads to start simultaneously c) Set CPU affinity for threads V1->V2: New added microbenchmark, as requested by Adhemerval Zanella Signed-off-by: Kemi Wang --- benchtests/Makefile | 38 ++++- benchtests/bench-mutex-adaptive-thread.c | 251 +++++++++++++++++++++++++++++++ 2 files changed, 282 insertions(+), 7 deletions(-) create mode 100644 benchtests/bench-mutex-adaptive-thread.c diff --git a/benchtests/Makefile b/benchtests/Makefile index bcd6a9c..b6f20be 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -95,10 +95,17 @@ else bench-malloc := $(filter malloc-%,${BENCHSET}) endif +ifeq (${BENCHSET},) +bench-mutex := mutex-adaptive-thread +else +bench-mutex := $(filter mutex-%,${BENCHSET}) +endif + $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(objpfx)bench-malloc-thread: $(shared-thread-library) +$(addprefix $(objpfx)bench-,$(bench-mutex)): $(shared-thread-library) @@ -119,6 +126,7 @@ include ../Rules binaries-bench := $(addprefix $(objpfx)bench-,$(bench)) binaries-benchset := $(addprefix $(objpfx)bench-,$(benchset)) binaries-bench-malloc := $(addprefix $(objpfx)bench-,$(bench-malloc)) +binaries-bench-mutex := $(addprefix $(objpfx)bench-,$(bench-mutex)) # The default duration: 10 seconds. ifndef BENCH_DURATION @@ -142,7 +150,7 @@ endif # This makes sure CPPFLAGS-nonlib and CFLAGS-nonlib are passed # for all these modules. cpp-srcs-left := $(binaries-benchset:=.c) $(binaries-bench:=.c) \ - $(binaries-bench-malloc:=.c) + $(binaries-bench-malloc:=.c) $(binaries-bench-mutex:=.c) lib := nonlib include $(patsubst %,$(..)libof-iterator.mk,$(cpp-srcs-left)) @@ -158,6 +166,7 @@ bench-clean: rm -f $(binaries-bench) $(addsuffix .o,$(binaries-bench)) rm -f $(binaries-benchset) $(addsuffix .o,$(binaries-benchset)) rm -f $(binaries-bench-malloc) $(addsuffix .o,$(binaries-bench-malloc)) + rm -f $(binaries-bench-mutex) $(addsuffix .o,$(binaries-bench-mutex)) rm -f $(timing-type) $(addsuffix .o,$(timing-type)) rm -f $(addprefix $(objpfx),$(bench-extra-objs)) @@ -165,7 +174,7 @@ bench-clean: ifneq ($(strip ${BENCHSET}),) VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \ wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \ - malloc-thread + malloc-thread mutex-adaptive-thread INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET}) ifneq (${INVALIDBENCHSETNAMES},) $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES}) @@ -176,7 +185,7 @@ endif # Define the bench target only if the target has a usable python installation. ifdef PYTHON -bench: bench-build bench-set bench-func bench-malloc +bench: bench-build bench-set bench-func bench-malloc bench-mutex else bench: @echo "The bench target needs python to run." @@ -187,10 +196,10 @@ endif # only if we're building natively. ifeq (no,$(cross-compiling)) bench-build: $(gen-locales) $(timing-type) $(binaries-bench) \ - $(binaries-benchset) $(binaries-bench-malloc) + $(binaries-benchset) $(binaries-bench-malloc) $(binaries-bench-mutex) else bench-build: $(timing-type) $(binaries-bench) $(binaries-benchset) \ - $(binaries-bench-malloc) + $(binaries-bench-malloc) $(binaries-bench-mutex) endif bench-set: $(binaries-benchset) @@ -207,6 +216,21 @@ bench-malloc: $(binaries-bench-malloc) done;\ done +# Run benchmark with 1, 2, 3, nproc/2, nproc threads +bench-mutex: $(binaries-bench-mutex) + for run in $^; do \ + prev=0; \ + for thr in 1 2 3 4 $$((`nproc` / 4)) $$((`nproc` / 2)) `nproc`; do \ + if [ $$thr -gt $$prev -a $$thr -le `nproc` ]; then \ + echo "Running $${run} $${thr}"; \ + else \ + continue; \ + fi; \ + prev=$$thr; \ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + done + # Build and execute the benchmark functions. This target generates JSON # formatted bench.out. Each of the programs produce independent JSON output, # so one could even execute them individually and process it using any JSON @@ -236,8 +260,8 @@ bench-func: $(binaries-bench) fi $(timing-type) $(binaries-bench) $(binaries-benchset) \ - $(binaries-bench-malloc): %: %.o $(objpfx)json-lib.o \ - $(link-extra-libs-tests) \ + $(binaries-bench-malloc) $(binaries-bench-mutex): \ + %: %.o $(objpfx)json-lib.o $(link-extra-libs-tests) \ $(sort $(filter $(common-objpfx)lib%,$(link-libc))) \ $(addprefix $(csu-objpfx),start.o) $(+preinit) $(+postinit) $(+link-tests) diff --git a/benchtests/bench-mutex-adaptive-thread.c b/benchtests/bench-mutex-adaptive-thread.c new file mode 100644 index 0000000..ce9c40e --- /dev/null +++ b/benchtests/bench-mutex-adaptive-thread.c @@ -0,0 +1,251 @@ +/* Benchmark pthread adaptive spin mutex lock and unlock functions. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bench-timing.h" +#include "json-lib.h" + +/* Benchmark duration in seconds. */ +#define BENCHMARK_DURATION 15 +#define TYPE PTHREAD_MUTEX_ADAPTIVE_NP +#define mb() asm ("" ::: "memory") +#define UNLOCK_DELAY 10 + +#if defined (__i386__) || defined (__x86_64__) +# define cpu_relax() asm ("rep; nop") +#else +# define cpu_relax() do { } while (0) +#endif + +static volatile int start_thread; +static unsigned long long val; +static pthread_mutexattr_t attr; +static pthread_mutex_t mutex; + +#define WORKING_SET_SIZE 4 +int working_set[] = {1, 10, 100, 1000}; + +struct thread_args +{ + unsigned long long iters; + int working_set; + timing_t elapsed; +}; + +static void +init_mutex (void) +{ + pthread_mutexattr_init (&attr); + pthread_mutexattr_settype (&attr, TYPE); + pthread_mutex_init (&mutex, &attr); +} + +static void +init_parameter (int size, struct thread_args *args, int num_thread) +{ + int i; + for (i = 0; i < num_thread; i++) + { + memset(&args[i], 0, sizeof(struct thread_args)); + args[i].working_set = size; + } +} + +static volatile bool timeout; + +static void +alarm_handler (int signum) +{ + timeout = true; +} + +static inline void +delay (int number) +{ + while (number > 0) + { + cpu_relax (); + number--; + } +} + +/* Lock and unlock for protecting the critical section. */ +static unsigned long long +mutex_benchmark_loop (int number) +{ + unsigned long long iters = 0; + + while (!start_thread) + cpu_relax (); + + while (!timeout) + { + pthread_mutex_lock (&mutex); + val++; + delay (number); + pthread_mutex_unlock (&mutex); + iters++; + delay (UNLOCK_DELAY); + } + return iters; +} + +static void * +benchmark_thread (void *arg) +{ + struct thread_args *args = (struct thread_args *) arg; + unsigned long long iters; + timing_t start, stop; + + TIMING_NOW (start); + iters = mutex_benchmark_loop (args->working_set); + TIMING_NOW (stop); + + TIMING_DIFF (args->elapsed, start, stop); + args->iters = iters; + + return NULL; +} + +static void +do_benchmark (size_t num_thread, struct thread_args *args) +{ + + pthread_t threads[num_thread]; + + for (size_t i = 0; i < num_thread; i++) + { + pthread_attr_t attr; + cpu_set_t set; + + pthread_attr_init (&attr); + CPU_ZERO (&set); + CPU_SET (i, &set); + pthread_attr_setaffinity_np (&attr, sizeof(cpu_set_t), &set); + pthread_create (&threads[i], &attr, benchmark_thread, args + i); + pthread_attr_destroy (&attr); + } + + mb (); + start_thread = 1; + mb (); + sched_yield (); + for (size_t i = 0; i < num_thread; i++) + pthread_join(threads[i], NULL); +} + +static void +usage(const char *name) +{ + fprintf (stderr, "%s: \n", name); + exit (1); +} + +int +main (int argc, char **argv) +{ + int i, j, num_thread = 1; + json_ctx_t json_ctx; + struct sigaction act; + + if (argc == 1) + num_thread = 1; + else if (argc == 2) + { + long ret; + + errno = 0; + ret = strtol(argv[1], NULL, 10); + + if (errno || ret == 0) + usage(argv[0]); + + num_thread = ret; + } + else + usage(argv[0]); + + /* Benchmark for different critical section size. */ + for (i = 0; i < WORKING_SET_SIZE; i++) + { + int size = working_set[i]; + struct thread_args args[num_thread]; + unsigned long long iters = 0, min_iters = -1ULL, max_iters = 0; + double d_total_s = 0, d_total_i = 0; + + timeout = false; + init_mutex (); + init_parameter (size, args, num_thread); + + json_init (&json_ctx, 0, stdout); + + json_document_begin (&json_ctx); + + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); + + json_attr_object_begin (&json_ctx, "functions"); + + json_attr_object_begin (&json_ctx, "mutex"); + + json_attr_object_begin (&json_ctx, ""); + + memset (&act, 0, sizeof (act)); + act.sa_handler = &alarm_handler; + + sigaction (SIGALRM, &act, NULL); + + alarm (BENCHMARK_DURATION); + + do_benchmark (num_thread, args); + + for (j = 0; j < num_thread; j++) + { + iters = args[j].iters; + if (iters < min_iters) + min_iters = iters; + if (iters >= max_iters) + max_iters = iters; + d_total_i += iters; + TIMING_ACCUM (d_total_s, args[j].elapsed); + } + json_attr_double (&json_ctx, "duration", d_total_s); + json_attr_double (&json_ctx, "total_iterations", d_total_i); + json_attr_double (&json_ctx, "min_iteration", min_iters); + json_attr_double (&json_ctx, "max_iteration", max_iters); + json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i); + json_attr_double (&json_ctx, "threads", num_thread); + json_attr_double (&json_ctx, "critical_section_size", size); + + json_attr_object_end (&json_ctx); + json_attr_object_end (&json_ctx); + json_attr_object_end (&json_ctx); + + json_document_end (&json_ctx); + fputs("\n", (&json_ctx)->fp); + } + return 0; +} From patchwork Fri Jun 22 09:32:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 933173 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93548-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="L+RM6Jaj"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41BtjG1xxTz9s3C for ; Fri, 22 Jun 2018 19:35:46 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=TtbktUi5gOrNl2x6KjFrawWCJI94/fd K/CTujvynKrh+te6NE3qaj6miQXRkacqQFQauJiSNRLZgrO2XWcRTx1jeDXHbinc pLfp9QR1b7jXwnQMgS6SeyebkID0ytUTNMGWtj0AicUYcYJ7B3KpcoSG+cIRoFGO Rf9XJRBuZsiY= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=Og6eeuXimq04+XY+GiJPsiWm44w=; b=L+RM6 Jajwbcj+6FAJA6gPLKo4k9jT3nmBGNUmYf3s1PYH5m0VD6YCT8tbsmBXdf36IXxG Ym7WX7+r7zkc7eitasz6U9b5vAxIITeV25BSBFxc2RzyIRIUSd0O5ReeNGPjPsQO 3QzfgvjB7YAmNrHydrhilHShozNT7ymPey+AtA= Received: (qmail 92883 invoked by alias); 22 Jun 2018 09:35:23 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 92636 invoked by uid 89); 22 Jun 2018 09:35:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY autolearn=ham version=3.3.2 spammy=0.7, Head, universal, 04 X-HELO: mga07.intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v5 3/3] Mutex: Replace trylock by read only while spinning Date: Fri, 22 Jun 2018 17:32:13 +0800 Message-Id: <1529659933-9549-3-git-send-email-kemi.wang@intel.com> In-Reply-To: <1529659933-9549-1-git-send-email-kemi.wang@intel.com> References: <1529659933-9549-1-git-send-email-kemi.wang@intel.com> The pthread adaptive spin mutex spins on the lock for a while before calling into the kernel to block. But, in the current implementation of spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a good idea on many targets as that will force expensive memory synchronization among processors and penalize other running threads. For example, it constantly floods the system with "read for ownership" requests, which are much more expensive to process than a single read. Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggested by Andi Kleen. Performance impact: Significant mutex performance improvement is not expected for this patch, though, it probably bring some benefit for the scenarios with severe lock contention on many architectures, the whole system performance can benefit from this modification because a number of unnecessary "read for ownership" requests which stress the cache system by broadcasting cache line invalidity are eliminated during spinning. Meanwhile, it may have some tiny performance regression on the lock holder transformation for the case of lock acquisition via spinning gets, because the lock state is checked before acquiring the lock via trylock. Similar mechanism has been implemented for pthread spin lock. Test machine: 2-sockets Skylake platform, 112 cores with 62G RAM Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex with global update) Usage: make bench BENCHSET=mutex-adaptive-thread Test result: +----------------+-----------------+-----------------+------------+ | Configuration | Base | Head | % Change | | | Total iteration | Total iteration | base->head | +----------------+-----------------+-----------------+------------+ | | Critical section size: 1x | +----------------+------------------------------------------------+ |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | +----------------+------------------------------------------------+ | | Critical section size: 10x | +----------------+------------------------------------------------+ |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | +----------------+------------------------------------------------+ | | Critical section size: 100x | +----------------+------------------------------------------------+ |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | +----------------+------------------------------------------------+ | | Critical section size: 1000x | +----------------+------------------------------------------------+ |1 thread | 316392 | 315635 | -0.2% | |2 threads | 302806 | 303469 | +0.2% | |3 threads | 298506 | 294281 | -1.4% | |4 threads | 292037 | 289945 | -0.7% | |28 threads | 155188 | 155250 | +0.0% | |56 threads | 190657 | 183106 | -4.0% | |112 threads | 210818 | 220342 | +4.5% | +----------------+-----------------+-----------------+------------+ * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex * nptl/pthread_mutex_timedlock.c: Likewise * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only while spinning for x86 architecture * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise ChangLog: V4->V5: a) Make the optimization work for pthread mutex_timedlock() in x86 architecture. b) Move the READ_ONLY_SPIN macro definition from this patch to the first patch which adds glibc.mutex.spin_count tunable entry V3->V4: a) Make the optimization opt-in, and enable for x86 architecture as default, as suggested by Florian Weimer. V2->V3: a) Drop the idea of blocking spinners if fail to acquire a lock, since this idea would not be an universal winner. E.g. several threads contend for a lock which protects a small critical section, thus, probably any thread can acquire the lock via spinning. b) Fix the format issue AFAIC V1->V2: fix format issue Suggested-by: Andi Kleen Signed-off-by: Kemi Wang --- nptl/pthread_mutex_lock.c | 15 +++++++++++++++ nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++ sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 + 5 files changed, 33 insertions(+) diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..9a7b5c2 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -124,8 +124,14 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) if (LLL_MUTEX_TRYLOCK (mutex) != 0) { int cnt = 0; +#ifdef READ_ONLY_SPIN + int val = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); +#else int max_cnt = MIN (MAX_ADAPTIVE_COUNT, mutex->__data.__spins * 2 + 10); +#endif do { if (cnt++ >= max_cnt) @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) LLL_MUTEX_LOCK (mutex); break; } +#ifdef READ_ONLY_SPIN + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); +#else atomic_spin_nop (); +#endif } while (LLL_MUTEX_TRYLOCK (mutex) != 0); diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c index 66efd39..dcaeca8 100644 --- a/nptl/pthread_mutex_timedlock.c +++ b/nptl/pthread_mutex_timedlock.c @@ -116,8 +116,14 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, if (lll_trylock (mutex->__data.__lock) != 0) { int cnt = 0; +#ifdef READ_ONLY_SPIN + int val = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); +#else int max_cnt = MIN (MAX_ADAPTIVE_COUNT, mutex->__data.__spins * 2 + 10); +#endif do { if (cnt++ >= max_cnt) @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, PTHREAD_MUTEX_PSHARED (mutex)); break; } +#ifdef READ_ONLY_SPIN + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); +#else atomic_spin_nop (); +#endif } while (lll_trylock (mutex->__data.__lock) != 0); diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c index 967d007..a44c48c 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c @@ -19,4 +19,5 @@ already elided locks. */ #include +#include #include diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c index c23678f..29d20e8 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c @@ -19,4 +19,5 @@ #include #include "force-elision.h" +#include "nptl/pthread_mutex_conf.h" #include "nptl/pthread_mutex_lock.c" diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c index 7997580..52345f5 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c @@ -19,4 +19,5 @@ #include #include "force-elision.h" +#include "nptl/pthread_mutex_conf.h" #include "nptl/pthread_mutex_timedlock.c"