From patchwork Mon Jul 2 08:27:25 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kemi X-Patchwork-Id: 937731 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-93895-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="St4TtcER"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41K0pS68Cyz9s2R for ; Mon, 2 Jul 2018 18:31:28 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; q=dns; s=default; b=ZCHYdnrz0oyVm4KNa558uV9aVWlD8a/ J2IIVlbRWTDT4UpL2F5OqwrL21CiPU9BNjIW7fLzu/NH9xL4px3wM3YBpGaIRRSY OQucrdBUoN8w98FwzKiKY7lpVj8jxJqgLbjKLIbTQ/eLmTqLvsnDhy3EeuGlP0Qg 1+ZXi3mLDj2k= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:in-reply-to :references; s=default; bh=LCRlrbnTeXfOha0zhV/JnTU9ydk=; b=St4Tt cERywT1eele51ikgxFBXteNBs1qpdEXjk2HEX4B0QZwVTYP4X0fBzRY6dBltA//2 hQgL1FpY4OHd0qrUCAd7wdBBJdEEIYER2qdtnuurbAojWjAGRgF9bCuo18zMCcpe IrwnqD0DaADQ/pQn00dF/ab/vw3Vbx6JJdOkVs= Received: (qmail 21719 invoked by alias); 2 Jul 2018 08:31:04 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 19340 invoked by uid 89); 2 Jul 2018 08:30:28 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT autolearn=ham version=3.3.2 spammy=no, thr, measures, determined X-HELO: mga01.intel.com From: Kemi Wang To: Adhemerval Zanella , Florian Weimer , Rical Jason , Carlos Donell , Glibc alpha Cc: Dave Hansen , Tim Chen , Andi Kleen , Ying Huang , Aaron Lu , Lu Aubrey , Kemi Wang Subject: [PATCH v6 2/3] benchtests: Add pthread adaptive spin mutex microbenchmark Date: Mon, 2 Jul 2018 16:27:25 +0800 Message-Id: <1530520046-18343-2-git-send-email-kemi.wang@intel.com> In-Reply-To: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> References: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> Add a microbenchmark for measuring mutex lock and unlock performance with varying numbers of threads and varying size of a critical section. The benchmark leverages the mutex lock and unlock operation for protecting the critical section and measures the minimum iterations, the maximum iterations, and the total iterations within a fixed duration. Variants of benchmark are run with 1, 2, 3, 4, nproc/4, nproc/2, nproc threads. The size of critical section is determined by the number of times of pause (x86 only) and read which is intended to emulate the scenarios in real applications. In this microbenchmark, the number 1, 10, 100, and 1000 are used to represent different size of critical sections in the working set. * benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive spin mutex * benchmark/Makefile: Add adaptive spin mutex benchmark ChangLog: V5->V6: no change V4->V5: a) Add sanity check in benchtests/Makefile to avoid redundant execution of test case. E.g. we intend to run test case with 1, 2, 3, 4, nproc/4, nproc/2, nproc threads. It is unnecessary to run test case with nproc/4, nproc/2 and nproc if the thread number of a system is less than 4. V3->V4: no change V2->V3: a) Add some delay after mutex unlock to reduce the possibility of lock acquisition by the previous lock holder, and that makes more sense for practical applications b) Sync threads to start simultaneously c) Set CPU affinity for threads V1->V2: New added microbenchmark, as requested by Adhemerval Zanella Signed-off-by: Kemi Wang --- benchtests/Makefile | 38 ++++- benchtests/bench-mutex-adaptive-thread.c | 251 +++++++++++++++++++++++++++++++ 2 files changed, 282 insertions(+), 7 deletions(-) create mode 100644 benchtests/bench-mutex-adaptive-thread.c diff --git a/benchtests/Makefile b/benchtests/Makefile index bcd6a9c..b6f20be 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -95,10 +95,17 @@ else bench-malloc := $(filter malloc-%,${BENCHSET}) endif +ifeq (${BENCHSET},) +bench-mutex := mutex-adaptive-thread +else +bench-mutex := $(filter mutex-%,${BENCHSET}) +endif + $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(objpfx)bench-malloc-thread: $(shared-thread-library) +$(addprefix $(objpfx)bench-,$(bench-mutex)): $(shared-thread-library) @@ -119,6 +126,7 @@ include ../Rules binaries-bench := $(addprefix $(objpfx)bench-,$(bench)) binaries-benchset := $(addprefix $(objpfx)bench-,$(benchset)) binaries-bench-malloc := $(addprefix $(objpfx)bench-,$(bench-malloc)) +binaries-bench-mutex := $(addprefix $(objpfx)bench-,$(bench-mutex)) # The default duration: 10 seconds. ifndef BENCH_DURATION @@ -142,7 +150,7 @@ endif # This makes sure CPPFLAGS-nonlib and CFLAGS-nonlib are passed # for all these modules. cpp-srcs-left := $(binaries-benchset:=.c) $(binaries-bench:=.c) \ - $(binaries-bench-malloc:=.c) + $(binaries-bench-malloc:=.c) $(binaries-bench-mutex:=.c) lib := nonlib include $(patsubst %,$(..)libof-iterator.mk,$(cpp-srcs-left)) @@ -158,6 +166,7 @@ bench-clean: rm -f $(binaries-bench) $(addsuffix .o,$(binaries-bench)) rm -f $(binaries-benchset) $(addsuffix .o,$(binaries-benchset)) rm -f $(binaries-bench-malloc) $(addsuffix .o,$(binaries-bench-malloc)) + rm -f $(binaries-bench-mutex) $(addsuffix .o,$(binaries-bench-mutex)) rm -f $(timing-type) $(addsuffix .o,$(timing-type)) rm -f $(addprefix $(objpfx),$(bench-extra-objs)) @@ -165,7 +174,7 @@ bench-clean: ifneq ($(strip ${BENCHSET}),) VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \ wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \ - malloc-thread + malloc-thread mutex-adaptive-thread INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET}) ifneq (${INVALIDBENCHSETNAMES},) $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES}) @@ -176,7 +185,7 @@ endif # Define the bench target only if the target has a usable python installation. ifdef PYTHON -bench: bench-build bench-set bench-func bench-malloc +bench: bench-build bench-set bench-func bench-malloc bench-mutex else bench: @echo "The bench target needs python to run." @@ -187,10 +196,10 @@ endif # only if we're building natively. ifeq (no,$(cross-compiling)) bench-build: $(gen-locales) $(timing-type) $(binaries-bench) \ - $(binaries-benchset) $(binaries-bench-malloc) + $(binaries-benchset) $(binaries-bench-malloc) $(binaries-bench-mutex) else bench-build: $(timing-type) $(binaries-bench) $(binaries-benchset) \ - $(binaries-bench-malloc) + $(binaries-bench-malloc) $(binaries-bench-mutex) endif bench-set: $(binaries-benchset) @@ -207,6 +216,21 @@ bench-malloc: $(binaries-bench-malloc) done;\ done +# Run benchmark with 1, 2, 3, nproc/2, nproc threads +bench-mutex: $(binaries-bench-mutex) + for run in $^; do \ + prev=0; \ + for thr in 1 2 3 4 $$((`nproc` / 4)) $$((`nproc` / 2)) `nproc`; do \ + if [ $$thr -gt $$prev -a $$thr -le `nproc` ]; then \ + echo "Running $${run} $${thr}"; \ + else \ + continue; \ + fi; \ + prev=$$thr; \ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + done + # Build and execute the benchmark functions. This target generates JSON # formatted bench.out. Each of the programs produce independent JSON output, # so one could even execute them individually and process it using any JSON @@ -236,8 +260,8 @@ bench-func: $(binaries-bench) fi $(timing-type) $(binaries-bench) $(binaries-benchset) \ - $(binaries-bench-malloc): %: %.o $(objpfx)json-lib.o \ - $(link-extra-libs-tests) \ + $(binaries-bench-malloc) $(binaries-bench-mutex): \ + %: %.o $(objpfx)json-lib.o $(link-extra-libs-tests) \ $(sort $(filter $(common-objpfx)lib%,$(link-libc))) \ $(addprefix $(csu-objpfx),start.o) $(+preinit) $(+postinit) $(+link-tests) diff --git a/benchtests/bench-mutex-adaptive-thread.c b/benchtests/bench-mutex-adaptive-thread.c new file mode 100644 index 0000000..ce9c40e --- /dev/null +++ b/benchtests/bench-mutex-adaptive-thread.c @@ -0,0 +1,251 @@ +/* Benchmark pthread adaptive spin mutex lock and unlock functions. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bench-timing.h" +#include "json-lib.h" + +/* Benchmark duration in seconds. */ +#define BENCHMARK_DURATION 15 +#define TYPE PTHREAD_MUTEX_ADAPTIVE_NP +#define mb() asm ("" ::: "memory") +#define UNLOCK_DELAY 10 + +#if defined (__i386__) || defined (__x86_64__) +# define cpu_relax() asm ("rep; nop") +#else +# define cpu_relax() do { } while (0) +#endif + +static volatile int start_thread; +static unsigned long long val; +static pthread_mutexattr_t attr; +static pthread_mutex_t mutex; + +#define WORKING_SET_SIZE 4 +int working_set[] = {1, 10, 100, 1000}; + +struct thread_args +{ + unsigned long long iters; + int working_set; + timing_t elapsed; +}; + +static void +init_mutex (void) +{ + pthread_mutexattr_init (&attr); + pthread_mutexattr_settype (&attr, TYPE); + pthread_mutex_init (&mutex, &attr); +} + +static void +init_parameter (int size, struct thread_args *args, int num_thread) +{ + int i; + for (i = 0; i < num_thread; i++) + { + memset(&args[i], 0, sizeof(struct thread_args)); + args[i].working_set = size; + } +} + +static volatile bool timeout; + +static void +alarm_handler (int signum) +{ + timeout = true; +} + +static inline void +delay (int number) +{ + while (number > 0) + { + cpu_relax (); + number--; + } +} + +/* Lock and unlock for protecting the critical section. */ +static unsigned long long +mutex_benchmark_loop (int number) +{ + unsigned long long iters = 0; + + while (!start_thread) + cpu_relax (); + + while (!timeout) + { + pthread_mutex_lock (&mutex); + val++; + delay (number); + pthread_mutex_unlock (&mutex); + iters++; + delay (UNLOCK_DELAY); + } + return iters; +} + +static void * +benchmark_thread (void *arg) +{ + struct thread_args *args = (struct thread_args *) arg; + unsigned long long iters; + timing_t start, stop; + + TIMING_NOW (start); + iters = mutex_benchmark_loop (args->working_set); + TIMING_NOW (stop); + + TIMING_DIFF (args->elapsed, start, stop); + args->iters = iters; + + return NULL; +} + +static void +do_benchmark (size_t num_thread, struct thread_args *args) +{ + + pthread_t threads[num_thread]; + + for (size_t i = 0; i < num_thread; i++) + { + pthread_attr_t attr; + cpu_set_t set; + + pthread_attr_init (&attr); + CPU_ZERO (&set); + CPU_SET (i, &set); + pthread_attr_setaffinity_np (&attr, sizeof(cpu_set_t), &set); + pthread_create (&threads[i], &attr, benchmark_thread, args + i); + pthread_attr_destroy (&attr); + } + + mb (); + start_thread = 1; + mb (); + sched_yield (); + for (size_t i = 0; i < num_thread; i++) + pthread_join(threads[i], NULL); +} + +static void +usage(const char *name) +{ + fprintf (stderr, "%s: \n", name); + exit (1); +} + +int +main (int argc, char **argv) +{ + int i, j, num_thread = 1; + json_ctx_t json_ctx; + struct sigaction act; + + if (argc == 1) + num_thread = 1; + else if (argc == 2) + { + long ret; + + errno = 0; + ret = strtol(argv[1], NULL, 10); + + if (errno || ret == 0) + usage(argv[0]); + + num_thread = ret; + } + else + usage(argv[0]); + + /* Benchmark for different critical section size. */ + for (i = 0; i < WORKING_SET_SIZE; i++) + { + int size = working_set[i]; + struct thread_args args[num_thread]; + unsigned long long iters = 0, min_iters = -1ULL, max_iters = 0; + double d_total_s = 0, d_total_i = 0; + + timeout = false; + init_mutex (); + init_parameter (size, args, num_thread); + + json_init (&json_ctx, 0, stdout); + + json_document_begin (&json_ctx); + + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); + + json_attr_object_begin (&json_ctx, "functions"); + + json_attr_object_begin (&json_ctx, "mutex"); + + json_attr_object_begin (&json_ctx, ""); + + memset (&act, 0, sizeof (act)); + act.sa_handler = &alarm_handler; + + sigaction (SIGALRM, &act, NULL); + + alarm (BENCHMARK_DURATION); + + do_benchmark (num_thread, args); + + for (j = 0; j < num_thread; j++) + { + iters = args[j].iters; + if (iters < min_iters) + min_iters = iters; + if (iters >= max_iters) + max_iters = iters; + d_total_i += iters; + TIMING_ACCUM (d_total_s, args[j].elapsed); + } + json_attr_double (&json_ctx, "duration", d_total_s); + json_attr_double (&json_ctx, "total_iterations", d_total_i); + json_attr_double (&json_ctx, "min_iteration", min_iters); + json_attr_double (&json_ctx, "max_iteration", max_iters); + json_attr_double (&json_ctx, "time_per_iteration", d_total_s / d_total_i); + json_attr_double (&json_ctx, "threads", num_thread); + json_attr_double (&json_ctx, "critical_section_size", size); + + json_attr_object_end (&json_ctx); + json_attr_object_end (&json_ctx); + json_attr_object_end (&json_ctx); + + json_document_end (&json_ctx); + fputs("\n", (&json_ctx)->fp); + } + return 0; +}