Message ID | 20181226025019.38752-1-ling.ma@MacBook-Pro-8.local |
---|---|
State | New |
Headers | show |
Series | NUMA spinlock [BZ #23962] | expand |
Hi all, Hi all, In the patch we provide the test case (update on shared data) to simulate our real workload, the test case shows data as below: (Testing on 2 sockets platform , 104 cores with 256G RAM, HT=on) 1. original spinlock #./tst-variable-overhead Number of processors: 104, Single thread time 24913702 Number of threads: 2, Total time 138311026, Overhead: 2.78 Number of threads: 4, Total time 643308290, Overhead: 6.46 Number of threads: 8, Total time 2607024872, Overhead: 13.08 Number of threads: 16, Total time 12196852896, Overhead: 30.60 Number of threads: 32, Total time 80141230706, Overhead: 100.52 Number of threads: 64, Total time 385495457274, Overhead: 241.77 Number of threads: 104, Total time 885472451762, Overhead: 341.75 Number of threads: 128, Total time 1543267108944, Overhead: 483.94 Number of threads: 256, Total time 11780541636054, Overhead: 1847.09 Number of threads: 360, Total time 32000897533528, Overhead: 3567.97 Number of threads: 416, Total time 48862502147056, Overhead: 4714.59 2. numa spinlock #./tst-numa-variable-overhead Number of processors: 104, Single thread time 27906178 Number of threads: 2, Total time 137911942, Overhead: 2.47 Number of threads: 4, Total time 339251710, Overhead: 3.04 Number of threads: 8, Total time 723663466, Overhead: 3.24 Number of threads: 16, Total time 2380050348, Overhead: 5.33 Number of threads: 32, Total time 8084855738, Overhead: 9.05 Number of threads: 64, Total time 27646507880, Overhead: 15.48 Number of threads: 104, Total time 65470567548, Overhead: 22.56 Number of threads: 128, Total time 97966785076, Overhead: 27.43 Number of threads: 256, Total time 1110168309448, Overhead: 155.40 Number of threads: 360, Total time 2516015158740, Overhead: 250.44 Number of threads: 416, Total time 4219233078472, Overhead: 363.45 The above data tell us the numa spinlock improve performance upto 15X, Our real online workload indicate numa spinlock improve the system throughput 14.5%. Appreciate your any comments. Thanks a lot Ling 在 2018/12/26 上午10:50,“Ma Ling”<ling.ma.program@gmail.com> 写入: From: "ling.ma" <ling.ml@antfin.com> On multi-socket systems, memory is shared across the entire system. Data access to the local socket is much faster than the remote socket and data access to the local core is faster than sibling cores on the same socket. For serialized workloads with conventional spinlock, when there is high spinlock contention between threads, lock ping-pong among sockets becomes the bottleneck and threads spend majority of their time in spinlock overhead. On multi-socket systems, the keys to our NUMA spinlock performance are to minimize cross-socket traffic as well as localize the serialized workload to one core for execution. The basic principles of NUMA spinlock are mainly consisted of following approaches, which reduce data movement and accelerate critical section, eventually give us significant performance improvement. 1. MCS spinlock MCS spinlock help us to reduce the useless lock movement in the spinning state. This paper provides a good description for this kind of lock: <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf> 2. Critical Section Integration (CSI) Essentially spinlock is similar to that one core complete critical sections one by one. So when contention happen, the serialized works are sent to the core who is the lock owner and responsible to execute them, that can save much time and power, because all shared data are located in private cache of the lock owner. We implemented this mechanism based on queued spinlock in kernel, that speeds up critical section, and reduces the probability of contention. The paper provides a good description for this kind of lock: <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf> 3. NUMA Aware Spinlock (NAS) Currently multi-socket systems give us better performance per watt, however that also involves more complex synchronization requirement, because off-chip data movement is much slower. We use distributed synchronization mechanism to decrease Lock cache line to and from different nodes. The paper provides a good description for this kind of lock: <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf> 4. Yield Schedule When threads are applying for Critical Section Integration(CSI) with known contention, they will delegate work to the thread who is the lock owner, and wait for work to be completed. The resources which they are using should be transferred to other threads. In order to accelerate the scenario, we introduce yield_sched function during spinning stage. 5. Optimization when NUMA is ON or OFF. Although programs can access memory with lower latency when NUMA is enabled, some programs may need more memory bandwidth for computation with NUMA disabled. We also optimize multi-socket systems with NUMA disabled. NUMA spinlock flow chart (assuming there are 2 CPU nodes): 1. Threads from node_0/node_1 acquire local lock for node_0/1 respectively. If the thread succeeds in acquiring local lock, it goes to step 2, otherwise pushes critical function into current local work queue, and enters into spinning stage with MCS mode. 2. Threads from node_0/node_1 acquire the global lock. If it succeeds in acquiring the global lock as the lock owner, it goes to step 3, otherwise waits until the lock owner thread releases the global lock. 3. The lock owner thread from node_0/1 enters into critical section, cleans up work queue by performing all local critical functions pushed at step 1 with CSI on behalf of other threads and informs those spinning threads that their works have been done. It then releases the local lock. 4. The lock owner thread frees global lock. If another thread is waiting at step 2, the lock owner thread passes the global lock to the waiting thread and returns. The new lock owner thread enters into step 3. If no threads are waiting, the lock owner thread releases the global lock and returns. The whole critical section process is completed. Steps 1 and 2 mitigate global lock contention. Only one thread from different nodes will compete for the global lock in step 2. Step 3 reduces the global lock & shared data movement because they are located in the same node as well as the same core. Our data shows that Critical Section Integration (CSI) improves data locality and NUMA-aware spinlock (NAS) helps CSI balance the workload. NUMA spinlock can greatly speed up critical section on multi-socket systems. It should improve spinlock performance on all multi-socket systems. 2018-12-24 Ling Ma <ling.ml@antfin.com> H.J. Lu <hongjiu.lu@intel.com> Wei Xiao <wei3.xiao@intel.com> [BZ #23962] * NEWS: Mention NUMA spinlock. * manual/examples/numa-spinlock.c: New file. * sysdeps/unix/sysv/linux/numa-spinlock-private.h: Likewise. * sysdeps/unix/sysv/linux/numa-spinlock.c: Likewise. * sysdeps/unix/sysv/linux/numa-spinlock.h: Likewise. * sysdeps/unix/sysv/linux/numa_spinlock_alloc.c: Likewise. * sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c: Likewise. * sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c: Likewise. * sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c: Likewise. * manual/threads.texi: Document NUMA spinlock. * sysdeps/unix/sysv/linux/Makefile (libpthread-sysdep_routines): Add numa_spinlock_alloc and numa-spinlock. (sysdep_headers): Add numa-spinlock.h. * sysdeps/unix/sysv/linux/Versions (libpthread): Add numa_spinlock_alloc, numa_spinlock_free, numa_spinlock_init and numa_spinlock_apply to GLIBC_2.29. * sysdeps/unix/sysv/linux/aarch64/libpthread.abilist: Add numa_spinlock_alloc, numa_spinlock_apply, numa_spinlock_free and numa_spinlock_init. * sysdeps/unix/sysv/linux/alpha/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/arm/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/csky/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/hppa/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/i386/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/ia64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/microblaze/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/nios2/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/sh/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist: Likewise. * sysdeps/unix/sysv/linux/x86/Makefile (xtests): Add tst-variable-overhead and tst-numa-variable-overhead. Signed-off-by: Ling Ma <ling.ml@antfin.com> Signed-off-by: H.J. Lu <hongjiu.lu@intel.com> Signed-off-by: Wei Xiao <wei3.xiao@intel.com> --- NEWS | 3 + manual/examples/numa-spinlock.c | 99 ++++++ manual/threads.texi | 105 ++++++ sysdeps/unix/sysv/linux/Makefile | 2 + sysdeps/unix/sysv/linux/Versions | 9 + sysdeps/unix/sysv/linux/aarch64/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/alpha/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/arm/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/csky/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/hppa/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/i386/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/ia64/libpthread.abilist | 4 + .../sysv/linux/m68k/coldfire/libpthread.abilist | 4 + .../unix/sysv/linux/m68k/m680x0/libpthread.abilist | 4 + .../unix/sysv/linux/microblaze/libpthread.abilist | 4 + .../unix/sysv/linux/mips/mips32/libpthread.abilist | 4 + .../unix/sysv/linux/mips/mips64/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/nios2/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/numa-spinlock-private.h | 38 ++ sysdeps/unix/sysv/linux/numa-spinlock.c | 327 ++++++++++++++++++ sysdeps/unix/sysv/linux/numa-spinlock.h | 64 ++++ sysdeps/unix/sysv/linux/numa_spinlock_alloc.c | 304 ++++++++++++++++ .../linux/powerpc/powerpc32/libpthread.abilist | 4 + .../linux/powerpc/powerpc64/be/libpthread.abilist | 4 + .../linux/powerpc/powerpc64/le/libpthread.abilist | 4 + .../unix/sysv/linux/riscv/rv64/libpthread.abilist | 4 + .../sysv/linux/s390/s390-32/libpthread.abilist | 4 + .../sysv/linux/s390/s390-64/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/sh/libpthread.abilist | 4 + .../sysv/linux/sparc/sparc32/libpthread.abilist | 4 + .../sysv/linux/sparc/sparc64/libpthread.abilist | 4 + sysdeps/unix/sysv/linux/x86/Makefile | 1 + .../sysv/linux/x86/tst-numa-variable-overhead.c | 53 +++ .../linux/x86/tst-variable-overhead-skeleton.c | 384 +++++++++++++++++++++ .../unix/sysv/linux/x86/tst-variable-overhead.c | 47 +++ .../unix/sysv/linux/x86_64/64/libpthread.abilist | 4 + .../unix/sysv/linux/x86_64/x32/libpthread.abilist | 4 + 37 files changed, 1532 insertions(+) create mode 100644 manual/examples/numa-spinlock.c create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c diff --git a/NEWS b/NEWS index cd29ec7..0764d0d 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,9 @@ Version 2.29 Major new features: +* NUMA spinlock is added to provide a spinlock implementation optimized + for multi-socket NUMA systems. + * The getcpu wrapper function has been added, which returns the currently used CPU and NUMA node. This function is Linux-specific. diff --git a/manual/examples/numa-spinlock.c b/manual/examples/numa-spinlock.c new file mode 100644 index 0000000..ca98443 --- /dev/null +++ b/manual/examples/numa-spinlock.c @@ -0,0 +1,99 @@ +/* NUMA spinlock example. + Copyright (C) 2018 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. +*/ + +#include <pthread.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <numa-spinlock.h> + +#define NUM_THREADS 20 + +struct numa_spinlock *lock; + +struct work_todo_argument +{ + void *arg; +}; + +static void * +work_todo (void *v) +{ + /* Do the real work with p->arg. */ + struct work_todo_argument *p = v; + /* Return value is set to lock_info.result. */ + return NULL; +} + +void * +work_thread (void *arg) +{ + struct work_todo_argument work_todo_arg; + struct numa_spinlock_info lock_info; + + if (numa_spinlock_init (lock, &lock_info)) + { + printf ("numa_spinlock_init failure: %m\n"); + exit (1); + } + + work_todo_arg.arg = arg; + lock_info.argument = &work_todo_arg; + lock_info.workload = work_todo; + + numa_spinlock_apply (&lock_info); + + return lock_info.result; +} + +int +main (int argc, char **argv) +{ + lock = numa_spinlock_alloc (); + pthread_t thr[NUM_THREADS]; + void *res[NUM_THREADS]; + int numthreads = NUM_THREADS; + int i; + + for (i = 0; i < NUM_THREADS; i++) + { + int err_ret = pthread_create (&thr[i], NULL, work_thread, + (void *) (intptr_t) i); + if (err_ret != 0) + { + printf ("pthread_create failed: %d, %s\n", + i, strerror (i)); + numthreads = i; + break; + } + } + + for (i = 0; i < numthreads; i++) + { + if (pthread_join (thr[i], (void *) &res[i]) == 0) + free (res[i]); + else + printf ("pthread_join failure: %m\n"); + } + + numa_spinlock_free (lock); + + return 0; +} diff --git a/manual/threads.texi b/manual/threads.texi index 87fda7d..e82ae0d 100644 --- a/manual/threads.texi +++ b/manual/threads.texi @@ -625,6 +625,9 @@ the standard. @menu * Default Thread Attributes:: Setting default attributes for threads in a process. +* NUMA Spinlock:: Spinlock optimized for + multi-socket NUMA platform. +* NUMA Spinlock Example:: A NUMA spinlock example. @end menu @node Default Thread Attributes @@ -669,6 +672,108 @@ The system does not have sufficient memory. @end table @end deftypefun +@node NUMA Spinlock +@subsubsection Spinlock optimized for multi-node NUMA systems + +To improve performance on multi-socket NUMA platforms for serialized +region protected by spinlock, @theglibc{} implements a NUMA spinlock +object, which minimizes cross-socket traffic and sends the protected +serialized region to one core for execution to reduce spinlock contention +overhead. + +The fundamental data types for a NUMA spinlock are +@code{numa_spinlock} and @code{numa_spinlock_info}: + +@deftp {Data Type} {struct numa_spinlock} +@standards{Linux, numa-spinlock.h} +This data type is an opaque structure. A @code{numa_spinlock} pointer +uniquely identifies a NUMA spinlock object. +@end deftp + +@deftp {Data Type} {struct numa_spinlock_info} +@standards{Linux, numa-spinlock.h} + +This data type uniquely identifies a NUMA spinlock information object for +a thread. It has the following members, and others internal to NUMA +spinlock implemenation: + +@table @code +@item void *(*workload) (void *) +A function pointer to the workload function serialized by spinlock. +@item void *argument +A pointer to argument passed to the @var{workload} function pointer. +@item void *result +Return value from the @var{workload} function pointer. +@end table + +@end deftp + +The following functions are provided for NUMA spinlock objects: + +@deftypefun struct numa_spinlock *numa_spinlock_alloc (void) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} + +This function returns a pointer to a newly allocated NUMA spinlock or a +null pointer if the NUMA spinlock could not be allocated, setting +@code{errno} to @code{ENOMEM}. Caller should call +@code{numa_spinlock_free} on the NUMA spinlock pointer to free the +memory space when it is no longer needed. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun void numa_spinlock_free (struct numa_spinlock *@var{lock}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} + +Free the memory space pointed to by @var{lock}, which must have been +returned by a previous call to @code{numa_spinlock_alloc}. Otherwise, +or if @code{numa_spinlock_free (@var{lock})} has already been called +before, undefined behavior occurs. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun int numa_spinlock_init (struct numa_spinlock *@var{lock}, +struct numa_spinlock_info *@var{info}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +Initialize the NUMA spinlock information block pointed to by @var{info} +with a NUMA spinlock pointer @var{lock}. The return value is @code{0} on +success and @code{-1} on failure. The following @code{errno} error +codes are defined for this function: + +@table @code +@item ENOSYS +The operating system does not support the @code{getcpu} function. +@end table + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun void numa_spinlock_apply (struct numa_spinlock_info *@var{info}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +Apply for spinlock with a NUMA spinlock information block pointed to by +@var{info}. When @code{numa_spinlock_apply} returns, the spinlock is +released and the @var{result} member of @var{info} contains the return +value of the @var{workload} member. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@node NUMA Spinlock Example +@subsubsection NUMA Spinlock Example + +A NUMA spinlock example: + +@smallexample +@include numa-spinlock.c.texi +@end smallexample + @c FIXME these are undocumented: @c pthread_atfork @c pthread_attr_destroy diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile index f827455..3361597 100644 --- a/sysdeps/unix/sysv/linux/Makefile +++ b/sysdeps/unix/sysv/linux/Makefile @@ -222,6 +222,8 @@ CFLAGS-gai.c += -DNEED_NETLINK endif ifeq ($(subdir),nptl) +libpthread-sysdep_routines += numa_spinlock_alloc numa-spinlock +sysdep_headers += numa-spinlock.h tests += tst-align-clone tst-getpid1 \ tst-thread-affinity-pthread tst-thread-affinity-pthread2 \ tst-thread-affinity-sched diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions index f1e12d9..7ce7e2b 100644 --- a/sysdeps/unix/sysv/linux/Versions +++ b/sysdeps/unix/sysv/linux/Versions @@ -185,3 +185,12 @@ libc { __netlink_assert_response; } } + +libpthread { + GLIBC_2.29 { + numa_spinlock_alloc; + numa_spinlock_free; + numa_spinlock_init; + numa_spinlock_apply; + } +} diff --git a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist index 9a9e4ce..eb54a83 100644 --- a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist index b413007..dd08796 100644 --- a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/arm/libpthread.abilist b/sysdeps/unix/sysv/linux/arm/libpthread.abilist index af82a4c..45a5c5a 100644 --- a/sysdeps/unix/sysv/linux/arm/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/arm/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.4 _IO_flockfile F GLIBC_2.4 _IO_ftrylockfile F GLIBC_2.4 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/csky/libpthread.abilist b/sysdeps/unix/sysv/linux/csky/libpthread.abilist index ea4b79a..cf65f72 100644 --- a/sysdeps/unix/sysv/linux/csky/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/csky/libpthread.abilist @@ -73,6 +73,10 @@ GLIBC_2.29 mtx_timedlock F GLIBC_2.29 mtx_trylock F GLIBC_2.29 mtx_unlock F GLIBC_2.29 nanosleep F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.29 open F GLIBC_2.29 open64 F GLIBC_2.29 pause F diff --git a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist index bcba07f..a80475f 100644 --- a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist index bece86d..40ac05a 100644 --- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist index ccc9449..5b190f6 100644 --- a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist index af82a4c..45a5c5a 100644 --- a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.4 _IO_flockfile F GLIBC_2.4 _IO_ftrylockfile F GLIBC_2.4 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist index bece86d..40ac05a 100644 --- a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist index 5067375..e6539bf 100644 --- a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist index 0214496..76edcb8 100644 --- a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist index 0214496..76edcb8 100644 --- a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist index 78cac2a..3141d08 100644 --- a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist @@ -241,3 +241,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/numa-spinlock-private.h b/sysdeps/unix/sysv/linux/numa-spinlock-private.h new file mode 100644 index 0000000..0f7a3be --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock-private.h @@ -0,0 +1,38 @@ +/* Internal definitions and declarations for NUMA spinlock. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include "numa-spinlock.h" + +/* The global NUMA spinlock. */ +struct numa_spinlock +{ + /* List of threads who owns the global NUMA spinlock. */ + struct numa_spinlock_info *owner; + /* The maximium NUMA node number. */ + unsigned int max_node; + /* Non-zero for single node system. */ + unsigned int single_node; + /* The maximium CPU number. Used only when NUMA is disabled. */ + unsigned int max_cpu; + /* Array of physical_package_id of each core if it isn't NULL. Used + only when NUMA is disabled.*/ + unsigned int *physical_package_id_p; + /* Arrays of lists of threads who are spinning for the local NUMA lock + on NUMA nodes indexed by NUMA node number. */ + struct numa_spinlock_info *lists[]; +}; diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.c b/sysdeps/unix/sysv/linux/numa-spinlock.c new file mode 100644 index 0000000..a141e7d --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock.c @@ -0,0 +1,327 @@ +/* NUMA spinlock + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <config.h> +#include <string.h> +#include <stdlib.h> +#include <sched.h> +#ifndef HAVE_GETCPU +#include <unistd.h> +#include <syscall.h> +#endif +#include <errno.h> +#include <atomic.h> +#include "numa-spinlock-private.h" + +#if !defined HAVE_GETCPU && defined _LIBC +# define HAVE_GETCPU +#endif + +/* On multi-socket systems, memory is shared across the entire system. + Data access to the local socket is much faster than the remote socket + and data access to the local core is faster than sibling cores on the + same socket. For serialized workloads with conventional spinlock, + when there is high spinlock contention between threads, lock ping-pong + among sockets becomes the bottleneck and threads spend majority of + their time in spinlock overhead. + + On multi-socket systems, the keys to our NUMA spinlock performance + are to minimize cross-socket traffic as well as localize the serialized + workload to one core for execution. The basic principles of NUMA + spinlock are mainly consisted of following approaches, which reduce + data movement and accelerate critical section, eventually give us + significant performance improvement. + + 1. MCS spinlock + MCS spinlock help us to reduce the useless lock movement in the + spinning state. This paper provides a good description for this + kind of lock: + <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf> + + 2. Critical Section Integration (CSI) + Essentially spinlock is similar to that one core complete critical + sections one by one. So when contention happen, the serialized works + are sent to the core who is the lock owner and responsible to execute + them, that can save much time and power, because all shared data are + located in private cache of the lock owner. + + We implemented this mechanism based on queued spinlock in kernel, that + speeds up critical section, and reduces the probability of contention. + The paper provides a good description for this kind of lock: + <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf> + + 3. NUMA Aware Spinlock (NAS) + Currently multi-socket systems give us better performance per watt, + however that also involves more complex synchronization requirement, + because off-chip data movement is much slower. We use distributed + synchronization mechanism to decrease Lock cache line to and from + different nodes. The paper provides a good description for this kind + of lock: + <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf> + + 4. Yield Schedule + When threads are applying for Critical Section Integration(CSI) with + known contention, they will delegate work to the thread who is the + lock owner, and wait for work to be completed. The resources which + they are using should be transferred to other threads. In order to + accelerate the scenario, we introduce yield_sched function during + spinning stage. + + 5. Optimization when NUMA is ON or OFF. + Although programs can access memory with lower latency when NUMA is + enabled, some programs may need more memory bandwidth for computation + with NUMA disabled. We also optimize multi-socket systems with NUMA + disabled. + + NUMA spinlock flow chart (assuming there are 2 CPU nodes): + + 1. Threads from node_0/node_1 acquire local lock for node_0/1 + respectively. If the thread succeeds in acquiring local lock, it + goes to step 2, otherwise pushes critical function into current + local work queue, and enters into spinning stage with MCS mode. + + 2. Threads from node_0/node_1 acquire the global lock. If it succeeds + in acquiring the global lock as the lock owner, it goes to step 3, + otherwise waits until the lock owner thread releases the global lock. + + 3. The lock owner thread from node_0/1 enters into critical section, + cleans up work queue by performing all local critical functions + pushed at step 1 with CSI on behalf of other threads and informs + those spinning threads that their works have been done. It then + releases the local lock. + + 4. The lock owner thread frees global lock. If another thread is + waiting at step 2, the lock owner thread passes the global lock to + the waiting thread and returns. The new lock owner thread enters + into step 3. If no threads are waiting, the lock owner thread + releases the global lock and returns. The whole critical section + process is completed. + + Steps 1 and 2 mitigate global lock contention. Only one thread + from different nodes will compete for the global lock in step 2. + Step 3 reduces the global lock & shared data movement because they + are located in the same node as well as the same core. Our data + shows that Critical Section Integration (CSI) improves data locality + and NUMA-aware spinlock (NAS) helps CSI balance the workload. + + NUMA spinlock can greatly speed up critical section on multi-socket + systems. It should improve spinlock performance on all multi-socket + systems. + + NOTE: LiTL <https://github.com/multicore-locks/litl>, is an open-source + project that provides implementations of dozens of various locks, + including several state-of-the-art NUMA-aware spinlocks. Among them + + 1. Hierarchical MCS (HMCS) spinlock. Milind Chabbi, Michael Fagan, + and John Mellor-Crummey. High Performance Locks for Multi-level NUMA + Systems. In Proceedings of the ACM SIGPLAN Symposium on Principles + and Practice of Parallel Programming (PPoPP), pages 215–226, 2015. + + 2. Cohort-MCS (C-MCS) spinlock. Dave Dice, Virendra J. Marathe, and + Nir Shavit. Lock Cohorting: A General Technique for Designing NUMA + Locks. ACM Trans. Parallel Comput., 1(2):13:1–13:42, 2015. + */ + +/* Get the next thread pointed to by *NEXT_P. NB: We must use a while + spin loop to load NEXT_P since there is a small window before *NEXT_P + is updated. */ + +static inline struct numa_spinlock_info * +get_numa_spinlock_info_next (struct numa_spinlock_info **next_p) +{ + struct numa_spinlock_info *next; + while (!(next = atomic_load_relaxed (next_p))) + atomic_spin_nop (); + return next; +} + +/* While holding the global NUMA spinlock, run the workload of the + thread pointed to by SELF first, then run the workload for each + thread on the thread list pointed to by HEAD_P and wake up the + thread so that all workloads run on a single processor. */ + +static inline void +run_numa_spinlock (struct numa_spinlock_info *self, + struct numa_spinlock_info **head_p) +{ + struct numa_spinlock_info *next, *current; + + /* Run the SELF's workload. */ + self->result = self->workload (self->argument); + + /* Process workloads for the rest of threads on the thread list. + NB: The thread list may be prepended by other threads at the + same time. */ + +retry: + /* If SELF is the first thread of the thread list pointed to by + HEAD_P, clear the thread list. */ + current = atomic_compare_and_exchange_val_acq (head_p, NULL, self); + if (current == self) + { + /* Since SELF is the only thread on the list, clear SELF's pending + field and return. */ + atomic_store_release (¤t->pending, 0); + return; + } + + /* CURRENT will have the previous first thread of the thread list + pointed to by HEAD_P and *HEAD_P will point to SELF. */ + current = atomic_exchange_acquire (head_p, self); + + /* NB: No need to check if CURRENT == SELF here since SELF can never + be CURRENT. */ + +repeat: + /* Get the next thread. */ + next = get_numa_spinlock_info_next (¤t->next); + + /* Run the CURRENT's workload and clear CURRENT's pending field. */ + current->result = current->workload (current->argument); + current->pending = 0; + + /* Process the workload for each thread from CURRENT to SELF on the + thread list. Don't pass beyond SELF since SELF is the last thread + on the list. */ + if (next == self) + goto retry; + current = next; + goto repeat; +} + +/* Apply for the NUMA spinlock with the NUMA spinlock info data pointed + to by SELF. */ + +void +numa_spinlock_apply (struct numa_spinlock_info *self) +{ + struct numa_spinlock *lock = self->lock; + struct numa_spinlock_info *first, *next; + struct numa_spinlock_info **head_p; + + self->next = NULL; + /* We want the global NUMA spinlock. */ + self->pending = 1; + /* Select the local NUMA spinlock list by the NUMA node number. */ + head_p = &lock->lists[self->node]; + /* FIRST will have the previous first thread of the local NUMA spinlock + list and *HEAD_P will point to SELF. */ + first = atomic_exchange_acquire (head_p, self); + if (first) + { + /* SELF has been prepended to the thread list pointed to by + HEAD_P. NB: There is a small window between updating + *HEAD_P and self->next. */ + atomic_store_release (&self->next, first); + /* Let other threads run first since another thread will run our + workload for us. */ + sched_yield (); + /* Spin until our PENDING is cleared. */ + while (atomic_load_relaxed (&self->pending)) + atomic_spin_nop (); + return; + } + + /* NB: Now SELF must be the only thread on the thread list pointed + to by HEAD_P. Since thread is always prepended to HEAD_P, we + can use *HEAD_P == SELF to check if SELF is the only thread on + the thread list. */ + + if (__glibc_unlikely (lock->single_node)) + { + /* If there is only one node, there is no need for the global + NUMA spinlock. */ + run_numa_spinlock (self, head_p); + return; + } + + /* FIRST will have the previous first thread of the local NUMA spinlock + list of threads which holds the global NUMA spinlock, which will + point to SELF. */ + first = atomic_exchange_acquire (&lock->owner, self); + if (first) + { + /* SELF has been prepended to the thread list pointed to by + lock->owner. NB: There is a small window between updating + *HEAD_P and first->next. */ + atomic_store_release (&first->next, self); + /* Spin until the list of threads which holds the global NUMA + spinlock clears our PENDING. */ + while (atomic_load_relaxed (&self->pending)) + atomic_spin_nop (); + } + + /* We get the global NUMA spinlock now. Run our workload. */ + run_numa_spinlock (self, head_p); + + /* SELF is the only thread on the list if SELF is the first thread + of the thread list pointed to by lock->owner. In this case, we + simply return. */ + if (!atomic_compare_and_exchange_bool_acq (&lock->owner, NULL, self)) + return; + + /* Wake up the next thread. */ + next = get_numa_spinlock_info_next (&self->next); + atomic_store_release (&next->pending, 0); +} + +/* Initialize the NUMA spinlock info data pointed to by INFO from a + pointer to the NUMA spinlock, LOCK. */ + +int +numa_spinlock_init (struct numa_spinlock *lock, + struct numa_spinlock_info *info) +{ + memset (info, 0, sizeof (*info)); + info->lock = lock; + /* For single node system, use 0 as the NUMA node number. */ + if (lock->single_node) + return 0; + /* NB: Use the NUMA node number from getcpu to select the local NUMA + spinlock list. */ + unsigned int cpu; + unsigned int node; +#ifdef HAVE_GETCPU + int err_ret = getcpu (&cpu, &node); +#else + int err_ret = syscall (SYS_getcpu, &cpu, &node, NULL); +#endif + if (err_ret) + return err_ret; + if (lock->physical_package_id_p) + { + /* Can it ever happen? */ + if (cpu > lock->max_cpu) + cpu = lock->max_cpu; + /* NB: If NUMA is disabled, use physical_package_id. */ + node = lock->physical_package_id_p[cpu]; + } + /* Can it ever happen? */ + if (node > lock->max_node) + node = lock->max_node; + info->node = node; + return err_ret; +} + +void +numa_spinlock_free (struct numa_spinlock *lock) +{ + if (lock->physical_package_id_p) + free (lock->physical_package_id_p); + free (lock); +} diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.h b/sysdeps/unix/sysv/linux/numa-spinlock.h new file mode 100644 index 0000000..b17bda5 --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock.h @@ -0,0 +1,64 @@ +/* Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _NUMA_SPINLOCK_H +#define _NUMA_SPINLOCK_H + +#include <features.h> + +__BEGIN_DECLS + +/* The NUMA spinlock. */ +struct numa_spinlock; + +/* The NUMA spinlock information for each thread. */ +struct numa_spinlock_info +{ + /* The workload function of this thread. */ + void *(*workload) (void *); + /* The argument pointer passed to the workload function. */ + void *argument; + /* The return value of the workload function. */ + void *result; + /* The pointer to the NUMA spinlock. */ + struct numa_spinlock *lock; + /* The next thread on the local NUMA spinlock thread list. */ + struct numa_spinlock_info *next; + /* The NUMA node number. */ + unsigned int node; + /* Non-zero to indicate that the thread wants the NUMA spinlock. */ + int pending; + /* Reserved for future use. */ + void *__reserved[4]; +}; + +/* Return a pointer to a newly allocated NUMA spinlock. */ +extern struct numa_spinlock *numa_spinlock_alloc (void); + +/* Free the memory space of the NUMA spinlock. */ +extern void numa_spinlock_free (struct numa_spinlock *); + +/* Initialize the NUMA spinlock information block. */ +extern int numa_spinlock_init (struct numa_spinlock *, + struct numa_spinlock_info *); + +/* Apply for spinlock with a NUMA spinlock information block. */ +extern void numa_spinlock_apply (struct numa_spinlock_info *); + +__END_DECLS + +#endif /* numa-spinlock.h */ diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c new file mode 100644 index 0000000..8ff4e1a --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c @@ -0,0 +1,304 @@ +/* Initialization of NUMA spinlock. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <assert.h> +#include <ctype.h> +#include <string.h> +#include <dirent.h> +#include <stdio.h> +#include <limits.h> +#ifdef _LIBC +# include <not-cancel.h> +#else +# include <stdlib.h> +# include <unistd.h> +# include <fcntl.h> +# define __open_nocancel open +# define __close_nocancel_nostatus close +# define __read_nocancel read +#endif + +#include "numa-spinlock-private.h" + +static char * +next_line (int fd, char *const buffer, char **cp, char **re, + char *const buffer_end) +{ + char *res = *cp; + char *nl = memchr (*cp, '\n', *re - *cp); + if (nl == NULL) + { + if (*cp != buffer) + { + if (*re == buffer_end) + { + memmove (buffer, *cp, *re - *cp); + *re = buffer + (*re - *cp); + *cp = buffer; + + ssize_t n = __read_nocancel (fd, *re, buffer_end - *re); + if (n < 0) + return NULL; + + *re += n; + + nl = memchr (*cp, '\n', *re - *cp); + while (nl == NULL && *re == buffer_end) + { + /* Truncate too long lines. */ + *re = buffer + 3 * (buffer_end - buffer) / 4; + n = __read_nocancel (fd, *re, buffer_end - *re); + if (n < 0) + return NULL; + + nl = memchr (*re, '\n', n); + **re = '\n'; + *re += n; + } + } + else + nl = memchr (*cp, '\n', *re - *cp); + + res = *cp; + } + + if (nl == NULL) + nl = *re - 1; + } + + *cp = nl + 1; + assert (*cp <= *re); + + return res == *re ? NULL : res; +} + +static int +select_cpu (const struct dirent *d) +{ + /* Return 1 for "cpuXXX" where XXX are digits. */ + if (strncmp (d->d_name, "cpu", sizeof ("cpu") - 1) == 0) + { + const char *p = d->d_name + 3; + + if (*p == '\0') + return 0; + + do + { + if (!isdigit (*p)) + return 0; + p++; + } + while (*p != '\0'); + + return 1; + } + return 0; +} + +/* Allocate a NUMA spinlock and return a pointer to it. Caller should + call numa_spinlock_free on the NUMA spinlock pointer to free the + memory when it is no longer needed. */ + +struct numa_spinlock * +numa_spinlock_alloc (void) +{ + const size_t buffer_size = 1024; + char buffer[buffer_size]; + char *buffer_end = buffer + buffer_size; + char *cp = buffer_end; + char *re = buffer_end; + + const int flags = O_RDONLY | O_CLOEXEC; + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); + char *l; + unsigned int max_node = 0; + unsigned int node_count = 0; + if (fd != -1) + { + l = next_line (fd, buffer, &cp, &re, buffer_end); + if (l != NULL) + do + { + char *endp; + unsigned long int n = strtoul (l, &endp, 10); + if (l == endp) + { + node_count = 1; + break; + } + + unsigned long int m = n; + if (*endp == '-') + { + l = endp + 1; + m = strtoul (l, &endp, 10); + if (l == endp) + { + node_count = 1; + break; + } + } + + node_count += m - n + 1; + + if (m >= max_node) + max_node = m; + + l = endp; + while (l < re && isspace (*l)) + ++l; + } + while (l < re); + + __close_nocancel_nostatus (fd); + } + + /* NB: Some NUMA nodes may not be available, if the number of NUMA + nodes is 1, set the maximium NUMA node number to 0. */ + if (node_count == 1) + max_node = 0; + + unsigned int max_cpu = 0; + unsigned int *physical_package_id_p = NULL; + + if (max_node == 0) + { + /* There is at least 1 node. */ + node_count = 1; + + /* If NUMA is disabled, use physical_package_id instead. */ + struct dirent **cpu_list; + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, + select_cpu, NULL); + if (nprocs > 0) + { + int i; + unsigned int *cpu_id_p = NULL; + + /* Find the maximum CPU number. */ + if (posix_memalign ((void **) &cpu_id_p, + __alignof__ (void *), + nprocs * sizeof (unsigned int)) == 0) + { + for (i = 0; i < nprocs; i++) + { + unsigned int cpu_id + = strtoul (cpu_list[i]->d_name + 3, NULL, 10); + cpu_id_p[i] = cpu_id; + if (cpu_id > max_cpu) + max_cpu = cpu_id; + } + + if (posix_memalign ((void **) &physical_package_id_p, + __alignof__ (void *), + ((max_cpu + 1) + * sizeof (unsigned int))) == 0) + { + memset (physical_package_id_p, 0, + ((max_cpu + 1) * sizeof (unsigned int))); + + max_node = UINT_MAX; + + /* Get physical_package_id. */ + char path[(sizeof ("/sys/devices/system/cpu") + + 3 * sizeof (unsigned long int) + + sizeof ("/topology/physical_package_id"))]; + for (i = 0; i < nprocs; i++) + { + struct dirent *d = cpu_list[i]; + if (snprintf (path, sizeof (path), + "/sys/devices/system/cpu/%s/topology/physical_package_id", + d->d_name) > 0) + { + fd = __open_nocancel (path, flags); + if (fd != -1) + { + if (__read_nocancel (fd, buffer, + buffer_size) > 0) + { + char *endp; + unsigned long int package_id + = strtoul (buffer, &endp, 10); + if (package_id != ULONG_MAX + && *buffer != '\0' + && (*endp == '\0' || *endp == '\n')) + { + physical_package_id_p[cpu_id_p[i]] + = package_id; + if (max_node == UINT_MAX) + { + /* This is the first node. */ + max_node = package_id; + } + else if (package_id != max_node) + { + /* NB: We only need to know if + NODE_COUNT > 1. */ + node_count = 2; + if (package_id > max_node) + max_node = package_id; + } + } + } + __close_nocancel_nostatus (fd); + } + } + + free (d); + } + } + + free (cpu_id_p); + } + else + { + for (i = 0; i < nprocs; i++) + free (cpu_list[i]); + } + + free (cpu_list); + } + } + + if (physical_package_id_p != NULL && node_count == 1) + { + /* There is only one node. No need for physical_package_id_p. */ + free (physical_package_id_p); + physical_package_id_p = NULL; + max_cpu = 0; + } + + /* Allocate an array of struct numa_spinlock_info pointers to hold info + for all NUMA nodes with NUMA node number from getcpu () as index. */ + size_t size = (sizeof (struct numa_spinlock) + + ((max_node + 1) + * sizeof (struct numa_spinlock_info *))); + struct numa_spinlock *lock; + if (posix_memalign ((void **) &lock, + __alignof__ (struct numa_spinlock_info *), size)) + return NULL; + memset (lock, 0, size); + + lock->max_node = max_node; + lock->single_node = node_count == 1; + lock->max_cpu = max_cpu; + lock->physical_package_id_p = physical_package_id_p; + + return lock; +} diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist index 09e8447..dba7df6 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist index 8300958..a763c0a 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3 _IO_flockfile F GLIBC_2.3 _IO_ftrylockfile F GLIBC_2.3 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist index 9a9e4ce..eb54a83 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist index c370fda..366fcac 100644 --- a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist @@ -235,3 +235,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist index d05468f..786d8e1 100644 --- a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist @@ -229,6 +229,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist index e8161aa..dd7c52f 100644 --- a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist @@ -221,6 +221,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sh/libpthread.abilist b/sysdeps/unix/sysv/linux/sh/libpthread.abilist index bcba07f..a80475f 100644 --- a/sysdeps/unix/sysv/linux/sh/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sh/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist index b413007..dd08796 100644 --- a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist index ccc9449..5b190f6 100644 --- a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/x86/Makefile b/sysdeps/unix/sysv/linux/x86/Makefile index 02ca36c..29d41ad 100644 --- a/sysdeps/unix/sysv/linux/x86/Makefile +++ b/sysdeps/unix/sysv/linux/x86/Makefile @@ -14,6 +14,7 @@ endif ifeq ($(subdir),nptl) libpthread-sysdep_routines += elision-lock elision-unlock elision-timed \ elision-trylock +xtests += tst-variable-overhead tst-numa-variable-overhead CFLAGS-elision-lock.c += -mrtm CFLAGS-elision-unlock.c += -mrtm CFLAGS-elision-timed.c += -mrtm diff --git a/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c new file mode 100644 index 0000000..7cb8542 --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c @@ -0,0 +1,53 @@ +/* Test case for NUMA spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include "numa-spinlock.h" + +struct numa_spinlock *lock; + +struct work_todo_argument +{ + unsigned long *v1; + unsigned long *v2; + unsigned long *v3; + unsigned long *v4; +}; + +static void * +work_todo (void *v) +{ + struct work_todo_argument *p = v; + unsigned long ret; + *p->v1 = *p->v1 + 1; + *p->v2 = *p->v2 + 1; + ret = __sync_val_compare_and_swap (p->v4, 0, 1); + *p->v3 = *p->v3 + ret; + return (void *) 2; +} + +static inline void +do_work (struct numa_spinlock_info *lock_info) +{ + numa_spinlock_apply (lock_info); +} + +#define USE_NUMA_SPINLOCK +#include "tst-variable-overhead-skeleton.c" diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c new file mode 100644 index 0000000..4b83dfb --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c @@ -0,0 +1,384 @@ +/* Test case skeleton for spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +/* Check spinlock overhead with large number threads. Critical region is + very smmall. Critical region + spinlock overhead aren't noticeable + when number of threads is small. When thread number increases, + spinlock overhead become the bottleneck. It shows up in wall time + of thread execution. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include <config.h> +#include <unistd.h> +#include <stdio.h> +#include <pthread.h> +#include <sched.h> +#include <stdlib.h> +#include <string.h> +#include <stdint.h> +#include <sys/time.h> +#include <sys/param.h> +#include <errno.h> +#ifdef MODULE_NAME +# include <cpu-features.h> +# include <support/test-driver.h> +#endif + +#ifndef USE_PTHREAD_ATTR_SETAFFINITY_NP +# define USE_PTHREAD_ATTR_SETAFFINITY_NP 1 +#endif + +#define memory_barrier() __asm ("" ::: "memory") +#define pause() __asm ("rep ; nop" ::: "memory") + +#define CACHELINE_SIZE 64 +#define CACHE_ALIGNED __attribute__((aligned(CACHELINE_SIZE))) + +#define constant_time 5 +unsigned long g_val CACHE_ALIGNED; +unsigned long g_val2 CACHE_ALIGNED; +unsigned long g_val3 CACHE_ALIGNED; +unsigned long cmplock CACHE_ALIGNED; +struct count +{ + unsigned long long total; + unsigned long long spinlock; + unsigned long long wall; +} __attribute__((aligned(128))); + +struct count *gcount; + +/* The time consumed by one update is about 200 TSCs. */ +static int delay_time_unlocked = 400; + +struct ops +{ + void *(*test) (void *arg); + void (*print_thread) (void *res, int); +} *ops; + +struct stats_result +{ + unsigned long num; +}; + +void *work_thread (void *arg); + +#define iterations (10000 * 5) + +static volatile int start_thread; + +/* Delay some fixed time */ +static void +delay_tsc (unsigned n) +{ + unsigned long long start, current, diff; + unsigned int aux; + start = __builtin_ia32_rdtscp (&aux); + while (1) + { + current = __builtin_ia32_rdtscp (&aux); + diff = current - start; + if (diff < n) + pause (); + else + break; + } +} + +static void +wait_a_bit (int delay_time) +{ + if (delay_time > 0) + delay_tsc (delay_time); +} + +#ifndef USE_NUMA_SPINLOCK +static inline void +work_todo (void) +{ + unsigned long ret; + g_val = g_val + 1; + g_val2 = g_val2 + 1; + ret = __sync_val_compare_and_swap (&cmplock, 0, 1); + g_val3 = g_val3 + 1 + ret; +} +#endif + +void * +work_thread (void *arg) +{ + long i; + unsigned long pid = (unsigned long) arg; + struct stats_result *res; + unsigned long long start, end; + int err_ret = posix_memalign ((void **)&res, CACHELINE_SIZE, + roundup (sizeof (*res), CACHELINE_SIZE)); + if (err_ret) + { + printf ("posix_memalign failure: %s\n", strerror (err_ret)); + exit (err_ret); + } + long num = 0; + +#ifdef USE_NUMA_SPINLOCK + struct work_todo_argument work_todo_arg; + struct numa_spinlock_info lock_info; + + if (numa_spinlock_init (lock, &lock_info)) + { + printf ("numa_spinlock_init failure: %m\n"); + exit (1); + } + + work_todo_arg.v1 = &g_val; + work_todo_arg.v2 = &g_val2; + work_todo_arg.v3 = &g_val3; + work_todo_arg.v4 = &cmplock; + lock_info.argument = &work_todo_arg; + lock_info.workload = work_todo; +#endif + + while (!start_thread) + pause (); + + unsigned int aux; + start = __builtin_ia32_rdtscp (&aux); + for (i = 0; i < iterations; i++) + { +#ifdef USE_NUMA_SPINLOCK + do_work (&lock_info); +#else + do_work (); +#endif + wait_a_bit (delay_time_unlocked); + num++; + } + end = __builtin_ia32_rdtscp (&aux); + gcount[pid].total = end - start; + res->num = num; + + return res; +} + +void +init_global_data(void) +{ + g_val = 0; + g_val2 = 0; + g_val3 = 0; + cmplock = 0; +} + +void +test_threads (int numthreads, int numprocs, unsigned long time) +{ + start_thread = 0; + +#ifdef USE_NUMA_SPINLOCK + lock = numa_spinlock_alloc (); +#endif + + memory_barrier (); + + pthread_t thr[numthreads]; + void *res[numthreads]; + int i; + + init_global_data (); + for (i = 0; i < numthreads; i++) + { + pthread_attr_t attr; + const pthread_attr_t *attrp = NULL; + if (USE_PTHREAD_ATTR_SETAFFINITY_NP) + { + attrp = &attr; + pthread_attr_init (&attr); + cpu_set_t set; + CPU_ZERO (&set); + int cpu = i % numprocs; + (void) CPU_SET (cpu, &set); + pthread_attr_setaffinity_np (&attr, sizeof (cpu_set_t), &set); + } + int err_ret = pthread_create (&thr[i], attrp, ops->test, + (void *)(uintptr_t) i); + if (err_ret != 0) + { + printf ("pthread_create failed: %d, %s\n", + i, strerror (i)); + numthreads = i; + break; + } + } + + memory_barrier (); + start_thread = 1; + memory_barrier (); + sched_yield (); + + if (time) + { + struct timespec ts = + { + ts.tv_sec = time, + ts.tv_nsec = 0 + }; + clock_nanosleep (CLOCK_MONOTONIC, 0, &ts, NULL); + memory_barrier (); + } + + for (i = 0; i < numthreads; i++) + { + if (pthread_join (thr[i], (void *) &res[i]) == 0) + free (res[i]); + else + printf ("pthread_join failure: %m\n"); + } + +#ifdef USE_NUMA_SPINLOCK + numa_spinlock_free (lock); +#endif +} + +struct ops hashwork_ops = +{ + .test = work_thread, +}; + +struct ops *ops; + +static struct count +total_cost (int numthreads, int numprocs) +{ + int i; + unsigned long long total = 0; + unsigned long long spinlock = 0; + + memset (gcount, 0, sizeof(gcount[0]) * numthreads); + + unsigned long long start, end, diff; + unsigned int aux; + + start = __builtin_ia32_rdtscp (&aux); + test_threads (numthreads, numprocs, constant_time); + end = __builtin_ia32_rdtscp (&aux); + diff = end - start; + + for (i = 0; i < numthreads; i++) + { + total += gcount[i].total; + spinlock += gcount[i].spinlock; + } + + struct count cost = { total, spinlock, diff }; + return cost; +} + +#ifdef MODULE_NAME +static int +do_test (void) +{ + if (!CPU_FEATURE_USABLE (RDTSCP)) + return EXIT_UNSUPPORTED; +#else +int +main (void) +{ +#endif + int numprocs = sysconf (_SC_NPROCESSORS_ONLN); + + /* Oversubscribe CPU. */ + int numthreads = 4 * numprocs; + + ops = &hashwork_ops; + + int err_ret = posix_memalign ((void **)&gcount, 4096, + sizeof(gcount[0]) * numthreads); + if (err_ret) + { + printf ("posix_memalign failure: %s\n", strerror (err_ret)); + exit (err_ret); + } + + struct count cost, cost1; + double overhead; + int i, last; + int last_increment = numprocs < 16 ? 16 : numprocs; + int numprocs_done = 0; + int numprocs_reset = 0; + cost1 = total_cost (1, numprocs); + + printf ("Number of processors: %d, Single thread time %lld\n\n", + numprocs, cost1.total); + + for (last = i = 2; i <= numthreads;) + { + last = i; + cost = total_cost (i, numprocs); + overhead = cost.total; + overhead /= i; + overhead /= cost1.total; + printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n", + i, cost.total, overhead); + if ((i * 2) < numprocs) + i = i * 2; + else if (numprocs_done) + { + if (numprocs_reset) + { + i = numprocs_reset; + numprocs_reset = 0; + } + else + { + if ((i * 2) < numthreads) + i = i * 2; + else + i = i + last_increment; + } + } + else + { + if (numprocs != 2 * i) + numprocs_reset = 2 * i; + i = numprocs; + numprocs_done = 1; + } + } + + if (last != numthreads) + { + i = numthreads; + cost = total_cost (i, numprocs); + overhead = cost.total; + overhead /= i; + overhead /= cost1.total; + printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n", + i, cost.total, overhead); + } + + free (gcount); + return 0; +} + +#ifdef MODULE_NAME +# define TIMEOUT 900 +# include <support/test-driver.c> +#endif diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c new file mode 100644 index 0000000..b3ce567 --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c @@ -0,0 +1,47 @@ +/* Test case for spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include <pthread.h> + +struct +{ + pthread_spinlock_t testlock; + char pad[64 - sizeof (pthread_spinlock_t)]; +} test __attribute__((aligned(64))); + +static void +__attribute__((constructor)) +init_spin (void) +{ + pthread_spin_init (&test.testlock, 0); +} + +static void work_todo (void); + +static inline void +do_work (void) +{ + pthread_spin_lock(&test.testlock); + work_todo (); + pthread_spin_unlock(&test.testlock); +} + +#include "tst-variable-overhead-skeleton.c" diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist index 931c827..e90532e 100644 --- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist index c09c9b0..c74febb 100644 --- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F -- 1.8.3.1
On 03/01/2019 05:35, 马凌(彦军) wrote: > create mode 100644 manual/examples/numa-spinlock.c > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h > create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c as far as i can tell the new code is generic (other than the presence of efficient getcpu), so i think the test should be generic too. > --- /dev/null > +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > @@ -0,0 +1,384 @@ ... > +/* Check spinlock overhead with large number threads. Critical region is > + very smmall. Critical region + spinlock overhead aren't noticeable > + when number of threads is small. When thread number increases, > + spinlock overhead become the bottleneck. It shows up in wall time > + of thread execution. */ yeah, this is not easy to do in a generic way, i think even on x86 such measurement is problematic, you don't know what goes on a system (or vm) when the glibc test is running. but doing precise timing is not that important for checking the correctness of the locks, so i think a simplified version can be generic test code.
On Thu, Jan 3, 2019 at 6:52 AM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote: > > On 03/01/2019 05:35, 马凌(彦军) wrote: > > create mode 100644 manual/examples/numa-spinlock.c > > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h > > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c > > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h > > create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c > > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c > > as far as i can tell the new code is generic > (other than the presence of efficient getcpu), > so i think the test should be generic too. > > > --- /dev/null > > +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > > @@ -0,0 +1,384 @@ > ... > > +/* Check spinlock overhead with large number threads. Critical region is > > + very smmall. Critical region + spinlock overhead aren't noticeable > > + when number of threads is small. When thread number increases, > > + spinlock overhead become the bottleneck. It shows up in wall time > > + of thread execution. */ > > yeah, this is not easy to do in a generic way, i think > even on x86 such measurement is problematic, you don't > know what goes on a system (or vm) when the glibc test > is running. > > but doing precise timing is not that important for > checking the correctness of the locks, so i think a > simplified version can be generic test code. Here is the updated patch to make tests generic.
On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > From: "ling.ma" <ling.ml@antfin.com> > > On multi-socket systems, memory is shared across the entire system. > Data access to the local socket is much faster than the remote socket > and data access to the local core is faster than sibling cores on the > same socket. For serialized workloads with conventional spinlock, > when there is high spinlock contention between threads, lock ping-pong > among sockets becomes the bottleneck and threads spend majority of > their time in spinlock overhead. > > On multi-socket systems, the keys to our NUMA spinlock performance > are to minimize cross-socket traffic as well as localize the serialized > workload to one core for execution. The basic principles of NUMA > spinlock are mainly consisted of following approaches, which reduce > data movement and accelerate critical section, eventually give us > significant performance improvement. I question whether this belongs in glibc. It seems highly application- and kernel-specific. Is there a reason you wouldn't prefer to implement and maintain it in a library for use in the kind of application that needs it? Some specific review inline below: > [...] > diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > new file mode 100644 > index 0000000..8ff4e1a > --- /dev/null > +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > @@ -0,0 +1,304 @@ > +/* Initialization of NUMA spinlock. > + Copyright (C) 2018 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#include <assert.h> > +#include <ctype.h> > +#include <string.h> > +#include <dirent.h> > +#include <stdio.h> > +#include <limits.h> > +#ifdef _LIBC > +# include <not-cancel.h> > +#else > +# include <stdlib.h> > +# include <unistd.h> > +# include <fcntl.h> > +# define __open_nocancel open > +# define __close_nocancel_nostatus close > +# define __read_nocancel read > +#endif > + > +#include "numa-spinlock-private.h" > + > +static char * > +next_line (int fd, char *const buffer, char **cp, char **re, > + char *const buffer_end) > +{ > + char *res = *cp; > + char *nl = memchr (*cp, '\n', *re - *cp); > + if (nl == NULL) > + { > + if (*cp != buffer) > + { > + if (*re == buffer_end) > + { > + memmove (buffer, *cp, *re - *cp); > + *re = buffer + (*re - *cp); > + *cp = buffer; > + > + ssize_t n = __read_nocancel (fd, *re, buffer_end - *re); > + if (n < 0) > + return NULL; > + > + *re += n; > + > + nl = memchr (*cp, '\n', *re - *cp); > + while (nl == NULL && *re == buffer_end) > + { > + /* Truncate too long lines. */ > + *re = buffer + 3 * (buffer_end - buffer) / 4; > + n = __read_nocancel (fd, *re, buffer_end - *re); > + if (n < 0) > + return NULL; > + > + nl = memchr (*re, '\n', n); > + **re = '\n'; > + *re += n; > + } > + } > + else > + nl = memchr (*cp, '\n', *re - *cp); > + > + res = *cp; > + } > + > + if (nl == NULL) > + nl = *re - 1; > + } > + > + *cp = nl + 1; > + assert (*cp <= *re); > + > + return res == *re ? NULL : res; > +} This looks like fragile duplication of stdio-like buffering logic that's not at all specific to this file. Does glibc have a policy on whether things needing this should use stdio or some other shared code rather than open-coding it like this? > [...] > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should > + call numa_spinlock_free on the NUMA spinlock pointer to free the > + memory when it is no longer needed. */ > + > +struct numa_spinlock * > +numa_spinlock_alloc (void) > +{ > + const size_t buffer_size = 1024; > + char buffer[buffer_size]; > + char *buffer_end = buffer + buffer_size; > + char *cp = buffer_end; > + char *re = buffer_end; > + > + const int flags = O_RDONLY | O_CLOEXEC; > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); > + char *l; > + unsigned int max_node = 0; > + unsigned int node_count = 0; > + if (fd != -1) > + { > + l = next_line (fd, buffer, &cp, &re, buffer_end); > + if (l != NULL) > + do > + { > + char *endp; > + unsigned long int n = strtoul (l, &endp, 10); > + if (l == endp) > + { > + node_count = 1; > + break; > + } > + > + unsigned long int m = n; > + if (*endp == '-') > + { > + l = endp + 1; > + m = strtoul (l, &endp, 10); > + if (l == endp) > + { > + node_count = 1; > + break; > + } > + } > + > + node_count += m - n + 1; > + > + if (m >= max_node) > + max_node = m; > + > + l = endp; > + while (l < re && isspace (*l)) > + ++l; > + } > + while (l < re); > + > + __close_nocancel_nostatus (fd); > + } > + > + /* NB: Some NUMA nodes may not be available, if the number of NUMA > + nodes is 1, set the maximium NUMA node number to 0. */ > + if (node_count == 1) > + max_node = 0; > + > + unsigned int max_cpu = 0; > + unsigned int *physical_package_id_p = NULL; > + > + if (max_node == 0) > + { > + /* There is at least 1 node. */ > + node_count = 1; > + > + /* If NUMA is disabled, use physical_package_id instead. */ > + struct dirent **cpu_list; > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, > + select_cpu, NULL); > + if (nprocs > 0) > + { > + int i; > + unsigned int *cpu_id_p = NULL; > + > + /* Find the maximum CPU number. */ > + if (posix_memalign ((void **) &cpu_id_p, > + __alignof__ (void *), > + nprocs * sizeof (unsigned int)) == 0) Using posix_memalign to get memory with the alignment of __alignof__(void*) makes no sense. All allocations via malloc are suitably aligned for any standard type. > + { > + for (i = 0; i < nprocs; i++) > + { > + unsigned int cpu_id > + = strtoul (cpu_list[i]->d_name + 3, NULL, 10); > + cpu_id_p[i] = cpu_id; > + if (cpu_id > max_cpu) > + max_cpu = cpu_id; > + } > + > + if (posix_memalign ((void **) &physical_package_id_p, > + __alignof__ (void *), > + ((max_cpu + 1) > + * sizeof (unsigned int))) == 0) Again. > + { > + memset (physical_package_id_p, 0, > + ((max_cpu + 1) * sizeof (unsigned int))); > + > + max_node = UINT_MAX; > + > + /* Get physical_package_id. */ > + char path[(sizeof ("/sys/devices/system/cpu") > + + 3 * sizeof (unsigned long int) > + + sizeof ("/topology/physical_package_id"))]; > + for (i = 0; i < nprocs; i++) > + { > + struct dirent *d = cpu_list[i]; > + if (snprintf (path, sizeof (path), > + "/sys/devices/system/cpu/%s/topology/physical_package_id", > + d->d_name) > 0) Are these sysfs pathnames documented as stable/permanent by the kernel? > + { > + fd = __open_nocancel (path, flags); > + if (fd != -1) > + { > + if (__read_nocancel (fd, buffer, > + buffer_size) > 0) > + { > + char *endp; > + unsigned long int package_id > + = strtoul (buffer, &endp, 10); > + if (package_id != ULONG_MAX > + && *buffer != '\0' > + && (*endp == '\0' || *endp == '\n')) > + { > + physical_package_id_p[cpu_id_p[i]] > + = package_id; > + if (max_node == UINT_MAX) > + { > + /* This is the first node. */ > + max_node = package_id; > + } > + else if (package_id != max_node) > + { > + /* NB: We only need to know if > + NODE_COUNT > 1. */ > + node_count = 2; > + if (package_id > max_node) > + max_node = package_id; > + } > + } > + } > + __close_nocancel_nostatus (fd); > + } > + } > + > + free (d); > + } > + } > + > + free (cpu_id_p); > + } > + else > + { > + for (i = 0; i < nprocs; i++) > + free (cpu_list[i]); > + } > + > + free (cpu_list); > + } > + } > + > + if (physical_package_id_p != NULL && node_count == 1) > + { > + /* There is only one node. No need for physical_package_id_p. */ > + free (physical_package_id_p); > + physical_package_id_p = NULL; > + max_cpu = 0; > + } > + > + /* Allocate an array of struct numa_spinlock_info pointers to hold info > + for all NUMA nodes with NUMA node number from getcpu () as index. */ > + size_t size = (sizeof (struct numa_spinlock) > + + ((max_node + 1) > + * sizeof (struct numa_spinlock_info *))); > + struct numa_spinlock *lock; > + if (posix_memalign ((void **) &lock, > + __alignof__ (struct numa_spinlock_info *), size)) Another gratuitous posix_memalign.
On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote: > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > From: "ling.ma" <ling.ml@antfin.com> > > > > On multi-socket systems, memory is shared across the entire system. > > Data access to the local socket is much faster than the remote socket > > and data access to the local core is faster than sibling cores on the > > same socket. For serialized workloads with conventional spinlock, > > when there is high spinlock contention between threads, lock ping-pong > > among sockets becomes the bottleneck and threads spend majority of > > their time in spinlock overhead. > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > are to minimize cross-socket traffic as well as localize the serialized > > workload to one core for execution. The basic principles of NUMA > > spinlock are mainly consisted of following approaches, which reduce > > data movement and accelerate critical section, eventually give us > > significant performance improvement. > > I question whether this belongs in glibc. It seems highly application- > and kernel-specific. Is there a reason you wouldn't prefer to > implement and maintain it in a library for use in the kind of > application that needs it? This is a good question. On the other hand, the current spinlock in glibc hasn't been changed for many years. It doesn't scale for today's hardware. Having a scalable spinlock in glibc is desirable. > Some specific review inline below: > > > [...] > > diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > > new file mode 100644 > > index 0000000..8ff4e1a > > --- /dev/null > > +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > > @@ -0,0 +1,304 @@ > > +/* Initialization of NUMA spinlock. > > + Copyright (C) 2018 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <http://www.gnu.org/licenses/>. */ > > + > > +#include <assert.h> > > +#include <ctype.h> > > +#include <string.h> > > +#include <dirent.h> > > +#include <stdio.h> > > +#include <limits.h> > > +#ifdef _LIBC > > +# include <not-cancel.h> > > +#else > > +# include <stdlib.h> > > +# include <unistd.h> > > +# include <fcntl.h> > > +# define __open_nocancel open > > +# define __close_nocancel_nostatus close > > +# define __read_nocancel read > > +#endif > > + > > +#include "numa-spinlock-private.h" > > + > > +static char * > > +next_line (int fd, char *const buffer, char **cp, char **re, > > + char *const buffer_end) > > +{ > > + char *res = *cp; > > + char *nl = memchr (*cp, '\n', *re - *cp); > > + if (nl == NULL) > > + { > > + if (*cp != buffer) > > + { > > + if (*re == buffer_end) > > + { > > + memmove (buffer, *cp, *re - *cp); > > + *re = buffer + (*re - *cp); > > + *cp = buffer; > > + > > + ssize_t n = __read_nocancel (fd, *re, buffer_end - *re); > > + if (n < 0) > > + return NULL; > > + > > + *re += n; > > + > > + nl = memchr (*cp, '\n', *re - *cp); > > + while (nl == NULL && *re == buffer_end) > > + { > > + /* Truncate too long lines. */ > > + *re = buffer + 3 * (buffer_end - buffer) / 4; > > + n = __read_nocancel (fd, *re, buffer_end - *re); > > + if (n < 0) > > + return NULL; > > + > > + nl = memchr (*re, '\n', n); > > + **re = '\n'; > > + *re += n; > > + } > > + } > > + else > > + nl = memchr (*cp, '\n', *re - *cp); > > + > > + res = *cp; > > + } > > + > > + if (nl == NULL) > > + nl = *re - 1; > > + } > > + > > + *cp = nl + 1; > > + assert (*cp <= *re); > > + > > + return res == *re ? NULL : res; > > +} > > This looks like fragile duplication of stdio-like buffering logic > that's not at all specific to this file. Does glibc have a policy on > whether things needing this should use stdio or some other shared code > rather than open-coding it like this? This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c. Should it be exported in GLIBC_PRIVATE name space? > > [...] > > > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should > > + call numa_spinlock_free on the NUMA spinlock pointer to free the > > + memory when it is no longer needed. */ > > + > > +struct numa_spinlock * > > +numa_spinlock_alloc (void) > > +{ > > + const size_t buffer_size = 1024; > > + char buffer[buffer_size]; > > + char *buffer_end = buffer + buffer_size; > > + char *cp = buffer_end; > > + char *re = buffer_end; > > + > > + const int flags = O_RDONLY | O_CLOEXEC; > > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); > > + char *l; > > + unsigned int max_node = 0; > > + unsigned int node_count = 0; > > + if (fd != -1) > > + { > > + l = next_line (fd, buffer, &cp, &re, buffer_end); > > + if (l != NULL) > > + do > > + { > > + char *endp; > > + unsigned long int n = strtoul (l, &endp, 10); > > + if (l == endp) > > + { > > + node_count = 1; > > + break; > > + } > > + > > + unsigned long int m = n; > > + if (*endp == '-') > > + { > > + l = endp + 1; > > + m = strtoul (l, &endp, 10); > > + if (l == endp) > > + { > > + node_count = 1; > > + break; > > + } > > + } > > + > > + node_count += m - n + 1; > > + > > + if (m >= max_node) > > + max_node = m; > > + > > + l = endp; > > + while (l < re && isspace (*l)) > > + ++l; > > + } > > + while (l < re); > > + > > + __close_nocancel_nostatus (fd); > > + } > > + > > + /* NB: Some NUMA nodes may not be available, if the number of NUMA > > + nodes is 1, set the maximium NUMA node number to 0. */ > > + if (node_count == 1) > > + max_node = 0; > > + > > + unsigned int max_cpu = 0; > > + unsigned int *physical_package_id_p = NULL; > > + > > + if (max_node == 0) > > + { > > + /* There is at least 1 node. */ > > + node_count = 1; > > + > > + /* If NUMA is disabled, use physical_package_id instead. */ > > + struct dirent **cpu_list; > > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, > > + select_cpu, NULL); > > + if (nprocs > 0) > > + { > > + int i; > > + unsigned int *cpu_id_p = NULL; > > + > > + /* Find the maximum CPU number. */ > > + if (posix_memalign ((void **) &cpu_id_p, > > + __alignof__ (void *), > > + nprocs * sizeof (unsigned int)) == 0) > > Using posix_memalign to get memory with the alignment of > __alignof__(void*) makes no sense. All allocations via malloc are > suitably aligned for any standard type. Does glibc prefer malloc over posix_memalign? > > + { > > + for (i = 0; i < nprocs; i++) > > + { > > + unsigned int cpu_id > > + = strtoul (cpu_list[i]->d_name + 3, NULL, 10); > > + cpu_id_p[i] = cpu_id; > > + if (cpu_id > max_cpu) > > + max_cpu = cpu_id; > > + } > > + > > + if (posix_memalign ((void **) &physical_package_id_p, > > + __alignof__ (void *), > > + ((max_cpu + 1) > > + * sizeof (unsigned int))) == 0) > > Again. > > > + { > > + memset (physical_package_id_p, 0, > > + ((max_cpu + 1) * sizeof (unsigned int))); > > + > > + max_node = UINT_MAX; > > + > > + /* Get physical_package_id. */ > > + char path[(sizeof ("/sys/devices/system/cpu") > > + + 3 * sizeof (unsigned long int) > > + + sizeof ("/topology/physical_package_id"))]; > > + for (i = 0; i < nprocs; i++) > > + { > > + struct dirent *d = cpu_list[i]; > > + if (snprintf (path, sizeof (path), > > + "/sys/devices/system/cpu/%s/topology/physical_package_id", > > + d->d_name) > 0) > > Are these sysfs pathnames documented as stable/permanent by the > kernel? I believe so. > > + { > > + fd = __open_nocancel (path, flags); > > + if (fd != -1) > > + { > > + if (__read_nocancel (fd, buffer, > > + buffer_size) > 0) > > + { > > + char *endp; > > + unsigned long int package_id > > + = strtoul (buffer, &endp, 10); > > + if (package_id != ULONG_MAX > > + && *buffer != '\0' > > + && (*endp == '\0' || *endp == '\n')) > > + { > > + physical_package_id_p[cpu_id_p[i]] > > + = package_id; > > + if (max_node == UINT_MAX) > > + { > > + /* This is the first node. */ > > + max_node = package_id; > > + } > > + else if (package_id != max_node) > > + { > > + /* NB: We only need to know if > > + NODE_COUNT > 1. */ > > + node_count = 2; > > + if (package_id > max_node) > > + max_node = package_id; > > + } > > + } > > + } > > + __close_nocancel_nostatus (fd); > > + } > > + } > > + > > + free (d); > > + } > > + } > > + > > + free (cpu_id_p); > > + } > > + else > > + { > > + for (i = 0; i < nprocs; i++) > > + free (cpu_list[i]); > > + } > > + > > + free (cpu_list); > > + } > > + } > > + > > + if (physical_package_id_p != NULL && node_count == 1) > > + { > > + /* There is only one node. No need for physical_package_id_p. */ > > + free (physical_package_id_p); > > + physical_package_id_p = NULL; > > + max_cpu = 0; > > + } > > + > > + /* Allocate an array of struct numa_spinlock_info pointers to hold info > > + for all NUMA nodes with NUMA node number from getcpu () as index. */ > > + size_t size = (sizeof (struct numa_spinlock) > > + + ((max_node + 1) > > + * sizeof (struct numa_spinlock_info *))); > > + struct numa_spinlock *lock; > > + if (posix_memalign ((void **) &lock, > > + __alignof__ (struct numa_spinlock_info *), size)) > > Another gratuitous posix_memalign.
On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote: > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote: > > > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > > From: "ling.ma" <ling.ml@antfin.com> > > > > > > On multi-socket systems, memory is shared across the entire system. > > > Data access to the local socket is much faster than the remote socket > > > and data access to the local core is faster than sibling cores on the > > > same socket. For serialized workloads with conventional spinlock, > > > when there is high spinlock contention between threads, lock ping-pong > > > among sockets becomes the bottleneck and threads spend majority of > > > their time in spinlock overhead. > > > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > > are to minimize cross-socket traffic as well as localize the serialized > > > workload to one core for execution. The basic principles of NUMA > > > spinlock are mainly consisted of following approaches, which reduce > > > data movement and accelerate critical section, eventually give us > > > significant performance improvement. > > > > I question whether this belongs in glibc. It seems highly application- > > and kernel-specific. Is there a reason you wouldn't prefer to > > implement and maintain it in a library for use in the kind of > > application that needs it? > > This is a good question. On the other hand, the current spinlock > in glibc hasn't been changed for many years. It doesn't scale for > today's hardware. Having a scalable spinlock in glibc is desirable. "Scalable spinlock" is something of an oxymoron. Spinlocks are for situations where contention is extremely rare, since they inherently blow up badly under contention. If this is happening it means you wanted a mutex not a spinlock. > > This looks like fragile duplication of stdio-like buffering logic > > that's not at all specific to this file. Does glibc have a policy on > > whether things needing this should use stdio or some other shared code > > rather than open-coding it like this? > > This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c. Should it > be exported in GLIBC_PRIVATE name space? Possibly. I don't have a strong opinion here. At least it's good to know it's copied from code elsewhere that's believed to be correct/safe. > > > > [...] > > > > > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should > > > + call numa_spinlock_free on the NUMA spinlock pointer to free the > > > + memory when it is no longer needed. */ > > > + > > > +struct numa_spinlock * > > > +numa_spinlock_alloc (void) > > > +{ > > > + const size_t buffer_size = 1024; > > > + char buffer[buffer_size]; > > > + char *buffer_end = buffer + buffer_size; > > > + char *cp = buffer_end; > > > + char *re = buffer_end; > > > + > > > + const int flags = O_RDONLY | O_CLOEXEC; > > > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); > > > + char *l; > > > + unsigned int max_node = 0; > > > + unsigned int node_count = 0; > > > + if (fd != -1) > > > + { > > > + l = next_line (fd, buffer, &cp, &re, buffer_end); > > > + if (l != NULL) > > > + do > > > + { > > > + char *endp; > > > + unsigned long int n = strtoul (l, &endp, 10); > > > + if (l == endp) > > > + { > > > + node_count = 1; > > > + break; > > > + } > > > + > > > + unsigned long int m = n; > > > + if (*endp == '-') > > > + { > > > + l = endp + 1; > > > + m = strtoul (l, &endp, 10); > > > + if (l == endp) > > > + { > > > + node_count = 1; > > > + break; > > > + } > > > + } > > > + > > > + node_count += m - n + 1; > > > + > > > + if (m >= max_node) > > > + max_node = m; > > > + > > > + l = endp; > > > + while (l < re && isspace (*l)) > > > + ++l; > > > + } > > > + while (l < re); > > > + > > > + __close_nocancel_nostatus (fd); > > > + } > > > + > > > + /* NB: Some NUMA nodes may not be available, if the number of NUMA > > > + nodes is 1, set the maximium NUMA node number to 0. */ > > > + if (node_count == 1) > > > + max_node = 0; > > > + > > > + unsigned int max_cpu = 0; > > > + unsigned int *physical_package_id_p = NULL; > > > + > > > + if (max_node == 0) > > > + { > > > + /* There is at least 1 node. */ > > > + node_count = 1; > > > + > > > + /* If NUMA is disabled, use physical_package_id instead. */ > > > + struct dirent **cpu_list; > > > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, > > > + select_cpu, NULL); > > > + if (nprocs > 0) > > > + { > > > + int i; > > > + unsigned int *cpu_id_p = NULL; > > > + > > > + /* Find the maximum CPU number. */ > > > + if (posix_memalign ((void **) &cpu_id_p, > > > + __alignof__ (void *), > > > + nprocs * sizeof (unsigned int)) == 0) > > > > Using posix_memalign to get memory with the alignment of > > __alignof__(void*) makes no sense. All allocations via malloc are > > suitably aligned for any standard type. > > Does glibc prefer malloc over posix_memalign? If not, I think it *should*. Use of posix_memalign suggests to the reader that some nonstandard alignment is needed, prompting one to ask "what?" only to figure out it was gratuitous. What's worse -- and I missed this on the first pass -- posix_memalign's signature is broken by design and almost universally gets the programmer to invoke undefined behavior, which has happened here. cpu_id_p has type unsigned int *, but due to the cast, posix_memalign accesses it as if it had type void *, which is an aliasing violation. For this reason alone, I would go so far as to say the function should never be used unless there's no way to avoid it, and even then it should be wrapped with a function that *returns* the pointer as void* so that this error can't be made. > > > + if (snprintf (path, sizeof (path), > > > + "/sys/devices/system/cpu/%s/topology/physical_package_id", > > > + d->d_name) > 0) > > > > Are these sysfs pathnames documented as stable/permanent by the > > kernel? > > I believe so. OK. It is worth checking though since otherwise you have an interface that will suddenly break when things change on the kernel side. Rich
On Thu, Jan 3, 2019 at 1:21 PM Rich Felker <dalias@libc.org> wrote: > > On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote: > > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote: > > > > > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > > > From: "ling.ma" <ling.ml@antfin.com> > > > > > > > > On multi-socket systems, memory is shared across the entire system. > > > > Data access to the local socket is much faster than the remote socket > > > > and data access to the local core is faster than sibling cores on the > > > > same socket. For serialized workloads with conventional spinlock, > > > > when there is high spinlock contention between threads, lock ping-pong > > > > among sockets becomes the bottleneck and threads spend majority of > > > > their time in spinlock overhead. > > > > > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > > > are to minimize cross-socket traffic as well as localize the serialized > > > > workload to one core for execution. The basic principles of NUMA > > > > spinlock are mainly consisted of following approaches, which reduce > > > > data movement and accelerate critical section, eventually give us > > > > significant performance improvement. > > > > > > I question whether this belongs in glibc. It seems highly application- > > > and kernel-specific. Is there a reason you wouldn't prefer to > > > implement and maintain it in a library for use in the kind of > > > application that needs it? > > > > This is a good question. On the other hand, the current spinlock > > in glibc hasn't been changed for many years. It doesn't scale for > > today's hardware. Having a scalable spinlock in glibc is desirable. > > "Scalable spinlock" is something of an oxymoron. Spinlocks are for > situations where contention is extremely rare, since they inherently > blow up badly under contention. If this is happening it means you > wanted a mutex not a spinlock. The critical region serialized by spinlock is very small. Overhead of mutex is much bigger than the critical region itself. > > > This looks like fragile duplication of stdio-like buffering logic > > > that's not at all specific to this file. Does glibc have a policy on > > > whether things needing this should use stdio or some other shared code > > > rather than open-coding it like this? > > > > This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c. Should it > > be exported in GLIBC_PRIVATE name space? > > Possibly. I don't have a strong opinion here. At least it's good to > know it's copied from code elsewhere that's believed to be > correct/safe. > > > > > > > [...] > > > > > > > +/* Allocate a NUMA spinlock and return a pointer to it. Caller should > > > > + call numa_spinlock_free on the NUMA spinlock pointer to free the > > > > + memory when it is no longer needed. */ > > > > + > > > > +struct numa_spinlock * > > > > +numa_spinlock_alloc (void) > > > > +{ > > > > + const size_t buffer_size = 1024; > > > > + char buffer[buffer_size]; > > > > + char *buffer_end = buffer + buffer_size; > > > > + char *cp = buffer_end; > > > > + char *re = buffer_end; > > > > + > > > > + const int flags = O_RDONLY | O_CLOEXEC; > > > > + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); > > > > + char *l; > > > > + unsigned int max_node = 0; > > > > + unsigned int node_count = 0; > > > > + if (fd != -1) > > > > + { > > > > + l = next_line (fd, buffer, &cp, &re, buffer_end); > > > > + if (l != NULL) > > > > + do > > > > + { > > > > + char *endp; > > > > + unsigned long int n = strtoul (l, &endp, 10); > > > > + if (l == endp) > > > > + { > > > > + node_count = 1; > > > > + break; > > > > + } > > > > + > > > > + unsigned long int m = n; > > > > + if (*endp == '-') > > > > + { > > > > + l = endp + 1; > > > > + m = strtoul (l, &endp, 10); > > > > + if (l == endp) > > > > + { > > > > + node_count = 1; > > > > + break; > > > > + } > > > > + } > > > > + > > > > + node_count += m - n + 1; > > > > + > > > > + if (m >= max_node) > > > > + max_node = m; > > > > + > > > > + l = endp; > > > > + while (l < re && isspace (*l)) > > > > + ++l; > > > > + } > > > > + while (l < re); > > > > + > > > > + __close_nocancel_nostatus (fd); > > > > + } > > > > + > > > > + /* NB: Some NUMA nodes may not be available, if the number of NUMA > > > > + nodes is 1, set the maximium NUMA node number to 0. */ > > > > + if (node_count == 1) > > > > + max_node = 0; > > > > + > > > > + unsigned int max_cpu = 0; > > > > + unsigned int *physical_package_id_p = NULL; > > > > + > > > > + if (max_node == 0) > > > > + { > > > > + /* There is at least 1 node. */ > > > > + node_count = 1; > > > > + > > > > + /* If NUMA is disabled, use physical_package_id instead. */ > > > > + struct dirent **cpu_list; > > > > + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, > > > > + select_cpu, NULL); > > > > + if (nprocs > 0) > > > > + { > > > > + int i; > > > > + unsigned int *cpu_id_p = NULL; > > > > + > > > > + /* Find the maximum CPU number. */ > > > > + if (posix_memalign ((void **) &cpu_id_p, > > > > + __alignof__ (void *), > > > > + nprocs * sizeof (unsigned int)) == 0) > > > > > > Using posix_memalign to get memory with the alignment of > > > __alignof__(void*) makes no sense. All allocations via malloc are > > > suitably aligned for any standard type. > > > > Does glibc prefer malloc over posix_memalign? > > If not, I think it *should*. Use of posix_memalign suggests to the > reader that some nonstandard alignment is needed, prompting one to ask > "what?" only to figure out it was gratuitous. > > What's worse -- and I missed this on the first pass -- > posix_memalign's signature is broken by design and almost universally > gets the programmer to invoke undefined behavior, which has happened > here. cpu_id_p has type unsigned int *, but due to the cast, > posix_memalign accesses it as if it had type void *, which is an > aliasing violation. For this reason alone, I would go so far as to say > the function should never be used unless there's no way to avoid it, > and even then it should be wrapped with a function that *returns* the > pointer as void* so that this error can't be made. We will change it to malloc. > > > > + if (snprintf (path, sizeof (path), > > > > + "/sys/devices/system/cpu/%s/topology/physical_package_id", > > > > + d->d_name) > 0) > > > > > > Are these sysfs pathnames documented as stable/permanent by the > > > kernel? > > > > I believe so. > > OK. It is worth checking though since otherwise you have an interface > that will suddenly break when things change on the kernel side. > Documentation/cputopology.txt has =========================================== How CPU topology info is exported via sysfs =========================================== Export CPU topology info via sysfs. Items (attributes) are similar to /proc/cpuinfo output of some architectures: 1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: physical package id of cpuX. Typically corresponds to a physical socket number, but the actual value is architecture and platform dependent. ....
在 2019/1/3 下午10:52,“Szabolcs Nagy”<Szabolcs.Nagy@arm.com> 写入: On 03/01/2019 05:35, 马凌(彦军) wrote: > create mode 100644 manual/examples/numa-spinlock.c > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c > create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h > create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c as far as i can tell the new code is generic (other than the presence of efficient getcpu), so i think the test should be generic too. > --- /dev/null > +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c > @@ -0,0 +1,384 @@ ... > +/* Check spinlock overhead with large number threads. Critical region is > + very smmall. Critical region + spinlock overhead aren't noticeable > + when number of threads is small. When thread number increases, > + spinlock overhead become the bottleneck. It shows up in wall time > + of thread execution. */ yeah, this is not easy to do in a generic way, i think even on x86 such measurement is problematic, you don't know what goes on a system (or vm) when the glibc test is running. but doing precise timing is not that important for checking the correctness of the locks, so i think a simplified version can be generic test code. Ling: It is good idea, we will try to send generic test case in next version. Thanks Ling
On 1/3/19 2:58 PM, H.J. Lu wrote: > +libpthread { > + GLIBC_2.29 { > + numa_spinlock_alloc; > + numa_spinlock_free; > + numa_spinlock_init; > + numa_spinlock_apply; > + } > +} Why are we adding these non-standard interfaces to glibc? The API implementation doesn't rely on any special glibc internals. It could be implemented as a distinct library, allowed to evolve quickly in response to customer need, and eventually integrated into glibc if the API proves stable. A similar model has been setup by Boost and C++ just to draw some parallels. I'm not happy to see new APIs like this go directly into glibc without much more discussion about *why* they have to be in glibc initially. Just to be clear I have a sustained objection to this new set of APIs being added to glibc until I can be convinced that they have to go in glibc.
On Sat, Jan 5, 2019 at 4:34 AM Carlos O'Donell <carlos@redhat.com> wrote: > > On 1/3/19 2:58 PM, H.J. Lu wrote: > > +libpthread { > > + GLIBC_2.29 { > > + numa_spinlock_alloc; > > + numa_spinlock_free; > > + numa_spinlock_init; > > + numa_spinlock_apply; > > + } > > +} > > Why are we adding these non-standard interfaces to glibc? > > The API implementation doesn't rely on any special glibc internals. > > It could be implemented as a distinct library, allowed to evolve quickly > in response to customer need, and eventually integrated into glibc if the > API proves stable. A similar model has been setup by Boost and C++ just to > draw some parallels. > > I'm not happy to see new APIs like this go directly into glibc without > much more discussion about *why* they have to be in glibc initially. > > Just to be clear I have a sustained objection to this new set of APIs > being added to glibc until I can be convinced that they have to go in > glibc. > Should glibc have scalable spinlock, in libc.so or a separate shared object? Or should we tell people that if they want scalable spinlock, they look elsewhere?
* H. J. Lu: > Should glibc have scalable spinlock, in libc.so or a separate shared object? > Or should we tell people that if they want scalable spinlock, they look > elsewhere? I think non-polymorphic, small lock types with scoped locking could make sense for glibc. A lock specific to a certain machine architecture seems strange. We currently lack any of the kernel NUMA interfaces in glibc, which makes this stand out even more. Thanks, Florian
On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote: > > * H. J. Lu: > > > Should glibc have scalable spinlock, in libc.so or a separate shared object? > > Or should we tell people that if they want scalable spinlock, they look > > elsewhere? > > I think non-polymorphic, small lock types with scoped locking could make > sense for glibc. > > A lock specific to a certain machine architecture seems strange. We > currently lack any of the kernel NUMA interfaces in glibc, which makes > this stand out even more. > And this lack of support doesn't make problems to go away.
On 1/7/19 2:48 PM, H.J. Lu wrote: > On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * H. J. Lu: >> >>> Should glibc have scalable spinlock, in libc.so or a separate shared object? >>> Or should we tell people that if they want scalable spinlock, they look >>> elsewhere? >> >> I think non-polymorphic, small lock types with scoped locking could make >> sense for glibc. >> >> A lock specific to a certain machine architecture seems strange. We >> currently lack any of the kernel NUMA interfaces in glibc, which makes >> this stand out even more. >> > > And this lack of support doesn't make problems to go away. No. But it is a strong indicator that the solution space hasn't been explored thoroughly enough for us to provide a long-term stable interface that will remain useful. My opinion is that for the health and evolution of a NUMA-aware spinlock and MCS lock, that we should create a distinct project and library that should have those locks, and then work to put them into downstream distributions. This will support key users being able to use supported versions of those libraries, and give the needed feedback about the API and the performance. It may take 1-2 years to get that feedback and every piece of feedback will improve the final API/ABI we put into glibc or even into the next ISO C standard as pat of the C thread interface. My objection to the NUMA-aware spinlock API is because I feel we are doing a disservice to the work by formalizing it and freezing it as part of the ABI/API that glibc is using. In fact this NUMA-aware discussion touches on a deeply complex issue, which is: How do we create/design, and evolve interfaces that we want to one-day have in stable glibc? But this is a discussion for another thread. Roland once said he wished we had put every function into it's own library ;-) Does this explain in more detail why I don't think it's a good idea to put these interfaces into glibc?
* Carlos O'Donell: > My opinion is that for the health and evolution of a NUMA-aware spinlock > and MCS lock, that we should create a distinct project and library that > should have those locks, and then work to put them into downstream > distributions. This will support key users being able to use supported > versions of those libraries, and give the needed feedback about the API > and the performance. It may take 1-2 years to get that feedback and every > piece of feedback will improve the final API/ABI we put into glibc or > even into the next ISO C standard as pat of the C thread interface. I think it's something taht could land in tbb, for which many distributions already have mechanisms to ship updated versions after a release. Thanks, Florian
On 1/10/19 11:32 AM, Florian Weimer wrote: > * Carlos O'Donell: > >> My opinion is that for the health and evolution of a NUMA-aware spinlock >> and MCS lock, that we should create a distinct project and library that >> should have those locks, and then work to put them into downstream >> distributions. This will support key users being able to use supported >> versions of those libraries, and give the needed feedback about the API >> and the performance. It may take 1-2 years to get that feedback and every >> piece of feedback will improve the final API/ABI we put into glibc or >> even into the next ISO C standard as pat of the C thread interface. > > I think it's something taht could land in tbb, for which many > distributions already have mechanisms to ship updated versions after a > release. Absolutely. That's a great idea.
On 10/01/2019 16:41, Carlos O'Donell wrote: > On 1/10/19 11:32 AM, Florian Weimer wrote: >> * Carlos O'Donell: >> >>> My opinion is that for the health and evolution of a NUMA-aware spinlock >>> and MCS lock, that we should create a distinct project and library that >>> should have those locks, and then work to put them into downstream >>> distributions. This will support key users being able to use supported >>> versions of those libraries, and give the needed feedback about the API >>> and the performance. It may take 1-2 years to get that feedback and every >>> piece of feedback will improve the final API/ABI we put into glibc or >>> even into the next ISO C standard as pat of the C thread interface. >> >> I think it's something taht could land in tbb, for which many >> distributions already have mechanisms to ship updated versions after a >> release. > > Absolutely. That's a great idea. > in principle the pthread_spin_lock api can use this algorithm assuming we can keep the pthread_spinlock_t abi and keep the POSIX semantics. (presumably users ran into issues with the existing posix api.. or how did this come up in the first place?)
On 1/10/19 12:52 PM, Szabolcs Nagy wrote: > On 10/01/2019 16:41, Carlos O'Donell wrote: >> On 1/10/19 11:32 AM, Florian Weimer wrote: >>> * Carlos O'Donell: >>> >>>> My opinion is that for the health and evolution of a NUMA-aware spinlock >>>> and MCS lock, that we should create a distinct project and library that >>>> should have those locks, and then work to put them into downstream >>>> distributions. This will support key users being able to use supported >>>> versions of those libraries, and give the needed feedback about the API >>>> and the performance. It may take 1-2 years to get that feedback and every >>>> piece of feedback will improve the final API/ABI we put into glibc or >>>> even into the next ISO C standard as pat of the C thread interface. >>> >>> I think it's something taht could land in tbb, for which many >>> distributions already have mechanisms to ship updated versions after a >>> release. >> >> Absolutely. That's a great idea. >> > > in principle the pthread_spin_lock api can use this algorithm > assuming we can keep the pthread_spinlock_t abi and keep the > POSIX semantics. (presumably users ran into issues with the > existing posix api.. or how did this come up in the first place?) Correct, but meeting the ABI contract of the pthread_spinlck_t turns out to be hard, there isn't much space. I've spoken with Kemi Wang (Intel) about this specific issue, and he has some ideas to share, but I'll leave it for him to describe.
On 2019/1/11 上午3:24, Carlos O'Donell wrote: > On 1/10/19 12:52 PM, Szabolcs Nagy wrote: >> On 10/01/2019 16:41, Carlos O'Donell wrote: >>> On 1/10/19 11:32 AM, Florian Weimer wrote: >>>> * Carlos O'Donell: >>>> >>>>> My opinion is that for the health and evolution of a NUMA-aware spinlock >>>>> and MCS lock, that we should create a distinct project and library that >>>>> should have those locks, and then work to put them into downstream >>>>> distributions. This will support key users being able to use supported >>>>> versions of those libraries, and give the needed feedback about the API >>>>> and the performance. It may take 1-2 years to get that feedback and every >>>>> piece of feedback will improve the final API/ABI we put into glibc or >>>>> even into the next ISO C standard as pat of the C thread interface. >>>> >>>> I think it's something taht could land in tbb, for which many >>>> distributions already have mechanisms to ship updated versions after a >>>> release. >>> >>> Absolutely. That's a great idea. >>> >> >> in principle the pthread_spin_lock api can use this algorithm >> assuming we can keep the pthread_spinlock_t abi and keep the >> POSIX semantics. (presumably users ran into issues with the >> existing posix api.. or how did this come up in the first place?) > > Correct, but meeting the ABI contract of the pthread_spinlck_t turns > out to be hard, there isn't much space. I've spoken with Kemi Wang > (Intel) about this specific issue, and he has some ideas to share, > but I'll leave it for him to describe. > It may be possible because we can make better use of size of pthread_spinlock_t. MCS lock is a well known method to reduce spinlock overhead by queuing spinner, the spinlock cache line is only contended between spinlock holder and a active spinner, other spinners are spinning on local-accessible flag until the previous spinner pass mcs lock holder down. Usually, a classical MCS implementation requires an extra pointer *mcs_lock* to track the tail of queue. When a new spinner is adding into the queue, we first get the current tail of queue, and move the mcs_lock pointer to point to this new spinner(a new tail of queue). If we can squeeze some space in pthread_spinlock_t to store this tail info, and update this tail info when a new spinner is added into the queue, then the MCS algorithm can be reimplemented without breaking ABI. That's possible because *lock* itself don't have to occupy 32 bits (8 bits or even one bit should be enough). Then the pthread_spinlock_t structure may be like this(Similar to qspinlock in kernel): struct pthread_spinlock_t { union { struct { u8 locked; // lock byte u8 reserve; u16 cpuid; // CPU id used by the last spinner, and using per-cpu infrastructure to convert it a pointer which points to the tail of queue. E.g per_cpu_var(qnode, cpuid) } int lock; } } PER-CPU struct qnode { struct qnode *next; // point to next spinner int flag; // local spinning flag } But they are two problems here. a) Lack of per-cpu infrastructure support in Glibc, so we can't do this cpuid->per-cpu-variable transition b) Can't disable preemption at userland. When a new spinner is adding to the queue, we need update the cpuid of pthread_spinlock_t to a new one. Pseudo-code: newid = get_current_cpuid(); prev = atomic_exchange_acquire(&cpuid, newid); // update cpuid to the new cpuid, and return back the previous one tail_node = per_cpu_var(qnode, prev); //get the last tail node of queue There is a problem when preemption happens at a time window between get_current_cpuid() and atomic_exchange_acquire(). When the thread is rescheduled back, it maybe on another cpu with different cpuid. ===============================CUT HERE================================== Another way is to store thread-specific info(e.g. tid) in pthread_spinlock_t instead of cpuid, then, we can avoid the issue b), but it seems that we break the semantic of TLS? Comments?
On Thu, Jan 10, 2019 at 8:32 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Carlos O'Donell: > > > My opinion is that for the health and evolution of a NUMA-aware spinlock > > and MCS lock, that we should create a distinct project and library that > > should have those locks, and then work to put them into downstream > > distributions. This will support key users being able to use supported > > versions of those libraries, and give the needed feedback about the API > > and the performance. It may take 1-2 years to get that feedback and every > > piece of feedback will improve the final API/ABI we put into glibc or > > even into the next ISO C standard as pat of the C thread interface. > > I think it's something taht could land in tbb, for which many > distributions already have mechanisms to ship updated versions after a > release. We will work on a standalone library: https://gitlab.com/numa-spinlock/numa-spinlock The implementation is done. But many pieces are missing: 1. Documentation. 2. Make tests portable for non-x86 platforms. 3. Do we need symbol versioning? 4. ... We are looking for contributors. Thanks.
On Thu, 2019-01-03 at 12:54 -0800, H.J. Lu wrote: > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote: > > > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > > From: "ling.ma" <ling.ml@antfin.com> > > > > > > On multi-socket systems, memory is shared across the entire system. > > > Data access to the local socket is much faster than the remote socket > > > and data access to the local core is faster than sibling cores on the > > > same socket. For serialized workloads with conventional spinlock, > > > when there is high spinlock contention between threads, lock ping-pong > > > among sockets becomes the bottleneck and threads spend majority of > > > their time in spinlock overhead. > > > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > > are to minimize cross-socket traffic as well as localize the serialized > > > workload to one core for execution. The basic principles of NUMA > > > spinlock are mainly consisted of following approaches, which reduce > > > data movement and accelerate critical section, eventually give us > > > significant performance improvement. > > > > I question whether this belongs in glibc. It seems highly application- > > and kernel-specific. Is there a reason you wouldn't prefer to > > implement and maintain it in a library for use in the kind of > > application that needs it? > > This is a good question. On the other hand, the current spinlock > in glibc hasn't been changed for many years. It doesn't scale for > today's hardware. Having a scalable spinlock in glibc is desirable. I agree the spinlocks need to improve, but let's do first things first: add proper back-off. The biggest problem there is find a way to select and maintain the tuning values that is simple for the glibc developers; there should be good benchmarks that can be used to automatically check that the tuning values make sense, and adaptation at runtime would be even better (if it can be shown to improve performance). There are also several other synchronization algorithms in glibc where proper (limited) spinning and back-off would help. Look for comments such as "TODO Back-off." throughout the code, in particular the synchronization code I have rewritten in the past. And this applies to the normal mutexes too, obviously.
On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote: > On 1/10/19 11:32 AM, Florian Weimer wrote: > > * Carlos O'Donell: > > > > > My opinion is that for the health and evolution of a NUMA-aware spinlock > > > and MCS lock, that we should create a distinct project and library that > > > should have those locks, and then work to put them into downstream > > > distributions. This will support key users being able to use supported > > > versions of those libraries, and give the needed feedback about the API > > > and the performance. It may take 1-2 years to get that feedback and every > > > piece of feedback will improve the final API/ABI we put into glibc or > > > even into the next ISO C standard as pat of the C thread interface. > > > > I think it's something taht could land in tbb, for which many > > distributions already have mechanisms to ship updated versions after a > > release. > > Absolutely. That's a great idea. > I don't think tbb is a useful vehicle. It would require that many applications use the tbb mutexes, which I doubt is the case.
On Sat, 2019-01-05 at 07:34 -0500, Carlos O'Donell wrote: > On 1/3/19 2:58 PM, H.J. Lu wrote: > > +libpthread { > > + GLIBC_2.29 { > > + numa_spinlock_alloc; > > + numa_spinlock_free; > > + numa_spinlock_init; > > + numa_spinlock_apply; > > + } > > +} > > Why are we adding these non-standard interfaces to glibc? I also think that there shouldn't be a new API added for these. There would be the option of adding a new pthread_mutex_t type, and thus even get some more space for the implementation, but I wouldn't like that either. What I think we should be doing instead: (1) Finally add proper spinning and back-off throughout our synchronization abstractions (spinlocks, mutexes, etc.; see my other comment on this thread). This should improve performance significantly, either through less contention during spinning, or through doing more spinning and thus less trips through futexes and the kernel. (2) Change the synchronization abstractions (especially mutexes) so that they can efficiently run different code paths when they are of non-process- shared type, and then start with doing things like MCS in the non-process- shared mutexes. For both of these, add proper benchmarks so that tuning decisions can be checked automatically and tested for regressions. IOW, we want tests for whether the tuning decisions are still the correct ones (in the future, or on future HW). The main reason why I think that's a better approach is that the biggest return on the investments by the glibc community is improving performance for as many users as possible, even if it may not be ideal performance for those unchanged programs. Adding a special API+implementation instead might enable somewhat larger performance improvements, but it requires programs to change and programmers to be aware of it, so will likely remain a niche case in practice. > It could be implemented as a distinct library, allowed to evolve quickly > in response to customer need, and eventually integrated into glibc if the > API proves stable. A similar model has been setup by Boost and C++ just to > draw some parallels. I think first of all, we should try hard to get as much as performance out of the interfaces, POSIX semantics constraints, and ABI constraints we have today. Then maybe change the ABI if it really unlocks further benefits. If that's not sufficient to get decent performance, then I think the next venue to look at is C++. That is, if implementing these ideas to the full extent is not possible in glibc, go to libstdc++ instead and see what's possible there. C++'s synchronization constructs have saner semantics than POSIX's, and ABI breaks in the future are more likely for C++ than for glibc. If the current C++ synchronization constructs have semantics that inhibit performance, then ISO C++ Study Group 1 will likely want to hear about it. And they are a much better group to discuss this than the glibc community is, simply because they are focused on just parallelism and concurrency.
On Thu, 2019-01-03 at 16:21 -0500, Rich Felker wrote: > On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote: > > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote: > > > > > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote: > > > > From: "ling.ma" <ling.ml@antfin.com> > > > > > > > > On multi-socket systems, memory is shared across the entire system. > > > > Data access to the local socket is much faster than the remote socket > > > > and data access to the local core is faster than sibling cores on the > > > > same socket. For serialized workloads with conventional spinlock, > > > > when there is high spinlock contention between threads, lock ping-pong > > > > among sockets becomes the bottleneck and threads spend majority of > > > > their time in spinlock overhead. > > > > > > > > On multi-socket systems, the keys to our NUMA spinlock performance > > > > are to minimize cross-socket traffic as well as localize the serialized > > > > workload to one core for execution. The basic principles of NUMA > > > > spinlock are mainly consisted of following approaches, which reduce > > > > data movement and accelerate critical section, eventually give us > > > > significant performance improvement. > > > > > > I question whether this belongs in glibc. It seems highly application- > > > and kernel-specific. Is there a reason you wouldn't prefer to > > > implement and maintain it in a library for use in the kind of > > > application that needs it? > > > > This is a good question. On the other hand, the current spinlock > > in glibc hasn't been changed for many years. It doesn't scale for > > today's hardware. Having a scalable spinlock in glibc is desirable. > > "Scalable spinlock" is something of an oxymoron. No, that's not true at all. Most high-performance shared-memory synchronization constructs (on typical HW we have today) will do some kind of spinning (and back-off), and there's nothing wrong about it. This can scale very well. > Spinlocks are for > situations where contention is extremely rare, No, the question is rather whether the program needs blocking through the OS (for performance, or for semantics such as PI) or not. Energy may be another factor. For example, glibc's current mutexes don't scale well on short critical because there's not enough spinning being done. In particular, in cases where there aren't more threads than cores (ie, what lots of high-performance parallel applications will ensure), it's better to just spin (and back off) than to eagerly block using the OS. > since they inherently > blow up badly under contention. Did I mention back-off before? ;) > If this is happening it means you > wanted a mutex not a spinlock. It doesn't make that much sense to have different interfaces for those, if it weren't for PThreads mutexes being dynamically typed, unfortunately. I believe that a high-performance default lock (C++11 or C11 semantics, non-process-shared) would beat both our spinlocks and mutex implementations that we have today. We could tune the C11 mutex implementation, but the number of users would still be small in the foreseeable future, I guess. Tuning libstdc++'s mutex implementation (so that it doesn't just use nptl mutexes but does something that's closer to the state of the art) would reach more users.
On Wed, 2018-12-26 at 10:50 +0800, Ma Ling wrote: > 2. Critical Section Integration (CSI) > Essentially spinlock is similar to that one core complete critical > sections one by one. So when contention happen, the serialized works > are sent to the core who is the lock owner and responsible to execute > them, that can save much time and power, because all shared data are > located in private cache of the lock owner. I agree that this can improve performance because of potentially both increasing data locality for the critical sections themselves and decreasing contention in the lock. However, this will mess with thread- local storage and assumptions about what OS thread a critical section runs on. Maybe it's better to first experiment with this change in semantics in C++; ISO C++ Study Group 1 on parallelism and concurrency is much deeper into this topic than the glibc community is. This isn't really a typical lock anymore when you do that, but rather a special kind of execution service for small functions; the study group has talked about executors that execute in guaranteed sequential fashion.
>> "Scalable spinlock" is something of an oxymoron. > > No, that's not true at all. Most high-performance shared-memory > synchronization constructs (on typical HW we have today) will do some kind > of spinning (and back-off), and there's nothing wrong about it. This can > scale very well. > >> Spinlocks are for >> situations where contention is extremely rare, > > No, the question is rather whether the program needs blocking through the > OS (for performance, or for semantics such as PI) or not. Energy may be > another factor. For example, glibc's current mutexes don't scale well on > short critical because there's not enough spinning being done. > yes. That's why we need pthread.mutex.spin_count tunable interface before. But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting algorithm of current adaptive mutex is the major negative factor which degrades mutex performance. That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive mutex. https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html Also, if with very small critical section in the worklad, this new type of mutex with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs much better than original spinlock. So, in some day, if adaptive mutex is tuned good enough, it should act like mcs-spinlock (or NUMA spinlock) if workload has small critical section, and performs like normal mutex if the critical section is too big to spinning-wait.
On 2018/12/26 上午10:50, Ma Ling wrote: > From: "ling.ma" <ling.ml@antfin.com> > > On multi-socket systems, memory is shared across the entire system. > Data access to the local socket is much faster than the remote socket > and data access to the local core is faster than sibling cores on the > same socket. For serialized workloads with conventional spinlock, > when there is high spinlock contention between threads, lock ping-pong > among sockets becomes the bottleneck and threads spend majority of > their time in spinlock overhead. > > On multi-socket systems, the keys to our NUMA spinlock performance > are to minimize cross-socket traffic as well as localize the serialized > workload to one core for execution. The basic principles of NUMA > spinlock are mainly consisted of following approaches, which reduce > data movement and accelerate critical section, eventually give us > significant performance improvement. > > 1. MCS spinlock > MCS spinlock help us to reduce the useless lock movement in the > spinning state. This paper provides a good description for this > kind of lock: That's not the truth. No matter generic spinlock(or x86 version) spinlock has used the way of test and test_and_set to reduce the useless lock movement in the spinning state. See glibc/nptl/pthread_spin_lock.c glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S What MCS-spinlock really helps is to accelerate lock release and lock acquisition by reducing lots of cache line bouncing. > NUMA spinlock can greatly speed up critical section on multi-socket > systems. It should improve spinlock performance on all multi-socket > systems. > This is out-of-question that NUMA spinlock helps a lot in case of heavy lock contention. But, we should also propose the data for non-contented case and slight contended case. It's expected that extra code complexity may degrade lock performance a bit for slight contended case, I would like to see the data for that. Also, the lock starvation would be possible if the running core is always busy with heavy lock contention. More explanation is expected.
> 1. MCS spinlock > MCS spinlock help us to reduce the useless lock movement in the > spinning state. This paper provides a good description for this > kind of lock: That's not the truth. No matter generic spinlock(or x86 version) spinlock has used the way of test and test_and_set to reduce the useless lock movement in the spinning state. See glibc/nptl/pthread_spin_lock.c glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S What MCS-spinlock really helps is to accelerate lock release and lock acquisition by reducing lots of cache line bouncing. Ling: Thanks for your comments. 1.in spinning stage current generic spinlock will read lock, then lock is stored in private cache among cores with shared state. when one core update the lock ,those cores containing the shared cache line will be noted ,invalidated lock then response to the cache controller the messages transmission spend much time, but they can be reduced by mcs lock 2. atomic_compare_exchange_weak_acquire cause plenty of useless lock movement because only one core can acquire lock mcs will transmit the lock one by one with private data, so we say it reduce useless lock movement from above scenario 1 and 2. > NUMA spinlock can greatly speed up critical section on multi-socket > systems. It should improve spinlock performance on all multi-socket > systems. > This is out-of-question that NUMA spinlock helps a lot in case of heavy lock contention. But, we should also propose the data for non-contented case and slight contended case. Ling: we will also present report hashwork like adaptive-mcs patch.
在 2019/1/15 上午7:26,“Torvald Riegel”<triegel@redhat.com> 写入: On Wed, 2018-12-26 at 10:50 +0800, Ma Ling wrote: > 2. Critical Section Integration (CSI) > Essentially spinlock is similar to that one core complete critical > sections one by one. So when contention happen, the serialized works > are sent to the core who is the lock owner and responsible to execute > them, that can save much time and power, because all shared data are > located in private cache of the lock owner. I agree that this can improve performance because of potentially both increasing data locality for the critical sections themselves and decreasing contention in the lock. However, this will mess with thread- local storage and assumptions about what OS thread a critical section runs on. Ling: yes, we have to consider it when applying numa spinlock. Maybe it's better to first experiment with this change in semantics in C++; ISO C++ Study Group 1 on parallelism and concurrency is much deeper into this topic than the glibc community is. This isn't really a typical lock anymore when you do that, but rather a special kind of execution service for small functions; the study group has talked about executors that execute in guaranteed sequential fashion. Ling: thanks for your suggestion, we would like to think about it seriously.
* Torvald Riegel: > On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote: >> On 1/10/19 11:32 AM, Florian Weimer wrote: >> > * Carlos O'Donell: >> > >> > > My opinion is that for the health and evolution of a NUMA-aware spinlock >> > > and MCS lock, that we should create a distinct project and library that >> > > should have those locks, and then work to put them into downstream >> > > distributions. This will support key users being able to use supported >> > > versions of those libraries, and give the needed feedback about the API >> > > and the performance. It may take 1-2 years to get that feedback and every >> > > piece of feedback will improve the final API/ABI we put into glibc or >> > > even into the next ISO C standard as pat of the C thread interface. >> > >> > I think it's something taht could land in tbb, for which many >> > distributions already have mechanisms to ship updated versions after a >> > release. >> >> Absolutely. That's a great idea. >> > > I don't think tbb is a useful vehicle. It would require that many > applications use the tbb mutexes, which I doubt is the case. That doesn't really matter because it's a new API anyway. Thanks, Florian
On Tue, 2019-01-15 at 10:32 +0100, Florian Weimer wrote: > * Torvald Riegel: > > > On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote: > > > On 1/10/19 11:32 AM, Florian Weimer wrote: > > > > * Carlos O'Donell: > > > > > > > > > My opinion is that for the health and evolution of a NUMA-aware spinlock > > > > > and MCS lock, that we should create a distinct project and library that > > > > > should have those locks, and then work to put them into downstream > > > > > distributions. This will support key users being able to use supported > > > > > versions of those libraries, and give the needed feedback about the API > > > > > and the performance. It may take 1-2 years to get that feedback and every > > > > > piece of feedback will improve the final API/ABI we put into glibc or > > > > > even into the next ISO C standard as pat of the C thread interface. > > > > > > > > I think it's something taht could land in tbb, for which many > > > > distributions already have mechanisms to ship updated versions after a > > > > release. > > > > > > Absolutely. That's a great idea. > > > > > > > I don't think tbb is a useful vehicle. It would require that many > > applications use the tbb mutexes, which I doubt is the case. > > That doesn't really matter because it's a new API anyway. What I mean is that applications would have to want to use locks provided by tbb, whether those are locks/mutexes that exist in tbb today or a new API that would be added. Put differently, I'm not optimistic about tbb being a good way to get feedback.
* Torvald Riegel: > What I mean is that applications would have to want to use locks provided > by tbb, whether those are locks/mutexes that exist in tbb today or a new > API that would be added. > > Put differently, I'm not optimistic about tbb being a good way to get > feedback. Do you want to run existing workloads with a new mutex implementation? Then we can't add new flags or change ABI in any way and would have to use a tunable. And to get feedback, we would have to make the new implementation the default, with a tunable to get back the old implementation. Thanks, Florian
On Tue, 2019-01-15 at 13:17 +0100, Florian Weimer wrote: > * Torvald Riegel: > > > What I mean is that applications would have to want to use locks provided > > by tbb, whether those are locks/mutexes that exist in tbb today or a new > > API that would be added. > > > > Put differently, I'm not optimistic about tbb being a good way to get > > feedback. > > Do you want to run existing workloads with a new mutex implementation? We need to get there. > Then we can't add new flags or change ABI in any way Yes. > and would have to > use a tunable. And to get feedback, we would have to make the new > implementation the default, with a tunable to get back the old > implementation. I wouldn't be too concerned to getting back the old implementation, so maybe we don't even need a tunable right now. The old implementation is just no spinning, so the cases where I can imagine the tunable to be useful is either (1) experimentation to compare performance without using different glibc's and (2) going back to old behavior in cases where we really screwed up. But how many users will have the time to investigate (2), really? Nobody should have to tune their spinlocks, or the back-off in mutexes. It's our duty to make sure this has good average-case performance.
On Tue, 2019-01-15 at 10:28 +0800, kemi wrote: > > > "Scalable spinlock" is something of an oxymoron. > > > > No, that's not true at all. Most high-performance shared-memory > > synchronization constructs (on typical HW we have today) will do some kind > > of spinning (and back-off), and there's nothing wrong about it. This can > > scale very well. > > > > > Spinlocks are for > > > situations where contention is extremely rare, > > > > No, the question is rather whether the program needs blocking through the > > OS (for performance, or for semantics such as PI) or not. Energy may be > > another factor. For example, glibc's current mutexes don't scale well on > > short critical because there's not enough spinning being done. > > > > yes. That's why we need pthread.mutex.spin_count tunable interface before. I don't think we need the tunable interface before that. Where we need to improve performance most is for applications that don't want to bother tuning their mutexes -- that's where the broadest gains are overall, I think. In turn, that means that we have spinning and back-off that give good average-case performance -- whether that's through automatic tuning of those two things at runtime, or through static default values that we do regular performance checks for in the glibc community. From that perspective, the tunable interface is a nice addition that can allow users to fine-tune the setting, but it's not how users would enable it. > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting > algorithm of current adaptive mutex is the major negative factor which degrades mutex > performance. Note that I'm not advocating for focusing on just the adaptive mutex type. IMO, adding this type was a mistake because whether to spin or not does not affect semantics of the mutexes. Performance hints shouldn't be done via a mutex' type, and all mutex implementations should consider to spin at least a little. If we just do something about the adaptive mutexes, then I guess this will reach few users. I believe most applications just don't use them, and the current implementation of adaptive mutexes is so simplistic that there's not much performance to be had by changing to adaptive mutexes (which is another reason for it having few users). > That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive > mutex. MCS-style spinning (ie, spinning on memory local to the spinning thread) is helpful, but I think we should tackle spinning on global memory first (ie, on a location in the mutex, which is shared by all the threads trying to acquire it). Of course, always including back-off. > https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html > > Also, if with very small critical section in the worklad, this new type of mutex > with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs > much better than original spinlock. I don't think we want to have a new type for that. It maybe useful for experimenting with it, but it shouldn't be exposed to users as a stable interface. Also, have you experimented with different kinds/settings of exponential back-off? I just saw normal spinning in your implementation, no varying amounts of back-off. The performance comparison should include back-off though, as that's one way to work around the contention problems (with a bigger hammer than local spinning of course, but can be effective nonetheless, and faster in low-contention cases). My guess is that a mix of local spinning on memory shared by a few threads running on cores that are close to each other would perform best (eg, similar to what's done in flat combining). > So, in some day, if adaptive mutex is tuned good enough, it should act like > mcs-spinlock (or NUMA spinlock) if workload has small critical section, and > performs like normal mutex if the critical section is too big to spinning-wait. I agree in some way, but I think that the adaptive mutex type should just be an alias of the normal mutex type (for API compatibility reasons only). And there could be other reasons than just critical-section-size that determine whether a thread should block using futexes or not.
On Tue, Jan 15, 2019 at 01:36:56PM +0100, Torvald Riegel wrote: > > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting > > algorithm of current adaptive mutex is the major negative factor which degrades mutex > > performance. > > Note that I'm not advocating for focusing on just the adaptive mutex type. > IMO, adding this type was a mistake because whether to spin or not does not > affect semantics of the mutexes. Performance hints shouldn't be done via a > mutex' type, and all mutex implementations should consider to spin at least > a little. Strongly agreed that this was a mistake. This same sentiment is why I don't like the nomenclature "NUMA spinlock". If there are semantic differences from a normal spinlock, it should be named for the semantics, not for the form of extreme performance tuning it does. If there are no semantic differences, NUMA-optimized spinlock implementation should just be a drop-in replacement for the standard spinlock API. Rich
On 2019/1/15 下午8:36, Torvald Riegel wrote: > On Tue, 2019-01-15 at 10:28 +0800, kemi wrote: >>>> "Scalable spinlock" is something of an oxymoron. >>> >>> No, that's not true at all. Most high-performance shared-memory >>> synchronization constructs (on typical HW we have today) will do some kind >>> of spinning (and back-off), and there's nothing wrong about it. This can >>> scale very well. >>> >>>> Spinlocks are for >>>> situations where contention is extremely rare, >>> >>> No, the question is rather whether the program needs blocking through the >>> OS (for performance, or for semantics such as PI) or not. Energy may be >>> another factor. For example, glibc's current mutexes don't scale well on >>> short critical because there's not enough spinning being done. >>> >> >> yes. That's why we need pthread.mutex.spin_count tunable interface before. > > I don't think we need the tunable interface before that. Where we need to > improve performance most is for applications that don't want to bother > tuning their mutexes -- that's where the broadest gains are overall, I > think. > > In turn, that means that we have spinning and back-off that give good > average-case performance -- whether that's through automatic tuning of > those two things at runtime, or through static default values that we do > regular performance checks for in the glibc community. > Spinning and proportional back-off with auto tuning has been proposed for several years and never got it merged in upstream kernel. IMHO, this is because MCS-style lock wins that battle. Could you tell me why we should consider backoff rather than MCS-style lock? > From that perspective, the tunable interface is a nice addition that can > allow users to fine-tune the setting, but it's not how users would enable > it. > >> But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting >> algorithm of current adaptive mutex is the major negative factor which degrades mutex >> performance. > > Note that I'm not advocating for focusing on just the adaptive mutex type. > IMO, adding this type was a mistake because whether to spin or not does not > affect semantics of the mutexes. Performance hints shouldn't be done via a > mutex' type, and all mutex implementations should consider to spin at least > a little. > > If we just do something about the adaptive mutexes, then I guess this will > reach few users. I believe most applications just don't use them, and the > current implementation of adaptive mutexes is so simplistic that there's > not much performance to be had by changing to adaptive mutexes (which is > another reason for it having few users). > Generally, I agree with you. May we tune adaptive mutex before applying these optimization to normal mutex. >> That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive >> mutex. > > MCS-style spinning (ie, spinning on memory local to the spinning thread) is > helpful, but I think we should tackle spinning on global memory first (ie, > on a location in the mutex, which is shared by all the threads trying to > acquire it). Of course, always including back-off. > >> https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html >> >> Also, if with very small critical section in the worklad, this new type of mutex >> with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs >> much better than original spinlock. > > I don't think we want to have a new type for that. It maybe useful for > experimenting with it, but it shouldn't be exposed to users as a stable > interface. > I don't like to add a new type either. As I said in the commit log, that's a trade-off to avoid ABI changed. I am very glad to see that MCS-style lock can be used gracefully without introducing a new type. > Also, have you experimented with different kinds/settings of exponential > back-off? I just saw normal spinning in your implementation, no varying > amounts of back-off. The performance comparison should include back-off > though, as that's one way to work around the contention problems (with a > bigger hammer than local spinning of course, but can be effective > nonetheless, and faster in low-contention cases). > I didn't try back-off, because we don't have to include it if MCS-style lock is used. > My guess is that a mix of local spinning on memory shared by a few threads > running on cores that are close to each other would perform best (eg, > similar to what's done in flat combining). > >> So, in some day, if adaptive mutex is tuned good enough, it should act like >> mcs-spinlock (or NUMA spinlock) if workload has small critical section, and >> performs like normal mutex if the critical section is too big to spinning-wait. > > I agree in some way, but I think that the adaptive mutex type should just > be an alias of the normal mutex type (for API compatibility reasons only). > And there could be other reasons than just critical-section-size that > determine whether a thread should block using futexes or not. > Agree. I am justing moving toward that step by step.
On Thu, 2019-01-17 at 11:05 +0800, kemi wrote: > > On 2019/1/15 下午8:36, Torvald Riegel wrote: > > On Tue, 2019-01-15 at 10:28 +0800, kemi wrote: > > > > > "Scalable spinlock" is something of an oxymoron. > > > > > > > > No, that's not true at all. Most high-performance shared-memory > > > > synchronization constructs (on typical HW we have today) will do some kind > > > > of spinning (and back-off), and there's nothing wrong about it. This can > > > > scale very well. > > > > > > > > > Spinlocks are for > > > > > situations where contention is extremely rare, > > > > > > > > No, the question is rather whether the program needs blocking through the > > > > OS (for performance, or for semantics such as PI) or not. Energy may be > > > > another factor. For example, glibc's current mutexes don't scale well on > > > > short critical because there's not enough spinning being done. > > > > > > > > > > yes. That's why we need pthread.mutex.spin_count tunable interface before. > > > > I don't think we need the tunable interface before that. Where we need to > > improve performance most is for applications that don't want to bother > > tuning their mutexes -- that's where the broadest gains are overall, I > > think. > > > > In turn, that means that we have spinning and back-off that give good > > average-case performance -- whether that's through automatic tuning of > > those two things at runtime, or through static default values that we do > > regular performance checks for in the glibc community. > > > > Spinning and proportional back-off with auto tuning has been proposed for several years > and never got it merged in upstream kernel. > IMHO, this is because MCS-style lock wins that battle. > > Could you tell me why we should consider backoff rather than MCS-style lock? The kernel has no concerns over ABI stability of their locks, AFAIK. But pthread_mutex_t is exposed to users, and there is the problem of process- shared mutexes, so we can't just go to an external list. Furthermore, if we had proper back-off, we could use it for other synchronization primitives too (grep for comments containing "back-off"). Doe you have any data that would show that MCS-style locks always win on current machines? > > From that perspective, the tunable interface is a nice addition that can > > allow users to fine-tune the setting, but it's not how users would enable > > it. > > > > > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting > > > algorithm of current adaptive mutex is the major negative factor which degrades mutex > > > performance. > > > > Note that I'm not advocating for focusing on just the adaptive mutex type. > > IMO, adding this type was a mistake because whether to spin or not does not > > affect semantics of the mutexes. Performance hints shouldn't be done via a > > mutex' type, and all mutex implementations should consider to spin at least > > a little. > > > > If we just do something about the adaptive mutexes, then I guess this will > > reach few users. I believe most applications just don't use them, and the > > current implementation of adaptive mutexes is so simplistic that there's > > not much performance to be had by changing to adaptive mutexes (which is > > another reason for it having few users). > > > > Generally, I agree with you. > May we tune adaptive mutex before applying these optimization to normal mutex. That may be a way to start experimenting with it, as it gives you an already existing "tunable". However, I would guess that it won't give you wide testing because I would guess that most programs don't bother changing their locks to be of the adaptive type. > > > That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive > > > mutex. > > > > MCS-style spinning (ie, spinning on memory local to the spinning thread) is > > helpful, but I think we should tackle spinning on global memory first (ie, > > on a location in the mutex, which is shared by all the threads trying to > > acquire it). Of course, always including back-off. > > > > > https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html > > > > > > Also, if with very small critical section in the worklad, this new type of mutex > > > with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs > > > much better than original spinlock. > > > > I don't think we want to have a new type for that. It maybe useful for > > experimenting with it, but it shouldn't be exposed to users as a stable > > interface. > > > > I don't like to add a new type either. Good. > As I said in the commit log, that's a trade-off to avoid ABI changed. > I am very glad to see that MCS-style lock can be used gracefully without > introducing a new type. Well, we'd need to introduce MCS-style locks in a way that does not change the ABI. If that's possible and done, and performance improves, I don't see a reason why this shouldn't be accepted. > > Also, have you experimented with different kinds/settings of exponential > > back-off? I just saw normal spinning in your implementation, no varying > > amounts of back-off. The performance comparison should include back-off > > though, as that's one way to work around the contention problems (with a > > bigger hammer than local spinning of course, but can be effective > > nonetheless, and faster in low-contention cases). > > > > I didn't try back-off, because we don't have to include it if MCS-style lock is used. But you should at least try to get some decent performance, because it's a competitor to MCS-style locking. Even if you can't get the same performance as MCS in some cases, we still need to know so that we can assess the pros and cons of choosing MCS. Also, some initial spinning may be useful even with MCS, I think (eg, short critical sections with just two threads but enough delay between critical sections by each thread so that there doesn't need to be a lot of contention actually).
diff --git a/NEWS b/NEWS index cd29ec7..0764d0d 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,9 @@ Version 2.29 Major new features: +* NUMA spinlock is added to provide a spinlock implementation optimized + for multi-socket NUMA systems. + * The getcpu wrapper function has been added, which returns the currently used CPU and NUMA node. This function is Linux-specific. diff --git a/manual/examples/numa-spinlock.c b/manual/examples/numa-spinlock.c new file mode 100644 index 0000000..ca98443 --- /dev/null +++ b/manual/examples/numa-spinlock.c @@ -0,0 +1,99 @@ +/* NUMA spinlock example. + Copyright (C) 2018 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + as published by the Free Software Foundation; either version 2 + of the License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. +*/ + +#include <pthread.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <numa-spinlock.h> + +#define NUM_THREADS 20 + +struct numa_spinlock *lock; + +struct work_todo_argument +{ + void *arg; +}; + +static void * +work_todo (void *v) +{ + /* Do the real work with p->arg. */ + struct work_todo_argument *p = v; + /* Return value is set to lock_info.result. */ + return NULL; +} + +void * +work_thread (void *arg) +{ + struct work_todo_argument work_todo_arg; + struct numa_spinlock_info lock_info; + + if (numa_spinlock_init (lock, &lock_info)) + { + printf ("numa_spinlock_init failure: %m\n"); + exit (1); + } + + work_todo_arg.arg = arg; + lock_info.argument = &work_todo_arg; + lock_info.workload = work_todo; + + numa_spinlock_apply (&lock_info); + + return lock_info.result; +} + +int +main (int argc, char **argv) +{ + lock = numa_spinlock_alloc (); + pthread_t thr[NUM_THREADS]; + void *res[NUM_THREADS]; + int numthreads = NUM_THREADS; + int i; + + for (i = 0; i < NUM_THREADS; i++) + { + int err_ret = pthread_create (&thr[i], NULL, work_thread, + (void *) (intptr_t) i); + if (err_ret != 0) + { + printf ("pthread_create failed: %d, %s\n", + i, strerror (i)); + numthreads = i; + break; + } + } + + for (i = 0; i < numthreads; i++) + { + if (pthread_join (thr[i], (void *) &res[i]) == 0) + free (res[i]); + else + printf ("pthread_join failure: %m\n"); + } + + numa_spinlock_free (lock); + + return 0; +} diff --git a/manual/threads.texi b/manual/threads.texi index 87fda7d..e82ae0d 100644 --- a/manual/threads.texi +++ b/manual/threads.texi @@ -625,6 +625,9 @@ the standard. @menu * Default Thread Attributes:: Setting default attributes for threads in a process. +* NUMA Spinlock:: Spinlock optimized for + multi-socket NUMA platform. +* NUMA Spinlock Example:: A NUMA spinlock example. @end menu @node Default Thread Attributes @@ -669,6 +672,108 @@ The system does not have sufficient memory. @end table @end deftypefun +@node NUMA Spinlock +@subsubsection Spinlock optimized for multi-node NUMA systems + +To improve performance on multi-socket NUMA platforms for serialized +region protected by spinlock, @theglibc{} implements a NUMA spinlock +object, which minimizes cross-socket traffic and sends the protected +serialized region to one core for execution to reduce spinlock contention +overhead. + +The fundamental data types for a NUMA spinlock are +@code{numa_spinlock} and @code{numa_spinlock_info}: + +@deftp {Data Type} {struct numa_spinlock} +@standards{Linux, numa-spinlock.h} +This data type is an opaque structure. A @code{numa_spinlock} pointer +uniquely identifies a NUMA spinlock object. +@end deftp + +@deftp {Data Type} {struct numa_spinlock_info} +@standards{Linux, numa-spinlock.h} + +This data type uniquely identifies a NUMA spinlock information object for +a thread. It has the following members, and others internal to NUMA +spinlock implemenation: + +@table @code +@item void *(*workload) (void *) +A function pointer to the workload function serialized by spinlock. +@item void *argument +A pointer to argument passed to the @var{workload} function pointer. +@item void *result +Return value from the @var{workload} function pointer. +@end table + +@end deftp + +The following functions are provided for NUMA spinlock objects: + +@deftypefun struct numa_spinlock *numa_spinlock_alloc (void) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} + +This function returns a pointer to a newly allocated NUMA spinlock or a +null pointer if the NUMA spinlock could not be allocated, setting +@code{errno} to @code{ENOMEM}. Caller should call +@code{numa_spinlock_free} on the NUMA spinlock pointer to free the +memory space when it is no longer needed. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun void numa_spinlock_free (struct numa_spinlock *@var{lock}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}} + +Free the memory space pointed to by @var{lock}, which must have been +returned by a previous call to @code{numa_spinlock_alloc}. Otherwise, +or if @code{numa_spinlock_free (@var{lock})} has already been called +before, undefined behavior occurs. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun int numa_spinlock_init (struct numa_spinlock *@var{lock}, +struct numa_spinlock_info *@var{info}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +Initialize the NUMA spinlock information block pointed to by @var{info} +with a NUMA spinlock pointer @var{lock}. The return value is @code{0} on +success and @code{-1} on failure. The following @code{errno} error +codes are defined for this function: + +@table @code +@item ENOSYS +The operating system does not support the @code{getcpu} function. +@end table + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@deftypefun void numa_spinlock_apply (struct numa_spinlock_info *@var{info}) +@standards{Linux, numa-spinlock.h} +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}} + +Apply for spinlock with a NUMA spinlock information block pointed to by +@var{info}. When @code{numa_spinlock_apply} returns, the spinlock is +released and the @var{result} member of @var{info} contains the return +value of the @var{workload} member. + +This function is Linux-specific and is declared in @file{numa-spinlock.h}. +@end deftypefun + +@node NUMA Spinlock Example +@subsubsection NUMA Spinlock Example + +A NUMA spinlock example: + +@smallexample +@include numa-spinlock.c.texi +@end smallexample + @c FIXME these are undocumented: @c pthread_atfork @c pthread_attr_destroy diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile index f827455..3361597 100644 --- a/sysdeps/unix/sysv/linux/Makefile +++ b/sysdeps/unix/sysv/linux/Makefile @@ -222,6 +222,8 @@ CFLAGS-gai.c += -DNEED_NETLINK endif ifeq ($(subdir),nptl) +libpthread-sysdep_routines += numa_spinlock_alloc numa-spinlock +sysdep_headers += numa-spinlock.h tests += tst-align-clone tst-getpid1 \ tst-thread-affinity-pthread tst-thread-affinity-pthread2 \ tst-thread-affinity-sched diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions index f1e12d9..7ce7e2b 100644 --- a/sysdeps/unix/sysv/linux/Versions +++ b/sysdeps/unix/sysv/linux/Versions @@ -185,3 +185,12 @@ libc { __netlink_assert_response; } } + +libpthread { + GLIBC_2.29 { + numa_spinlock_alloc; + numa_spinlock_free; + numa_spinlock_init; + numa_spinlock_apply; + } +} diff --git a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist index 9a9e4ce..eb54a83 100644 --- a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist index b413007..dd08796 100644 --- a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/arm/libpthread.abilist b/sysdeps/unix/sysv/linux/arm/libpthread.abilist index af82a4c..45a5c5a 100644 --- a/sysdeps/unix/sysv/linux/arm/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/arm/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.4 _IO_flockfile F GLIBC_2.4 _IO_ftrylockfile F GLIBC_2.4 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/csky/libpthread.abilist b/sysdeps/unix/sysv/linux/csky/libpthread.abilist index ea4b79a..cf65f72 100644 --- a/sysdeps/unix/sysv/linux/csky/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/csky/libpthread.abilist @@ -73,6 +73,10 @@ GLIBC_2.29 mtx_timedlock F GLIBC_2.29 mtx_trylock F GLIBC_2.29 mtx_unlock F GLIBC_2.29 nanosleep F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.29 open F GLIBC_2.29 open64 F GLIBC_2.29 pause F diff --git a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist index bcba07f..a80475f 100644 --- a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist index bece86d..40ac05a 100644 --- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist index ccc9449..5b190f6 100644 --- a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist index af82a4c..45a5c5a 100644 --- a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.4 _IO_flockfile F GLIBC_2.4 _IO_ftrylockfile F GLIBC_2.4 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist index bece86d..40ac05a 100644 --- a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist index 5067375..e6539bf 100644 --- a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist index 0214496..76edcb8 100644 --- a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist index 0214496..76edcb8 100644 --- a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist index 78cac2a..3141d08 100644 --- a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist @@ -241,3 +241,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/numa-spinlock-private.h b/sysdeps/unix/sysv/linux/numa-spinlock-private.h new file mode 100644 index 0000000..0f7a3be --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock-private.h @@ -0,0 +1,38 @@ +/* Internal definitions and declarations for NUMA spinlock. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include "numa-spinlock.h" + +/* The global NUMA spinlock. */ +struct numa_spinlock +{ + /* List of threads who owns the global NUMA spinlock. */ + struct numa_spinlock_info *owner; + /* The maximium NUMA node number. */ + unsigned int max_node; + /* Non-zero for single node system. */ + unsigned int single_node; + /* The maximium CPU number. Used only when NUMA is disabled. */ + unsigned int max_cpu; + /* Array of physical_package_id of each core if it isn't NULL. Used + only when NUMA is disabled.*/ + unsigned int *physical_package_id_p; + /* Arrays of lists of threads who are spinning for the local NUMA lock + on NUMA nodes indexed by NUMA node number. */ + struct numa_spinlock_info *lists[]; +}; diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.c b/sysdeps/unix/sysv/linux/numa-spinlock.c new file mode 100644 index 0000000..a141e7d --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock.c @@ -0,0 +1,327 @@ +/* NUMA spinlock + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <config.h> +#include <string.h> +#include <stdlib.h> +#include <sched.h> +#ifndef HAVE_GETCPU +#include <unistd.h> +#include <syscall.h> +#endif +#include <errno.h> +#include <atomic.h> +#include "numa-spinlock-private.h" + +#if !defined HAVE_GETCPU && defined _LIBC +# define HAVE_GETCPU +#endif + +/* On multi-socket systems, memory is shared across the entire system. + Data access to the local socket is much faster than the remote socket + and data access to the local core is faster than sibling cores on the + same socket. For serialized workloads with conventional spinlock, + when there is high spinlock contention between threads, lock ping-pong + among sockets becomes the bottleneck and threads spend majority of + their time in spinlock overhead. + + On multi-socket systems, the keys to our NUMA spinlock performance + are to minimize cross-socket traffic as well as localize the serialized + workload to one core for execution. The basic principles of NUMA + spinlock are mainly consisted of following approaches, which reduce + data movement and accelerate critical section, eventually give us + significant performance improvement. + + 1. MCS spinlock + MCS spinlock help us to reduce the useless lock movement in the + spinning state. This paper provides a good description for this + kind of lock: + <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf> + + 2. Critical Section Integration (CSI) + Essentially spinlock is similar to that one core complete critical + sections one by one. So when contention happen, the serialized works + are sent to the core who is the lock owner and responsible to execute + them, that can save much time and power, because all shared data are + located in private cache of the lock owner. + + We implemented this mechanism based on queued spinlock in kernel, that + speeds up critical section, and reduces the probability of contention. + The paper provides a good description for this kind of lock: + <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf> + + 3. NUMA Aware Spinlock (NAS) + Currently multi-socket systems give us better performance per watt, + however that also involves more complex synchronization requirement, + because off-chip data movement is much slower. We use distributed + synchronization mechanism to decrease Lock cache line to and from + different nodes. The paper provides a good description for this kind + of lock: + <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf> + + 4. Yield Schedule + When threads are applying for Critical Section Integration(CSI) with + known contention, they will delegate work to the thread who is the + lock owner, and wait for work to be completed. The resources which + they are using should be transferred to other threads. In order to + accelerate the scenario, we introduce yield_sched function during + spinning stage. + + 5. Optimization when NUMA is ON or OFF. + Although programs can access memory with lower latency when NUMA is + enabled, some programs may need more memory bandwidth for computation + with NUMA disabled. We also optimize multi-socket systems with NUMA + disabled. + + NUMA spinlock flow chart (assuming there are 2 CPU nodes): + + 1. Threads from node_0/node_1 acquire local lock for node_0/1 + respectively. If the thread succeeds in acquiring local lock, it + goes to step 2, otherwise pushes critical function into current + local work queue, and enters into spinning stage with MCS mode. + + 2. Threads from node_0/node_1 acquire the global lock. If it succeeds + in acquiring the global lock as the lock owner, it goes to step 3, + otherwise waits until the lock owner thread releases the global lock. + + 3. The lock owner thread from node_0/1 enters into critical section, + cleans up work queue by performing all local critical functions + pushed at step 1 with CSI on behalf of other threads and informs + those spinning threads that their works have been done. It then + releases the local lock. + + 4. The lock owner thread frees global lock. If another thread is + waiting at step 2, the lock owner thread passes the global lock to + the waiting thread and returns. The new lock owner thread enters + into step 3. If no threads are waiting, the lock owner thread + releases the global lock and returns. The whole critical section + process is completed. + + Steps 1 and 2 mitigate global lock contention. Only one thread + from different nodes will compete for the global lock in step 2. + Step 3 reduces the global lock & shared data movement because they + are located in the same node as well as the same core. Our data + shows that Critical Section Integration (CSI) improves data locality + and NUMA-aware spinlock (NAS) helps CSI balance the workload. + + NUMA spinlock can greatly speed up critical section on multi-socket + systems. It should improve spinlock performance on all multi-socket + systems. + + NOTE: LiTL <https://github.com/multicore-locks/litl>, is an open-source + project that provides implementations of dozens of various locks, + including several state-of-the-art NUMA-aware spinlocks. Among them + + 1. Hierarchical MCS (HMCS) spinlock. Milind Chabbi, Michael Fagan, + and John Mellor-Crummey. High Performance Locks for Multi-level NUMA + Systems. In Proceedings of the ACM SIGPLAN Symposium on Principles + and Practice of Parallel Programming (PPoPP), pages 215–226, 2015. + + 2. Cohort-MCS (C-MCS) spinlock. Dave Dice, Virendra J. Marathe, and + Nir Shavit. Lock Cohorting: A General Technique for Designing NUMA + Locks. ACM Trans. Parallel Comput., 1(2):13:1–13:42, 2015. + */ + +/* Get the next thread pointed to by *NEXT_P. NB: We must use a while + spin loop to load NEXT_P since there is a small window before *NEXT_P + is updated. */ + +static inline struct numa_spinlock_info * +get_numa_spinlock_info_next (struct numa_spinlock_info **next_p) +{ + struct numa_spinlock_info *next; + while (!(next = atomic_load_relaxed (next_p))) + atomic_spin_nop (); + return next; +} + +/* While holding the global NUMA spinlock, run the workload of the + thread pointed to by SELF first, then run the workload for each + thread on the thread list pointed to by HEAD_P and wake up the + thread so that all workloads run on a single processor. */ + +static inline void +run_numa_spinlock (struct numa_spinlock_info *self, + struct numa_spinlock_info **head_p) +{ + struct numa_spinlock_info *next, *current; + + /* Run the SELF's workload. */ + self->result = self->workload (self->argument); + + /* Process workloads for the rest of threads on the thread list. + NB: The thread list may be prepended by other threads at the + same time. */ + +retry: + /* If SELF is the first thread of the thread list pointed to by + HEAD_P, clear the thread list. */ + current = atomic_compare_and_exchange_val_acq (head_p, NULL, self); + if (current == self) + { + /* Since SELF is the only thread on the list, clear SELF's pending + field and return. */ + atomic_store_release (¤t->pending, 0); + return; + } + + /* CURRENT will have the previous first thread of the thread list + pointed to by HEAD_P and *HEAD_P will point to SELF. */ + current = atomic_exchange_acquire (head_p, self); + + /* NB: No need to check if CURRENT == SELF here since SELF can never + be CURRENT. */ + +repeat: + /* Get the next thread. */ + next = get_numa_spinlock_info_next (¤t->next); + + /* Run the CURRENT's workload and clear CURRENT's pending field. */ + current->result = current->workload (current->argument); + current->pending = 0; + + /* Process the workload for each thread from CURRENT to SELF on the + thread list. Don't pass beyond SELF since SELF is the last thread + on the list. */ + if (next == self) + goto retry; + current = next; + goto repeat; +} + +/* Apply for the NUMA spinlock with the NUMA spinlock info data pointed + to by SELF. */ + +void +numa_spinlock_apply (struct numa_spinlock_info *self) +{ + struct numa_spinlock *lock = self->lock; + struct numa_spinlock_info *first, *next; + struct numa_spinlock_info **head_p; + + self->next = NULL; + /* We want the global NUMA spinlock. */ + self->pending = 1; + /* Select the local NUMA spinlock list by the NUMA node number. */ + head_p = &lock->lists[self->node]; + /* FIRST will have the previous first thread of the local NUMA spinlock + list and *HEAD_P will point to SELF. */ + first = atomic_exchange_acquire (head_p, self); + if (first) + { + /* SELF has been prepended to the thread list pointed to by + HEAD_P. NB: There is a small window between updating + *HEAD_P and self->next. */ + atomic_store_release (&self->next, first); + /* Let other threads run first since another thread will run our + workload for us. */ + sched_yield (); + /* Spin until our PENDING is cleared. */ + while (atomic_load_relaxed (&self->pending)) + atomic_spin_nop (); + return; + } + + /* NB: Now SELF must be the only thread on the thread list pointed + to by HEAD_P. Since thread is always prepended to HEAD_P, we + can use *HEAD_P == SELF to check if SELF is the only thread on + the thread list. */ + + if (__glibc_unlikely (lock->single_node)) + { + /* If there is only one node, there is no need for the global + NUMA spinlock. */ + run_numa_spinlock (self, head_p); + return; + } + + /* FIRST will have the previous first thread of the local NUMA spinlock + list of threads which holds the global NUMA spinlock, which will + point to SELF. */ + first = atomic_exchange_acquire (&lock->owner, self); + if (first) + { + /* SELF has been prepended to the thread list pointed to by + lock->owner. NB: There is a small window between updating + *HEAD_P and first->next. */ + atomic_store_release (&first->next, self); + /* Spin until the list of threads which holds the global NUMA + spinlock clears our PENDING. */ + while (atomic_load_relaxed (&self->pending)) + atomic_spin_nop (); + } + + /* We get the global NUMA spinlock now. Run our workload. */ + run_numa_spinlock (self, head_p); + + /* SELF is the only thread on the list if SELF is the first thread + of the thread list pointed to by lock->owner. In this case, we + simply return. */ + if (!atomic_compare_and_exchange_bool_acq (&lock->owner, NULL, self)) + return; + + /* Wake up the next thread. */ + next = get_numa_spinlock_info_next (&self->next); + atomic_store_release (&next->pending, 0); +} + +/* Initialize the NUMA spinlock info data pointed to by INFO from a + pointer to the NUMA spinlock, LOCK. */ + +int +numa_spinlock_init (struct numa_spinlock *lock, + struct numa_spinlock_info *info) +{ + memset (info, 0, sizeof (*info)); + info->lock = lock; + /* For single node system, use 0 as the NUMA node number. */ + if (lock->single_node) + return 0; + /* NB: Use the NUMA node number from getcpu to select the local NUMA + spinlock list. */ + unsigned int cpu; + unsigned int node; +#ifdef HAVE_GETCPU + int err_ret = getcpu (&cpu, &node); +#else + int err_ret = syscall (SYS_getcpu, &cpu, &node, NULL); +#endif + if (err_ret) + return err_ret; + if (lock->physical_package_id_p) + { + /* Can it ever happen? */ + if (cpu > lock->max_cpu) + cpu = lock->max_cpu; + /* NB: If NUMA is disabled, use physical_package_id. */ + node = lock->physical_package_id_p[cpu]; + } + /* Can it ever happen? */ + if (node > lock->max_node) + node = lock->max_node; + info->node = node; + return err_ret; +} + +void +numa_spinlock_free (struct numa_spinlock *lock) +{ + if (lock->physical_package_id_p) + free (lock->physical_package_id_p); + free (lock); +} diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.h b/sysdeps/unix/sysv/linux/numa-spinlock.h new file mode 100644 index 0000000..b17bda5 --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa-spinlock.h @@ -0,0 +1,64 @@ +/* Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _NUMA_SPINLOCK_H +#define _NUMA_SPINLOCK_H + +#include <features.h> + +__BEGIN_DECLS + +/* The NUMA spinlock. */ +struct numa_spinlock; + +/* The NUMA spinlock information for each thread. */ +struct numa_spinlock_info +{ + /* The workload function of this thread. */ + void *(*workload) (void *); + /* The argument pointer passed to the workload function. */ + void *argument; + /* The return value of the workload function. */ + void *result; + /* The pointer to the NUMA spinlock. */ + struct numa_spinlock *lock; + /* The next thread on the local NUMA spinlock thread list. */ + struct numa_spinlock_info *next; + /* The NUMA node number. */ + unsigned int node; + /* Non-zero to indicate that the thread wants the NUMA spinlock. */ + int pending; + /* Reserved for future use. */ + void *__reserved[4]; +}; + +/* Return a pointer to a newly allocated NUMA spinlock. */ +extern struct numa_spinlock *numa_spinlock_alloc (void); + +/* Free the memory space of the NUMA spinlock. */ +extern void numa_spinlock_free (struct numa_spinlock *); + +/* Initialize the NUMA spinlock information block. */ +extern int numa_spinlock_init (struct numa_spinlock *, + struct numa_spinlock_info *); + +/* Apply for spinlock with a NUMA spinlock information block. */ +extern void numa_spinlock_apply (struct numa_spinlock_info *); + +__END_DECLS + +#endif /* numa-spinlock.h */ diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c new file mode 100644 index 0000000..8ff4e1a --- /dev/null +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c @@ -0,0 +1,304 @@ +/* Initialization of NUMA spinlock. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <assert.h> +#include <ctype.h> +#include <string.h> +#include <dirent.h> +#include <stdio.h> +#include <limits.h> +#ifdef _LIBC +# include <not-cancel.h> +#else +# include <stdlib.h> +# include <unistd.h> +# include <fcntl.h> +# define __open_nocancel open +# define __close_nocancel_nostatus close +# define __read_nocancel read +#endif + +#include "numa-spinlock-private.h" + +static char * +next_line (int fd, char *const buffer, char **cp, char **re, + char *const buffer_end) +{ + char *res = *cp; + char *nl = memchr (*cp, '\n', *re - *cp); + if (nl == NULL) + { + if (*cp != buffer) + { + if (*re == buffer_end) + { + memmove (buffer, *cp, *re - *cp); + *re = buffer + (*re - *cp); + *cp = buffer; + + ssize_t n = __read_nocancel (fd, *re, buffer_end - *re); + if (n < 0) + return NULL; + + *re += n; + + nl = memchr (*cp, '\n', *re - *cp); + while (nl == NULL && *re == buffer_end) + { + /* Truncate too long lines. */ + *re = buffer + 3 * (buffer_end - buffer) / 4; + n = __read_nocancel (fd, *re, buffer_end - *re); + if (n < 0) + return NULL; + + nl = memchr (*re, '\n', n); + **re = '\n'; + *re += n; + } + } + else + nl = memchr (*cp, '\n', *re - *cp); + + res = *cp; + } + + if (nl == NULL) + nl = *re - 1; + } + + *cp = nl + 1; + assert (*cp <= *re); + + return res == *re ? NULL : res; +} + +static int +select_cpu (const struct dirent *d) +{ + /* Return 1 for "cpuXXX" where XXX are digits. */ + if (strncmp (d->d_name, "cpu", sizeof ("cpu") - 1) == 0) + { + const char *p = d->d_name + 3; + + if (*p == '\0') + return 0; + + do + { + if (!isdigit (*p)) + return 0; + p++; + } + while (*p != '\0'); + + return 1; + } + return 0; +} + +/* Allocate a NUMA spinlock and return a pointer to it. Caller should + call numa_spinlock_free on the NUMA spinlock pointer to free the + memory when it is no longer needed. */ + +struct numa_spinlock * +numa_spinlock_alloc (void) +{ + const size_t buffer_size = 1024; + char buffer[buffer_size]; + char *buffer_end = buffer + buffer_size; + char *cp = buffer_end; + char *re = buffer_end; + + const int flags = O_RDONLY | O_CLOEXEC; + int fd = __open_nocancel ("/sys/devices/system/node/online", flags); + char *l; + unsigned int max_node = 0; + unsigned int node_count = 0; + if (fd != -1) + { + l = next_line (fd, buffer, &cp, &re, buffer_end); + if (l != NULL) + do + { + char *endp; + unsigned long int n = strtoul (l, &endp, 10); + if (l == endp) + { + node_count = 1; + break; + } + + unsigned long int m = n; + if (*endp == '-') + { + l = endp + 1; + m = strtoul (l, &endp, 10); + if (l == endp) + { + node_count = 1; + break; + } + } + + node_count += m - n + 1; + + if (m >= max_node) + max_node = m; + + l = endp; + while (l < re && isspace (*l)) + ++l; + } + while (l < re); + + __close_nocancel_nostatus (fd); + } + + /* NB: Some NUMA nodes may not be available, if the number of NUMA + nodes is 1, set the maximium NUMA node number to 0. */ + if (node_count == 1) + max_node = 0; + + unsigned int max_cpu = 0; + unsigned int *physical_package_id_p = NULL; + + if (max_node == 0) + { + /* There is at least 1 node. */ + node_count = 1; + + /* If NUMA is disabled, use physical_package_id instead. */ + struct dirent **cpu_list; + int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list, + select_cpu, NULL); + if (nprocs > 0) + { + int i; + unsigned int *cpu_id_p = NULL; + + /* Find the maximum CPU number. */ + if (posix_memalign ((void **) &cpu_id_p, + __alignof__ (void *), + nprocs * sizeof (unsigned int)) == 0) + { + for (i = 0; i < nprocs; i++) + { + unsigned int cpu_id + = strtoul (cpu_list[i]->d_name + 3, NULL, 10); + cpu_id_p[i] = cpu_id; + if (cpu_id > max_cpu) + max_cpu = cpu_id; + } + + if (posix_memalign ((void **) &physical_package_id_p, + __alignof__ (void *), + ((max_cpu + 1) + * sizeof (unsigned int))) == 0) + { + memset (physical_package_id_p, 0, + ((max_cpu + 1) * sizeof (unsigned int))); + + max_node = UINT_MAX; + + /* Get physical_package_id. */ + char path[(sizeof ("/sys/devices/system/cpu") + + 3 * sizeof (unsigned long int) + + sizeof ("/topology/physical_package_id"))]; + for (i = 0; i < nprocs; i++) + { + struct dirent *d = cpu_list[i]; + if (snprintf (path, sizeof (path), + "/sys/devices/system/cpu/%s/topology/physical_package_id", + d->d_name) > 0) + { + fd = __open_nocancel (path, flags); + if (fd != -1) + { + if (__read_nocancel (fd, buffer, + buffer_size) > 0) + { + char *endp; + unsigned long int package_id + = strtoul (buffer, &endp, 10); + if (package_id != ULONG_MAX + && *buffer != '\0' + && (*endp == '\0' || *endp == '\n')) + { + physical_package_id_p[cpu_id_p[i]] + = package_id; + if (max_node == UINT_MAX) + { + /* This is the first node. */ + max_node = package_id; + } + else if (package_id != max_node) + { + /* NB: We only need to know if + NODE_COUNT > 1. */ + node_count = 2; + if (package_id > max_node) + max_node = package_id; + } + } + } + __close_nocancel_nostatus (fd); + } + } + + free (d); + } + } + + free (cpu_id_p); + } + else + { + for (i = 0; i < nprocs; i++) + free (cpu_list[i]); + } + + free (cpu_list); + } + } + + if (physical_package_id_p != NULL && node_count == 1) + { + /* There is only one node. No need for physical_package_id_p. */ + free (physical_package_id_p); + physical_package_id_p = NULL; + max_cpu = 0; + } + + /* Allocate an array of struct numa_spinlock_info pointers to hold info + for all NUMA nodes with NUMA node number from getcpu () as index. */ + size_t size = (sizeof (struct numa_spinlock) + + ((max_node + 1) + * sizeof (struct numa_spinlock_info *))); + struct numa_spinlock *lock; + if (posix_memalign ((void **) &lock, + __alignof__ (struct numa_spinlock_info *), size)) + return NULL; + memset (lock, 0, size); + + lock->max_node = max_node; + lock->single_node = node_count == 1; + lock->max_cpu = max_cpu; + lock->physical_package_id_p = physical_package_id_p; + + return lock; +} diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist index 09e8447..dba7df6 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist index 8300958..a763c0a 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3 _IO_flockfile F GLIBC_2.3 _IO_ftrylockfile F GLIBC_2.3 _IO_funlockfile F diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist index 9a9e4ce..eb54a83 100644 --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist index c370fda..366fcac 100644 --- a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist @@ -235,3 +235,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist index d05468f..786d8e1 100644 --- a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist @@ -229,6 +229,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist index e8161aa..dd7c52f 100644 --- a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist @@ -221,6 +221,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sh/libpthread.abilist b/sysdeps/unix/sysv/linux/sh/libpthread.abilist index bcba07f..a80475f 100644 --- a/sysdeps/unix/sysv/linux/sh/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sh/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist index b413007..dd08796 100644 --- a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist index ccc9449..5b190f6 100644 --- a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/x86/Makefile b/sysdeps/unix/sysv/linux/x86/Makefile index 02ca36c..29d41ad 100644 --- a/sysdeps/unix/sysv/linux/x86/Makefile +++ b/sysdeps/unix/sysv/linux/x86/Makefile @@ -14,6 +14,7 @@ endif ifeq ($(subdir),nptl) libpthread-sysdep_routines += elision-lock elision-unlock elision-timed \ elision-trylock +xtests += tst-variable-overhead tst-numa-variable-overhead CFLAGS-elision-lock.c += -mrtm CFLAGS-elision-unlock.c += -mrtm CFLAGS-elision-timed.c += -mrtm diff --git a/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c new file mode 100644 index 0000000..7cb8542 --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c @@ -0,0 +1,53 @@ +/* Test case for NUMA spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include "numa-spinlock.h" + +struct numa_spinlock *lock; + +struct work_todo_argument +{ + unsigned long *v1; + unsigned long *v2; + unsigned long *v3; + unsigned long *v4; +}; + +static void * +work_todo (void *v) +{ + struct work_todo_argument *p = v; + unsigned long ret; + *p->v1 = *p->v1 + 1; + *p->v2 = *p->v2 + 1; + ret = __sync_val_compare_and_swap (p->v4, 0, 1); + *p->v3 = *p->v3 + ret; + return (void *) 2; +} + +static inline void +do_work (struct numa_spinlock_info *lock_info) +{ + numa_spinlock_apply (lock_info); +} + +#define USE_NUMA_SPINLOCK +#include "tst-variable-overhead-skeleton.c" diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c new file mode 100644 index 0000000..4b83dfb --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c @@ -0,0 +1,384 @@ +/* Test case skeleton for spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +/* Check spinlock overhead with large number threads. Critical region is + very smmall. Critical region + spinlock overhead aren't noticeable + when number of threads is small. When thread number increases, + spinlock overhead become the bottleneck. It shows up in wall time + of thread execution. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include <config.h> +#include <unistd.h> +#include <stdio.h> +#include <pthread.h> +#include <sched.h> +#include <stdlib.h> +#include <string.h> +#include <stdint.h> +#include <sys/time.h> +#include <sys/param.h> +#include <errno.h> +#ifdef MODULE_NAME +# include <cpu-features.h> +# include <support/test-driver.h> +#endif + +#ifndef USE_PTHREAD_ATTR_SETAFFINITY_NP +# define USE_PTHREAD_ATTR_SETAFFINITY_NP 1 +#endif + +#define memory_barrier() __asm ("" ::: "memory") +#define pause() __asm ("rep ; nop" ::: "memory") + +#define CACHELINE_SIZE 64 +#define CACHE_ALIGNED __attribute__((aligned(CACHELINE_SIZE))) + +#define constant_time 5 +unsigned long g_val CACHE_ALIGNED; +unsigned long g_val2 CACHE_ALIGNED; +unsigned long g_val3 CACHE_ALIGNED; +unsigned long cmplock CACHE_ALIGNED; +struct count +{ + unsigned long long total; + unsigned long long spinlock; + unsigned long long wall; +} __attribute__((aligned(128))); + +struct count *gcount; + +/* The time consumed by one update is about 200 TSCs. */ +static int delay_time_unlocked = 400; + +struct ops +{ + void *(*test) (void *arg); + void (*print_thread) (void *res, int); +} *ops; + +struct stats_result +{ + unsigned long num; +}; + +void *work_thread (void *arg); + +#define iterations (10000 * 5) + +static volatile int start_thread; + +/* Delay some fixed time */ +static void +delay_tsc (unsigned n) +{ + unsigned long long start, current, diff; + unsigned int aux; + start = __builtin_ia32_rdtscp (&aux); + while (1) + { + current = __builtin_ia32_rdtscp (&aux); + diff = current - start; + if (diff < n) + pause (); + else + break; + } +} + +static void +wait_a_bit (int delay_time) +{ + if (delay_time > 0) + delay_tsc (delay_time); +} + +#ifndef USE_NUMA_SPINLOCK +static inline void +work_todo (void) +{ + unsigned long ret; + g_val = g_val + 1; + g_val2 = g_val2 + 1; + ret = __sync_val_compare_and_swap (&cmplock, 0, 1); + g_val3 = g_val3 + 1 + ret; +} +#endif + +void * +work_thread (void *arg) +{ + long i; + unsigned long pid = (unsigned long) arg; + struct stats_result *res; + unsigned long long start, end; + int err_ret = posix_memalign ((void **)&res, CACHELINE_SIZE, + roundup (sizeof (*res), CACHELINE_SIZE)); + if (err_ret) + { + printf ("posix_memalign failure: %s\n", strerror (err_ret)); + exit (err_ret); + } + long num = 0; + +#ifdef USE_NUMA_SPINLOCK + struct work_todo_argument work_todo_arg; + struct numa_spinlock_info lock_info; + + if (numa_spinlock_init (lock, &lock_info)) + { + printf ("numa_spinlock_init failure: %m\n"); + exit (1); + } + + work_todo_arg.v1 = &g_val; + work_todo_arg.v2 = &g_val2; + work_todo_arg.v3 = &g_val3; + work_todo_arg.v4 = &cmplock; + lock_info.argument = &work_todo_arg; + lock_info.workload = work_todo; +#endif + + while (!start_thread) + pause (); + + unsigned int aux; + start = __builtin_ia32_rdtscp (&aux); + for (i = 0; i < iterations; i++) + { +#ifdef USE_NUMA_SPINLOCK + do_work (&lock_info); +#else + do_work (); +#endif + wait_a_bit (delay_time_unlocked); + num++; + } + end = __builtin_ia32_rdtscp (&aux); + gcount[pid].total = end - start; + res->num = num; + + return res; +} + +void +init_global_data(void) +{ + g_val = 0; + g_val2 = 0; + g_val3 = 0; + cmplock = 0; +} + +void +test_threads (int numthreads, int numprocs, unsigned long time) +{ + start_thread = 0; + +#ifdef USE_NUMA_SPINLOCK + lock = numa_spinlock_alloc (); +#endif + + memory_barrier (); + + pthread_t thr[numthreads]; + void *res[numthreads]; + int i; + + init_global_data (); + for (i = 0; i < numthreads; i++) + { + pthread_attr_t attr; + const pthread_attr_t *attrp = NULL; + if (USE_PTHREAD_ATTR_SETAFFINITY_NP) + { + attrp = &attr; + pthread_attr_init (&attr); + cpu_set_t set; + CPU_ZERO (&set); + int cpu = i % numprocs; + (void) CPU_SET (cpu, &set); + pthread_attr_setaffinity_np (&attr, sizeof (cpu_set_t), &set); + } + int err_ret = pthread_create (&thr[i], attrp, ops->test, + (void *)(uintptr_t) i); + if (err_ret != 0) + { + printf ("pthread_create failed: %d, %s\n", + i, strerror (i)); + numthreads = i; + break; + } + } + + memory_barrier (); + start_thread = 1; + memory_barrier (); + sched_yield (); + + if (time) + { + struct timespec ts = + { + ts.tv_sec = time, + ts.tv_nsec = 0 + }; + clock_nanosleep (CLOCK_MONOTONIC, 0, &ts, NULL); + memory_barrier (); + } + + for (i = 0; i < numthreads; i++) + { + if (pthread_join (thr[i], (void *) &res[i]) == 0) + free (res[i]); + else + printf ("pthread_join failure: %m\n"); + } + +#ifdef USE_NUMA_SPINLOCK + numa_spinlock_free (lock); +#endif +} + +struct ops hashwork_ops = +{ + .test = work_thread, +}; + +struct ops *ops; + +static struct count +total_cost (int numthreads, int numprocs) +{ + int i; + unsigned long long total = 0; + unsigned long long spinlock = 0; + + memset (gcount, 0, sizeof(gcount[0]) * numthreads); + + unsigned long long start, end, diff; + unsigned int aux; + + start = __builtin_ia32_rdtscp (&aux); + test_threads (numthreads, numprocs, constant_time); + end = __builtin_ia32_rdtscp (&aux); + diff = end - start; + + for (i = 0; i < numthreads; i++) + { + total += gcount[i].total; + spinlock += gcount[i].spinlock; + } + + struct count cost = { total, spinlock, diff }; + return cost; +} + +#ifdef MODULE_NAME +static int +do_test (void) +{ + if (!CPU_FEATURE_USABLE (RDTSCP)) + return EXIT_UNSUPPORTED; +#else +int +main (void) +{ +#endif + int numprocs = sysconf (_SC_NPROCESSORS_ONLN); + + /* Oversubscribe CPU. */ + int numthreads = 4 * numprocs; + + ops = &hashwork_ops; + + int err_ret = posix_memalign ((void **)&gcount, 4096, + sizeof(gcount[0]) * numthreads); + if (err_ret) + { + printf ("posix_memalign failure: %s\n", strerror (err_ret)); + exit (err_ret); + } + + struct count cost, cost1; + double overhead; + int i, last; + int last_increment = numprocs < 16 ? 16 : numprocs; + int numprocs_done = 0; + int numprocs_reset = 0; + cost1 = total_cost (1, numprocs); + + printf ("Number of processors: %d, Single thread time %lld\n\n", + numprocs, cost1.total); + + for (last = i = 2; i <= numthreads;) + { + last = i; + cost = total_cost (i, numprocs); + overhead = cost.total; + overhead /= i; + overhead /= cost1.total; + printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n", + i, cost.total, overhead); + if ((i * 2) < numprocs) + i = i * 2; + else if (numprocs_done) + { + if (numprocs_reset) + { + i = numprocs_reset; + numprocs_reset = 0; + } + else + { + if ((i * 2) < numthreads) + i = i * 2; + else + i = i + last_increment; + } + } + else + { + if (numprocs != 2 * i) + numprocs_reset = 2 * i; + i = numprocs; + numprocs_done = 1; + } + } + + if (last != numthreads) + { + i = numthreads; + cost = total_cost (i, numprocs); + overhead = cost.total; + overhead /= i; + overhead /= cost1.total; + printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n", + i, cost.total, overhead); + } + + free (gcount); + return 0; +} + +#ifdef MODULE_NAME +# define TIMEOUT 900 +# include <support/test-driver.c> +#endif diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c new file mode 100644 index 0000000..b3ce567 --- /dev/null +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c @@ -0,0 +1,47 @@ +/* Test case for spinlock overhead. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#ifndef _GNU_SOURCE +# define _GNU_SOURCE +#endif +#include <pthread.h> + +struct +{ + pthread_spinlock_t testlock; + char pad[64 - sizeof (pthread_spinlock_t)]; +} test __attribute__((aligned(64))); + +static void +__attribute__((constructor)) +init_spin (void) +{ + pthread_spin_init (&test.testlock, 0); +} + +static void work_todo (void); + +static inline void +do_work (void) +{ + pthread_spin_lock(&test.testlock); + work_todo (); + pthread_spin_unlock(&test.testlock); +} + +#include "tst-variable-overhead-skeleton.c" diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist index 931c827..e90532e 100644 --- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F GLIBC_2.3.2 pthread_cond_broadcast F GLIBC_2.3.2 pthread_cond_destroy F GLIBC_2.3.2 pthread_cond_init F diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist index c09c9b0..c74febb 100644 --- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F GLIBC_2.28 tss_delete F GLIBC_2.28 tss_get F GLIBC_2.28 tss_set F +GLIBC_2.29 numa_spinlock_alloc F +GLIBC_2.29 numa_spinlock_apply F +GLIBC_2.29 numa_spinlock_free F +GLIBC_2.29 numa_spinlock_init F