NUMA spinlock [BZ #23962]

Message ID 20181226025019.38752-1-ling.ma@MacBook-Pro-8.local
State New
Headers show
Series
  • NUMA spinlock [BZ #23962]
Related show

Commit Message

Ma Ling Dec. 26, 2018, 2:50 a.m.
From: "ling.ma" <ling.ml@antfin.com>

On multi-socket systems, memory is shared across the entire system.
Data access to the local socket is much faster than the remote socket
and data access to the local core is faster than sibling cores on the
same socket.  For serialized workloads with conventional spinlock,
when there is high spinlock contention between threads, lock ping-pong
among sockets becomes the bottleneck and threads spend majority of
their time in spinlock overhead.

On multi-socket systems, the keys to our NUMA spinlock performance
are to minimize cross-socket traffic as well as localize the serialized
workload to one core for execution.  The basic principles of NUMA
spinlock are mainly consisted of following approaches, which reduce
data movement and accelerate critical section, eventually give us
significant performance improvement.

1. MCS spinlock
MCS spinlock help us to reduce the useless lock movement in the
spinning state.  This paper provides a good description for this
kind of lock:
<http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>

2. Critical Section Integration (CSI)
Essentially spinlock is similar to that one core complete critical
sections one by one. So when contention happen, the serialized works
are sent to the core who is the lock owner and responsible to execute
them, that can save much time and power, because all shared data are
located in private cache of the lock owner.

We implemented this mechanism based on queued spinlock in kernel, that
speeds up critical section, and reduces the probability of contention.
The paper provides a good description for this kind of lock:
<https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>

3. NUMA Aware Spinlock (NAS)
Currently multi-socket systems give us better performance per watt,
however that also involves more complex synchronization requirement,
because off-chip data movement is much slower. We use distributed
synchronization mechanism to decrease Lock cache line to and from
different nodes. The paper provides a good description for this kind
of lock:
<https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>

4. Yield Schedule
When threads are applying for Critical Section Integration(CSI) with
known contention, they will delegate work to the thread who is the
lock owner, and wait for work to be completed.  The resources which
they are using should be transferred to other threads. In order to
accelerate the scenario, we introduce yield_sched function during
spinning stage.

5. Optimization when NUMA is ON or OFF.
Although programs can access memory with lower latency when NUMA is
enabled, some programs may need more memory bandwidth for computation
with NUMA disabled.  We also optimize multi-socket systems with NUMA
disabled.

NUMA spinlock flow chart (assuming there are 2 CPU nodes):

1. Threads from node_0/node_1 acquire local lock for node_0/1
respectively.  If the thread succeeds in acquiring local lock, it
goes to step 2, otherwise pushes critical function into current
local work queue, and enters into spinning stage with MCS mode.

2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
in acquiring the global lock as the lock owner, it goes to step 3,
otherwise waits until the lock owner thread releases the global lock.

3. The lock owner thread from node_0/1 enters into critical section,
cleans up work queue by performing all local critical functions
pushed at step 1 with CSI on behalf of other threads and informs
those spinning threads that their works have been done.  It then
releases the local lock.

4. The lock owner thread frees global lock.  If another thread is
waiting at step 2, the lock owner thread passes the global lock to
the waiting thread and returns.  The new lock owner thread enters
into step 3.  If no threads are waiting, the lock owner thread
releases the global lock and returns.  The whole critical section
process is completed.

Steps 1 and 2 mitigate global lock contention.  Only one thread
from different nodes will compete for the global lock in step 2.
Step 3 reduces the global lock & shared data movement because they
are located in the same node as well as the same core.  Our data
shows that Critical Section Integration (CSI) improves data locality
and NUMA-aware spinlock (NAS) helps CSI balance the workload.

NUMA spinlock can greatly speed up critical section on multi-socket
systems.  It should improve spinlock performance on all multi-socket
systems. 

2018-12-24  Ling Ma  <ling.ml@antfin.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Wei Xiao  <wei3.xiao@intel.com>

	[BZ #23962]
	* NEWS: Mention NUMA spinlock.
	* manual/examples/numa-spinlock.c: New file.
	* sysdeps/unix/sysv/linux/numa-spinlock-private.h: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.c: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.h: Likewise.
	* sysdeps/unix/sysv/linux/numa_spinlock_alloc.c: Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c:
	Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c:
	Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c: Likewise.
	* manual/threads.texi: Document NUMA spinlock.
	* sysdeps/unix/sysv/linux/Makefile (libpthread-sysdep_routines):
	Add numa_spinlock_alloc and numa-spinlock.
	(sysdep_headers): Add numa-spinlock.h.
	* sysdeps/unix/sysv/linux/Versions (libpthread): Add
	numa_spinlock_alloc, numa_spinlock_free, numa_spinlock_init
	and numa_spinlock_apply to GLIBC_2.29.
	* sysdeps/unix/sysv/linux/aarch64/libpthread.abilist: Add
	numa_spinlock_alloc, numa_spinlock_apply, numa_spinlock_free
	and numa_spinlock_init.
	* sysdeps/unix/sysv/linux/alpha/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/arm/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/csky/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/hppa/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/i386/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/ia64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/nios2/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sh/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86/Makefile (xtests): Add
	tst-variable-overhead and tst-numa-variable-overhead.
	

Signed-off-by: Ling Ma  <ling.ml@antfin.com>
Signed-off-by: H.J. Lu  <hongjiu.lu@intel.com>
Signed-off-by: Wei Xiao  <wei3.xiao@intel.com>
---
 NEWS                                               |   3 +
 manual/examples/numa-spinlock.c                    |  99 ++++++
 manual/threads.texi                                | 105 ++++++
 sysdeps/unix/sysv/linux/Makefile                   |   2 +
 sysdeps/unix/sysv/linux/Versions                   |   9 +
 sysdeps/unix/sysv/linux/aarch64/libpthread.abilist |   4 +
 sysdeps/unix/sysv/linux/alpha/libpthread.abilist   |   4 +
 sysdeps/unix/sysv/linux/arm/libpthread.abilist     |   4 +
 sysdeps/unix/sysv/linux/csky/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/hppa/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/i386/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/ia64/libpthread.abilist    |   4 +
 .../sysv/linux/m68k/coldfire/libpthread.abilist    |   4 +
 .../unix/sysv/linux/m68k/m680x0/libpthread.abilist |   4 +
 .../unix/sysv/linux/microblaze/libpthread.abilist  |   4 +
 .../unix/sysv/linux/mips/mips32/libpthread.abilist |   4 +
 .../unix/sysv/linux/mips/mips64/libpthread.abilist |   4 +
 sysdeps/unix/sysv/linux/nios2/libpthread.abilist   |   4 +
 sysdeps/unix/sysv/linux/numa-spinlock-private.h    |  38 ++
 sysdeps/unix/sysv/linux/numa-spinlock.c            | 327 ++++++++++++++++++
 sysdeps/unix/sysv/linux/numa-spinlock.h            |  64 ++++
 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c      | 304 ++++++++++++++++
 .../linux/powerpc/powerpc32/libpthread.abilist     |   4 +
 .../linux/powerpc/powerpc64/be/libpthread.abilist  |   4 +
 .../linux/powerpc/powerpc64/le/libpthread.abilist  |   4 +
 .../unix/sysv/linux/riscv/rv64/libpthread.abilist  |   4 +
 .../sysv/linux/s390/s390-32/libpthread.abilist     |   4 +
 .../sysv/linux/s390/s390-64/libpthread.abilist     |   4 +
 sysdeps/unix/sysv/linux/sh/libpthread.abilist      |   4 +
 .../sysv/linux/sparc/sparc32/libpthread.abilist    |   4 +
 .../sysv/linux/sparc/sparc64/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/x86/Makefile               |   1 +
 .../sysv/linux/x86/tst-numa-variable-overhead.c    |  53 +++
 .../linux/x86/tst-variable-overhead-skeleton.c     | 384 +++++++++++++++++++++
 .../unix/sysv/linux/x86/tst-variable-overhead.c    |  47 +++
 .../unix/sysv/linux/x86_64/64/libpthread.abilist   |   4 +
 .../unix/sysv/linux/x86_64/x32/libpthread.abilist  |   4 +
 37 files changed, 1532 insertions(+)
 create mode 100644 manual/examples/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c

Comments

=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Jan. 3, 2019, 4:06 a.m. | #1
Hi all,
Hi all,
In the patch we provide the test case (update on shared data) to simulate our real workload, 
the test case shows data as below:
(Testing on 2 sockets platform , 104 cores with 256G RAM, HT=on)
1. original spinlock
#./tst-variable-overhead 
Number of processors: 104, Single thread time 24913702

Number of threads:    2, Total time      138311026, Overhead: 2.78
Number of threads:    4, Total time      643308290, Overhead: 6.46
Number of threads:    8, Total time     2607024872, Overhead: 13.08
Number of threads:   16, Total time    12196852896, Overhead: 30.60
Number of threads:   32, Total time    80141230706, Overhead: 100.52
Number of threads:   64, Total time   385495457274, Overhead: 241.77
Number of threads:  104, Total time   885472451762, Overhead: 341.75
Number of threads:  128, Total time  1543267108944, Overhead: 483.94
Number of threads:  256, Total time 11780541636054, Overhead: 1847.09
Number of threads:  360, Total time 32000897533528, Overhead: 3567.97
Number of threads:  416, Total time 48862502147056, Overhead: 4714.59

2. numa spinlock
#./tst-numa-variable-overhead
Number of processors: 104, Single thread time 27906178

Number of threads:    2, Total time      137911942, Overhead: 2.47
Number of threads:    4, Total time      339251710, Overhead: 3.04
Number of threads:    8, Total time      723663466, Overhead: 3.24
Number of threads:   16, Total time     2380050348, Overhead: 5.33
Number of threads:   32, Total time     8084855738, Overhead: 9.05
Number of threads:   64, Total time    27646507880, Overhead: 15.48
Number of threads:  104, Total time    65470567548, Overhead: 22.56
Number of threads:  128, Total time    97966785076, Overhead: 27.43
Number of threads:  256, Total time  1110168309448, Overhead: 155.40
Number of threads:  360, Total time  2516015158740, Overhead: 250.44
Number of threads:  416, Total time  4219233078472, Overhead: 363.45

The above data tell us the numa spinlock improve performance upto 15X, 
Our real online workload indicate numa spinlock improve the system throughput 14.5%.

Appreciate your any comments.

Thanks a lot
Ling

在 2018/12/26 上午10:50,“Ma Ling”<ling.ma.program@gmail.com> 写入:

    From: "ling.ma" <ling.ml@antfin.com>
    
    On multi-socket systems, memory is shared across the entire system.
    Data access to the local socket is much faster than the remote socket
    and data access to the local core is faster than sibling cores on the
    same socket.  For serialized workloads with conventional spinlock,
    when there is high spinlock contention between threads, lock ping-pong
    among sockets becomes the bottleneck and threads spend majority of
    their time in spinlock overhead.
    
    On multi-socket systems, the keys to our NUMA spinlock performance
    are to minimize cross-socket traffic as well as localize the serialized
    workload to one core for execution.  The basic principles of NUMA
    spinlock are mainly consisted of following approaches, which reduce
    data movement and accelerate critical section, eventually give us
    significant performance improvement.
    
    1. MCS spinlock
    MCS spinlock help us to reduce the useless lock movement in the
    spinning state.  This paper provides a good description for this
    kind of lock:
    <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>
    
    2. Critical Section Integration (CSI)
    Essentially spinlock is similar to that one core complete critical
    sections one by one. So when contention happen, the serialized works
    are sent to the core who is the lock owner and responsible to execute
    them, that can save much time and power, because all shared data are
    located in private cache of the lock owner.
    
    We implemented this mechanism based on queued spinlock in kernel, that
    speeds up critical section, and reduces the probability of contention.
    The paper provides a good description for this kind of lock:
    <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>
    
    3. NUMA Aware Spinlock (NAS)
    Currently multi-socket systems give us better performance per watt,
    however that also involves more complex synchronization requirement,
    because off-chip data movement is much slower. We use distributed
    synchronization mechanism to decrease Lock cache line to and from
    different nodes. The paper provides a good description for this kind
    of lock:
    <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>
    
    4. Yield Schedule
    When threads are applying for Critical Section Integration(CSI) with
    known contention, they will delegate work to the thread who is the
    lock owner, and wait for work to be completed.  The resources which
    they are using should be transferred to other threads. In order to
    accelerate the scenario, we introduce yield_sched function during
    spinning stage.
    
    5. Optimization when NUMA is ON or OFF.
    Although programs can access memory with lower latency when NUMA is
    enabled, some programs may need more memory bandwidth for computation
    with NUMA disabled.  We also optimize multi-socket systems with NUMA
    disabled.
    
    NUMA spinlock flow chart (assuming there are 2 CPU nodes):
    
    1. Threads from node_0/node_1 acquire local lock for node_0/1
    respectively.  If the thread succeeds in acquiring local lock, it
    goes to step 2, otherwise pushes critical function into current
    local work queue, and enters into spinning stage with MCS mode.
    
    2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
    in acquiring the global lock as the lock owner, it goes to step 3,
    otherwise waits until the lock owner thread releases the global lock.
    
    3. The lock owner thread from node_0/1 enters into critical section,
    cleans up work queue by performing all local critical functions
    pushed at step 1 with CSI on behalf of other threads and informs
    those spinning threads that their works have been done.  It then
    releases the local lock.
    
    4. The lock owner thread frees global lock.  If another thread is
    waiting at step 2, the lock owner thread passes the global lock to
    the waiting thread and returns.  The new lock owner thread enters
    into step 3.  If no threads are waiting, the lock owner thread
    releases the global lock and returns.  The whole critical section
    process is completed.
    
    Steps 1 and 2 mitigate global lock contention.  Only one thread
    from different nodes will compete for the global lock in step 2.
    Step 3 reduces the global lock & shared data movement because they
    are located in the same node as well as the same core.  Our data
    shows that Critical Section Integration (CSI) improves data locality
    and NUMA-aware spinlock (NAS) helps CSI balance the workload.
    
    NUMA spinlock can greatly speed up critical section on multi-socket
    systems.  It should improve spinlock performance on all multi-socket
    systems. 
    
    2018-12-24  Ling Ma  <ling.ml@antfin.com>
    	    H.J. Lu  <hongjiu.lu@intel.com>
    	    Wei Xiao  <wei3.xiao@intel.com>
    
    	[BZ #23962]
    	* NEWS: Mention NUMA spinlock.
    	* manual/examples/numa-spinlock.c: New file.
    	* sysdeps/unix/sysv/linux/numa-spinlock-private.h: Likewise.
    	* sysdeps/unix/sysv/linux/numa-spinlock.c: Likewise.
    	* sysdeps/unix/sysv/linux/numa-spinlock.h: Likewise.
    	* sysdeps/unix/sysv/linux/numa_spinlock_alloc.c: Likewise.
    	* sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c:
    	Likewise.
    	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c:
    	Likewise.
    	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c: Likewise.
    	* manual/threads.texi: Document NUMA spinlock.
    	* sysdeps/unix/sysv/linux/Makefile (libpthread-sysdep_routines):
    	Add numa_spinlock_alloc and numa-spinlock.
    	(sysdep_headers): Add numa-spinlock.h.
    	* sysdeps/unix/sysv/linux/Versions (libpthread): Add
    	numa_spinlock_alloc, numa_spinlock_free, numa_spinlock_init
    	and numa_spinlock_apply to GLIBC_2.29.
    	* sysdeps/unix/sysv/linux/aarch64/libpthread.abilist: Add
    	numa_spinlock_alloc, numa_spinlock_apply, numa_spinlock_free
    	and numa_spinlock_init.
    	* sysdeps/unix/sysv/linux/alpha/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/arm/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/csky/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/hppa/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/i386/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/ia64/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/microblaze/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/nios2/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/sh/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist:
    	Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist: Likewise.
    	* sysdeps/unix/sysv/linux/x86/Makefile (xtests): Add
    	tst-variable-overhead and tst-numa-variable-overhead.
    	
    
    Signed-off-by: Ling Ma  <ling.ml@antfin.com>
    Signed-off-by: H.J. Lu  <hongjiu.lu@intel.com>
    Signed-off-by: Wei Xiao  <wei3.xiao@intel.com>
    ---
     NEWS                                               |   3 +
     manual/examples/numa-spinlock.c                    |  99 ++++++
     manual/threads.texi                                | 105 ++++++
     sysdeps/unix/sysv/linux/Makefile                   |   2 +
     sysdeps/unix/sysv/linux/Versions                   |   9 +
     sysdeps/unix/sysv/linux/aarch64/libpthread.abilist |   4 +
     sysdeps/unix/sysv/linux/alpha/libpthread.abilist   |   4 +
     sysdeps/unix/sysv/linux/arm/libpthread.abilist     |   4 +
     sysdeps/unix/sysv/linux/csky/libpthread.abilist    |   4 +
     sysdeps/unix/sysv/linux/hppa/libpthread.abilist    |   4 +
     sysdeps/unix/sysv/linux/i386/libpthread.abilist    |   4 +
     sysdeps/unix/sysv/linux/ia64/libpthread.abilist    |   4 +
     .../sysv/linux/m68k/coldfire/libpthread.abilist    |   4 +
     .../unix/sysv/linux/m68k/m680x0/libpthread.abilist |   4 +
     .../unix/sysv/linux/microblaze/libpthread.abilist  |   4 +
     .../unix/sysv/linux/mips/mips32/libpthread.abilist |   4 +
     .../unix/sysv/linux/mips/mips64/libpthread.abilist |   4 +
     sysdeps/unix/sysv/linux/nios2/libpthread.abilist   |   4 +
     sysdeps/unix/sysv/linux/numa-spinlock-private.h    |  38 ++
     sysdeps/unix/sysv/linux/numa-spinlock.c            | 327 ++++++++++++++++++
     sysdeps/unix/sysv/linux/numa-spinlock.h            |  64 ++++
     sysdeps/unix/sysv/linux/numa_spinlock_alloc.c      | 304 ++++++++++++++++
     .../linux/powerpc/powerpc32/libpthread.abilist     |   4 +
     .../linux/powerpc/powerpc64/be/libpthread.abilist  |   4 +
     .../linux/powerpc/powerpc64/le/libpthread.abilist  |   4 +
     .../unix/sysv/linux/riscv/rv64/libpthread.abilist  |   4 +
     .../sysv/linux/s390/s390-32/libpthread.abilist     |   4 +
     .../sysv/linux/s390/s390-64/libpthread.abilist     |   4 +
     sysdeps/unix/sysv/linux/sh/libpthread.abilist      |   4 +
     .../sysv/linux/sparc/sparc32/libpthread.abilist    |   4 +
     .../sysv/linux/sparc/sparc64/libpthread.abilist    |   4 +
     sysdeps/unix/sysv/linux/x86/Makefile               |   1 +
     .../sysv/linux/x86/tst-numa-variable-overhead.c    |  53 +++
     .../linux/x86/tst-variable-overhead-skeleton.c     | 384 +++++++++++++++++++++
     .../unix/sysv/linux/x86/tst-variable-overhead.c    |  47 +++
     .../unix/sysv/linux/x86_64/64/libpthread.abilist   |   4 +
     .../unix/sysv/linux/x86_64/x32/libpthread.abilist  |   4 +
     37 files changed, 1532 insertions(+)
     create mode 100644 manual/examples/numa-spinlock.c
     create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
     create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
     create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
     create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
     create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
     create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
     create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
    
    diff --git a/NEWS b/NEWS
    index cd29ec7..0764d0d 100644
    --- a/NEWS
    +++ b/NEWS
    @@ -9,6 +9,9 @@ Version 2.29
     
     Major new features:
     
    +* NUMA spinlock is added to provide a spinlock implementation optimized
    +  for multi-socket NUMA systems.
    +
     * The getcpu wrapper function has been added, which returns the currently
       used CPU and NUMA node.  This function is Linux-specific.
     
    diff --git a/manual/examples/numa-spinlock.c b/manual/examples/numa-spinlock.c
    new file mode 100644
    index 0000000..ca98443
    --- /dev/null
    +++ b/manual/examples/numa-spinlock.c
    @@ -0,0 +1,99 @@
    +/* NUMA spinlock example.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +
    +   This program is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU General Public License
    +   as published by the Free Software Foundation; either version 2
    +   of the License, or (at your option) any later version.
    +
    +   This program is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    +   GNU General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.
    +*/
    +
    +#include <pthread.h>
    +#include <stdio.h>
    +#include <stdlib.h>
    +#include <unistd.h>
    +#include <errno.h>
    +#include <string.h>
    +#include <numa-spinlock.h>
    +
    +#define NUM_THREADS	20
    +
    +struct numa_spinlock *lock;
    +
    +struct work_todo_argument
    +{
    +  void *arg;
    +};
    +
    +static void *
    +work_todo (void *v)
    +{
    +  /* Do the real work with p->arg. */
    +  struct work_todo_argument *p = v;
    +  /* Return value is set to lock_info.result. */
    +  return NULL;
    +}
    +
    +void *
    +work_thread (void *arg)
    +{
    +  struct work_todo_argument work_todo_arg;
    +  struct numa_spinlock_info lock_info;
    +
    +  if (numa_spinlock_init (lock, &lock_info))
    +    {
    +      printf ("numa_spinlock_init failure: %m\n");
    +      exit (1);
    +    }
    +
    +  work_todo_arg.arg = arg;
    +  lock_info.argument = &work_todo_arg;
    +  lock_info.workload = work_todo;
    +
    +  numa_spinlock_apply (&lock_info);
    +
    +  return lock_info.result;
    +}
    +
    +int
    +main (int argc, char **argv)
    +{
    +  lock = numa_spinlock_alloc ();
    +  pthread_t thr[NUM_THREADS];
    +  void *res[NUM_THREADS];
    +  int numthreads = NUM_THREADS;
    +  int i;
    +
    +  for (i = 0; i < NUM_THREADS; i++)
    +    {
    +      int err_ret = pthread_create (&thr[i], NULL, work_thread,
    +				    (void *) (intptr_t) i);
    +      if (err_ret != 0)
    +	{
    +	  printf ("pthread_create failed: %d, %s\n",
    +		  i, strerror (i));
    +	  numthreads = i;
    +	  break;
    +	}
    +    }
    +
    +  for (i = 0; i < numthreads; i++)
    +    {
    +      if (pthread_join (thr[i], (void *) &res[i]) == 0)
    +	free (res[i]);
    +      else
    +	printf ("pthread_join failure: %m\n");
    +    }
    +
    +  numa_spinlock_free (lock);
    +
    +  return 0;
    +}
    diff --git a/manual/threads.texi b/manual/threads.texi
    index 87fda7d..e82ae0d 100644
    --- a/manual/threads.texi
    +++ b/manual/threads.texi
    @@ -625,6 +625,9 @@ the standard.
     @menu
     * Default Thread Attributes::             Setting default attributes for
     					  threads in a process.
    +* NUMA Spinlock::                         Spinlock optimized for
    +					  multi-socket NUMA platform.
    +* NUMA Spinlock Example::                 A NUMA spinlock example.
     @end menu
     
     @node Default Thread Attributes
    @@ -669,6 +672,108 @@ The system does not have sufficient memory.
     @end table
     @end deftypefun
     
    +@node NUMA Spinlock
    +@subsubsection Spinlock optimized for multi-node NUMA systems
    +
    +To improve performance on multi-socket NUMA platforms for serialized
    +region protected by spinlock, @theglibc{} implements a NUMA spinlock
    +object, which minimizes cross-socket traffic and sends the protected
    +serialized region to one core for execution to reduce spinlock contention
    +overhead.
    +
    +The fundamental data types for a NUMA spinlock are
    +@code{numa_spinlock} and @code{numa_spinlock_info}:
    +
    +@deftp {Data Type} {struct numa_spinlock}
    +@standards{Linux, numa-spinlock.h}
    +This data type is an opaque structure.  A @code{numa_spinlock} pointer
    +uniquely identifies a NUMA spinlock object.
    +@end deftp
    +
    +@deftp {Data Type} {struct numa_spinlock_info}
    +@standards{Linux, numa-spinlock.h}
    +
    +This data type uniquely identifies a NUMA spinlock information object for
    +a thread.  It has the following members, and others internal to NUMA
    +spinlock implemenation:
    +
    +@table @code
    +@item void *(*workload) (void *)
    +A function pointer to the workload function serialized by spinlock.
    +@item void *argument
    +A pointer to argument passed to the @var{workload} function pointer.
    +@item void *result
    +Return value from the @var{workload} function pointer.
    +@end table
    +
    +@end deftp
    +
    +The following functions are provided for NUMA spinlock objects:
    +
    +@deftypefun struct numa_spinlock *numa_spinlock_alloc (void)
    +@standards{Linux, numa-spinlock.h}
    +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
    +
    +This function returns a pointer to a newly allocated NUMA spinlock or a
    +null pointer if the NUMA spinlock could not be allocated, setting
    +@code{errno} to @code{ENOMEM}.  Caller should call
    +@code{numa_spinlock_free} on the NUMA spinlock pointer to free the
    +memory space when it is no longer needed.
    +
    +This function is Linux-specific and is declared in @file{numa-spinlock.h}.
    +@end deftypefun
    +
    +@deftypefun void numa_spinlock_free (struct numa_spinlock *@var{lock})
    +@standards{Linux, numa-spinlock.h}
    +@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
    +
    +Free the memory space pointed to by @var{lock}, which must have been
    +returned by a previous call to @code{numa_spinlock_alloc}.  Otherwise,
    +or if @code{numa_spinlock_free (@var{lock})} has already been called
    +before, undefined behavior occurs.
    +
    +This function is Linux-specific and is declared in @file{numa-spinlock.h}.
    +@end deftypefun
    +
    +@deftypefun int numa_spinlock_init (struct numa_spinlock *@var{lock},
    +struct numa_spinlock_info *@var{info})
    +@standards{Linux, numa-spinlock.h}
    +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
    +
    +Initialize the NUMA spinlock information block pointed to by @var{info}
    +with a NUMA spinlock pointer @var{lock}.  The return value is @code{0} on
    +success and @code{-1} on failure.  The following @code{errno} error
    +codes are defined for this function:
    +
    +@table @code
    +@item ENOSYS
    +The operating system does not support the @code{getcpu} function.
    +@end table
    +
    +This function is Linux-specific and is declared in @file{numa-spinlock.h}.
    +@end deftypefun
    +
    +@deftypefun void numa_spinlock_apply (struct numa_spinlock_info *@var{info})
    +@standards{Linux, numa-spinlock.h}
    +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
    +
    +Apply for spinlock with a NUMA spinlock information block pointed to by
    +@var{info}.  When @code{numa_spinlock_apply} returns, the spinlock is
    +released and the @var{result} member of @var{info} contains the return
    +value of the @var{workload} member.
    +
    +This function is Linux-specific and is declared in @file{numa-spinlock.h}.
    +@end deftypefun
    +
    +@node NUMA Spinlock Example
    +@subsubsection NUMA Spinlock Example
    +
    +A NUMA spinlock example:
    +
    +@smallexample
    +@include numa-spinlock.c.texi
    +@end smallexample
    +
     @c FIXME these are undocumented:
     @c pthread_atfork
     @c pthread_attr_destroy
    diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
    index f827455..3361597 100644
    --- a/sysdeps/unix/sysv/linux/Makefile
    +++ b/sysdeps/unix/sysv/linux/Makefile
    @@ -222,6 +222,8 @@ CFLAGS-gai.c += -DNEED_NETLINK
     endif
     
     ifeq ($(subdir),nptl)
    +libpthread-sysdep_routines += numa_spinlock_alloc numa-spinlock
    +sysdep_headers += numa-spinlock.h
     tests += tst-align-clone tst-getpid1 \
     	tst-thread-affinity-pthread tst-thread-affinity-pthread2 \
     	tst-thread-affinity-sched
    diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
    index f1e12d9..7ce7e2b 100644
    --- a/sysdeps/unix/sysv/linux/Versions
    +++ b/sysdeps/unix/sysv/linux/Versions
    @@ -185,3 +185,12 @@ libc {
         __netlink_assert_response;
       }
     }
    +
    +libpthread {
    +  GLIBC_2.29 {
    +    numa_spinlock_alloc;
    +    numa_spinlock_free;
    +    numa_spinlock_init;
    +    numa_spinlock_apply;
    +  }
    +}
    diff --git a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
    index 9a9e4ce..eb54a83 100644
    --- a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
    @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    diff --git a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
    index b413007..dd08796 100644
    --- a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/arm/libpthread.abilist b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
    index af82a4c..45a5c5a 100644
    --- a/sysdeps/unix/sysv/linux/arm/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
    @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.4 _IO_flockfile F
     GLIBC_2.4 _IO_ftrylockfile F
     GLIBC_2.4 _IO_funlockfile F
    diff --git a/sysdeps/unix/sysv/linux/csky/libpthread.abilist b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
    index ea4b79a..cf65f72 100644
    --- a/sysdeps/unix/sysv/linux/csky/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
    @@ -73,6 +73,10 @@ GLIBC_2.29 mtx_timedlock F
     GLIBC_2.29 mtx_trylock F
     GLIBC_2.29 mtx_unlock F
     GLIBC_2.29 nanosleep F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.29 open F
     GLIBC_2.29 open64 F
     GLIBC_2.29 pause F
    diff --git a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
    index bcba07f..a80475f 100644
    --- a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
    @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
    index bece86d..40ac05a 100644
    --- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
    index ccc9449..5b190f6 100644
    --- a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
    @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
    index af82a4c..45a5c5a 100644
    --- a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
    @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.4 _IO_flockfile F
     GLIBC_2.4 _IO_ftrylockfile F
     GLIBC_2.4 _IO_funlockfile F
    diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
    index bece86d..40ac05a 100644
    --- a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
    index 5067375..e6539bf 100644
    --- a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
    @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    diff --git a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
    index 0214496..76edcb8 100644
    --- a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
    index 0214496..76edcb8 100644
    --- a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
    index 78cac2a..3141d08 100644
    --- a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
    @@ -241,3 +241,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    diff --git a/sysdeps/unix/sysv/linux/numa-spinlock-private.h b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
    new file mode 100644
    index 0000000..0f7a3be
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
    @@ -0,0 +1,38 @@
    +/* Internal definitions and declarations for NUMA spinlock.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#include "numa-spinlock.h"
    +
    +/* The global NUMA spinlock.  */
    +struct numa_spinlock
    +{
    +  /* List of threads who owns the global NUMA spinlock.  */
    +  struct numa_spinlock_info *owner;
    +  /* The maximium NUMA node number.  */
    +  unsigned int max_node;
    +  /* Non-zero for single node system.  */
    +  unsigned int single_node;
    +  /* The maximium CPU number.  Used only when NUMA is disabled.  */
    +  unsigned int max_cpu;
    +  /* Array of physical_package_id of each core if it isn't NULL.  Used
    +     only when NUMA is disabled.*/
    +  unsigned int *physical_package_id_p;
    +  /* Arrays of lists of threads who are spinning for the local NUMA lock
    +     on NUMA nodes indexed by NUMA node number.  */
    +  struct numa_spinlock_info *lists[];
    +};
    diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.c b/sysdeps/unix/sysv/linux/numa-spinlock.c
    new file mode 100644
    index 0000000..a141e7d
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/numa-spinlock.c
    @@ -0,0 +1,327 @@
    +/* NUMA spinlock
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#include <config.h>
    +#include <string.h>
    +#include <stdlib.h>
    +#include <sched.h>
    +#ifndef HAVE_GETCPU
    +#include <unistd.h>
    +#include <syscall.h>
    +#endif
    +#include <errno.h>
    +#include <atomic.h>
    +#include "numa-spinlock-private.h"
    +
    +#if !defined HAVE_GETCPU && defined _LIBC
    +# define HAVE_GETCPU
    +#endif
    +
    +/* On multi-socket systems, memory is shared across the entire system.
    +   Data access to the local socket is much faster than the remote socket
    +   and data access to the local core is faster than sibling cores on the
    +   same socket.  For serialized workloads with conventional spinlock,
    +   when there is high spinlock contention between threads, lock ping-pong
    +   among sockets becomes the bottleneck and threads spend majority of
    +   their time in spinlock overhead.
    +
    +   On multi-socket systems, the keys to our NUMA spinlock performance
    +   are to minimize cross-socket traffic as well as localize the serialized
    +   workload to one core for execution.  The basic principles of NUMA
    +   spinlock are mainly consisted of following approaches, which reduce
    +   data movement and accelerate critical section, eventually give us
    +   significant performance improvement.
    +
    +   1. MCS spinlock
    +   MCS spinlock help us to reduce the useless lock movement in the
    +   spinning state.  This paper provides a good description for this
    +   kind of lock:
    +   <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>
    +
    +   2. Critical Section Integration (CSI)
    +   Essentially spinlock is similar to that one core complete critical
    +   sections one by one. So when contention happen, the serialized works
    +   are sent to the core who is the lock owner and responsible to execute
    +   them, that can save much time and power, because all shared data are
    +   located in private cache of the lock owner.
    +
    +   We implemented this mechanism based on queued spinlock in kernel, that
    +   speeds up critical section, and reduces the probability of contention.
    +   The paper provides a good description for this kind of lock:
    +   <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>
    +
    +   3. NUMA Aware Spinlock (NAS)
    +   Currently multi-socket systems give us better performance per watt,
    +   however that also involves more complex synchronization requirement,
    +   because off-chip data movement is much slower. We use distributed
    +   synchronization mechanism to decrease Lock cache line to and from
    +   different nodes. The paper provides a good description for this kind
    +   of lock:
    +   <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>
    +
    +   4. Yield Schedule
    +   When threads are applying for Critical Section Integration(CSI) with
    +   known contention, they will delegate work to the thread who is the
    +   lock owner, and wait for work to be completed.  The resources which
    +   they are using should be transferred to other threads. In order to
    +   accelerate the scenario, we introduce yield_sched function during
    +   spinning stage.
    +
    +   5. Optimization when NUMA is ON or OFF.
    +   Although programs can access memory with lower latency when NUMA is
    +   enabled, some programs may need more memory bandwidth for computation
    +   with NUMA disabled.  We also optimize multi-socket systems with NUMA
    +   disabled.
    +
    +   NUMA spinlock flow chart (assuming there are 2 CPU nodes):
    +
    +   1. Threads from node_0/node_1 acquire local lock for node_0/1
    +   respectively.  If the thread succeeds in acquiring local lock, it
    +   goes to step 2, otherwise pushes critical function into current
    +   local work queue, and enters into spinning stage with MCS mode.
    +
    +   2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
    +   in acquiring the global lock as the lock owner, it goes to step 3,
    +   otherwise waits until the lock owner thread releases the global lock.
    +
    +   3. The lock owner thread from node_0/1 enters into critical section,
    +   cleans up work queue by performing all local critical functions
    +   pushed at step 1 with CSI on behalf of other threads and informs
    +   those spinning threads that their works have been done.  It then
    +   releases the local lock.
    +
    +   4. The lock owner thread frees global lock.  If another thread is
    +   waiting at step 2, the lock owner thread passes the global lock to
    +   the waiting thread and returns.  The new lock owner thread enters
    +   into step 3.  If no threads are waiting, the lock owner thread
    +   releases the global lock and returns.  The whole critical section
    +   process is completed.
    +
    +   Steps 1 and 2 mitigate global lock contention.  Only one thread
    +   from different nodes will compete for the global lock in step 2.
    +   Step 3 reduces the global lock & shared data movement because they
    +   are located in the same node as well as the same core.  Our data
    +   shows that Critical Section Integration (CSI) improves data locality
    +   and NUMA-aware spinlock (NAS) helps CSI balance the workload.
    +
    +   NUMA spinlock can greatly speed up critical section on multi-socket
    +   systems.  It should improve spinlock performance on all multi-socket
    +   systems.
    +
    +   NOTE: LiTL <https://github.com/multicore-locks/litl>, is an open-source
    +   project that provides implementations of dozens of various locks,
    +   including several state-of-the-art NUMA-aware spinlocks.  Among them
    +
    +   1. Hierarchical MCS (HMCS) spinlock.  Milind Chabbi, Michael Fagan,
    +   and John Mellor-Crummey. High Performance Locks for Multi-level NUMA
    +   Systems.  In Proceedings of the ACM SIGPLAN Symposium on Principles
    +   and Practice of Parallel Programming (PPoPP), pages 215–226, 2015.
    +
    +   2. Cohort-MCS (C-MCS) spinlock.  Dave Dice, Virendra J. Marathe, and
    +   Nir Shavit.  Lock Cohorting: A General Technique for Designing NUMA
    +   Locks. ACM Trans. Parallel Comput., 1(2):13:1–13:42, 2015.
    + */
    +
    +/* Get the next thread pointed to by *NEXT_P.  NB: We must use a while
    +   spin loop to load NEXT_P since there is a small window before *NEXT_P
    +   is updated.  */
    +
    +static inline struct numa_spinlock_info *
    +get_numa_spinlock_info_next (struct numa_spinlock_info **next_p)
    +{
    +  struct numa_spinlock_info *next;
    +  while (!(next = atomic_load_relaxed (next_p)))
    +    atomic_spin_nop ();
    +  return next;
    +}
    +
    +/* While holding the global NUMA spinlock, run the workload of the
    +   thread pointed to by SELF first, then run the workload for each
    +   thread on the thread list pointed to by HEAD_P and wake up the
    +   thread so that all workloads run on a single processor.  */
    +
    +static inline void
    +run_numa_spinlock (struct numa_spinlock_info *self,
    +		   struct numa_spinlock_info **head_p)
    +{
    +  struct numa_spinlock_info *next, *current;
    +
    +  /* Run the SELF's workload. */
    +  self->result = self->workload (self->argument);
    +
    +  /* Process workloads for the rest of threads on the thread list.
    +     NB: The thread list may be prepended by other threads at the
    +     same time.  */
    +
    +retry:
    +   /* If SELF is the first thread of the thread list pointed to by
    +      HEAD_P, clear the thread list.  */
    +  current = atomic_compare_and_exchange_val_acq (head_p, NULL, self);
    +  if (current == self)
    +    {
    +      /* Since SELF is the only thread on the list, clear SELF's pending
    +         field and return.  */
    +      atomic_store_release (&current->pending, 0);
    +      return;
    +    }
    +
    +  /* CURRENT will have the previous first thread of the thread list
    +     pointed to by HEAD_P and *HEAD_P will point to SELF.  */
    +  current = atomic_exchange_acquire (head_p, self);
    +
    +  /* NB: No need to check if CURRENT == SELF here since SELF can never
    +     be CURRENT.  */
    +
    +repeat:
    +  /* Get the next thread.  */
    +  next = get_numa_spinlock_info_next (&current->next);
    +
    +  /* Run the CURRENT's workload and clear CURRENT's pending field. */
    +  current->result = current->workload (current->argument);
    +  current->pending = 0;
    +
    +  /* Process the workload for each thread from CURRENT to SELF on the
    +     thread list.  Don't pass beyond SELF since SELF is the last thread
    +     on the list.  */
    +  if (next == self)
    +    goto retry;
    +  current = next;
    +  goto repeat;
    +}
    +
    +/* Apply for the NUMA spinlock with the NUMA spinlock info data pointed
    +   to by SELF.  */
    +
    +void
    +numa_spinlock_apply (struct numa_spinlock_info *self)
    +{
    +  struct numa_spinlock *lock = self->lock;
    +  struct numa_spinlock_info *first, *next;
    +  struct numa_spinlock_info **head_p;
    +
    +  self->next = NULL;
    +  /* We want the global NUMA spinlock.  */
    +  self->pending = 1;
    +  /* Select the local NUMA spinlock list by the NUMA node number.  */
    +  head_p = &lock->lists[self->node];
    +  /* FIRST will have the previous first thread of the local NUMA spinlock
    +     list and *HEAD_P will point to SELF.  */
    +  first = atomic_exchange_acquire (head_p, self);
    +  if (first)
    +    {
    +      /* SELF has been prepended to the thread list pointed to by
    +	 HEAD_P.  NB: There is a small window between updating
    +	 *HEAD_P and self->next.  */
    +      atomic_store_release (&self->next, first);
    +      /* Let other threads run first since another thread will run our
    +	 workload for us.  */
    +      sched_yield ();
    +      /* Spin until our PENDING is cleared.  */
    +      while (atomic_load_relaxed (&self->pending))
    +	atomic_spin_nop ();
    +      return;
    +    }
    +
    +  /* NB: Now SELF must be the only thread on the thread list pointed
    +     to by HEAD_P.  Since thread is always prepended to HEAD_P, we
    +     can use *HEAD_P == SELF to check if SELF is the only thread on
    +     the thread list.  */
    +
    +  if (__glibc_unlikely (lock->single_node))
    +    {
    +      /* If there is only one node, there is no need for the global
    +         NUMA spinlock.  */
    +      run_numa_spinlock (self, head_p);
    +      return;
    +    }
    +
    +  /* FIRST will have the previous first thread of the local NUMA spinlock
    +     list of threads which holds the global NUMA spinlock, which will
    +     point to SELF.  */
    +  first = atomic_exchange_acquire (&lock->owner, self);
    +  if (first)
    +    {
    +      /* SELF has been prepended to the thread list pointed to by
    +	 lock->owner.  NB: There is a small window between updating
    +	 *HEAD_P and first->next.  */
    +      atomic_store_release (&first->next, self);
    +      /* Spin until the list of threads which holds the global NUMA
    +	 spinlock clears our PENDING.  */
    +      while (atomic_load_relaxed (&self->pending))
    +	atomic_spin_nop ();
    +    }
    +
    +  /* We get the global NUMA spinlock now.  Run our workload.  */
    +  run_numa_spinlock (self, head_p);
    +
    +  /* SELF is the only thread on the list if SELF is the first thread
    +     of the thread list pointed to by lock->owner.  In this case, we
    +     simply return.  */
    +  if (!atomic_compare_and_exchange_bool_acq (&lock->owner, NULL, self))
    +    return;
    +
    +  /* Wake up the next thread.  */
    +  next = get_numa_spinlock_info_next (&self->next);
    +  atomic_store_release (&next->pending, 0);
    +}
    +
    +/* Initialize the NUMA spinlock info data pointed to by INFO from a
    +   pointer to the NUMA spinlock, LOCK.  */
    +
    +int
    +numa_spinlock_init (struct numa_spinlock *lock,
    +			 struct numa_spinlock_info *info)
    +{
    +  memset (info, 0, sizeof (*info));
    +  info->lock = lock;
    +  /* For single node system, use 0 as the NUMA node number.  */
    +  if (lock->single_node)
    +    return 0;
    +  /* NB: Use the NUMA node number from getcpu to select the local NUMA
    +     spinlock list.  */
    +  unsigned int cpu;
    +  unsigned int node;
    +#ifdef HAVE_GETCPU
    +  int err_ret = getcpu (&cpu, &node);
    +#else
    +  int err_ret = syscall (SYS_getcpu, &cpu, &node, NULL);
    +#endif
    +  if (err_ret)
    +    return err_ret;
    +  if (lock->physical_package_id_p)
    +    {
    +      /* Can it ever happen?  */
    +      if (cpu > lock->max_cpu)
    +	cpu = lock->max_cpu;
    +      /* NB: If NUMA is disabled, use physical_package_id.  */
    +      node = lock->physical_package_id_p[cpu];
    +    }
    +  /* Can it ever happen?  */
    +  if (node > lock->max_node)
    +    node = lock->max_node;
    +  info->node = node;
    +  return err_ret;
    +}
    +
    +void
    +numa_spinlock_free (struct numa_spinlock *lock)
    +{
    +  if (lock->physical_package_id_p)
    +    free (lock->physical_package_id_p);
    +  free (lock);
    +}
    diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.h b/sysdeps/unix/sysv/linux/numa-spinlock.h
    new file mode 100644
    index 0000000..b17bda5
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/numa-spinlock.h
    @@ -0,0 +1,64 @@
    +/* Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#ifndef _NUMA_SPINLOCK_H
    +#define _NUMA_SPINLOCK_H
    +
    +#include <features.h>
    +
    +__BEGIN_DECLS
    +
    +/* The NUMA spinlock.  */
    +struct numa_spinlock;
    +
    +/* The NUMA spinlock information for each thread.  */
    +struct numa_spinlock_info
    +{
    +  /* The workload function of this thread.  */
    +  void *(*workload) (void *);
    +  /* The argument pointer passed to the workload function.  */
    +  void *argument;
    +  /* The return value of the workload function.  */
    +  void *result;
    +  /* The pointer to the NUMA spinlock.  */
    +  struct numa_spinlock *lock;
    +  /* The next thread on the local NUMA spinlock thread list.  */
    +  struct numa_spinlock_info *next;
    +  /* The NUMA node number.  */
    +  unsigned int node;
    +  /* Non-zero to indicate that the thread wants the NUMA spinlock.  */
    +  int pending;
    +  /* Reserved for future use.  */
    +  void *__reserved[4];
    +};
    +
    +/* Return a pointer to a newly allocated NUMA spinlock.  */
    +extern struct numa_spinlock *numa_spinlock_alloc (void);
    +
    +/* Free the memory space of the NUMA spinlock.  */
    +extern void numa_spinlock_free (struct numa_spinlock *);
    +
    +/* Initialize the NUMA spinlock information block.  */
    +extern int numa_spinlock_init (struct numa_spinlock *,
    +			       struct numa_spinlock_info *);
    +
    +/* Apply for spinlock with a NUMA spinlock information block.  */
    +extern void numa_spinlock_apply (struct numa_spinlock_info *);
    +
    +__END_DECLS
    +
    +#endif /* numa-spinlock.h */
    diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
    new file mode 100644
    index 0000000..8ff4e1a
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
    @@ -0,0 +1,304 @@
    +/* Initialization of NUMA spinlock.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#include <assert.h>
    +#include <ctype.h>
    +#include <string.h>
    +#include <dirent.h>
    +#include <stdio.h>
    +#include <limits.h>
    +#ifdef _LIBC
    +# include <not-cancel.h>
    +#else
    +# include <stdlib.h>
    +# include <unistd.h>
    +# include <fcntl.h>
    +# define __open_nocancel		open
    +# define __close_nocancel_nostatus	close
    +# define __read_nocancel		read
    +#endif
    +
    +#include "numa-spinlock-private.h"
    +
    +static char *
    +next_line (int fd, char *const buffer, char **cp, char **re,
    +	   char *const buffer_end)
    +{
    +  char *res = *cp;
    +  char *nl = memchr (*cp, '\n', *re - *cp);
    +  if (nl == NULL)
    +    {
    +      if (*cp != buffer)
    +	{
    +	  if (*re == buffer_end)
    +	    {
    +	      memmove (buffer, *cp, *re - *cp);
    +	      *re = buffer + (*re - *cp);
    +	      *cp = buffer;
    +
    +	      ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
    +	      if (n < 0)
    +		return NULL;
    +
    +	      *re += n;
    +
    +	      nl = memchr (*cp, '\n', *re - *cp);
    +	      while (nl == NULL && *re == buffer_end)
    +		{
    +		  /* Truncate too long lines.  */
    +		  *re = buffer + 3 * (buffer_end - buffer) / 4;
    +		  n = __read_nocancel (fd, *re, buffer_end - *re);
    +		  if (n < 0)
    +		    return NULL;
    +
    +		  nl = memchr (*re, '\n', n);
    +		  **re = '\n';
    +		  *re += n;
    +		}
    +	    }
    +	  else
    +	    nl = memchr (*cp, '\n', *re - *cp);
    +
    +	  res = *cp;
    +	}
    +
    +      if (nl == NULL)
    +	nl = *re - 1;
    +    }
    +
    +  *cp = nl + 1;
    +  assert (*cp <= *re);
    +
    +  return res == *re ? NULL : res;
    +}
    +
    +static int
    +select_cpu (const struct dirent *d)
    +{
    +  /* Return 1 for "cpuXXX" where XXX are digits.  */
    +  if (strncmp (d->d_name, "cpu", sizeof ("cpu") - 1) == 0)
    +    {
    +      const char *p = d->d_name + 3;
    +
    +      if (*p == '\0')
    +	return 0;
    +
    +      do
    +	{
    +	  if (!isdigit (*p))
    +	    return 0;
    +	  p++;
    +	}
    +      while (*p != '\0');
    +
    +      return 1;
    +    }
    +  return 0;
    +}
    +
    +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
    +   call numa_spinlock_free on the NUMA spinlock pointer to free the
    +   memory when it is no longer needed.  */
    +
    +struct numa_spinlock *
    +numa_spinlock_alloc (void)
    +{
    +  const size_t buffer_size = 1024;
    +  char buffer[buffer_size];
    +  char *buffer_end = buffer + buffer_size;
    +  char *cp = buffer_end;
    +  char *re = buffer_end;
    +
    +  const int flags = O_RDONLY | O_CLOEXEC;
    +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
    +  char *l;
    +  unsigned int max_node = 0;
    +  unsigned int node_count = 0;
    +  if (fd != -1)
    +    {
    +      l = next_line (fd, buffer, &cp, &re, buffer_end);
    +      if (l != NULL)
    +	do
    +	  {
    +	    char *endp;
    +	    unsigned long int n = strtoul (l, &endp, 10);
    +	    if (l == endp)
    +	      {
    +		node_count = 1;
    +		break;
    +	      }
    +
    +	    unsigned long int m = n;
    +	    if (*endp == '-')
    +	      {
    +		l = endp + 1;
    +		m = strtoul (l, &endp, 10);
    +		if (l == endp)
    +		  {
    +		    node_count = 1;
    +		    break;
    +		  }
    +	      }
    +
    +	    node_count += m - n + 1;
    +
    +	    if (m >= max_node)
    +	      max_node = m;
    +
    +	    l = endp;
    +	    while (l < re && isspace (*l))
    +	      ++l;
    +	  }
    +	while (l < re);
    +
    +      __close_nocancel_nostatus (fd);
    +    }
    +
    +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
    +     nodes is 1, set the maximium NUMA node number to 0.  */
    +  if (node_count == 1)
    +    max_node = 0;
    +
    +  unsigned int max_cpu = 0;
    +  unsigned int *physical_package_id_p = NULL;
    +
    +  if (max_node == 0)
    +    {
    +      /* There is at least 1 node.  */
    +      node_count = 1;
    +
    +      /* If NUMA is disabled, use physical_package_id instead.  */
    +      struct dirent **cpu_list;
    +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
    +			    select_cpu, NULL);
    +      if (nprocs > 0)
    +	{
    +	  int i;
    +	  unsigned int *cpu_id_p = NULL;
    +
    +	  /* Find the maximum CPU number.  */
    +	  if (posix_memalign ((void **) &cpu_id_p,
    +			      __alignof__ (void *),
    +			      nprocs * sizeof (unsigned int)) == 0)
    +	    {
    +	      for (i = 0; i < nprocs; i++)
    +		{
    +		  unsigned int cpu_id
    +		    = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
    +		  cpu_id_p[i] = cpu_id;
    +		  if (cpu_id > max_cpu)
    +		    max_cpu = cpu_id;
    +		}
    +
    +	      if (posix_memalign ((void **) &physical_package_id_p,
    +				  __alignof__ (void *),
    +				  ((max_cpu + 1)
    +				   * sizeof (unsigned int))) == 0)
    +		{
    +		  memset (physical_package_id_p, 0,
    +			  ((max_cpu + 1) * sizeof (unsigned int)));
    +
    +		  max_node = UINT_MAX;
    +
    +		  /* Get physical_package_id.  */
    +		  char path[(sizeof ("/sys/devices/system/cpu")
    +			     + 3 * sizeof (unsigned long int)
    +			     + sizeof ("/topology/physical_package_id"))];
    +		  for (i = 0; i < nprocs; i++)
    +		    {
    +		      struct dirent *d = cpu_list[i];
    +		      if (snprintf (path, sizeof (path),
    +				    "/sys/devices/system/cpu/%s/topology/physical_package_id",
    +				    d->d_name) > 0)
    +			{
    +			  fd = __open_nocancel (path, flags);
    +			  if (fd != -1)
    +			    {
    +			      if (__read_nocancel (fd, buffer,
    +						   buffer_size) > 0)
    +				{
    +				  char *endp;
    +				  unsigned long int package_id
    +				    = strtoul (buffer, &endp, 10);
    +				  if (package_id != ULONG_MAX
    +				      && *buffer != '\0'
    +				      && (*endp == '\0' || *endp == '\n'))
    +				    {
    +				      physical_package_id_p[cpu_id_p[i]]
    +					= package_id;
    +				      if (max_node == UINT_MAX)
    +					{
    +					  /* This is the first node.  */
    +					  max_node = package_id;
    +					}
    +				      else if (package_id != max_node)
    +					{
    +					  /* NB: We only need to know if
    +					     NODE_COUNT > 1.  */
    +					  node_count = 2;
    +					  if (package_id > max_node)
    +					    max_node = package_id;
    +					}
    +				    }
    +				}
    +			      __close_nocancel_nostatus (fd);
    +			    }
    +			}
    +
    +		      free (d);
    +		    }
    +		}
    +
    +	      free (cpu_id_p);
    +	    }
    +	  else
    +	    {
    +	      for (i = 0; i < nprocs; i++)
    +		free (cpu_list[i]);
    +	    }
    +
    +	  free (cpu_list);
    +	}
    +    }
    +
    +  if (physical_package_id_p != NULL && node_count == 1)
    +    {
    +      /* There is only one node.  No need for physical_package_id_p.  */
    +      free (physical_package_id_p);
    +      physical_package_id_p = NULL;
    +      max_cpu = 0;
    +    }
    +
    +  /* Allocate an array of struct numa_spinlock_info pointers to hold info
    +     for all NUMA nodes with NUMA node number from getcpu () as index.  */
    +  size_t size = (sizeof (struct numa_spinlock)
    +		 + ((max_node + 1)
    +		    * sizeof (struct numa_spinlock_info *)));
    +  struct numa_spinlock *lock;
    +  if (posix_memalign ((void **) &lock,
    +		      __alignof__ (struct numa_spinlock_info *), size))
    +    return NULL;
    +  memset (lock, 0, size);
    +
    +  lock->max_node = max_node;
    +  lock->single_node = node_count == 1;
    +  lock->max_cpu = max_cpu;
    +  lock->physical_package_id_p = physical_package_id_p;
    +
    +  return lock;
    +}
    diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
    index 09e8447..dba7df6 100644
    --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
    index 8300958..a763c0a 100644
    --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
    @@ -27,6 +27,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3 _IO_flockfile F
     GLIBC_2.3 _IO_ftrylockfile F
     GLIBC_2.3 _IO_funlockfile F
    diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
    index 9a9e4ce..eb54a83 100644
    --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
    @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
    index c370fda..366fcac 100644
    --- a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
    @@ -235,3 +235,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
    index d05468f..786d8e1 100644
    --- a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
    @@ -229,6 +229,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
    index e8161aa..dd7c52f 100644
    --- a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
    @@ -221,6 +221,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/sh/libpthread.abilist b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
    index bcba07f..a80475f 100644
    --- a/sysdeps/unix/sysv/linux/sh/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
    @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
    index b413007..dd08796 100644
    --- a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
    @@ -227,6 +227,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
    index ccc9449..5b190f6 100644
    --- a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
    @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/x86/Makefile b/sysdeps/unix/sysv/linux/x86/Makefile
    index 02ca36c..29d41ad 100644
    --- a/sysdeps/unix/sysv/linux/x86/Makefile
    +++ b/sysdeps/unix/sysv/linux/x86/Makefile
    @@ -14,6 +14,7 @@ endif
     ifeq ($(subdir),nptl)
     libpthread-sysdep_routines += elision-lock elision-unlock elision-timed \
     			      elision-trylock
    +xtests += tst-variable-overhead tst-numa-variable-overhead
     CFLAGS-elision-lock.c += -mrtm
     CFLAGS-elision-unlock.c += -mrtm
     CFLAGS-elision-timed.c += -mrtm
    diff --git a/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
    new file mode 100644
    index 0000000..7cb8542
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
    @@ -0,0 +1,53 @@
    +/* Test case for NUMA spinlock overhead.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#ifndef _GNU_SOURCE
    +# define _GNU_SOURCE
    +#endif
    +#include "numa-spinlock.h"
    +
    +struct numa_spinlock *lock;
    +
    +struct work_todo_argument
    +{
    +  unsigned long *v1;
    +  unsigned long *v2;
    +  unsigned long *v3;
    +  unsigned long *v4;
    +};
    +
    +static void *
    +work_todo (void *v)
    +{
    +  struct work_todo_argument *p = v;
    +  unsigned long ret;
    +  *p->v1 = *p->v1 + 1;
    +  *p->v2 = *p->v2 + 1;
    +  ret = __sync_val_compare_and_swap (p->v4, 0, 1);
    +  *p->v3 = *p->v3 + ret;
    +  return (void *) 2;
    +}
    +
    +static inline void
    +do_work (struct numa_spinlock_info *lock_info)
    +{
    +  numa_spinlock_apply (lock_info);
    +}
    +
    +#define USE_NUMA_SPINLOCK
    +#include "tst-variable-overhead-skeleton.c"
    diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
    new file mode 100644
    index 0000000..4b83dfb
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
    @@ -0,0 +1,384 @@
    +/* Test case skeleton for spinlock overhead.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +/* Check spinlock overhead with large number threads.  Critical region is
    +   very smmall.  Critical region + spinlock overhead aren't noticeable
    +   when number of threads is small.  When thread number increases,
    +   spinlock overhead become the bottleneck.  It shows up in wall time
    +   of thread execution.  */
    +
    +#ifndef _GNU_SOURCE
    +# define _GNU_SOURCE
    +#endif
    +#include <config.h>
    +#include <unistd.h>
    +#include <stdio.h>
    +#include <pthread.h>
    +#include <sched.h>
    +#include <stdlib.h>
    +#include <string.h>
    +#include <stdint.h>
    +#include <sys/time.h>
    +#include <sys/param.h>
    +#include <errno.h>
    +#ifdef MODULE_NAME
    +# include <cpu-features.h>
    +# include <support/test-driver.h>
    +#endif
    +
    +#ifndef USE_PTHREAD_ATTR_SETAFFINITY_NP
    +# define USE_PTHREAD_ATTR_SETAFFINITY_NP 1
    +#endif
    +
    +#define memory_barrier() __asm ("" ::: "memory")
    +#define pause() __asm  ("rep ; nop" ::: "memory")
    +
    +#define CACHELINE_SIZE	64
    +#define CACHE_ALIGNED	__attribute__((aligned(CACHELINE_SIZE)))
    +
    +#define constant_time 5
    +unsigned long g_val CACHE_ALIGNED;
    +unsigned long g_val2 CACHE_ALIGNED;
    +unsigned long g_val3 CACHE_ALIGNED;
    +unsigned long cmplock CACHE_ALIGNED;
    +struct count
    +{
    +  unsigned long long total;
    +  unsigned long long spinlock;
    +  unsigned long long wall;
    +} __attribute__((aligned(128)));
    +
    +struct count *gcount;
    +
    +/* The time consumed by one update is about 200 TSCs.  */
    +static int delay_time_unlocked = 400;
    +
    +struct ops
    +{
    +  void *(*test) (void *arg);
    +  void (*print_thread) (void *res, int);
    +} *ops;
    +
    +struct stats_result
    +{
    +  unsigned long num;
    +};
    +
    +void *work_thread (void *arg);
    +
    +#define iterations (10000 * 5)
    +
    +static volatile int start_thread;
    +
    +/* Delay some fixed time */
    +static void
    +delay_tsc (unsigned n)
    +{
    +  unsigned long long start, current, diff;
    +  unsigned int aux;
    +  start = __builtin_ia32_rdtscp (&aux);
    +  while (1)
    +    {
    +      current = __builtin_ia32_rdtscp (&aux);
    +      diff = current - start;
    +      if (diff < n)
    +	pause ();
    +      else
    +	break;
    +    }
    +}
    +
    +static void
    +wait_a_bit (int delay_time)
    +{
    +  if (delay_time > 0)
    +    delay_tsc (delay_time);
    +}
    +
    +#ifndef USE_NUMA_SPINLOCK
    +static inline void
    +work_todo (void)
    +{
    +  unsigned long ret;
    +  g_val = g_val + 1;
    +  g_val2 = g_val2 + 1;
    +  ret = __sync_val_compare_and_swap (&cmplock, 0, 1);
    +  g_val3 = g_val3 + 1 + ret;
    +}
    +#endif
    +
    +void *
    +work_thread (void *arg)
    +{
    +  long i;
    +  unsigned long pid = (unsigned long) arg;
    +  struct stats_result *res;
    +  unsigned long long start, end;
    +  int err_ret = posix_memalign ((void **)&res, CACHELINE_SIZE,
    +				roundup (sizeof (*res), CACHELINE_SIZE));
    +  if (err_ret)
    +    {
    +      printf ("posix_memalign failure: %s\n", strerror (err_ret));
    +      exit (err_ret);
    +    }
    +  long num = 0;
    +
    +#ifdef USE_NUMA_SPINLOCK
    +  struct work_todo_argument work_todo_arg;
    +  struct numa_spinlock_info lock_info;
    +
    +  if (numa_spinlock_init (lock, &lock_info))
    +    {
    +      printf ("numa_spinlock_init failure: %m\n");
    +      exit (1);
    +    }
    +
    +  work_todo_arg.v1 = &g_val;
    +  work_todo_arg.v2 = &g_val2;
    +  work_todo_arg.v3 = &g_val3;
    +  work_todo_arg.v4 = &cmplock;
    +  lock_info.argument = &work_todo_arg;
    +  lock_info.workload = work_todo;
    +#endif
    +
    +  while (!start_thread)
    +    pause ();
    +
    +  unsigned int aux;
    +  start = __builtin_ia32_rdtscp (&aux);
    +  for (i = 0; i < iterations; i++)
    +    {
    +#ifdef USE_NUMA_SPINLOCK
    +      do_work (&lock_info);
    +#else
    +      do_work ();
    +#endif
    +      wait_a_bit (delay_time_unlocked);
    +      num++;
    +    }
    +  end = __builtin_ia32_rdtscp (&aux);
    +  gcount[pid].total = end - start;
    +  res->num = num;
    +
    +  return res;
    +}
    +
    +void
    +init_global_data(void)
    +{
    +  g_val = 0;
    +  g_val2 = 0;
    +  g_val3 = 0;
    +  cmplock = 0;
    +}
    +
    +void
    +test_threads (int numthreads, int numprocs, unsigned long time)
    +{
    +  start_thread = 0;
    +
    +#ifdef USE_NUMA_SPINLOCK
    +  lock = numa_spinlock_alloc ();
    +#endif
    +
    +  memory_barrier ();
    +
    +  pthread_t thr[numthreads];
    +  void *res[numthreads];
    +  int i;
    +
    +  init_global_data ();
    +  for (i = 0; i < numthreads; i++)
    +    {
    +      pthread_attr_t attr;
    +      const pthread_attr_t *attrp = NULL;
    +      if (USE_PTHREAD_ATTR_SETAFFINITY_NP)
    +	{
    +	  attrp = &attr;
    +	  pthread_attr_init (&attr);
    +	  cpu_set_t set;
    +	  CPU_ZERO (&set);
    +	  int cpu = i % numprocs;
    +	  (void) CPU_SET (cpu, &set);
    +	  pthread_attr_setaffinity_np (&attr, sizeof (cpu_set_t), &set);
    +	}
    +      int err_ret = pthread_create (&thr[i], attrp, ops->test,
    +				    (void *)(uintptr_t) i);
    +      if (err_ret != 0)
    +	{
    +	  printf ("pthread_create failed: %d, %s\n",
    +		  i, strerror (i));
    +	  numthreads = i;
    +	  break;
    +	}
    +    }
    +
    +  memory_barrier ();
    +  start_thread = 1;
    +  memory_barrier ();
    +  sched_yield ();
    +
    +  if (time)
    +    {
    +      struct timespec ts =
    +	{
    +	  ts.tv_sec = time,
    +	  ts.tv_nsec = 0
    +	};
    +      clock_nanosleep (CLOCK_MONOTONIC, 0, &ts, NULL);
    +      memory_barrier ();
    +    }
    +
    +  for (i = 0; i < numthreads; i++)
    +    {
    +      if (pthread_join (thr[i], (void *) &res[i]) == 0)
    +	free (res[i]);
    +      else
    +	printf ("pthread_join failure: %m\n");
    +    }
    +
    +#ifdef USE_NUMA_SPINLOCK
    +  numa_spinlock_free (lock);
    +#endif
    +}
    +
    +struct ops hashwork_ops =
    +{
    +  .test = work_thread,
    +};
    +
    +struct ops *ops;
    +
    +static struct count
    +total_cost (int numthreads, int numprocs)
    +{
    +  int i;
    +  unsigned long long total = 0;
    +  unsigned long long spinlock = 0;
    +
    +  memset (gcount, 0, sizeof(gcount[0]) * numthreads);
    +
    +  unsigned long long start, end, diff;
    +  unsigned int aux;
    +
    +  start = __builtin_ia32_rdtscp (&aux);
    +  test_threads (numthreads, numprocs, constant_time);
    +  end = __builtin_ia32_rdtscp (&aux);
    +  diff = end - start;
    +
    +  for (i = 0; i < numthreads; i++)
    +    {
    +      total += gcount[i].total;
    +      spinlock += gcount[i].spinlock;
    +    }
    +
    +  struct count cost = { total, spinlock, diff };
    +  return cost;
    +}
    +
    +#ifdef MODULE_NAME
    +static int
    +do_test (void)
    +{
    +  if (!CPU_FEATURE_USABLE (RDTSCP))
    +    return EXIT_UNSUPPORTED;
    +#else
    +int
    +main (void)
    +{
    +#endif
    +  int numprocs = sysconf (_SC_NPROCESSORS_ONLN);
    +
    +  /* Oversubscribe CPU.  */
    +  int numthreads = 4 * numprocs;
    +
    +  ops = &hashwork_ops;
    +
    +  int err_ret = posix_memalign ((void **)&gcount, 4096,
    +				sizeof(gcount[0]) * numthreads);
    +  if (err_ret)
    +    {
    +      printf ("posix_memalign failure: %s\n", strerror (err_ret));
    +      exit (err_ret);
    +    }
    +
    +  struct count cost, cost1;
    +  double overhead;
    +  int i, last;
    +  int last_increment = numprocs < 16 ? 16 : numprocs;
    +  int numprocs_done = 0;
    +  int numprocs_reset = 0;
    +  cost1 = total_cost (1, numprocs);
    +
    +  printf ("Number of processors: %d, Single thread time %lld\n\n",
    +	  numprocs, cost1.total);
    +
    +  for (last = i = 2; i <= numthreads;)
    +    {
    +      last = i;
    +      cost = total_cost (i, numprocs);
    +      overhead = cost.total;
    +      overhead /= i;
    +      overhead /= cost1.total;
    +      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
    +	      i, cost.total, overhead);
    +      if ((i * 2) < numprocs)
    +	i = i * 2;
    +      else if (numprocs_done)
    +	{
    +	  if (numprocs_reset)
    +	    {
    +	      i = numprocs_reset;
    +	      numprocs_reset = 0;
    +	    }
    +	  else
    +	    {
    +	      if ((i * 2) < numthreads)
    +		i = i * 2;
    +	      else
    +		i = i + last_increment;
    +	    }
    +	}
    +      else
    +	{
    +	  if (numprocs != 2 * i)
    +	    numprocs_reset = 2 * i;
    +	  i = numprocs;
    +	  numprocs_done = 1;
    +	}
    +    }
    +
    +  if (last != numthreads)
    +    {
    +      i = numthreads;
    +      cost = total_cost (i, numprocs);
    +      overhead = cost.total;
    +      overhead /= i;
    +      overhead /= cost1.total;
    +      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
    +	      i, cost.total, overhead);
    +    }
    +
    +  free (gcount);
    +  return 0;
    +}
    +
    +#ifdef MODULE_NAME
    +# define TIMEOUT 900
    +# include <support/test-driver.c>
    +#endif
    diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
    new file mode 100644
    index 0000000..b3ce567
    --- /dev/null
    +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
    @@ -0,0 +1,47 @@
    +/* Test case for spinlock overhead.
    +   Copyright (C) 2018 Free Software Foundation, Inc.
    +   This file is part of the GNU C Library.
    +
    +   The GNU C Library is free software; you can redistribute it and/or
    +   modify it under the terms of the GNU Lesser General Public
    +   License as published by the Free Software Foundation; either
    +   version 2.1 of the License, or (at your option) any later version.
    +
    +   The GNU C Library is distributed in the hope that it will be useful,
    +   but WITHOUT ANY WARRANTY; without even the implied warranty of
    +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    +   Lesser General Public License for more details.
    +
    +   You should have received a copy of the GNU Lesser General Public
    +   License along with the GNU C Library; if not, see
    +   <http://www.gnu.org/licenses/>.  */
    +
    +#ifndef _GNU_SOURCE
    +# define _GNU_SOURCE
    +#endif
    +#include <pthread.h>
    +
    +struct
    +{
    +  pthread_spinlock_t testlock;
    +  char pad[64 - sizeof (pthread_spinlock_t)];
    +} test __attribute__((aligned(64)));
    +
    +static void
    +__attribute__((constructor))
    +init_spin (void)
    +{
    +  pthread_spin_init (&test.testlock, 0);
    +}
    +
    +static void work_todo (void);
    +
    +static inline void
    +do_work (void)
    +{
    +  pthread_spin_lock(&test.testlock);
    +  work_todo ();
    +  pthread_spin_unlock(&test.testlock);
    +}
    +
    +#include "tst-variable-overhead-skeleton.c"
    diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
    index 931c827..e90532e 100644
    --- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
    @@ -219,6 +219,10 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
     GLIBC_2.3.2 pthread_cond_broadcast F
     GLIBC_2.3.2 pthread_cond_destroy F
     GLIBC_2.3.2 pthread_cond_init F
    diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
    index c09c9b0..c74febb 100644
    --- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
    +++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
    @@ -243,3 +243,7 @@ GLIBC_2.28 tss_create F
     GLIBC_2.28 tss_delete F
     GLIBC_2.28 tss_get F
     GLIBC_2.28 tss_set F
    +GLIBC_2.29 numa_spinlock_alloc F
    +GLIBC_2.29 numa_spinlock_apply F
    +GLIBC_2.29 numa_spinlock_free F
    +GLIBC_2.29 numa_spinlock_init F
    -- 
    1.8.3.1
Szabolcs Nagy Jan. 3, 2019, 2:52 p.m. | #2
On 03/01/2019 05:35, 马凌(彦军) wrote:
>      create mode 100644 manual/examples/numa-spinlock.c
>      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
>      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
>      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
>      create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
>      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
>      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
>      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c

as far as i can tell the new code is generic
(other than the presence of efficient getcpu),
so i think the test should be generic too.

>     --- /dev/null
>     +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
>     @@ -0,0 +1,384 @@
...
>     +/* Check spinlock overhead with large number threads.  Critical region is
>     +   very smmall.  Critical region + spinlock overhead aren't noticeable
>     +   when number of threads is small.  When thread number increases,
>     +   spinlock overhead become the bottleneck.  It shows up in wall time
>     +   of thread execution.  */

yeah, this is not easy to do in a generic way, i think
even on x86 such measurement is problematic, you don't
know what goes on a system (or vm) when the glibc test
is running.

but doing precise timing is not that important for
checking the correctness of the locks, so i think a
simplified version can be generic test code.
H.J. Lu Jan. 3, 2019, 7:58 p.m. | #3
On Thu, Jan 3, 2019 at 6:52 AM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
>
> On 03/01/2019 05:35, 马凌(彦军) wrote:
> >      create mode 100644 manual/examples/numa-spinlock.c
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
> >      create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
>
> as far as i can tell the new code is generic
> (other than the presence of efficient getcpu),
> so i think the test should be generic too.
>
> >     --- /dev/null
> >     +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
> >     @@ -0,0 +1,384 @@
> ...
> >     +/* Check spinlock overhead with large number threads.  Critical region is
> >     +   very smmall.  Critical region + spinlock overhead aren't noticeable
> >     +   when number of threads is small.  When thread number increases,
> >     +   spinlock overhead become the bottleneck.  It shows up in wall time
> >     +   of thread execution.  */
>
> yeah, this is not easy to do in a generic way, i think
> even on x86 such measurement is problematic, you don't
> know what goes on a system (or vm) when the glibc test
> is running.
>
> but doing precise timing is not that important for
> checking the correctness of the locks, so i think a
> simplified version can be generic test code.

Here is the updated patch to make tests generic.
Rich Felker Jan. 3, 2019, 8:43 p.m. | #4
On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> From: "ling.ma" <ling.ml@antfin.com>
> 
> On multi-socket systems, memory is shared across the entire system.
> Data access to the local socket is much faster than the remote socket
> and data access to the local core is faster than sibling cores on the
> same socket.  For serialized workloads with conventional spinlock,
> when there is high spinlock contention between threads, lock ping-pong
> among sockets becomes the bottleneck and threads spend majority of
> their time in spinlock overhead.
> 
> On multi-socket systems, the keys to our NUMA spinlock performance
> are to minimize cross-socket traffic as well as localize the serialized
> workload to one core for execution.  The basic principles of NUMA
> spinlock are mainly consisted of following approaches, which reduce
> data movement and accelerate critical section, eventually give us
> significant performance improvement.

I question whether this belongs in glibc. It seems highly application-
and kernel-specific. Is there a reason you wouldn't prefer to
implement and maintain it in a library for use in the kind of
application that needs it?

Some specific review inline below:

> [...]
> diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> new file mode 100644
> index 0000000..8ff4e1a
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> @@ -0,0 +1,304 @@
> +/* Initialization of NUMA spinlock.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <assert.h>
> +#include <ctype.h>
> +#include <string.h>
> +#include <dirent.h>
> +#include <stdio.h>
> +#include <limits.h>
> +#ifdef _LIBC
> +# include <not-cancel.h>
> +#else
> +# include <stdlib.h>
> +# include <unistd.h>
> +# include <fcntl.h>
> +# define __open_nocancel		open
> +# define __close_nocancel_nostatus	close
> +# define __read_nocancel		read
> +#endif
> +
> +#include "numa-spinlock-private.h"
> +
> +static char *
> +next_line (int fd, char *const buffer, char **cp, char **re,
> +	   char *const buffer_end)
> +{
> +  char *res = *cp;
> +  char *nl = memchr (*cp, '\n', *re - *cp);
> +  if (nl == NULL)
> +    {
> +      if (*cp != buffer)
> +	{
> +	  if (*re == buffer_end)
> +	    {
> +	      memmove (buffer, *cp, *re - *cp);
> +	      *re = buffer + (*re - *cp);
> +	      *cp = buffer;
> +
> +	      ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
> +	      if (n < 0)
> +		return NULL;
> +
> +	      *re += n;
> +
> +	      nl = memchr (*cp, '\n', *re - *cp);
> +	      while (nl == NULL && *re == buffer_end)
> +		{
> +		  /* Truncate too long lines.  */
> +		  *re = buffer + 3 * (buffer_end - buffer) / 4;
> +		  n = __read_nocancel (fd, *re, buffer_end - *re);
> +		  if (n < 0)
> +		    return NULL;
> +
> +		  nl = memchr (*re, '\n', n);
> +		  **re = '\n';
> +		  *re += n;
> +		}
> +	    }
> +	  else
> +	    nl = memchr (*cp, '\n', *re - *cp);
> +
> +	  res = *cp;
> +	}
> +
> +      if (nl == NULL)
> +	nl = *re - 1;
> +    }
> +
> +  *cp = nl + 1;
> +  assert (*cp <= *re);
> +
> +  return res == *re ? NULL : res;
> +}

This looks like fragile duplication of stdio-like buffering logic
that's not at all specific to this file. Does glibc have a policy on
whether things needing this should use stdio or some other shared code
rather than open-coding it like this?

> [...]

> +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> +   memory when it is no longer needed.  */
> +
> +struct numa_spinlock *
> +numa_spinlock_alloc (void)
> +{
> +  const size_t buffer_size = 1024;
> +  char buffer[buffer_size];
> +  char *buffer_end = buffer + buffer_size;
> +  char *cp = buffer_end;
> +  char *re = buffer_end;
> +
> +  const int flags = O_RDONLY | O_CLOEXEC;
> +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> +  char *l;
> +  unsigned int max_node = 0;
> +  unsigned int node_count = 0;
> +  if (fd != -1)
> +    {
> +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> +      if (l != NULL)
> +	do
> +	  {
> +	    char *endp;
> +	    unsigned long int n = strtoul (l, &endp, 10);
> +	    if (l == endp)
> +	      {
> +		node_count = 1;
> +		break;
> +	      }
> +
> +	    unsigned long int m = n;
> +	    if (*endp == '-')
> +	      {
> +		l = endp + 1;
> +		m = strtoul (l, &endp, 10);
> +		if (l == endp)
> +		  {
> +		    node_count = 1;
> +		    break;
> +		  }
> +	      }
> +
> +	    node_count += m - n + 1;
> +
> +	    if (m >= max_node)
> +	      max_node = m;
> +
> +	    l = endp;
> +	    while (l < re && isspace (*l))
> +	      ++l;
> +	  }
> +	while (l < re);
> +
> +      __close_nocancel_nostatus (fd);
> +    }
> +
> +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> +     nodes is 1, set the maximium NUMA node number to 0.  */
> +  if (node_count == 1)
> +    max_node = 0;
> +
> +  unsigned int max_cpu = 0;
> +  unsigned int *physical_package_id_p = NULL;
> +
> +  if (max_node == 0)
> +    {
> +      /* There is at least 1 node.  */
> +      node_count = 1;
> +
> +      /* If NUMA is disabled, use physical_package_id instead.  */
> +      struct dirent **cpu_list;
> +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> +			    select_cpu, NULL);
> +      if (nprocs > 0)
> +	{
> +	  int i;
> +	  unsigned int *cpu_id_p = NULL;
> +
> +	  /* Find the maximum CPU number.  */
> +	  if (posix_memalign ((void **) &cpu_id_p,
> +			      __alignof__ (void *),
> +			      nprocs * sizeof (unsigned int)) == 0)

Using posix_memalign to get memory with the alignment of
__alignof__(void*) makes no sense. All allocations via malloc are
suitably aligned for any standard type.

> +	    {
> +	      for (i = 0; i < nprocs; i++)
> +		{
> +		  unsigned int cpu_id
> +		    = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
> +		  cpu_id_p[i] = cpu_id;
> +		  if (cpu_id > max_cpu)
> +		    max_cpu = cpu_id;
> +		}
> +
> +	      if (posix_memalign ((void **) &physical_package_id_p,
> +				  __alignof__ (void *),
> +				  ((max_cpu + 1)
> +				   * sizeof (unsigned int))) == 0)

Again.

> +		{
> +		  memset (physical_package_id_p, 0,
> +			  ((max_cpu + 1) * sizeof (unsigned int)));
> +
> +		  max_node = UINT_MAX;
> +
> +		  /* Get physical_package_id.  */
> +		  char path[(sizeof ("/sys/devices/system/cpu")
> +			     + 3 * sizeof (unsigned long int)
> +			     + sizeof ("/topology/physical_package_id"))];
> +		  for (i = 0; i < nprocs; i++)
> +		    {
> +		      struct dirent *d = cpu_list[i];
> +		      if (snprintf (path, sizeof (path),
> +				    "/sys/devices/system/cpu/%s/topology/physical_package_id",
> +				    d->d_name) > 0)

Are these sysfs pathnames documented as stable/permanent by the
kernel?

> +			{
> +			  fd = __open_nocancel (path, flags);
> +			  if (fd != -1)
> +			    {
> +			      if (__read_nocancel (fd, buffer,
> +						   buffer_size) > 0)
> +				{
> +				  char *endp;
> +				  unsigned long int package_id
> +				    = strtoul (buffer, &endp, 10);
> +				  if (package_id != ULONG_MAX
> +				      && *buffer != '\0'
> +				      && (*endp == '\0' || *endp == '\n'))
> +				    {
> +				      physical_package_id_p[cpu_id_p[i]]
> +					= package_id;
> +				      if (max_node == UINT_MAX)
> +					{
> +					  /* This is the first node.  */
> +					  max_node = package_id;
> +					}
> +				      else if (package_id != max_node)
> +					{
> +					  /* NB: We only need to know if
> +					     NODE_COUNT > 1.  */
> +					  node_count = 2;
> +					  if (package_id > max_node)
> +					    max_node = package_id;
> +					}
> +				    }
> +				}
> +			      __close_nocancel_nostatus (fd);
> +			    }
> +			}
> +
> +		      free (d);
> +		    }
> +		}
> +
> +	      free (cpu_id_p);
> +	    }
> +	  else
> +	    {
> +	      for (i = 0; i < nprocs; i++)
> +		free (cpu_list[i]);
> +	    }
> +
> +	  free (cpu_list);
> +	}
> +    }
> +
> +  if (physical_package_id_p != NULL && node_count == 1)
> +    {
> +      /* There is only one node.  No need for physical_package_id_p.  */
> +      free (physical_package_id_p);
> +      physical_package_id_p = NULL;
> +      max_cpu = 0;
> +    }
> +
> +  /* Allocate an array of struct numa_spinlock_info pointers to hold info
> +     for all NUMA nodes with NUMA node number from getcpu () as index.  */
> +  size_t size = (sizeof (struct numa_spinlock)
> +		 + ((max_node + 1)
> +		    * sizeof (struct numa_spinlock_info *)));
> +  struct numa_spinlock *lock;
> +  if (posix_memalign ((void **) &lock,
> +		      __alignof__ (struct numa_spinlock_info *), size))

Another gratuitous posix_memalign.
H.J. Lu Jan. 3, 2019, 8:54 p.m. | #5
On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > From: "ling.ma" <ling.ml@antfin.com>
> >
> > On multi-socket systems, memory is shared across the entire system.
> > Data access to the local socket is much faster than the remote socket
> > and data access to the local core is faster than sibling cores on the
> > same socket.  For serialized workloads with conventional spinlock,
> > when there is high spinlock contention between threads, lock ping-pong
> > among sockets becomes the bottleneck and threads spend majority of
> > their time in spinlock overhead.
> >
> > On multi-socket systems, the keys to our NUMA spinlock performance
> > are to minimize cross-socket traffic as well as localize the serialized
> > workload to one core for execution.  The basic principles of NUMA
> > spinlock are mainly consisted of following approaches, which reduce
> > data movement and accelerate critical section, eventually give us
> > significant performance improvement.
>
> I question whether this belongs in glibc. It seems highly application-
> and kernel-specific. Is there a reason you wouldn't prefer to
> implement and maintain it in a library for use in the kind of
> application that needs it?

This is a good question.  On the other hand,  the current spinlock
in glibc hasn't been changed for many years.  It doesn't scale for
today's hardware.  Having a scalable spinlock in glibc is desirable.

> Some specific review inline below:
>
> > [...]
> > diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> > new file mode 100644
> > index 0000000..8ff4e1a
> > --- /dev/null
> > +++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> > @@ -0,0 +1,304 @@
> > +/* Initialization of NUMA spinlock.
> > +   Copyright (C) 2018 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <assert.h>
> > +#include <ctype.h>
> > +#include <string.h>
> > +#include <dirent.h>
> > +#include <stdio.h>
> > +#include <limits.h>
> > +#ifdef _LIBC
> > +# include <not-cancel.h>
> > +#else
> > +# include <stdlib.h>
> > +# include <unistd.h>
> > +# include <fcntl.h>
> > +# define __open_nocancel             open
> > +# define __close_nocancel_nostatus   close
> > +# define __read_nocancel             read
> > +#endif
> > +
> > +#include "numa-spinlock-private.h"
> > +
> > +static char *
> > +next_line (int fd, char *const buffer, char **cp, char **re,
> > +        char *const buffer_end)
> > +{
> > +  char *res = *cp;
> > +  char *nl = memchr (*cp, '\n', *re - *cp);
> > +  if (nl == NULL)
> > +    {
> > +      if (*cp != buffer)
> > +     {
> > +       if (*re == buffer_end)
> > +         {
> > +           memmove (buffer, *cp, *re - *cp);
> > +           *re = buffer + (*re - *cp);
> > +           *cp = buffer;
> > +
> > +           ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
> > +           if (n < 0)
> > +             return NULL;
> > +
> > +           *re += n;
> > +
> > +           nl = memchr (*cp, '\n', *re - *cp);
> > +           while (nl == NULL && *re == buffer_end)
> > +             {
> > +               /* Truncate too long lines.  */
> > +               *re = buffer + 3 * (buffer_end - buffer) / 4;
> > +               n = __read_nocancel (fd, *re, buffer_end - *re);
> > +               if (n < 0)
> > +                 return NULL;
> > +
> > +               nl = memchr (*re, '\n', n);
> > +               **re = '\n';
> > +               *re += n;
> > +             }
> > +         }
> > +       else
> > +         nl = memchr (*cp, '\n', *re - *cp);
> > +
> > +       res = *cp;
> > +     }
> > +
> > +      if (nl == NULL)
> > +     nl = *re - 1;
> > +    }
> > +
> > +  *cp = nl + 1;
> > +  assert (*cp <= *re);
> > +
> > +  return res == *re ? NULL : res;
> > +}
>
> This looks like fragile duplication of stdio-like buffering logic
> that's not at all specific to this file. Does glibc have a policy on
> whether things needing this should use stdio or some other shared code
> rather than open-coding it like this?

This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c.  Should it
be exported in GLIBC_PRIVATE name space?

> > [...]
>
> > +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> > +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> > +   memory when it is no longer needed.  */
> > +
> > +struct numa_spinlock *
> > +numa_spinlock_alloc (void)
> > +{
> > +  const size_t buffer_size = 1024;
> > +  char buffer[buffer_size];
> > +  char *buffer_end = buffer + buffer_size;
> > +  char *cp = buffer_end;
> > +  char *re = buffer_end;
> > +
> > +  const int flags = O_RDONLY | O_CLOEXEC;
> > +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> > +  char *l;
> > +  unsigned int max_node = 0;
> > +  unsigned int node_count = 0;
> > +  if (fd != -1)
> > +    {
> > +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> > +      if (l != NULL)
> > +     do
> > +       {
> > +         char *endp;
> > +         unsigned long int n = strtoul (l, &endp, 10);
> > +         if (l == endp)
> > +           {
> > +             node_count = 1;
> > +             break;
> > +           }
> > +
> > +         unsigned long int m = n;
> > +         if (*endp == '-')
> > +           {
> > +             l = endp + 1;
> > +             m = strtoul (l, &endp, 10);
> > +             if (l == endp)
> > +               {
> > +                 node_count = 1;
> > +                 break;
> > +               }
> > +           }
> > +
> > +         node_count += m - n + 1;
> > +
> > +         if (m >= max_node)
> > +           max_node = m;
> > +
> > +         l = endp;
> > +         while (l < re && isspace (*l))
> > +           ++l;
> > +       }
> > +     while (l < re);
> > +
> > +      __close_nocancel_nostatus (fd);
> > +    }
> > +
> > +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> > +     nodes is 1, set the maximium NUMA node number to 0.  */
> > +  if (node_count == 1)
> > +    max_node = 0;
> > +
> > +  unsigned int max_cpu = 0;
> > +  unsigned int *physical_package_id_p = NULL;
> > +
> > +  if (max_node == 0)
> > +    {
> > +      /* There is at least 1 node.  */
> > +      node_count = 1;
> > +
> > +      /* If NUMA is disabled, use physical_package_id instead.  */
> > +      struct dirent **cpu_list;
> > +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> > +                         select_cpu, NULL);
> > +      if (nprocs > 0)
> > +     {
> > +       int i;
> > +       unsigned int *cpu_id_p = NULL;
> > +
> > +       /* Find the maximum CPU number.  */
> > +       if (posix_memalign ((void **) &cpu_id_p,
> > +                           __alignof__ (void *),
> > +                           nprocs * sizeof (unsigned int)) == 0)
>
> Using posix_memalign to get memory with the alignment of
> __alignof__(void*) makes no sense. All allocations via malloc are
> suitably aligned for any standard type.

Does glibc prefer malloc over posix_memalign?

> > +         {
> > +           for (i = 0; i < nprocs; i++)
> > +             {
> > +               unsigned int cpu_id
> > +                 = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
> > +               cpu_id_p[i] = cpu_id;
> > +               if (cpu_id > max_cpu)
> > +                 max_cpu = cpu_id;
> > +             }
> > +
> > +           if (posix_memalign ((void **) &physical_package_id_p,
> > +                               __alignof__ (void *),
> > +                               ((max_cpu + 1)
> > +                                * sizeof (unsigned int))) == 0)
>
> Again.
>
> > +             {
> > +               memset (physical_package_id_p, 0,
> > +                       ((max_cpu + 1) * sizeof (unsigned int)));
> > +
> > +               max_node = UINT_MAX;
> > +
> > +               /* Get physical_package_id.  */
> > +               char path[(sizeof ("/sys/devices/system/cpu")
> > +                          + 3 * sizeof (unsigned long int)
> > +                          + sizeof ("/topology/physical_package_id"))];
> > +               for (i = 0; i < nprocs; i++)
> > +                 {
> > +                   struct dirent *d = cpu_list[i];
> > +                   if (snprintf (path, sizeof (path),
> > +                                 "/sys/devices/system/cpu/%s/topology/physical_package_id",
> > +                                 d->d_name) > 0)
>
> Are these sysfs pathnames documented as stable/permanent by the
> kernel?

I believe so.

> > +                     {
> > +                       fd = __open_nocancel (path, flags);
> > +                       if (fd != -1)
> > +                         {
> > +                           if (__read_nocancel (fd, buffer,
> > +                                                buffer_size) > 0)
> > +                             {
> > +                               char *endp;
> > +                               unsigned long int package_id
> > +                                 = strtoul (buffer, &endp, 10);
> > +                               if (package_id != ULONG_MAX
> > +                                   && *buffer != '\0'
> > +                                   && (*endp == '\0' || *endp == '\n'))
> > +                                 {
> > +                                   physical_package_id_p[cpu_id_p[i]]
> > +                                     = package_id;
> > +                                   if (max_node == UINT_MAX)
> > +                                     {
> > +                                       /* This is the first node.  */
> > +                                       max_node = package_id;
> > +                                     }
> > +                                   else if (package_id != max_node)
> > +                                     {
> > +                                       /* NB: We only need to know if
> > +                                          NODE_COUNT > 1.  */
> > +                                       node_count = 2;
> > +                                       if (package_id > max_node)
> > +                                         max_node = package_id;
> > +                                     }
> > +                                 }
> > +                             }
> > +                           __close_nocancel_nostatus (fd);
> > +                         }
> > +                     }
> > +
> > +                   free (d);
> > +                 }
> > +             }
> > +
> > +           free (cpu_id_p);
> > +         }
> > +       else
> > +         {
> > +           for (i = 0; i < nprocs; i++)
> > +             free (cpu_list[i]);
> > +         }
> > +
> > +       free (cpu_list);
> > +     }
> > +    }
> > +
> > +  if (physical_package_id_p != NULL && node_count == 1)
> > +    {
> > +      /* There is only one node.  No need for physical_package_id_p.  */
> > +      free (physical_package_id_p);
> > +      physical_package_id_p = NULL;
> > +      max_cpu = 0;
> > +    }
> > +
> > +  /* Allocate an array of struct numa_spinlock_info pointers to hold info
> > +     for all NUMA nodes with NUMA node number from getcpu () as index.  */
> > +  size_t size = (sizeof (struct numa_spinlock)
> > +              + ((max_node + 1)
> > +                 * sizeof (struct numa_spinlock_info *)));
> > +  struct numa_spinlock *lock;
> > +  if (posix_memalign ((void **) &lock,
> > +                   __alignof__ (struct numa_spinlock_info *), size))
>
> Another gratuitous posix_memalign.
Rich Felker Jan. 3, 2019, 9:21 p.m. | #6
On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > From: "ling.ma" <ling.ml@antfin.com>
> > >
> > > On multi-socket systems, memory is shared across the entire system.
> > > Data access to the local socket is much faster than the remote socket
> > > and data access to the local core is faster than sibling cores on the
> > > same socket.  For serialized workloads with conventional spinlock,
> > > when there is high spinlock contention between threads, lock ping-pong
> > > among sockets becomes the bottleneck and threads spend majority of
> > > their time in spinlock overhead.
> > >
> > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > are to minimize cross-socket traffic as well as localize the serialized
> > > workload to one core for execution.  The basic principles of NUMA
> > > spinlock are mainly consisted of following approaches, which reduce
> > > data movement and accelerate critical section, eventually give us
> > > significant performance improvement.
> >
> > I question whether this belongs in glibc. It seems highly application-
> > and kernel-specific. Is there a reason you wouldn't prefer to
> > implement and maintain it in a library for use in the kind of
> > application that needs it?
> 
> This is a good question.  On the other hand,  the current spinlock
> in glibc hasn't been changed for many years.  It doesn't scale for
> today's hardware.  Having a scalable spinlock in glibc is desirable.

"Scalable spinlock" is something of an oxymoron. Spinlocks are for
situations where contention is extremely rare, since they inherently
blow up badly under contention. If this is happening it means you
wanted a mutex not a spinlock.

> > This looks like fragile duplication of stdio-like buffering logic
> > that's not at all specific to this file. Does glibc have a policy on
> > whether things needing this should use stdio or some other shared code
> > rather than open-coding it like this?
> 
> This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c.  Should it
> be exported in GLIBC_PRIVATE name space?

Possibly. I don't have a strong opinion here. At least it's good to
know it's copied from code elsewhere that's believed to be
correct/safe.

> 
> > > [...]
> >
> > > +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> > > +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> > > +   memory when it is no longer needed.  */
> > > +
> > > +struct numa_spinlock *
> > > +numa_spinlock_alloc (void)
> > > +{
> > > +  const size_t buffer_size = 1024;
> > > +  char buffer[buffer_size];
> > > +  char *buffer_end = buffer + buffer_size;
> > > +  char *cp = buffer_end;
> > > +  char *re = buffer_end;
> > > +
> > > +  const int flags = O_RDONLY | O_CLOEXEC;
> > > +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> > > +  char *l;
> > > +  unsigned int max_node = 0;
> > > +  unsigned int node_count = 0;
> > > +  if (fd != -1)
> > > +    {
> > > +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> > > +      if (l != NULL)
> > > +     do
> > > +       {
> > > +         char *endp;
> > > +         unsigned long int n = strtoul (l, &endp, 10);
> > > +         if (l == endp)
> > > +           {
> > > +             node_count = 1;
> > > +             break;
> > > +           }
> > > +
> > > +         unsigned long int m = n;
> > > +         if (*endp == '-')
> > > +           {
> > > +             l = endp + 1;
> > > +             m = strtoul (l, &endp, 10);
> > > +             if (l == endp)
> > > +               {
> > > +                 node_count = 1;
> > > +                 break;
> > > +               }
> > > +           }
> > > +
> > > +         node_count += m - n + 1;
> > > +
> > > +         if (m >= max_node)
> > > +           max_node = m;
> > > +
> > > +         l = endp;
> > > +         while (l < re && isspace (*l))
> > > +           ++l;
> > > +       }
> > > +     while (l < re);
> > > +
> > > +      __close_nocancel_nostatus (fd);
> > > +    }
> > > +
> > > +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> > > +     nodes is 1, set the maximium NUMA node number to 0.  */
> > > +  if (node_count == 1)
> > > +    max_node = 0;
> > > +
> > > +  unsigned int max_cpu = 0;
> > > +  unsigned int *physical_package_id_p = NULL;
> > > +
> > > +  if (max_node == 0)
> > > +    {
> > > +      /* There is at least 1 node.  */
> > > +      node_count = 1;
> > > +
> > > +      /* If NUMA is disabled, use physical_package_id instead.  */
> > > +      struct dirent **cpu_list;
> > > +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> > > +                         select_cpu, NULL);
> > > +      if (nprocs > 0)
> > > +     {
> > > +       int i;
> > > +       unsigned int *cpu_id_p = NULL;
> > > +
> > > +       /* Find the maximum CPU number.  */
> > > +       if (posix_memalign ((void **) &cpu_id_p,
> > > +                           __alignof__ (void *),
> > > +                           nprocs * sizeof (unsigned int)) == 0)
> >
> > Using posix_memalign to get memory with the alignment of
> > __alignof__(void*) makes no sense. All allocations via malloc are
> > suitably aligned for any standard type.
> 
> Does glibc prefer malloc over posix_memalign?

If not, I think it *should*. Use of posix_memalign suggests to the
reader that some nonstandard alignment is needed, prompting one to ask
"what?" only to figure out it was gratuitous.

What's worse -- and I missed this on the first pass --
posix_memalign's signature is broken by design and almost universally
gets the programmer to invoke undefined behavior, which has happened
here. cpu_id_p has type unsigned int *, but due to the cast,
posix_memalign accesses it as if it had type void *, which is an
aliasing violation. For this reason alone, I would go so far as to say
the function should never be used unless there's no way to avoid it,
and even then it should be wrapped with a function that *returns* the
pointer as void* so that this error can't be made.

> > > +                   if (snprintf (path, sizeof (path),
> > > +                                 "/sys/devices/system/cpu/%s/topology/physical_package_id",
> > > +                                 d->d_name) > 0)
> >
> > Are these sysfs pathnames documented as stable/permanent by the
> > kernel?
> 
> I believe so.

OK. It is worth checking though since otherwise you have an interface
that will suddenly break when things change on the kernel side.

Rich
H.J. Lu Jan. 3, 2019, 9:27 p.m. | #7
On Thu, Jan 3, 2019 at 1:21 PM Rich Felker <dalias@libc.org> wrote:
>
> On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > > From: "ling.ma" <ling.ml@antfin.com>
> > > >
> > > > On multi-socket systems, memory is shared across the entire system.
> > > > Data access to the local socket is much faster than the remote socket
> > > > and data access to the local core is faster than sibling cores on the
> > > > same socket.  For serialized workloads with conventional spinlock,
> > > > when there is high spinlock contention between threads, lock ping-pong
> > > > among sockets becomes the bottleneck and threads spend majority of
> > > > their time in spinlock overhead.
> > > >
> > > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > > are to minimize cross-socket traffic as well as localize the serialized
> > > > workload to one core for execution.  The basic principles of NUMA
> > > > spinlock are mainly consisted of following approaches, which reduce
> > > > data movement and accelerate critical section, eventually give us
> > > > significant performance improvement.
> > >
> > > I question whether this belongs in glibc. It seems highly application-
> > > and kernel-specific. Is there a reason you wouldn't prefer to
> > > implement and maintain it in a library for use in the kind of
> > > application that needs it?
> >
> > This is a good question.  On the other hand,  the current spinlock
> > in glibc hasn't been changed for many years.  It doesn't scale for
> > today's hardware.  Having a scalable spinlock in glibc is desirable.
>
> "Scalable spinlock" is something of an oxymoron. Spinlocks are for
> situations where contention is extremely rare, since they inherently
> blow up badly under contention. If this is happening it means you
> wanted a mutex not a spinlock.

The critical region serialized by spinlock is very small.  Overhead
of mutex is much bigger than the critical region itself.

> > > This looks like fragile duplication of stdio-like buffering logic
> > > that's not at all specific to this file. Does glibc have a policy on
> > > whether things needing this should use stdio or some other shared code
> > > rather than open-coding it like this?
> >
> > This is borrowed from sysdeps/unix/sysv/linux/getsysstats.c.  Should it
> > be exported in GLIBC_PRIVATE name space?
>
> Possibly. I don't have a strong opinion here. At least it's good to
> know it's copied from code elsewhere that's believed to be
> correct/safe.
>
> >
> > > > [...]
> > >
> > > > +/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
> > > > +   call numa_spinlock_free on the NUMA spinlock pointer to free the
> > > > +   memory when it is no longer needed.  */
> > > > +
> > > > +struct numa_spinlock *
> > > > +numa_spinlock_alloc (void)
> > > > +{
> > > > +  const size_t buffer_size = 1024;
> > > > +  char buffer[buffer_size];
> > > > +  char *buffer_end = buffer + buffer_size;
> > > > +  char *cp = buffer_end;
> > > > +  char *re = buffer_end;
> > > > +
> > > > +  const int flags = O_RDONLY | O_CLOEXEC;
> > > > +  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
> > > > +  char *l;
> > > > +  unsigned int max_node = 0;
> > > > +  unsigned int node_count = 0;
> > > > +  if (fd != -1)
> > > > +    {
> > > > +      l = next_line (fd, buffer, &cp, &re, buffer_end);
> > > > +      if (l != NULL)
> > > > +     do
> > > > +       {
> > > > +         char *endp;
> > > > +         unsigned long int n = strtoul (l, &endp, 10);
> > > > +         if (l == endp)
> > > > +           {
> > > > +             node_count = 1;
> > > > +             break;
> > > > +           }
> > > > +
> > > > +         unsigned long int m = n;
> > > > +         if (*endp == '-')
> > > > +           {
> > > > +             l = endp + 1;
> > > > +             m = strtoul (l, &endp, 10);
> > > > +             if (l == endp)
> > > > +               {
> > > > +                 node_count = 1;
> > > > +                 break;
> > > > +               }
> > > > +           }
> > > > +
> > > > +         node_count += m - n + 1;
> > > > +
> > > > +         if (m >= max_node)
> > > > +           max_node = m;
> > > > +
> > > > +         l = endp;
> > > > +         while (l < re && isspace (*l))
> > > > +           ++l;
> > > > +       }
> > > > +     while (l < re);
> > > > +
> > > > +      __close_nocancel_nostatus (fd);
> > > > +    }
> > > > +
> > > > +  /* NB: Some NUMA nodes may not be available, if the number of NUMA
> > > > +     nodes is 1, set the maximium NUMA node number to 0.  */
> > > > +  if (node_count == 1)
> > > > +    max_node = 0;
> > > > +
> > > > +  unsigned int max_cpu = 0;
> > > > +  unsigned int *physical_package_id_p = NULL;
> > > > +
> > > > +  if (max_node == 0)
> > > > +    {
> > > > +      /* There is at least 1 node.  */
> > > > +      node_count = 1;
> > > > +
> > > > +      /* If NUMA is disabled, use physical_package_id instead.  */
> > > > +      struct dirent **cpu_list;
> > > > +      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
> > > > +                         select_cpu, NULL);
> > > > +      if (nprocs > 0)
> > > > +     {
> > > > +       int i;
> > > > +       unsigned int *cpu_id_p = NULL;
> > > > +
> > > > +       /* Find the maximum CPU number.  */
> > > > +       if (posix_memalign ((void **) &cpu_id_p,
> > > > +                           __alignof__ (void *),
> > > > +                           nprocs * sizeof (unsigned int)) == 0)
> > >
> > > Using posix_memalign to get memory with the alignment of
> > > __alignof__(void*) makes no sense. All allocations via malloc are
> > > suitably aligned for any standard type.
> >
> > Does glibc prefer malloc over posix_memalign?
>
> If not, I think it *should*. Use of posix_memalign suggests to the
> reader that some nonstandard alignment is needed, prompting one to ask
> "what?" only to figure out it was gratuitous.
>
> What's worse -- and I missed this on the first pass --
> posix_memalign's signature is broken by design and almost universally
> gets the programmer to invoke undefined behavior, which has happened
> here. cpu_id_p has type unsigned int *, but due to the cast,
> posix_memalign accesses it as if it had type void *, which is an
> aliasing violation. For this reason alone, I would go so far as to say
> the function should never be used unless there's no way to avoid it,
> and even then it should be wrapped with a function that *returns* the
> pointer as void* so that this error can't be made.

We will change it to malloc.

> > > > +                   if (snprintf (path, sizeof (path),
> > > > +                                 "/sys/devices/system/cpu/%s/topology/physical_package_id",
> > > > +                                 d->d_name) > 0)
> > >
> > > Are these sysfs pathnames documented as stable/permanent by the
> > > kernel?
> >
> > I believe so.
>
> OK. It is worth checking though since otherwise you have an interface
> that will suddenly break when things change on the kernel side.
>

Documentation/cputopology.txt has

===========================================
How CPU topology info is exported via sysfs
===========================================

Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo output of some architectures:

1) /sys/devices/system/cpu/cpuX/topology/physical_package_id:

physical package id of cpuX. Typically corresponds to a physical
socket number, but the actual value is architecture and platform
dependent.
....
=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Jan. 4, 2019, 4:13 a.m. | #8
在 2019/1/3 下午10:52,“Szabolcs Nagy”<Szabolcs.Nagy@arm.com> 写入:

    On 03/01/2019 05:35, 马凌(彦军) wrote:
    >      create mode 100644 manual/examples/numa-spinlock.c
    >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
    >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
    >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
    >      create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
    >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
    >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
    >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
    
    as far as i can tell the new code is generic
    (other than the presence of efficient getcpu),
    so i think the test should be generic too.

    
    >     --- /dev/null
    >     +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
    >     @@ -0,0 +1,384 @@
    ...
    >     +/* Check spinlock overhead with large number threads.  Critical region is
    >     +   very smmall.  Critical region + spinlock overhead aren't noticeable
    >     +   when number of threads is small.  When thread number increases,
    >     +   spinlock overhead become the bottleneck.  It shows up in wall time
    >     +   of thread execution.  */
    
    yeah, this is not easy to do in a generic way, i think
    even on x86 such measurement is problematic, you don't
    know what goes on a system (or vm) when the glibc test
    is running.
    
    but doing precise timing is not that important for
    checking the correctness of the locks, so i think a
    simplified version can be generic test code.

Ling: It is good idea, we will try to send generic test case in next version.

Thanks
Ling
Carlos O'Donell Jan. 5, 2019, 12:34 p.m. | #9
On 1/3/19 2:58 PM, H.J. Lu wrote:
> +libpthread {
> +  GLIBC_2.29 {
> +    numa_spinlock_alloc;
> +    numa_spinlock_free;
> +    numa_spinlock_init;
> +    numa_spinlock_apply;
> +  }
> +}

Why are we adding these non-standard interfaces to glibc?

The API implementation doesn't rely on any special glibc internals.

It could be implemented as a distinct library, allowed to evolve quickly
in response to customer need, and eventually integrated into glibc if the
API proves stable. A similar model has been setup by Boost and C++ just to
draw some parallels.

I'm not happy to see new APIs like this go directly into glibc without
much more discussion about *why* they have to be in glibc initially.

Just to be clear I have a sustained objection to this new set of APIs
being added to glibc until I can be convinced that they have to go in
glibc.
H.J. Lu Jan. 5, 2019, 4:35 p.m. | #10
On Sat, Jan 5, 2019 at 4:34 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 1/3/19 2:58 PM, H.J. Lu wrote:
> > +libpthread {
> > +  GLIBC_2.29 {
> > +    numa_spinlock_alloc;
> > +    numa_spinlock_free;
> > +    numa_spinlock_init;
> > +    numa_spinlock_apply;
> > +  }
> > +}
>
> Why are we adding these non-standard interfaces to glibc?
>
> The API implementation doesn't rely on any special glibc internals.
>
> It could be implemented as a distinct library, allowed to evolve quickly
> in response to customer need, and eventually integrated into glibc if the
> API proves stable. A similar model has been setup by Boost and C++ just to
> draw some parallels.
>
> I'm not happy to see new APIs like this go directly into glibc without
> much more discussion about *why* they have to be in glibc initially.
>
> Just to be clear I have a sustained objection to this new set of APIs
> being added to glibc until I can be convinced that they have to go in
> glibc.
>

Should glibc have scalable spinlock, in libc.so or a separate shared object?
Or should we tell people that if they want scalable spinlock, they look
elsewhere?
Florian Weimer Jan. 7, 2019, 7:12 p.m. | #11
* H. J. Lu:

> Should glibc have scalable spinlock, in libc.so or a separate shared object?
> Or should we tell people that if they want scalable spinlock, they look
> elsewhere?

I think non-polymorphic, small lock types with scoped locking could make
sense for glibc.

A lock specific to a certain machine architecture seems strange.  We
currently lack any of the kernel NUMA interfaces in glibc, which makes
this stand out even more.

Thanks,
Florian
H.J. Lu Jan. 7, 2019, 7:48 p.m. | #12
On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > Should glibc have scalable spinlock, in libc.so or a separate shared object?
> > Or should we tell people that if they want scalable spinlock, they look
> > elsewhere?
>
> I think non-polymorphic, small lock types with scoped locking could make
> sense for glibc.
>
> A lock specific to a certain machine architecture seems strange.  We
> currently lack any of the kernel NUMA interfaces in glibc, which makes
> this stand out even more.
>

And this lack of support doesn't make problems to go away.
Carlos O'Donell Jan. 10, 2019, 4:31 p.m. | #13
On 1/7/19 2:48 PM, H.J. Lu wrote:
> On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * H. J. Lu:
>>
>>> Should glibc have scalable spinlock, in libc.so or a separate shared object?
>>> Or should we tell people that if they want scalable spinlock, they look
>>> elsewhere?
>>
>> I think non-polymorphic, small lock types with scoped locking could make
>> sense for glibc.
>>
>> A lock specific to a certain machine architecture seems strange.  We
>> currently lack any of the kernel NUMA interfaces in glibc, which makes
>> this stand out even more.
>>
> 
> And this lack of support doesn't make problems to go away.
 
No. But it is a strong indicator that the solution space hasn't been
explored thoroughly enough for us to provide a long-term stable interface
that will remain useful.

My opinion is that for the health and evolution of a NUMA-aware spinlock
and MCS lock, that we should create a distinct project and library that
should have those locks, and then work to put them into downstream
distributions. This will support key users being able to use supported
versions of those libraries, and give the needed feedback about the API
and the performance. It may take 1-2 years to get that feedback and every
piece of feedback will improve the final API/ABI we put into glibc or
even into the next ISO C standard as pat of the C thread interface.

My objection to the NUMA-aware spinlock API is because I feel we are doing
a disservice to the work by formalizing it and freezing it as part of the
ABI/API that glibc is using.

In fact this NUMA-aware discussion touches on a deeply complex issue,
which is: How do we create/design, and evolve interfaces that we want
to one-day have in stable glibc? But this is a discussion for another
thread. Roland once said he wished we had put every function into it's
own library ;-)

Does this explain in more detail why I don't think it's a good idea to
put these interfaces into glibc?
Florian Weimer Jan. 10, 2019, 4:32 p.m. | #14
* Carlos O'Donell:

> My opinion is that for the health and evolution of a NUMA-aware spinlock
> and MCS lock, that we should create a distinct project and library that
> should have those locks, and then work to put them into downstream
> distributions. This will support key users being able to use supported
> versions of those libraries, and give the needed feedback about the API
> and the performance. It may take 1-2 years to get that feedback and every
> piece of feedback will improve the final API/ABI we put into glibc or
> even into the next ISO C standard as pat of the C thread interface.

I think it's something taht could land in tbb, for which many
distributions already have mechanisms to ship updated versions after a
release.

Thanks,
Florian
Carlos O'Donell Jan. 10, 2019, 4:41 p.m. | #15
On 1/10/19 11:32 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>> and MCS lock, that we should create a distinct project and library that
>> should have those locks, and then work to put them into downstream
>> distributions. This will support key users being able to use supported
>> versions of those libraries, and give the needed feedback about the API
>> and the performance. It may take 1-2 years to get that feedback and every
>> piece of feedback will improve the final API/ABI we put into glibc or
>> even into the next ISO C standard as pat of the C thread interface.
> 
> I think it's something taht could land in tbb, for which many
> distributions already have mechanisms to ship updated versions after a
> release.

Absolutely. That's a great idea.
Szabolcs Nagy Jan. 10, 2019, 5:52 p.m. | #16
On 10/01/2019 16:41, Carlos O'Donell wrote:
> On 1/10/19 11:32 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>>
>>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>>> and MCS lock, that we should create a distinct project and library that
>>> should have those locks, and then work to put them into downstream
>>> distributions. This will support key users being able to use supported
>>> versions of those libraries, and give the needed feedback about the API
>>> and the performance. It may take 1-2 years to get that feedback and every
>>> piece of feedback will improve the final API/ABI we put into glibc or
>>> even into the next ISO C standard as pat of the C thread interface.
>>
>> I think it's something taht could land in tbb, for which many
>> distributions already have mechanisms to ship updated versions after a
>> release.
> 
> Absolutely. That's a great idea.
> 

in principle the pthread_spin_lock api can use this algorithm
assuming we can keep the pthread_spinlock_t abi and keep the
POSIX semantics. (presumably users ran into issues with the
existing posix api.. or how did this come up in the first place?)
Carlos O'Donell Jan. 10, 2019, 7:24 p.m. | #17
On 1/10/19 12:52 PM, Szabolcs Nagy wrote:
> On 10/01/2019 16:41, Carlos O'Donell wrote:
>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>>>> and MCS lock, that we should create a distinct project and library that
>>>> should have those locks, and then work to put them into downstream
>>>> distributions. This will support key users being able to use supported
>>>> versions of those libraries, and give the needed feedback about the API
>>>> and the performance. It may take 1-2 years to get that feedback and every
>>>> piece of feedback will improve the final API/ABI we put into glibc or
>>>> even into the next ISO C standard as pat of the C thread interface.
>>>
>>> I think it's something taht could land in tbb, for which many
>>> distributions already have mechanisms to ship updated versions after a
>>> release.
>>
>> Absolutely. That's a great idea.
>>
> 
> in principle the pthread_spin_lock api can use this algorithm
> assuming we can keep the pthread_spinlock_t abi and keep the
> POSIX semantics. (presumably users ran into issues with the
> existing posix api.. or how did this come up in the first place?)
 
Correct, but meeting the ABI contract of the pthread_spinlck_t turns
out to be hard, there isn't much space. I've spoken with Kemi Wang 
(Intel) about this specific issue, and he has some ideas to share,
but I'll leave it for him to describe.
kemi Jan. 11, 2019, 11:56 a.m. | #18
On 2019/1/11 上午3:24, Carlos O'Donell wrote:
> On 1/10/19 12:52 PM, Szabolcs Nagy wrote:
>> On 10/01/2019 16:41, Carlos O'Donell wrote:
>>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>>>> * Carlos O'Donell:
>>>>
>>>>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>>>>> and MCS lock, that we should create a distinct project and library that
>>>>> should have those locks, and then work to put them into downstream
>>>>> distributions. This will support key users being able to use supported
>>>>> versions of those libraries, and give the needed feedback about the API
>>>>> and the performance. It may take 1-2 years to get that feedback and every
>>>>> piece of feedback will improve the final API/ABI we put into glibc or
>>>>> even into the next ISO C standard as pat of the C thread interface.
>>>>
>>>> I think it's something taht could land in tbb, for which many
>>>> distributions already have mechanisms to ship updated versions after a
>>>> release.
>>>
>>> Absolutely. That's a great idea.
>>>
>>
>> in principle the pthread_spin_lock api can use this algorithm
>> assuming we can keep the pthread_spinlock_t abi and keep the
>> POSIX semantics. (presumably users ran into issues with the
>> existing posix api.. or how did this come up in the first place?)
>  
> Correct, but meeting the ABI contract of the pthread_spinlck_t turns
> out to be hard, there isn't much space. I've spoken with Kemi Wang 
> (Intel) about this specific issue, and he has some ideas to share,
> but I'll leave it for him to describe.
> 

It may be possible because we can make better use of size of pthread_spinlock_t.

MCS lock is a well known method to reduce spinlock overhead by queuing spinner, the spinlock 
cache line is only contended between spinlock holder and a active spinner, other spinners are
spinning on local-accessible flag until the previous spinner pass mcs lock holder down.

Usually, a classical MCS implementation requires an extra pointer *mcs_lock* to track the tail of queue.
When a new spinner is adding into the queue, we first get the current tail of queue, and move the mcs_lock
pointer to point to this new spinner(a new tail of queue). 
If we can squeeze some space in pthread_spinlock_t to store this tail info, and update this tail info
when a new spinner is added into the queue, then the MCS algorithm can be reimplemented without breaking ABI.
That's possible because *lock* itself don't have to occupy 32 bits (8 bits or even one bit should be enough).

Then the pthread_spinlock_t structure may be like this(Similar to qspinlock in kernel):
struct pthread_spinlock_t
{
   union {
      struct {
         u8 locked; // lock byte
         u8 reserve; 
         u16 cpuid; // CPU id used by the last spinner, and using per-cpu infrastructure to convert it
         a pointer which points to the tail of queue. E.g per_cpu_var(qnode, cpuid)
      }
   int lock;
   }
}

PER-CPU struct qnode {
    struct qnode *next; // point to next spinner
    int flag;  // local spinning flag
}

But they are two problems here.
a) Lack of per-cpu infrastructure support in Glibc, so we can't do this cpuid->per-cpu-variable transition
b) Can't disable preemption at userland. 
   When a new spinner is adding to the queue, we need update the cpuid of pthread_spinlock_t to a new one.
   Pseudo-code:
	newid = get_current_cpuid();  
        prev = atomic_exchange_acquire(&cpuid, newid); // update cpuid to the new cpuid, and return
							 back the previous one
        tail_node = per_cpu_var(qnode, prev);  //get the last tail node of queue

There is a problem when preemption happens at a time window between get_current_cpuid() and atomic_exchange_acquire().
When the thread is rescheduled back, it maybe on another cpu with different cpuid.

===============================CUT HERE==================================
Another way is to store thread-specific info(e.g. tid) in pthread_spinlock_t instead of cpuid, then, we can avoid the
issue b), but it seems that we break the semantic of TLS? Comments?
H.J. Lu Jan. 11, 2019, 4:23 p.m. | #19
On Thu, Jan 10, 2019 at 8:32 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Carlos O'Donell:
>
> > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > and MCS lock, that we should create a distinct project and library that
> > should have those locks, and then work to put them into downstream
> > distributions. This will support key users being able to use supported
> > versions of those libraries, and give the needed feedback about the API
> > and the performance. It may take 1-2 years to get that feedback and every
> > piece of feedback will improve the final API/ABI we put into glibc or
> > even into the next ISO C standard as pat of the C thread interface.
>
> I think it's something taht could land in tbb, for which many
> distributions already have mechanisms to ship updated versions after a
> release.

We will work on a standalone library:

https://gitlab.com/numa-spinlock/numa-spinlock

The implementation is done.  But many pieces are missing:

1. Documentation.
2. Make tests portable for non-x86 platforms.
3. Do we need symbol versioning?
4. ...

We are looking for contributors.

Thanks.
Torvald Riegel Jan. 14, 2019, 10:40 p.m. | #20
On Thu, 2019-01-03 at 12:54 -0800, H.J. Lu wrote:
> On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> > 
> > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > From: "ling.ma" <ling.ml@antfin.com>
> > > 
> > > On multi-socket systems, memory is shared across the entire system.
> > > Data access to the local socket is much faster than the remote socket
> > > and data access to the local core is faster than sibling cores on the
> > > same socket.  For serialized workloads with conventional spinlock,
> > > when there is high spinlock contention between threads, lock ping-pong
> > > among sockets becomes the bottleneck and threads spend majority of
> > > their time in spinlock overhead.
> > > 
> > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > are to minimize cross-socket traffic as well as localize the serialized
> > > workload to one core for execution.  The basic principles of NUMA
> > > spinlock are mainly consisted of following approaches, which reduce
> > > data movement and accelerate critical section, eventually give us
> > > significant performance improvement.
> > 
> > I question whether this belongs in glibc. It seems highly application-
> > and kernel-specific. Is there a reason you wouldn't prefer to
> > implement and maintain it in a library for use in the kind of
> > application that needs it?
> 
> This is a good question.  On the other hand,  the current spinlock
> in glibc hasn't been changed for many years.  It doesn't scale for
> today's hardware.  Having a scalable spinlock in glibc is desirable.

I agree the spinlocks need to improve, but let's do first things first: add
proper back-off.  The biggest problem there is find a way to select and
maintain the tuning values that is simple for the glibc developers; there
should be good benchmarks that can be used to automatically check that the
tuning values make sense, and adaptation at runtime would be even better
(if it can be shown to improve performance).

There are also several other synchronization algorithms in glibc where
proper (limited) spinning and back-off would help.  Look for comments such
as "TODO Back-off." throughout the code, in particular the synchronization
code I have rewritten in the past.  And this applies to the normal mutexes
too, obviously.
Torvald Riegel Jan. 14, 2019, 10:45 p.m. | #21
On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
> On 1/10/19 11:32 AM, Florian Weimer wrote:
> > * Carlos O'Donell:
> > 
> > > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > > and MCS lock, that we should create a distinct project and library that
> > > should have those locks, and then work to put them into downstream
> > > distributions. This will support key users being able to use supported
> > > versions of those libraries, and give the needed feedback about the API
> > > and the performance. It may take 1-2 years to get that feedback and every
> > > piece of feedback will improve the final API/ABI we put into glibc or
> > > even into the next ISO C standard as pat of the C thread interface.
> > 
> > I think it's something taht could land in tbb, for which many
> > distributions already have mechanisms to ship updated versions after a
> > release.
> 
> Absolutely. That's a great idea.
> 

I don't think tbb is a useful vehicle.  It would require that many
applications use the tbb mutexes, which I doubt is the case.
Torvald Riegel Jan. 14, 2019, 11:03 p.m. | #22
On Sat, 2019-01-05 at 07:34 -0500, Carlos O'Donell wrote:
> On 1/3/19 2:58 PM, H.J. Lu wrote:
> > +libpthread {
> > +  GLIBC_2.29 {
> > +    numa_spinlock_alloc;
> > +    numa_spinlock_free;
> > +    numa_spinlock_init;
> > +    numa_spinlock_apply;
> > +  }
> > +}
> 
> Why are we adding these non-standard interfaces to glibc?

I also think that there shouldn't be a new API added for these.  There
would be the option of adding a new pthread_mutex_t type, and thus even get
some more space for the implementation, but I wouldn't like that either.

What I think we should be doing instead:

(1) Finally add proper spinning and back-off throughout our synchronization
abstractions (spinlocks, mutexes, etc.; see my other comment on this
thread).  This should improve performance significantly, either through
less contention during spinning, or through doing more spinning and thus
less trips through futexes and the kernel.

(2) Change the synchronization abstractions (especially mutexes) so that
they can efficiently run different code paths when they are of non-process-
shared type, and then start with doing things like MCS in the non-process-
shared mutexes.

For both of these, add proper benchmarks so that tuning decisions can be
checked automatically and tested for regressions.  IOW, we want tests for
whether the tuning decisions are still the correct ones (in the future, or
on future HW).

The main reason why I think that's a better approach is that the biggest
return on the investments by the glibc community is improving performance
for as many users as possible, even if it may not be ideal performance for
those unchanged programs.  Adding a special API+implementation instead
might enable somewhat larger performance improvements, but it requires
programs to change and programmers to be aware of it, so will likely remain
a niche case in practice.

> It could be implemented as a distinct library, allowed to evolve quickly
> in response to customer need, and eventually integrated into glibc if the
> API proves stable. A similar model has been setup by Boost and C++ just to
> draw some parallels.

I think first of all, we should try hard to get as much as performance out
of the interfaces, POSIX semantics constraints, and ABI constraints we have
today.  Then maybe change the ABI if it really unlocks further benefits.

If that's not sufficient to get decent performance, then I think the next
venue to look at is C++.  That is, if implementing these ideas to the full
extent is not possible in glibc, go to libstdc++ instead and see what's
possible there.  C++'s synchronization constructs have saner semantics than
POSIX's, and ABI breaks in the future are more likely for C++ than for
glibc.
If the current C++ synchronization constructs have semantics that inhibit
performance, then ISO C++ Study Group 1 will likely want to hear about it. 
And they are a much better group to discuss this than the glibc community
is, simply because they are focused on just parallelism and concurrency.
Torvald Riegel Jan. 14, 2019, 11:18 p.m. | #23
On Thu, 2019-01-03 at 16:21 -0500, Rich Felker wrote:
> On Thu, Jan 03, 2019 at 12:54:18PM -0800, H.J. Lu wrote:
> > On Thu, Jan 3, 2019 at 12:43 PM Rich Felker <dalias@libc.org> wrote:
> > > 
> > > On Wed, Dec 26, 2018 at 10:50:19AM +0800, Ma Ling wrote:
> > > > From: "ling.ma" <ling.ml@antfin.com>
> > > > 
> > > > On multi-socket systems, memory is shared across the entire system.
> > > > Data access to the local socket is much faster than the remote socket
> > > > and data access to the local core is faster than sibling cores on the
> > > > same socket.  For serialized workloads with conventional spinlock,
> > > > when there is high spinlock contention between threads, lock ping-pong
> > > > among sockets becomes the bottleneck and threads spend majority of
> > > > their time in spinlock overhead.
> > > > 
> > > > On multi-socket systems, the keys to our NUMA spinlock performance
> > > > are to minimize cross-socket traffic as well as localize the serialized
> > > > workload to one core for execution.  The basic principles of NUMA
> > > > spinlock are mainly consisted of following approaches, which reduce
> > > > data movement and accelerate critical section, eventually give us
> > > > significant performance improvement.
> > > 
> > > I question whether this belongs in glibc. It seems highly application-
> > > and kernel-specific. Is there a reason you wouldn't prefer to
> > > implement and maintain it in a library for use in the kind of
> > > application that needs it?
> > 
> > This is a good question.  On the other hand,  the current spinlock
> > in glibc hasn't been changed for many years.  It doesn't scale for
> > today's hardware.  Having a scalable spinlock in glibc is desirable.
> 
> "Scalable spinlock" is something of an oxymoron.

No, that's not true at all.  Most high-performance shared-memory
synchronization constructs (on typical HW we have today) will do some kind
of spinning (and back-off), and there's nothing wrong about it.  This can
scale very well. 

> Spinlocks are for
> situations where contention is extremely rare,

No, the question is rather whether the program needs blocking through the
OS (for performance, or for semantics such as PI) or not.  Energy may be
another factor.  For example, glibc's current mutexes don't scale well on
short critical because there's not enough spinning being done.

In particular, in cases where there aren't more threads than cores (ie,
what lots of high-performance parallel applications will ensure), it's
better to just spin (and back off) than to eagerly block using the OS.

> since they inherently
> blow up badly under contention.

Did I mention back-off before? ;)

> If this is happening it means you
> wanted a mutex not a spinlock.

It doesn't make that much sense to have different interfaces for those, if
it weren't for PThreads mutexes being dynamically typed, unfortunately.

I believe that a high-performance default lock (C++11 or C11 semantics,
non-process-shared) would beat both our spinlocks and mutex implementations
that we have today.

We could tune the C11 mutex implementation, but the number of users would
still be small in the foreseeable future, I guess.  Tuning libstdc++'s
mutex implementation (so that it doesn't just use nptl mutexes but does
something that's closer to the state of the art) would reach more users.
Torvald Riegel Jan. 14, 2019, 11:26 p.m. | #24
On Wed, 2018-12-26 at 10:50 +0800, Ma Ling wrote:
> 2. Critical Section Integration (CSI)
> Essentially spinlock is similar to that one core complete critical
> sections one by one. So when contention happen, the serialized works
> are sent to the core who is the lock owner and responsible to execute
> them, that can save much time and power, because all shared data are
> located in private cache of the lock owner.

I agree that this can improve performance because of potentially both
increasing data locality for the critical sections themselves and
decreasing contention in the lock.  However, this will mess with thread-
local storage and assumptions about what OS thread a critical section runs
on.

Maybe it's better to first experiment with this change in semantics in C++;
ISO C++ Study Group 1 on parallelism and concurrency is much deeper into
this topic than the glibc community is.  This isn't really a typical lock
anymore when you do that, but rather a special kind of execution service
for small functions; the study group has talked about executors that
execute in guaranteed sequential fashion.
kemi Jan. 15, 2019, 2:28 a.m. | #25
>> "Scalable spinlock" is something of an oxymoron.
> 
> No, that's not true at all.  Most high-performance shared-memory
> synchronization constructs (on typical HW we have today) will do some kind
> of spinning (and back-off), and there's nothing wrong about it.  This can
> scale very well. 
> 
>> Spinlocks are for
>> situations where contention is extremely rare,
> 
> No, the question is rather whether the program needs blocking through the
> OS (for performance, or for semantics such as PI) or not.  Energy may be
> another factor.  For example, glibc's current mutexes don't scale well on
> short critical because there's not enough spinning being done.
> 

yes. That's why we need pthread.mutex.spin_count tunable interface before.
But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
algorithm of current adaptive mutex is the major negative factor which degrades mutex
performance. That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
mutex.

https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html

Also, if with very small critical section in the worklad, this new type of mutex 
with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
much better than original spinlock.

So, in some day, if adaptive mutex is tuned good enough, it should act like
mcs-spinlock (or NUMA spinlock) if workload has small critical section, and
performs like normal mutex if the critical section is too big to spinning-wait.
kemi Jan. 15, 2019, 2:52 a.m. | #26
On 2018/12/26 上午10:50, Ma Ling wrote:
> From: "ling.ma" <ling.ml@antfin.com>
> 
> On multi-socket systems, memory is shared across the entire system.
> Data access to the local socket is much faster than the remote socket
> and data access to the local core is faster than sibling cores on the
> same socket.  For serialized workloads with conventional spinlock,
> when there is high spinlock contention between threads, lock ping-pong
> among sockets becomes the bottleneck and threads spend majority of
> their time in spinlock overhead.
> 
> On multi-socket systems, the keys to our NUMA spinlock performance
> are to minimize cross-socket traffic as well as localize the serialized
> workload to one core for execution.  The basic principles of NUMA
> spinlock are mainly consisted of following approaches, which reduce
> data movement and accelerate critical section, eventually give us
> significant performance improvement.
> 
> 1. MCS spinlock
> MCS spinlock help us to reduce the useless lock movement in the
> spinning state.  This paper provides a good description for this
> kind of lock:

That's not the truth.
No matter generic spinlock(or x86 version) spinlock has used the way of 
test and test_and_set to reduce the useless lock movement in the spinning 
state.

See
glibc/nptl/pthread_spin_lock.c
glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S

What MCS-spinlock really helps is to accelerate lock release and lock acquisition
by reducing lots of cache line bouncing.

> NUMA spinlock can greatly speed up critical section on multi-socket
> systems.  It should improve spinlock performance on all multi-socket
> systems. 
> 

This is out-of-question that NUMA spinlock helps a lot in case of heavy lock
contention. But, we should also propose the data for non-contented case and slight
contended case.

It's expected that extra code complexity may degrade lock performance a bit for
slight contended case, I would like to see the data for that.

Also, the lock starvation would be possible if the running core is always busy with
heavy lock contention. More explanation is expected.
=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Jan. 15, 2019, 4:28 a.m. | #27
> 1. MCS spinlock
    > MCS spinlock help us to reduce the useless lock movement in the
    > spinning state.  This paper provides a good description for this
    > kind of lock:
    
    That's not the truth.
    No matter generic spinlock(or x86 version) spinlock has used the way of 
    test and test_and_set to reduce the useless lock movement in the spinning 
    state.
    
    See
    glibc/nptl/pthread_spin_lock.c
    glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S
  
    What MCS-spinlock really helps is to accelerate lock release and lock acquisition
    by reducing lots of cache line bouncing.
  Ling: 
	Thanks for your comments.
	1.in spinning stage current generic spinlock will read lock, then lock is stored in private cache among cores with shared state.
     when one core update the lock ,those cores containing the shared cache line will be noted ,invalidated lock then response to the cache controller
	 the messages transmission spend much time, but they can be reduced by mcs lock
	2. atomic_compare_exchange_weak_acquire cause plenty of useless lock movement because only one core can acquire lock 
    
	mcs will transmit the lock one by one with private data, so we say it reduce useless lock movement from above scenario 1 and 2.

    > NUMA spinlock can greatly speed up critical section on multi-socket
    > systems.  It should improve spinlock performance on all multi-socket
    > systems. 
    > 
    
    This is out-of-question that NUMA spinlock helps a lot in case of heavy lock
    contention. But, we should also propose the data for non-contented case and slight
    contended case.

   Ling: we will also present report hashwork like adaptive-mcs patch.
=?UTF-8?B?6ams5YeMKOW9puWGmyk=?= Jan. 15, 2019, 4:48 a.m. | #28
在 2019/1/15 上午7:26,“Torvald Riegel”<triegel@redhat.com> 写入:

    On Wed, 2018-12-26 at 10:50 +0800, Ma Ling wrote:
    > 2. Critical Section Integration (CSI)
    > Essentially spinlock is similar to that one core complete critical
    > sections one by one. So when contention happen, the serialized works
    > are sent to the core who is the lock owner and responsible to execute
    > them, that can save much time and power, because all shared data are
    > located in private cache of the lock owner.
    
    I agree that this can improve performance because of potentially both
    increasing data locality for the critical sections themselves and
    decreasing contention in the lock.  However, this will mess with thread-
    local storage and assumptions about what OS thread a critical section runs
    on.
  Ling: yes, we have to consider it when applying numa spinlock.  
    Maybe it's better to first experiment with this change in semantics in C++;
    ISO C++ Study Group 1 on parallelism and concurrency is much deeper into
    this topic than the glibc community is.  This isn't really a typical lock
    anymore when you do that, but rather a special kind of execution service
    for small functions; the study group has talked about executors that
    execute in guaranteed sequential fashion.
 Ling: thanks for your suggestion, we would like to think about it seriously.
Florian Weimer Jan. 15, 2019, 9:32 a.m. | #29
* Torvald Riegel:

> On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>> > * Carlos O'Donell:
>> > 
>> > > My opinion is that for the health and evolution of a NUMA-aware spinlock
>> > > and MCS lock, that we should create a distinct project and library that
>> > > should have those locks, and then work to put them into downstream
>> > > distributions. This will support key users being able to use supported
>> > > versions of those libraries, and give the needed feedback about the API
>> > > and the performance. It may take 1-2 years to get that feedback and every
>> > > piece of feedback will improve the final API/ABI we put into glibc or
>> > > even into the next ISO C standard as pat of the C thread interface.
>> > 
>> > I think it's something taht could land in tbb, for which many
>> > distributions already have mechanisms to ship updated versions after a
>> > release.
>> 
>> Absolutely. That's a great idea.
>> 
>
> I don't think tbb is a useful vehicle.  It would require that many
> applications use the tbb mutexes, which I doubt is the case.

That doesn't really matter because it's a new API anyway.

Thanks,
Florian
Torvald Riegel Jan. 15, 2019, 12:01 p.m. | #30
On Tue, 2019-01-15 at 10:32 +0100, Florian Weimer wrote:
> * Torvald Riegel:
> 
> > On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
> > > On 1/10/19 11:32 AM, Florian Weimer wrote:
> > > > * Carlos O'Donell:
> > > > 
> > > > > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > > > > and MCS lock, that we should create a distinct project and library that
> > > > > should have those locks, and then work to put them into downstream
> > > > > distributions. This will support key users being able to use supported
> > > > > versions of those libraries, and give the needed feedback about the API
> > > > > and the performance. It may take 1-2 years to get that feedback and every
> > > > > piece of feedback will improve the final API/ABI we put into glibc or
> > > > > even into the next ISO C standard as pat of the C thread interface.
> > > > 
> > > > I think it's something taht could land in tbb, for which many
> > > > distributions already have mechanisms to ship updated versions after a
> > > > release.
> > > 
> > > Absolutely. That's a great idea.
> > > 
> > 
> > I don't think tbb is a useful vehicle.  It would require that many
> > applications use the tbb mutexes, which I doubt is the case.
> 
> That doesn't really matter because it's a new API anyway.

What I mean is that applications would have to want to use locks provided
by tbb, whether those are locks/mutexes that exist in tbb today or a new
API that would be added.

Put differently, I'm not optimistic about tbb being a good way to get
feedback.
Florian Weimer Jan. 15, 2019, 12:17 p.m. | #31
* Torvald Riegel:

> What I mean is that applications would have to want to use locks provided
> by tbb, whether those are locks/mutexes that exist in tbb today or a new
> API that would be added.
>
> Put differently, I'm not optimistic about tbb being a good way to get
> feedback.

Do you want to run existing workloads with a new mutex implementation?

Then we can't add new flags or change ABI in any way and would have to
use a tunable.  And to get feedback, we would have to make the new
implementation the default, with a tunable to get back the old
implementation.

Thanks,
Florian
Torvald Riegel Jan. 15, 2019, 12:30 p.m. | #32
On Tue, 2019-01-15 at 13:17 +0100, Florian Weimer wrote:
> * Torvald Riegel:
> 
> > What I mean is that applications would have to want to use locks provided
> > by tbb, whether those are locks/mutexes that exist in tbb today or a new
> > API that would be added.
> > 
> > Put differently, I'm not optimistic about tbb being a good way to get
> > feedback.
> 
> Do you want to run existing workloads with a new mutex implementation?

We need to get there.

> Then we can't add new flags or change ABI in any way

Yes.

> and would have to
> use a tunable.  And to get feedback, we would have to make the new
> implementation the default, with a tunable to get back the old
> implementation.

I wouldn't be too concerned to getting back the old implementation, so
maybe we don't even need a tunable right now.  The old implementation is
just no spinning, so the cases where I can imagine the tunable to be useful
is either (1) experimentation to compare performance without using
different glibc's and (2) going back to old behavior in cases where we
really screwed up.  But how many users will have the time to investigate
(2), really?

Nobody should have to tune their spinlocks, or the back-off in mutexes. 
It's our duty to make sure this has good average-case performance.
Torvald Riegel Jan. 15, 2019, 12:36 p.m. | #33
On Tue, 2019-01-15 at 10:28 +0800, kemi wrote:
> > > "Scalable spinlock" is something of an oxymoron.
> > 
> > No, that's not true at all.  Most high-performance shared-memory
> > synchronization constructs (on typical HW we have today) will do some kind
> > of spinning (and back-off), and there's nothing wrong about it.  This can
> > scale very well. 
> > 
> > > Spinlocks are for
> > > situations where contention is extremely rare,
> > 
> > No, the question is rather whether the program needs blocking through the
> > OS (for performance, or for semantics such as PI) or not.  Energy may be
> > another factor.  For example, glibc's current mutexes don't scale well on
> > short critical because there's not enough spinning being done.
> > 
> 
> yes. That's why we need pthread.mutex.spin_count tunable interface before.

I don't think we need the tunable interface before that.  Where we need to
improve performance most is for applications that don't want to bother
tuning their mutexes -- that's where the broadest gains are overall, I
think.

In turn, that means that we have spinning and back-off that give good
average-case performance -- whether that's through automatic tuning of
those two things at runtime, or through static default values that we do
regular performance checks for in the glibc community. 

From that perspective, the tunable interface is a nice addition that can
allow users to fine-tune the setting, but it's not how users would enable
it.

> But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
> algorithm of current adaptive mutex is the major negative factor which degrades mutex
> performance.

Note that I'm not advocating for focusing on just the adaptive mutex type. 
IMO, adding this type was a mistake because whether to spin or not does not
affect semantics of the mutexes.  Performance hints shouldn't be done via a
mutex' type, and all mutex implementations should consider to spin at least
a little.

If we just do something about the adaptive mutexes, then I guess this will
reach few users.  I believe most applications just don't use them, and the
current implementation of adaptive mutexes is so simplistic that there's
not much performance to be had by changing to adaptive mutexes (which is
another reason for it having few users).

> That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
> mutex.

MCS-style spinning (ie, spinning on memory local to the spinning thread) is
helpful, but I think we should tackle spinning on global memory first (ie,
on a location in the mutex, which is shared by all the threads trying to
acquire it).  Of course, always including back-off.

> https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html
> 
> Also, if with very small critical section in the worklad, this new type of mutex 
> with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
> much better than original spinlock.

I don't think we want to have a new type for that.  It maybe useful for
experimenting with it, but it shouldn't be exposed to users as a stable
interface.

Also, have you experimented with different kinds/settings of exponential
back-off?  I just saw normal spinning in your implementation, no varying
amounts of back-off.  The performance comparison should include back-off
though, as that's one way to work around the contention problems (with a
bigger hammer than local spinning of course, but can be effective
nonetheless, and faster in low-contention cases).

My guess is that a mix of local spinning on memory shared by a few threads
running on cores that are close to each other would perform best (eg,
similar to what's done in flat combining). 

> So, in some day, if adaptive mutex is tuned good enough, it should act like
> mcs-spinlock (or NUMA spinlock) if workload has small critical section, and
> performs like normal mutex if the critical section is too big to spinning-wait.

I agree in some way, but I think that the adaptive mutex type should just
be an alias of the normal mutex type (for API compatibility reasons only). 
And there could be other reasons than just critical-section-size that
determine whether a thread should block using futexes or not.
Rich Felker Jan. 15, 2019, 4:44 p.m. | #34
On Tue, Jan 15, 2019 at 01:36:56PM +0100, Torvald Riegel wrote:
> > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
> > algorithm of current adaptive mutex is the major negative factor which degrades mutex
> > performance.
> 
> Note that I'm not advocating for focusing on just the adaptive mutex type. 
> IMO, adding this type was a mistake because whether to spin or not does not
> affect semantics of the mutexes.  Performance hints shouldn't be done via a
> mutex' type, and all mutex implementations should consider to spin at least
> a little.

Strongly agreed that this was a mistake. This same sentiment is why I
don't like the nomenclature "NUMA spinlock". If there are semantic
differences from a normal spinlock, it should be named for the
semantics, not for the form of extreme performance tuning it does. If
there are no semantic differences, NUMA-optimized spinlock
implementation should just be a drop-in replacement for the standard
spinlock API.

Rich
kemi Jan. 17, 2019, 3:05 a.m. | #35
On 2019/1/15 下午8:36, Torvald Riegel wrote:
> On Tue, 2019-01-15 at 10:28 +0800, kemi wrote:
>>>> "Scalable spinlock" is something of an oxymoron.
>>>
>>> No, that's not true at all.  Most high-performance shared-memory
>>> synchronization constructs (on typical HW we have today) will do some kind
>>> of spinning (and back-off), and there's nothing wrong about it.  This can
>>> scale very well. 
>>>
>>>> Spinlocks are for
>>>> situations where contention is extremely rare,
>>>
>>> No, the question is rather whether the program needs blocking through the
>>> OS (for performance, or for semantics such as PI) or not.  Energy may be
>>> another factor.  For example, glibc's current mutexes don't scale well on
>>> short critical because there's not enough spinning being done.
>>>
>>
>> yes. That's why we need pthread.mutex.spin_count tunable interface before.
> 
> I don't think we need the tunable interface before that.  Where we need to
> improve performance most is for applications that don't want to bother
> tuning their mutexes -- that's where the broadest gains are overall, I
> think.
> 
> In turn, that means that we have spinning and back-off that give good
> average-case performance -- whether that's through automatic tuning of
> those two things at runtime, or through static default values that we do
> regular performance checks for in the glibc community. 
> 

Spinning and proportional back-off with auto tuning has been proposed for several years
and never got it merged in upstream kernel.
IMHO, this is because MCS-style lock wins that battle.

Could you tell me why we should consider backoff rather than MCS-style lock?

> From that perspective, the tunable interface is a nice addition that can
> allow users to fine-tune the setting, but it's not how users would enable
> it.
> 
>> But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
>> algorithm of current adaptive mutex is the major negative factor which degrades mutex
>> performance.
> 
> Note that I'm not advocating for focusing on just the adaptive mutex type. 
> IMO, adding this type was a mistake because whether to spin or not does not
> affect semantics of the mutexes.  Performance hints shouldn't be done via a
> mutex' type, and all mutex implementations should consider to spin at least
> a little.
> 
> If we just do something about the adaptive mutexes, then I guess this will
> reach few users.  I believe most applications just don't use them, and the
> current implementation of adaptive mutexes is so simplistic that there's
> not much performance to be had by changing to adaptive mutexes (which is
> another reason for it having few users).
> 

Generally, I agree with you.
May we tune adaptive mutex before applying these optimization to normal mutex.

>> That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
>> mutex.
> 
> MCS-style spinning (ie, spinning on memory local to the spinning thread) is
> helpful, but I think we should tackle spinning on global memory first (ie,
> on a location in the mutex, which is shared by all the threads trying to
> acquire it).  Of course, always including back-off.
> 
>> https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html
>>
>> Also, if with very small critical section in the worklad, this new type of mutex 
>> with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
>> much better than original spinlock.
> 
> I don't think we want to have a new type for that.  It maybe useful for
> experimenting with it, but it shouldn't be exposed to users as a stable
> interface.
> 

I don't like to add a new type either.
As I said in the commit log, that's a trade-off to avoid ABI changed.
I am very glad to see that MCS-style lock can be used gracefully without
introducing a new type.

> Also, have you experimented with different kinds/settings of exponential
> back-off?  I just saw normal spinning in your implementation, no varying
> amounts of back-off.  The performance comparison should include back-off
> though, as that's one way to work around the contention problems (with a
> bigger hammer than local spinning of course, but can be effective
> nonetheless, and faster in low-contention cases).
> 

I didn't try back-off, because we don't have to include it if MCS-style lock is used.

> My guess is that a mix of local spinning on memory shared by a few threads
> running on cores that are close to each other would perform best (eg,
> similar to what's done in flat combining). 
> 
>> So, in some day, if adaptive mutex is tuned good enough, it should act like
>> mcs-spinlock (or NUMA spinlock) if workload has small critical section, and
>> performs like normal mutex if the critical section is too big to spinning-wait.
> 
> I agree in some way, but I think that the adaptive mutex type should just
> be an alias of the normal mutex type (for API compatibility reasons only). 
> And there could be other reasons than just critical-section-size that
> determine whether a thread should block using futexes or not.
> 

Agree.
I am justing moving toward that step by step.
Torvald Riegel Feb. 4, 2019, 5:23 p.m. | #36
On Thu, 2019-01-17 at 11:05 +0800, kemi wrote:
> 
> On 2019/1/15 下午8:36, Torvald Riegel wrote:
> > On Tue, 2019-01-15 at 10:28 +0800, kemi wrote:
> > > > > "Scalable spinlock" is something of an oxymoron.
> > > > 
> > > > No, that's not true at all.  Most high-performance shared-memory
> > > > synchronization constructs (on typical HW we have today) will do some kind
> > > > of spinning (and back-off), and there's nothing wrong about it.  This can
> > > > scale very well. 
> > > > 
> > > > > Spinlocks are for
> > > > > situations where contention is extremely rare,
> > > > 
> > > > No, the question is rather whether the program needs blocking through the
> > > > OS (for performance, or for semantics such as PI) or not.  Energy may be
> > > > another factor.  For example, glibc's current mutexes don't scale well on
> > > > short critical because there's not enough spinning being done.
> > > > 
> > > 
> > > yes. That's why we need pthread.mutex.spin_count tunable interface before.
> > 
> > I don't think we need the tunable interface before that.  Where we need to
> > improve performance most is for applications that don't want to bother
> > tuning their mutexes -- that's where the broadest gains are overall, I
> > think.
> > 
> > In turn, that means that we have spinning and back-off that give good
> > average-case performance -- whether that's through automatic tuning of
> > those two things at runtime, or through static default values that we do
> > regular performance checks for in the glibc community. 
> > 
> 
> Spinning and proportional back-off with auto tuning has been proposed for several years
> and never got it merged in upstream kernel.
> IMHO, this is because MCS-style lock wins that battle.
> 
> Could you tell me why we should consider backoff rather than MCS-style lock?

The kernel has no concerns over ABI stability of their locks, AFAIK.  But
pthread_mutex_t is exposed to users, and there is the problem of process-
shared mutexes, so we can't just go to an external list.

Furthermore, if we had proper back-off, we could use it for other
synchronization primitives too (grep for comments containing "back-off").

Doe you have any data that would show that MCS-style locks always win on
current machines?

> > From that perspective, the tunable interface is a nice addition that can
> > allow users to fine-tune the setting, but it's not how users would enable
> > it.
> > 
> > > But, that's not enough. When tunable is not the bottleneck, the simple busy-waiting
> > > algorithm of current adaptive mutex is the major negative factor which degrades mutex
> > > performance.
> > 
> > Note that I'm not advocating for focusing on just the adaptive mutex type. 
> > IMO, adding this type was a mistake because whether to spin or not does not
> > affect semantics of the mutexes.  Performance hints shouldn't be done via a
> > mutex' type, and all mutex implementations should consider to spin at least
> > a little.
> > 
> > If we just do something about the adaptive mutexes, then I guess this will
> > reach few users.  I believe most applications just don't use them, and the
> > current implementation of adaptive mutexes is so simplistic that there's
> > not much performance to be had by changing to adaptive mutexes (which is
> > another reason for it having few users).
> > 
> 
> Generally, I agree with you.
> May we tune adaptive mutex before applying these optimization to normal mutex.

That may be a way to start experimenting with it, as it gives you an
already existing "tunable".  However, I would guess that it won't give you
wide testing because I would guess that most programs don't bother changing
their locks to be of the adaptive type.

> > > That's why I proposed to use MCS-based spinning-waiting algorithm for adaptive
> > > mutex.
> > 
> > MCS-style spinning (ie, spinning on memory local to the spinning thread) is
> > helpful, but I think we should tackle spinning on global memory first (ie,
> > on a location in the mutex, which is shared by all the threads trying to
> > acquire it).  Of course, always including back-off.
> > 
> > > https://sourceware.org/ml/libc-alpha/2019-01/msg00279.html
> > > 
> > > Also, if with very small critical section in the worklad, this new type of mutex 
> > > with GNU extension PTHREAD_MUTEX_QUEUESPINNER_NP acts like MCS-spinlock, and performs
> > > much better than original spinlock.
> > 
> > I don't think we want to have a new type for that.  It maybe useful for
> > experimenting with it, but it shouldn't be exposed to users as a stable
> > interface.
> > 
> 
> I don't like to add a new type either.

Good.

> As I said in the commit log, that's a trade-off to avoid ABI changed.
> I am very glad to see that MCS-style lock can be used gracefully without
> introducing a new type.

Well, we'd need to introduce MCS-style locks in a way that does not change
the ABI.  If that's possible and done, and performance improves, I don't
see a reason why this shouldn't be accepted.

> > Also, have you experimented with different kinds/settings of exponential
> > back-off?  I just saw normal spinning in your implementation, no varying
> > amounts of back-off.  The performance comparison should include back-off
> > though, as that's one way to work around the contention problems (with a
> > bigger hammer than local spinning of course, but can be effective
> > nonetheless, and faster in low-contention cases).
> > 
> 
> I didn't try back-off, because we don't have to include it if MCS-style lock is used.

But you should at least try to get some decent performance, because it's a
competitor to MCS-style locking.  Even if you can't get the same
performance as MCS in some cases, we still need to know so that we can
assess the pros and cons of choosing MCS.

Also, some initial spinning may be useful even with MCS, I think (eg, short
critical sections with just two threads but enough delay between critical
sections by each thread so that there doesn't need to be a lot of
contention actually).

Patch

diff --git a/NEWS b/NEWS
index cd29ec7..0764d0d 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,9 @@  Version 2.29
 
 Major new features:
 
+* NUMA spinlock is added to provide a spinlock implementation optimized
+  for multi-socket NUMA systems.
+
 * The getcpu wrapper function has been added, which returns the currently
   used CPU and NUMA node.  This function is Linux-specific.
 
diff --git a/manual/examples/numa-spinlock.c b/manual/examples/numa-spinlock.c
new file mode 100644
index 0000000..ca98443
--- /dev/null
+++ b/manual/examples/numa-spinlock.c
@@ -0,0 +1,99 @@ 
+/* NUMA spinlock example.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or
+   modify it under the terms of the GNU General Public License
+   as published by the Free Software Foundation; either version 2
+   of the License, or (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.
+*/
+
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <numa-spinlock.h>
+
+#define NUM_THREADS	20
+
+struct numa_spinlock *lock;
+
+struct work_todo_argument
+{
+  void *arg;
+};
+
+static void *
+work_todo (void *v)
+{
+  /* Do the real work with p->arg. */
+  struct work_todo_argument *p = v;
+  /* Return value is set to lock_info.result. */
+  return NULL;
+}
+
+void *
+work_thread (void *arg)
+{
+  struct work_todo_argument work_todo_arg;
+  struct numa_spinlock_info lock_info;
+
+  if (numa_spinlock_init (lock, &lock_info))
+    {
+      printf ("numa_spinlock_init failure: %m\n");
+      exit (1);
+    }
+
+  work_todo_arg.arg = arg;
+  lock_info.argument = &work_todo_arg;
+  lock_info.workload = work_todo;
+
+  numa_spinlock_apply (&lock_info);
+
+  return lock_info.result;
+}
+
+int
+main (int argc, char **argv)
+{
+  lock = numa_spinlock_alloc ();
+  pthread_t thr[NUM_THREADS];
+  void *res[NUM_THREADS];
+  int numthreads = NUM_THREADS;
+  int i;
+
+  for (i = 0; i < NUM_THREADS; i++)
+    {
+      int err_ret = pthread_create (&thr[i], NULL, work_thread,
+				    (void *) (intptr_t) i);
+      if (err_ret != 0)
+	{
+	  printf ("pthread_create failed: %d, %s\n",
+		  i, strerror (i));
+	  numthreads = i;
+	  break;
+	}
+    }
+
+  for (i = 0; i < numthreads; i++)
+    {
+      if (pthread_join (thr[i], (void *) &res[i]) == 0)
+	free (res[i]);
+      else
+	printf ("pthread_join failure: %m\n");
+    }
+
+  numa_spinlock_free (lock);
+
+  return 0;
+}
diff --git a/manual/threads.texi b/manual/threads.texi
index 87fda7d..e82ae0d 100644
--- a/manual/threads.texi
+++ b/manual/threads.texi
@@ -625,6 +625,9 @@  the standard.
 @menu
 * Default Thread Attributes::             Setting default attributes for
 					  threads in a process.
+* NUMA Spinlock::                         Spinlock optimized for
+					  multi-socket NUMA platform.
+* NUMA Spinlock Example::                 A NUMA spinlock example.
 @end menu
 
 @node Default Thread Attributes
@@ -669,6 +672,108 @@  The system does not have sufficient memory.
 @end table
 @end deftypefun
 
+@node NUMA Spinlock
+@subsubsection Spinlock optimized for multi-node NUMA systems
+
+To improve performance on multi-socket NUMA platforms for serialized
+region protected by spinlock, @theglibc{} implements a NUMA spinlock
+object, which minimizes cross-socket traffic and sends the protected
+serialized region to one core for execution to reduce spinlock contention
+overhead.
+
+The fundamental data types for a NUMA spinlock are
+@code{numa_spinlock} and @code{numa_spinlock_info}:
+
+@deftp {Data Type} {struct numa_spinlock}
+@standards{Linux, numa-spinlock.h}
+This data type is an opaque structure.  A @code{numa_spinlock} pointer
+uniquely identifies a NUMA spinlock object.
+@end deftp
+
+@deftp {Data Type} {struct numa_spinlock_info}
+@standards{Linux, numa-spinlock.h}
+
+This data type uniquely identifies a NUMA spinlock information object for
+a thread.  It has the following members, and others internal to NUMA
+spinlock implemenation:
+
+@table @code
+@item void *(*workload) (void *)
+A function pointer to the workload function serialized by spinlock.
+@item void *argument
+A pointer to argument passed to the @var{workload} function pointer.
+@item void *result
+Return value from the @var{workload} function pointer.
+@end table
+
+@end deftp
+
+The following functions are provided for NUMA spinlock objects:
+
+@deftypefun struct numa_spinlock *numa_spinlock_alloc (void)
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
+
+This function returns a pointer to a newly allocated NUMA spinlock or a
+null pointer if the NUMA spinlock could not be allocated, setting
+@code{errno} to @code{ENOMEM}.  Caller should call
+@code{numa_spinlock_free} on the NUMA spinlock pointer to free the
+memory space when it is no longer needed.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun void numa_spinlock_free (struct numa_spinlock *@var{lock})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
+
+Free the memory space pointed to by @var{lock}, which must have been
+returned by a previous call to @code{numa_spinlock_alloc}.  Otherwise,
+or if @code{numa_spinlock_free (@var{lock})} has already been called
+before, undefined behavior occurs.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun int numa_spinlock_init (struct numa_spinlock *@var{lock},
+struct numa_spinlock_info *@var{info})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+Initialize the NUMA spinlock information block pointed to by @var{info}
+with a NUMA spinlock pointer @var{lock}.  The return value is @code{0} on
+success and @code{-1} on failure.  The following @code{errno} error
+codes are defined for this function:
+
+@table @code
+@item ENOSYS
+The operating system does not support the @code{getcpu} function.
+@end table
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun void numa_spinlock_apply (struct numa_spinlock_info *@var{info})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+Apply for spinlock with a NUMA spinlock information block pointed to by
+@var{info}.  When @code{numa_spinlock_apply} returns, the spinlock is
+released and the @var{result} member of @var{info} contains the return
+value of the @var{workload} member.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@node NUMA Spinlock Example
+@subsubsection NUMA Spinlock Example
+
+A NUMA spinlock example:
+
+@smallexample
+@include numa-spinlock.c.texi
+@end smallexample
+
 @c FIXME these are undocumented:
 @c pthread_atfork
 @c pthread_attr_destroy
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index f827455..3361597 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -222,6 +222,8 @@  CFLAGS-gai.c += -DNEED_NETLINK
 endif
 
 ifeq ($(subdir),nptl)
+libpthread-sysdep_routines += numa_spinlock_alloc numa-spinlock
+sysdep_headers += numa-spinlock.h
 tests += tst-align-clone tst-getpid1 \
 	tst-thread-affinity-pthread tst-thread-affinity-pthread2 \
 	tst-thread-affinity-sched
diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
index f1e12d9..7ce7e2b 100644
--- a/sysdeps/unix/sysv/linux/Versions
+++ b/sysdeps/unix/sysv/linux/Versions
@@ -185,3 +185,12 @@  libc {
     __netlink_assert_response;
   }
 }
+
+libpthread {
+  GLIBC_2.29 {
+    numa_spinlock_alloc;
+    numa_spinlock_free;
+    numa_spinlock_init;
+    numa_spinlock_apply;
+  }
+}
diff --git a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
index 9a9e4ce..eb54a83 100644
--- a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
index b413007..dd08796 100644
--- a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/arm/libpthread.abilist b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
index af82a4c..45a5c5a 100644
--- a/sysdeps/unix/sysv/linux/arm/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.4 _IO_flockfile F
 GLIBC_2.4 _IO_ftrylockfile F
 GLIBC_2.4 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/csky/libpthread.abilist b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
index ea4b79a..cf65f72 100644
--- a/sysdeps/unix/sysv/linux/csky/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
@@ -73,6 +73,10 @@  GLIBC_2.29 mtx_timedlock F
 GLIBC_2.29 mtx_trylock F
 GLIBC_2.29 mtx_unlock F
 GLIBC_2.29 nanosleep F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.29 open F
 GLIBC_2.29 open64 F
 GLIBC_2.29 pause F
diff --git a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
index bcba07f..a80475f 100644
--- a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
index bece86d..40ac05a 100644
--- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
index ccc9449..5b190f6 100644
--- a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
index af82a4c..45a5c5a 100644
--- a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.4 _IO_flockfile F
 GLIBC_2.4 _IO_ftrylockfile F
 GLIBC_2.4 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
index bece86d..40ac05a 100644
--- a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
index 5067375..e6539bf 100644
--- a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
index 0214496..76edcb8 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
index 0214496..76edcb8 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
index 78cac2a..3141d08 100644
--- a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
@@ -241,3 +241,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock-private.h b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
new file mode 100644
index 0000000..0f7a3be
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
@@ -0,0 +1,38 @@ 
+/* Internal definitions and declarations for NUMA spinlock.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "numa-spinlock.h"
+
+/* The global NUMA spinlock.  */
+struct numa_spinlock
+{
+  /* List of threads who owns the global NUMA spinlock.  */
+  struct numa_spinlock_info *owner;
+  /* The maximium NUMA node number.  */
+  unsigned int max_node;
+  /* Non-zero for single node system.  */
+  unsigned int single_node;
+  /* The maximium CPU number.  Used only when NUMA is disabled.  */
+  unsigned int max_cpu;
+  /* Array of physical_package_id of each core if it isn't NULL.  Used
+     only when NUMA is disabled.*/
+  unsigned int *physical_package_id_p;
+  /* Arrays of lists of threads who are spinning for the local NUMA lock
+     on NUMA nodes indexed by NUMA node number.  */
+  struct numa_spinlock_info *lists[];
+};
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.c b/sysdeps/unix/sysv/linux/numa-spinlock.c
new file mode 100644
index 0000000..a141e7d
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock.c
@@ -0,0 +1,327 @@ 
+/* NUMA spinlock
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+#include <string.h>
+#include <stdlib.h>
+#include <sched.h>
+#ifndef HAVE_GETCPU
+#include <unistd.h>
+#include <syscall.h>
+#endif
+#include <errno.h>
+#include <atomic.h>
+#include "numa-spinlock-private.h"
+
+#if !defined HAVE_GETCPU && defined _LIBC
+# define HAVE_GETCPU
+#endif
+
+/* On multi-socket systems, memory is shared across the entire system.
+   Data access to the local socket is much faster than the remote socket
+   and data access to the local core is faster than sibling cores on the
+   same socket.  For serialized workloads with conventional spinlock,
+   when there is high spinlock contention between threads, lock ping-pong
+   among sockets becomes the bottleneck and threads spend majority of
+   their time in spinlock overhead.
+
+   On multi-socket systems, the keys to our NUMA spinlock performance
+   are to minimize cross-socket traffic as well as localize the serialized
+   workload to one core for execution.  The basic principles of NUMA
+   spinlock are mainly consisted of following approaches, which reduce
+   data movement and accelerate critical section, eventually give us
+   significant performance improvement.
+
+   1. MCS spinlock
+   MCS spinlock help us to reduce the useless lock movement in the
+   spinning state.  This paper provides a good description for this
+   kind of lock:
+   <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>
+
+   2. Critical Section Integration (CSI)
+   Essentially spinlock is similar to that one core complete critical
+   sections one by one. So when contention happen, the serialized works
+   are sent to the core who is the lock owner and responsible to execute
+   them, that can save much time and power, because all shared data are
+   located in private cache of the lock owner.
+
+   We implemented this mechanism based on queued spinlock in kernel, that
+   speeds up critical section, and reduces the probability of contention.
+   The paper provides a good description for this kind of lock:
+   <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>
+
+   3. NUMA Aware Spinlock (NAS)
+   Currently multi-socket systems give us better performance per watt,
+   however that also involves more complex synchronization requirement,
+   because off-chip data movement is much slower. We use distributed
+   synchronization mechanism to decrease Lock cache line to and from
+   different nodes. The paper provides a good description for this kind
+   of lock:
+   <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>
+
+   4. Yield Schedule
+   When threads are applying for Critical Section Integration(CSI) with
+   known contention, they will delegate work to the thread who is the
+   lock owner, and wait for work to be completed.  The resources which
+   they are using should be transferred to other threads. In order to
+   accelerate the scenario, we introduce yield_sched function during
+   spinning stage.
+
+   5. Optimization when NUMA is ON or OFF.
+   Although programs can access memory with lower latency when NUMA is
+   enabled, some programs may need more memory bandwidth for computation
+   with NUMA disabled.  We also optimize multi-socket systems with NUMA
+   disabled.
+
+   NUMA spinlock flow chart (assuming there are 2 CPU nodes):
+
+   1. Threads from node_0/node_1 acquire local lock for node_0/1
+   respectively.  If the thread succeeds in acquiring local lock, it
+   goes to step 2, otherwise pushes critical function into current
+   local work queue, and enters into spinning stage with MCS mode.
+
+   2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
+   in acquiring the global lock as the lock owner, it goes to step 3,
+   otherwise waits until the lock owner thread releases the global lock.
+
+   3. The lock owner thread from node_0/1 enters into critical section,
+   cleans up work queue by performing all local critical functions
+   pushed at step 1 with CSI on behalf of other threads and informs
+   those spinning threads that their works have been done.  It then
+   releases the local lock.
+
+   4. The lock owner thread frees global lock.  If another thread is
+   waiting at step 2, the lock owner thread passes the global lock to
+   the waiting thread and returns.  The new lock owner thread enters
+   into step 3.  If no threads are waiting, the lock owner thread
+   releases the global lock and returns.  The whole critical section
+   process is completed.
+
+   Steps 1 and 2 mitigate global lock contention.  Only one thread
+   from different nodes will compete for the global lock in step 2.
+   Step 3 reduces the global lock & shared data movement because they
+   are located in the same node as well as the same core.  Our data
+   shows that Critical Section Integration (CSI) improves data locality
+   and NUMA-aware spinlock (NAS) helps CSI balance the workload.
+
+   NUMA spinlock can greatly speed up critical section on multi-socket
+   systems.  It should improve spinlock performance on all multi-socket
+   systems.
+
+   NOTE: LiTL <https://github.com/multicore-locks/litl>, is an open-source
+   project that provides implementations of dozens of various locks,
+   including several state-of-the-art NUMA-aware spinlocks.  Among them
+
+   1. Hierarchical MCS (HMCS) spinlock.  Milind Chabbi, Michael Fagan,
+   and John Mellor-Crummey. High Performance Locks for Multi-level NUMA
+   Systems.  In Proceedings of the ACM SIGPLAN Symposium on Principles
+   and Practice of Parallel Programming (PPoPP), pages 215–226, 2015.
+
+   2. Cohort-MCS (C-MCS) spinlock.  Dave Dice, Virendra J. Marathe, and
+   Nir Shavit.  Lock Cohorting: A General Technique for Designing NUMA
+   Locks. ACM Trans. Parallel Comput., 1(2):13:1–13:42, 2015.
+ */
+
+/* Get the next thread pointed to by *NEXT_P.  NB: We must use a while
+   spin loop to load NEXT_P since there is a small window before *NEXT_P
+   is updated.  */
+
+static inline struct numa_spinlock_info *
+get_numa_spinlock_info_next (struct numa_spinlock_info **next_p)
+{
+  struct numa_spinlock_info *next;
+  while (!(next = atomic_load_relaxed (next_p)))
+    atomic_spin_nop ();
+  return next;
+}
+
+/* While holding the global NUMA spinlock, run the workload of the
+   thread pointed to by SELF first, then run the workload for each
+   thread on the thread list pointed to by HEAD_P and wake up the
+   thread so that all workloads run on a single processor.  */
+
+static inline void
+run_numa_spinlock (struct numa_spinlock_info *self,
+		   struct numa_spinlock_info **head_p)
+{
+  struct numa_spinlock_info *next, *current;
+
+  /* Run the SELF's workload. */
+  self->result = self->workload (self->argument);
+
+  /* Process workloads for the rest of threads on the thread list.
+     NB: The thread list may be prepended by other threads at the
+     same time.  */
+
+retry:
+   /* If SELF is the first thread of the thread list pointed to by
+      HEAD_P, clear the thread list.  */
+  current = atomic_compare_and_exchange_val_acq (head_p, NULL, self);
+  if (current == self)
+    {
+      /* Since SELF is the only thread on the list, clear SELF's pending
+         field and return.  */
+      atomic_store_release (&current->pending, 0);
+      return;
+    }
+
+  /* CURRENT will have the previous first thread of the thread list
+     pointed to by HEAD_P and *HEAD_P will point to SELF.  */
+  current = atomic_exchange_acquire (head_p, self);
+
+  /* NB: No need to check if CURRENT == SELF here since SELF can never
+     be CURRENT.  */
+
+repeat:
+  /* Get the next thread.  */
+  next = get_numa_spinlock_info_next (&current->next);
+
+  /* Run the CURRENT's workload and clear CURRENT's pending field. */
+  current->result = current->workload (current->argument);
+  current->pending = 0;
+
+  /* Process the workload for each thread from CURRENT to SELF on the
+     thread list.  Don't pass beyond SELF since SELF is the last thread
+     on the list.  */
+  if (next == self)
+    goto retry;
+  current = next;
+  goto repeat;
+}
+
+/* Apply for the NUMA spinlock with the NUMA spinlock info data pointed
+   to by SELF.  */
+
+void
+numa_spinlock_apply (struct numa_spinlock_info *self)
+{
+  struct numa_spinlock *lock = self->lock;
+  struct numa_spinlock_info *first, *next;
+  struct numa_spinlock_info **head_p;
+
+  self->next = NULL;
+  /* We want the global NUMA spinlock.  */
+  self->pending = 1;
+  /* Select the local NUMA spinlock list by the NUMA node number.  */
+  head_p = &lock->lists[self->node];
+  /* FIRST will have the previous first thread of the local NUMA spinlock
+     list and *HEAD_P will point to SELF.  */
+  first = atomic_exchange_acquire (head_p, self);
+  if (first)
+    {
+      /* SELF has been prepended to the thread list pointed to by
+	 HEAD_P.  NB: There is a small window between updating
+	 *HEAD_P and self->next.  */
+      atomic_store_release (&self->next, first);
+      /* Let other threads run first since another thread will run our
+	 workload for us.  */
+      sched_yield ();
+      /* Spin until our PENDING is cleared.  */
+      while (atomic_load_relaxed (&self->pending))
+	atomic_spin_nop ();
+      return;
+    }
+
+  /* NB: Now SELF must be the only thread on the thread list pointed
+     to by HEAD_P.  Since thread is always prepended to HEAD_P, we
+     can use *HEAD_P == SELF to check if SELF is the only thread on
+     the thread list.  */
+
+  if (__glibc_unlikely (lock->single_node))
+    {
+      /* If there is only one node, there is no need for the global
+         NUMA spinlock.  */
+      run_numa_spinlock (self, head_p);
+      return;
+    }
+
+  /* FIRST will have the previous first thread of the local NUMA spinlock
+     list of threads which holds the global NUMA spinlock, which will
+     point to SELF.  */
+  first = atomic_exchange_acquire (&lock->owner, self);
+  if (first)
+    {
+      /* SELF has been prepended to the thread list pointed to by
+	 lock->owner.  NB: There is a small window between updating
+	 *HEAD_P and first->next.  */
+      atomic_store_release (&first->next, self);
+      /* Spin until the list of threads which holds the global NUMA
+	 spinlock clears our PENDING.  */
+      while (atomic_load_relaxed (&self->pending))
+	atomic_spin_nop ();
+    }
+
+  /* We get the global NUMA spinlock now.  Run our workload.  */
+  run_numa_spinlock (self, head_p);
+
+  /* SELF is the only thread on the list if SELF is the first thread
+     of the thread list pointed to by lock->owner.  In this case, we
+     simply return.  */
+  if (!atomic_compare_and_exchange_bool_acq (&lock->owner, NULL, self))
+    return;
+
+  /* Wake up the next thread.  */
+  next = get_numa_spinlock_info_next (&self->next);
+  atomic_store_release (&next->pending, 0);
+}
+
+/* Initialize the NUMA spinlock info data pointed to by INFO from a
+   pointer to the NUMA spinlock, LOCK.  */
+
+int
+numa_spinlock_init (struct numa_spinlock *lock,
+			 struct numa_spinlock_info *info)
+{
+  memset (info, 0, sizeof (*info));
+  info->lock = lock;
+  /* For single node system, use 0 as the NUMA node number.  */
+  if (lock->single_node)
+    return 0;
+  /* NB: Use the NUMA node number from getcpu to select the local NUMA
+     spinlock list.  */
+  unsigned int cpu;
+  unsigned int node;
+#ifdef HAVE_GETCPU
+  int err_ret = getcpu (&cpu, &node);
+#else
+  int err_ret = syscall (SYS_getcpu, &cpu, &node, NULL);
+#endif
+  if (err_ret)
+    return err_ret;
+  if (lock->physical_package_id_p)
+    {
+      /* Can it ever happen?  */
+      if (cpu > lock->max_cpu)
+	cpu = lock->max_cpu;
+      /* NB: If NUMA is disabled, use physical_package_id.  */
+      node = lock->physical_package_id_p[cpu];
+    }
+  /* Can it ever happen?  */
+  if (node > lock->max_node)
+    node = lock->max_node;
+  info->node = node;
+  return err_ret;
+}
+
+void
+numa_spinlock_free (struct numa_spinlock *lock)
+{
+  if (lock->physical_package_id_p)
+    free (lock->physical_package_id_p);
+  free (lock);
+}
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.h b/sysdeps/unix/sysv/linux/numa-spinlock.h
new file mode 100644
index 0000000..b17bda5
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock.h
@@ -0,0 +1,64 @@ 
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _NUMA_SPINLOCK_H
+#define _NUMA_SPINLOCK_H
+
+#include <features.h>
+
+__BEGIN_DECLS
+
+/* The NUMA spinlock.  */
+struct numa_spinlock;
+
+/* The NUMA spinlock information for each thread.  */
+struct numa_spinlock_info
+{
+  /* The workload function of this thread.  */
+  void *(*workload) (void *);
+  /* The argument pointer passed to the workload function.  */
+  void *argument;
+  /* The return value of the workload function.  */
+  void *result;
+  /* The pointer to the NUMA spinlock.  */
+  struct numa_spinlock *lock;
+  /* The next thread on the local NUMA spinlock thread list.  */
+  struct numa_spinlock_info *next;
+  /* The NUMA node number.  */
+  unsigned int node;
+  /* Non-zero to indicate that the thread wants the NUMA spinlock.  */
+  int pending;
+  /* Reserved for future use.  */
+  void *__reserved[4];
+};
+
+/* Return a pointer to a newly allocated NUMA spinlock.  */
+extern struct numa_spinlock *numa_spinlock_alloc (void);
+
+/* Free the memory space of the NUMA spinlock.  */
+extern void numa_spinlock_free (struct numa_spinlock *);
+
+/* Initialize the NUMA spinlock information block.  */
+extern int numa_spinlock_init (struct numa_spinlock *,
+			       struct numa_spinlock_info *);
+
+/* Apply for spinlock with a NUMA spinlock information block.  */
+extern void numa_spinlock_apply (struct numa_spinlock_info *);
+
+__END_DECLS
+
+#endif /* numa-spinlock.h */
diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
new file mode 100644
index 0000000..8ff4e1a
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
@@ -0,0 +1,304 @@ 
+/* Initialization of NUMA spinlock.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <ctype.h>
+#include <string.h>
+#include <dirent.h>
+#include <stdio.h>
+#include <limits.h>
+#ifdef _LIBC
+# include <not-cancel.h>
+#else
+# include <stdlib.h>
+# include <unistd.h>
+# include <fcntl.h>
+# define __open_nocancel		open
+# define __close_nocancel_nostatus	close
+# define __read_nocancel		read
+#endif
+
+#include "numa-spinlock-private.h"
+
+static char *
+next_line (int fd, char *const buffer, char **cp, char **re,
+	   char *const buffer_end)
+{
+  char *res = *cp;
+  char *nl = memchr (*cp, '\n', *re - *cp);
+  if (nl == NULL)
+    {
+      if (*cp != buffer)
+	{
+	  if (*re == buffer_end)
+	    {
+	      memmove (buffer, *cp, *re - *cp);
+	      *re = buffer + (*re - *cp);
+	      *cp = buffer;
+
+	      ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
+	      if (n < 0)
+		return NULL;
+
+	      *re += n;
+
+	      nl = memchr (*cp, '\n', *re - *cp);
+	      while (nl == NULL && *re == buffer_end)
+		{
+		  /* Truncate too long lines.  */
+		  *re = buffer + 3 * (buffer_end - buffer) / 4;
+		  n = __read_nocancel (fd, *re, buffer_end - *re);
+		  if (n < 0)
+		    return NULL;
+
+		  nl = memchr (*re, '\n', n);
+		  **re = '\n';
+		  *re += n;
+		}
+	    }
+	  else
+	    nl = memchr (*cp, '\n', *re - *cp);
+
+	  res = *cp;
+	}
+
+      if (nl == NULL)
+	nl = *re - 1;
+    }
+
+  *cp = nl + 1;
+  assert (*cp <= *re);
+
+  return res == *re ? NULL : res;
+}
+
+static int
+select_cpu (const struct dirent *d)
+{
+  /* Return 1 for "cpuXXX" where XXX are digits.  */
+  if (strncmp (d->d_name, "cpu", sizeof ("cpu") - 1) == 0)
+    {
+      const char *p = d->d_name + 3;
+
+      if (*p == '\0')
+	return 0;
+
+      do
+	{
+	  if (!isdigit (*p))
+	    return 0;
+	  p++;
+	}
+      while (*p != '\0');
+
+      return 1;
+    }
+  return 0;
+}
+
+/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
+   call numa_spinlock_free on the NUMA spinlock pointer to free the
+   memory when it is no longer needed.  */
+
+struct numa_spinlock *
+numa_spinlock_alloc (void)
+{
+  const size_t buffer_size = 1024;
+  char buffer[buffer_size];
+  char *buffer_end = buffer + buffer_size;
+  char *cp = buffer_end;
+  char *re = buffer_end;
+
+  const int flags = O_RDONLY | O_CLOEXEC;
+  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
+  char *l;
+  unsigned int max_node = 0;
+  unsigned int node_count = 0;
+  if (fd != -1)
+    {
+      l = next_line (fd, buffer, &cp, &re, buffer_end);
+      if (l != NULL)
+	do
+	  {
+	    char *endp;
+	    unsigned long int n = strtoul (l, &endp, 10);
+	    if (l == endp)
+	      {
+		node_count = 1;
+		break;
+	      }
+
+	    unsigned long int m = n;
+	    if (*endp == '-')
+	      {
+		l = endp + 1;
+		m = strtoul (l, &endp, 10);
+		if (l == endp)
+		  {
+		    node_count = 1;
+		    break;
+		  }
+	      }
+
+	    node_count += m - n + 1;
+
+	    if (m >= max_node)
+	      max_node = m;
+
+	    l = endp;
+	    while (l < re && isspace (*l))
+	      ++l;
+	  }
+	while (l < re);
+
+      __close_nocancel_nostatus (fd);
+    }
+
+  /* NB: Some NUMA nodes may not be available, if the number of NUMA
+     nodes is 1, set the maximium NUMA node number to 0.  */
+  if (node_count == 1)
+    max_node = 0;
+
+  unsigned int max_cpu = 0;
+  unsigned int *physical_package_id_p = NULL;
+
+  if (max_node == 0)
+    {
+      /* There is at least 1 node.  */
+      node_count = 1;
+
+      /* If NUMA is disabled, use physical_package_id instead.  */
+      struct dirent **cpu_list;
+      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
+			    select_cpu, NULL);
+      if (nprocs > 0)
+	{
+	  int i;
+	  unsigned int *cpu_id_p = NULL;
+
+	  /* Find the maximum CPU number.  */
+	  if (posix_memalign ((void **) &cpu_id_p,
+			      __alignof__ (void *),
+			      nprocs * sizeof (unsigned int)) == 0)
+	    {
+	      for (i = 0; i < nprocs; i++)
+		{
+		  unsigned int cpu_id
+		    = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
+		  cpu_id_p[i] = cpu_id;
+		  if (cpu_id > max_cpu)
+		    max_cpu = cpu_id;
+		}
+
+	      if (posix_memalign ((void **) &physical_package_id_p,
+				  __alignof__ (void *),
+				  ((max_cpu + 1)
+				   * sizeof (unsigned int))) == 0)
+		{
+		  memset (physical_package_id_p, 0,
+			  ((max_cpu + 1) * sizeof (unsigned int)));
+
+		  max_node = UINT_MAX;
+
+		  /* Get physical_package_id.  */
+		  char path[(sizeof ("/sys/devices/system/cpu")
+			     + 3 * sizeof (unsigned long int)
+			     + sizeof ("/topology/physical_package_id"))];
+		  for (i = 0; i < nprocs; i++)
+		    {
+		      struct dirent *d = cpu_list[i];
+		      if (snprintf (path, sizeof (path),
+				    "/sys/devices/system/cpu/%s/topology/physical_package_id",
+				    d->d_name) > 0)
+			{
+			  fd = __open_nocancel (path, flags);
+			  if (fd != -1)
+			    {
+			      if (__read_nocancel (fd, buffer,
+						   buffer_size) > 0)
+				{
+				  char *endp;
+				  unsigned long int package_id
+				    = strtoul (buffer, &endp, 10);
+				  if (package_id != ULONG_MAX
+				      && *buffer != '\0'
+				      && (*endp == '\0' || *endp == '\n'))
+				    {
+				      physical_package_id_p[cpu_id_p[i]]
+					= package_id;
+				      if (max_node == UINT_MAX)
+					{
+					  /* This is the first node.  */
+					  max_node = package_id;
+					}
+				      else if (package_id != max_node)
+					{
+					  /* NB: We only need to know if
+					     NODE_COUNT > 1.  */
+					  node_count = 2;
+					  if (package_id > max_node)
+					    max_node = package_id;
+					}
+				    }
+				}
+			      __close_nocancel_nostatus (fd);
+			    }
+			}
+
+		      free (d);
+		    }
+		}
+
+	      free (cpu_id_p);
+	    }
+	  else
+	    {
+	      for (i = 0; i < nprocs; i++)
+		free (cpu_list[i]);
+	    }
+
+	  free (cpu_list);
+	}
+    }
+
+  if (physical_package_id_p != NULL && node_count == 1)
+    {
+      /* There is only one node.  No need for physical_package_id_p.  */
+      free (physical_package_id_p);
+      physical_package_id_p = NULL;
+      max_cpu = 0;
+    }
+
+  /* Allocate an array of struct numa_spinlock_info pointers to hold info
+     for all NUMA nodes with NUMA node number from getcpu () as index.  */
+  size_t size = (sizeof (struct numa_spinlock)
+		 + ((max_node + 1)
+		    * sizeof (struct numa_spinlock_info *)));
+  struct numa_spinlock *lock;
+  if (posix_memalign ((void **) &lock,
+		      __alignof__ (struct numa_spinlock_info *), size))
+    return NULL;
+  memset (lock, 0, size);
+
+  lock->max_node = max_node;
+  lock->single_node = node_count == 1;
+  lock->max_cpu = max_cpu;
+  lock->physical_package_id_p = physical_package_id_p;
+
+  return lock;
+}
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
index 09e8447..dba7df6 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
index 8300958..a763c0a 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3 _IO_flockfile F
 GLIBC_2.3 _IO_ftrylockfile F
 GLIBC_2.3 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
index 9a9e4ce..eb54a83 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
index c370fda..366fcac 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
@@ -235,3 +235,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
index d05468f..786d8e1 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
@@ -229,6 +229,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
index e8161aa..dd7c52f 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
@@ -221,6 +221,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sh/libpthread.abilist b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
index bcba07f..a80475f 100644
--- a/sysdeps/unix/sysv/linux/sh/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
index b413007..dd08796 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
index ccc9449..5b190f6 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/x86/Makefile b/sysdeps/unix/sysv/linux/x86/Makefile
index 02ca36c..29d41ad 100644
--- a/sysdeps/unix/sysv/linux/x86/Makefile
+++ b/sysdeps/unix/sysv/linux/x86/Makefile
@@ -14,6 +14,7 @@  endif
 ifeq ($(subdir),nptl)
 libpthread-sysdep_routines += elision-lock elision-unlock elision-timed \
 			      elision-trylock
+xtests += tst-variable-overhead tst-numa-variable-overhead
 CFLAGS-elision-lock.c += -mrtm
 CFLAGS-elision-unlock.c += -mrtm
 CFLAGS-elision-timed.c += -mrtm
diff --git a/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
new file mode 100644
index 0000000..7cb8542
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
@@ -0,0 +1,53 @@ 
+/* Test case for NUMA spinlock overhead.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include "numa-spinlock.h"
+
+struct numa_spinlock *lock;
+
+struct work_todo_argument
+{
+  unsigned long *v1;
+  unsigned long *v2;
+  unsigned long *v3;
+  unsigned long *v4;
+};
+
+static void *
+work_todo (void *v)
+{
+  struct work_todo_argument *p = v;
+  unsigned long ret;
+  *p->v1 = *p->v1 + 1;
+  *p->v2 = *p->v2 + 1;
+  ret = __sync_val_compare_and_swap (p->v4, 0, 1);
+  *p->v3 = *p->v3 + ret;
+  return (void *) 2;
+}
+
+static inline void
+do_work (struct numa_spinlock_info *lock_info)
+{
+  numa_spinlock_apply (lock_info);
+}
+
+#define USE_NUMA_SPINLOCK
+#include "tst-variable-overhead-skeleton.c"
diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
new file mode 100644
index 0000000..4b83dfb
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
@@ -0,0 +1,384 @@ 
+/* Test case skeleton for spinlock overhead.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Check spinlock overhead with large number threads.  Critical region is
+   very smmall.  Critical region + spinlock overhead aren't noticeable
+   when number of threads is small.  When thread number increases,
+   spinlock overhead become the bottleneck.  It shows up in wall time
+   of thread execution.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include <config.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <sys/time.h>
+#include <sys/param.h>
+#include <errno.h>
+#ifdef MODULE_NAME
+# include <cpu-features.h>
+# include <support/test-driver.h>
+#endif
+
+#ifndef USE_PTHREAD_ATTR_SETAFFINITY_NP
+# define USE_PTHREAD_ATTR_SETAFFINITY_NP 1
+#endif
+
+#define memory_barrier() __asm ("" ::: "memory")
+#define pause() __asm  ("rep ; nop" ::: "memory")
+
+#define CACHELINE_SIZE	64
+#define CACHE_ALIGNED	__attribute__((aligned(CACHELINE_SIZE)))
+
+#define constant_time 5
+unsigned long g_val CACHE_ALIGNED;
+unsigned long g_val2 CACHE_ALIGNED;
+unsigned long g_val3 CACHE_ALIGNED;
+unsigned long cmplock CACHE_ALIGNED;
+struct count
+{
+  unsigned long long total;
+  unsigned long long spinlock;
+  unsigned long long wall;
+} __attribute__((aligned(128)));
+
+struct count *gcount;
+
+/* The time consumed by one update is about 200 TSCs.  */
+static int delay_time_unlocked = 400;
+
+struct ops
+{
+  void *(*test) (void *arg);
+  void (*print_thread) (void *res, int);
+} *ops;
+
+struct stats_result
+{
+  unsigned long num;
+};
+
+void *work_thread (void *arg);
+
+#define iterations (10000 * 5)
+
+static volatile int start_thread;
+
+/* Delay some fixed time */
+static void
+delay_tsc (unsigned n)
+{
+  unsigned long long start, current, diff;
+  unsigned int aux;
+  start = __builtin_ia32_rdtscp (&aux);
+  while (1)
+    {
+      current = __builtin_ia32_rdtscp (&aux);
+      diff = current - start;
+      if (diff < n)
+	pause ();
+      else
+	break;
+    }
+}
+
+static void
+wait_a_bit (int delay_time)
+{
+  if (delay_time > 0)
+    delay_tsc (delay_time);
+}
+
+#ifndef USE_NUMA_SPINLOCK
+static inline void
+work_todo (void)
+{
+  unsigned long ret;
+  g_val = g_val + 1;
+  g_val2 = g_val2 + 1;
+  ret = __sync_val_compare_and_swap (&cmplock, 0, 1);
+  g_val3 = g_val3 + 1 + ret;
+}
+#endif
+
+void *
+work_thread (void *arg)
+{
+  long i;
+  unsigned long pid = (unsigned long) arg;
+  struct stats_result *res;
+  unsigned long long start, end;
+  int err_ret = posix_memalign ((void **)&res, CACHELINE_SIZE,
+				roundup (sizeof (*res), CACHELINE_SIZE));
+  if (err_ret)
+    {
+      printf ("posix_memalign failure: %s\n", strerror (err_ret));
+      exit (err_ret);
+    }
+  long num = 0;
+
+#ifdef USE_NUMA_SPINLOCK
+  struct work_todo_argument work_todo_arg;
+  struct numa_spinlock_info lock_info;
+
+  if (numa_spinlock_init (lock, &lock_info))
+    {
+      printf ("numa_spinlock_init failure: %m\n");
+      exit (1);
+    }
+
+  work_todo_arg.v1 = &g_val;
+  work_todo_arg.v2 = &g_val2;
+  work_todo_arg.v3 = &g_val3;
+  work_todo_arg.v4 = &cmplock;
+  lock_info.argument = &work_todo_arg;
+  lock_info.workload = work_todo;
+#endif
+
+  while (!start_thread)
+    pause ();
+
+  unsigned int aux;
+  start = __builtin_ia32_rdtscp (&aux);
+  for (i = 0; i < iterations; i++)
+    {
+#ifdef USE_NUMA_SPINLOCK
+      do_work (&lock_info);
+#else
+      do_work ();
+#endif
+      wait_a_bit (delay_time_unlocked);
+      num++;
+    }
+  end = __builtin_ia32_rdtscp (&aux);
+  gcount[pid].total = end - start;
+  res->num = num;
+
+  return res;
+}
+
+void
+init_global_data(void)
+{
+  g_val = 0;
+  g_val2 = 0;
+  g_val3 = 0;
+  cmplock = 0;
+}
+
+void
+test_threads (int numthreads, int numprocs, unsigned long time)
+{
+  start_thread = 0;
+
+#ifdef USE_NUMA_SPINLOCK
+  lock = numa_spinlock_alloc ();
+#endif
+
+  memory_barrier ();
+
+  pthread_t thr[numthreads];
+  void *res[numthreads];
+  int i;
+
+  init_global_data ();
+  for (i = 0; i < numthreads; i++)
+    {
+      pthread_attr_t attr;
+      const pthread_attr_t *attrp = NULL;
+      if (USE_PTHREAD_ATTR_SETAFFINITY_NP)
+	{
+	  attrp = &attr;
+	  pthread_attr_init (&attr);
+	  cpu_set_t set;
+	  CPU_ZERO (&set);
+	  int cpu = i % numprocs;
+	  (void) CPU_SET (cpu, &set);
+	  pthread_attr_setaffinity_np (&attr, sizeof (cpu_set_t), &set);
+	}
+      int err_ret = pthread_create (&thr[i], attrp, ops->test,
+				    (void *)(uintptr_t) i);
+      if (err_ret != 0)
+	{
+	  printf ("pthread_create failed: %d, %s\n",
+		  i, strerror (i));
+	  numthreads = i;
+	  break;
+	}
+    }
+
+  memory_barrier ();
+  start_thread = 1;
+  memory_barrier ();
+  sched_yield ();
+
+  if (time)
+    {
+      struct timespec ts =
+	{
+	  ts.tv_sec = time,
+	  ts.tv_nsec = 0
+	};
+      clock_nanosleep (CLOCK_MONOTONIC, 0, &ts, NULL);
+      memory_barrier ();
+    }
+
+  for (i = 0; i < numthreads; i++)
+    {
+      if (pthread_join (thr[i], (void *) &res[i]) == 0)
+	free (res[i]);
+      else
+	printf ("pthread_join failure: %m\n");
+    }
+
+#ifdef USE_NUMA_SPINLOCK
+  numa_spinlock_free (lock);
+#endif
+}
+
+struct ops hashwork_ops =
+{
+  .test = work_thread,
+};
+
+struct ops *ops;
+
+static struct count
+total_cost (int numthreads, int numprocs)
+{
+  int i;
+  unsigned long long total = 0;
+  unsigned long long spinlock = 0;
+
+  memset (gcount, 0, sizeof(gcount[0]) * numthreads);
+
+  unsigned long long start, end, diff;
+  unsigned int aux;
+
+  start = __builtin_ia32_rdtscp (&aux);
+  test_threads (numthreads, numprocs, constant_time);
+  end = __builtin_ia32_rdtscp (&aux);
+  diff = end - start;
+
+  for (i = 0; i < numthreads; i++)
+    {
+      total += gcount[i].total;
+      spinlock += gcount[i].spinlock;
+    }
+
+  struct count cost = { total, spinlock, diff };
+  return cost;
+}
+
+#ifdef MODULE_NAME
+static int
+do_test (void)
+{
+  if (!CPU_FEATURE_USABLE (RDTSCP))
+    return EXIT_UNSUPPORTED;
+#else
+int
+main (void)
+{
+#endif
+  int numprocs = sysconf (_SC_NPROCESSORS_ONLN);
+
+  /* Oversubscribe CPU.  */
+  int numthreads = 4 * numprocs;
+
+  ops = &hashwork_ops;
+
+  int err_ret = posix_memalign ((void **)&gcount, 4096,
+				sizeof(gcount[0]) * numthreads);
+  if (err_ret)
+    {
+      printf ("posix_memalign failure: %s\n", strerror (err_ret));
+      exit (err_ret);
+    }
+
+  struct count cost, cost1;
+  double overhead;
+  int i, last;
+  int last_increment = numprocs < 16 ? 16 : numprocs;
+  int numprocs_done = 0;
+  int numprocs_reset = 0;
+  cost1 = total_cost (1, numprocs);
+
+  printf ("Number of processors: %d, Single thread time %lld\n\n",
+	  numprocs, cost1.total);
+
+  for (last = i = 2; i <= numthreads;)
+    {
+      last = i;
+      cost = total_cost (i, numprocs);
+      overhead = cost.total;
+      overhead /= i;
+      overhead /= cost1.total;
+      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
+	      i, cost.total, overhead);
+      if ((i * 2) < numprocs)
+	i = i * 2;
+      else if (numprocs_done)
+	{
+	  if (numprocs_reset)
+	    {
+	      i = numprocs_reset;
+	      numprocs_reset = 0;
+	    }
+	  else
+	    {
+	      if ((i * 2) < numthreads)
+		i = i * 2;
+	      else
+		i = i + last_increment;
+	    }
+	}
+      else
+	{
+	  if (numprocs != 2 * i)
+	    numprocs_reset = 2 * i;
+	  i = numprocs;
+	  numprocs_done = 1;
+	}
+    }
+
+  if (last != numthreads)
+    {
+      i = numthreads;
+      cost = total_cost (i, numprocs);
+      overhead = cost.total;
+      overhead /= i;
+      overhead /= cost1.total;
+      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
+	      i, cost.total, overhead);
+    }
+
+  free (gcount);
+  return 0;
+}
+
+#ifdef MODULE_NAME
+# define TIMEOUT 900
+# include <support/test-driver.c>
+#endif
diff --git a/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
new file mode 100644
index 0000000..b3ce567
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
@@ -0,0 +1,47 @@ 
+/* Test case for spinlock overhead.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include <pthread.h>
+
+struct
+{
+  pthread_spinlock_t testlock;
+  char pad[64 - sizeof (pthread_spinlock_t)];
+} test __attribute__((aligned(64)));
+
+static void
+__attribute__((constructor))
+init_spin (void)
+{
+  pthread_spin_init (&test.testlock, 0);
+}
+
+static void work_todo (void);
+
+static inline void
+do_work (void)
+{
+  pthread_spin_lock(&test.testlock);
+  work_todo ();
+  pthread_spin_unlock(&test.testlock);
+}
+
+#include "tst-variable-overhead-skeleton.c"
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
index 931c827..e90532e 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
index c09c9b0..c74febb 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F