From patchwork Fri Aug 17 08:10:32 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: kemi <kemi.wang@intel.com>
X-Patchwork-Id: 958686
Return-Path: 
 <libc-alpha-return-95343-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-95343-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="dlS21jfj"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41sGFn5RPPz9s47
	for <incoming@patchwork.ozlabs.org>;
	Fri, 17 Aug 2018 18:14:37 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; q=dns; s=default; b=fh7nqOFxxuNDfTt3CBL+Y89XjuXgjb4
	PdCN6u4MKIaaaZ9UkD+TvzXurMRPiiYOpV++Jf12CJ5Y3c04VNPApxqKDtfqyPHk
	SPMgYCXmQvFCeFPCenEgvVwf+Hr6UOk1BCYUmw1T4TxsGQx0qcO9vj8dvW5j96Ag
	Qo/3EWh20Mzc=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; s=default; bh=4o3H9NFsqhZOfrW4J3bxnLA1O5I=; b=dlS21
	jfjDnkuj2yFmaQ3LfsPVpYtvaP/2isOI3R5pHN38jr0dEzeaJ+LJq+D2yiFhTHSL
	TC5SfxRwqypcwd7RMInS5MS8Tv/cbRsBlC0WIZ9JQq2IKPrDSTX0Nis6glaeSZLk
	r2bdsa3ob4rIUA5xK0ZqDk3D/fbef5zr6yWI2E=
Received: (qmail 63663 invoked by alias); 17 Aug 2018 08:14:26 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 63537 invoked by uid 89); 17 Aug 2018 08:14:25 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-23.4 required=5.0 tests=BAYES_50, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	KAM_SHORT,
	SPF_PASS autolearn=ham version=3.3.2 spammy=transactions,
	traffic, requesting, arrival
X-HELO: mga07.intel.com
From: Kemi Wang <kemi.wang@intel.com>
To: Glibc alpha <libc-alpha@sourceware.org>
Cc: Kemi Wang <kemi.wang@intel.com>
Subject: [RESEND PATCH] Mutex: Use test and cmpxchg instead of cmpxchg while
	spinning
Date: Fri, 17 Aug 2018 16:10:32 +0800
Message-Id: <1534493432-20242-2-git-send-email-kemi.wang@intel.com>
In-Reply-To: <1534493432-20242-1-git-send-email-kemi.wang@intel.com>
References: <1534493432-20242-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive spin mutex uses cmpxchg instead of test and cmpxchg
while spinning on the lock. The first is very unfriendly to the uncore
because it consistently floods the system with "read for ownership"
requests, which are much more expensive to process than a single read.

============================Some Background===============================
For Intel x86 architecture:
Memory coherence between cores is handled by the "uncore" (roughly equates
to logic outside the CPU cores but residing on the same die). Cores can ask
the uncore to "read" or "read for ownership".  If the request is "read"
then generally the cacheline will wind up being marked SHARED in all caches
that have a copy, or will wind up EXCLUSIVE in the requesting core if no
one else had it before.  If the request is "read for ownership" any other
copies will be invalidated (or written-back and invalidated if MODIFIED)
and the requesting core will get the line in EXCLUSIVE mode.

When a read for ownership comes to the uncore, all copies in other caches
must be invalidated.  If there is only one other copy, it may be MODIFIED
and will have to be written back before being invalidated.

The uncore does these things by sending "snoops" to affected cores on the
same or other sockets and waiting for the answers.

When a read comes to the uncore, if there are more than one copy already in
cores, then adding another one does not require a snoop.  If there is only
one other copy, a snoop is required to find out what state it is in and to
force a transition to SHARED state.

A lock cmpxchg prevents any other core from modifying the line between the
read phase of cmpxchg and the (potential) write phase.  The way this is
done is to "read for ownership" the line, which will make it EXCLUSIVE.
Then the core defers responding to a snoop from the uncore until the write
phase is complete.  This prevents any other core from acquiring the
necessary exclusive access to the cacheline until the instruction is
complete.

With this background, we can explain how lock cmpxchg works and understand
the benefit of test and cmpxchg vs cmpxchg for implementation of locks.

============================Performance Impact============================
Now think about the case that a lock is in alternate use by two cores.  At
the conclusion of an unlock, the lock's cacheline is in MODIFIED state in
core A which did the unlock.

If core B tries for a lock while core A is holding it, then core B will
take away the cache line from core A and only then find out the lock is
held.  Core B will then spin, testing the lock over and over again and
making no progress until core A finally releases it.  The release will be
expensive because core A will have to do an RFO to get the line back, and
then hold off core B's snoops until it can release the lock.  Then core B
will grab it back and successfully acquire.

In the two-contending cores case, lock-cmpxchg is not too bad.  The real
problem happens with three or more contenders.  In this case, every time
one of the contenders attempts to lock the lock with a lock cmpxchg, a
flurry of RFO transactions happens.

Generally, test and test and set is better.  To acquire the lock, you first
"test" with an ordinary MOV and only when the lock appears free do you try
the lock cmpxchg.  The MOV puts the line into SHARED state, and everyone
waiting can poll locally in their own caches without causing any uncore
traffic. When core A finally unlocks, the uncore will have to invalidate
everyone, then grant ownership to A for the unlock.  Then all the other
cores will pile on again and one of them will win.

Now, let's go back to see the impact of test and cmpxchg on adaptive mutex:
For the case with no/little lock contention, the lock performance is nearly
no any difference between two approaches, because lock is usually acquired
via immediate gets in the fast path.

For the case with slight lock contention, lock usually is acquired via
either immediate gets or spinning gets, test and cmpxchg performs better
than the cmpxchg way even if the first one has one more test operation.
This is probably because, in regard of lock acquisition (decode as "lock
cmpxchg") and lock release (decode as "lock dec"), the snoop of uncore
originated from RFOs of requesting core can be responded immediately due to
fewer in-flight snoops.

For the case of severe lock contention (E.g. the contending thread number
is large), the lock performance improvement may not be obvious even if
significant RFOs number is reduced. Because the behavior of adaptive mutex
is changed, and some of lock acquisition is shifted from spinning gets to
waking up. In such case, the lock performance is dominated by the overhead
of futex syscall and context switch. Even so, test and cmpxchg also has its
value because unnecessary uncore traffic is reduced in the whole system.

================================Testing===================================
Test Machine:
Intel 2-sockets Skylake platform (56 cores, 62G RAM)

Methodology:
Let's assume the size of critical section is represented by *s*, the size
of non-critical section is represented by *t*, and let t = k*s. Then, on a
single thread, the arrival rate at which a single core will try to acquire
the lock, in the absence of contention, is 1/(k+1). We also assume there
are *n* threads contending for a lock, each thread binds to an individual
CPU core, and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until each thread reaches 100,000 iteration, the performance is
measured by RFOs number and the average latency of lock acquisition and
lock release.
*Note*: the latency is measured by CPU cycles with the help of RDTSCP
instruction. The beginning time frame is recorded before calling
lock/unlock, and ending time frame is recorded once lock/unlock is
returned. The delta is calculated as the latency of lock acquisition and
lock release, respectively. We have got rid of invalid data in the
statistic result (e.g. Interruption is handled in that core during calling
lock/unlock).

To emulate different usage scenarios, we let k=6, s=200ns and run this
workload with the different spinning methods. In our workload, [1-5]
threads contending for a lock emulates little(no) lock contention, and
[6-15] threads contending for a lock emulates slight lock contention, and
[16-56] threads contending for a lock emulates severe lock contention
across sockets (Benchmark is provided by Andi Kleen).

Test Command:
perf stat -a -e offcore_requests.demand_rfo ./run_workload

Result:
Generally, test and cmpxchg performs consistently better than cmpxchg with
lock contention, see details below:
For the case with little/no lock contention, no obvious difference of lock
performance between two approaches.
For the case with slight lock contention (using thread number 8 as an
example), the test and cmpxchg way performs better than the cmpxchg way,
the average latency of lock acquisition is reduced from 7501 cycles to
5396 cycles (-28.1%), the average latency of lock release is reduced from
1704 cycles to 1095 cycles (-35.7%), and total RFOs number is reduced from
22239643 to 12484358 (-43.9%).
For the case with severe lock contention (using thread number 56 as an
example), the test and cmpxchg way performs better than the cmpxchg way,
the average latency of lock acquisition is reduced from 136787 cycles to
106050 cycles (-22.5%), the average latency of lock release is reduced from
57787 cycles to 49895 cycles (-13.7%), and total RFOs number is reduced
from 317483283 to 215756961 (-32.0%).

Many thanks to H.J. Lu to help refine this patch, and to Stewart, Lawrence
to explain these "lock compxchg" matters to me in details.

    * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
    * nptl/pthread_mutex_timedlock.c: Likewise
    * nptl/pthread_spinlock.h: New file
    * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c                      |  3 ++-
 nptl/pthread_mutex_timedlock.c                 |  4 ++--
 nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
 4 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 nptl/pthread_spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c910ec4 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -25,6 +25,7 @@
 #include "pthreadP.h"
 #include <atomic.h>
 #include <lowlevellock.h>
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_lock_elision
@@ -133,7 +134,7 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 28237b0..2ede5a0 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,7 +25,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
-
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_timedlock_elision
@@ -126,7 +126,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 					  PTHREAD_MUTEX_PSHARED (mutex));
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (lll_trylock (mutex->__data.__lock) != 0);
 
diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
new file mode 100644
index 0000000..8bd7c16
--- /dev/null
+++ b/nptl/pthread_spinlock.h
@@ -0,0 +1,23 @@
+/* Functions for pthread_spinlock_t.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  atomic_spin_nop ();
+}
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
new file mode 100644
index 0000000..5ca84d1
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
@@ -0,0 +1,31 @@
+/* Functions for pthread_spinlock_t.  X86 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  int val = 0;
+  int cnt = *cnt_p;
+  do
+    {
+      atomic_spin_nop ();
+      val = atomic_load_relaxed (lock);
+    }
+  while (val != 0 && ++cnt < max_cnt);
+  *cnt_p = cnt;
+}