From patchwork Fri Jul  6 07:50:09 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: kemi <kemi.wang@intel.com>
X-Patchwork-Id: 940311
Return-Path: 
 <libc-alpha-return-94048-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-94048-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="oC0qCwDM"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41MRmr2NXZz9s4c
	for <incoming@patchwork.ozlabs.org>;
	Fri,  6 Jul 2018 17:53:32 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; q=dns; s=default; b=ZkV2rSwQMt+svUMnKW3uhP7QCVDJ9+f
	2z7fufnZYmdtx9f7gG1q7+iAdE/owMg2v+CgC7VnPHhlkagDQJzPeiXNRvciVFF5
	jJGpx9XxFCDOnOr5K2f2DYrBEhMrZHb4CnM2JQ0ASAUoKKFFEe+/ANth6GnyuIqP
	UgaUiW3zLVkY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; s=default; bh=3ueycN50kpclCyPwCyOO8+sfJnA=; b=oC0qC
	wDM0Pc+02dhGPsuCPE4hcfIacAjYrwHpyN98+/Gp4ZYuSaXSQXgI8S7g88A2K4bP
	nJC3fvvm59vJcS/aELSrzs5NzimL5kBBc05x+CGnabOKbwf7dAP4xM6xXkd/fCGl
	ZjOFhmTt0dBtPPSV0MTaVE4QmW44KflpB/o20E=
Received: (qmail 5793 invoked by alias); 6 Jul 2018 07:53:25 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 5764 invoked by uid 89); 6 Jul 2018 07:53:25 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-26.1 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	KAM_SHORT, SPF_PASS autolearn=ham version=3.3.2 spammy=UD:Lu,
	0.4, 79, Multiple
X-HELO: mga09.intel.com
From: Kemi Wang <kemi.wang@intel.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
	Florian Weimer <fweimer@redhat.com>, Rical Jason <rj@2c3t.io>,
	Carlos Donell <carlos@redhat.com>,
	Glibc alpha <libc-alpha@sourceware.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	Tim Chen <tim.c.chen@intel.com>, Andi Kleen <andi.kleen@intel.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Lu Aubrey <aubrey.li@intel.com>, Kemi Wang <kemi.wang@intel.com>
Subject: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning
Date: Fri,  6 Jul 2018 15:50:09 +0800
Message-Id: <1530863409-326-2-git-send-email-kemi.wang@intel.com>
In-Reply-To: <1530863409-326-1-git-send-email-kemi.wang@intel.com>
References: <1530863409-326-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.

Performance impact:
It would bring some benefit in the scenarios with severe lock contention on
many architectures (significant performance improvement is not expected),
and the whole system performance can benefit from this modification because
a number of unnecessary "read for ownership" requests which stress the
cache system by broadcasting cache line invalidity are eliminated during
spinning.

Meanwhile, it may have some tiny performance regression on the lock holder
transformation for the case of lock acquisition via spinning gets, because
the lock state is checked before acquiring the lock via trylock.

Similar mechanism has been implemented for pthread spin lock.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: Multiple threads contend for adaptive spin mutex
In this test case, each thread binds to an individual CPU core, and does
the following:
1) lock
2) spend about 50 nanoseconds (~1 pause on Skylake) in the critical section
3) unlock
4) spend 500 nanoseconds in the non-critical section in a loop until 15
seconds, and the lock performance is measured by the total iterations.

Then, enlarge the size of critical section, and repeat. Test result is
shows as below:

+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
|2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
|3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
|4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
|28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
|56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
|112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
|2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
|3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
|4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
|28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
|56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
|112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
|2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
|3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
|4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
|28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
|56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
|112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  316392         |  315635         |   -0.2%    |
|2 threads       |  302806         |  303469         |   +0.2%    |
|3 threads       |  298506         |  294281         |   -1.4%    |
|4 threads       |  292037         |  289945         |   -0.7%    |
|28 threads      |  155188         |  155250         |   +0.0%    |
|56 threads      |  190657         |  183106         |   -4.0%    |
|112 threads     |  210818         |  220342         |   +4.5%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
    * nptl/pthread_mutex_timedlock.c: Likewise
    * nptl/pthread_spinlock.h: New file
    * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file

ChangLog:
    V6->V7:
    a) Patch is refined by H.J.Lu

    V5->V6:
    no change

    V4->V5:
    a) Make the optimization work for pthread mutex_timedlock() in x86
    architecture.
    b) Move the READ_ONLY_SPIN macro definition from this patch to the
    first patch which adds glibc.mutex.spin_count tunable entry

    V3->V4:
    a) Make the optimization opt-in, and enable for x86 architecture as
    default, as suggested by Florian Weimer.

    V2->V3:
    a) Drop the idea of blocking spinners if fail to acquire a lock, since
       this idea would not be an universal winner. E.g. several threads
       contend for a lock which protects a small critical section, thus,
       probably any thread can acquire the lock via spinning.
    b) Fix the format issue AFAIC

    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c                      |  3 ++-
 nptl/pthread_mutex_timedlock.c                 |  4 ++--
 nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
 4 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 nptl/pthread_spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c910ec4 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -25,6 +25,7 @@
 #include "pthreadP.h"
 #include <atomic.h>
 #include <lowlevellock.h>
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_lock_elision
@@ -133,7 +134,7 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 28237b0..2ede5a0 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,7 +25,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
-
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_timedlock_elision
@@ -126,7 +126,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 					  PTHREAD_MUTEX_PSHARED (mutex));
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (lll_trylock (mutex->__data.__lock) != 0);
 
diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
new file mode 100644
index 0000000..8bd7c16
--- /dev/null
+++ b/nptl/pthread_spinlock.h
@@ -0,0 +1,23 @@
+/* Functions for pthread_spinlock_t.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  atomic_spin_nop ();
+}
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
new file mode 100644
index 0000000..5ca84d1
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
@@ -0,0 +1,31 @@
+/* Functions for pthread_spinlock_t.  X86 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  int val = 0;
+  int cnt = *cnt_p;
+  do
+    {
+      atomic_spin_nop ();
+      val = atomic_load_relaxed (lock);
+    }
+  while (val != 0 && ++cnt < max_cnt);
+  *cnt_p = cnt;
+}