From patchwork Wed Apr 25 02:56:28 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: kemi <kemi.wang@intel.com>
X-Patchwork-Id: 903941
Return-Path: 
 <libc-alpha-return-91809-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-91809-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=intel.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="qGtZiwaN"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 40W4fk6BCbz9s0n
	for <incoming@patchwork.ozlabs.org>;
	Wed, 25 Apr 2018 12:59:26 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; q=dns; s=default; b=hMOFX7WaBIVEuXT9VKPOsSF7zhwW+Ch
	DJ2k4Y53LVuEayiz/GdG0dd2+hbwZH+LLKKQX5NWcp0FhBt/oI+99l/HxzM2sb8+
	z2W85Sa42efoD1nROEALGfWSHGpDhen7sTQlotdCqOJQNvzZX/iurZVSYNRNl8jd
	1EbRyXdHCFt8=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:in-reply-to
	:references; s=default; bh=tlPhABUPeT9ZCvEhT9B+4eBZ904=; b=qGtZi
	waNFrSrYChFiugEfUvfyhwpYC6ug5m+S7LW+VoQYlyl3MQpXgznxkM3n02Ghcbkp
	UEdRs3JGGxulqioNABTqjaoYLiJORWsgF2NLaEVcXb1Ll6Dxoq5nx8x11BPrPvjG
	heIMOicrCCTuUko39YjiwddeFqoxgCN1kQLmpY=
Received: (qmail 119844 invoked by alias); 25 Apr 2018 02:59:02 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 119741 invoked by uid 89); 25 Apr 2018 02:59:01 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-25.4 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_ASCII_DIVIDERS,
	SPF_PASS autolearn=ham version=3.3.2 spammy=budget, 0.3, 1.3,
	84
X-HELO: mga04.intel.com
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
X-ExtLoop1: 1
From: Kemi Wang <kemi.wang@intel.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
	Glibc alpha <libc-alpha@sourceware.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	Tim Chen <tim.c.chen@intel.com>, Andi Kleen <andi.kleen@intel.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Lu Aubrey <aubrey.li@intel.com>, Kemi Wang <kemi.wang@intel.com>
Subject: [PATCH v2 3/3] Mutex: Optimize adaptive spin algorithm
Date: Wed, 25 Apr 2018 10:56:28 +0800
Message-Id: <1524624988-29141-3-git-send-email-kemi.wang@intel.com>
In-Reply-To: <1524624988-29141-1-git-send-email-kemi.wang@intel.com>
References: <1524624988-29141-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.

Usually, it is useless to go on spinning on the lock if fail to acquire the
lock when lock is available. That's because the spinner probably does not
have the possibility to acquire the lock during the spin process in case of
severe lock contention. Therefore, it would be better to call into the
kernel to block the thread, as suggested by Tim Chen and we can gain the
benefit at least from:
a) save the CPU time;
b) save power budget;
c) reduce the overhead of cache line bouncing during the spinning.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  7.06542e+08    |  7.08998e+08    |   +0.3%    |
|2 threads       |  5.73018e+07    |  7.20815e+07    |   +25.6%   |
|3 threads       |  3.78511e+07    |  1.15544e+08    |   +205.3%  |
|4 threads       |  2.28214e+07    |  6.57055e+07    |   +187.9%  |
|28 threads      |  1.68839e+07    |  5.19314e+07    |   +207.6%  |
|56 threads      |  1.84983e+07    |  5.06522e+07    |   +173.8%  |
|112 threads     |  2.3568e+07     |  4.95375e+07    |   +110.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  5.40274e+08    |  5.47189e+08    |   +1.3%    |
|2 threads       |  4.55684e+07    |  6.03275e+07    |   +32.4%   |
|3 threads       |  3.05702e+07    |  1.04035e+08    |   +240.3%  |
|4 threads       |  2.17341e+07    |  5.57264e+07    |   +156.4%  |
|28 threads      |  1.39503e+07    |  4.53525e+07    |   +225.1%  |
|56 threads      |  1.50154e+07    |  4.16203e+07    |   +177.2%  |
|112 threads     |  1.90175e+07    |  3.88308e+07    |   +104.2%  |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  7.23372e+07    | 7.25654e+07     |   +0.3%    |
|2 threads       |  2.67302e+07    | 2.40265e+07     |   -10.1%   |
|3 threads       |  1.89936e+07    | 2.70759e+07     |   +42.6%   |
|4 threads       |  1.62423e+07    | 2.25097e+07     |   +38.6%   |
|28 threads      |  9.85977e+06    | 1.59003e+07     |   +61.3%   |
|56 threads      |  8.11471e+06    | 1.6344e+07      |   +101.4%  |
|112 threads     |  8.58044e+06    | 1.53827e+07     |   +79.3%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  8.16913e+06    |  8.16126e+06    |   -0.1%    |
|2 threads       |  5.82987e+06    |  5.92752e+06    |   +1.7%    |
|3 threads       |  6.05125e+06    |  6.37068e+06    |   +5.3%    |
|4 threads       |  5.91259e+06    |  6.27616e+06    |   +6.1%    |
|28 threads      |  2.40584e+06    |  2.60738e+06    |   +8.4%    |
|56 threads      |  2.32643e+06    |  2.3245e+06     |   -0.1%    |
|112 threads     |  2.32366e+06    |  2.30271e+06    |   -0.9%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex

ChangLog:
    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 ChangeLog                 |  4 ++++
 nptl/pthread_mutex_lock.c | 32 ++++++++++++++++++--------------
 2 files changed, 22 insertions(+), 14 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 76d2628..4c81693 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2018-04-24  Kemi Wang <kemi.wang@intel.com>
 
+	* nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex.
+
+2018-04-24  Kemi Wang <kemi.wang@intel.com>
+
 	* benchtests/bench-mutex-adaptive-thread.c: Microbenchmark for adaptive
 	spin mutex.
 	* benchmark/Makefile: Add adaptive spin mutex benchmark.
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..3442c58 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <stap-probe.h>
+#include <pthread_mutex_conf.h>
 
 #ifndef lll_lock_elision
 #define lll_lock_elision(lock, try_lock, private)	({ \
@@ -124,21 +125,24 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
       if (LLL_MUTEX_TRYLOCK (mutex) != 0)
 	{
 	  int cnt = 0;
-	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
-			     mutex->__data.__spins * 2 + 10);
-	  do
-	    {
-	      if (cnt++ >= max_cnt)
-		{
-		  LLL_MUTEX_LOCK (mutex);
-		  break;
-		}
-	      atomic_spin_nop ();
-	    }
-	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
+	  int max_cnt = MIN (__mutex_aconf.spin_count,
+			mutex->__data.__spins * 2 + 100);
+
+      /* MO read while spinning */
+      do
+        {
+         atomic_spin_nop ();
+        }
+      while (atomic_load_relaxed (&mutex->__data.__lock) != 0 &&
+            ++cnt < max_cnt);
+        /* Try to acquire the lock if lock is available or the spin count
+         * is run out, call into kernel to block if fails
+         */
+      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
+        LLL_MUTEX_LOCK (mutex);
 
-	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
-	}
+      mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
+    }
       assert (mutex->__data.__owner == 0);
     }
   else