From patchwork Mon Jul  2 08:11:52 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: kemi <kemi.wang@intel.com>
X-Patchwork-Id: 937719
Return-Path: 
 <libc-alpha-return-93890-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=sourceware.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=libc-alpha-return-93890-incoming=patchwork.ozlabs.org@sourceware.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	secure) header.d=sourceware.org header.i=@sourceware.org
	header.b="x8/WGJ7Z"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41K0Sr6wccz9s2R
	for <incoming@patchwork.ozlabs.org>;
	Mon,  2 Jul 2018 18:16:11 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id; q=dns; s=
	default; b=DqWIq8CH7FTUzT2QLzBlnoCIlMFZWVOtbV0zJGBv59l2/lFCInjxS
	kCp439dZGgbVVqlahoFGtuGOk8BRjgCOdyZbfS9YNVRvWmsKolQ14S6mgmoD5KiU
	fP29w778nFb4Nd2j0C7/+n2B9r/qvtgvgXG64HPAlbSmMPqG4kCyXk=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id; s=default;
	bh=P0EXQ39orF3P2kKOq4uCYKeLeVA=; b=x8/WGJ7Z89H5s2K+rLLd2Y2eE+ae
	fvO1TLkdwrXVHAwVOTfTjPSdPDKLeXfaDzHx2A6m478a51RStU36N7U3tE5xwjeK
	CdrMZ7tSkHd59uy+XDfzSUmZ+EHuqybxD2Z1I2994+yZI4Pg+1FtadFB4YttK/+V
	KQFG+Ct3hMhu8T0=
Received: (qmail 22615 invoked by alias); 2 Jul 2018 08:16:05 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 22411 invoked by uid 89); 2 Jul 2018 08:15:44 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-10.5 required=5.0 tests=BAYES_05, GIT_PATCH_2,
	GIT_PATCH_3,
	SPF_PASS autolearn=ham version=3.3.2 spammy=H*MI:wang, queue,
	Therefore, truth
X-HELO: mga09.intel.com
From: Kemi Wang <kemi.wang@intel.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>,
	Florian Weimer <fweimer@redhat.com>, Rical Jason <rj@2c3t.io>,
	Carlos Donell <carlos@redhat.com>,
	Glibc alpha <libc-alpha@sourceware.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	Tim Chen <tim.c.chen@intel.com>, Andi Kleen <andi.kleen@intel.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Lu Aubrey <aubrey.li@intel.com>, Kemi Wang <kemi.wang@intel.com>
Subject: [RFC 0/4] Add a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP
Date: Mon,  2 Jul 2018 16:11:52 +0800
Message-Id: <1530519116-13103-1-git-send-email-kemi.wang@intel.com>

The pthread adaptive mutex is designed based-on the truth that the lock
probably would be released after a few of CPU cycles in most cases.
Especially for the case: the applications code is well-behaved with a short
critical section that is highly contended. Thus, spinning on the lock for
a while instead of calling into the kernel to block immediately can help
to improve the system performance.

But there are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the lock is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

To address the problems mentioned above, the idea for queuing mutex
spinners with MCS lock referred to the implementation of mutex in kernel is
proposed and a new mutex type PTHREAD_MUTEX_QUEUESPINNER_NP is introduced.
Compare to adaptive mutex (read only while spinning), the test result on
Intel 2 sockets Skylake platform has showns significant performance
improvement (See the first patch for details).

Though, queuing spinner with mcs lock can help to improve the performance
of adaptive mutex when multiple threads contending for a lock, people
probably want to know how queue spinner mutex performs when compare to
other lock discipline widely knowns as pthread spin lock and pthread mutex.
We can see the details of the test result in the first patch. Simply, some
conclusion is summarized as below:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex.
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex.
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). Thus, the other spinners
have to wait in the queue, this would probably collapse the lock performance.
To emulate this case, we run two same processes simultaneously, the
process has 28 threads each of which sets CPU affinity to an individual CPU
according to the thread id. Thus, CPU [0~27] are subscribed by two threads.
In the worst case (s=1000ns, t=6000ns), the lock performance is reduced by
58.1% (2205245->924263).
Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Containers in public cloud infrastructure.

Kemi Wang (4):
  Mutex: Queue spinners to reduce cache line bouncing and ensure
    fairness
  Mutex: add unit tests for new type
  BENCHMARK: add a benchmark for testing new type of mutex
  Manual: Add manual for pthread mutex

 benchtests/Makefile                          |  4 +-
 benchtests/bench-mutex-adaptive-thread.c     |  8 +++-
 benchtests/bench-mutex-queuespinner-thread.c | 21 +++++++++
 manual/Makefile                              |  2 +-
 manual/mutex.texi                            | 68 ++++++++++++++++++++++++++++
 nptl/Makefile                                |  8 ++--
 nptl/allocatestack.c                         |  2 +-
 nptl/descr.h                                 | 26 +++++------
 nptl/mcs_lock.c                              | 68 ++++++++++++++++++++++++++++
 nptl/mcs_lock.h                              | 21 +++++++++
 nptl/nptl-init.c                             |  2 +-
 nptl/pthreadP.h                              |  2 +-
 nptl/pthread_mutex_init.c                    |  3 +-
 nptl/pthread_mutex_lock.c                    | 35 +++++++++++++-
 nptl/pthread_mutex_timedlock.c               | 35 ++++++++++++--
 nptl/pthread_mutex_trylock.c                 |  5 +-
 nptl/pthread_mutex_unlock.c                  |  7 ++-
 nptl/pthread_mutexattr_settype.c             |  2 +-
 nptl/tst-initializers1.c                     | 11 +++--
 nptl/tst-mutex5b.c                           |  2 +
 nptl/tst-mutex7b.c                           |  2 +
 sysdeps/nptl/bits/thread-shared-types.h      | 21 +++++++--
 sysdeps/nptl/pthread.h                       | 15 ++++--
 sysdeps/unix/sysv/linux/hppa/pthread.h       |  4 ++
 24 files changed, 324 insertions(+), 50 deletions(-)
 create mode 100644 benchtests/bench-mutex-queuespinner-thread.c
 create mode 100644 manual/mutex.texi
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h
 create mode 100644 nptl/tst-mutex5b.c
 create mode 100644 nptl/tst-mutex7b.c