From patchwork Fri Sep 23 07:31:49 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Paolo Bonzini <pbonzini@redhat.com>
X-Patchwork-Id: 673964
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3sgRPr6rkCz9t0p
	for <incoming@patchwork.ozlabs.org>;
	Fri, 23 Sep 2016 18:30:24 +1000 (AEST)
Authentication-Results: ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com header.b=AyN2Za2R;
	dkim-atps=neutral
Received: from localhost ([::1]:42111 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1bnLsE-0003pj-Ed
	for incoming@patchwork.ozlabs.org; Fri, 23 Sep 2016 04:30:22 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36267)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1bnKz1-0005sA-Eo
	for qemu-devel@nongnu.org; Fri, 23 Sep 2016 03:33:22 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1bnKyz-0005rt-6G
	for qemu-devel@nongnu.org; Fri, 23 Sep 2016 03:33:18 -0400
Received: from mail-wm0-f65.google.com ([74.125.82.65]:34333)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1bnKyy-0005rk-TX
	for qemu-devel@nongnu.org; Fri, 23 Sep 2016 03:33:17 -0400
Received: by mail-wm0-f65.google.com with SMTP id l132so1282541wmf.1
	for <qemu-devel@nongnu.org>; Fri, 23 Sep 2016 00:33:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=sender:from:to:cc:subject:date:message-id:in-reply-to:references;
	bh=cnwKZs0spha+O8mGG4t6PZsib1k9r1HdOLSUTO4XVVA=;
	b=AyN2Za2ROty1LVzlNma+o7XLyk5KAiITUtqxx0t+mxBIDkZtrQgUQaZaDPvRWrN4u/
	EO2SBxYfkPh5EZugF1bZMWKmfFQ0/IXoU3ReksbGH6SiWCp25g2Qip/NENrcBDU51uTk
	/l6Rt6U0lB5aQRGOHkBHmah+XKrsjXOSwBezh7C1oDEXBDS+YY42Xx5SOIACFjIHDHSm
	e5ZOiPOJSuF50OIjGqfV5YeCDGPCRs+2XyhcpJgQ+r9f54lbc8iRqUr650cNHDC0N/XI
	uGCgpxrIX3KM3RNxW0MF4GzU/pEtHX6fu9O5AtRzUSlr+esDjvCCTsp2L4tJYa/GGwfx
	DvTA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:sender:from:to:cc:subject:date:message-id
	:in-reply-to:references;
	bh=cnwKZs0spha+O8mGG4t6PZsib1k9r1HdOLSUTO4XVVA=;
	b=MsMJS4FolhpTNeIXChilrriaHdgUPAtopLlm9WDHM9ErATAw2k9yF34BYffjdSbKwP
	lDKnkJwWPOOxkQYe/w9PODTE/qO/5Iny8DRSOQekVouxnsFaKpwZTDDEtzP8aEIIFgP0
	CWbe6fQyqknas+jZSQ8OghJc44Q8ilov5VeQ30GJyMnw3APlZCIevKjXtDuiH+Dro0DM
	B9rKz1J/uC4tIgJ0gZleanEC9Iu9yi4ew6bzuJYMeGy5b9xY4iAkR5dPMr9Un6GYaIRo
	UTT6hZtq0dIn4damk2lygtYeLMPtsMZ5jKxViQTDlLiWZpRQkQvKxS3DLSNwqmCLQ3nM
	FQjw==
X-Gm-Message-State: 
 AE9vXwNWtK14n26lWU1i2oIOqLTOW4Q08zq2kj1W5ekX+qZTiE7f4rT0ikzv0k7zR9hQdQ==
X-Received: by 10.194.125.44 with SMTP id mn12mr5419526wjb.147.1474615936278;
	Fri, 23 Sep 2016 00:32:16 -0700 (PDT)
Received: from donizetti.lan (94-39-176-182.adsl-ull.clienti.tiscali.it.
	[94.39.176.182]) by smtp.gmail.com with ESMTPSA id
	q10sm1691267wme.6.2016.09.23.00.32.14
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Fri, 23 Sep 2016 00:32:15 -0700 (PDT)
From: Paolo Bonzini <pbonzini@redhat.com>
To: qemu-devel@nongnu.org
Date: Fri, 23 Sep 2016 09:31:49 +0200
Message-Id: <1474615909-17069-17-git-send-email-pbonzini@redhat.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1474615909-17069-1-git-send-email-pbonzini@redhat.com>
References: <1474615909-17069-1-git-send-email-pbonzini@redhat.com>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
	[fuzzy]
X-Received-From: 74.125.82.65
Subject: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for
	cpu_exec_start/end
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: serge.fdrv@gmail.com, cota@braap.org, alex.bennee@linaro.org,
	sergey.fedorov@linaro.org
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Set cpu->running without taking the cpu_list lock, only requiring it if
there is a concurrent exclusive section.  This requires adding a new
field to CPUState, which records whether a running CPU is being counted
in pending_cpus.

When an exclusive section is started concurrently with cpu_exec_start,
cpu_exec_start can use the new field to determine if it has to wait for
the end of the exclusive section.  Likewise, cpu_exec_end can use it to
see if start_exclusive is waiting for that CPU.

This a separate patch for easier bisection of issues.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 cpus-common.c              | 95 ++++++++++++++++++++++++++++++++++++++--------
 docs/tcg-exclusive.promela | 53 ++++++++++++++++++++++++--
 include/qom/cpu.h          |  5 ++-
 3 files changed, 133 insertions(+), 20 deletions(-)

diff --git a/cpus-common.c b/cpus-common.c
index 38b1d55..618eab8 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -28,6 +28,9 @@ static QemuCond exclusive_cond;
 static QemuCond exclusive_resume;
 static QemuCond qemu_work_cond;
 
+/* >= 1 if a thread is inside start_exclusive/end_exclusive.  Written
+ * under qemu_cpu_list_lock, read with atomic operations.
+ */
 static int pending_cpus;
 
 void qemu_init_cpu_list(void)
@@ -177,18 +180,26 @@ static inline void exclusive_idle(void)
 void start_exclusive(void)
 {
     CPUState *other_cpu;
+    int running_cpus;
 
     qemu_mutex_lock(&qemu_cpu_list_lock);
     exclusive_idle();
 
     /* Make all other cpus stop executing.  */
-    pending_cpus = 1;
+    atomic_set(&pending_cpus, 1);
+
+    /* Write pending_cpus before reading other_cpu->running.  */
+    smp_mb();
+    running_cpus = 0;
     CPU_FOREACH(other_cpu) {
-        if (other_cpu->running) {
-            pending_cpus++;
+        if (atomic_read(&other_cpu->running)) {
+            other_cpu->has_waiter = true;
+            running_cpus++;
             qemu_cpu_kick(other_cpu);
         }
     }
+
+    atomic_set(&pending_cpus, running_cpus + 1);
     while (pending_cpus > 1) {
         qemu_cond_wait(&exclusive_cond, &qemu_cpu_list_lock);
     }
@@ -203,7 +214,7 @@ void start_exclusive(void)
 void end_exclusive(void)
 {
     qemu_mutex_lock(&qemu_cpu_list_lock);
-    pending_cpus = 0;
+    atomic_set(&pending_cpus, 0);
     qemu_cond_broadcast(&exclusive_resume);
     qemu_mutex_unlock(&qemu_cpu_list_lock);
 }
@@ -211,24 +222,78 @@ void end_exclusive(void)
 /* Wait for exclusive ops to finish, and begin cpu execution.  */
 void cpu_exec_start(CPUState *cpu)
 {
-    qemu_mutex_lock(&qemu_cpu_list_lock);
-    exclusive_idle();
-    cpu->running = true;
-    qemu_mutex_unlock(&qemu_cpu_list_lock);
+    atomic_set(&cpu->running, true);
+
+    /* Write cpu->running before reading pending_cpus.  */
+    smp_mb();
+
+    /* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
+     * After taking the lock we'll see cpu->has_waiter == true and run---not
+     * for long because start_exclusive kicked us.  cpu_exec_end will
+     * decrement pending_cpus and signal the waiter.
+     *
+     * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
+     * This includes the case when an exclusive item is running now.
+     * Then we'll see cpu->has_waiter == false and wait for the item to
+     * complete.
+     *
+     * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
+     * see cpu->running == true, and it will kick the CPU.
+     */
+    if (unlikely(atomic_read(&pending_cpus))) {
+        qemu_mutex_lock(&qemu_cpu_list_lock);
+        if (!cpu->has_waiter) {
+            /* Not counted in pending_cpus, let the exclusive item
+             * run.  Since we have the lock, just set cpu->running to true
+             * while holding it; no need to check pending_cpus again.
+             */
+            cpu->running = false;
+            exclusive_idle();
+            /* Now pending_cpus is zero.  */
+            cpu->running = true;
+        } else {
+            /* Counted in pending_cpus, go ahead and release the
+             * waiter at cpu_exec_end.
+             */
+        }
+        qemu_mutex_unlock(&qemu_cpu_list_lock);
+    }
 }
 
 /* Mark cpu as not executing, and release pending exclusive ops.  */
 void cpu_exec_end(CPUState *cpu)
 {
-    qemu_mutex_lock(&qemu_cpu_list_lock);
-    cpu->running = false;
-    if (pending_cpus > 1) {
-        pending_cpus--;
-        if (pending_cpus == 1) {
-            qemu_cond_signal(&exclusive_cond);
+    atomic_set(&cpu->running, false);
+
+    /* Write cpu->running before reading pending_cpus.  */
+    smp_mb();
+
+    /* 1. start_exclusive saw cpu->running == true.  Then it will increment
+     * pending_cpus and wait for exclusive_cond.  After taking the lock
+     * we'll see cpu->has_waiter == true.
+     *
+     * 2. start_exclusive saw cpu->running == false but here pending_cpus >= 1.
+     * This includes the case when an exclusive item started after setting
+     * cpu->running to false and before we read pending_cpus.  Then we'll see
+     * cpu->has_waiter == false and not touch pending_cpus.  The next call to
+     * cpu_exec_start will run exclusive_idle if still necessary, thus waiting
+     * for the item to complete.
+     *
+     * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
+     * see cpu->running == false, and it can ignore this CPU until the
+     * next cpu_exec_start.
+     */
+    if (unlikely(atomic_read(&pending_cpus))) {
+        qemu_mutex_lock(&qemu_cpu_list_lock);
+        if (cpu->has_waiter) {
+            cpu->has_waiter = false;
+            atomic_set(&pending_cpus, pending_cpus - 1);
+            if (pending_cpus == 1) {
+                qemu_cond_signal(&exclusive_cond);
+            }
         }
+        qemu_mutex_unlock(&qemu_cpu_list_lock);
     }
-    qemu_mutex_unlock(&qemu_cpu_list_lock);
 }
 
 void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
diff --git a/docs/tcg-exclusive.promela b/docs/tcg-exclusive.promela
index a8896e5..8361cc2 100644
--- a/docs/tcg-exclusive.promela
+++ b/docs/tcg-exclusive.promela
@@ -12,7 +12,8 @@
  *     spin -a docs/event.promela
  *     ./a.out -a
  *
- * Tunable processor macros: N_CPUS, N_EXCLUSIVE, N_CYCLES, TEST_EXPENSIVE.
+ * Tunable processor macros: N_CPUS, N_EXCLUSIVE, N_CYCLES, USE_MUTEX,
+ *                           TEST_EXPENSIVE.
  */
 
 // Define the missing parameters for the model
@@ -21,8 +22,10 @@
 #warning defaulting to 2 CPU processes
 #endif
 
-// the expensive test is not so expensive for <= 3 CPUs
-#if N_CPUS <= 3
+// the expensive test is not so expensive for <= 2 CPUs
+// If the mutex is used, it's also cheap (300 MB / 4 seconds) for 3 CPUs
+// For 3 CPUs and the lock-free option it needs 1.5 GB of RAM
+#if N_CPUS <= 2 || (N_CPUS <= 3 && defined USE_MUTEX)
 #define TEST_EXPENSIVE
 #endif
 
@@ -106,6 +109,8 @@ byte has_waiter[N_CPUS];
     COND_BROADCAST(exclusive_resume);                             \
     MUTEX_UNLOCK(mutex);
 
+#ifdef USE_MUTEX
+// Simple version using mutexes
 #define cpu_exec_start(id)                                                   \
     MUTEX_LOCK(mutex);                                                       \
     exclusive_idle();                                                        \
@@ -126,6 +131,48 @@ byte has_waiter[N_CPUS];
         :: else -> skip;                                                     \
     fi;                                                                      \
     MUTEX_UNLOCK(mutex);
+#else
+// Wait-free fast path, only needs mutex when concurrent with
+// an exclusive section
+#define cpu_exec_start(id)                                                   \
+    running[id] = 1;                                                         \
+    if                                                                       \
+        :: pending_cpus -> {                                                 \
+            MUTEX_LOCK(mutex);                                               \
+            if                                                               \
+                :: !has_waiter[id] -> {                                      \
+                    running[id] = 0;                                         \
+                    exclusive_idle();                                        \
+                    running[id] = 1;                                         \
+                }                                                            \
+                :: else -> skip;                                             \
+            fi;                                                              \
+            MUTEX_UNLOCK(mutex);                                             \
+        }                                                                    \
+        :: else -> skip;                                                     \
+    fi;
+
+#define cpu_exec_end(id)                                                     \
+    running[id] = 0;                                                         \
+    if                                                                       \
+        :: pending_cpus -> {                                                 \
+            MUTEX_LOCK(mutex);                                               \
+            if                                                               \
+                :: has_waiter[id] -> {                                       \
+                    has_waiter[id] = 0;                                      \
+                    pending_cpus--;                                          \
+                    if                                                       \
+                        :: pending_cpus == 1 -> COND_BROADCAST(exclusive_cond); \
+                        :: else -> skip;                                     \
+                    fi;                                                      \
+                }                                                            \
+                :: else -> skip;                                             \
+            fi;                                                              \
+            MUTEX_UNLOCK(mutex);                                             \
+        }                                                                    \
+        :: else -> skip;                                                     \
+    fi
+#endif
 
 // Promela processes
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 5dfe74a..22b54d6 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -242,7 +242,8 @@ struct qemu_work_item;
  * @nr_threads: Number of threads within this CPU.
  * @numa_node: NUMA node this CPU is belonging to.
  * @host_tid: Host thread ID.
- * @running: #true if CPU is currently running;
+ * @running: #true if CPU is currently running (lockless).
+ * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end;
  * valid under cpu_list_lock.
  * @created: Indicates whether the CPU thread has been successfully created.
  * @interrupt_request: Indicates a pending interrupt request.
@@ -296,7 +297,7 @@ struct CPUState {
 #endif
     int thread_id;
     uint32_t host_tid;
-    bool running;
+    bool running, has_waiter;
     struct QemuCond *halt_cond;
     bool thread_kicked;
     bool created;