From patchwork Mon Aug 19 16:29:58 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Han Zhou <zhouhan@gmail.com>
X-Patchwork-Id: 1149392
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=openvswitch.org
	(client-ip=140.211.169.12; helo=mail.linuxfoundation.org;
	envelope-from=ovs-dev-bounces@openvswitch.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com
	header.b="VqhLKNYx"; dkim-atps=neutral
Received: from mail.linuxfoundation.org (mail.linuxfoundation.org
	[140.211.169.12])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 46Bzvw3Qg0z9sND
	for <incoming@patchwork.ozlabs.org>;
	Tue, 20 Aug 2019 02:31:40 +1000 (AEST)
Received: from mail.linux-foundation.org (localhost [127.0.0.1])
	by mail.linuxfoundation.org (Postfix) with ESMTP id 031F3CDD;
	Mon, 19 Aug 2019 16:30:17 +0000 (UTC)
X-Original-To: dev@openvswitch.org
Delivered-To: ovs-dev@mail.linuxfoundation.org
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 5DFF0CC6
	for <dev@openvswitch.org>; Mon, 19 Aug 2019 16:30:13 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com
	[209.85.214.194])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 89B5F67F
	for <dev@openvswitch.org>; Mon, 19 Aug 2019 16:30:12 +0000 (UTC)
Received: by mail-pl1-f194.google.com with SMTP id f19so802129plr.3
	for <dev@openvswitch.org>; Mon, 19 Aug 2019 09:30:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id:in-reply-to:references
	:mime-version:content-transfer-encoding;
	bh=FO5MZOlyMan9HQHYQfjmwXPgETb84U+b5bR8NxaZxV4=;
	b=VqhLKNYxxgH1G2BzVa7HtKBg1ppQyMdXLi8lk011568+IV77vJzWRUjZPX5SAseHip
	FTsRRLbpYZNEkaKPlUbwbfH+SezHY4T8ZaA+9B14JXqw9OFpxMfKtE5FiUN2BOv9NVz5
	Y+XwsFMs0CUorsdrPTtaKX2aYQpC28HtEcru6+yq6mcBHwDhhC2tQ2TnVJdd7bQgAd34
	r9/SMdCp8yedNmIXU+a1dH96vysGi+4P9+S5Xq5Wtyaxy/3qcfqXqB3c9gR2MowdN08g
	hZGZ+F9t111gMfHf6KDCRlEqrPNX3QPrQeWCRYGiqSVtOaCCKSGR7jPx5isjYlAljXL9
	uPyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
	:references:mime-version:content-transfer-encoding;
	bh=FO5MZOlyMan9HQHYQfjmwXPgETb84U+b5bR8NxaZxV4=;
	b=HWvm7xxVp9Z/NPU7i4pjEbJSWA7HNGpxbZN5p2rAqOzsG+x2mc64Ij445VsLA8vAst
	uE00EBrWHERVJYu3Hr9dkYVJE8msuRP4YAtLnPoBu7v6GvGZR5ukNnRcqI1JtYKBqwSc
	hw+EwFH5xRxDwo9CCPoSgoKd+It+yOHvcv6pR4iWlhfXhvjxXIMtqSK6GdlHstxAzgOb
	ICvU/lzADWJ9bA38GRK+m+lxQIws16gQWNEAm77shcxRQLwkywBBvITtCaobQ88NAAzM
	Hno1fW7fWYpSRsDZlPuix6ASPgiD5JiMrzPtn15mn18zT0Bmfc6tR8oSeqwwTGeD8vdI
	TzyQ==
X-Gm-Message-State: APjAAAU1J2GAVcom3slEk9AHNGLlK0pBQFBlByKJ0/QtdezuJzerxH0q
	TaA0gWdHVrcu+03fTHhp0DLYSPJI/bs=
X-Google-Smtp-Source: 
 APXvYqy83yuB0T96LLx+l87eYR3hVusCptDxO+yi1ZT+mUE5Ved6m7pRKTy7Vg2FpNzqJA3jNktSkA==
X-Received: by 2002:a17:902:d892:: with SMTP id
	b18mr10485240plz.175.1566232211739;
	Mon, 19 Aug 2019 09:30:11 -0700 (PDT)
Received: from localhost.localdomain.localdomain ([73.241.94.255])
	by smtp.gmail.com with ESMTPSA id
	a5sm12285418pjs.31.2019.08.19.09.30.10
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Mon, 19 Aug 2019 09:30:10 -0700 (PDT)
From: Han Zhou <zhouhan@gmail.com>
X-Google-Original-From: Han Zhou <hzhou8@ebay.com>
To: dev@openvswitch.org
Date: Mon, 19 Aug 2019 09:29:58 -0700
Message-Id: <1566232200-35099-3-git-send-email-hzhou8@ebay.com>
X-Mailer: git-send-email 2.1.0
In-Reply-To: <1566232200-35099-1-git-send-email-hzhou8@ebay.com>
References: <1566232200-35099-1-git-send-email-hzhou8@ebay.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Subject: [ovs-dev] [PATCH v3 3/5] raft.c: Stale leader should disconnect
	from cluster.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
	<mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Sender: ovs-dev-bounces@openvswitch.org
Errors-To: ovs-dev-bounces@openvswitch.org

From: Han Zhou <hzhou8@ebay.com>

As mentioned in RAFT paper, section 6.2:

Leaders: A server might be in the leader state, but if it isn’t the current
leader, it could be needlessly delaying client requests. For example, suppose a
leader is partitioned from the rest of the cluster, but it can still
communicate with a particular client. Without additional mechanism, it could
delay a request from that client forever, being unable to replicate a log entry
to any other servers. Meanwhile, there might be another leader of a newer term
that is able to communicate with a majority of the cluster and would be able to
commit the client’s request. Thus, a leader in Raft steps down if an election
timeout elapses without a successful round of heartbeats to a majority of its
cluster; this allows clients to retry their requests with another server.

Reported-by: Aliasgar Ginwala <aginwala@ebay.com>
Tested-by: Aliasgar Ginwala <aginwala@ebay.com>
Signed-off-by: Han Zhou <hzhou8@ebay.com>
---
 ovsdb/raft-private.h   |   3 ++
 ovsdb/raft.c           |  43 +++++++++++++++--
 tests/ovsdb-cluster.at | 123 +++++++++++++++++++++++++++++--------------------
 3 files changed, 116 insertions(+), 53 deletions(-)
diff --git a/ovsdb/raft-private.h b/ovsdb/raft-private.h
index 35a3dd7..fb7e6e3 100644
--- a/ovsdb/raft-private.h
+++ b/ovsdb/raft-private.h
@@ -80,6 +80,9 @@ struct raft_server {
     uint64_t next_index;     /* Index of next log entry to send this server. */
     uint64_t match_index;    /* Index of max log entry server known to have. */
     enum raft_server_phase phase;
+    bool replied;            /* Reply to append_request was received from this
+                                node during current election_timeout interval.
+                                */
     /* For use in adding and removing servers: */
     struct uuid requester_sid;  /* Nonzero if requested via RPC. */
     struct unixctl_conn *requester_conn; /* Only if requested via unixctl. */
diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index 1c38b3b..71c5a7e 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -250,7 +250,6 @@ struct raft {
     uint64_t last_applied;      /* Max log index applied to state machine. */
     struct uuid leader_sid;     /* Server ID of leader (zero, if unknown). */
 
-    /* Followers and candidates only. */
 #define ELECTION_BASE_MSEC 1024
 #define ELECTION_RANGE_MSEC 1024
     long long int election_base;    /* Time of last heartbeat from leader. */
@@ -1785,7 +1784,43 @@ raft_run(struct raft *raft)
     }
 
     if (!raft->joining && time_msec() >= raft->election_timeout) {
-        raft_start_election(raft, false);
+        if (raft->role == RAFT_LEADER) {
+            /* Check if majority of followers replied, then reset
+             * election_timeout and reset s->replied. Otherwise, become
+             * follower.
+             *
+             * Raft paper section 6.2: Leaders: A server might be in the leader
+             * state, but if it isn’t the current leader, it could be
+             * needlessly delaying client requests. For example, suppose a
+             * leader is partitioned from the rest of the cluster, but it can
+             * still communicate with a particular client. Without additional
+             * mechanism, it could delay a request from that client forever,
+             * being unable to replicate a log entry to any other servers.
+             * Meanwhile, there might be another leader of a newer term that is
+             * able to communicate with a majority of the cluster and would be
+             * able to commit the client’s request. Thus, a leader in Raft
+             * steps down if an election timeout elapses without a successful
+             * round of heartbeats to a majority of its cluster; this allows
+             * clients to retry their requests with another server.  */
+            int count = 0;
+            HMAP_FOR_EACH (server, hmap_node, &raft->servers) {
+                if (server->replied) {
+                    count ++;
+                }
+            }
+            if (count >= hmap_count(&raft->servers) / 2) {
+                HMAP_FOR_EACH (server, hmap_node, &raft->servers) {
+                    server->replied = false;
+                }
+                raft_reset_election_timer(raft);
+            } else {
+                raft_become_follower(raft);
+                raft_start_election(raft, false);
+            }
+        } else {
+            raft_start_election(raft, false);
+        }
+
     }
 
     if (raft->leaving && time_msec() >= raft->leave_timeout) {
@@ -2447,6 +2482,7 @@ raft_server_init_leader(struct raft *raft, struct raft_server *s)
     s->next_index = raft->log_end;
     s->match_index = 0;
     s->phase = RAFT_PHASE_STABLE;
+    s->replied = false;
 }
 
 static void
@@ -2462,7 +2498,7 @@ raft_become_leader(struct raft *raft)
     ovs_assert(raft->role != RAFT_LEADER);
     raft->role = RAFT_LEADER;
     raft->leader_sid = raft->sid;
-    raft->election_timeout = LLONG_MAX;
+    raft_reset_election_timer(raft);
     raft_reset_ping_timer(raft);
 
     struct raft_server *s;
@@ -3192,6 +3228,7 @@ raft_handle_append_reply(struct raft *raft,
         }
     }
 
+    s->replied = true;
     if (rpy->result == RAFT_APPEND_OK) {
         /* Figure 3.1: "If successful, update nextIndex and matchIndex for
          * follower (section 3.5)." */
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 02e9926..0fbb268 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -65,59 +65,82 @@ EXECUTION_EXAMPLES
 
 AT_BANNER([OVSDB - disconnect from cluster])
 
-# When a node is disconnected from the cluster, the IDL should disconnect
-# and retry even if it uses a single remote, because the remote IP can be
-# a VIP on a load-balance.
-AT_SETUP([OVSDB cluster - disconnect from cluster, single remote])
-AT_KEYWORDS([ovsdb server negative unix cluster disconnect])
+OVS_START_SHELL_HELPERS
+# ovsdb_test_cluster_disconnect LEADER_OR_FOLLOWER
+ovsdb_test_cluster_disconnect () {
+    leader_or_follower=$1
+    schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+    ordinal_schema > schema
+    AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
+    cid=`ovsdb-tool db-cid s1.db`
+    schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+    for i in `seq 2 3`; do
+        AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
+    done
+
+    on_exit 'kill `cat *.pid`'
+    for i in `seq 3`; do
+        AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
+    done
+    for i in `seq 3`; do
+        AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+    done
+
+    AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest",
+          {"op": "insert",
+           "table": "simple",
+           "row": {"i": 1}}]]'], [0], [ignore], [ignore])
+
+    # When a node is disconnected from the cluster, the IDL should disconnect
+    # and retry even if it uses a single remote, because the remote IP can be
+    # a VIP on a load-balance. So we use single remote to test here.
+    if test $leader_or_follower == "leader"; then
+        target=1
+        shutdown="2 3"
+    else
+        target=3
+
+        # shutdown follower before the leader so that there is no chance for s3
+        # become leader during the process.
+        shutdown="2 1"
+    fi
+
+    # Connect to $target.  Use "wait" to trigger a non-op transaction so
+    # that test-ovsdb will not quit.
+
+    test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s$target.ovsdb '[["idltest",
+          {"op": "wait",
+           "table": "simple",
+           "where": [["i", "==", 1]],
+           "columns": ["i"],
+           "until": "==",
+           "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 &
+    echo $! > test-ovsdb.pid
 
-schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
-ordinal_schema > schema
-AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
-cid=`ovsdb-tool db-cid s1.db`
-schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
-for i in `seq 2 3`; do
-    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
-done
-
-on_exit 'kill `cat *.pid`'
-for i in `seq 3`; do
-    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
-done
-for i in `seq 3`; do
-    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
-done
-
-AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest",
-      {"op": "insert",
-       "table": "simple",
-       "row": {"i": 1}}]]'], [0], [ignore], [ignore])
-
-# Connect to a single remote s3.  Use "wait" to trigger a non-op transaction so
-# that test-ovsdb will not quit.
-
-test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s3.ovsdb '[["idltest",
-      {"op": "wait",
-       "table": "simple",
-       "where": [["i", "==", 1]],
-       "columns": ["i"],
-       "until": "==",
-       "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 &
-echo $! > test-ovsdb.pid
-
-OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log])
-
-# Shutdown the other 2 servers so that s3 is disconnected from the cluster.
-for i in 2 1; do
-    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
-done
-
-# The test-ovsdb should detect the disconnect and retry.
-OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log])
-
-OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s3], [s3.pid])
+    OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log])
 
+    # Shutdown the other servers so that $target is disconnected from the cluster.
+    for i in $shutdown; do
+        OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
+    done
+
+    # The test-ovsdb should detect the disconnect and retry.
+    OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log])
+
+    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$target], [s$target.pid])
+}
+OVS_END_SHELL_HELPERS
+
+AT_SETUP([OVSDB cluster - follower disconnect from cluster, single remote])
+AT_KEYWORDS([ovsdb server negative unix cluster disconnect])
+ovsdb_test_cluster_disconnect follower
+AT_CLEANUP
+
+AT_SETUP([OVSDB cluster - leader disconnect from cluster, single remote])
+AT_KEYWORDS([ovsdb server negative unix cluster disconnect])
+ovsdb_test_cluster_disconnect leader
 AT_CLEANUP
+
 
 
 OVS_START_SHELL_HELPERS