From patchwork Mon Aug 19 16:29:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Zhou X-Patchwork-Id: 1149392 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="VqhLKNYx"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46Bzvw3Qg0z9sND for ; Tue, 20 Aug 2019 02:31:40 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 031F3CDD; Mon, 19 Aug 2019 16:30:17 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 5DFF0CC6 for ; Mon, 19 Aug 2019 16:30:13 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 89B5F67F for ; Mon, 19 Aug 2019 16:30:12 +0000 (UTC) Received: by mail-pl1-f194.google.com with SMTP id f19so802129plr.3 for ; Mon, 19 Aug 2019 09:30:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=FO5MZOlyMan9HQHYQfjmwXPgETb84U+b5bR8NxaZxV4=; b=VqhLKNYxxgH1G2BzVa7HtKBg1ppQyMdXLi8lk011568+IV77vJzWRUjZPX5SAseHip FTsRRLbpYZNEkaKPlUbwbfH+SezHY4T8ZaA+9B14JXqw9OFpxMfKtE5FiUN2BOv9NVz5 Y+XwsFMs0CUorsdrPTtaKX2aYQpC28HtEcru6+yq6mcBHwDhhC2tQ2TnVJdd7bQgAd34 r9/SMdCp8yedNmIXU+a1dH96vysGi+4P9+S5Xq5Wtyaxy/3qcfqXqB3c9gR2MowdN08g hZGZ+F9t111gMfHf6KDCRlEqrPNX3QPrQeWCRYGiqSVtOaCCKSGR7jPx5isjYlAljXL9 uPyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=FO5MZOlyMan9HQHYQfjmwXPgETb84U+b5bR8NxaZxV4=; b=HWvm7xxVp9Z/NPU7i4pjEbJSWA7HNGpxbZN5p2rAqOzsG+x2mc64Ij445VsLA8vAst uE00EBrWHERVJYu3Hr9dkYVJE8msuRP4YAtLnPoBu7v6GvGZR5ukNnRcqI1JtYKBqwSc hw+EwFH5xRxDwo9CCPoSgoKd+It+yOHvcv6pR4iWlhfXhvjxXIMtqSK6GdlHstxAzgOb ICvU/lzADWJ9bA38GRK+m+lxQIws16gQWNEAm77shcxRQLwkywBBvITtCaobQ88NAAzM Hno1fW7fWYpSRsDZlPuix6ASPgiD5JiMrzPtn15mn18zT0Bmfc6tR8oSeqwwTGeD8vdI TzyQ== X-Gm-Message-State: APjAAAU1J2GAVcom3slEk9AHNGLlK0pBQFBlByKJ0/QtdezuJzerxH0q TaA0gWdHVrcu+03fTHhp0DLYSPJI/bs= X-Google-Smtp-Source: APXvYqy83yuB0T96LLx+l87eYR3hVusCptDxO+yi1ZT+mUE5Ved6m7pRKTy7Vg2FpNzqJA3jNktSkA== X-Received: by 2002:a17:902:d892:: with SMTP id b18mr10485240plz.175.1566232211739; Mon, 19 Aug 2019 09:30:11 -0700 (PDT) Received: from localhost.localdomain.localdomain ([73.241.94.255]) by smtp.gmail.com with ESMTPSA id a5sm12285418pjs.31.2019.08.19.09.30.10 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Aug 2019 09:30:10 -0700 (PDT) From: Han Zhou X-Google-Original-From: Han Zhou To: dev@openvswitch.org Date: Mon, 19 Aug 2019 09:29:58 -0700 Message-Id: <1566232200-35099-3-git-send-email-hzhou8@ebay.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1566232200-35099-1-git-send-email-hzhou8@ebay.com> References: <1566232200-35099-1-git-send-email-hzhou8@ebay.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v3 3/5] raft.c: Stale leader should disconnect from cluster. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Han Zhou As mentioned in RAFT paper, section 6.2: Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be needlessly delaying client requests. For example, suppose a leader is partitioned from the rest of the cluster, but it can still communicate with a particular client. Without additional mechanism, it could delay a request from that client forever, being unable to replicate a log entry to any other servers. Meanwhile, there might be another leader of a newer term that is able to communicate with a majority of the cluster and would be able to commit the client’s request. Thus, a leader in Raft steps down if an election timeout elapses without a successful round of heartbeats to a majority of its cluster; this allows clients to retry their requests with another server. Reported-by: Aliasgar Ginwala Tested-by: Aliasgar Ginwala Signed-off-by: Han Zhou --- ovsdb/raft-private.h | 3 ++ ovsdb/raft.c | 43 +++++++++++++++-- tests/ovsdb-cluster.at | 123 +++++++++++++++++++++++++++++-------------------- 3 files changed, 116 insertions(+), 53 deletions(-) diff --git a/ovsdb/raft-private.h b/ovsdb/raft-private.h index 35a3dd7..fb7e6e3 100644 --- a/ovsdb/raft-private.h +++ b/ovsdb/raft-private.h @@ -80,6 +80,9 @@ struct raft_server { uint64_t next_index; /* Index of next log entry to send this server. */ uint64_t match_index; /* Index of max log entry server known to have. */ enum raft_server_phase phase; + bool replied; /* Reply to append_request was received from this + node during current election_timeout interval. + */ /* For use in adding and removing servers: */ struct uuid requester_sid; /* Nonzero if requested via RPC. */ struct unixctl_conn *requester_conn; /* Only if requested via unixctl. */ diff --git a/ovsdb/raft.c b/ovsdb/raft.c index 1c38b3b..71c5a7e 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -250,7 +250,6 @@ struct raft { uint64_t last_applied; /* Max log index applied to state machine. */ struct uuid leader_sid; /* Server ID of leader (zero, if unknown). */ - /* Followers and candidates only. */ #define ELECTION_BASE_MSEC 1024 #define ELECTION_RANGE_MSEC 1024 long long int election_base; /* Time of last heartbeat from leader. */ @@ -1785,7 +1784,43 @@ raft_run(struct raft *raft) } if (!raft->joining && time_msec() >= raft->election_timeout) { - raft_start_election(raft, false); + if (raft->role == RAFT_LEADER) { + /* Check if majority of followers replied, then reset + * election_timeout and reset s->replied. Otherwise, become + * follower. + * + * Raft paper section 6.2: Leaders: A server might be in the leader + * state, but if it isn’t the current leader, it could be + * needlessly delaying client requests. For example, suppose a + * leader is partitioned from the rest of the cluster, but it can + * still communicate with a particular client. Without additional + * mechanism, it could delay a request from that client forever, + * being unable to replicate a log entry to any other servers. + * Meanwhile, there might be another leader of a newer term that is + * able to communicate with a majority of the cluster and would be + * able to commit the client’s request. Thus, a leader in Raft + * steps down if an election timeout elapses without a successful + * round of heartbeats to a majority of its cluster; this allows + * clients to retry their requests with another server. */ + int count = 0; + HMAP_FOR_EACH (server, hmap_node, &raft->servers) { + if (server->replied) { + count ++; + } + } + if (count >= hmap_count(&raft->servers) / 2) { + HMAP_FOR_EACH (server, hmap_node, &raft->servers) { + server->replied = false; + } + raft_reset_election_timer(raft); + } else { + raft_become_follower(raft); + raft_start_election(raft, false); + } + } else { + raft_start_election(raft, false); + } + } if (raft->leaving && time_msec() >= raft->leave_timeout) { @@ -2447,6 +2482,7 @@ raft_server_init_leader(struct raft *raft, struct raft_server *s) s->next_index = raft->log_end; s->match_index = 0; s->phase = RAFT_PHASE_STABLE; + s->replied = false; } static void @@ -2462,7 +2498,7 @@ raft_become_leader(struct raft *raft) ovs_assert(raft->role != RAFT_LEADER); raft->role = RAFT_LEADER; raft->leader_sid = raft->sid; - raft->election_timeout = LLONG_MAX; + raft_reset_election_timer(raft); raft_reset_ping_timer(raft); struct raft_server *s; @@ -3192,6 +3228,7 @@ raft_handle_append_reply(struct raft *raft, } } + s->replied = true; if (rpy->result == RAFT_APPEND_OK) { /* Figure 3.1: "If successful, update nextIndex and matchIndex for * follower (section 3.5)." */ diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 02e9926..0fbb268 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -65,59 +65,82 @@ EXECUTION_EXAMPLES AT_BANNER([OVSDB - disconnect from cluster]) -# When a node is disconnected from the cluster, the IDL should disconnect -# and retry even if it uses a single remote, because the remote IP can be -# a VIP on a load-balance. -AT_SETUP([OVSDB cluster - disconnect from cluster, single remote]) -AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +OVS_START_SHELL_HELPERS +# ovsdb_test_cluster_disconnect LEADER_OR_FOLLOWER +ovsdb_test_cluster_disconnect () { + leader_or_follower=$1 + schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` + ordinal_schema > schema + AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) + cid=`ovsdb-tool db-cid s1.db` + schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` + for i in `seq 2 3`; do + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) + done + + on_exit 'kill `cat *.pid`' + for i in `seq 3`; do + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) + done + for i in `seq 3`; do + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) + done + + AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest", + {"op": "insert", + "table": "simple", + "row": {"i": 1}}]]'], [0], [ignore], [ignore]) + + # When a node is disconnected from the cluster, the IDL should disconnect + # and retry even if it uses a single remote, because the remote IP can be + # a VIP on a load-balance. So we use single remote to test here. + if test $leader_or_follower == "leader"; then + target=1 + shutdown="2 3" + else + target=3 + + # shutdown follower before the leader so that there is no chance for s3 + # become leader during the process. + shutdown="2 1" + fi + + # Connect to $target. Use "wait" to trigger a non-op transaction so + # that test-ovsdb will not quit. + + test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s$target.ovsdb '[["idltest", + {"op": "wait", + "table": "simple", + "where": [["i", "==", 1]], + "columns": ["i"], + "until": "==", + "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 & + echo $! > test-ovsdb.pid -schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` -ordinal_schema > schema -AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) -cid=`ovsdb-tool db-cid s1.db` -schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` -for i in `seq 2 3`; do - AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) -done - -on_exit 'kill `cat *.pid`' -for i in `seq 3`; do - AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) -done -for i in `seq 3`; do - AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) -done - -AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest", - {"op": "insert", - "table": "simple", - "row": {"i": 1}}]]'], [0], [ignore], [ignore]) - -# Connect to a single remote s3. Use "wait" to trigger a non-op transaction so -# that test-ovsdb will not quit. - -test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s3.ovsdb '[["idltest", - {"op": "wait", - "table": "simple", - "where": [["i", "==", 1]], - "columns": ["i"], - "until": "==", - "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 & -echo $! > test-ovsdb.pid - -OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log]) - -# Shutdown the other 2 servers so that s3 is disconnected from the cluster. -for i in 2 1; do - OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid]) -done - -# The test-ovsdb should detect the disconnect and retry. -OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log]) - -OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s3], [s3.pid]) + OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log]) + # Shutdown the other servers so that $target is disconnected from the cluster. + for i in $shutdown; do + OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid]) + done + + # The test-ovsdb should detect the disconnect and retry. + OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log]) + + OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$target], [s$target.pid]) +} +OVS_END_SHELL_HELPERS + +AT_SETUP([OVSDB cluster - follower disconnect from cluster, single remote]) +AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +ovsdb_test_cluster_disconnect follower +AT_CLEANUP + +AT_SETUP([OVSDB cluster - leader disconnect from cluster, single remote]) +AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +ovsdb_test_cluster_disconnect leader AT_CLEANUP + OVS_START_SHELL_HELPERS