From patchwork Wed Aug 14 22:37:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Zhou X-Patchwork-Id: 1147276 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="a28rxFi9"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4684HZ4Kypz9sN1 for ; Thu, 15 Aug 2019 08:38:34 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 8821FDB8; Wed, 14 Aug 2019 22:37:28 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 7BF14CAF for ; Wed, 14 Aug 2019 22:37:27 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-ed1-f67.google.com (mail-ed1-f67.google.com [209.85.208.67]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 4197176D for ; Wed, 14 Aug 2019 22:37:26 +0000 (UTC) Received: by mail-ed1-f67.google.com with SMTP id p28so623886edi.3 for ; Wed, 14 Aug 2019 15:37:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=350Li2rz5xP7GRfoUYtklLZDZLDQ7b//fsquQT86LRA=; b=a28rxFi9aCEEYfhOmBEx9WyTnZGUKcJb56pEt5t7DOF5rjPNKVuIR0DNeuhwSe/Avq HVaItC+ZmKApwW7O3ckJ8dnpfb6Yr4DUfiZeJA0LozVjK0YypnkVzaRAbvHgM9Z2xnHf mshUjwc8JTkX8LgcBO25vsMLV9vJtwIxC+bHrD9cKrlFfNbxXwWdD+AzbQZmmnHLfZOx XAYVjohdCDF6V6M+lgGqbCY28ZSabvv24j63vvvyCSHRA16C+EWaKnmIdBJV+KrXFZaH 2MLClyf5hBHOym52pca4ph6kDZLbyxdx1jWha13s1FQ7vZ9rNA5mI+kCzKAw2hy8pKMz UdLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=350Li2rz5xP7GRfoUYtklLZDZLDQ7b//fsquQT86LRA=; b=Cnm/JYkR+yVwcFo2mAlbIOdmiJZXAGEgajGIRg3cpbecjS/ZllcyVCvCj3iKtS2wv3 zvAA4HK/tbj2VhtWRvsnqxPeSXs6sFzCPOjNavv8OlpDvSp5IMK5TMauK2R7D3sVtiME QgdTRsCHw2CETG0R1nsek2oydRSI4T1COVMcPI56yd/+oYfRe+gH2e9fiebFOm/Bguq/ u40hGrl27fGo9hq4Xprl+DwlEKoc551oFqxf4RmTYVAYphi18tMC8S/hDeYvnWCQM4lH q6FnTMW35lSpfyHiFz0hZr6N4GU4bpzJeQHQFqU+5P2oiLcXXAie8xtcLN7lcvuDnBTQ oZvg== X-Gm-Message-State: APjAAAUXyxt5mqH2a6KgwnMkWQ6sXmlIn6EV6moZoQ7QxbM6bYMKZDkn vP2zUzL1stdMiqjtvLM8v33Rzrg9f90= X-Google-Smtp-Source: APXvYqzWRKPINgRr86rLG1ve4w8VvZulETmLfOdMUqbA+BEJOHjs/PWqvJbLihWc1wqG+WfNcUNqLA== X-Received: by 2002:a17:906:f74c:: with SMTP id jp12mr1719097ejb.91.1565822244434; Wed, 14 Aug 2019 15:37:24 -0700 (PDT) Received: from localhost.localdomain.localdomain ([216.113.160.77]) by smtp.gmail.com with ESMTPSA id c15sm198167edf.37.2019.08.14.15.37.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 14 Aug 2019 15:37:23 -0700 (PDT) From: Han Zhou X-Google-Original-From: Han Zhou To: dev@openvswitch.org Date: Wed, 14 Aug 2019 15:37:02 -0700 Message-Id: <1565822223-90171-3-git-send-email-hzhou8@ebay.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1565822223-90171-1-git-send-email-hzhou8@ebay.com> References: <1565822223-90171-1-git-send-email-hzhou8@ebay.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v2 3/4] raft.c: Stale leader should disconnect from cluster. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Han Zhou As mentioned in RAFT paper, section 6.2: Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be needlessly delaying client requests. For example, suppose a leader is partitioned from the rest of the cluster, but it can still communicate with a particular client. Without additional mechanism, it could delay a request from that client forever, being unable to replicate a log entry to any other servers. Meanwhile, there might be another leader of a newer term that is able to communicate with a majority of the cluster and would be able to commit the client’s request. Thus, a leader in Raft steps down if an election timeout elapses without a successful round of heartbeats to a majority of its cluster; this allows clients to retry their requests with another server. Reported-by: Aliasgar Ginwala Tested-by: Aliasgar Ginwala Signed-off-by: Han Zhou --- ovsdb/raft-private.h | 3 ++ ovsdb/raft.c | 42 ++++++++++++++++- tests/ovsdb-cluster.at | 123 +++++++++++++++++++++++++++++-------------------- 3 files changed, 116 insertions(+), 52 deletions(-) diff --git a/ovsdb/raft-private.h b/ovsdb/raft-private.h index 35a3dd7..fb7e6e3 100644 --- a/ovsdb/raft-private.h +++ b/ovsdb/raft-private.h @@ -80,6 +80,9 @@ struct raft_server { uint64_t next_index; /* Index of next log entry to send this server. */ uint64_t match_index; /* Index of max log entry server known to have. */ enum raft_server_phase phase; + bool replied; /* Reply to append_request was received from this + node during current election_timeout interval. + */ /* For use in adding and removing servers: */ struct uuid requester_sid; /* Nonzero if requested via RPC. */ struct unixctl_conn *requester_conn; /* Only if requested via unixctl. */ diff --git a/ovsdb/raft.c b/ovsdb/raft.c index 1c38b3b..64b8150 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -1785,7 +1785,43 @@ raft_run(struct raft *raft) } if (!raft->joining && time_msec() >= raft->election_timeout) { - raft_start_election(raft, false); + if (raft->role == RAFT_LEADER) { + /* Check if majority of followers replied, then reset + * election_timeout and reset s->replied. Otherwise, become + * follower. + * + * Raft paper section 6.2: Leaders: A server might be in the leader + * state, but if it isn’t the current leader, it could be + * needlessly delaying client requests. For example, suppose a + * leader is partitioned from the rest of the cluster, but it can + * still communicate with a particular client. Without additional + * mechanism, it could delay a request from that client forever, + * being unable to replicate a log entry to any other servers. + * Meanwhile, there might be another leader of a newer term that is + * able to communicate with a majority of the cluster and would be + * able to commit the client’s request. Thus, a leader in Raft + * steps down if an election timeout elapses without a successful + * round of heartbeats to a majority of its cluster; this allows + * clients to retry their requests with another server. */ + int count = 0; + HMAP_FOR_EACH (server, hmap_node, &raft->servers) { + if (server->replied) { + count ++; + } + } + if (count >= hmap_count(&raft->servers) / 2) { + HMAP_FOR_EACH (server, hmap_node, &raft->servers) { + server->replied = false; + } + raft_reset_election_timer(raft); + } else { + raft_become_follower(raft); + raft_start_election(raft, false); + } + } else { + raft_start_election(raft, false); + } + } if (raft->leaving && time_msec() >= raft->leave_timeout) { @@ -2447,6 +2483,7 @@ raft_server_init_leader(struct raft *raft, struct raft_server *s) s->next_index = raft->log_end; s->match_index = 0; s->phase = RAFT_PHASE_STABLE; + s->replied = false; } static void @@ -2462,7 +2499,7 @@ raft_become_leader(struct raft *raft) ovs_assert(raft->role != RAFT_LEADER); raft->role = RAFT_LEADER; raft->leader_sid = raft->sid; - raft->election_timeout = LLONG_MAX; + raft_reset_election_timer(raft); raft_reset_ping_timer(raft); struct raft_server *s; @@ -3192,6 +3229,7 @@ raft_handle_append_reply(struct raft *raft, } } + s->replied = true; if (rpy->result == RAFT_APPEND_OK) { /* Figure 3.1: "If successful, update nextIndex and matchIndex for * follower (section 3.5)." */ diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 6a13843..7146fe6 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -65,59 +65,82 @@ EXECUTION_EXAMPLES AT_BANNER([OVSDB - disconnect from cluster]) -# When a node is disconnected from the cluster, the IDL should disconnect -# and retry even if it uses a single remote, because the remote IP can be -# a VIP on a load-balance. -AT_SETUP([OVSDB cluster - disconnect from cluster, single remote]) -AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +OVS_START_SHELL_HELPERS +# ovsdb_test_cluster_disconnect LEADER_OR_FOLLOWER +ovsdb_test_cluster_disconnect () { + leader_or_follower=$1 + schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` + ordinal_schema > schema + AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) + cid=`ovsdb-tool db-cid s1.db` + schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` + for i in `seq 2 3`; do + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) + done + + on_exit 'kill `cat *.pid`' + for i in `seq 3`; do + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) + done + for i in `seq 3`; do + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) + done + + AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest", + {"op": "insert", + "table": "simple", + "row": {"i": 1}}]]'], [0], [ignore], [ignore]) + + # When a node is disconnected from the cluster, the IDL should disconnect + # and retry even if it uses a single remote, because the remote IP can be + # a VIP on a load-balance. So we use single remote to test here. + if test $leader_or_follower == "leader"; then + target=1 + shutdown="2 3" + else + target=3 + + # shutdown follower before the leader so that there is no chance for s3 + # become leader during the process. + shutdown="2 1" + fi + + # Connect to $target. Use "wait" to trigger a non-op transaction so + # that test-ovsdb will not quit. + + on_exit 'pkill test-ovsdb' + test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s$target.ovsdb '[["idltest", + {"op": "wait", + "table": "simple", + "where": [["i", "==", 1]], + "columns": ["i"], + "until": "==", + "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 & -schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` -ordinal_schema > schema -AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) -cid=`ovsdb-tool db-cid s1.db` -schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema` -for i in `seq 2 3`; do - AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) -done - -on_exit 'kill `cat *.pid`' -for i in `seq 3`; do - AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db]) -done -for i in `seq 3`; do - AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) -done - -AT_CHECK([ovsdb-client transact unix:s1.ovsdb '[["idltest", - {"op": "insert", - "table": "simple", - "row": {"i": 1}}]]'], [0], [ignore], [ignore]) - -# Connect to a single remote s3. Use "wait" to trigger a non-op transaction so -# that test-ovsdb will not quit. - -on_exit 'pkill test-ovsdb' -test-ovsdb '-vPATTERN:console:test-ovsdb|%c|%m' -v -t10 idl unix:s3.ovsdb '[["idltest", - {"op": "wait", - "table": "simple", - "where": [["i", "==", 1]], - "columns": ["i"], - "until": "==", - "rows": [{"i": 1}]}]]' > test-ovsdb.log 2>&1 & - -OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log]) - -# Shutdown the other 2 servers so that s3 is disconnected from the cluster. -for i in 2 1; do - OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid]) -done - -# The test-ovsdb should detect the disconnect and retry. -OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log]) - -OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s3], [s3.pid]) + OVS_WAIT_UNTIL([grep "000: i=1" test-ovsdb.log]) + # Shutdown the other servers so that $target is disconnected from the cluster. + for i in $shutdown; do + OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid]) + done + + # The test-ovsdb should detect the disconnect and retry. + OVS_WAIT_UNTIL([grep disconnect test-ovsdb.log]) + + OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$target], [s$target.pid]) +} +OVS_END_SHELL_HELPERS + +AT_SETUP([OVSDB cluster - follower disconnect from cluster, single remote]) +AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +ovsdb_test_cluster_disconnect follower +AT_CLEANUP + +AT_SETUP([OVSDB cluster - leader disconnect from cluster, single remote]) +AT_KEYWORDS([ovsdb server negative unix cluster disconnect]) +ovsdb_test_cluster_disconnect leader AT_CLEANUP + OVS_START_SHELL_HELPERS