From patchwork Tue Mar 26 17:27:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ilya Maximets X-Patchwork-Id: 1916300 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org (client-ip=2605:bc80:3010::138; helo=smtp1.osuosl.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=patchwork.ozlabs.org) Received: from smtp1.osuosl.org (smtp1.osuosl.org [IPv6:2605:bc80:3010::138]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4V3xYm6gypz1yWr for ; Wed, 27 Mar 2024 04:26:52 +1100 (AEDT) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id D0ED28202B; Tue, 26 Mar 2024 17:26:50 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id YV1yFsv_Qqsh; Tue, 26 Mar 2024 17:26:49 +0000 (UTC) X-Comment: SPF check N/A for local connections - client-ip=2605:bc80:3010:104::8cd3:938; helo=lists.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver= DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org 8E1C481DD8 Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [IPv6:2605:bc80:3010:104::8cd3:938]) by smtp1.osuosl.org (Postfix) with ESMTPS id 8E1C481DD8; Tue, 26 Mar 2024 17:26:49 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 67A43C0DCF; Tue, 26 Mar 2024 17:26:49 +0000 (UTC) X-Original-To: ovs-dev@openvswitch.org Delivered-To: ovs-dev@lists.linuxfoundation.org Received: from smtp1.osuosl.org (smtp1.osuosl.org [140.211.166.138]) by lists.linuxfoundation.org (Postfix) with ESMTP id 1FA1CC0037 for ; Tue, 26 Mar 2024 17:26:48 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id A8DF58201F for ; Tue, 26 Mar 2024 17:26:47 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id hjphLieNojQV for ; Tue, 26 Mar 2024 17:26:46 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=217.70.183.198; helo=relay6-d.mail.gandi.net; envelope-from=i.maximets@ovn.org; receiver= DMARC-Filter: OpenDMARC Filter v1.4.2 smtp1.osuosl.org E624081DD8 Authentication-Results: smtp1.osuosl.org; dmarc=none (p=none dis=none) header.from=ovn.org DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org E624081DD8 Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net [217.70.183.198]) by smtp1.osuosl.org (Postfix) with ESMTPS id E624081DD8 for ; Tue, 26 Mar 2024 17:26:45 +0000 (UTC) Received: by mail.gandi.net (Postfix) with ESMTPSA id 7392EC0009; Tue, 26 Mar 2024 17:26:43 +0000 (UTC) From: Ilya Maximets To: ovs-dev@openvswitch.org Date: Tue, 26 Mar 2024 18:27:11 +0100 Message-ID: <20240326172717.1454071-4-i.maximets@ovn.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240326172717.1454071-1-i.maximets@ovn.org> References: <20240326172717.1454071-1-i.maximets@ovn.org> MIME-Version: 1.0 X-GND-Sasl: i.maximets@ovn.org Cc: Ilya Maximets , Dumitru Ceara Subject: [ovs-dev] [PATCH v2 3/5] ovsdb: raft: Fix permanent joining state on a cluster member. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ovs-dev-bounces@openvswitch.org Sender: "dev" Consider the following chain of events: 1. Have a cluster with 2 members - A and B. A is a leader. 2. C connects to A, sends a request to join the cluster. 3. A catches up C, creates an update for the 'servers' list and sends it to B and C to apply. This entry is not committed yet. 4. Before B or C can reply, A looses leadership for some reason. 5. A sends a joining failure message to C, C remains in joining state. 5. Both B and C have the new version of 'servers', so they recognize each other as valid cluster members. 6. B initiates a vote, C (or A) replies and B becomes a new leader. 7. B has a new list of servers. B commits it. C becomes a committed cluster member. 8. A and C receive heartbeats with a new commit index and C is now a committed cluster member for all A, B and C. However, at the end of this process, C is still in joining state as it never received a successful reply for a join request, and C is still in a COMMITTING phase for A. So, C skips some parts of the RAFT life cycle and A will refuse to transfer leadership to C if something happens in the future. More interestingly, B can actually transfer leadership to C and vote for it. A will vote for it just fine as well. After that, C becomes a new cluster leader while still in joining state. In this state C will not commit any changes. So, we have seemingly stable cluster that doesn't commit any changes! E.g.: s3 Address: unix:s3.raft Status: joining cluster Remotes for joining: unix:s3.raft unix:s2.raft unix:s1.raft Role: leader Term: 4 Leader: self Vote: self Last Election started 30095 ms ago, reason: leadership_transfer Last Election won: 30093 ms ago Election timer: 1000 Log: [2, 7] Entries not yet committed: 2 Entries not yet applied: 6 Connections: ->s1 ->s2 <-s1 <-s2 Disconnections: 0 Servers: s3 (60cf at unix:s3.raft) (self) next_index=7 match_index=6 s2 (46aa at unix:s2.raft) next_index=7 match_index=6 last msg 58 ms ago s1 (28f7 at unix:s1.raft) next_index=7 match_index=6 last msg 59 ms ago Fix the first scenario by examining server changes in committed log entries. This way server A can transition C to a STABLE phase and server C can find itself in the committed list of servers and move out from a joining state. This is similar to completing commands without receiving an explicit reply or after the role change from leader to follower. The second scenario with a leader in a joining state can be fixed when the joining server becomes leader. New leader's log is getting committed automatically and all servers transition into STABLE phase for it, but it should also move on from a joining state, since it leads the cluster now. It is also possible that B transfers leadership to C before the list of servers is marked as committed on other servers. In this case C will commit it's own addition to the cluster configuration. The added test usually triggers both scenarios, but it will trigger at least one of them. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou Signed-off-by: Ilya Maximets --- ovsdb/raft.c | 44 ++++++++++++++++++++++++++++++++++- tests/ovsdb-cluster.at | 53 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 96 insertions(+), 1 deletion(-) diff --git a/ovsdb/raft.c b/ovsdb/raft.c index ec3a0ff66..5d7e88732 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -386,6 +386,7 @@ static void raft_get_servers_from_log(struct raft *, enum vlog_level); static void raft_get_election_timer_from_log(struct raft *); static bool raft_handle_write_error(struct raft *, struct ovsdb_error *); +static bool raft_has_uncommitted_configuration(const struct raft *); static void raft_run_reconfigure(struct raft *); @@ -2848,6 +2849,18 @@ raft_become_leader(struct raft *raft) raft_reset_election_timer(raft); raft_reset_ping_timer(raft); + if (raft->joining) { + /* It is possible that the server committing this one to the list of + * servers lost leadership before the entry is committed but after + * it was already replicated to majority of servers. In this case + * other servers will recognize this one as a valid cluster member + * and may transfer leadership to it and vote for it. This way + * we're becoming a cluster leader without receiving reply for a + * join request and will commit addition of this server ourselves. */ + VLOG_INFO_RL(&rl, "elected as leader while joining"); + raft->joining = false; + } + struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->servers) { raft_server_init_leader(raft, s); @@ -3006,12 +3019,12 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) } while (raft->commit_index < new_commit_index) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); uint64_t index = ++raft->commit_index; const struct raft_entry *e = raft_get_entry(raft, index); if (raft_entry_has_data(e)) { struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid); - static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); if (cmd) { if (!cmd->index && raft->role == RAFT_LEADER) { @@ -3055,6 +3068,35 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) * reallocate raft->entries, which would invalidate 'e', so * this case must be last, after the one for 'e->data'. */ raft_run_reconfigure(raft); + } else if (e->servers && !raft_has_uncommitted_configuration(raft)) { + struct ovsdb_error *error; + struct raft_server *s; + struct hmap servers; + + error = raft_servers_from_json(e->servers, &servers); + ovs_assert(!error); + HMAP_FOR_EACH (s, hmap_node, &servers) { + struct raft_server *server = raft_find_server(raft, &s->sid); + + if (server && server->phase == RAFT_PHASE_COMMITTING) { + /* This server lost leadership while committing + * server 's', but it was committed later by a + * new leader. */ + server->phase = RAFT_PHASE_STABLE; + } + + if (raft->joining && uuid_equals(&s->sid, &raft->sid)) { + /* Leadership change happened before previous leader + * could commit the change of a servers list, but it + * was replicated and a new leader committed it. */ + VLOG_INFO_RL(&rl, + "added to configuration without reply " + "(eid: "UUID_FMT", commit index: %"PRIu64")", + UUID_ARGS(&e->eid), index); + raft->joining = false; + } + } + raft_servers_destroy(&servers); } } diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 481afc08b..482e4e02d 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -473,6 +473,59 @@ done AT_CLEANUP +AT_SETUP([OVSDB cluster - leadership change after replication while joining]) +AT_KEYWORDS([ovsdb server negative unix cluster join]) + +n=5 +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db dnl + $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) +cid=$(ovsdb-tool db-cid s1.db) +schema_name=$(ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema) +for i in $(seq 2 $n); do + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) +done + +on_exit 'kill $(cat *.pid)' +on_exit " + for i in \$(ls $(pwd)/s[[0-$n]]); do + ovs-appctl --timeout 1 -t \$i cluster/status $schema_name; + done +" + +dnl Starting servers one by one asking all exisitng servers to transfer +dnl leadership after append reply forcing the joining server to try another +dnl one that will also transfer leadership. Since transfer is happening +dnl after the servers update is replicated to other servers, one of the +dnl other servers will actually commit it. It may be a new leader from +dnl one of the old members or the new joining server itself. +for i in $(seq $n); do + dnl Make sure that all already started servers joined the cluster. + for j in $(seq $((i - 1)) ); do + AT_CHECK([ovsdb_client_wait unix:s$j.ovsdb $schema_name connected]) + done + for j in $(seq $((i - 1)) ); do + OVS_WAIT_UNTIL([ovs-appctl -t "$(pwd)"/s$j \ + cluster/failure-test \ + transfer-leadership-after-sending-append-request \ + | grep -q "engaged"]) + done + + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \ + --detach --no-chdir --log-file=s$i.log \ + --pidfile=s$i.pid --unixctl=s$i \ + --remote=punix:s$i.ovsdb s$i.db]) +done + +dnl Make sure that all servers joined the cluster. +for i in $(seq $n); do + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) +done + +for i in $(seq $n); do + OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid]) +done + +AT_CLEANUP OVS_START_SHELL_HELPERS