From patchwork Tue Mar 26 17:27:11 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ilya Maximets <i.maximets@ovn.org>
X-Patchwork-Id: 1916300
Return-Path: <ovs-dev-bounces@openvswitch.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=openvswitch.org
 (client-ip=2605:bc80:3010::138; helo=smtp1.osuosl.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=patchwork.ozlabs.org)
Received: from smtp1.osuosl.org (smtp1.osuosl.org [IPv6:2605:bc80:3010::138])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4V3xYm6gypz1yWr
	for <incoming@patchwork.ozlabs.org>; Wed, 27 Mar 2024 04:26:52 +1100 (AEDT)
Received: from localhost (localhost [127.0.0.1])
	by smtp1.osuosl.org (Postfix) with ESMTP id D0ED28202B;
	Tue, 26 Mar 2024 17:26:50 +0000 (UTC)
X-Virus-Scanned: amavis at osuosl.org
Received: from smtp1.osuosl.org ([127.0.0.1])
 by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP
 id YV1yFsv_Qqsh; Tue, 26 Mar 2024 17:26:49 +0000 (UTC)
X-Comment: SPF check N/A for local connections -
 client-ip=2605:bc80:3010:104::8cd3:938; helo=lists.linuxfoundation.org;
 envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org 8E1C481DD8
Received: from lists.linuxfoundation.org (lf-lists.osuosl.org
 [IPv6:2605:bc80:3010:104::8cd3:938])
	by smtp1.osuosl.org (Postfix) with ESMTPS id 8E1C481DD8;
	Tue, 26 Mar 2024 17:26:49 +0000 (UTC)
Received: from lf-lists.osuosl.org (localhost [127.0.0.1])
	by lists.linuxfoundation.org (Postfix) with ESMTP id 67A43C0DCF;
	Tue, 26 Mar 2024 17:26:49 +0000 (UTC)
X-Original-To: ovs-dev@openvswitch.org
Delivered-To: ovs-dev@lists.linuxfoundation.org
Received: from smtp1.osuosl.org (smtp1.osuosl.org [140.211.166.138])
 by lists.linuxfoundation.org (Postfix) with ESMTP id 1FA1CC0037
 for <ovs-dev@openvswitch.org>; Tue, 26 Mar 2024 17:26:48 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by smtp1.osuosl.org (Postfix) with ESMTP id A8DF58201F
 for <ovs-dev@openvswitch.org>; Tue, 26 Mar 2024 17:26:47 +0000 (UTC)
X-Virus-Scanned: amavis at osuosl.org
Received: from smtp1.osuosl.org ([127.0.0.1])
 by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP
 id hjphLieNojQV for <ovs-dev@openvswitch.org>;
 Tue, 26 Mar 2024 17:26:46 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=217.70.183.198;
 helo=relay6-d.mail.gandi.net; envelope-from=i.maximets@ovn.org;
 receiver=<UNKNOWN>
DMARC-Filter: OpenDMARC Filter v1.4.2 smtp1.osuosl.org E624081DD8
Authentication-Results: smtp1.osuosl.org;
 dmarc=none (p=none dis=none) header.from=ovn.org
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org E624081DD8
Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net
 [217.70.183.198])
 by smtp1.osuosl.org (Postfix) with ESMTPS id E624081DD8
 for <ovs-dev@openvswitch.org>; Tue, 26 Mar 2024 17:26:45 +0000 (UTC)
Received: by mail.gandi.net (Postfix) with ESMTPSA id 7392EC0009;
 Tue, 26 Mar 2024 17:26:43 +0000 (UTC)
From: Ilya Maximets <i.maximets@ovn.org>
To: ovs-dev@openvswitch.org
Date: Tue, 26 Mar 2024 18:27:11 +0100
Message-ID: <20240326172717.1454071-4-i.maximets@ovn.org>
X-Mailer: git-send-email 2.44.0
In-Reply-To: <20240326172717.1454071-1-i.maximets@ovn.org>
References: <20240326172717.1454071-1-i.maximets@ovn.org>
MIME-Version: 1.0
X-GND-Sasl: i.maximets@ovn.org
Cc: Ilya Maximets <i.maximets@ovn.org>, Dumitru Ceara <dceara@redhat.com>
Subject: [ovs-dev] [PATCH v2 3/5] ovsdb: raft: Fix permanent joining state
	on a cluster member.
X-BeenThere: ovs-dev@openvswitch.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ovs-dev.openvswitch.org>
List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe>
List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/>
List-Post: <mailto:ovs-dev@openvswitch.org>
List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help>
List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>,
 <mailto:ovs-dev-request@openvswitch.org?subject=subscribe>
Errors-To: ovs-dev-bounces@openvswitch.org
Sender: "dev" <ovs-dev-bounces@openvswitch.org>

Consider the following chain of events:

 1. Have a cluster with 2 members - A and B.  A is a leader.
 2. C connects to A, sends a request to join the cluster.
 3. A catches up C, creates an update for the 'servers' list and sends
    it to B and C to apply.  This entry is not committed yet.
 4. Before B or C can reply, A looses leadership for some reason.
 5. A sends a joining failure message to C, C remains in joining state.
 5. Both B and C have the new version of 'servers', so they recognize
    each other as valid cluster members.
 6. B initiates a vote, C (or A) replies and B becomes a new leader.
 7. B has a new list of servers.  B commits it.  C becomes a committed
    cluster member.
 8. A and C receive heartbeats with a new commit index and C is now
    a committed cluster member for all A, B and C.

However, at the end of this process, C is still in joining state
as it never received a successful reply for a join request, and C is
still in a COMMITTING phase for A.  So, C skips some parts of the RAFT
life cycle and A will refuse to transfer leadership to C if something
happens in the future.

More interestingly, B can actually transfer leadership to C and vote
for it.  A will vote for it just fine as well.  After that, C becomes
a new cluster leader while still in joining state.  In this state C
will not commit any changes.  So, we have seemingly stable cluster
that doesn't commit any changes!  E.g.:

  s3
  Address: unix:s3.raft
  Status: joining cluster
  Remotes for joining: unix:s3.raft unix:s2.raft unix:s1.raft
  Role: leader
  Term: 4
  Leader: self
  Vote: self

  Last Election started 30095 ms ago, reason: leadership_transfer
  Last Election won: 30093 ms ago
  Election timer: 1000
  Log: [2, 7]
  Entries not yet committed: 2
  Entries not yet applied: 6
  Connections: ->s1 ->s2 <-s1 <-s2
  Disconnections: 0
  Servers:
    s3 (60cf at unix:s3.raft) (self) next_index=7 match_index=6
    s2 (46aa at unix:s2.raft) next_index=7 match_index=6 last msg 58 ms ago
    s1 (28f7 at unix:s1.raft) next_index=7 match_index=6 last msg 59 ms ago

Fix the first scenario by examining server changes in committed log
entries.  This way server A can transition C to a STABLE phase and
server C can find itself in the committed list of servers and move out
from a joining state.  This is similar to completing commands without
receiving an explicit reply or after the role change from leader to
follower.

The second scenario with a leader in a joining state can be fixed
when the joining server becomes leader.  New leader's log is getting
committed automatically and all servers transition into STABLE phase
for it, but it should also move on from a joining state, since it
leads the cluster now.
It is also possible that B transfers leadership to C before the list
of servers is marked as committed on other servers.  In this case
C will commit it's own addition to the cluster configuration.

The added test usually triggers both scenarios, but it will trigger at
least one of them.

Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 ovsdb/raft.c           | 44 ++++++++++++++++++++++++++++++++++-
 tests/ovsdb-cluster.at | 53 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 1 deletion(-)
diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index ec3a0ff66..5d7e88732 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -386,6 +386,7 @@ static void raft_get_servers_from_log(struct raft *, enum vlog_level);
 static void raft_get_election_timer_from_log(struct raft *);
 
 static bool raft_handle_write_error(struct raft *, struct ovsdb_error *);
+static bool raft_has_uncommitted_configuration(const struct raft *);
 
 static void raft_run_reconfigure(struct raft *);
 
@@ -2848,6 +2849,18 @@ raft_become_leader(struct raft *raft)
     raft_reset_election_timer(raft);
     raft_reset_ping_timer(raft);
 
+    if (raft->joining) {
+        /* It is possible that the server committing this one to the list of
+         * servers lost leadership before the entry is committed but after
+         * it was already replicated to majority of servers.  In this case
+         * other servers will recognize this one as a valid cluster member
+         * and may transfer leadership to it and vote for it.  This way
+         * we're becoming a cluster leader without receiving reply for a
+         * join request and will commit addition of this server ourselves. */
+        VLOG_INFO_RL(&rl, "elected as leader while joining");
+        raft->joining = false;
+    }
+
     struct raft_server *s;
     HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
         raft_server_init_leader(raft, s);
@@ -3006,12 +3019,12 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
     }
 
     while (raft->commit_index < new_commit_index) {
+        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
         uint64_t index = ++raft->commit_index;
         const struct raft_entry *e = raft_get_entry(raft, index);
 
         if (raft_entry_has_data(e)) {
             struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid);
-            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
 
             if (cmd) {
                 if (!cmd->index && raft->role == RAFT_LEADER) {
@@ -3055,6 +3068,35 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
              * reallocate raft->entries, which would invalidate 'e', so
              * this case must be last, after the one for 'e->data'. */
             raft_run_reconfigure(raft);
+        } else if (e->servers && !raft_has_uncommitted_configuration(raft)) {
+            struct ovsdb_error *error;
+            struct raft_server *s;
+            struct hmap servers;
+
+            error = raft_servers_from_json(e->servers, &servers);
+            ovs_assert(!error);
+            HMAP_FOR_EACH (s, hmap_node, &servers) {
+                struct raft_server *server = raft_find_server(raft, &s->sid);
+
+                if (server && server->phase == RAFT_PHASE_COMMITTING) {
+                    /* This server lost leadership while committing
+                     * server 's', but it was committed later by a
+                     * new leader. */
+                    server->phase = RAFT_PHASE_STABLE;
+                }
+
+                if (raft->joining && uuid_equals(&s->sid, &raft->sid)) {
+                    /* Leadership change happened before previous leader
+                     * could commit the change of a servers list, but it
+                     * was replicated and a new leader committed it. */
+                    VLOG_INFO_RL(&rl,
+                        "added to configuration without reply "
+                        "(eid: "UUID_FMT", commit index: %"PRIu64")",
+                        UUID_ARGS(&e->eid), index);
+                    raft->joining = false;
+                }
+            }
+            raft_servers_destroy(&servers);
         }
     }
 
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 481afc08b..482e4e02d 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -473,6 +473,59 @@ done
 
 AT_CLEANUP
 
+AT_SETUP([OVSDB cluster - leadership change after replication while joining])
+AT_KEYWORDS([ovsdb server negative unix cluster join])
+
+n=5
+AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db dnl
+              $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
+cid=$(ovsdb-tool db-cid s1.db)
+schema_name=$(ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema)
+for i in $(seq 2 $n); do
+    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
+done
+
+on_exit 'kill $(cat *.pid)'
+on_exit "
+  for i in \$(ls $(pwd)/s[[0-$n]]); do
+    ovs-appctl --timeout 1 -t \$i cluster/status $schema_name;
+  done
+"
+
+dnl Starting servers one by one asking all exisitng servers to transfer
+dnl leadership after append reply forcing the joining server to try another
+dnl one that will also transfer leadership.  Since transfer is happening
+dnl after the servers update is replicated to other servers, one of the
+dnl other servers will actually commit it.  It may be a new leader from
+dnl one of the old members or the new joining server itself.
+for i in $(seq $n); do
+    dnl Make sure that all already started servers joined the cluster.
+    for j in $(seq $((i - 1)) ); do
+        AT_CHECK([ovsdb_client_wait unix:s$j.ovsdb $schema_name connected])
+    done
+    for j in $(seq $((i - 1)) ); do
+        OVS_WAIT_UNTIL([ovs-appctl -t "$(pwd)"/s$j \
+                          cluster/failure-test \
+                            transfer-leadership-after-sending-append-request \
+                        | grep -q "engaged"])
+    done
+
+    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \
+                           --detach --no-chdir --log-file=s$i.log \
+                           --pidfile=s$i.pid --unixctl=s$i \
+                           --remote=punix:s$i.ovsdb s$i.db])
+done
+
+dnl Make sure that all servers joined the cluster.
+for i in $(seq $n); do
+    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+done
+
+for i in $(seq $n); do
+    OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid])
+done
+
+AT_CLEANUP
 
 
 OVS_START_SHELL_HELPERS