Message ID | 20240315201614.236523-4-i.maximets@ovn.org |
---|---|
State | Superseded |
Headers | show |
Series | ovsdb: raft: Fixes for cluster joining state. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
On Fri, Mar 15, 2024 at 1:15 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > Consider the following chain of events: > > 1. Have a cluster with 2 members - A and B. A is a leader. > 2. C connects to A, sends a request to join the cluster. > 3. A catches up C, creates an update for the 'servers' list and sends > it to B and C to apply. This entry is not committed yet. > 4. Before B or C can reply, A looses leadership for some reason. > 5. A sends a joining failure message to C, C remains in joining state. > 5. Both B and C have the new version of 'servers', so they recognize > each other as valid cluster members. > 6. B initiates a vote, C (or A) replies and B becomes a new leader. > 7. B has a new list of servers. B commits it. C becomes a committed > cluster member. > 8. A and C receive heartbeats with a new commit index and C is now > a committed cluster member for all A, B and C. > > However, at the end of this process, C is still in joining state > as it never received a successful reply for a join request, and C is > still in a COMMITTING phase for A. So, C skips some parts of the RAFT > life cycle and A will refuse to transfer leadership to C if something > happens in the future. > > More interestingly, B can actually transfer leadership to C and vote > for it. A will vote for it just fine as well. After that, C becomes > a new cluster leader while still in joining state. In this state C > will not commit any changes. So, we have seemingly stable cluster > that doesn't commit any changes! E.g.: > > s3 > Address: unix:s3.raft > Status: joining cluster > Remotes for joining: unix:s3.raft unix:s2.raft unix:s1.raft > Role: leader > Term: 4 > Leader: self > Vote: self > > Last Election started 30095 ms ago, reason: leadership_transfer > Last Election won: 30093 ms ago > Election timer: 1000 > Log: [2, 7] > Entries not yet committed: 2 > Entries not yet applied: 6 > Connections: ->s1 ->s2 <-s1 <-s2 > Disconnections: 0 > Servers: > s3 (60cf at unix:s3.raft) (self) next_index=7 match_index=6 > s2 (46aa at unix:s2.raft) next_index=7 match_index=6 last msg 58 ms ago > s1 (28f7 at unix:s1.raft) next_index=7 match_index=6 last msg 59 ms ago > > Fix the first scenario by examining server changes in committed log > entries. This way server A can transition C to a STABLE phase and > server C can find itself in the committed list of servers and move out > from a joining state. This is similar to completing commands without > receiving an explicit reply or after the role change from leader to > follower. > > The second scenario with a leader in a joining state can be fixed > when the joining server becomes leader. New leader's log is getting > committed automatically and all servers transition into STABLE phase > for it, but it should also move on from a joining state, since it > leads the cluster now. > It is also possible that B transfers leadership to C before the list > of servers is marked as committed on other servers. In this case > C will commit it's own addition to the cluster configuration. > > The added test usually triggers both scenarios, but it will trigger at > least one of them. > > Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > ovsdb/raft.c | 44 ++++++++++++++++++++++++++++++++++- > tests/ovsdb-cluster.at | 53 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 96 insertions(+), 1 deletion(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index 57e27bf73..237d7ebf5 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -385,6 +385,7 @@ static void raft_get_servers_from_log(struct raft *, enum vlog_level); > static void raft_get_election_timer_from_log(struct raft *); > > static bool raft_handle_write_error(struct raft *, struct ovsdb_error *); > +static bool raft_has_uncommitted_configuration(const struct raft *); > > static void raft_run_reconfigure(struct raft *); > > @@ -2810,6 +2811,18 @@ raft_become_leader(struct raft *raft) > raft_reset_election_timer(raft); > raft_reset_ping_timer(raft); > > + if (raft->joining) { > + /* It is possible that the server committing this one to the list of > + * servers lost leadership before the entry is committed but after > + * it was already replicated to majority of servers. In this case > + * other servers will recognize this one as a valid cluster member > + * and may transfer leadership to it and vote for it. This way > + * we're becoming a cluster leader without receiving reply for a > + * join request and will commit addition of this server ourselves. */ > + VLOG_INFO_RL(&rl, "elected as leader while joining"); > + raft->joining = false; > + } > + > struct raft_server *s; > HMAP_FOR_EACH (s, hmap_node, &raft->servers) { > raft_server_init_leader(raft, s); > @@ -2968,12 +2981,12 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) > } > > while (raft->commit_index < new_commit_index) { > + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > uint64_t index = ++raft->commit_index; > const struct raft_entry *e = raft_get_entry(raft, index); > > if (raft_entry_has_data(e)) { > struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid); > - static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > > if (cmd) { > if (!cmd->index && raft->role == RAFT_LEADER) { > @@ -3017,6 +3030,35 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) > * reallocate raft->entries, which would invalidate 'e', so > * this case must be last, after the one for 'e->data'. */ > raft_run_reconfigure(raft); > + } else if (e->servers && !raft_has_uncommitted_configuration(raft)) { > + struct ovsdb_error *error; > + struct raft_server *s; > + struct hmap servers; > + > + error = raft_servers_from_json(e->servers, &servers); > + ovs_assert(!error); > + HMAP_FOR_EACH (s, hmap_node, &servers) { > + struct raft_server *server = raft_find_server(raft, &s->sid); > + > + if (server && server->phase == RAFT_PHASE_COMMITTING) { > + /* This server lost leadership while committing > + * seever 's', but it was committed later by a nit: typo s/seever/server Acked-by: Han Zhou <hzhou@ovn.org> > + * new leader. */ > + server->phase = RAFT_PHASE_STABLE; > + } > + > + if (raft->joining && uuid_equals(&s->sid, &raft->sid)) { > + /* Leadership change happened before previous leader > + * could commit the change of a servers list, but it > + * was replicated and a new leader committed it. */ > + VLOG_INFO_RL(&rl, > + "added to configuration without reply " > + "(eid: "UUID_FMT", commit index: %"PRIu64")", > + UUID_ARGS(&e->eid), index); > + raft->joining = false; > + } > + } > + raft_servers_destroy(&servers); > } > } > > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at > index 481afc08b..482e4e02d 100644 > --- a/tests/ovsdb-cluster.at > +++ b/tests/ovsdb-cluster.at > @@ -473,6 +473,59 @@ done > > AT_CLEANUP > > +AT_SETUP([OVSDB cluster - leadership change after replication while joining]) > +AT_KEYWORDS([ovsdb server negative unix cluster join]) > + > +n=5 > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db dnl > + $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) > +cid=$(ovsdb-tool db-cid s1.db) > +schema_name=$(ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema) > +for i in $(seq 2 $n); do > + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) > +done > + > +on_exit 'kill $(cat *.pid)' > +on_exit " > + for i in \$(ls $(pwd)/s[[0-$n]]); do > + ovs-appctl --timeout 1 -t \$i cluster/status $schema_name; > + done > +" > + > +dnl Starting servers one by one asking all exisitng servers to transfer > +dnl leadership after append reply forcing the joining server to try another > +dnl one that will also transfer leadership. Since transfer is happening > +dnl after the servers update is replicated to other servers, one of the > +dnl other servers will actually commit it. It may be a new leader from > +dnl one of the old members or the new joining server itself. > +for i in $(seq $n); do > + dnl Make sure that all already started servers joined the cluster. > + for j in $(seq $((i - 1)) ); do > + AT_CHECK([ovsdb_client_wait unix:s$j.ovsdb $schema_name connected]) > + done > + for j in $(seq $((i - 1)) ); do > + OVS_WAIT_UNTIL([ovs-appctl -t "$(pwd)"/s$j \ > + cluster/failure-test \ > + transfer-leadership-after-sending-append-request \ > + | grep -q "engaged"]) > + done > + > + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \ > + --detach --no-chdir --log-file=s$i.log \ > + --pidfile=s$i.pid --unixctl=s$i \ > + --remote=punix:s$i.ovsdb s$i.db]) > +done > + > +dnl Make sure that all servers joined the cluster. > +for i in $(seq $n); do > + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) > +done > + > +for i in $(seq $n); do > + OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid]) > +done > + > +AT_CLEANUP > > > OVS_START_SHELL_HELPERS > -- > 2.43.0 >
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index 57e27bf73..237d7ebf5 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -385,6 +385,7 @@ static void raft_get_servers_from_log(struct raft *, enum vlog_level); static void raft_get_election_timer_from_log(struct raft *); static bool raft_handle_write_error(struct raft *, struct ovsdb_error *); +static bool raft_has_uncommitted_configuration(const struct raft *); static void raft_run_reconfigure(struct raft *); @@ -2810,6 +2811,18 @@ raft_become_leader(struct raft *raft) raft_reset_election_timer(raft); raft_reset_ping_timer(raft); + if (raft->joining) { + /* It is possible that the server committing this one to the list of + * servers lost leadership before the entry is committed but after + * it was already replicated to majority of servers. In this case + * other servers will recognize this one as a valid cluster member + * and may transfer leadership to it and vote for it. This way + * we're becoming a cluster leader without receiving reply for a + * join request and will commit addition of this server ourselves. */ + VLOG_INFO_RL(&rl, "elected as leader while joining"); + raft->joining = false; + } + struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->servers) { raft_server_init_leader(raft, s); @@ -2968,12 +2981,12 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) } while (raft->commit_index < new_commit_index) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); uint64_t index = ++raft->commit_index; const struct raft_entry *e = raft_get_entry(raft, index); if (raft_entry_has_data(e)) { struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid); - static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); if (cmd) { if (!cmd->index && raft->role == RAFT_LEADER) { @@ -3017,6 +3030,35 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) * reallocate raft->entries, which would invalidate 'e', so * this case must be last, after the one for 'e->data'. */ raft_run_reconfigure(raft); + } else if (e->servers && !raft_has_uncommitted_configuration(raft)) { + struct ovsdb_error *error; + struct raft_server *s; + struct hmap servers; + + error = raft_servers_from_json(e->servers, &servers); + ovs_assert(!error); + HMAP_FOR_EACH (s, hmap_node, &servers) { + struct raft_server *server = raft_find_server(raft, &s->sid); + + if (server && server->phase == RAFT_PHASE_COMMITTING) { + /* This server lost leadership while committing + * seever 's', but it was committed later by a + * new leader. */ + server->phase = RAFT_PHASE_STABLE; + } + + if (raft->joining && uuid_equals(&s->sid, &raft->sid)) { + /* Leadership change happened before previous leader + * could commit the change of a servers list, but it + * was replicated and a new leader committed it. */ + VLOG_INFO_RL(&rl, + "added to configuration without reply " + "(eid: "UUID_FMT", commit index: %"PRIu64")", + UUID_ARGS(&e->eid), index); + raft->joining = false; + } + } + raft_servers_destroy(&servers); } } diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 481afc08b..482e4e02d 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -473,6 +473,59 @@ done AT_CLEANUP +AT_SETUP([OVSDB cluster - leadership change after replication while joining]) +AT_KEYWORDS([ovsdb server negative unix cluster join]) + +n=5 +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db dnl + $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr]) +cid=$(ovsdb-tool db-cid s1.db) +schema_name=$(ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema) +for i in $(seq 2 $n); do + AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft]) +done + +on_exit 'kill $(cat *.pid)' +on_exit " + for i in \$(ls $(pwd)/s[[0-$n]]); do + ovs-appctl --timeout 1 -t \$i cluster/status $schema_name; + done +" + +dnl Starting servers one by one asking all exisitng servers to transfer +dnl leadership after append reply forcing the joining server to try another +dnl one that will also transfer leadership. Since transfer is happening +dnl after the servers update is replicated to other servers, one of the +dnl other servers will actually commit it. It may be a new leader from +dnl one of the old members or the new joining server itself. +for i in $(seq $n); do + dnl Make sure that all already started servers joined the cluster. + for j in $(seq $((i - 1)) ); do + AT_CHECK([ovsdb_client_wait unix:s$j.ovsdb $schema_name connected]) + done + for j in $(seq $((i - 1)) ); do + OVS_WAIT_UNTIL([ovs-appctl -t "$(pwd)"/s$j \ + cluster/failure-test \ + transfer-leadership-after-sending-append-request \ + | grep -q "engaged"]) + done + + AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off \ + --detach --no-chdir --log-file=s$i.log \ + --pidfile=s$i.pid --unixctl=s$i \ + --remote=punix:s$i.ovsdb s$i.db]) +done + +dnl Make sure that all servers joined the cluster. +for i in $(seq $n); do + AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected]) +done + +for i in $(seq $n); do + OVS_APP_EXIT_AND_WAIT_BY_TARGET([$(pwd)/s$i], [s$i.pid]) +done + +AT_CLEANUP OVS_START_SHELL_HELPERS
Consider the following chain of events: 1. Have a cluster with 2 members - A and B. A is a leader. 2. C connects to A, sends a request to join the cluster. 3. A catches up C, creates an update for the 'servers' list and sends it to B and C to apply. This entry is not committed yet. 4. Before B or C can reply, A looses leadership for some reason. 5. A sends a joining failure message to C, C remains in joining state. 5. Both B and C have the new version of 'servers', so they recognize each other as valid cluster members. 6. B initiates a vote, C (or A) replies and B becomes a new leader. 7. B has a new list of servers. B commits it. C becomes a committed cluster member. 8. A and C receive heartbeats with a new commit index and C is now a committed cluster member for all A, B and C. However, at the end of this process, C is still in joining state as it never received a successful reply for a join request, and C is still in a COMMITTING phase for A. So, C skips some parts of the RAFT life cycle and A will refuse to transfer leadership to C if something happens in the future. More interestingly, B can actually transfer leadership to C and vote for it. A will vote for it just fine as well. After that, C becomes a new cluster leader while still in joining state. In this state C will not commit any changes. So, we have seemingly stable cluster that doesn't commit any changes! E.g.: s3 Address: unix:s3.raft Status: joining cluster Remotes for joining: unix:s3.raft unix:s2.raft unix:s1.raft Role: leader Term: 4 Leader: self Vote: self Last Election started 30095 ms ago, reason: leadership_transfer Last Election won: 30093 ms ago Election timer: 1000 Log: [2, 7] Entries not yet committed: 2 Entries not yet applied: 6 Connections: ->s1 ->s2 <-s1 <-s2 Disconnections: 0 Servers: s3 (60cf at unix:s3.raft) (self) next_index=7 match_index=6 s2 (46aa at unix:s2.raft) next_index=7 match_index=6 last msg 58 ms ago s1 (28f7 at unix:s1.raft) next_index=7 match_index=6 last msg 59 ms ago Fix the first scenario by examining server changes in committed log entries. This way server A can transition C to a STABLE phase and server C can find itself in the committed list of servers and move out from a joining state. This is similar to completing commands without receiving an explicit reply or after the role change from leader to follower. The second scenario with a leader in a joining state can be fixed when the joining server becomes leader. New leader's log is getting committed automatically and all servers transition into STABLE phase for it, but it should also move on from a joining state, since it leads the cluster now. It is also possible that B transfers leadership to C before the list of servers is marked as committed on other servers. In this case C will commit it's own addition to the cluster configuration. The added test usually triggers both scenarios, but it will trigger at least one of them. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- ovsdb/raft.c | 44 ++++++++++++++++++++++++++++++++++- tests/ovsdb-cluster.at | 53 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 96 insertions(+), 1 deletion(-)