diff mbox series

[ovs-dev] ovsdb: raft: Fix transaction double commit due to lost leadership.

Message ID 20220505122439.523610-1-i.maximets@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] ovsdb: raft: Fix transaction double commit due to lost leadership. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ilya Maximets May 5, 2022, 12:24 p.m. UTC
While becoming a follower, the leader completes all the current
'in-flight' commands, so the higher layers can re-try corresponding
transactions when the new leader is elected.  However, most of these
commands are already sent to followers as append requests, hence they
will actually be committed by the majority of the cluster members,
i.e. will be treated as committed by the new leader, unless there is
an actual network problem between servers.  However, the old leader
will decline append replies, since it's not the leader anymore and
commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.

New leader will replicate the commit index back to the old leader.
Old leader will re-try the previously "failed" transaction, because
"cluster error"s are temporary.

If a transaction had some prerequisites that didn't allow double
committing or there are other database constraints (like indexes) that
will not allow a transaction to be committed twice, the server will
reply to the client with a false-negative transaction result.

If there are no prerequisites or additional database constraints,
the server will execute the same transaction again, but as a follower.

E.g. in the OVN case, this may result in creation of duplicated logical
switches / routers / load balancers.  I.e. resources with the same
non-indexed name.  That may cause issues later where ovn-nbctl will
not be able to add ports to these switches / routers.

Suggested solution is to not complete the commands, but allow them to
be completed with the commit index update from a new leader.
It the similar behavior to what we do in order to complete commands
in a backward scenario when the follower becomes a leader.  That
scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
transaction execution when leader failover.").

Additionally, truncating the current log down to commit index in order
to avoid possible log inconsistencies with a new leader.  Followers
should not have uncommitted entries in the log.  New leader will send
them back, if they were actually committed.

Code paths for leader and follower inside the raft_update_commit_index
were very similar, so they were refactored into one, since we also
needed an ability to complete more than one command for a follower.

Failure test added to exercise scenario of a leadership transfer.

Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.")
Reported-at: https://bugzilla.redhat.com/2046340
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 ovsdb/raft.c           | 135 +++++++++++++++++++++++++----------------
 tests/ovsdb-cluster.at |  27 +++++++--
 2 files changed, 103 insertions(+), 59 deletions(-)

Comments

Dumitru Ceara May 16, 2022, 11:02 a.m. UTC | #1
On 5/5/22 14:24, Ilya Maximets wrote:
> While becoming a follower, the leader completes all the current
> 'in-flight' commands, so the higher layers can re-try corresponding
> transactions when the new leader is elected.  However, most of these
> commands are already sent to followers as append requests, hence they
> will actually be committed by the majority of the cluster members,
> i.e. will be treated as committed by the new leader, unless there is
> an actual network problem between servers.  However, the old leader
> will decline append replies, since it's not the leader anymore and
> commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
> 
> New leader will replicate the commit index back to the old leader.
> Old leader will re-try the previously "failed" transaction, because
> "cluster error"s are temporary.
> 
> If a transaction had some prerequisites that didn't allow double
> committing or there are other database constraints (like indexes) that
> will not allow a transaction to be committed twice, the server will
> reply to the client with a false-negative transaction result.
> 
> If there are no prerequisites or additional database constraints,
> the server will execute the same transaction again, but as a follower.
> 
> E.g. in the OVN case, this may result in creation of duplicated logical
> switches / routers / load balancers.  I.e. resources with the same
> non-indexed name.  That may cause issues later where ovn-nbctl will
> not be able to add ports to these switches / routers.
> 
> Suggested solution is to not complete the commands, but allow them to
> be completed with the commit index update from a new leader.
> It the similar behavior to what we do in order to complete commands
> in a backward scenario when the follower becomes a leader.  That
> scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
> transaction execution when leader failover.").
> 
> Additionally, truncating the current log down to commit index in order
> to avoid possible log inconsistencies with a new leader.  Followers
> should not have uncommitted entries in the log.  New leader will send
> them back, if they were actually committed.
> 
> Code paths for leader and follower inside the raft_update_commit_index
> were very similar, so they were refactored into one, since we also
> needed an ability to complete more than one command for a follower.
> 
> Failure test added to exercise scenario of a leadership transfer.
> 
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
> Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.")
> Reported-at: https://bugzilla.redhat.com/2046340
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---

Hi Ilya,

This change looks OK to me.  However, I think it's best to wait for
Han's review too before applying this.

Acked-by: Dumitru Ceara <dceara@redhat.com>

Regards,
Dumitru
Han Zhou May 22, 2022, 11:58 p.m. UTC | #2
Thanks Ilya for addressing this problem! Please see some comments below:
On Thu, May 5, 2022 at 5:24 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> While becoming a follower, the leader completes all the current
nit: The word "completes" here may be a little confusing. We are talking
about complete with error instead of SUCCESS, so better to say "cancels" or
"aborts".

> 'in-flight' commands, so the higher layers can re-try corresponding
> transactions when the new leader is elected.  However, most of these
> commands are already sent to followers as append requests, hence they
> will actually be committed by the majority of the cluster members,
> i.e. will be treated as committed by the new leader, unless there is
> an actual network problem between servers.  However, the old leader
> will decline append replies, since it's not the leader anymore and
> commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
>
> New leader will replicate the commit index back to the old leader.
> Old leader will re-try the previously "failed" transaction, because
> "cluster error"s are temporary.
>
> If a transaction had some prerequisites that didn't allow double
> committing or there are other database constraints (like indexes) that
> will not allow a transaction to be committed twice, the server will
> reply to the client with a false-negative transaction result.
>
> If there are no prerequisites or additional database constraints,
> the server will execute the same transaction again, but as a follower.
>
> E.g. in the OVN case, this may result in creation of duplicated logical
> switches / routers / load balancers.  I.e. resources with the same
> non-indexed name.  That may cause issues later where ovn-nbctl will
> not be able to add ports to these switches / routers.
>
> Suggested solution is to not complete the commands, but allow them to
> be completed with the commit index update from a new leader.
> It the similar behavior to what we do in order to complete commands
> in a backward scenario when the follower becomes a leader.  That
> scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
> transaction execution when leader failover.").
>
> Additionally, truncating the current log down to commit index in order
> to avoid possible log inconsistencies with a new leader.  Followers
> should not have uncommitted entries in the log.  New leader will send
> them back, if they were actually committed.

This is something I am a little concerned about. It may still be correct
but I feel it is unnecessary and may be less efficient.
- If the old leader has sent append requests before it stepped down, the
new leader and the old leader would both have the entry, so it will be
committed when the new leader finishes the first round of heart beat and
the command would be completed. If we truncate the entry from the old
leader, it would take another round trip from the new leader to the old
leader to append the entry (when the new leader notices that the old
leader's log is behind from an append-reply from the old leader).
- If the old leader hasn't sent append requests to any servers for the
command before stepping down, but has the entry in it's raft log, then
there will be a conflict detected at the old leader when receiving
append-request from the new leader, and in raft_handle_append_entries() it
should resolve the conflict by discarding its own entry - so, it is not
harmful to have uncommitted entries in the follower (old leader)'s log.

Did you observe any problem during testing if not truncating the log?

Other than this, the patch looks good.

Thanks,
Han

>
> Code paths for leader and follower inside the raft_update_commit_index
> were very similar, so they were refactored into one, since we also
> needed an ability to complete more than one command for a follower.
>
> Failure test added to exercise scenario of a leadership transfer.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.")
> Reported-at: https://bugzilla.redhat.com/2046340
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  ovsdb/raft.c           | 135 +++++++++++++++++++++++++----------------
>  tests/ovsdb-cluster.at |  27 +++++++--
>  2 files changed, 103 insertions(+), 59 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index 530c5e5a3..cefeab82c 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -78,6 +78,8 @@ enum raft_failure_test {
>      FT_DELAY_ELECTION,
>      FT_DONT_SEND_VOTE_REQUEST,
>      FT_STOP_RAFT_RPC,
> +    FT_TRANSFER_LEADERSHIP,
> +    FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ,
>  };
>  static enum raft_failure_test failure_test;
>
> @@ -1931,6 +1933,13 @@ raft_run(struct raft *raft)
>          return;
>      }
>
> +    if (failure_test == FT_TRANSFER_LEADERSHIP) {
> +        /* Using this function as it conveniently implements all we need
and
> +         * snapshotting is the main test scenario for leadership
transfer. */
> +        raft_notify_snapshot_recommended(raft);
> +        failure_test = FT_NO_TEST;
> +    }
> +
>      raft_waiters_run(raft);
>
>      if (!raft->listener && time_msec() >= raft->listen_backoff) {
> @@ -2066,7 +2075,14 @@ raft_run(struct raft *raft)
>          HMAP_FOR_EACH_SAFE (cmd, hmap_node, &raft->commands) {
>              if (cmd->timestamp
>                  && now - cmd->timestamp > raft->election_timer * 2) {
> -                raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
> +                if (cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This server lost leadership and command didn't
complete
> +                     * in time.  Likely, it wasn't replicated to the
majority
> +                     * of servers before losing the leadership. */
> +                    raft_command_complete(raft, cmd,
RAFT_CMD_LOST_LEADERSHIP);
> +                } else {
> +                    raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
> +                }
>              }
>          }
>          raft_reset_ping_timer(raft);
> @@ -2258,6 +2274,9 @@ raft_command_initiate(struct raft *raft,
>      if (failure_test == FT_CRASH_AFTER_SEND_APPEND_REQ) {
>          ovs_fatal(0, "Raft test: crash after sending append_request.");
>      }
> +    if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ) {
> +        failure_test = FT_TRANSFER_LEADERSHIP;
> +    }
>      raft_reset_ping_timer(raft);
>
>      return cmd;
> @@ -2639,7 +2658,17 @@ raft_become_follower(struct raft *raft)
>          raft->remove_server = NULL;
>      }
>
> -    raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP);
> +    /* Not completing all commands here as they can still be completed if
> +     * already replicated to the majority of servers.  But since we can
not
> +     * be sure that these commands actually succeeded, we need to remove
> +     * corresponding uncommitted entries from the log.  Doing that we
will
> +     * avoid possible log inconsistency with a new leader.  If entries
were
> +     * actually committed according to a new leader, we will receive them
> +     * back with AppendEntries request and will complete commands then.
*/
> +    if (raft_truncate(raft, raft->commit_index + 1)) {
> +        /* The set of servers might have changed; check. */
> +        raft_get_servers_from_log(raft, VLL_INFO);
> +    }
>  }
>
>  static void
> @@ -2888,61 +2917,56 @@ raft_update_commit_index(struct raft *raft,
uint64_t new_commit_index)
>          return false;
>      }
>
> -    if (raft->role == RAFT_LEADER) {
> -        while (raft->commit_index < new_commit_index) {
> -            uint64_t index = ++raft->commit_index;
> -            const struct raft_entry *e = raft_get_entry(raft, index);
> -            if (raft_entry_has_data(e)) {
> -                struct raft_command *cmd
> -                    = raft_find_command_by_eid(raft, &e->eid);
> -                if (cmd) {
> -                    if (!cmd->index) {
> -                        VLOG_DBG("Command completed after role change
from"
> -                                 " follower to leader "UUID_FMT,
> -                                 UUID_ARGS(&e->eid));
> -                        cmd->index = index;
> -                    }
> -                    raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
> +    while (raft->commit_index < new_commit_index) {
> +        uint64_t index = ++raft->commit_index;
> +        const struct raft_entry *e = raft_get_entry(raft, index);
> +
> +        if (raft_entry_has_data(e)) {
> +            struct raft_command *cmd = raft_find_command_by_eid(raft,
&e->eid);
> +            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
5);
> +
> +            if (cmd) {
> +                if (!cmd->index && raft->role == RAFT_LEADER) {
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed after role change from "
> +                        "follower to leader (eid: "UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                } else if (!cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This can happen when leader fail-over before
sending
> +                     * execute_command_reply. */
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed without reply (eid:
"UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                } else if (cmd->index && raft->role != RAFT_LEADER) {
> +                    /* This can happen if current server lost leadership
after
> +                     * sending append requests to the majority of
servers, but
> +                     * before receiving majority of append replies. */
> +                    VLOG_INFO_RL(&rl,
> +                        "command completed after role change from "
> +                        "leader to follower (eid: "UUID_FMT", "
> +                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid),
index);
> +                    /* Clearing 'sid' to avoid sending cmd execution
reply. */
> +                    cmd->sid = UUID_ZERO;
> +                } else {
> +                    /* (cmd->index && raft->role == RAFT_LEADER)
> +                     * Normal command completion on a leader. */
>                  }
> -            }
> -            if (e->election_timer) {
> -                VLOG_INFO("Election timer changed from %"PRIu64" to
%"PRIu64,
> -                          raft->election_timer, e->election_timer);
> -                raft->election_timer = e->election_timer;
> -                raft->election_timer_new = 0;
> -                raft_update_probe_intervals(raft);
> -            }
> -            if (e->servers) {
> -                /* raft_run_reconfigure() can write a new Raft entry,
which can
> -                 * reallocate raft->entries, which would invalidate 'e',
so
> -                 * this case must be last, after the one for 'e->data'.
*/
> -                raft_run_reconfigure(raft);
> +                cmd->index = index;
> +                raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
>              }
>          }
> -    } else {
> -        while (raft->commit_index < new_commit_index) {
> -            uint64_t index = ++raft->commit_index;
> -            const struct raft_entry *e = raft_get_entry(raft, index);
> -            if (e->election_timer) {
> -                VLOG_INFO("Election timer changed from %"PRIu64" to
%"PRIu64,
> -                          raft->election_timer, e->election_timer);
> -                raft->election_timer = e->election_timer;
> -                raft_update_probe_intervals(raft);
> -            }
> +        if (e->election_timer) {
> +            VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64,
> +                      raft->election_timer, e->election_timer);
> +            raft->election_timer = e->election_timer;
> +            raft->election_timer_new = 0;
> +            raft_update_probe_intervals(raft);
>          }
> -        /* Check if any pending command can be completed, and complete
it.
> -         * This can happen when leader fail-over before sending
> -         * execute_command_reply. */
> -        const struct uuid *eid = raft_get_eid(raft, new_commit_index);
> -        struct raft_command *cmd = raft_find_command_by_eid(raft, eid);
> -        if (cmd) {
> -            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5,
5);
> -            VLOG_INFO_RL(&rl,
> -                         "Command completed without reply (eid:
"UUID_FMT", "
> -                         "commit index: %"PRIu64")",
> -                         UUID_ARGS(eid), new_commit_index);
> -            cmd->index = new_commit_index;
> -            raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
> +        if (e->servers && raft->role == RAFT_LEADER) {
> +            /* raft_run_reconfigure() can write a new Raft entry, which
can
> +             * reallocate raft->entries, which would invalidate 'e', so
> +             * this case must be last, after the one for 'e->data'. */
> +            raft_run_reconfigure(raft);
>          }
>      }
>
> @@ -4975,6 +4999,11 @@ raft_unixctl_failure_test(struct unixctl_conn
*conn OVS_UNUSED,
>          failure_test = FT_DONT_SEND_VOTE_REQUEST;
>      } else if (!strcmp(test, "stop-raft-rpc")) {
>          failure_test = FT_STOP_RAFT_RPC;
> +    } else if (!strcmp(test,
> +
"transfer-leadership-after-sending-append-request")) {
> +        failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ;
> +    } else if (!strcmp(test, "transfer-leadership")) {
> +        failure_test = FT_TRANSFER_LEADERSHIP;
>      } else if (!strcmp(test, "clear")) {
>          failure_test = FT_NO_TEST;
>          unixctl_command_reply(conn, "test dismissed");
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index ee9c7b937..0f7076a05 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -468,6 +468,7 @@ ovsdb_cluster_failure_test () {
>      if test "$crash_node" == "1"; then
>          new_leader=$5
>      fi
> +    log_grep=$6
>
>      cp $top_srcdir/vswitchd/vswitch.ovsschema schema
>      schema=`ovsdb-tool schema-name schema`
> @@ -488,7 +489,7 @@ ovsdb_cluster_failure_test () {
>      start_server() {
>          local i=$1
>          printf "\ns$i: starting\n"
> -        AT_CHECK([ovsdb-server -vjsonrpc -vconsole:off -vsyslog:off
--detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i
--remote=punix:s$i.ovsdb s$i.db])
> +        AT_CHECK([ovsdb-server -vjsonrpc -vraft -vconsole:off
-vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid
--unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
>      }
>      connect_server() {
>          local i=$1
> @@ -514,14 +515,23 @@ ovsdb_cluster_failure_test () {
>          fi
>          AT_CHECK([ovs-appctl -t "`pwd`"/s$delay_election_node
cluster/failure-test delay-election], [0], [ignore])
>      fi
> +
> +    # Initializing the database separately to avoid extra 'wait'
operation
> +    # in later transactions.
> +    AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only
--no-shuffle-remotes --no-wait init], [0], [ignore], [ignore])
> +
>      AT_CHECK([ovs-appctl -t "`pwd`"/s$crash_node cluster/failure-test
$crash_command], [0], [ignore])
>      AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only
--no-shuffle-remotes --no-wait create QoS type=x], [0], [ignore], [ignore])
>
> -    # Make sure that the node really crashed.
> -    AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
> -    # XXX: Client will fail if remotes contains unix socket that doesn't
exist (killed).
> -    if test "$remote_1" = "$crash_node"; then
> -        db=unix:s$remote_2.ovsdb
> +    # Make sure that the node really crashed or has specific log message.
> +    if test -z "$log_grep"; then
> +        AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
> +        # XXX: Client will fail if remotes contains unix socket that
doesn't exist (killed).
> +        if test "$remote_1" = "$crash_node"; then
> +            db=unix:s$remote_2.ovsdb
> +        fi
> +    else
> +        OVS_WAIT_UNTIL([grep -q "$log_grep" s${crash_node}.log])
>      fi
>      AT_CHECK([ovs-vsctl --db="$db" --no-leader-only --no-wait
--columns=type --bare list QoS], [0], [x
>  ])
> @@ -617,6 +627,11 @@ AT_KEYWORDS([ovsdb server negative unix cluster
pending-txn])
>  ovsdb_cluster_failure_test 2 2 3
crash-after-receiving-append-request-update
>  AT_CLEANUP
>
> +AT_SETUP([OVSDB cluster - txn on leader, leader transfers leadership
after sending appendReq])
> +AT_KEYWORDS([ovsdb server negative unix cluster pending-txn transfer])
> +ovsdb_cluster_failure_test 1 2 1
transfer-leadership-after-sending-append-request -1 "Transferring
leadership"
> +AT_CLEANUP
> +
>
>  AT_SETUP([OVSDB cluster - competing candidates])
>  AT_KEYWORDS([ovsdb server negative unix cluster competing-candidates])
> --
> 2.34.1
>
Ilya Maximets May 23, 2022, 9:20 p.m. UTC | #3
On 5/23/22 01:58, Han Zhou wrote:
> 
> 
> Thanks Ilya for addressing this problem! Please see some comments below:
> On Thu, May 5, 2022 at 5:24 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> While becoming a follower, the leader completes all the current
> nit: The word "completes" here may be a little confusing. We are talking about complete with error instead of SUCCESS, so better to say "cancels" or "aborts".
> 
>> 'in-flight' commands, so the higher layers can re-try corresponding
>> transactions when the new leader is elected.  However, most of these
>> commands are already sent to followers as append requests, hence they
>> will actually be committed by the majority of the cluster members,
>> i.e. will be treated as committed by the new leader, unless there is
>> an actual network problem between servers.  However, the old leader
>> will decline append replies, since it's not the leader anymore and
>> commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
>>
>> New leader will replicate the commit index back to the old leader.
>> Old leader will re-try the previously "failed" transaction, because
>> "cluster error"s are temporary.
>>
>> If a transaction had some prerequisites that didn't allow double
>> committing or there are other database constraints (like indexes) that
>> will not allow a transaction to be committed twice, the server will
>> reply to the client with a false-negative transaction result.
>>
>> If there are no prerequisites or additional database constraints,
>> the server will execute the same transaction again, but as a follower.
>>
>> E.g. in the OVN case, this may result in creation of duplicated logical
>> switches / routers / load balancers.  I.e. resources with the same
>> non-indexed name.  That may cause issues later where ovn-nbctl will
>> not be able to add ports to these switches / routers.
>>
>> Suggested solution is to not complete the commands, but allow them to
>> be completed with the commit index update from a new leader.
>> It the similar behavior to what we do in order to complete commands
>> in a backward scenario when the follower becomes a leader.  That
>> scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
>> transaction execution when leader failover.").
>>
>> Additionally, truncating the current log down to commit index in order
>> to avoid possible log inconsistencies with a new leader.  Followers
>> should not have uncommitted entries in the log.  New leader will send
>> them back, if they were actually committed.
> 
> This is something I am a little concerned about. It may still be correct but I feel it is unnecessary and may be less efficient.
> - If the old leader has sent append requests before it stepped down, the new leader and the old leader would both have the entry, so it will be committed when the new leader finishes the first round of heart beat and the command would be completed. If we truncate the entry from the old leader, it would take another round trip from the new leader to the old leader to append the entry (when the new leader notices that the old leader's log is behind from an append-reply from the old leader).
> - If the old leader hasn't sent append requests to any servers for the command before stepping down, but has the entry in it's raft log, then there will be a conflict detected at the old leader when receiving append-request from the new leader, and in raft_handle_append_entries() it should resolve the conflict by discarding its own entry - so, it is not harmful to have uncommitted entries in the follower (old leader)'s log.
> 
> Did you observe any problem during testing if not truncating the log?

No.  I didn't.  In fact, I didn't have this part in the first
version of the fix, but added later because it seemed logical
to me to not have uncommitted entries in the log.  More safe,
I guess.

But I think you're right.  And actually, thinking more about
this, it's the same condition as if follower received some
entries from the leader, which are not yet fully committed, i.e.:

1. Leader sends uncommitted entries to a follower.
2. Follower adds them to the log.
3. Leadership change happens before entries are written by other
   followers, so not yet committed.
4. New leader without these entries commits something else.

This should be a fully legit scenario where uncommitted entries
are in the follower's log.  And, yes, this follower will discard
them once the update received from the new leader.

I'll drop this part of the patch and send v2.  Thanks!

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index 530c5e5a3..cefeab82c 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -78,6 +78,8 @@  enum raft_failure_test {
     FT_DELAY_ELECTION,
     FT_DONT_SEND_VOTE_REQUEST,
     FT_STOP_RAFT_RPC,
+    FT_TRANSFER_LEADERSHIP,
+    FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ,
 };
 static enum raft_failure_test failure_test;
 
@@ -1931,6 +1933,13 @@  raft_run(struct raft *raft)
         return;
     }
 
+    if (failure_test == FT_TRANSFER_LEADERSHIP) {
+        /* Using this function as it conveniently implements all we need and
+         * snapshotting is the main test scenario for leadership transfer. */
+        raft_notify_snapshot_recommended(raft);
+        failure_test = FT_NO_TEST;
+    }
+
     raft_waiters_run(raft);
 
     if (!raft->listener && time_msec() >= raft->listen_backoff) {
@@ -2066,7 +2075,14 @@  raft_run(struct raft *raft)
         HMAP_FOR_EACH_SAFE (cmd, hmap_node, &raft->commands) {
             if (cmd->timestamp
                 && now - cmd->timestamp > raft->election_timer * 2) {
-                raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
+                if (cmd->index && raft->role != RAFT_LEADER) {
+                    /* This server lost leadership and command didn't complete
+                     * in time.  Likely, it wasn't replicated to the majority
+                     * of servers before losing the leadership. */
+                    raft_command_complete(raft, cmd, RAFT_CMD_LOST_LEADERSHIP);
+                } else {
+                    raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
+                }
             }
         }
         raft_reset_ping_timer(raft);
@@ -2258,6 +2274,9 @@  raft_command_initiate(struct raft *raft,
     if (failure_test == FT_CRASH_AFTER_SEND_APPEND_REQ) {
         ovs_fatal(0, "Raft test: crash after sending append_request.");
     }
+    if (failure_test == FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ) {
+        failure_test = FT_TRANSFER_LEADERSHIP;
+    }
     raft_reset_ping_timer(raft);
 
     return cmd;
@@ -2639,7 +2658,17 @@  raft_become_follower(struct raft *raft)
         raft->remove_server = NULL;
     }
 
-    raft_complete_all_commands(raft, RAFT_CMD_LOST_LEADERSHIP);
+    /* Not completing all commands here as they can still be completed if
+     * already replicated to the majority of servers.  But since we can not
+     * be sure that these commands actually succeeded, we need to remove
+     * corresponding uncommitted entries from the log.  Doing that we will
+     * avoid possible log inconsistency with a new leader.  If entries were
+     * actually committed according to a new leader, we will receive them
+     * back with AppendEntries request and will complete commands then. */
+    if (raft_truncate(raft, raft->commit_index + 1)) {
+        /* The set of servers might have changed; check. */
+        raft_get_servers_from_log(raft, VLL_INFO);
+    }
 }
 
 static void
@@ -2888,61 +2917,56 @@  raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
         return false;
     }
 
-    if (raft->role == RAFT_LEADER) {
-        while (raft->commit_index < new_commit_index) {
-            uint64_t index = ++raft->commit_index;
-            const struct raft_entry *e = raft_get_entry(raft, index);
-            if (raft_entry_has_data(e)) {
-                struct raft_command *cmd
-                    = raft_find_command_by_eid(raft, &e->eid);
-                if (cmd) {
-                    if (!cmd->index) {
-                        VLOG_DBG("Command completed after role change from"
-                                 " follower to leader "UUID_FMT,
-                                 UUID_ARGS(&e->eid));
-                        cmd->index = index;
-                    }
-                    raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
+    while (raft->commit_index < new_commit_index) {
+        uint64_t index = ++raft->commit_index;
+        const struct raft_entry *e = raft_get_entry(raft, index);
+
+        if (raft_entry_has_data(e)) {
+            struct raft_command *cmd = raft_find_command_by_eid(raft, &e->eid);
+            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
+
+            if (cmd) {
+                if (!cmd->index && raft->role == RAFT_LEADER) {
+                    VLOG_INFO_RL(&rl,
+                        "command completed after role change from "
+                        "follower to leader (eid: "UUID_FMT", "
+                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index);
+                } else if (!cmd->index && raft->role != RAFT_LEADER) {
+                    /* This can happen when leader fail-over before sending
+                     * execute_command_reply. */
+                    VLOG_INFO_RL(&rl,
+                        "command completed without reply (eid: "UUID_FMT", "
+                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index);
+                } else if (cmd->index && raft->role != RAFT_LEADER) {
+                    /* This can happen if current server lost leadership after
+                     * sending append requests to the majority of servers, but
+                     * before receiving majority of append replies. */
+                    VLOG_INFO_RL(&rl,
+                        "command completed after role change from "
+                        "leader to follower (eid: "UUID_FMT", "
+                        "commit index: %"PRIu64")", UUID_ARGS(&e->eid), index);
+                    /* Clearing 'sid' to avoid sending cmd execution reply. */
+                    cmd->sid = UUID_ZERO;
+                } else {
+                    /* (cmd->index && raft->role == RAFT_LEADER)
+                     * Normal command completion on a leader. */
                 }
-            }
-            if (e->election_timer) {
-                VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64,
-                          raft->election_timer, e->election_timer);
-                raft->election_timer = e->election_timer;
-                raft->election_timer_new = 0;
-                raft_update_probe_intervals(raft);
-            }
-            if (e->servers) {
-                /* raft_run_reconfigure() can write a new Raft entry, which can
-                 * reallocate raft->entries, which would invalidate 'e', so
-                 * this case must be last, after the one for 'e->data'. */
-                raft_run_reconfigure(raft);
+                cmd->index = index;
+                raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
             }
         }
-    } else {
-        while (raft->commit_index < new_commit_index) {
-            uint64_t index = ++raft->commit_index;
-            const struct raft_entry *e = raft_get_entry(raft, index);
-            if (e->election_timer) {
-                VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64,
-                          raft->election_timer, e->election_timer);
-                raft->election_timer = e->election_timer;
-                raft_update_probe_intervals(raft);
-            }
+        if (e->election_timer) {
+            VLOG_INFO("Election timer changed from %"PRIu64" to %"PRIu64,
+                      raft->election_timer, e->election_timer);
+            raft->election_timer = e->election_timer;
+            raft->election_timer_new = 0;
+            raft_update_probe_intervals(raft);
         }
-        /* Check if any pending command can be completed, and complete it.
-         * This can happen when leader fail-over before sending
-         * execute_command_reply. */
-        const struct uuid *eid = raft_get_eid(raft, new_commit_index);
-        struct raft_command *cmd = raft_find_command_by_eid(raft, eid);
-        if (cmd) {
-            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
-            VLOG_INFO_RL(&rl,
-                         "Command completed without reply (eid: "UUID_FMT", "
-                         "commit index: %"PRIu64")",
-                         UUID_ARGS(eid), new_commit_index);
-            cmd->index = new_commit_index;
-            raft_command_complete(raft, cmd, RAFT_CMD_SUCCESS);
+        if (e->servers && raft->role == RAFT_LEADER) {
+            /* raft_run_reconfigure() can write a new Raft entry, which can
+             * reallocate raft->entries, which would invalidate 'e', so
+             * this case must be last, after the one for 'e->data'. */
+            raft_run_reconfigure(raft);
         }
     }
 
@@ -4975,6 +4999,11 @@  raft_unixctl_failure_test(struct unixctl_conn *conn OVS_UNUSED,
         failure_test = FT_DONT_SEND_VOTE_REQUEST;
     } else if (!strcmp(test, "stop-raft-rpc")) {
         failure_test = FT_STOP_RAFT_RPC;
+    } else if (!strcmp(test,
+                       "transfer-leadership-after-sending-append-request")) {
+        failure_test = FT_TRANSFER_LEADERSHIP_AFTER_SEND_APPEND_REQ;
+    } else if (!strcmp(test, "transfer-leadership")) {
+        failure_test = FT_TRANSFER_LEADERSHIP;
     } else if (!strcmp(test, "clear")) {
         failure_test = FT_NO_TEST;
         unixctl_command_reply(conn, "test dismissed");
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index ee9c7b937..0f7076a05 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -468,6 +468,7 @@  ovsdb_cluster_failure_test () {
     if test "$crash_node" == "1"; then
         new_leader=$5
     fi
+    log_grep=$6
 
     cp $top_srcdir/vswitchd/vswitch.ovsschema schema
     schema=`ovsdb-tool schema-name schema`
@@ -488,7 +489,7 @@  ovsdb_cluster_failure_test () {
     start_server() {
         local i=$1
         printf "\ns$i: starting\n"
-        AT_CHECK([ovsdb-server -vjsonrpc -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
+        AT_CHECK([ovsdb-server -vjsonrpc -vraft -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
     }
     connect_server() {
         local i=$1
@@ -514,14 +515,23 @@  ovsdb_cluster_failure_test () {
         fi
         AT_CHECK([ovs-appctl -t "`pwd`"/s$delay_election_node cluster/failure-test delay-election], [0], [ignore])
     fi
+
+    # Initializing the database separately to avoid extra 'wait' operation
+    # in later transactions.
+    AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait init], [0], [ignore], [ignore])
+
     AT_CHECK([ovs-appctl -t "`pwd`"/s$crash_node cluster/failure-test $crash_command], [0], [ignore])
     AT_CHECK([ovs-vsctl -v --db="$db" --no-leader-only --no-shuffle-remotes --no-wait create QoS type=x], [0], [ignore], [ignore])
 
-    # Make sure that the node really crashed.
-    AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
-    # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed).
-    if test "$remote_1" = "$crash_node"; then
-        db=unix:s$remote_2.ovsdb
+    # Make sure that the node really crashed or has specific log message.
+    if test -z "$log_grep"; then
+        AT_CHECK([ls s$crash_node.ovsdb], [2], [ignore], [ignore])
+        # XXX: Client will fail if remotes contains unix socket that doesn't exist (killed).
+        if test "$remote_1" = "$crash_node"; then
+            db=unix:s$remote_2.ovsdb
+        fi
+    else
+        OVS_WAIT_UNTIL([grep -q "$log_grep" s${crash_node}.log])
     fi
     AT_CHECK([ovs-vsctl --db="$db" --no-leader-only --no-wait --columns=type --bare list QoS], [0], [x
 ])
@@ -617,6 +627,11 @@  AT_KEYWORDS([ovsdb server negative unix cluster pending-txn])
 ovsdb_cluster_failure_test 2 2 3 crash-after-receiving-append-request-update
 AT_CLEANUP
 
+AT_SETUP([OVSDB cluster - txn on leader, leader transfers leadership after sending appendReq])
+AT_KEYWORDS([ovsdb server negative unix cluster pending-txn transfer])
+ovsdb_cluster_failure_test 1 2 1 transfer-leadership-after-sending-append-request -1 "Transferring leadership"
+AT_CLEANUP
+
 
 AT_SETUP([OVSDB cluster - competing candidates])
 AT_KEYWORDS([ovsdb server negative unix cluster competing-candidates])