diff mbox series

[ovs-dev] ovsdb: raft: Support pre-vote mechanism to deal with disruptive server.

Message ID 20230625173533.1665933-1-hzhou@ovn.org
State Changes Requested
Headers show
Series [ovs-dev] ovsdb: raft: Support pre-vote mechanism to deal with disruptive server. | expand

Checks

Context Check Description
ovsrobot/apply-robot warning apply and check: warning
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Han Zhou June 25, 2023, 5:35 p.m. UTC
When a server becomes unstable due to overloading or intermittent
partitioning, it may miss some heartbeats and then starts election with
a new term, which would disrupt the otherwise healthy cluster formed by
the rest of the healthy nodes.  Such situation may exist for a long time
until the "flapping" server is shutdown or recovered completely, which
can severely impact the availability of the cluster. The pre-vote
mechanism introduced in the raft paper section 9.6 can prevent such
problems. This patch implements the pre-vote mechanism, in a way that is
backward compatible and safe for upgrades.

Signed-off-by: Han Zhou <hzhou@ovn.org>
---
 ovsdb/raft-rpc.c       | 19 ++++++++-
 ovsdb/raft-rpc.h       |  3 ++
 ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
 tests/ovsdb-cluster.at | 42 ++++++++++++++++++++
 4 files changed, 127 insertions(+), 25 deletions(-)

Comments

0-day Robot June 25, 2023, 5:59 p.m. UTC | #1
Bleep bloop.  Greetings Han Zhou, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line lacks whitespace around operator
#125 FILE: ovsdb/raft.c:309:
                                   Pre-vote mechanism is introduced in raft

WARNING: Line lacks whitespace around operator
#127 FILE: ovsdb/raft.c:311:
                                   sub-state of candidate to minize the change

Lines checked: 405, Warnings: 2, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Ilya Maximets June 30, 2023, 11:29 p.m. UTC | #2
On 6/25/23 19:35, Han Zhou wrote:
> When a server becomes unstable due to overloading or intermittent
> partitioning, it may miss some heartbeats and then starts election with
> a new term, which would disrupt the otherwise healthy cluster formed by
> the rest of the healthy nodes.  Such situation may exist for a long time
> until the "flapping" server is shutdown or recovered completely, which
> can severely impact the availability of the cluster. The pre-vote
> mechanism introduced in the raft paper section 9.6 can prevent such
> problems. This patch implements the pre-vote mechanism, in a way that is
> backward compatible and safe for upgrades.

Thanks, Han!  This is definitely interesting.  I don't think that this
should be a solution for overloading though, overloading should be fixed
by, well, reducing the load or increasing the election timer.  But some
intermittent network issues is indeed a valid case.

I looked through the code and I don't see any significant problems right
away, but I would like to experiment with it a bit more next week, since
the voting logic is fairly complex in general.

It's nice to see you found a use case for the stop-raft-rpc failure
test. :)

See some comments inline.

Best regards, Ilya Maximets.

> 
> Signed-off-by: Han Zhou <hzhou@ovn.org>
> ---
>  ovsdb/raft-rpc.c       | 19 ++++++++-
>  ovsdb/raft-rpc.h       |  3 ++
>  ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
>  tests/ovsdb-cluster.at | 42 ++++++++++++++++++++
>  4 files changed, 127 insertions(+), 25 deletions(-)
> 
> diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
> index dd14d81091fc..f750ba897046 100644
> --- a/ovsdb/raft-rpc.c
> +++ b/ovsdb/raft-rpc.c
> @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct raft_vote_request *rq,
>          json_object_put(args, "leadership_transfer",
>                          json_boolean_create(true));
>      }
> +    if (rq->is_prevote) {
> +        json_object_put(args, "is_prevote",
> +                        json_boolean_create(true));
> +    }
>  }
>  
>  static void
> @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser *p,
>      rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
>      rq->leadership_transfer
>          = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
> +    rq->is_prevote
> +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
>  }
>  
>  static void
> @@ -305,6 +311,9 @@ raft_format_vote_request(const struct raft_vote_request *rq, struct ds *s)
>      if (rq->leadership_transfer) {
>          ds_put_cstr(s, " leadership_transfer=true");
>      }
> +    if (rq->is_prevote) {
> +        ds_put_cstr(s, " is_prevote=true");
> +    }
>  }
>  
>  /* raft_vote_reply. */
> @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct raft_vote_reply *rpy,
>  {
>      raft_put_uint64(args, "term", rpy->term);
>      json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote));
> +    if (rpy->is_prevote) {
> +        json_object_put(args, "is_prevote", json_boolean_create(true));
> +    }
>  }
>  
>  static void
> @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
>  {
>      rpy->term = raft_parse_required_uint64(p, "term");
>      rpy->vote = raft_parse_required_uuid(p, "vote");
> +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1;
>  }
>  
>  static void
> @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply *rpy, struct ds *s)
>  {
>      ds_put_format(s, " term=%"PRIu64, rpy->term);
>      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
> +    if (rpy->is_prevote) {
> +        ds_put_cstr(s, " is_prevote=true");
> +    }
>  }
>  
>  /* raft_add_server_request */
> @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
>          return NULL;
>  
>      case RAFT_RPC_VOTE_REPLY:
> -        return &raft_vote_reply_cast(rpc)->vote;
> +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
> +            &raft_vote_reply_cast(rpc)->vote;

It might be better to create a pointer variable here and access it
instead of casting twice.

>  
>      default:
>          OVS_NOT_REACHED();
> diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
> index 221f24d00128..10f30618e3e0 100644
> --- a/ovsdb/raft-rpc.h
> +++ b/ovsdb/raft-rpc.h
> @@ -125,12 +125,15 @@ struct raft_vote_request {
>      uint64_t last_log_index; /* Index of candidate's last log entry. */
>      uint64_t last_log_term;  /* Term of candidate's last log entry. */
>      bool leadership_transfer;  /* True to override minimum election timeout. */
> +    bool is_prevote;         /* True: pre-vote; False: real vote (default). */
>  };
>  
>  struct raft_vote_reply {
>      struct raft_rpc_common common;
>      uint64_t term;          /* Current term, for candidate to update itself. */
>      struct uuid vote;       /* Server ID of vote. */
> +    bool is_prevote;        /* Copy the is_prevote from the request, primarily
> +                               for validation. */

Each comment line should start with a '*'.

"Copy of the ..." ?

>  };
>  
>  struct raft_add_server_request {
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index b2361b1737a2..4d7a3f112cad 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -305,6 +305,11 @@ struct raft {
>  
>      /* Candidates only.  Reinitialized at start of election. */
>      int n_votes;                /* Number of votes for me. */
> +    bool prevote_passed;        /* Indicates if it passed pre-vote phase.
> +                                   Pre-vote mechanism is introduced in raft
> +                                   paper section 9.6. We implement it as a
> +                                   sub-state of candidate to minize the change
> +                                   and keep backward compatibility. */

Each comment line should start with a '*'.

And s/minize/minimize/.

>  
>      /* Followers and candidates only. */
>      bool candidate_retrying;    /* The earlier election timed-out and we are
> @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
>  static void raft_reset_election_timer(struct raft *);
>  static void raft_reset_ping_timer(struct raft *);
>  static void raft_send_heartbeats(struct raft *);
> -static void raft_start_election(struct raft *, bool leadership_transfer);
> +static void raft_start_election(struct raft *, bool is_prevote,
> +                                bool leadership_transfer);
>  static bool raft_truncate(struct raft *, uint64_t new_end);
>  static void raft_get_servers_from_log(struct raft *, enum vlog_level);
>  static void raft_get_election_timer_from_log(struct raft *);
> @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft **raftp)
>          /* If there's only one server, start an election right away so that the
>           * cluster bootstraps quickly. */
>          if (hmap_count(&raft->servers) == 1) {
> -            raft_start_election(raft, false);
> +            /* No pre-vote needed since we are the only one. */
> +            raft_start_election(raft, false, false);
>          }
>      } else {
>          raft->join_timeout = time_msec() + 1000;
> @@ -1360,7 +1367,7 @@ void
>  raft_take_leadership(struct raft *raft)
>  {
>      if (raft->role != RAFT_LEADER) {
> -        raft_start_election(raft, true);
> +        raft_start_election(raft, false, true);
>      }
>  }
>  
> @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term, const struct uuid *vote)
>      return true;
>  }
>  
> -static void
> +static bool
>  raft_accept_vote(struct raft *raft, struct raft_server *s,
>                   const struct uuid *vote)
>  {
>      if (uuid_equals(&s->vote, vote)) {
> -        return;
> +        return false;
>      }
>      if (!uuid_is_zero(&s->vote)) {
>          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct raft_server *s,
>      s->vote = *vote;
>      if (uuid_equals(vote, &raft->sid)
>          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
> -        raft_become_leader(raft);
> +        return true;
>      }
> +    return false;
>  }
>  
>  static void
> -raft_start_election(struct raft *raft, bool leadership_transfer)
> +raft_start_election(struct raft *raft, bool is_prevote,
> +                    bool leadership_transfer)
>  {
> +    /* Leadership transfer doesn't use prevote. */
> +    ovs_assert(!is_prevote || !leadership_transfer);
> +
>      if (raft->leaving) {
>          return;
>      }
> @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>          return;
>      }
>  
> -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
> +    if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) {
>          return;
>      }
>  
> @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>  
>      raft->leader_sid = UUID_ZERO;
>      raft->role = RAFT_CANDIDATE;
> -    /* If there was no leader elected since last election, we know we are
> -     * retrying now. */
> -    raft->candidate_retrying = !raft->had_leader;
> -    raft->had_leader = false;
> +    raft->prevote_passed = !is_prevote;
> +
> +    if (is_prevote || leadership_transfer) {
> +        /* If there was no leader elected since last election, we know we are
> +         * retrying now. */
> +        raft->candidate_retrying = !raft->had_leader;
> +        raft->had_leader = false;
> +
> +        raft->election_start = time_msec();
> +        raft->election_won = 0;
> +    }
>  
>      raft->n_votes = 0;
>  
> -    raft->election_start = time_msec();
> -    raft->election_won = 0;
>      raft->leadership_transfer = leadership_transfer;
>  
>      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>      if (!VLOG_DROP_INFO(&rl)) {
>          long long int now = time_msec();
> +        char *comment = is_prevote ? "prevote" : "vote";

Should user-visible logs say "pre-vote" instead of "prevote" ?

>          if (now >= raft->election_timeout) {
>              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
> -                      "starting election",
> -                      raft->term, now - raft->election_base);
> +                      "starting election (%s)",
> +                      raft->term, now - raft->election_base, comment);
>          } else {
> -            VLOG_INFO("term %"PRIu64": starting election", raft->term);
> +            VLOG_INFO("term %"PRIu64": starting election (%s)",
> +                      raft->term, comment);
>          }
>      }
>      raft_reset_election_timer(raft);
> @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>                      ? raft->entries[raft->log_end - raft->log_start - 1].term
>                      : raft->snap.term),
>                  .leadership_transfer = leadership_transfer,
> +                .is_prevote = is_prevote,
>              },
>          };
>          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
> @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>      }
>  
>      /* Vote for ourselves. */
> -    raft_accept_vote(raft, me, &raft->sid);
> +    if (raft_accept_vote(raft, me, &raft->sid)) {
> +        /* We just started vote, so it shouldn't be accepted yet unless this is
> +         * a one-node cluster. In such case we don't do pre-vote, and become
> +         * leader immediately. */
> +        ovs_assert(!is_prevote);
> +        raft_become_leader(raft);
> +    }
>  }
>  
>  static void
> @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
>                  raft_reset_election_timer(raft);
>              } else {
>                  raft_become_follower(raft);
> -                raft_start_election(raft, false);
> +                raft_start_election(raft, true, false);
>              }
>          } else {
> -            raft_start_election(raft, false);
> +            raft_start_election(raft, true, false);
>          }
>  
>      }
> @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
>          return false;
>      }
>  
> +    if (rq->is_prevote) {
> +        return true;
> +    }
> +
>      /* Record a vote for the peer. */
>      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
>          return false;
> @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
>  
>  static void
>  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
> -                     const struct uuid *vote)
> +                     const struct uuid *vote, bool is_prevote)
>  {
>      union raft_rpc rpy = {
>          .vote_reply = {
> @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>              },
>              .term = raft->term,
>              .vote = *vote,
> +            .is_prevote = is_prevote,
>          },
>      };
>      raft_send(raft, &rpy);
> @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
>                           const struct raft_vote_request *rq)
>  {
>      if (raft_handle_vote_request__(raft, rq)) {
> -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
> +        raft_send_vote_reply(raft, &rq->common.sid,
> +                             rq->is_prevote ? &rq->common.sid : &raft->vote,
> +                             rq->is_prevote);
>      }
>  }
>  
> @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
>  
>      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
>      if (s) {
> -        raft_accept_vote(raft, s, &rpy->vote);
> +        if (raft_accept_vote(raft, s, &rpy->vote)) {

IIUC, the raft_accept_vote() here can't succeed for pre-vote because
everyone votes for themselves on pre-vote.  Is that correct?

What happens if we received a real vote during a pre-vote phase for
some reason?  Will this logic fail?  Is this scenario possible?

> +            if (raft->prevote_passed) {
> +                raft_become_leader(raft);
> +            } else {
> +                /* Start the real election. */
> +                raft_start_election(raft, false, false);
> +            }
> +        }
>      }
>  }
>  
> @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
>          VLOG_INFO("received leadership transfer from %s in term %"PRIu64,
>                    raft_get_nickname(raft, &rq->common.sid, buf, sizeof buf),
>                    rq->term);
> -        raft_start_election(raft, true);
> +        raft_start_election(raft, false, true);
>      }
>  }
>  
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index 9fbf5dc897f2..8a81136999e0 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -715,6 +715,48 @@ done
>  
>  AT_CLEANUP
>  
> +
> +AT_SETUP([OVSDB cluster - disruptive server])
> +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
> +
> +n=3
> +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> +ordinal_schema > schema
> +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
> +cid=`ovsdb-tool db-cid s1.db`
> +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> +for i in `seq 2 $n`; do
> +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
> +done
> +
> +on_exit 'kill `cat *.pid`'
> +for i in `seq $n`; do
> +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
> +done
> +for i in `seq $n`; do
> +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> +done
> +
> +# An unstable follower shouldn't disrupt the healthy cluster - shouldn't
> +# trigger term change.
> +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], [0], [ignore])
> +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: candidate"])
> +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], [ignore])
> +
> +# Should step back to follower.
> +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: follower"])
> +
> +# No term change.
> +for i in `seq $n`; do
> +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | grep "Term: 1"], [0], [ignore])
> +done
> +
> +for i in `seq $n`; do
> +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
> +done
> +
> +AT_CLEANUP
> +
>  
>  AT_BANNER([OVSDB - cluster tests])
>
Han Zhou July 1, 2023, 2:43 a.m. UTC | #3
On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/25/23 19:35, Han Zhou wrote:
> > When a server becomes unstable due to overloading or intermittent
> > partitioning, it may miss some heartbeats and then starts election with
> > a new term, which would disrupt the otherwise healthy cluster formed by
> > the rest of the healthy nodes.  Such situation may exist for a long time
> > until the "flapping" server is shutdown or recovered completely, which
> > can severely impact the availability of the cluster. The pre-vote
> > mechanism introduced in the raft paper section 9.6 can prevent such
> > problems. This patch implements the pre-vote mechanism, in a way that is
> > backward compatible and safe for upgrades.
>
> Thanks, Han!  This is definitely interesting.  I don't think that this
> should be a solution for overloading though, overloading should be fixed
> by, well, reducing the load or increasing the election timer.  But some
> intermittent network issues is indeed a valid case.

Sorry that I wasn't clear about "overloading". I was referring to the case
when the machine is getting overloaded, not necessarily because of the
ovsdb-server process but because of other applications/VMs sharing the same
physical resource, which is likely what we have hit in the production.
You may argue it is an improper resource allocation, but from OVSDB point
of view, it is not an excuse for the whole RAFT cluster down. The pre-vote
can avoid such disaster, regardless of the factors of the problematic
server that we can't control.

>
> I looked through the code and I don't see any significant problems right
> away, but I would like to experiment with it a bit more next week, since
> the voting logic is fairly complex in general.
>
Appreciated!

> It's nice to see you found a use case for the stop-raft-rpc failure
> test. :)
>
> See some comments inline.
>
> Best regards, Ilya Maximets.
>
> >
> > Signed-off-by: Han Zhou <hzhou@ovn.org>
> > ---
> >  ovsdb/raft-rpc.c       | 19 ++++++++-
> >  ovsdb/raft-rpc.h       |  3 ++
> >  ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
> >  tests/ovsdb-cluster.at | 42 ++++++++++++++++++++
> >  4 files changed, 127 insertions(+), 25 deletions(-)
> >
> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
> > index dd14d81091fc..f750ba897046 100644
> > --- a/ovsdb/raft-rpc.c
> > +++ b/ovsdb/raft-rpc.c
> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct
raft_vote_request *rq,
> >          json_object_put(args, "leadership_transfer",
> >                          json_boolean_create(true));
> >      }
> > +    if (rq->is_prevote) {
> > +        json_object_put(args, "is_prevote",
> > +                        json_boolean_create(true));
> > +    }
> >  }
> >
> >  static void
> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser
*p,
> >      rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
> >      rq->leadership_transfer
> >          = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
> > +    rq->is_prevote
> > +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
> >  }
> >
> >  static void
> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct
raft_vote_request *rq, struct ds *s)
> >      if (rq->leadership_transfer) {
> >          ds_put_cstr(s, " leadership_transfer=true");
> >      }
> > +    if (rq->is_prevote) {
> > +        ds_put_cstr(s, " is_prevote=true");
> > +    }
> >  }
> >
> >  /* raft_vote_reply. */
> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct
raft_vote_reply *rpy,
> >  {
> >      raft_put_uint64(args, "term", rpy->term);
> >      json_object_put_format(args, "vote", UUID_FMT,
UUID_ARGS(&rpy->vote));
> > +    if (rpy->is_prevote) {
> > +        json_object_put(args, "is_prevote", json_boolean_create(true));
> > +    }
> >  }
> >
> >  static void
> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
> >  {
> >      rpy->term = raft_parse_required_uint64(p, "term");
> >      rpy->vote = raft_parse_required_uuid(p, "vote");
> > +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") ==
1;
> >  }
> >
> >  static void
> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply
*rpy, struct ds *s)
> >  {
> >      ds_put_format(s, " term=%"PRIu64, rpy->term);
> >      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
> > +    if (rpy->is_prevote) {
> > +        ds_put_cstr(s, " is_prevote=true");
> > +    }
> >  }
> >
> >  /* raft_add_server_request */
> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
> >          return NULL;
> >
> >      case RAFT_RPC_VOTE_REPLY:
> > -        return &raft_vote_reply_cast(rpc)->vote;
> > +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
> > +            &raft_vote_reply_cast(rpc)->vote;
>
> It might be better to create a pointer variable here and access it
> instead of casting twice.
>
That was my first attempt, but I can't define a local variable under the
'case' statement unless I add a pair of '{}' for the whole block, which
looks ugly :(
I will try to find if there are similar patterns in the code base.

> >
> >      default:
> >          OVS_NOT_REACHED();
> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
> > index 221f24d00128..10f30618e3e0 100644
> > --- a/ovsdb/raft-rpc.h
> > +++ b/ovsdb/raft-rpc.h
> > @@ -125,12 +125,15 @@ struct raft_vote_request {
> >      uint64_t last_log_index; /* Index of candidate's last log entry. */
> >      uint64_t last_log_term;  /* Term of candidate's last log entry. */
> >      bool leadership_transfer;  /* True to override minimum election
timeout. */
> > +    bool is_prevote;         /* True: pre-vote; False: real vote
(default). */
> >  };
> >
> >  struct raft_vote_reply {
> >      struct raft_rpc_common common;
> >      uint64_t term;          /* Current term, for candidate to update
itself. */
> >      struct uuid vote;       /* Server ID of vote. */
> > +    bool is_prevote;        /* Copy the is_prevote from the request,
primarily
> > +                               for validation. */
>
> Each comment line should start with a '*'.
>
> "Copy of the ..." ?
>

Ack.

> >  };
> >
> >  struct raft_add_server_request {
> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> > index b2361b1737a2..4d7a3f112cad 100644
> > --- a/ovsdb/raft.c
> > +++ b/ovsdb/raft.c
> > @@ -305,6 +305,11 @@ struct raft {
> >
> >      /* Candidates only.  Reinitialized at start of election. */
> >      int n_votes;                /* Number of votes for me. */
> > +    bool prevote_passed;        /* Indicates if it passed pre-vote
phase.
> > +                                   Pre-vote mechanism is introduced in
raft
> > +                                   paper section 9.6. We implement it
as a
> > +                                   sub-state of candidate to minize
the change
> > +                                   and keep backward compatibility. */
>
> Each comment line should start with a '*'.
>
> And s/minize/minimize/.
>
Ack.

> >
> >      /* Followers and candidates only. */
> >      bool candidate_retrying;    /* The earlier election timed-out and
we are
> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
> >  static void raft_reset_election_timer(struct raft *);
> >  static void raft_reset_ping_timer(struct raft *);
> >  static void raft_send_heartbeats(struct raft *);
> > -static void raft_start_election(struct raft *, bool
leadership_transfer);
> > +static void raft_start_election(struct raft *, bool is_prevote,
> > +                                bool leadership_transfer);
> >  static bool raft_truncate(struct raft *, uint64_t new_end);
> >  static void raft_get_servers_from_log(struct raft *, enum vlog_level);
> >  static void raft_get_election_timer_from_log(struct raft *);
> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft
**raftp)
> >          /* If there's only one server, start an election right away so
that the
> >           * cluster bootstraps quickly. */
> >          if (hmap_count(&raft->servers) == 1) {
> > -            raft_start_election(raft, false);
> > +            /* No pre-vote needed since we are the only one. */
> > +            raft_start_election(raft, false, false);
> >          }
> >      } else {
> >          raft->join_timeout = time_msec() + 1000;
> > @@ -1360,7 +1367,7 @@ void
> >  raft_take_leadership(struct raft *raft)
> >  {
> >      if (raft->role != RAFT_LEADER) {
> > -        raft_start_election(raft, true);
> > +        raft_start_election(raft, false, true);
> >      }
> >  }
> >
> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term,
const struct uuid *vote)
> >      return true;
> >  }
> >
> > -static void
> > +static bool
> >  raft_accept_vote(struct raft *raft, struct raft_server *s,
> >                   const struct uuid *vote)
> >  {
> >      if (uuid_equals(&s->vote, vote)) {
> > -        return;
> > +        return false;
> >      }
> >      if (!uuid_is_zero(&s->vote)) {
> >          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct
raft_server *s,
> >      s->vote = *vote;
> >      if (uuid_equals(vote, &raft->sid)
> >          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
> > -        raft_become_leader(raft);
> > +        return true;
> >      }
> > +    return false;
> >  }
> >
> >  static void
> > -raft_start_election(struct raft *raft, bool leadership_transfer)
> > +raft_start_election(struct raft *raft, bool is_prevote,
> > +                    bool leadership_transfer)
> >  {
> > +    /* Leadership transfer doesn't use prevote. */
> > +    ovs_assert(!is_prevote || !leadership_transfer);
> > +
> >      if (raft->leaving) {
> >          return;
> >      }
> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >          return;
> >      }
> >
> > -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
> > +    if (!is_prevote && !raft_set_term(raft, raft->term + 1,
&raft->sid)) {
> >          return;
> >      }
> >
> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >
> >      raft->leader_sid = UUID_ZERO;
> >      raft->role = RAFT_CANDIDATE;
> > -    /* If there was no leader elected since last election, we know we
are
> > -     * retrying now. */
> > -    raft->candidate_retrying = !raft->had_leader;
> > -    raft->had_leader = false;
> > +    raft->prevote_passed = !is_prevote;
> > +
> > +    if (is_prevote || leadership_transfer) {
> > +        /* If there was no leader elected since last election, we know
we are
> > +         * retrying now. */
> > +        raft->candidate_retrying = !raft->had_leader;
> > +        raft->had_leader = false;
> > +
> > +        raft->election_start = time_msec();
> > +        raft->election_won = 0;
> > +    }
> >
> >      raft->n_votes = 0;
> >
> > -    raft->election_start = time_msec();
> > -    raft->election_won = 0;
> >      raft->leadership_transfer = leadership_transfer;
> >
> >      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> >      if (!VLOG_DROP_INFO(&rl)) {
> >          long long int now = time_msec();
> > +        char *comment = is_prevote ? "prevote" : "vote";
>
> Should user-visible logs say "pre-vote" instead of "prevote" ?

I am ok either way. I was thinking about the situation when searching the
logs with \<vote\> in vim, but it shouldn't matter much :) I can change it
if pre-vote looks better to users.
>
> >          if (now >= raft->election_timeout) {
> >              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
> > -                      "starting election",
> > -                      raft->term, now - raft->election_base);
> > +                      "starting election (%s)",
> > +                      raft->term, now - raft->election_base, comment);
> >          } else {
> > -            VLOG_INFO("term %"PRIu64": starting election", raft->term);
> > +            VLOG_INFO("term %"PRIu64": starting election (%s)",
> > +                      raft->term, comment);
> >          }
> >      }
> >      raft_reset_election_timer(raft);
> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >                      ? raft->entries[raft->log_end - raft->log_start -
1].term
> >                      : raft->snap.term),
> >                  .leadership_transfer = leadership_transfer,
> > +                .is_prevote = is_prevote,
> >              },
> >          };
> >          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >      }
> >
> >      /* Vote for ourselves. */
> > -    raft_accept_vote(raft, me, &raft->sid);
> > +    if (raft_accept_vote(raft, me, &raft->sid)) {
> > +        /* We just started vote, so it shouldn't be accepted yet
unless this is
> > +         * a one-node cluster. In such case we don't do pre-vote, and
become
> > +         * leader immediately. */
> > +        ovs_assert(!is_prevote);
> > +        raft_become_leader(raft);
> > +    }
> >  }
> >
> >  static void
> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
> >                  raft_reset_election_timer(raft);
> >              } else {
> >                  raft_become_follower(raft);
> > -                raft_start_election(raft, false);
> > +                raft_start_election(raft, true, false);
> >              }
> >          } else {
> > -            raft_start_election(raft, false);
> > +            raft_start_election(raft, true, false);
> >          }
> >
> >      }
> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
> >          return false;
> >      }
> >
> > +    if (rq->is_prevote) {
> > +        return true;
> > +    }
> > +
> >      /* Record a vote for the peer. */
> >      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
> >          return false;
> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
> >
> >  static void
> >  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
> > -                     const struct uuid *vote)
> > +                     const struct uuid *vote, bool is_prevote)
> >  {
> >      union raft_rpc rpy = {
> >          .vote_reply = {
> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const
struct uuid *dst,
> >              },
> >              .term = raft->term,
> >              .vote = *vote,
> > +            .is_prevote = is_prevote,
> >          },
> >      };
> >      raft_send(raft, &rpy);
> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
> >                           const struct raft_vote_request *rq)
> >  {
> >      if (raft_handle_vote_request__(raft, rq)) {
> > -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
> > +        raft_send_vote_reply(raft, &rq->common.sid,
> > +                             rq->is_prevote ? &rq->common.sid :
&raft->vote,
> > +                             rq->is_prevote);
> >      }
> >  }
> >
> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
> >
> >      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
> >      if (s) {
> > -        raft_accept_vote(raft, s, &rpy->vote);
> > +        if (raft_accept_vote(raft, s, &rpy->vote)) {
>
> IIUC, the raft_accept_vote() here can't succeed for pre-vote because
> everyone votes for themselves on pre-vote.  Is that correct?
>
pre-vote happens the same way as the real vote, except that pre-vote
doesn't increase the term.
This function is for the candidate to handle a vote-reply from a peer for a
vote/pre-vote request sent by itself.
So, raft_accept_vote should succeed in normal cases, for both pre-vote and
vote.
If it is pre-vote and it is successful, it starts the real vote immediately.

> What happens if we received a real vote during a pre-vote phase for
> some reason?  Will this logic fail?  Is this scenario possible?

The logic of handling the vote/pre-vote reply message only depends on the
status of the server itself, regardless of the message's is_prevote field.
So we don't really need to care if it is a pre-vote reply or vote reply.
The is_prevote field in the reply was added only for validation purpose,
not useful for the RPC itself, as mentioned in the comments.

Thanks,
Han
>
> > +            if (raft->prevote_passed) {
> > +                raft_become_leader(raft);
> > +            } else {
> > +                /* Start the real election. */
> > +                raft_start_election(raft, false, false);
> > +            }
> > +        }
> >      }
> >  }
> >
> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
> >          VLOG_INFO("received leadership transfer from %s in term
%"PRIu64,
> >                    raft_get_nickname(raft, &rq->common.sid, buf, sizeof
buf),
> >                    rq->term);
> > -        raft_start_election(raft, true);
> > +        raft_start_election(raft, false, true);
> >      }
> >  }
> >
> > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> > index 9fbf5dc897f2..8a81136999e0 100644
> > --- a/tests/ovsdb-cluster.at
> > +++ b/tests/ovsdb-cluster.at
> > @@ -715,6 +715,48 @@ done
> >
> >  AT_CLEANUP
> >
> > +
> > +AT_SETUP([OVSDB cluster - disruptive server])
> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
> > +
> > +n=3
> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> > +ordinal_schema > schema
> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db
$abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
> > +cid=`ovsdb-tool db-cid s1.db`
> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> > +for i in `seq 2 $n`; do
> > +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name
unix:s$i.raft unix:s1.raft])
> > +done
> > +
> > +on_exit 'kill `cat *.pid`'
> > +for i in `seq $n`; do
> > +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach
--no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i
--remote=punix:s$i.ovsdb s$i.db])
> > +done
> > +for i in `seq $n`; do
> > +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
> > +done
> > +
> > +# An unstable follower shouldn't disrupt the healthy cluster -
shouldn't
> > +# trigger term change.
> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test
stop-raft-rpc], [0], [ignore])
> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name |
grep "Role: candidate"])
> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0],
[ignore])
> > +
> > +# Should step back to follower.
> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name |
grep "Role: follower"])
> > +
> > +# No term change.
> > +for i in `seq $n`; do
> > +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name |
grep "Term: 1"], [0], [ignore])
> > +done
> > +
> > +for i in `seq $n`; do
> > +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
> > +done
> > +
> > +AT_CLEANUP
> > +
> >
> >  AT_BANNER([OVSDB - cluster tests])
> >
>
Ilya Maximets July 3, 2023, 1:47 p.m. UTC | #4
On 7/1/23 04:43, Han Zhou wrote:
> 
> 
> On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> On 6/25/23 19:35, Han Zhou wrote:
>> > When a server becomes unstable due to overloading or intermittent
>> > partitioning, it may miss some heartbeats and then starts election with
>> > a new term, which would disrupt the otherwise healthy cluster formed by
>> > the rest of the healthy nodes.  Such situation may exist for a long time
>> > until the "flapping" server is shutdown or recovered completely, which
>> > can severely impact the availability of the cluster. The pre-vote
>> > mechanism introduced in the raft paper section 9.6 can prevent such
>> > problems. This patch implements the pre-vote mechanism, in a way that is
>> > backward compatible and safe for upgrades.
>>
>> Thanks, Han!  This is definitely interesting.  I don't think that this
>> should be a solution for overloading though, overloading should be fixed
>> by, well, reducing the load or increasing the election timer.  But some
>> intermittent network issues is indeed a valid case.
> 
> Sorry that I wasn't clear about "overloading". I was referring to the case when the machine is getting overloaded, not necessarily because of the ovsdb-server process but because of other applications/VMs sharing the same physical resource, which is likely what we have hit in the production.
> You may argue it is an improper resource allocation, but from OVSDB point of view, it is not an excuse for the whole RAFT cluster down. The pre-vote can avoid such disaster, regardless of the factors of the problematic server that we can't control.

OK.  Makes sense.

> 
>>
>> I looked through the code and I don't see any significant problems right
>> away, but I would like to experiment with it a bit more next week, since
>> the voting logic is fairly complex in general.
>>
> Appreciated!
> 
>> It's nice to see you found a use case for the stop-raft-rpc failure
>> test. :)
>>
>> See some comments inline.
>>
>> Best regards, Ilya Maximets.
>>
>> >
>> > Signed-off-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
>> > ---
>> >  ovsdb/raft-rpc.c       | 19 ++++++++-
>> >  ovsdb/raft-rpc.h       |  3 ++
>> >  ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
>> >  tests/ovsdb-cluster.at <http://ovsdb-cluster.at> | 42 ++++++++++++++++++++
>> >  4 files changed, 127 insertions(+), 25 deletions(-)
>> >
>> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
>> > index dd14d81091fc..f750ba897046 100644
>> > --- a/ovsdb/raft-rpc.c
>> > +++ b/ovsdb/raft-rpc.c
>> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct raft_vote_request *rq,
>> >          json_object_put(args, "leadership_transfer",
>> >                          json_boolean_create(true));
>> >      }
>> > +    if (rq->is_prevote) {
>> > +        json_object_put(args, "is_prevote",
>> > +                        json_boolean_create(true));
>> > +    }

Hmm.  I don't think this is fully backward compatible.  If the other
server is the old one, the ovsdb_parser_finish() in the
raft_rpc_from_jsonrpc() will report an error, because not all the members
of the json message were consumed.  And hence the pre-vote will never
succeed in a cluster where upgraded servers do not have a quorum.

Let's say we have a cluster of 5.  The leader + 2 old servers + 2 new.
Leader goes down.  2 new servers don't have a quorum and old ones do
not reply to pre-vote requests.  I suppose, one of the old ones will
have to initiate an election then, and the new ones will reply to the
actual vote.  So, one of the old servers can become a leader, but new
servers can not.  This is probably not an issue since the cluster can
still elect a leader.  What do you think?

Maybe a clarification on that will be useful in the commit message.


>> >  }
>> >
>> >  static void
>> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser *p,
>> >      rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
>> >      rq->leadership_transfer
>> >          = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
>> > +    rq->is_prevote
>> > +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >  }
>> >
>> >  static void
>> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct raft_vote_request *rq, struct ds *s)
>> >      if (rq->leadership_transfer) {
>> >          ds_put_cstr(s, " leadership_transfer=true");
>> >      }
>> > +    if (rq->is_prevote) {
>> > +        ds_put_cstr(s, " is_prevote=true");
>> > +    }
>> >  }
>> >
>> >  /* raft_vote_reply. */
>> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct raft_vote_reply *rpy,
>> >  {
>> >      raft_put_uint64(args, "term", rpy->term);
>> >      json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote));
>> > +    if (rpy->is_prevote) {
>> > +        json_object_put(args, "is_prevote", json_boolean_create(true));
>> > +    }
>> >  }
>> >
>> >  static void
>> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
>> >  {
>> >      rpy->term = raft_parse_required_uint64(p, "term");
>> >      rpy->vote = raft_parse_required_uuid(p, "vote");
>> > +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >  }
>> >
>> >  static void
>> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply *rpy, struct ds *s)
>> >  {
>> >      ds_put_format(s, " term=%"PRIu64, rpy->term);
>> >      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
>> > +    if (rpy->is_prevote) {
>> > +        ds_put_cstr(s, " is_prevote=true");
>> > +    }
>> >  }
>> >
>> >  /* raft_add_server_request */
>> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
>> >          return NULL;
>> >
>> >      case RAFT_RPC_VOTE_REPLY:
>> > -        return &raft_vote_reply_cast(rpc)->vote;
>> > +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
>> > +            &raft_vote_reply_cast(rpc)->vote;
>>
>> It might be better to create a pointer variable here and access it
>> instead of casting twice.
>>
> That was my first attempt, but I can't define a local variable under the 'case' statement unless I add a pair of '{}' for the whole block, which looks ugly :(
> I will try to find if there are similar patterns in the code base.

We have a few places with braces around the case body, it should be
fine.  You may also define a pointer at the top of the function
and assign it in the case.  RAFT_RPC_VOTE_REPLY is the main use-case
for this function, so having a pointer for it declared at the top
should be fine as well.

> 
>> >
>> >      default:
>> >          OVS_NOT_REACHED();
>> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
>> > index 221f24d00128..10f30618e3e0 100644
>> > --- a/ovsdb/raft-rpc.h
>> > +++ b/ovsdb/raft-rpc.h
>> > @@ -125,12 +125,15 @@ struct raft_vote_request {
>> >      uint64_t last_log_index; /* Index of candidate's last log entry. */
>> >      uint64_t last_log_term;  /* Term of candidate's last log entry. */
>> >      bool leadership_transfer;  /* True to override minimum election timeout. */
>> > +    bool is_prevote;         /* True: pre-vote; False: real vote (default). */
>> >  };
>> >
>> >  struct raft_vote_reply {
>> >      struct raft_rpc_common common;
>> >      uint64_t term;          /* Current term, for candidate to update itself. */
>> >      struct uuid vote;       /* Server ID of vote. */
>> > +    bool is_prevote;        /* Copy the is_prevote from the request, primarily
>> > +                               for validation. */
>>
>> Each comment line should start with a '*'.
>>
>> "Copy of the ..." ?
>>
> 
> Ack.
> 
>> >  };
>> >
>> >  struct raft_add_server_request {
>> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> > index b2361b1737a2..4d7a3f112cad 100644
>> > --- a/ovsdb/raft.c
>> > +++ b/ovsdb/raft.c
>> > @@ -305,6 +305,11 @@ struct raft {
>> >
>> >      /* Candidates only.  Reinitialized at start of election. */
>> >      int n_votes;                /* Number of votes for me. */
>> > +    bool prevote_passed;        /* Indicates if it passed pre-vote phase.
>> > +                                   Pre-vote mechanism is introduced in raft
>> > +                                   paper section 9.6. We implement it as a
>> > +                                   sub-state of candidate to minize the change
>> > +                                   and keep backward compatibility. */
>>
>> Each comment line should start with a '*'.
>>
>> And s/minize/minimize/.
>>
> Ack.
> 
>> >
>> >      /* Followers and candidates only. */
>> >      bool candidate_retrying;    /* The earlier election timed-out and we are
>> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
>> >  static void raft_reset_election_timer(struct raft *);
>> >  static void raft_reset_ping_timer(struct raft *);
>> >  static void raft_send_heartbeats(struct raft *);
>> > -static void raft_start_election(struct raft *, bool leadership_transfer);
>> > +static void raft_start_election(struct raft *, bool is_prevote,
>> > +                                bool leadership_transfer);
>> >  static bool raft_truncate(struct raft *, uint64_t new_end);
>> >  static void raft_get_servers_from_log(struct raft *, enum vlog_level);
>> >  static void raft_get_election_timer_from_log(struct raft *);
>> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft **raftp)
>> >          /* If there's only one server, start an election right away so that the
>> >           * cluster bootstraps quickly. */
>> >          if (hmap_count(&raft->servers) == 1) {
>> > -            raft_start_election(raft, false);
>> > +            /* No pre-vote needed since we are the only one. */
>> > +            raft_start_election(raft, false, false);
>> >          }
>> >      } else {
>> >          raft->join_timeout = time_msec() + 1000;
>> > @@ -1360,7 +1367,7 @@ void
>> >  raft_take_leadership(struct raft *raft)
>> >  {
>> >      if (raft->role != RAFT_LEADER) {
>> > -        raft_start_election(raft, true);
>> > +        raft_start_election(raft, false, true);
>> >      }
>> >  }
>> >
>> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term, const struct uuid *vote)
>> >      return true;
>> >  }
>> >
>> > -static void
>> > +static bool
>> >  raft_accept_vote(struct raft *raft, struct raft_server *s,
>> >                   const struct uuid *vote)
>> >  {
>> >      if (uuid_equals(&s->vote, vote)) {
>> > -        return;
>> > +        return false;
>> >      }
>> >      if (!uuid_is_zero(&s->vote)) {
>> >          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct raft_server *s,
>> >      s->vote = *vote;
>> >      if (uuid_equals(vote, &raft->sid)
>> >          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
>> > -        raft_become_leader(raft);
>> > +        return true;
>> >      }
>> > +    return false;
>> >  }
>> >
>> >  static void
>> > -raft_start_election(struct raft *raft, bool leadership_transfer)
>> > +raft_start_election(struct raft *raft, bool is_prevote,
>> > +                    bool leadership_transfer)
>> >  {
>> > +    /* Leadership transfer doesn't use prevote. */
>> > +    ovs_assert(!is_prevote || !leadership_transfer);
>> > +
>> >      if (raft->leaving) {
>> >          return;
>> >      }
>> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >          return;
>> >      }
>> >
>> > -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> > +    if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> >          return;
>> >      }
>> >
>> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >
>> >      raft->leader_sid = UUID_ZERO;
>> >      raft->role = RAFT_CANDIDATE;
>> > -    /* If there was no leader elected since last election, we know we are
>> > -     * retrying now. */
>> > -    raft->candidate_retrying = !raft->had_leader;
>> > -    raft->had_leader = false;
>> > +    raft->prevote_passed = !is_prevote;
>> > +
>> > +    if (is_prevote || leadership_transfer) {
>> > +        /* If there was no leader elected since last election, we know we are
>> > +         * retrying now. */
>> > +        raft->candidate_retrying = !raft->had_leader;
>> > +        raft->had_leader = false;
>> > +
>> > +        raft->election_start = time_msec();
>> > +        raft->election_won = 0;
>> > +    }
>> >
>> >      raft->n_votes = 0;
>> >
>> > -    raft->election_start = time_msec();
>> > -    raft->election_won = 0;
>> >      raft->leadership_transfer = leadership_transfer;
>> >
>> >      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>> >      if (!VLOG_DROP_INFO(&rl)) {
>> >          long long int now = time_msec();
>> > +        char *comment = is_prevote ? "prevote" : "vote";
>>
>> Should user-visible logs say "pre-vote" instead of "prevote" ?
> 
> I am ok either way. I was thinking about the situation when searching the logs with \<vote\> in vim, but it shouldn't matter much :) I can change it if pre-vote looks better to users.

It was just a bit confusing that all the comments use 'pre-vote',
but the log entry is using 'prevote'.

An alternative might be to use "starting election" for an actual
vote and "starting preliminary election" for a pre-vote.

We may also use 'preliminary': true/false, instead of "is_prevote".
Might be more user-friendly.

'prevote_passed' may be turned into 'preliminary_vote_passed'.
It's on the edge of being too long of a name, not sure what would
be a better one though.

What do you think?

>>
>> >          if (now >= raft->election_timeout) {
>> >              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
>> > -                      "starting election",
>> > -                      raft->term, now - raft->election_base);
>> > +                      "starting election (%s)",
>> > +                      raft->term, now - raft->election_base, comment);
>> >          } else {
>> > -            VLOG_INFO("term %"PRIu64": starting election", raft->term);
>> > +            VLOG_INFO("term %"PRIu64": starting election (%s)",
>> > +                      raft->term, comment);
>> >          }
>> >      }
>> >      raft_reset_election_timer(raft);
>> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >                      ? raft->entries[raft->log_end - raft->log_start - 1].term
>> >                      : raft->snap.term),
>> >                  .leadership_transfer = leadership_transfer,
>> > +                .is_prevote = is_prevote,
>> >              },
>> >          };
>> >          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
>> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >      }
>> >
>> >      /* Vote for ourselves. */
>> > -    raft_accept_vote(raft, me, &raft->sid);
>> > +    if (raft_accept_vote(raft, me, &raft->sid)) {
>> > +        /* We just started vote, so it shouldn't be accepted yet unless this is
>> > +         * a one-node cluster. In such case we don't do pre-vote, and become
>> > +         * leader immediately. */
>> > +        ovs_assert(!is_prevote);
>> > +        raft_become_leader(raft);
>> > +    }
>> >  }
>> >
>> >  static void
>> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
>> >                  raft_reset_election_timer(raft);
>> >              } else {
>> >                  raft_become_follower(raft);
>> > -                raft_start_election(raft, false);
>> > +                raft_start_election(raft, true, false);
>> >              }
>> >          } else {
>> > -            raft_start_election(raft, false);
>> > +            raft_start_election(raft, true, false);
>> >          }
>> >
>> >      }
>> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
>> >          return false;
>> >      }
>> >
>> > +    if (rq->is_prevote) {
>> > +        return true;
>> > +    }
>> > +
>> >      /* Record a vote for the peer. */
>> >      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
>> >          return false;
>> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
>> >
>> >  static void
>> >  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>> > -                     const struct uuid *vote)
>> > +                     const struct uuid *vote, bool is_prevote)
>> >  {
>> >      union raft_rpc rpy = {
>> >          .vote_reply = {
>> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>> >              },
>> >              .term = raft->term,
>> >              .vote = *vote,
>> > +            .is_prevote = is_prevote,
>> >          },
>> >      };
>> >      raft_send(raft, &rpy);
>> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
>> >                           const struct raft_vote_request *rq)
>> >  {
>> >      if (raft_handle_vote_request__(raft, rq)) {
>> > -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
>> > +        raft_send_vote_reply(raft, &rq->common.sid,
>> > +                             rq->is_prevote ? &rq->common.sid : &raft->vote,
>> > +                             rq->is_prevote);
>> >      }
>> >  }
>> >
>> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
>> >
>> >      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
>> >      if (s) {
>> > -        raft_accept_vote(raft, s, &rpy->vote);
>> > +        if (raft_accept_vote(raft, s, &rpy->vote)) {
>>
>> IIUC, the raft_accept_vote() here can't succeed for pre-vote because
>> everyone votes for themselves on pre-vote.  Is that correct?
>>
> pre-vote happens the same way as the real vote, except that pre-vote doesn't increase the term.
> This function is for the candidate to handle a vote-reply from a peer for a vote/pre-vote request sent by itself.
> So, raft_accept_vote should succeed in normal cases, for both pre-vote and vote.
> If it is pre-vote and it is successful, it starts the real vote immediately.
> 
>> What happens if we received a real vote during a pre-vote phase for
>> some reason?  Will this logic fail?  Is this scenario possible?
> 
> The logic of handling the vote/pre-vote reply message only depends on the status of the server itself, regardless of the message's is_prevote field. So we don't really need to care if it is a pre-vote reply or vote reply. The is_prevote field in the reply was added only for validation purpose, not useful for the RPC itself, as mentioned in the comments.

Oh, thanks for explanation, I misread the raft_handle_vote_request()
the first time.  I'll take a closer look.

> 
> Thanks,
> Han
>>
>> > +            if (raft->prevote_passed) {
>> > +                raft_become_leader(raft);
>> > +            } else {
>> > +                /* Start the real election. */
>> > +                raft_start_election(raft, false, false);
>> > +            }
>> > +        }
>> >      }
>> >  }
>> >
>> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
>> >          VLOG_INFO("received leadership transfer from %s in term %"PRIu64,
>> >                    raft_get_nickname(raft, &rq->common.sid, buf, sizeof buf),
>> >                    rq->term);
>> > -        raft_start_election(raft, true);
>> > +        raft_start_election(raft, false, true);
>> >      }
>> >  }
>> >
>> > diff --git a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > index 9fbf5dc897f2..8a81136999e0 100644
>> > --- a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > +++ b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
>> > @@ -715,6 +715,48 @@ done
>> >
>> >  AT_CLEANUP
>> >
>> > +
>> > +AT_SETUP([OVSDB cluster - disruptive server])
>> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
>> > +
>> > +n=3
>> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> > +ordinal_schema > schema
>> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
>> > +cid=`ovsdb-tool db-cid s1.db`
>> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> > +for i in `seq 2 $n`; do
>> > +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
>> > +done
>> > +
>> > +on_exit 'kill `cat *.pid`'
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
>> > +done
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
>> > +done
>> > +
>> > +# An unstable follower shouldn't disrupt the healthy cluster - shouldn't
>> > +# trigger term change.
>> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], [0], [ignore])
>> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: candidate"])
>> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], [ignore])
>> > +
>> > +# Should step back to follower.
>> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: follower"])
>> > +
>> > +# No term change.
>> > +for i in `seq $n`; do
>> > +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | grep "Term: 1"], [0], [ignore])
>> > +done
>> > +
>> > +for i in `seq $n`; do
>> > +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
>> > +done
>> > +
>> > +AT_CLEANUP
>> > +
>> >
>> >  AT_BANNER([OVSDB - cluster tests])
>> >
>>
Han Zhou July 7, 2023, 5:14 a.m. UTC | #5
On Mon, Jul 3, 2023 at 9:46 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/1/23 04:43, Han Zhou wrote:
> >
> >
> > On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <i.maximets@ovn.org
<mailto:i.maximets@ovn.org>> wrote:
> >>
> >> On 6/25/23 19:35, Han Zhou wrote:
> >> > When a server becomes unstable due to overloading or intermittent
> >> > partitioning, it may miss some heartbeats and then starts election
with
> >> > a new term, which would disrupt the otherwise healthy cluster formed
by
> >> > the rest of the healthy nodes.  Such situation may exist for a long
time
> >> > until the "flapping" server is shutdown or recovered completely,
which
> >> > can severely impact the availability of the cluster. The pre-vote
> >> > mechanism introduced in the raft paper section 9.6 can prevent such
> >> > problems. This patch implements the pre-vote mechanism, in a way
that is
> >> > backward compatible and safe for upgrades.
> >>
> >> Thanks, Han!  This is definitely interesting.  I don't think that this
> >> should be a solution for overloading though, overloading should be
fixed
> >> by, well, reducing the load or increasing the election timer.  But some
> >> intermittent network issues is indeed a valid case.
> >
> > Sorry that I wasn't clear about "overloading". I was referring to the
case when the machine is getting overloaded, not necessarily because of the
ovsdb-server process but because of other applications/VMs sharing the same
physical resource, which is likely what we have hit in the production.
> > You may argue it is an improper resource allocation, but from OVSDB
point of view, it is not an excuse for the whole RAFT cluster down. The
pre-vote can avoid such disaster, regardless of the factors of the
problematic server that we can't control.
>
> OK.  Makes sense.
>
> >
> >>
> >> I looked through the code and I don't see any significant problems
right
> >> away, but I would like to experiment with it a bit more next week,
since
> >> the voting logic is fairly complex in general.
> >>
> > Appreciated!
> >
> >> It's nice to see you found a use case for the stop-raft-rpc failure
> >> test. :)
> >>
> >> See some comments inline.
> >>
> >> Best regards, Ilya Maximets.
> >>
> >> >
> >> > Signed-off-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
> >> > ---
> >> >  ovsdb/raft-rpc.c       | 19 ++++++++-
> >> >  ovsdb/raft-rpc.h       |  3 ++
> >> >  ovsdb/raft.c           | 88
++++++++++++++++++++++++++++++------------
> >> >  tests/ovsdb-cluster.at <http://ovsdb-cluster.at> | 42
++++++++++++++++++++
> >> >  4 files changed, 127 insertions(+), 25 deletions(-)
> >> >
> >> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
> >> > index dd14d81091fc..f750ba897046 100644
> >> > --- a/ovsdb/raft-rpc.c
> >> > +++ b/ovsdb/raft-rpc.c
> >> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct
raft_vote_request *rq,
> >> >          json_object_put(args, "leadership_transfer",
> >> >                          json_boolean_create(true));
> >> >      }
> >> > +    if (rq->is_prevote) {
> >> > +        json_object_put(args, "is_prevote",
> >> > +                        json_boolean_create(true));
> >> > +    }
>
> Hmm.  I don't think this is fully backward compatible.  If the other
> server is the old one, the ovsdb_parser_finish() in the
> raft_rpc_from_jsonrpc() will report an error, because not all the members
> of the json message were consumed.  And hence the pre-vote will never
> succeed in a cluster where upgraded servers do not have a quorum.
>
> Let's say we have a cluster of 5.  The leader + 2 old servers + 2 new.
> Leader goes down.  2 new servers don't have a quorum and old ones do
> not reply to pre-vote requests.  I suppose, one of the old ones will
> have to initiate an election then, and the new ones will reply to the
> actual vote.  So, one of the old servers can become a leader, but new
> servers can not.  This is probably not an issue since the cluster can
> still elect a leader.  What do you think?
>
> Maybe a clarification on that will be useful in the commit message.
>

Thanks for pointing this out. I just checked the logs during an upgrade
test, and it indeed had the error in the old cluster node:

2023-07-07T04:46:24.466Z|00070|raft|INFO|unix#6: syntax
"{"cluster":"a5cc9100-9d30-466a-b97c-55e72300b510","from":"7aaec107-2fe2-431e-a7f8-136f17ef9600","is_prevote":true,"last_log_index":9,"last_log_term":1,"term":1,"to":"b8ea2911-fcc3-4048-b140-f609918b7e67"}":
syntax error: Parsing raft vote_request RPC failed: Member 'is_prevote' is
present but not allowed here.

I didn't consider this regarding backward compatibility. I was mainly
trying to avoid crashing the old nodes by unknown message types, and didn't
know ovsdb rpc doesn't allow unused fields. But fortunately the node can
continue the work after this error, and just as you mentioned, the old node
then initiated the vote and the new one accepted it. So, maybe it is not a
real issue except the unpleasant error log. I will update the commit
message, and probably include this in NEWS as well.

>
> >> >  }
> >> >
> >> >  static void
> >> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct
ovsdb_parser *p,
> >> >      rq->last_log_term = raft_parse_required_uint64(p,
"last_log_term");
> >> >      rq->leadership_transfer
> >> >          = raft_parse_optional_boolean(p, "leadership_transfer") ==
1;
> >> > +    rq->is_prevote
> >> > +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
> >> >  }
> >> >
> >> >  static void
> >> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct
raft_vote_request *rq, struct ds *s)
> >> >      if (rq->leadership_transfer) {
> >> >          ds_put_cstr(s, " leadership_transfer=true");
> >> >      }
> >> > +    if (rq->is_prevote) {
> >> > +        ds_put_cstr(s, " is_prevote=true");
> >> > +    }
> >> >  }
> >> >
> >> >  /* raft_vote_reply. */
> >> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct
raft_vote_reply *rpy,
> >> >  {
> >> >      raft_put_uint64(args, "term", rpy->term);
> >> >      json_object_put_format(args, "vote", UUID_FMT,
UUID_ARGS(&rpy->vote));
> >> > +    if (rpy->is_prevote) {
> >> > +        json_object_put(args, "is_prevote",
json_boolean_create(true));
> >> > +    }
> >> >  }
> >> >
> >> >  static void
> >> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser
*p,
> >> >  {
> >> >      rpy->term = raft_parse_required_uint64(p, "term");
> >> >      rpy->vote = raft_parse_required_uuid(p, "vote");
> >> > +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote")
== 1;
> >> >  }
> >> >
> >> >  static void
> >> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct
raft_vote_reply *rpy, struct ds *s)
> >> >  {
> >> >      ds_put_format(s, " term=%"PRIu64, rpy->term);
> >> >      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
> >> > +    if (rpy->is_prevote) {
> >> > +        ds_put_cstr(s, " is_prevote=true");
> >> > +    }
> >> >  }
> >> >
> >> >  /* raft_add_server_request */
> >> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
> >> >          return NULL;
> >> >
> >> >      case RAFT_RPC_VOTE_REPLY:
> >> > -        return &raft_vote_reply_cast(rpc)->vote;
> >> > +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
> >> > +            &raft_vote_reply_cast(rpc)->vote;
> >>
> >> It might be better to create a pointer variable here and access it
> >> instead of casting twice.
> >>
> > That was my first attempt, but I can't define a local variable under
the 'case' statement unless I add a pair of '{}' for the whole block, which
looks ugly :(
> > I will try to find if there are similar patterns in the code base.
>
> We have a few places with braces around the case body, it should be
> fine.  You may also define a pointer at the top of the function
> and assign it in the case.  RAFT_RPC_VOTE_REPLY is the main use-case
> for this function, so having a pointer for it declared at the top
> should be fine as well.
>

Ack

> >
> >> >
> >> >      default:
> >> >          OVS_NOT_REACHED();
> >> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
> >> > index 221f24d00128..10f30618e3e0 100644
> >> > --- a/ovsdb/raft-rpc.h
> >> > +++ b/ovsdb/raft-rpc.h
> >> > @@ -125,12 +125,15 @@ struct raft_vote_request {
> >> >      uint64_t last_log_index; /* Index of candidate's last log
entry. */
> >> >      uint64_t last_log_term;  /* Term of candidate's last log entry.
*/
> >> >      bool leadership_transfer;  /* True to override minimum election
timeout. */
> >> > +    bool is_prevote;         /* True: pre-vote; False: real vote
(default). */
> >> >  };
> >> >
> >> >  struct raft_vote_reply {
> >> >      struct raft_rpc_common common;
> >> >      uint64_t term;          /* Current term, for candidate to
update itself. */
> >> >      struct uuid vote;       /* Server ID of vote. */
> >> > +    bool is_prevote;        /* Copy the is_prevote from the
request, primarily
> >> > +                               for validation. */
> >>
> >> Each comment line should start with a '*'.
> >>
> >> "Copy of the ..." ?
> >>
> >
> > Ack.
> >
> >> >  };
> >> >
> >> >  struct raft_add_server_request {
> >> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> > index b2361b1737a2..4d7a3f112cad 100644
> >> > --- a/ovsdb/raft.c
> >> > +++ b/ovsdb/raft.c
> >> > @@ -305,6 +305,11 @@ struct raft {
> >> >
> >> >      /* Candidates only.  Reinitialized at start of election. */
> >> >      int n_votes;                /* Number of votes for me. */
> >> > +    bool prevote_passed;        /* Indicates if it passed pre-vote
phase.
> >> > +                                   Pre-vote mechanism is introduced
in raft
> >> > +                                   paper section 9.6. We implement
it as a
> >> > +                                   sub-state of candidate to minize
the change
> >> > +                                   and keep backward compatibility.
*/
> >>
> >> Each comment line should start with a '*'.
> >>
> >> And s/minize/minimize/.
> >>
> > Ack.
> >
> >> >
> >> >      /* Followers and candidates only. */
> >> >      bool candidate_retrying;    /* The earlier election timed-out
and we are
> >> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
> >> >  static void raft_reset_election_timer(struct raft *);
> >> >  static void raft_reset_ping_timer(struct raft *);
> >> >  static void raft_send_heartbeats(struct raft *);
> >> > -static void raft_start_election(struct raft *, bool
leadership_transfer);
> >> > +static void raft_start_election(struct raft *, bool is_prevote,
> >> > +                                bool leadership_transfer);
> >> >  static bool raft_truncate(struct raft *, uint64_t new_end);
> >> >  static void raft_get_servers_from_log(struct raft *, enum
vlog_level);
> >> >  static void raft_get_election_timer_from_log(struct raft *);
> >> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft
**raftp)
> >> >          /* If there's only one server, start an election right away
so that the
> >> >           * cluster bootstraps quickly. */
> >> >          if (hmap_count(&raft->servers) == 1) {
> >> > -            raft_start_election(raft, false);
> >> > +            /* No pre-vote needed since we are the only one. */
> >> > +            raft_start_election(raft, false, false);
> >> >          }
> >> >      } else {
> >> >          raft->join_timeout = time_msec() + 1000;
> >> > @@ -1360,7 +1367,7 @@ void
> >> >  raft_take_leadership(struct raft *raft)
> >> >  {
> >> >      if (raft->role != RAFT_LEADER) {
> >> > -        raft_start_election(raft, true);
> >> > +        raft_start_election(raft, false, true);
> >> >      }
> >> >  }
> >> >
> >> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t
term, const struct uuid *vote)
> >> >      return true;
> >> >  }
> >> >
> >> > -static void
> >> > +static bool
> >> >  raft_accept_vote(struct raft *raft, struct raft_server *s,
> >> >                   const struct uuid *vote)
> >> >  {
> >> >      if (uuid_equals(&s->vote, vote)) {
> >> > -        return;
> >> > +        return false;
> >> >      }
> >> >      if (!uuid_is_zero(&s->vote)) {
> >> >          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1,
1);
> >> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct
raft_server *s,
> >> >      s->vote = *vote;
> >> >      if (uuid_equals(vote, &raft->sid)
> >> >          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
> >> > -        raft_become_leader(raft);
> >> > +        return true;
> >> >      }
> >> > +    return false;
> >> >  }
> >> >
> >> >  static void
> >> > -raft_start_election(struct raft *raft, bool leadership_transfer)
> >> > +raft_start_election(struct raft *raft, bool is_prevote,
> >> > +                    bool leadership_transfer)
> >> >  {
> >> > +    /* Leadership transfer doesn't use prevote. */
> >> > +    ovs_assert(!is_prevote || !leadership_transfer);
> >> > +
> >> >      if (raft->leaving) {
> >> >          return;
> >> >      }
> >> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >> >          return;
> >> >      }
> >> >
> >> > -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
> >> > +    if (!is_prevote && !raft_set_term(raft, raft->term + 1,
&raft->sid)) {
> >> >          return;
> >> >      }
> >> >
> >> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >> >
> >> >      raft->leader_sid = UUID_ZERO;
> >> >      raft->role = RAFT_CANDIDATE;
> >> > -    /* If there was no leader elected since last election, we know
we are
> >> > -     * retrying now. */
> >> > -    raft->candidate_retrying = !raft->had_leader;
> >> > -    raft->had_leader = false;
> >> > +    raft->prevote_passed = !is_prevote;
> >> > +
> >> > +    if (is_prevote || leadership_transfer) {
> >> > +        /* If there was no leader elected since last election, we
know we are
> >> > +         * retrying now. */
> >> > +        raft->candidate_retrying = !raft->had_leader;
> >> > +        raft->had_leader = false;
> >> > +
> >> > +        raft->election_start = time_msec();
> >> > +        raft->election_won = 0;
> >> > +    }
> >> >
> >> >      raft->n_votes = 0;
> >> >
> >> > -    raft->election_start = time_msec();
> >> > -    raft->election_won = 0;
> >> >      raft->leadership_transfer = leadership_transfer;
> >> >
> >> >      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> >> >      if (!VLOG_DROP_INFO(&rl)) {
> >> >          long long int now = time_msec();
> >> > +        char *comment = is_prevote ? "prevote" : "vote";
> >>
> >> Should user-visible logs say "pre-vote" instead of "prevote" ?
> >
> > I am ok either way. I was thinking about the situation when searching
the logs with \<vote\> in vim, but it shouldn't matter much :) I can change
it if pre-vote looks better to users.
>
> It was just a bit confusing that all the comments use 'pre-vote',
> but the log entry is using 'prevote'.
>
> An alternative might be to use "starting election" for an actual
> vote and "starting preliminary election" for a pre-vote.
>
> We may also use 'preliminary': true/false, instead of "is_prevote".
> Might be more user-friendly.
>
> 'prevote_passed' may be turned into 'preliminary_vote_passed'.
> It's on the edge of being too long of a name, not sure what would
> be a better one though.
>
> What do you think?

I'd prefer "pre-vote" since it is the terminology used in the RAFT paper. I
will update the log to "pre-vote" in v2.

>
> >>
> >> >          if (now >= raft->election_timeout) {
> >> >              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
> >> > -                      "starting election",
> >> > -                      raft->term, now - raft->election_base);
> >> > +                      "starting election (%s)",
> >> > +                      raft->term, now - raft->election_base,
comment);
> >> >          } else {
> >> > -            VLOG_INFO("term %"PRIu64": starting election",
raft->term);
> >> > +            VLOG_INFO("term %"PRIu64": starting election (%s)",
> >> > +                      raft->term, comment);
> >> >          }
> >> >      }
> >> >      raft_reset_election_timer(raft);
> >> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >> >                      ? raft->entries[raft->log_end - raft->log_start
- 1].term
> >> >                      : raft->snap.term),
> >> >                  .leadership_transfer = leadership_transfer,
> >> > +                .is_prevote = is_prevote,
> >> >              },
> >> >          };
> >> >          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
> >> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool
leadership_transfer)
> >> >      }
> >> >
> >> >      /* Vote for ourselves. */
> >> > -    raft_accept_vote(raft, me, &raft->sid);
> >> > +    if (raft_accept_vote(raft, me, &raft->sid)) {
> >> > +        /* We just started vote, so it shouldn't be accepted yet
unless this is
> >> > +         * a one-node cluster. In such case we don't do pre-vote,
and become
> >> > +         * leader immediately. */
> >> > +        ovs_assert(!is_prevote);
> >> > +        raft_become_leader(raft);
> >> > +    }
> >> >  }
> >> >
> >> >  static void
> >> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
> >> >                  raft_reset_election_timer(raft);
> >> >              } else {
> >> >                  raft_become_follower(raft);
> >> > -                raft_start_election(raft, false);
> >> > +                raft_start_election(raft, true, false);
> >> >              }
> >> >          } else {
> >> > -            raft_start_election(raft, false);
> >> > +            raft_start_election(raft, true, false);
> >> >          }
> >> >
> >> >      }
> >> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
> >> >          return false;
> >> >      }
> >> >
> >> > +    if (rq->is_prevote) {
> >> > +        return true;
> >> > +    }
> >> > +
> >> >      /* Record a vote for the peer. */
> >> >      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
> >> >          return false;
> >> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
> >> >
> >> >  static void
> >> >  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
> >> > -                     const struct uuid *vote)
> >> > +                     const struct uuid *vote, bool is_prevote)
> >> >  {
> >> >      union raft_rpc rpy = {
> >> >          .vote_reply = {
> >> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const
struct uuid *dst,
> >> >              },
> >> >              .term = raft->term,
> >> >              .vote = *vote,
> >> > +            .is_prevote = is_prevote,
> >> >          },
> >> >      };
> >> >      raft_send(raft, &rpy);
> >> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
> >> >                           const struct raft_vote_request *rq)
> >> >  {
> >> >      if (raft_handle_vote_request__(raft, rq)) {
> >> > -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
> >> > +        raft_send_vote_reply(raft, &rq->common.sid,
> >> > +                             rq->is_prevote ? &rq->common.sid :
&raft->vote,
> >> > +                             rq->is_prevote);
> >> >      }
> >> >  }
> >> >
> >> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
> >> >
> >> >      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
> >> >      if (s) {
> >> > -        raft_accept_vote(raft, s, &rpy->vote);
> >> > +        if (raft_accept_vote(raft, s, &rpy->vote)) {
> >>
> >> IIUC, the raft_accept_vote() here can't succeed for pre-vote because
> >> everyone votes for themselves on pre-vote.  Is that correct?
> >>
> > pre-vote happens the same way as the real vote, except that pre-vote
doesn't increase the term.
> > This function is for the candidate to handle a vote-reply from a peer
for a vote/pre-vote request sent by itself.
> > So, raft_accept_vote should succeed in normal cases, for both pre-vote
and vote.
> > If it is pre-vote and it is successful, it starts the real vote
immediately.
> >
> >> What happens if we received a real vote during a pre-vote phase for
> >> some reason?  Will this logic fail?  Is this scenario possible?
> >
> > The logic of handling the vote/pre-vote reply message only depends on
the status of the server itself, regardless of the message's is_prevote
field. So we don't really need to care if it is a pre-vote reply or vote
reply. The is_prevote field in the reply was added only for validation
purpose, not useful for the RPC itself, as mentioned in the comments.
>
> Oh, thanks for explanation, I misread the raft_handle_vote_request()
> the first time.  I'll take a closer look.
>
No problem. Thanks for your review. Do you have other comments? If not, I
will send v2 with all the comments addressed.

Regards,
Han

> >
> > Thanks,
> > Han
> >>
> >> > +            if (raft->prevote_passed) {
> >> > +                raft_become_leader(raft);
> >> > +            } else {
> >> > +                /* Start the real election. */
> >> > +                raft_start_election(raft, false, false);
> >> > +            }
> >> > +        }
> >> >      }
> >> >  }
> >> >
> >> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
> >> >          VLOG_INFO("received leadership transfer from %s in term
%"PRIu64,
> >> >                    raft_get_nickname(raft, &rq->common.sid, buf,
sizeof buf),
> >> >                    rq->term);
> >> > -        raft_start_election(raft, true);
> >> > +        raft_start_election(raft, false, true);
> >> >      }
> >> >  }
> >> >
> >> > diff --git a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
> >> > index 9fbf5dc897f2..8a81136999e0 100644
> >> > --- a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
> >> > +++ b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at>
> >> > @@ -715,6 +715,48 @@ done
> >> >
> >> >  AT_CLEANUP
> >> >
> >> > +
> >> > +AT_SETUP([OVSDB cluster - disruptive server])
> >> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
> >> > +
> >> > +n=3
> >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> >> > +ordinal_schema > schema
> >> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster
s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
> >> > +cid=`ovsdb-tool db-cid s1.db`
> >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
> >> > +for i in `seq 2 $n`; do
> >> > +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name
unix:s$i.raft unix:s1.raft])
> >> > +done
> >> > +
> >> > +on_exit 'kill `cat *.pid`'
> >> > +for i in `seq $n`; do
> >> > +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach
--no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i
--remote=punix:s$i.ovsdb s$i.db])
> >> > +done
> >> > +for i in `seq $n`; do
> >> > +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name
connected])
> >> > +done
> >> > +
> >> > +# An unstable follower shouldn't disrupt the healthy cluster -
shouldn't
> >> > +# trigger term change.
> >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test
stop-raft-rpc], [0], [ignore])
> >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status
$schema_name | grep "Role: candidate"])
> >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear],
[0], [ignore])
> >> > +
> >> > +# Should step back to follower.
> >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status
$schema_name | grep "Role: follower"])
> >> > +
> >> > +# No term change.
> >> > +for i in `seq $n`; do
> >> > +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name
| grep "Term: 1"], [0], [ignore])
> >> > +done
> >> > +
> >> > +for i in `seq $n`; do
> >> > +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
> >> > +done
> >> > +
> >> > +AT_CLEANUP
> >> > +
> >> >
> >> >  AT_BANNER([OVSDB - cluster tests])
> >> >
> >>
>
Ilya Maximets July 12, 2023, 5:51 p.m. UTC | #6
On 7/7/23 07:14, Han Zhou wrote:
> 
> 
> On Mon, Jul 3, 2023 at 9:46 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> On 7/1/23 04:43, Han Zhou wrote:
>> >
>> >
>> > On Fri, Jun 30, 2023 at 4:30 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org> <mailto:i.maximets@ovn.org <mailto:i.maximets@ovn.org>>> wrote:
>> >>
>> >> On 6/25/23 19:35, Han Zhou wrote:
>> >> > When a server becomes unstable due to overloading or intermittent
>> >> > partitioning, it may miss some heartbeats and then starts election with
>> >> > a new term, which would disrupt the otherwise healthy cluster formed by
>> >> > the rest of the healthy nodes.  Such situation may exist for a long time
>> >> > until the "flapping" server is shutdown or recovered completely, which
>> >> > can severely impact the availability of the cluster. The pre-vote
>> >> > mechanism introduced in the raft paper section 9.6 can prevent such
>> >> > problems. This patch implements the pre-vote mechanism, in a way that is
>> >> > backward compatible and safe for upgrades.
>> >>
>> >> Thanks, Han!  This is definitely interesting.  I don't think that this
>> >> should be a solution for overloading though, overloading should be fixed
>> >> by, well, reducing the load or increasing the election timer.  But some
>> >> intermittent network issues is indeed a valid case.
>> >
>> > Sorry that I wasn't clear about "overloading". I was referring to the case when the machine is getting overloaded, not necessarily because of the ovsdb-server process but because of other applications/VMs sharing the same physical resource, which is likely what we have hit in the production.
>> > You may argue it is an improper resource allocation, but from OVSDB point of view, it is not an excuse for the whole RAFT cluster down. The pre-vote can avoid such disaster, regardless of the factors of the problematic server that we can't control.
>>
>> OK.  Makes sense.
>>
>> >
>> >>
>> >> I looked through the code and I don't see any significant problems right
>> >> away, but I would like to experiment with it a bit more next week, since
>> >> the voting logic is fairly complex in general.
>> >>
>> > Appreciated!
>> >
>> >> It's nice to see you found a use case for the stop-raft-rpc failure
>> >> test. :)
>> >>
>> >> See some comments inline.
>> >>
>> >> Best regards, Ilya Maximets.
>> >>
>> >> >
>> >> > Signed-off-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org> <mailto:hzhou@ovn.org <mailto:hzhou@ovn.org>>>
>> >> > ---
>> >> >  ovsdb/raft-rpc.c       | 19 ++++++++-
>> >> >  ovsdb/raft-rpc.h       |  3 ++
>> >> >  ovsdb/raft.c           | 88 ++++++++++++++++++++++++++++++------------
>> >> >  tests/ovsdb-cluster.at <http://ovsdb-cluster.at> <http://ovsdb-cluster.at <http://ovsdb-cluster.at>> | 42 ++++++++++++++++++++
>> >> >  4 files changed, 127 insertions(+), 25 deletions(-)
>> >> >
>> >> > diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
>> >> > index dd14d81091fc..f750ba897046 100644
>> >> > --- a/ovsdb/raft-rpc.c
>> >> > +++ b/ovsdb/raft-rpc.c
>> >> > @@ -283,6 +283,10 @@ raft_vote_request_to_jsonrpc(const struct raft_vote_request *rq,
>> >> >          json_object_put(args, "leadership_transfer",
>> >> >                          json_boolean_create(true));
>> >> >      }
>> >> > +    if (rq->is_prevote) {
>> >> > +        json_object_put(args, "is_prevote",
>> >> > +                        json_boolean_create(true));
>> >> > +    }
>>
>> Hmm.  I don't think this is fully backward compatible.  If the other
>> server is the old one, the ovsdb_parser_finish() in the
>> raft_rpc_from_jsonrpc() will report an error, because not all the members
>> of the json message were consumed.  And hence the pre-vote will never
>> succeed in a cluster where upgraded servers do not have a quorum.
>>
>> Let's say we have a cluster of 5.  The leader + 2 old servers + 2 new.
>> Leader goes down.  2 new servers don't have a quorum and old ones do
>> not reply to pre-vote requests.  I suppose, one of the old ones will
>> have to initiate an election then, and the new ones will reply to the
>> actual vote.  So, one of the old servers can become a leader, but new
>> servers can not.  This is probably not an issue since the cluster can
>> still elect a leader.  What do you think?
>>
>> Maybe a clarification on that will be useful in the commit message.
>>
> 
> Thanks for pointing this out. I just checked the logs during an upgrade test, and it indeed had the error in the old cluster node:
> 
> 2023-07-07T04:46:24.466Z|00070|raft|INFO|unix#6: syntax "{"cluster":"a5cc9100-9d30-466a-b97c-55e72300b510","from":"7aaec107-2fe2-431e-a7f8-136f17ef9600","is_prevote":true,"last_log_index":9,"last_log_term":1,"term":1,"to":"b8ea2911-fcc3-4048-b140-f609918b7e67"}": syntax error: Parsing raft vote_request RPC failed: Member 'is_prevote' is present but not allowed here.
> 
> I didn't consider this regarding backward compatibility. I was mainly trying to avoid crashing the old nodes by unknown message types, and didn't know ovsdb rpc doesn't allow unused fields. But fortunately the node can continue the work after this error, and just as you mentioned, the old node then initiated the vote and the new one accepted it. So, maybe it is not a real issue except the unpleasant error log. I will update the commit message, and probably include this in NEWS as well.
> 
>>
>> >> >  }
>> >> >
>> >> >  static void
>> >> > @@ -294,6 +298,8 @@ raft_vote_request_from_jsonrpc(struct ovsdb_parser *p,
>> >> >      rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
>> >> >      rq->leadership_transfer
>> >> >          = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
>> >> > +    rq->is_prevote
>> >> > +        = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >> >  }
>> >> >
>> >> >  static void
>> >> > @@ -305,6 +311,9 @@ raft_format_vote_request(const struct raft_vote_request *rq, struct ds *s)
>> >> >      if (rq->leadership_transfer) {
>> >> >          ds_put_cstr(s, " leadership_transfer=true");
>> >> >      }
>> >> > +    if (rq->is_prevote) {
>> >> > +        ds_put_cstr(s, " is_prevote=true");
>> >> > +    }
>> >> >  }
>> >> >
>> >> >  /* raft_vote_reply. */
>> >> > @@ -326,6 +335,9 @@ raft_vote_reply_to_jsonrpc(const struct raft_vote_reply *rpy,
>> >> >  {
>> >> >      raft_put_uint64(args, "term", rpy->term);
>> >> >      json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote));
>> >> > +    if (rpy->is_prevote) {
>> >> > +        json_object_put(args, "is_prevote", json_boolean_create(true));
>> >> > +    }
>> >> >  }
>> >> >
>> >> >  static void
>> >> > @@ -334,6 +346,7 @@ raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
>> >> >  {
>> >> >      rpy->term = raft_parse_required_uint64(p, "term");
>> >> >      rpy->vote = raft_parse_required_uuid(p, "vote");
>> >> > +    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1;
>> >> >  }
>> >> >
>> >> >  static void
>> >> > @@ -341,6 +354,9 @@ raft_format_vote_reply(const struct raft_vote_reply *rpy, struct ds *s)
>> >> >  {
>> >> >      ds_put_format(s, " term=%"PRIu64, rpy->term);
>> >> >      ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
>> >> > +    if (rpy->is_prevote) {
>> >> > +        ds_put_cstr(s, " is_prevote=true");
>> >> > +    }
>> >> >  }
>> >> >
>> >> >  /* raft_add_server_request */
>> >> > @@ -1008,7 +1024,8 @@ raft_rpc_get_vote(const union raft_rpc *rpc)
>> >> >          return NULL;
>> >> >
>> >> >      case RAFT_RPC_VOTE_REPLY:
>> >> > -        return &raft_vote_reply_cast(rpc)->vote;
>> >> > +        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
>> >> > +            &raft_vote_reply_cast(rpc)->vote;
>> >>
>> >> It might be better to create a pointer variable here and access it
>> >> instead of casting twice.
>> >>
>> > That was my first attempt, but I can't define a local variable under the 'case' statement unless I add a pair of '{}' for the whole block, which looks ugly :(
>> > I will try to find if there are similar patterns in the code base.
>>
>> We have a few places with braces around the case body, it should be
>> fine.  You may also define a pointer at the top of the function
>> and assign it in the case.  RAFT_RPC_VOTE_REPLY is the main use-case
>> for this function, so having a pointer for it declared at the top
>> should be fine as well.
>>
> 
> Ack
> 
>> >
>> >> >
>> >> >      default:
>> >> >          OVS_NOT_REACHED();
>> >> > diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
>> >> > index 221f24d00128..10f30618e3e0 100644
>> >> > --- a/ovsdb/raft-rpc.h
>> >> > +++ b/ovsdb/raft-rpc.h
>> >> > @@ -125,12 +125,15 @@ struct raft_vote_request {
>> >> >      uint64_t last_log_index; /* Index of candidate's last log entry. */
>> >> >      uint64_t last_log_term;  /* Term of candidate's last log entry. */
>> >> >      bool leadership_transfer;  /* True to override minimum election timeout. */
>> >> > +    bool is_prevote;         /* True: pre-vote; False: real vote (default). */
>> >> >  };
>> >> >
>> >> >  struct raft_vote_reply {
>> >> >      struct raft_rpc_common common;
>> >> >      uint64_t term;          /* Current term, for candidate to update itself. */
>> >> >      struct uuid vote;       /* Server ID of vote. */
>> >> > +    bool is_prevote;        /* Copy the is_prevote from the request, primarily
>> >> > +                               for validation. */
>> >>
>> >> Each comment line should start with a '*'.
>> >>
>> >> "Copy of the ..." ?
>> >>
>> >
>> > Ack.
>> >
>> >> >  };
>> >> >
>> >> >  struct raft_add_server_request {
>> >> > diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> >> > index b2361b1737a2..4d7a3f112cad 100644
>> >> > --- a/ovsdb/raft.c
>> >> > +++ b/ovsdb/raft.c
>> >> > @@ -305,6 +305,11 @@ struct raft {
>> >> >
>> >> >      /* Candidates only.  Reinitialized at start of election. */
>> >> >      int n_votes;                /* Number of votes for me. */
>> >> > +    bool prevote_passed;        /* Indicates if it passed pre-vote phase.
>> >> > +                                   Pre-vote mechanism is introduced in raft
>> >> > +                                   paper section 9.6. We implement it as a
>> >> > +                                   sub-state of candidate to minize the change
>> >> > +                                   and keep backward compatibility. */
>> >>
>> >> Each comment line should start with a '*'.
>> >>
>> >> And s/minize/minimize/.
>> >>
>> > Ack.
>> >
>> >> >
>> >> >      /* Followers and candidates only. */
>> >> >      bool candidate_retrying;    /* The earlier election timed-out and we are
>> >> > @@ -372,7 +377,8 @@ static void raft_become_follower(struct raft *);
>> >> >  static void raft_reset_election_timer(struct raft *);
>> >> >  static void raft_reset_ping_timer(struct raft *);
>> >> >  static void raft_send_heartbeats(struct raft *);
>> >> > -static void raft_start_election(struct raft *, bool leadership_transfer);
>> >> > +static void raft_start_election(struct raft *, bool is_prevote,
>> >> > +                                bool leadership_transfer);
>> >> >  static bool raft_truncate(struct raft *, uint64_t new_end);
>> >> >  static void raft_get_servers_from_log(struct raft *, enum vlog_level);
>> >> >  static void raft_get_election_timer_from_log(struct raft *);
>> >> > @@ -1069,7 +1075,8 @@ raft_open(struct ovsdb_log *log, struct raft **raftp)
>> >> >          /* If there's only one server, start an election right away so that the
>> >> >           * cluster bootstraps quickly. */
>> >> >          if (hmap_count(&raft->servers) == 1) {
>> >> > -            raft_start_election(raft, false);
>> >> > +            /* No pre-vote needed since we are the only one. */
>> >> > +            raft_start_election(raft, false, false);
>> >> >          }
>> >> >      } else {
>> >> >          raft->join_timeout = time_msec() + 1000;
>> >> > @@ -1360,7 +1367,7 @@ void
>> >> >  raft_take_leadership(struct raft *raft)
>> >> >  {
>> >> >      if (raft->role != RAFT_LEADER) {
>> >> > -        raft_start_election(raft, true);
>> >> > +        raft_start_election(raft, false, true);
>> >> >      }
>> >> >  }
>> >> >
>> >> > @@ -1766,12 +1773,12 @@ raft_set_term(struct raft *raft, uint64_t term, const struct uuid *vote)
>> >> >      return true;
>> >> >  }
>> >> >
>> >> > -static void
>> >> > +static bool
>> >> >  raft_accept_vote(struct raft *raft, struct raft_server *s,
>> >> >                   const struct uuid *vote)
>> >> >  {
>> >> >      if (uuid_equals(&s->vote, vote)) {
>> >> > -        return;
>> >> > +        return false;
>> >> >      }
>> >> >      if (!uuid_is_zero(&s->vote)) {
>> >> >          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>> >> > @@ -1785,13 +1792,18 @@ raft_accept_vote(struct raft *raft, struct raft_server *s,
>> >> >      s->vote = *vote;
>> >> >      if (uuid_equals(vote, &raft->sid)
>> >> >          && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
>> >> > -        raft_become_leader(raft);
>> >> > +        return true;
>> >> >      }
>> >> > +    return false;
>> >> >  }
>> >> >
>> >> >  static void
>> >> > -raft_start_election(struct raft *raft, bool leadership_transfer)
>> >> > +raft_start_election(struct raft *raft, bool is_prevote,
>> >> > +                    bool leadership_transfer)
>> >> >  {
>> >> > +    /* Leadership transfer doesn't use prevote. */
>> >> > +    ovs_assert(!is_prevote || !leadership_transfer);
>> >> > +
>> >> >      if (raft->leaving) {
>> >> >          return;
>> >> >      }
>> >> > @@ -1801,7 +1813,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >> >          return;
>> >> >      }
>> >> >
>> >> > -    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> >> > +    if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) {
>> >> >          return;
>> >> >      }
>> >> >
>> >> > @@ -1809,26 +1821,33 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >> >
>> >> >      raft->leader_sid = UUID_ZERO;
>> >> >      raft->role = RAFT_CANDIDATE;
>> >> > -    /* If there was no leader elected since last election, we know we are
>> >> > -     * retrying now. */
>> >> > -    raft->candidate_retrying = !raft->had_leader;
>> >> > -    raft->had_leader = false;
>> >> > +    raft->prevote_passed = !is_prevote;
>> >> > +
>> >> > +    if (is_prevote || leadership_transfer) {
>> >> > +        /* If there was no leader elected since last election, we know we are
>> >> > +         * retrying now. */
>> >> > +        raft->candidate_retrying = !raft->had_leader;
>> >> > +        raft->had_leader = false;
>> >> > +
>> >> > +        raft->election_start = time_msec();
>> >> > +        raft->election_won = 0;
>> >> > +    }
>> >> >
>> >> >      raft->n_votes = 0;
>> >> >
>> >> > -    raft->election_start = time_msec();
>> >> > -    raft->election_won = 0;
>> >> >      raft->leadership_transfer = leadership_transfer;
>> >> >
>> >> >      static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>> >> >      if (!VLOG_DROP_INFO(&rl)) {
>> >> >          long long int now = time_msec();
>> >> > +        char *comment = is_prevote ? "prevote" : "vote";
>> >>
>> >> Should user-visible logs say "pre-vote" instead of "prevote" ?
>> >
>> > I am ok either way. I was thinking about the situation when searching the logs with \<vote\> in vim, but it shouldn't matter much :) I can change it if pre-vote looks better to users.
>>
>> It was just a bit confusing that all the comments use 'pre-vote',
>> but the log entry is using 'prevote'.
>>
>> An alternative might be to use "starting election" for an actual
>> vote and "starting preliminary election" for a pre-vote.
>>
>> We may also use 'preliminary': true/false, instead of "is_prevote".
>> Might be more user-friendly.
>>
>> 'prevote_passed' may be turned into 'preliminary_vote_passed'.
>> It's on the edge of being too long of a name, not sure what would
>> be a better one though.
>>
>> What do you think?
> 
> I'd prefer "pre-vote" since it is the terminology used in the RAFT paper. I will update the log to "pre-vote" in v2.
> 
>>
>> >>
>> >> >          if (now >= raft->election_timeout) {
>> >> >              VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
>> >> > -                      "starting election",
>> >> > -                      raft->term, now - raft->election_base);
>> >> > +                      "starting election (%s)",
>> >> > +                      raft->term, now - raft->election_base, comment);
>> >> >          } else {
>> >> > -            VLOG_INFO("term %"PRIu64": starting election", raft->term);
>> >> > +            VLOG_INFO("term %"PRIu64": starting election (%s)",
>> >> > +                      raft->term, comment);
>> >> >          }
>> >> >      }
>> >> >      raft_reset_election_timer(raft);
>> >> > @@ -1853,6 +1872,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >> >                      ? raft->entries[raft->log_end - raft->log_start - 1].term
>> >> >                      : raft->snap.term),
>> >> >                  .leadership_transfer = leadership_transfer,
>> >> > +                .is_prevote = is_prevote,
>> >> >              },
>> >> >          };
>> >> >          if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
>> >> > @@ -1861,7 +1881,13 @@ raft_start_election(struct raft *raft, bool leadership_transfer)
>> >> >      }
>> >> >
>> >> >      /* Vote for ourselves. */
>> >> > -    raft_accept_vote(raft, me, &raft->sid);
>> >> > +    if (raft_accept_vote(raft, me, &raft->sid)) {
>> >> > +        /* We just started vote, so it shouldn't be accepted yet unless this is
>> >> > +         * a one-node cluster. In such case we don't do pre-vote, and become
>> >> > +         * leader immediately. */
>> >> > +        ovs_assert(!is_prevote);
>> >> > +        raft_become_leader(raft);
>> >> > +    }
>> >> >  }
>> >> >
>> >> >  static void
>> >> > @@ -2041,10 +2067,10 @@ raft_run(struct raft *raft)
>> >> >                  raft_reset_election_timer(raft);
>> >> >              } else {
>> >> >                  raft_become_follower(raft);
>> >> > -                raft_start_election(raft, false);
>> >> > +                raft_start_election(raft, true, false);
>> >> >              }
>> >> >          } else {
>> >> > -            raft_start_election(raft, false);
>> >> > +            raft_start_election(raft, true, false);
>> >> >          }
>> >> >
>> >> >      }
>> >> > @@ -3673,6 +3699,10 @@ raft_handle_vote_request__(struct raft *raft,
>> >> >          return false;
>> >> >      }
>> >> >
>> >> > +    if (rq->is_prevote) {
>> >> > +        return true;
>> >> > +    }
>> >> > +
>> >> >      /* Record a vote for the peer. */
>> >> >      if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
>> >> >          return false;
>> >> > @@ -3685,7 +3715,7 @@ raft_handle_vote_request__(struct raft *raft,
>> >> >
>> >> >  static void
>> >> >  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>> >> > -                     const struct uuid *vote)
>> >> > +                     const struct uuid *vote, bool is_prevote)
>> >> >  {
>> >> >      union raft_rpc rpy = {
>> >> >          .vote_reply = {
>> >> > @@ -3695,6 +3725,7 @@ raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
>> >> >              },
>> >> >              .term = raft->term,
>> >> >              .vote = *vote,
>> >> > +            .is_prevote = is_prevote,
>> >> >          },
>> >> >      };
>> >> >      raft_send(raft, &rpy);
>> >> > @@ -3705,7 +3736,9 @@ raft_handle_vote_request(struct raft *raft,
>> >> >                           const struct raft_vote_request *rq)
>> >> >  {
>> >> >      if (raft_handle_vote_request__(raft, rq)) {
>> >> > -        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
>> >> > +        raft_send_vote_reply(raft, &rq->common.sid,
>> >> > +                             rq->is_prevote ? &rq->common.sid : &raft->vote,
>> >> > +                             rq->is_prevote);
>> >> >      }
>> >> >  }
>> >> >
>> >> > @@ -3723,7 +3756,14 @@ raft_handle_vote_reply(struct raft *raft,
>> >> >
>> >> >      struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
>> >> >      if (s) {
>> >> > -        raft_accept_vote(raft, s, &rpy->vote);
>> >> > +        if (raft_accept_vote(raft, s, &rpy->vote)) {
>> >>
>> >> IIUC, the raft_accept_vote() here can't succeed for pre-vote because
>> >> everyone votes for themselves on pre-vote.  Is that correct?
>> >>
>> > pre-vote happens the same way as the real vote, except that pre-vote doesn't increase the term.
>> > This function is for the candidate to handle a vote-reply from a peer for a vote/pre-vote request sent by itself.
>> > So, raft_accept_vote should succeed in normal cases, for both pre-vote and vote.
>> > If it is pre-vote and it is successful, it starts the real vote immediately.
>> >
>> >> What happens if we received a real vote during a pre-vote phase for
>> >> some reason?  Will this logic fail?  Is this scenario possible?
>> >
>> > The logic of handling the vote/pre-vote reply message only depends on the status of the server itself, regardless of the message's is_prevote field. So we don't really need to care if it is a pre-vote reply or vote reply. The is_prevote field in the reply was added only for validation purpose, not useful for the RPC itself, as mentioned in the comments.
>>
>> Oh, thanks for explanation, I misread the raft_handle_vote_request()
>> the first time.  I'll take a closer look.
>>
> No problem. Thanks for your review. Do you have other comments? If not, I will send v2 with all the comments addressed.

I don't have any further comments at the moment,
feel free to send a new version.

Best regards, Ilya Maximets.

> 
> Regards,
> Han
> 
>> >
>> > Thanks,
>> > Han
>> >>
>> >> > +            if (raft->prevote_passed) {
>> >> > +                raft_become_leader(raft);
>> >> > +            } else {
>> >> > +                /* Start the real election. */
>> >> > +                raft_start_election(raft, false, false);
>> >> > +            }
>> >> > +        }
>> >> >      }
>> >> >  }
>> >> >
>> >> > @@ -4357,7 +4397,7 @@ raft_handle_become_leader(struct raft *raft,
>> >> >          VLOG_INFO("received leadership transfer from %s in term %"PRIu64,
>> >> >                    raft_get_nickname(raft, &rq->common.sid, buf, sizeof buf),
>> >> >                    rq->term);
>> >> > -        raft_start_election(raft, true);
>> >> > +        raft_start_election(raft, false, true);
>> >> >      }
>> >> >  }
>> >> >
>> >> > diff --git a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> <http://ovsdb-cluster.at <http://ovsdb-cluster.at>> b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> <http://ovsdb-cluster.at <http://ovsdb-cluster.at>>
>> >> > index 9fbf5dc897f2..8a81136999e0 100644
>> >> > --- a/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> <http://ovsdb-cluster.at <http://ovsdb-cluster.at>>
>> >> > +++ b/tests/ovsdb-cluster.at <http://ovsdb-cluster.at> <http://ovsdb-cluster.at <http://ovsdb-cluster.at>>
>> >> > @@ -715,6 +715,48 @@ done
>> >> >
>> >> >  AT_CLEANUP
>> >> >
>> >> > +
>> >> > +AT_SETUP([OVSDB cluster - disruptive server])
>> >> > +AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
>> >> > +
>> >> > +n=3
>> >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> >> > +ordinal_schema > schema
>> >> > +AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
>> >> > +cid=`ovsdb-tool db-cid s1.db`
>> >> > +schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
>> >> > +for i in `seq 2 $n`; do
>> >> > +    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
>> >> > +done
>> >> > +
>> >> > +on_exit 'kill `cat *.pid`'
>> >> > +for i in `seq $n`; do
>> >> > +    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
>> >> > +done
>> >> > +for i in `seq $n`; do
>> >> > +    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
>> >> > +done
>> >> > +
>> >> > +# An unstable follower shouldn't disrupt the healthy cluster - shouldn't
>> >> > +# trigger term change.
>> >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], [0], [ignore])
>> >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: candidate"])
>> >> > +AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], [ignore])
>> >> > +
>> >> > +# Should step back to follower.
>> >> > +OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: follower"])
>> >> > +
>> >> > +# No term change.
>> >> > +for i in `seq $n`; do
>> >> > +    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | grep "Term: 1"], [0], [ignore])
>> >> > +done
>> >> > +
>> >> > +for i in `seq $n`; do
>> >> > +    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
>> >> > +done
>> >> > +
>> >> > +AT_CLEANUP
>> >> > +
>> >> >
>> >> >  AT_BANNER([OVSDB - cluster tests])
>> >> >
>> >>
>>
diff mbox series

Patch

diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
index dd14d81091fc..f750ba897046 100644
--- a/ovsdb/raft-rpc.c
+++ b/ovsdb/raft-rpc.c
@@ -283,6 +283,10 @@  raft_vote_request_to_jsonrpc(const struct raft_vote_request *rq,
         json_object_put(args, "leadership_transfer",
                         json_boolean_create(true));
     }
+    if (rq->is_prevote) {
+        json_object_put(args, "is_prevote",
+                        json_boolean_create(true));
+    }
 }
 
 static void
@@ -294,6 +298,8 @@  raft_vote_request_from_jsonrpc(struct ovsdb_parser *p,
     rq->last_log_term = raft_parse_required_uint64(p, "last_log_term");
     rq->leadership_transfer
         = raft_parse_optional_boolean(p, "leadership_transfer") == 1;
+    rq->is_prevote
+        = raft_parse_optional_boolean(p, "is_prevote") == 1;
 }
 
 static void
@@ -305,6 +311,9 @@  raft_format_vote_request(const struct raft_vote_request *rq, struct ds *s)
     if (rq->leadership_transfer) {
         ds_put_cstr(s, " leadership_transfer=true");
     }
+    if (rq->is_prevote) {
+        ds_put_cstr(s, " is_prevote=true");
+    }
 }
 
 /* raft_vote_reply. */
@@ -326,6 +335,9 @@  raft_vote_reply_to_jsonrpc(const struct raft_vote_reply *rpy,
 {
     raft_put_uint64(args, "term", rpy->term);
     json_object_put_format(args, "vote", UUID_FMT, UUID_ARGS(&rpy->vote));
+    if (rpy->is_prevote) {
+        json_object_put(args, "is_prevote", json_boolean_create(true));
+    }
 }
 
 static void
@@ -334,6 +346,7 @@  raft_vote_reply_from_jsonrpc(struct ovsdb_parser *p,
 {
     rpy->term = raft_parse_required_uint64(p, "term");
     rpy->vote = raft_parse_required_uuid(p, "vote");
+    rpy->is_prevote = raft_parse_optional_boolean(p, "is_prevote") == 1;
 }
 
 static void
@@ -341,6 +354,9 @@  raft_format_vote_reply(const struct raft_vote_reply *rpy, struct ds *s)
 {
     ds_put_format(s, " term=%"PRIu64, rpy->term);
     ds_put_format(s, " vote="SID_FMT, SID_ARGS(&rpy->vote));
+    if (rpy->is_prevote) {
+        ds_put_cstr(s, " is_prevote=true");
+    }
 }
 
 /* raft_add_server_request */
@@ -1008,7 +1024,8 @@  raft_rpc_get_vote(const union raft_rpc *rpc)
         return NULL;
 
     case RAFT_RPC_VOTE_REPLY:
-        return &raft_vote_reply_cast(rpc)->vote;
+        return raft_vote_reply_cast(rpc)->is_prevote ? NULL :
+            &raft_vote_reply_cast(rpc)->vote;
 
     default:
         OVS_NOT_REACHED();
diff --git a/ovsdb/raft-rpc.h b/ovsdb/raft-rpc.h
index 221f24d00128..10f30618e3e0 100644
--- a/ovsdb/raft-rpc.h
+++ b/ovsdb/raft-rpc.h
@@ -125,12 +125,15 @@  struct raft_vote_request {
     uint64_t last_log_index; /* Index of candidate's last log entry. */
     uint64_t last_log_term;  /* Term of candidate's last log entry. */
     bool leadership_transfer;  /* True to override minimum election timeout. */
+    bool is_prevote;         /* True: pre-vote; False: real vote (default). */
 };
 
 struct raft_vote_reply {
     struct raft_rpc_common common;
     uint64_t term;          /* Current term, for candidate to update itself. */
     struct uuid vote;       /* Server ID of vote. */
+    bool is_prevote;        /* Copy the is_prevote from the request, primarily
+                               for validation. */
 };
 
 struct raft_add_server_request {
diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index b2361b1737a2..4d7a3f112cad 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -305,6 +305,11 @@  struct raft {
 
     /* Candidates only.  Reinitialized at start of election. */
     int n_votes;                /* Number of votes for me. */
+    bool prevote_passed;        /* Indicates if it passed pre-vote phase.
+                                   Pre-vote mechanism is introduced in raft
+                                   paper section 9.6. We implement it as a
+                                   sub-state of candidate to minize the change
+                                   and keep backward compatibility. */
 
     /* Followers and candidates only. */
     bool candidate_retrying;    /* The earlier election timed-out and we are
@@ -372,7 +377,8 @@  static void raft_become_follower(struct raft *);
 static void raft_reset_election_timer(struct raft *);
 static void raft_reset_ping_timer(struct raft *);
 static void raft_send_heartbeats(struct raft *);
-static void raft_start_election(struct raft *, bool leadership_transfer);
+static void raft_start_election(struct raft *, bool is_prevote,
+                                bool leadership_transfer);
 static bool raft_truncate(struct raft *, uint64_t new_end);
 static void raft_get_servers_from_log(struct raft *, enum vlog_level);
 static void raft_get_election_timer_from_log(struct raft *);
@@ -1069,7 +1075,8 @@  raft_open(struct ovsdb_log *log, struct raft **raftp)
         /* If there's only one server, start an election right away so that the
          * cluster bootstraps quickly. */
         if (hmap_count(&raft->servers) == 1) {
-            raft_start_election(raft, false);
+            /* No pre-vote needed since we are the only one. */
+            raft_start_election(raft, false, false);
         }
     } else {
         raft->join_timeout = time_msec() + 1000;
@@ -1360,7 +1367,7 @@  void
 raft_take_leadership(struct raft *raft)
 {
     if (raft->role != RAFT_LEADER) {
-        raft_start_election(raft, true);
+        raft_start_election(raft, false, true);
     }
 }
 
@@ -1766,12 +1773,12 @@  raft_set_term(struct raft *raft, uint64_t term, const struct uuid *vote)
     return true;
 }
 
-static void
+static bool
 raft_accept_vote(struct raft *raft, struct raft_server *s,
                  const struct uuid *vote)
 {
     if (uuid_equals(&s->vote, vote)) {
-        return;
+        return false;
     }
     if (!uuid_is_zero(&s->vote)) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
@@ -1785,13 +1792,18 @@  raft_accept_vote(struct raft *raft, struct raft_server *s,
     s->vote = *vote;
     if (uuid_equals(vote, &raft->sid)
         && ++raft->n_votes > hmap_count(&raft->servers) / 2) {
-        raft_become_leader(raft);
+        return true;
     }
+    return false;
 }
 
 static void
-raft_start_election(struct raft *raft, bool leadership_transfer)
+raft_start_election(struct raft *raft, bool is_prevote,
+                    bool leadership_transfer)
 {
+    /* Leadership transfer doesn't use prevote. */
+    ovs_assert(!is_prevote || !leadership_transfer);
+
     if (raft->leaving) {
         return;
     }
@@ -1801,7 +1813,7 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
         return;
     }
 
-    if (!raft_set_term(raft, raft->term + 1, &raft->sid)) {
+    if (!is_prevote && !raft_set_term(raft, raft->term + 1, &raft->sid)) {
         return;
     }
 
@@ -1809,26 +1821,33 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
 
     raft->leader_sid = UUID_ZERO;
     raft->role = RAFT_CANDIDATE;
-    /* If there was no leader elected since last election, we know we are
-     * retrying now. */
-    raft->candidate_retrying = !raft->had_leader;
-    raft->had_leader = false;
+    raft->prevote_passed = !is_prevote;
+
+    if (is_prevote || leadership_transfer) {
+        /* If there was no leader elected since last election, we know we are
+         * retrying now. */
+        raft->candidate_retrying = !raft->had_leader;
+        raft->had_leader = false;
+
+        raft->election_start = time_msec();
+        raft->election_won = 0;
+    }
 
     raft->n_votes = 0;
 
-    raft->election_start = time_msec();
-    raft->election_won = 0;
     raft->leadership_transfer = leadership_transfer;
 
     static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
     if (!VLOG_DROP_INFO(&rl)) {
         long long int now = time_msec();
+        char *comment = is_prevote ? "prevote" : "vote";
         if (now >= raft->election_timeout) {
             VLOG_INFO("term %"PRIu64": %lld ms timeout expired, "
-                      "starting election",
-                      raft->term, now - raft->election_base);
+                      "starting election (%s)",
+                      raft->term, now - raft->election_base, comment);
         } else {
-            VLOG_INFO("term %"PRIu64": starting election", raft->term);
+            VLOG_INFO("term %"PRIu64": starting election (%s)",
+                      raft->term, comment);
         }
     }
     raft_reset_election_timer(raft);
@@ -1853,6 +1872,7 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
                     ? raft->entries[raft->log_end - raft->log_start - 1].term
                     : raft->snap.term),
                 .leadership_transfer = leadership_transfer,
+                .is_prevote = is_prevote,
             },
         };
         if (failure_test != FT_DONT_SEND_VOTE_REQUEST) {
@@ -1861,7 +1881,13 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
     }
 
     /* Vote for ourselves. */
-    raft_accept_vote(raft, me, &raft->sid);
+    if (raft_accept_vote(raft, me, &raft->sid)) {
+        /* We just started vote, so it shouldn't be accepted yet unless this is
+         * a one-node cluster. In such case we don't do pre-vote, and become
+         * leader immediately. */
+        ovs_assert(!is_prevote);
+        raft_become_leader(raft);
+    }
 }
 
 static void
@@ -2041,10 +2067,10 @@  raft_run(struct raft *raft)
                 raft_reset_election_timer(raft);
             } else {
                 raft_become_follower(raft);
-                raft_start_election(raft, false);
+                raft_start_election(raft, true, false);
             }
         } else {
-            raft_start_election(raft, false);
+            raft_start_election(raft, true, false);
         }
 
     }
@@ -3673,6 +3699,10 @@  raft_handle_vote_request__(struct raft *raft,
         return false;
     }
 
+    if (rq->is_prevote) {
+        return true;
+    }
+
     /* Record a vote for the peer. */
     if (!raft_set_term(raft, raft->term, &rq->common.sid)) {
         return false;
@@ -3685,7 +3715,7 @@  raft_handle_vote_request__(struct raft *raft,
 
 static void
 raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
-                     const struct uuid *vote)
+                     const struct uuid *vote, bool is_prevote)
 {
     union raft_rpc rpy = {
         .vote_reply = {
@@ -3695,6 +3725,7 @@  raft_send_vote_reply(struct raft *raft, const struct uuid *dst,
             },
             .term = raft->term,
             .vote = *vote,
+            .is_prevote = is_prevote,
         },
     };
     raft_send(raft, &rpy);
@@ -3705,7 +3736,9 @@  raft_handle_vote_request(struct raft *raft,
                          const struct raft_vote_request *rq)
 {
     if (raft_handle_vote_request__(raft, rq)) {
-        raft_send_vote_reply(raft, &rq->common.sid, &raft->vote);
+        raft_send_vote_reply(raft, &rq->common.sid,
+                             rq->is_prevote ? &rq->common.sid : &raft->vote,
+                             rq->is_prevote);
     }
 }
 
@@ -3723,7 +3756,14 @@  raft_handle_vote_reply(struct raft *raft,
 
     struct raft_server *s = raft_find_peer(raft, &rpy->common.sid);
     if (s) {
-        raft_accept_vote(raft, s, &rpy->vote);
+        if (raft_accept_vote(raft, s, &rpy->vote)) {
+            if (raft->prevote_passed) {
+                raft_become_leader(raft);
+            } else {
+                /* Start the real election. */
+                raft_start_election(raft, false, false);
+            }
+        }
     }
 }
 
@@ -4357,7 +4397,7 @@  raft_handle_become_leader(struct raft *raft,
         VLOG_INFO("received leadership transfer from %s in term %"PRIu64,
                   raft_get_nickname(raft, &rq->common.sid, buf, sizeof buf),
                   rq->term);
-        raft_start_election(raft, true);
+        raft_start_election(raft, false, true);
     }
 }
 
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 9fbf5dc897f2..8a81136999e0 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -715,6 +715,48 @@  done
 
 AT_CLEANUP
 
+
+AT_SETUP([OVSDB cluster - disruptive server])
+AT_KEYWORDS([ovsdb server negative unix cluster disruptive])
+
+n=3
+schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+ordinal_schema > schema
+AT_CHECK([ovsdb-tool '-vPATTERN:console:%c|%p|%m' create-cluster s1.db $abs_srcdir/idltest.ovsschema unix:s1.raft], [0], [], [stderr])
+cid=`ovsdb-tool db-cid s1.db`
+schema_name=`ovsdb-tool schema-name $abs_srcdir/idltest.ovsschema`
+for i in `seq 2 $n`; do
+    AT_CHECK([ovsdb-tool join-cluster s$i.db $schema_name unix:s$i.raft unix:s1.raft])
+done
+
+on_exit 'kill `cat *.pid`'
+for i in `seq $n`; do
+    AT_CHECK([ovsdb-server -v -vconsole:off -vsyslog:off --detach --no-chdir --log-file=s$i.log --pidfile=s$i.pid --unixctl=s$i --remote=punix:s$i.ovsdb s$i.db])
+done
+for i in `seq $n`; do
+    AT_CHECK([ovsdb_client_wait unix:s$i.ovsdb $schema_name connected])
+done
+
+# An unstable follower shouldn't disrupt the healthy cluster - shouldn't
+# trigger term change.
+AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test stop-raft-rpc], [0], [ignore])
+OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: candidate"])
+AT_CHECK([ovs-appctl -t "`pwd`"/s2 cluster/failure-test clear], [0], [ignore])
+
+# Should step back to follower.
+OVS_WAIT_UNTIL([ovs-appctl -t "`pwd`"/s2 cluster/status $schema_name | grep "Role: follower"])
+
+# No term change.
+for i in `seq $n`; do
+    AT_CHECK([ovs-appctl -t "`pwd`"/s$i cluster/status $schema_name | grep "Term: 1"], [0], [ignore])
+done
+
+for i in `seq $n`; do
+    OVS_APP_EXIT_AND_WAIT_BY_TARGET([`pwd`/s$i], [s$i.pid])
+done
+
+AT_CLEANUP
+
 
 AT_BANNER([OVSDB - cluster tests])