diff mbox series

[ovs-dev,v2,1/5] ovsdb: raft: Avoid transferring leadership to unavailable servers.

Message ID 20240326172717.1454071-2-i.maximets@ovn.org
State Accepted
Commit aab379ec21c95971fe6a05fb94793d1744a864ce
Headers show
Series ovsdb: raft: Fixes for cluster joining state. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation fail test: fail

Commit Message

Ilya Maximets March 26, 2024, 5:27 p.m. UTC
Current implementation of the leadership transfer just shoots the
leadership in the general direction of the first stable server in the
configuration.  It doesn't check if the server was active recently or
even that the connection is established.  This may result in sending
leadership to a disconnected or otherwise unavailable server.

Such behavior should not cause log truncation or any other correctness
issues because the destination server would have all the append
requests queued up or the connection will be dropped by the leader.
In a worst case we will have a leader-less cluster until the next
election timer fires up.  Other servers will notice the absence of the
leader and will trigger a new leader election normally.
However, the potential wait for the election timer is not good as
real-world setups may have high values configured.

Fix that by trying to transfer to servers that we know have applied
the most changes, i.e., have the highest 'match_index'.  Such servers
replied to the most recent append requests, so they have highest
chances to be healthy.  Choosing the random starting point in the
list of such servers so we don't transfer to the same server every
single time.  This slightly improves load distribution, but, most
importantly, increases robustness of our test suite, making it cover
more cases.  Also checking that the message was actually sent without
immediate failure.

If we fail to transfer to any server with the highest index, try to
just transfer to any other server that is not behind majority and then
just any other server that is connected.  We did actually send them
all the updates (if the connection is open), they just didn't reply
yet for one reason or another.  It should be better than leaving the
cluster without a leader.

Note that there is always a chance that transfer will fail, since
we're not waiting for it to be acknowledged (and must not wait).
In this case, normal election will be triggered after the election
timer fires up.

Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

CC: Felix Huettner <felix.huettner@mail.schwarz>

 ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

Comments

Han Zhou March 26, 2024, 8:10 p.m. UTC | #1
On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> Current implementation of the leadership transfer just shoots the
> leadership in the general direction of the first stable server in the
> configuration.  It doesn't check if the server was active recently or
> even that the connection is established.  This may result in sending
> leadership to a disconnected or otherwise unavailable server.
>
> Such behavior should not cause log truncation or any other correctness
> issues because the destination server would have all the append
> requests queued up or the connection will be dropped by the leader.
> In a worst case we will have a leader-less cluster until the next
> election timer fires up.  Other servers will notice the absence of the
> leader and will trigger a new leader election normally.
> However, the potential wait for the election timer is not good as
> real-world setups may have high values configured.
>
> Fix that by trying to transfer to servers that we know have applied
> the most changes, i.e., have the highest 'match_index'.  Such servers
> replied to the most recent append requests, so they have highest
> chances to be healthy.  Choosing the random starting point in the
> list of such servers so we don't transfer to the same server every
> single time.  This slightly improves load distribution, but, most
> importantly, increases robustness of our test suite, making it cover
> more cases.  Also checking that the message was actually sent without
> immediate failure.
>
> If we fail to transfer to any server with the highest index, try to
> just transfer to any other server that is not behind majority and then
> just any other server that is connected.  We did actually send them
> all the updates (if the connection is open), they just didn't reply
> yet for one reason or another.  It should be better than leaving the
> cluster without a leader.
>
> Note that there is always a chance that transfer will fail, since
> we're not waiting for it to be acknowledged (and must not wait).
> In this case, normal election will be triggered after the election
> timer fires up.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>
> CC: Felix Huettner <felix.huettner@mail.schwarz>
>
>  ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..b171da345 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>          return;
>      }
>
> -    struct raft_server *s;
> +    struct raft_server **servers, *s;
> +    uint64_t threshold = 0;
> +    size_t n = 0, start, i;
> +
> +    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
> +
>      HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> -        if (!uuid_equals(&raft->sid, &s->sid)
> -            && s->phase == RAFT_PHASE_STABLE) {
> +        if (uuid_equals(&raft->sid, &s->sid)
> +            || s->phase != RAFT_PHASE_STABLE) {
> +            continue;
> +        }
> +        if (s->match_index > threshold) {
> +            threshold = s->match_index;
> +        }
> +        servers[n++] = s;
> +    }
> +
> +    start = n ? random_range(n) : 0;
> +
> +retry:
> +    for (i = 0; i < n; i++) {
> +        s = servers[(start + i) % n];
> +
> +        if (s->match_index >= threshold) {
>              struct raft_conn *conn = raft_find_conn_by_sid(raft,
&s->sid);
>              if (!conn) {
>                  continue;
> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>                      .term = raft->term,
>                  }
>              };
> -            raft_send_to_conn(raft, &rpc, conn);
> +
> +            if (!raft_send_to_conn(raft, &rpc, conn)) {
> +                continue;
> +            }
>
>              raft_record_note(raft, "transfer leadership",
>                               "transferring leadership to %s because %s",
> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>              break;
>          }
>      }
> +
> +    if (n && i == n && threshold) {
> +        if (threshold > raft->commit_index) {
> +            /* Failed to transfer to servers with the highest
'match_index'.
> +             * Try other servers that are not behind the majority. */
> +            threshold = raft->commit_index;
> +        } else {
> +            /* Try any other server.  It is safe, because they either
have all
> +             * the append requests queued up for them before the
leadership
> +             * transfer message or their connection is broken and we
will not
> +             * transfer anyway. */
> +            threshold = 0;
> +        }
> +        goto retry;

Thanks Ilya. It seems the retry could try the earlier failed server (e.g.
the ones that raft_send_to_conn() returned false) one or two more times,
but it should be fine because the number of servers are very small anyway.
So this looks good to me.

Acked-by: Han Zhou <hzhou@ovn.org>

> +    }
> +
> +    free(servers);
>  }
>
>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
> --
> 2.44.0
>
Felix Huettner March 27, 2024, 8:26 a.m. UTC | #2
On Tue, Mar 26, 2024 at 06:27:09PM +0100, Ilya Maximets wrote:
> Current implementation of the leadership transfer just shoots the
> leadership in the general direction of the first stable server in the
> configuration.  It doesn't check if the server was active recently or
> even that the connection is established.  This may result in sending
> leadership to a disconnected or otherwise unavailable server.
> 
> Such behavior should not cause log truncation or any other correctness
> issues because the destination server would have all the append
> requests queued up or the connection will be dropped by the leader.
> In a worst case we will have a leader-less cluster until the next
> election timer fires up.  Other servers will notice the absence of the
> leader and will trigger a new leader election normally.
> However, the potential wait for the election timer is not good as
> real-world setups may have high values configured.
> 
> Fix that by trying to transfer to servers that we know have applied
> the most changes, i.e., have the highest 'match_index'.  Such servers
> replied to the most recent append requests, so they have highest
> chances to be healthy.  Choosing the random starting point in the
> list of such servers so we don't transfer to the same server every
> single time.  This slightly improves load distribution, but, most
> importantly, increases robustness of our test suite, making it cover
> more cases.  Also checking that the message was actually sent without
> immediate failure.
> 
> If we fail to transfer to any server with the highest index, try to
> just transfer to any other server that is not behind majority and then
> just any other server that is connected.  We did actually send them
> all the updates (if the connection is open), they just didn't reply
> yet for one reason or another.  It should be better than leaving the
> cluster without a leader.
> 
> Note that there is always a chance that transfer will fail, since
> we're not waiting for it to be acknowledged (and must not wait).
> In this case, normal election will be triggered after the election
> timer fires up.
> 
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
> 
> CC: Felix Huettner <felix.huettner@mail.schwarz>
> 
>  ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..b171da345 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>          return;
>      }
>  
> -    struct raft_server *s;
> +    struct raft_server **servers, *s;
> +    uint64_t threshold = 0;
> +    size_t n = 0, start, i;
> +
> +    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
> +
>      HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> -        if (!uuid_equals(&raft->sid, &s->sid)
> -            && s->phase == RAFT_PHASE_STABLE) {
> +        if (uuid_equals(&raft->sid, &s->sid)
> +            || s->phase != RAFT_PHASE_STABLE) {
> +            continue;
> +        }
> +        if (s->match_index > threshold) {
> +            threshold = s->match_index;
> +        }
> +        servers[n++] = s;
> +    }
> +
> +    start = n ? random_range(n) : 0;
> +
> +retry:
> +    for (i = 0; i < n; i++) {
> +        s = servers[(start + i) % n];
> +
> +        if (s->match_index >= threshold) {
>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
>              if (!conn) {
>                  continue;
> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>                      .term = raft->term,
>                  }
>              };
> -            raft_send_to_conn(raft, &rpc, conn);
> +
> +            if (!raft_send_to_conn(raft, &rpc, conn)) {
> +                continue;
> +            }
>  
>              raft_record_note(raft, "transfer leadership",
>                               "transferring leadership to %s because %s",
> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>              break;
>          }
>      }
> +
> +    if (n && i == n && threshold) {
> +        if (threshold > raft->commit_index) {
> +            /* Failed to transfer to servers with the highest 'match_index'.
> +             * Try other servers that are not behind the majority. */
> +            threshold = raft->commit_index;
> +        } else {
> +            /* Try any other server.  It is safe, because they either have all
> +             * the append requests queued up for them before the leadership
> +             * transfer message or their connection is broken and we will not
> +             * transfer anyway. */
> +            threshold = 0;
> +        }
> +        goto retry;
> +    }
> +
> +    free(servers);
>  }
>  
>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
> -- 
> 2.44.0
> 

Thanks you, looks good for me.
Acked-by: Felix Huettner <felix.huettner@mail.schwarz>
Ilya Maximets March 27, 2024, 9:50 p.m. UTC | #3
On 3/26/24 21:10, Han Zhou wrote:
> 
> 
> On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>> Current implementation of the leadership transfer just shoots the
>> leadership in the general direction of the first stable server in the
>> configuration.  It doesn't check if the server was active recently or
>> even that the connection is established.  This may result in sending
>> leadership to a disconnected or otherwise unavailable server.
>>
>> Such behavior should not cause log truncation or any other correctness
>> issues because the destination server would have all the append
>> requests queued up or the connection will be dropped by the leader.
>> In a worst case we will have a leader-less cluster until the next
>> election timer fires up.  Other servers will notice the absence of the
>> leader and will trigger a new leader election normally.
>> However, the potential wait for the election timer is not good as
>> real-world setups may have high values configured.
>>
>> Fix that by trying to transfer to servers that we know have applied
>> the most changes, i.e., have the highest 'match_index'.  Such servers
>> replied to the most recent append requests, so they have highest
>> chances to be healthy.  Choosing the random starting point in the
>> list of such servers so we don't transfer to the same server every
>> single time.  This slightly improves load distribution, but, most
>> importantly, increases robustness of our test suite, making it cover
>> more cases.  Also checking that the message was actually sent without
>> immediate failure.
>>
>> If we fail to transfer to any server with the highest index, try to
>> just transfer to any other server that is not behind majority and then
>> just any other server that is connected.  We did actually send them
>> all the updates (if the connection is open), they just didn't reply
>> yet for one reason or another.  It should be better than leaving the
>> cluster without a leader.
>>
>> Note that there is always a chance that transfer will fail, since
>> we're not waiting for it to be acknowledged (and must not wait).
>> In this case, normal election will be triggered after the election
>> timer fires up.
>>
>> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>
>> ---
>>
>> CC: Felix Huettner <felix.huettner@mail.schwarz>
>>
>>  ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index f463afcb3..b171da345 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>>          return;
>>      }
>>
>> -    struct raft_server *s;
>> +    struct raft_server **servers, *s;
>> +    uint64_t threshold = 0;
>> +    size_t n = 0, start, i;
>> +
>> +    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
>> +
>>      HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> -        if (!uuid_equals(&raft->sid, &s->sid)
>> -            && s->phase == RAFT_PHASE_STABLE) {
>> +        if (uuid_equals(&raft->sid, &s->sid)
>> +            || s->phase != RAFT_PHASE_STABLE) {
>> +            continue;
>> +        }
>> +        if (s->match_index > threshold) {
>> +            threshold = s->match_index;
>> +        }
>> +        servers[n++] = s;
>> +    }
>> +
>> +    start = n ? random_range(n) : 0;
>> +
>> +retry:
>> +    for (i = 0; i < n; i++) {
>> +        s = servers[(start + i) % n];
>> +
>> +        if (s->match_index >= threshold) {
>>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
>>              if (!conn) {
>>                  continue;
>> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>>                      .term = raft->term,
>>                  }
>>              };
>> -            raft_send_to_conn(raft, &rpc, conn);
>> +
>> +            if (!raft_send_to_conn(raft, &rpc, conn)) {
>> +                continue;
>> +            }
>>
>>              raft_record_note(raft, "transfer leadership",
>>                               "transferring leadership to %s because %s",
>> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>>              break;
>>          }
>>      }
>> +
>> +    if (n && i == n && threshold) {
>> +        if (threshold > raft->commit_index) {
>> +            /* Failed to transfer to servers with the highest 'match_index'.
>> +             * Try other servers that are not behind the majority. */
>> +            threshold = raft->commit_index;
>> +        } else {
>> +            /* Try any other server.  It is safe, because they either have all
>> +             * the append requests queued up for them before the leadership
>> +             * transfer message or their connection is broken and we will not
>> +             * transfer anyway. */
>> +            threshold = 0;
>> +        }
>> +        goto retry;
> 
> Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the ones that
> raft_send_to_conn() returned false) one or two more times, but it should be fine because
> the number of servers are very small anyway. So this looks good to me.

Yeah, most of the time this array will have just 2 elements
and commit_index will be equal to the highest match_index,
so we will likely only retry once.

> 
> Acked-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
> 
>> +    }
>> +
>> +    free(servers);
>>  }
>>
>>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
>> --
>> 2.44.0
>>
>
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index f463afcb3..b171da345 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1261,10 +1261,30 @@  raft_transfer_leadership(struct raft *raft, const char *reason)
         return;
     }
 
-    struct raft_server *s;
+    struct raft_server **servers, *s;
+    uint64_t threshold = 0;
+    size_t n = 0, start, i;
+
+    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
+
     HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
-        if (!uuid_equals(&raft->sid, &s->sid)
-            && s->phase == RAFT_PHASE_STABLE) {
+        if (uuid_equals(&raft->sid, &s->sid)
+            || s->phase != RAFT_PHASE_STABLE) {
+            continue;
+        }
+        if (s->match_index > threshold) {
+            threshold = s->match_index;
+        }
+        servers[n++] = s;
+    }
+
+    start = n ? random_range(n) : 0;
+
+retry:
+    for (i = 0; i < n; i++) {
+        s = servers[(start + i) % n];
+
+        if (s->match_index >= threshold) {
             struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
             if (!conn) {
                 continue;
@@ -1280,7 +1300,10 @@  raft_transfer_leadership(struct raft *raft, const char *reason)
                     .term = raft->term,
                 }
             };
-            raft_send_to_conn(raft, &rpc, conn);
+
+            if (!raft_send_to_conn(raft, &rpc, conn)) {
+                continue;
+            }
 
             raft_record_note(raft, "transfer leadership",
                              "transferring leadership to %s because %s",
@@ -1288,6 +1311,23 @@  raft_transfer_leadership(struct raft *raft, const char *reason)
             break;
         }
     }
+
+    if (n && i == n && threshold) {
+        if (threshold > raft->commit_index) {
+            /* Failed to transfer to servers with the highest 'match_index'.
+             * Try other servers that are not behind the majority. */
+            threshold = raft->commit_index;
+        } else {
+            /* Try any other server.  It is safe, because they either have all
+             * the append requests queued up for them before the leadership
+             * transfer message or their connection is broken and we will not
+             * transfer anyway. */
+            threshold = 0;
+        }
+        goto retry;
+    }
+
+    free(servers);
 }
 
 /* Send a RemoveServerRequest to the rest of the servers in the cluster.