diff mbox series

[ovs-dev,1/5] ovsdb: raft: Randomize leadership transfer.

Message ID 20240315201614.236523-2-i.maximets@ovn.org
State Changes Requested
Headers show
Series ovsdb: raft: Fixes for cluster joining state. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ilya Maximets March 15, 2024, 8:14 p.m. UTC
Each cluster member typically always transfers leadership to the same
other member, which is the first in their list of servers.  This may
result in two servers in a 3-node cluster to transfer leadership to
each other and never to the third one.

Randomizing the selection to make the load more evenly distributed.

This also makes cluster failure tests cover more scenarios as servers
will transfer leadership to servers they didn't before.  This is
important especially for cluster joining tests.

Ideally, we would transfer to a random server with a highest apply
index, but not trying to implement this for now.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 ovsdb/raft.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Felix Huettner March 18, 2024, 4:15 p.m. UTC | #1
On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> Each cluster member typically always transfers leadership to the same
> other member, which is the first in their list of servers.  This may
> result in two servers in a 3-node cluster to transfer leadership to
> each other and never to the third one.
> 
> Randomizing the selection to make the load more evenly distributed.

Hi Ilya,

just out of curiosity: since basically only one of the 3 members is
active at any point in time, is balancing the load even relevant. It
will always only be on one of the 3 members anyway.

> 
> This also makes cluster failure tests cover more scenarios as servers
> will transfer leadership to servers they didn't before.  This is
> important especially for cluster joining tests.
> 
> Ideally, we would transfer to a random server with a highest apply
> index, but not trying to implement this for now.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  ovsdb/raft.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..25f462431 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>          return;
>      }
>  
> +    size_t n = hmap_count(&raft->servers) * 3;
>      struct raft_server *s;
> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> +
> +    while (n--) {
> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
> +                         struct raft_server, hmap_node);
>          if (!uuid_equals(&raft->sid, &s->sid)
>              && s->phase == RAFT_PHASE_STABLE) {
>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);

i think this has the risk of never selecting one server out of the list of
cluster members. Suppose you have a 3 node cluster where one of them
members is down. In this case there is a single member the leadership
can be transfered to.
This means for a single iteration of the while loop has a chance of 2/3
to select a member that can not be used. Over the 9 iterations this
would do this would give a chance of (2/3)^9 to always choose an
inappropriate member. This equals to a chance of 0.026% of never
selecting the single appropriate target member.

Could we instead rather start iterating the hmap at some random offset?
That should give a similar result while giving the guarantee that we
visit each member once.

Thanks
Felix
Ilya Maximets March 18, 2024, 4:52 p.m. UTC | #2
On 3/18/24 17:15, Felix Huettner wrote:
> On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> Each cluster member typically always transfers leadership to the same
>> other member, which is the first in their list of servers.  This may
>> result in two servers in a 3-node cluster to transfer leadership to
>> each other and never to the third one.
>>
>> Randomizing the selection to make the load more evenly distributed.
> 
> Hi Ilya,

Hi, Felix.  Thanks for the comments!

> 
> just out of curiosity: since basically only one of the 3 members is
> active at any point in time, is balancing the load even relevant. It
> will always only be on one of the 3 members anyway.
It is not very important, I agree.  What I observed in practice is
that sometimes if, for example, compactions happen in approximately
similar time, the server we transfer the leadership to may send it
right back, while the first server is busy compacting.  This is
less of a problem today as well, since we have parallel compaction,
but it may still be annoying if that happens every time.

I'm mostly making this patch for the purpose of better testing below.

> 
>>
>> This also makes cluster failure tests cover more scenarios as servers
>> will transfer leadership to servers they didn't before.  This is
>> important especially for cluster joining tests.
>>
>> Ideally, we would transfer to a random server with a highest apply
>> index, but not trying to implement this for now.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>  ovsdb/raft.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index f463afcb3..25f462431 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>>          return;
>>      }
>>  
>> +    size_t n = hmap_count(&raft->servers) * 3;
>>      struct raft_server *s;
>> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> +
>> +    while (n--) {
>> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
>> +                         struct raft_server, hmap_node);
>>          if (!uuid_equals(&raft->sid, &s->sid)
>>              && s->phase == RAFT_PHASE_STABLE) {
>>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> 
> i think this has the risk of never selecting one server out of the list of
> cluster members. Suppose you have a 3 node cluster where one of them
> members is down. In this case there is a single member the leadership
> can be transfered to.
> This means for a single iteration of the while loop has a chance of 2/3
> to select a member that can not be used. Over the 9 iterations this
> would do this would give a chance of (2/3)^9 to always choose an
> inappropriate member. This equals to a chance of 0.026% of never
> selecting the single appropriate target member.
> 
> Could we instead rather start iterating the hmap at some random offset?
> That should give a similar result while giving the guarantee that we
> visit each member once.

I don't think it's very important, because we're transferring leadership
without verifying if the other side is alive and we're not checking if we
actually transferred it or not.  So, retries are basically just for the
server not hitting itself or servers that didn't join yet.  We will transfer
to the first other server that already joined regardless of the iteration
method.

The way to mostly fix the issue with dead servers is, as I mentioned, to
transfer only to the servers with the highest apply index, i.e. the servers
that acknowledged the most amount of changes.  This will ensure that we
at least heard something from the server in the recent past.  But that's
a separate feature to implement.

Also, the leadership transfer is just an optimization to speed up elections,
so it's not necessary for correctness for this operation to be successful.
If we fail to transfer or transfer to the dead server, the rest of the
cluster will notice the absence of the leader and initiate election by
the timeout.

Best regards, Ilya Maximets.
Felix Huettner March 19, 2024, 7:04 a.m. UTC | #3
On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> On 3/18/24 17:15, Felix Huettner wrote:
> > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> >> Each cluster member typically always transfers leadership to the same
> >> other member, which is the first in their list of servers.  This may
> >> result in two servers in a 3-node cluster to transfer leadership to
> >> each other and never to the third one.
> >>
> >> Randomizing the selection to make the load more evenly distributed.
> > 
> > Hi Ilya,
> 
> Hi, Felix.  Thanks for the comments!
> 
> > 
> > just out of curiosity: since basically only one of the 3 members is
> > active at any point in time, is balancing the load even relevant. It
> > will always only be on one of the 3 members anyway.
> It is not very important, I agree.  What I observed in practice is
> that sometimes if, for example, compactions happen in approximately
> similar time, the server we transfer the leadership to may send it
> right back, while the first server is busy compacting.  This is
> less of a problem today as well, since we have parallel compaction,
> but it may still be annoying if that happens every time.
> 
> I'm mostly making this patch for the purpose of better testing below.
> 
> > 
> >>
> >> This also makes cluster failure tests cover more scenarios as servers
> >> will transfer leadership to servers they didn't before.  This is
> >> important especially for cluster joining tests.
> >>
> >> Ideally, we would transfer to a random server with a highest apply
> >> index, but not trying to implement this for now.
> >>
> >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >> ---
> >>  ovsdb/raft.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> index f463afcb3..25f462431 100644
> >> --- a/ovsdb/raft.c
> >> +++ b/ovsdb/raft.c
> >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
> >>          return;
> >>      }
> >>  
> >> +    size_t n = hmap_count(&raft->servers) * 3;
> >>      struct raft_server *s;
> >> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> >> +
> >> +    while (n--) {
> >> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
> >> +                         struct raft_server, hmap_node);
> >>          if (!uuid_equals(&raft->sid, &s->sid)
> >>              && s->phase == RAFT_PHASE_STABLE) {
> >>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> > 
> > i think this has the risk of never selecting one server out of the list of
> > cluster members. Suppose you have a 3 node cluster where one of them
> > members is down. In this case there is a single member the leadership
> > can be transfered to.
> > This means for a single iteration of the while loop has a chance of 2/3
> > to select a member that can not be used. Over the 9 iterations this
> > would do this would give a chance of (2/3)^9 to always choose an
> > inappropriate member. This equals to a chance of 0.026% of never
> > selecting the single appropriate target member.
> > 
> > Could we instead rather start iterating the hmap at some random offset?
> > That should give a similar result while giving the guarantee that we
> > visit each member once.
> 
> I don't think it's very important, because we're transferring leadership
> without verifying if the other side is alive and we're not checking if we
> actually transferred it or not.  So, retries are basically just for the
> server not hitting itself or servers that didn't join yet.  We will transfer
> to the first other server that already joined regardless of the iteration
> method.
> 
> The way to mostly fix the issue with dead servers is, as I mentioned, to
> transfer only to the servers with the highest apply index, i.e. the servers
> that acknowledged the most amount of changes.  This will ensure that we
> at least heard something from the server in the recent past.  But that's
> a separate feature to implement.
> 
> Also, the leadership transfer is just an optimization to speed up elections,
> so it's not necessary for correctness for this operation to be successful.
> If we fail to transfer or transfer to the dead server, the rest of the
> cluster will notice the absence of the leader and initiate election by
> the timeout.
> 
> Best regards, Ilya Maximets.

Hi Ilya,

thanks for the clarifications.
Then i guess the 0.026% chance is not too relevant.

Thanks
Felix
Han Zhou March 25, 2024, 7:19 a.m. UTC | #4
On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev <
ovs-dev@openvswitch.org> wrote:
>
> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> > On 3/18/24 17:15, Felix Huettner wrote:
> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> > >> Each cluster member typically always transfers leadership to the same
> > >> other member, which is the first in their list of servers.  This may
> > >> result in two servers in a 3-node cluster to transfer leadership to
> > >> each other and never to the third one.
> > >>
> > >> Randomizing the selection to make the load more evenly distributed.
> > >
> > > Hi Ilya,
> >
> > Hi, Felix.  Thanks for the comments!
> >
> > >
> > > just out of curiosity: since basically only one of the 3 members is
> > > active at any point in time, is balancing the load even relevant. It
> > > will always only be on one of the 3 members anyway.
> > It is not very important, I agree.  What I observed in practice is
> > that sometimes if, for example, compactions happen in approximately
> > similar time, the server we transfer the leadership to may send it
> > right back, while the first server is busy compacting.  This is
> > less of a problem today as well, since we have parallel compaction,
> > but it may still be annoying if that happens every time.
> >
> > I'm mostly making this patch for the purpose of better testing below.
> >
> > >
> > >>
> > >> This also makes cluster failure tests cover more scenarios as servers
> > >> will transfer leadership to servers they didn't before.  This is
> > >> important especially for cluster joining tests.
> > >>
> > >> Ideally, we would transfer to a random server with a highest apply
> > >> index, but not trying to implement this for now.
> > >>
> > >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > >> ---
> > >>  ovsdb/raft.c | 6 +++++-
> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> > >> index f463afcb3..25f462431 100644
> > >> --- a/ovsdb/raft.c
> > >> +++ b/ovsdb/raft.c
> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft,
const char *reason)
> > >>          return;
> > >>      }
> > >>
> > >> +    size_t n = hmap_count(&raft->servers) * 3;
> > >>      struct raft_server *s;
> > >> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> > >> +
> > >> +    while (n--) {
> > >> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
> > >> +                         struct raft_server, hmap_node);
> > >>          if (!uuid_equals(&raft->sid, &s->sid)
> > >>              && s->phase == RAFT_PHASE_STABLE) {
> > >>              struct raft_conn *conn = raft_find_conn_by_sid(raft,
&s->sid);
> > >
> > > i think this has the risk of never selecting one server out of the
list of
> > > cluster members. Suppose you have a 3 node cluster where one of them
> > > members is down. In this case there is a single member the leadership
> > > can be transfered to.
> > > This means for a single iteration of the while loop has a chance of
2/3
> > > to select a member that can not be used. Over the 9 iterations this
> > > would do this would give a chance of (2/3)^9 to always choose an
> > > inappropriate member. This equals to a chance of 0.026% of never
> > > selecting the single appropriate target member.
> > >
> > > Could we instead rather start iterating the hmap at some random
offset?
> > > That should give a similar result while giving the guarantee that we
> > > visit each member once.
> >
> > I don't think it's very important, because we're transferring leadership
> > without verifying if the other side is alive and we're not checking if
we
> > actually transferred it or not.  So, retries are basically just for the
> > server not hitting itself or servers that didn't join yet.  We will
transfer
> > to the first other server that already joined regardless of the
iteration
> > method.
> >
> > The way to mostly fix the issue with dead servers is, as I mentioned, to
> > transfer only to the servers with the highest apply index, i.e. the
servers
> > that acknowledged the most amount of changes.  This will ensure that we
> > at least heard something from the server in the recent past.  But that's
> > a separate feature to implement.
> >
> > Also, the leadership transfer is just an optimization to speed up
elections,
> > so it's not necessary for correctness for this operation to be
successful.
> > If we fail to transfer or transfer to the dead server, the rest of the
> > cluster will notice the absence of the leader and initiate election by
> > the timeout.
> >
> > Best regards, Ilya Maximets.
>
> Hi Ilya,
>
> thanks for the clarifications.
> Then i guess the 0.026% chance is not too relevant.
>

Thanks for the patch and the discussion.
I still have a concern regarding the retry logic. Maybe it is not a
critical problem if the function failed to select any server to transfer,
but I'd still prefer this function to be more predictable. For example, we
can store the servers of the hmap to an array, and shuffle the order, and
then just iterate the array once. This would make sure we don't randomly
miss the chance of a leader transfer.

However, if you believe it is totally unnecessary for the leader transfer
to be reliable and it is not worth the effort, at least we should add some
comment for the function and make it clear that there is a small chance the
function may end up without sending any leader transfer at all, or maybe
just rename it to raft_try_transfer_leadership.

Thanks,
Han

> Thanks
> Felix
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ilya Maximets March 25, 2024, 12:43 p.m. UTC | #5
On 3/25/24 08:19, Han Zhou wrote:
> 
> 
> On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev <ovs-dev@openvswitch.org <mailto:ovs-dev@openvswitch.org>> wrote:
>>
>> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
>> > On 3/18/24 17:15, Felix Huettner wrote:
>> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> > >> Each cluster member typically always transfers leadership to the same
>> > >> other member, which is the first in their list of servers.  This may
>> > >> result in two servers in a 3-node cluster to transfer leadership to
>> > >> each other and never to the third one.
>> > >>
>> > >> Randomizing the selection to make the load more evenly distributed.
>> > >
>> > > Hi Ilya,
>> >
>> > Hi, Felix.  Thanks for the comments!
>> >
>> > >
>> > > just out of curiosity: since basically only one of the 3 members is
>> > > active at any point in time, is balancing the load even relevant. It
>> > > will always only be on one of the 3 members anyway.
>> > It is not very important, I agree.  What I observed in practice is
>> > that sometimes if, for example, compactions happen in approximately
>> > similar time, the server we transfer the leadership to may send it
>> > right back, while the first server is busy compacting.  This is
>> > less of a problem today as well, since we have parallel compaction,
>> > but it may still be annoying if that happens every time.
>> >
>> > I'm mostly making this patch for the purpose of better testing below.
>> >
>> > >
>> > >>
>> > >> This also makes cluster failure tests cover more scenarios as servers
>> > >> will transfer leadership to servers they didn't before.  This is
>> > >> important especially for cluster joining tests.
>> > >>
>> > >> Ideally, we would transfer to a random server with a highest apply
>> > >> index, but not trying to implement this for now.
>> > >>
>> > >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>
>> > >> ---
>> > >>  ovsdb/raft.c | 6 +++++-
>> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >>
>> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> > >> index f463afcb3..25f462431 100644
>> > >> --- a/ovsdb/raft.c
>> > >> +++ b/ovsdb/raft.c
>> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char *reason)
>> > >>          return;
>> > >>      }
>> > >>
>> > >> +    size_t n = hmap_count(&raft->servers) * 3;
>> > >>      struct raft_server *s;
>> > >> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> > >> +
>> > >> +    while (n--) {
>> > >> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
>> > >> +                         struct raft_server, hmap_node);
>> > >>          if (!uuid_equals(&raft->sid, &s->sid)
>> > >>              && s->phase == RAFT_PHASE_STABLE) {
>> > >>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
>> > >
>> > > i think this has the risk of never selecting one server out of the list of
>> > > cluster members. Suppose you have a 3 node cluster where one of them
>> > > members is down. In this case there is a single member the leadership
>> > > can be transfered to.
>> > > This means for a single iteration of the while loop has a chance of 2/3
>> > > to select a member that can not be used. Over the 9 iterations this
>> > > would do this would give a chance of (2/3)^9 to always choose an
>> > > inappropriate member. This equals to a chance of 0.026% of never
>> > > selecting the single appropriate target member.
>> > >
>> > > Could we instead rather start iterating the hmap at some random offset?
>> > > That should give a similar result while giving the guarantee that we
>> > > visit each member once.
>> >
>> > I don't think it's very important, because we're transferring leadership
>> > without verifying if the other side is alive and we're not checking if we
>> > actually transferred it or not.  So, retries are basically just for the
>> > server not hitting itself or servers that didn't join yet.  We will transfer
>> > to the first other server that already joined regardless of the iteration
>> > method.
>> >
>> > The way to mostly fix the issue with dead servers is, as I mentioned, to
>> > transfer only to the servers with the highest apply index, i.e. the servers
>> > that acknowledged the most amount of changes.  This will ensure that we
>> > at least heard something from the server in the recent past.  But that's
>> > a separate feature to implement.
>> >
>> > Also, the leadership transfer is just an optimization to speed up elections,
>> > so it's not necessary for correctness for this operation to be successful.
>> > If we fail to transfer or transfer to the dead server, the rest of the
>> > cluster will notice the absence of the leader and initiate election by
>> > the timeout.
>> >
>> > Best regards, Ilya Maximets.
>>
>> Hi Ilya,
>>
>> thanks for the clarifications.
>> Then i guess the 0.026% chance is not too relevant.
>>
> 
> Thanks for the patch and the discussion.
> I still have a concern regarding the retry logic. Maybe it is not a critical
> problem if the function failed to select any server to transfer, but I'd stil> prefer this function to be more predictable. For example, we can store the servers
> of the hmap to an array, and shuffle the order, and then just iterate the array
> once. This would make sure we don't randomly miss the chance of a leader transfer.

I guess, it's easier if I just implement the maximum match_index logic instead.
I'll do that v2.

Do you have comments for the rest of the patch set?  I can wait for them
before sending a new version.

Best regards, Ilya Maximets.

> 
> However, if you believe it is totally unnecessary for the leader transfer to be
> reliable and it is not worth the effort, at least we should add some comment for
> the function and make it clear that there is a small chance the function may end
> up without sending any leader transfer at all, or maybe just rename it to
> raft_try_transfer_leadership.
> 
> Thanks,
> Han
> 
>> Thanks
>> Felix
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index f463afcb3..25f462431 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1261,8 +1261,12 @@  raft_transfer_leadership(struct raft *raft, const char *reason)
         return;
     }
 
+    size_t n = hmap_count(&raft->servers) * 3;
     struct raft_server *s;
-    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
+
+    while (n--) {
+        s = CONTAINER_OF(hmap_random_node(&raft->servers),
+                         struct raft_server, hmap_node);
         if (!uuid_equals(&raft->sid, &s->sid)
             && s->phase == RAFT_PHASE_STABLE) {
             struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);