diff mbox series

[ovs-dev] raft: Only allow followers to snapshot.

Message ID 20211213110314.13721-1-dceara@redhat.com
State Superseded
Headers show
Series [ovs-dev] raft: Only allow followers to snapshot. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed

Commit Message

Dumitru Ceara Dec. 13, 2021, 11:03 a.m. UTC
Commit 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.") made it such that raft leaders transfer leadership before
snapshotting.  However, there's still the case when the next leader to
be is in the process of snapshotting.  To avoid delays in that case too,
we now explicitly allow snapshots only on followers.  Cluster members
will have to wait until the current election is settled before
snapshotting.

Given the following logs taken from an OVN_Southbound 3-server cluster
during a scale test:

S1 (old leader):
  2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot.
  2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
  2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system)
  2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43

S2 (follower):
  2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43

S3 (new leader):
  2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42
  2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election
  2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
  2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers

We see that the leader to be (S3) receives the leadership transfer,
initiates the election and immediately after starts a snapshot that
takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
as cluster leader but S3 doesn't effectively become leader until it
finishes snapshotting, essentially keeping the cluster without a
leader for up to ~9.5 seconds.

With the current change, S3 will delay compaction and snapshotting until
the election is finished.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
---
 ovsdb/raft.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Ilya Maximets Dec. 13, 2021, 4:08 p.m. UTC | #1
On 12/13/21 12:03, Dumitru Ceara wrote:
> Commit 3c2d6274bcee ("raft: Transfer leadership before creating
> snapshots.") made it such that raft leaders transfer leadership before
> snapshotting.  However, there's still the case when the next leader to
> be is in the process of snapshotting.  To avoid delays in that case too,
> we now explicitly allow snapshots only on followers.  Cluster members
> will have to wait until the current election is settled before
> snapshotting.
> 
> Given the following logs taken from an OVN_Southbound 3-server cluster
> during a scale test:
> 
> S1 (old leader):
>   2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot.
>   2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
>   2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system)
>   2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43
> 
> S2 (follower):
>   2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43
> 
> S3 (new leader):
>   2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42
>   2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election
>   2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
>   2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers
> 
> We see that the leader to be (S3) receives the leadership transfer,
> initiates the election and immediately after starts a snapshot that
> takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
> as cluster leader but S3 doesn't effectively become leader until it
> finishes snapshotting, essentially keeping the cluster without a
> leader for up to ~9.5 seconds.
> 
> With the current change, S3 will delay compaction and snapshotting until
> the election is finished.
> 
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>

Thanks, Dumitru!  This change looks good to me.  I'm going to run a couple of
tests with it before applying.  Will reply once this is done.

At the same time, looking at it and also remembering how servers behave in scale
tests, it seems that all servers in many cases are snapshotting at exact same
time.  So, I sent another patch to fix that problem:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20211213154333.1283297-1-i.maximets@ovn.org/
Please, take a look.

With both patches applied, I think, we should have the compaction clash much
less likely.

Best regards, Ilya Maximets.

> ---
>  ovsdb/raft.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index ce40c5bc075c..6ffcb21db1e2 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft)
>              && !raft->leaving
>              && !raft->left
>              && !raft->failed
> -            && raft->role != RAFT_LEADER
> +            && raft->role == RAFT_FOLLOWER
>              && raft->last_applied >= raft->log_start);
>  }
>  
>
Han Zhou Dec. 13, 2021, 5:44 p.m. UTC | #2
On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com> wrote:
>
> Commit 3c2d6274bcee ("raft: Transfer leadership before creating
> snapshots.") made it such that raft leaders transfer leadership before
> snapshotting.  However, there's still the case when the next leader to
> be is in the process of snapshotting.  To avoid delays in that case too,
> we now explicitly allow snapshots only on followers.  Cluster members
> will have to wait until the current election is settled before
> snapshotting.
>
> Given the following logs taken from an OVN_Southbound 3-server cluster
> during a scale test:
>
> S1 (old leader):
>   2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to
write a snapshot.
>   2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database
compaction took 12601ms
>   2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms
poll interval (10632ms user, 1924ms system)
>   2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term
43
>
> S2 (follower):
>   2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term
43
>
> S3 (new leader):
>   2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer
from f5c9 in term 42
>   2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election
>   2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database
compaction took 9559ms
>   2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+
of 3 servers
>
> We see that the leader to be (S3) receives the leadership transfer,
> initiates the election and immediately after starts a snapshot that
> takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
> as cluster leader but S3 doesn't effectively become leader until it
> finishes snapshotting, essentially keeping the cluster without a
> leader for up to ~9.5 seconds.
>
> With the current change, S3 will delay compaction and snapshotting until
> the election is finished.
>
> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
> ---
>  ovsdb/raft.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index ce40c5bc075c..6ffcb21db1e2 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft)
>              && !raft->leaving
>              && !raft->left
>              && !raft->failed
> -            && raft->role != RAFT_LEADER
> +            && raft->role == RAFT_FOLLOWER
>              && raft->last_applied >= raft->log_start);
>  }
>
> --
> 2.27.0
>

Thanks Dumitru! This change looks good to me.
In addition, I'd add a check for server count so that the follower role is
required only if there are more than one server. i.e.:
&& (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1)
It is not a problem for real production deployments but since 1 node
cluster is supported by ovsdb RAFT, it is better to allow snapshot for that
scenario, too. Of course this is not a problem of this patch, but existed
already before this change.

Acked-by: Han Zhou <hzhou@ovn.org>
Dumitru Ceara Dec. 13, 2021, 7:46 p.m. UTC | #3
On 12/13/21 18:44, Han Zhou wrote:
> On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> Commit 3c2d6274bcee ("raft: Transfer leadership before creating
>> snapshots.") made it such that raft leaders transfer leadership before
>> snapshotting.  However, there's still the case when the next leader to
>> be is in the process of snapshotting.  To avoid delays in that case too,
>> we now explicitly allow snapshots only on followers.  Cluster members
>> will have to wait until the current election is settled before
>> snapshotting.
>>
>> Given the following logs taken from an OVN_Southbound 3-server cluster
>> during a scale test:
>>
>> S1 (old leader):
>>   2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to
> write a snapshot.
>>   2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database
> compaction took 12601ms
>>   2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms
> poll interval (10632ms user, 1924ms system)
>>   2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term
> 43
>>
>> S2 (follower):
>>   2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term
> 43
>>
>> S3 (new leader):
>>   2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer
> from f5c9 in term 42
>>   2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election
>>   2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database
> compaction took 9559ms
>>   2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+
> of 3 servers
>>
>> We see that the leader to be (S3) receives the leadership transfer,
>> initiates the election and immediately after starts a snapshot that
>> takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
>> as cluster leader but S3 doesn't effectively become leader until it
>> finishes snapshotting, essentially keeping the cluster without a
>> leader for up to ~9.5 seconds.
>>
>> With the current change, S3 will delay compaction and snapshotting until
>> the election is finished.
>>
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com>
>> ---
>>  ovsdb/raft.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index ce40c5bc075c..6ffcb21db1e2 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft)
>>              && !raft->leaving
>>              && !raft->left
>>              && !raft->failed
>> -            && raft->role != RAFT_LEADER
>> +            && raft->role == RAFT_FOLLOWER
>>              && raft->last_applied >= raft->log_start);
>>  }
>>
>> --
>> 2.27.0
>>
> 
> Thanks Dumitru! This change looks good to me.
> In addition, I'd add a check for server count so that the follower role is
> required only if there are more than one server. i.e.:
> && (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1)

Makes sense, I added it in v2.

> It is not a problem for real production deployments but since 1 node
> cluster is supported by ovsdb RAFT, it is better to allow snapshot for that
> scenario, too. Of course this is not a problem of this patch, but existed
> already before this change.
> 
> Acked-by: Han Zhou <hzhou@ovn.org>
> 

Thanks for your review!

V2 posted:
https://patchwork.ozlabs.org/project/openvswitch/patch/20211213194603.32487-1-dceara@redhat.com/
Ilya Maximets Dec. 13, 2021, 8:01 p.m. UTC | #4
On 12/13/21 18:44, Han Zhou wrote:
> 
> 
> On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com <mailto:dceara@redhat.com>> wrote:
>>
>> Commit 3c2d6274bcee ("raft: Transfer leadership before creating
>> snapshots.") made it such that raft leaders transfer leadership before
>> snapshotting.  However, there's still the case when the next leader to
>> be is in the process of snapshotting.  To avoid delays in that case too,
>> we now explicitly allow snapshots only on followers.  Cluster members
>> will have to wait until the current election is settled before
>> snapshotting.
>>
>> Given the following logs taken from an OVN_Southbound 3-server cluster
>> during a scale test:
>>
>> S1 (old leader):
>>   2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot.
>>   2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
>>   2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system)
>>   2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43
>>
>> S2 (follower):
>>   2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43
>>
>> S3 (new leader):
>>   2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42
>>   2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election
>>   2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
>>   2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers
>>
>> We see that the leader to be (S3) receives the leadership transfer,
>> initiates the election and immediately after starts a snapshot that
>> takes ~9.5 seconds.  During this time, S2 votes for S3 electing it
>> as cluster leader but S3 doesn't effectively become leader until it
>> finishes snapshotting, essentially keeping the cluster without a
>> leader for up to ~9.5 seconds.
>>
>> With the current change, S3 will delay compaction and snapshotting until
>> the election is finished.
>>
>> Signed-off-by: Dumitru Ceara <dceara@redhat.com <mailto:dceara@redhat.com>>
>> ---
>>  ovsdb/raft.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index ce40c5bc075c..6ffcb21db1e2 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft)
>>              && !raft->leaving
>>              && !raft->left
>>              && !raft->failed
>> -            && raft->role != RAFT_LEADER
>> +            && raft->role == RAFT_FOLLOWER
>>              && raft->last_applied >= raft->log_start);
>>  }
>>
>> --
>> 2.27.0
>>
> 
> Thanks Dumitru! This change looks good to me.
> In addition, I'd add a check for server count so that the follower role is
> required only if there are more than one server. i.e.:
> && (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1)
> It is not a problem for real production deployments but since 1 node cluster
> is supported by ovsdb RAFT, it is better to allow snapshot for that scenario,
> too. Of course this is not a problem of this patch, but existed already before
> this change.

I didn't test this, but I think that cluster of one server should be able
to snapshot even with the current implementation.  This is because
'raft_notify_snapshot_recommended' will "transfer" the leadership to
no-one and force-switch the role to RAFT_FOLLOWER in the 'raft_become_follower'.
And this server will stay in this state waiting for the election timer.
Once election timer expired, the server will re-elect itself as a leader.
So, there is at least 1 second time frame where conditions are sufficient
for the compaction.  Though, some non-raft wakeup needs to occur before
the election timer expired in order to trigger the compaction.  And I
agree that it's not a very clean way to do things, so it's definitely better
to avoid the transfer for the clusters of one server.  Thanks for noticing
this!

> 
> Acked-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index ce40c5bc075c..6ffcb21db1e2 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -4226,7 +4226,7 @@  raft_may_snapshot(const struct raft *raft)
             && !raft->leaving
             && !raft->left
             && !raft->failed
-            && raft->role != RAFT_LEADER
+            && raft->role == RAFT_FOLLOWER
             && raft->last_applied >= raft->log_start);
 }