Message ID | 20211213110314.13721-1-dceara@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev] raft: Only allow followers to snapshot. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
On 12/13/21 12:03, Dumitru Ceara wrote: > Commit 3c2d6274bcee ("raft: Transfer leadership before creating > snapshots.") made it such that raft leaders transfer leadership before > snapshotting. However, there's still the case when the next leader to > be is in the process of snapshotting. To avoid delays in that case too, > we now explicitly allow snapshots only on followers. Cluster members > will have to wait until the current election is settled before > snapshotting. > > Given the following logs taken from an OVN_Southbound 3-server cluster > during a scale test: > > S1 (old leader): > 2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot. > 2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms > 2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system) > 2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43 > > S2 (follower): > 2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43 > > S3 (new leader): > 2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42 > 2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election > 2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms > 2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers > > We see that the leader to be (S3) receives the leadership transfer, > initiates the election and immediately after starts a snapshot that > takes ~9.5 seconds. During this time, S2 votes for S3 electing it > as cluster leader but S3 doesn't effectively become leader until it > finishes snapshotting, essentially keeping the cluster without a > leader for up to ~9.5 seconds. > > With the current change, S3 will delay compaction and snapshotting until > the election is finished. > > Signed-off-by: Dumitru Ceara <dceara@redhat.com> Thanks, Dumitru! This change looks good to me. I'm going to run a couple of tests with it before applying. Will reply once this is done. At the same time, looking at it and also remembering how servers behave in scale tests, it seems that all servers in many cases are snapshotting at exact same time. So, I sent another patch to fix that problem: https://patchwork.ozlabs.org/project/openvswitch/patch/20211213154333.1283297-1-i.maximets@ovn.org/ Please, take a look. With both patches applied, I think, we should have the compaction clash much less likely. Best regards, Ilya Maximets. > --- > ovsdb/raft.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index ce40c5bc075c..6ffcb21db1e2 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft) > && !raft->leaving > && !raft->left > && !raft->failed > - && raft->role != RAFT_LEADER > + && raft->role == RAFT_FOLLOWER > && raft->last_applied >= raft->log_start); > } > >
On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com> wrote: > > Commit 3c2d6274bcee ("raft: Transfer leadership before creating > snapshots.") made it such that raft leaders transfer leadership before > snapshotting. However, there's still the case when the next leader to > be is in the process of snapshotting. To avoid delays in that case too, > we now explicitly allow snapshots only on followers. Cluster members > will have to wait until the current election is settled before > snapshotting. > > Given the following logs taken from an OVN_Southbound 3-server cluster > during a scale test: > > S1 (old leader): > 2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot. > 2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms > 2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system) > 2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43 > > S2 (follower): > 2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43 > > S3 (new leader): > 2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42 > 2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election > 2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms > 2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers > > We see that the leader to be (S3) receives the leadership transfer, > initiates the election and immediately after starts a snapshot that > takes ~9.5 seconds. During this time, S2 votes for S3 electing it > as cluster leader but S3 doesn't effectively become leader until it > finishes snapshotting, essentially keeping the cluster without a > leader for up to ~9.5 seconds. > > With the current change, S3 will delay compaction and snapshotting until > the election is finished. > > Signed-off-by: Dumitru Ceara <dceara@redhat.com> > --- > ovsdb/raft.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index ce40c5bc075c..6ffcb21db1e2 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft) > && !raft->leaving > && !raft->left > && !raft->failed > - && raft->role != RAFT_LEADER > + && raft->role == RAFT_FOLLOWER > && raft->last_applied >= raft->log_start); > } > > -- > 2.27.0 > Thanks Dumitru! This change looks good to me. In addition, I'd add a check for server count so that the follower role is required only if there are more than one server. i.e.: && (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1) It is not a problem for real production deployments but since 1 node cluster is supported by ovsdb RAFT, it is better to allow snapshot for that scenario, too. Of course this is not a problem of this patch, but existed already before this change. Acked-by: Han Zhou <hzhou@ovn.org>
On 12/13/21 18:44, Han Zhou wrote: > On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com> wrote: >> >> Commit 3c2d6274bcee ("raft: Transfer leadership before creating >> snapshots.") made it such that raft leaders transfer leadership before >> snapshotting. However, there's still the case when the next leader to >> be is in the process of snapshotting. To avoid delays in that case too, >> we now explicitly allow snapshots only on followers. Cluster members >> will have to wait until the current election is settled before >> snapshotting. >> >> Given the following logs taken from an OVN_Southbound 3-server cluster >> during a scale test: >> >> S1 (old leader): >> 2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to > write a snapshot. >> 2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database > compaction took 12601ms >> 2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms > poll interval (10632ms user, 1924ms system) >> 2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term > 43 >> >> S2 (follower): >> 2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term > 43 >> >> S3 (new leader): >> 2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer > from f5c9 in term 42 >> 2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election >> 2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database > compaction took 9559ms >> 2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ > of 3 servers >> >> We see that the leader to be (S3) receives the leadership transfer, >> initiates the election and immediately after starts a snapshot that >> takes ~9.5 seconds. During this time, S2 votes for S3 electing it >> as cluster leader but S3 doesn't effectively become leader until it >> finishes snapshotting, essentially keeping the cluster without a >> leader for up to ~9.5 seconds. >> >> With the current change, S3 will delay compaction and snapshotting until >> the election is finished. >> >> Signed-off-by: Dumitru Ceara <dceara@redhat.com> >> --- >> ovsdb/raft.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c >> index ce40c5bc075c..6ffcb21db1e2 100644 >> --- a/ovsdb/raft.c >> +++ b/ovsdb/raft.c >> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft) >> && !raft->leaving >> && !raft->left >> && !raft->failed >> - && raft->role != RAFT_LEADER >> + && raft->role == RAFT_FOLLOWER >> && raft->last_applied >= raft->log_start); >> } >> >> -- >> 2.27.0 >> > > Thanks Dumitru! This change looks good to me. > In addition, I'd add a check for server count so that the follower role is > required only if there are more than one server. i.e.: > && (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1) Makes sense, I added it in v2. > It is not a problem for real production deployments but since 1 node > cluster is supported by ovsdb RAFT, it is better to allow snapshot for that > scenario, too. Of course this is not a problem of this patch, but existed > already before this change. > > Acked-by: Han Zhou <hzhou@ovn.org> > Thanks for your review! V2 posted: https://patchwork.ozlabs.org/project/openvswitch/patch/20211213194603.32487-1-dceara@redhat.com/
On 12/13/21 18:44, Han Zhou wrote: > > > On Mon, Dec 13, 2021 at 3:03 AM Dumitru Ceara <dceara@redhat.com <mailto:dceara@redhat.com>> wrote: >> >> Commit 3c2d6274bcee ("raft: Transfer leadership before creating >> snapshots.") made it such that raft leaders transfer leadership before >> snapshotting. However, there's still the case when the next leader to >> be is in the process of snapshotting. To avoid delays in that case too, >> we now explicitly allow snapshots only on followers. Cluster members >> will have to wait until the current election is settled before >> snapshotting. >> >> Given the following logs taken from an OVN_Southbound 3-server cluster >> during a scale test: >> >> S1 (old leader): >> 2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot. >> 2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms >> 2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system) >> 2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43 >> >> S2 (follower): >> 2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43 >> >> S3 (new leader): >> 2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42 >> 2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election >> 2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms >> 2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers >> >> We see that the leader to be (S3) receives the leadership transfer, >> initiates the election and immediately after starts a snapshot that >> takes ~9.5 seconds. During this time, S2 votes for S3 electing it >> as cluster leader but S3 doesn't effectively become leader until it >> finishes snapshotting, essentially keeping the cluster without a >> leader for up to ~9.5 seconds. >> >> With the current change, S3 will delay compaction and snapshotting until >> the election is finished. >> >> Signed-off-by: Dumitru Ceara <dceara@redhat.com <mailto:dceara@redhat.com>> >> --- >> ovsdb/raft.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c >> index ce40c5bc075c..6ffcb21db1e2 100644 >> --- a/ovsdb/raft.c >> +++ b/ovsdb/raft.c >> @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft) >> && !raft->leaving >> && !raft->left >> && !raft->failed >> - && raft->role != RAFT_LEADER >> + && raft->role == RAFT_FOLLOWER >> && raft->last_applied >= raft->log_start); >> } >> >> -- >> 2.27.0 >> > > Thanks Dumitru! This change looks good to me. > In addition, I'd add a check for server count so that the follower role is > required only if there are more than one server. i.e.: > && (raft->role == RAFT_FOLLOWER || hmap_count(&raft->servers) == 1) > It is not a problem for real production deployments but since 1 node cluster > is supported by ovsdb RAFT, it is better to allow snapshot for that scenario, > too. Of course this is not a problem of this patch, but existed already before > this change. I didn't test this, but I think that cluster of one server should be able to snapshot even with the current implementation. This is because 'raft_notify_snapshot_recommended' will "transfer" the leadership to no-one and force-switch the role to RAFT_FOLLOWER in the 'raft_become_follower'. And this server will stay in this state waiting for the election timer. Once election timer expired, the server will re-elect itself as a leader. So, there is at least 1 second time frame where conditions are sufficient for the compaction. Though, some non-raft wakeup needs to occur before the election timer expired in order to trigger the compaction. And I agree that it's not a very clean way to do things, so it's definitely better to avoid the transfer for the clusters of one server. Thanks for noticing this! > > Acked-by: Han Zhou <hzhou@ovn.org <mailto:hzhou@ovn.org>>
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index ce40c5bc075c..6ffcb21db1e2 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -4226,7 +4226,7 @@ raft_may_snapshot(const struct raft *raft) && !raft->leaving && !raft->left && !raft->failed - && raft->role != RAFT_LEADER + && raft->role == RAFT_FOLLOWER && raft->last_applied >= raft->log_start); }
Commit 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") made it such that raft leaders transfer leadership before snapshotting. However, there's still the case when the next leader to be is in the process of snapshotting. To avoid delays in that case too, we now explicitly allow snapshots only on followers. Cluster members will have to wait until the current election is settled before snapshotting. Given the following logs taken from an OVN_Southbound 3-server cluster during a scale test: S1 (old leader): 2021-12-10T19:07:51.226Z|00823|raft|INFO|Transferring leadership to write a snapshot. 2021-12-10T19:08:03.830Z|00824|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms 2021-12-10T19:08:03.833Z|00825|timeval|WARN|Unreasonably long 12604ms poll interval (10632ms user, 1924ms system) 2021-12-10T19:08:03.940Z|00838|raft|INFO|server 8b8d is leader for term 43 S2 (follower): 2021-12-10T19:08:00.870Z|00481|raft|INFO|server 8b8d is leader for term 43 S3 (new leader): 2021-12-10T19:07:51.242Z|01083|raft|INFO|received leadership transfer from f5c9 in term 42 2021-12-10T19:07:51.244Z|01084|raft|INFO|term 43: starting election 2021-12-10T19:08:00.805Z|01085|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms 2021-12-10T19:08:00.869Z|01100|raft|INFO|term 43: elected leader by 2+ of 3 servers We see that the leader to be (S3) receives the leadership transfer, initiates the election and immediately after starts a snapshot that takes ~9.5 seconds. During this time, S2 votes for S3 electing it as cluster leader but S3 doesn't effectively become leader until it finishes snapshotting, essentially keeping the cluster without a leader for up to ~9.5 seconds. With the current change, S3 will delay compaction and snapshotting until the election is finished. Signed-off-by: Dumitru Ceara <dceara@redhat.com> --- ovsdb/raft.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)