Message ID | 1553359466-40630-1-git-send-email-hzhou8@ebay.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev,v5] ovsdb raft: Sync commit index to followers without delay. | expand |
On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote: > From: Han Zhou <hzhou8@ebay.com> > > When update is requested from follower, the leader sends AppendRequest > to all followers and wait until AppendReply received from majority, and > then it will update commit index - the new entry is regarded as committed > in raft log. However, this commit will not be notified to followers > (including the one initiated the request) until next heartbeat (ping > timeout), if no other pending requests. This results in long latency > for updates made through followers, especially when a batch of updates > are requested through the same follower. The tests pass now, but each one of them ends up with several ovn-sbctl and ovsdb-server processes all trying to use 100% of a CPU. If I run the tests with 10-way parallelism, the load average goes above 100. Surely something has to be wrong in the implementation here. Each of the ovn-sbctl processes is trying to push through only a single transaction; after that, it exits. If we think of ovsdb-server as giving each of its clients one chance to execute an RPC in round-robin order, which is approximately correct, then one of those transactions should succeed per round. I don't understand why, if this model is correct, the ovn-sbctls would burn so much CPU. If the model is wrong, then we need to understand why it is wrong and how we can fix it. Maybe the ovn-sbctl processes are retrying blindly without waiting for an update from the server; if so, that's a bug and it should be fixed.
On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote: > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote: > > From: Han Zhou <hzhou8@ebay.com> > > > > When update is requested from follower, the leader sends AppendRequest > > to all followers and wait until AppendReply received from majority, and > > then it will update commit index - the new entry is regarded as committed > > in raft log. However, this commit will not be notified to followers > > (including the one initiated the request) until next heartbeat (ping > > timeout), if no other pending requests. This results in long latency > > for updates made through followers, especially when a batch of updates > > are requested through the same follower. > > The tests pass now, but each one of them ends up with several ovn-sbctl > and ovsdb-server processes all trying to use 100% of a CPU. If I run > the tests with 10-way parallelism, the load average goes above 100. > Surely something has to be wrong in the implementation here. > > Each of the ovn-sbctl processes is trying to push through only a single > transaction; after that, it exits. If we think of ovsdb-server as > giving each of its clients one chance to execute an RPC in round-robin > order, which is approximately correct, then one of those transactions > should succeed per round. I don't understand why, if this model is > correct, the ovn-sbctls would burn so much CPU. If the model is wrong, > then we need to understand why it is wrong and how we can fix it. Maybe > the ovn-sbctl processes are retrying blindly without waiting for an > update from the server; if so, that's a bug and it should be fixed. Hi Ben, to make the test case more effective, in v4/v5 I enlarged the size of each transaction 50 times, specified by the var "n3". This is just to create the load in test and slow down the transaction execution, to ensure there are enough parallel requests ongoing, and to ensure we can test the server membership changes during this period. Without this change (in test case only), the CPU was low, but it just making the torture test not torturing at all (either skipped because the execution was too fast, or if put sleep there it loses the parallelism). And the test case change was proved to be more effective - it found an existing bug, as mentioned in the notes. If your question is why increasing the size of transaction causing high CPU load, it seems expected if we see what's being done by each client: - download initial data, which increases as more and more transaction executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients will take more CPU because of this. - send transaction with operations: wait + update, which has double size of the original data, plus newly added 50 entries. - there are some chances that clients got conflict update (wait operation fails), which triggers retry. This doesn't happen a lot (by checking the logs), and I think it is expected due to the purpose of the testing - to increase parallelism. - process updates, at least once for its own update, but also could process other's update. (This step is not CPU costly since each update notification is not big) So I think increasing the size of transaction causing higher CPU is expected due to these operations, considering the JsonRPC cost. I ran the tests a lot a times and I didn't observe any client occupying CPU for long time - there were short spikes (not 100%) shown by "top", but I never see anyone there twice. If you see any behavior like it is continously retrying, could you please share the x-y.log file of that client? I agree there could be improvement to the RPC performance of ovsdb/IDL, but it is not related to this patch (or maybe we can say the test in this patch utilized this slowness). Thanks, Han
On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote: > > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote: > > > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote: > > > From: Han Zhou <hzhou8@ebay.com> > > > > > > When update is requested from follower, the leader sends AppendRequest > > > to all followers and wait until AppendReply received from majority, and > > > then it will update commit index - the new entry is regarded as committed > > > in raft log. However, this commit will not be notified to followers > > > (including the one initiated the request) until next heartbeat (ping > > > timeout), if no other pending requests. This results in long latency > > > for updates made through followers, especially when a batch of updates > > > are requested through the same follower. > > > > The tests pass now, but each one of them ends up with several ovn-sbctl > > and ovsdb-server processes all trying to use 100% of a CPU. If I run > > the tests with 10-way parallelism, the load average goes above 100. > > Surely something has to be wrong in the implementation here. > > > > Each of the ovn-sbctl processes is trying to push through only a single > > transaction; after that, it exits. If we think of ovsdb-server as > > giving each of its clients one chance to execute an RPC in round-robin > > order, which is approximately correct, then one of those transactions > > should succeed per round. I don't understand why, if this model is > > correct, the ovn-sbctls would burn so much CPU. If the model is wrong, > > then we need to understand why it is wrong and how we can fix it. Maybe > > the ovn-sbctl processes are retrying blindly without waiting for an > > update from the server; if so, that's a bug and it should be fixed. > > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the > size of each transaction 50 times, specified by the var "n3". This is > just to create the load in test and slow down the transaction > execution, to ensure there are enough parallel requests ongoing, and > to ensure we can test the server membership changes during this > period. Without this change (in test case only), the CPU was low, but > it just making the torture test not torturing at all (either skipped > because the execution was too fast, or if put sleep there it loses the > parallelism). And the test case change was proved to be more effective > - it found an existing bug, as mentioned in the notes. > > If your question is why increasing the size of transaction causing > high CPU load, it seems expected if we see what's being done by each > client: > - download initial data, which increases as more and more transaction > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients > will take more CPU because of this. > - send transaction with operations: wait + update, which has double > size of the original data, plus newly added 50 entries. > - there are some chances that clients got conflict update (wait > operation fails), which triggers retry. This doesn't happen a lot (by > checking the logs), and I think it is expected due to the purpose of > the testing - to increase parallelism. > - process updates, at least once for its own update, but also could > process other's update. (This step is not CPU costly since each update > notification is not big) > > So I think increasing the size of transaction causing higher CPU is > expected due to these operations, considering the JsonRPC cost. I ran > the tests a lot a times and I didn't observe any client occupying CPU > for long time - there were short spikes (not 100%) shown by "top", but > I never see anyone there twice. If you see any behavior like it is > continously retrying, could you please share the x-y.log file of that > client? > > I agree there could be improvement to the RPC performance of > ovsdb/IDL, but it is not related to this patch (or maybe we can say > the test in this patch utilized this slowness). > Hi Ben, does my explain make sense? To prove that the CPU cost is expected (at least nothing caused by this patch), I compared with master branch using leader_only = true for the ovn-sbctl command in torture test, and execute same updated torture test case (i.e. same number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true, there is no latency for transaction execution even without this patch, because no idle waiting for syncing from leader to followers, so all transactions are executed full speed, and the performance is comparable. time make check TESTSUITEFLAGS="2529" (2529: OVSDB 3-server torture test - kill/restart leader) - *Master branch* with new test case, and use leader_only=true: real 0m15.448s user 0m15.384s sys 0m13.687s - With this patch (leader_only=false): real 0m13.904s user 0m15.578s sys 0m11.817s - With this patch but leader_only=true (exactly same test case as for *Master branch*): real 0m14.419s user 0m14.972s sys 0m12.467s As we can see, there is no obvious difference. During the tests I was also watch "top" output and I didn't notice any difference either. The "leader_only" update on top of this patch is simply: ------8><------------------------------------------><8----------- diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index 5550a19..94ec9c7 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral for k in $(seq $n3); do txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k" done - run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only $txn + run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 $txn status=$? if test $status != 0; then echo "$i-$j exited with status $status" > $i-$j:$status -------------------------------------------------------------------------- Please try it and let me know if you still have concerns. Thanks, Han
On Wed, Mar 27, 2019 at 4:10 PM Han Zhou <zhouhan@gmail.com> wrote: > > On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote: > > > > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote: > > > > > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote: > > > > From: Han Zhou <hzhou8@ebay.com> > > > > > > > > When update is requested from follower, the leader sends AppendRequest > > > > to all followers and wait until AppendReply received from majority, and > > > > then it will update commit index - the new entry is regarded as committed > > > > in raft log. However, this commit will not be notified to followers > > > > (including the one initiated the request) until next heartbeat (ping > > > > timeout), if no other pending requests. This results in long latency > > > > for updates made through followers, especially when a batch of updates > > > > are requested through the same follower. > > > > > > The tests pass now, but each one of them ends up with several ovn-sbctl > > > and ovsdb-server processes all trying to use 100% of a CPU. If I run > > > the tests with 10-way parallelism, the load average goes above 100. > > > Surely something has to be wrong in the implementation here. > > > > > > Each of the ovn-sbctl processes is trying to push through only a single > > > transaction; after that, it exits. If we think of ovsdb-server as > > > giving each of its clients one chance to execute an RPC in round-robin > > > order, which is approximately correct, then one of those transactions > > > should succeed per round. I don't understand why, if this model is > > > correct, the ovn-sbctls would burn so much CPU. If the model is wrong, > > > then we need to understand why it is wrong and how we can fix it. Maybe > > > the ovn-sbctl processes are retrying blindly without waiting for an > > > update from the server; if so, that's a bug and it should be fixed. > > > > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the > > size of each transaction 50 times, specified by the var "n3". This is > > just to create the load in test and slow down the transaction > > execution, to ensure there are enough parallel requests ongoing, and > > to ensure we can test the server membership changes during this > > period. Without this change (in test case only), the CPU was low, but > > it just making the torture test not torturing at all (either skipped > > because the execution was too fast, or if put sleep there it loses the > > parallelism). And the test case change was proved to be more effective > > - it found an existing bug, as mentioned in the notes. > > > > If your question is why increasing the size of transaction causing > > high CPU load, it seems expected if we see what's being done by each > > client: > > - download initial data, which increases as more and more transaction > > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients > > will take more CPU because of this. > > - send transaction with operations: wait + update, which has double > > size of the original data, plus newly added 50 entries. > > - there are some chances that clients got conflict update (wait > > operation fails), which triggers retry. This doesn't happen a lot (by > > checking the logs), and I think it is expected due to the purpose of > > the testing - to increase parallelism. > > - process updates, at least once for its own update, but also could > > process other's update. (This step is not CPU costly since each update > > notification is not big) > > > > So I think increasing the size of transaction causing higher CPU is > > expected due to these operations, considering the JsonRPC cost. I ran > > the tests a lot a times and I didn't observe any client occupying CPU > > for long time - there were short spikes (not 100%) shown by "top", but > > I never see anyone there twice. If you see any behavior like it is > > continously retrying, could you please share the x-y.log file of that > > client? > > > > I agree there could be improvement to the RPC performance of > > ovsdb/IDL, but it is not related to this patch (or maybe we can say > > the test in this patch utilized this slowness). > > > > Hi Ben, does my explain make sense? To prove that the CPU cost is > expected (at least nothing caused by this patch), I compared with > master branch using leader_only = true for the ovn-sbctl command in > torture test, and execute same updated torture test case (i.e. same > number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true, > there is no latency for transaction execution even without this patch, > because no idle waiting for syncing from leader to followers, so all > transactions are executed full speed, and the performance is > comparable. > > time make check TESTSUITEFLAGS="2529" > (2529: OVSDB 3-server torture test - kill/restart leader) > > - *Master branch* with new test case, and use leader_only=true: > real 0m15.448s > user 0m15.384s > sys 0m13.687s > > - With this patch (leader_only=false): > real 0m13.904s > user 0m15.578s > sys 0m11.817s > > - With this patch but leader_only=true (exactly same test case as for > *Master branch*): > real 0m14.419s > user 0m14.972s > sys 0m12.467s > > As we can see, there is no obvious difference. During the tests I was > also watch "top" output and I didn't notice any difference either. The > "leader_only" update on top of this patch is simply: > ------8><------------------------------------------><8----------- > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at > index 5550a19..94ec9c7 100644 > --- a/tests/ovsdb-cluster.at > +++ b/tests/ovsdb-cluster.at > @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in > 'OVN_Southbound' database from ephemeral > for k in $(seq $n3); do > txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k" > done > - run_as "ovn-sbctl($i-$j)" ovn-sbctl > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 > --no-leader-only $txn > + run_as "ovn-sbctl($i-$j)" ovn-sbctl > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 > $txn > status=$? > if test $status != 0; then > echo "$i-$j exited with status $status" > $i-$j:$status > -------------------------------------------------------------------------- > Please try it and let me know if you still have concerns. > > Thanks, > Han Hi Ben, this one is small but important, just waiting for your final approval. Sorry for pushing this, but I am also working on UTs for different failure scenarios of Raft, this patch helps for some of the test cases, so I hope this can be merged before I send out the UT series. Thanks, Han
On Mon, Apr 8, 2019 at 4:08 PM Han Zhou <zhouhan@gmail.com> wrote: > > > > On Wed, Mar 27, 2019 at 4:10 PM Han Zhou <zhouhan@gmail.com> wrote: > > > > On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote: > > > > > > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote: > > > > > > > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote: > > > > > From: Han Zhou <hzhou8@ebay.com> > > > > > > > > > > When update is requested from follower, the leader sends AppendRequest > > > > > to all followers and wait until AppendReply received from majority, and > > > > > then it will update commit index - the new entry is regarded as committed > > > > > in raft log. However, this commit will not be notified to followers > > > > > (including the one initiated the request) until next heartbeat (ping > > > > > timeout), if no other pending requests. This results in long latency > > > > > for updates made through followers, especially when a batch of updates > > > > > are requested through the same follower. > > > > > > > > The tests pass now, but each one of them ends up with several ovn-sbctl > > > > and ovsdb-server processes all trying to use 100% of a CPU. If I run > > > > the tests with 10-way parallelism, the load average goes above 100. > > > > Surely something has to be wrong in the implementation here. > > > > > > > > Each of the ovn-sbctl processes is trying to push through only a single > > > > transaction; after that, it exits. If we think of ovsdb-server as > > > > giving each of its clients one chance to execute an RPC in round-robin > > > > order, which is approximately correct, then one of those transactions > > > > should succeed per round. I don't understand why, if this model is > > > > correct, the ovn-sbctls would burn so much CPU. If the model is wrong, > > > > then we need to understand why it is wrong and how we can fix it. Maybe > > > > the ovn-sbctl processes are retrying blindly without waiting for an > > > > update from the server; if so, that's a bug and it should be fixed. > > > > > > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the > > > size of each transaction 50 times, specified by the var "n3". This is > > > just to create the load in test and slow down the transaction > > > execution, to ensure there are enough parallel requests ongoing, and > > > to ensure we can test the server membership changes during this > > > period. Without this change (in test case only), the CPU was low, but > > > it just making the torture test not torturing at all (either skipped > > > because the execution was too fast, or if put sleep there it loses the > > > parallelism). And the test case change was proved to be more effective > > > - it found an existing bug, as mentioned in the notes. > > > > > > If your question is why increasing the size of transaction causing > > > high CPU load, it seems expected if we see what's being done by each > > > client: > > > - download initial data, which increases as more and more transaction > > > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients > > > will take more CPU because of this. > > > - send transaction with operations: wait + update, which has double > > > size of the original data, plus newly added 50 entries. > > > - there are some chances that clients got conflict update (wait > > > operation fails), which triggers retry. This doesn't happen a lot (by > > > checking the logs), and I think it is expected due to the purpose of > > > the testing - to increase parallelism. > > > - process updates, at least once for its own update, but also could > > > process other's update. (This step is not CPU costly since each update > > > notification is not big) > > > > > > So I think increasing the size of transaction causing higher CPU is > > > expected due to these operations, considering the JsonRPC cost. I ran > > > the tests a lot a times and I didn't observe any client occupying CPU > > > for long time - there were short spikes (not 100%) shown by "top", but > > > I never see anyone there twice. If you see any behavior like it is > > > continously retrying, could you please share the x-y.log file of that > > > client? > > > > > > I agree there could be improvement to the RPC performance of > > > ovsdb/IDL, but it is not related to this patch (or maybe we can say > > > the test in this patch utilized this slowness). > > > > > > > Hi Ben, does my explain make sense? To prove that the CPU cost is > > expected (at least nothing caused by this patch), I compared with > > master branch using leader_only = true for the ovn-sbctl command in > > torture test, and execute same updated torture test case (i.e. same > > number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true, > > there is no latency for transaction execution even without this patch, > > because no idle waiting for syncing from leader to followers, so all > > transactions are executed full speed, and the performance is > > comparable. > > > > time make check TESTSUITEFLAGS="2529" > > (2529: OVSDB 3-server torture test - kill/restart leader) > > > > - *Master branch* with new test case, and use leader_only=true: > > real 0m15.448s > > user 0m15.384s > > sys 0m13.687s > > > > - With this patch (leader_only=false): > > real 0m13.904s > > user 0m15.578s > > sys 0m11.817s > > > > - With this patch but leader_only=true (exactly same test case as for > > *Master branch*): > > real 0m14.419s > > user 0m14.972s > > sys 0m12.467s > > > > As we can see, there is no obvious difference. During the tests I was > > also watch "top" output and I didn't notice any difference either. The > > "leader_only" update on top of this patch is simply: > > ------8><------------------------------------------><8----------- > > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at > > index 5550a19..94ec9c7 100644 > > --- a/tests/ovsdb-cluster.at > > +++ b/tests/ovsdb-cluster.at > > @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in > > 'OVN_Southbound' database from ephemeral > > for k in $(seq $n3); do > > txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k" > > done > > - run_as "ovn-sbctl($i-$j)" ovn-sbctl > > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" > > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 > > --no-leader-only $txn > > + run_as "ovn-sbctl($i-$j)" ovn-sbctl > > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" > > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 > > $txn > > status=$? > > if test $status != 0; then > > echo "$i-$j exited with status $status" > $i-$j:$status > > -------------------------------------------------------------------------- > > Please try it and let me know if you still have concerns. > > > > Thanks, > > Han > > Hi Ben, this one is small but important, just waiting for your final approval. Sorry for pushing this, but I am also working on UTs for different failure scenarios of Raft, this patch helps for some of the test cases, so I hope this can be merged before I send out the UT series. > > Thanks, > Han I embedded this one as the first patch in a new series: https://patchwork.ozlabs.org/project/openvswitch/list/?series=101873 The new series is mainly for raft UT and one bug fix, which is not closely related to but depends on this patch.
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index eee4f33..31e9e72 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -320,7 +320,8 @@ static void raft_send_append_request(struct raft *, static void raft_become_leader(struct raft *); static void raft_become_follower(struct raft *); -static void raft_reset_timer(struct raft *); +static void raft_reset_election_timer(struct raft *); +static void raft_reset_ping_timer(struct raft *); static void raft_send_heartbeats(struct raft *); static void raft_start_election(struct raft *, bool leadership_transfer); static bool raft_truncate(struct raft *, uint64_t new_end); @@ -376,8 +377,8 @@ raft_alloc(void) hmap_init(&raft->add_servers); hmap_init(&raft->commands); - raft->ping_timeout = time_msec() + PING_TIME_MSEC; - raft_reset_timer(raft); + raft_reset_ping_timer(raft); + raft_reset_election_timer(raft); return raft; } @@ -865,7 +866,7 @@ raft_read_log(struct raft *raft) } static void -raft_reset_timer(struct raft *raft) +raft_reset_election_timer(struct raft *raft) { unsigned int duration = (ELECTION_BASE_MSEC + random_range(ELECTION_RANGE_MSEC)); @@ -874,6 +875,12 @@ raft_reset_timer(struct raft *raft) } static void +raft_reset_ping_timer(struct raft *raft) +{ + raft->ping_timeout = time_msec() + PING_TIME_MSEC; +} + +static void raft_add_conn(struct raft *raft, struct jsonrpc_session *js, const struct uuid *sid, bool incoming) { @@ -1603,7 +1610,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer) VLOG_INFO("term %"PRIu64": starting election", raft->term); } } - raft_reset_timer(raft); + raft_reset_election_timer(raft); struct raft_server *peer; HMAP_FOR_EACH (peer, hmap_node, &raft->servers) { @@ -1782,8 +1789,8 @@ raft_run(struct raft *raft) raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); } } + raft_reset_ping_timer(raft); } - raft->ping_timeout = time_msec() + PING_TIME_MSEC; } /* Do this only at the end; if we did it as soon as we set raft->left or @@ -1963,6 +1970,7 @@ raft_command_initiate(struct raft *raft, s->next_index++; } } + raft_reset_ping_timer(raft); return cmd; } @@ -2313,7 +2321,7 @@ raft_become_follower(struct raft *raft) } raft->role = RAFT_FOLLOWER; - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* Notify clients about lost leadership. * @@ -2387,6 +2395,8 @@ raft_send_heartbeats(struct raft *raft) RAFT_CMD_INCOMPLETE, 0); } } + + raft_reset_ping_timer(raft); } /* Initializes the fields in 's' that represent the leader's view of the @@ -2413,7 +2423,7 @@ raft_become_leader(struct raft *raft) raft->role = RAFT_LEADER; raft->leader_sid = raft->sid; raft->election_timeout = LLONG_MAX; - raft->ping_timeout = time_msec() + PING_TIME_MSEC; + raft_reset_ping_timer(raft); struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->servers) { @@ -2573,11 +2583,13 @@ raft_get_next_entry(struct raft *raft, struct uuid *eid) return data; } -static void +/* Updates commit index in raft log. If commit index is already up-to-date + * it does nothing and return false, otherwise, returns true. */ +static bool raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) { if (new_commit_index <= raft->commit_index) { - return; + return false; } if (raft->role == RAFT_LEADER) { @@ -2610,6 +2622,7 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) .commit_index = raft->commit_index, }; ignore(ovsdb_log_write_and_free(raft->log, raft_record_to_json(&r))); + return true; } /* This doesn't use rq->entries (but it does use rq->n_entries). */ @@ -2809,7 +2822,7 @@ raft_handle_append_request(struct raft *raft, "usurped leadership"); return; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* First check for the common case, where the AppendEntries request is * entirely for indexes covered by 'log_start' ... 'log_end - 1', something @@ -3045,7 +3058,9 @@ raft_consider_updating_commit_index(struct raft *raft) } } } - raft_update_commit_index(raft, new_commit_index); + if (raft_update_commit_index(raft, new_commit_index)) { + raft_send_heartbeats(raft); + } } static void @@ -3274,7 +3289,7 @@ raft_handle_vote_request__(struct raft *raft, return false; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); return true; } @@ -3697,7 +3712,7 @@ static bool raft_handle_install_snapshot_request__( struct raft *raft, const struct raft_install_snapshot_request *rq) { - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* * Our behavior here depend on new_log_start in the snapshot compared to diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index c7f1e34..5550a19 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -131,12 +131,16 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral done export OVN_SB_DB - n1=10 n2=5 + n1=10 n2=5 n3=50 echo "starting $n1*$n2 ovn-sbctl processes..." for i in $(seq 0 $(expr $n1 - 1) ); do (for j in $(seq $n2); do : > $i-$j.running - run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only add SB_Global . external_ids $i-$j=$i-$j + txn="add SB_Global . external_ids $i-$j=$i-$j" + for k in $(seq $n3); do + txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k" + done + run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only $txn status=$? if test $status != 0; then echo "$i-$j exited with status $status" > $i-$j:$status @@ -146,13 +150,12 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral : > $i.done)& done echo "...done" - sleep 2 echo "waiting for ovn-sbctl processes to exit..." # Use file instead of var because code inside "while" runs in a subshell. echo 0 > phase i=0 - (while :; do echo; sleep 1; done) | while read REPLY; do + (while :; do echo; sleep 0.1; done) | while read REPLY; do printf "t=%2d s:" $i done=0 for j in $(seq 0 $(expr $n1 - 1)); do @@ -168,7 +171,7 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral case $(cat phase) in # ( 0) - if test $done -ge $(expr $n1 / 4); then + if test $done -ge $(expr $n1 / 10); then if test $variant = kill; then stop_server $victim else @@ -199,6 +202,9 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral for i in $(seq 0 $(expr $n1 - 1) ); do for j in `seq $n2`; do echo "$i-$j=$i-$j" + for k in `seq $n3`; do + echo "$i-$j-$k=$i-$j-$k" + done done done | sort > expout AT_CHECK([ovn-sbctl --timeout=30 --log-file=finalize.log -vtimeval:off -vfile -vsyslog:off --bare get SB_Global . external-ids | tr ',' '\n' | sed 's/[[{}"" ]]//g' | sort], [0], [expout])