Message ID | 1552583636-122844-1-git-send-email-hzhou8@ebay.com |
---|---|
State | Superseded |
Headers | show |
Series | [ovs-dev] ovsdb raft: Sync commit index to followers without delay. | expand |
On Thu, Mar 14, 2019 at 10:13:56AM -0700, Han Zhou wrote: > From: Han Zhou <hzhou8@ebay.com> > > When update is requested from follower, the leader sends AppendRequest > to all followers and wait until AppendReply received from majority, and > then it will update commit index - the new entry is regarded as committed > in raft log. However, this commit will not be notified to followers > (including the one initiated the request) until next heartbeat (ping > timeout), if no other pending requests. This results in long latency > for updates made through followers, especially when a batch of updates > are requested through the same follower. > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > real 0m34.154s > user 0m0.083s > sys 0m0.250s > > This patch solves the problem by sending heartbeat as soon as the commit > index is updated in leader. It also avoids unnessary heartbeat by resetting > the ping timer whenever AppendRequest is broadcasted. With this patch > the performance is improved more than 50 times in same test: > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > real 0m0.564s > user 0m0.080s > sys 0m0.199s > > The parameters in torture test is also adjusted because of the improved > performance, otherwise the tests will all be skipped. The patch seems very reasonable and the concept makes sense, but when I run make -j10 check TESTSUITEFLAGS='-k ovsdb,cluster,torture -j10' it comes near to killing my laptop, with multiple ovsdb-servers going to 100% CPU. Without the patch, I don't see behavior like that at all. Do you see the same thing?
On Fri, Mar 15, 2019 at 3:30 PM Ben Pfaff <blp@ovn.org> wrote: > > On Thu, Mar 14, 2019 at 10:13:56AM -0700, Han Zhou wrote: > > From: Han Zhou <hzhou8@ebay.com> > > > > When update is requested from follower, the leader sends AppendRequest > > to all followers and wait until AppendReply received from majority, and > > then it will update commit index - the new entry is regarded as committed > > in raft log. However, this commit will not be notified to followers > > (including the one initiated the request) until next heartbeat (ping > > timeout), if no other pending requests. This results in long latency > > for updates made through followers, especially when a batch of updates > > are requested through the same follower. > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > real 0m34.154s > > user 0m0.083s > > sys 0m0.250s > > > > This patch solves the problem by sending heartbeat as soon as the commit > > index is updated in leader. It also avoids unnessary heartbeat by resetting > > the ping timer whenever AppendRequest is broadcasted. With this patch > > the performance is improved more than 50 times in same test: > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > real 0m0.564s > > user 0m0.080s > > sys 0m0.199s > > > > The parameters in torture test is also adjusted because of the improved > > performance, otherwise the tests will all be skipped. > > The patch seems very reasonable and the concept makes sense, but when I > run > make -j10 check TESTSUITEFLAGS='-k ovsdb,cluster,torture -j10' > it comes near to killing my laptop, with multiple ovsdb-servers going to > 100% CPU. Without the patch, I don't see behavior like that at all. > > Do you see the same thing? I think this is caused by the change in torture test case, as mentioned in the commit message: The parameters in torture test is also adjusted because of the improved performance, otherwise the tests will all be skipped. (because all clients finishes the tasks at phase 0) Unlike other tests, I never run torture tests using -j, since it is more related to timing and less stable. Now since I increased the clients to 20 x 40 in the test, it is likely to kill your laptop by using -j10. Do you have any suggestion for this? Maybe I can try keep the number of clients small but put some sleep between requests to slow them down.
On Fri, Mar 15, 2019 at 3:37 PM Han Zhou <zhouhan@gmail.com> wrote: > > On Fri, Mar 15, 2019 at 3:30 PM Ben Pfaff <blp@ovn.org> wrote: > > > > On Thu, Mar 14, 2019 at 10:13:56AM -0700, Han Zhou wrote: > > > From: Han Zhou <hzhou8@ebay.com> > > > > > > When update is requested from follower, the leader sends AppendRequest > > > to all followers and wait until AppendReply received from majority, and > > > then it will update commit index - the new entry is regarded as committed > > > in raft log. However, this commit will not be notified to followers > > > (including the one initiated the request) until next heartbeat (ping > > > timeout), if no other pending requests. This results in long latency > > > for updates made through followers, especially when a batch of updates > > > are requested through the same follower. > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m34.154s > > > user 0m0.083s > > > sys 0m0.250s > > > > > > This patch solves the problem by sending heartbeat as soon as the commit > > > index is updated in leader. It also avoids unnessary heartbeat by resetting > > > the ping timer whenever AppendRequest is broadcasted. With this patch > > > the performance is improved more than 50 times in same test: > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m0.564s > > > user 0m0.080s > > > sys 0m0.199s > > > > > > The parameters in torture test is also adjusted because of the improved > > > performance, otherwise the tests will all be skipped. > > > > The patch seems very reasonable and the concept makes sense, but when I > > run > > make -j10 check TESTSUITEFLAGS='-k ovsdb,cluster,torture -j10' > > it comes near to killing my laptop, with multiple ovsdb-servers going to > > 100% CPU. Without the patch, I don't see behavior like that at all. > > > > Do you see the same thing? > > I think this is caused by the change in torture test case, as > mentioned in the commit message: The parameters in torture test is > also adjusted because of the improved > performance, otherwise the tests will all be skipped. (because all > clients finishes the tasks at phase 0) > > Unlike other tests, I never run torture tests using -j, since it is > more related to timing and less stable. Now since I increased the > clients to 20 x 40 in the test, it is likely to kill your laptop by > using -j10. Do you have any suggestion for this? Maybe I can try keep > the number of clients small but put some sleep between requests to > slow them down. Adding sleep helped. I just sent v2 that keeps original number of clients (10 x 5) but adds some sleep. I verified with no -j and also with -j10, all tests passed in both.
On Fri, Mar 15, 2019 at 03:37:20PM -0700, Han Zhou wrote: > On Fri, Mar 15, 2019 at 3:30 PM Ben Pfaff <blp@ovn.org> wrote: > > > > On Thu, Mar 14, 2019 at 10:13:56AM -0700, Han Zhou wrote: > > > From: Han Zhou <hzhou8@ebay.com> > > > > > > When update is requested from follower, the leader sends AppendRequest > > > to all followers and wait until AppendReply received from majority, and > > > then it will update commit index - the new entry is regarded as committed > > > in raft log. However, this commit will not be notified to followers > > > (including the one initiated the request) until next heartbeat (ping > > > timeout), if no other pending requests. This results in long latency > > > for updates made through followers, especially when a batch of updates > > > are requested through the same follower. > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m34.154s > > > user 0m0.083s > > > sys 0m0.250s > > > > > > This patch solves the problem by sending heartbeat as soon as the commit > > > index is updated in leader. It also avoids unnessary heartbeat by resetting > > > the ping timer whenever AppendRequest is broadcasted. With this patch > > > the performance is improved more than 50 times in same test: > > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done > > > > > > real 0m0.564s > > > user 0m0.080s > > > sys 0m0.199s > > > > > > The parameters in torture test is also adjusted because of the improved > > > performance, otherwise the tests will all be skipped. > > > > The patch seems very reasonable and the concept makes sense, but when I > > run > > make -j10 check TESTSUITEFLAGS='-k ovsdb,cluster,torture -j10' > > it comes near to killing my laptop, with multiple ovsdb-servers going to > > 100% CPU. Without the patch, I don't see behavior like that at all. > > > > Do you see the same thing? > > I think this is caused by the change in torture test case, as > mentioned in the commit message: The parameters in torture test is > also adjusted because of the improved > performance, otherwise the tests will all be skipped. (because all > clients finishes the tasks at phase 0) > > Unlike other tests, I never run torture tests using -j, since it is > more related to timing and less stable. Now since I increased the > clients to 20 x 40 in the test, it is likely to kill your laptop by > using -j10. Do you have any suggestion for this? Maybe I can try keep > the number of clients small but put some sleep between requests to > slow them down. I think that the patch is actually causing a lot more CPU use. For example, without the patch, running test 2525 (OVSDB 3-server torture test - kill/restart leader) takes about 15 seconds and uses about 5.6 seconds of CPU time: $ time make -j10 check TESTSUITEFLAGS=2525 [...] ## ------------------------------- ## ## openvswitch 2.11.90 test suite. ## ## ------------------------------- ## 2525: OVSDB 3-server torture test - kill/restart leader ok [...] real 0m15.644s user 0m4.140s sys 0m1.459s With the patch, it takes 1 3/4 minutes and uses over 178 seconds of CPU time: real 1m44.701s user 2m15.691s sys 0m42.651s That explains the problem I see with -j10 pretty well. Do you see the same thing?
diff --git a/ovsdb/raft.c b/ovsdb/raft.c index eee4f33..31e9e72 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -320,7 +320,8 @@ static void raft_send_append_request(struct raft *, static void raft_become_leader(struct raft *); static void raft_become_follower(struct raft *); -static void raft_reset_timer(struct raft *); +static void raft_reset_election_timer(struct raft *); +static void raft_reset_ping_timer(struct raft *); static void raft_send_heartbeats(struct raft *); static void raft_start_election(struct raft *, bool leadership_transfer); static bool raft_truncate(struct raft *, uint64_t new_end); @@ -376,8 +377,8 @@ raft_alloc(void) hmap_init(&raft->add_servers); hmap_init(&raft->commands); - raft->ping_timeout = time_msec() + PING_TIME_MSEC; - raft_reset_timer(raft); + raft_reset_ping_timer(raft); + raft_reset_election_timer(raft); return raft; } @@ -865,7 +866,7 @@ raft_read_log(struct raft *raft) } static void -raft_reset_timer(struct raft *raft) +raft_reset_election_timer(struct raft *raft) { unsigned int duration = (ELECTION_BASE_MSEC + random_range(ELECTION_RANGE_MSEC)); @@ -874,6 +875,12 @@ raft_reset_timer(struct raft *raft) } static void +raft_reset_ping_timer(struct raft *raft) +{ + raft->ping_timeout = time_msec() + PING_TIME_MSEC; +} + +static void raft_add_conn(struct raft *raft, struct jsonrpc_session *js, const struct uuid *sid, bool incoming) { @@ -1603,7 +1610,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer) VLOG_INFO("term %"PRIu64": starting election", raft->term); } } - raft_reset_timer(raft); + raft_reset_election_timer(raft); struct raft_server *peer; HMAP_FOR_EACH (peer, hmap_node, &raft->servers) { @@ -1782,8 +1789,8 @@ raft_run(struct raft *raft) raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); } } + raft_reset_ping_timer(raft); } - raft->ping_timeout = time_msec() + PING_TIME_MSEC; } /* Do this only at the end; if we did it as soon as we set raft->left or @@ -1963,6 +1970,7 @@ raft_command_initiate(struct raft *raft, s->next_index++; } } + raft_reset_ping_timer(raft); return cmd; } @@ -2313,7 +2321,7 @@ raft_become_follower(struct raft *raft) } raft->role = RAFT_FOLLOWER; - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* Notify clients about lost leadership. * @@ -2387,6 +2395,8 @@ raft_send_heartbeats(struct raft *raft) RAFT_CMD_INCOMPLETE, 0); } } + + raft_reset_ping_timer(raft); } /* Initializes the fields in 's' that represent the leader's view of the @@ -2413,7 +2423,7 @@ raft_become_leader(struct raft *raft) raft->role = RAFT_LEADER; raft->leader_sid = raft->sid; raft->election_timeout = LLONG_MAX; - raft->ping_timeout = time_msec() + PING_TIME_MSEC; + raft_reset_ping_timer(raft); struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->servers) { @@ -2573,11 +2583,13 @@ raft_get_next_entry(struct raft *raft, struct uuid *eid) return data; } -static void +/* Updates commit index in raft log. If commit index is already up-to-date + * it does nothing and return false, otherwise, returns true. */ +static bool raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) { if (new_commit_index <= raft->commit_index) { - return; + return false; } if (raft->role == RAFT_LEADER) { @@ -2610,6 +2622,7 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) .commit_index = raft->commit_index, }; ignore(ovsdb_log_write_and_free(raft->log, raft_record_to_json(&r))); + return true; } /* This doesn't use rq->entries (but it does use rq->n_entries). */ @@ -2809,7 +2822,7 @@ raft_handle_append_request(struct raft *raft, "usurped leadership"); return; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* First check for the common case, where the AppendEntries request is * entirely for indexes covered by 'log_start' ... 'log_end - 1', something @@ -3045,7 +3058,9 @@ raft_consider_updating_commit_index(struct raft *raft) } } } - raft_update_commit_index(raft, new_commit_index); + if (raft_update_commit_index(raft, new_commit_index)) { + raft_send_heartbeats(raft); + } } static void @@ -3274,7 +3289,7 @@ raft_handle_vote_request__(struct raft *raft, return false; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); return true; } @@ -3697,7 +3712,7 @@ static bool raft_handle_install_snapshot_request__( struct raft *raft, const struct raft_install_snapshot_request *rq) { - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* * Our behavior here depend on new_log_start in the snapshot compared to diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index c7f1e34..36fb926 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -131,7 +131,7 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral done export OVN_SB_DB - n1=10 n2=5 + n1=40 n2=20 echo "starting $n1*$n2 ovn-sbctl processes..." for i in $(seq 0 $(expr $n1 - 1) ); do (for j in $(seq $n2); do @@ -168,7 +168,7 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral case $(cat phase) in # ( 0) - if test $done -ge $(expr $n1 / 4); then + if test $done -ge $(expr $n1 / 5); then if test $variant = kill; then stop_server $victim else