From patchwork Wed Mar 20 05:56:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Zhou X-Patchwork-Id: 1058880 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="hyRDwc6o"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44PK3F40xWz9sDn for ; Wed, 20 Mar 2019 16:58:16 +1100 (AEDT) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 63A903D93; Wed, 20 Mar 2019 05:58:12 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 0D2CB3D93 for ; Wed, 20 Mar 2019 05:56:21 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pg1-f196.google.com (mail-pg1-f196.google.com [209.85.215.196]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 480C5148 for ; Wed, 20 Mar 2019 05:56:20 +0000 (UTC) Received: by mail-pg1-f196.google.com with SMTP id h8so946685pgp.6 for ; Tue, 19 Mar 2019 22:56:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=ge6FW5adDJfwLJ9VZ0gQB5KAJba+dQoVhAGuPmf31k8=; b=hyRDwc6oA4ATIfhAbWeoBu1QImRN1/jkBcRwhVB4BRj9UnWNDaEbJ8jIvxbwCTeUe5 I0RlUZ6Ey3evuL+y73IHOntLw8TrQR2oH/zX/U04PCjZ0gCtjyF4cpeSp4EPaiXLIGX2 4j8sh9TF0wCuA9Er8pnzHFsi8RB5SXN6cP+8OYkbeBKdJO4bZuQYdW8T4HV3ws27WHJx +T2javU7EhYnvN9Ou/QzjIkzA63yCD0I4J0YvXUuNwMkWHeTZh9aWAj2w6sDbFskObLc grmQSdgnriHelzWhN83oDTOXEUccRcD7/qqTw7O+yda0oaM9rEsRRM2erbScWl2Wdm+W YmLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=ge6FW5adDJfwLJ9VZ0gQB5KAJba+dQoVhAGuPmf31k8=; b=ckQGjRPJ54FePZLuRN4nffR6UG8uRz5GK1DHPHU8XbYV534WfyqmR8LW5ECT3e3CRe dlewWE24lucntDslae5H57hgoI2Dgh9RaYv2dvGVXA47aCWsd5Z5tHzo35v1ZhLo4kLs qsDi+oYh7IOWgmYMDLzi5U5cqr14WfDXjAkvpdK2zatNoTOBBlmUS0Bm+AM1sj23dnL5 CYXCNLxPzdixp9tLe0wi1wiwF0wRZaEqF3PgTTrz/V6jvuaRgVrtUYpx5dIJC4G9yXR8 BLDSbqE4anN97oD3uOEjSwj5fYYOgeut5+1M80coHBdQL28ihVAXc6p9qPnyE83uKUoH 7Z8g== X-Gm-Message-State: APjAAAWRsYSmzhVgpxpFfkhuyiC9hvZwQHws4+yL9BChhS1vLDpD/h6H piWcyCJ9e4dfYyvsr28d6tndhOrH X-Google-Smtp-Source: APXvYqyz031/mSuCk6gG2Du0+My9hZxhRXKJ+PNqu/kbtCpO/tkIouAvthdbo5nU6+n/qY3v8rOvGg== X-Received: by 2002:a17:902:8f81:: with SMTP id z1mr6255168plo.265.1553061379359; Tue, 19 Mar 2019 22:56:19 -0700 (PDT) Received: from localhost.localdomain.localdomain (c-76-21-108-74.hsd1.ca.comcast.net. [76.21.108.74]) by smtp.gmail.com with ESMTPSA id e15sm977260pgk.30.2019.03.19.22.56.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 19 Mar 2019 22:56:18 -0700 (PDT) From: Han Zhou X-Google-Original-From: Han Zhou To: dev@openvswitch.org Date: Tue, 19 Mar 2019 22:56:24 -0700 Message-Id: <1553061384-116689-1-git-send-email-hzhou8@ebay.com> X-Mailer: git-send-email 2.1.0 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v4] ovsdb raft: Sync commit index to followers without delay. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org From: Han Zhou When update is requested from follower, the leader sends AppendRequest to all followers and wait until AppendReply received from majority, and then it will update commit index - the new entry is regarded as committed in raft log. However, this commit will not be notified to followers (including the one initiated the request) until next heartbeat (ping timeout), if no other pending requests. This results in long latency for updates made through followers, especially when a batch of updates are requested through the same follower. $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done real 0m34.154s user 0m0.083s sys 0m0.250s This patch solves the problem by sending heartbeat as soon as the commit index is updated in leader. It also avoids unnessary heartbeat by resetting the ping timer whenever AppendRequest is broadcasted. With this patch the performance is improved more than 50 times in same test: $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done real 0m0.564s user 0m0.080s sys 0m0.199s Torture test cases are also updated because otherwise the tests will all be skipped because of the improved performance. Signed-off-by: Han Zhou --- Notes: v3->v4: Update torture tests again. Instead of sleeping, the size of transaction of each client is increased to slow down the execution so that the chance of parallel executions are not reduced. ovsdb/raft.c | 43 +++++++++++++++++++++++++++++-------------- tests/ovsdb-cluster.at | 16 +++++++++++----- 2 files changed, 40 insertions(+), 19 deletions(-) diff --git a/ovsdb/raft.c b/ovsdb/raft.c index eee4f33..31e9e72 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -320,7 +320,8 @@ static void raft_send_append_request(struct raft *, static void raft_become_leader(struct raft *); static void raft_become_follower(struct raft *); -static void raft_reset_timer(struct raft *); +static void raft_reset_election_timer(struct raft *); +static void raft_reset_ping_timer(struct raft *); static void raft_send_heartbeats(struct raft *); static void raft_start_election(struct raft *, bool leadership_transfer); static bool raft_truncate(struct raft *, uint64_t new_end); @@ -376,8 +377,8 @@ raft_alloc(void) hmap_init(&raft->add_servers); hmap_init(&raft->commands); - raft->ping_timeout = time_msec() + PING_TIME_MSEC; - raft_reset_timer(raft); + raft_reset_ping_timer(raft); + raft_reset_election_timer(raft); return raft; } @@ -865,7 +866,7 @@ raft_read_log(struct raft *raft) } static void -raft_reset_timer(struct raft *raft) +raft_reset_election_timer(struct raft *raft) { unsigned int duration = (ELECTION_BASE_MSEC + random_range(ELECTION_RANGE_MSEC)); @@ -874,6 +875,12 @@ raft_reset_timer(struct raft *raft) } static void +raft_reset_ping_timer(struct raft *raft) +{ + raft->ping_timeout = time_msec() + PING_TIME_MSEC; +} + +static void raft_add_conn(struct raft *raft, struct jsonrpc_session *js, const struct uuid *sid, bool incoming) { @@ -1603,7 +1610,7 @@ raft_start_election(struct raft *raft, bool leadership_transfer) VLOG_INFO("term %"PRIu64": starting election", raft->term); } } - raft_reset_timer(raft); + raft_reset_election_timer(raft); struct raft_server *peer; HMAP_FOR_EACH (peer, hmap_node, &raft->servers) { @@ -1782,8 +1789,8 @@ raft_run(struct raft *raft) raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT); } } + raft_reset_ping_timer(raft); } - raft->ping_timeout = time_msec() + PING_TIME_MSEC; } /* Do this only at the end; if we did it as soon as we set raft->left or @@ -1963,6 +1970,7 @@ raft_command_initiate(struct raft *raft, s->next_index++; } } + raft_reset_ping_timer(raft); return cmd; } @@ -2313,7 +2321,7 @@ raft_become_follower(struct raft *raft) } raft->role = RAFT_FOLLOWER; - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* Notify clients about lost leadership. * @@ -2387,6 +2395,8 @@ raft_send_heartbeats(struct raft *raft) RAFT_CMD_INCOMPLETE, 0); } } + + raft_reset_ping_timer(raft); } /* Initializes the fields in 's' that represent the leader's view of the @@ -2413,7 +2423,7 @@ raft_become_leader(struct raft *raft) raft->role = RAFT_LEADER; raft->leader_sid = raft->sid; raft->election_timeout = LLONG_MAX; - raft->ping_timeout = time_msec() + PING_TIME_MSEC; + raft_reset_ping_timer(raft); struct raft_server *s; HMAP_FOR_EACH (s, hmap_node, &raft->servers) { @@ -2573,11 +2583,13 @@ raft_get_next_entry(struct raft *raft, struct uuid *eid) return data; } -static void +/* Updates commit index in raft log. If commit index is already up-to-date + * it does nothing and return false, otherwise, returns true. */ +static bool raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) { if (new_commit_index <= raft->commit_index) { - return; + return false; } if (raft->role == RAFT_LEADER) { @@ -2610,6 +2622,7 @@ raft_update_commit_index(struct raft *raft, uint64_t new_commit_index) .commit_index = raft->commit_index, }; ignore(ovsdb_log_write_and_free(raft->log, raft_record_to_json(&r))); + return true; } /* This doesn't use rq->entries (but it does use rq->n_entries). */ @@ -2809,7 +2822,7 @@ raft_handle_append_request(struct raft *raft, "usurped leadership"); return; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* First check for the common case, where the AppendEntries request is * entirely for indexes covered by 'log_start' ... 'log_end - 1', something @@ -3045,7 +3058,9 @@ raft_consider_updating_commit_index(struct raft *raft) } } } - raft_update_commit_index(raft, new_commit_index); + if (raft_update_commit_index(raft, new_commit_index)) { + raft_send_heartbeats(raft); + } } static void @@ -3274,7 +3289,7 @@ raft_handle_vote_request__(struct raft *raft, return false; } - raft_reset_timer(raft); + raft_reset_election_timer(raft); return true; } @@ -3697,7 +3712,7 @@ static bool raft_handle_install_snapshot_request__( struct raft *raft, const struct raft_install_snapshot_request *rq) { - raft_reset_timer(raft); + raft_reset_election_timer(raft); /* * Our behavior here depend on new_log_start in the snapshot compared to diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at index c7f1e34..5550a19 100644 --- a/tests/ovsdb-cluster.at +++ b/tests/ovsdb-cluster.at @@ -131,12 +131,16 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral done export OVN_SB_DB - n1=10 n2=5 + n1=10 n2=5 n3=50 echo "starting $n1*$n2 ovn-sbctl processes..." for i in $(seq 0 $(expr $n1 - 1) ); do (for j in $(seq $n2); do : > $i-$j.running - run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only add SB_Global . external_ids $i-$j=$i-$j + txn="add SB_Global . external_ids $i-$j=$i-$j" + for k in $(seq $n3); do + txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k" + done + run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only $txn status=$? if test $status != 0; then echo "$i-$j exited with status $status" > $i-$j:$status @@ -146,13 +150,12 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral : > $i.done)& done echo "...done" - sleep 2 echo "waiting for ovn-sbctl processes to exit..." # Use file instead of var because code inside "while" runs in a subshell. echo 0 > phase i=0 - (while :; do echo; sleep 1; done) | while read REPLY; do + (while :; do echo; sleep 0.1; done) | while read REPLY; do printf "t=%2d s:" $i done=0 for j in $(seq 0 $(expr $n1 - 1)); do @@ -168,7 +171,7 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral case $(cat phase) in # ( 0) - if test $done -ge $(expr $n1 / 4); then + if test $done -ge $(expr $n1 / 10); then if test $variant = kill; then stop_server $victim else @@ -199,6 +202,9 @@ ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral for i in $(seq 0 $(expr $n1 - 1) ); do for j in `seq $n2`; do echo "$i-$j=$i-$j" + for k in `seq $n3`; do + echo "$i-$j-$k=$i-$j-$k" + done done done | sort > expout AT_CHECK([ovn-sbctl --timeout=30 --log-file=finalize.log -vtimeval:off -vfile -vsyslog:off --bare get SB_Global . external-ids | tr ',' '\n' | sed 's/[[{}"" ]]//g' | sort], [0], [expout])