From patchwork Fri May 18 20:35:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Pfaff X-Patchwork-Id: 916587 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ovn.org Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40ng185FBnz9s4V for ; Sat, 19 May 2018 06:35:56 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 2D5E6C97; Fri, 18 May 2018 20:35:54 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id C7327951 for ; Fri, 18 May 2018 20:35:52 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from relay3-d.mail.gandi.net (relay3-d.mail.gandi.net [217.70.183.195]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 0A02FE5 for ; Fri, 18 May 2018 20:35:51 +0000 (UTC) X-Originating-IP: 208.91.3.26 Received: from ovn.org (unknown [208.91.3.26]) (Authenticated sender: blp@ovn.org) by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id E361060002; Fri, 18 May 2018 22:35:48 +0200 (CEST) Date: Fri, 18 May 2018 13:35:45 -0700 From: Ben Pfaff To: aginwala Message-ID: <20180518203545.GW7948@ovn.org> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: ovs dev Subject: [ovs-dev] scale test testing requests (was: raft ovsdb clustering with scale test) X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org I've spent some time stressing the database yesterday and today. So far, I can't reproduce these particular problems. I do see various ways to improve OVS and OVN and their tests. Here are some suggestions I have for further testing: 1. You mentioned that programs were segfaulting. They should not be doing that (obviously) but I wasn't able to get them to do so in my own testing. It would be very helpful to have backtraces. Would you mind trying to get them? 2. You mentioned that "perf" shows that lots of time is being spent writing snapshots. It would be helpful to know whether this is a contributing factor in the failures. (If it is, then I will work on making snapshots faster.) One way to figure that out would be to disable snapshots entirely for testing. That isn't acceptable in production because it will use up all the disk space eventually, but for testing one could apply the following patch: 3. This isn't really a testing note but I do see that the way that OVSDB is proxying writes from a Raft follower to the leader is needlessly inefficient and I should rework it for better write performance. diff --git a/ovsdb/storage.c b/ovsdb/storage.c index 446cae0861ec..9fa9954b6d35 100644 --- a/ovsdb/storage.c +++ b/ovsdb/storage.c @@ -490,38 +490,8 @@ schedule_next_snapshot(struct ovsdb_storage *storage, bool quick) } bool -ovsdb_storage_should_snapshot(const struct ovsdb_storage *storage) +ovsdb_storage_should_snapshot(const struct ovsdb_storage *storage OVS_UNUSED) { - if (storage->raft || storage->log) { - /* If we haven't reached the minimum snapshot time, don't snapshot. */ - long long int now = time_msec(); - if (now < storage->next_snapshot_min) { - return false; - } - - /* If we can't snapshot right now, don't. */ - if (storage->raft && !raft_may_snapshot(storage->raft)) { - return false; - } - - uint64_t log_len = (storage->raft - ? raft_get_log_length(storage->raft) - : storage->n_read + storage->n_written); - if (now < storage->next_snapshot_max) { - /* Maximum snapshot time not yet reached. Take a snapshot if there - * have been at least 100 log entries and the log file size has - * grown a lot. */ - bool grew_lots = (storage->raft - ? raft_grew_lots(storage->raft) - : ovsdb_log_grew_lots(storage->log)); - return log_len >= 100 && grew_lots; - } else { - /* We have reached the maximum snapshot time. Take a snapshot if - * there have been any log entries at all. */ - return log_len > 0; - } - } - return false; }