From patchwork Tue Mar 26 15:30:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065727 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="dJR+zona"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFTH1jPLz9sSq for ; Wed, 27 Mar 2019 02:30:59 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732090AbfCZPac (ORCPT ); Tue, 26 Mar 2019 11:30:32 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:37693 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726285AbfCZPab (ORCPT ); Tue, 26 Mar 2019 11:30:31 -0400 Received: by mail-ed1-f66.google.com with SMTP id v21so11148659edq.4 for ; Tue, 26 Mar 2019 08:30:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=tThm/x2p9W+i8tJU44hM6YhgVB9o3kMMPHMITeV1Iuc=; b=dJR+zona8sRJfBWWxpFZXMnyPOVdlg/YoP59nDbOjIlHopFKxHy6TM6lISzy+x27wE K5DgnJ31nun8ca4qyHllwI+u2lQUSIzPvjS7eeIIKHxTNSMKn/SPwltQdpbOqVgCrC4t t9Tv/iFQgdtieeSKGMLbH1dCWY2EdrnWKDRSnm9M39qVo4Z3pI1himljyKRWF30t8RTa FDf52igzEuhbxWkhh/TYYzGvAg76VxWs8dwY/vsC6Hq2TRlhKBkAHX5rgA3mtDB/2QGB bMMVrIkheMHTd2aomLDPqZoy6v77ynKGzRDG7TJ0X3sDAG3QICsh5W4WpQ+t6U1mIPoX CqXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=tThm/x2p9W+i8tJU44hM6YhgVB9o3kMMPHMITeV1Iuc=; b=cT46iDGky5foQ+Of1wzuK/UnMO96f/eVysaFvETmphlE2kxxRLS9AsIwqLVMtBOA2w VxhAn/f7jxWc0gy0/ra+Ow2RZu6xPyvCGgEZW2Kn/ascCTvJp0DoFSJo9kZGGFwenPv9 6pWVoymGTsRuf02K2K/a43wpKcx4xUAyS3TZhV076iJ/jtmNbhFiG0qn5Oiy5atqYTf4 291Qa1Yt+1m6Kh9tcAANb6YIyuKEnTf+c0CiewFrrMWaDNtPHOZN4tVAeaaDk89ESYoi 3nRzlbXVzFNjYhVFu9zhq3tvAE4yLZGrcYRvi6sQq4BzNDSYbkmTbyTFOUWwycz7hKIf EviQ== X-Gm-Message-State: APjAAAXygWuvjEuDneeKjYqIgJUW9FT57caHWJ+rc7vramTcdkKhAqoA RoY47on8uj6K0qA7zEPbrOk+Hw== X-Google-Smtp-Source: APXvYqwJuVDDu/mOYKzGkZFB0X1UqTn0Qa+OVaLwLB95oIEcxAwYN0wOok1qesvkSn2uo7Jzyq10Dw== X-Received: by 2002:a17:906:69c3:: with SMTP id g3mr17466229ejs.245.1553614229385; Tue, 26 Mar 2019 08:30:29 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.28 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:28 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org Subject: [RFC 1/4] net/ipv4/fib: Remove run-time check in tnode_alloc() Date: Tue, 26 Mar 2019 15:30:22 +0000 Message-Id: <20190326153026.24493-2-dima@arista.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190326153026.24493-1-dima@arista.com> References: <20190326153026.24493-1-dima@arista.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org TNODE_KMALLOC_MAX is not used anywhere, while TNODE_VMALLOC_MAX check in tnode_alloc() only adds additional cmp/jmp instructions to tnode allocation. During rebalancing of the trie the function can be called thousands of times. Runtime check takes cache line and predictor entry. Futhermore, this check is always false on 64-bit platfroms and ipv4 has only 4 byte address range and bits are limited by KEYLENGTH (32). Move the check under unlikely() and change comparison to BITS_PER_LONG, optimizing allocation of tnode during rebalancing (and removing it complitely for platforms with BITS_PER_LONG > KEYLENGTH). Signed-off-by: Dmitry Safonov --- net/ipv4/fib_trie.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index a573e37e0615..ad7d56c421cb 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -312,11 +312,6 @@ static inline void alias_free_mem_rcu(struct fib_alias *fa) call_rcu(&fa->rcu, __alias_free_mem); } -#define TNODE_KMALLOC_MAX \ - ilog2((PAGE_SIZE - TNODE_SIZE(0)) / sizeof(struct key_vector *)) -#define TNODE_VMALLOC_MAX \ - ilog2((SIZE_MAX - TNODE_SIZE(0)) / sizeof(struct key_vector *)) - static void __node_free_rcu(struct rcu_head *head) { struct tnode *n = container_of(head, struct tnode, rcu); @@ -333,8 +328,7 @@ static struct tnode *tnode_alloc(int bits) { size_t size; - /* verify bits is within bounds */ - if (bits > TNODE_VMALLOC_MAX) + if ((BITS_PER_LONG <= KEYLENGTH) && unlikely(bits >= BITS_PER_LONG)) return NULL; /* determine size and verify it is non-zero and didn't overflow */ From patchwork Tue Mar 26 15:30:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065725 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="inAjRHxP"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFT83QzQz9sSV for ; Wed, 27 Mar 2019 02:30:52 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732197AbfCZPat (ORCPT ); Tue, 26 Mar 2019 11:30:49 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:39884 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731840AbfCZPad (ORCPT ); Tue, 26 Mar 2019 11:30:33 -0400 Received: by mail-ed1-f67.google.com with SMTP id p20so10664334eds.6 for ; Tue, 26 Mar 2019 08:30:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/RqmWo28JW2clwB4Gt/VepomkIt6hea2E0CtBSqiPnc=; b=inAjRHxPIQ+irCWZ7SEoSkYaS0rOJ9J/urBPRydWhX4FurEkB/I8sLRg/hBKO1z8XZ BNgf5ASqV1N+yByh9AUbM2el/zlDCYUTRXpLK9Jdi0qSmAYpBLxy4HYJk3N6JnwtWO4C WwSywzAWU/ocX1E3MA/vRi4vVH0QuTRjiYp0xoj19d72q28ERxXzKhES0xhQXmxW272z fnKXtRAAs6D9E7fjM4EV84yDRZEVXO97vXS2NlWcYBH2leNz1Gdf6lH13/nr7jeYg/3G fs2i9sWnw/ZBCNgbocrqy3DOzXy8RTv20LoG3qPk5kqdL0Riwa71d+JvH6NvDiKTqV7y qCAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/RqmWo28JW2clwB4Gt/VepomkIt6hea2E0CtBSqiPnc=; b=JLAKFoOHSsytLRptwOWwZbJ6/TOUIm77suQaNQdckH0TtNYb5uQfsjp8CyILPeiNnN urZQuDY0w+Bx0xf13yvlzSqBQDsmGZ36k8SajPCSeck+rsjUU7XVf4/1Nbk8FTfWn2qX albvSrfUfuUAkIblP3laEDSPCcUHxGbMKgKkarB/EhYsvAbQ19TK3uAXQEqV83rrQAIt OgBB7kMUUgGqmKG9NLnCTy7whUbrSPnFBq9dJzqO8UlOprlnIfgRTbcIYRFuLOwMLFrq MEnvpJ3j5ZNwFY7i80Q41na3n6RO+CNRAdR84LkKbqlo2zxYkGvnjuHL1CpDHnyPqm9L Y9DA== X-Gm-Message-State: APjAAAW563r2FpmRFS9Se48iy5kY5T2VwDnXDEgC9h1HDt/gBh86rbrN iVcweOhYW0e8jdcE+nfVhB8RWw== X-Google-Smtp-Source: APXvYqwqlWChpnwD65Z/LGYIe6DuGEUtTKTV7h6YnYZ8jdwZo/3YNzljZpWbLZRhdGR+LFD7QgWhcg== X-Received: by 2002:a50:be01:: with SMTP id a1mr17051112edi.22.1553614230686; Tue, 26 Mar 2019 08:30:30 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.29 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:30 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet Subject: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Date: Tue, 26 Mar 2019 15:30:23 +0000 Message-Id: <20190326153026.24493-3-dima@arista.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190326153026.24493-1-dima@arista.com> References: <20190326153026.24493-1-dima@arista.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Unfortunately, MAX_WORK at this moment is broken: during trie_rebalance() climbing upwards resize() is being called for each tnode. Which makes the limit useless the bigger trie gets: at each level there are 10 attempts to balance tnode and childs, resulting in O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is the level where we changed the trie originally (adding/removing alias/route). Which results in the following price of removing one route under big trie (similar for some other single routes that results in reallocating tnodes by hitting the threshold limits for halve/inflate resize): Before: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 1.99 Max depth: 2 Leaves: 77825 Prefixes: 77825 Internal nodes: 77 9: 1 10: 76 Pointers: 78336 Null ptrs: 435 Total size: 7912 kB Local: [omitted] After: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 3.00 Max depth: 3 Leaves: 77824 Prefixes: 77824 Internal nodes: 20491 1: 2048 2: 18432 6: 1 11: 10 Pointers: 98368 Null ptrs: 54 Total size: 8865 kB Local: [omitted] Provide a sysctl to control amount of pending balancing work. (by default unlimited as it was) Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down into inflate/halve") Signed-off-by: Dmitry Safonov --- Documentation/networking/ip-sysctl.txt | 6 +++ include/net/ip.h | 1 + net/ipv4/fib_trie.c | 60 +++++++++++++++----------- net/ipv4/sysctl_net_ipv4.c | 7 +++ 4 files changed, 49 insertions(+), 25 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index acdfb5d2bcaa..fb71dacff4dd 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER 0 - Layer 3 1 - Layer 4 +fib_balance_budget - UNSIGNED INTEGER + Limits the number of resize attempts during balancing fib trie + on adding/removing new routes. + Possible values: + Default: UINT_MAX (0xFFFFFFFF) + ip_forward_update_priority - INTEGER Whether to update SKB priority from "TOS" field in IPv4 header after it is forwarded. The new SKB priority is mapped from TOS field value diff --git a/include/net/ip.h b/include/net/ip.h index be3cad9c2e4c..305d0e43088b 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk, return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU); } +extern unsigned int fib_balance_budget; struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx, int fc_mx_len, struct netlink_ext_ack *extack); diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index ad7d56c421cb..d90cf9dfd443 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -182,8 +182,10 @@ struct trie { #endif }; -static struct key_vector *resize(struct trie *t, struct key_vector *tn); +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget); static size_t tnode_free_size; +unsigned int fib_balance_budget = UINT_MAX; /* * synchronize_rcu after call_rcu for that many pages; it should be especially @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn) static struct key_vector *replace(struct trie *t, struct key_vector *oldtnode, - struct key_vector *tn) + struct key_vector *tn, + unsigned int *budget) { struct key_vector *tp = node_parent(oldtnode); unsigned long i; @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t, tnode_free(oldtnode); /* resize children now that oldtnode is freed */ - for (i = child_length(tn); i;) { + for (i = child_length(tn); i && *budget;) { struct key_vector *inode = get_child(tn, --i); /* resize child node */ if (tnode_full(tn, inode)) - tn = resize(t, inode); + tn = resize(t, inode, budget); } return tp; } -static struct key_vector *inflate(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t, return NULL; } -static struct key_vector *halve(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn) return used < 2; } -#define MAX_WORK 10 -static struct key_vector *resize(struct trie *t, struct key_vector *tn) +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget) { #ifdef CONFIG_IP_FIB_TRIE_STATS struct trie_use_stats __percpu *stats = t->stats; #endif struct key_vector *tp = node_parent(tn); unsigned long cindex = get_index(tn->key, tp); - int max_work = MAX_WORK; + bool inflated = false; pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n", tn, inflate_threshold, halve_threshold); @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) /* Double as long as the resulting node has a number of * nonempty nodes that are above the threshold. */ - while (should_inflate(tp, tn) && max_work) { - tp = inflate(t, tn); + while (should_inflate(tp, tn) && *budget) { + tp = inflate(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; + inflated = true; tn = get_child(tp, cindex); } /* update parent in case inflate failed */ tp = node_parent(tn); - /* Return if at least one inflate is run */ - if (max_work != MAX_WORK) + /* Return if at least one inflate is run: + * microoptimization to not recalculate thresholds + */ + if (inflated) return tp; /* Halve as long as the number of empty children in this * node is above threshold. */ - while (should_halve(tp, tn) && max_work) { - tp = halve(t, tn); + while (should_halve(tp, tn) && *budget) { + tp = halve(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; tn = get_child(tp, cindex); } @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, static void trie_rebalance(struct trie *t, struct key_vector *tn) { - while (!IS_TRIE(tn)) - tn = resize(t, tn); + unsigned int budget = fib_balance_budget; + + while (budget && !IS_TRIE(tn)) + tn = resize(t, tn, &budget); } static int fib_insert_node(struct trie *t, struct key_vector *tp, @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) void fib_table_flush_external(struct fib_table *tb) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index ba0fc4b18465..d7274cc442af 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "fib_balance_budget", + .data = &fib_balance_budget, + .maxlen = sizeof(fib_balance_budget), + .mode = 0644, + .proc_handler = proc_douintvec, + }, { .procname = "inet_peer_threshold", .data = &inet_peer_threshold, From patchwork Tue Mar 26 15:30:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065726 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="iSkg4fsQ"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFTC5gcVz9sSV for ; Wed, 27 Mar 2019 02:30:55 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732176AbfCZPas (ORCPT ); Tue, 26 Mar 2019 11:30:48 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:41766 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732093AbfCZPad (ORCPT ); Tue, 26 Mar 2019 11:30:33 -0400 Received: by mail-ed1-f66.google.com with SMTP id a25so11136220edc.8 for ; Tue, 26 Mar 2019 08:30:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=5AmL7INQVSixEXzYLcTw3S29fd2fYSbdnmlsd37WMAc=; b=iSkg4fsQ1JIKR5iFGa0idGSRotE+/wEzkJemA3rfncwWltbkHAc40xCj1zGnU7N1SE eIntoPpA3CnoxM6AI5qO2qtE6u5X1NReLNr1jVOjqO41ile3rBJxQ0vFM/qeTThijGUh lMLozqP3h3oy7uN3yxESN51FQOHFR02BYDXEKWT0E1bxOl4sC/xbdJFD+Dan+dHQrutd 15ooAM8hoDRlgOnT3lazFN/gWzjTWYfDNE1M5NhsOp0N3EZzd36bVf1SngVksPmM9dBV 5yYJtdZTU5X6vmNkXkCh86uuys6XZwOd4JMRbSvSCsHeue/bAVj6d4fW2UVFTuX6zcXU NhTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=5AmL7INQVSixEXzYLcTw3S29fd2fYSbdnmlsd37WMAc=; b=d7ihbsgVeXj9LN79IIX7FucvKbm0abELNhRxmUJKKrBs7Gi/hRyk1U2KsP0IZ7VEgd xBWV2ASXgSaR8RIgEdDYknwku+buMd/f4aFtVWQ3LFGYM6a8VBGdVUMWsob4x6+ynvx9 Abirbmj7p5y6xvs5JG5QgxEaGmsV83hpyICR++f8delkPCyoYBcboWZq8zLMMclS27N1 yuzu7I4MCrZJp3YzJqZCMoAeNHr3NfMaaLURkHXi6FxoTzT8YrkRXHkPfVRn2BRiU1Ic vzWb1j+VLJnCF8ud1XPvbXPQ4lMjm7XJV5+YtkctbCO+RgzSsgrC4f0kqJfDX95T58gL BBQQ== X-Gm-Message-State: APjAAAUM1x9j7FgSU5OvpncEb5BqgjuZ/kjjr3fDYloWt3cUj9/YT2Hi 0rsJimimt0nYRm86vywPGt9f7A== X-Google-Smtp-Source: APXvYqw+sJ0YtUytYc08TB+MyZlXTvsgHWcF3mji94/+J7Lknij1mIcHXIP1nVcwtsbEMpG2PRr9vQ== X-Received: by 2002:a50:ca87:: with SMTP id x7mr20705284edh.165.1553614232059; Tue, 26 Mar 2019 08:30:32 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.30 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:31 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org Subject: [RFC 3/4] net/fib: Check budget before should_{inflate,halve}() Date: Tue, 26 Mar 2019 15:30:24 +0000 Message-Id: <20190326153026.24493-4-dima@arista.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190326153026.24493-1-dima@arista.com> References: <20190326153026.24493-1-dima@arista.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Those functions are compute-costly, if we're out of budget - better omit additional computations. Signed-off-by: Dmitry Safonov --- net/ipv4/fib_trie.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index d90cf9dfd443..2ce2739e7693 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -868,7 +868,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn, /* Double as long as the resulting node has a number of * nonempty nodes that are above the threshold. */ - while (should_inflate(tp, tn) && *budget) { + while (*budget && should_inflate(tp, tn)) { tp = inflate(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS @@ -894,7 +894,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn, /* Halve as long as the number of empty children in this * node is above threshold. */ - while (should_halve(tp, tn) && *budget) { + while (*budget && should_halve(tp, tn)) { tp = halve(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS From patchwork Tue Mar 26 15:30:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065724 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="PXTAkF5/"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFSw02bDz9sSk for ; Wed, 27 Mar 2019 02:30:40 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732139AbfCZPah (ORCPT ); Tue, 26 Mar 2019 11:30:37 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:33444 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732108AbfCZPaf (ORCPT ); Tue, 26 Mar 2019 11:30:35 -0400 Received: by mail-ed1-f65.google.com with SMTP id q3so11162205edg.0 for ; Tue, 26 Mar 2019 08:30:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+L4U8orEw9uiR5qzAaEmkGzbeMHT+Q4wZ1ViLF3HGVk=; b=PXTAkF5/Q5z1rrjSOHkLwFICwkQZpzRlI0z+4S/sOIJjkfKYBLH0/INoo7a5vCg1mC hUyWaHnnJmsQlewOjVIML0P6hIX+T5kiDS45VTeH49TxBuhZlALjongUbJlvyCUV1F0j LCRM/nHSvvxzEOsVrjwR/x9bT8LFrEl7rNV9UYWf3L13qWRZf711rIfLEQXSFOs/71FK vgLhHKbI9fbQl8d1gKU7RqB8ugvcRgC7au6tswNmz463hDTwI6HTGsaL7SitzIvBc4fI iMgX08yVG3IIx0RBpDL3yjS2yY8nl0ZFWG6d/W0xowweqfoLlJUAH4QA/48douA/Exi1 ogZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=+L4U8orEw9uiR5qzAaEmkGzbeMHT+Q4wZ1ViLF3HGVk=; b=uc7IXM3v3frOqczTdH+WUZrWiG1FLLhnTQ6lsp8boZ3a5+YfZ0fa20ICerv8CUe+lL SZnHN+SWr93qakzWEUSP0d/7iX2N9qanlR7BJpv6M8YhbIjpetntNAquNuExFI0xDp1Z DjugHlD9mTnqPhsCabwqH7C1JKLaTKKEMs1RPCVYoz3tnkENG1lcgAPxdwhrXWC6yNhd 48p0pLfUXAQn98bkK+OPz30lcZW//VeedhtF1+xxBonzOK4lE1/srzaqOJ49tzgaYsc4 9gtCIWCaznQ9aCYy/stMzG9Pomwb9uSx0zdmrvl7mHCuVVlv7EAsrLcJbDg4nrEdvEe6 gmxQ== X-Gm-Message-State: APjAAAV6CaKqms3MTBu++UpsCgREXaLQiJWkRGuoIkbfM8okkr2er5Sa lzUPmqTMx07POzAye55PC4tcow== X-Google-Smtp-Source: APXvYqwiO44bnS60Pv8ZisD4olxDZ7JAvGko4b8CQzXjI0kiBQrr3xXnDcL478L5bVov9bMWGXLNnQ== X-Received: by 2002:a50:cd06:: with SMTP id z6mr20844524edi.163.1553614233194; Tue, 26 Mar 2019 08:30:33 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.32 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:32 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org Subject: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb Date: Tue, 26 Mar 2019 15:30:25 +0000 Message-Id: <20190326153026.24493-5-dima@arista.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190326153026.24493-1-dima@arista.com> References: <20190326153026.24493-1-dima@arista.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). The limit is 128 pages or 512Kb (considering common case with 4Kb pages). Unfortunately, at Arista we have use-scenarios with full view software forwarding. At the scale of 100K and more routes even on 2 core boxes the hard-coded limit starts actively shooting in the leg: lockup detector notices that rtnl_lock is held for seconds. First reason is previously broken MAX_WORK, that didn't limit pending balancing work. While fixing it, I've noticed that the bottle-neck is actually in the number of synchronise_rcu() calls. I've tried to fix it with a patch to decrement number of tnodes in rcu callback, but it hasn't much affected performance. One possible way to "fix" it - provide another sysctl to control sync_pages, but in my POV it's nasty - exposing another realisation detail into user-space. To be complete honest, I'm not sure if calling rcu_synchronise() from shrinker is a sane idea: during OOM we're slowed down enough and adding synchronise there probably will noticeably falter a shrinker. Anyway, I've got the following results on a very stupid benchmark that adds one-by-one routes and removes them (with unlimited fib_balance_budget) and measures time spent to remove one route: *Before* on 4-cores switch (AMD GX-420CA SOC): v4 Create of 4194304 routes: 76806ms 0(2097152): 000. 32. 0. 0 3353ms 1(3145729): 000. 48. 0. 1 1311ms 2(1048577): 000. 16. 0. 1 1286ms 3(524289): 000. 8. 0. 1 865ms 4(4098744): 000. 62.138.184 858ms 5(3145728): 000. 48. 0. 0 832ms 6(1048576): 000. 16. 0. 0 794ms 7(2621441): 000. 40. 0. 1 663ms 8(2621440): 000. 40. 0. 0 525ms 9(524288): 000. 8. 0. 0 508ms v4 Delete of 4194304 routes: 111129ms 0(1589247): 000. 24. 63.255 3033ms 1(3702783): 000. 56.127.255 2833ms 2(3686399): 000. 56. 63.255 2630ms 3(1605631): 000. 24.127.255 2574ms 4(1581055): 000. 24. 31.255 2395ms 5(3671039): 000. 56. 3.255 2289ms 6(1573887): 000. 24. 3.255 2234ms 7(3678207): 000. 56. 31.255 2143ms 8(3670527): 000. 56. 1.255 2109ms 9(1573375): 000. 24. 1.255 2070ms *After* on 4-cores switch: v4 Create of 4194304 routes: 65305ms 0(2097153): 000. 32. 0. 1 1871ms 1(1048577): 000. 16. 0. 1 1064ms 2(2097152): 000. 32. 0. 0 905ms 3(524289): 000. 8. 0. 1 507ms 4(1048576): 000. 16. 0. 0 451ms 5(2097154): 000. 32. 0. 2 355ms 6(262145): 000. 4. 0. 1 240ms 7(524288): 000. 8. 0. 0 230ms 8(262144): 000. 4. 0. 0 115ms 9(131073): 000. 2. 0. 1 109ms v4 Delete of 4194304 routes: 38015ms 0(3571711): 000. 54.127.255 1616ms 1(3565567): 000. 54.103.255 1340ms 2(3670015): 000. 55.255.255 1297ms 3(3565183): 000. 54.102.127 1226ms 4(3565159): 000. 54.102.103 912ms 5(3604479): 000. 54.255.255 596ms 6(3670016): 000. 56. 0. 0 474ms 7(3565311): 000. 54.102.255 434ms 8(3567615): 000. 54.111.255 388ms 9(3565167): 000. 54.102.111 376ms After the patch there is one core, completely busy with the benchmark, while previously neither CPU was busy. Controlling balancing budget sysctl knob, one can distribute balancing work on add/remove a route between neighbour changes (with the price of possibly less balanced trie and a bit more expensive lookups). Fixes: fc86a93b46d7 ("fib_trie: Push tnode flushing down to inflate/halve") Signed-off-by: Dmitry Safonov --- net/ipv4/fib_trie.c | 53 +++++++++++++++++++++++++++++++-------------- 1 file changed, 37 insertions(+), 16 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 2ce2739e7693..5773d479e7d2 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -184,16 +184,9 @@ struct trie { static struct key_vector *resize(struct trie *t, struct key_vector *tn, unsigned int *budget); -static size_t tnode_free_size; +static atomic_long_t objects_waiting_rcu; unsigned int fib_balance_budget = UINT_MAX; -/* - * synchronize_rcu after call_rcu for that many pages; it should be especially - * useful before resizing the root node with PREEMPT_NONE configs; the value was - * obtained experimentally, aiming to avoid visible slowdown. - */ -static const int sync_pages = 128; - static struct kmem_cache *fn_alias_kmem __ro_after_init; static struct kmem_cache *trie_leaf_kmem __ro_after_init; @@ -306,11 +299,16 @@ static const int inflate_threshold_root = 30; static void __alias_free_mem(struct rcu_head *head) { struct fib_alias *fa = container_of(head, struct fib_alias, rcu); + + atomic_long_dec(&objects_waiting_rcu); kmem_cache_free(fn_alias_kmem, fa); } static inline void alias_free_mem_rcu(struct fib_alias *fa) { + lockdep_rtnl_is_held(); + + atomic_long_inc(&objects_waiting_rcu); call_rcu(&fa->rcu, __alias_free_mem); } @@ -318,13 +316,40 @@ static void __node_free_rcu(struct rcu_head *head) { struct tnode *n = container_of(head, struct tnode, rcu); + atomic_long_dec(&objects_waiting_rcu); if (!n->tn_bits) kmem_cache_free(trie_leaf_kmem, n); else kvfree(n); } -#define node_free(n) call_rcu(&tn_info(n)->rcu, __node_free_rcu) +static inline void node_free(struct key_vector *n) +{ + lockdep_rtnl_is_held(); + + atomic_long_inc(&objects_waiting_rcu); + call_rcu(&tn_info(n)->rcu, __node_free_rcu); +} + +static unsigned long fib_shrink_count(struct shrinker *s, + struct shrink_control *sc) +{ + return (unsigned long)atomic_long_read(&objects_waiting_rcu); +} + +static unsigned long fib_shrink_scan(struct shrinker *s, + struct shrink_control *sc) +{ + long ret = (unsigned long)atomic_long_read(&objects_waiting_rcu); + + synchronize_rcu(); + return (unsigned long)ret; +} + +static struct shrinker fib_shrinker = { + .count_objects = fib_shrink_count, + .scan_objects = fib_shrink_scan, +}; static struct tnode *tnode_alloc(int bits) { @@ -494,16 +519,9 @@ static void tnode_free(struct key_vector *tn) while (head) { head = head->next; - tnode_free_size += TNODE_SIZE(1ul << tn->bits); node_free(tn); - tn = container_of(head, struct tnode, rcu)->kv; } - - if (tnode_free_size >= PAGE_SIZE * sync_pages) { - tnode_free_size = 0; - synchronize_rcu(); - } } static struct key_vector *replace(struct trie *t, @@ -2118,6 +2136,9 @@ void __init fib_trie_init(void) trie_leaf_kmem = kmem_cache_create("ip_fib_trie", LEAF_SIZE, 0, SLAB_PANIC, NULL); + + if (register_shrinker(&fib_shrinker)) + panic("IP FIB: failed to register fib_shrinker\n"); } struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)