From patchwork Wed Aug 7 22:25:33 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yifeng Sun X-Patchwork-Id: 1143737 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="DRE+Q26p"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 463mL06bPWz9sDB for ; Thu, 8 Aug 2019 08:25:44 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 07F5ED91; Wed, 7 Aug 2019 22:25:43 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 4B572CC1 for ; Wed, 7 Aug 2019 22:25:42 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com [209.85.210.193]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id CCA67823 for ; Wed, 7 Aug 2019 22:25:40 +0000 (UTC) Received: by mail-pf1-f193.google.com with SMTP id m30so42891273pff.8 for ; Wed, 07 Aug 2019 15:25:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=/GfV1CjFD0nIH7P+oR8ZRipNZkB9vAJArcahWW840g0=; b=DRE+Q26pX5p50LNBpEyNN+mysOISeKneIqIbQPf4xVpO9wULR6x58MuHegzwtnLB3m 7L16gpgUjs1osWF/+eFz7XC5F6tStyTm4AbabkyOCK2XGrL+ciUb6bkbhc1MiBFeFz6C wVyygu0K2huZ2w7z/EijZ/UvDobM1wY8bsINXTm4WWOFuTG6G1AsT0ADlM/WyvHlaOwF W6Qg4f5saFTtDLKcybjQKvNR1PPLOChPWwd90PydPoFBQOQHH0m9ZDOzpaQCqnyvnk7V M9nmEV5+ecO7ek16J7jLb007474Vw8xVgLUf0w0P4EAtINsEU/SmwXkARYxEioYYqa7t nfyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=/GfV1CjFD0nIH7P+oR8ZRipNZkB9vAJArcahWW840g0=; b=Kq2KkLqx/busWHClHWjCxQo5DBL77ksBegtE4ozjhayYuJ3lN2cKCJ8eTEnvyxy4Fr iDXgDpWQg4TAAJe8lMbXLoMeqnyr2/m5xQFvJsaOPJcV2mJFRG8qOBR58nq2/5O/n1HR 2g2vRr+vLIlIGjBlN68hBXxeY1GyMxrlbdnkUgRg9yPy2QcaFXPuVNc8pHN5ADIekoS6 tStUE4NwYlOQ41A4HLS93mm/afq0T5haAhc3J4WyV8rWs4frku86dO2eQfkN8A9umPe6 EJ4hXJvfWGZWgSo+TWw7AHAZk9h9aMqHmPGeGMaToiF9vIidCQJ+y70wEO5B0nFntLL6 MTsA== X-Gm-Message-State: APjAAAWT8RF1wHb76DHekpY6LOfj2GXkNc6zPmKXyFIC0v4b8nm6mSE0 LFti4PmStLlTOqHklxiOZn8J+8Mo8hE= X-Google-Smtp-Source: APXvYqz/LjaEGxjIoMmbTYyzDHHgGvmFmSMNNgDg4gIUiMGzM38p/wyUKf+KBwRTzS+Fo0RQfg+DDw== X-Received: by 2002:a62:6344:: with SMTP id x65mr12170109pfb.111.1565216739822; Wed, 07 Aug 2019 15:25:39 -0700 (PDT) Received: from kern417.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id o130sm150719228pfg.171.2019.08.07.15.25.38 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 07 Aug 2019 15:25:39 -0700 (PDT) From: Yifeng Sun To: dev@openvswitch.org Date: Wed, 7 Aug 2019 15:25:33 -0700 Message-Id: <1565216733-23570-1-git-send-email-pkusunyifeng@gmail.com> X-Mailer: git-send-email 2.7.4 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v3] datapath: compat: Backports bugfixes for nf_conncount X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org This patch backports several critical bug fixes related to locking and data consistency in nf_conncount code. This backport is based on the following upstream net-next upstream commits. a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit") c80f10b ("netfilter: nf_conncount: speculative garbage collection on empty lists") 2f971a8 ("netfilter: nf_conncount: move all list iterations under spinlock") df4a902 ("netfilter: nf_conncount: merge lookup and add functions") e8cfb37 ("netfilter: nf_conncount: restart search when nodes have been erased") f7fcc98 ("netfilter: nf_conncount: split gc in two phases") 4cd273b ("netfilter: nf_conncount: don't skip eviction when age is negative") c78e781 ("netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS") d4e7df1 ("netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node()") 53ca0f2 ("netfilter: nf_conncount: remove wrong condition check routine") 3c5cdb1 ("netfilter: nf_conncount: fix unexpected permanent node of list.") 31568ec ("netfilter: nf_conncount: fix list_del corruption in conn_free") fd3e71a ("netfilter: nf_conncount: use spin_lock_bh instead of spin_lock") This patch adds additional compat code so that it can build on all supported kernel versions. In addition, this patch helps OVS datapath to always choose bug-fixed nf_conncount code. If kernel already has these fixes, then kernel's nf_conncount is being used. Otherwise, OVS falls back to use compat nf_conncount functions. Travis tests are at https://travis-ci.org/yifsun/ovs-travis/builds/569056850 On latest RHEL kernel, 'make check-kmod' runs good. VMware-BZ: #2396471 Signed-off-by: Yifeng Sun Acked-by: Yi-Hung Wei --- v1->v2: Add fixes to support old kernel versions, as suggested by YiHung. v2->v3: Backport more upstream patches. Thanks YiHung for reviewing. acinclude.m4 | 6 +- datapath/linux/Modules.mk | 3 +- datapath/linux/compat/include/linux/rbtree.h | 19 ++ .../include/net/netfilter/nf_conntrack_count.h | 7 - datapath/linux/compat/nf_conncount.c | 290 ++++++++++----------- 5 files changed, 161 insertions(+), 164 deletions(-) create mode 100644 datapath/linux/compat/include/linux/rbtree.h diff --git a/acinclude.m4 b/acinclude.m4 index 116ffcf9096d..f0e38898b17a 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -713,7 +713,9 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [ OVS_GREP_IFELSE([$KSRC/include/net/netfilter/nf_nat.h], [nf_nat_range2]) OVS_GREP_IFELSE([$KSRC/include/net/netfilter/nf_conntrack_seqadj.h], [nf_ct_seq_adjust]) OVS_GREP_IFELSE([$KSRC/include/net/netfilter/nf_conntrack_count.h], [nf_conncount_gc_list], - [OVS_DEFINE([HAVE_UPSTREAM_NF_CONNCOUNT])]) + [OVS_GREP_IFELSE([$KSRC/include/net/netfilter/nf_conntrack_count.h], + [int nf_conncount_add], + [], [OVS_DEFINE([HAVE_UPSTREAM_NF_CONNCOUNT])])]) OVS_GREP_IFELSE([$KSRC/include/linux/random.h], [prandom_u32]) OVS_GREP_IFELSE([$KSRC/include/linux/random.h], [prandom_u32_max]) @@ -1012,6 +1014,8 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [ [OVS_DEFINE([HAVE_GRE_CALC_HLEN])]) OVS_GREP_IFELSE([$KSRC/include/net/gre.h], [ip_gre_calc_hlen], [OVS_DEFINE([HAVE_IP_GRE_CALC_HLEN])]) + OVS_GREP_IFELSE([$KSRC/include/linux/rbtree.h], [rb_link_node_rcu], + [OVS_DEFINE([HAVE_RBTREE_RB_LINK_NODE_RCU])]) if cmp -s datapath/linux/kcompat.h.new \ datapath/linux/kcompat.h >/dev/null 2>&1; then diff --git a/datapath/linux/Modules.mk b/datapath/linux/Modules.mk index cbb29f1c69d0..69d7faeac414 100644 --- a/datapath/linux/Modules.mk +++ b/datapath/linux/Modules.mk @@ -116,5 +116,6 @@ openvswitch_headers += \ linux/compat/include/uapi/linux/netfilter.h \ linux/compat/include/linux/mm.h \ linux/compat/include/linux/netfilter.h \ - linux/compat/include/linux/overflow.h + linux/compat/include/linux/overflow.h \ + linux/compat/include/linux/rbtree.h EXTRA_DIST += linux/compat/build-aux/export-check-whitelist diff --git a/datapath/linux/compat/include/linux/rbtree.h b/datapath/linux/compat/include/linux/rbtree.h new file mode 100644 index 000000000000..dbf20ff0e0b8 --- /dev/null +++ b/datapath/linux/compat/include/linux/rbtree.h @@ -0,0 +1,19 @@ +#ifndef __LINUX_RBTREE_WRAPPER_H +#define __LINUX_RBTREE_WRAPPER_H 1 + +#include_next + +#ifndef HAVE_RBTREE_RB_LINK_NODE_RCU +#include + +static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node *parent, + struct rb_node **rb_link) +{ + node->__rb_parent_color = (unsigned long)parent; + node->rb_left = node->rb_right = NULL; + + rcu_assign_pointer(*rb_link, node); +} +#endif + +#endif /* __LINUX_RBTREE_WRAPPER_H */ diff --git a/datapath/linux/compat/include/net/netfilter/nf_conntrack_count.h b/datapath/linux/compat/include/net/netfilter/nf_conntrack_count.h index 614017309efe..2143136aa617 100644 --- a/datapath/linux/compat/include/net/netfilter/nf_conntrack_count.h +++ b/datapath/linux/compat/include/net/netfilter/nf_conntrack_count.h @@ -21,17 +21,10 @@ static inline void rpl_nf_conncount_modexit(void) #define CONFIG_NETFILTER_CONNCOUNT 1 struct nf_conncount_data; -enum nf_conncount_list_add { - NF_CONNCOUNT_ADDED, /* list add was ok */ - NF_CONNCOUNT_ERR, /* -ENOMEM, must drop skb */ - NF_CONNCOUNT_SKIP, /* list is already reclaimed by gc */ -}; - struct nf_conncount_list { spinlock_t list_lock; struct list_head head; /* connections with the same filtering key */ unsigned int count; /* length of list */ - bool dead; }; struct nf_conncount_data diff --git a/datapath/linux/compat/nf_conncount.c b/datapath/linux/compat/nf_conncount.c index eeae440f872d..97bdfb933b95 100644 --- a/datapath/linux/compat/nf_conncount.c +++ b/datapath/linux/compat/nf_conncount.c @@ -38,12 +38,6 @@ #define CONNCOUNT_SLOTS 256U -#ifdef CONFIG_LOCKDEP -#define CONNCOUNT_LOCK_SLOTS 8U -#else -#define CONNCOUNT_LOCK_SLOTS 256U -#endif - #define CONNCOUNT_GC_MAX_NODES 8 #define MAX_KEYLEN 5 @@ -54,7 +48,6 @@ struct nf_conncount_tuple { struct nf_conntrack_zone zone; int cpu; u32 jiffies32; - struct rcu_head rcu_head; }; struct nf_conncount_rb { @@ -64,7 +57,7 @@ struct nf_conncount_rb { struct rcu_head rcu_head; }; -static spinlock_t nf_conncount_locks[CONNCOUNT_LOCK_SLOTS] __cacheline_aligned_in_smp; +static spinlock_t nf_conncount_locks[CONNCOUNT_SLOTS] __cacheline_aligned_in_smp; struct nf_conncount_data { unsigned int keylen; @@ -93,74 +86,25 @@ static int key_diff(const u32 *a, const u32 *b, unsigned int klen) return memcmp(a, b, klen * sizeof(u32)); } -static enum nf_conncount_list_add -nf_conncount_add(struct nf_conncount_list *list, - const struct nf_conntrack_tuple *tuple, - const struct nf_conntrack_zone *zone) -{ - struct nf_conncount_tuple *conn; - - if (WARN_ON_ONCE(list->count > INT_MAX)) - return NF_CONNCOUNT_ERR; - - conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC); - if (conn == NULL) - return NF_CONNCOUNT_ERR; - - conn->tuple = *tuple; - conn->zone = *zone; - conn->cpu = raw_smp_processor_id(); - conn->jiffies32 = (u32)jiffies; - spin_lock(&list->list_lock); - if (list->dead == true) { - kmem_cache_free(conncount_conn_cachep, conn); - spin_unlock(&list->list_lock); - return NF_CONNCOUNT_SKIP; - } - list_add_tail(&conn->node, &list->head); - list->count++; - spin_unlock(&list->list_lock); - return NF_CONNCOUNT_ADDED; -} - -static void __conn_free(struct rcu_head *h) -{ - struct nf_conncount_tuple *conn; - - conn = container_of(h, struct nf_conncount_tuple, rcu_head); - kmem_cache_free(conncount_conn_cachep, conn); -} - -static bool conn_free(struct nf_conncount_list *list, +static void conn_free(struct nf_conncount_list *list, struct nf_conncount_tuple *conn) { - bool free_entry = false; - - spin_lock(&list->list_lock); - - if (list->count == 0) { - spin_unlock(&list->list_lock); - return free_entry; - } + lockdep_assert_held(&list->list_lock); list->count--; - list_del_rcu(&conn->node); - if (list->count == 0) - free_entry = true; + list_del(&conn->node); - spin_unlock(&list->list_lock); - call_rcu(&conn->rcu_head, __conn_free); - return free_entry; + kmem_cache_free(conncount_conn_cachep, conn); } static const struct nf_conntrack_tuple_hash * find_or_evict(struct net *net, struct nf_conncount_list *list, - struct nf_conncount_tuple *conn, bool *free_entry) + struct nf_conncount_tuple *conn) { const struct nf_conntrack_tuple_hash *found; unsigned long a, b; int cpu = raw_smp_processor_id(); - __s32 age; + u32 age; found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple); if (found) @@ -175,52 +119,45 @@ find_or_evict(struct net *net, struct nf_conncount_list *list, */ age = a - b; if (conn->cpu == cpu || age >= 2) { - *free_entry = conn_free(list, conn); + conn_free(list, conn); return ERR_PTR(-ENOENT); } return ERR_PTR(-EAGAIN); } -static void nf_conncount_lookup(struct net *net, - struct nf_conncount_list *list, - const struct nf_conntrack_tuple *tuple, - const struct nf_conntrack_zone *zone, - bool *addit) +static int __nf_conncount_add(struct net *net, + struct nf_conncount_list *list, + const struct nf_conntrack_tuple *tuple, + const struct nf_conntrack_zone *zone) { const struct nf_conntrack_tuple_hash *found; struct nf_conncount_tuple *conn, *conn_n; struct nf_conn *found_ct; unsigned int collect = 0; - bool free_entry = false; - - /* best effort only */ - *addit = tuple ? true : false; /* check the saved connections */ list_for_each_entry_safe(conn, conn_n, &list->head, node) { if (collect > CONNCOUNT_GC_MAX_NODES) break; - found = find_or_evict(net, list, conn, &free_entry); + found = find_or_evict(net, list, conn); if (IS_ERR(found)) { /* Not found, but might be about to be confirmed */ if (PTR_ERR(found) == -EAGAIN) { - if (!tuple) - continue; - if (nf_ct_tuple_equal(&conn->tuple, tuple) && nf_ct_zone_id(&conn->zone, conn->zone.dir) == nf_ct_zone_id(zone, zone->dir)) - *addit = false; - } else if (PTR_ERR(found) == -ENOENT) + return 0; /* already exists */ + } else { collect++; + } continue; } found_ct = nf_ct_tuplehash_to_ctrack(found); - if (tuple && nf_ct_tuple_equal(&conn->tuple, tuple) && + if (nf_ct_tuple_equal(&conn->tuple, tuple) && nf_ct_zone_equal(found_ct, zone, zone->dir)) { /* * We should not see tuples twice unless someone hooks @@ -228,7 +165,8 @@ static void nf_conncount_lookup(struct net *net, * * Attempt to avoid a re-add in this case. */ - *addit = false; + nf_ct_put(found_ct); + return 0; } else if (already_closed(found_ct)) { /* * we do not care about connections which are @@ -242,34 +180,64 @@ static void nf_conncount_lookup(struct net *net, nf_ct_put(found_ct); } + + if (WARN_ON_ONCE(list->count > INT_MAX)) + return -EOVERFLOW; + + conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC); + if (conn == NULL) + return -ENOMEM; + + conn->tuple = *tuple; + conn->zone = *zone; + conn->cpu = raw_smp_processor_id(); + conn->jiffies32 = (u32)jiffies; + list_add_tail(&conn->node, &list->head); + list->count++; + return 0; +} + +int nf_conncount_add(struct net *net, + struct nf_conncount_list *list, + const struct nf_conntrack_tuple *tuple, + const struct nf_conntrack_zone *zone) +{ + int ret; + + /* check the saved connections */ + spin_lock_bh(&list->list_lock); + ret = __nf_conncount_add(net, list, tuple, zone); + spin_unlock_bh(&list->list_lock); + + return ret; } static void nf_conncount_list_init(struct nf_conncount_list *list) { spin_lock_init(&list->list_lock); INIT_LIST_HEAD(&list->head); - list->count = 1; - list->dead = false; + list->count = 0; } -/* Return true if the list is empty */ +/* Return true if the list is empty. Must be called with BH disabled. */ static bool nf_conncount_gc_list(struct net *net, - struct nf_conncount_list *list) + struct nf_conncount_list *list) { const struct nf_conntrack_tuple_hash *found; struct nf_conncount_tuple *conn, *conn_n; struct nf_conn *found_ct; unsigned int collected = 0; - bool free_entry = false; + bool ret = false; + + /* don't bother if other cpu is already doing GC */ + if (!spin_trylock(&list->list_lock)) + return false; list_for_each_entry_safe(conn, conn_n, &list->head, node) { - found = find_or_evict(net, list, conn, &free_entry); + found = find_or_evict(net, list, conn); if (IS_ERR(found)) { - if (PTR_ERR(found) == -ENOENT) { - if (free_entry) - return true; + if (PTR_ERR(found) == -ENOENT) collected++; - } continue; } @@ -280,17 +248,21 @@ static bool nf_conncount_gc_list(struct net *net, * closed already -> ditch it */ nf_ct_put(found_ct); - if (conn_free(list, conn)) - return true; + conn_free(list, conn); collected++; continue; } nf_ct_put(found_ct); if (collected > CONNCOUNT_GC_MAX_NODES) - return false; + break; } - return false; + + if (!list->count) + ret = true; + spin_unlock(&list->list_lock); + + return ret; } static void __tree_nodes_free(struct rcu_head *h) @@ -301,6 +273,7 @@ static void __tree_nodes_free(struct rcu_head *h) kmem_cache_free(conncount_rb_cachep, rbconn); } +/* caller must hold tree nf_conncount_locks[] lock */ static void tree_nodes_free(struct rb_root *root, struct nf_conncount_rb *gc_nodes[], unsigned int gc_count) @@ -310,8 +283,7 @@ static void tree_nodes_free(struct rb_root *root, while (gc_count) { rbconn = gc_nodes[--gc_count]; spin_lock(&rbconn->list.list_lock); - if (rbconn->list.count == 0 && rbconn->list.dead == false) { - rbconn->list.dead = true; + if (!rbconn->list.count) { rb_erase(&rbconn->node, root); call_rcu(&rbconn->rcu_head, __tree_nodes_free); } @@ -331,20 +303,19 @@ insert_tree(struct net *net, struct rb_root *root, unsigned int hash, const u32 *key, - u8 keylen, const struct nf_conntrack_tuple *tuple, const struct nf_conntrack_zone *zone) { - enum nf_conncount_list_add ret; struct nf_conncount_rb *gc_nodes[CONNCOUNT_GC_MAX_NODES]; struct rb_node **rbnode, *parent; struct nf_conncount_rb *rbconn; struct nf_conncount_tuple *conn; unsigned int count = 0, gc_count = 0; - bool node_found = false; - - spin_lock_bh(&nf_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]); + u8 keylen = data->keylen; + bool do_gc = true; + spin_lock_bh(&nf_conncount_locks[hash]); +restart: parent = NULL; rbnode = &(root->rb_node); while (*rbnode) { @@ -358,45 +329,32 @@ insert_tree(struct net *net, } else if (diff > 0) { rbnode = &((*rbnode)->rb_right); } else { - /* unlikely: other cpu added node already */ - node_found = true; - ret = nf_conncount_add(&rbconn->list, tuple, zone); - if (ret == NF_CONNCOUNT_ERR) { + int ret; + + ret = nf_conncount_add(net, &rbconn->list, tuple, zone); + if (ret) count = 0; /* hotdrop */ - } else if (ret == NF_CONNCOUNT_ADDED) { + else count = rbconn->list.count; - } else { - /* NF_CONNCOUNT_SKIP, rbconn is already - * reclaimed by gc, insert a new tree node - */ - node_found = false; - } - break; + tree_nodes_free(root, gc_nodes, gc_count); + goto out_unlock; } if (gc_count >= ARRAY_SIZE(gc_nodes)) continue; - if (nf_conncount_gc_list(net, &rbconn->list)) + if (do_gc && nf_conncount_gc_list(net, &rbconn->list)) gc_nodes[gc_count++] = rbconn; } if (gc_count) { tree_nodes_free(root, gc_nodes, gc_count); - /* tree_node_free before new allocation permits - * allocator to re-use newly free'd object. - * - * This is a rare event; in most cases we will find - * existing node to re-use. (or gc_count is 0). - */ - - if (gc_count >= ARRAY_SIZE(gc_nodes)) - schedule_gc_worker(data, hash); + schedule_gc_worker(data, hash); + gc_count = 0; + do_gc = false; + goto restart; } - if (node_found) - goto out_unlock; - /* expected case: match, insert new node */ rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC); if (rbconn == NULL) @@ -415,11 +373,12 @@ insert_tree(struct net *net, nf_conncount_list_init(&rbconn->list); list_add(&conn->node, &rbconn->list.head); count = 1; + rbconn->list.count = count; - rb_link_node(&rbconn->node, parent, rbnode); + rb_link_node_rcu(&rbconn->node, parent, rbnode); rb_insert_color(&rbconn->node, root); out_unlock: - spin_unlock_bh(&nf_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]); + spin_unlock_bh(&nf_conncount_locks[hash]); return count; } @@ -430,7 +389,6 @@ count_tree(struct net *net, const struct nf_conntrack_tuple *tuple, const struct nf_conntrack_zone *zone) { - enum nf_conncount_list_add ret; struct rb_root *root; struct rb_node *parent; struct nf_conncount_rb *rbconn; @@ -443,7 +401,6 @@ count_tree(struct net *net, parent = rcu_dereference_raw(root->rb_node); while (parent) { int diff; - bool addit; rbconn = rb_entry(parent, struct nf_conncount_rb, node); @@ -453,31 +410,36 @@ count_tree(struct net *net, } else if (diff > 0) { parent = rcu_dereference_raw(parent->rb_right); } else { - /* same source network -> be counted! */ - nf_conncount_lookup(net, &rbconn->list, tuple, zone, - &addit); + int ret; - if (!addit) + if (!tuple) { + nf_conncount_gc_list(net, &rbconn->list); return rbconn->list.count; + } - ret = nf_conncount_add(&rbconn->list, tuple, zone); - if (ret == NF_CONNCOUNT_ERR) { - return 0; /* hotdrop */ - } else if (ret == NF_CONNCOUNT_ADDED) { - return rbconn->list.count; - } else { - /* NF_CONNCOUNT_SKIP, rbconn is already - * reclaimed by gc, insert a new tree node - */ + spin_lock_bh(&rbconn->list.list_lock); + /* Node might be about to be free'd. + * We need to defer to insert_tree() in this case. + */ + if (rbconn->list.count == 0) { + spin_unlock_bh(&rbconn->list.list_lock); break; } + + /* same source network -> be counted! */ + ret = __nf_conncount_add(net, &rbconn->list, tuple, zone); + spin_unlock_bh(&rbconn->list.list_lock); + if (ret) + return 0; /* hotdrop */ + else + return rbconn->list.count; } } if (!tuple) return 0; - return insert_tree(net, data, root, hash, key, keylen, tuple, zone); + return insert_tree(net, data, root, hash, key, tuple, zone); } static void tree_gc_worker(struct work_struct *work) @@ -488,27 +450,48 @@ static void tree_gc_worker(struct work_struct *work) struct rb_node *node; unsigned int tree, next_tree, gc_count = 0; - tree = data->gc_tree % CONNCOUNT_LOCK_SLOTS; + tree = data->gc_tree % CONNCOUNT_SLOTS; root = &data->root[tree]; + local_bh_disable(); rcu_read_lock(); for (node = rb_first(root); node != NULL; node = rb_next(node)) { rbconn = rb_entry(node, struct nf_conncount_rb, node); if (nf_conncount_gc_list(data->net, &rbconn->list)) - gc_nodes[gc_count++] = rbconn; + gc_count++; } rcu_read_unlock(); + local_bh_enable(); + + cond_resched(); spin_lock_bh(&nf_conncount_locks[tree]); + if (gc_count < ARRAY_SIZE(gc_nodes)) + goto next; /* do not bother */ - if (gc_count) { - tree_nodes_free(root, gc_nodes, gc_count); + gc_count = 0; + node = rb_first(root); + while (node != NULL) { + rbconn = rb_entry(node, struct nf_conncount_rb, node); + node = rb_next(node); + + if (rbconn->list.count > 0) + continue; + + gc_nodes[gc_count++] = rbconn; + if (gc_count >= ARRAY_SIZE(gc_nodes)) { + tree_nodes_free(root, gc_nodes, gc_count); + gc_count = 0; + } } + tree_nodes_free(root, gc_nodes, gc_count); +next: + clear_bit(tree, data->pending_trees); next_tree = (tree + 1) % CONNCOUNT_SLOTS; - next_tree = find_next_bit(data->pending_trees, next_tree, CONNCOUNT_SLOTS); + next_tree = find_next_bit(data->pending_trees, CONNCOUNT_SLOTS, next_tree); if (next_tree < CONNCOUNT_SLOTS) { data->gc_tree = next_tree; @@ -533,7 +516,7 @@ unsigned int rpl_nf_conncount_count(struct net *net, EXPORT_SYMBOL_GPL(rpl_nf_conncount_count); struct nf_conncount_data *rpl_nf_conncount_init(struct net *net, unsigned int family, - unsigned int keylen) + unsigned int keylen) { struct nf_conncount_data *data; int ret, i; @@ -609,10 +592,7 @@ int rpl_nf_conncount_modinit(void) { int i; - BUILD_BUG_ON(CONNCOUNT_LOCK_SLOTS > CONNCOUNT_SLOTS); - BUILD_BUG_ON((CONNCOUNT_SLOTS % CONNCOUNT_LOCK_SLOTS) != 0); - - for (i = 0; i < CONNCOUNT_LOCK_SLOTS; ++i) + for (i = 0; i < CONNCOUNT_SLOTS; ++i) spin_lock_init(&nf_conncount_locks[i]); conncount_conn_cachep = kmem_cache_create("nf_conncount_tuple",