diff mbox

[1/3] netfilter: nf_tables: fix suboptimal set selection

Message ID 1388956728-6754-2-git-send-email-pablo@netfilter.org
State Changes Requested
Headers show

Commit Message

Pablo Neira Ayuso Jan. 5, 2014, 9:18 p.m. UTC
The rb-tree is currently used for simple sets and maps with no
intervals which is suboptimal. Fix it by adding the weight field
to each existing set implementation, this value allows to select
the best candidate in case that several set types can be used.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |   13 +++++++++++++
 net/netfilter/nf_tables_api.c     |   10 +++++-----
 net/netfilter/nft_hash.c          |    1 +
 net/netfilter/nft_rbtree.c        |    1 +
 4 files changed, 20 insertions(+), 5 deletions(-)

Comments

Patrick McHardy Jan. 5, 2014, 9:28 p.m. UTC | #1
On Sun, Jan 05, 2014 at 10:18:46PM +0100, Pablo Neira Ayuso wrote:
> The rb-tree is currently used for simple sets and maps with no
> intervals which is suboptimal. Fix it by adding the weight field
> to each existing set implementation, this value allows to select
> the best candidate in case that several set types can be used.

Ohh, lets do this properly. This is one point why I was opposed to a
merge at this stage, lets not paper over this but fix this how it
was intended. I'll work on that after finishing the inet family.
Until then no harm is done, just some memory waste.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Jan. 5, 2014, 9:34 p.m. UTC | #2
On Sun, Jan 05, 2014 at 09:28:35PM +0000, Patrick McHardy wrote:
> On Sun, Jan 05, 2014 at 10:18:46PM +0100, Pablo Neira Ayuso wrote:
> > The rb-tree is currently used for simple sets and maps with no
> > intervals which is suboptimal. Fix it by adding the weight field
> > to each existing set implementation, this value allows to select
> > the best candidate in case that several set types can be used.
> 
> Ohh, lets do this properly. This is one point why I was opposed to a
> merge at this stage, lets not paper over this but fix this how it
> was intended. I'll work on that after finishing the inet family.
> Until then no harm is done, just some memory waste.

Please, elaborate your plans on this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Jan. 5, 2014, 9:45 p.m. UTC | #3
On Sun, Jan 05, 2014 at 10:34:03PM +0100, Pablo Neira Ayuso wrote:
> On Sun, Jan 05, 2014 at 09:28:35PM +0000, Patrick McHardy wrote:
> > On Sun, Jan 05, 2014 at 10:18:46PM +0100, Pablo Neira Ayuso wrote:
> > > The rb-tree is currently used for simple sets and maps with no
> > > intervals which is suboptimal. Fix it by adding the weight field
> > > to each existing set implementation, this value allows to select
> > > the best candidate in case that several set types can be used.
> > 
> > Ohh, lets do this properly. This is one point why I was opposed to a
> > merge at this stage, lets not paper over this but fix this how it
> > was intended. I'll work on that after finishing the inet family.
> > Until then no harm is done, just some memory waste.
> 
> Please, elaborate your plans on this.

Well, the selection should happen based on the set contents, not on
some static weight. Basically we have memory usage and performance
as measurable weights, and both depend on the contents of the set,
so we need an abstrace description of those.

I have another set type coming up, static weights will not help to
make a good decision long term. This won't be easy to get right,
but I really don't want to paper over this since this just allows
to delay to proper fix that will be needed at some point anyway.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Jan. 5, 2014, 10:11 p.m. UTC | #4
On Sun, Jan 05, 2014 at 09:45:31PM +0000, Patrick McHardy wrote:
> On Sun, Jan 05, 2014 at 10:34:03PM +0100, Pablo Neira Ayuso wrote:
> > On Sun, Jan 05, 2014 at 09:28:35PM +0000, Patrick McHardy wrote:
> > > On Sun, Jan 05, 2014 at 10:18:46PM +0100, Pablo Neira Ayuso wrote:
> > > > The rb-tree is currently used for simple sets and maps with no
> > > > intervals which is suboptimal. Fix it by adding the weight field
> > > > to each existing set implementation, this value allows to select
> > > > the best candidate in case that several set types can be used.
> > > 
> > > Ohh, lets do this properly. This is one point why I was opposed to a
> > > merge at this stage, lets not paper over this but fix this how it
> > > was intended. I'll work on that after finishing the inet family.
> > > Until then no harm is done, just some memory waste.
> > 
> > Please, elaborate your plans on this.
> 
> Well, the selection should happen based on the set contents, not on
> some static weight. Basically we have memory usage and performance
> as measurable weights, and both depend on the contents of the set,
> so we need an abstrace description of those.

I agree on that the more information we provide to the kernel, the
best decision regarding the set type and its configuration will be
made.

This simple weight thing that this patch adds is just a way to
break a tie in case two sets provides the features that are needed
(this is the only description information that we have at this moment
from userspace). I think for simple sets (with no intervals) hash
should be prefered to rb-trees.

But if bitmap sets come into place, the selection between hash and
bitmaps cannot be done by this weight this patch adds, as we need more
information on the set elements to make the right choice. That I can
think, we need the distance between the two elements that are further
from each other to calculate the memory consumption at least.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Jan. 5, 2014, 10:21 p.m. UTC | #5
On Sun, Jan 05, 2014 at 11:11:05PM +0100, Pablo Neira Ayuso wrote:
> On Sun, Jan 05, 2014 at 09:45:31PM +0000, Patrick McHardy wrote:
> > On Sun, Jan 05, 2014 at 10:34:03PM +0100, Pablo Neira Ayuso wrote:
> > > On Sun, Jan 05, 2014 at 09:28:35PM +0000, Patrick McHardy wrote:
> > > > On Sun, Jan 05, 2014 at 10:18:46PM +0100, Pablo Neira Ayuso wrote:
> > > > > The rb-tree is currently used for simple sets and maps with no
> > > > > intervals which is suboptimal. Fix it by adding the weight field
> > > > > to each existing set implementation, this value allows to select
> > > > > the best candidate in case that several set types can be used.
> > > > 
> > > > Ohh, lets do this properly. This is one point why I was opposed to a
> > > > merge at this stage, lets not paper over this but fix this how it
> > > > was intended. I'll work on that after finishing the inet family.
> > > > Until then no harm is done, just some memory waste.
> > > 
> > > Please, elaborate your plans on this.
> > 
> > Well, the selection should happen based on the set contents, not on
> > some static weight. Basically we have memory usage and performance
> > as measurable weights, and both depend on the contents of the set,
> > so we need an abstrace description of those.
> 
> I agree on that the more information we provide to the kernel, the
> best decision regarding the set type and its configuration will be
> made.
> 
> This simple weight thing that this patch adds is just a way to
> break a tie in case two sets provides the features that are needed
> (this is the only description information that we have at this moment
> from userspace). I think for simple sets (with no intervals) hash
> should be prefered to rb-trees.

Indeed, but not based on the fact that its a hash, but based on
performance and memory usage.

> But if bitmap sets come into place, the selection between hash and
> bitmaps cannot be done by this weight this patch adds, as we need more
> information on the set elements to make the right choice. That I can
> think, we need the distance between the two elements that are further
> from each other to calculate the memory consumption at least.

Correct. I would really prefer to keep the waste to keep the incentive
to fix this properly up. Its definitely doable, lets keep the pain
up to fix this properly, nobody is using this in production anyway.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 5a91abf..82920e8 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -143,6 +143,17 @@  struct nft_set_iter {
 };
 
 /**
+ *	enum nft_set_prio - nf_tables set priority
+ *
+ *	This is used to set preference in case that all set types provide the
+ *	same features.
+ */
+enum nft_set_prio {
+	NFT_SET_PRIO_HASH	= 0,
+	NFT_SET_PRIO_RBTREE,
+};
+
+/**
  *	struct nft_set_ops - nf_tables set operations
  *
  *	@lookup: look up an element within the set
@@ -155,6 +166,7 @@  struct nft_set_iter {
  *	@list: nf_tables_set_ops list node
  *	@owner: module reference
  *	@features: features supported by the implementation
+ *	@priority: priority of this set type
  */
 struct nft_set_ops {
 	bool				(*lookup)(const struct nft_set *set,
@@ -178,6 +190,7 @@  struct nft_set_ops {
 	struct list_head		list;
 	struct module			*owner;
 	u32				features;
+	u32				priority;
 };
 
 int nft_register_set(struct nft_set_ops *ops);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0d4b42d..60efb61 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1857,7 +1857,7 @@  EXPORT_SYMBOL_GPL(nft_unregister_set);
 
 static const struct nft_set_ops *nft_select_set_ops(const struct nlattr * const nla[])
 {
-	const struct nft_set_ops *ops;
+	const struct nft_set_ops *ops, *cand = NULL;
 	u32 features;
 
 #ifdef CONFIG_MODULES
@@ -1875,14 +1875,14 @@  static const struct nft_set_ops *nft_select_set_ops(const struct nlattr * const
 		features &= NFT_SET_INTERVAL | NFT_SET_MAP;
 	}
 
-	// FIXME: implement selection properly
 	list_for_each_entry(ops, &nf_tables_set_ops, list) {
 		if ((ops->features & features) != features)
 			continue;
-		if (!try_module_get(ops->owner))
-			continue;
-		return ops;
+		if (!cand || cand->priority > ops->priority)
+			cand = ops;
 	}
+	if (cand && try_module_get(cand->owner))
+		return 0;
 
 	return ERR_PTR(-EOPNOTSUPP);
 }
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 3d3f8fc..f640c1c 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -210,6 +210,7 @@  static struct nft_set_ops nft_hash_ops __read_mostly = {
 	.lookup		= nft_hash_lookup,
 	.walk		= nft_hash_walk,
 	.features	= NFT_SET_MAP,
+	.priority	= NFT_SET_PRIO_HASH,
 	.owner		= THIS_MODULE,
 };
 
diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_rbtree.c
index ca0c1b2..acddc64 100644
--- a/net/netfilter/nft_rbtree.c
+++ b/net/netfilter/nft_rbtree.c
@@ -226,6 +226,7 @@  static struct nft_set_ops nft_rbtree_ops __read_mostly = {
 	.lookup		= nft_rbtree_lookup,
 	.walk		= nft_rbtree_walk,
 	.features	= NFT_SET_INTERVAL | NFT_SET_MAP,
+	.priority	= NFT_SET_PRIO_RBTREE,
 	.owner		= THIS_MODULE,
 };