diff mbox

[2/5] netfilter: nft_rbtree: no need for spinlock from set destroy path

Message ID 1411464288-18069-3-git-send-email-pablo@netfilter.org
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Pablo Neira Ayuso Sept. 23, 2014, 9:24 a.m. UTC
The sets are released from the rcu callback, after the rule is removed
from the chain list, which implies that nfnetlink cannot update the
rbtree and no packets are walking on the set anymore. Thus, we can get
rid of the spinlock in the set destroy path there.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewied-by: Thomas Graf <tgraf@suug.ch>
---
 net/netfilter/nft_rbtree.c |    2 --
 1 file changed, 2 deletions(-)

Comments

Eric Dumazet Sept. 23, 2014, 9:52 a.m. UTC | #1
On Tue, 2014-09-23 at 11:24 +0200, Pablo Neira Ayuso wrote:
> The sets are released from the rcu callback, after the rule is removed
> from the chain list, which implies that nfnetlink cannot update the
> rbtree and no packets are walking on the set anymore. Thus, we can get
> rid of the spinlock in the set destroy path there.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> Reviewied-by: Thomas Graf <tgraf@suug.ch>
> ---
>  net/netfilter/nft_rbtree.c |    2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_rbtree.c
> index e1836ff..46214f2 100644
> --- a/net/netfilter/nft_rbtree.c
> +++ b/net/netfilter/nft_rbtree.c
> @@ -234,13 +234,11 @@ static void nft_rbtree_destroy(const struct nft_set *set)
>  	struct nft_rbtree_elem *rbe;
>  	struct rb_node *node;
>  
> -	spin_lock_bh(&nft_rbtree_lock);
>  	while ((node = priv->root.rb_node) != NULL) {
>  		rb_erase(node, &priv->root);
>  		rbe = rb_entry(node, struct nft_rbtree_elem, node);
>  		nft_rbtree_elem_destroy(set, rbe);
>  	}
> -	spin_unlock_bh(&nft_rbtree_lock);
>  }
>  
>  static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,

BTW, do you know if destroying an rbtree is faster this way, or using
rb_first() ?

Most cases I see in the kernel use a rb_first instead of taking the
root.

Examples : (its not an exhaustive list)

net/netfilter/xt_connlimit.c:402
net/sched/sch_netem.c:380
net/sched/sch_fq.c:519
drivers/infiniband/hw/mlx4/cm.c:439
drivers/iommu/iova.c:324
drivers/md/dm-thin.c:1491
drivers/mtd/mtdswap.c:625
drivers/mtd/ubi/attach.c:636

This might be better for large trees, to get better cache locality,
but I have no experimental data.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 23, 2014, 11:01 a.m. UTC | #2
On Tue, Sep 23, 2014 at 02:52:37AM -0700, Eric Dumazet wrote:
> On Tue, 2014-09-23 at 11:24 +0200, Pablo Neira Ayuso wrote:
> > The sets are released from the rcu callback, after the rule is removed
> > from the chain list, which implies that nfnetlink cannot update the
> > rbtree and no packets are walking on the set anymore. Thus, we can get
> > rid of the spinlock in the set destroy path there.
> > 
> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > Reviewied-by: Thomas Graf <tgraf@suug.ch>
> > ---
> >  net/netfilter/nft_rbtree.c |    2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_rbtree.c
> > index e1836ff..46214f2 100644
> > --- a/net/netfilter/nft_rbtree.c
> > +++ b/net/netfilter/nft_rbtree.c
> > @@ -234,13 +234,11 @@ static void nft_rbtree_destroy(const struct nft_set *set)
> >  	struct nft_rbtree_elem *rbe;
> >  	struct rb_node *node;
> >  
> > -	spin_lock_bh(&nft_rbtree_lock);
> >  	while ((node = priv->root.rb_node) != NULL) {
> >  		rb_erase(node, &priv->root);
> >  		rbe = rb_entry(node, struct nft_rbtree_elem, node);
> >  		nft_rbtree_elem_destroy(set, rbe);
> >  	}
> > -	spin_unlock_bh(&nft_rbtree_lock);
> >  }
> >  
> >  static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,
> 
> BTW, do you know if destroying an rbtree is faster this way, or using
> rb_first() ?
> 
> Most cases I see in the kernel use a rb_first instead of taking the
> root.
> 
> Examples : (its not an exhaustive list)
> 
> net/netfilter/xt_connlimit.c:402
> net/sched/sch_netem.c:380
> net/sched/sch_fq.c:519
> drivers/infiniband/hw/mlx4/cm.c:439
> drivers/iommu/iova.c:324
> drivers/md/dm-thin.c:1491
> drivers/mtd/mtdswap.c:625
> drivers/mtd/ubi/attach.c:636
> 
> This might be better for large trees, to get better cache locality,
> but I have no experimental data.

I'll send a follow up patch for nf-next to use rb_first() in that
patch. Thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 23, 2014, 11:54 a.m. UTC | #3
On Tue, 2014-09-23 at 13:01 +0200, Pablo Neira Ayuso wrote:

> I'll send a follow up patch for nf-next to use rb_first() in that
> patch. Thanks Eric.

I did a test, and its indeed a bit faster to use rb_first(), by about 5%

Real win is to be able to build a chain using rb_first()/rb_next(),
(leaving the tree as is), then deleting the items in the chain, and
simply reset rb_root.

This only needs to reuse one pointer to store the item->next pointer.

This is then about ~50% faster, because we do not constantly rebalance
tree for every removed item.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 23, 2014, 4:10 p.m. UTC | #4
On Tue, Sep 23, 2014 at 04:54:05AM -0700, Eric Dumazet wrote:
> On Tue, 2014-09-23 at 13:01 +0200, Pablo Neira Ayuso wrote:
> 
> > I'll send a follow up patch for nf-next to use rb_first() in that
> > patch. Thanks Eric.
> 
> I did a test, and its indeed a bit faster to use rb_first(), by about 5%
> 
> Real win is to be able to build a chain using rb_first()/rb_next(),
> (leaving the tree as is), then deleting the items in the chain, and
> simply reset rb_root.
> 
> This only needs to reuse one pointer to store the item->next pointer.
> 
> This is then about ~50% faster, because we do not constantly rebalance
> tree for every removed item.

Indeed.

struct nft_rbtree_elem {
        struct rb_node          node;
        u16                     flags;
        struct nft_data         key;
        struct nft_data         data[];
};

Actually, I could add to nft_data a pointer in the union area, but I'm
not very confortable with adding it for this specific case. At this
moment we're releasing this from rcu_callback which is "hiding" the
deletion time from the netlink interface.

But I'll keep this back in my head if we later on have some pointer
candidate to be reused in a nice way.

I'll send a patch to make the rb_first()/rb_next() conversion though.

Thanks for your comments!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 23, 2014, 4:29 p.m. UTC | #5
On Tue, 2014-09-23 at 18:10 +0200, Pablo Neira Ayuso wrote:

> Actually, I could add to nft_data a pointer in the union area, but I'm
> not very confortable with adding it for this specific case. At this
> moment we're releasing this from rcu_callback which is "hiding" the
> deletion time from the netlink interface.
> 
> But I'll keep this back in my head if we later on have some pointer
> candidate to be reused in a nice way.
> 
> I'll send a patch to make the rb_first()/rb_next() conversion though.

Ah, forget what I said, its actually slower to build a list by 7%, I had
an error in my test.

So the rb_first() thing is the easy and fast way.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_rbtree.c
index e1836ff..46214f2 100644
--- a/net/netfilter/nft_rbtree.c
+++ b/net/netfilter/nft_rbtree.c
@@ -234,13 +234,11 @@  static void nft_rbtree_destroy(const struct nft_set *set)
 	struct nft_rbtree_elem *rbe;
 	struct rb_node *node;
 
-	spin_lock_bh(&nft_rbtree_lock);
 	while ((node = priv->root.rb_node) != NULL) {
 		rb_erase(node, &priv->root);
 		rbe = rb_entry(node, struct nft_rbtree_elem, node);
 		nft_rbtree_elem_destroy(set, rbe);
 	}
-	spin_unlock_bh(&nft_rbtree_lock);
 }
 
 static bool nft_rbtree_estimate(const struct nft_set_desc *desc, u32 features,