diff mbox series

[nf,1/2] netfilter: nft_set_rbtree: move sync GC from insert path to set->ops->commit

Message ID 20230929164404.172081-1-pablo@netfilter.org
State RFC
Delegated to: Pablo Neira
Headers show
Series [nf,1/2] netfilter: nft_set_rbtree: move sync GC from insert path to set->ops->commit | expand

Commit Message

Pablo Neira Ayuso Sept. 29, 2023, 4:44 p.m. UTC
According to 2ee52ae94baa ("netfilter: nft_set_rbtree: skip sync GC for
new elements in this transaction"), new elements in this transaction
might expire before such transaction ends. Skip sync GC is needed for
such elements otherwise commit path might walk over an already released
object.

However, Florian found that while iterating the tree from the insert
path for sync GC, it is possible that stale references could still
happen for elements in the less-equal and great-than boundaries to
narrow down the tree descend to speed up overlap detection, this
triggers bogus overlap errors.

This patch skips expired elements in the overlap detection routine which
iterates on the reversed ordered list of elements that represent the
intervals. Since end elements provide no expiration extension, check for
the next non-end element in this interval, hence, skip both elements in
the iteration if the interval has expired.

Moreover, move GC sync to the set->ops->commit interface to collect
expired interval. The GC run needs to ignore the gc_interval because the
tree cannot store duplicated expired elements, otherwise bogus
mismatches might occur.

Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Hi Florian,

I picked up on issue, this proposed as an alternative to:
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20230928131247.3391-1-fw@strlen.de/

Note this ignores the specified gc_interval. For this to work, a copy of the
tree would need to be maintained, similar to what pipapo does.

This approach should clear the path to support for set element timeout
update/refresh.

I still spinning over this one and running test to see if I see any failure
with sets/0044_interval_overlap_{0,1}.

 net/netfilter/nft_set_rbtree.c | 57 +++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 14 deletions(-)

Comments

Pablo Neira Ayuso Sept. 29, 2023, 10:25 p.m. UTC | #1
Hi,

On Fri, Sep 29, 2023 at 06:44:03PM +0200, Pablo Neira Ayuso wrote:
> According to 2ee52ae94baa ("netfilter: nft_set_rbtree: skip sync GC for
> new elements in this transaction"), new elements in this transaction
> might expire before such transaction ends. Skip sync GC is needed for
> such elements otherwise commit path might walk over an already released
> object.
> 
> However, Florian found that while iterating the tree from the insert
> path for sync GC, it is possible that stale references could still
> happen for elements in the less-equal and great-than boundaries to
> narrow down the tree descend to speed up overlap detection, this
> triggers bogus overlap errors.
> 
> This patch skips expired elements in the overlap detection routine which
> iterates on the reversed ordered list of elements that represent the
> intervals. Since end elements provide no expiration extension, check for
> the next non-end element in this interval, hence, skip both elements in
> the iteration if the interval has expired.
> 
> Moreover, move GC sync to the set->ops->commit interface to collect
> expired interval. The GC run needs to ignore the gc_interval because the
> tree cannot store duplicated expired elements, otherwise bogus
> mismatches might occur.

I should have tagged this as RFC, this is partially cooked:

- read spin lock is required for the sync GC to make sure this does
  not zap entries that are being used from the datapath.

- the full GC batch could be used to amortize the memory allocation
  (not only two slots as it happens now, I am recycling an existing
   function).

- ENOMEM on GC sync commit path could be an issue. It is too late to
  fail. The tree would start collecting expired elements that might
  duplicate existing, triggering bogus mismatches. In this path the
  commit_mutex is held, and this set backend does not support for
  lockless read, so it might be possible to simply grab the spinlock
  in write mode and release entries inmediately, without requiring the
  sync GC batch infrastructure that pipapo is using.

> Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API")
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> Hi Florian,
> 
> I picked up on issue, this proposed as an alternative to:
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20230928131247.3391-1-fw@strlen.de/
> 
> Note this ignores the specified gc_interval. For this to work, a copy of the
> tree would need to be maintained, similar to what pipapo does.
> 
> This approach should clear the path to support for set element timeout
> update/refresh.
> 
> I still spinning over this one and running test to see if I see any failure
> with sets/0044_interval_overlap_{0,1}.
> 
>  net/netfilter/nft_set_rbtree.c | 57 +++++++++++++++++++++++++---------
>  1 file changed, 43 insertions(+), 14 deletions(-)
> 
> diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
> index 487572dcd614..de6d812fc790 100644
> --- a/net/netfilter/nft_set_rbtree.c
> +++ b/net/netfilter/nft_set_rbtree.c
> @@ -235,8 +235,7 @@ static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set,
>  
>  static int nft_rbtree_gc_elem(const struct nft_set *__set,
>  			      struct nft_rbtree *priv,
> -			      struct nft_rbtree_elem *rbe,
> -			      u8 genmask)
> +			      struct nft_rbtree_elem *rbe)
>  {
>  	struct nft_set *set = (struct nft_set *)__set;
>  	struct rb_node *prev = rb_prev(&rbe->node);
> @@ -254,8 +253,7 @@ static int nft_rbtree_gc_elem(const struct nft_set *__set,
>  	 */
>  	while (prev) {
>  		rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node);
> -		if (nft_rbtree_interval_end(rbe_prev) &&
> -		    nft_set_elem_active(&rbe_prev->ext, genmask))
> +		if (nft_rbtree_interval_end(rbe_prev))
>  			break;
>  
>  		prev = rb_prev(prev);
> @@ -289,6 +287,34 @@ static int nft_rbtree_gc_elem(const struct nft_set *__set,
>  	return 0;
>  }
>  
> +static void nft_rbtree_commit(const struct nft_set *set)
> +{
> +	struct nft_rbtree *priv = nft_set_priv(set);
> +	struct rb_node *node, *next, *first;
> +	struct nft_rbtree_elem *rbe;
> +
> +	if (!(set->flags & NFT_SET_TIMEOUT))
> +		return;
> +
> +	/* ignore GC interval here, unless two copies of the tree are
> +	 * maintained, it is not possible to postpone removal of expired
> +	 * elements.
> +	 */
> +
> +	first = rb_first(&priv->root);
> +
> +	for (node = first; node != NULL; node = next) {
> +		next = rb_next(node);
> +
> +		rbe = rb_entry(node, struct nft_rbtree_elem, node);
> +
> +		if (!nft_set_elem_expired(&rbe->ext))
> +			continue;
> +
> +		nft_rbtree_gc_elem(set, priv, rbe);
> +	}
> +}
> +
>  static bool nft_rbtree_update_first(const struct nft_set *set,
>  				    struct nft_rbtree_elem *rbe,
>  				    struct rb_node *first)
> @@ -312,9 +338,8 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
>  	struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL;
>  	struct rb_node *node, *next, *parent, **p, *first = NULL;
>  	struct nft_rbtree *priv = nft_set_priv(set);
> -	u8 cur_genmask = nft_genmask_cur(net);
>  	u8 genmask = nft_genmask_next(net);
> -	int d, err;
> +	int d;
>  
>  	/* Descend the tree to search for an existing element greater than the
>  	 * key value to insert that is greater than the new element. This is the
> @@ -358,16 +383,19 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
>  		if (!nft_set_elem_active(&rbe->ext, genmask))
>  			continue;
>  
> -		/* perform garbage collection to avoid bogus overlap reports
> -		 * but skip new elements in this transaction.
> +		/* skip expired intervals to avoid bogus overlap reports:
> +		 * end element has no expiration, check next start element.
>  		 */
> -		if (nft_set_elem_expired(&rbe->ext) &&
> -		    nft_set_elem_active(&rbe->ext, cur_genmask)) {
> -			err = nft_rbtree_gc_elem(set, priv, rbe, genmask);
> -			if (err < 0)
> -				return err;
> +		if (nft_rbtree_interval_end(rbe) && next) {
> +			struct nft_rbtree_elem *rbe_next;
>  
> -			continue;
> +			rbe_next = rb_entry(next, struct nft_rbtree_elem, node);
> +
> +			if (nft_set_elem_expired(&rbe_next->ext)) {
> +				/* skip expired next start element. */
> +				next = rb_next(next);
> +				continue;
> +			}
>  		}
>  
>  		d = nft_rbtree_cmp(set, rbe, new);
> @@ -755,5 +783,6 @@ const struct nft_set_type nft_set_rbtree_type = {
>  		.lookup		= nft_rbtree_lookup,
>  		.walk		= nft_rbtree_walk,
>  		.get		= nft_rbtree_get,
> +		.commit		= nft_rbtree_commit,
>  	},
>  };
> -- 
> 2.30.2
>
Florian Westphal Sept. 30, 2023, 8:10 a.m. UTC | #2
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> - read spin lock is required for the sync GC to make sure this does
>   not zap entries that are being used from the datapath.

It needs to grab the write spinlock for each rb_erase, plus
the seqcount increase to make sure that parallel lookup doesn't
miss an element in the tree.

> - the full GC batch could be used to amortize the memory allocation
>   (not only two slots as it happens now, I am recycling an existing
>    function).

Yes.

> - ENOMEM on GC sync commit path could be an issue. It is too late to
>   fail. The tree would start collecting expired elements that might
>   duplicate existing, triggering bogus mismatches. In this path the
>   commit_mutex is held, and this set backend does not support for
>   lockless read,

It does.  If lockless doesn't return a match it falls back to readlock.

>   it might be possible to simply grab the spinlock
>   in write mode and release entries inmediately, without requiring the
>   sync GC batch infrastructure that pipapo is using.

Is there evidence that the on-demand GC is a problem?
It only searches in the relevant subtree, it should rarely, if ever,
encounter any expired element.
Pablo Neira Ayuso Oct. 1, 2023, 8:10 p.m. UTC | #3
Hi Florian,

On Sat, Sep 30, 2023 at 10:10:38AM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > - read spin lock is required for the sync GC to make sure this does
> >   not zap entries that are being used from the datapath.
> 
> It needs to grab the write spinlock for each rb_erase, plus
> the seqcount increase to make sure that parallel lookup doesn't
> miss an element in the tree.

Right, read lock is not enough for sync GC, and it would be also
required by this approach.

> > - the full GC batch could be used to amortize the memory allocation
> >   (not only two slots as it happens now, I am recycling an existing
> >    function).
> 
> Yes.
> 
> > - ENOMEM on GC sync commit path could be an issue. It is too late to
> >   fail. The tree would start collecting expired elements that might
> >   duplicate existing, triggering bogus mismatches. In this path the
> >   commit_mutex is held, and this set backend does not support for
> >   lockless read,
> 
> It does.  If lockless doesn't return a match it falls back to readlock.

And it will in case of the update to remove the expired element, right?

> >   it might be possible to simply grab the spinlock
> >   in write mode and release entries inmediately, without requiring the
> >   sync GC batch infrastructure that pipapo is using.
> 
> Is there evidence that the on-demand GC is a problem?

After your last fix, not really, other than we have to be care not to
zap elements that are in any of the pending transaction in this batch.

> It only searches in the relevant subtree, it should rarely, if ever,
> encounter any expired element.

Not a problem now, but this path that we follow blocks a requested
feature.

I currently do not know how to support for set element timeout update
with this on-demand GC on the rbtree. This feature will require a new
update state in the transaction to update the timer for an element. We
will have to be careful because on-demand GC might zap an expired
element that got just refreshed in this transaction (unless we
reintroduce some sort of 'busy' bit for updates again which is
something I prefer we do not). With 2ee52ae94baa ("netfilter:
nft_set_rbtree: skip sync GC for new elements in this transaction")
I could fix by looking at the generation mask to infer that this
element is 'busy' by some pending transaction, but I do not see a way
with an element update command in place.

Maybe take your patch and then follow up to nf-next with this approach
based once set element timer update is introduced? Otherwise, rbtree
will block the introduction of this new feature for other sets too.
Florian Westphal Oct. 1, 2023, 9:08 p.m. UTC | #4
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Sep 30, 2023 at 10:10:38AM +0200, Florian Westphal wrote:
> > It does.  If lockless doesn't return a match it falls back to readlock.
> 
> And it will in case of the update to remove the expired element, right?

Yes, but I don't think its a huge problem, this is short and we do not
grab wrlock until we actually unlink.

> > >   it might be possible to simply grab the spinlock
> > >   in write mode and release entries inmediately, without requiring the
> > >   sync GC batch infrastructure that pipapo is using.
> > 
> > Is there evidence that the on-demand GC is a problem?
> 
> After your last fix, not really, other than we have to be care not to
> zap elements that are in any of the pending transaction in this batch.

Right.  I don't see a problem with zapping those (newly added ones)
from the point-of-no-return phase, however, we don't have to worry about
rollback then.

> > It only searches in the relevant subtree, it should rarely, if ever,
> > encounter any expired element.
> 
> Not a problem now, but this path that we follow blocks a requested
> feature.
>
> I currently do not know how to support for set element timeout update
> with this on-demand GC on the rbtree. This feature will require a new
> update state in the transaction to update the timer for an element. We
> will have to be careful because on-demand GC might zap an expired
> element that got just refreshed in this transaction (unless we
> reintroduce some sort of 'busy' bit for updates again which is
> something I prefer we do not). With 2ee52ae94baa ("netfilter:
> nft_set_rbtree: skip sync GC for new elements in this transaction")
> I could fix by looking at the generation mask to infer that this
> element is 'busy' by some pending transaction, but I do not see a way
> with an element update command in place.

Not convinced. In particular, I don't understand what this has to do
with async vs. sync gc.  To me this depends how we want to handle
elements that have expired or expire during transaction.

Example:

Element E1, times out in 1 hour
Element E2, times out in 1 second
Element E3, timed out (1 second ago, 3 minutes ago, doesn't matter).

Userspace batch to kernel:
Update Element E1 to time out in 2 hours.
Update Element E2 to time out in 1 hour.
Update Element E3 to time out in 1 hour.

What is the expected outcome of this request?

Ignore E3 being reaped already and refresh the timeout (resurrection?)
Ignore E3 being reaped already and ignore the request?
Fail the transaction as E3 timed out already ("does not exist")?

Now, what about E2?  If transaction is large, it could become
like E3 *during the transaction* unless we introduce some freezing
mechanism.  Whats the expected outcome?

Whats the expected outcome if there is some other, unrelated
failure? I assume we have to roll back all the timeout updates, right?
If so, why not temporarily make the timeouts effective right away
and then roll back?

What if userspace asks to *shrink* timeouts? Same scenario:

Update Element E1 to time out in 1 second
Update Element E2 to time out in 1 second
Update Element E3 to time out in 1 second

We could say 'not supported' of course, would avoid the
rollback headache, because in this case long transaction
gc would definitely start to pick some of those elements
up for reaping...

> Maybe take your patch and then follow up to nf-next with this approach
> based once set element timer update is introduced? Otherwise, rbtree
> will block the introduction of this new feature for other sets too.

See above, I would first like to understand the expected
semantics of the timeout update facility.

I've pushed a (not very much tested) version of gc overhaul
to passive lookups based on expiry candidates, this removes
the need for gc sequence counters.

Its vs. nf.git but really should be re-targetted to nf-next, I'll
try to do this next week:

https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/log/?h=nft_set_gc_query_08
Pablo Neira Ayuso Oct. 2, 2023, 8:20 a.m. UTC | #5
Hi Florian,

Looking at your series, I don't think we are that far each other, see
below.

On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> I've pushed a (not very much tested) version of gc overhaul
> to passive lookups based on expiry candidates, this removes
> the need for gc sequence counters.

This patch ("netfilter: nft_set_rbtree: prefer sync gc to async
worker")

https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/commit/?h=nft_set_gc_query_08&id=edfeb02d758d6a96a3c1c9a483b69e43e5528e87

goes in the same direction I would like to go with my incomplete patch
I posted. However:

+static void nft_rbtree_commit(struct nft_set *set)
+{
+	struct nft_rbtree *priv = nft_set_priv(set);
+
+	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
+		nft_rbtree_gc(set);
+}

I don't think this time_after_eq() to postpone element removal will
work. According to Stefano, you cannot store in the rbtree tree
duplicated elements. Same problem already exists for this set backend
in case a transaction add and delete elements in the same batch.
Unless we maintain two copies. I understand you don't want to maintain
the two copies but then this time_after_eq() needs to go away.

This patch above to add .commit interface to rbtree basically undoes:

https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/commit/?h=nft_set_gc_query_08&id=ee48e86518d62db058efafb6ec1b9f426c441a9d

so better fix it by adding the .commit interface to the rbtree in
first place?

According to what I read it seems we agree on that, the only subtle
difference between your patch and my incomplete patch is this
time_after_eq().

> Its vs. nf.git but really should be re-targetted to nf-next, I'll
> try to do this next week:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/log/?h=nft_set_gc_query_08

Thanks. The gc sequence removal is a different topic we have been
discussing for a while. Would it be possible to incorrect zap an entry
with the transaction semantics? I mean:

#1 transaction to remove element k in set x y
#2 flush set x y (removes dead element k)
#3 add element k to set x y expires 3 minutes
#4 gc transaction freshly added new element

In this case, no dead flag is set on in this new element k on so GC
transaction will skip it.

As for the element timeout update semantics, I will catch up in a
separated email renaming this thread, as this is a new feature and I
prefer to re-focus the conversation on your branch, it has been me
that has been mixing up different topics anyway.
Florian Westphal Oct. 2, 2023, 8:47 a.m. UTC | #6
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Looking at your series, I don't think we are that far each other, see
> below.

Agree.

> On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> > I've pushed a (not very much tested) version of gc overhaul
> > to passive lookups based on expiry candidates, this removes
> > the need for gc sequence counters.
> 
> This patch ("netfilter: nft_set_rbtree: prefer sync gc to async
> worker")
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/commit/?h=nft_set_gc_query_08&id=edfeb02d758d6a96a3c1c9a483b69e43e5528e87
> 
> goes in the same direction I would like to go with my incomplete patch
> I posted. However:
> 
> +static void nft_rbtree_commit(struct nft_set *set)
> +{
> +	struct nft_rbtree *priv = nft_set_priv(set);
> +
> +	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
> +		nft_rbtree_gc(set);
> +}
> 
> I don't think this time_after_eq() to postpone element removal will
> work. According to Stefano, you cannot store in the rbtree tree
> duplicated elements.

Note that in this series the on-demand part is still in place,
there will be no duplicate elements.

> Same problem already exists for this set backend
> in case a transaction add and delete elements in the same batch.
> Unless we maintain two copies. I understand you don't want to maintain
> the two copies but then this time_after_eq() needs to go away.

I can remove it, I don't think a full traversal (without doing
anything) will be too costly.

> According to what I read it seems we agree on that, the only subtle
> difference between your patch and my incomplete patch is this
> time_after_eq().

Yes, your patch gets rid of on-demand gc, I agree that we cannot
postpone full run in that case.

> > Its vs. nf.git but really should be re-targetted to nf-next, I'll
> > try to do this next week:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/log/?h=nft_set_gc_query_08
> 
> Thanks. The gc sequence removal is a different topic we have been
> discussing for a while.

Yup.  I wanted to explore how much work this is, and it turns
out it gets a lot less ugly of we don't have to hande rbtree and
its end elements.

> Would it be possible to incorrect zap an entry
> with the transaction semantics? I mean:

Nope, should not happen.

> #1 transaction to remove element k in set x y
> #2 flush set x y (removes dead element k)
> #3 add element k to set x y expires 3 minutes
> #4 gc transaction freshly added new element
>
> In this case, no dead flag is set on in this new element k on so GC
> transaction will skip it.

The GC will do lookup, will find the element, will
see its neither dead nor expired so it will be skipped.

At least thats the idea, entries get zapped only
if they are expired or dead (to handle packet path deletion).

> As for the element timeout update semantics, I will catch up in a
> separated email renaming this thread

Thanks!
Pablo Neira Ayuso Oct. 2, 2023, 10:24 a.m. UTC | #7
On Mon, Oct 02, 2023 at 10:47:46AM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Looking at your series, I don't think we are that far each other, see
> > below.
> 
> Agree.
> 
> > On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> > > I've pushed a (not very much tested) version of gc overhaul
> > > to passive lookups based on expiry candidates, this removes
> > > the need for gc sequence counters.
> > 
> > This patch ("netfilter: nft_set_rbtree: prefer sync gc to async
> > worker")
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/commit/?h=nft_set_gc_query_08&id=edfeb02d758d6a96a3c1c9a483b69e43e5528e87
> > 
> > goes in the same direction I would like to go with my incomplete patch
> > I posted. However:
> > 
> > +static void nft_rbtree_commit(struct nft_set *set)
> > +{
> > +	struct nft_rbtree *priv = nft_set_priv(set);
> > +
> > +	if (time_after_eq(jiffies, priv->last_gc + nft_set_gc_interval(set)))
> > +		nft_rbtree_gc(set);
> > +}
> > 
> > I don't think this time_after_eq() to postpone element removal will
> > work. According to Stefano, you cannot store in the rbtree tree
> > duplicated elements.
> 
> Note that in this series the on-demand part is still in place,
> there will be no duplicate elements.

Right.

> > Same problem already exists for this set backend
> > in case a transaction add and delete elements in the same batch.
> > Unless we maintain two copies. I understand you don't want to maintain
> > the two copies but then this time_after_eq() needs to go away.
> 
> I can remove it, I don't think a full traversal (without doing
> anything) will be too costly.

OK, so what is your proposal to move on?

> > According to what I read it seems we agree on that, the only subtle
> > difference between your patch and my incomplete patch is this
> > time_after_eq().
> 
> Yes, your patch gets rid of on-demand gc, I agree that we cannot
> postpone full run in that case.

Yes.

> > > Its vs. nf.git but really should be re-targetted to nf-next, I'll
> > > try to do this next week:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/log/?h=nft_set_gc_query_08
> > 
> > Thanks. The gc sequence removal is a different topic we have been
> > discussing for a while.
> 
> Yup.  I wanted to explore how much work this is, and it turns
> out it gets a lot less ugly of we don't have to hande rbtree and
> its end elements.

OK.

> > Would it be possible to incorrect zap an entry
> > with the transaction semantics? I mean:
> 
> Nope, should not happen.
> 
> > #1 transaction to remove element k in set x y
> > #2 flush set x y (removes dead element k)
> > #3 add element k to set x y expires 3 minutes
> > #4 gc transaction freshly added new element
> >
> > In this case, no dead flag is set on in this new element k on so GC
> > transaction will skip it.
> 
> The GC will do lookup, will find the element, will
> see its neither dead nor expired so it will be skipped.
>
> At least thats the idea, entries get zapped only
> if they are expired or dead (to handle packet path deletion).

Agreed, it is an extra lookup, but it is safer approach.

Thanks.
Pablo Neira Ayuso Oct. 2, 2023, 12:42 p.m. UTC | #8
On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> Not convinced. In particular, I don't understand what this has to do
> with async vs. sync gc.  To me this depends how we want to handle
> elements that have expired or expire during transaction.
> 
> Example:
> 
> Element E1, times out in 1 hour
> Element E2, times out in 1 second
> Element E3, timed out (1 second ago, 3 minutes ago, doesn't matter).
> 
> Userspace batch to kernel:
> Update Element E1 to time out in 2 hours.
> Update Element E2 to time out in 1 hour.
> Update Element E3 to time out in 1 hour.
> 
> What is the expected outcome of this request?
> 
> Ignore E3 being reaped already and refresh the timeout (resurrection?)

No resurrection, the element might have counters, it already expired.

> Ignore E3 being reaped already and ignore the request?
> Fail the transaction as E3 timed out already ("does not exist")?

Add a new E3. If NLM_F_EXCL is specified, then fail with "does not exist"

> Now, what about E2?  If transaction is large, it could become
> like E3 *during the transaction* unless we introduce some freezing
> mechanism.  Whats the expected outcome?
> 
> Whats the expected outcome if there is some other, unrelated
> failure? I assume we have to roll back all the timeout updates, right?

We annotate the new timeout in transaction object, then refresh the
timeout update in the commit phase.

> If so, why not temporarily make the timeouts effective right away
> and then roll back?

You mean, from the preparation phase? Then we need to undo what has
been done, in case of --check / abort path is exercised, this might
just create a bogus element listing.

> What if userspace asks to *shrink* timeouts? Same scenario:
> 
> Update Element E1 to time out in 1 second
> Update Element E2 to time out in 1 second
> Update Element E3 to time out in 1 second
> 
> We could say 'not supported' of course, would avoid the
> rollback headache, because in this case long transaction
> gc would definitely start to pick some of those elements
> up for reaping...

No need for rollback if new timeout is store in the transaction
object, we just set the new timeout from _commit() step in the
NEWSETELEM case, which has to deal with updates. Other objects follow
a similar approach.
Florian Westphal Oct. 2, 2023, 1:58 p.m. UTC | #9
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> > Element E1, times out in 1 hour
> > Element E2, times out in 1 second
> > Element E3, timed out (1 second ago, 3 minutes ago, doesn't matter).
> > 
> > Userspace batch to kernel:
> > Update Element E1 to time out in 2 hours.
> > Update Element E2 to time out in 1 hour.
> > Update Element E3 to time out in 1 hour.
> > 
> > What is the expected outcome of this request?
> > 
> > Ignore E3 being reaped already and refresh the timeout (resurrection?)
> 
> No resurrection, the element might have counters, it already expired.

OK.

> > Ignore E3 being reaped already and ignore the request?
> > Fail the transaction as E3 timed out already ("does not exist")?
> 
> Add a new E3. If NLM_F_EXCL is specified, then fail with "does not exist"

OK.

> > Now, what about E2?  If transaction is large, it could become
> > like E3 *during the transaction* unless we introduce some freezing
> > mechanism.  Whats the expected outcome?
> > 
> > Whats the expected outcome if there is some other, unrelated
> > failure? I assume we have to roll back all the timeout updates, right?
> 
> We annotate the new timeout in transaction object, then refresh the
> timeout update in the commit phase.

OK, so as per "E3-example", you're saying that if E2 expires during
the transaction, then if F_EXCL is given the transaction will fail while
otherwise it will be re-added.

> > If so, why not temporarily make the timeouts effective right away
> > and then roll back?
> 
> You mean, from the preparation phase? Then we need to undo what has
> been done, in case of --check / abort path is exercised, this might
> just create a bogus element listing.

True.  Am I correct that we can't implement the "expand" via
del+add because of stateful objects?

I fear we will need to circle back to rbtree then, I'll followup
there (wrt. on-demand gc).

> No need for rollback if new timeout is store in the transaction
> object, we just set the new timeout from _commit() step in the
> NEWSETELEM case, which has to deal with updates. Other objects follow
> a similar approach.

Got it.
Florian Westphal Oct. 2, 2023, 2:21 p.m. UTC | #10
Florian Westphal <fw@strlen.de> wrote:
> > You mean, from the preparation phase? Then we need to undo what has
> > been done, in case of --check / abort path is exercised, this might
> > just create a bogus element listing.
> 
> True.  Am I correct that we can't implement the "expand" via
> del+add because of stateful objects?
> 
> I fear we will need to circle back to rbtree then, I'll followup
> there (wrt. on-demand gc).

Found another corner case, lets assume rhash set to not mix it
with the rbtree issue.

Element E exists and has not timed out yet

Userspace generates:
Refresh timeout of E to <bignum>
<other unrelated stuff, time passes, E expires>
Add E again.  As existing E has timed out already,
this passes...

How to prevent an outcome where E now exists twice?

Variant:
Refresh timeout of E to <bignum>
Delete E

Care has to be taken to not add UAF here.

ATM deletion (unlink) would take effect before the sets' commit
hooks are called, so we refresh timeout of an unlinked element.

AFAICS there won't be UAF because the element memory won't be
released until after mutex unlock and synchronize_rcu, but its
not nice either.
Florian Westphal Oct. 2, 2023, 2:23 p.m. UTC | #11
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> According to 2ee52ae94baa ("netfilter: nft_set_rbtree: skip sync GC for
> new elements in this transaction"), new elements in this transaction
> might expire before such transaction ends. Skip sync GC is needed for
> such elements otherwise commit path might walk over an already released
> object.
> 
> However, Florian found that while iterating the tree from the insert
> path for sync GC, it is possible that stale references could still
> happen for elements in the less-equal and great-than boundaries to
> narrow down the tree descend to speed up overlap detection, this
> triggers bogus overlap errors.
> 
> This patch skips expired elements in the overlap detection routine which
> iterates on the reversed ordered list of elements that represent the
> intervals. Since end elements provide no expiration extension, check for
> the next non-end element in this interval, hence, skip both elements in
> the iteration if the interval has expired.

10.1.2.3 - 10.1.2.30  (expired!)

transaction wants to add:
10.1.2.2 - 10.1.2.29

AFAICS, this is now mismerged into:

10.1.2.2 - 10.1.2.30, because walking back to
next end element from expired 10.1.2.3 will
find 10.1.2.29 as first preceeding end element, no?

and the "commit" operation comes after genid bump, so we can't
restrict that to "not active in next gen" or similar :-/

Can you use dead-bit instead?

Element has expired -> Mark element and the end-pair as dead,
then reap all expired and dead nodes from commit callback.

Problem is what to do after reset-inerval support is added,
because the newly-marked-dead elements could have a timeout
refresh already pending, and I don't see how this can be handled.
Pablo Neira Ayuso Oct. 2, 2023, 9:10 p.m. UTC | #12
On Mon, Oct 02, 2023 at 03:58:38PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Sun, Oct 01, 2023 at 11:08:16PM +0200, Florian Westphal wrote:
> > > Element E1, times out in 1 hour
> > > Element E2, times out in 1 second
> > > Element E3, timed out (1 second ago, 3 minutes ago, doesn't matter).
> > > 
> > > Userspace batch to kernel:
> > > Update Element E1 to time out in 2 hours.
> > > Update Element E2 to time out in 1 hour.
> > > Update Element E3 to time out in 1 hour.
> > > 
> > > What is the expected outcome of this request?
> > > 
> > > Ignore E3 being reaped already and refresh the timeout (resurrection?)
> > 
> > No resurrection, the element might have counters, it already expired.
> 
> OK.
> 
> > > Ignore E3 being reaped already and ignore the request?
> > > Fail the transaction as E3 timed out already ("does not exist")?
> > 
> > Add a new E3. If NLM_F_EXCL is specified, then fail with "does not exist"
> 
> OK.

Actually not correct what I said in this case: NLM_F_EXCL means create
in newsetelem() path, then add new E3 always succeeds if there is a
timed out E3, regardless the flag.

No "does not exist" error is possible.

> > > Now, what about E2?  If transaction is large, it could become
> > > like E3 *during the transaction* unless we introduce some freezing
> > > mechanism.  Whats the expected outcome?
> > > 
> > > Whats the expected outcome if there is some other, unrelated
> > > failure? I assume we have to roll back all the timeout updates, right?
> > 
> > We annotate the new timeout in transaction object, then refresh the
> > timeout update in the commit phase.
> 
> OK, so as per "E3-example", you're saying that if E2 expires during
> the transaction, then if F_EXCL is given the transaction will fail while
> otherwise it will be re-added.

Please, revisit if the semantics look correct to you after my
correction on the NLM_F_EXCL flag.

> > > If so, why not temporarily make the timeouts effective right away
> > > and then roll back?
> > 
> > You mean, from the preparation phase? Then we need to undo what has
> > been done, in case of --check / abort path is exercised, this might
> > just create a bogus element listing.
> 
> True.  Am I correct that we can't implement the "expand" via
> del+add because of stateful objects?

I think it is not a good idea to expand a element that has already
expired. There might be another possible corner case:

Refresh element E1 with timeout X -> not expired yet
Element E1 expires
Refresh element E1 with timeout Y -> already expired, ENOENT.

This looks fine to me, this handling is possible because the timeout
is not updated from the preparation phase, only later in the commit
phase.

> I fear we will need to circle back to rbtree then, I'll followup
> there (wrt. on-demand gc).
>
> > No need for rollback if new timeout is store in the transaction
> > object, we just set the new timeout from _commit() step in the
> > NEWSETELEM case, which has to deal with updates. Other objects follow
> > a similar approach.
> 
> Got it.
Pablo Neira Ayuso Oct. 2, 2023, 9:14 p.m. UTC | #13
On Mon, Oct 02, 2023 at 11:10:42PM +0200, Pablo Neira Ayuso wrote:
> I think it is not a good idea to expand a element that has already
> expired. There might be another possible corner case:
> 
> Refresh element E1 with timeout X -> not expired yet
> Element E1 expires
> Refresh element E1 with timeout Y -> already expired, ENOENT.

Oh, I see, you mentioned this problem in another email.

This actually results in a new element to be added, with my proposal,
no ENOENT is ever hit.

I get your point, this needs to be sorted out.
Pablo Neira Ayuso Oct. 2, 2023, 9:37 p.m. UTC | #14
On Mon, Oct 02, 2023 at 04:23:12PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > According to 2ee52ae94baa ("netfilter: nft_set_rbtree: skip sync GC for
> > new elements in this transaction"), new elements in this transaction
> > might expire before such transaction ends. Skip sync GC is needed for
> > such elements otherwise commit path might walk over an already released
> > object.
> > 
> > However, Florian found that while iterating the tree from the insert
> > path for sync GC, it is possible that stale references could still
> > happen for elements in the less-equal and great-than boundaries to
> > narrow down the tree descend to speed up overlap detection, this
> > triggers bogus overlap errors.
> > 
> > This patch skips expired elements in the overlap detection routine which
> > iterates on the reversed ordered list of elements that represent the
> > intervals. Since end elements provide no expiration extension, check for
> > the next non-end element in this interval, hence, skip both elements in
> > the iteration if the interval has expired.
> 
> 10.1.2.3 - 10.1.2.30  (expired!)
> 
> transaction wants to add:
> 10.1.2.2 - 10.1.2.29
> 
> AFAICS, this is now mismerged into:
> 
> 10.1.2.2 - 10.1.2.30, because walking back to
> next end element from expired 10.1.2.3 will
> find 10.1.2.29 as first preceeding end element, no?
> 
> and the "commit" operation comes after genid bump, so we can't
> restrict that to "not active in next gen" or similar :-/
> 
> Can you use dead-bit instead?
> 
> Element has expired -> Mark element and the end-pair as dead,
> then reap all expired and dead nodes from commit callback.

This makes sense.

> Problem is what to do after reset-inerval support is added,
> because the newly-marked-dead elements could have a timeout
> refresh already pending, and I don't see how this can be handled.

Yes, refresh set element timeout semantics depends on what the timeout
says in the preparation phase. The same problem that is being
discussed in the other two emails in different thread.

We could annotate the current time at the beginning of the
transaction and use it to check if the element has expired, so set
element expiration stops being a moving target.
Pablo Neira Ayuso Oct. 2, 2023, 9:42 p.m. UTC | #15
On Mon, Oct 02, 2023 at 04:23:12PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > According to 2ee52ae94baa ("netfilter: nft_set_rbtree: skip sync GC for
> > new elements in this transaction"), new elements in this transaction
> > might expire before such transaction ends. Skip sync GC is needed for
> > such elements otherwise commit path might walk over an already released
> > object.
> > 
> > However, Florian found that while iterating the tree from the insert
> > path for sync GC, it is possible that stale references could still
> > happen for elements in the less-equal and great-than boundaries to
> > narrow down the tree descend to speed up overlap detection, this
> > triggers bogus overlap errors.
> > 
> > This patch skips expired elements in the overlap detection routine which
> > iterates on the reversed ordered list of elements that represent the
> > intervals. Since end elements provide no expiration extension, check for
> > the next non-end element in this interval, hence, skip both elements in
> > the iteration if the interval has expired.
> 
> 10.1.2.3 - 10.1.2.30  (expired!)
> 
> transaction wants to add:
> 10.1.2.2 - 10.1.2.29
> 
> AFAICS, this is now mismerged into:
> 
> 10.1.2.2 - 10.1.2.30, because walking back to
> next end element from expired 10.1.2.3 will
> find 10.1.2.29 as first preceeding end element, no?

Yes, this corner case is currently possible.

> and the "commit" operation comes after genid bump, so we can't
> restrict that to "not active in next gen" or similar :-/
> 
> Can you use dead-bit instead?

Yes, on-demand GC remains in place from insert path and it uses the
dead bit.

> Element has expired -> Mark element and the end-pair as dead,
> then reap all expired and dead nodes from commit callback.
>
> Problem is what to do after reset-inerval support is added,
> because the newly-marked-dead elements could have a timeout
> refresh already pending, and I don't see how this can be handled.

You mean:

transaction
set element E is refreshed
set element E expires
set element E is marked as dead by on-demand GC (when walking down for different element E2)
end transaction

This can probably be addressed by using a curren time snapshot at the
beginning of the transaction to check for the expiration, instead of
checking for the current real time which is a moving target.

Let's consolidate this discussion in one single email thread, all
these issues are interrelated :)
Pablo Neira Ayuso Oct. 3, 2023, 8:22 a.m. UTC | #16
Hi Florian,

I am collecting here your comments for the model we are defining for
set elements:

https://people.netfilter.org/pablo/setelems-timeout.txt

Let me know if you have more possible scenarios that you think that
might not be address by this model:

- Annotate current time at the beginning of the transaction, use it
  in _expired() check (=> timeout is not a moving target anymore).
- Annotate element timeout update in transaction, update timeout from
  _commit() path not to refresh it.

Thanks.
Florian Westphal Oct. 3, 2023, 9:04 a.m. UTC | #17
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Florian,
> 
> I am collecting here your comments for the model we are defining for
> set elements:
> 
> https://people.netfilter.org/pablo/setelems-timeout.txt

LGTM.  I think your proposal to snapshot current time and
remove the "moving target" is key.

> Let me know if you have more possible scenarios that you think that
> might not be address by this model:
> 
> - Annotate current time at the beginning of the transaction, use it
>   in _expired() check (=> timeout is not a moving target anymore).
> - Annotate element timeout update in transaction, update timeout from
>   _commit() path not to refresh it.

Right, I think that will work.
For rbtree, sync gc is kept in place, elements are not zapped,
they get tagged as DEAD, including the end element.

Then from commit, do full scan and remove any and all elements
that are flagged as DEAD or have expired.
Pablo Neira Ayuso Oct. 3, 2023, 9:42 a.m. UTC | #18
On Tue, Oct 03, 2023 at 11:04:10AM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi Florian,
> > 
> > I am collecting here your comments for the model we are defining for
> > set elements:
> > 
> > https://people.netfilter.org/pablo/setelems-timeout.txt
> 
> LGTM.  I think your proposal to snapshot current time and
> remove the "moving target" is key.

Agreed.

> > Let me know if you have more possible scenarios that you think that
> > might not be address by this model:
> > 
> > - Annotate current time at the beginning of the transaction, use it
> >   in _expired() check (=> timeout is not a moving target anymore).
> > - Annotate element timeout update in transaction, update timeout from
> >   _commit() path not to refresh it.
> 
> Right, I think that will work.
> For rbtree, sync gc is kept in place, elements are not zapped,
> they get tagged as DEAD, including the end element.
> 
> Then from commit, do full scan and remove any and all elements
> that are flagged as DEAD or have expired.

Sounds good.

Would you follow this approach to fix the existing issue with the
rbtree on-demand GC in nf.git?
Florian Westphal Oct. 3, 2023, 6:24 p.m. UTC | #19
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Right, I think that will work.
> > For rbtree, sync gc is kept in place, elements are not zapped,
> > they get tagged as DEAD, including the end element.
> > 
> > Then from commit, do full scan and remove any and all elements
> > that are flagged as DEAD or have expired.
> 
> Sounds good.
> 
> Would you follow this approach to fix the existing issue with the
> rbtree on-demand GC in nf.git?

Actually, I don't see why its needed. With your proposal
to make the "is_expired" check during transaction consistently based on
a fixed tstamp, expiry during transaction becomes impossible.
So we can keep immediate rb_erase around.

I suggest to take my proposal to erase, signal -EAGAIN to caller,
then have caller retry.  Apply this to nf.git as a bug fix.

Then, I can take my patches that are mixed into the gc rework;
split those up, and we take the "no more async rbtree gc" for nf-next.

Do you still spot a problem if we retain the on-insert node erase?

To give some numbers (async gc disabled):

Insert 20k ranges into rbtree (takes ~4minutes).
Wait until all have expired.
Insert a single range: takes 250ms (entire tree has to be purged).

Don't think it will be any faster with dead-bit approach,
we simply move cost to later in the transaction.

The only nf.git "advantage" is that we typically won't have
to zap the entire tree during transaction, but thats due to
async gc and I'd rather remove it.

What do you think?
Pablo Neira Ayuso Oct. 4, 2023, 8:30 a.m. UTC | #20
On Tue, Oct 03, 2023 at 08:24:47PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Right, I think that will work.
> > > For rbtree, sync gc is kept in place, elements are not zapped,
> > > they get tagged as DEAD, including the end element.
> > > 
> > > Then from commit, do full scan and remove any and all elements
> > > that are flagged as DEAD or have expired.
> > 
> > Sounds good.
> > 
> > Would you follow this approach to fix the existing issue with the
> > rbtree on-demand GC in nf.git?
> 
> Actually, I don't see why its needed. With your proposal
> to make the "is_expired" check during transaction consistently based on
> a fixed tstamp, expiry during transaction becomes impossible.
> So we can keep immediate rb_erase around.

Makes sense.

> I suggest to take my proposal to erase, signal -EAGAIN to caller,
> then have caller retry.  Apply this to nf.git as a bug fix.
> 
> Then, I can take my patches that are mixed into the gc rework;
> split those up, and we take the "no more async rbtree gc" for nf-next.
> 
> Do you still spot a problem if we retain the on-insert node erase?

Apart from this unbound loop, which sooner or later will not hit
EAGAIN, no.

> To give some numbers (async gc disabled):
> 
> Insert 20k ranges into rbtree (takes ~4minutes).
> Wait until all have expired.
> Insert a single range: takes 250ms (entire tree has to be purged).
> 
> Don't think it will be any faster with dead-bit approach,
> we simply move cost to later in the transaction.
>
> The only nf.git "advantage" is that we typically won't have
> to zap the entire tree during transaction, but thats due to
> async gc and I'd rather remove it.
> 
> What do you think?

I am fine with this approach.

What it comes, will be redo in nf-next.
diff mbox series

Patch

diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 487572dcd614..de6d812fc790 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -235,8 +235,7 @@  static void nft_rbtree_gc_remove(struct net *net, struct nft_set *set,
 
 static int nft_rbtree_gc_elem(const struct nft_set *__set,
 			      struct nft_rbtree *priv,
-			      struct nft_rbtree_elem *rbe,
-			      u8 genmask)
+			      struct nft_rbtree_elem *rbe)
 {
 	struct nft_set *set = (struct nft_set *)__set;
 	struct rb_node *prev = rb_prev(&rbe->node);
@@ -254,8 +253,7 @@  static int nft_rbtree_gc_elem(const struct nft_set *__set,
 	 */
 	while (prev) {
 		rbe_prev = rb_entry(prev, struct nft_rbtree_elem, node);
-		if (nft_rbtree_interval_end(rbe_prev) &&
-		    nft_set_elem_active(&rbe_prev->ext, genmask))
+		if (nft_rbtree_interval_end(rbe_prev))
 			break;
 
 		prev = rb_prev(prev);
@@ -289,6 +287,34 @@  static int nft_rbtree_gc_elem(const struct nft_set *__set,
 	return 0;
 }
 
+static void nft_rbtree_commit(const struct nft_set *set)
+{
+	struct nft_rbtree *priv = nft_set_priv(set);
+	struct rb_node *node, *next, *first;
+	struct nft_rbtree_elem *rbe;
+
+	if (!(set->flags & NFT_SET_TIMEOUT))
+		return;
+
+	/* ignore GC interval here, unless two copies of the tree are
+	 * maintained, it is not possible to postpone removal of expired
+	 * elements.
+	 */
+
+	first = rb_first(&priv->root);
+
+	for (node = first; node != NULL; node = next) {
+		next = rb_next(node);
+
+		rbe = rb_entry(node, struct nft_rbtree_elem, node);
+
+		if (!nft_set_elem_expired(&rbe->ext))
+			continue;
+
+		nft_rbtree_gc_elem(set, priv, rbe);
+	}
+}
+
 static bool nft_rbtree_update_first(const struct nft_set *set,
 				    struct nft_rbtree_elem *rbe,
 				    struct rb_node *first)
@@ -312,9 +338,8 @@  static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 	struct nft_rbtree_elem *rbe, *rbe_le = NULL, *rbe_ge = NULL;
 	struct rb_node *node, *next, *parent, **p, *first = NULL;
 	struct nft_rbtree *priv = nft_set_priv(set);
-	u8 cur_genmask = nft_genmask_cur(net);
 	u8 genmask = nft_genmask_next(net);
-	int d, err;
+	int d;
 
 	/* Descend the tree to search for an existing element greater than the
 	 * key value to insert that is greater than the new element. This is the
@@ -358,16 +383,19 @@  static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set,
 		if (!nft_set_elem_active(&rbe->ext, genmask))
 			continue;
 
-		/* perform garbage collection to avoid bogus overlap reports
-		 * but skip new elements in this transaction.
+		/* skip expired intervals to avoid bogus overlap reports:
+		 * end element has no expiration, check next start element.
 		 */
-		if (nft_set_elem_expired(&rbe->ext) &&
-		    nft_set_elem_active(&rbe->ext, cur_genmask)) {
-			err = nft_rbtree_gc_elem(set, priv, rbe, genmask);
-			if (err < 0)
-				return err;
+		if (nft_rbtree_interval_end(rbe) && next) {
+			struct nft_rbtree_elem *rbe_next;
 
-			continue;
+			rbe_next = rb_entry(next, struct nft_rbtree_elem, node);
+
+			if (nft_set_elem_expired(&rbe_next->ext)) {
+				/* skip expired next start element. */
+				next = rb_next(next);
+				continue;
+			}
 		}
 
 		d = nft_rbtree_cmp(set, rbe, new);
@@ -755,5 +783,6 @@  const struct nft_set_type nft_set_rbtree_type = {
 		.lookup		= nft_rbtree_lookup,
 		.walk		= nft_rbtree_walk,
 		.get		= nft_rbtree_get,
+		.commit		= nft_rbtree_commit,
 	},
 };