diff mbox

netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Message ID 1389090711-15843-1-git-send-email-avagin@openvz.org
State Superseded
Headers show

Commit Message

Andrei Vagin Jan. 7, 2014, 10:31 a.m. UTC
Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few node.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 net/netfilter/nf_conntrack_core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Vasily Averin Jan. 7, 2014, 11:42 a.m. UTC | #1
On 01/07/2014 02:31 PM, Andrey Vagin wrote:
> Lets look at destroy_conntrack:
> 
> hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> ...
> nf_conntrack_free(ct)
> 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> 
> net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
> 
> The hash is protected by rcu, so readers look up conntracks without
> locks.
> A conntrack is removed from the hash, but in this moment a few readers
> still can use the conntrack. Then this conntrack is released and another
> thread creates conntrack with the same address and the equal tuple.
> After this a reader starts to validate the conntrack:
> * It's not dying, because a new conntrack was created
> * nf_ct_tuple_equal() returns true.
> 
> But this conntrack is not initialized yet, so it can not be used by two
> threads concurrently. In this case BUG_ON may be triggered from
> nf_nat_setup_info().
> 
> Florian Westphal suggested to check the confirm bit too. I think it's
> right.
> 
> task 1			task 2			task 3
> 			nf_conntrack_find_get
> 			 ____nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
> 						__nf_conntrack_alloc
> 						 kmem_cache_alloc
> 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> 			 if (nf_ct_is_dying(ct))
> 			 if (!nf_ct_tuple_equal()
> 
> I'm not sure, that I have ever seen this race condition in a real life.
> Currently we are investigating a bug, which is reproduced on a few node.
> In our case one conntrack is initialized from a few tasks concurrently,
> we don't have any other explanation for this.

I would add; this strange traffic is generated by botnet clients (/etc/atdd and /boot/.IptabLes)
Several threads are send lot of identical tcp packets via RAW socket.
https://bugzilla.openvz.org/show_bug.cgi?id=2851

> <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
> ...
> <4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
> ...
> <4>[46267.085549] Call Trace:
> <4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
> <4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
> <4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
> <4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
> <4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
> <4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
> <4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
> <4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
> <4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
> <4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
> <4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
> <4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
> <4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
> <4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
> <4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
> <4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
> <4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
> <4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
> <4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
> <4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
> <4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
> <4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
> <4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
> <4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
> <4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
> <4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
> <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
> <1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590
> 
> Cc: Florian Westphal <fw@strlen.de>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  net/netfilter/nf_conntrack_core.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 43549eb..7a34bb2 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -387,8 +387,12 @@ begin:
>  			     !atomic_inc_not_zero(&ct->ct_general.use)))
>  			h = NULL;
>  		else {
> +			/* A conntrack can be recreated with the equal tuple,
> +			 * so we need to check that the conntrack is initialized
> +			 */
>  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> -				     nf_ct_zone(ct) != zone)) {
> +				     nf_ct_zone(ct) != zone) ||
> +				     !nf_ct_is_confirmed(ct)) {
>  				nf_ct_put(ct);
>  				goto begin;
>  			}
> 

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 7, 2014, 3:08 p.m. UTC | #2
On Tue, 2014-01-07 at 14:31 +0400, Andrey Vagin wrote:
> Lets look at destroy_conntrack:
> 
> hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> ...
> nf_conntrack_free(ct)
> 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> 
> net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
> 
> The hash is protected by rcu, so readers look up conntracks without
> locks.
> A conntrack is removed from the hash, but in this moment a few readers
> still can use the conntrack. Then this conntrack is released and another
> thread creates conntrack with the same address and the equal tuple.
> After this a reader starts to validate the conntrack:
> * It's not dying, because a new conntrack was created
> * nf_ct_tuple_equal() returns true.
> 
> But this conntrack is not initialized yet, so it can not be used by two
> threads concurrently. In this case BUG_ON may be triggered from
> nf_nat_setup_info().
> 
> Florian Westphal suggested to check the confirm bit too. I think it's
> right.
> 
> task 1			task 2			task 3
> 			nf_conntrack_find_get
> 			 ____nf_conntrack_find
> destroy_conntrack
>  hlist_nulls_del_rcu
>  nf_conntrack_free
>  kmem_cache_free
> 						__nf_conntrack_alloc
> 						 kmem_cache_alloc
> 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> 			 if (nf_ct_is_dying(ct))
> 			 if (!nf_ct_tuple_equal()
> 
> I'm not sure, that I have ever seen this race condition in a real life.
> Currently we are investigating a bug, which is reproduced on a few node.
> In our case one conntrack is initialized from a few tasks concurrently,
> we don't have any other explanation for this.

> 
> Cc: Florian Westphal <fw@strlen.de>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  net/netfilter/nf_conntrack_core.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 43549eb..7a34bb2 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -387,8 +387,12 @@ begin:
>  			     !atomic_inc_not_zero(&ct->ct_general.use)))
>  			h = NULL;
>  		else {
> +			/* A conntrack can be recreated with the equal tuple,
> +			 * so we need to check that the conntrack is initialized
> +			 */
>  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> -				     nf_ct_zone(ct) != zone)) {
> +				     nf_ct_zone(ct) != zone) ||
> +				     !nf_ct_is_confirmed(ct)) {
>  				nf_ct_put(ct);
>  				goto begin;
>  			}

I do not think this is the right way to fix this problem (if said
problem is confirmed)

Remember the rule about SLAB_DESTROY_BY_RCU :

When a struct is freed, then reused, its important to set the its refcnt
(from 0 to 1) only when the structure is fully ready for use.

If a lookup finds a structure which is not yet setup, the
atomic_inc_not_zero() will fail.

Take a look at sk_clone_lock() and Documentation/RCU/rculist_nulls.txt





--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 7, 2014, 3:25 p.m. UTC | #3
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 43549eb..7a34bb2 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -387,8 +387,12 @@ begin:
> >  			     !atomic_inc_not_zero(&ct->ct_general.use)))
> >  			h = NULL;
> >  		else {
> > +			/* A conntrack can be recreated with the equal tuple,
> > +			 * so we need to check that the conntrack is initialized
> > +			 */
> >  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> > -				     nf_ct_zone(ct) != zone)) {
> > +				     nf_ct_zone(ct) != zone) ||
> > +				     !nf_ct_is_confirmed(ct)) {
> >  				nf_ct_put(ct);
> >  				goto begin;
> >  			}
> 
> I do not think this is the right way to fix this problem (if said
> problem is confirmed)
> 
> Remember the rule about SLAB_DESTROY_BY_RCU :
> 
> When a struct is freed, then reused, its important to set the its refcnt
> (from 0 to 1) only when the structure is fully ready for use.
> 
> If a lookup finds a structure which is not yet setup, the
> atomic_inc_not_zero() will fail.

Indeed.  But, the structure itself might be ready (or rather,
can be ready since the allocation side will set the refcount to one
after doing the initial work, such as zapping old ->status flags and
setting tuple information).

The problem is with nat extension area stored in the ct->ext area.
This extension area is preallocated but the snat/dnat action
information is only set up after the ct (or rather, the skb that grabbed
a reference to the nf_conn entry) traverses nat pre/postrouting.

This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
rule existed.

The manipulations of the skb->nfct->ext nat area are performed without
a lock.  Concurrent access is supposedly impossible as the conntrack
should not (yet) be in the hash table.

The confirmed bit is set right before we insert the conntrack into
the hash table (after we traversed rules, ct is ready to be
'published').

i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
struct when we perform the lookup, as it should still be sitting on the
'unconfirmed' list, being invisible to readers.

Does that explanation make sense to you?

Thanks for looking into this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2014, 1:42 p.m. UTC | #4
On Tue, 2014-01-07 at 16:25 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > index 43549eb..7a34bb2 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -387,8 +387,12 @@ begin:
> > >  			     !atomic_inc_not_zero(&ct->ct_general.use)))
> > >  			h = NULL;
> > >  		else {
> > > +			/* A conntrack can be recreated with the equal tuple,
> > > +			 * so we need to check that the conntrack is initialized
> > > +			 */
> > >  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> > > -				     nf_ct_zone(ct) != zone)) {
> > > +				     nf_ct_zone(ct) != zone) ||
> > > +				     !nf_ct_is_confirmed(ct)) {
> > >  				nf_ct_put(ct);
> > >  				goto begin;
> > >  			}
> > 
> > I do not think this is the right way to fix this problem (if said
> > problem is confirmed)
> > 
> > Remember the rule about SLAB_DESTROY_BY_RCU :
> > 
> > When a struct is freed, then reused, its important to set the its refcnt
> > (from 0 to 1) only when the structure is fully ready for use.
> > 
> > If a lookup finds a structure which is not yet setup, the
> > atomic_inc_not_zero() will fail.
> 
> Indeed.  But, the structure itself might be ready (or rather,
> can be ready since the allocation side will set the refcount to one
> after doing the initial work, such as zapping old ->status flags and
> setting tuple information).
> 
> The problem is with nat extension area stored in the ct->ext area.
> This extension area is preallocated but the snat/dnat action
> information is only set up after the ct (or rather, the skb that grabbed
> a reference to the nf_conn entry) traverses nat pre/postrouting.
> 
> This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
> rule existed.
> 
> The manipulations of the skb->nfct->ext nat area are performed without
> a lock.  Concurrent access is supposedly impossible as the conntrack
> should not (yet) be in the hash table.
> 
> The confirmed bit is set right before we insert the conntrack into
> the hash table (after we traversed rules, ct is ready to be
> 'published').
> 
> i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
> struct when we perform the lookup, as it should still be sitting on the
> 'unconfirmed' list, being invisible to readers.
> 
> Does that explanation make sense to you?
> 
> Thanks for looking into this.

Still, this patch adds a loop. And maybe an infinite one if confirmed
bit is set from an context that was interrupted by this one.

If you need to test the confirmed bit, then you also need to test it
before taking the refcount.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 8, 2014, 2:04 p.m. UTC | #5
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
> > rule existed.
> > 
> > The manipulations of the skb->nfct->ext nat area are performed without
> > a lock.  Concurrent access is supposedly impossible as the conntrack
> > should not (yet) be in the hash table.
> > 
> > The confirmed bit is set right before we insert the conntrack into
> > the hash table (after we traversed rules, ct is ready to be
> > 'published').
> > 
> > i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
> > struct when we perform the lookup, as it should still be sitting on the
> > 'unconfirmed' list, being invisible to readers.
> > 
> > Does that explanation make sense to you?
> > 
> > Thanks for looking into this.
> 
> Still, this patch adds a loop. And maybe an infinite one if confirmed
> bit is set from an context that was interrupted by this one.

Hmm.  There should be at most one retry.

The confirmed bit should always be set here.
If it isn't then this conntrack shouldn't be in the hash table, i.e.
when we re-try we should find the same conntrack again with the bit set.

Asuming the other cpu git interrupted after setting confirmed
bit but before inserting it into the hash table, then our re-try
should not be able find a matching entry.

Maybe I am missing something, but I don't see how we could (upon
retry) find the very same entry again with the bit still not set.

> If you need to test the confirmed bit, then you also need to test it
> before taking the refcount.

I don't think that would make sense, because it should always be
set (inserting conntrack into hash table without confirmed set is
illegal, and it is never unset again).

[ when allocating a new conntrack, ct->status is zeroed, which also
  clears the flag.  This happens just before we set the new objects
  refcount to 1 ]
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 8, 2014, 5:31 p.m. UTC | #6
On Wed, 2014-01-08 at 15:04 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
> > > rule existed.
> > > 
> > > The manipulations of the skb->nfct->ext nat area are performed without
> > > a lock.  Concurrent access is supposedly impossible as the conntrack
> > > should not (yet) be in the hash table.
> > > 
> > > The confirmed bit is set right before we insert the conntrack into
> > > the hash table (after we traversed rules, ct is ready to be
> > > 'published').
> > > 
> > > i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
> > > struct when we perform the lookup, as it should still be sitting on the
> > > 'unconfirmed' list, being invisible to readers.
> > > 
> > > Does that explanation make sense to you?
> > > 
> > > Thanks for looking into this.
> > 
> > Still, this patch adds a loop. And maybe an infinite one if confirmed
> > bit is set from an context that was interrupted by this one.
> 
> Hmm.  There should be at most one retry.
> 
> The confirmed bit should always be set here.

So why are you testing it ?

> If it isn't then this conntrack shouldn't be in the hash table, i.e.
> when we re-try we should find the same conntrack again with the bit set.

Where is this guaranteed ? The invariant is the refcnt being taken.

> 
> Asuming the other cpu git interrupted after setting confirmed
> bit but before inserting it into the hash table, then our re-try
> should not be able find a matching entry.
> 
> Maybe I am missing something, but I don't see how we could (upon
> retry) find the very same entry again with the bit still not set.
> 
> > If you need to test the confirmed bit, then you also need to test it
> > before taking the refcount.
> 
> I don't think that would make sense, because it should always be
> set (inserting conntrack into hash table without confirmed set is
> illegal, and it is never unset again).
> 
> [ when allocating a new conntrack, ct->status is zeroed, which also
>   clears the flag.  This happens just before we set the new objects
>   refcount to 1 ]


I did this RCU conversion, so I think I know what I am talking about.

The entry should not be put into hash table (or refcnt set to 1),
if its not ready. It is that simple.

We need to address this problem without adding an extra test and
possible loop for the lookups.

Whole point of RCU is to make lookups fast, by possibly making the
writers doing all the needed logic.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 8, 2014, 8:18 p.m. UTC | #7
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > The confirmed bit should always be set here.
> 
> So why are you testing it ?

To detect ct object recycling when tuple is identical.

This is my understanding of how we can end up with two
cpus thinking they have exclusive ownership of the same ct:

A cpu0: starts lookup: find ct for tuple t
B cpu1: starts lookup: find ct for tuple t
C cpu0: finds ct c for tuple t, no refcnt taken yet
  cpu1: finds ct c for tuple t, no refcnt taken yet
   cpu2: destroys ct c, removes from hash table, calls ->destroy function
D cpu0: tries to increment refcnt; fails since its 0: lookup ends
E cpu0: allocates a new ct object since no acceptable ct was found for t
F cpu0: allocator gives us just-freed ct c
G cpu0: initialises ct, sets refcnt to 1
H cpu0: adds extensions, ct object is put on unconfirmed list and
        assigned to skb->nfct
I cpu0: skb continues through network stack
J cpu1: tries to increment refcnt, ok
K cpu1: checks if ct matches requested tuple t: it does
L cpu0: sets refcnt conntrack tuple, allocates extensions, etc.
  cpu1: sets skb->nfct to ct, skb continues through network stack

-> both cpu0 and cpu1 reference a ct object that was not in hash table

cpu0 and cpu1 will then race, for example in
net/ipv4/netfilter/iptable_nat.c:nf_nat_rule_find():

        if (!nf_nat_initialized(ct, HOOK2MANIP(hooknum)))
              ret = alloc_null_binding(ct, hooknum);

[ Its possible that I misunderstand and that there is something that
  precents this from happening.  Usually its the 'tuple equal' test
  that is performed post-atomic-inc-not-zero that detects the recycling,
  so step K above would fail ]

The idea of the 'confirmed bit test' is that when its not set then the
conntrack was recycled and should not be used before the cpu that
currently 'owns' that entry has put it into the hash table again.

> I did this RCU conversion, so I think I know what I am talking about.

Yes, I know.  If you have any suggestions on how to fix it, I'd be very
interested to hear about them.

> The entry should not be put into hash table (or refcnt set to 1),
> if its not ready. It is that simple.

I understand what you're saying, but I don't see how we can do it.

I think the assumption that we have a refcnt on skb->nfct is everywhere.

If I understand you correctly we'd have to differentiate between
'can be used fully (e.g. nf_nat_setup_info done for both directions)'
and 'a new conntrack was created (extensions may not be ready yet)'.

But currently in both cases the ct is assigned to the skb, and in both
cases a refcnt is taken.  I am out of ideas, except perhaps using
ct->lock to serialize the nat setup (but I don't like that since
I'm not sure that the nat race is the only one).

> We need to address this problem without adding an extra test and
> possible loop for the lookups.

Agreed.  I don't like the extra test either.

Many thanks for looking into this Eric!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 8, 2014, 8:23 p.m. UTC | #8
Florian Westphal <fw@strlen.de> wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > The confirmed bit should always be set here.
> > 
> > So why are you testing it ?
> 
> To detect ct object recycling when tuple is identical.
> 
> This is my understanding of how we can end up with two
> cpus thinking they have exclusive ownership of the same ct:
> 
> A cpu0: starts lookup: find ct for tuple t
> B cpu1: starts lookup: find ct for tuple t
> C cpu0: finds ct c for tuple t, no refcnt taken yet
>   cpu1: finds ct c for tuple t, no refcnt taken yet
>    cpu2: destroys ct c, removes from hash table, calls ->destroy function
> D cpu0: tries to increment refcnt; fails since its 0: lookup ends
> E cpu0: allocates a new ct object since no acceptable ct was found for t
> F cpu0: allocator gives us just-freed ct c
> G cpu0: initialises ct, sets refcnt to 1
> H cpu0: adds extensions, ct object is put on unconfirmed list and
>         assigned to skb->nfct
> I cpu0: skb continues through network stack
> J cpu1: tries to increment refcnt, ok
> K cpu1: checks if ct matches requested tuple t: it does
> L cpu0: sets refcnt conntrack tuple, allocates extensions, etc.
    ^^^^
>   cpu1: sets skb->nfct to ct, skb continues through network stack

sorry, for that brain fart  This should only say
  L cpu1: sets skb->nfct to ct, skb continues...
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrei Vagin Jan. 9, 2014, 5:24 a.m. UTC | #9
On Tue, Jan 07, 2014 at 07:08:25AM -0800, Eric Dumazet wrote:
> On Tue, 2014-01-07 at 14:31 +0400, Andrey Vagin wrote:
> > Lets look at destroy_conntrack:
> > 
> > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > ...
> > nf_conntrack_free(ct)
> > 	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
> > 
> > net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
> > 
> > The hash is protected by rcu, so readers look up conntracks without
> > locks.
> > A conntrack is removed from the hash, but in this moment a few readers
> > still can use the conntrack. Then this conntrack is released and another
> > thread creates conntrack with the same address and the equal tuple.
> > After this a reader starts to validate the conntrack:
> > * It's not dying, because a new conntrack was created
> > * nf_ct_tuple_equal() returns true.
> > 
> > But this conntrack is not initialized yet, so it can not be used by two
> > threads concurrently. In this case BUG_ON may be triggered from
> > nf_nat_setup_info().
> > 
> > Florian Westphal suggested to check the confirm bit too. I think it's
> > right.
> > 
> > task 1			task 2			task 3
> > 			nf_conntrack_find_get
> > 			 ____nf_conntrack_find
> > destroy_conntrack
> >  hlist_nulls_del_rcu
> >  nf_conntrack_free
> >  kmem_cache_free
> > 						__nf_conntrack_alloc
> > 						 kmem_cache_alloc
> > 						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
> > 			 if (nf_ct_is_dying(ct))
> > 			 if (!nf_ct_tuple_equal()
> > 
> > I'm not sure, that I have ever seen this race condition in a real life.
> > Currently we are investigating a bug, which is reproduced on a few node.
> > In our case one conntrack is initialized from a few tasks concurrently,
> > we don't have any other explanation for this.
> 
> > 
> > Cc: Florian Westphal <fw@strlen.de>
> > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > Cc: Patrick McHardy <kaber@trash.net>
> > Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> > Signed-off-by: Andrey Vagin <avagin@openvz.org>
> > ---
> >  net/netfilter/nf_conntrack_core.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 43549eb..7a34bb2 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -387,8 +387,12 @@ begin:
> >  			     !atomic_inc_not_zero(&ct->ct_general.use)))
> >  			h = NULL;
> >  		else {
> > +			/* A conntrack can be recreated with the equal tuple,
> > +			 * so we need to check that the conntrack is initialized
> > +			 */
> >  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> > -				     nf_ct_zone(ct) != zone)) {
> > +				     nf_ct_zone(ct) != zone) ||
> > +				     !nf_ct_is_confirmed(ct)) {
> >  				nf_ct_put(ct);
> >  				goto begin;
> >  			}
> 
> I do not think this is the right way to fix this problem (if said
> problem is confirmed)
> 
> Remember the rule about SLAB_DESTROY_BY_RCU :
> 
> When a struct is freed, then reused, its important to set the its refcnt
> (from 0 to 1) only when the structure is fully ready for use.
> 
> If a lookup finds a structure which is not yet setup, the
> atomic_inc_not_zero() will fail.
> 
> Take a look at sk_clone_lock() and Documentation/RCU/rculist_nulls.txt
> 

I have one more question. Looks like I found another problem.

init_conntrack:
        hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
                      &net->ct.unconfirmed);

__nf_conntrack_hash_insert:
	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
		   &net->ct.hash[hash]);

We use one hlist_nulls_node to add a conntrack into two different lists.
As far as I understand, it's unacceptable in case of
SLAB_DESTROY_BY_RCU.

Lets imagine that we have two threads. The first one enumerates objects
from a first list, the second one recreates an object and insert it in a
second list.  If a first thread in this moment stays on the object, it
can read "next", when it's in the second list, so it will continue
to enumerate objects from the second list. It is not what we want, isn't
it?

cpu1				cpu2
				hlist_nulls_for_each_entry_rcu(ct)
destroy_conntrack
 kmem_cache_free

init_conntrack
 hlist_nulls_add_head_rcu
				ct = ct->next

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 9, 2014, 3:23 p.m. UTC | #10
On Thu, 2014-01-09 at 09:24 +0400, Andrew Vagin wrote:

> I have one more question. Looks like I found another problem.
> 
> init_conntrack:
>         hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
>                       &net->ct.unconfirmed);
> 
> __nf_conntrack_hash_insert:
> 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
> 		   &net->ct.hash[hash]);
> 
> We use one hlist_nulls_node to add a conntrack into two different lists.
> As far as I understand, it's unacceptable in case of
> SLAB_DESTROY_BY_RCU.

I guess you missed :

net/netfilter/nf_conntrack_core.c:1598: INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL);


> 
> Lets imagine that we have two threads. The first one enumerates objects
> from a first list, the second one recreates an object and insert it in a
> second list.  If a first thread in this moment stays on the object, it
> can read "next", when it's in the second list, so it will continue
> to enumerate objects from the second list. It is not what we want, isn't
> it?
> 
> cpu1				cpu2
> 				hlist_nulls_for_each_entry_rcu(ct)
> destroy_conntrack
>  kmem_cache_free
> 
> init_conntrack
>  hlist_nulls_add_head_rcu
> 				ct = ct->next
> 

This will be fine.

I think we even have a counter to count number of occurrence of this
rare event. (I personally never read a non null search_restart value)

 NF_CT_STAT_INC(net, search_restart); 



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin Jan. 9, 2014, 8:32 p.m. UTC | #11
On Tue, Jan 07, 2014 at 04:25:20PM +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > index 43549eb..7a34bb2 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -387,8 +387,12 @@ begin:
> > >  			     !atomic_inc_not_zero(&ct->ct_general.use)))
> > >  			h = NULL;
> > >  		else {
> > > +			/* A conntrack can be recreated with the equal tuple,
> > > +			 * so we need to check that the conntrack is initialized
> > > +			 */
> > >  			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> > > -				     nf_ct_zone(ct) != zone)) {
> > > +				     nf_ct_zone(ct) != zone) ||
> > > +				     !nf_ct_is_confirmed(ct)) {
> > >  				nf_ct_put(ct);
> > >  				goto begin;
> > >  			}
> > 
> > I do not think this is the right way to fix this problem (if said
> > problem is confirmed)
> > 
> > Remember the rule about SLAB_DESTROY_BY_RCU :
> > 
> > When a struct is freed, then reused, its important to set the its refcnt
> > (from 0 to 1) only when the structure is fully ready for use.
> > 
> > If a lookup finds a structure which is not yet setup, the
> > atomic_inc_not_zero() will fail.
> 
> Indeed.  But, the structure itself might be ready (or rather,
> can be ready since the allocation side will set the refcount to one
> after doing the initial work, such as zapping old ->status flags and
> setting tuple information).
> 
> The problem is with nat extension area stored in the ct->ext area.
> This extension area is preallocated but the snat/dnat action
> information is only set up after the ct (or rather, the skb that grabbed
> a reference to the nf_conn entry) traverses nat pre/postrouting.
> 
> This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
> rule existed.
> 
> The manipulations of the skb->nfct->ext nat area are performed without
> a lock.  Concurrent access is supposedly impossible as the conntrack
> should not (yet) be in the hash table.
> 
> The confirmed bit is set right before we insert the conntrack into
> the hash table (after we traversed rules, ct is ready to be
> 'published').

Can we allocate conntrack with zero ct_general.use and increment it at
the first time before inserting the conntrack into the hash table?
When conntrack is allocated it is attached exclusively to one skb.
It must be destroyed with skb, if it has not been confirmed, so we
don't need refcnt on this stage.

I found only one place, where a reference counter of unconfirmed
conntract can incremented. It's ctnetlink_dump_table(). Probably we
can find a way, how to fix it.

> 
> i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
> struct when we perform the lookup, as it should still be sitting on the
> 'unconfirmed' list, being invisible to readers.
> 
> Does that explanation make sense to you?
> 
> Thanks for looking into this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 9, 2014, 8:56 p.m. UTC | #12
Andrew Vagin <avagin@parallels.com> wrote:
> Can we allocate conntrack with zero ct_general.use and increment it at
> the first time before inserting the conntrack into the hash table?
> When conntrack is allocated it is attached exclusively to one skb.
> It must be destroyed with skb, if it has not been confirmed, so we
> don't need refcnt on this stage.
> 
> I found only one place, where a reference counter of unconfirmed
> conntract can incremented. It's ctnetlink_dump_table().

What about skb_clone, etc?  They will also increment the refcnt
if a conntrack entry is attached to the skb.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Vagin Jan. 9, 2014, 9:07 p.m. UTC | #13
On Thu, Jan 09, 2014 at 09:56:22PM +0100, Florian Westphal wrote:
> Andrew Vagin <avagin@parallels.com> wrote:
> > Can we allocate conntrack with zero ct_general.use and increment it at
> > the first time before inserting the conntrack into the hash table?
> > When conntrack is allocated it is attached exclusively to one skb.
> > It must be destroyed with skb, if it has not been confirmed, so we
> > don't need refcnt on this stage.
> > 
> > I found only one place, where a reference counter of unconfirmed
> > conntract can incremented. It's ctnetlink_dump_table().
> 
> What about skb_clone, etc?  They will also increment the refcnt
> if a conntrack entry is attached to the skb.

We can not attach an unconfirmed conntrack to a few skb, because
nf_nat_setup_info can be executed concurrently for the same conntrack.

How do we avoid this race condition for cloned skb-s?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 9, 2014, 9:26 p.m. UTC | #14
Andrew Vagin <avagin@parallels.com> wrote:
> On Thu, Jan 09, 2014 at 09:56:22PM +0100, Florian Westphal wrote:
> > Andrew Vagin <avagin@parallels.com> wrote:
> > > Can we allocate conntrack with zero ct_general.use and increment it at
> > > the first time before inserting the conntrack into the hash table?
> > > When conntrack is allocated it is attached exclusively to one skb.
> > > It must be destroyed with skb, if it has not been confirmed, so we
> > > don't need refcnt on this stage.
> > > 
> > > I found only one place, where a reference counter of unconfirmed
> > > conntract can incremented. It's ctnetlink_dump_table().
> > 
> > What about skb_clone, etc?  They will also increment the refcnt
> > if a conntrack entry is attached to the skb.
> 
> We can not attach an unconfirmed conntrack to a few skb, because

s/few/new/?

> nf_nat_setup_info can be executed concurrently for the same conntrack.
> 
> How do we avoid this race condition for cloned skb-s?

Simple, the assumption is that only one cpu owns the nfct, so it does
not matter if the skb is cloned in between, as there are no parallel
users.

The only possibility (that I know of) to violate this is to create
a bridge, enable call-iptables sysctl, add -j NFQUEUE rules and then
wait for packets that need to be forwarded to several recipients,
e.g. multicast traffic.

see
http://marc.info/?l=netfilter-devel&m=131471083501656&w=2

or search 'netfilter: nat: work around shared nfct struct in bridge
case'

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrei Vagin Jan. 9, 2014, 9:46 p.m. UTC | #15
2014/1/9 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2014-01-09 at 09:24 +0400, Andrew Vagin wrote:
>
>> I have one more question. Looks like I found another problem.
>>
>> init_conntrack:
>>         hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
>>                       &net->ct.unconfirmed);
>>
>> __nf_conntrack_hash_insert:
>>       hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
>>                  &net->ct.hash[hash]);
>>
>> We use one hlist_nulls_node to add a conntrack into two different lists.
>> As far as I understand, it's unacceptable in case of
>> SLAB_DESTROY_BY_RCU.
>
> I guess you missed :
>
> net/netfilter/nf_conntrack_core.c:1598: INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL);

but we can look up something suitable and return this value, but it
will be unconfirmed. Ok, I see. This situation is fixed by this patch
too.

I don't understand the result of your discussion with Florian.  Here
are a few states of conntracts: it can be used and it's initialized.
The sign of the first state is non-zero refcnt and the sign of the
second state is the confirmed bit.

In the first state conntrack is attached to skb and inserted in the
unconfirmed list. In this state the use count can be incremented in
ctnetlink_dump_list() and skb_clone().

In the second state conntrack may be attached to a few skb-s and
inserted to net->ct.hash.

I have read all emails again and I can't understand when this patch
doesn't work.  Maybe you could give a sequence of actions? Thanks.

>
>
>>
>> Lets imagine that we have two threads. The first one enumerates objects
>> from a first list, the second one recreates an object and insert it in a
>> second list.  If a first thread in this moment stays on the object, it
>> can read "next", when it's in the second list, so it will continue
>> to enumerate objects from the second list. It is not what we want, isn't
>> it?
>>
>> cpu1                          cpu2
>>                               hlist_nulls_for_each_entry_rcu(ct)
>> destroy_conntrack
>>  kmem_cache_free
>>
>> init_conntrack
>>  hlist_nulls_add_head_rcu
>>                               ct = ct->next
>>
>
> This will be fine.
>
> I think we even have a counter to count number of occurrence of this
> rare event. (I personally never read a non null search_restart value)
>
>  NF_CT_STAT_INC(net, search_restart);

Thank you for explanation.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 43549eb..7a34bb2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -387,8 +387,12 @@  begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
+			/* A conntrack can be recreated with the equal tuple,
+			 * so we need to check that the conntrack is initialized
+			 */
 			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
-				     nf_ct_zone(ct) != zone)) {
+				     nf_ct_zone(ct) != zone) ||
+				     !nf_ct_is_confirmed(ct)) {
 				nf_ct_put(ct);
 				goto begin;
 			}