diff mbox

death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack

Message ID 7353554.n89QJXU3eh@gentoovm
State Not Applicable
Headers show

Commit Message

Oliver Aug. 27, 2012, 9:33 a.m. UTC
In a previous version of ctnetlink, a race condition could be caused as a 
result of ctnetlink_del_conntrack not setting the IPS_DYING_BIT that is 
checked by death_by_timeout()

I found that in 3.4.9 I could trigger a soft-lockup by packet flooding a pair 
of systems running conntrackd with NetlinkEventsReliable On.

I found that death_by_event() does not currently check the IPS_DYING_BIT and 
therefore, based on the panic stack trace, I added the bit check to 
death_by_event() and have since been unable to reproduce the crash.

I hope this patch is correct/useful - I'm not innately familiar with the 
conntrack code so perhaps I'm breaking the reliable event reporting with this 
change.

kernel panic is as follows:

Aug 24 14:02:39 fw02-lab [ 2544.350016] BUG: soft lockup - CPU#6 stuck for 
24s! [conntrackd:5119]
Aug 24 14:02:39 fw02-lab [ 2544.350536] Kernel panic - not syncing: 
softlockup: hung tasks
Aug 24 14:02:39 fw02-lab [ 2544.350662] Pid: 5119, comm: conntrackd Tainted: G        
W    3.4.9 #2
Aug 24 14:02:39 fw02-lab [ 2544.350786] Call Trace:
Aug 24 14:02:39 fw02-lab [ 2544.350903]  <IRQ>
Aug 24 14:02:39 fw02-lab [<ffffffff81683ab2>] ? panic+0xbe/0x1cd
Aug 24 14:02:39 fw02-lab [ 2544.351078]  [<ffffffff810a8493>] ? 
watchdog_timer_fn+0x173/0x180
Aug 24 14:02:39 fw02-lab [ 2544.351204]  [<ffffffff8106847e>] ? 
__run_hrtimer.clone.33+0x4e/0x110
Aug 24 14:02:39 fw02-lab [ 2544.351330]  [<ffffffff81068d34>] ? 
hrtimer_interrupt+0xf4/0x250
Aug 24 14:02:39 fw02-lab [ 2544.351455]  [<ffffffff8101ee43>] ? 
smp_apic_timer_interrupt+0x63/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.351591]  [<ffffffff816878c7>] ? 
apic_timer_interrupt+0x67/0x70
Aug 24 14:02:39 fw02-lab [ 2544.351715]  [<ffffffff814ed7a1>] ? 
__kfree_skb+0x11/0x90
Aug 24 14:02:39 fw02-lab [ 2544.351837]  [<ffffffff815271e3>] ? 
netlink_broadcast_filtered+0x123/0x3c0
Aug 24 14:02:39 fw02-lab [ 2544.351962]  [<ffffffff8152e22e>] ? 
death_by_event+0x3e/0x1f0
Aug 24 14:02:39 fw02-lab [ 2544.352085]  [<ffffffff8152e3c5>] ? 
death_by_event+0x1d5/0x1f0
Aug 24 14:02:39 fw02-lab [ 2544.352209]  [<ffffffff8105445f>] ? 
run_timer_softirq+0x11f/0x240
Aug 24 14:02:39 fw02-lab [ 2544.352333]  [<ffffffff8152e1f0>] ? 
nf_conntrack_hash_check_insert+0x270/0x270
Aug 24 14:02:39 fw02-lab [ 2544.352524]  [<ffffffff8104f2c8>] ? 
__do_softirq+0x98/0x120
Aug 24 14:02:39 fw02-lab [ 2544.352647]  [<ffffffff8168820c>] ? 
call_softirq+0x1c/0x30
Aug 24 14:02:39 fw02-lab [ 2544.352767]  <EOI>
Aug 24 14:02:39 fw02-lab [<ffffffff8100460d>] ? do_softirq+0x4d/0x80
Aug 24 14:02:39 fw02-lab [ 2544.352938]  [<ffffffff8104f224>] ? 
local_bh_enable+0x94/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.353061]  [<ffffffff8152dd5d>] ? 
____nf_conntrack_find+0x10d/0x120
Aug 24 14:02:39 fw02-lab [ 2544.353186]  [<ffffffff8152ddb9>] ? 
__nf_conntrack_find_get+0x49/0x170
Aug 24 14:02:39 fw02-lab [ 2544.353311]  [<ffffffff8153876c>] ? 
ctnetlink_del_conntrack+0xac/0x300
Aug 24 14:02:39 fw02-lab [ 2544.353435]  [<ffffffff81280f70>] ? 
nla_parse+0x80/0xd0
Aug 24 14:02:39 fw02-lab [ 2544.353558]  [<ffffffff8152c93e>] ? 
nfnetlink_rcv_msg+0x1ee/0x220
Aug 24 14:02:39 fw02-lab [ 2544.353682]  [<ffffffff8152c77a>] ? 
nfnetlink_rcv_msg+0x2a/0x220
Aug 24 14:02:39 fw02-lab [ 2544.353806]  [<ffffffff8152c750>] ? 
nfnl_lock+0x20/0x20
Aug 24 14:02:39 fw02-lab [ 2544.353927]  [<ffffffff81529389>] ? 
netlink_rcv_skb+0x99/0xc0
Aug 24 14:02:39 fw02-lab [ 2544.354050]  [<ffffffff81528d1f>] ? 
netlink_unicast+0x1af/0x200
Aug 24 14:02:39 fw02-lab [ 2544.354173]  [<ffffffff81528fa8>] ? 
netlink_sendmsg+0x238/0x350
Aug 24 14:02:39 fw02-lab [ 2544.354296]  [<ffffffff814e5084>] ? 
sock_sendmsg+0xe4/0x130
Aug 24 14:02:39 fw02-lab [ 2544.354418]  [<ffffffff814e4eed>] ? 
sock_recvmsg+0xed/0x140
Aug 24 14:02:39 fw02-lab [ 2544.354542]  [<ffffffff8111f5fd>] ? 
core_sys_select+0x22d/0x340
Aug 24 14:02:39 fw02-lab [ 2544.354665]  [<ffffffff814e64fb>] ? 
move_addr_to_kernel+0x2b/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.354788]  [<ffffffff814e4272>] ? 
sockfd_lookup_light+0x22/0x90
Aug 24 14:02:39 fw02-lab [ 2544.354912]  [<ffffffff814e6fac>] ? 
sys_sendto+0x13c/0x1a0
Aug 24 14:02:39 fw02-lab [ 2544.355034]  [<ffffffff814ed7a1>] ? 
__kfree_skb+0x11/0x90
Aug 24 14:02:39 fw02-lab [ 2544.355157]  [<ffffffff8126c574>] ? 
rb_insert_color+0xa4/0x140
Aug 24 14:02:39 fw02-lab [ 2544.355279]  [<ffffffff81077e27>] ? 
dequeue_pushable_task+0x27/0x70
Aug 24 14:02:39 fw02-lab [ 2544.355404]  [<ffffffff81686e62>] ? 
system_call_fastpath+0x16/0x1b
Aug 24 14:02:39 fw02-lab [ 2544.355541] Rebooting in 5 seconds..

Thanks,
Oliver

Comments

Oliver Aug. 30, 2012, 3:09 a.m. UTC | #1
On Thursday 30 August 2012 04:50:09 you wrote:
> Not sure what you mean, you're still crashing with the patch below,
> right?
> 
> My proposal is to give a try to the ecache patch, that requires
> removing the previous patch.

Apologies for the confusion;  the patch quoted is essentially the first patch 
you provided me, with my changes to make it work in 3.4.10 *plus* the deletion 
of the change to nf_conntrack_ecache.h where your patch deleted the 
nf_ct_is_dying() check (i.e I have this check left in) - with this 
modification, I find that conntrackd is well-behaved and I have thus far not 
successfully caused a kernel panic.

Having tested your latest patch, I can also confirm that it also does not 
crash, including at exhaustion of the conntrack table.

In terms of overall stability, I would presume your latest patch is superior 
to the previous (i.e. what I attached most recently) ?

Kind Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 30, 2012, 10:34 a.m. UTC | #2
On Thu, Aug 30, 2012 at 05:09:01AM +0200, Oliver wrote:
> On Thursday 30 August 2012 04:50:09 you wrote:
> > Not sure what you mean, you're still crashing with the patch below,
> > right?
> > 
> > My proposal is to give a try to the ecache patch, that requires
> > removing the previous patch.
> 
> Apologies for the confusion;  the patch quoted is essentially the first patch 
> you provided me, with my changes to make it work in 3.4.10 *plus* the deletion 
> of the change to nf_conntrack_ecache.h where your patch deleted the 
> nf_ct_is_dying() check (i.e I have this check left in) - with this 
> modification, I find that conntrackd is well-behaved and I have thus far not 
> successfully caused a kernel panic.
> 
> Having tested your latest patch, I can also confirm that it also does not 
> crash, including at exhaustion of the conntrack table.
>
> In terms of overall stability, I would presume your latest patch is superior 
> to the previous (i.e. what I attached most recently) ?

Yes, I prefer the second patch. There is still races in the first
patch I sent you, harder to trigger, but still there.

There are several cleanups I'd like to recover from the first patch
though. Would you help testing them?

Thanks a lot for testing.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Aug. 30, 2012, 12:28 p.m. UTC | #3
On Thursday 30 August 2012 12:34:37 you wrote:
> Yes, I prefer the second patch. There is still races in the first
> patch I sent you, harder to trigger, but still there.
> 
> There are several cleanups I'd like to recover from the first patch
> though. Would you help testing them?
> 
> Thanks a lot for testing.

HI Pablo,

Yep, I'd be happy to test. I've also uncovered a new issue: I have two Active-
Active machines (conntrackd running NOTRACK mode with both External and 
Internal cache disabled)

In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4 and 
it immediately has around 50% failure of TCP connection attempts on systems 
behind them - ICMP on the other hand is flawless, DNS lookups also are OK so I 
*believe* that UDP may also be performing well - I've no idea where to even 
look on this one so any insight would be most appreciated.

Kind Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Aug. 30, 2012, 12:39 p.m. UTC | #4
On Thursday 30 August 2012 14:28:20 Oliver wrote:
> Yep, I'd be happy to test. I've also uncovered a new issue: I have two
> Active- Active machines (conntrackd running NOTRACK mode with both External
> and Internal cache disabled)
> 
> In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4
> and it immediately has around 50% failure of TCP connection attempts on
> systems behind them - ICMP on the other hand is flawless, DNS lookups also
> are OK so I *believe* that UDP may also be performing well - I've no idea
> where to even look on this one so any insight would be most appreciated.
> 
> Kind Regards,
> Oliver

Another thing that just entered my mind: I configured raw/PREROUTING to -j CT 
--notrack TCP port 80 (source and dest) on the appropriate interfaces and this 
resulted in total loss despite the fact that I had --ctstate UNTRACKED set to 
ACCEPT - and again, this behaviour only occurs under 3.4.[9|10] (probably 
earlier too but I didn't test)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 30, 2012, 4:22 p.m. UTC | #5
Hi Oliver,

On Thu, Aug 30, 2012 at 02:28:20PM +0200, Oliver wrote:
> On Thursday 30 August 2012 12:34:37 you wrote:
> > Yes, I prefer the second patch. There is still races in the first
> > patch I sent you, harder to trigger, but still there.
> > 
> > There are several cleanups I'd like to recover from the first patch
> > though. Would you help testing them?
> > 
> > Thanks a lot for testing.
> 
> HI Pablo,
> 
> Yep, I'd be happy to test. I've also uncovered a new issue: I have two Active-
> Active machines (conntrackd running NOTRACK mode with both External and 
> Internal cache disabled)

Thanks. I'll send you patches then.

> In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4 and 
> it immediately has around 50% failure of TCP connection attempts on systems 
> behind them - ICMP on the other hand is flawless, DNS lookups also are OK so I 
> *believe* that UDP may also be performing well - I've no idea where to even 
> look on this one so any insight would be most appreciated.

Unfortunately, asymmetric active-active is a crazy setup for conntrack
(documentation already discuss this). The state synchronization that
we are doing is asynchronous, so state-updates race with TCP packet.
We don't support this, sorry.

We can support active-active with hash-based load-sharing with the
cluster match / arptables, it seems more sane to me, theory is
described here:

http://1984.lsi.us.es/~pablo/docs/intcomp09.pdf

However, there is not documentation yet on how to make it. Last time I
looked at existing HA daemons, I didn't find that they support
active-active setup very well, so they require some changes / we need
some small new HA daemon for this.

I need to work on active/active load-sharing, to fully documented and
support it. That's not in top of my priority list at the moment
though.

Another (simpler) alternative is, in case your firewall have two IPs,
to statically distribute the load between your firewalls by assigning
different gateway IP to your client nodes. That should not be hard to
deploy.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Aug. 30, 2012, 5:49 p.m. UTC | #6
On Thursday 30 August 2012 18:22:48 you wrote:
> Unfortunately, asymmetric active-active is a crazy setup for conntrack
> (documentation already discuss this). The state synchronization that
> we are doing is asynchronous, so state-updates race with TCP packet.
> We don't support this, sorry.

The environment does fulfil the assumptions necessary for replication to happen 
within the handshake so under 3.2 there is no issue with handshakes completing 
under an asymmetric path.

Nonetheless, what doesn't make sense is that this operates under 3.2 and not 
3.4 - also is the fact that having a "-j CT --notrack" on specific traffic (i.e. 
asymmetric should not matter because there is no stateful tracking)

Kind Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 30, 2012, 6:39 p.m. UTC | #7
On Thu, Aug 30, 2012 at 07:49:24PM +0200, Oliver wrote:
> On Thursday 30 August 2012 18:22:48 you wrote:
> > Unfortunately, asymmetric active-active is a crazy setup for conntrack
> > (documentation already discuss this). The state synchronization that
> > we are doing is asynchronous, so state-updates race with TCP packet.
> > We don't support this, sorry.
> 
> The environment does fulfil the assumptions necessary for replication to happen 
> within the handshake so under 3.2 there is no issue with handshakes completing 
> under an asymmetric path.

Interesting, how are those assumptions fulfilled?

> Nonetheless, what doesn't make sense is that this operates under 3.2 and not 
> 3.4 - also is the fact that having a "-j CT --notrack" on specific traffic (i.e. 
> asymmetric should not matter because there is no stateful tracking)

Agreed. But I don't come with any netfilter change that may result in
that problem you're reporting. You'll have to debug this and get back
to me with more information.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Aug. 31, 2012, 12:19 a.m. UTC | #8
On Thursday 30 August 2012 20:39:50 Pablo Neira Ayuso wrote:
> Interesting, how are those assumptions fulfilled?

Well, timing of course ;) - essentially, traffic paths are ensured longer than 
the actual time for replication of conntrack state.


> Agreed. But I don't come with any netfilter change that may result in
> that problem you're reporting. You'll have to debug this and get back
> to me with more information.

You can disregard this, turned out to be due to the unfortunate fact that 
net.ipv4.netfilter.ip_conntrack_tcp_be_liberal is of course replaced by 
net.netfilter.nf_conntrack_tcp_be_liberal under 3.4

Please feel free to send me your latest rework of the patch and I will be 
happy to test it out.

Kind Regards,
Oliver
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Aug. 31, 2012, 9:27 a.m. UTC | #9
On Fri, Aug 31, 2012 at 02:19:36AM +0200, Oliver wrote:
> On Thursday 30 August 2012 20:39:50 Pablo Neira Ayuso wrote:
> > Interesting, how are those assumptions fulfilled?
> 
> Well, timing of course ;) - essentially, traffic paths are ensured longer than 
> the actual time for replication of conntrack state.

I see.

> > Agreed. But I don't come with any netfilter change that may result in
> > that problem you're reporting. You'll have to debug this and get back
> > to me with more information.
> 
> You can disregard this, turned out to be due to the unfortunate fact that 
> net.ipv4.netfilter.ip_conntrack_tcp_be_liberal is of course replaced by 
> net.netfilter.nf_conntrack_tcp_be_liberal under 3.4

$ ls /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal 
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

Probably you forgot to set CONFIG_NF_CONNTRACK_PROC_COMPAT=y

We haven't remove it yet, although it should be bad to schedule this
for removal.

> Please feel free to send me your latest rework of the patch and I will be 
> happy to test it out.

Will do.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 729f157..5c274f3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -250,7 +250,8 @@  static void death_by_event(unsigned long ul_conntrack)
 	struct nf_conn *ct = (void *)ul_conntrack;
 	struct net *net = nf_ct_net(ct);
 
-	if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
+	if (!test_bit(IPS_DYING_BIT, &ct->status) &&
+	    nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
 		/* bad luck, let's retry again */
 		ct->timeout.expires = jiffies +
 			(random32() % net->ct.sysctl_events_retry_timeout);