diff mbox

scheduling while atomic followed by oops upon conntrackd -c execution

Message ID 20120304110151.GA22404@1984
State Accepted
Headers show

Commit Message

Pablo Neira Ayuso March 4, 2012, 11:01 a.m. UTC
Hi Kerin,

On Sat, Mar 03, 2012 at 06:47:27PM +0000, Kerin Millar wrote:
> Hi,
> 
> On 03/03/2012 13:30, Pablo Neira Ayuso wrote:
> >I just posted another patch to the ML that is a relative fix to
> >Jozsef's patch. You have to apply that as well.
> 
> I've now tested 3.3-rc5 with the addition of the above mentioned
> follow-on patch. The behaviour during conntrackd -c execution is
> clearly much improved - in so far as it doesn't generate much noise
> - but the crash that follows remains. Here's a netconsole capture:-
> 
> http://paste.pocoo.org/raw/560439/

Great to know :-).

Regarding your previous email, I'm sorry, by reading your email I
thought you were using 2.6.32 which was not the case, your
configuration is perfectly reasonable.

It seems we still have problems regarding early_drop, but this time
with reliable event delivery enabled (15 seconds is the time that
is required to retry sending the destroy event).

If you can test the following patch, I'll appreciate.

Thank you.

Comments

Kerin Millar March 5, 2012, 5:19 p.m. UTC | #1
Hi Pablo,

On 04/03/2012 11:01, Pablo Neira Ayuso wrote:
> Hi Kerin,
>
> On Sat, Mar 03, 2012 at 06:47:27PM +0000, Kerin Millar wrote:
>> Hi,
>>
>> On 03/03/2012 13:30, Pablo Neira Ayuso wrote:
>>> I just posted another patch to the ML that is a relative fix to
>>> Jozsef's patch. You have to apply that as well.
>>
>> I've now tested 3.3-rc5 with the addition of the above mentioned
>> follow-on patch. The behaviour during conntrackd -c execution is
>> clearly much improved - in so far as it doesn't generate much noise
>> - but the crash that follows remains. Here's a netconsole capture:-
>>
>> http://paste.pocoo.org/raw/560439/
>
> Great to know :-).

I apologize but I think I may have led you astray on the nf_nat issue. 
At the time of submitting my original report, I now believe that the 
nf_nat module wasn't loaded prior to starting conntrackd, although it 
was definitely available. For all tests that followed, however, I am 
entirely certain the the nf_nat module was loaded in advance. The upshot 
is that my claim that things had improved may have been premature; I 
need to specifically test under both circumstances to be sure that 
things are improving. That is, both with and without the module loaded 
in advance.

Following my own advice then, I first tried going through my test case 
*without* loading nf_nat in advance. Alas, conntrackd -c triggered hard 
lockups and didn't return to prompt. Here are the results:-

http://paste.pocoo.org/raw/561350/

In case it matters, the existing ssh session continued to respond to 
input but I was no longer able to initiate any new sessions.

>
> Regarding your previous email, I'm sorry, by reading your email I
> thought you were using 2.6.32 which was not the case, your
> configuration is perfectly reasonable.
>
> It seems we still have problems regarding early_drop, but this time
> with reliable event delivery enabled (15 seconds is the time that
> is required to retry sending the destroy event).
>
> If you can test the following patch, I'll appreciate.

Gladly. I applied the patch to my 3.3-rc5 tree, which is still carrying 
the two patches discussed earlier in the thread. I then went through my 
test case under normal circumstances i.e. all firewall rules in place, 
nf_nat confirmed present before conntrackd etc. Again, conntrackd -c did 
not return to prompt. Here are the results:-

http://paste.pocoo.org/raw/561354/

Well, at least there was no oops this time. I should also add that the 
patch was present for both of the tests mentioned in this email.

---
Incidentally, I found out why the internal cache on the master was 
filling up to capacity. It was apparently due to the use of "iptables -I 
PREROUTING -t raw -j CT --ctevents assured". Perhaps I'm missing 
something but doesn't this stop events such as new and destroy from 
being propagated? An inspection with conntrack -E suggests so. Once I 
removed the above rule, I could see destroy events being propagated and 
the number of active connections in the cache no longer exceeded my 
chosen limit of 2097152 ...

# conntrack -S | head -n1; conntrackd -s | head -n2
entries                 725826
cache internal:
current active connections:          1409472

Whatever the case, I'm quite happy to go without this rule as these 
systems are coping fine with the load incurred by conntrackd.

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 6, 2012, 11:14 a.m. UTC | #2
Hi Kerin,

On Mon, Mar 05, 2012 at 05:19:49PM +0000, Kerin Millar wrote:
> Hi Pablo,
> 
> On 04/03/2012 11:01, Pablo Neira Ayuso wrote:
> >Hi Kerin,
> >
> >On Sat, Mar 03, 2012 at 06:47:27PM +0000, Kerin Millar wrote:
> >>Hi,
> >>
> >>On 03/03/2012 13:30, Pablo Neira Ayuso wrote:
> >>>I just posted another patch to the ML that is a relative fix to
> >>>Jozsef's patch. You have to apply that as well.
> >>
> >>I've now tested 3.3-rc5 with the addition of the above mentioned
> >>follow-on patch. The behaviour during conntrackd -c execution is
> >>clearly much improved - in so far as it doesn't generate much noise
> >>- but the crash that follows remains. Here's a netconsole capture:-
> >>
> >>http://paste.pocoo.org/raw/560439/
> >
> >Great to know :-).
> 
> I apologize but I think I may have led you astray on the nf_nat
> issue. At the time of submitting my original report, I now believe
> that the nf_nat module wasn't loaded prior to starting conntrackd,
> although it was definitely available. For all tests that followed,
> however, I am entirely certain the the nf_nat module was loaded in
> advance. The upshot is that my claim that things had improved may
> have been premature; I need to specifically test under both
> circumstances to be sure that things are improving. That is, both
> with and without the module loaded in advance.
> 
> Following my own advice then, I first tried going through my test
> case *without* loading nf_nat in advance. Alas, conntrackd -c
> triggered hard lockups and didn't return to prompt. Here are the
> results:-
> 
> http://paste.pocoo.org/raw/561350/
> 
> In case it matters, the existing ssh session continued to respond to
> input but I was no longer able to initiate any new sessions.
> 
> >
> >Regarding your previous email, I'm sorry, by reading your email I
> >thought you were using 2.6.32 which was not the case, your
> >configuration is perfectly reasonable.
> >
> >It seems we still have problems regarding early_drop, but this time
> >with reliable event delivery enabled (15 seconds is the time that
> >is required to retry sending the destroy event).
> >
> >If you can test the following patch, I'll appreciate.
> 
> Gladly. I applied the patch to my 3.3-rc5 tree, which is still
> carrying the two patches discussed earlier in the thread. I then
> went through my test case under normal circumstances i.e. all
> firewall rules in place, nf_nat confirmed present before conntrackd
> etc. Again, conntrackd -c did not return to prompt. Here are the
> results:-
> 
> http://paste.pocoo.org/raw/561354/
>
> Well, at least there was no oops this time. I should also add that
> the patch was present for both of the tests mentioned in this email.

Previous patch that I sent you was not OK, sorry. I have committed the
following to my git tree:

http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897

I've been using the following tools that you can find enclosed to this
email, they are much more simple than conntrackd but, they do the same
in essence:

* conntrack_stress.c
* conntrack_events.c

gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
gcc -lnetfilter_conntrack conntrack_events.c -o ct_events

Then, to listen to events with reliable event delivery enabled:

# ./ct_events &

And to create loads of flow entries in ASSURED state:

# ./ct_stress 65535 # that's my ct table size in my laptop

You'll hit ENOMEM errors at some point, that's fine, but no oops or
lockups happen here.

I have pushed this tools to the qa/ directory under
libnetfilter_conntrack:

commit 94e75add9867fb6f0e05e73b23f723f139da829e
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Tue Mar 6 12:10:55 2012 +0100

    qa: add some stress tools to test conntrack via ctnetlink

(BTW, ct_stress may disrupt your network connection since the table
gets filled. You can use conntrack -F to get the ct table empty again).

> ---
> Incidentally, I found out why the internal cache on the master was
> filling up to capacity. It was apparently due to the use of
> "iptables -I PREROUTING -t raw -j CT --ctevents assured". Perhaps
> I'm missing something but doesn't this stop events such as new and
> destroy from being propagated? An inspection with conntrack -E
> suggests so. Once I removed the above rule, I could see destroy
> events being propagated and the number of active connections in the
> cache no longer exceeded my chosen limit of 2097152 ...

Yes, that line was wrong, I have fixed in the documentation, the
correct one must be:

iptables -I PREROUTING -t raw -j CT --ctevents assured,destroy

Thus, destroy events are delivered to user-space.

> # conntrack -S | head -n1; conntrackd -s | head -n2
> entries                 725826
> cache internal:
> current active connections:          1409472
> 
> Whatever the case, I'm quite happy to go without this rule as these
> systems are coping fine with the load incurred by conntrackd.

I want to get things fixed, please, don't give up on using that rule
yet :-).

Regarding the hardlockups. I'd be happy if you can re-do the tests,
both with conntrackd and the tools that I send you.

Make sure you have these three patches, note that the last one has
changed.

http://1984.lsi.us.es/git/net/commit/?id=7d367e06688dc7a2cc98c2ace04e1296e1d987e2
http://1984.lsi.us.es/git/net/commit/?id=a8f341e98a46f579061fabfe6ea50be3d0eb2c60
http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897

Thanks!
Kerin Millar March 6, 2012, 4:42 p.m. UTC | #3
Hi Pablo,

On 06/03/2012 11:14, Pablo Neira Ayuso wrote:

<snip>

>> Gladly. I applied the patch to my 3.3-rc5 tree, which is still
>> carrying the two patches discussed earlier in the thread. I then
>> went through my test case under normal circumstances i.e. all
>> firewall rules in place, nf_nat confirmed present before conntrackd
>> etc. Again, conntrackd -c did not return to prompt. Here are the
>> results:-
>>
>> http://paste.pocoo.org/raw/561354/
>>
>> Well, at least there was no oops this time. I should also add that
>> the patch was present for both of the tests mentioned in this email.
>
> Previous patch that I sent you was not OK, sorry. I have committed the
> following to my git tree:
>
> http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897

Noted.

>
> I've been using the following tools that you can find enclosed to this
> email, they are much more simple than conntrackd but, they do the same
> in essence:
>
> * conntrack_stress.c
> * conntrack_events.c
>
> gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
> gcc -lnetfilter_conntrack conntrack_events.c -o ct_events
>
> Then, to listen to events with reliable event delivery enabled:
>
> # ./ct_events&
>
> And to create loads of flow entries in ASSURED state:
>
> # ./ct_stress 65535 # that's my ct table size in my laptop
>
> You'll hit ENOMEM errors at some point, that's fine, but no oops or
> lockups happen here.
>
> I have pushed this tools to the qa/ directory under
> libnetfilter_conntrack:
>
> commit 94e75add9867fb6f0e05e73b23f723f139da829e
> Author: Pablo Neira Ayuso<pablo@netfilter.org>
> Date:   Tue Mar 6 12:10:55 2012 +0100
>
>      qa: add some stress tools to test conntrack via ctnetlink
>
> (BTW, ct_stress may disrupt your network connection since the table
> gets filled. You can use conntrack -F to get the ct table empty again).
>

Sorry if this is a silly question but should conntrackd be running while 
I conduct this stress test? If so, is there any danger of the master 
becoming unstable? I must ask because, if the stability of the master is 
compromised, I will be in big trouble ;)

<snip>

> Yes, that line was wrong, I have fixed in the documentation, the
> correct one must be:
>
> iptables -I PREROUTING -t raw -j CT --ctevents assured,destroy
>
> Thus, destroy events are delivered to user-space.
>
>> # conntrack -S | head -n1; conntrackd -s | head -n2
>> entries                 725826
>> cache internal:
>> current active connections:          1409472
>>
>> Whatever the case, I'm quite happy to go without this rule as these
>> systems are coping fine with the load incurred by conntrackd.
>
> I want to get things fixed, please, don't give up on using that rule
> yet :-).

Sure. I've re-instated the rule as requested. With the addition of 
destroy events, cache usage remains under control.

>
> Regarding the hardlockups. I'd be happy if you can re-do the tests,
> both with conntrackd and the tools that I send you.
>
> Make sure you have these three patches, note that the last one has
> changed.
>
> http://1984.lsi.us.es/git/net/commit/?id=7d367e06688dc7a2cc98c2ace04e1296e1d987e2
> http://1984.lsi.us.es/git/net/commit/?id=a8f341e98a46f579061fabfe6ea50be3d0eb2c60
> http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897
>

Duly applied to a fresh 3.3-rc5 tree.

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 6, 2012, 5:23 p.m. UTC | #4
On Tue, Mar 06, 2012 at 04:42:02PM +0000, Kerin Millar wrote:
> Hi Pablo,
> 
> On 06/03/2012 11:14, Pablo Neira Ayuso wrote:
> 
> <snip>
> 
> >>Gladly. I applied the patch to my 3.3-rc5 tree, which is still
> >>carrying the two patches discussed earlier in the thread. I then
> >>went through my test case under normal circumstances i.e. all
> >>firewall rules in place, nf_nat confirmed present before conntrackd
> >>etc. Again, conntrackd -c did not return to prompt. Here are the
> >>results:-
> >>
> >>http://paste.pocoo.org/raw/561354/
> >>
> >>Well, at least there was no oops this time. I should also add that
> >>the patch was present for both of the tests mentioned in this email.
> >
> >Previous patch that I sent you was not OK, sorry. I have committed the
> >following to my git tree:
> >
> >http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897
> 
> Noted.
> 
> >
> >I've been using the following tools that you can find enclosed to this
> >email, they are much more simple than conntrackd but, they do the same
> >in essence:
> >
> >* conntrack_stress.c
> >* conntrack_events.c
> >
> >gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
> >gcc -lnetfilter_conntrack conntrack_events.c -o ct_events
> >
> >Then, to listen to events with reliable event delivery enabled:
> >
> ># ./ct_events&
> >
> >And to create loads of flow entries in ASSURED state:
> >
> ># ./ct_stress 65535 # that's my ct table size in my laptop
> >
> >You'll hit ENOMEM errors at some point, that's fine, but no oops or
> >lockups happen here.
> >
> >I have pushed this tools to the qa/ directory under
> >libnetfilter_conntrack:
> >
> >commit 94e75add9867fb6f0e05e73b23f723f139da829e
> >Author: Pablo Neira Ayuso<pablo@netfilter.org>
> >Date:   Tue Mar 6 12:10:55 2012 +0100
> >
> >     qa: add some stress tools to test conntrack via ctnetlink
> >
> >(BTW, ct_stress may disrupt your network connection since the table
> >gets filled. You can use conntrack -F to get the ct table empty again).
> >
> 
> Sorry if this is a silly question but should conntrackd be running
> while I conduct this stress test? If so, is there any danger of the
> master becoming unstable? I must ask because, if the stability of
> the master is compromised, I will be in big trouble ;)

If you run this in the backup, conntrackd will spam the master with
lots of new flows in the external cache. That shouldn't be a problem
(just a bit of extra load invested in the replication).

But if you run this in the master, my test will fill the ct table
with lots of assured flows. Thus, packets that belong new flows will
be likely dropped in that node.

> >Yes, that line was wrong, I have fixed in the documentation, the
> >correct one must be:
> >
> >iptables -I PREROUTING -t raw -j CT --ctevents assured,destroy
> >
> >Thus, destroy events are delivered to user-space.
> >
> >># conntrack -S | head -n1; conntrackd -s | head -n2
> >>entries                 725826
> >>cache internal:
> >>current active connections:          1409472
> >>
> >>Whatever the case, I'm quite happy to go without this rule as these
> >>systems are coping fine with the load incurred by conntrackd.
> >
> >I want to get things fixed, please, don't give up on using that rule
> >yet :-).
> 
> Sure. I've re-instated the rule as requested. With the addition of
> destroy events, cache usage remains under control.
> 
> >
> >Regarding the hardlockups. I'd be happy if you can re-do the tests,
> >both with conntrackd and the tools that I send you.
> >
> >Make sure you have these three patches, note that the last one has
> >changed.
> >
> >http://1984.lsi.us.es/git/net/commit/?id=7d367e06688dc7a2cc98c2ace04e1296e1d987e2
> >http://1984.lsi.us.es/git/net/commit/?id=a8f341e98a46f579061fabfe6ea50be3d0eb2c60
> >http://1984.lsi.us.es/git/net/commit/?id=691d47b2dc8fdb8fea5a2b59c46e70363fa66897
> >
> 
> Duly applied to a fresh 3.3-rc5 tree.
> 
> Cheers,
> 
> --Kerin
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kerin Millar March 6, 2012, 10:37 p.m. UTC | #5
Hi Pablo,

On 06/03/2012 17:23, Pablo Neira Ayuso wrote:

<snip>

>>> I've been using the following tools that you can find enclosed to this
>>> email, they are much more simple than conntrackd but, they do the same
>>> in essence:
>>>
>>> * conntrack_stress.c
>>> * conntrack_events.c
>>>
>>> gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
>>> gcc -lnetfilter_conntrack conntrack_events.c -o ct_events
>>>
>>> Then, to listen to events with reliable event delivery enabled:
>>>
>>> # ./ct_events&
>>>
>>> And to create loads of flow entries in ASSURED state:
>>>
>>> # ./ct_stress 65535 # that's my ct table size in my laptop
>>>
>>> You'll hit ENOMEM errors at some point, that's fine, but no oops or
>>> lockups happen here.
>>>
>>> I have pushed this tools to the qa/ directory under
>>> libnetfilter_conntrack:
>>>
>>> commit 94e75add9867fb6f0e05e73b23f723f139da829e
>>> Author: Pablo Neira Ayuso<pablo@netfilter.org>
>>> Date:   Tue Mar 6 12:10:55 2012 +0100
>>>
>>>      qa: add some stress tools to test conntrack via ctnetlink
>>>
>>> (BTW, ct_stress may disrupt your network connection since the table
>>> gets filled. You can use conntrack -F to get the ct table empty again).
>>>
>>
>> Sorry if this is a silly question but should conntrackd be running
>> while I conduct this stress test? If so, is there any danger of the
>> master becoming unstable? I must ask because, if the stability of
>> the master is compromised, I will be in big trouble ;)
>
> If you run this in the backup, conntrackd will spam the master with
> lots of new flows in the external cache. That shouldn't be a problem
> (just a bit of extra load invested in the replication).
>
> But if you run this in the master, my test will fill the ct table
> with lots of assured flows. Thus, packets that belong new flows will
> be likely dropped in that node.

That makes sense. So, I rebooted the backup with the latest kernel 
build, ran my iptables script then started conntrackd. I was not able to 
destabilize the system through the use of your stress tool. The sequence 
of commands used to invoke the ct_stress tool was as follows:-

1) ct_stress 2097152
2) ct_stress 2097152
3) ct_stress 1048576

There were indeed a lot of ENOMEM errors, and messages warning that the 
conntrack table was full with packets being dropped. Nothing surprising.

I then tried my test case again. The exact sequence of commands was as 
follows:-

4) conntrackd -n
5) conntrackd -c
6) conntrackd -f internal
7) conntrackd -F
8) conntrackd -n
9) conntrackd -c

It didn't crash after the 5th step (to my amazement) but it did after 
the 9th. Here's a netconsole log covering all of the above:

http://paste.pocoo.org/raw/562136/

The invalid opcode error was also present in the log that I provided 
with my first post in this thread.

For some reason, I couldn't capture stdout from your ct_events tool but 
here's as much as I was able to copy and paste before it stopped 
responding completely.

2100000 events received (2 new, 1048702 destroy)
2110000 events received (2 new, 1048706 destroy)
2120000 events received (2 new, 1048713 destroy)
2130000 events received (2 new, 1048722 destroy)
2140000 events received (2 new, 1048735 destroy)
2150000 events received (2 new, 1048748 destroy)
2160000 events received (2 new, 1048776 destroy)
2170000 events received (2 new, 1048797 destroy)
2180000 events received (2 new, 1048830 destroy)
2190000 events received (2 new, 1048872 destroy)
2200000 events received (2 new, 1048909 destroy)
2210000 events received (2 new, 1048945 destroy)
2220000 events received (2 new, 1048985 destroy)
2230000 events received (2 new, 1049039 destroy)
2240000 events received (2 new, 1049102 destroy)
2250000 events received (2 new, 1049170 destroy)
2260000 events received (2 new, 1049238 destroy)
2270000 events received (2 new, 1049292 destroy)
2280000 events received (2 new, 1049347 destroy)
2290000 events received (2 new, 1049423 destroy)
2300000 events received (2 new, 1049490 destroy)
2310000 events received (2 new, 1049563 destroy)
2320000 events received (2 new, 1049646 destroy)
2330000 events received (2 new, 1049739 destroy)
2340000 events received (2 new, 1049819 destroy)
2350000 events received (2 new, 1049932 destroy)
2360000 events received (2 new, 1050040 destroy)
2370000 events received (2 new, 1050153 destroy)
2380000 events received (2 new, 1050293 destroy)
2390000 events received (2 new, 1050405 destroy)
2400000 events received (2 new, 1050535 destroy)
2410000 events received (2 new, 1050661 destroy)
2420000 events received (2 new, 1050786 destroy)
2430000 events received (2 new, 1050937 destroy)
2440000 events received (2 new, 1051085 destroy)
2450000 events received (2 new, 1051226 destroy)
2460000 events received (2 new, 1051378 destroy)
2470000 events received (2 new, 1051542 destroy)
2480000 events received (2 new, 1051693 destroy)
2490000 events received (2 new, 1051852 destroy)
2500000 events received (2 new, 1052008 destroy)
2510000 events received (2 new, 1052185 destroy)
2520000 events received (2 new, 1052373 destroy)
2530000 events received (2 new, 1052569 destroy)
2540000 events received (2 new, 1052770 destroy)
2550000 events received (2 new, 1052978 destroy)

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kerin Millar March 7, 2012, 2:41 p.m. UTC | #6
Hi Pablo,

To follow up briefly (at the end of this message) ...

On 06/03/2012 22:37, Kerin Millar wrote:
> Hi Pablo,
>
> On 06/03/2012 17:23, Pablo Neira Ayuso wrote:
>
> <snip>
>
>>>> I've been using the following tools that you can find enclosed to this
>>>> email, they are much more simple than conntrackd but, they do the same
>>>> in essence:
>>>>
>>>> * conntrack_stress.c
>>>> * conntrack_events.c
>>>>
>>>> gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
>>>> gcc -lnetfilter_conntrack conntrack_events.c -o ct_events
>>>>
>>>> Then, to listen to events with reliable event delivery enabled:
>>>>
>>>> # ./ct_events&
>>>>
>>>> And to create loads of flow entries in ASSURED state:
>>>>
>>>> # ./ct_stress 65535 # that's my ct table size in my laptop
>>>>
>>>> You'll hit ENOMEM errors at some point, that's fine, but no oops or
>>>> lockups happen here.
>>>>
>>>> I have pushed this tools to the qa/ directory under
>>>> libnetfilter_conntrack:
>>>>
>>>> commit 94e75add9867fb6f0e05e73b23f723f139da829e
>>>> Author: Pablo Neira Ayuso<pablo@netfilter.org>
>>>> Date: Tue Mar 6 12:10:55 2012 +0100
>>>>
>>>> qa: add some stress tools to test conntrack via ctnetlink
>>>>
>>>> (BTW, ct_stress may disrupt your network connection since the table
>>>> gets filled. You can use conntrack -F to get the ct table empty again).
>>>>
>>>
>>> Sorry if this is a silly question but should conntrackd be running
>>> while I conduct this stress test? If so, is there any danger of the
>>> master becoming unstable? I must ask because, if the stability of
>>> the master is compromised, I will be in big trouble ;)
>>
>> If you run this in the backup, conntrackd will spam the master with
>> lots of new flows in the external cache. That shouldn't be a problem
>> (just a bit of extra load invested in the replication).
>>
>> But if you run this in the master, my test will fill the ct table
>> with lots of assured flows. Thus, packets that belong new flows will
>> be likely dropped in that node.
>
> That makes sense. So, I rebooted the backup with the latest kernel
> build, ran my iptables script then started conntrackd. I was not able to
> destabilize the system through the use of your stress tool. The sequence
> of commands used to invoke the ct_stress tool was as follows:-
>
> 1) ct_stress 2097152
> 2) ct_stress 2097152
> 3) ct_stress 1048576
>
> There were indeed a lot of ENOMEM errors, and messages warning that the
> conntrack table was full with packets being dropped. Nothing surprising.
>
> I then tried my test case again. The exact sequence of commands was as
> follows:-
>
> 4) conntrackd -n
> 5) conntrackd -c
> 6) conntrackd -f internal
> 7) conntrackd -F
> 8) conntrackd -n
> 9) conntrackd -c
>
> It didn't crash after the 5th step (to my amazement) but it did after
> the 9th. Here's a netconsole log covering all of the above:
>
> http://paste.pocoo.org/raw/562136/
>
> The invalid opcode error was also present in the log that I provided
> with my first post in this thread.
>
> For some reason, I couldn't capture stdout from your ct_events tool but
> here's as much as I was able to copy and paste before it stopped
> responding completely.
>
> 2100000 events received (2 new, 1048702 destroy)
> 2110000 events received (2 new, 1048706 destroy)
> 2120000 events received (2 new, 1048713 destroy)
> 2130000 events received (2 new, 1048722 destroy)
> 2140000 events received (2 new, 1048735 destroy)
> 2150000 events received (2 new, 1048748 destroy)
> 2160000 events received (2 new, 1048776 destroy)
> 2170000 events received (2 new, 1048797 destroy)
> 2180000 events received (2 new, 1048830 destroy)
> 2190000 events received (2 new, 1048872 destroy)
> 2200000 events received (2 new, 1048909 destroy)
> 2210000 events received (2 new, 1048945 destroy)
> 2220000 events received (2 new, 1048985 destroy)
> 2230000 events received (2 new, 1049039 destroy)
> 2240000 events received (2 new, 1049102 destroy)
> 2250000 events received (2 new, 1049170 destroy)
> 2260000 events received (2 new, 1049238 destroy)
> 2270000 events received (2 new, 1049292 destroy)
> 2280000 events received (2 new, 1049347 destroy)
> 2290000 events received (2 new, 1049423 destroy)
> 2300000 events received (2 new, 1049490 destroy)
> 2310000 events received (2 new, 1049563 destroy)
> 2320000 events received (2 new, 1049646 destroy)
> 2330000 events received (2 new, 1049739 destroy)
> 2340000 events received (2 new, 1049819 destroy)
> 2350000 events received (2 new, 1049932 destroy)
> 2360000 events received (2 new, 1050040 destroy)
> 2370000 events received (2 new, 1050153 destroy)
> 2380000 events received (2 new, 1050293 destroy)
> 2390000 events received (2 new, 1050405 destroy)
> 2400000 events received (2 new, 1050535 destroy)
> 2410000 events received (2 new, 1050661 destroy)
> 2420000 events received (2 new, 1050786 destroy)
> 2430000 events received (2 new, 1050937 destroy)
> 2440000 events received (2 new, 1051085 destroy)
> 2450000 events received (2 new, 1051226 destroy)
> 2460000 events received (2 new, 1051378 destroy)
> 2470000 events received (2 new, 1051542 destroy)
> 2480000 events received (2 new, 1051693 destroy)
> 2490000 events received (2 new, 1051852 destroy)
> 2500000 events received (2 new, 1052008 destroy)
> 2510000 events received (2 new, 1052185 destroy)
> 2520000 events received (2 new, 1052373 destroy)
> 2530000 events received (2 new, 1052569 destroy)
> 2540000 events received (2 new, 1052770 destroy)
> 2550000 events received (2 new, 1052978 destroy)

Just to add that I ran a more extensive stress test on the backup, like 
so ...

for x in $(seq 1 100); do ct_stress 1048576; sleep $(( $RANDOM % 60 )); done

It remained stable throughout. I notice that there's an option to dump 
the cache in XML format. I wonder if it be useful if I were to provide 
such a dump, having synced with the master? Assuming that there's a way 
to inject the contents, perhaps you could reproduce the issue also.

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso March 8, 2012, 1:33 a.m. UTC | #7
On Wed, Mar 07, 2012 at 02:41:02PM +0000, Kerin Millar wrote:
> Hi Pablo,
> 
> To follow up briefly (at the end of this message) ...
> 
> On 06/03/2012 22:37, Kerin Millar wrote:
> >Hi Pablo,
> >
> >On 06/03/2012 17:23, Pablo Neira Ayuso wrote:
> >
> ><snip>
> >
> >>>>I've been using the following tools that you can find enclosed to this
> >>>>email, they are much more simple than conntrackd but, they do the same
> >>>>in essence:
> >>>>
> >>>>* conntrack_stress.c
> >>>>* conntrack_events.c
> >>>>
> >>>>gcc -lnetfilter_conntrack conntrack_stress.c -o ct_stress
> >>>>gcc -lnetfilter_conntrack conntrack_events.c -o ct_events
> >>>>
> >>>>Then, to listen to events with reliable event delivery enabled:
> >>>>
> >>>># ./ct_events&
> >>>>
> >>>>And to create loads of flow entries in ASSURED state:
> >>>>
> >>>># ./ct_stress 65535 # that's my ct table size in my laptop
> >>>>
> >>>>You'll hit ENOMEM errors at some point, that's fine, but no oops or
> >>>>lockups happen here.
> >>>>
> >>>>I have pushed this tools to the qa/ directory under
> >>>>libnetfilter_conntrack:
> >>>>
> >>>>commit 94e75add9867fb6f0e05e73b23f723f139da829e
> >>>>Author: Pablo Neira Ayuso<pablo@netfilter.org>
> >>>>Date: Tue Mar 6 12:10:55 2012 +0100
> >>>>
> >>>>qa: add some stress tools to test conntrack via ctnetlink
> >>>>
> >>>>(BTW, ct_stress may disrupt your network connection since the table
> >>>>gets filled. You can use conntrack -F to get the ct table empty again).
> >>>>
> >>>
> >>>Sorry if this is a silly question but should conntrackd be running
> >>>while I conduct this stress test? If so, is there any danger of the
> >>>master becoming unstable? I must ask because, if the stability of
> >>>the master is compromised, I will be in big trouble ;)
> >>
> >>If you run this in the backup, conntrackd will spam the master with
> >>lots of new flows in the external cache. That shouldn't be a problem
> >>(just a bit of extra load invested in the replication).
> >>
> >>But if you run this in the master, my test will fill the ct table
> >>with lots of assured flows. Thus, packets that belong new flows will
> >>be likely dropped in that node.
> >
> >That makes sense. So, I rebooted the backup with the latest kernel
> >build, ran my iptables script then started conntrackd. I was not able to
> >destabilize the system through the use of your stress tool. The sequence
> >of commands used to invoke the ct_stress tool was as follows:-
> >
> >1) ct_stress 2097152
> >2) ct_stress 2097152
> >3) ct_stress 1048576
> >
> >There were indeed a lot of ENOMEM errors, and messages warning that the
> >conntrack table was full with packets being dropped. Nothing surprising.
> >
> >I then tried my test case again. The exact sequence of commands was as
> >follows:-
> >
> >4) conntrackd -n
> >5) conntrackd -c
> >6) conntrackd -f internal
> >7) conntrackd -F
> >8) conntrackd -n
> >9) conntrackd -c
> >
> >It didn't crash after the 5th step (to my amazement) but it did after
> >the 9th. Here's a netconsole log covering all of the above:
> >
> >http://paste.pocoo.org/raw/562136/
> >
> >The invalid opcode error was also present in the log that I provided
> >with my first post in this thread.
> >
> >For some reason, I couldn't capture stdout from your ct_events tool but
> >here's as much as I was able to copy and paste before it stopped
> >responding completely.
> >
> >2100000 events received (2 new, 1048702 destroy)
> >2110000 events received (2 new, 1048706 destroy)
> >2120000 events received (2 new, 1048713 destroy)
> >2130000 events received (2 new, 1048722 destroy)
> >2140000 events received (2 new, 1048735 destroy)
> >2150000 events received (2 new, 1048748 destroy)
> >2160000 events received (2 new, 1048776 destroy)
> >2170000 events received (2 new, 1048797 destroy)
> >2180000 events received (2 new, 1048830 destroy)
> >2190000 events received (2 new, 1048872 destroy)
> >2200000 events received (2 new, 1048909 destroy)
> >2210000 events received (2 new, 1048945 destroy)
> >2220000 events received (2 new, 1048985 destroy)
> >2230000 events received (2 new, 1049039 destroy)
> >2240000 events received (2 new, 1049102 destroy)
> >2250000 events received (2 new, 1049170 destroy)
> >2260000 events received (2 new, 1049238 destroy)
> >2270000 events received (2 new, 1049292 destroy)
> >2280000 events received (2 new, 1049347 destroy)
> >2290000 events received (2 new, 1049423 destroy)
> >2300000 events received (2 new, 1049490 destroy)
> >2310000 events received (2 new, 1049563 destroy)
> >2320000 events received (2 new, 1049646 destroy)
> >2330000 events received (2 new, 1049739 destroy)
> >2340000 events received (2 new, 1049819 destroy)
> >2350000 events received (2 new, 1049932 destroy)
> >2360000 events received (2 new, 1050040 destroy)
> >2370000 events received (2 new, 1050153 destroy)
> >2380000 events received (2 new, 1050293 destroy)
> >2390000 events received (2 new, 1050405 destroy)
> >2400000 events received (2 new, 1050535 destroy)
> >2410000 events received (2 new, 1050661 destroy)
> >2420000 events received (2 new, 1050786 destroy)
> >2430000 events received (2 new, 1050937 destroy)
> >2440000 events received (2 new, 1051085 destroy)
> >2450000 events received (2 new, 1051226 destroy)
> >2460000 events received (2 new, 1051378 destroy)
> >2470000 events received (2 new, 1051542 destroy)
> >2480000 events received (2 new, 1051693 destroy)
> >2490000 events received (2 new, 1051852 destroy)
> >2500000 events received (2 new, 1052008 destroy)
> >2510000 events received (2 new, 1052185 destroy)
> >2520000 events received (2 new, 1052373 destroy)
> >2530000 events received (2 new, 1052569 destroy)
> >2540000 events received (2 new, 1052770 destroy)
> >2550000 events received (2 new, 1052978 destroy)
> 
> Just to add that I ran a more extensive stress test on the backup,
> like so ...
> 
> for x in $(seq 1 100); do ct_stress 1048576; sleep $(( $RANDOM % 60 )); done

I guess you're running ct_events_reliable as well. Lauching several
ct_stress at the same time is also interesting.

> It remained stable throughout. I notice that there's an option to
> dump the cache in XML format. I wonder if it be useful if I were to
> provide such a dump, having synced with the master? Assuming that
> there's a way to inject the contents, perhaps you could reproduce
> the issue also.

I've been launching the user-space stress tests but I was not abled to
reproduce the problem that you reported so far.

I'd need to know if the problem that you reported is easy to
reproduce in your setup or, it looks more like a race condition.
Moreover, I need to know if there was some traffic circulating
through the backup or no traffic at all.

Let me know.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kerin Millar March 8, 2012, 11 a.m. UTC | #8
Hi Pablo,

On 08/03/2012 01:33, Pablo Neira Ayuso wrote:
> On Wed, Mar 07, 2012 at 02:41:02PM +0000, Kerin Millar wrote:

<snip>

>>> That makes sense. So, I rebooted the backup with the latest kernel
>>> build, ran my iptables script then started conntrackd. I was not able to
>>> destabilize the system through the use of your stress tool. The sequence
>>> of commands used to invoke the ct_stress tool was as follows:-
>>>
>>> 1) ct_stress 2097152
>>> 2) ct_stress 2097152
>>> 3) ct_stress 1048576
>>>
>>> There were indeed a lot of ENOMEM errors, and messages warning that the
>>> conntrack table was full with packets being dropped. Nothing surprising.
>>>
>>> I then tried my test case again. The exact sequence of commands was as
>>> follows:-
>>>
>>> 4) conntrackd -n
>>> 5) conntrackd -c
>>> 6) conntrackd -f internal
>>> 7) conntrackd -F
>>> 8) conntrackd -n
>>> 9) conntrackd -c
>>>
>>> It didn't crash after the 5th step (to my amazement) but it did after
>>> the 9th. Here's a netconsole log covering all of the above:
>>>
>>> http://paste.pocoo.org/raw/562136/
>>>
>>> The invalid opcode error was also present in the log that I provided
>>> with my first post in this thread.
>>>
>>> For some reason, I couldn't capture stdout from your ct_events tool but
>>> here's as much as I was able to copy and paste before it stopped
>>> responding completely.
>>>
>>> 2100000 events received (2 new, 1048702 destroy)
>>> 2110000 events received (2 new, 1048706 destroy)
>>> 2120000 events received (2 new, 1048713 destroy)
>>> 2130000 events received (2 new, 1048722 destroy)
>>> 2140000 events received (2 new, 1048735 destroy)
>>> 2150000 events received (2 new, 1048748 destroy)
>>> 2160000 events received (2 new, 1048776 destroy)
>>> 2170000 events received (2 new, 1048797 destroy)
>>> 2180000 events received (2 new, 1048830 destroy)
>>> 2190000 events received (2 new, 1048872 destroy)
>>> 2200000 events received (2 new, 1048909 destroy)
>>> 2210000 events received (2 new, 1048945 destroy)
>>> 2220000 events received (2 new, 1048985 destroy)
>>> 2230000 events received (2 new, 1049039 destroy)
>>> 2240000 events received (2 new, 1049102 destroy)
>>> 2250000 events received (2 new, 1049170 destroy)
>>> 2260000 events received (2 new, 1049238 destroy)
>>> 2270000 events received (2 new, 1049292 destroy)
>>> 2280000 events received (2 new, 1049347 destroy)
>>> 2290000 events received (2 new, 1049423 destroy)
>>> 2300000 events received (2 new, 1049490 destroy)
>>> 2310000 events received (2 new, 1049563 destroy)
>>> 2320000 events received (2 new, 1049646 destroy)
>>> 2330000 events received (2 new, 1049739 destroy)
>>> 2340000 events received (2 new, 1049819 destroy)
>>> 2350000 events received (2 new, 1049932 destroy)
>>> 2360000 events received (2 new, 1050040 destroy)
>>> 2370000 events received (2 new, 1050153 destroy)
>>> 2380000 events received (2 new, 1050293 destroy)
>>> 2390000 events received (2 new, 1050405 destroy)
>>> 2400000 events received (2 new, 1050535 destroy)
>>> 2410000 events received (2 new, 1050661 destroy)
>>> 2420000 events received (2 new, 1050786 destroy)
>>> 2430000 events received (2 new, 1050937 destroy)
>>> 2440000 events received (2 new, 1051085 destroy)
>>> 2450000 events received (2 new, 1051226 destroy)
>>> 2460000 events received (2 new, 1051378 destroy)
>>> 2470000 events received (2 new, 1051542 destroy)
>>> 2480000 events received (2 new, 1051693 destroy)
>>> 2490000 events received (2 new, 1051852 destroy)
>>> 2500000 events received (2 new, 1052008 destroy)
>>> 2510000 events received (2 new, 1052185 destroy)
>>> 2520000 events received (2 new, 1052373 destroy)
>>> 2530000 events received (2 new, 1052569 destroy)
>>> 2540000 events received (2 new, 1052770 destroy)
>>> 2550000 events received (2 new, 1052978 destroy)
>>
>> Just to add that I ran a more extensive stress test on the backup,
>> like so ...
>>
>> for x in $(seq 1 100); do ct_stress 1048576; sleep $(( $RANDOM % 60 )); done
>
> I guess you're running ct_events_reliable as well. Lauching several
> ct_stress at the same time is also interesting.

The ct_events (conntrack_events.c) program was running throughout and 
"NetlinkEventsReliable On" remains defined in conntrackd.conf. I will 
try running ct_stress concurrently.

>
>> It remained stable throughout. I notice that there's an option to
>> dump the cache in XML format. I wonder if it be useful if I were to
>> provide such a dump, having synced with the master? Assuming that
>> there's a way to inject the contents, perhaps you could reproduce
>> the issue also.
>
> I've been launching the user-space stress tests but I was not abled to
> reproduce the problem that you reported so far.
>
> I'd need to know if the problem that you reported is easy to
> reproduce in your setup or, it looks more like a race condition.
> Moreover, I need to know if there was some traffic circulating
> through the backup or no traffic at all.

Indeed, I can reproduce it so easily and consistently that I've now lost 
track of the amount of times I've had to hard reboot this machine. To 
recap: I boot the slave with 3.3-rc5 (now featuring the three patches 
you asked me to apply). My iptables ruleset is loaded and conntrackd is 
started. At that point, the stats look like this:-

# conntrackd -s
cache internal:
current active connections:              109
connections created:                     175    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                    66    failed:            0

cache external:
current active connections:             3676
connections created:                    3688    failed:            0
connections updated:                       4    failed:            0
connections destroyed:                    12    failed:            0

traffic processed:
                    0 Bytes                         0 Pckts

UDP traffic (active device=eth2):
                 4360 Bytes sent               570188 Bytes recv
                   91 Pckts sent                 6681 Pckts recv
                    0 Error send                    0 Error recv

message tracking:
                    0 Malformed msgs                    0 Lost msgs

NOTE: if left alone, external cache usage grows steadily - as expected 
due to the master handling new connections to our busy load 
balance/appserver farm.

Next, rather than spend the time waiting for the backup to catch up, I 
explicitly synchronize with the master.

# conntrackd -n
# conntrackd -s
cache internal:
current active connections:                5
connections created:                     179    failed:            0
connections updated:                       0    failed:            0
connections destroyed:                   174    failed:            0

cache external:
current active connections:          1295640
connections created:                 1351838    failed:            0
connections updated:                      26    failed:            0
connections destroyed:                 56198    failed:            0

traffic processed:
                    0 Bytes                         0 Pckts

UDP traffic (active device=eth2):
               139384 Bytes sent            112306776 Bytes recv
                 1054 Pckts sent               135073 Pckts recv
                    0 Error send                    0 Error recv

message tracking:
                    0 Malformed msgs               175170 Lost msgs

The number of state entries in the external cache is now in line with 
the master. Finally, I commit.

# conntrackd -c

What happens next is one of two things ...

a) a seemingly never-ending series of hard lockups occur
b) it panics with "not syncing: Fatal exception in interrupt"

Scenario (a) seems to be more frequent but, either way, it happens 
virtually every single time. I think I've had no more than *two* 
occasions where conntrackd -c returned to prompt in dozens upon dozens 
of tests and, even then, the system didn't survive a second invocation 
of conntrackd -c. In all cases, I have to hard reboot the machine. 
Netconsole logs for both of these outcomes have been provided in 
previous posts. Ergo, it's entirely reproducible here.

There are absolutely no signs of instability with the system except for 
when state is being committed. Indeed, I first became aware of this 
situation upon simulating a genuine failover scenario. That is, I shut 
down the master, ucarp migrated the VIPs and instructed conntrackd to 
commit state. Then it died. Since that first time, I deactivated ucarp 
entirely and have easily reproduced the issue by running conntrackd -c 
manually.

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kerin Millar March 8, 2012, 11:29 a.m. UTC | #9
On 08/03/2012 01:33, Pablo Neira Ayuso wrote:
> Moreover, I need to know if there was some traffic circulating
> through the backup or no traffic at all.

Sorry, I didn't address this point in my previous email. The backup does 
indeed handle some traffic. Both systems run BIND and, as such, the 
backup is also our secondary public-facing nameserver. The load 
generated by this is not significant though. At any given moment, the 
number of state entries hovers at between 100-150 and almost all of 
these are from UDP entries on account of DNS queries.

The rest can be accounted for by ntpd (about 4 active entries for 
upstream ntp servers), ssh (1 connection only), ICMP echo-request 
handling and a few other sundries.

In my test case, I am running conntrackd -c under circumstances where 
conntrackd on the master is still pushing events across. But, I have 
also simulated a realistic failover scenario on at least two occasions 
by shutting down the master (at which point, conntrackd terminates and 
is obviously no longer pushing events to the backup). Regardless, the 
backup still crashes upon conntrackd -c.

In summary:

* Both nodes are handling DNS traffic (but it's packet forwarding which
   really generates a heavy load)

* conntrackd -c has been run under circumstances where conntrack daemon
   is and isn't continuing to receive traffic from other node. It crashes
   anyway.

Cheers,

--Kerin

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 1320c099d618a278fa17715127d6fecca2786a36 Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Sun, 4 Mar 2012 11:34:06 +0100
Subject: [PATCH] netfilter: nf_conntrack: fix early_drop with reliable event
 delivery

With reliable event delivery is enabled, if we fail to deliver the
destroy event in early_drop, we put out one entry that is still in
the dying list.

Reported-by: Kerin Millar <kerframil@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_core.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index ed86a3b..7d2d641 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -635,6 +635,11 @@  static noinline int early_drop(struct net *net, unsigned int hash)
 
 	if (del_timer(&ct->timeout)) {
 		death_by_timeout((unsigned long)ct);
+		/* Check if we indeed killed this entry. Reliable event
+		   delivery may insert this into the dying list. */
+		if (!test_bit(IPS_DYING_BIT, &ct->status))
+			return dropped;
+
 		dropped = 1;
 		NF_CT_STAT_INC_ATOMIC(net, early_drop);
 	}
-- 
1.7.7.3