diff mbox

Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

Message ID 20090702093031.GC4954@ff.dom.local
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jarek Poplawski July 2, 2009, 9:30 a.m. UTC
On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
...
> So I tried - and I did not catch any lockdep output before the crash.  
> Unfortunately I do not have another machine on the same local network to  
> catch any messages after the crash... So I could be missing some warning  
> (I did synchronous logging though).
> Will check with netconsole tomorrow.

Could you try if this patch changes anything?

Thanks,
Jarek P.
---

 net/sched/sch_netem.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jarek Poplawski July 2, 2009, 10:12 a.m. UTC | #1
On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
> ...
> > So I tried - and I did not catch any lockdep output before the crash.  
> > Unfortunately I do not have another machine on the same local network to  
> > catch any messages after the crash... So I could be missing some warning  
> > (I did synchronous logging though).
> > Will check with netconsole tomorrow.
> 
> Could you try if this patch changes anything?

...and maybe CONFIG_PACKET_MMAP turned off.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joao Correia July 2, 2009, 10:51 a.m. UTC | #2
Im getting this one too. On any .31-rc1-gitXX, it pops up a few
seconds after i turn traffic shaping on with tc/htb (no netem here).
Its easily reproducible on my end, turn on traffic shaping, get
traffic flowing, box freezes. No error message, nothing on logs,
although -sometimes- i get lockdep error messages on screen, but
scrolls too fast and box is otherwise unresponsive, so i can't get the
error out.

This does not happen on 2.6.30 at all.

Hope this helps.

(i tried both CONFIG_PACKET_MMAP off and the patch, still happens).

Joao Correia
Centro de Informatica
Universidade da Beira Interior
Portugal


(snip firewall script. TCBIN is just tc, OUTIF is just the outside
interface (gigabit D-Link using r8169 driver))
$TCBIN class add dev $OUTIF parent 1: classid 1:1 htb rate
${UPLINK}kbit ceil ${UPLINK}kbit
$TCBIN class add dev $OUTIF parent 1:1 classid 1:10 htb rate
$[30*$UPLINK/100]kbit ceil $[30*$UPLINK/100]kbit prio 0
$TCBIN class add dev $OUTIF parent 1:1 classid 1:11 htb rate
$[30*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 1
$TCBIN class add dev $OUTIF parent 1:1 classid 1:12 htb rate
$[8*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 2
$TCBIN class add dev $OUTIF parent 1:1 classid 1:13 htb rate
$[8*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 2
$TCBIN class add dev $OUTIF parent 1:1 classid 1:14 htb rate
$[10*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 7 quantum 50000
$TCBIN class add dev $OUTIF parent 1:1 classid 1:15 htb rate
$[13*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 8

$TCBIN qdisc add dev $OUTIF parent 1:12 handle 120: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:13 handle 130: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:14 handle 140: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:15 handle 150: sfq perturb 10

$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 1 handle 1 fw
classid 1:10
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 2 handle 2 fw
classid 1:11
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 3 handle 3 fw
classid 1:12
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 4 handle 4 fw
classid 1:13
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 5 handle 5 fw
classid 1:14
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 6 handle 6 fw
classid 1:15
(snip firewall script)

On Thu, Jul 2, 2009 at 11:12 AM, Jarek Poplawski<jarkao2@gmail.com> wrote:
> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>> ...
>> > So I tried - and I did not catch any lockdep output before the crash.
>> > Unfortunately I do not have another machine on the same local network to
>> > catch any messages after the crash... So I could be missing some warning
>> > (I did synchronous logging though).
>> > Will check with netconsole tomorrow.
>>
>> Could you try if this patch changes anything?
>
> ...and maybe CONFIG_PACKET_MMAP turned off.
>
> Jarek P.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski July 2, 2009, 11:09 a.m. UTC | #3
On Thu, Jul 02, 2009 at 11:51:56AM +0100, Joao Correia wrote:
> Im getting this one too. On any .31-rc1-gitXX, it pops up a few
> seconds after i turn traffic shaping on with tc/htb (no netem here).
> Its easily reproducible on my end, turn on traffic shaping, get
> traffic flowing, box freezes. No error message, nothing on logs,
> although -sometimes- i get lockdep error messages on screen, but
> scrolls too fast and box is otherwise unresponsive, so i can't get the
> error out.
> 
> This does not happen on 2.6.30 at all.
> 
> Hope this helps.
> 
> (i tried both CONFIG_PACKET_MMAP off and the patch, still happens).

It helps for sure - to save on useless checking, and maybe reproduce
too.

Thank you very much Joao,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Freund July 2, 2009, 11:11 a.m. UTC | #4
On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>> ...
>>> So I tried - and I did not catch any lockdep output before the crash.
>>> Unfortunately I do not have another machine on the same local network to
>>> catch any messages after the crash... So I could be missing some warning
>>> (I did synchronous logging though).
>>> Will check with netconsole tomorrow.
>>
>> Could you try if this patch changes anything?
>
> ...and maybe CONFIG_PACKET_MMAP turned off.
Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly 
the same game.
I now had another computer to catch the netconsole output. Still no 
lockdep warnings.

Unfortunately the other computer was a windows machine with its strange 
terminal, so long lines are wrapped at 80cols, but that shouldn't be too 
bad.

Andres
[  215.208044] netem: version 1.2

[  350.040136] BUG: soft lockup - CPU#1 stuck for 61s! [openvpn:4248]
[  350.040136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
 pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[  350.040136] irq event stamp: 149925
[  350.040136] hardirqs last  enabled at (149924): [<ffffffff81036a10>] restore_
args+0x0/0x30
[  350.040136] hardirqs last disabled at (149925): [<ffffffff81035d3a>] save_arg
s+0x6a/0x70
[  350.040136] softirqs last  enabled at (19946): [<ffffffff815528ad>] lock_sock
_nested+0x8d/0x130
[  350.040136] softirqs last disabled at (19952): [<ffffffff815627a8>] dev_queue
_xmit+0x58/0x4b0
[  350.040136] CPU 1:
[  350.040136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
 pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[  350.040136] Pid: 4248, comm: openvpn Not tainted 2.6.31-rc1-andres-00437-gde7
327a-dirty #61 208252G
[  350.040136] RIP: 0010:[<ffffffff8103e276>]  [<ffffffff8103e276>] native_read_
tsc+0x6/0x20
[  350.040136] RSP: 0018:ffff8801271c78b8  EFLAGS: 00000206
[  350.040136] RAX: 000000000ae20bd8 RBX: ffff8801271c78b8 RCX: 000000000ae20b00

[  350.040136] RDX: 00000000000000b0 RSI: 0000000000006040 RDI: 0000000000000001

[  350.040136] RBP: ffffffff81036b6e R08: ffffffff82175180 R09: 0000000000000000

[  350.040136] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271c6000

[  350.040136] R13: 0000000000000000 R14: ffff88002efec400 R15: 0000000000000000

[  350.040136] FS:  00007f98ddc646f0(0000) GS:ffff88002efde000(0000) knlGS:00000
00000000000
[  350.040136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  350.040136] CR2: 0000000005593008 CR3: 000000012742f000 CR4: 00000000000026e0

[  350.040136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  350.040136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  350.040136] Call Trace:
[  350.040136]  [<ffffffff812a96c2>] ? delay_tsc+0x22/0x80
[  350.040136]  [<ffffffff812a95da>] ? __delay+0xa/0x10
[  350.040136]  [<ffffffff812addbd>] ? _raw_spin_lock+0xfd/0x170
[  350.040136]  [<ffffffff816e92f1>] ? _spin_lock+0x51/0x70
[  350.040136]  [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[  350.040136]  [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[  350.040136]  [<ffffffff815627a3>] ? dev_queue_xmit+0x53/0x4b0
[  350.040136]  [<ffffffff81594bac>] ? ip_finish_output+0x13c/0x320
[  350.040136]  [<ffffffff81594e0b>] ? ip_output+0x7b/0xd0
[  350.040136]  [<ffffffff81593bf0>] ? ip_local_out+0x20/0x30
[  350.040136]  [<ffffffff815943c5>] ? ip_queue_xmit+0x165/0x3b0
[  350.040136]  [<ffffffff815a8d49>] ? tcp_transmit_skb+0x3e9/0x780
[  350.040136]  [<ffffffff815ab3b7>] ? tcp_write_xmit+0x1e7/0x9d0
[  350.040136]  [<ffffffff815abc0b>] ? __tcp_push_pending_frames+0x2b/0xd0
[  350.040136]  [<ffffffff8159e327>] ? tcp_sendmsg+0x887/0xb90
[  350.040136]  [<ffffffff8154fa86>] ? sock_sendmsg+0x126/0x140
[  350.040136]  [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[  350.040136]  [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[  350.040136]  [<ffffffff810ab0e7>] ? mark_held_locks+0x67/0x90
[  350.040136]  [<ffffffff816e90bb>] ? _spin_unlock_irqrestore+0x3b/0x70
[  350.040136]  [<ffffffff815509c0>] ? sys_sendto+0xf0/0x130
[  350.040136]  [<ffffffff810ab3fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[  350.040136]  [<ffffffff810ab44d>] ? trace_hardirqs_on+0xd/0x10
[  350.040136]  [<ffffffff810a1c27>] ? getnstimeofday+0x57/0xe0
[  350.040136]  [<ffffffff8109bbf1>] ? ktime_get_ts+0x51/0x70
[  350.040136]  [<ffffffff81035ec2>] ? system_call_fastpath+0x16/0x1b
[  415.538136] BUG: soft lockup - CPU#1 stuck for 61s! [openvpn:4248]
[  415.538136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
 pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[  415.538136] irq event stamp: 281051
[  415.538136] hardirqs last  enabled at (281050): [<ffffffff81036a10>] restore_
args+0x0/0x30
[  415.538136] hardirqs last disabled at (281051): [<ffffffff81035d3a>] save_arg
s+0x6a/0x70
[  415.538136] softirqs last  enabled at (19946): [<ffffffff815528ad>] lock_sock
_nested+0x8d/0x130
[  415.538136] softirqs last disabled at (19952): [<ffffffff815627a8>] dev_queue
_xmit+0x58/0x4b0
[  415.538136] CPU 1:
[  415.538136] Modules linked in: sch_netem sbs sbshc pcmcia snd_hda_codec_conex
ant yenta_socket rsrc_nonstatic snd_hda_intel snd_hda_codec thinkpad_acpi iwlagn
 pcmcia_core btusb snd_hwdep ehci_hcd uhci_hcd
[  415.538136] Pid: 4248, comm: openvpn Not tainted 2.6.31-rc1-andres-00437-gde7
327a-dirty #61 208252G
[  415.538136] RIP: 0010:[<ffffffff8103e276>]  [<ffffffff8103e276>] native_read_
tsc+0x6/0x20
[  415.538136] RSP: 0018:ffff8801271c78b8  EFLAGS: 00000206
[  415.538136] RAX: 000000008cf9059c RBX: ffff8801271c78b8 RCX: 000000008cf9050c

[  415.538136] RDX: 00000000000000d4 RSI: 0000000000006040 RDI: 0000000000000001

[  415.538136] RBP: ffffffff81036b6e R08: ffffffff82175180 R09: 0000000000000000

[  415.538136] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000003fed

[  415.538136] R13: ffff88002efde000 R14: ffff8801271c6000 R15: 0000000000000000

[  415.538136] FS:  00007f98ddc646f0(0000) GS:ffff88002efde000(0000) knlGS:00000
00000000000
[  415.538136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  415.538136] CR2: 0000000005593008 CR3: 000000012742f000 CR4: 00000000000026e0

[  415.538136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  415.538136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  415.538136] Call Trace:
[  415.538136]  [<ffffffff812a96ea>] ? delay_tsc+0x4a/0x80
[  415.538136]  [<ffffffff812a95da>] ? __delay+0xa/0x10
[  415.538136]  [<ffffffff812addbd>] ? _raw_spin_lock+0xfd/0x170
[  415.538136]  [<ffffffff816e92f1>] ? _spin_lock+0x51/0x70
[  415.538136]  [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[  415.538136]  [<ffffffff81562836>] ? dev_queue_xmit+0xe6/0x4b0
[  415.538136]  [<ffffffff815627a3>] ? dev_queue_xmit+0x53/0x4b0
[  415.538136]  [<ffffffff81594bac>] ? ip_finish_output+0x13c/0x320
[  415.538136]  [<ffffffff81594e0b>] ? ip_output+0x7b/0xd0
[  415.538136]  [<ffffffff81593bf0>] ? ip_local_out+0x20/0x30
[  415.538136]  [<ffffffff815943c5>] ? ip_queue_xmit+0x165/0x3b0
[  415.538136]  [<ffffffff815a8d49>] ? tcp_transmit_skb+0x3e9/0x780
[  415.538136]  [<ffffffff815ab3b7>] ? tcp_write_xmit+0x1e7/0x9d0
[  415.538136]  [<ffffffff815abc0b>] ? __tcp_push_pending_frames+0x2b/0xd0
[  415.538136]  [<ffffffff8159e327>] ? tcp_sendmsg+0x887/0xb90
[  415.538136]  [<ffffffff8154fa86>] ? sock_sendmsg+0x126/0x140
[  415.538136]  [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[  415.538136]  [<ffffffff81097b60>] ? autoremove_wake_function+0x0/0x40
[  415.538136]  [<ffffffff810ab0e7>] ? mark_held_locks+0x67/0x90
[  415.538136]  [<ffffffff816e90bb>] ? _spin_unlock_irqrestore+0x3b/0x70
[  415.538136]  [<ffffffff815509c0>] ? sys_sendto+0xf0/0x130
[  415.538136]  [<ffffffff810ab3fd>] ? trace_hardirqs_on_caller+0x14d/0x190
[  415.538136]  [<ffffffff810ab44d>] ? trace_hardirqs_on+0xd/0x10
[  415.538136]  [<ffffffff810a1c27>] ? getnstimeofday+0x57/0xe0
[  415.538136]  [<ffffffff8109bbf1>] ? ktime_get_ts+0x51/0x70
[  415.538136]  [<ffffffff81035ec2>] ? system_call_fastpath+0x16/0x1b
Jarek Poplawski July 2, 2009, 11:43 a.m. UTC | #5
On Thu, Jul 02, 2009 at 01:11:56PM +0200, Andres Freund wrote:
> On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>>> ...
>>>> So I tried - and I did not catch any lockdep output before the crash.
>>>> Unfortunately I do not have another machine on the same local network to
>>>> catch any messages after the crash... So I could be missing some warning
>>>> (I did synchronous logging though).
>>>> Will check with netconsole tomorrow.
>>>
>>> Could you try if this patch changes anything?
>>
>> ...and maybe CONFIG_PACKET_MMAP turned off.
> Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly  
> the same game.
> I now had another computer to catch the netconsole output. Still no  
> lockdep warnings.
>
> Unfortunately the other computer was a windows machine with its strange  
> terminal, so long lines are wrapped at 80cols, but that shouldn't be too  
> bad.

It's very nice ;-) Alas we can see here only one side, so there is
still some searching needed. I hope I'll be able to reproduce this
later.

Thanks for your work Andres,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Freund July 2, 2009, 11:43 a.m. UTC | #6
On 07/02/2009 01:11 PM, Andres Freund wrote:
> On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>>> ...
>>>> So I tried - and I did not catch any lockdep output before the crash.
>>>> Unfortunately I do not have another machine on the same local
>>>> network to
>>>> catch any messages after the crash... So I could be missing some
>>>> warning
>>>> (I did synchronous logging though).
>>>> Will check with netconsole tomorrow.
>>>
>>> Could you try if this patch changes anything?
>>
>> ...and maybe CONFIG_PACKET_MMAP turned off.
> Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly
> the same game.
> I now had another computer to catch the netconsole output. Still no
> lockdep warnings.
>
> Unfortunately the other computer was a windows machine with its strange
> terminal, so long lines are wrapped at 80cols, but that shouldn't be too
> bad.
I will start trying to place the issue by testing with existing kernels 
between 2.6.30 and now.

Andres
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski July 2, 2009, 11:54 a.m. UTC | #7
On Thu, Jul 02, 2009 at 01:43:49PM +0200, Andres Freund wrote:
...
> I will start trying to place the issue by testing with existing kernels  
> between 2.6.30 and now.

If you can afford your time of course this would be very helpful.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Freund July 2, 2009, 11:59 a.m. UTC | #8
On 07/02/2009 01:54 PM, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 01:43:49PM +0200, Andres Freund wrote: ...
>> I will start trying to place the issue by testing with existing
>> kernels between 2.6.30 and now.
> If you can afford your time of course this would be very helpful.
Well. Waiting for the issue to resolve itself would cost time as well ;-)
I wont be able to finish this today, but perhaps some reduction of the 
search space will be enough.

I placed it between 2.6.30 and 03347e2592078a90df818670fddf97a33eec70fb
(v2.6.30-5415-g03347e2) so far.


Andres
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2b88295..6001ad9 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -176,8 +176,6 @@  static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	}
 
-	skb_orphan(skb);
-
 	/*
 	 * If we need to duplicate packet, then re-insert at top of the
 	 * qdisc tree, since parent queuer expects that only one