diff mbox

kernel panic in fib_rules_lookup [2.6.27.7 vendor-patched]

Message ID 1286905245.2703.3.camel@edumazet-laptop
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 12, 2010, 5:40 p.m. UTC
Le mardi 12 octobre 2010 à 17:14 +0000, Joe Buehler a écrit :
> I am seeing a kernel panic (NULL pointer) in fib_rules_lookup.  There
> were some other reports for 2.6.32 back in March and May.  It looks to
> me as though "rules_list" is not in a good state when fib_rules_lookup
> traverses it.
> 
> My application is bringing TAP interfaces up and down and making
> changes to associated routing tables at a fairly good clip (say, a few
> times a second).  That use case seems to be similar to a previously
> reported crash case.
> 
> This is a MIPS kernel (Cavium Octeon) running two CPUs SMP.  I am
> using 2.6.27.7 patched by Cavium for hardware support reasons.  I
> cannot upgrade because the vendor patches are non-trivial to
> forward-port.
> 
> Here is one stack trace:
> 
> [<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
> [<ffffffff814bd144>] fib_lookup+0x2c/0x48
> [<ffffffff814788d8>] __ip_route_output_key+0x918/0xf38
> [<ffffffff81478f30>] ip_route_output_flow+0x38/0x2e8
> [<ffffffff8149fd1c>] tcp_v4_connect+0x134/0x498
> [<ffffffff814aef80>] inet_stream_connect+0xf8/0x2f0
> [<ffffffff81442680>] sys_connect+0xe0/0xf8
> [<ffffffff8114528c>] handle_sys+0x12c/0x148
> 
> Here is another:
> 
> [<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
> [<ffffffff814bd144>] fib_lookup+0x2c/0x48
> [<ffffffff814b6550>] fib_validate_source+0xb0/0x4c0
> [<ffffffff8147a524>] ip_route_input+0x11a4/0x13c0
> [<ffffffff8147c304>] ip_rcv_finish+0x2f4/0x4c0
> [<ffffffff81454220>] process_backlog+0xa8/0x160
> [<ffffffff81451ea8>] net_rx_action+0x190/0x2e0
> [<ffffffff81166978>] __do_softirq+0xf0/0x218
> [<ffffffff81166b18>] do_softirq+0x78/0x80
> [<ffffffff81100e30>] plat_irq_dispatch+0x130/0x1e0
> [<ffffffff81130948>] ret_from_irq+0x0/0x4
> [<ffffffff8151167c>] _cond_resched+0x34/0x50
> [<ffffffff81148b60>] fpu_emulator_cop1Handler+0x90/0x1c80
> [<ffffffff81136f4c>] do_cpu+0x24c/0x360
> [<ffffffff81130940>] ret_from_exception+0x0/0x8
> 
> *IF* my reading of the disassembled code at point of panic is correct,
>  the "pos" pointer in list_for_each_entry_rcu appears to be NULL.
> 
> Looking at the code in net/core/fib_rules.c I see some uses of the
> "rules_list" using rcu and some apparently not.  Has something simple
> been overlooked?
> 
> I need this fixed so will try adding a spinlock to protect rules_list
> if necessary.

2.6.27 is a bit old, you might try :

commit 7fa7cb7109d07c29ab28bb877bc7049a0150dbe5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Sep 27 04:18:27 2010 +0000

    fib: use atomic_inc_not_zero() in fib_rules_lookup
    
    It seems we dont use appropriate refcount increment in an
    rcu_read_lock() protected section.
    
    fib_rule_get() might increment a null refcount and bad things could
    happen.
    
    While fib_nl_delrule() respects an rcu grace period before calling
    fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a
    grace period.
    
    Note : after this patch, we might avoid the synchronize_rcu() call done
    in fib_nl_delrule()
    
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Joe Buehler Oct. 13, 2010, 11:18 a.m. UTC | #1
Eric Dumazet wrote:
> 2.6.27 is a bit old, you might try :
>
> commit 7fa7cb7109d07c29ab28bb877bc7049a0150dbe5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Mon Sep 27 04:18:27 2010 +0000
>   
Amazing -- a fix in 20 minutes -- thank you!  I need to do more load
testing but it looks like it may have solved the issue.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 20, 2010, 5:43 p.m. UTC | #2
Eric Dumazet wrote:

> 2.6.27 is a bit old, you might try :
> 
> commit 7fa7cb7109d07c29ab28bb877bc7049a0150dbe5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Mon Sep 27 04:18:27 2010 +0000

Alas, after more load testing I find that the panic still occurs:

CPU 1 Unable to handle kernel paging request at virtual address
0000000000000000, epc == ffffffff8146728c, ra == ffffffff81467258
Oops[#1]:
Cpu 1
$ 0   : 0000000000000000 0000000000000000 0000000000000000 0000000000000000
$ 4   : ffffffffffffffff a80000008c9d78f0 a80000009acd9880 000000000a205a7c
$ 8   : 0000000000000000 0000000000000020 a80000009a9c49d0 0000000000000000
$12   : ffffffff8155de00 0000000000000004 0000000000000001 0000000000000000
$16   : 0000000000000000 a80000008c9d78f0 0000000000000002 a80000009b94ed80
$20   : a80000009b94edf8 0000000000000000 0000000000000003 a80000008c9d78a0
$24   : 0000000000000000 ffffffff812df388
$28   : a80000008c9d4000 a80000008c9d7840 fffffffffffffff5 ffffffff81467258
Hi    : 0000000000000000
Lo    : 0000000000000000
epc   : ffffffff8146728c fib_rules_lookup+0x11c/0x260
    Not tainted
ra    : ffffffff81467258 fib_rules_lookup+0xe8/0x260
Status: 1010cce3    KX SX UX KERNEL EXL IE
Cause : 00800008
BadVA : 0000000000000000
PrId  : 000d0409 (Cavium Octeon)
Modules linked in: x_tables ip_tables iptable_filter nf_conntrack
nf_conntrack_ipv4 nf_nat iptable_nat tun xt_tcpudp xt_state ipt_REJECT
ipv6 ip6_tables ip6table_filter ip6t_ipv6header ip6t_REJECT
Process qscope7500 (pid: 1343, threadinfo=a80000008c9d4000,
task=a80000008c068ac0, tls=000000002d51e920)
Stack : ffffffffffffffff 0000000000000003 a80000008c9d78d8 a80000008c9d79d8
        a80000008c9d78f0 0000000000000000 ffffffff816c39c0 ffffffffffffffff
        0000000000000003 00000000000004a6 0000000000000000 ffffffff814bd2a4
        0000000000000000 a80000008c9d78d8 0000000000000000 ffffffffc001aa24
        a80000008c9d78d8 ffffffff81478a38 0000000000000003 0000000000000001
        0000000000000000 0000000000000000 0000000000000001 000000000a205a7c
        0a2059bf00000000 0000000000000000 0000000000000000 0000000000000000
        0000000000000000 0000000000000000 a80000008c9d79d0 0000000000000000
        a80000008c9d79d8 a8000000994ff380 0000000000000000 0000000000000000
        a80000008c9d79d0 ffffffff816c39c0 0000000000000003 00000000000004a6
        ...
Call Trace:
[<ffffffff8146728c>] fib_rules_lookup+0x11c/0x260
[<ffffffff814bd2a4>] fib_lookup+0x2c/0x48
[<ffffffff81478a38>] __ip_route_output_key+0x918/0xf38
[<ffffffff81479090>] ip_route_output_flow+0x38/0x2e8
[<ffffffff81482c44>] ip_queue_xmit+0x38c/0x3a8
[<ffffffff81497f5c>] tcp_transmit_skb+0x3f4/0x7d0
[<ffffffff8149af2c>] __tcp_push_pending_frames+0x1fc/0x9e0
[<ffffffff8148cf50>] tcp_sendmsg+0x900/0xe00
[<ffffffff81441f7c>] sock_aio_write+0x16c/0x190
[<ffffffff811d0424>] do_sync_write+0xbc/0x130
[<ffffffff811d10a8>] vfs_write+0x150/0x158
[<ffffffff811d124c>] sys_write+0x5c/0x118
[<ffffffff8114532c>] handle_sys+0x12c/0x148


Code: 0040282d  00000000  de100000 <de020000> cc400000  1614ffce
00000000  2405fffd  dfbf0058
Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
Rebooting in 1 seconds..

I'll have a go at adding a spinlock.

Joe Buehler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 20, 2010, 10:05 p.m. UTC | #3
Le mercredi 20 octobre 2010 à 13:43 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> 
> > 2.6.27 is a bit old, you might try :
> > 
> > commit 7fa7cb7109d07c29ab28bb877bc7049a0150dbe5
> > Author: Eric Dumazet <eric.dumazet@gmail.com>
> > Date:   Mon Sep 27 04:18:27 2010 +0000
> 
> Alas, after more load testing I find that the panic still occurs:
> 
> CPU 1 Unable to handle kernel paging request at virtual address
> 0000000000000000, epc == ffffffff8146728c, ra == ffffffff81467258
> Oops[#1]:
> Cpu 1
> $ 0   : 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> $ 4   : ffffffffffffffff a80000008c9d78f0 a80000009acd9880 000000000a205a7c
> $ 8   : 0000000000000000 0000000000000020 a80000009a9c49d0 0000000000000000
> $12   : ffffffff8155de00 0000000000000004 0000000000000001 0000000000000000
> $16   : 0000000000000000 a80000008c9d78f0 0000000000000002 a80000009b94ed80
> $20   : a80000009b94edf8 0000000000000000 0000000000000003 a80000008c9d78a0
> $24   : 0000000000000000 ffffffff812df388
> $28   : a80000008c9d4000 a80000008c9d7840 fffffffffffffff5 ffffffff81467258
> Hi    : 0000000000000000
> Lo    : 0000000000000000
> epc   : ffffffff8146728c fib_rules_lookup+0x11c/0x260
>     Not tainted
> ra    : ffffffff81467258 fib_rules_lookup+0xe8/0x260
> Status: 1010cce3    KX SX UX KERNEL EXL IE
> Cause : 00800008
> BadVA : 0000000000000000
> PrId  : 000d0409 (Cavium Octeon)
> Modules linked in: x_tables ip_tables iptable_filter nf_conntrack
> nf_conntrack_ipv4 nf_nat iptable_nat tun xt_tcpudp xt_state ipt_REJECT
> ipv6 ip6_tables ip6table_filter ip6t_ipv6header ip6t_REJECT
> Process qscope7500 (pid: 1343, threadinfo=a80000008c9d4000,
> task=a80000008c068ac0, tls=000000002d51e920)
> Stack : ffffffffffffffff 0000000000000003 a80000008c9d78d8 a80000008c9d79d8
>         a80000008c9d78f0 0000000000000000 ffffffff816c39c0 ffffffffffffffff
>         0000000000000003 00000000000004a6 0000000000000000 ffffffff814bd2a4
>         0000000000000000 a80000008c9d78d8 0000000000000000 ffffffffc001aa24
>         a80000008c9d78d8 ffffffff81478a38 0000000000000003 0000000000000001
>         0000000000000000 0000000000000000 0000000000000001 000000000a205a7c
>         0a2059bf00000000 0000000000000000 0000000000000000 0000000000000000
>         0000000000000000 0000000000000000 a80000008c9d79d0 0000000000000000
>         a80000008c9d79d8 a8000000994ff380 0000000000000000 0000000000000000
>         a80000008c9d79d0 ffffffff816c39c0 0000000000000003 00000000000004a6
>         ...
> Call Trace:
> [<ffffffff8146728c>] fib_rules_lookup+0x11c/0x260
> [<ffffffff814bd2a4>] fib_lookup+0x2c/0x48
> [<ffffffff81478a38>] __ip_route_output_key+0x918/0xf38
> [<ffffffff81479090>] ip_route_output_flow+0x38/0x2e8
> [<ffffffff81482c44>] ip_queue_xmit+0x38c/0x3a8
> [<ffffffff81497f5c>] tcp_transmit_skb+0x3f4/0x7d0
> [<ffffffff8149af2c>] __tcp_push_pending_frames+0x1fc/0x9e0
> [<ffffffff8148cf50>] tcp_sendmsg+0x900/0xe00
> [<ffffffff81441f7c>] sock_aio_write+0x16c/0x190
> [<ffffffff811d0424>] do_sync_write+0xbc/0x130
> [<ffffffff811d10a8>] vfs_write+0x150/0x158
> [<ffffffff811d124c>] sys_write+0x5c/0x118
> [<ffffffff8114532c>] handle_sys+0x12c/0x148
> 
> 
> Code: 0040282d  00000000  de100000 <de020000> cc400000  1614ffce
> 00000000  2405fffd  dfbf0058
> Fatal exception: panic in 5 seconds
> Kernel panic - not syncing: Fatal exception
> Rebooting in 1 seconds..

Thanks Joe

Could you provide a disassembly of function fib_rules_lookup ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 22, 2010, 8:30 p.m. UTC | #4
Eric Dumazet wrote:

> Could you provide a disassembly of function fib_rules_lookup ?

Try looking in http://68.100.141.95:3000/linux-crash/.  There should be
the source file I am using (not current release if you recall), the .o,
the disassembly, and a -S compile that makes deducing the line numbers a
little easier.

I have tried adding a recursive spinlock as a quick fix without much
success.  It looks like the code in net/core/fib_rules.c results in some
rather complex code paths through the kernel involving softirq handlers.
 I get lockups though I took steps to make sure local interrupts were
off and preemption disabled when taking the spinlock.

I do not know the RCU API or the network code at all but some of the
code in fib_rules.c does not look right prima facie.  For example,
dump_rules does not seem to bother about RCU when traversing rules_list.
 Its caller, fib_nl_dumprule does, but only for one of two calls.  The
fib_nl_newrule and fib_nl_delrule functions traverse rules_list without
any obvious RCU precautions.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 23, 2010, 3:44 a.m. UTC | #5
Le vendredi 22 octobre 2010 à 16:30 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> 
> > Could you provide a disassembly of function fib_rules_lookup ?
> 
> Try looking in http://68.100.141.95:3000/linux-crash/.  There should be
> the source file I am using (not current release if you recall), the .o,
> the disassembly, and a -S compile that makes deducing the line numbers a
> little easier.
> 

Hmm, I'll take a look sometime in the future, thanks

> I have tried adding a recursive spinlock as a quick fix without much
> success.  It looks like the code in net/core/fib_rules.c results in some
> rather complex code paths through the kernel involving softirq handlers.
>  I get lockups though I took steps to make sure local interrupts were
> off and preemption disabled when taking the spinlock.
> 
> I do not know the RCU API or the network code at all but some of the
> code in fib_rules.c does not look right prima facie.  For example,
> dump_rules does not seem to bother about RCU when traversing rules_list.
>  Its caller, fib_nl_dumprule does, but only for one of two calls.  The
> fib_nl_newrule and fib_nl_delrule functions traverse rules_list without
> any obvious RCU precautions.

There is no obligation RCU shall be used in a slow path algorithm.

This slow path uses regular locking to prevent another process to change
the things while we read them, thats it.

Patrick made a change in the past so that RTNL is locked on dump
operations, so no dump_rules() is in fact protected by RTNL, it doesnt
need to take care of 'RCU'.

Thanks



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 23, 2010, 3:11 p.m. UTC | #6
Le samedi 23 octobre 2010 à 05:44 +0200, Eric Dumazet a écrit :
> Le vendredi 22 octobre 2010 à 16:30 -0400, Joe Buehler a écrit :
> > Eric Dumazet wrote:
> > 
> > > Could you provide a disassembly of function fib_rules_lookup ?
> > 
> > Try looking in http://68.100.141.95:3000/linux-crash/.  There should be
> > the source file I am using (not current release if you recall), the .o,
> > the disassembly, and a -S compile that makes deducing the line numbers a
> > little easier.
> > 
> 
> Hmm, I'll take a look sometime in the future, thanks

Did that... Hmm...

I am wondering if smp_rcu_assign_pointer() (or more precisely smp_wmb())
is correctly implemented on octeon platform.

Try to add in fib_nl_newrule() right after the kzalloc bloc :

	rule = kzalloc(ops->rule_size, GFP_KERNEL);
	if (rule == NULL) {
		err = -ENOMEM;
		goto errout;
	}
+       rule->list.next = LIST_POISON1;
+       rule->list.prev = LIST_POISON2;


So that we can actually see if the NULL dereference bug you hit becomes
a "LIST_POISON1" dereference bug...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 23, 2010, 3:24 p.m. UTC | #7
Le samedi 23 octobre 2010 à 17:11 +0200, Eric Dumazet a écrit :
> Le samedi 23 octobre 2010 à 05:44 +0200, Eric Dumazet a écrit :
> > Le vendredi 22 octobre 2010 à 16:30 -0400, Joe Buehler a écrit :
> > > Eric Dumazet wrote:
> > > 
> > > > Could you provide a disassembly of function fib_rules_lookup ?
> > > 
> > > Try looking in http://68.100.141.95:3000/linux-crash/.  There should be
> > > the source file I am using (not current release if you recall), the .o,
> > > the disassembly, and a -S compile that makes deducing the line numbers a
> > > little easier.
> > > 
> > 
> > Hmm, I'll take a look sometime in the future, thanks
> 
> Did that... Hmm...
> 
> I am wondering if smp_rcu_assign_pointer() (or more precisely smp_wmb())
> is correctly implemented on octeon platform.
> 
> Try to add in fib_nl_newrule() right after the kzalloc bloc :
> 
> 	rule = kzalloc(ops->rule_size, GFP_KERNEL);
> 	if (rule == NULL) {
> 		err = -ENOMEM;
> 		goto errout;
> 	}
> +       rule->list.next = LIST_POISON1;
> +       rule->list.prev = LIST_POISON2;
> 
> 
> So that we can actually see if the NULL dereference bug you hit becomes
> a "LIST_POISON1" dereference bug...
> 
> 

Reading commit 500c2e1fdbcc2b273bd is interesting...

David Daney added a nudge_writes(), actually doing a "syncw"
instruction, and this seems to be the smp_wmb() this platform should be
using in the first place, not a pure compiler barrier (barrier()) 

So Joe, you might want to change the smp_wmb() call in
rcu_assign_pointer() by the nudge_writes() call, and see what happens...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 23, 2010, 3:33 p.m. UTC | #8
Eric Dumazet wrote:
>
> Did that... Hmm...
>
> I am wondering if smp_rcu_assign_pointer() (or more precisely smp_wmb())
> is correctly implemented on octeon platform.
>
> Try to add in fib_nl_newrule() right after the kzalloc bloc :
>
> 	rule = kzalloc(ops->rule_size, GFP_KERNEL);
> 	if (rule == NULL) {
> 		err = -ENOMEM;
> 		goto errout;
> 	}
> +       rule->list.next = LIST_POISON1;
> +       rule->list.prev = LIST_POISON2;
>
>
> So that we can actually see if the NULL dereference bug you hit becomes
> a "LIST_POISON1" dereference bug...
>
>
>   

Thanks -- I'll try it when I'm back in the office Tuesday.

It is always possible that there is some issue with the Octeon memory
barrier stuff, but I would think that the system would be much more
unstable than it is -- we're really beating on a dual CPU LINUX instance
that has Java and C++ apps running and also doing some network I/O.

My strategy at this point is logging events to memory and dumping the
log to the console at the time of the panic.  I might be able to figure
out the sequence of events causing the crash.

The load test that causes the panic is using several dozen TAP
interfaces, ifconfig'd up/down every 10 seconds or so, with
source-routes, DNAT and SNAT being set up and taken down also.

Joe Buehler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 23, 2010, 3:40 p.m. UTC | #9
Eric Dumazet wrote:
>
> David Daney added a nudge_writes(), actually doing a "syncw"
> instruction, and this seems to be the smp_wmb() this platform should be
> using in the first place, not a pure compiler barrier (barrier()) 
>
> So Joe, you might want to change the smp_wmb() call in
> rcu_assign_pointer() by the nudge_writes() call, and see what happens...
>
>
>   

I think Daney is Cavium's Octeon LINUX guru from the posts I've seen so
he would definitely know the platform.  I'm not sure I quite understand
what you are saying but it sounds as though you are saying that smp_wb
is not doing a syncw and that sounds *totally* broken -- snycw is what
the low-level Cavium SDK uses for memory barriers all over the place.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 23, 2010, 4:07 p.m. UTC | #10
Le samedi 23 octobre 2010 à 11:40 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> >
> > David Daney added a nudge_writes(), actually doing a "syncw"
> > instruction, and this seems to be the smp_wmb() this platform should be
> > using in the first place, not a pure compiler barrier (barrier()) 
> >
> > So Joe, you might want to change the smp_wmb() call in
> > rcu_assign_pointer() by the nudge_writes() call, and see what happens...
> >
> >
> >   
> 
> I think Daney is Cavium's Octeon LINUX guru from the posts I've seen so
> he would definitely know the platform.  I'm not sure I quite understand
> what you are saying but it sounds as though you are saying that smp_wb
> is not doing a syncw and that sounds *totally* broken -- snycw is what
> the low-level Cavium SDK uses for memory barriers all over the place.
> 

Yes, I am saying exactly this : smp_wmb() is a barrier() only, at least
on the disassembly you provided to me. It might be fine (it is the same
on x86 for example)

fib_rules.old.s

.L234:
	.loc 1 338 0
	beq	$9,$0,.L235   if (last) {
.LBB911:					// last = prev->next
.LBB912:
	.loc 12 45 0
	ld	$2,0($9)   //next = prev->next
.LBB913:
.LBB914:
	.loc 12 22 0
	sd	$9,8($18)   // part of list_add_rcu  new->prev = prev;
	.loc 12 21 0
	sd	$2,0($18)   // new->next = next;
	.loc 12 23 0
<<!>>
	sd	$18,0($9)  //rcu_assign_pointer(prev->next, new); 
	.loc 12 24 0
	sd	$18,8($2)  // next->prev = new;
.L236:
.LBE914:
.LBE913:


No syncw here at least.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 23, 2010, 4:35 p.m. UTC | #11
Le samedi 23 octobre 2010 à 11:33 -0400, Joe Buehler a écrit :

> It is always possible that there is some issue with the Octeon memory
> barrier stuff, but I would think that the system would be much more
> unstable than it is -- we're really beating on a dual CPU LINUX instance
> that has Java and C++ apps running and also doing some network I/O.
> 
> My strategy at this point is logging events to memory and dumping the
> log to the console at the time of the panic.  I might be able to figure
> out the sequence of events causing the crash.
> 
> The load test that causes the panic is using several dozen TAP
> interfaces, ifconfig'd up/down every 10 seconds or so, with
> source-routes, DNAT and SNAT being set up and taken down also.

With a normal workload, on a dual cpu machine, a missing memory barrier
can stay un-noticed for quite a long time. The race window is so small
that probability for the bug might be 0.0000001 % or something like
that :(

You could try to run a test dual threaded program to reproduce the
problem in user land, faster...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Arkhangelsky Oct. 23, 2010, 5:37 p.m. UTC | #12
23.10.2010, 20:36, "Eric Dumazet" <eric.dumazet@gmail.com>:

> With a normal workload, on a dual cpu machine, a missing memory barrier
> can stay un-noticed for quite a long time. The race window is so small
> that probability for the bug might be 0.0000001 % or something like
> that :(

Eric, I'd like to remind you that I've faced the similar problem on simple x86.

See http://kerneltrap.org/mailarchive/linux-netdev/2010/3/9/6271568

Two main differences for our case:

1) There is no userspace workload (except for bgpd), no changes in interfaces
2) We are not using multiple routing tables

This panic was pretty rare in our case  (not more that 2 times per month).

Currently we're running fine with disabled CONFIG_IP_MULTIPLE_TABLES.
Joe Buehler Oct. 26, 2010, 1:29 p.m. UTC | #13
Eric Dumazet wrote:

> With a normal workload, on a dual cpu machine, a missing memory barrier
> can stay un-noticed for quite a long time. The race window is so small
> that probability for the bug might be 0.0000001 % or something like
> that :(

I'm looking at the LINUX source at the moment and not liking what I see
in include/asm-mips/barrier.h:

#define smp_mb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
#ifdef CONFIG_CPU_CAVIUM_OCTEON
#define smp_rmb()	barrier()
#define smp_wmb()	barrier()
#else
#define smp_rmb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
#define smp_wmb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
#endif

The Octeon documentation explicitly says that neither loads nor stores
need execute in program order, so the definitions for smp_rmb and
smp_wmb appear to be wrong wrong wrong.

It appears that smp_wmb should be making use of SYNCW and smp_rmb should
be making use of SYNC.

Should I pursue this question on the main LINUX kernel list?

Joe Buehler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 26, 2010, 1:36 p.m. UTC | #14
Le mardi 26 octobre 2010 à 09:29 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> 
> > With a normal workload, on a dual cpu machine, a missing memory barrier
> > can stay un-noticed for quite a long time. The race window is so small
> > that probability for the bug might be 0.0000001 % or something like
> > that :(
> 
> I'm looking at the LINUX source at the moment and not liking what I see
> in include/asm-mips/barrier.h:
> 
> #define smp_mb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
> #ifdef CONFIG_CPU_CAVIUM_OCTEON
> #define smp_rmb()	barrier()
> #define smp_wmb()	barrier()
> #else
> #define smp_rmb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
> #define smp_wmb()	__asm__ __volatile__(__WEAK_ORDERING_MB : : :"memory")
> #endif
> 
> The Octeon documentation explicitly says that neither loads nor stores
> need execute in program order, so the definitions for smp_rmb and
> smp_wmb appear to be wrong wrong wrong.
> 
> It appears that smp_wmb should be making use of SYNCW and smp_rmb should
> be making use of SYNC.
> 
> Should I pursue this question on the main LINUX kernel list?

Well, it would be surprising this being wrong and crash only once in a
while in fib_rules_lookup

Did you tried my last patch ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 26, 2010, 1:53 p.m. UTC | #15
Eric Dumazet wrote:

> Well, it would be surprising this being wrong and crash only once in a
> while in fib_rules_lookup

> Did you tried my last patch ?

There was a patch to the kernel by David Daney back in January to
improve performance of Octeon memory barriers.  The patch changes the
generic MIPS barrier code to introduce optimizations for Octeon.  The
LINUX version I am using is from the Octeon SDK and appears to have an
early version of this patch.  It's broken however -- the Jan patch has
proper SYNCW instructions in smp_wmb while the SDK version does not.

I have made your changes but will also fold in this change, then start
some load testing.

The real-time scheduler is broken in the LINUX I am using -- I get
kernel crashes -- and I would be most happy if the SYNCW fix fixed that
also.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 26, 2010, 1:58 p.m. UTC | #16
Le mardi 26 octobre 2010 à 09:53 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> 
> > Well, it would be surprising this being wrong and crash only once in a
> > while in fib_rules_lookup
> 
> > Did you tried my last patch ?
> 
> There was a patch to the kernel by David Daney back in January to
> improve performance of Octeon memory barriers.  The patch changes the
> generic MIPS barrier code to introduce optimizations for Octeon.  The
> LINUX version I am using is from the Octeon SDK and appears to have an
> early version of this patch.  It's broken however -- the Jan patch has
> proper SYNCW instructions in smp_wmb while the SDK version does not.
> 
> I have made your changes but will also fold in this change, then start
> some load testing.
> 
> The real-time scheduler is broken in the LINUX I am using -- I get
> kernel crashes -- and I would be most happy if the SYNCW fix fixed that
> also.
> 
> Joe Buehler
> 

Just to make sure, you do use two "syncw", not only one ?

/*
 * We actually use two syncw instructions in a row when we need a write
 * memory barrier. This is because the CN3XXX series of Octeons have
 * errata Core-401. This can cause a single syncw to not enforce
 * ordering under very rare conditions. Even if it is rare, better safe
 * than sorry.
 */
#define CVMX_SYNCW asm volatile ("syncw\n\tsyncw" : : : "memory")




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 26, 2010, 2:33 p.m. UTC | #17
Le mardi 26 octobre 2010 à 09:53 -0400, Joe Buehler a écrit :
> Eric Dumazet wrote:
> 
> > Well, it would be surprising this being wrong and crash only once in a
> > while in fib_rules_lookup
> 
> > Did you tried my last patch ?
> 
> There was a patch to the kernel by David Daney back in January to
> improve performance of Octeon memory barriers.  The patch changes the
> generic MIPS barrier code to introduce optimizations for Octeon.  The
> LINUX version I am using is from the Octeon SDK and appears to have an
> early version of this patch.  It's broken however -- the Jan patch has
> proper SYNCW instructions in smp_wmb while the SDK version does not.
> 
> I have made your changes but will also fold in this change, then start
> some load testing.
> 
> The real-time scheduler is broken in the LINUX I am using -- I get
> kernel crashes -- and I would be most happy if the SYNCW fix fixed that
> also.
> 
>

Just to make sure, are you using a single syncw, or a double one ?

/*
 * We actually use two syncw instructions in a row when we need a write
 * memory barrier. This is because the CN3XXX series of Octeons have
 * errata Core-401. This can cause a single syncw to not enforce
 * ordering under very rare conditions. Even if it is rare, better safe
 * than sorry.
 */
#define CVMX_SYNCW asm volatile ("syncw\n\tsyncw" : : : "memory")





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 26, 2010, 2:41 p.m. UTC | #18
Eric Dumazet wrote:

> Just to make sure, are you using a single syncw, or a double one ?

I'm using double.  I know about the errata.  It doesn't apply to all
Octeon versions but in their SDK they always use two without worrying
about the chip variant.

What I actually to patch this was look at the current LINUX prerelease
and copy what it does.

We'll see later today if all these changes fix my crash, it takes a
while to encounter it.

Joe Buehler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Oct. 27, 2010, 9:01 p.m. UTC | #19
Eric, tests with your fixes and with the smp barrier corrections are
looking positive.  More testing is needed before a final verdict is in
however -- the load test crashed after 8 hours due to an application bug
that we're working on fixing now.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 27, 2010, 9:05 p.m. UTC | #20
Le mercredi 27 octobre 2010 à 17:01 -0400, Joe Buehler a écrit :
> Eric, tests with your fixes and with the smp barrier corrections are
> looking positive.  More testing is needed before a final verdict is in
> however -- the load test crashed after 8 hours due to an application bug
> that we're working on fixing now.
> 

Excellent ! Thanks for the report.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Buehler Dec. 2, 2010, 1:36 p.m. UTC | #21
We have had no further problems with crashes.  The SMP memory barrier
fixes seem to have been the final piece that was missing.

Thanks for all the help, you guys are great.

Joe Buehler

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 2, 2010, 2:04 p.m. UTC | #22
Le jeudi 02 décembre 2010 à 08:36 -0500, Joe Buehler a écrit :
> We have had no further problems with crashes.  The SMP memory barrier
> fixes seem to have been the final piece that was missing.
> 
> Thanks for all the help, you guys are great.
> 
> Joe Buehler
> 

Great, thanks for your feedback Joe !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 42e84e0..d078728 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -225,9 +225,11 @@  jumped:
 			err = ops->action(rule, fl, flags, arg);
 
 		if (err != -EAGAIN) {
-			fib_rule_get(rule);
-			arg->rule = rule;
-			goto out;
+			if (likely(atomic_inc_not_zero(&rule->refcnt))) {
+				arg->rule = rule;
+				goto out;
+			}
+			break;
 		}
 	}