diff mbox

net/802/mrp: fix lockdep splat

Message ID 1368447851.13473.119.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 13, 2013, 12:24 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

On Mon, 2013-05-13 at 14:59 +0300, Denys Fedoryshchenko wrote:
> Hi
> 
> Noticed on 32-builds of my systems that r8169 is not working anymore at 
> all.
> Problem started to occur somewhere after 3.6 kernels, but it is 
> difficult to try other kernels on this semi-embedded system, and i am 
> not completely sure about that.
> Once i used another OS build, but 64-bit build with same latest kernel, 
> and it worked.
> It spits netdev watchdog timeouts, and constantly shows interface coming 
> up.
> 
> Here is max debug output(dmesg) i got, with some inline comments:

> 
> Here i am trying to remove module
> 
> [   19.735147] =================================
> [   19.735235] [ INFO: inconsistent lock state ]
> [   19.735324] 3.9.2-build-0063 #4 Not tainted
> [   19.735412] ---------------------------------
> [   19.735500] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   19.735592] rmmod/1840 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   19.735682]  (&(&app->lock)->rlock#2){+.?...}, at: [<f862bb5b>] 
> mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.735932] {IN-SOFTIRQ-W} state was registered at:
> [   19.736002]   [<c01611d8>] __lock_acquire+0x284/0xd52
> [   19.736002]   [<c0162068>] lock_acquire+0x71/0x85
> [   19.736002]   [<c03e4ad5>] _raw_spin_lock+0x33/0x40
> [   19.736002]   [<f862b5dc>] mrp_join_timer+0x11/0x3d [mrp]
> [   19.736002]   [<c0136102>] call_timer_fn+0x69/0xdc
> [   19.736002]   [<c01362e6>] run_timer_softirq+0x171/0x1ab
> [   19.736002]   [<c0131963>] __do_softirq+0x98/0x153
> [   19.736002]   [<c0131afa>] irq_exit+0x47/0x82
> [   19.736002]   [<c011c274>] smp_apic_timer_interrupt+0x64/0x71
> [   19.736002]   [<c03e5616>] apic_timer_interrupt+0x32/0x38
> [   19.736002]   [<c010902c>] cpu_idle+0x55/0x6f
> [   19.736002]   [<c03d7d19>] rest_init+0xa1/0xa7
> [   19.736002]   [<c057792f>] start_kernel+0x300/0x307
> [   19.736002]   [<c05772a3>] i386_start_kernel+0x79/0x7d
> [   19.736002] irq event stamp: 4379
> [   19.736002] hardirqs last  enabled at (4379): [<c03e50bb>] 
> _raw_spin_unlock_irqrestore+0x36/0x3c
> [   19.736002] hardirqs last disabled at (4378): [<c03e4b92>] 
> _raw_spin_lock_irqsave+0x16/0x50
> [   19.736002] softirqs last  enabled at (4342): [<c0379c22>] 
> neigh_ifdown+0x95/0xb5
> [   19.736002] softirqs last disabled at (4340): [<c03e4cc9>] 
> _raw_write_lock_bh+0xe/0x45
> [   19.736002]
> [   19.736002] other info that might help us debug this:
> [   19.736002]  Possible unsafe locking scenario:
> [   19.736002]
> [   19.736002]        CPU0
> [   19.736002]        ----
> [   19.736002]   lock(&(&app->lock)->rlock#2);
> [   19.736002]   <Interrupt>
> [   19.736002]     lock(&(&app->lock)->rlock#2);
> [   19.736002]
> [   19.736002]  *** DEADLOCK ***
> [   19.736002]
> [   19.736002] 3 locks held by rmmod/1840:
> [   19.736002]  #0:  (&__lockdep_no_validate__){......}, at: 
> [<c02c195c>] device_lock+0xd/0xf
> [   19.736002]  #1:  (&__lockdep_no_validate__){......}, at: 
> [<c02c195c>] device_lock+0xd/0xf
> [   19.736002]  #2:  (rtnl_mutex){+.+.+.}, at: [<c037d71b>] 
> rtnl_lock+0xf/0x11
> [   19.736002]
> [   19.736002] stack backtrace:
> [   19.736002] Pid: 1840, comm: rmmod Not tainted 3.9.2-build-0063 #4
> [   19.736002] Call Trace:
> [   19.736002]  [<c0160d8e>] valid_state+0x1f6/0x201
> [   19.736002]  [<c0160e6a>] mark_lock+0xd1/0x1bb
> [   19.736002]  [<c01606b1>] ? check_usage_forwards+0x77/0x77
> [   19.736002]  [<c016124c>] __lock_acquire+0x2f8/0xd52
> [   19.736002]  [<c0160d00>] ? valid_state+0x168/0x201
> [   19.736002]  [<c0160dbf>] ? mark_lock+0x26/0x1bb
> [   19.736002]  [<c016230e>] ? mark_held_locks+0x5d/0x7b
> [   19.736002]  [<c03e50bb>] ? _raw_spin_unlock_irqrestore+0x36/0x3c
> [   19.736002]  [<c0162068>] lock_acquire+0x71/0x85
> [   19.736002]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<c03e4ad5>] _raw_spin_lock+0x33/0x40
> [   19.736002]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<f862bb5b>] mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<c0172e67>] ? kfree_call_rcu+0x17/0x19
> [   19.736002]  [<f8647bf6>] vlan_mvrp_uninit_applicant+0xd/0xf [8021q]
> [   19.736002]  [<f8646198>] unregister_vlan_dev+0xb8/0xe2 [8021q]
> [   19.736002]  [<c03d1ae8>] ? packet_mm_close+0x1f/0x1f
> [   19.736002]  [<f86464c9>] vlan_device_event+0x307/0x35a [8021q]
> [   19.736002]  [<c03d1ce7>] ? rcu_read_unlock+0x17/0x19
> [   19.736002]  [<c0146215>] notifier_call_chain+0x26/0x48
> [   19.736002]  [<c0146283>] raw_notifier_call_chain+0x1a/0x1c
> [   19.736002]  [<c0370310>] call_netdevice_notifiers+0x44/0x4b
> [   19.736002]  [<c0370604>] rollback_registered_many+0xe2/0x185
> [   19.736002]  [<c037070d>] rollback_registered+0x1f/0x2c
> [   19.736002]  [<c0370777>] unregister_netdevice_queue+0x5d/0x87
> [   19.736002]  [<c03e3464>] ? mutex_lock_nested+0x20/0x22
> [   19.736002]  [<c037d71b>] ? rtnl_lock+0xf/0x11
> [   19.736002]  [<c0370851>] unregister_netdev+0x18/0x1f
> [   19.736002]  [<f85cb017>] rtl_remove_one+0x7f/0xf9 [r8169]
> [   19.736002]  [<c02c9117>] ? __pm_runtime_resume+0x48/0x50
> [   19.736002]  [<c026cea9>] pci_device_remove+0x27/0x76
> [   19.736002]  [<c02c1b70>] __device_release_driver+0x66/0xaa
> [   19.736002]  [<c02c20a3>] driver_detach+0x62/0x83
> [   19.736002]  [<c02c1919>] bus_remove_driver+0x69/0x88
> [   19.736002]  [<c02c23be>] driver_unregister+0x53/0x5a
> [   19.736002]  [<c016243a>] ? trace_hardirqs_on_caller+0x10e/0x13f
> [   19.736002]  [<c026cf87>] pci_unregister_driver+0x10/0x5a
> [   19.736002]  [<f85cd8fb>] rtl8169_pci_driver_exit+0xd/0xf [r8169]
> [   19.736002]  [<c01688cf>] sys_delete_module+0x175/0x1c1
> [   19.736002]  [<c03e5367>] ? restore_all+0xf/0xf
> [   19.736002]  [<c03e5334>] syscall_call+0x7/0xb

Thanks for the report. Could you please try the following fix for the
module unload problem.

[PATCH] net/802/mrp: fix lockdep splat

commit fb745e9a037895 ("net/802/mrp: fix possible race condition when
calling mrp_pdu_queue()") introduced a lockdep splat.

[   19.735147] =================================
[   19.735235] [ INFO: inconsistent lock state ]
[   19.735324] 3.9.2-build-0063 #4 Not tainted
[   19.735412] ---------------------------------
[   19.735500] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[   19.735592] rmmod/1840 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   19.735682]  (&(&app->lock)->rlock#2){+.?...}, at: [<f862bb5b>] 
mrp_uninit_applicant+0x69/0xba [mrp]

app->lock is normally taken under softirq context, so disable BH to
avoid the splat.

Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ward <david.ward@ll.mit.edu>
Cc: Cong Wang <amwang@redhat.com>
---
 net/802/mrp.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Denys Fedoryshchenko May 13, 2013, 12:53 p.m. UTC | #1
Hi

Thank you for superfast answer :)
It is changed a bit, but not disappeared. Relevant part from new dmesg:

[    3.936191] =================================
[    3.936279] [ INFO: inconsistent lock state ]
[    3.936367] 3.9.2-build-0064 #6 Not tainted
[    3.936453] ---------------------------------
[    3.936540] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[    3.936630] rmmod/1843 [HC0[0]:SC0[0]:HE1:SE1] takes:
[    3.936718]  (&(&app->lock)->rlock#2){+.?...}, at: [<f862bb5b>] 
mrp_uninit_applicant+0x69/0xba [mrp]
[    3.936967] {IN-SOFTIRQ-W} state was registered at:
[    3.937055]   [<c01611d8>] __lock_acquire+0x284/0xd52
[    3.937058]   [<c0162068>] lock_acquire+0x71/0x85
[    3.937058]   [<c03e4cae>] _raw_spin_lock_bh+0x38/0x45
[    3.937058]   [<f862b5dc>] mrp_join_timer+0x11/0x3d [mrp]
[    3.937058]   [<c0136102>] call_timer_fn+0x69/0xdc
[    3.937058]   [<c01362e6>] run_timer_softirq+0x171/0x1ab
[    3.937058]   [<c0131963>] __do_softirq+0x98/0x153
[    3.937058]   [<c0131afa>] irq_exit+0x47/0x82
[    3.937058]   [<c011c274>] smp_apic_timer_interrupt+0x64/0x71
[    3.937058]   [<c03e5616>] apic_timer_interrupt+0x32/0x38
[    3.937058]   [<c03e3464>] mutex_lock_nested+0x20/0x22
[    3.937058]   [<c019a272>] unlink_file_vma+0x25/0x75
[    3.937058]   [<c0195dfc>] free_pgtables+0x55/0x99
[    3.937058]   [<c019b4a2>] exit_mmap+0x83/0xd5
[    3.937058]   [<c0129fe7>] mmput+0x4c/0xc4
[    3.937058]   [<c013004d>] do_exit+0x2ac/0x730
[    3.937058]   [<c0130615>] do_group_exit+0x60/0x83
[    3.937058]   [<c013064b>] sys_exit_group+0x13/0x17
[    3.937058]   [<c03e5334>] syscall_call+0x7/0xb
[    3.937058] irq event stamp: 4683
[    3.937058] hardirqs last  enabled at (4683): [<c03e50bb>] 
_raw_spin_unlock_irqrestore+0x36/0x3c
[    3.937058] hardirqs last disabled at (4682): [<c03e4b92>] 
_raw_spin_lock_irqsave+0x16/0x50
[    3.937058] softirqs last  enabled at (4646): [<c0379c22>] 
neigh_ifdown+0x95/0xb5
[    3.937058] softirqs last disabled at (4644): [<c03e4cc9>] 
_raw_write_lock_bh+0xe/0x45
[    3.937058]
[    3.937058] other info that might help us debug this:
[    3.937058]  Possible unsafe locking scenario:
[    3.937058]
[    3.937058]        CPU0
[    3.937058]        ----
[    3.937058]   lock(&(&app->lock)->rlock#2);
[    3.937058]   <Interrupt>
[    3.937058]     lock(&(&app->lock)->rlock#2);
[    3.937058]
[    3.937058]  *** DEADLOCK ***
[    3.937058]
[    3.937058] 3 locks held by rmmod/1843:
[    3.937058]  #0:  (&__lockdep_no_validate__){......}, at: 
[<c02c195c>] device_lock+0xd/0xf
[    3.937058]  #1:  (&__lockdep_no_validate__){......}, at: 
[<c02c195c>] device_lock+0xd/0xf
[    3.937058]  #2:  (rtnl_mutex){+.+.+.}, at: [<c037d71b>] 
rtnl_lock+0xf/0x11
[    3.937058]
[    3.937058] stack backtrace:
[    3.937058] Pid: 1843, comm: rmmod Not tainted 3.9.2-build-0064 #6
[    3.937058] Call Trace:
[    3.937058]  [<c0160d8e>] valid_state+0x1f6/0x201
[    3.937058]  [<c0160e6a>] mark_lock+0xd1/0x1bb
[    3.937058]  [<c01606b1>] ? check_usage_forwards+0x77/0x77
[    3.937058]  [<c016124c>] __lock_acquire+0x2f8/0xd52
[    3.937058]  [<c0160d00>] ? valid_state+0x168/0x201
[    3.937058]  [<c0160dbf>] ? mark_lock+0x26/0x1bb
[    3.937058]  [<c016230e>] ? mark_held_locks+0x5d/0x7b
[    3.937058]  [<c03e50bb>] ? _raw_spin_unlock_irqrestore+0x36/0x3c
[    3.937058]  [<c0162068>] lock_acquire+0x71/0x85
[    3.937058]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
[    3.937058]  [<c03e4ad5>] _raw_spin_lock+0x33/0x40
[    3.937058]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
[    3.937058]  [<f862bb5b>] mrp_uninit_applicant+0x69/0xba [mrp]
[    3.937058]  [<c0172e67>] ? kfree_call_rcu+0x17/0x19
[    3.937058]  [<f8647bf6>] vlan_mvrp_uninit_applicant+0xd/0xf [8021q]
[    3.937058]  [<f8646198>] unregister_vlan_dev+0xb8/0xe2 [8021q]
[    3.937058]  [<c03d1ae8>] ? packet_mm_close+0x1f/0x1f
[    3.937058]  [<f86464c9>] vlan_device_event+0x307/0x35a [8021q]
[    3.937058]  [<c03d1ce7>] ? rcu_read_unlock+0x17/0x19
[    3.937058]  [<c0146215>] notifier_call_chain+0x26/0x48
[    3.937058]  [<c0146283>] raw_notifier_call_chain+0x1a/0x1c
[    3.937058]  [<c0370310>] call_netdevice_notifiers+0x44/0x4b
[    3.937058]  [<c0370604>] rollback_registered_many+0xe2/0x185
[    3.937058]  [<c037070d>] rollback_registered+0x1f/0x2c
[    3.937058]  [<c0370777>] unregister_netdevice_queue+0x5d/0x87
[    3.937058]  [<c03e3464>] ? mutex_lock_nested+0x20/0x22
[    3.937058]  [<c037d71b>] ? rtnl_lock+0xf/0x11
[    3.937058]  [<c0370851>] unregister_netdev+0x18/0x1f
[    3.937058]  [<f85cb017>] rtl_remove_one+0x7f/0xf9 [r8169]
[    3.937058]  [<c02c9117>] ? __pm_runtime_resume+0x48/0x50
[    3.937058]  [<c026cea9>] pci_device_remove+0x27/0x76
[    3.937058]  [<c02c1b70>] __device_release_driver+0x66/0xaa
[    3.937058]  [<c02c20a3>] driver_detach+0x62/0x83
[    3.937058]  [<c02c1919>] bus_remove_driver+0x69/0x88
[    3.937058]  [<c02c23be>] driver_unregister+0x53/0x5a
[    3.937058]  [<c016243a>] ? trace_hardirqs_on_caller+0x10e/0x13f
[    3.937058]  [<c026cf87>] pci_unregister_driver+0x10/0x5a
[    3.937058]  [<f85cd8fb>] rtl8169_pci_driver_exit+0xd/0xf [r8169]
[    3.937058]  [<c01688cf>] sys_delete_module+0x175/0x1c1
[    3.937058]  [<c03e5367>] ? restore_all+0xf/0xf
[    3.937058]  [<c03e5334>] syscall_call+0x7/0xb



On 2013-05-13 15:24, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> On Mon, 2013-05-13 at 14:59 +0300, Denys Fedoryshchenko wrote:
> Hi
> 
> Noticed on 32-builds of my systems that r8169 is not working anymore at
> all.
> Problem started to occur somewhere after 3.6 kernels, but it is
> difficult to try other kernels on this semi-embedded system, and i am
> not completely sure about that.
> Once i used another OS build, but 64-bit build with same latest kernel,
> and it worked.
> It spits netdev watchdog timeouts, and constantly shows interface 
> coming
> up.
> 
> Here is max debug output(dmesg) i got, with some inline comments:
> 
> 
> Here i am trying to remove module
> 
> [   19.735147] =================================
> [   19.735235] [ INFO: inconsistent lock state ]
> [   19.735324] 3.9.2-build-0063 #4 Not tainted
> [   19.735412] ---------------------------------
> [   19.735500] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   19.735592] rmmod/1840 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   19.735682]  (&(&app->lock)->rlock#2){+.?...}, at: [<f862bb5b>]
> mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.735932] {IN-SOFTIRQ-W} state was registered at:
> [   19.736002]   [<c01611d8>] __lock_acquire+0x284/0xd52
> [   19.736002]   [<c0162068>] lock_acquire+0x71/0x85
> [   19.736002]   [<c03e4ad5>] _raw_spin_lock+0x33/0x40
> [   19.736002]   [<f862b5dc>] mrp_join_timer+0x11/0x3d [mrp]
> [   19.736002]   [<c0136102>] call_timer_fn+0x69/0xdc
> [   19.736002]   [<c01362e6>] run_timer_softirq+0x171/0x1ab
> [   19.736002]   [<c0131963>] __do_softirq+0x98/0x153
> [   19.736002]   [<c0131afa>] irq_exit+0x47/0x82
> [   19.736002]   [<c011c274>] smp_apic_timer_interrupt+0x64/0x71
> [   19.736002]   [<c03e5616>] apic_timer_interrupt+0x32/0x38
> [   19.736002]   [<c010902c>] cpu_idle+0x55/0x6f
> [   19.736002]   [<c03d7d19>] rest_init+0xa1/0xa7
> [   19.736002]   [<c057792f>] start_kernel+0x300/0x307
> [   19.736002]   [<c05772a3>] i386_start_kernel+0x79/0x7d
> [   19.736002] irq event stamp: 4379
> [   19.736002] hardirqs last  enabled at (4379): [<c03e50bb>]
> _raw_spin_unlock_irqrestore+0x36/0x3c
> [   19.736002] hardirqs last disabled at (4378): [<c03e4b92>]
> _raw_spin_lock_irqsave+0x16/0x50
> [   19.736002] softirqs last  enabled at (4342): [<c0379c22>]
> neigh_ifdown+0x95/0xb5
> [   19.736002] softirqs last disabled at (4340): [<c03e4cc9>]
> _raw_write_lock_bh+0xe/0x45
> [   19.736002]
> [   19.736002] other info that might help us debug this:
> [   19.736002]  Possible unsafe locking scenario:
> [   19.736002]
> [   19.736002]        CPU0
> [   19.736002]        ----
> [   19.736002]   lock(&(&app->lock)->rlock#2);
> [   19.736002]   <Interrupt>
> [   19.736002]     lock(&(&app->lock)->rlock#2);
> [   19.736002]
> [   19.736002]  *** DEADLOCK ***
> [   19.736002]
> [   19.736002] 3 locks held by rmmod/1840:
> [   19.736002]  #0:  (&__lockdep_no_validate__){......}, at:
> [<c02c195c>] device_lock+0xd/0xf
> [   19.736002]  #1:  (&__lockdep_no_validate__){......}, at:
> [<c02c195c>] device_lock+0xd/0xf
> [   19.736002]  #2:  (rtnl_mutex){+.+.+.}, at: [<c037d71b>]
> rtnl_lock+0xf/0x11
> [   19.736002]
> [   19.736002] stack backtrace:
> [   19.736002] Pid: 1840, comm: rmmod Not tainted 3.9.2-build-0063 #4
> [   19.736002] Call Trace:
> [   19.736002]  [<c0160d8e>] valid_state+0x1f6/0x201
> [   19.736002]  [<c0160e6a>] mark_lock+0xd1/0x1bb
> [   19.736002]  [<c01606b1>] ? check_usage_forwards+0x77/0x77
> [   19.736002]  [<c016124c>] __lock_acquire+0x2f8/0xd52
> [   19.736002]  [<c0160d00>] ? valid_state+0x168/0x201
> [   19.736002]  [<c0160dbf>] ? mark_lock+0x26/0x1bb
> [   19.736002]  [<c016230e>] ? mark_held_locks+0x5d/0x7b
> [   19.736002]  [<c03e50bb>] ? _raw_spin_unlock_irqrestore+0x36/0x3c
> [   19.736002]  [<c0162068>] lock_acquire+0x71/0x85
> [   19.736002]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<c03e4ad5>] _raw_spin_lock+0x33/0x40
> [   19.736002]  [<f862bb5b>] ? mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<f862bb5b>] mrp_uninit_applicant+0x69/0xba [mrp]
> [   19.736002]  [<c0172e67>] ? kfree_call_rcu+0x17/0x19
> [   19.736002]  [<f8647bf6>] vlan_mvrp_uninit_applicant+0xd/0xf [8021q]
> [   19.736002]  [<f8646198>] unregister_vlan_dev+0xb8/0xe2 [8021q]
> [   19.736002]  [<c03d1ae8>] ? packet_mm_close+0x1f/0x1f
> [   19.736002]  [<f86464c9>] vlan_device_event+0x307/0x35a [8021q]
> [   19.736002]  [<c03d1ce7>] ? rcu_read_unlock+0x17/0x19
> [   19.736002]  [<c0146215>] notifier_call_chain+0x26/0x48
> [   19.736002]  [<c0146283>] raw_notifier_call_chain+0x1a/0x1c
> [   19.736002]  [<c0370310>] call_netdevice_notifiers+0x44/0x4b
> [   19.736002]  [<c0370604>] rollback_registered_many+0xe2/0x185
> [   19.736002]  [<c037070d>] rollback_registered+0x1f/0x2c
> [   19.736002]  [<c0370777>] unregister_netdevice_queue+0x5d/0x87
> [   19.736002]  [<c03e3464>] ? mutex_lock_nested+0x20/0x22
> [   19.736002]  [<c037d71b>] ? rtnl_lock+0xf/0x11
> [   19.736002]  [<c0370851>] unregister_netdev+0x18/0x1f
> [   19.736002]  [<f85cb017>] rtl_remove_one+0x7f/0xf9 [r8169]
> [   19.736002]  [<c02c9117>] ? __pm_runtime_resume+0x48/0x50
> [   19.736002]  [<c026cea9>] pci_device_remove+0x27/0x76
> [   19.736002]  [<c02c1b70>] __device_release_driver+0x66/0xaa
> [   19.736002]  [<c02c20a3>] driver_detach+0x62/0x83
> [   19.736002]  [<c02c1919>] bus_remove_driver+0x69/0x88
> [   19.736002]  [<c02c23be>] driver_unregister+0x53/0x5a
> [   19.736002]  [<c016243a>] ? trace_hardirqs_on_caller+0x10e/0x13f
> [   19.736002]  [<c026cf87>] pci_unregister_driver+0x10/0x5a
> [   19.736002]  [<f85cd8fb>] rtl8169_pci_driver_exit+0xd/0xf [r8169]
> [   19.736002]  [<c01688cf>] sys_delete_module+0x175/0x1c1
> [   19.736002]  [<c03e5367>] ? restore_all+0xf/0xf
> [   19.736002]  [<c03e5334>] syscall_call+0x7/0xb
> 
> Thanks for the report. Could you please try the following fix for the
> module unload problem.
> 
> [PATCH] net/802/mrp: fix lockdep splat
> 
> commit fb745e9a037895 ("net/802/mrp: fix possible race condition when
> calling mrp_pdu_queue()") introduced a lockdep splat.
> 
> [   19.735147] =================================
> [   19.735235] [ INFO: inconsistent lock state ]
> [   19.735324] 3.9.2-build-0063 #4 Not tainted
> [   19.735412] ---------------------------------
> [   19.735500] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   19.735592] rmmod/1840 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   19.735682]  (&(&app->lock)->rlock#2){+.?...}, at: [<f862bb5b>]
> mrp_uninit_applicant+0x69/0xba [mrp]
> 
> app->lock is normally taken under softirq context, so disable BH to
> avoid the splat.
> 
> Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ward <david.ward@ll.mit.edu>
> Cc: Cong Wang <amwang@redhat.com>
> ---
> net/802/mrp.c |    4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/802/mrp.c b/net/802/mrp.c
> index e085bcc..1eb05d8 100644
> --- a/net/802/mrp.c
> +++ b/net/802/mrp.c
> @@ -871,10 +871,10 @@ void mrp_uninit_applicant(struct net_device
> *dev, struct mrp_application *appl)
> 	 */
> 	del_timer_sync(&app->join_timer);
> 
> -	spin_lock(&app->lock);
> +	spin_lock_bh(&app->lock);
> 	mrp_mad_event(app, MRP_EVENT_TX);
> 	mrp_pdu_queue(app);
> -	spin_unlock(&app->lock);
> +	spin_unlock_bh(&app->lock);
> 
> 	mrp_queue_xmit(app);

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amerigo Wang May 14, 2013, 1:12 a.m. UTC | #2
On Mon, 2013-05-13 at 05:24 -0700, Eric Dumazet wrote:
>         del_timer_sync(&app->join_timer);
>  
> -       spin_lock(&app->lock);
> +       spin_lock_bh(&app->lock); 

But we just have removed the join timer before taking this spin lock, so
it is not possible to race with timer, but we still have race condition
with mrp_rcv() which is called in BH context too. However, this is not
what Denys reported, so I have no idea on this.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 14, 2013, 1:30 a.m. UTC | #3
On Tue, 2013-05-14 at 09:12 +0800, Cong Wang wrote:
> On Mon, 2013-05-13 at 05:24 -0700, Eric Dumazet wrote:
> >         del_timer_sync(&app->join_timer);
> >  
> > -       spin_lock(&app->lock);
> > +       spin_lock_bh(&app->lock); 
> 
> But we just have removed the join timer before taking this spin lock, so
> it is not possible to race with timer, but we still have race condition
> with mrp_rcv() which is called in BH context too. However, this is not
> what Denys reported, so I have no idea on this.
> 

Thats a lockdep splat.

lockdep cannot know join timer wont fire again.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 14, 2013, 7:07 p.m. UTC | #4
From: Denys Fedoryshchenko <denys@visp.net.lb>
Date: Mon, 13 May 2013 15:53:30 +0300

> Thank you for superfast answer :)
> It is changed a bit, but not disappeared. Relevant part from new
> dmesg:

Can you double check that you really had Eric's patch applied?
lockdep appears to be complaining about the same thing in your
log dump, as if the patch was not really applied.

It's saying that app->lock can be taken from the join timer in
softirq, but mrp_uninit_applicant() takes it without disabling
softirqs.

Eric's patch explicitly fixes this, by making sure that spin_lock_bh()
is used there.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 14, 2013, 7:21 p.m. UTC | #5
On Tue, 2013-05-14 at 12:07 -0700, David Miller wrote:

> Can you double check that you really had Eric's patch applied?
> lockdep appears to be complaining about the same thing in your
> log dump, as if the patch was not really applied.
> 
> It's saying that app->lock can be taken from the join timer in
> softirq, but mrp_uninit_applicant() takes it without disabling
> softirqs.
> 
> Eric's patch explicitly fixes this, by making sure that spin_lock_bh()
> is used there.

I am going to test this myself, it seems quite simple.
I'll add a Tested-by: tag once done.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 14, 2013, 7:40 p.m. UTC | #6
On Tue, 2013-05-14 at 12:21 -0700, Eric Dumazet wrote:
> On Tue, 2013-05-14 at 12:07 -0700, David Miller wrote:
> 
> > Can you double check that you really had Eric's patch applied?
> > lockdep appears to be complaining about the same thing in your
> > log dump, as if the patch was not really applied.
> > 
> > It's saying that app->lock can be taken from the join timer in
> > softirq, but mrp_uninit_applicant() takes it without disabling
> > softirqs.
> > 
> > Eric's patch explicitly fixes this, by making sure that spin_lock_bh()
> > is used there.
> 
> I am going to test this myself, it seems quite simple.
> I'll add a Tested-by: tag once done.

I definitely could trigger the bug easily without my patch :

[  296.523421] 8021q: 802.1Q VLAN Support v1.8
[  296.523478] 8021q: adding VLAN 0 to HW filter on device eth4
[  322.070382] 
[  322.071882] =================================
[  322.076234] [ INFO: inconsistent lock state ]
[  322.080588] 3.10.0-dbg-DEV #383 Not tainted
[  322.084769] ---------------------------------
[  322.089122] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  322.095124] ip/19081 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  322.099995]  (&(&app->lock)->rlock){+.?...}, at: [<ffffffffa0240804>] mrp_uninit_applicant+0x54/0x110 [mrp]
[  322.109805] {IN-SOFTIRQ-W} state was registered at:
[  322.114676]   [<ffffffff810dbcb7>] __lock_acquire+0x747/0x1c50
[  322.120531]   [<ffffffff810dd7c7>] lock_acquire+0x97/0x150
[  322.126028]   [<ffffffff815fc5d1>] _raw_spin_lock+0x41/0x50
[  322.131622]   [<ffffffffa02408de>] mrp_join_timer+0x1e/0x70 [mrp]
[  322.137727]   [<ffffffff81096c3f>] call_timer_fn+0x7f/0x1e0
[  322.143323]   [<ffffffff81097a94>] run_timer_softirq+0x214/0x2d0
[  322.149340]   [<ffffffff8108f6a5>] __do_softirq+0xe5/0x2c0
[  322.154838]   [<ffffffff8108fa1e>] irq_exit+0x9e/0xc0
[  322.159903]   [<ffffffff816073ee>] smp_apic_timer_interrupt+0x6e/0x99
[  322.166353]   [<ffffffff8160652f>] apic_timer_interrupt+0x6f/0x80
[  322.172458]   [<ffffffff81055f2e>] arch_cpu_idle+0x1e/0x30
[  322.177966]   [<ffffffff810cdb3f>] cpu_startup_entry+0x14f/0x290
[  322.183992]   [<ffffffff815e33e6>] rest_init+0xd6/0xe0
[  322.189145]   [<ffffffff81cf6e49>] start_kernel+0x3c0/0x3cd
[  322.194729]   [<ffffffff81cf65a6>] x86_64_start_reservations+0x2a/0x2c
[  322.201267]   [<ffffffff81cf668d>] x86_64_start_kernel+0xe5/0xec
[  322.207287] irq event stamp: 2539
[  322.210598] hardirqs last  enabled at (2539): [<ffffffff815fce1f>] _raw_spin_unlock_irqrestore+0x3f/0x70
[  322.220084] hardirqs last disabled at (2538): [<ffffffff815fc6dc>] _raw_spin_lock_irqsave+0x1c/0x60
[  322.229136] softirqs last  enabled at (2482): [<ffffffffa00082b7>] inet6_fill_ifla6_attrs+0x277/0x2a0 [ipv6]
[  322.238977] softirqs last disabled at (2480): [<ffffffff815fcbc6>] _raw_read_lock_bh+0x16/0x60
[  322.247599] 
[  322.247599] other info that might help us debug this:
[  322.254118]  Possible unsafe locking scenario:
[  322.254118] 
[  322.260033]        CPU0
[  322.262478]        ----
[  322.264924]   lock(&(&app->lock)->rlock);
[  322.268965]   <Interrupt>
[  322.271584]     lock(&(&app->lock)->rlock);
[  322.275799] 
[  322.275799]  *** DEADLOCK ***
[  322.275799] 
[  322.281714] 1 lock held by ip/19081:
[  322.285286]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81540787>] rtnl_lock+0x17/0x20
[  322.293107] 
[  322.293107] stack backtrace:
[  322.297461] CPU: 2 PID: 19081 Comm: ip Not tainted 3.10.0-dbg-DEV #383
[  322.311279]  ffffffff82233cc0 ffff880c126e5628 ffffffff815f618c ffff880c126e5688
[  322.318738]  ffffffff815f3c1a 0000000000000000 ffff880c00000001 ffff880c00000001
[  322.326194]  ffffffff8105a70f ffff880c126e56b8 0000000000000004 ffff880c526bacd8
[  322.333650] Call Trace:
[  322.336098]  [<ffffffff815f618c>] dump_stack+0x19/0x1b
[  322.341234]  [<ffffffff815f3c1a>] print_usage_bug+0x1fc/0x20d
[  322.346981]  [<ffffffff8105a70f>] ? save_stack_trace+0x2f/0x50
[  322.352808]  [<ffffffff810db50c>] mark_lock+0x28c/0x2f0
[  322.358026]  [<ffffffff810daa50>] ? check_usage_forwards+0x150/0x150
[  322.364375]  [<ffffffff810dbd1e>] __lock_acquire+0x7ae/0x1c50
[  322.370115]  [<ffffffff810de024>] ? mark_held_locks+0x74/0x140
[  322.375951]  [<ffffffff81096fa8>] ? lock_timer_base.isra.34+0x38/0x70
[  322.382396]  [<ffffffff810de024>] ? mark_held_locks+0x74/0x140
[  322.388231]  [<ffffffffa0240804>] ? mrp_uninit_applicant+0x54/0x110 [mrp]
[  322.395021]  [<ffffffff810dd7c7>] lock_acquire+0x97/0x150
[  322.400415]  [<ffffffffa0240804>] ? mrp_uninit_applicant+0x54/0x110 [mrp]
[  322.407206]  [<ffffffff810de2cd>] ? trace_hardirqs_on+0xd/0x10
[  322.413041]  [<ffffffff815fc5d1>] _raw_spin_lock+0x41/0x50
[  322.418523]  [<ffffffffa0240804>] ? mrp_uninit_applicant+0x54/0x110 [mrp]
[  322.425304]  [<ffffffffa0240804>] mrp_uninit_applicant+0x54/0x110 [mrp]
[  322.431914]  [<ffffffffa0254775>] vlan_mvrp_uninit_applicant+0x15/0x20 [8021q]
[  322.439134]  [<ffffffffa02521f8>] unregister_vlan_dev+0x128/0x150 [8021q]
[  322.445914]  [<ffffffff815422cc>] rtnl_dellink+0xac/0x110
[  322.451308]  [<ffffffff81543a35>] rtnetlink_rcv_msg+0x165/0x250
[  322.457221]  [<ffffffff815f922c>] ? mutex_lock_nested+0x28c/0x380
[  322.463308]  [<ffffffff81540787>] ? rtnl_lock+0x17/0x20
[  322.468529]  [<ffffffff815438d0>] ? __rtnl_unlock+0x20/0x20
[  322.474106]  [<ffffffff81561f99>] netlink_rcv_skb+0xa9/0xd0
[  322.479672]  [<ffffffff815407b5>] rtnetlink_rcv+0x25/0x40
[  322.485065]  [<ffffffff81561939>] netlink_unicast+0x169/0x1e0
[  322.490805]  [<ffffffff81561c5e>] netlink_sendmsg+0x2ae/0x360
[  322.496545]  [<ffffffff81517382>] sock_sendmsg+0xc2/0xe0
[  322.501854]  [<ffffffff8116ba23>] ? might_fault+0x53/0xb0
[  322.507247]  [<ffffffff8116ba23>] ? might_fault+0x53/0xb0
[  322.512643]  [<ffffffff81517722>] __sys_sendmsg+0x382/0x390
[  322.518210]  [<ffffffff810b36c3>] ? up_read+0x23/0x40
[  322.523256]  [<ffffffff81600e84>] ? __do_page_fault+0x254/0x4d0
[  322.529170]  [<ffffffff81175b9c>] ? do_brk+0x1fc/0x360
[  322.534305]  [<ffffffff8151aa09>] SyS_sendmsg+0x49/0x90
[  322.539535]  [<ffffffff816058c2>] system_call_fastpath+0x16/0x1b

And with the patch, there is no lockdep splat anymore.

Tested-by: Eric Dumazet <edumazet@google.com>



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 14, 2013, 8:01 p.m. UTC | #7
On 2013-05-14 22:07, David Miller wrote:
> From: Denys Fedoryshchenko <denys@visp.net.lb>
> Date: Mon, 13 May 2013 15:53:30 +0300
> 
> Thank you for superfast answer :)
> It is changed a bit, but not disappeared. Relevant part from new
> dmesg:
> 
> Can you double check that you really had Eric's patch applied?
> lockdep appears to be complaining about the same thing in your
> log dump, as if the patch was not really applied.
> 
> It's saying that app->lock can be taken from the join timer in
> softirq, but mrp_uninit_applicant() takes it without disabling
> softirqs.
> 
> Eric's patch explicitly fixes this, by making sure that spin_lock_bh()
> is used there.
I will reapply patch on clean kernel and try again.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 14, 2013, 8:03 p.m. UTC | #8
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 14 May 2013 12:40:01 -0700

> On Tue, 2013-05-14 at 12:21 -0700, Eric Dumazet wrote:
>> On Tue, 2013-05-14 at 12:07 -0700, David Miller wrote:
>> 
>> > Can you double check that you really had Eric's patch applied?
>> > lockdep appears to be complaining about the same thing in your
>> > log dump, as if the patch was not really applied.
>> > 
>> > It's saying that app->lock can be taken from the join timer in
>> > softirq, but mrp_uninit_applicant() takes it without disabling
>> > softirqs.
>> > 
>> > Eric's patch explicitly fixes this, by making sure that spin_lock_bh()
>> > is used there.
>> 
>> I am going to test this myself, it seems quite simple.
>> I'll add a Tested-by: tag once done.
> 
> I definitely could trigger the bug easily without my patch :
 ...
> And with the patch, there is no lockdep splat anymore.
> 
> Tested-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denys Fedoryshchenko May 15, 2013, 2:50 p.m. UTC | #9
On 2013-05-14 22:40, Eric Dumazet wrote:
> 
> I definitely could trigger the bug easily without my patch :


> 
> And with the patch, there is no lockdep splat anymore.
> 
> Tested-by: Eric Dumazet <edumazet@google.com>

Yes, it seems after depclean i cannot reproduce anymore this bug. My 
bad, seems i should recompile kernel completely next time.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 15, 2013, 3:22 p.m. UTC | #10
On Wed, 2013-05-15 at 17:50 +0300, Denys Fedoryshchenko wrote:
> On 2013-05-14 22:40, Eric Dumazet wrote:
> > 
> > I definitely could trigger the bug easily without my patch :
> 
> 
> > 
> > And with the patch, there is no lockdep splat anymore.
> > 
> > Tested-by: Eric Dumazet <edumazet@google.com>
> 
> Yes, it seems after depclean i cannot reproduce anymore this bug. My 
> bad, seems i should recompile kernel completely next time.
> 

Thanks for testing again, I was quite mystified by your original
results ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Ward May 15, 2013, 4:34 p.m. UTC | #11
On 05/15/2013 11:22 AM, Eric Dumazet wrote:
> Re: [PATCH] net/802/mrp: fix lockdep splat
>
> On Wed, 2013-05-15 at 17:50 +0300, Denys Fedoryshchenko wrote:
> > On 2013-05-14 22:40, Eric Dumazet wrote:
> > >
> > > I definitely could trigger the bug easily without my patch :
> >
> >
> > >
> > > And with the patch, there is no lockdep splat anymore.
> > >
> > > Tested-by: Eric Dumazet <edumazet@google.com>
> >
> > Yes, it seems after depclean i cannot reproduce anymore this bug. My
> > bad, seems i should recompile kernel completely next time.
> >
>
> Thanks for testing again, I was quite mystified by your original
> results ;)
>

FYI - I believe GARP (net/802/garp.c) needs the same fix.  Thechange 
that introduced this issue was originally copied from there.

David
David Ward May 15, 2013, 4:41 p.m. UTC | #12
On 05/15/2013 12:34 PM, Ward, David - 0663 - MITLL wrote:
> On 05/15/2013 11:22 AM, Eric Dumazet wrote:
>> Re: [PATCH] net/802/mrp: fix lockdep splat
>>
>> On Wed, 2013-05-15 at 17:50 +0300, Denys Fedoryshchenko wrote:
>> > On 2013-05-14 22:40, Eric Dumazet wrote:
>> > >
>> > > I definitely could trigger the bug easily without my patch :
>> >
>> >
>> > >
>> > > And with the patch, there is no lockdep splat anymore.
>> > >
>> > > Tested-by: Eric Dumazet <edumazet@google.com>
>> >
>> > Yes, it seems after depclean i cannot reproduce anymore this bug. My
>> > bad, seems i should recompile kernel completely next time.
>> >
>>
>> Thanks for testing again, I was quite mystified by your original
>> results ;)
>>
>
> FYI - I believe GARP (net/802/garp.c) needs the same fix. Thechange 
> that introduced this issue was originally copied from there.
>
> David
>

My bad... I copied the patch from GARP to MRP wrong... that's what 
created this problem in the first place.  Sorry.

David
diff mbox

Patch

diff --git a/net/802/mrp.c b/net/802/mrp.c
index e085bcc..1eb05d8 100644
--- a/net/802/mrp.c
+++ b/net/802/mrp.c
@@ -871,10 +871,10 @@  void mrp_uninit_applicant(struct net_device *dev, struct mrp_application *appl)
 	 */
 	del_timer_sync(&app->join_timer);
 
-	spin_lock(&app->lock);
+	spin_lock_bh(&app->lock);
 	mrp_mad_event(app, MRP_EVENT_TX);
 	mrp_pdu_queue(app);
-	spin_unlock(&app->lock);
+	spin_unlock_bh(&app->lock);
 
 	mrp_queue_xmit(app);