Message ID | 1499973097-14579-1-git-send-email-dsahern@gmail.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On 7/13/17 1:11 PM, David Ahern wrote: > Since mlxsw is not doing a get on the rule to increase the ref count, it > should not be doing a put. upon further review, mlxsw is doing a get on the rule Problem remains, but this is not the right fix.
On Thu, Jul 13, 2017 at 01:46:15PM -0600, David Ahern wrote: > On 7/13/17 1:11 PM, David Ahern wrote: > > Since mlxsw is not doing a get on the rule to increase the ref count, it > > should not be doing a put. > > upon further review, mlxsw is doing a get on the rule > > Problem remains, but this is not the right fix. Remains where? It's not clear to me how you concluded mlxsw is at fault. My setup is running net-next with the refcount patches and I didn't observe this. If current trace isn't enough to pinpoint the problem, can you try to reproduce with a KASAN enabled kernel?
On 7/13/17 2:33 PM, Ido Schimmel wrote: > Remains where? It's not clear to me how you concluded mlxsw is at fault. > My setup is running net-next with the refcount patches and I didn't > observe this. Create a VRF. see latest patch. mlxsw releasing the refcnt on the rule was the victim; eric's patch to fix a delete was setting the refcnt to 1 after mlxsw bumped it.
On Thu, Jul 13, 2017 at 02:39:10PM -0600, David Ahern wrote: > On 7/13/17 2:33 PM, Ido Schimmel wrote: > > Remains where? It's not clear to me how you concluded mlxsw is at fault. > > My setup is running net-next with the refcount patches and I didn't > > observe this. > > Create a VRF. Yea, I wasn't running VRFs with the refcount patches. Reproduced this on my system now. Thanks for the fix.
On Thu, Jul 13, 2017 at 02:39:10PM -0600, David Ahern wrote: > On 7/13/17 2:33 PM, Ido Schimmel wrote: > > Remains where? It's not clear to me how you concluded mlxsw is at fault. > > My setup is running net-next with the refcount patches and I didn't > > observe this. > > Create a VRF. BTW, this didn't show up on my dev branch as I've patches that introduce IPv6 support where I move the rules notifications to core, after the refcount is set to 1 and just before the netlink notification is sent. https://github.com/idosch/linux/commit/7b17a21b1d71fc9a1969080e5fdcb90f376b73b2
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c index 383fef5a8e24..b0fb8e5e83c9 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c @@ -2844,7 +2844,6 @@ static void mlxsw_sp_router_fib_event_work(struct work_struct *work) rule = fib_work->fr_info.rule; if (!fib4_rule_default(rule) && !rule->l3mdev) mlxsw_sp_router_fib4_abort(mlxsw_sp); - fib_rule_put(rule); break; case FIB_EVENT_NH_ADD: /* fall through */ case FIB_EVENT_NH_DEL:
The recent conversion to refcount_t, 717d1e993ad8 ("net: convert fib_rule.refcnt from atomic_t to refcount_t"), and subsequent fix by Eric, 5361e209dd30 ("net: avoid one splat in fib_nl_delrule()"), exposed a bug in mlxsw. The driver is doing a put on fib rules after processing it from the notifier. This triggers a BUG on: [ 104.444889] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 104.452821] IP: fib_rules_lookup+0x39/0x170 [ 104.457056] PGD 409395067 [ 104.457057] P4D 409395067 [ 104.459783] PUD 408c23067 [ 104.462507] PMD 0 ... [ 104.519750] CPU: 1 PID: 900 Comm: vrf Tainted: G W 4.12.0-rc7+ #51 [ 104.527133] Hardware name: Mellanox Technologies Ltd. Mellanox switch/Mellanox switch, BIOS 4.6.5 05/21/2015 [ 104.537084] task: ffff880401454380 task.stack: ffffc900007c0000 [ 104.543029] RIP: 0010:fib_rules_lookup+0x39/0x170 [ 104.547784] RSP: 0000:ffff88041dd039d8 EFLAGS: 00010207 [ 104.553053] RAX: 00000000d8e1b910 RBX: 0000000000000000 RCX: 0000000000000002 [ 104.560264] RDX: 00000000fffffff5 RSI: 0000000000000000 RDI: ffff880408d80f30 [ 104.567461] RBP: ffff88041dd03a08 R08: 000000000000001d R09: 0000000000000000 [ 104.574699] R10: 0000000000000000 R11: 0000000000000006 R12: ffff88040b160cc0 [ 104.581916] R13: ffff88041dd03a18 R14: ffff88040b160d40 R15: ffff88041dd03aa0 [ 104.589130] FS: 00007f44b0edf700(0000) GS:ffff88041dd00000(0000) knlGS:0000000000000000 [ 104.597330] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 104.603151] CR2: 0000000000000010 CR3: 0000000408ca8000 CR4: 00000000001406e0 [ 104.610371] Call Trace: [ 104.612839] <IRQ> [ 104.614872] __fib_lookup+0x54/0x90 [ 104.618406] fib_validate_source+0x31d/0x570 [ 104.622731] ? fib_rules_lookup+0x131/0x170 [ 104.626975] ? __fib_lookup+0x54/0x90 [ 104.630685] ip_route_input_rcu+0xbcf/0xd30 Since mlxsw is not doing a get on the rule to increase the ref count, it should not be doing a put. Fixes: 5d7bfd141924a("ipv4: fib_rules: Dump FIB rules when registering FIB notifier") Signed-off-by: David Ahern <dsahern@gmail.com> --- drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 1 - 1 file changed, 1 deletion(-)