Patchwork 3.7.3+: Bad paging request in ip_rcv_finish while running NFS traffic.

login
register
mail settings
Submitter Ben Greear
Date Jan. 23, 2013, 2:32 a.m.
Message ID <50FF4BC9.1060206@candelatech.com>
Download mbox | patch
Permalink /patch/214710/
State RFC
Delegated to: David Miller
Headers show

Comments

Ben Greear - Jan. 23, 2013, 2:32 a.m.
On 01/22/2013 02:18 PM, Ben Greear wrote:
> On 01/22/2013 09:26 AM, Eric Dumazet wrote:
>> On Tue, 2013-01-22 at 09:17 -0800, Eric Dumazet wrote:
>>> On Tue, 2013-01-22 at 09:08 -0800, Ben Greear wrote:
>>>
>>>> Unfortunately, I hit it again this morning after the first restart of
>>>> my application (which bounces all 3000 interfaces).  Memory poisoning
>>>> was disabled.
>>>
>>> Is your NFS traffic using TCP or UDP ?
>>>
>>
>> Oh well, it seems macvlan.c has to skb_drop_dst(skb) before giving skb
>> to netif_rx()
>
> I just saw another crash.  It had run 2 user-space restarts and
> 2 reboots, but on the third reboot, it crashed coming up.  It seemed
> to last longer this time, but that could just be luck as it's never
> been super easy to reproduce this quickly.

I added a patch to set dst->input and dst->output to 0xdeadbeef before
freeing the memory.  (The warn-on below did NOT hit)

@@ -452,6 +452,9 @@ static inline int dst_output(struct sk_buff *skb)
  /* Input packet from network to transport.  */
  static inline int dst_input(struct sk_buff *skb)
  {
+       if (WARN_ON(((unsigned long)(skb_dst(skb))) < 4000)) {
+               printk("Bad skb_dst: %lu\n", skb->_skb_refdst);
+       }
         return skb_dst(skb)->input(skb);
  }


Looks like we do indeed access freed memory, based on this crash I saw on
the next reboot:

[root@lf1011-12060006 ~]# BUG: unable to handle kernel paging request at 00000000deadbeef
IP: [<00000000deadbeef>] 0xdeadbeee
PGD 0
Oops: 0010 [#1] PREEMPT SMP
Modules linked in: macvlan pktgen lockd sunrpc uinput iTCO_wdt iTCO_vendor_support gpio_ich coretemp hwmon kvm_intel kvm microcode pcspkr i2c_i801 lpc_ich 
e1000e i7core_edac ioatdma edac_core igb ptp pps_core dca ipv6 mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core
CPU 8
Pid: 59, comm: ksoftirqd/8 Tainted: G         C O 3.7.3+ #46 Iron Systems Inc. EE2610R/X8ST3
RIP: 0010:[<00000000deadbeef>]  [<00000000deadbeef>] 0xdeadbeee
RSP: 0018:ffff88040d7d7bc0  EFLAGS: 00010286
RAX: ffff8803d97fc900 RBX: ffff8803d4d30d00 RCX: 0000000000000028
RDX: ffffffff81aafcb0 RSI: ffffffff81a2a500 RDI: ffff8803d4d30d00
RBP: ffff88040d7d7be8 R08: ffffffff814a8812 R09: ffff88040d7d7bb0
R10: ffff8803c9dfd8fc R11: ffff88040d7d7c48 R12: ffff8803c9dfd8fc
R13: ffff8803d4d30d00 R14: ffff88040d3f8000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000deadbeef CR3: 0000000001a0b000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/8 (pid: 59, threadinfo ffff88040d7d6000, task ffff88040d7e1f50)
Stack:
  ffffffff814a8b02 ffff8803d4d30d00 ffffffff814a8812 ffff8803d4d30d00
  ffff88040d3f8000 ffff88040d7d7c18 ffffffff814a8eb5 0000000080000000
  ffffffff81472e61 ffff8803d4d30d00 ffff88040d3f8000 ffff88040d7d7c48
Call Trace:
  [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
  [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
  [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
  [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
  [<ffffffff814a9142>] ip_rcv+0x237/0x269
  [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
  [<ffffffff81473f91>] process_backlog+0xf9/0x1da
  [<ffffffff8147639a>] net_rx_action+0xad/0x218
  [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
  [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
  [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
  [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810a0a6d>] kthread+0xc2/0xca
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
  [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
Code:  Bad RIP value.
RIP  [<00000000deadbeef>] 0xdeadbeee
  RSP <ffff88040d7d7bc0>
CR2: 00000000deadbeef
---[ end trace eed854e70ff0a575 ]---
Kernel panic - not syncing: Fatal excepti

Thanks,
Ben
Eric Dumazet - Jan. 23, 2013, 6:11 a.m.
On Tue, 2013-01-22 at 18:32 -0800, Ben Greear wrote:

> diff --git a/net/core/dst.c b/net/core/dst.c
> index ee6153e..234b168 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -245,6 +245,7 @@ again:
>                  dst->ops->destroy(dst);
>          if (dst->dev)
>                  dev_put(dst->dev);
> +       dst->input = dst->output = 0xdeadbeef;
>          kmem_cache_free(dst->ops->kmem_cachep, dst);

Great !

You could comment the kmem_cache_free() as well to get better chances to
hit the bug, and maybe start a bisection to find the faulty commit ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 23, 2013, 7:14 a.m.
On 01/22/2013 10:11 PM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 18:32 -0800, Ben Greear wrote:
>
>> diff --git a/net/core/dst.c b/net/core/dst.c
>> index ee6153e..234b168 100644
>> --- a/net/core/dst.c
>> +++ b/net/core/dst.c
>> @@ -245,6 +245,7 @@ again:
>>                   dst->ops->destroy(dst);
>>           if (dst->dev)
>>                   dev_put(dst->dev);
>> +       dst->input = dst->output = 0xdeadbeef;
>>           kmem_cache_free(dst->ops->kmem_cachep, dst);
>
> Great !
>
> You could comment the kmem_cache_free() as well to get better chances to
> hit the bug, and maybe start a bisection to find the faulty commit ?

I suspect the bug goes back at least as far as 3.3.  And since
I need the NFS patches for this test case, bisecting will be pure hell.

I'll work on some more code instrumentation tomorrow.

One thing that came to mind while I was looking at the code today:

How are the non-ref-counted dst objects used safely?  Any chance
that tearing down the IP protocol on a device (or deleting a device)
could delete a dst that is referenced by an skb (and thus crashes as
I see)?

Thanks,
Ben
Eric Dumazet - Jan. 23, 2013, 1:35 p.m.
On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:

> How are the non-ref-counted dst objects used safely?  Any chance
> that tearing down the IP protocol on a device (or deleting a device)
> could delete a dst that is referenced by an skb (and thus crashes as
> I see)?

There is probably a bug. Normally it should be RCU protected.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Jan. 23, 2013, 2:42 p.m.
On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:

> I suspect the bug goes back at least as far as 3.3.  And since
> I need the NFS patches for this test case, bisecting will be pure hell.

Make sure the bug is still present in net-next.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 23, 2013, 6:15 p.m.
On 01/23/2013 05:35 AM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
>
>> How are the non-ref-counted dst objects used safely?  Any chance
>> that tearing down the IP protocol on a device (or deleting a device)
>> could delete a dst that is referenced by an skb (and thus crashes as
>> I see)?
>
> There is probably a bug. Normally it should be RCU protected.

I'm building net-next, and will do some testing on it later today
if all goes well.

In the meantime, a few questions about RCU.

If the non-ref-counted dst is protected by RCU, does that mean
that from the time we acquire the pointer (ie, set it in the skb),
to the time we clear the pointer from the skb, we must be under
RCU read lock?

If so, that means that ip_rcv_finish must be called under RCU
read lock, etc?

Btw, this seems non related to NFS, so I'm dropping them
from CC after this email unless I hear a request otherwise.

Thanks,
Ben
Eric Dumazet - Jan. 23, 2013, 9:43 p.m.
On Wed, 2013-01-23 at 10:15 -0800, Ben Greear wrote:
> On 01/23/2013 05:35 AM, Eric Dumazet wrote:
> > On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
> >
> >> How are the non-ref-counted dst objects used safely?  Any chance
> >> that tearing down the IP protocol on a device (or deleting a device)
> >> could delete a dst that is referenced by an skb (and thus crashes as
> >> I see)?
> >
> > There is probably a bug. Normally it should be RCU protected.
> 
> I'm building net-next, and will do some testing on it later today
> if all goes well.
> 
> In the meantime, a few questions about RCU.
> 
> If the non-ref-counted dst is protected by RCU, does that mean
> that from the time we acquire the pointer (ie, set it in the skb),
> to the time we clear the pointer from the skb, we must be under
> RCU read lock?
> 
> If so, that means that ip_rcv_finish must be called under RCU
> read lock, etc?

Obviously yes.  Check skb_dst_force() for the rare cases we want to
escape the RCU section.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 23, 2013, 9:53 p.m.
On 01/23/2013 06:42 AM, Eric Dumazet wrote:
> On Tue, 2013-01-22 at 23:14 -0800, Ben Greear wrote:
>
>> I suspect the bug goes back at least as far as 3.3.  And since
>> I need the NFS patches for this test case, bisecting will be pure hell.
>
> Make sure the bug is still present in net-next.
>

My test case isn't running well on net-next.  Seems most
of the file-io processes are hung trying to open a file:

[root@lf1011-12060006 lanforge]# cat /proc/10612/stack
[<ffffffff8115fadb>] do_last+0x1db/0xa4e
[<ffffffff8116090b>] path_openat+0xcb/0x363
[<ffffffff81160ca8>] do_filp_open+0x38/0x84
[<ffffffff81153bfc>] do_sys_open+0x6d/0xff
[<ffffffff81153cbb>] sys_open+0x1c/0x1e
[<ffffffff8154f629>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


I'm going back to 3.7.3 and will try to learn some more
about what it's doing there.

Thanks,
Ben
Ben Greear - Jan. 23, 2013, 11:55 p.m.
On 01/22/2013 06:32 PM, Ben Greear wrote:

So, I'm slowly making some progress.  I've verified that the skb
has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
method.  I'm trying to track it backwards and figure out which
device it belongs to, etc....takes a while to reproduce though.

One thing about this stack trace below...the dev_seq_stop() does
a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
can cause dev_seq_stop() to run, but if this stack is legit,
then maybe by the time we enter the ip_rcv_finish() code we are
running without rcu_readlock() held?

If so, that would probably explain the bug.

> Call Trace:
>   [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
>   [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
>   [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
>   [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
>   [<ffffffff814a9142>] ip_rcv+0x237/0x269
>   [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
>   [<ffffffff81473f91>] process_backlog+0xf9/0x1da
>   [<ffffffff8147639a>] net_rx_action+0xad/0x218
>   [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
>   [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
>   [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
>   [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
>   [<ffffffff810a0a6d>] kthread+0xc2/0xca
>   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
>   [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
>   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56


## This is from a slightly different kernel image...but probably this part is legit.

0xffffffff814a92b3 is in ip_rcv (/home/greearb/git/linux-3.7.dev.y/net/ipv4/ip_input.c:466).
461		/* Our transport medium may have padded the buffer out. Now we know it
462		 * is IP we can trim to the true length of the frame.
463		 * Note this now means skb->len holds ntohs(iph->tot_len).
464		 */
465		if (pskb_trim_rcsum(skb, len)) {
466			IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
467			goto drop;
468		}
469	
470		/* Remove any debris in the socket control block */
Eric Dumazet - Jan. 24, 2013, 12:01 a.m.
On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
> On 01/22/2013 06:32 PM, Ben Greear wrote:
> 
> So, I'm slowly making some progress.  I've verified that the skb
> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
> method.  I'm trying to track it backwards and figure out which
> device it belongs to, etc....takes a while to reproduce though.
> 
> One thing about this stack trace below...the dev_seq_stop() does
> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
> can cause dev_seq_stop() to run, but if this stack is legit,
> then maybe by the time we enter the ip_rcv_finish() code we are
> running without rcu_readlock() held?
> 
> If so, that would probably explain the bug.
> 

The whole thing is run under rcu_read_lock() done in
__netif_receive_skb()

My suspicion was that we called netif_rx() from macvlan leaving a
not refcounted skb dst.

But the patch I sent to you didnt solve the bug, so its something else.

You could trace at which point the dst was released. (where you set
dst->input/output to deadbeef)

> > Call Trace:
> >   [<ffffffff814a8b02>] ? ip_rcv_finish+0x2f0/0x308
> >   [<ffffffff814a8812>] ? skb_dst+0x5a/0x5a
> >   [<ffffffff814a8eb5>] NF_HOOK.clone.1+0x4c/0x54
> >   [<ffffffff81472e61>] ? dev_seq_stop+0xb/0xb
> >   [<ffffffff814a9142>] ip_rcv+0x237/0x269
> >   [<ffffffff81473def>] __netif_receive_skb+0x487/0x530
> >   [<ffffffff81473f91>] process_backlog+0xf9/0x1da
> >   [<ffffffff8147639a>] net_rx_action+0xad/0x218
> >   [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
> >   [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
> >   [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
> >   [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
> >   [<ffffffff810a0a6d>] kthread+0xc2/0xca
> >   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
> >   [<ffffffff81537b7c>] ret_from_fork+0x7c/0xb0
> >   [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
> 
> 
> ## This is from a slightly different kernel image...but probably this part is legit.
> 
> 0xffffffff814a92b3 is in ip_rcv (/home/greearb/git/linux-3.7.dev.y/net/ipv4/ip_input.c:466).
> 461		/* Our transport medium may have padded the buffer out. Now we know it
> 462		 * is IP we can trim to the true length of the frame.
> 463		 * Note this now means skb->len holds ntohs(iph->tot_len).
> 464		 */
> 465		if (pskb_trim_rcsum(skb, len)) {
> 466			IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INDISCARDS);
> 467			goto drop;
> 468		}
> 469	
> 470		/* Remove any debris in the socket control block */
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 24, 2013, 12:13 a.m.
On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
>> On 01/22/2013 06:32 PM, Ben Greear wrote:
>>
>> So, I'm slowly making some progress.  I've verified that the skb
>> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
>> method.  I'm trying to track it backwards and figure out which
>> device it belongs to, etc....takes a while to reproduce though.
>>
>> One thing about this stack trace below...the dev_seq_stop() does
>> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
>> can cause dev_seq_stop() to run, but if this stack is legit,
>> then maybe by the time we enter the ip_rcv_finish() code we are
>> running without rcu_readlock() held?
>>
>> If so, that would probably explain the bug.
>>
>
> The whole thing is run under rcu_read_lock() done in
> __netif_receive_skb()

I was worried that the dev_seq_stop might be called
incorrectly causing an asymetric unlock.  I have no
idea how that might happened, but several crashes
have that dev_seq_stop method listed, so it got me suspicious.

>
> My suspicion was that we called netif_rx() from macvlan leaving a
> not refcounted skb dst.
>
> But the patch I sent to you didnt solve the bug, so its something else.
>
> You could trace at which point the dst was released. (where you set
> dst->input/output to deadbeef)

My current code is in some garbage collector timer code, but I can
work on saving the call-site that first pokes the dst into the
garbage collection list...

Thanks,
Ben
Eric Dumazet - Jan. 24, 2013, 12:23 a.m.
On Wed, 2013-01-23 at 16:13 -0800, Ben Greear wrote:
> On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> > On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
> >> On 01/22/2013 06:32 PM, Ben Greear wrote:
> >>
> >> So, I'm slowly making some progress.  I've verified that the skb
> >> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
> >> method.  I'm trying to track it backwards and figure out which
> >> device it belongs to, etc....takes a while to reproduce though.
> >>
> >> One thing about this stack trace below...the dev_seq_stop() does
> >> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
> >> can cause dev_seq_stop() to run, but if this stack is legit,
> >> then maybe by the time we enter the ip_rcv_finish() code we are
> >> running without rcu_readlock() held?
> >>
> >> If so, that would probably explain the bug.
> >>
> >
> > The whole thing is run under rcu_read_lock() done in
> > __netif_receive_skb()
> 
> I was worried that the dev_seq_stop might be called
> incorrectly causing an asymetric unlock.  I have no
> idea how that might happened, but several crashes
> have that dev_seq_stop method listed, so it got me suspicious.

dev_seq_stop() is some word in the kernel stack, result of a prior
system call. Stack is not cleanup.

Each function reserves an amount of stack but not always write on all
reserved space (some automatic variables might be not set)

Note the "? " before the name : linux printed the symbol but this was
not a call site for this particular call graph. Its only an extra
indication, that can be useful sometimes.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 24, 2013, 12:38 a.m.
On 01/23/2013 04:23 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 16:13 -0800, Ben Greear wrote:
>> On 01/23/2013 04:01 PM, Eric Dumazet wrote:

>> I was worried that the dev_seq_stop might be called
>> incorrectly causing an asymetric unlock.  I have no
>> idea how that might happened, but several crashes
>> have that dev_seq_stop method listed, so it got me suspicious.
>
> dev_seq_stop() is some word in the kernel stack, result of a prior
> system call. Stack is not cleanup.
>
> Each function reserves an amount of stack but not always write on all
> reserved space (some automatic variables might be not set)
>
> Note the "? " before the name : linux printed the symbol but this was
> not a call site for this particular call graph. Its only an extra
> indication, that can be useful sometimes.

Ahh, thanks for that info...I'd never quite pieced that together
before.

Here's another crash.  Interestingly, the dst is bad before the rcu-read-lock()
(the bug is from the first of the 'deadbeef' debugging code below)

Perhaps other useful info:  The skb->dev claims to be 'lo'.  The dst 'pointer'
in the skb has 0x1 set, so it is the 'noref' variant.


static int __netif_receive_skb(struct sk_buff *skb)
{
	struct packet_type *ptype, *pt_prev;
	rx_handler_func_t *rx_handler;
	struct net_device *orig_dev;
	struct net_device *null_or_dev;
	bool deliver_exact = false;
	int ret = NET_RX_DROP;
	__be16 type;
	unsigned long pflags = current->flags;

	net_timestamp_check(!netdev_tstamp_prequeue, skb);

	trace_netif_receive_skb(skb);

	/*
	 * PFMEMALLOC skbs are special, they should
	 * - be delivered to SOCK_MEMALLOC sockets only
	 * - stay away from userspace
	 * - have bounded memory usage
	 *
	 * Use PF_MEMALLOC as this saves us from propagating the allocation
	 * context down to all allocation sites.
	 */
	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
		current->flags |= PF_MEMALLOC;

	/* if we've gotten here through NAPI, check netpoll */
	if (netpoll_receive_skb(skb))
		goto out;

	orig_dev = skb->dev;

	skb_reset_network_header(skb);
	skb_reset_transport_header(skb);
	skb_reset_mac_len(skb);

	pt_prev = NULL;

	if (skb_dst(skb)) {
		if (skb_dst(skb)->input == 0xdeadbeef) {
			printk("bad dst: %lu, skb->dev: %s  len: %i\n",
			       skb->_skb_refdst, skb->dev->name, skb->len);
			BUG_ON(1);
		}
	}
	
	rcu_read_lock();

	if (skb_dst(skb)) {
		if (skb_dst(skb)->input == 0xdeadbeef) {
			printk("bad dst: %lu, skb->dev: %s  len: %i\n",
			       skb->_skb_refdst, skb->dev->name, skb->len);
			BUG_ON(1);
		}
	}
	
	
another_round:
	skb->skb_iif = skb->dev->ifindex;

	__this_cpu_inc(softnet_data.processed);
...



[root@lf1011-12060006 ~]# bad dst: 18446612148864241601, skb->dev: lo  len: 3232
------------[ cut here ]------------
kernel BUG at /home/greearb/git/linux-3.7.dev.y/net/core/dev.c:3266!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: macvlan pktgen lockd sunrpc uinput coretemp hwmon kvm_intel kvm microcode iTCO_wdt iTe
CPU 4
Pid: 35, comm: ksoftirqd/4 Tainted: G         C O 3.7.3+ #50 Iron Systems Inc. EE2610R/X8ST3
RIP: 0010:[<ffffffff81473d22>]  [<ffffffff81473d22>] __netif_receive_skb+0x101/0x5b8
RSP: 0018:ffff88040d711c58  EFLAGS: 00010296
RAX: 0000000000000036 RBX: ffff88041fc93e80 RCX: 000000000000a6a5
RDX: ffffffff810883a6 RSI: 00000000000005fc RDI: 0000000000000246
RBP: ffff88040d711cb8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000004 R11: 0000000000000000 R12: ffff88041fc93fd0
R13: 0000000000000040 R14: ffff88040d3f8000 R15: ffff88041fc93f80
FS:  0000000000000000(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000262c118 CR3: 00000003da651000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/4 (pid: 35, threadinfo ffff88040d710000, task ffff88040d701f50)
Stack:
  0000000000000046 0420804000000100 ffffffff81aaf0a0 ffff8803da901200
  000000000d711cb8 ffffffff81aaf0a0 ffff8803ffa90428 ffff88041fc93e80
  ffff88041fc93fd0 0000000000000040 0000000000000024 ffff88041fc93f80
Call Trace:
  [<ffffffff814742d2>] process_backlog+0xf9/0x1da
  [<ffffffff814766db>] net_rx_action+0xad/0x218
  [<ffffffff8108d50a>] __do_softirq+0x9c/0x161
  [<ffffffff8108d5f2>] run_ksoftirqd+0x23/0x42
  [<ffffffff810a7ebe>] smpboot_thread_fn+0x253/0x259
  [<ffffffff810a7c6b>] ? test_ti_thread_flag.clone.0+0x11/0x11
  [<ffffffff810a0a6d>] kthread+0xc2/0xca
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
  [<ffffffff81537dbc>] ret_from_fork+0x7c/0xb0
  [<ffffffff810a09ab>] ? __init_kthread_worker+0x56/0x56
Code: fc ff ff ba ef be ad de 48 39 50 50 75 21 48 8b 45 b8 48 c7 c7 50 ea 82 81 8b 48 68 48 8b 50 20 48
RIP  [<ffffffff81473d22>] __netif_receive_skb+0x101/0x5b8
  RSP <ffff88040d711c58>
---[ end trace e5f94dc78f5e5277 ]---
Kernel panic - not syncing: Fatal exception in interrupt
>
>
Eric Dumazet - Jan. 24, 2013, 12:45 a.m.
On Wed, 2013-01-23 at 16:38 -0800, Ben Greear wrote:
> On 01/23/2013 04:23 PM, Eric Dumazet wrote:
> > On Wed, 2013-01-23 at 16:13 -0800, Ben Greear wrote:
> >> On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> 
> >> I was worried that the dev_seq_stop might be called
> >> incorrectly causing an asymetric unlock.  I have no
> >> idea how that might happened, but several crashes
> >> have that dev_seq_stop method listed, so it got me suspicious.
> >
> > dev_seq_stop() is some word in the kernel stack, result of a prior
> > system call. Stack is not cleanup.
> >
> > Each function reserves an amount of stack but not always write on all
> > reserved space (some automatic variables might be not set)
> >
> > Note the "? " before the name : linux printed the symbol but this was
> > not a call site for this particular call graph. Its only an extra
> > indication, that can be useful sometimes.
> 
> Ahh, thanks for that info...I'd never quite pieced that together
> before.
> 
> Here's another crash.  Interestingly, the dst is bad before the rcu-read-lock()
> (the bug is from the first of the 'deadbeef' debugging code below)
> 
> Perhaps other useful info:  The skb->dev claims to be 'lo'.  The dst 'pointer'
> in the skb has 0x1 set, so it is the 'noref' variant.
> 
> 
> static int __netif_receive_skb(struct sk_buff *skb)
> {
> 	struct packet_type *ptype, *pt_prev;
> 	rx_handler_func_t *rx_handler;
> 	struct net_device *orig_dev;
> 	struct net_device *null_or_dev;
> 	bool deliver_exact = false;
> 	int ret = NET_RX_DROP;
> 	__be16 type;
> 	unsigned long pflags = current->flags;
> 
> 	net_timestamp_check(!netdev_tstamp_prequeue, skb);
> 
> 	trace_netif_receive_skb(skb);
> 
> 	/*
> 	 * PFMEMALLOC skbs are special, they should
> 	 * - be delivered to SOCK_MEMALLOC sockets only
> 	 * - stay away from userspace
> 	 * - have bounded memory usage
> 	 *
> 	 * Use PF_MEMALLOC as this saves us from propagating the allocation
> 	 * context down to all allocation sites.
> 	 */
> 	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
> 		current->flags |= PF_MEMALLOC;
> 
> 	/* if we've gotten here through NAPI, check netpoll */
> 	if (netpoll_receive_skb(skb))
> 		goto out;
> 
> 	orig_dev = skb->dev;
> 
> 	skb_reset_network_header(skb);
> 	skb_reset_transport_header(skb);
> 	skb_reset_mac_len(skb);
> 
> 	pt_prev = NULL;
> 
> 	if (skb_dst(skb)) {
> 		if (skb_dst(skb)->input == 0xdeadbeef) {
> 			printk("bad dst: %lu, skb->dev: %s  len: %i\n",
> 			       skb->_skb_refdst, skb->dev->name, skb->len);
> 			BUG_ON(1);
> 		}
> 	}
> 	

You should add your debuging code in netif_rx() so that we know the
caller

by the way you could only add

BUG_ON(skb->_skb_refdst & SKB_DST_NOREF)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 24, 2013, 12:51 a.m.
On 01/23/2013 04:45 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 16:38 -0800, Ben Greear wrote:
>> On 01/23/2013 04:23 PM, Eric Dumazet wrote:
>>> On Wed, 2013-01-23 at 16:13 -0800, Ben Greear wrote:
>>>> On 01/23/2013 04:01 PM, Eric Dumazet wrote:
>>
>>>> I was worried that the dev_seq_stop might be called
>>>> incorrectly causing an asymetric unlock.  I have no
>>>> idea how that might happened, but several crashes
>>>> have that dev_seq_stop method listed, so it got me suspicious.
>>>
>>> dev_seq_stop() is some word in the kernel stack, result of a prior
>>> system call. Stack is not cleanup.
>>>
>>> Each function reserves an amount of stack but not always write on all
>>> reserved space (some automatic variables might be not set)
>>>
>>> Note the "? " before the name : linux printed the symbol but this was
>>> not a call site for this particular call graph. Its only an extra
>>> indication, that can be useful sometimes.
>>
>> Ahh, thanks for that info...I'd never quite pieced that together
>> before.
>>
>> Here's another crash.  Interestingly, the dst is bad before the rcu-read-lock()
>> (the bug is from the first of the 'deadbeef' debugging code below)
>>
>> Perhaps other useful info:  The skb->dev claims to be 'lo'.  The dst 'pointer'
>> in the skb has 0x1 set, so it is the 'noref' variant.
>>
>>
>> static int __netif_receive_skb(struct sk_buff *skb)
>> {
>> 	struct packet_type *ptype, *pt_prev;
>> 	rx_handler_func_t *rx_handler;
>> 	struct net_device *orig_dev;
>> 	struct net_device *null_or_dev;
>> 	bool deliver_exact = false;
>> 	int ret = NET_RX_DROP;
>> 	__be16 type;
>> 	unsigned long pflags = current->flags;
>>
>> 	net_timestamp_check(!netdev_tstamp_prequeue, skb);
>>
>> 	trace_netif_receive_skb(skb);
>>
>> 	/*
>> 	 * PFMEMALLOC skbs are special, they should
>> 	 * - be delivered to SOCK_MEMALLOC sockets only
>> 	 * - stay away from userspace
>> 	 * - have bounded memory usage
>> 	 *
>> 	 * Use PF_MEMALLOC as this saves us from propagating the allocation
>> 	 * context down to all allocation sites.
>> 	 */
>> 	if (sk_memalloc_socks() && skb_pfmemalloc(skb))
>> 		current->flags |= PF_MEMALLOC;
>>
>> 	/* if we've gotten here through NAPI, check netpoll */
>> 	if (netpoll_receive_skb(skb))
>> 		goto out;
>>
>> 	orig_dev = skb->dev;
>>
>> 	skb_reset_network_header(skb);
>> 	skb_reset_transport_header(skb);
>> 	skb_reset_mac_len(skb);
>>
>> 	pt_prev = NULL;
>>
>> 	if (skb_dst(skb)) {
>> 		if (skb_dst(skb)->input == 0xdeadbeef) {
>> 			printk("bad dst: %lu, skb->dev: %s  len: %i\n",
>> 			       skb->_skb_refdst, skb->dev->name, skb->len);
>> 			BUG_ON(1);
>> 		}
>> 	}
>> 	
>
> You should add your debuging code in netif_rx() so that we know the
> caller
>
> by the way you could only add
>
> BUG_ON(skb->_skb_refdst & SKB_DST_NOREF)

Ok, will add that.

I was poking around in drivers/net/loopback.c.  Maybe it needs
to clean up the skb_dst() before calling the rx logic in the
loopback_xmit method?


Thanks,
ben

>
>
Eric Dumazet - Jan. 24, 2013, 1 a.m.
On Wed, 2013-01-23 at 16:51 -0800, Ben Greear wrote:

> 
> I was poking around in drivers/net/loopback.c.  Maybe it needs
> to clean up the skb_dst() before calling the rx logic in the
> loopback_xmit method?

Nope. Its ok there. We need a dst for loopback





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear - Jan. 24, 2013, 1:06 a.m.
On 01/23/2013 05:00 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 16:51 -0800, Ben Greear wrote:
>
>>
>> I was poking around in drivers/net/loopback.c.  Maybe it needs
>> to clean up the skb_dst() before calling the rx logic in the
>> loopback_xmit method?
>
> Nope. Its ok there. We need a dst for loopback

Here's the crash with the BUG_ON().  Did I understand your
suggestion properly?  This hit immediately when booting.


int netif_rx(struct sk_buff *skb)
{
	int ret;

	BUG_ON(skb->_skb_refdst & SKB_DST_NOREF);
	
	/* if netpoll wants it, pretend we never saw it */
	if (netpoll_rx(skb))
		return NET_RX_DROP;


kernel BUG at /home/greearb/git/linux-3.7.dev.y/net/core/dev.c:2982!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in: lockd sunrpc macvlan pktgen uinput coretemp hwmon kvm_intel kvm iTCO_wdt iTCO_vendor_e
CPU 0
Pid: 1554, comm: btserver Tainted: G         C O 3.7.3+ #51 Iron Systems Inc. EE2610R/X8ST3
RIP: 0010:[<ffffffff8147685a>]  [<ffffffff8147685a>] netif_rx+0x14/0x109
RSP: 0018:ffff8804030359a8  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8803fff3a2f0 RCX: ffff88040d3b4490
RDX: ffff880403035a08 RSI: ffff88040d3f8000 RDI: ffff8803fff3a2f0
RBP: ffff8804030359d8 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff81472e61 R11: ffff8803fff62cc0 R12: 0000000000016ff0
R13: 000000000000003c R14: ffff88041fc00000 R15: ffffffff81670560
FS:  00007fcbd5c01740(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000439004 CR3: 00000003ff68d000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process btserver (pid: 1554, threadinfo ffff880403034000, task ffff880407e09f50)
Stack:
  ffff8804030359d8 ffffffff814c0290 ffff880403035a08 ffff8803fff3a2f0
  ffff8803fff3a2f0 0000000000016ff0 ffff880403035a08 ffffffff813a3c4d
  ffff8803fff3a2f0 ffff88040d3f8000 00000000a01b7c08 ffff88040d0b7400
Call Trace:
  [<ffffffff814c0290>] ? tcp_wfree+0xc0/0xc8
  [<ffffffff813a3c4d>] loopback_xmit+0x64/0x83
  [<ffffffff81477364>] dev_hard_start_xmit+0x26c/0x35e
  [<ffffffff8147771a>] dev_queue_xmit+0x2c4/0x37c
  [<ffffffff81477456>] ? dev_hard_start_xmit+0x35e/0x35e
  [<ffffffff8148cfa6>] ? eth_header+0x28/0xb6
  [<ffffffff81480f09>] neigh_resolve_output+0x176/0x1a7
  [<ffffffff814ad835>] ip_finish_output2+0x297/0x30d
  [<ffffffff814ad6d5>] ? ip_finish_output2+0x137/0x30d
  [<ffffffff814ad90e>] ip_finish_output+0x63/0x68
  [<ffffffff814ae412>] ip_output+0x61/0x67
  [<ffffffff814ab904>] dst_output+0x17/0x1b
  [<ffffffff814adb6d>] ip_local_out+0x1e/0x23
  [<ffffffff814ae1c4>] ip_queue_xmit+0x315/0x353
  [<ffffffff814adeaf>] ? ip_send_unicast_reply+0x2cc/0x2cc
  [<ffffffff814c018f>] tcp_transmit_skb+0x7ca/0x80b
  [<ffffffff814c3571>] tcp_connect+0x53c/0x587
  [<ffffffff810c2f0c>] ? getnstimeofday+0x44/0x7d
  [<ffffffff810c2f56>] ? ktime_get_real+0x11/0x3e
  [<ffffffff814c6f9b>] tcp_v4_connect+0x3c2/0x431
  [<ffffffff814d6913>] __inet_stream_connect+0x84/0x287
  [<ffffffff814d6b38>] ? inet_stream_connect+0x22/0x49
  [<ffffffff8108d695>] ? _local_bh_enable_ip+0x84/0x9f
  [<ffffffff8108d6c8>] ? local_bh_enable+0xd/0x11
  [<ffffffff8146763c>] ? lock_sock_nested+0x6e/0x79
  [<ffffffff814d6b38>] ? inet_stream_connect+0x22/0x49
  [<ffffffff814d6b49>] inet_stream_connect+0x33/0x49
  [<ffffffff814632c6>] sys_connect+0x75/0x98
  [<ffffffff811551fd>] ? path_put+0x1d/0x21
  [<ffffffff810e8d06>] ? __audit_syscall_entry+0x11c/0x148
  [<ffffffff8128a509>] ? lockdep_sys_exit_thunk+0x35/0x67
  [<ffffffff81162f8b>] ? __fd_install+0x26/0x52
  [<ffffffff81537e69>] system_call_fastpath+0x16/0x1b
Code: 49 8b 5c 24 10 48 8b 43 68 48 85 c0 0f 85 5e fe ff ff e9 79 fe ff ff 55 48 89 e5 41 54 53 48 89 fb
RIP  [<ffffffff8147685a>] netif_rx+0x14/0x109
  RSP <ffff8804030359a8>
... DONE

>
>
>
>
Eric Dumazet - Jan. 24, 2013, 1:10 a.m.
On Wed, 2013-01-23 at 17:06 -0800, Ben Greear wrote:
> On 01/23/2013 05:00 PM, Eric Dumazet wrote:
> > On Wed, 2013-01-23 at 16:51 -0800, Ben Greear wrote:
> >
> >>
> >> I was poking around in drivers/net/loopback.c.  Maybe it needs
> >> to clean up the skb_dst() before calling the rx logic in the
> >> loopback_xmit method?
> >
> > Nope. Its ok there. We need a dst for loopback
> 
> Here's the crash with the BUG_ON().  Did I understand your
> suggestion properly?  This hit immediately when booting.
> 
> 
> int netif_rx(struct sk_buff *skb)
> {
> 	int ret;
> 
> 	BUG_ON(skb->_skb_refdst & SKB_DST_NOREF);
> 	
> 	/* if netpoll wants it, pretend we never saw it */
> 	if (netpoll_rx(skb))
> 		return NET_RX_DROP;
> 
> 
> kernel BUG at /home/greearb/git/linux-3.7.dev.y/net/core/dev.c:2982!
> invalid opcode: 0000 [#1] PREEMPT SMP
> Modules linked in: lockd sunrpc macvlan pktgen uinput coretemp hwmon kvm_intel kvm iTCO_wdt iTCO_vendor_e
> CPU 0
> Pid: 1554, comm: btserver Tainted: G         C O 3.7.3+ #51 Iron Systems Inc. EE2610R/X8ST3
> RIP: 0010:[<ffffffff8147685a>]  [<ffffffff8147685a>] netif_rx+0x14/0x109
> RSP: 0018:ffff8804030359a8  EFLAGS: 00010202
> RAX: 0000000000000000 RBX: ffff8803fff3a2f0 RCX: ffff88040d3b4490
> RDX: ffff880403035a08 RSI: ffff88040d3f8000 RDI: ffff8803fff3a2f0
> RBP: ffff8804030359d8 R08: 0000000000000001 R09: 0000000000000000
> R10: ffffffff81472e61 R11: ffff8803fff62cc0 R12: 0000000000016ff0
> R13: 000000000000003c R14: ffff88041fc00000 R15: ffffffff81670560
> FS:  00007fcbd5c01740(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000439004 CR3: 00000003ff68d000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process btserver (pid: 1554, threadinfo ffff880403034000, task ffff880407e09f50)
> Stack:
>   ffff8804030359d8 ffffffff814c0290 ffff880403035a08 ffff8803fff3a2f0
>   ffff8803fff3a2f0 0000000000016ff0 ffff880403035a08 ffffffff813a3c4d
>   ffff8803fff3a2f0 ffff88040d3f8000 00000000a01b7c08 ffff88040d0b7400
> Call Trace:
>   [<ffffffff814c0290>] ? tcp_wfree+0xc0/0xc8
>   [<ffffffff813a3c4d>] loopback_xmit+0x64/0x83
>   [<ffffffff81477364>] dev_hard_start_xmit+0x26c/0x35e
>   [<ffffffff8147771a>] dev_queue_xmit+0x2c4/0x37c
>   [<ffffffff81477456>] ? dev_hard_start_xmit+0x35e/0x35e
>   [<ffffffff8148cfa6>] ? eth_header+0x28/0xb6
>   [<ffffffff81480f09>] neigh_resolve_output+0x176/0x1a7
>   [<ffffffff814ad835>] ip_finish_output2+0x297/0x30d
>   [<ffffffff814ad6d5>] ? ip_finish_output2+0x137/0x30d
>   [<ffffffff814ad90e>] ip_finish_output+0x63/0x68
>   [<ffffffff814ae412>] ip_output+0x61/0x67
>   [<ffffffff814ab904>] dst_output+0x17/0x1b
>   [<ffffffff814adb6d>] ip_local_out+0x1e/0x23
>   [<ffffffff814ae1c4>] ip_queue_xmit+0x315/0x353
>   [<ffffffff814adeaf>] ? ip_send_unicast_reply+0x2cc/0x2cc
>   [<ffffffff814c018f>] tcp_transmit_skb+0x7ca/0x80b
>   [<ffffffff814c3571>] tcp_connect+0x53c/0x587
>   [<ffffffff810c2f0c>] ? getnstimeofday+0x44/0x7d
>   [<ffffffff810c2f56>] ? ktime_get_real+0x11/0x3e
>   [<ffffffff814c6f9b>] tcp_v4_connect+0x3c2/0x431
>   [<ffffffff814d6913>] __inet_stream_connect+0x84/0x287
>   [<ffffffff814d6b38>] ? inet_stream_connect+0x22/0x49
>   [<ffffffff8108d695>] ? _local_bh_enable_ip+0x84/0x9f
>   [<ffffffff8108d6c8>] ? local_bh_enable+0xd/0x11
>   [<ffffffff8146763c>] ? lock_sock_nested+0x6e/0x79
>   [<ffffffff814d6b38>] ? inet_stream_connect+0x22/0x49
>   [<ffffffff814d6b49>] inet_stream_connect+0x33/0x49
>   [<ffffffff814632c6>] sys_connect+0x75/0x98
>   [<ffffffff811551fd>] ? path_put+0x1d/0x21
>   [<ffffffff810e8d06>] ? __audit_syscall_entry+0x11c/0x148
>   [<ffffffff8128a509>] ? lockdep_sys_exit_thunk+0x35/0x67
>   [<ffffffff81162f8b>] ? __fd_install+0x26/0x52
>   [<ffffffff81537e69>] system_call_fastpath+0x16/0x1b
> Code: 49 8b 5c 24 10 48 8b 43 68 48 85 c0 0f 85 5e fe ff ff e9 79 fe ff ff 55 48 89 e5 41 54 53 48 89 fb
> RIP  [<ffffffff8147685a>] netif_rx+0x14/0x109
>   RSP <ffff8804030359a8>
> ... DONE

Excellent, thats the bug.

I'll send a fix asap, thanks !



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/dst.c b/net/core/dst.c
index ee6153e..234b168 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -245,6 +245,7 @@  again:
                 dst->ops->destroy(dst);
         if (dst->dev)
                 dev_put(dst->dev);
+       dst->input = dst->output = 0xdeadbeef;
         kmem_cache_free(dst->ops->kmem_cachep, dst);

         dst = child;