[net] Fix an intermittent pr_emerg warning about lo becoming free.

Submitted by Krister Johansen on June 8, 2017, 8:12 p.m.

Details

Message ID 20170608201238.GC2553@templeofstupid.com
State Accepted
Delegated to: David Miller
Headers show

Commit Message

Krister Johansen June 8, 2017, 8:12 p.m.
It looks like this:

Message from syslogd@flamingo at Apr 26 00:45:00 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4

They seem to coincide with net namespace teardown.

The message is emitted by netdev_wait_allrefs().

Forced a kdump in netdev_run_todo, but found that the refcount on the lo
device was already 0 at the time we got to the panic.

Used bcc to check the blocking in netdev_run_todo.  The only places
where we're off cpu there are in the rcu_barrier() and msleep() calls.
That behavior is expected.  The msleep time coincides with the amount of
time we spend waiting for the refcount to reach zero; the rcu_barrier()
wait times are not excessive.

After looking through the list of callbacks that the netdevice notifiers
invoke in this path, it appears that the dst_dev_event is the most
interesting.  The dst_ifdown path places a hold on the loopback_dev as
part of releasing the dev associated with the original dst cache entry.
Most of our notifier callbacks are straight-forward, but this one a)
looks complex, and b) places a hold on the network interface in
question.

I constructed a new bcc script that watches various events in the
liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
the loopback device until the invalidated dst entry gets freed.

[      __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
    __dst_free
    rcu_nocb_kthread
    kthread
    ret_from_fork
---
[     dst_gc_task] completed in          4: start: 1282115783057516 end 1282115783062462
[     dst_gc_task] completed in          5: start: 1282116447054101 end 1282116447059392
[     dst_gc_task] completed in          5: start: 1282118111030391 end 1282118111036041
[     dst_gc_task] completed in          6: start: 1282121247074938 end 1282121247081141
[     dst_gc_task] completed in          5: start: 1282126815053252 end 1282126815058751
[     dst_gc_task] completed in          6: start: 1282135007055673 end 1282135007061877
[     dst_gc_task] completed in          6: start: 1282145759021745 end 1282145759027830
[     dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257
    dst_release
    ip_queue_xmit
    tcp_transmit_skb
    tcp_write_xmit
    __tcp_push_pending_frames
    tcp_send_fin
    tcp_close
    inet_release
    sock_release
    sock_close
    __fput
    ____fput
    task_work_run
    do_exit
    do_group_exit
    SyS_exit_group
    do_syscall_64
    return_from_SYSCALL_64
[      dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680
    dst_ifdown
    notifier_call_chain
    raw_notifier_call_chain
    call_netdevice_notifiers_info
    netdev_run_todo
    kretprobe_trampoline
    default_device_exit_batch
    ops_exit_list.isra.4
    cleanup_net
    process_one_work
    worker_thread
    kthread
    ret_from_fork
---
[      dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269
    dst_ifdown
    notifier_call_chain
    raw_notifier_call_chain
    call_netdevice_notifiers_info
    netdev_run_todo
    kretprobe_trampoline
    default_device_exit_batch
    ops_exit_list.isra.4
    cleanup_net
    process_one_work
    worker_thread
    kthread
    ret_from_fork
---
<...>
[     dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022
    dst_destroy
    kretprobe_trampoline
    worker_thread
    kthread
    ret_from_fork
---
[     dst_gc_task] completed in         30: start: 1282160607047161 end 1282160607077572

The way this works is that if there's still a reference on the dst entry
at the time we try to free it, it gets placed in the gc list by
__dst_free and the dst_destroy() call is invoked by the gc task once the
refcount is 0.  If the gc task processes a 10th or less of its entries
on a single pass, it inreases the amount of time it waits between gc
intervals.

Looking at the gc_task intervals, they started at 663ms when we invoked
__dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
10751, and 14848.  The release that set the refcnt to 0 on our dst entry
occurred after the gc_task was enqueued for 14 second interval so we had
to wait longer than the warning time in wait_allrefs in order for the
dst entry to get free'd and the hold on 'lo' to be released.

A simple solution to this problem is to have dst_dev_event() reset the
gc timer, which causes us to process this list shortly after the
gc_mutex is relased when dst_dev_event() completes.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 net/core/dst.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Eric Dumazet June 8, 2017, 9:06 p.m.
On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote:

...

> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
> 
> A simple solution to this problem is to have dst_dev_event() reset the
> gc timer, which causes us to process this list shortly after the
> gc_mutex is relased when dst_dev_event() completes.
> 
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
>  net/core/dst.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 6192f11..13ba4a0 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
>  		spin_lock_bh(&dst_garbage.lock);
>  		dst = dst_garbage.list;
>  		dst_garbage.list = NULL;
> +		/* The code in dst_ifdown places a hold on the loopback device.
> +		 * If the gc entry processing is set to expire after a lengthy
> +		 * interval, this hold can cause netdev_wait_allrefs() to hang
> +		 * out and wait for a long time -- until the the loopback
> +		 * interface is released.  If we're really unlucky, it'll emit
> +		 * pr_emerg messages to console too.  Reset the interval here,
> +		 * so dst cleanups occur in a more timely fashion.
> +		 */
> +		if (dst_garbage.timer_inc > DST_GC_INC) {
> +			dst_garbage.timer_inc = DST_GC_INC;
> +			dst_garbage.timer_expires = DST_GC_MIN;
> +			mod_delayed_work(system_wq, &dst_gc_work,
> +					 dst_garbage.timer_expires);
> +		}
>  		spin_unlock_bh(&dst_garbage.lock);
>  
>  		if (last)

Looks very nice to me !

Acked-by: Eric Dumazet <edumazet@google.com>

Wei Wan is actually working on a patch series removing all this
dst_garbage list stuff.
Wei Wang June 8, 2017, 9:41 p.m.
> Wei Wan is actually working on a patch series removing all this
> dst_garbage list stuff.

Yes. I am working on removing the dst garbage collector completely.
dst_dev_event() will be removed from the list of callbacks that the
netdevice notifiers invoke in my patch series.

On Thu, Jun 8, 2017 at 2:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote:
>
> ...
>
>> Looking at the gc_task intervals, they started at 663ms when we invoked
>> __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
>> 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
>> occurred after the gc_task was enqueued for 14 second interval so we had
>> to wait longer than the warning time in wait_allrefs in order for the
>> dst entry to get free'd and the hold on 'lo' to be released.
>>
>> A simple solution to this problem is to have dst_dev_event() reset the
>> gc timer, which causes us to process this list shortly after the
>> gc_mutex is relased when dst_dev_event() completes.
>>
>> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
>> ---
>>  net/core/dst.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/net/core/dst.c b/net/core/dst.c
>> index 6192f11..13ba4a0 100644
>> --- a/net/core/dst.c
>> +++ b/net/core/dst.c
>> @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
>>               spin_lock_bh(&dst_garbage.lock);
>>               dst = dst_garbage.list;
>>               dst_garbage.list = NULL;
>> +             /* The code in dst_ifdown places a hold on the loopback device.
>> +              * If the gc entry processing is set to expire after a lengthy
>> +              * interval, this hold can cause netdev_wait_allrefs() to hang
>> +              * out and wait for a long time -- until the the loopback
>> +              * interface is released.  If we're really unlucky, it'll emit
>> +              * pr_emerg messages to console too.  Reset the interval here,
>> +              * so dst cleanups occur in a more timely fashion.
>> +              */
>> +             if (dst_garbage.timer_inc > DST_GC_INC) {
>> +                     dst_garbage.timer_inc = DST_GC_INC;
>> +                     dst_garbage.timer_expires = DST_GC_MIN;
>> +                     mod_delayed_work(system_wq, &dst_gc_work,
>> +                                      dst_garbage.timer_expires);
>> +             }
>>               spin_unlock_bh(&dst_garbage.lock);
>>
>>               if (last)
>
> Looks very nice to me !
>
> Acked-by: Eric Dumazet <edumazet@google.com>
>
> Wei Wan is actually working on a patch series removing all this
> dst_garbage list stuff.
>
>
>
David Miller June 9, 2017, 4:28 p.m.
From: Krister Johansen <kjlx@templeofstupid.com>
Date: Thu, 8 Jun 2017 13:12:38 -0700

> It looks like this:
> 
> Message from syslogd@flamingo at Apr 26 00:45:00 ...
>  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
> 
> They seem to coincide with net namespace teardown.
> 
> The message is emitted by netdev_wait_allrefs().
> 
> Forced a kdump in netdev_run_todo, but found that the refcount on the lo
> device was already 0 at the time we got to the panic.
> 
> Used bcc to check the blocking in netdev_run_todo.  The only places
> where we're off cpu there are in the rcu_barrier() and msleep() calls.
> That behavior is expected.  The msleep time coincides with the amount of
> time we spend waiting for the refcount to reach zero; the rcu_barrier()
> wait times are not excessive.
> 
> After looking through the list of callbacks that the netdevice notifiers
> invoke in this path, it appears that the dst_dev_event is the most
> interesting.  The dst_ifdown path places a hold on the loopback_dev as
> part of releasing the dev associated with the original dst cache entry.
> Most of our notifier callbacks are straight-forward, but this one a)
> looks complex, and b) places a hold on the network interface in
> question.
> 
> I constructed a new bcc script that watches various events in the
> liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
> the loopback device until the invalidated dst entry gets freed.
 ...
> The way this works is that if there's still a reference on the dst entry
> at the time we try to free it, it gets placed in the gc list by
> __dst_free and the dst_destroy() call is invoked by the gc task once the
> refcount is 0.  If the gc task processes a 10th or less of its entries
> on a single pass, it inreases the amount of time it waits between gc
> intervals.
> 
> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
> 
> A simple solution to this problem is to have dst_dev_event() reset the
> gc timer, which causes us to process this list shortly after the
> gc_mutex is relased when dst_dev_event() completes.
> 
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>

Yeah this is one of the more unsatisfying areas of dst and device handling
in the tree, thanks for working on this.

Applied and queued up for -stable, thanks again.
WANG Cong June 9, 2017, 6:18 p.m.
On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
<kjlx@templeofstupid.com> wrote:
> After looking through the list of callbacks that the netdevice notifiers
> invoke in this path, it appears that the dst_dev_event is the most
> interesting.  The dst_ifdown path places a hold on the loopback_dev as
> part of releasing the dev associated with the original dst cache entry.
> Most of our notifier callbacks are straight-forward, but this one a)
> looks complex, and b) places a hold on the network interface in
> question.
>
> I constructed a new bcc script that watches various events in the
> liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
> the loopback device until the invalidated dst entry gets freed.
>

Yeah, this is what I observed when Kevin (Cc'ed) reported a similar
(if not same) bug, I thought we have a refcnt leak on dst.

...
> The way this works is that if there's still a reference on the dst entry
> at the time we try to free it, it gets placed in the gc list by
> __dst_free and the dst_destroy() call is invoked by the gc task once the
> refcount is 0.  If the gc task processes a 10th or less of its entries
> on a single pass, it inreases the amount of time it waits between gc
> intervals.
>
> Looking at the gc_task intervals, they started at 663ms when we invoked
> __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
> 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
> occurred after the gc_task was enqueued for 14 second interval so we had
> to wait longer than the warning time in wait_allrefs in order for the
> dst entry to get free'd and the hold on 'lo' to be released.
>

I am glad to see you don't have a dst leak here.

But from my experience of a similar bug (refcnt wait on lo), this goes
infinitely rather than just 14sec, so it looked more like a real leak than
just a gc delay. So in your case, this annoying warning eventually
disappears, right?


Thanks.
Krister Johansen June 9, 2017, 6:43 p.m.
On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote:
> On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
> <kjlx@templeofstupid.com> wrote:
> > The way this works is that if there's still a reference on the dst entry
> > at the time we try to free it, it gets placed in the gc list by
> > __dst_free and the dst_destroy() call is invoked by the gc task once the
> > refcount is 0.  If the gc task processes a 10th or less of its entries
> > on a single pass, it inreases the amount of time it waits between gc
> > intervals.
> >
> > Looking at the gc_task intervals, they started at 663ms when we invoked
> > __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
> > 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
> > occurred after the gc_task was enqueued for 14 second interval so we had
> > to wait longer than the warning time in wait_allrefs in order for the
> > dst entry to get free'd and the hold on 'lo' to be released.
> >
> 
> I am glad to see you don't have a dst leak here.
> 
> But from my experience of a similar bug (refcnt wait on lo), this goes
> infinitely rather than just 14sec, so it looked more like a real leak than
> just a gc delay. So in your case, this annoying warning eventually
> disappears, right?

That's correct.  The problem occurs intermittently, and the warnings are
less frequent than the interval in netdev_wait_allrefs().  At least when
I observed it, it tended to conincide with our controlplane canary
issuing an API call that lead to a network namespace teardown on the
dataplane.

Sometimes, the message would look like this:

  unregister_netdevice: waiting for lo to become free. Usage count = 0

The dst entries were getting released, it's just that often our dst
cache gc interval was longer than the warning interval in wait_allrefs.

The other concern was that because the wait_allrefs happens in the
netdev_todo path, a long gc interval can cause the rtnl_lock hold times
to be much longer than necessary if this bug is encountered.

-K
WANG Cong June 9, 2017, 9:48 p.m.
On Fri, Jun 9, 2017 at 11:43 AM, Krister Johansen
<kjlx@templeofstupid.com> wrote:
> On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote:
>> On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen
>> <kjlx@templeofstupid.com> wrote:
>> > The way this works is that if there's still a reference on the dst entry
>> > at the time we try to free it, it gets placed in the gc list by
>> > __dst_free and the dst_destroy() call is invoked by the gc task once the
>> > refcount is 0.  If the gc task processes a 10th or less of its entries
>> > on a single pass, it inreases the amount of time it waits between gc
>> > intervals.
>> >
>> > Looking at the gc_task intervals, they started at 663ms when we invoked
>> > __dst_free().  After that, they increased to 1663, 3136, 5567, 8191,
>> > 10751, and 14848.  The release that set the refcnt to 0 on our dst entry
>> > occurred after the gc_task was enqueued for 14 second interval so we had
>> > to wait longer than the warning time in wait_allrefs in order for the
>> > dst entry to get free'd and the hold on 'lo' to be released.
>> >
>>
>> I am glad to see you don't have a dst leak here.
>>
>> But from my experience of a similar bug (refcnt wait on lo), this goes
>> infinitely rather than just 14sec, so it looked more like a real leak than
>> just a gc delay. So in your case, this annoying warning eventually
>> disappears, right?
>
> That's correct.  The problem occurs intermittently, and the warnings are
> less frequent than the interval in netdev_wait_allrefs().  At least when
> I observed it, it tended to conincide with our controlplane canary
> issuing an API call that lead to a network namespace teardown on the
> dataplane.

Great! Then the bug I saw is different from this one and it is probably
a dst leak.

Thanks.

Patch hide | download patch | download mbox

diff --git a/net/core/dst.c b/net/core/dst.c
index 6192f11..13ba4a0 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -469,6 +469,20 @@  static int dst_dev_event(struct notifier_block *this, unsigned long event,
 		spin_lock_bh(&dst_garbage.lock);
 		dst = dst_garbage.list;
 		dst_garbage.list = NULL;
+		/* The code in dst_ifdown places a hold on the loopback device.
+		 * If the gc entry processing is set to expire after a lengthy
+		 * interval, this hold can cause netdev_wait_allrefs() to hang
+		 * out and wait for a long time -- until the the loopback
+		 * interface is released.  If we're really unlucky, it'll emit
+		 * pr_emerg messages to console too.  Reset the interval here,
+		 * so dst cleanups occur in a more timely fashion.
+		 */
+		if (dst_garbage.timer_inc > DST_GC_INC) {
+			dst_garbage.timer_inc = DST_GC_INC;
+			dst_garbage.timer_expires = DST_GC_MIN;
+			mod_delayed_work(system_wq, &dst_gc_work,
+					 dst_garbage.timer_expires);
+		}
 		spin_unlock_bh(&dst_garbage.lock);
 
 		if (last)