Message ID | 20170608201238.GC2553@templeofstupid.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote: ... > Looking at the gc_task intervals, they started at 663ms when we invoked > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, > 10751, and 14848. The release that set the refcnt to 0 on our dst entry > occurred after the gc_task was enqueued for 14 second interval so we had > to wait longer than the warning time in wait_allrefs in order for the > dst entry to get free'd and the hold on 'lo' to be released. > > A simple solution to this problem is to have dst_dev_event() reset the > gc timer, which causes us to process this list shortly after the > gc_mutex is relased when dst_dev_event() completes. > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> > --- > net/core/dst.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/net/core/dst.c b/net/core/dst.c > index 6192f11..13ba4a0 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event, > spin_lock_bh(&dst_garbage.lock); > dst = dst_garbage.list; > dst_garbage.list = NULL; > + /* The code in dst_ifdown places a hold on the loopback device. > + * If the gc entry processing is set to expire after a lengthy > + * interval, this hold can cause netdev_wait_allrefs() to hang > + * out and wait for a long time -- until the the loopback > + * interface is released. If we're really unlucky, it'll emit > + * pr_emerg messages to console too. Reset the interval here, > + * so dst cleanups occur in a more timely fashion. > + */ > + if (dst_garbage.timer_inc > DST_GC_INC) { > + dst_garbage.timer_inc = DST_GC_INC; > + dst_garbage.timer_expires = DST_GC_MIN; > + mod_delayed_work(system_wq, &dst_gc_work, > + dst_garbage.timer_expires); > + } > spin_unlock_bh(&dst_garbage.lock); > > if (last) Looks very nice to me ! Acked-by: Eric Dumazet <edumazet@google.com> Wei Wan is actually working on a patch series removing all this dst_garbage list stuff.
> Wei Wan is actually working on a patch series removing all this > dst_garbage list stuff. Yes. I am working on removing the dst garbage collector completely. dst_dev_event() will be removed from the list of callbacks that the netdevice notifiers invoke in my patch series. On Thu, Jun 8, 2017 at 2:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2017-06-08 at 13:12 -0700, Krister Johansen wrote: > > ... > >> Looking at the gc_task intervals, they started at 663ms when we invoked >> __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, >> 10751, and 14848. The release that set the refcnt to 0 on our dst entry >> occurred after the gc_task was enqueued for 14 second interval so we had >> to wait longer than the warning time in wait_allrefs in order for the >> dst entry to get free'd and the hold on 'lo' to be released. >> >> A simple solution to this problem is to have dst_dev_event() reset the >> gc timer, which causes us to process this list shortly after the >> gc_mutex is relased when dst_dev_event() completes. >> >> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> >> --- >> net/core/dst.c | 14 ++++++++++++++ >> 1 file changed, 14 insertions(+) >> >> diff --git a/net/core/dst.c b/net/core/dst.c >> index 6192f11..13ba4a0 100644 >> --- a/net/core/dst.c >> +++ b/net/core/dst.c >> @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event, >> spin_lock_bh(&dst_garbage.lock); >> dst = dst_garbage.list; >> dst_garbage.list = NULL; >> + /* The code in dst_ifdown places a hold on the loopback device. >> + * If the gc entry processing is set to expire after a lengthy >> + * interval, this hold can cause netdev_wait_allrefs() to hang >> + * out and wait for a long time -- until the the loopback >> + * interface is released. If we're really unlucky, it'll emit >> + * pr_emerg messages to console too. Reset the interval here, >> + * so dst cleanups occur in a more timely fashion. >> + */ >> + if (dst_garbage.timer_inc > DST_GC_INC) { >> + dst_garbage.timer_inc = DST_GC_INC; >> + dst_garbage.timer_expires = DST_GC_MIN; >> + mod_delayed_work(system_wq, &dst_gc_work, >> + dst_garbage.timer_expires); >> + } >> spin_unlock_bh(&dst_garbage.lock); >> >> if (last) > > Looks very nice to me ! > > Acked-by: Eric Dumazet <edumazet@google.com> > > Wei Wan is actually working on a patch series removing all this > dst_garbage list stuff. > > >
From: Krister Johansen <kjlx@templeofstupid.com> Date: Thu, 8 Jun 2017 13:12:38 -0700 > It looks like this: > > Message from syslogd@flamingo at Apr 26 00:45:00 ... > kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4 > > They seem to coincide with net namespace teardown. > > The message is emitted by netdev_wait_allrefs(). > > Forced a kdump in netdev_run_todo, but found that the refcount on the lo > device was already 0 at the time we got to the panic. > > Used bcc to check the blocking in netdev_run_todo. The only places > where we're off cpu there are in the rcu_barrier() and msleep() calls. > That behavior is expected. The msleep time coincides with the amount of > time we spend waiting for the refcount to reach zero; the rcu_barrier() > wait times are not excessive. > > After looking through the list of callbacks that the netdevice notifiers > invoke in this path, it appears that the dst_dev_event is the most > interesting. The dst_ifdown path places a hold on the loopback_dev as > part of releasing the dev associated with the original dst cache entry. > Most of our notifier callbacks are straight-forward, but this one a) > looks complex, and b) places a hold on the network interface in > question. > > I constructed a new bcc script that watches various events in the > liftime of a dst cache entry. Note that dst_ifdown will take a hold on > the loopback device until the invalidated dst entry gets freed. ... > The way this works is that if there's still a reference on the dst entry > at the time we try to free it, it gets placed in the gc list by > __dst_free and the dst_destroy() call is invoked by the gc task once the > refcount is 0. If the gc task processes a 10th or less of its entries > on a single pass, it inreases the amount of time it waits between gc > intervals. > > Looking at the gc_task intervals, they started at 663ms when we invoked > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, > 10751, and 14848. The release that set the refcnt to 0 on our dst entry > occurred after the gc_task was enqueued for 14 second interval so we had > to wait longer than the warning time in wait_allrefs in order for the > dst entry to get free'd and the hold on 'lo' to be released. > > A simple solution to this problem is to have dst_dev_event() reset the > gc timer, which causes us to process this list shortly after the > gc_mutex is relased when dst_dev_event() completes. > > Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Yeah this is one of the more unsatisfying areas of dst and device handling in the tree, thanks for working on this. Applied and queued up for -stable, thanks again.
On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen <kjlx@templeofstupid.com> wrote: > After looking through the list of callbacks that the netdevice notifiers > invoke in this path, it appears that the dst_dev_event is the most > interesting. The dst_ifdown path places a hold on the loopback_dev as > part of releasing the dev associated with the original dst cache entry. > Most of our notifier callbacks are straight-forward, but this one a) > looks complex, and b) places a hold on the network interface in > question. > > I constructed a new bcc script that watches various events in the > liftime of a dst cache entry. Note that dst_ifdown will take a hold on > the loopback device until the invalidated dst entry gets freed. > Yeah, this is what I observed when Kevin (Cc'ed) reported a similar (if not same) bug, I thought we have a refcnt leak on dst. ... > The way this works is that if there's still a reference on the dst entry > at the time we try to free it, it gets placed in the gc list by > __dst_free and the dst_destroy() call is invoked by the gc task once the > refcount is 0. If the gc task processes a 10th or less of its entries > on a single pass, it inreases the amount of time it waits between gc > intervals. > > Looking at the gc_task intervals, they started at 663ms when we invoked > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, > 10751, and 14848. The release that set the refcnt to 0 on our dst entry > occurred after the gc_task was enqueued for 14 second interval so we had > to wait longer than the warning time in wait_allrefs in order for the > dst entry to get free'd and the hold on 'lo' to be released. > I am glad to see you don't have a dst leak here. But from my experience of a similar bug (refcnt wait on lo), this goes infinitely rather than just 14sec, so it looked more like a real leak than just a gc delay. So in your case, this annoying warning eventually disappears, right? Thanks.
On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote: > On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen > <kjlx@templeofstupid.com> wrote: > > The way this works is that if there's still a reference on the dst entry > > at the time we try to free it, it gets placed in the gc list by > > __dst_free and the dst_destroy() call is invoked by the gc task once the > > refcount is 0. If the gc task processes a 10th or less of its entries > > on a single pass, it inreases the amount of time it waits between gc > > intervals. > > > > Looking at the gc_task intervals, they started at 663ms when we invoked > > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, > > 10751, and 14848. The release that set the refcnt to 0 on our dst entry > > occurred after the gc_task was enqueued for 14 second interval so we had > > to wait longer than the warning time in wait_allrefs in order for the > > dst entry to get free'd and the hold on 'lo' to be released. > > > > I am glad to see you don't have a dst leak here. > > But from my experience of a similar bug (refcnt wait on lo), this goes > infinitely rather than just 14sec, so it looked more like a real leak than > just a gc delay. So in your case, this annoying warning eventually > disappears, right? That's correct. The problem occurs intermittently, and the warnings are less frequent than the interval in netdev_wait_allrefs(). At least when I observed it, it tended to conincide with our controlplane canary issuing an API call that lead to a network namespace teardown on the dataplane. Sometimes, the message would look like this: unregister_netdevice: waiting for lo to become free. Usage count = 0 The dst entries were getting released, it's just that often our dst cache gc interval was longer than the warning interval in wait_allrefs. The other concern was that because the wait_allrefs happens in the netdev_todo path, a long gc interval can cause the rtnl_lock hold times to be much longer than necessary if this bug is encountered. -K
On Fri, Jun 9, 2017 at 11:43 AM, Krister Johansen <kjlx@templeofstupid.com> wrote: > On Fri, Jun 09, 2017 at 11:18:44AM -0700, Cong Wang wrote: >> On Thu, Jun 8, 2017 at 1:12 PM, Krister Johansen >> <kjlx@templeofstupid.com> wrote: >> > The way this works is that if there's still a reference on the dst entry >> > at the time we try to free it, it gets placed in the gc list by >> > __dst_free and the dst_destroy() call is invoked by the gc task once the >> > refcount is 0. If the gc task processes a 10th or less of its entries >> > on a single pass, it inreases the amount of time it waits between gc >> > intervals. >> > >> > Looking at the gc_task intervals, they started at 663ms when we invoked >> > __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, >> > 10751, and 14848. The release that set the refcnt to 0 on our dst entry >> > occurred after the gc_task was enqueued for 14 second interval so we had >> > to wait longer than the warning time in wait_allrefs in order for the >> > dst entry to get free'd and the hold on 'lo' to be released. >> > >> >> I am glad to see you don't have a dst leak here. >> >> But from my experience of a similar bug (refcnt wait on lo), this goes >> infinitely rather than just 14sec, so it looked more like a real leak than >> just a gc delay. So in your case, this annoying warning eventually >> disappears, right? > > That's correct. The problem occurs intermittently, and the warnings are > less frequent than the interval in netdev_wait_allrefs(). At least when > I observed it, it tended to conincide with our controlplane canary > issuing an API call that lead to a network namespace teardown on the > dataplane. Great! Then the bug I saw is different from this one and it is probably a dst leak. Thanks.
diff --git a/net/core/dst.c b/net/core/dst.c index 6192f11..13ba4a0 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event, spin_lock_bh(&dst_garbage.lock); dst = dst_garbage.list; dst_garbage.list = NULL; + /* The code in dst_ifdown places a hold on the loopback device. + * If the gc entry processing is set to expire after a lengthy + * interval, this hold can cause netdev_wait_allrefs() to hang + * out and wait for a long time -- until the the loopback + * interface is released. If we're really unlucky, it'll emit + * pr_emerg messages to console too. Reset the interval here, + * so dst cleanups occur in a more timely fashion. + */ + if (dst_garbage.timer_inc > DST_GC_INC) { + dst_garbage.timer_inc = DST_GC_INC; + dst_garbage.timer_expires = DST_GC_MIN; + mod_delayed_work(system_wq, &dst_gc_work, + dst_garbage.timer_expires); + } spin_unlock_bh(&dst_garbage.lock); if (last)
It looks like this: Message from syslogd@flamingo at Apr 26 00:45:00 ... kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4 They seem to coincide with net namespace teardown. The message is emitted by netdev_wait_allrefs(). Forced a kdump in netdev_run_todo, but found that the refcount on the lo device was already 0 at the time we got to the panic. Used bcc to check the blocking in netdev_run_todo. The only places where we're off cpu there are in the rcu_barrier() and msleep() calls. That behavior is expected. The msleep time coincides with the amount of time we spend waiting for the refcount to reach zero; the rcu_barrier() wait times are not excessive. After looking through the list of callbacks that the netdevice notifiers invoke in this path, it appears that the dst_dev_event is the most interesting. The dst_ifdown path places a hold on the loopback_dev as part of releasing the dev associated with the original dst cache entry. Most of our notifier callbacks are straight-forward, but this one a) looks complex, and b) places a hold on the network interface in question. I constructed a new bcc script that watches various events in the liftime of a dst cache entry. Note that dst_ifdown will take a hold on the loopback device until the invalidated dst entry gets freed. [ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183 __dst_free rcu_nocb_kthread kthread ret_from_fork --- [ dst_gc_task] completed in 4: start: 1282115783057516 end 1282115783062462 [ dst_gc_task] completed in 5: start: 1282116447054101 end 1282116447059392 [ dst_gc_task] completed in 5: start: 1282118111030391 end 1282118111036041 [ dst_gc_task] completed in 6: start: 1282121247074938 end 1282121247081141 [ dst_gc_task] completed in 5: start: 1282126815053252 end 1282126815058751 [ dst_gc_task] completed in 6: start: 1282135007055673 end 1282135007061877 [ dst_gc_task] completed in 6: start: 1282145759021745 end 1282145759027830 [ dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257 dst_release ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_send_fin tcp_close inet_release sock_release sock_close __fput ____fput task_work_run do_exit do_group_exit SyS_exit_group do_syscall_64 return_from_SYSCALL_64 [ dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680 dst_ifdown notifier_call_chain raw_notifier_call_chain call_netdevice_notifiers_info netdev_run_todo kretprobe_trampoline default_device_exit_batch ops_exit_list.isra.4 cleanup_net process_one_work worker_thread kthread ret_from_fork --- [ dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269 dst_ifdown notifier_call_chain raw_notifier_call_chain call_netdevice_notifiers_info netdev_run_todo kretprobe_trampoline default_device_exit_batch ops_exit_list.isra.4 cleanup_net process_one_work worker_thread kthread ret_from_fork --- <...> [ dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022 dst_destroy kretprobe_trampoline worker_thread kthread ret_from_fork --- [ dst_gc_task] completed in 30: start: 1282160607047161 end 1282160607077572 The way this works is that if there's still a reference on the dst entry at the time we try to free it, it gets placed in the gc list by __dst_free and the dst_destroy() call is invoked by the gc task once the refcount is 0. If the gc task processes a 10th or less of its entries on a single pass, it inreases the amount of time it waits between gc intervals. Looking at the gc_task intervals, they started at 663ms when we invoked __dst_free(). After that, they increased to 1663, 3136, 5567, 8191, 10751, and 14848. The release that set the refcnt to 0 on our dst entry occurred after the gc_task was enqueued for 14 second interval so we had to wait longer than the warning time in wait_allrefs in order for the dst entry to get free'd and the hold on 'lo' to be released. A simple solution to this problem is to have dst_dev_event() reset the gc timer, which causes us to process this list shortly after the gc_mutex is relased when dst_dev_event() completes. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> --- net/core/dst.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)