Message ID | 20080929175421.722037051@theryb.frec.bull.fr |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 29-09-2008 19:54, Benjamin Thery wrote: > This patch proposes to replace the rtnl_unlock() call in > linkwatch_event() by __rtnl_unlock(). The difference between the two > routines being that __rtnl_unlock() will not call netdev_run_todo() > after it unlocks rtnl_mutex. > > This is to fix a "deadlock" we observed when unregistering a net device. > > In some circumstances, linkwatch_event() blocks the whole "events" > workqueue while blocking in rtnl_unlock(). > > Here is what happens: > > 1. Unregister a device, the following routines are called: > > -> unregister_netdev > -> rtnl_lock > -> unregister_netdevice > -> rtnl_unlock > -> netdev_run_todo > -> netdev_wait_allrefs > > 2. In netdev_wait_allrefs(), the device's refcount is greater than 0 > because there are still some routes to be garbage collected later. > > 3. Also, some link watch events are pending. netdev_wait_allrefs() > will run the linkwatch event queue, calls linkwatch_run_queue(). > > > Both the route garbage collector dst_gc_task() and the linkwatch task > linkwatch_event() are queued in the same generic workqueue: "events". > > > 4. linkwatch_event() is enqueued earlier in the queue. It will grab > rtnl_lock(), deliver the link watch events pending, and then call > rtnl_unlock(). > rtnl_unlock() will then call netdev_run_todo() and block on > mutex_lock(&net_todo_run_mutex). > > At this point, the workqueue "events" is _blocked_ until the > netdev_wait_allrefs() call above returns when the device refcount > reaches 0. > > Problem: it will never happens if dst_gc_task() was enqueued behind > linkwatch_event() in the "events" workqueue as the queue is now > blocked. ... If it's really like this, I wonder if this can happen without linkwatch too in a non-preemptive config? So maybe this should be fixed somewhere else? According to a comment above netdev_wait_allrefs() it seems references should be rather put down on an UNREGISTER event, so this dst_gc_task() scheduling shouldn't bother us, I guess. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On 29-09-2008 19:54, Benjamin Thery wrote: >> This patch proposes to replace the rtnl_unlock() call in >> linkwatch_event() by __rtnl_unlock(). The difference between the two >> routines being that __rtnl_unlock() will not call netdev_run_todo() >> after it unlocks rtnl_mutex. >> >> This is to fix a "deadlock" we observed when unregistering a net device. >> >> In some circumstances, linkwatch_event() blocks the whole "events" >> workqueue while blocking in rtnl_unlock(). >> >> Here is what happens: >> >> 1. Unregister a device, the following routines are called: >> >> -> unregister_netdev >> -> rtnl_lock >> -> unregister_netdevice >> -> rtnl_unlock >> -> netdev_run_todo >> -> netdev_wait_allrefs >> >> 2. In netdev_wait_allrefs(), the device's refcount is greater than 0 >> because there are still some routes to be garbage collected later. >> >> 3. Also, some link watch events are pending. netdev_wait_allrefs() >> will run the linkwatch event queue, calls linkwatch_run_queue(). >> >> >> Both the route garbage collector dst_gc_task() and the linkwatch task >> linkwatch_event() are queued in the same generic workqueue: "events". >> >> >> 4. linkwatch_event() is enqueued earlier in the queue. It will grab >> rtnl_lock(), deliver the link watch events pending, and then call >> rtnl_unlock(). >> rtnl_unlock() will then call netdev_run_todo() and block on >> mutex_lock(&net_todo_run_mutex). >> >> At this point, the workqueue "events" is _blocked_ until the >> netdev_wait_allrefs() call above returns when the device refcount >> reaches 0. >> >> Problem: it will never happens if dst_gc_task() was enqueued behind >> linkwatch_event() in the "events" workqueue as the queue is now >> blocked. > ... > > If it's really like this, I wonder if this can happen without linkwatch > too in a non-preemptive config? Um, not sure I fully understand what you mean... do you mean with CONFIG_PREEMPT_NONE=y? > So maybe this should be fixed somewhere > else? According to a comment above netdev_wait_allrefs() it seems > references should be rather put down on an UNREGISTER event, so this > dst_gc_task() scheduling shouldn't bother us, I guess. I saw this comment too. In our case, the UNREGISTER event is sent, notifications are dispatched correctly, some routes are deleted (dst_free()) but not destroyed (dst_destroy()) and the garbage collector as to run to finish the work. dst_entry's may hold a refcount on device until dst_destroy() is run on them. Unfortunately dst_gc_task() won't have a chance to run dst_destroy() on them later in this case because it is stuck in the "events" workqueue behind linkwatch_event() who is blocking everyone else in the queue. I'm still looking at why the first dst_free() on those particular routes doesn't call dst_destroy() immediately but defers it (another refcount on the route itself). BTW, in the past (last year) dst_gc_task() was run as a timer handler and this situation (deadlock with linkwatch_event()) couldn't occur. Thanks, Benjamin > > Jarek P. > >
From: Benjamin Thery <benjamin.thery@bull.net> Date: Tue, 30 Sep 2008 13:52:31 +0200 > BTW, in the past (last year) dst_gc_task() was run as a timer > handler and this situation (deadlock with linkwatch_event()) > couldn't occur. September 12th to be exact :-) That's when Eric Dumazet's change went in: commit 86bba269d08f0c545ae76c90b56727f65d62d57f Author: Eric Dumazet <dada1@cosmosbay.com> Date: Wed Sep 12 14:29:01 2007 +0200 [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue But this change went into 2.6.24, so I wonder why it's taken so long to hit this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: > From: Benjamin Thery <benjamin.thery@bull.net> > Date: Tue, 30 Sep 2008 13:52:31 +0200 > >> BTW, in the past (last year) dst_gc_task() was run as a timer >> handler and this situation (deadlock with linkwatch_event()) >> couldn't occur. > > September 12th to be exact :-) That's when Eric Dumazet's > change went in: > > commit 86bba269d08f0c545ae76c90b56727f65d62d57f > Author: Eric Dumazet <dada1@cosmosbay.com> > Date: Wed Sep 12 14:29:01 2007 +0200 > > [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue > > > But this change went into 2.6.24, so I wonder why it's taken > so long to hit this. Maybe because we manage to reproduce it only with network namespaces so far and there isn't much netns users at the moment (and netns is still incomplete in mainline). Benjamin > >
Benjamin Thery wrote, On 09/30/2008 01:52 PM: > Jarek Poplawski wrote: >> On 29-09-2008 19:54, Benjamin Thery wrote: ... >>> Problem: it will never happens if dst_gc_task() was enqueued behind >>> linkwatch_event() in the "events" workqueue as the queue is now >>> blocked. >> ... >> >> If it's really like this, I wonder if this can happen without linkwatch >> too in a non-preemptive config? > > Um, not sure I fully understand what you mean... do you mean with > CONFIG_PREEMPT_NONE=y? Yes, but after rethinking I see this is irrelevant. Anyway, my main concern is that it seems similar dependency might happen between other than linkwatch work functions or processes waiting for each other. >> So maybe this should be fixed somewhere >> else? According to a comment above netdev_wait_allrefs() it seems >> references should be rather put down on an UNREGISTER event, so this >> dst_gc_task() scheduling shouldn't bother us, I guess. > > I saw this comment too. In our case, the UNREGISTER event is sent, > notifications are dispatched correctly, some routes are deleted > (dst_free()) but not destroyed (dst_destroy()) and the garbage collector > as to run to finish the work. > > dst_entry's may hold a refcount on device until dst_destroy() is run on > them. Unfortunately dst_gc_task() won't have a chance to run > dst_destroy() on them later in this case because it is stuck in the > "events" workqueue behind linkwatch_event() who is blocking everyone > else in the queue. > > I'm still looking at why the first dst_free() on those particular routes > doesn't call dst_destroy() immediately but defers it (another refcount > on the route itself). Yes, finding/fixing this, if possible, in this place looks like the most consistent with the way netdev_wait_allrefs() is handling this. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote: > Benjamin Thery wrote, On 09/30/2008 01:52 PM: ... > > I'm still looking at why the first dst_free() on those particular routes > > doesn't call dst_destroy() immediately but defers it (another refcount > > on the route itself). > > Yes, finding/fixing this, if possible, in this place looks like the > most consistent with the way netdev_wait_allrefs() is handling this. Actually, I wonder, why we can't simply run this dst_gc_task() from dst_dev_event() (after cancelling the work) when needed. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote: >> Benjamin Thery wrote, On 09/30/2008 01:52 PM: > ... >>> I'm still looking at why the first dst_free() on those particular routes >>> doesn't call dst_destroy() immediately but defers it (another refcount >>> on the route itself). >> Yes, finding/fixing this, if possible, in this place looks like the >> most consistent with the way netdev_wait_allrefs() is handling this. > > Actually, I wonder, why we can't simply run this dst_gc_task() from > dst_dev_event() (after cancelling the work) when needed. > Um... I haven't thought about this. I'll have a look to see if it can solve our issue. Benjamin > Jarek P. > >
From: Benjamin Thery <benjamin.thery@bull.net> Date: Tue, 30 Sep 2008 17:18:25 +0200 > Jarek Poplawski wrote: > > On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote: > >> Benjamin Thery wrote, On 09/30/2008 01:52 PM: > > ... > >>> I'm still looking at why the first dst_free() on those particular routes doesn't call dst_destroy() immediately but defers it (another refcount > >>> on the route itself). > >> Yes, finding/fixing this, if possible, in this place looks like the > >> most consistent with the way netdev_wait_allrefs() is handling this. > > Actually, I wonder, why we can't simply run this dst_gc_task() from > > dst_dev_event() (after cancelling the work) when needed. > > > > Um... I haven't thought about this. I'll have a look to see if it can > solve our issue. Let me know what happens, I'd like to apply some fix soon. So just report the patch implementing the final approach you feel the most comfortable with. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: > From: Benjamin Thery <benjamin.thery@bull.net> > Date: Tue, 30 Sep 2008 17:18:25 +0200 > >> Jarek Poplawski wrote: >>> On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote: >>>> Benjamin Thery wrote, On 09/30/2008 01:52 PM: >>> ... >>>>> I'm still looking at why the first dst_free() on those particular routes doesn't call dst_destroy() immediately but defers it (another refcount >>>>> on the route itself). >>>> Yes, finding/fixing this, if possible, in this place looks like the >>>> most consistent with the way netdev_wait_allrefs() is handling this. >>> Actually, I wonder, why we can't simply run this dst_gc_task() from >>> dst_dev_event() (after cancelling the work) when needed. >>> >> Um... I haven't thought about this. I'll have a look to see if it can >> solve our issue. > > Let me know what happens, I'd like to apply some fix soon. > So just report the patch implementing the final approach you > feel the most comfortable with. We did the modification suggested by Jarek and that fix the problem :) We are playing a bit with this patch to check if we didn't missed something. We will certainly send it in the next hours. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Lezcano <dlezcano@fr.ibm.com> Date: Wed, 01 Oct 2008 12:10:10 +0200 > We are playing a bit with this patch to check if we didn't missed > something. We will certainly send it in the next hours. Nice. It is quite efficient for us to work together live like this when we are in the same time zone, and even the same country :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benjamin Thery <benjamin.thery@bull.net> writes: > This patch proposes to replace the rtnl_unlock() call in > linkwatch_event() by __rtnl_unlock(). The difference between the two > routines being that __rtnl_unlock() will not call netdev_run_todo() > after it unlocks rtnl_mutex. > > This is to fix a "deadlock" we observed when unregistering a net device. Thanks for tracking this I just started seeing this a few days ago, and before I reach the point of tracking it down you are in the process of fixing it ;) Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: net-next-2.6/net/core/link_watch.c =================================================================== --- net-next-2.6.orig/net/core/link_watch.c +++ net-next-2.6/net/core/link_watch.c @@ -207,7 +207,7 @@ static void linkwatch_event(struct work_ { rtnl_lock(); __linkwatch_run_queue(time_after(linkwatch_nextevent, jiffies)); - rtnl_unlock(); + __rtnl_unlock(); }
This patch proposes to replace the rtnl_unlock() call in linkwatch_event() by __rtnl_unlock(). The difference between the two routines being that __rtnl_unlock() will not call netdev_run_todo() after it unlocks rtnl_mutex. This is to fix a "deadlock" we observed when unregistering a net device. In some circumstances, linkwatch_event() blocks the whole "events" workqueue while blocking in rtnl_unlock(). Here is what happens: 1. Unregister a device, the following routines are called: -> unregister_netdev -> rtnl_lock -> unregister_netdevice -> rtnl_unlock -> netdev_run_todo -> netdev_wait_allrefs 2. In netdev_wait_allrefs(), the device's refcount is greater than 0 because there are still some routes to be garbage collected later. 3. Also, some link watch events are pending. netdev_wait_allrefs() will run the linkwatch event queue, calls linkwatch_run_queue(). Both the route garbage collector dst_gc_task() and the linkwatch task linkwatch_event() are queued in the same generic workqueue: "events". 4. linkwatch_event() is enqueued earlier in the queue. It will grab rtnl_lock(), deliver the link watch events pending, and then call rtnl_unlock(). rtnl_unlock() will then call netdev_run_todo() and block on mutex_lock(&net_todo_run_mutex). At this point, the workqueue "events" is _blocked_ until the netdev_wait_allrefs() call above returns when the device refcount reaches 0. Problem: it will never happens if dst_gc_task() was enqueued behind linkwatch_event() in the "events" workqueue as the queue is now blocked. Thread foo Thread "events" ---------- --------------- free IPv6 routes | -> schedule_delayed_work(dst_gc_task) ... | unregister_netdev | -> rtln_unlock | -> netdev_run_todo | ... -> grab net_todo_run_mutex | -> netdev_wait_allrefs (loop) | -> run__workqueue() -> linkwatch_run_queue() | .... enqueue linkwatch_event | in "events" | | -> run_workqueue() "BLOCKED" | -> linkwatch_event() | -> rtnl_lock() | -> __linkwatch_run_queue | ->rtnl_unlock() | -> netdev_run_todo | -> try to grab net_todo_run_mutex BLOCKED Deadlock: # BLOCKED waiting for dst_gc_task() # BLOCKED waiting for netdev_wait_allrefs to decrement the device refcnt to release net_todo_run_mutex # dst_gst_task() waiting to be run by "events" This deadlock can be observed sometimes when the loopback is unregistered when a network namespaces exits. Our tests showed no regression so far with this modification done in linkwatch_event(), but we aren't sure we understood all the subtleties in net device unregistration yet. :) Benjamin Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> --- net/core/link_watch.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)