Patchwork net: deadlock during net device unregistration

login
register
mail settings
Submitter Benjamin Thery
Date Sept. 29, 2008, 5:54 p.m.
Message ID <20080929175421.722037051@theryb.frec.bull.fr>
Download mbox | patch
Permalink /patch/1921/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Benjamin Thery - Sept. 29, 2008, 5:54 p.m.
This patch proposes to replace the rtnl_unlock() call in 
linkwatch_event() by __rtnl_unlock(). The difference between the two
routines being that __rtnl_unlock() will not call netdev_run_todo()
after it unlocks rtnl_mutex.

This is to fix a "deadlock" we observed when unregistering a net device.

In some circumstances, linkwatch_event() blocks the whole "events" 
workqueue while blocking in rtnl_unlock().

Here is what happens:

1. Unregister a device, the following routines are called:

-> unregister_netdev
  -> rtnl_lock
  -> unregister_netdevice
  -> rtnl_unlock
    -> netdev_run_todo
      -> netdev_wait_allrefs

2. In netdev_wait_allrefs(), the device's refcount is greater than 0
  because there are still some routes to be garbage collected later.

3. Also, some link watch events are pending. netdev_wait_allrefs()
  will run the linkwatch event queue, calls linkwatch_run_queue().


Both the route garbage collector dst_gc_task() and the linkwatch task
linkwatch_event() are queued in the same generic workqueue: "events".


4. linkwatch_event() is enqueued earlier in the queue. It will grab
  rtnl_lock(), deliver the link watch events pending, and then call
  rtnl_unlock(). 
  rtnl_unlock() will then call netdev_run_todo() and block on
  mutex_lock(&net_todo_run_mutex).

  At this point, the workqueue "events" is _blocked_ until the
  netdev_wait_allrefs() call above returns when the device refcount 
  reaches 0.

  Problem: it will never happens if dst_gc_task() was enqueued behind
  linkwatch_event() in the "events" workqueue as the queue is now 
  blocked.


Thread foo                            Thread "events"
----------                            ---------------

free IPv6 routes                    |
-> schedule_delayed_work(dst_gc_task)
...                                 |
unregister_netdev                   |
 -> rtln_unlock                     | 
   -> netdev_run_todo               | ...
     -> grab net_todo_run_mutex     |
     -> netdev_wait_allrefs (loop)  | -> run__workqueue()
        ->  linkwatch_run_queue()   | ....
            enqueue linkwatch_event |
            in "events"             |
                                    | -> run_workqueue()
        "BLOCKED"                   |   -> linkwatch_event()
                                    |     -> rtnl_lock()
                                    |     -> __linkwatch_run_queue 
                                    |     ->rtnl_unlock()
                                    |       -> netdev_run_todo
                                    |          -> try to grab net_todo_run_mutex
                                               BLOCKED

Deadlock: 
  
# BLOCKED waiting for dst_gc_task()  # BLOCKED waiting for netdev_wait_allrefs
to decrement the device refcnt         to release net_todo_run_mutex

                  # dst_gst_task() waiting to be run by "events"


This deadlock can be observed sometimes when the loopback is 
unregistered when a network namespaces exits.

Our tests showed no regression so far with this modification done in 
linkwatch_event(), but we aren't sure we understood all the subtleties
in net device unregistration yet. :)

Benjamin

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
---
 net/core/link_watch.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Jarek Poplawski - Sept. 30, 2008, 6:32 a.m.
On 29-09-2008 19:54, Benjamin Thery wrote:
> This patch proposes to replace the rtnl_unlock() call in 
> linkwatch_event() by __rtnl_unlock(). The difference between the two
> routines being that __rtnl_unlock() will not call netdev_run_todo()
> after it unlocks rtnl_mutex.
> 
> This is to fix a "deadlock" we observed when unregistering a net device.
> 
> In some circumstances, linkwatch_event() blocks the whole "events" 
> workqueue while blocking in rtnl_unlock().
> 
> Here is what happens:
> 
> 1. Unregister a device, the following routines are called:
> 
> -> unregister_netdev
>   -> rtnl_lock
>   -> unregister_netdevice
>   -> rtnl_unlock
>     -> netdev_run_todo
>       -> netdev_wait_allrefs
> 
> 2. In netdev_wait_allrefs(), the device's refcount is greater than 0
>   because there are still some routes to be garbage collected later.
> 
> 3. Also, some link watch events are pending. netdev_wait_allrefs()
>   will run the linkwatch event queue, calls linkwatch_run_queue().
> 
> 
> Both the route garbage collector dst_gc_task() and the linkwatch task
> linkwatch_event() are queued in the same generic workqueue: "events".
> 
> 
> 4. linkwatch_event() is enqueued earlier in the queue. It will grab
>   rtnl_lock(), deliver the link watch events pending, and then call
>   rtnl_unlock(). 
>   rtnl_unlock() will then call netdev_run_todo() and block on
>   mutex_lock(&net_todo_run_mutex).
> 
>   At this point, the workqueue "events" is _blocked_ until the
>   netdev_wait_allrefs() call above returns when the device refcount 
>   reaches 0.
> 
>   Problem: it will never happens if dst_gc_task() was enqueued behind
>   linkwatch_event() in the "events" workqueue as the queue is now 
>   blocked.
...

If it's really like this, I wonder if this can happen without linkwatch
too in a non-preemptive config? So maybe this should be fixed somewhere
else? According to a comment above netdev_wait_allrefs() it seems
references should be rather put down on an UNREGISTER event, so this
dst_gc_task() scheduling shouldn't bother us, I guess.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Thery - Sept. 30, 2008, 11:52 a.m.
Jarek Poplawski wrote:
> On 29-09-2008 19:54, Benjamin Thery wrote:
>> This patch proposes to replace the rtnl_unlock() call in 
>> linkwatch_event() by __rtnl_unlock(). The difference between the two
>> routines being that __rtnl_unlock() will not call netdev_run_todo()
>> after it unlocks rtnl_mutex.
>>
>> This is to fix a "deadlock" we observed when unregistering a net device.
>>
>> In some circumstances, linkwatch_event() blocks the whole "events" 
>> workqueue while blocking in rtnl_unlock().
>>
>> Here is what happens:
>>
>> 1. Unregister a device, the following routines are called:
>>
>> -> unregister_netdev
>>   -> rtnl_lock
>>   -> unregister_netdevice
>>   -> rtnl_unlock
>>     -> netdev_run_todo
>>       -> netdev_wait_allrefs
>>
>> 2. In netdev_wait_allrefs(), the device's refcount is greater than 0
>>   because there are still some routes to be garbage collected later.
>>
>> 3. Also, some link watch events are pending. netdev_wait_allrefs()
>>   will run the linkwatch event queue, calls linkwatch_run_queue().
>>
>>
>> Both the route garbage collector dst_gc_task() and the linkwatch task
>> linkwatch_event() are queued in the same generic workqueue: "events".
>>
>>
>> 4. linkwatch_event() is enqueued earlier in the queue. It will grab
>>   rtnl_lock(), deliver the link watch events pending, and then call
>>   rtnl_unlock(). 
>>   rtnl_unlock() will then call netdev_run_todo() and block on
>>   mutex_lock(&net_todo_run_mutex).
>>
>>   At this point, the workqueue "events" is _blocked_ until the
>>   netdev_wait_allrefs() call above returns when the device refcount 
>>   reaches 0.
>>
>>   Problem: it will never happens if dst_gc_task() was enqueued behind
>>   linkwatch_event() in the "events" workqueue as the queue is now 
>>   blocked.
> ...
> 
> If it's really like this, I wonder if this can happen without linkwatch
> too in a non-preemptive config?

Um, not sure I fully understand what you mean... do you mean with
CONFIG_PREEMPT_NONE=y?

> So maybe this should be fixed somewhere
> else? According to a comment above netdev_wait_allrefs() it seems
> references should be rather put down on an UNREGISTER event, so this
> dst_gc_task() scheduling shouldn't bother us, I guess.

I saw this comment too. In our case, the UNREGISTER event is sent,
notifications are dispatched correctly, some routes are deleted
(dst_free()) but not destroyed (dst_destroy()) and the garbage collector
as to run to finish the work.

dst_entry's may hold a refcount on device until dst_destroy() is run on
them. Unfortunately dst_gc_task() won't have a chance to run
dst_destroy() on them later in this case because it is stuck in the
"events" workqueue behind linkwatch_event() who is blocking everyone 
else in the queue.

I'm still looking at why the first dst_free() on those particular routes 
doesn't call dst_destroy() immediately but defers it (another refcount
on the route itself).

BTW, in the past (last year) dst_gc_task() was run as a timer handler 
and this situation (deadlock with linkwatch_event()) couldn't occur.

Thanks,
Benjamin

> 
> Jarek P.
> 
>
David Miller - Sept. 30, 2008, 1:58 p.m.
From: Benjamin Thery <benjamin.thery@bull.net>
Date: Tue, 30 Sep 2008 13:52:31 +0200

> BTW, in the past (last year) dst_gc_task() was run as a timer
> handler and this situation (deadlock with linkwatch_event())
> couldn't occur.

September 12th to be exact :-)  That's when Eric Dumazet's
change went in:

commit 86bba269d08f0c545ae76c90b56727f65d62d57f
Author: Eric Dumazet <dada1@cosmosbay.com>
Date:   Wed Sep 12 14:29:01 2007 +0200

    [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue


But this change went into 2.6.24, so I wonder why it's taken
so long to hit this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Thery - Sept. 30, 2008, 2:07 p.m.
David Miller wrote:
> From: Benjamin Thery <benjamin.thery@bull.net>
> Date: Tue, 30 Sep 2008 13:52:31 +0200
> 
>> BTW, in the past (last year) dst_gc_task() was run as a timer
>> handler and this situation (deadlock with linkwatch_event())
>> couldn't occur.
> 
> September 12th to be exact :-)  That's when Eric Dumazet's
> change went in:
> 
> commit 86bba269d08f0c545ae76c90b56727f65d62d57f
> Author: Eric Dumazet <dada1@cosmosbay.com>
> Date:   Wed Sep 12 14:29:01 2007 +0200
> 
>     [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue
> 
> 
> But this change went into 2.6.24, so I wonder why it's taken
> so long to hit this.

Maybe because we manage to reproduce it only with network namespaces so
far and there isn't much netns users at the moment (and netns is still
incomplete in mainline).

Benjamin

> 
>
Jarek Poplawski - Sept. 30, 2008, 2:42 p.m.
Benjamin Thery wrote, On 09/30/2008 01:52 PM:

> Jarek Poplawski wrote:
>> On 29-09-2008 19:54, Benjamin Thery wrote:
...
>>>   Problem: it will never happens if dst_gc_task() was enqueued behind
>>>   linkwatch_event() in the "events" workqueue as the queue is now 
>>>   blocked.
>> ...
>>
>> If it's really like this, I wonder if this can happen without linkwatch
>> too in a non-preemptive config?
> 
> Um, not sure I fully understand what you mean... do you mean with
> CONFIG_PREEMPT_NONE=y?

Yes, but after rethinking I see this is irrelevant. Anyway, my main
concern is that it seems similar dependency might happen between other
than linkwatch work functions or processes waiting for each other.

>> So maybe this should be fixed somewhere
>> else? According to a comment above netdev_wait_allrefs() it seems
>> references should be rather put down on an UNREGISTER event, so this
>> dst_gc_task() scheduling shouldn't bother us, I guess.
> 
> I saw this comment too. In our case, the UNREGISTER event is sent,
> notifications are dispatched correctly, some routes are deleted
> (dst_free()) but not destroyed (dst_destroy()) and the garbage collector
> as to run to finish the work.
> 
> dst_entry's may hold a refcount on device until dst_destroy() is run on
> them. Unfortunately dst_gc_task() won't have a chance to run
> dst_destroy() on them later in this case because it is stuck in the
> "events" workqueue behind linkwatch_event() who is blocking everyone 
> else in the queue.
> 
> I'm still looking at why the first dst_free() on those particular routes 
> doesn't call dst_destroy() immediately but defers it (another refcount
> on the route itself).

Yes, finding/fixing this, if possible, in this place looks like the
most consistent with the way netdev_wait_allrefs() is handling this.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski - Sept. 30, 2008, 2:57 p.m.
On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote:
> Benjamin Thery wrote, On 09/30/2008 01:52 PM:
...
> > I'm still looking at why the first dst_free() on those particular routes 
> > doesn't call dst_destroy() immediately but defers it (another refcount
> > on the route itself).
> 
> Yes, finding/fixing this, if possible, in this place looks like the
> most consistent with the way netdev_wait_allrefs() is handling this.

Actually, I wonder, why we can't simply run this dst_gc_task() from
dst_dev_event() (after cancelling the work) when needed.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Thery - Sept. 30, 2008, 3:18 p.m.
Jarek Poplawski wrote:
> On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote:
>> Benjamin Thery wrote, On 09/30/2008 01:52 PM:
> ...
>>> I'm still looking at why the first dst_free() on those particular routes 
>>> doesn't call dst_destroy() immediately but defers it (another refcount
>>> on the route itself).
>> Yes, finding/fixing this, if possible, in this place looks like the
>> most consistent with the way netdev_wait_allrefs() is handling this.
> 
> Actually, I wonder, why we can't simply run this dst_gc_task() from
> dst_dev_event() (after cancelling the work) when needed.
> 

Um... I haven't thought about this. I'll have a look to see if it can
solve our issue.


Benjamin

> Jarek P.
> 
>
David Miller - Oct. 1, 2008, 9:59 a.m.
From: Benjamin Thery <benjamin.thery@bull.net>
Date: Tue, 30 Sep 2008 17:18:25 +0200

> Jarek Poplawski wrote:
> > On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote:
> >> Benjamin Thery wrote, On 09/30/2008 01:52 PM:
> > ...
> >>> I'm still looking at why the first dst_free() on those particular routes doesn't call dst_destroy() immediately but defers it (another refcount
> >>> on the route itself).
> >> Yes, finding/fixing this, if possible, in this place looks like the
> >> most consistent with the way netdev_wait_allrefs() is handling this.
> > Actually, I wonder, why we can't simply run this dst_gc_task() from
> > dst_dev_event() (after cancelling the work) when needed.
> > 
> 
> Um... I haven't thought about this. I'll have a look to see if it can
> solve our issue.

Let me know what happens, I'd like to apply some fix soon.
So just report the patch implementing the final approach you
feel the most comfortable with.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano - Oct. 1, 2008, 10:10 a.m.
David Miller wrote:
> From: Benjamin Thery <benjamin.thery@bull.net>
> Date: Tue, 30 Sep 2008 17:18:25 +0200
> 
>> Jarek Poplawski wrote:
>>> On Tue, Sep 30, 2008 at 04:42:04PM +0200, Jarek Poplawski wrote:
>>>> Benjamin Thery wrote, On 09/30/2008 01:52 PM:
>>> ...
>>>>> I'm still looking at why the first dst_free() on those particular routes doesn't call dst_destroy() immediately but defers it (another refcount
>>>>> on the route itself).
>>>> Yes, finding/fixing this, if possible, in this place looks like the
>>>> most consistent with the way netdev_wait_allrefs() is handling this.
>>> Actually, I wonder, why we can't simply run this dst_gc_task() from
>>> dst_dev_event() (after cancelling the work) when needed.
>>>
>> Um... I haven't thought about this. I'll have a look to see if it can
>> solve our issue.
> 
> Let me know what happens, I'd like to apply some fix soon.
> So just report the patch implementing the final approach you
> feel the most comfortable with.

We did the modification suggested by Jarek and that fix the problem :)
We are playing a bit with this patch to check if we didn't missed 
something. We will certainly send it in the next hours.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Oct. 1, 2008, 10:12 a.m.
From: Daniel Lezcano <dlezcano@fr.ibm.com>
Date: Wed, 01 Oct 2008 12:10:10 +0200

> We are playing a bit with this patch to check if we didn't missed
> something. We will certainly send it in the next hours.

Nice.

It is quite efficient for us to work together live like this when we
are in the same time zone, and even the same country :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman - Oct. 3, 2008, 12:41 a.m.
Benjamin Thery <benjamin.thery@bull.net> writes:

> This patch proposes to replace the rtnl_unlock() call in 
> linkwatch_event() by __rtnl_unlock(). The difference between the two
> routines being that __rtnl_unlock() will not call netdev_run_todo()
> after it unlocks rtnl_mutex.
>
> This is to fix a "deadlock" we observed when unregistering a net device.

Thanks for tracking this I just started seeing this a few days ago, and
before I reach the point of tracking it down you are in the process
of fixing it ;)

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

Index: net-next-2.6/net/core/link_watch.c
===================================================================
--- net-next-2.6.orig/net/core/link_watch.c
+++ net-next-2.6/net/core/link_watch.c
@@ -207,7 +207,7 @@  static void linkwatch_event(struct work_
 {
 	rtnl_lock();
 	__linkwatch_run_queue(time_after(linkwatch_nextevent, jiffies));
-	rtnl_unlock();
+	__rtnl_unlock();
 }