Patchwork net: deadlock during net device unregistration

login
register
mail settings
Submitter Herbert Xu
Date Oct. 5, 2008, 4:26 a.m.
Message ID <E1KmLCH-0003E2-Ch@gondolin.me.apana.org.au>
Download mbox | patch
Permalink /patch/2730/
State Accepted
Delegated to: David Miller
Headers show

Comments

Herbert Xu - Oct. 5, 2008, 4:26 a.m.
Benjamin Thery <benjamin.thery@bull.net> wrote:
>
> 1. Unregister a device, the following routines are called:
> 
> -> unregister_netdev
>  -> rtnl_lock
>  -> unregister_netdevice
>  -> rtnl_unlock
>    -> netdev_run_todo
>      -> netdev_wait_allrefs

OK, this explains lots of dead-locks that people have been seeing.

But I think we can go a step further:

net: Fix netdev_run_todo dead-lock

Benjamin Thery tracked down a bug that explains many instances
of the error

unregister_netdevice: waiting for %s to become free. Usage count = %d

It turns out that netdev_run_todo can dead-lock with itself if
a second instance of it is run in a thread that will then free
a reference to the device waited on by the first instance.

The problem is really quite silly.  We were trying to create
parallelism where none was required.  As netdev_run_todo always
follows a RTNL section, and that todo tasks can only be added
with the RTNL held, by definition you should only need to wait
for the very ones that you've added and be done with it.

There is no need for a second mutex or spinlock.

This is exactly what the following patch does.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
Jarek Poplawski - Oct. 5, 2008, 6:55 a.m.
Herbert Xu wrote, On 10/05/2008 06:26 AM:
...
> The problem is really quite silly.  We were trying to create
> parallelism where none was required.  As netdev_run_todo always
> follows a RTNL section, and that todo tasks can only be added
> with the RTNL held, by definition you should only need to wait
> for the very ones that you've added and be done with it.
> 
> There is no need for a second mutex or spinlock.

Is it for sure? (Read below.)

> 
> This is exactly what the following patch does.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e8eb2b4..021f531 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3808,14 +3808,11 @@ static int dev_new_index(struct net *net)
>  }
>  
>  /* Delayed registration/unregisteration */
> -static DEFINE_SPINLOCK(net_todo_list_lock);
>  static LIST_HEAD(net_todo_list);
>  
>  static void net_set_todo(struct net_device *dev)
>  {
> -	spin_lock(&net_todo_list_lock);
>  	list_add_tail(&dev->todo_list, &net_todo_list);
> -	spin_unlock(&net_todo_list_lock);
>  }
>  
>  static void rollback_registered(struct net_device *dev)
> @@ -4142,33 +4139,24 @@ static void netdev_wait_allrefs(struct net_device *dev)
>   *	free_netdev(y1);
>   *	free_netdev(y2);
>   *
> - * We are invoked by rtnl_unlock() after it drops the semaphore.
> + * We are invoked by rtnl_unlock().
>   * This allows us to deal with problems:
>   * 1) We can delete sysfs objects which invoke hotplug
>   *    without deadlocking with linkwatch via keventd.
>   * 2) Since we run with the RTNL semaphore not held, we can sleep
>   *    safely in order to wait for the netdev refcnt to drop to zero.
> + *
> + * We must not return until all unregister events added during
> + * the interval the lock was held have been completed.
>   */
> -static DEFINE_MUTEX(net_todo_run_mutex);
>  void netdev_run_todo(void)
>  {
>  	struct list_head list;
>  
> -	/* Need to guard against multiple cpu's getting out of order. */
> -	mutex_lock(&net_todo_run_mutex);
> -
> -	/* Not safe to do outside the semaphore.  We must not return
> -	 * until all unregister events invoked by the local processor
> -	 * have been completed (either by this todo run, or one on
> -	 * another cpu).
> -	 */

I think, it's about not to let others run this for devices unregistered
within later rtnl_locks before completing previous tasks. So, it would
be nice to have some comment why it's not necessary anymore.

Cheers,
Jarek P.

> -	if (list_empty(&net_todo_list))
> -		goto out;
> -
>  	/* Snapshot list, allow later requests */
> -	spin_lock(&net_todo_list_lock);
>  	list_replace_init(&net_todo_list, &list);
> -	spin_unlock(&net_todo_list_lock);
> +
> +	__rtnl_unlock();
>  
>  	while (!list_empty(&list)) {
>  		struct net_device *dev
> @@ -4200,9 +4188,6 @@ void netdev_run_todo(void)
>  		/* Free network device */
>  		kobject_put(&dev->dev.kobj);
>  	}
> -
> -out:
> -	mutex_unlock(&net_todo_run_mutex);
>  }
>  
>  static struct net_device_stats *internal_stats(struct net_device *dev)
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 71edb8b..d6381c2 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -73,7 +73,7 @@ void __rtnl_unlock(void)
>  
>  void rtnl_unlock(void)
>  {
> -	mutex_unlock(&rtnl_mutex);
> +	/* This fellow will unlock it for us. */
>  	netdev_run_todo();
>  }
>  
> Cheers,





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Oct. 5, 2008, 6:56 a.m.
On Sun, Oct 05, 2008 at 08:55:10AM +0200, Jarek Poplawski wrote:
>
> > -	/* Not safe to do outside the semaphore.  We must not return
> > -	 * until all unregister events invoked by the local processor
> > -	 * have been completed (either by this todo run, or one on
> > -	 * another cpu).
> > -	 */
> 
> I think, it's about not to let others run this for devices unregistered
> within later rtnl_locks before completing previous tasks. So, it would
> be nice to have some comment why it's not necessary anymore.

Where did you get that idea?

This was there because people did (and still do) stuff like:

unregister_netdev(dev);
free_netdev(dev);

Cheers,
Jarek Poplawski - Oct. 5, 2008, 7:12 a.m.
On Sun, Oct 05, 2008 at 02:56:48PM +0800, Herbert Xu wrote:
> On Sun, Oct 05, 2008 at 08:55:10AM +0200, Jarek Poplawski wrote:
> >
> > > -	/* Not safe to do outside the semaphore.  We must not return
> > > -	 * until all unregister events invoked by the local processor
> > > -	 * have been completed (either by this todo run, or one on
> > > -	 * another cpu).
> > > -	 */
> > 
> > I think, it's about not to let others run this for devices unregistered
> > within later rtnl_locks before completing previous tasks. So, it would
> > be nice to have some comment why it's not necessary anymore.
> 
> Where did you get that idea?

Just reading this code (plus the comment). Why would anybody bother 
with something so complex like this if something like your idea is
rather straightforward? But, needed or not, my point is it would be
nice to comment that this patch changes this behavior btw.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - Oct. 5, 2008, 7:28 a.m.
On Sun, 5 Oct 2008 09:12:38 +0200
Jarek Poplawski <jarkao2@gmail.com> wrote:

> On Sun, Oct 05, 2008 at 02:56:48PM +0800, Herbert Xu wrote:
> > On Sun, Oct 05, 2008 at 08:55:10AM +0200, Jarek Poplawski wrote:
> > >
> > > > -	/* Not safe to do outside the semaphore.  We must not return
> > > > -	 * until all unregister events invoked by the local processor
> > > > -	 * have been completed (either by this todo run, or one on
> > > > -	 * another cpu).
> > > > -	 */
> > > 
> > > I think, it's about not to let others run this for devices unregistered
> > > within later rtnl_locks before completing previous tasks. So, it would
> > > be nice to have some comment why it's not necessary anymore.
> > 
> > Where did you get that idea?
> 
> Just reading this code (plus the comment). Why would anybody bother 
> with something so complex like this if something like your idea is
> rather straightforward? But, needed or not, my point is it would be
> nice to comment that this patch changes this behavior btw.

I think there were issues with unregister triggering hotplug udev
events, but that may have been long ago when rtnl_lock was
a semaphore not a mutex.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Oct. 5, 2008, 7:38 a.m.
On Sun, Oct 05, 2008 at 09:28:23AM +0200, Stephen Hemminger wrote:

> I think there were issues with unregister triggering hotplug udev
> events, but that may have been long ago when rtnl_lock was
> a semaphore not a mutex.

Well as it is we have no hotplug events in netdev_run_todo at
all.  They're all in unregister_netdevice which is where they
should be and that runs under the RTNL.

In fact it would be a bug if we had anything in netdev_run_todo
tied to sysfs because once we get here we've relinquished our
ownership of the device name so someone else is free to readd
a device with the same name.

Cheers,
Herbert Xu - Oct. 5, 2008, 7:39 a.m.
On Sun, Oct 05, 2008 at 09:12:38AM +0200, Jarek Poplawski wrote:
>
> Just reading this code (plus the comment). Why would anybody bother 
> with something so complex like this if something like your idea is
> rather straightforward? But, needed or not, my point is it would be
> nice to comment that this patch changes this behavior btw.

I'm sorry but I don't have time to consider such hypotheticals.
Benjamin Thery - Oct. 6, 2008, 3:19 p.m.
Herbert Xu wrote:
> Benjamin Thery <benjamin.thery@bull.net> wrote:
>> 1. Unregister a device, the following routines are called:
>>
>> -> unregister_netdev
>>  -> rtnl_lock
>>  -> unregister_netdevice
>>  -> rtnl_unlock
>>    -> netdev_run_todo
>>      -> netdev_wait_allrefs
> 
> OK, this explains lots of dead-locks that people have been seeing.
> 
> But I think we can go a step further:
> 
> net: Fix netdev_run_todo dead-lock
> 
> Benjamin Thery tracked down a bug that explains many instances
> of the error
> 
> unregister_netdevice: waiting for %s to become free. Usage count = %d
> 
> It turns out that netdev_run_todo can dead-lock with itself if
> a second instance of it is run in a thread that will then free
> a reference to the device waited on by the first instance.
> 
> The problem is really quite silly.  We were trying to create
> parallelism where none was required.  As netdev_run_todo always
> follows a RTNL section, and that todo tasks can only be added
> with the RTNL held, by definition you should only need to wait
> for the very ones that you've added and be done with it.
> 
> There is no need for a second mutex or spinlock.
> 
> This is exactly what the following patch does.

Herbert, thank you for having looked at the issue too.

When I understood how the dead lock happened, I considered playing
with the locks in net_set_todo()/netdev_run_todo(), but as some comments
in the code in this area sounds a bit too cryptic for my brain, I didn't
dare to change them myself. :)

I guess you know a lot better than me the history behind this piece of
code and why it was done this way.

I tested your patch on my testbed during all the afternoon.
It fixes the dead lock I can easily reproduce here with net namespaces
and I didn't produce any regressions on my setup.

Benjamin

> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e8eb2b4..021f531 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3808,14 +3808,11 @@ static int dev_new_index(struct net *net)
>  }
>  
>  /* Delayed registration/unregisteration */
> -static DEFINE_SPINLOCK(net_todo_list_lock);
>  static LIST_HEAD(net_todo_list);
>  
>  static void net_set_todo(struct net_device *dev)
>  {
> -	spin_lock(&net_todo_list_lock);
>  	list_add_tail(&dev->todo_list, &net_todo_list);
> -	spin_unlock(&net_todo_list_lock);
>  }
>  
>  static void rollback_registered(struct net_device *dev)
> @@ -4142,33 +4139,24 @@ static void netdev_wait_allrefs(struct net_device *dev)
>   *	free_netdev(y1);
>   *	free_netdev(y2);
>   *
> - * We are invoked by rtnl_unlock() after it drops the semaphore.
> + * We are invoked by rtnl_unlock().
>   * This allows us to deal with problems:
>   * 1) We can delete sysfs objects which invoke hotplug
>   *    without deadlocking with linkwatch via keventd.
>   * 2) Since we run with the RTNL semaphore not held, we can sleep
>   *    safely in order to wait for the netdev refcnt to drop to zero.
> + *
> + * We must not return until all unregister events added during
> + * the interval the lock was held have been completed.
>   */
> -static DEFINE_MUTEX(net_todo_run_mutex);
>  void netdev_run_todo(void)
>  {
>  	struct list_head list;
>  
> -	/* Need to guard against multiple cpu's getting out of order. */
> -	mutex_lock(&net_todo_run_mutex);
> -
> -	/* Not safe to do outside the semaphore.  We must not return
> -	 * until all unregister events invoked by the local processor
> -	 * have been completed (either by this todo run, or one on
> -	 * another cpu).
> -	 */
> -	if (list_empty(&net_todo_list))
> -		goto out;
> -
>  	/* Snapshot list, allow later requests */
> -	spin_lock(&net_todo_list_lock);
>  	list_replace_init(&net_todo_list, &list);
> -	spin_unlock(&net_todo_list_lock);
> +
> +	__rtnl_unlock();
>  
>  	while (!list_empty(&list)) {
>  		struct net_device *dev
> @@ -4200,9 +4188,6 @@ void netdev_run_todo(void)
>  		/* Free network device */
>  		kobject_put(&dev->dev.kobj);
>  	}
> -
> -out:
> -	mutex_unlock(&net_todo_run_mutex);
>  }
>  
>  static struct net_device_stats *internal_stats(struct net_device *dev)
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 71edb8b..d6381c2 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -73,7 +73,7 @@ void __rtnl_unlock(void)
>  
>  void rtnl_unlock(void)
>  {
> -	mutex_unlock(&rtnl_mutex);
> +	/* This fellow will unlock it for us. */
>  	netdev_run_todo();
>  }
>  
> Cheers,
David Miller - Oct. 7, 2008, 10:46 p.m.
From: Benjamin Thery <benjamin.thery@bull.net>
Date: Mon, 06 Oct 2008 17:19:13 +0200

> I guess you know a lot better than me the history behind this piece of
> code and why it was done this way.

Or maybe Herbert doesn't know :-)

I wrote this crap, and I can't remember why I decided to drop
the RTNL semaphore so early and go directly to that todo lock.
Maybe it was purely because of how the interfaces are arranged
and I considered it ugly to drop the RTNL semaphore in the
todo list running code.   Who knows...

I'm going to scrutinize Herbert's patch some more, but this is
really a borderline change to put into 2.6.27 so late in the
RC cycle.

I want to fix this bug, but... it's risky.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Oct. 7, 2008, 10:50 p.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 05 Oct 2008 12:26:21 +0800

> net: Fix netdev_run_todo dead-lock
> 
> Benjamin Thery tracked down a bug that explains many instances
> of the error
> 
> unregister_netdevice: waiting for %s to become free. Usage count = %d
> 
> It turns out that netdev_run_todo can dead-lock with itself if
> a second instance of it is run in a thread that will then free
> a reference to the device waited on by the first instance.
> 
> The problem is really quite silly.  We were trying to create
> parallelism where none was required.  As netdev_run_todo always
> follows a RTNL section, and that todo tasks can only be added
> with the RTNL held, by definition you should only need to wait
> for the very ones that you've added and be done with it.
> 
> There is no need for a second mutex or spinlock.
> 
> This is exactly what the following patch does.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Ok, this looks safe.  I've applied this to net-2.6, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index e8eb2b4..021f531 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3808,14 +3808,11 @@  static int dev_new_index(struct net *net)
 }
 
 /* Delayed registration/unregisteration */
-static DEFINE_SPINLOCK(net_todo_list_lock);
 static LIST_HEAD(net_todo_list);
 
 static void net_set_todo(struct net_device *dev)
 {
-	spin_lock(&net_todo_list_lock);
 	list_add_tail(&dev->todo_list, &net_todo_list);
-	spin_unlock(&net_todo_list_lock);
 }
 
 static void rollback_registered(struct net_device *dev)
@@ -4142,33 +4139,24 @@  static void netdev_wait_allrefs(struct net_device *dev)
  *	free_netdev(y1);
  *	free_netdev(y2);
  *
- * We are invoked by rtnl_unlock() after it drops the semaphore.
+ * We are invoked by rtnl_unlock().
  * This allows us to deal with problems:
  * 1) We can delete sysfs objects which invoke hotplug
  *    without deadlocking with linkwatch via keventd.
  * 2) Since we run with the RTNL semaphore not held, we can sleep
  *    safely in order to wait for the netdev refcnt to drop to zero.
+ *
+ * We must not return until all unregister events added during
+ * the interval the lock was held have been completed.
  */
-static DEFINE_MUTEX(net_todo_run_mutex);
 void netdev_run_todo(void)
 {
 	struct list_head list;
 
-	/* Need to guard against multiple cpu's getting out of order. */
-	mutex_lock(&net_todo_run_mutex);
-
-	/* Not safe to do outside the semaphore.  We must not return
-	 * until all unregister events invoked by the local processor
-	 * have been completed (either by this todo run, or one on
-	 * another cpu).
-	 */
-	if (list_empty(&net_todo_list))
-		goto out;
-
 	/* Snapshot list, allow later requests */
-	spin_lock(&net_todo_list_lock);
 	list_replace_init(&net_todo_list, &list);
-	spin_unlock(&net_todo_list_lock);
+
+	__rtnl_unlock();
 
 	while (!list_empty(&list)) {
 		struct net_device *dev
@@ -4200,9 +4188,6 @@  void netdev_run_todo(void)
 		/* Free network device */
 		kobject_put(&dev->dev.kobj);
 	}
-
-out:
-	mutex_unlock(&net_todo_run_mutex);
 }
 
 static struct net_device_stats *internal_stats(struct net_device *dev)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 71edb8b..d6381c2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -73,7 +73,7 @@  void __rtnl_unlock(void)
 
 void rtnl_unlock(void)
 {
-	mutex_unlock(&rtnl_mutex);
+	/* This fellow will unlock it for us. */
 	netdev_run_todo();
 }