Patchwork [1/1] net: race condition when removing virtual net_device

login
register
mail settings
Submitter Francesco Ruggeri
Date Sept. 9, 2013, 11:15 p.m.
Message ID <1378768511-27866-1-git-send-email-fruggeri@aristanetworks.com>
Download mbox | patch
Permalink /patch/273725/
State RFC
Delegated to: David Miller
Headers show

Comments

Francesco Ruggeri - Sept. 9, 2013, 11:15 p.m.
There is a race condition when removing a net_device while its net namespace
is also being removed.
This can result in a crash or other unpredictable behavior.
This is a sample scenario with veth, but the same applies to other virtual
devices such as vlan and macvlan.
veth pair v0-v1 is created, with v0 in namespace ns0 and v1 in ns1.
Process p0 deletes v0. v1 is also unregistered (in veth_dellink), so when p0
gets to netdev_run_todo both v0 and v1 are in net_todo_list, and they have both
been unlisted from their respective namespaces. If all references to v1 have not
already been dropped then netdev_run_todo/netdev_wait_allrefs will call netdev
notifiers for v1, releasing the rtnl lock between calls.
Now process p1 removes namespace ns1. v1 will not prevent this from happening
(in default_device_exit_batch) since it was already unlisted by p0.
Next time p0 invokes the notifiers for v1 any notifiers that use dev_net(v1)
will get a pointer to a namespace that has been or is being removed.
Similar scenarios apply with v1 as a vlan or macvlan interface and v0 as its
real device.
We hit this problem in 3.4 with sequence fib_netdev_event, fib_disable_ip,
rt_cache_flush, rt_cache_invalidate, inetpeer_invalidate_tree. That sequence no
longer applies in later kernels, but unless we can guarantee that no
NETDEV_UNREGISTER or NETDEV_UNREGISTER_FINAL handler accesses a net_device's
dev_net(dev) then there is a vulnerability (this happens for example with
NETDEV_UNREGISTER_FINAL in dst_dev_event/dst_ifdown).
Commit 0115e8e3 later made things better by reducing the chances of this
happening, but the underlying problem still seems to be there.
I would like to get some feedback on this patch.
The idea is to take a reference to the loopback_dev of a net_device's
namespace (which is always the last net_device to be removed when a namespace
is destroyed) when the net_device is unlisted, and release it when the
net_device is disposed of.
To avoid deadlocks in cleanup_net all loopback_devs are queued at the end
of net_todo_list.

Tested on Linux 3.4.

Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
---
 net/core/dev.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)
stephen hemminger - Sept. 10, 2013, 12:57 a.m.
On Mon,  9 Sep 2013 16:15:11 -0700
Francesco Ruggeri <fruggeri@aristanetworks.com> wrote:

> There is a race condition when removing a net_device while its net namespace
> is also being removed.
> This can result in a crash or other unpredictable behavior.
> This is a sample scenario with veth, but the same applies to other virtual
> devices such as vlan and macvlan.
> veth pair v0-v1 is created, with v0 in namespace ns0 and v1 in ns1.
> Process p0 deletes v0. v1 is also unregistered (in veth_dellink), so when p0
> gets to netdev_run_todo both v0 and v1 are in net_todo_list, and they have both
> been unlisted from their respective namespaces. If all references to v1 have not
> already been dropped then netdev_run_todo/netdev_wait_allrefs will call netdev
> notifiers for v1, releasing the rtnl lock between calls.
> Now process p1 removes namespace ns1. v1 will not prevent this from happening
> (in default_device_exit_batch) since it was already unlisted by p0.
> Next time p0 invokes the notifiers for v1 any notifiers that use dev_net(v1)
> will get a pointer to a namespace that has been or is being removed.

Namespace's should be ref counted.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Francesco Ruggeri - Sept. 10, 2013, 2:03 a.m.
On Mon, Sep 9, 2013 at 5:57 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon,  9 Sep 2013 16:15:11 -0700
> Francesco Ruggeri <fruggeri@aristanetworks.com> wrote:
>
> > There is a race condition when removing a net_device while its net namespace
> > is also being removed.
> > This can result in a crash or other unpredictable behavior.
> > This is a sample scenario with veth, but the same applies to other virtual
> > devices such as vlan and macvlan.
> > veth pair v0-v1 is created, with v0 in namespace ns0 and v1 in ns1.
> > Process p0 deletes v0. v1 is also unregistered (in veth_dellink), so when p0
> > gets to netdev_run_todo both v0 and v1 are in net_todo_list, and they have both
> > been unlisted from their respective namespaces. If all references to v1 have not
> > already been dropped then netdev_run_todo/netdev_wait_allrefs will call netdev
> > notifiers for v1, releasing the rtnl lock between calls.
> > Now process p1 removes namespace ns1. v1 will not prevent this from happening
> > (in default_device_exit_batch) since it was already unlisted by p0.
> > Next time p0 invokes the notifiers for v1 any notifiers that use dev_net(v1)
> > will get a pointer to a namespace that has been or is being removed.
>
> Namespace's should be ref counted.


Namespaces are refcounted, but rollback_registered_many is also
invoked when a namespace is torn down. That is triggered in put_net
exactly by the namespace refcount dropping to 0, which then triggers
__put_net, cleanup_net, ops_exit_list, default_device_exit_batch and
unregister_netdevice_many. In that case using the namespace refcount
would not prevent the namespace from being deleted, since we already
passed the point (in put_net) where the refcount would have had that
effect. It would probably also not be straightforward to start using
again the namespace refcount in code that started under the premise
that the refcount had dropped to 0.
net_devices in a namespace do not seem to take references to the
namespace that they are in, and as far as I can tell the only thing
that prevents a namespace form being removed from under its
net_devices' feet is default_device_exit or default_device_exit_batch.
But unlist_netdevice creates a window where that logic fails.
In addition to this, in general we can have several instances of
netdev_run_todo running concurrently with any number of net_devices
from any number of net namespaces, possibly cross-referencing each
other. Some of these instances may be in the context of cleanup_net
and some may not.
Taking a refcount on the loopback_dev and making sure the looback_devs
are at the tail of net_todo_list in each of these instances was the
only way I could come up with to address all scenarios, but there may
be better ways to do it.

Thanks,
Francesco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 26755dd..da2fd78 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -225,10 +225,14 @@  static void list_netdevice(struct net_device *dev)
 }
 
 /* Device list removal
+ * Take a reference to dev_net(dev)->loopback_dev, so dev_net(dev)
+ * will not be freed until we are done with dev.
  * caller must respect a RCU grace period before freeing/reusing dev
  */
 static void unlist_netdevice(struct net_device *dev)
 {
+	struct net_device *loopback_dev = dev_net(dev)->loopback_dev;
+
 	ASSERT_RTNL();
 
 	/* Unlink dev from the device chain */
@@ -238,9 +242,23 @@  static void unlist_netdevice(struct net_device *dev)
 	hlist_del_rcu(&dev->index_hlist);
 	write_unlock_bh(&dev_base_lock);
 
+	if (dev != loopback_dev)
+		dev_hold(loopback_dev);
+
 	dev_base_seq_inc(dev_net(dev));
 }
 
+/**
+ * Called when a net_device that has been previously unlisted from a net
+ * namespace is disposed of.
+ */
+static inline void unlist_netdevice_done(struct net_device *dev)
+{
+	struct net_device *loopback_dev = dev_net(dev)->loopback_dev;
+	if (dev != loopback_dev)
+		dev_put(loopback_dev);
+}
+
 /*
  *	Our notifier list
  */
@@ -5009,10 +5027,17 @@  static int dev_new_index(struct net *net)
 
 /* Delayed registration/unregisteration */
 static LIST_HEAD(net_todo_list);
+static struct list_head *first_loopback_dev = &net_todo_list;
 
 static void net_set_todo(struct net_device *dev)
 {
-	list_add_tail(&dev->todo_list, &net_todo_list);
+	/* All loopback_devs go to end of net_todo_list. */
+	if (dev == dev_net(dev)->loopback_dev) {
+		list_add_tail(&dev->todo_list, &net_todo_list);
+		if (first_loopback_dev == &net_todo_list)
+			first_loopback_dev = &dev->todo_list;
+	} else
+		list_add_tail(&dev->todo_list, first_loopback_dev);
 }
 
 static void rollback_registered_many(struct list_head *head)
@@ -5641,6 +5666,7 @@  void netdev_run_todo(void)
 
 	/* Snapshot list, allow later requests */
 	list_replace_init(&net_todo_list, &list);
+	first_loopback_dev = &net_todo_list;
 
 	__rtnl_unlock();
 
@@ -5670,6 +5696,7 @@  void netdev_run_todo(void)
 		on_each_cpu(flush_backlog, dev, 1);
 
 		netdev_wait_allrefs(dev);
+		unlist_netdevice_done(dev);
 
 		/* paranoia */
 		BUG_ON(netdev_refcnt_read(dev));
@@ -6076,6 +6103,7 @@  int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
 
 	/* Actually switch the network namespace */
+	unlist_netdevice_done(dev);
 	dev_net_set(dev, net);
 
 	/* If there is an ifindex conflict assign a new one */