diff mbox

[7/8] net: Allow setting the network namespace by fd

Message ID m1hbhgq1v1.fsf@fess.ebiederm.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman Sept. 23, 2010, 8:51 a.m. UTC
Take advantage of the new abstraction and allow network devices
to be placed in any network namespace that we have a fd to talk
about.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/if_link.h     |    1 +
 include/net/net_namespace.h |    1 +
 net/core/net_namespace.c    |   26 ++++++++++++++++++++++++++
 net/core/rtnetlink.c        |    4 +++-
 4 files changed, 31 insertions(+), 1 deletions(-)

Comments

jamal Sept. 23, 2010, 11:22 a.m. UTC | #1
On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 

So ... why just netdevice? could you allow migration of other
net "items" eg a route table since they are all tagged by
netns?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Haley Sept. 23, 2010, 2:22 p.m. UTC | #2
On 09/23/2010 04:51 AM, Eric W. Biederman wrote:
> 
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 
...
> +struct net *get_net_ns_by_fd(int fd)
> +{
> +	struct proc_inode *ei;
> +	struct file *file;
> +	struct net *net;
> +
> +	file = NULL;

No need to initialize this.

> +	net = ERR_PTR(-EINVAL);

or this?

> +	file = proc_ns_fget(fd);
> +	if (!fd)
> +		goto out;
> +		return ERR_PTR(-EINVAL);

Shouldn't this be:

	if (!file)

And the "goto" seems wrong, especially without a {} here.  Unless you
meant to keep the "goto" and branch below?

-Brian

> +
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	if (ei->ns_ops != &netns_operations)
> +		goto out;
> +
> +	net = get_net(ei->ns);
> +out:
> +	if (file)
> +		fput(file);
> +	return net;
> +}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lamparter Sept. 23, 2010, 2:58 p.m. UTC | #3
On Thu, Sep 23, 2010 at 07:22:06AM -0400, jamal wrote:
> On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
> > Take advantage of the new abstraction and allow network devices
> > to be placed in any network namespace that we have a fd to talk
> > about.
> 
> So ... why just netdevice? could you allow migration of other
> net "items" eg a route table since they are all tagged by
> netns?

migrating route table entries makes no sense because
a) they refer to devices and configuration that does not exist in the
   target namespace; they only make sense within their netns context
b) they are purely virtual and you get the same result from deleting and
   recreating them.

Network devices are special because they may have something attached to
them, be it hardware or some daemon.


-David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Sept. 23, 2010, 3:14 p.m. UTC | #4
jamal <hadi@cyberus.ca> writes:

> On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
>> Take advantage of the new abstraction and allow network devices
>> to be placed in any network namespace that we have a fd to talk
>> about.
>> 
>
> So ... why just netdevice? could you allow migration of other
> net "items" eg a route table since they are all tagged by
> netns?

For this patchset because we only support migrating physical
network devices between network namespaces today.

In the bigger picture migrating things between network namespaces is
race prone.  Fixing those races probably would reduce network stack
performance and increase code complexity for not particularly good
reason.  Network devices are special because they are physical hardware
and in combination with the rule that all packets coming a network
device go to a single network namespace we have to implement migration
for network devices.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Sept. 23, 2010, 4:16 p.m. UTC | #5
Brian Haley <brian.haley@hp.com> writes:

> On 09/23/2010 04:51 AM, Eric W. Biederman wrote:
>> 
>> Take advantage of the new abstraction and allow network devices
>> to be placed in any network namespace that we have a fd to talk
>> about.
>> 
> ...
>> +struct net *get_net_ns_by_fd(int fd)
>> +{
>> +	struct proc_inode *ei;
>> +	struct file *file;
>> +	struct net *net;
>> +
>> +	file = NULL;
>
> No need to initialize this.
>
>> +	net = ERR_PTR(-EINVAL);
>
> or this?
>
>> +	file = proc_ns_fget(fd);
>> +	if (!fd)
>> +		goto out;
>> +		return ERR_PTR(-EINVAL);
>
> Shouldn't this be:
>
> 	if (!file)
>
> And the "goto" seems wrong, especially without a {} here.  Unless you
> meant to keep the "goto" and branch below?

I think I changed my mind half way through writing the code and never
did anything about it.  Oops.

Thanks fixed.  It is now:

struct net *get_net_ns_by_fd(int fd)
{
	struct proc_inode *ei;
	struct file *file;
	struct net *net;

	net = ERR_PTR(-EINVAL);
	file = proc_ns_fget(fd);
	if (!file)
		goto out;

	ei = PROC_I(file->f_dentry->d_inode);
	if (ei->ns_ops != &netns_operations)
		goto out;

	net = get_net(ei->ns);
out:
	if (file)
		fput(file);
	return net;
}

Which at least makes sense.  Now to test it to double check it does what
it should do.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal Sept. 24, 2010, 11:51 a.m. UTC | #6
On Thu, 2010-09-23 at 16:58 +0200, David Lamparter wrote:

> migrating route table entries makes no sense because
> a) they refer to devices and configuration that does not exist in the
>    target namespace; they only make sense within their netns context
> b) they are purely virtual and you get the same result from deleting and
>    recreating them.
> 
> Network devices are special because they may have something attached to
> them, be it hardware or some daemon.

Routes functionally reside on top of netdevices, point to nexthop
neighbors across these netdevices etc. Underlying assumption is you take
care of that dependency when migrating.
We are talking about FIB entries here not the route cache; moving a few
pointers within the kernel is a hell lot faster than recreating a subset
of BGP entries from user space. 

Eric, I didnt follow the exposed-races arguement: Why would it involve
more than just some basic locking only while you change the struct net
pointer to the new namespace for these sub-subsystems?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lamparter Sept. 24, 2010, 12:57 p.m. UTC | #7
On Fri, Sep 24, 2010 at 07:51:24AM -0400, jamal wrote:
> > migrating route table entries makes no sense because
> > a) they refer to devices and configuration that does not exist in the
> >    target namespace; they only make sense within their netns context
> > b) they are purely virtual and you get the same result from deleting and
> >    recreating them.
> > 
> > Network devices are special because they may have something attached to
> > them, be it hardware or some daemon.
> 
> Routes functionally reside on top of netdevices, point to nexthop
> neighbors across these netdevices etc. Underlying assumption is you take
> care of that dependency when migrating.
> We are talking about FIB entries here not the route cache; moving a few
> pointers within the kernel is a hell lot faster than recreating a subset
> of BGP entries from user space. 

No. While you sure could associate routes with devices, they don't
*functionally* reside on top of network devices. They reside on top of
the entire IP configuration, and in case of BGP they even reside on top
of your set of peerings and their data.

Even if you could "move" routes together with a network device, the
result would be utter nonsense. The routes depend on your BGP view, and
if your set of interfaces (and peers) changes, your routes will change.
Your bgpd will, either way, need to set up new peerings and redo best
path evaluations.

(On an unrelated note, how often are you planning to move stuff between
namespaces? I don't expect to be moving stuff except on configuration
events...)


-David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal Sept. 24, 2010, 1:32 p.m. UTC | #8
On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:

> No. While you sure could associate routes with devices, they don't
> *functionally* reside on top of network devices. They reside on top of
> the entire IP configuration, 

I think i am not clearly making my point. There are data dependencies;
If you were to move routes, youd need everything that routes depend on.
IOW, if i was to draw a functional graph, routes would appear on top
of netdevs (I dont care what other functional blocks you put in between
or sideways to them).

> and in case of BGP they even reside on top
> of your set of peerings and their data.
> Even if you could "move" routes together with a network device, the
> result would be utter nonsense. 

You could argue that moving a netdevice where some of its fundamental
properties such as an ifindex change is utter nonsense. But you can
work around it.

> The routes depend on your BGP view, and
> if your set of interfaces (and peers) changes, your routes will change.
> Your bgpd will, either way, need to set up new peerings and redo best
> path evaluations.

Worst case scenario, yes. I am beginning to get a feeling we are trying 
to achieve different goals maybe? Why are you even migrating netdevs?

> (On an unrelated note, how often are you planning to move stuff between
> namespaces? I don't expect to be moving stuff except on configuration
> events...)

Triggering on config events is useful and it is likely the only
possibility if you assumed the other namespace is remote. But if could
send a single command to migrate several things in the kernel (in my
case to recover state to a different ns), then that is much simpler and
uses the least resources (memory, cpu, bandwidth). I admit it is very
hard to do in most cases where the underlying dependencies are evolving
and synchronizing via user space is the best approach. The example
of route table i pointed to is simple.
Besides that: dynamic state created in the kernel that doesnt have to be
recreated by the next arriving 100K packets helps to improve recovery.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Sept. 24, 2010, 1:46 p.m. UTC | #9
On 09/23/2010 10:51 AM, Eric W. Biederman wrote:
>
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
>
> Signed-off-by: Eric W. Biederman<ebiederm@xmission.com>
> ---

[ ... ]

> +struct net *get_net_ns_by_fd(int fd)
> +{
> +	struct proc_inode *ei;
> +	struct file *file;
> +	struct net *net;
> +
> +	file = NULL;
> +	net = ERR_PTR(-EINVAL);
> +	file = proc_ns_fget(fd);
> +	if (!fd)
> +		goto out;
> +		return ERR_PTR(-EINVAL);
> +
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	if (ei->ns_ops !=&netns_operations)
> +		goto out;

Is this check necessary here ? proc_ns_fget checks "file->f_op != 
&ns_file_operations", no ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lamparter Sept. 24, 2010, 2:09 p.m. UTC | #10
On Fri, Sep 24, 2010 at 09:32:53AM -0400, jamal wrote:
> On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:
> > No. While you sure could associate routes with devices, they don't
> > *functionally* reside on top of network devices. They reside on top of
> > the entire IP configuration, 
> 
> I think i am not clearly making my point. There are data dependencies;
> If you were to move routes, youd need everything that routes depend on.
> IOW, if i was to draw a functional graph, routes would appear on top
> of netdevs (I dont care what other functional blocks you put in between
> or sideways to them).

I understood your point. What I'm saying is that that functional graph
you're describing is too simplistic do be a workable model. Your graph
allows for what you're trying to do, yes. But your graph is not modeling
the reality.

> > The routes depend on your BGP view, and
> > if your set of interfaces (and peers) changes, your routes will change.
> > Your bgpd will, either way, need to set up new peerings and redo best
> > path evaluations.
> 
> Worst case scenario, yes. I am beginning to get a feeling we are trying 
> to achieve different goals maybe? Why are you even migrating netdevs?

Err... I'm migrating netdevs to assign them to namespaces to allow them
to use them? Setup, basically. Either way a device move only happens as
result of some administrative action; be it creating a new namespace or
changing the physical/logical network setup.

> > (On an unrelated note, how often are you planning to move stuff between
> > namespaces? I don't expect to be moving stuff except on configuration
> > events...)
> 
> Triggering on config events is useful and it is likely the only
> possibility if you assumed the other namespace is remote.

wtf is a "remote" namespace?

>                                                           But if could
> send a single command to migrate several things in the kernel (in my
> case to recover state to a different ns), then that is much simpler and
> uses the least resources (memory, cpu, bandwidth). I admit it is very
> hard to do in most cases where the underlying dependencies are evolving
> and synchronizing via user space is the best approach. The example
> of route table i pointed to is simple.
> Besides that: dynamic state created in the kernel that doesnt have to be
> recreated by the next arriving 100K packets helps to improve recovery.

Can you please describe your application that requires moving possibly
several network devices together with "their" routes to a different
namespace?


-David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
jamal Sept. 24, 2010, 2:16 p.m. UTC | #11
On Fri, 2010-09-24 at 16:09 +0200, David Lamparter wrote:

> I understood your point. What I'm saying is that that functional graph
> you're describing is too simplistic do be a workable model. Your graph
> allows for what you're trying to do, yes. But your graph is not modeling
> the reality.

How about we put this specific point to rest by agreeing to
disagree? ;->

> Err... I'm migrating netdevs to assign them to namespaces to allow them
> to use them? Setup, basically. Either way a device move only happens as
> result of some administrative action; be it creating a new namespace or
> changing the physical/logical network setup.
> 

Ok, different need. You have a much more basic requirement than i do.

> wtf is a "remote" namespace?
> 

A namespace that is remotely located on another machine/hardware ;->

> Can you please describe your application that requires moving possibly
> several network devices together with "their" routes to a different
> namespace?

scaling and availability are the driving requirements.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 2fc66dd..ae73d5e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -116,6 +116,7 @@  enum {
 	IFLA_STATS64,
 	IFLA_VF_PORTS,
 	IFLA_PORT_SELF,
+	IFLA_NET_NS_FD,
 	__IFLA_MAX
 };
 
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index bd10a79..68672ce 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -114,6 +114,7 @@  static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
 extern struct list_head net_namespace_list;
 
 extern struct net *get_net_ns_by_pid(pid_t pid);
+extern struct net *get_net_ns_by_fd(int pid);
 
 #ifdef CONFIG_NET_NS
 extern void __put_net(struct net *net);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 581a088..a9b54a7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -8,6 +8,8 @@ 
 #include <linux/idr.h>
 #include <linux/rculist.h>
 #include <linux/nsproxy.h>
+#include <linux/proc_fs.h>
+#include <linux/file.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -341,6 +343,30 @@  struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+struct net *get_net_ns_by_fd(int fd)
+{
+	struct proc_inode *ei;
+	struct file *file;
+	struct net *net;
+
+	file = NULL;
+	net = ERR_PTR(-EINVAL);
+	file = proc_ns_fget(fd);
+	if (!fd)
+		goto out;
+		return ERR_PTR(-EINVAL);
+
+	ei = PROC_I(file->f_dentry->d_inode);
+	if (ei->ns_ops != &netns_operations)
+		goto out;
+
+	net = get_net(ei->ns);
+out:
+	if (file)
+		fput(file);
+	return net;
+}
+
 static int __init net_ns_init(void)
 {
 	struct net_generic *ng;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f78d821..771d8be 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1003,6 +1003,8 @@  struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
 	 */
 	if (tb[IFLA_NET_NS_PID])
 		net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
+	else if (tb[IFLA_NET_NS_FD])
+		net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
 	else
 		net = get_net(src_net);
 	return net;
@@ -1077,7 +1079,7 @@  static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 	int send_addr_notify = 0;
 	int err;
 
-	if (tb[IFLA_NET_NS_PID]) {
+	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]) {
 		struct net *net = rtnl_link_get_net(dev_net(dev), tb);
 		if (IS_ERR(net)) {
 			err = PTR_ERR(net);