diff mbox

netns: add /proc/*/net/id symlink

Message ID 20110521093936.GA3015@p183
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alexey Dobriyan May 21, 2011, 9:39 a.m. UTC
David Lamparter pointed some real scenarios where knowing
if two processes live in same netns is important,
like "how do I kill _all_ processes in netns to shutdown it".

Currently only kernel knows if two netns are the same.
Userspace maybe can look at different proc files to find a match
indirectly sysconf-style but result will be ugly no matter what.

Add /proc/*/net/id symlink which "points" to an integer.

	$ readlink /proc/net/id
	0

	$ readlink /proc/2941/net/id
	1

"id" is not a file because 1 syscall is faster than 3 syscalls.

The only rules and expectations for userspace are:
[as if they will comply, ha-ha]

* init_net always has id 0
* two netns do not have same id
* id is unsigned integer

Kernel code continues to use net_eq(), there is no need
to compare net->id inside kernel, because it is slower than net_eq().

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 fs/proc/generic.c           |   16 +++++++++++++
 fs/proc/proc_net.c          |   31 ++++++++++++++++++++++++-
 include/linux/proc_fs.h     |    7 +++++
 include/net/net_namespace.h |   10 ++++++++
 net/core/net_namespace.c    |   54 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 117 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman May 21, 2011, 3:39 p.m. UTC | #1
Alexey Dobriyan <adobriyan@gmail.com> writes:

> David Lamparter pointed some real scenarios where knowing
> if two processes live in same netns is important,
> like "how do I kill _all_ processes in netns to shutdown it".

Currently today the way I do this is md5sum /proc/<pid>/mounts.

That works because it is usually necessary to have a separate mount
namespace with a separate set of mounts to accommodate sysfs.

> Currently only kernel knows if two netns are the same.
> Userspace maybe can look at different proc files to find a match
> indirectly sysconf-style but result will be ugly no matter what.

Somewhat. 

Right now today without patches if we limit ourselves to the network
namespace there is a pretty valid way to do this.

stat /proc/<pid>/net/dev and compare the inode numbers.

Or any other file in /proc/*/net/.  The inode numbers are the
same if you are in the same network namespace.

> Add /proc/*/net/id symlink which "points" to an integer.
>
> 	$ readlink /proc/net/id
> 	0
>
> 	$ readlink /proc/2941/net/id
> 	1
>
> "id" is not a file because 1 syscall is faster than 3 syscalls.
>
> The only rules and expectations for userspace are:
> [as if they will comply, ha-ha]
>
> * init_net always has id 0
> * two netns do not have same id
> * id is unsigned integer

I don't like this patch because we already have a proc interface
that already solves this in production kernels today.

- stat is a single syscall
- two netns do not have the same id
- id is an ino_t.

Now it probably needs to be better documented that /proc/*/net/*
have the same inode number if the network namespace is the
same, as everyone including myself overlooked this very handy
existing property.




Writing this it occurs to me there is a misfeature in my pending
namespace file descriptor code.  Right now /proc/<pid>/ns/net
has a floating inode number and it would be good if I could make
that a inode number be the same for every file that refers to
the same network namespace. Ugh.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey Dobriyan May 21, 2011, 10:30 p.m. UTC | #2
On Sat, May 21, 2011 at 08:39:37AM -0700, Eric W. Biederman wrote:
> Alexey Dobriyan <adobriyan@gmail.com> writes:
> > * init_net always has id 0
> > * two netns do not have same id
> > * id is unsigned integer
> 
> I don't like this patch because we already have a proc interface
> that already solves this in production kernels today.
> 
> - stat is a single syscall
> - two netns do not have the same id
> - id is an ino_t.

Yeah, stat /proc/*/net/dev works.
If you document this, it means we can't change the way ->low_ino is set.
And we can't do other things inside irregular part of procfs.

But can we add clean interface once in a while.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 22, 2011, 12:15 a.m. UTC | #3
Adding the containers list.

Alexey Dobriyan <adobriyan@gmail.com> writes:

> On Sat, May 21, 2011 at 08:39:37AM -0700, Eric W. Biederman wrote:
>> Alexey Dobriyan <adobriyan@gmail.com> writes:
>> > * init_net always has id 0
>> > * two netns do not have same id
>> > * id is unsigned integer
>> 
>> I don't like this patch because we already have a proc interface
>> that already solves this in production kernels today.
>> 
>> - stat is a single syscall
>> - two netns do not have the same id
>> - id is an ino_t.
>
> Yeah, stat /proc/*/net/dev works.
> If you document this, it means we can't change the way ->low_ino is set.
> And we can't do other things inside irregular part of procfs.

Maybe.  Certainly there are things that would suggest we need some
fixes to this part of procfs.

> But can we add clean interface once in a while.

I am all for making a clean solution.  I don't see a proc file
in in /proc/net that provides a small integer as particularly clean.

It has the classic problem of what namespace are namespaces named in.
It only solves the problem for the network namespace.

So on that level I really like the idea of inode numbers in proc
being the place where we have a name.  People generally don't get
confused about inode numbers understanding they are an implementation
detail but they do understand that inode numbers plus filesystem
information can be used to compare files for identity.

So let's skip the fact that /proc/*/net/dev happens to work for a
moment.

For clean interfaces I am in the process of adding /proc/<pid>/ns/net,
/proc/<pid>/ns/ipc, and /proc/<pid>/ns/uts.

If we can make those files inode number be the same if the namespace is
the same like /proc/<pid>/net/dev is today.  I think we will have a
clean solution.

Additionally that solution will work for comparing network namespaces
that don't happen to have any processes in them at the moment.  Because
fstat works on file descriptors.

With the /proc/<pid>/ns/net file and bind mounts I have solved the
deeper problem of how do we get userspace policy into the naming of
namespaces.  With those files and the setns system call I have solved
the other problem of what is a good way to refer to namespaces without
assuming a global name.  So once those changes are merged I expect there
to be much less pressure to misuse any kind of identifier we can have.

And if we only make the guarantee about inode consistency for the
/proc/<pid>/ns/FILE files I don't expect it will make maintenance
of procfs any harder than it already is.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lamparter May 23, 2011, 1:43 a.m. UTC | #4
> ... Eric W. Biederman wrote:
> Now it probably needs to be better documented that /proc/*/net/*
> have the same inode number if the network namespace is the
> same, as everyone including myself overlooked this very handy
> existing property.

Eh, so did I. But, yes, very nice.

On Sat, May 21, 2011 at 05:15:38PM -0700, Eric W. Biederman wrote:
> Additionally that solution will work for comparing network namespaces
> that don't happen to have any processes in them at the moment.  Because
> fstat works on file descriptors.

Hm. I have a peeve here. Assume I am a... rogue admin, whatever. I have
root on a router. I create a new network namespace, put a macvlan of
eth0 in it and a macvlan of eth1. I enable ip_forward.

Then I make a mount namespace, bind-mount the net namespace, bind mount
the mount namespace and terminate all processes that reference it (yes
this does work, i just checked [!]).

Now I can use it to bypass all firewall rules, IDS, whatever.

How is any normal admin, monitoring script or whatever else able to
detect this?

> With the /proc/<pid>/ns/net file and bind mounts I have solved the
> deeper problem of how do we get userspace policy into the naming of
> namespaces.  With those files and the setns system call I have solved
> the other problem of what is a good way to refer to namespaces without
> assuming a global name.  So once those changes are merged I expect there
> to be much less pressure to misuse any kind of identifier we can have.
> 
> And if we only make the guarantee about inode consistency for the
> /proc/<pid>/ns/FILE files I don't expect it will make maintenance
> of procfs any harder than it already is.


-David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lamparter May 23, 2011, 1:47 a.m. UTC | #5
On Mon, May 23, 2011 at 03:43:03AM +0200, David Lamparter wrote:
> Then I make a mount namespace, bind-mount the net namespace, bind mount
> the mount namespace and terminate all processes that reference it (yes
> this does work, i just checked [!]).

Actually, Eric, bind-mounting a mount namespace inside itself should
probably be forbidden? No idea if you changed that (running a year-old
version of your patches here). Not only can you lose network namespaces
inside those self-referential mount namespaces but also references to
block devices, unix socket connections, etc. pp.


-David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 23, 2011, 2:02 a.m. UTC | #6
David Lamparter <equinox@diac24.net> writes:

>> ... Eric W. Biederman wrote:
>> Now it probably needs to be better documented that /proc/*/net/*
>> have the same inode number if the network namespace is the
>> same, as everyone including myself overlooked this very handy
>> existing property.
>
> Eh, so did I. But, yes, very nice.
>
> On Sat, May 21, 2011 at 05:15:38PM -0700, Eric W. Biederman wrote:
>> Additionally that solution will work for comparing network namespaces
>> that don't happen to have any processes in them at the moment.  Because
>> fstat works on file descriptors.
>
> Hm. I have a peeve here. Assume I am a... rogue admin, whatever. I have
> root on a router. I create a new network namespace, put a macvlan of
> eth0 in it and a macvlan of eth1. I enable ip_forward.
>
> Then I make a mount namespace, bind-mount the net namespace, bind mount
> the mount namespace and terminate all processes that reference it (yes
> this does work, i just checked [!]).

You must be using an older version of my patchset than what I have
queued for Linus.  Bind mounting the mount namepsace and creating
reference counting loops is a weird and ugly case.  So for the moment I
am not supporting the mount namespace, until I can think through
the consequences.

> Now I can use it to bypass all firewall rules, IDS, whatever.
>
> How is any normal admin, monitoring script or whatever else able to
> detect this?

Which is why we I proceed slowly and cautiously with adding new kernel
interfaces.  It is hard to think of everything until you can actually
put it into use, and play with it.

Other than not allowing bind mounting the mount namespace I don't
have any all encompassing really good answers at the moment.

I do have a few small answers.  For network namespaces you can look in
/proc/slabinfo and see how many you have, unless slub is lying to you.
On the switch your server is connected to you can look at the mac table
and see which mac addresses are currently in use, and notice if there
are unaccounted for mac addresses.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -660,6 +660,22 @@  struct proc_dir_entry *proc_symlink(const char *name,
 }
 EXPORT_SYMBOL(proc_symlink);
 
+struct proc_dir_entry *_proc_symlink(const char *name, struct proc_dir_entry *parent, const struct inode_operations *proc_iops)
+{
+	struct proc_dir_entry *pde;
+
+	pde = __proc_create(&parent, name, S_IFLNK | S_IRUGO|S_IWUGO|S_IXUGO, 1);
+	if (!pde)
+		return NULL;
+	pde->proc_iops = proc_iops;
+	pde->data = NULL;
+	if (proc_register(parent, pde) < 0) {
+		kfree(pde);
+		return NULL;
+	}
+	return pde;
+}
+
 struct proc_dir_entry *proc_mkdir_mode(const char *name, mode_t mode,
 		struct proc_dir_entry *parent)
 {
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -191,9 +191,30 @@  void proc_net_remove(struct net *net, const char *name)
 }
 EXPORT_SYMBOL_GPL(proc_net_remove);
 
+static int net_id_readlink(struct dentry *dentry, char __user *buf, int buflen)
+{
+	struct net *net;
+	char kbuf[42];
+	int len;
+
+	net = get_proc_net(dentry->d_inode);
+	if (!net)
+		return -ENXIO;
+	len = snprintf(kbuf, sizeof(kbuf), "%u", net->id);
+	put_net(net);
+	len = min(len, buflen);
+	if (copy_to_user(buf, kbuf, len))
+		return -EFAULT;
+	return len;
+}
+
+static const struct inode_operations net_id_proc_iops = {
+	.readlink	= net_id_readlink,
+};
+
 static __net_init int proc_net_ns_init(struct net *net)
 {
-	struct proc_dir_entry *netd, *net_statd;
+	struct proc_dir_entry *netd, *net_statd, *pde;
 	int err;
 
 	err = -ENOMEM;
@@ -214,8 +235,15 @@  static __net_init int proc_net_ns_init(struct net *net)
 
 	net->proc_net = netd;
 	net->proc_net_stat = net_statd;
+
+	pde = _proc_symlink("id", net->proc_net, &net_id_proc_iops);
+	if (!pde)
+		goto free_net_stat;
+
 	return 0;
 
+free_net_stat:
+	kfree(net_statd);
 free_net:
 	kfree(netd);
 out:
@@ -224,6 +252,7 @@  out:
 
 static __net_exit void proc_net_ns_exit(struct net *net)
 {
+	remove_proc_entry("id", net->proc_net);
 	remove_proc_entry("stat", net->proc_net);
 	kfree(net->proc_net);
 }
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -143,6 +143,7 @@  extern void proc_device_tree_update_prop(struct proc_dir_entry *pde,
 					 struct property *oldprop);
 #endif /* CONFIG_PROC_DEVICETREE */
 
+struct proc_dir_entry *_proc_symlink(const char *name, struct proc_dir_entry *parent, const struct inode_operations *proc_iops);
 extern struct proc_dir_entry *proc_symlink(const char *,
 		struct proc_dir_entry *, const char *);
 extern struct proc_dir_entry *proc_mkdir(const char *,struct proc_dir_entry *);
@@ -204,8 +205,14 @@  static inline struct proc_dir_entry *proc_create_data(const char *name,
 }
 #define remove_proc_entry(name, parent) do {} while (0)
 
+static inline struct proc_dir_entry *_proc_symlink(const char *name, struct proc_dir_entry *parent, const struct inode_operations *proc_iops)
+{
+	return NULL;
+}
+
 static inline struct proc_dir_entry *proc_symlink(const char *name,
 		struct proc_dir_entry *parent,const char *dest) {return NULL;}
+
 static inline struct proc_dir_entry *proc_mkdir(const char *name,
 	struct proc_dir_entry *parent) {return NULL;}
 static inline struct proc_dir_entry *proc_mkdir_mode(const char *name,
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -96,6 +96,16 @@  struct net {
 	struct netns_xfrm	xfrm;
 #endif
 	struct netns_ipvs	*ipvs;
+
+	/*
+	 * netns unique id solely for userspace consumption,
+	 * see /proc/net/id symlink.
+	 *
+	 * init_net has id 0.
+	 *
+	 * Write-once field.
+	 */
+	unsigned int		id;
 };
 
 
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -115,6 +115,52 @@  static void ops_free_list(const struct pernet_operations *ops,
 	}
 }
 
+#ifdef CONFIG_NET_NS
+static DEFINE_IDA(net_id_ida);
+static DEFINE_SPINLOCK(net_id_ida_lock);
+
+static int __net_init set_net_id(struct net *net)
+{
+	int id;
+
+	if (net_eq(net, &init_net)) {
+		id = 0;
+	} else {
+		int rv;
+
+		do {
+			if (ida_pre_get(&net_id_ida, GFP_KERNEL) == 0)
+				return -ENOMEM;
+			spin_lock(&net_id_ida_lock);
+			/* init_net has id 0 */
+			rv = ida_get_new_above(&net_id_ida, 1, &id);
+			spin_unlock(&net_id_ida_lock);
+		} while (rv == -EAGAIN);
+		if (rv < 0)
+			return rv;
+	}
+	net->id = id;
+	return 0;
+}
+
+static void free_net_id(struct net *net)
+{
+	spin_lock(&net_id_ida_lock);
+	ida_remove(&net_id_ida, net->id);
+	spin_unlock(&net_id_ida_lock);
+}
+#else
+static inline int set_net_id(struct net *net)
+{
+	net->id = 0;
+	return 0;
+}
+
+static inline void free_net_id(struct net *net)
+{
+}
+#endif
+
 /*
  * setup_net runs the initializers for the network namespace object.
  */
@@ -131,6 +177,10 @@  static __net_init int setup_net(struct net *net)
 	atomic_set(&net->use_count, 0);
 #endif
 
+	error = set_net_id(net);
+	if (error < 0)
+		goto out;
+
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
 		if (error < 0)
@@ -140,6 +190,8 @@  out:
 	return error;
 
 out_undo:
+	free_net_id(net);
+
 	/* Walk through the list backwards calling the exit functions
 	 * for the pernet modules whose init functions did not fail.
 	 */
@@ -204,6 +256,8 @@  static void net_free(struct net *net)
 		return;
 	}
 #endif
+
+	free_net_id(net);
 	kfree(net->gen);
 	kmem_cache_free(net_cachep, net);
 }