diff mbox

[2/7] ns: Introduce the setns syscall

Message ID 1304735101-1824-2-git-send-email-ebiederm@xmission.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman May 7, 2011, 2:24 a.m. UTC
With the networking stack today there is demand to handle
multiple network stacks at a time.  Not in the context
of containers but in the context of people doing interesting
things with routing.

There is also demand in the context of containers to have
an efficient way to execute some code in the container itself.
If nothing else it is very useful ad a debugging technique.

Both problems can be solved by starting some form of login
daemon in the namespaces people want access to, or you
can play games by ptracing a process and getting the
traced process to do things you want it to do. However
it turns out that a login daemon or a ptrace puppet
controller are more code, they are more prone to
failure, and generally they are less efficient than
simply changing the namespace of a process to a
specified one.

Pieces of this puzzle can also be solved by instead of
coming up with a general purpose system call coming up
with targed system calls perhaps socketat that solve
a subset of the larger problem.  Overall that appears
to be more work for less reward.

int setns(int fd, int nstype);

The fd argument is a file descriptor referring to a proc
file of the namespace you want to switch the process to.

In the setns system call the nstype is 0 or specifies
an clone flag of the namespace you intend to change
to prevent changing a namespace unintentionally.

v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/nsproxy.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

Comments

Rémi Denis-Courmont May 7, 2011, 8:01 a.m. UTC | #1
Le samedi 7 mai 2011 05:24:56 Eric W. Biederman, vous avez écrit :
> Pieces of this puzzle can also be solved by instead of
> coming up with a general purpose system call coming up
> with targed system calls perhaps socketat that solve
> a subset of the larger problem.  Overall that appears
> to be more work for less reward.

socketat() is still required for multithreaded namespace-aware userspace, I 
believe.
Eric W. Biederman May 7, 2011, 1:57 p.m. UTC | #2
"Rémi Denis-Courmont" <remi@remlab.net> writes:

> Le samedi 7 mai 2011 05:24:56 Eric W. Biederman, vous avez écrit :
>> Pieces of this puzzle can also be solved by instead of
>> coming up with a general purpose system call coming up
>> with targed system calls perhaps socketat that solve
>> a subset of the larger problem.  Overall that appears
>> to be more work for less reward.
>
> socketat() is still required for multithreaded namespace-aware userspace, I 
> believe.

The network namespace is a per task property so there are no problems
with multithreaded network namespace aware userspace applications.  The
implementation of a userspace socketat will still need to disable signal
handling around the network namespace switch to be signal safe.  Which
means that ultimately a kernel version of socketat may be desirable,
for performance reasons but I know of know correctness reasons to need
it.

For the time being I have simply removed socketat from what I plan to
merge because it is not strictly needed, I don't yet have a test case
for socketat, and I don't have as much time to work on this as I
would like.

There is one bug a multi-threaded network namespace aware user space
application might run into, and that is /proc/net is a symlink to
/proc/self.  Which means that if you open /proc/net/foo from a task with
a different network namespace than your the task whose tid equals your
tgid, the /proc/net will return the wrong file.  Still you can
avoid even that silliness by opening /proc/<tid>/net.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano May 7, 2011, 10:39 p.m. UTC | #3
On 05/07/2011 04:24 AM, Eric W. Biederman wrote:
> With the networking stack today there is demand to handle
> multiple network stacks at a time.  Not in the context
> of containers but in the context of people doing interesting
> things with routing.
>
> There is also demand in the context of containers to have
> an efficient way to execute some code in the container itself.
> If nothing else it is very useful ad a debugging technique.
>
> Both problems can be solved by starting some form of login
> daemon in the namespaces people want access to, or you
> can play games by ptracing a process and getting the
> traced process to do things you want it to do. However
> it turns out that a login daemon or a ptrace puppet
> controller are more code, they are more prone to
> failure, and generally they are less efficient than
> simply changing the namespace of a process to a
> specified one.
>
> Pieces of this puzzle can also be solved by instead of
> coming up with a general purpose system call coming up
> with targed system calls perhaps socketat that solve
> a subset of the larger problem.  Overall that appears
> to be more work for less reward.
>
> int setns(int fd, int nstype);
>
> The fd argument is a file descriptor referring to a proc
> file of the namespace you want to switch the process to.
>
> In the setns system call the nstype is 0 or specifies
> an clone flag of the namespace you intend to change
> to prevent changing a namespace unintentionally.
>
> v2: Most of the architecture support added by Daniel Lezcano<dlezcano@fr.ibm.com>
> v3: ported to v2.6.36-rc4 by: Eric W. Biederman<ebiederm@xmission.com>
> v4: Moved wiring up of the system call to another patch
>
> Signed-off-by: Eric W. Biederman<ebiederm@xmission.com>
> ---

Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matt Helsley May 8, 2011, 3:51 a.m. UTC | #4
On Fri, May 06, 2011 at 07:24:56PM -0700, Eric W. Biederman wrote:
> With the networking stack today there is demand to handle
> multiple network stacks at a time.  Not in the context
> of containers but in the context of people doing interesting
> things with routing.
> 
> There is also demand in the context of containers to have
> an efficient way to execute some code in the container itself.
> If nothing else it is very useful ad a debugging technique.
> 
> Both problems can be solved by starting some form of login
> daemon in the namespaces people want access to, or you
> can play games by ptracing a process and getting the
> traced process to do things you want it to do. However
> it turns out that a login daemon or a ptrace puppet
> controller are more code, they are more prone to
> failure, and generally they are less efficient than
> simply changing the namespace of a process to a
> specified one.
> 
> Pieces of this puzzle can also be solved by instead of
> coming up with a general purpose system call coming up
> with targed system calls perhaps socketat that solve
> a subset of the larger problem.  Overall that appears
> to be more work for less reward.
> 
> int setns(int fd, int nstype);
> 
> The fd argument is a file descriptor referring to a proc
> file of the namespace you want to switch the process to.
> 
> In the setns system call the nstype is 0 or specifies
> an clone flag of the namespace you intend to change
> to prevent changing a namespace unintentionally.
> 
> v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
> v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
> v4: Moved wiring up of the system call to another patch
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  kernel/nsproxy.c |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index a05d191..96059d8 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -22,6 +22,9 @@
>  #include <linux/pid_namespace.h>
>  #include <net/net_namespace.h>
>  #include <linux/ipc_namespace.h>
> +#include <linux/proc_fs.h>
> +#include <linux/file.h>
> +#include <linux/syscalls.h>
> 
>  static struct kmem_cache *nsproxy_cachep;
> 
> @@ -233,6 +236,40 @@ void exit_task_namespaces(struct task_struct *p)
>  	switch_task_namespaces(p, NULL);
>  }
> 
> +SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> +{
> +	const struct proc_ns_operations *ops;
> +	struct task_struct *tsk = current;
> +	struct nsproxy *new_nsproxy;
> +	struct proc_inode *ei;
> +	struct file *file;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	file = proc_ns_fget(fd);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	err = -EINVAL;
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	ops = ei->ns_ops;
> +	if (nstype && (ops->type != nstype))
> +		goto out;
> +
> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);

Doesn't this need some error checking like:

	if (IS_ERR(new_nsproxy)) {
		err = PTR_ERR(new_nsproxy);
		goto out;
	}


> +	err = ops->install(new_nsproxy, ei->ns);
> +	if (err) {
> +		free_nsproxy(new_nsproxy);
> +		goto out;
> +	}
> +	switch_task_namespaces(tsk, new_nsproxy);
> +out:
> +	fput(file);
> +	return err;
> +}
> +
>  static int __init nsproxy_cache_init(void)
>  {
>  	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> -- 
> 1.6.5.2.143.g8cc62
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nathan Lynch May 11, 2011, 7:21 p.m. UTC | #5
Hi Eric,

On Fri, 2011-05-06 at 19:24 -0700, Eric W. Biederman wrote:
> With the networking stack today there is demand to handle
> multiple network stacks at a time.  Not in the context
> of containers but in the context of people doing interesting
> things with routing.
> 
> There is also demand in the context of containers to have
> an efficient way to execute some code in the container itself.
> If nothing else it is very useful ad a debugging technique.
> 
> Both problems can be solved by starting some form of login
> daemon in the namespaces people want access to, or you
> can play games by ptracing a process and getting the
> traced process to do things you want it to do. However
> it turns out that a login daemon or a ptrace puppet
> controller are more code, they are more prone to
> failure, and generally they are less efficient than
> simply changing the namespace of a process to a
> specified one.
> 
> Pieces of this puzzle can also be solved by instead of
> coming up with a general purpose system call coming up
> with targed system calls perhaps socketat that solve
> a subset of the larger problem.  Overall that appears
> to be more work for less reward.
> 
> int setns(int fd, int nstype);
> 
> The fd argument is a file descriptor referring to a proc
> file of the namespace you want to switch the process to.
> 
> In the setns system call the nstype is 0 or specifies
> an clone flag of the namespace you intend to change
> to prevent changing a namespace unintentionally.

I don't understand exactly what the nstype argument buys us - why would
correct code ever need to specify a value other than 0?  And reusing the
CLONE_NEW* values in this interface is kind of ugly when setns is
precisely _not_ creating new namespaces.

Is there some fundamental reason it couldn't be

int setns(int fd);

or is there a use case I'm missing?

 
> +SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> +{
> +	const struct proc_ns_operations *ops;
> +	struct task_struct *tsk = current;
> +	struct nsproxy *new_nsproxy;
> +	struct proc_inode *ei;
> +	struct file *file;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	file = proc_ns_fget(fd);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	err = -EINVAL;
> +	ei = PROC_I(file->f_dentry->d_inode);
> +	ops = ei->ns_ops;
> +	if (nstype && (ops->type != nstype))
> +		goto out;
> +
> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);

create_new_namespaces() can fail; shouldn't this be checked?


> +	err = ops->install(new_nsproxy, ei->ns);
> +	if (err) {
> +		free_nsproxy(new_nsproxy);
> +		goto out;
> +	}
> +	switch_task_namespaces(tsk, new_nsproxy);
> +out:
> +	fput(file);
> +	return err;
> +}
> +


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 11, 2011, 8:33 p.m. UTC | #6
Nathan Lynch <ntl@pobox.com> writes:

> Hi Eric,
>
> On Fri, 2011-05-06 at 19:24 -0700, Eric W. Biederman wrote:
>> With the networking stack today there is demand to handle
>> multiple network stacks at a time.  Not in the context
>> of containers but in the context of people doing interesting
>> things with routing.
>> 
>> There is also demand in the context of containers to have
>> an efficient way to execute some code in the container itself.
>> If nothing else it is very useful ad a debugging technique.
>> 
>> Both problems can be solved by starting some form of login
>> daemon in the namespaces people want access to, or you
>> can play games by ptracing a process and getting the
>> traced process to do things you want it to do. However
>> it turns out that a login daemon or a ptrace puppet
>> controller are more code, they are more prone to
>> failure, and generally they are less efficient than
>> simply changing the namespace of a process to a
>> specified one.
>> 
>> Pieces of this puzzle can also be solved by instead of
>> coming up with a general purpose system call coming up
>> with targed system calls perhaps socketat that solve
>> a subset of the larger problem.  Overall that appears
>> to be more work for less reward.
>> 
>> int setns(int fd, int nstype);
>> 
>> The fd argument is a file descriptor referring to a proc
>> file of the namespace you want to switch the process to.
>> 
>> In the setns system call the nstype is 0 or specifies
>> an clone flag of the namespace you intend to change
>> to prevent changing a namespace unintentionally.
>
> I don't understand exactly what the nstype argument buys us - why would
> correct code ever need to specify a value other than 0?  And reusing the
> CLONE_NEW* values in this interface is kind of ugly when setns is
> precisely _not_ creating new namespaces.

No but it is setting a new namespace.  I do agree it is a bit ugly.  But
the worst case at this point is I introduce a new set of beautiful
defines with the same values.

> Is there some fundamental reason it couldn't be
>
> int setns(int fd);
>
> or is there a use case I'm missing?

When someone else opens the file descriptor and passes it to us
and we don't completely trust them.  Or equally when someone
else does the bind mount into the filesystem namespace and we
don't completely trust them.

Plus having a flags field is useful in general.

>> +SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>> +{
>> +	const struct proc_ns_operations *ops;
>> +	struct task_struct *tsk = current;
>> +	struct nsproxy *new_nsproxy;
>> +	struct proc_inode *ei;
>> +	struct file *file;
>> +	int err;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	file = proc_ns_fget(fd);
>> +	if (IS_ERR(file))
>> +		return PTR_ERR(file);
>> +
>> +	err = -EINVAL;
>> +	ei = PROC_I(file->f_dentry->d_inode);
>> +	ops = ei->ns_ops;
>> +	if (nstype && (ops->type != nstype))
>> +		goto out;
>> +
>> +	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
>
> create_new_namespaces() can fail; shouldn't this be checked?

Yes.  This was pointed out a little earlier and has been fixed
in my tree.


>> +	err = ops->install(new_nsproxy, ei->ns);
>> +	if (err) {
>> +		free_nsproxy(new_nsproxy);
>> +		goto out;
>> +	}
>> +	switch_task_namespaces(tsk, new_nsproxy);
>> +out:
>> +	fput(file);
>> +	return err;
>> +}
>> +

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index a05d191..96059d8 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -22,6 +22,9 @@ 
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/proc_fs.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -233,6 +236,40 @@  void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+SYSCALL_DEFINE2(setns, int, fd, int, nstype)
+{
+	const struct proc_ns_operations *ops;
+	struct task_struct *tsk = current;
+	struct nsproxy *new_nsproxy;
+	struct proc_inode *ei;
+	struct file *file;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	file = proc_ns_fget(fd);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	err = -EINVAL;
+	ei = PROC_I(file->f_dentry->d_inode);
+	ops = ei->ns_ops;
+	if (nstype && (ops->type != nstype))
+		goto out;
+
+	new_nsproxy = create_new_namespaces(0, tsk, tsk->fs);
+	err = ops->install(new_nsproxy, ei->ns);
+	if (err) {
+		free_nsproxy(new_nsproxy);
+		goto out;
+	}
+	switch_task_namespaces(tsk, new_nsproxy);
+out:
+	fput(file);
+	return err;
+}
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);