Message ID | 1304735101-1824-2-git-send-email-ebiederm@xmission.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Le samedi 7 mai 2011 05:24:56 Eric W. Biederman, vous avez écrit : > Pieces of this puzzle can also be solved by instead of > coming up with a general purpose system call coming up > with targed system calls perhaps socketat that solve > a subset of the larger problem. Overall that appears > to be more work for less reward. socketat() is still required for multithreaded namespace-aware userspace, I believe.
"Rémi Denis-Courmont" <remi@remlab.net> writes: > Le samedi 7 mai 2011 05:24:56 Eric W. Biederman, vous avez écrit : >> Pieces of this puzzle can also be solved by instead of >> coming up with a general purpose system call coming up >> with targed system calls perhaps socketat that solve >> a subset of the larger problem. Overall that appears >> to be more work for less reward. > > socketat() is still required for multithreaded namespace-aware userspace, I > believe. The network namespace is a per task property so there are no problems with multithreaded network namespace aware userspace applications. The implementation of a userspace socketat will still need to disable signal handling around the network namespace switch to be signal safe. Which means that ultimately a kernel version of socketat may be desirable, for performance reasons but I know of know correctness reasons to need it. For the time being I have simply removed socketat from what I plan to merge because it is not strictly needed, I don't yet have a test case for socketat, and I don't have as much time to work on this as I would like. There is one bug a multi-threaded network namespace aware user space application might run into, and that is /proc/net is a symlink to /proc/self. Which means that if you open /proc/net/foo from a task with a different network namespace than your the task whose tid equals your tgid, the /proc/net will return the wrong file. Still you can avoid even that silliness by opening /proc/<tid>/net. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/07/2011 04:24 AM, Eric W. Biederman wrote: > With the networking stack today there is demand to handle > multiple network stacks at a time. Not in the context > of containers but in the context of people doing interesting > things with routing. > > There is also demand in the context of containers to have > an efficient way to execute some code in the container itself. > If nothing else it is very useful ad a debugging technique. > > Both problems can be solved by starting some form of login > daemon in the namespaces people want access to, or you > can play games by ptracing a process and getting the > traced process to do things you want it to do. However > it turns out that a login daemon or a ptrace puppet > controller are more code, they are more prone to > failure, and generally they are less efficient than > simply changing the namespace of a process to a > specified one. > > Pieces of this puzzle can also be solved by instead of > coming up with a general purpose system call coming up > with targed system calls perhaps socketat that solve > a subset of the larger problem. Overall that appears > to be more work for less reward. > > int setns(int fd, int nstype); > > The fd argument is a file descriptor referring to a proc > file of the namespace you want to switch the process to. > > In the setns system call the nstype is 0 or specifies > an clone flag of the namespace you intend to change > to prevent changing a namespace unintentionally. > > v2: Most of the architecture support added by Daniel Lezcano<dlezcano@fr.ibm.com> > v3: ported to v2.6.36-rc4 by: Eric W. Biederman<ebiederm@xmission.com> > v4: Moved wiring up of the system call to another patch > > Signed-off-by: Eric W. Biederman<ebiederm@xmission.com> > --- Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 06, 2011 at 07:24:56PM -0700, Eric W. Biederman wrote: > With the networking stack today there is demand to handle > multiple network stacks at a time. Not in the context > of containers but in the context of people doing interesting > things with routing. > > There is also demand in the context of containers to have > an efficient way to execute some code in the container itself. > If nothing else it is very useful ad a debugging technique. > > Both problems can be solved by starting some form of login > daemon in the namespaces people want access to, or you > can play games by ptracing a process and getting the > traced process to do things you want it to do. However > it turns out that a login daemon or a ptrace puppet > controller are more code, they are more prone to > failure, and generally they are less efficient than > simply changing the namespace of a process to a > specified one. > > Pieces of this puzzle can also be solved by instead of > coming up with a general purpose system call coming up > with targed system calls perhaps socketat that solve > a subset of the larger problem. Overall that appears > to be more work for less reward. > > int setns(int fd, int nstype); > > The fd argument is a file descriptor referring to a proc > file of the namespace you want to switch the process to. > > In the setns system call the nstype is 0 or specifies > an clone flag of the namespace you intend to change > to prevent changing a namespace unintentionally. > > v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com> > v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com> > v4: Moved wiring up of the system call to another patch > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > kernel/nsproxy.c | 37 +++++++++++++++++++++++++++++++++++++ > 1 files changed, 37 insertions(+), 0 deletions(-) > > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c > index a05d191..96059d8 100644 > --- a/kernel/nsproxy.c > +++ b/kernel/nsproxy.c > @@ -22,6 +22,9 @@ > #include <linux/pid_namespace.h> > #include <net/net_namespace.h> > #include <linux/ipc_namespace.h> > +#include <linux/proc_fs.h> > +#include <linux/file.h> > +#include <linux/syscalls.h> > > static struct kmem_cache *nsproxy_cachep; > > @@ -233,6 +236,40 @@ void exit_task_namespaces(struct task_struct *p) > switch_task_namespaces(p, NULL); > } > > +SYSCALL_DEFINE2(setns, int, fd, int, nstype) > +{ > + const struct proc_ns_operations *ops; > + struct task_struct *tsk = current; > + struct nsproxy *new_nsproxy; > + struct proc_inode *ei; > + struct file *file; > + int err; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + file = proc_ns_fget(fd); > + if (IS_ERR(file)) > + return PTR_ERR(file); > + > + err = -EINVAL; > + ei = PROC_I(file->f_dentry->d_inode); > + ops = ei->ns_ops; > + if (nstype && (ops->type != nstype)) > + goto out; > + > + new_nsproxy = create_new_namespaces(0, tsk, tsk->fs); Doesn't this need some error checking like: if (IS_ERR(new_nsproxy)) { err = PTR_ERR(new_nsproxy); goto out; } > + err = ops->install(new_nsproxy, ei->ns); > + if (err) { > + free_nsproxy(new_nsproxy); > + goto out; > + } > + switch_task_namespaces(tsk, new_nsproxy); > +out: > + fput(file); > + return err; > +} > + > static int __init nsproxy_cache_init(void) > { > nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC); > -- > 1.6.5.2.143.g8cc62 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On Fri, 2011-05-06 at 19:24 -0700, Eric W. Biederman wrote: > With the networking stack today there is demand to handle > multiple network stacks at a time. Not in the context > of containers but in the context of people doing interesting > things with routing. > > There is also demand in the context of containers to have > an efficient way to execute some code in the container itself. > If nothing else it is very useful ad a debugging technique. > > Both problems can be solved by starting some form of login > daemon in the namespaces people want access to, or you > can play games by ptracing a process and getting the > traced process to do things you want it to do. However > it turns out that a login daemon or a ptrace puppet > controller are more code, they are more prone to > failure, and generally they are less efficient than > simply changing the namespace of a process to a > specified one. > > Pieces of this puzzle can also be solved by instead of > coming up with a general purpose system call coming up > with targed system calls perhaps socketat that solve > a subset of the larger problem. Overall that appears > to be more work for less reward. > > int setns(int fd, int nstype); > > The fd argument is a file descriptor referring to a proc > file of the namespace you want to switch the process to. > > In the setns system call the nstype is 0 or specifies > an clone flag of the namespace you intend to change > to prevent changing a namespace unintentionally. I don't understand exactly what the nstype argument buys us - why would correct code ever need to specify a value other than 0? And reusing the CLONE_NEW* values in this interface is kind of ugly when setns is precisely _not_ creating new namespaces. Is there some fundamental reason it couldn't be int setns(int fd); or is there a use case I'm missing? > +SYSCALL_DEFINE2(setns, int, fd, int, nstype) > +{ > + const struct proc_ns_operations *ops; > + struct task_struct *tsk = current; > + struct nsproxy *new_nsproxy; > + struct proc_inode *ei; > + struct file *file; > + int err; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + file = proc_ns_fget(fd); > + if (IS_ERR(file)) > + return PTR_ERR(file); > + > + err = -EINVAL; > + ei = PROC_I(file->f_dentry->d_inode); > + ops = ei->ns_ops; > + if (nstype && (ops->type != nstype)) > + goto out; > + > + new_nsproxy = create_new_namespaces(0, tsk, tsk->fs); create_new_namespaces() can fail; shouldn't this be checked? > + err = ops->install(new_nsproxy, ei->ns); > + if (err) { > + free_nsproxy(new_nsproxy); > + goto out; > + } > + switch_task_namespaces(tsk, new_nsproxy); > +out: > + fput(file); > + return err; > +} > + -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nathan Lynch <ntl@pobox.com> writes: > Hi Eric, > > On Fri, 2011-05-06 at 19:24 -0700, Eric W. Biederman wrote: >> With the networking stack today there is demand to handle >> multiple network stacks at a time. Not in the context >> of containers but in the context of people doing interesting >> things with routing. >> >> There is also demand in the context of containers to have >> an efficient way to execute some code in the container itself. >> If nothing else it is very useful ad a debugging technique. >> >> Both problems can be solved by starting some form of login >> daemon in the namespaces people want access to, or you >> can play games by ptracing a process and getting the >> traced process to do things you want it to do. However >> it turns out that a login daemon or a ptrace puppet >> controller are more code, they are more prone to >> failure, and generally they are less efficient than >> simply changing the namespace of a process to a >> specified one. >> >> Pieces of this puzzle can also be solved by instead of >> coming up with a general purpose system call coming up >> with targed system calls perhaps socketat that solve >> a subset of the larger problem. Overall that appears >> to be more work for less reward. >> >> int setns(int fd, int nstype); >> >> The fd argument is a file descriptor referring to a proc >> file of the namespace you want to switch the process to. >> >> In the setns system call the nstype is 0 or specifies >> an clone flag of the namespace you intend to change >> to prevent changing a namespace unintentionally. > > I don't understand exactly what the nstype argument buys us - why would > correct code ever need to specify a value other than 0? And reusing the > CLONE_NEW* values in this interface is kind of ugly when setns is > precisely _not_ creating new namespaces. No but it is setting a new namespace. I do agree it is a bit ugly. But the worst case at this point is I introduce a new set of beautiful defines with the same values. > Is there some fundamental reason it couldn't be > > int setns(int fd); > > or is there a use case I'm missing? When someone else opens the file descriptor and passes it to us and we don't completely trust them. Or equally when someone else does the bind mount into the filesystem namespace and we don't completely trust them. Plus having a flags field is useful in general. >> +SYSCALL_DEFINE2(setns, int, fd, int, nstype) >> +{ >> + const struct proc_ns_operations *ops; >> + struct task_struct *tsk = current; >> + struct nsproxy *new_nsproxy; >> + struct proc_inode *ei; >> + struct file *file; >> + int err; >> + >> + if (!capable(CAP_SYS_ADMIN)) >> + return -EPERM; >> + >> + file = proc_ns_fget(fd); >> + if (IS_ERR(file)) >> + return PTR_ERR(file); >> + >> + err = -EINVAL; >> + ei = PROC_I(file->f_dentry->d_inode); >> + ops = ei->ns_ops; >> + if (nstype && (ops->type != nstype)) >> + goto out; >> + >> + new_nsproxy = create_new_namespaces(0, tsk, tsk->fs); > > create_new_namespaces() can fail; shouldn't this be checked? Yes. This was pointed out a little earlier and has been fixed in my tree. >> + err = ops->install(new_nsproxy, ei->ns); >> + if (err) { >> + free_nsproxy(new_nsproxy); >> + goto out; >> + } >> + switch_task_namespaces(tsk, new_nsproxy); >> +out: >> + fput(file); >> + return err; >> +} >> + Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index a05d191..96059d8 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -22,6 +22,9 @@ #include <linux/pid_namespace.h> #include <net/net_namespace.h> #include <linux/ipc_namespace.h> +#include <linux/proc_fs.h> +#include <linux/file.h> +#include <linux/syscalls.h> static struct kmem_cache *nsproxy_cachep; @@ -233,6 +236,40 @@ void exit_task_namespaces(struct task_struct *p) switch_task_namespaces(p, NULL); } +SYSCALL_DEFINE2(setns, int, fd, int, nstype) +{ + const struct proc_ns_operations *ops; + struct task_struct *tsk = current; + struct nsproxy *new_nsproxy; + struct proc_inode *ei; + struct file *file; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + file = proc_ns_fget(fd); + if (IS_ERR(file)) + return PTR_ERR(file); + + err = -EINVAL; + ei = PROC_I(file->f_dentry->d_inode); + ops = ei->ns_ops; + if (nstype && (ops->type != nstype)) + goto out; + + new_nsproxy = create_new_namespaces(0, tsk, tsk->fs); + err = ops->install(new_nsproxy, ei->ns); + if (err) { + free_nsproxy(new_nsproxy); + goto out; + } + switch_task_namespaces(tsk, new_nsproxy); +out: + fput(file); + return err; +} + static int __init nsproxy_cache_init(void) { nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
With the networking stack today there is demand to handle multiple network stacks at a time. Not in the context of containers but in the context of people doing interesting things with routing. There is also demand in the context of containers to have an efficient way to execute some code in the container itself. If nothing else it is very useful ad a debugging technique. Both problems can be solved by starting some form of login daemon in the namespaces people want access to, or you can play games by ptracing a process and getting the traced process to do things you want it to do. However it turns out that a login daemon or a ptrace puppet controller are more code, they are more prone to failure, and generally they are less efficient than simply changing the namespace of a process to a specified one. Pieces of this puzzle can also be solved by instead of coming up with a general purpose system call coming up with targed system calls perhaps socketat that solve a subset of the larger problem. Overall that appears to be more work for less reward. int setns(int fd, int nstype); The fd argument is a file descriptor referring to a proc file of the namespace you want to switch the process to. In the setns system call the nstype is 0 or specifies an clone flag of the namespace you intend to change to prevent changing a namespace unintentionally. v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com> v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com> v4: Moved wiring up of the system call to another patch Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- kernel/nsproxy.c | 37 +++++++++++++++++++++++++++++++++++++ 1 files changed, 37 insertions(+), 0 deletions(-)