Message ID | 1222858454-7843-1-git-send-email-den@openvz.org |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Denis V. Lunev wrote: > This patch opens a way to connect via Unix socket from one namespace > to another if these sockets are opened via conventional filesystem > interface. Such approach allows to share important services between > namespaces in efficient way. > > This breach is controlled by the means of shared filesystem, i.e. if > somebody really wants to isolate containers, he should start from > filesystem separation. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > --- > net/unix/af_unix.c | 3 --- > 1 files changed, 0 insertions(+), 3 deletions(-) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 39d2173..0e1eccd 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) > &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { > struct dentry *dentry = unix_sk(s)->dentry; > > - if (!net_eq(sock_net(s), net)) > - continue; > - > if(dentry && dentry->d_inode == i) > { > sock_hold(s); Hi Denis, Do you have a list of the important services this isolation forbids ? (I suppose there is syslog). Without this isolation, we can not ensure the container will not connect to an application outside of the namespace. How can we handle this case for the checkpoint-restart/migration ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote: > Denis V. Lunev wrote: > > This patch opens a way to connect via Unix socket from one namespace > > to another if these sockets are opened via conventional filesystem > > interface. Such approach allows to share important services between > > namespaces in efficient way. > > > > This breach is controlled by the means of shared filesystem, i.e. if > > somebody really wants to isolate containers, he should start from > > filesystem separation. > > > > Signed-off-by: Denis V. Lunev <den@openvz.org> > > --- > > net/unix/af_unix.c | 3 --- > > 1 files changed, 0 insertions(+), 3 deletions(-) > > > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > > index 39d2173..0e1eccd 100644 > > --- a/net/unix/af_unix.c > > +++ b/net/unix/af_unix.c > > @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) > > &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { > > struct dentry *dentry = unix_sk(s)->dentry; > > > > - if (!net_eq(sock_net(s), net)) > > - continue; > > - > > if(dentry && dentry->d_inode == i) > > { > > sock_hold(s); > > Hi Denis, > > Do you have a list of the important services this isolation forbids ? (I > suppose there is syslog). we have asked from our customers for a shared MySQL server The full story is here :) http://bugzilla.openvz.org/show_bug.cgi?id=985 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Denis V. Lunev wrote: > On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote: >> Denis V. Lunev wrote: >>> This patch opens a way to connect via Unix socket from one namespace >>> to another if these sockets are opened via conventional filesystem >>> interface. Such approach allows to share important services between >>> namespaces in efficient way. >>> >>> This breach is controlled by the means of shared filesystem, i.e. if >>> somebody really wants to isolate containers, he should start from >>> filesystem separation. >>> >>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>> --- >>> net/unix/af_unix.c | 3 --- >>> 1 files changed, 0 insertions(+), 3 deletions(-) >>> >>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c >>> index 39d2173..0e1eccd 100644 >>> --- a/net/unix/af_unix.c >>> +++ b/net/unix/af_unix.c >>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) >>> &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { >>> struct dentry *dentry = unix_sk(s)->dentry; >>> >>> - if (!net_eq(sock_net(s), net)) >>> - continue; >>> - >>> if(dentry && dentry->d_inode == i) >>> { >>> sock_hold(s); >> Hi Denis, >> >> Do you have a list of the important services this isolation forbids ? (I >> suppose there is syslog). > > we have asked from our customers for a shared MySQL server > > The full story is here :) > http://bugzilla.openvz.org/show_bug.cgi?id=985 Ok, thanks. My question remains :) How do you handle migration in this case ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-01 at 13:55 +0200, Daniel Lezcano wrote: > Denis V. Lunev wrote: > > On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote: > >> Denis V. Lunev wrote: > >>> This patch opens a way to connect via Unix socket from one namespace > >>> to another if these sockets are opened via conventional filesystem > >>> interface. Such approach allows to share important services between > >>> namespaces in efficient way. > >>> > >>> This breach is controlled by the means of shared filesystem, i.e. if > >>> somebody really wants to isolate containers, he should start from > >>> filesystem separation. > >>> > >>> Signed-off-by: Denis V. Lunev <den@openvz.org> > >>> --- > >>> net/unix/af_unix.c | 3 --- > >>> 1 files changed, 0 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > >>> index 39d2173..0e1eccd 100644 > >>> --- a/net/unix/af_unix.c > >>> +++ b/net/unix/af_unix.c > >>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) > >>> &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { > >>> struct dentry *dentry = unix_sk(s)->dentry; > >>> > >>> - if (!net_eq(sock_net(s), net)) > >>> - continue; > >>> - > >>> if(dentry && dentry->d_inode == i) > >>> { > >>> sock_hold(s); > >> Hi Denis, > >> > >> Do you have a list of the important services this isolation forbids ? (I > >> suppose there is syslog). > > > > we have asked from our customers for a shared MySQL server > > > > The full story is here :) > > http://bugzilla.openvz.org/show_bug.cgi?id=985 > > Ok, thanks. > > My question remains :) > > How do you handle migration in this case ? There is no problem until you really have listeners from different namespaces on both ends. This is checked after the freeze stage and migration is forbidden if such a situation is detected. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Denis V. Lunev wrote: > On Wed, 2008-10-01 at 13:55 +0200, Daniel Lezcano wrote: >> Denis V. Lunev wrote: >>> On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote: >>>> Denis V. Lunev wrote: >>>>> This patch opens a way to connect via Unix socket from one namespace >>>>> to another if these sockets are opened via conventional filesystem >>>>> interface. Such approach allows to share important services between >>>>> namespaces in efficient way. >>>>> >>>>> This breach is controlled by the means of shared filesystem, i.e. if >>>>> somebody really wants to isolate containers, he should start from >>>>> filesystem separation. >>>>> >>>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>>> --- >>>>> net/unix/af_unix.c | 3 --- >>>>> 1 files changed, 0 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c >>>>> index 39d2173..0e1eccd 100644 >>>>> --- a/net/unix/af_unix.c >>>>> +++ b/net/unix/af_unix.c >>>>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) >>>>> &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { >>>>> struct dentry *dentry = unix_sk(s)->dentry; >>>>> >>>>> - if (!net_eq(sock_net(s), net)) >>>>> - continue; >>>>> - >>>>> if(dentry && dentry->d_inode == i) >>>>> { >>>>> sock_hold(s); >>>> Hi Denis, >>>> >>>> Do you have a list of the important services this isolation forbids ? (I >>>> suppose there is syslog). >>> we have asked from our customers for a shared MySQL server >>> >>> The full story is here :) >>> http://bugzilla.openvz.org/show_bug.cgi?id=985 >> Ok, thanks. >> >> My question remains :) >> >> How do you handle migration in this case ? > > There is no problem until you really have listeners from different > namespaces on both ends. This is checked after the freeze stage and > migration is forbidden if such a situation is detected. So there are 2 cases: * full isolation : restriction on VPS * partial isolation : no restriction but *perhaps* problem when migrating Looks like we need an option per namespace to reduce the isolation for af_unix sockets :) - on (default): current behaviour => full isolation - off : partial isolation -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> So there are 2 cases: > * full isolation : restriction on VPS > * partial isolation : no restriction but *perhaps* problem when migrating > > Looks like we need an option per namespace to reduce the isolation for > af_unix sockets :) > - on (default): current behaviour => full isolation > - off : partial isolation You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov wrote: >> So there are 2 cases: >> * full isolation : restriction on VPS >> * partial isolation : no restriction but *perhaps* problem when migrating >> >> Looks like we need an option per namespace to reduce the isolation for >> af_unix sockets :) >> - on (default): current behaviour => full isolation >> - off : partial isolation > > You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? Yes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Lezcano wrote: > Pavel Emelyanov wrote: >>> So there are 2 cases: >>> * full isolation : restriction on VPS >>> * partial isolation : no restriction but *perhaps* problem when migrating >>> >>> Looks like we need an option per namespace to reduce the isolation for >>> af_unix sockets :) >>> - on (default): current behaviour => full isolation >>> - off : partial isolation >> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? > > Yes. OK. Den, please, do :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov wrote: > Daniel Lezcano wrote: >> Pavel Emelyanov wrote: >>>> So there are 2 cases: >>>> * full isolation : restriction on VPS >>>> * partial isolation : no restriction but *perhaps* problem when migrating >>>> >>>> Looks like we need an option per namespace to reduce the isolation for >>>> af_unix sockets :) >>>> - on (default): current behaviour => full isolation >>>> - off : partial isolation >>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? >> Yes. > > OK. Den, please, do :) hmm, would that allow sibling namespaces to connect to each other ? If so, I'm not in favor of such a solution. I understand the need. we had a similar issue with the command line tool pgsl. Could we work something out with the capabilities ? or make an exception if your ->nsproxy->net_ns == init_net ? Thanks, C. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote: > Pavel Emelyanov wrote: > >> So there are 2 cases: > >> * full isolation : restriction on VPS > >> * partial isolation : no restriction but *perhaps* problem when migrating > >> > >> Looks like we need an option per namespace to reduce the isolation for > >> af_unix sockets :) > >> - on (default): current behaviour => full isolation > >> - off : partial isolation > > > > You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? > > Yes. I do not see much sense with sysctl as: - check (cross-connected sockets) is required as we can start namespace with already opened socket - this kind of sharing is not implicit but explicit as normal isolated containers _must_ have separate filesystems. In this case this sharing requires explicit host administrator action to link socket between containers -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Denis V. Lunev wrote: > On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote: >> Pavel Emelyanov wrote: >>>> So there are 2 cases: >>>> * full isolation : restriction on VPS >>>> * partial isolation : no restriction but *perhaps* problem when migrating >>>> >>>> Looks like we need an option per namespace to reduce the isolation for >>>> af_unix sockets :) >>>> - on (default): current behaviour => full isolation >>>> - off : partial isolation >>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? >> Yes. > > I do not see much sense with sysctl as: > - check (cross-connected sockets) is required as we can start namespace > with already opened socket Check when checkpointing ? If you inherit a socket from your parent namespace, this socket belongs to your parent and you should not checkpoint it, no ? In case you allow cross-connected sockets, this check is mandatory I agree. > - this kind of sharing is not implicit but explicit as normal isolated > containers _must_ have separate filesystems. In this case this > sharing requires explicit host administrator action to link socket > between containers What are "normal isolated containers" ? Are they OpenVZ containers ? These containers belong to the system containers family. What happens with application containers, if I want to share the filesystem without breaking the isolation of the afunix sockets ? The current code provides full isolation and this is in mainline. I don't think it is reasonable to change that. What I propose is to keep the current behaviour. When you create a network namespace, you can change the behaviour inside this namespace via /proc/sys/net/unix/isolated (for example). This option allows: 1 - to connect to af_unix not belonging to the container 2 - to accept af_unix connection from outside the container (avoid a container to forbid the checkpoint of another container); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cedric Le Goater wrote: > Pavel Emelyanov wrote: >> Daniel Lezcano wrote: >>> Pavel Emelyanov wrote: >>>>> So there are 2 cases: >>>>> * full isolation : restriction on VPS >>>>> * partial isolation : no restriction but *perhaps* problem when migrating >>>>> >>>>> Looks like we need an option per namespace to reduce the isolation for >>>>> af_unix sockets :) >>>>> - on (default): current behaviour => full isolation >>>>> - off : partial isolation >>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? >>> Yes. >> OK. Den, please, do :) > > hmm, would that allow sibling namespaces to connect to each other ? If so, > I'm not in favor of such a solution. > > I understand the need. we had a similar issue with the command line tool > pgsl. Could we work something out with the capabilities ? or make an > exception if your ->nsproxy->net_ns == init_net ? Why capabilities is better than a simple sysctl ? Making an exception for init_net will break the nested containers no ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-01 at 15:46 +0200, Daniel Lezcano wrote: > Denis V. Lunev wrote: > > On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote: > >> Pavel Emelyanov wrote: > >>>> So there are 2 cases: > >>>> * full isolation : restriction on VPS > >>>> * partial isolation : no restriction but *perhaps* problem when migrating > >>>> > >>>> Looks like we need an option per namespace to reduce the isolation for > >>>> af_unix sockets :) > >>>> - on (default): current behaviour => full isolation > >>>> - off : partial isolation > >>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? > >> Yes. > > > > I do not see much sense with sysctl as: > > - check (cross-connected sockets) is required as we can start namespace > > with already opened socket > > Check when checkpointing ? If you inherit a socket from your parent > namespace, this socket belongs to your parent and you should not > checkpoint it, no ? > > In case you allow cross-connected sockets, this check is mandatory I agree. > > > - this kind of sharing is not implicit but explicit as normal isolated > > containers _must_ have separate filesystems. In this case this > > sharing requires explicit host administrator action to link socket > > between containers > > What are "normal isolated containers" ? Are they OpenVZ containers ? > These containers belong to the system containers family. What happens > with application containers, if I want to share the filesystem without > breaking the isolation of the afunix sockets ? then you are doomed as you will have a FIFO opened from 2 namespaces and checking the absences of external references is still mandatory > The current code provides full isolation and this is in mainline. I > don't think it is reasonable to change that. What I propose is to keep > the current behaviour. > > When you create a network namespace, you can change the behaviour inside > this namespace via /proc/sys/net/unix/isolated (for example). > > This option allows: > 1 - to connect to af_unix not belonging to the container > 2 - to accept af_unix connection from outside the container (avoid a > container to forbid the checkpoint of another container); this should be at least per/namespace option controlled from parent container from my POW -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Lezcano wrote: > Cedric Le Goater wrote: >> Pavel Emelyanov wrote: >>> Daniel Lezcano wrote: >>>> Pavel Emelyanov wrote: >>>>>> So there are 2 cases: >>>>>> * full isolation : restriction on VPS >>>>>> * partial isolation : no restriction but *perhaps* problem when migrating >>>>>> >>>>>> Looks like we need an option per namespace to reduce the isolation for >>>>>> af_unix sockets :) >>>>>> - on (default): current behaviour => full isolation >>>>>> - off : partial isolation >>>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? >>>> Yes. >>> OK. Den, please, do :) >> hmm, would that allow sibling namespaces to connect to each other ? If so, >> I'm not in favor of such a solution. >> >> I understand the need. we had a similar issue with the command line tool >> pgsl. Could we work something out with the capabilities ? or make an >> exception if your ->nsproxy->net_ns == init_net ? > > Why capabilities is better than a simple sysctl ? because it depends on the current process privilege and not just some random process. > Making an exception for init_net will break the nested containers no ? may be. I don't know how this is implemented. if we break isolation, my feeling is that we should only do it for a parent namespace. It just feel wrong to allow sibling namespaces to connect to each other. C. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Denis V. Lunev wrote: > On Wed, 2008-10-01 at 15:46 +0200, Daniel Lezcano wrote: >> Denis V. Lunev wrote: >>> On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote: >>>> Pavel Emelyanov wrote: >>>>>> So there are 2 cases: >>>>>> * full isolation : restriction on VPS >>>>>> * partial isolation : no restriction but *perhaps* problem when migrating >>>>>> >>>>>> Looks like we need an option per namespace to reduce the isolation for >>>>>> af_unix sockets :) >>>>>> - on (default): current behaviour => full isolation >>>>>> - off : partial isolation >>>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode? >>>> Yes. >>> I do not see much sense with sysctl as: >>> - check (cross-connected sockets) is required as we can start namespace >>> with already opened socket >> Check when checkpointing ? If you inherit a socket from your parent >> namespace, this socket belongs to your parent and you should not >> checkpoint it, no ? >> >> In case you allow cross-connected sockets, this check is mandatory I agree. >> >>> - this kind of sharing is not implicit but explicit as normal isolated >>> containers _must_ have separate filesystems. In this case this >>> sharing requires explicit host administrator action to link socket >>> between containers >> What are "normal isolated containers" ? Are they OpenVZ containers ? >> These containers belong to the system containers family. What happens >> with application containers, if I want to share the filesystem without >> breaking the isolation of the afunix sockets ? > > then you are doomed as you will have a FIFO opened from 2 namespaces and > checking the absences of external references is still mandatory >> The current code provides full isolation and this is in mainline. I >> don't think it is reasonable to change that. What I propose is to keep >> the current behaviour. >> >> When you create a network namespace, you can change the behaviour inside >> this namespace via /proc/sys/net/unix/isolated (for example). >> >> This option allows: >> 1 - to connect to af_unix not belonging to the container >> 2 - to accept af_unix connection from outside the container (avoid a >> container to forbid the checkpoint of another container); > > this should be at least per/namespace option controlled from parent > container from my POW Yes per namespace, I agree. If the option is controlled by the parent and it is done by sysctl, you will have to make proc/sys per namespace like Pavel did with /proc/net, no ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Yes per namespace, I agree. > > If the option is controlled by the parent and it is done by sysctl, you > will have to make proc/sys per namespace like Pavel did with /proc/net, no ? /proc/sys is already per namespace actually ;) Or what did you mean by that? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov wrote: >> Yes per namespace, I agree. >> >> If the option is controlled by the parent and it is done by sysctl, you >> will have to make proc/sys per namespace like Pavel did with /proc/net, no ? > > /proc/sys is already per namespace actually ;) Or what did you mean by that? Effectively I was not clear :) I meant, you can not access /proc/sys from outside the namespace like /proc/net which can be followed up by /proc/<pid>/net outside the namespace. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Lezcano wrote: > Pavel Emelyanov wrote: >>> Yes per namespace, I agree. >>> >>> If the option is controlled by the parent and it is done by sysctl, you >>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ? >> /proc/sys is already per namespace actually ;) Or what did you mean by that? > > > Effectively I was not clear :) > > I meant, you can not access /proc/sys from outside the namespace like > /proc/net which can be followed up by /proc/<pid>/net outside the namespace. Ah! I've got it. Well, I think after Al Viro finishes with sysctl rework this possibility will appear, but Denis actually persuaded me in his POV - if we do want to disable shared sockets we *can* do this by putting containers in proper mount namespaces of chroot environments. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov wrote: > Daniel Lezcano wrote: >> Pavel Emelyanov wrote: >>>> Yes per namespace, I agree. >>>> >>>> If the option is controlled by the parent and it is done by sysctl, you >>>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ? >>> /proc/sys is already per namespace actually ;) Or what did you mean by that? >> >> Effectively I was not clear :) >> >> I meant, you can not access /proc/sys from outside the namespace like >> /proc/net which can be followed up by /proc/<pid>/net outside the namespace. > > Ah! I've got it. Well, I think after Al Viro finishes with sysctl > rework this possibility will appear, but Denis actually persuaded me > in his POV - if we do want to disable shared sockets we *can* do this > by putting containers in proper mount namespaces of chroot environments. And I agree with this point. But :) 1 - the current behaviour is full isolation. Shall we/can we change that without taking into account there are perhaps some people using this today ? I don't know. 2 - I wish to launch a non chrooted application inside a namespace, sharing the file system without sharing the af_unix sockets, because I don't want the application running inside the container overlap with the socket af_unix of another container. I prefer to detect a collision with a strong isolation and handle it manually (remount some part of the fs for example). 3 - I would like to be able to reduce this isolation (your point) to share the af_unix socket for example to use /dev/klog or something else. I don't know how much we can consider the point 1, 2 pertinent, but disabling 3 lines of code via a sysctl with strong isolation as default and having a process unsharing the namespace in userspace and changing this value to less isolation is not a big challenge IMHO :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-01 at 18:15 +0200, Daniel Lezcano wrote: > Pavel Emelyanov wrote: > > Daniel Lezcano wrote: > >> Pavel Emelyanov wrote: > >>>> Yes per namespace, I agree. > >>>> > >>>> If the option is controlled by the parent and it is done by sysctl, you > >>>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ? > >>> /proc/sys is already per namespace actually ;) Or what did you mean by that? > >> > >> Effectively I was not clear :) > >> > >> I meant, you can not access /proc/sys from outside the namespace like > >> /proc/net which can be followed up by /proc/<pid>/net outside the namespace. > > > > Ah! I've got it. Well, I think after Al Viro finishes with sysctl > > rework this possibility will appear, but Denis actually persuaded me > > in his POV - if we do want to disable shared sockets we *can* do this > > by putting containers in proper mount namespaces of chroot environments. > > And I agree with this point. But :) > > 1 - the current behaviour is full isolation. Shall we/can we change > that without taking into account there are perhaps some people using > this today ? I don't know. We have a direct request from people using to remove this state of isolation. > 2 - I wish to launch a non chrooted application inside a namespace, > sharing the file system without sharing the af_unix sockets, because I > don't want the application running inside the container overlap with the > socket af_unix of another container. I prefer to detect a collision with > a strong isolation and handle it manually (remount some part of the fs > for example). with common filesystem you have to detect collisions at least for FIFOs. This situation is the same. Basically, if we'll treat named Unix sockets as an improved FIFO - it's better to use the same approach > 3 - I would like to be able to reduce this isolation (your point) to > share the af_unix socket for example to use /dev/klog or something else. > > I don't know how much we can consider the point 1, 2 pertinent, but > disabling 3 lines of code via a sysctl with strong isolation as default > and having a process unsharing the namespace in userspace and changing > this value to less isolation is not a big challenge IMHO :) the real questions is _who_ is responsible for this kind of staff -> node (parent container) administrator or container administrator. I strongly vote for first. Also if we are talking about such kind of staff, I dislike global kludge. This should be a property of two concrete VEs and better two concrete sockets. Unfortunately, setsockopt is not an option :( -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I do not believe we have yet met the burden of proof necessary to make the proposed semantic change. "Denis V. Lunev" <den@openvz.org> writes: > On Wed, 2008-10-01 at 18:15 +0200, Daniel Lezcano wrote: >> >> 1 - the current behaviour is full isolation. Shall we/can we change >> that without taking into account there are perhaps some people using >> this today ? I don't know. > We have a direct request from people using to remove this state of > isolation. Users of a brittle but convenient hack not application developers. Which means there is demand for this class of communication but it doesn't mean that the suggested way is the right way. >> 2 - I wish to launch a non chrooted application inside a namespace, >> sharing the file system without sharing the af_unix sockets, because I >> don't want the application running inside the container overlap with the >> socket af_unix of another container. I prefer to detect a collision with >> a strong isolation and handle it manually (remount some part of the fs >> for example). > with common filesystem you have to detect collisions at least for FIFOs. > This situation is the same. Basically, if we'll treat named Unix sockets > as an improved FIFO - it's better to use the same approach There are two aspects to this. 1) Looking up the proxy object in the filesystem. With that I agree that the file system access rules are sufficient. If we don't want the possibility of proxy objects you simply configure the system so that there are no shared files between namespaces. 2) How we interpret the proxy object. Currently the shared filesystem aka NFS precedent is that you can see the unix domain socket but not use it. Just what we have implemented now. As for FIFO you are be right that there is a potential bug there. Likely those fall in the yet to be implemented device namespace. It is certainly an area of the code we have not audited and thought about, so there is no precedent set. >> 3 - I would like to be able to reduce this isolation (your point) to >> share the af_unix socket for example to use /dev/klog or something else. >> >> I don't know how much we can consider the point 1, 2 pertinent, but >> disabling 3 lines of code via a sysctl with strong isolation as default >> and having a process unsharing the namespace in userspace and changing >> this value to less isolation is not a big challenge IMHO :) > the real questions is _who_ is responsible for this kind of staff -> > node (parent container) administrator or container administrator. I > strongly vote for first. > > Also if we are talking about such kind of staff, I dislike global > kludge. This should be a property of two concrete VEs and better two > concrete sockets. Unfortunately, setsockopt is not an option :( We support sockets from different namespaces in a single process. I agree that to keep the hack working we can not use setsockopt. I don't agree that we want to keep the hack working. - The code needs an audit to think about what it means to exchange packets between unix domain sockets in different network namespaces. There is surprising a lot of code in veth to accommodate that. If we have too many places where we need to do something strange it is going to make code maintenance difficult. - The semantics get stranger with respect to interpreting unix domain socket proxy objects. Sometimes we can connect to someplace non-local and sometimes we can't. - We can do this without changing the semantics of how socket proxy objects are interpreted, as it is possible for a process to use sockets in two different network namespaces. That requires application level changes. - It is prohibitively difficult to implement unix domain sockets that talk between different kernels (file descriptors sharing offsets and garbage collection of in flight sockets ouch!). This means that encouraging the use of unix domain sockets to transparently connect applications in different containers is probably a bad idea. This is completely different from FIFO's that are simple enough it is not hard to relay data between machines and get it right. - I don't know how much using unix domain sockets to talk between different domains transparently will confuse applications, and increase the risk of security exploits. I don't think it will be much though as we already have unix permissions checking from the proxy object, and we should be handling the namespace transitions of passed objects in the code that sends credentials and file descriptors. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 39d2173..0e1eccd 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) { struct dentry *dentry = unix_sk(s)->dentry; - if (!net_eq(sock_net(s), net)) - continue; - if(dentry && dentry->d_inode == i) { sock_hold(s);
This patch opens a way to connect via Unix socket from one namespace to another if these sockets are opened via conventional filesystem interface. Such approach allows to share important services between namespaces in efficient way. This breach is controlled by the means of shared filesystem, i.e. if somebody really wants to isolate containers, he should start from filesystem separation. Signed-off-by: Denis V. Lunev <den@openvz.org> --- net/unix/af_unix.c | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-)