diff mbox

[net-next,RFC] netns: enable cross-ve Unix sockets

Message ID 1222858454-7843-1-git-send-email-den@openvz.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Denis V. Lunev Oct. 1, 2008, 10:54 a.m. UTC
This patch opens a way to connect via Unix socket from one namespace
to another if these sockets are opened via conventional filesystem
interface. Such approach allows to share important services between
namespaces in efficient way.

This breach is controlled by the means of shared filesystem, i.e. if
somebody really wants to isolate containers, he should start from
filesystem separation.

Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 net/unix/af_unix.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

Comments

Daniel Lezcano Oct. 1, 2008, 11:13 a.m. UTC | #1
Denis V. Lunev wrote:
> This patch opens a way to connect via Unix socket from one namespace
> to another if these sockets are opened via conventional filesystem
> interface. Such approach allows to share important services between
> namespaces in efficient way.
> 
> This breach is controlled by the means of shared filesystem, i.e. if
> somebody really wants to isolate containers, he should start from
> filesystem separation.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  net/unix/af_unix.c |    3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 39d2173..0e1eccd 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
>  		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
>  		struct dentry *dentry = unix_sk(s)->dentry;
> 
> -		if (!net_eq(sock_net(s), net))
> -			continue;
> -
>  		if(dentry && dentry->d_inode == i)
>  		{
>  			sock_hold(s);

Hi Denis,

Do you have a list of the important services this isolation forbids ? (I 
suppose there is syslog).

Without this isolation, we can not ensure the container will not connect 
to an application outside of the namespace. How can we handle this case 
for the checkpoint-restart/migration ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denis V. Lunev Oct. 1, 2008, 11:32 a.m. UTC | #2
On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote:
> Denis V. Lunev wrote:
> > This patch opens a way to connect via Unix socket from one namespace
> > to another if these sockets are opened via conventional filesystem
> > interface. Such approach allows to share important services between
> > namespaces in efficient way.
> > 
> > This breach is controlled by the means of shared filesystem, i.e. if
> > somebody really wants to isolate containers, he should start from
> > filesystem separation.
> > 
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > ---
> >  net/unix/af_unix.c |    3 ---
> >  1 files changed, 0 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index 39d2173..0e1eccd 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
> >  		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
> >  		struct dentry *dentry = unix_sk(s)->dentry;
> > 
> > -		if (!net_eq(sock_net(s), net))
> > -			continue;
> > -
> >  		if(dentry && dentry->d_inode == i)
> >  		{
> >  			sock_hold(s);
> 
> Hi Denis,
> 
> Do you have a list of the important services this isolation forbids ? (I 
> suppose there is syslog).

we have asked from our customers for a shared MySQL server

The full story is here :)
http://bugzilla.openvz.org/show_bug.cgi?id=985

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 11:55 a.m. UTC | #3
Denis V. Lunev wrote:
> On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote:
>> Denis V. Lunev wrote:
>>> This patch opens a way to connect via Unix socket from one namespace
>>> to another if these sockets are opened via conventional filesystem
>>> interface. Such approach allows to share important services between
>>> namespaces in efficient way.
>>>
>>> This breach is controlled by the means of shared filesystem, i.e. if
>>> somebody really wants to isolate containers, he should start from
>>> filesystem separation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> ---
>>>  net/unix/af_unix.c |    3 ---
>>>  1 files changed, 0 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>>> index 39d2173..0e1eccd 100644
>>> --- a/net/unix/af_unix.c
>>> +++ b/net/unix/af_unix.c
>>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
>>>  		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
>>>  		struct dentry *dentry = unix_sk(s)->dentry;
>>>
>>> -		if (!net_eq(sock_net(s), net))
>>> -			continue;
>>> -
>>>  		if(dentry && dentry->d_inode == i)
>>>  		{
>>>  			sock_hold(s);
>> Hi Denis,
>>
>> Do you have a list of the important services this isolation forbids ? (I 
>> suppose there is syslog).
> 
> we have asked from our customers for a shared MySQL server
> 
> The full story is here :)
> http://bugzilla.openvz.org/show_bug.cgi?id=985

Ok, thanks.

My question remains :)

How do you handle migration in this case ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denis V. Lunev Oct. 1, 2008, 12:03 p.m. UTC | #4
On Wed, 2008-10-01 at 13:55 +0200, Daniel Lezcano wrote:
> Denis V. Lunev wrote:
> > On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote:
> >> Denis V. Lunev wrote:
> >>> This patch opens a way to connect via Unix socket from one namespace
> >>> to another if these sockets are opened via conventional filesystem
> >>> interface. Such approach allows to share important services between
> >>> namespaces in efficient way.
> >>>
> >>> This breach is controlled by the means of shared filesystem, i.e. if
> >>> somebody really wants to isolate containers, he should start from
> >>> filesystem separation.
> >>>
> >>> Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>> ---
> >>>  net/unix/af_unix.c |    3 ---
> >>>  1 files changed, 0 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >>> index 39d2173..0e1eccd 100644
> >>> --- a/net/unix/af_unix.c
> >>> +++ b/net/unix/af_unix.c
> >>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
> >>>  		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
> >>>  		struct dentry *dentry = unix_sk(s)->dentry;
> >>>
> >>> -		if (!net_eq(sock_net(s), net))
> >>> -			continue;
> >>> -
> >>>  		if(dentry && dentry->d_inode == i)
> >>>  		{
> >>>  			sock_hold(s);
> >> Hi Denis,
> >>
> >> Do you have a list of the important services this isolation forbids ? (I 
> >> suppose there is syslog).
> > 
> > we have asked from our customers for a shared MySQL server
> > 
> > The full story is here :)
> > http://bugzilla.openvz.org/show_bug.cgi?id=985
> 
> Ok, thanks.
> 
> My question remains :)
> 
> How do you handle migration in this case ?

There is no problem until you really have listeners from different
namespaces on both ends. This is checked after the freeze stage and
migration is forbidden if such a situation is detected.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 12:19 p.m. UTC | #5
Denis V. Lunev wrote:
> On Wed, 2008-10-01 at 13:55 +0200, Daniel Lezcano wrote:
>> Denis V. Lunev wrote:
>>> On Wed, 2008-10-01 at 13:13 +0200, Daniel Lezcano wrote:
>>>> Denis V. Lunev wrote:
>>>>> This patch opens a way to connect via Unix socket from one namespace
>>>>> to another if these sockets are opened via conventional filesystem
>>>>> interface. Such approach allows to share important services between
>>>>> namespaces in efficient way.
>>>>>
>>>>> This breach is controlled by the means of shared filesystem, i.e. if
>>>>> somebody really wants to isolate containers, he should start from
>>>>> filesystem separation.
>>>>>
>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>>> ---
>>>>>  net/unix/af_unix.c |    3 ---
>>>>>  1 files changed, 0 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>>>>> index 39d2173..0e1eccd 100644
>>>>> --- a/net/unix/af_unix.c
>>>>> +++ b/net/unix/af_unix.c
>>>>> @@ -297,9 +297,6 @@ static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
>>>>>  		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
>>>>>  		struct dentry *dentry = unix_sk(s)->dentry;
>>>>>
>>>>> -		if (!net_eq(sock_net(s), net))
>>>>> -			continue;
>>>>> -
>>>>>  		if(dentry && dentry->d_inode == i)
>>>>>  		{
>>>>>  			sock_hold(s);
>>>> Hi Denis,
>>>>
>>>> Do you have a list of the important services this isolation forbids ? (I 
>>>> suppose there is syslog).
>>> we have asked from our customers for a shared MySQL server
>>>
>>> The full story is here :)
>>> http://bugzilla.openvz.org/show_bug.cgi?id=985
>> Ok, thanks.
>>
>> My question remains :)
>>
>> How do you handle migration in this case ?
> 
> There is no problem until you really have listeners from different
> namespaces on both ends. This is checked after the freeze stage and
> migration is forbidden if such a situation is detected.

So there are 2 cases:
  * full isolation : restriction on VPS
  * partial isolation : no restriction but *perhaps* problem when migrating

Looks like we need an option per namespace to reduce the isolation for 
af_unix sockets :)
  - on (default): current behaviour => full isolation
  - off : partial isolation




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov Oct. 1, 2008, 12:24 p.m. UTC | #6
> So there are 2 cases:
>   * full isolation : restriction on VPS
>   * partial isolation : no restriction but *perhaps* problem when migrating
> 
> Looks like we need an option per namespace to reduce the isolation for 
> af_unix sockets :)
>   - on (default): current behaviour => full isolation
>   - off : partial isolation

You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 12:31 p.m. UTC | #7
Pavel Emelyanov wrote:
>> So there are 2 cases:
>>   * full isolation : restriction on VPS
>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>
>> Looks like we need an option per namespace to reduce the isolation for 
>> af_unix sockets :)
>>   - on (default): current behaviour => full isolation
>>   - off : partial isolation
> 
> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov Oct. 1, 2008, 12:40 p.m. UTC | #8
Daniel Lezcano wrote:
> Pavel Emelyanov wrote:
>>> So there are 2 cases:
>>>   * full isolation : restriction on VPS
>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>
>>> Looks like we need an option per namespace to reduce the isolation for 
>>> af_unix sockets :)
>>>   - on (default): current behaviour => full isolation
>>>   - off : partial isolation
>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
> 
> Yes.

OK. Den, please, do :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cédric Le Goater Oct. 1, 2008, 1:08 p.m. UTC | #9
Pavel Emelyanov wrote:
> Daniel Lezcano wrote:
>> Pavel Emelyanov wrote:
>>>> So there are 2 cases:
>>>>   * full isolation : restriction on VPS
>>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>>
>>>> Looks like we need an option per namespace to reduce the isolation for 
>>>> af_unix sockets :)
>>>>   - on (default): current behaviour => full isolation
>>>>   - off : partial isolation
>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
>> Yes.
> 
> OK. Den, please, do :)

hmm, would that allow sibling namespaces to connect to each other ? If so, 
I'm not in favor of such a solution. 

I understand the need. we had a similar issue with the command line tool 
pgsl. Could we work something out with the capabilities ? or make an 
exception if your ->nsproxy->net_ns == init_net ? 

Thanks,

C.  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denis V. Lunev Oct. 1, 2008, 1:11 p.m. UTC | #10
On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote:
> Pavel Emelyanov wrote:
> >> So there are 2 cases:
> >>   * full isolation : restriction on VPS
> >>   * partial isolation : no restriction but *perhaps* problem when migrating
> >>
> >> Looks like we need an option per namespace to reduce the isolation for 
> >> af_unix sockets :)
> >>   - on (default): current behaviour => full isolation
> >>   - off : partial isolation
> > 
> > You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
> 
> Yes.

I do not see much sense with sysctl as:
- check (cross-connected sockets) is required as we can start namespace
  with already opened socket
- this kind of sharing is not implicit but explicit as normal isolated
  containers _must_ have separate filesystems. In this case this
  sharing requires explicit host administrator action to link socket
  between containers

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 1:46 p.m. UTC | #11
Denis V. Lunev wrote:
> On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote:
>> Pavel Emelyanov wrote:
>>>> So there are 2 cases:
>>>>   * full isolation : restriction on VPS
>>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>>
>>>> Looks like we need an option per namespace to reduce the isolation for 
>>>> af_unix sockets :)
>>>>   - on (default): current behaviour => full isolation
>>>>   - off : partial isolation
>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
>> Yes.
> 
> I do not see much sense with sysctl as:
> - check (cross-connected sockets) is required as we can start namespace
>   with already opened socket

Check when checkpointing ? If you inherit a socket from your parent 
namespace, this socket belongs to your parent and you should not 
checkpoint it, no ?

In case you allow cross-connected sockets, this check is mandatory I agree.

> - this kind of sharing is not implicit but explicit as normal isolated
>   containers _must_ have separate filesystems. In this case this
>   sharing requires explicit host administrator action to link socket
>   between containers

What are "normal isolated containers" ? Are they OpenVZ containers ? 
These containers belong to the system containers family. What happens 
with application containers, if I want to share the filesystem without 
breaking the isolation of the afunix sockets ?

The current code provides full isolation and this is in mainline. I 
don't think it is reasonable to change that. What I propose is to keep 
the current behaviour.

When you create a network namespace, you can change the behaviour inside 
this namespace via /proc/sys/net/unix/isolated (for example).

This option allows:
1 - to connect to af_unix not belonging to the container
2 - to accept af_unix connection from outside the container (avoid a 
container to forbid the checkpoint of another container);





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 1:50 p.m. UTC | #12
Cedric Le Goater wrote:
> Pavel Emelyanov wrote:
>> Daniel Lezcano wrote:
>>> Pavel Emelyanov wrote:
>>>>> So there are 2 cases:
>>>>>   * full isolation : restriction on VPS
>>>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>>>
>>>>> Looks like we need an option per namespace to reduce the isolation for 
>>>>> af_unix sockets :)
>>>>>   - on (default): current behaviour => full isolation
>>>>>   - off : partial isolation
>>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
>>> Yes.
>> OK. Den, please, do :)
> 
> hmm, would that allow sibling namespaces to connect to each other ? If so, 
> I'm not in favor of such a solution. 
> 
> I understand the need. we had a similar issue with the command line tool 
> pgsl. Could we work something out with the capabilities ? or make an 
> exception if your ->nsproxy->net_ns == init_net ? 

Why capabilities is better than a simple sysctl ?
Making an exception for init_net will break the nested containers no ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denis V. Lunev Oct. 1, 2008, 2:54 p.m. UTC | #13
On Wed, 2008-10-01 at 15:46 +0200, Daniel Lezcano wrote:
> Denis V. Lunev wrote:
> > On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote:
> >> Pavel Emelyanov wrote:
> >>>> So there are 2 cases:
> >>>>   * full isolation : restriction on VPS
> >>>>   * partial isolation : no restriction but *perhaps* problem when migrating
> >>>>
> >>>> Looks like we need an option per namespace to reduce the isolation for 
> >>>> af_unix sockets :)
> >>>>   - on (default): current behaviour => full isolation
> >>>>   - off : partial isolation
> >>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
> >> Yes.
> > 
> > I do not see much sense with sysctl as:
> > - check (cross-connected sockets) is required as we can start namespace
> >   with already opened socket
> 
> Check when checkpointing ? If you inherit a socket from your parent 
> namespace, this socket belongs to your parent and you should not 
> checkpoint it, no ?
> 
> In case you allow cross-connected sockets, this check is mandatory I agree.
> 
> > - this kind of sharing is not implicit but explicit as normal isolated
> >   containers _must_ have separate filesystems. In this case this
> >   sharing requires explicit host administrator action to link socket
> >   between containers
> 
> What are "normal isolated containers" ? Are they OpenVZ containers ? 
> These containers belong to the system containers family. What happens 
> with application containers, if I want to share the filesystem without 
> breaking the isolation of the afunix sockets ?

then you are doomed as you will have a FIFO opened from 2 namespaces and
checking the absences of external references is still mandatory

> The current code provides full isolation and this is in mainline. I 
> don't think it is reasonable to change that. What I propose is to keep 
> the current behaviour.
> 
> When you create a network namespace, you can change the behaviour inside 
> this namespace via /proc/sys/net/unix/isolated (for example).
> 
> This option allows:
> 1 - to connect to af_unix not belonging to the container
> 2 - to accept af_unix connection from outside the container (avoid a 
> container to forbid the checkpoint of another container);

this should be at least per/namespace option controlled from parent
container from my POW

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cédric Le Goater Oct. 1, 2008, 3:07 p.m. UTC | #14
Daniel Lezcano wrote:
> Cedric Le Goater wrote:
>> Pavel Emelyanov wrote:
>>> Daniel Lezcano wrote:
>>>> Pavel Emelyanov wrote:
>>>>>> So there are 2 cases:
>>>>>>   * full isolation : restriction on VPS
>>>>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>>>>
>>>>>> Looks like we need an option per namespace to reduce the isolation for 
>>>>>> af_unix sockets :)
>>>>>>   - on (default): current behaviour => full isolation
>>>>>>   - off : partial isolation
>>>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
>>>> Yes.
>>> OK. Den, please, do :)
>> hmm, would that allow sibling namespaces to connect to each other ? If so, 
>> I'm not in favor of such a solution. 
>>
>> I understand the need. we had a similar issue with the command line tool 
>> pgsl. Could we work something out with the capabilities ? or make an 
>> exception if your ->nsproxy->net_ns == init_net ? 
> 
> Why capabilities is better than a simple sysctl ?

because it depends on the current process privilege and not just some
random process.

> Making an exception for init_net will break the nested containers no ?

may be. I don't know how this is implemented. if we break isolation, my
feeling is that we should only do it for a parent namespace. It just feel 
wrong to allow sibling namespaces to connect to each other. 

C. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 3:18 p.m. UTC | #15
Denis V. Lunev wrote:
> On Wed, 2008-10-01 at 15:46 +0200, Daniel Lezcano wrote:
>> Denis V. Lunev wrote:
>>> On Wed, 2008-10-01 at 14:31 +0200, Daniel Lezcano wrote:
>>>> Pavel Emelyanov wrote:
>>>>>> So there are 2 cases:
>>>>>>   * full isolation : restriction on VPS
>>>>>>   * partial isolation : no restriction but *perhaps* problem when migrating
>>>>>>
>>>>>> Looks like we need an option per namespace to reduce the isolation for 
>>>>>> af_unix sockets :)
>>>>>>   - on (default): current behaviour => full isolation
>>>>>>   - off : partial isolation
>>>>> You mean some sysctl, that enables/disables this check in unix_find_socket_byinode?
>>>> Yes.
>>> I do not see much sense with sysctl as:
>>> - check (cross-connected sockets) is required as we can start namespace
>>>   with already opened socket
>> Check when checkpointing ? If you inherit a socket from your parent 
>> namespace, this socket belongs to your parent and you should not 
>> checkpoint it, no ?
>>
>> In case you allow cross-connected sockets, this check is mandatory I agree.
>>
>>> - this kind of sharing is not implicit but explicit as normal isolated
>>>   containers _must_ have separate filesystems. In this case this
>>>   sharing requires explicit host administrator action to link socket
>>>   between containers
>> What are "normal isolated containers" ? Are they OpenVZ containers ? 
>> These containers belong to the system containers family. What happens 
>> with application containers, if I want to share the filesystem without 
>> breaking the isolation of the afunix sockets ?
> 
> then you are doomed as you will have a FIFO opened from 2 namespaces and
> checking the absences of external references is still mandatory
>> The current code provides full isolation and this is in mainline. I 
>> don't think it is reasonable to change that. What I propose is to keep 
>> the current behaviour.
>>
>> When you create a network namespace, you can change the behaviour inside 
>> this namespace via /proc/sys/net/unix/isolated (for example).
>>
>> This option allows:
>> 1 - to connect to af_unix not belonging to the container
>> 2 - to accept af_unix connection from outside the container (avoid a 
>> container to forbid the checkpoint of another container);
> 
> this should be at least per/namespace option controlled from parent
> container from my POW

Yes per namespace, I agree.

If the option is controlled by the parent and it is done by sysctl, you 
will have to make proc/sys per namespace like Pavel did with /proc/net, no ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov Oct. 1, 2008, 3:31 p.m. UTC | #16
> Yes per namespace, I agree.
> 
> If the option is controlled by the parent and it is done by sysctl, you 
> will have to make proc/sys per namespace like Pavel did with /proc/net, no ?

/proc/sys is already per namespace actually ;) Or what did you mean by that?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 3:38 p.m. UTC | #17
Pavel Emelyanov wrote:
>> Yes per namespace, I agree.
>>
>> If the option is controlled by the parent and it is done by sysctl, you 
>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ?
> 
> /proc/sys is already per namespace actually ;) Or what did you mean by that?


Effectively I was not clear :)

I meant, you can not access /proc/sys from outside the namespace like 
/proc/net which can be followed up by /proc/<pid>/net outside the namespace.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov Oct. 1, 2008, 3:42 p.m. UTC | #18
Daniel Lezcano wrote:
> Pavel Emelyanov wrote:
>>> Yes per namespace, I agree.
>>>
>>> If the option is controlled by the parent and it is done by sysctl, you 
>>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ?
>> /proc/sys is already per namespace actually ;) Or what did you mean by that?
> 
> 
> Effectively I was not clear :)
> 
> I meant, you can not access /proc/sys from outside the namespace like 
> /proc/net which can be followed up by /proc/<pid>/net outside the namespace.

Ah! I've got it. Well, I think after Al Viro finishes with sysctl
rework this possibility will appear, but Denis actually persuaded me
in his POV - if we do want to disable shared sockets we *can* do this
by putting containers in proper mount namespaces of chroot environments.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Oct. 1, 2008, 4:15 p.m. UTC | #19
Pavel Emelyanov wrote:
> Daniel Lezcano wrote:
>> Pavel Emelyanov wrote:
>>>> Yes per namespace, I agree.
>>>>
>>>> If the option is controlled by the parent and it is done by sysctl, you 
>>>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ?
>>> /proc/sys is already per namespace actually ;) Or what did you mean by that?
>>
>> Effectively I was not clear :)
>>
>> I meant, you can not access /proc/sys from outside the namespace like 
>> /proc/net which can be followed up by /proc/<pid>/net outside the namespace.
> 
> Ah! I've got it. Well, I think after Al Viro finishes with sysctl
> rework this possibility will appear, but Denis actually persuaded me
> in his POV - if we do want to disable shared sockets we *can* do this
> by putting containers in proper mount namespaces of chroot environments.

And I agree with this point. But :)

  1 - the current behaviour is full isolation. Shall we/can we change 
that without taking into account there are perhaps some people using 
this today ? I don't know.

  2 - I wish to launch a non chrooted application inside a namespace, 
sharing the file system without sharing the af_unix sockets, because I 
don't want the application running inside the container overlap with the 
socket af_unix of another container. I prefer to detect a collision with 
a strong isolation and handle it manually (remount some part of the fs 
for example).

  3 - I would like to be able to reduce this isolation (your point) to 
share the af_unix socket for example to use /dev/klog or something else.

I don't know how much we can consider the point 1, 2 pertinent, but 
disabling 3 lines of code via a sysctl with strong isolation as default 
and having a process unsharing the namespace in userspace and changing 
this value to less isolation is not a big challenge IMHO :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Denis V. Lunev Oct. 2, 2008, 10:21 a.m. UTC | #20
On Wed, 2008-10-01 at 18:15 +0200, Daniel Lezcano wrote:
> Pavel Emelyanov wrote:
> > Daniel Lezcano wrote:
> >> Pavel Emelyanov wrote:
> >>>> Yes per namespace, I agree.
> >>>>
> >>>> If the option is controlled by the parent and it is done by sysctl, you 
> >>>> will have to make proc/sys per namespace like Pavel did with /proc/net, no ?
> >>> /proc/sys is already per namespace actually ;) Or what did you mean by that?
> >>
> >> Effectively I was not clear :)
> >>
> >> I meant, you can not access /proc/sys from outside the namespace like 
> >> /proc/net which can be followed up by /proc/<pid>/net outside the namespace.
> > 
> > Ah! I've got it. Well, I think after Al Viro finishes with sysctl
> > rework this possibility will appear, but Denis actually persuaded me
> > in his POV - if we do want to disable shared sockets we *can* do this
> > by putting containers in proper mount namespaces of chroot environments.
> 
> And I agree with this point. But :)
> 
>   1 - the current behaviour is full isolation. Shall we/can we change 
> that without taking into account there are perhaps some people using 
> this today ? I don't know.
We have a direct request from people using to remove this state of
isolation.

>   2 - I wish to launch a non chrooted application inside a namespace, 
> sharing the file system without sharing the af_unix sockets, because I 
> don't want the application running inside the container overlap with the 
> socket af_unix of another container. I prefer to detect a collision with 
> a strong isolation and handle it manually (remount some part of the fs 
> for example).
with common filesystem you have to detect collisions at least for FIFOs.
This situation is the same. Basically, if we'll treat named Unix sockets
as an improved FIFO - it's better to use the same approach

>   3 - I would like to be able to reduce this isolation (your point) to 
> share the af_unix socket for example to use /dev/klog or something else.
> 
> I don't know how much we can consider the point 1, 2 pertinent, but 
> disabling 3 lines of code via a sysctl with strong isolation as default 
> and having a process unsharing the namespace in userspace and changing 
> this value to less isolation is not a big challenge IMHO :)
the real questions is _who_ is responsible for this kind of staff ->
node (parent container) administrator or container administrator. I
strongly vote for first.

Also if we are talking about such kind of staff, I dislike global
kludge. This should be a property of two concrete VEs and better two
concrete sockets. Unfortunately, setsockopt is not an option :(

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 2, 2008, 8:03 p.m. UTC | #21
I do not believe we have yet met the burden of proof necessary to make the proposed
semantic change.

"Denis V. Lunev" <den@openvz.org> writes:

> On Wed, 2008-10-01 at 18:15 +0200, Daniel Lezcano wrote:
>> 
>>   1 - the current behaviour is full isolation. Shall we/can we change 
>> that without taking into account there are perhaps some people using 
>> this today ? I don't know.
> We have a direct request from people using to remove this state of
> isolation.

Users of a brittle but convenient hack not application developers.

Which means there is demand for this class of communication but it
doesn't mean that the suggested way is the right way.

>>   2 - I wish to launch a non chrooted application inside a namespace, 
>> sharing the file system without sharing the af_unix sockets, because I 
>> don't want the application running inside the container overlap with the 
>> socket af_unix of another container. I prefer to detect a collision with 
>> a strong isolation and handle it manually (remount some part of the fs 
>> for example).
> with common filesystem you have to detect collisions at least for FIFOs.
> This situation is the same. Basically, if we'll treat named Unix sockets
> as an improved FIFO - it's better to use the same approach

There are two aspects to this.

1) Looking up the proxy object in the filesystem.

   With that I agree that the file system access rules are sufficient.
   If we don't want the possibility of proxy objects you simply configure
   the system so that there are no shared files between namespaces.

2) How we interpret the proxy object.

   Currently the shared filesystem aka NFS precedent is that you can see
   the unix domain socket but not use it.  Just what we have implemented now.

As for FIFO you are be right that there is a potential bug there.  Likely those
fall in the yet to be implemented device namespace.  It is certainly an area
of the code we have not audited and thought about, so there is no precedent
set.

>>   3 - I would like to be able to reduce this isolation (your point) to 
>> share the af_unix socket for example to use /dev/klog or something else.
>> 
>> I don't know how much we can consider the point 1, 2 pertinent, but 
>> disabling 3 lines of code via a sysctl with strong isolation as default 
>> and having a process unsharing the namespace in userspace and changing 
>> this value to less isolation is not a big challenge IMHO :)
> the real questions is _who_ is responsible for this kind of staff ->
> node (parent container) administrator or container administrator. I
> strongly vote for first.
>
> Also if we are talking about such kind of staff, I dislike global
> kludge. This should be a property of two concrete VEs and better two
> concrete sockets. Unfortunately, setsockopt is not an option :(


We support sockets from different namespaces in a single process.

I agree that to keep the hack working we can not use setsockopt.
I don't agree that we want to keep the hack working.

- The code needs an audit to think about what it means to exchange packets
  between unix domain sockets in different network namespaces.  There is
  surprising a lot of code in veth to accommodate that.

  If we have too many places where we need to do something strange it is
  going to make code maintenance difficult.

- The semantics get stranger with respect to interpreting unix domain socket
  proxy objects.   Sometimes we can connect to someplace non-local and sometimes
  we can't.

- We can do this without changing the semantics of how socket proxy objects
  are interpreted, as it is possible for a process to use sockets in two different
  network namespaces.  That requires application level changes.

- It is prohibitively difficult to implement unix domain sockets that talk between
  different kernels (file descriptors sharing offsets and garbage collection of in
  flight sockets ouch!).  This means that encouraging the use of unix domain sockets
  to transparently connect applications in different containers is probably a bad
  idea.  This is completely different from FIFO's that are simple enough it is not
  hard to relay data between machines and get it right.

- I don't know how much using unix domain sockets to talk between different domains
  transparently will confuse applications, and increase the risk of security exploits.
  I don't think it will be much though as we already have unix permissions checking
  from the proxy object, and we should be handling the namespace transitions of passed
  objects in the code that sends credentials and file descriptors.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 39d2173..0e1eccd 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -297,9 +297,6 @@  static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i)
 		    &unix_socket_table[i->i_ino & (UNIX_HASH_SIZE - 1)]) {
 		struct dentry *dentry = unix_sk(s)->dentry;
 
-		if (!net_eq(sock_net(s), net))
-			continue;
-
 		if(dentry && dentry->d_inode == i)
 		{
 			sock_hold(s);