diff mbox

[RFC] ns: Syscalls for better namespace sharing control.

Message ID m1ocj6qljj.fsf@fess.ebiederm.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman March 3, 2010, 12:46 a.m. UTC
Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:

> Eric W. Biederman [ebiederm@xmission.com] wrote:
> | 
> | I think replacing a struct pid for another struct pid allocated in
> | descendant pid_namespace (but has all of the same struct upid values
> | as the first struct pid) is a disastrous idea.  It destroys the
>
> True. Sorry, I did not mean we would need a new 'struct pid' for an
> existing process. I think we talked earlier of finding a way of attaching
> additional pid numbers to the same struct pid.

I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
to be that you become the idle task aka pid 0, and not the init task pid 1 the
implementation is trivial.

Eric
----

 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 fs/proc/root.c                            |    2 +-
 init/main.c                               |    9 ---------
 kernel/cgroup.c                           |    2 +-
 kernel/fork.c                             |   16 +++++++++++++---
 kernel/nsproxy.c                          |    2 +-
 kernel/perf_event.c                       |    2 +-
 kernel/pid.c                              |    8 ++++----
 kernel/signal.c                           |    9 ++++-----
 kernel/sysctl_binary.c                    |    2 +-
 11 files changed, 28 insertions(+), 28 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Serge E. Hallyn March 3, 2010, 3:38 p.m. UTC | #1
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
> 
> > Eric W. Biederman [ebiederm@xmission.com] wrote:
> > | 
> > | I think replacing a struct pid for another struct pid allocated in
> > | descendant pid_namespace (but has all of the same struct upid values
> > | as the first struct pid) is a disastrous idea.  It destroys the
> >
> > True. Sorry, I did not mean we would need a new 'struct pid' for an
> > existing process. I think we talked earlier of finding a way of attaching
> > additional pid numbers to the same struct pid.
> 
> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

Heh, and then (browsing through your copy_process() patch hunks) the next
forked task becomes the child reaper for the new pidns?  <shrug>  why not
I guess.

Now if that child reaper then gets killed, will the idle task get killed too?
And if not, then idle task can just re-populating the new pidns with new
idle tasks...

If this brought us a step closer to entering an existing pidns that would
be one thing, but is there actually any advantage to being able to
unshare a new pidns?  Oh, I guess there is - PAM can then use it at
login, which might be neat.

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov March 3, 2010, 4:50 p.m. UTC | #2
> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
> to be that you become the idle task aka pid 0, and not the init task pid 1 the
> implementation is trivial.

This is not ... handy - if after enter you have pid 0 you obviously
can't perform 2 parallel enters. The way I see it:

As far as the numbers reported to the userspace are concerned:
1. task, that enters is still visible by its old parent by old pid
2. task, that enters gets some pid within the entering namespace
   and reports its parent pid to have pid 1 (init obviously doesn't
   care)
3. we _can_ try to allocate new pid equal to the old one so that
   glibc stays happy


As far as the pointers are concerned:
1. parent pointer doesn't change
2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
   a) we don't allow threads enter (de_thread problem is handeled)
   b) we don't allow leave the group/session, i.e. check, that there
      is the only one task that enters lives in its pgid/sid
   c) we wait for the quiescent state to pass by before destroying
      the old pid to handle race with sys_kill()

Thoughts/questions? ("This is a nasty problem" answer is not acceptable,
the real code problems/races please)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 3, 2010, 7:47 p.m. UTC | #3
"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>> 
>> > Eric W. Biederman [ebiederm@xmission.com] wrote:
>> > | 
>> > | I think replacing a struct pid for another struct pid allocated in
>> > | descendant pid_namespace (but has all of the same struct upid values
>> > | as the first struct pid) is a disastrous idea.  It destroys the
>> >
>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>> > existing process. I think we talked earlier of finding a way of attaching
>> > additional pid numbers to the same struct pid.
>> 
>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> Heh, and then (browsing through your copy_process() patch hunks) the next
> forked task becomes the child reaper for the new pidns?  <shrug>  why not
> I guess.
>
> Now if that child reaper then gets killed, will the idle task get killed too?

No.

> And if not, then idle task can just re-populating the new pidns with new
> idle tasks...

After zap_pid_namespace interesting...

> If this brought us a step closer to entering an existing pidns that would
> be one thing, but is there actually any advantage to being able to
> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
> login, which might be neat.

I have to say that the semantics of my patch are unworkable for
unshare.  Unless I am mistaken for PAM to use it requires that the
current process fully change and become what it needs to be.
Requiring an extra fork to fully complete the process is a problem.

Scratch one bright idea.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 3, 2010, 8:16 p.m. UTC | #4
Pavel Emelyanov <xemul@parallels.com> writes:

>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>> implementation is trivial.
>
> This is not ... handy - if after enter you have pid 0 you obviously
> can't perform 2 parallel enters.

2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
You have pid 0 because your pid simply does not map.

There is nothing that makes to parallel enters impossible in that.
Even today we have one thread per cpu that has task->pid == &init_struct_pid
which is pid 0.

For the case of unshare where we are designed to be used with PAM I don't
think my proposed semantics work.  For a join needed an extra fork before
you are really in the pid namespace should be minor.

> The way I see it:
>
> As far as the numbers reported to the userspace are concerned:
> 1. task, that enters is still visible by its old parent by old pid
> 2. task, that enters gets some pid within the entering namespace
>    and reports its parent pid to have pid 1 (init obviously doesn't
>    care)
> 3. we _can_ try to allocate new pid equal to the old one so that
>    glibc stays happy
>
>
> As far as the pointers are concerned:
> 1. parent pointer doesn't change
> 2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
>    a) we don't allow threads enter (de_thread problem is handeled)
>    b) we don't allow leave the group/session, i.e. check, that there
>       is the only one task that enters lives in its pgid/sid
>    c) we wait for the quiescent state to pass by before destroying
>       the old pid to handle race with sys_kill()

That doesn't handle the case of cached struct pids.  A good example is
waitpid, where it waits for a specific struct pid.  Which means that
allocating a new struct pid and changing task->pid will cause
waitpid(pid) to wait forever...

To change struct pid would require the refcount on struct pid to show
no references from anywhere except the task_struct.

At the cost of a little memory we can solve that problem for unshare
if we have a an extra upid in struct pid, how we verify there is space
in struct pid I'm not certain.

I do think that at least until someone calls exec the namespace pids are
reported to the process itself should not change.  That is kill and
waitpid etc.  Which suggests an implementation the opposite of what
I proposed.  With ns_of_pid(task_pid(current)) being used as the
pid namespace of children, and current->nsproxy->pid_ns not changing
in the case of unshare.

Shrug.

Or perhaps this is a case where we use we can implement join with
an extra process but we can't implement unshare, because the effect
cannot be immediate.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 4, 2010, 9:45 p.m. UTC | #5
ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>>> 
>>> > Eric W. Biederman [ebiederm@xmission.com] wrote:
>>> > | 
>>> > | I think replacing a struct pid for another struct pid allocated in
>>> > | descendant pid_namespace (but has all of the same struct upid values
>>> > | as the first struct pid) is a disastrous idea.  It destroys the
>>> >
>>> > True. Sorry, I did not mean we would need a new 'struct pid' for an
>>> > existing process. I think we talked earlier of finding a way of attaching
>>> > additional pid numbers to the same struct pid.
>>> 
>>> I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
>>> to be that you become the idle task aka pid 0, and not the init task pid 1 the
>>> implementation is trivial.
>>
>> Heh, and then (browsing through your copy_process() patch hunks) the next
>> forked task becomes the child reaper for the new pidns?  <shrug>  why not
>> I guess.
>>
>> Now if that child reaper then gets killed, will the idle task get killed too?
>
> No.
>
>> And if not, then idle task can just re-populating the new pidns with new
>> idle tasks...
>
> After zap_pid_namespace interesting...
>
>> If this brought us a step closer to entering an existing pidns that would
>> be one thing, but is there actually any advantage to being able to
>> unshare a new pidns?  Oh, I guess there is - PAM can then use it at
>> login, which might be neat.
>
> I have to say that the semantics of my patch are unworkable for
> unshare.  Unless I am mistaken for PAM to use it requires that the
> current process fully change and become what it needs to be.
> Requiring an extra fork to fully complete the process is a problem.
>
> Scratch one bright idea.

Maybe not.  I just looked and in the vast majority of cases the login
process goes like this.

{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);
        
        pam and other cleanup
}

So an unshare of the pid namespace that doesn't really take effect
until we fork may actually be usable from pam, and in fact is probably
the preferred implementation.  It looks like neither openssh nor login
from util-linux-ng will cope properly with getting any pid back from
wait() except the pid of their child.  It looks like they both with
terminate.  Which means if you login in a new pid namespace (where the
unsharing process becomes pid 1) and call nohup everything will get
killed and you will be logged out.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Engelhardt March 4, 2010, 10:55 p.m. UTC | #6
On Thursday 2010-03-04 22:45, Eric W. Biederman wrote:
>
>So an unshare of the pid namespace that doesn't really take effect
>until we fork may actually be usable from pam, and in fact is probably
>the preferred implementation.  It looks like neither openssh nor login
>from util-linux-ng will cope properly with getting any pid back from
>wait() except the pid of their child.

Correct; I can tell from experience with pam_mount. GDM for example is 
very unhappy if you fork/exit processes in PAM modules and don't hide 
the fact by bending SIGCHLD from gdm_handler to mypam_handler (which 
itself is racy, suppose GDM re-set the SIGCHLD handler midway through).

(In this particular case however, I'd prefer if login programs like GDM 
just ignored any PIDs they did not spawn in the first place instead of 
moaning around.)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov March 5, 2010, 7:18 p.m. UTC | #7
> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
> You have pid 0 because your pid simply does not map.

Oh, I see.

> There is nothing that makes to parallel enters impossible in that.
> Even today we have one thread per cpu that has task->pid == &init_struct_pid
> which is pid 0.

How about the forked processes then? Who will be their parent?

> For the case of unshare where we are designed to be used with PAM I don't
> think my proposed semantics work.  For a join needed an extra fork before
> you are really in the pid namespace should be minor.

Hm... One more proposal - can we adopt the planned new fork_with_pids system
call to fork the process right into a new pid namespace?

> That doesn't handle the case of cached struct pids.  A good example is
> waitpid, where it waits for a specific struct pid.  Which means that
> allocating a new struct pid and changing task->pid will cause
> waitpid(pid) to wait forever...

OK. Good example. Thanks.

> To change struct pid would require the refcount on struct pid to show
> no references from anywhere except the task_struct.

I think this is OK to return -EBUSY for this. And fix the waitpid
respectively not to block this common case. All the others I think
can be stayed as is.

> At the cost of a little memory we can solve that problem for unshare
> if we have a an extra upid in struct pid, how we verify there is space
> in struct pid I'm not certain.
> 
> I do think that at least until someone calls exec the namespace pids are
> reported to the process itself should not change.  That is kill and

Wait a second - in that case the wait will be blocked too! No?

> waitpid etc.  Which suggests an implementation the opposite of what
> I proposed.  With ns_of_pid(task_pid(current)) being used as the
> pid namespace of children, and current->nsproxy->pid_ns not changing
> in the case of unshare.
> 
> Shrug.
> 
> Or perhaps this is a case where we use we can implement join with
> an extra process but we can't implement unshare, because the effect
> cannot be immediate.

Well, I'm talking only about the join now.

> Eric
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 5, 2010, 8:26 p.m. UTC | #8
Pavel Emelyanov <xemul@parallels.com> writes:

>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>> You have pid 0 because your pid simply does not map.
>
> Oh, I see.
>
>> There is nothing that makes to parallel enters impossible in that.
>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>> which is pid 0.
>
> How about the forked processes then? Who will be their parent?

The normal rules of parentage apply.   So the child will see simply
see it's parent as ppid == 0.  If that child daemonizes it will become
a child of the pid namespaces init.

This is a lot like something that gets started from call_usermodehelper.  It's
parent process is not a descendant of init either.


The implementation of the join is to simply change current->nsproxy->pid_ns.
Then to use it you simply fork to get a child in the target pid namespace.

>> For the case of unshare where we are designed to be used with PAM I don't
>> think my proposed semantics work.  For a join needed an extra fork before
>> you are really in the pid namespace should be minor.
>
> Hm... One more proposal - can we adopt the planned new fork_with_pids system
> call to fork the process right into a new pid namespace?

In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
don't think anything I am doing fundamentally undermines it.  The use
case of doing things in fork is that there is automatic inheritance of
everything.  All of the namespaces and all of the control groups, and
possibly also the parent process.  It does have the high cost that the
process we are copying from must be stopped because there are no locks
that let us take everything.  I haven't looked at the recent proposals
to see if anyone has solved that problem cleanly.



If we can do a sys_hijack/sys_cloneat style of join, that means we can
afford a fork.  At which point the my proposed pid namespace semantics
should be fine.

aka:
setns(NSTYPE_PID);
pid = fork();
if (pid == 0) {
	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
        getppid() == 0;
} else {
	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
	waitpid(pid);
}

>> That doesn't handle the case of cached struct pids.  A good example is
>> waitpid, where it waits for a specific struct pid.  Which means that
>> allocating a new struct pid and changing task->pid will cause
>> waitpid(pid) to wait forever...
>
> OK. Good example. Thanks.
>
>> To change struct pid would require the refcount on struct pid to show
>> no references from anywhere except the task_struct.
>
> I think this is OK to return -EBUSY for this. And fix the waitpid
> respectively not to block this common case. All the others I think
> can be stayed as is.

That would probably work.  setsid() and setpgrp() have similar sorts
of restrictions.  That is both more challenging and more limiting than
the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
would prefer to keep this sort of thing as a last resort.

>> At the cost of a little memory we can solve that problem for unshare
>> if we have a an extra upid in struct pid, how we verify there is space
>> in struct pid I'm not certain.
>> 
>> I do think that at least until someone calls exec the namespace pids are
>> reported to the process itself should not change.  That is kill and
>
> Wait a second - in that case the wait will be blocked too! No?

If all we do is populate an unused struct upid in struct pid there
isn't a chance of a problem.  

>> waitpid etc.  Which suggests an implementation the opposite of what
>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>> pid namespace of children, and current->nsproxy->pid_ns not changing
>> in the case of unshare.
>> 
>> Shrug.
>> 
>> Or perhaps this is a case where we use we can implement join with
>> an extra process but we can't implement unshare, because the effect
>> cannot be immediate.
>
> Well, I'm talking only about the join now.

Overall it sounds like the semantics I have proposed with
unshare(CLONE_NEWPID) are workable, and simple to implement.  The
extra fork is a bit surprising but it certainly does not
look like a show stopper for implementing a pid namespace join.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 6, 2010, 2:47 p.m. UTC | #9
Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>   
>>> 2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
>>> You have pid 0 because your pid simply does not map.
>>>       
>> Oh, I see.
>>
>>     
>>> There is nothing that makes to parallel enters impossible in that.
>>> Even today we have one thread per cpu that has task->pid == &init_struct_pid
>>> which is pid 0.
>>>       
>> How about the forked processes then? Who will be their parent?
>>     
>
> The normal rules of parentage apply.   So the child will see simply
> see it's parent as ppid == 0.  If that child daemonizes it will become
> a child of the pid namespaces init.
>
> This is a lot like something that gets started from call_usermodehelper.  It's
> parent process is not a descendant of init either.
>
>
> The implementation of the join is to simply change current->nsproxy->pid_ns.
> Then to use it you simply fork to get a child in the target pid namespace.
>   
If the normal rules of parentage apply, that means pid 0 has to wait 
it's child.
If we are in the scenario of pid 0, it's child pid 1234 and we kill the 
pid 1 of the pid namespace, I suppose pid 1234 will be killed too.
The pid 0 will stay in the pid namespace and will able to fork again a 
new pid 1.

I think Serge already reported that...

That sounds good :)
>>> For the case of unshare where we are designed to be used with PAM I don't
>>> think my proposed semantics work.  For a join needed an extra fork before
>>> you are really in the pid namespace should be minor.
>>>       
>> Hm... One more proposal - can we adopt the planned new fork_with_pids system
>> call to fork the process right into a new pid namespace?
>>     
>
> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
> don't think anything I am doing fundamentally undermines it.  The use
> case of doing things in fork is that there is automatic inheritance of
> everything.  All of the namespaces and all of the control groups, and
> possibly also the parent process.  
And also the rootfs for executing the command inside the container (eg. 
shutdown), the uid/gid (if there is a user namespace), the mount points, ...
But I suppose we can do the same with setns for all the namespaces and 
chrooting within the container rootfs.

What I see is a problem with the tty. For example, we cloneat the init 
process of the container which is usually /sbin/init but this one has 
its tty mapped to /dev/console, so the output of the exec'ed command 
will go to the console.
> It does have the high cost that the
> process we are copying from must be stopped because there are no locks
> that let us take everything.  I haven't looked at the recent proposals
> to see if anyone has solved that problem cleanly.
>   
Right.

> If we can do a sys_hijack/sys_cloneat style of join, that means we can
> afford a fork.  At which point the my proposed pid namespace semantics
> should be fine.
>
> aka:
> setns(NSTYPE_PID);
> pid = fork();
> if (pid == 0) {
> 	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
>         getppid() == 0;
> } else {
> 	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
> 	waitpid(pid);
> }
>
>   
>>> That doesn't handle the case of cached struct pids.  A good example is
>>> waitpid, where it waits for a specific struct pid.  Which means that
>>> allocating a new struct pid and changing task->pid will cause
>>> waitpid(pid) to wait forever...
>>>       
>> OK. Good example. Thanks.
>>
>>     
>>> To change struct pid would require the refcount on struct pid to show
>>> no references from anywhere except the task_struct.
>>>       
>> I think this is OK to return -EBUSY for this. And fix the waitpid
>> respectively not to block this common case. All the others I think
>> can be stayed as is.
>>     
>
> That would probably work.  setsid() and setpgrp() have similar sorts
> of restrictions.  That is both more challenging and more limiting than
> the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I
> would prefer to keep this sort of thing as a last resort.
>
>   
>>> At the cost of a little memory we can solve that problem for unshare
>>> if we have a an extra upid in struct pid, how we verify there is space
>>> in struct pid I'm not certain.
>>>
>>> I do think that at least until someone calls exec the namespace pids are
>>> reported to the process itself should not change.  That is kill and
>>>       
>> Wait a second - in that case the wait will be blocked too! No?
>>     
>
> If all we do is populate an unused struct upid in struct pid there
> isn't a chance of a problem.  
>
>   
>>> waitpid etc.  Which suggests an implementation the opposite of what
>>> I proposed.  With ns_of_pid(task_pid(current)) being used as the
>>> pid namespace of children, and current->nsproxy->pid_ns not changing
>>> in the case of unshare.
>>>
>>> Shrug.
>>>
>>> Or perhaps this is a case where we use we can implement join with
>>> an extra process but we can't implement unshare, because the effect
>>> cannot be immediate.
>>>       
>> Well, I'm talking only about the join now.
>>     
>
> Overall it sounds like the semantics I have proposed with
> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
> extra fork is a bit surprising but it certainly does not
> look like a show stopper for implementing a pid namespace join.
>   
I agree, it's some kind of "ghost" process.
IMO, with a bit of userspace code it would be possible to enter or exec 
a command inside a container with nsfd, setns.

+1 to test your patchset Eric :)

Just a mindless suggestion, the "nsopen" / "nsattach" syscall names 
should be more clear no ?

Jumping back, one question about the nsfd and the poll for waiting the 
end of the namespace.
If we have an openened file descriptor on a specific namespace, we grab 
a reference on this one, so the namespace won't be destroyed until we 
close the fd which is used to poll the end of the namespace, no ? Did I 
miss something ?

Thanks
  -- Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 6, 2010, 8:48 p.m. UTC | #10
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:

> If the normal rules of parentage apply, that means pid 0 has to wait it's child.
> If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of
> the pid namespace, I suppose pid 1234 will be killed too.
> The pid 0 will stay in the pid namespace and will able to fork again a new pid
> 1.
>
> I think Serge already reported that...
>
> That sounds good :)

I expect zap_pid_ns_processes should also arrange so we cannot allocate any
more processes.  We certainly need to do something explicit or pid 1 won't
be allocated.  It might make sense to resurrect a pid namespace after it's
death but it is definitely weird.

>> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
>> don't think anything I am doing fundamentally undermines it.  The use
>> case of doing things in fork is that there is automatic inheritance of
>> everything.  All of the namespaces and all of the control groups, and
>> possibly also the parent process.  
> And also the rootfs for executing the command inside the container
> (eg. shutdown), the uid/gid (if there is a user namespace), the mount points,
> ...
> But I suppose we can do the same with setns for all the namespaces and chrooting
> within the container rootfs.
>
> What I see is a problem with the tty. For example, we cloneat the init process
> of the container which is usually /sbin/init but this one has its tty mapped to
> /dev/console, so the output of the exec'ed command will go to the console.

My original thinking was that the fd's would come from the caller of sys_cloneat....

>> Overall it sounds like the semantics I have proposed with
>> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
>> extra fork is a bit surprising but it certainly does not
>> look like a show stopper for implementing a pid namespace join.
>>   
> I agree, it's some kind of "ghost" process.
> IMO, with a bit of userspace code it would be possible to enter or exec a
> command inside a container with nsfd, setns.
>
> +1 to test your patchset Eric :)

I will see about reposting sometime soon.

> Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be
> more clear no ?

Not bad suggestions.

I am going to explore a bit more.  Given that nsfd is using the same
permission checks as a proc file, I think I can just make it a proc
file.  Something like "/proc/<pid>/ns/net".  With a little luck that
won't suck too badly.

> Jumping back, one question about the nsfd and the poll for waiting the end of
> the namespace.
> If we have an openened file descriptor on a specific namespace, we grab a
> reference on this one, so the namespace won't be destroyed until we close the fd
> which is used to poll the end of the namespace, no ? Did I miss something ?

Not really.  The assumption was that there would be a very similar
file descriptor that we could use with poll.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 6, 2010, 9:26 p.m. UTC | #11
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>
>   
>> If the normal rules of parentage apply, that means pid 0 has to wait it's child.
>> If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of
>> the pid namespace, I suppose pid 1234 will be killed too.
>> The pid 0 will stay in the pid namespace and will able to fork again a new pid
>> 1.
>>
>> I think Serge already reported that...
>>
>> That sounds good :)
>>     
>
> I expect zap_pid_ns_processes should also arrange so we cannot allocate any
> more processes.  We certainly need to do something explicit or pid 1 won't
> be allocated.  It might make sense to resurrect a pid namespace after it's
> death but it is definitely weird.
>   
Mmh, yes. But that was just an idea, maybe a bit out of the scope you 
are aiming.

>>> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
>>> don't think anything I am doing fundamentally undermines it.  The use
>>> case of doing things in fork is that there is automatic inheritance of
>>> everything.  All of the namespaces and all of the control groups, and
>>> possibly also the parent process.  
>>>       
>> And also the rootfs for executing the command inside the container
>> (eg. shutdown), the uid/gid (if there is a user namespace), the mount points,
>> ...
>> But I suppose we can do the same with setns for all the namespaces and chrooting
>> within the container rootfs.
>>
>> What I see is a problem with the tty. For example, we cloneat the init process
>> of the container which is usually /sbin/init but this one has its tty mapped to
>> /dev/console, so the output of the exec'ed command will go to the console.
>>     
>
> My original thinking was that the fd's would come from the caller of sys_cloneat....
Oh, ok :s

>>> Overall it sounds like the semantics I have proposed with
>>> unshare(CLONE_NEWPID) are workable, and simple to implement.  The
>>> extra fork is a bit surprising but it certainly does not
>>> look like a show stopper for implementing a pid namespace join.
>>>   
>>>       
>> I agree, it's some kind of "ghost" process.
>> IMO, with a bit of userspace code it would be possible to enter or exec a
>> command inside a container with nsfd, setns.
>>
>> +1 to test your patchset Eric :)
>>     
>
> I will see about reposting sometime soon.
>   
Great ! thanks.

>> Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be
>> more clear no ?
>>     
>
> Not bad suggestions.
>
> I am going to explore a bit more.  Given that nsfd is using the same
> permission checks as a proc file, I think I can just make it a proc
> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> won't suck too badly.
>   
Ah ! yes. Good idea.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 8:32 a.m. UTC | #12
I have take an snapshot of my development tree and placed it at.


git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git


>> I am going to explore a bit more.  Given that nsfd is using the same
>> permission checks as a proc file, I think I can just make it a proc
>> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> won't suck too badly.
>>   
> Ah ! yes. Good idea.

It is a hair more code to use proc files but nothing worth counting.

Probably the biggest thing I am aware of right now in my development
tree is in getting uids to pass properly between unix domain sockets
I would up writing this cred_to_ucred function.

Serge can you take a look and check my logic, and do you have
any idea of where we should place something like pid_vnr but
for the uid namespace?

void cred_to_ucred(struct pid *pid, const struct cred *cred,
		   struct ucred *ucred)
{
	ucred->pid = pid_vnr(pid);
	ucred->uid = ucred->gid = -1;
	if (cred) {
		struct user_namespace *cred_ns = cred->user->user_ns;
		struct user_namespace *current_ns = current_user_ns();
		struct user_namespace *tmp;

		if (likely(cred_ns == current_ns)) {
			ucred->uid = cred->euid;
			ucred->gid = cred->egid;
		} else {
			/* Is cred in a child user namespace */
			tmp = cred_ns;
			do {
				tmp = tmp->creator->user_ns;
				if (tmp == current_ns) {
					ucred->uid = tmp->creator->uid;
					ucred->gid = overflowgid;
					return;
				}
			} while (tmp != &init_user_ns);

			/* Is cred the creator of my user namespace,
			 * or the creator of one of it's parents?
			 */
			for( tmp = current_ns; tmp != &init_user_ns;
			     tmp = tmp->creator->user_ns) {
				if (cred->user == tmp->creator) {
					ucred->uid = 0;
					ucred->gid = 0;
					return;
				}
			}

			/* No user namespace relationship so no mapping */
			ucred->uid = overflowuid;
			ucred->gid = overflowgid;
		}
	}
}

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 8, 2010, 4:54 p.m. UTC | #13
Eric W. Biederman wrote:
> I have take an snapshot of my development tree and placed it at.
>
>
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>   

Hi Eric,

thanks for the pointer.

I tried to boot the kernel under qemu and I got this oops:

Loading /lib/kbd/keymaps/i386/azerty/fr.map
Creating block device nodes.
Creating character device nodes.
Making device-mapper control node
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
PGD 3cfd0067 PUD 3cfc1067 PMD 0
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/class/firmware/timeout
CPU 0
Pid: 841, comm: modprobe Not tainted 2.6.33 #1 /
RIP: 0010:[<ffffffff812df7e7>]  [<ffffffff812df7e7>] 
netlink_broadcast+0x1bd/0x384
RSP: 0018:ffff88003cfc3ca8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88003ce947f0 RCX: ffff88003cf877f0
RDX: ffff88003f939dd0 RSI: ffff88003f939ef0 RDI: ffff88003f939e00
RBP: ffff88003cfc3d18 R08: ffff88003f939d98 R09: ffff88003f939e88
R10: ffff88003cf87818 R11: 0000000000000286 R12: ffff88003f939d98
R13: ffff88003f939e88 R14: ffff88003ce94818 R15: ffff88003ce94800
FS:  00007f23a90a06f0(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003cfcd000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 841, threadinfo ffff88003cfc2000, task 
ffff88003d1db058)
Stack:
 0000000000000270 00000000000000d0 ffff88003f9377f0 ffffffff8203d630
<0> ffff88003f939f5c 0000000000000000 0000000000000001 0000000000000000
<0> ffff88003cf03000 0000000000000020 0000000000000004 ffff88003f939e88
Call Trace:
 [<ffffffff8121c1bd>] kobject_uevent_env+0x414/0x59b
 [<ffffffff81117432>] ? sysfs_create_file+0x25/0x27
 [<ffffffff8105eb13>] ? module_add_modinfo_attrs+0xd6/0xfc
 [<ffffffff8121c34f>] kobject_uevent+0xb/0xd
 [<ffffffff8105eba4>] mod_sysfs_setup+0x6b/0x99
 [<ffffffff810602aa>] load_module+0x12a2/0x16f1
 [<ffffffff81060b34>] sys_init_module+0x60/0x230
 [<ffffffff81002928>] system_call_fastpath+0x16/0x1b
Code: 00 ff c8 74 2a 8b 75 98 4c 89 ef e8 22 f8 fd ff 48 85 c0 49 89 c4 
74 3c 48 8d 50 38 48 8b 40 38 48 85 c0 74 0
2 ff 00 48 8b 42 08 <ff> 00 eb 4f 48 8b 55 b0 ff 02 49 83 7d 10 00 74 08 
4c 89 ef e8
RIP  [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
 RSP <ffff88003cfc3ca8>
CR2: 0000000000000000
---[ end trace 979d62b87f68fab3 ]---
netlink_recvmsg: missing NETLINK_CB proto: 15 pid: 0 dsg_group: 1
destructor = (null)
nlmsghdr: len=1080321121 type=27951 flags=646f seq=795176053 pid=1769169779

I will try to investigate further later - kid duty :)

  -- Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn March 8, 2010, 5:07 p.m. UTC | #14
Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> I have take an snapshot of my development tree and placed it at.
> 
> 
> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> 
> 
> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> permission checks as a proc file, I think I can just make it a proc
> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> won't suck too badly.
> >>   
> > Ah ! yes. Good idea.
> 
> It is a hair more code to use proc files but nothing worth counting.
> 
> Probably the biggest thing I am aware of right now in my development
> tree is in getting uids to pass properly between unix domain sockets
> I would up writing this cred_to_ucred function.
> 
> Serge can you take a look and check my logic, and do you have
> any idea of where we should place something like pid_vnr but
> for the uid namespace?

Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
the best place for it.

> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> 		   struct ucred *ucred)
> {
> 	ucred->pid = pid_vnr(pid);
> 	ucred->uid = ucred->gid = -1;
> 	if (cred) {
> 		struct user_namespace *cred_ns = cred->user->user_ns;
> 		struct user_namespace *current_ns = current_user_ns();
> 		struct user_namespace *tmp;
> 
> 		if (likely(cred_ns == current_ns)) {
> 			ucred->uid = cred->euid;
> 			ucred->gid = cred->egid;
> 		} else {
> 			/* Is cred in a child user namespace */
> 			tmp = cred_ns;
> 			do {
> 				tmp = tmp->creator->user_ns;
> 				if (tmp == current_ns) {

	Hmm, I think you want to catch one level up - so the creator itself
	is in current_user_ns, so

	do {
		if (tmp->creator->user_ns == current_ns) {
			ucred->uid = tmp->creator->uid;
			ucred->gid = tmp->creator_gid;
			return;
		}
		tmp = tmp->creator->user_ns;
	} while (tmp != &init_user_ns);

> 					ucred->uid = tmp->creator->uid;
> 					ucred->gid = overflowgid;

			should we start recording a user_ns->creator_gid
			instead?

> 					return;
> 				}
> 			} while (tmp != &init_user_ns);
> 
> 			/* Is cred the creator of my user namespace,
> 			 * or the creator of one of it's parents?
> 			 */
> 			for( tmp = current_ns; tmp != &init_user_ns;
> 			     tmp = tmp->creator->user_ns) {
> 				if (cred->user == tmp->creator) {
> 					ucred->uid = 0;
> 					ucred->gid = 0;
> 					return;
> 				}
> 			}

That looks right.

> 			/* No user namespace relationship so no mapping */
> 			ucred->uid = overflowuid;
> 			ucred->gid = overflowgid;
> 		}
> 	}
> }
> 
> Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 5:35 p.m. UTC | #15
"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> 
>> I have take an snapshot of my development tree and placed it at.
>> 
>> 
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>> 
>> 
>> >> I am going to explore a bit more.  Given that nsfd is using the same
>> >> permission checks as a proc file, I think I can just make it a proc
>> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
>> >> won't suck too badly.
>> >>   
>> > Ah ! yes. Good idea.
>> 
>> It is a hair more code to use proc files but nothing worth counting.
>> 
>> Probably the biggest thing I am aware of right now in my development
>> tree is in getting uids to pass properly between unix domain sockets
>> I would up writing this cred_to_ucred function.
>> 
>> Serge can you take a look and check my logic, and do you have
>> any idea of where we should place something like pid_vnr but
>> for the uid namespace?
>
> Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> the best place for it.

Thanks.

>> void cred_to_ucred(struct pid *pid, const struct cred *cred,
>> 		   struct ucred *ucred)
>> {
>> 	ucred->pid = pid_vnr(pid);
>> 	ucred->uid = ucred->gid = -1;
>> 	if (cred) {
>> 		struct user_namespace *cred_ns = cred->user->user_ns;
>> 		struct user_namespace *current_ns = current_user_ns();
>> 		struct user_namespace *tmp;
>> 
>> 		if (likely(cred_ns == current_ns)) {
>> 			ucred->uid = cred->euid;
>> 			ucred->gid = cred->egid;
>> 		} else {
>> 			/* Is cred in a child user namespace */
>> 			tmp = cred_ns;
>> 			do {
>> 				tmp = tmp->creator->user_ns;
>> 				if (tmp == current_ns) {
>
> 	Hmm, I think you want to catch one level up - so the creator itself
> 	is in current_user_ns, so

>
> 	do {
> 		if (tmp->creator->user_ns == current_ns) {
> 			ucred->uid = tmp->creator->uid;
> 			ucred->gid = tmp->creator_gid;
> 			return;
> 		}
> 		tmp = tmp->creator->user_ns;
> 	} while (tmp != &init_user_ns);

Good catch.  

>> 					ucred->uid = tmp->creator->uid;
>> 					ucred->gid = overflowgid;
>
> 			should we start recording a user_ns->creator_gid
> 			instead?

I had a similar question.  Possibly we can just grab the creators cred.


>> 					return;
>> 				}
>> 			} while (tmp != &init_user_ns);
>> 
>> 			/* Is cred the creator of my user namespace,
>> 			 * or the creator of one of it's parents?
>> 			 */
>> 			for( tmp = current_ns; tmp != &init_user_ns;
>> 			     tmp = tmp->creator->user_ns) {
>> 				if (cred->user == tmp->creator) {
>> 					ucred->uid = 0;
>> 					ucred->gid = 0;
>> 					return;
>> 				}
>> 			}
>
> That looks right.
>
>> 			/* No user namespace relationship so no mapping */
>> 			ucred->uid = overflowuid;
>> 			ucred->gid = overflowgid;
>> 		}
>> 	}
>> }

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn March 8, 2010, 5:47 p.m. UTC | #16
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> 
> >> I have take an snapshot of my development tree and placed it at.
> >> 
> >> 
> >> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
> >> 
> >> 
> >> >> I am going to explore a bit more.  Given that nsfd is using the same
> >> >> permission checks as a proc file, I think I can just make it a proc
> >> >> file.  Something like "/proc/<pid>/ns/net".  With a little luck that
> >> >> won't suck too badly.
> >> >>   
> >> > Ah ! yes. Good idea.
> >> 
> >> It is a hair more code to use proc files but nothing worth counting.
> >> 
> >> Probably the biggest thing I am aware of right now in my development
> >> tree is in getting uids to pass properly between unix domain sockets
> >> I would up writing this cred_to_ucred function.
> >> 
> >> Serge can you take a look and check my logic, and do you have
> >> any idea of where we should place something like pid_vnr but
> >> for the uid namespace?
> >
> > Well my first thought was user_namespace, but I'm thinking kernel/cred.c is
> > the best place for it.
> 
> Thanks.
> 
> >> void cred_to_ucred(struct pid *pid, const struct cred *cred,
> >> 		   struct ucred *ucred)
> >> {
> >> 	ucred->pid = pid_vnr(pid);
> >> 	ucred->uid = ucred->gid = -1;
> >> 	if (cred) {
> >> 		struct user_namespace *cred_ns = cred->user->user_ns;
> >> 		struct user_namespace *current_ns = current_user_ns();
> >> 		struct user_namespace *tmp;
> >> 
> >> 		if (likely(cred_ns == current_ns)) {
> >> 			ucred->uid = cred->euid;
> >> 			ucred->gid = cred->egid;
> >> 		} else {
> >> 			/* Is cred in a child user namespace */
> >> 			tmp = cred_ns;
> >> 			do {
> >> 				tmp = tmp->creator->user_ns;
> >> 				if (tmp == current_ns) {
> >
> > 	Hmm, I think you want to catch one level up - so the creator itself
> > 	is in current_user_ns, so
> 
> >
> > 	do {
> > 		if (tmp->creator->user_ns == current_ns) {
> > 			ucred->uid = tmp->creator->uid;
> > 			ucred->gid = tmp->creator_gid;
> > 			return;
> > 		}
> > 		tmp = tmp->creator->user_ns;
> > 	} while (tmp != &init_user_ns);
> 
> Good catch.  
> 
> >> 					ucred->uid = tmp->creator->uid;
> >> 					ucred->gid = overflowgid;
> >
> > 			should we start recording a user_ns->creator_gid
> > 			instead?
> 
> I had a similar question.  Possibly we can just grab the creators cred.

Oh, yeah, make user_ns->creator a cred, excellent idea - then we have
the LSM and capability fields cached as well.

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4678078..b7f2026 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1094,7 +1094,7 @@  static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		current->nsproxy->pid_ns->last_pid);
+		task_active_pid_ns(current)->last_pid);
 	return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 3b3c366..4e6985e 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@  void mconsole_log(struct mc_request *req)
 void mconsole_proc(struct mc_request *req)
 {
 	struct nameidata nd;
-	struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+	struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
 	struct file *file;
 	int n, err;
 	char *ptr = req->request.data, *buf;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b080b79..fbcd3f8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -57,7 +57,7 @@  static int proc_get_sb(struct file_system_type *fs_type,
 	if (flags & MS_KERNMOUNT)
 		ns = (struct pid_namespace *)data;
 	else
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 
 	sb = sget(fs_type, proc_test_super, proc_set_super, ns);
 	if (IS_ERR(sb))
diff --git a/init/main.c b/init/main.c
index 4cb47a1..67e40fc 100644
--- a/init/main.c
+++ b/init/main.c
@@ -851,15 +851,6 @@  static int __init kernel_init(void * unused)
 	 * init can run on any cpu.
 	 */
 	set_cpus_allowed_ptr(current, cpu_all_mask);
-	/*
-	 * Tell the world that we're going to be the grim
-	 * reaper of innocent orphaned children.
-	 *
-	 * We don't want people to have to make incorrect
-	 * assumptions about where in the task array this
-	 * can be found.
-	 */
-	init_pid_ns.child_reaper = current;
 
 	cad_pid = task_pid(current);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index aa3bee5..737d2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2453,7 +2453,7 @@  static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
 {
 	struct cgroup_pidlist *l;
 	/* don't need task_nsproxy() if we're looking at ourself */
-	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	struct pid_namespace *ns = get_pid_ns(task_active_pid_ns(current));
 	/*
 	 * We can't drop the pidlist_mutex before taking the l->mutex in case
 	 * the last ref-holder is trying to remove l from the list at the same
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..832c035 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1172,7 +1172,7 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 		if (!pid)
 			goto bad_fork_cleanup_io;
 
-		if (clone_flags & CLONE_NEWPID) {
+		if (pid->numbers[pid->level].nr == 1) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
 			if (retval < 0)
 				goto bad_fork_free_pid;
@@ -1279,7 +1279,7 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 		tracehook_finish_clone(p, clone_flags, trace);
 
 		if (thread_group_leader(p)) {
-			if (clone_flags & CLONE_NEWPID)
+			if (pid->numbers[pid->level].nr == 1)
 				p->nsproxy->pid_ns->child_reaper = p;
 
 			p->signal->leader_pid = pid;
@@ -1539,10 +1539,19 @@  static void check_unshare_flags(unsigned long *flags_ptr)
 		*flags_ptr |= CLONE_THREAD;
 
 	/*
+	 * If unsharing the pid namespace and the task was created
+	 * using CLONE_THREAD, then must unshare the thread.
+	 */
+	if ((*flags_ptr & CLONE_NEWPID) &&
+	    (atomic_read(&current->signal->count) > 1))
+		*flags_ptr |= CLONE_THREAD;
+
+	/*
 	 * If unsharing namespace, must also unshare filesystem information.
 	 */
 	if (*flags_ptr & CLONE_NEWNS)
 		*flags_ptr |= CLONE_FS;
+
 }
 
 /*
@@ -1647,7 +1656,8 @@  SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 	err = -EINVAL;
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
-				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
+				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
+				CLONE_NEWPID))
 		goto bad_unshare_out;
 
 	/*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e3be4ef..1d023d5 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -173,7 +173,7 @@  int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET)))
+			       CLONE_NEWNET | CLONE_NEWPID)))
 		return 0;
 
 	if (!capable(CAP_SYS_ADMIN))
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 2ae7409..74865cd 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -4436,7 +4436,7 @@  perf_event_alloc(struct perf_event_attr *attr,
 
 	event->parent		= parent_event;
 
-	event->ns		= get_pid_ns(current->nsproxy->pid_ns);
+	event->ns		= get_pid_ns(task_active_pid_ns(current));
 	event->id		= atomic64_inc_return(&perf_event_id);
 
 	event->state		= PERF_EVENT_STATE_INACTIVE;
diff --git a/kernel/pid.c b/kernel/pid.c
index 2e17c9c..6b64a82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -305,7 +305,7 @@  EXPORT_SYMBOL_GPL(find_pid_ns);
 
 struct pid *find_vpid(int nr)
 {
-	return find_pid_ns(nr, current->nsproxy->pid_ns);
+	return find_pid_ns(nr, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(find_vpid);
 
@@ -385,7 +385,7 @@  struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
-	return find_task_by_pid_ns(vnr, current->nsproxy->pid_ns);
+	return find_task_by_pid_ns(vnr, task_active_pid_ns(current));
 }
 
 struct pid *get_task_pid(struct task_struct *task, enum pid_type type)
@@ -437,7 +437,7 @@  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 
 pid_t pid_vnr(struct pid *pid)
 {
-	return pid_nr_ns(pid, current->nsproxy->pid_ns);
+	return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
@@ -448,7 +448,7 @@  pid_t __task_pid_nr_ns(struct task_struct *task, enum pid_type type,
 
 	rcu_read_lock();
 	if (!ns)
-		ns = current->nsproxy->pid_ns;
+		ns = task_active_pid_ns(current);
 	if (likely(pid_alive(task))) {
 		if (type != PIDTYPE_PID)
 			task = task->group_leader;
diff --git a/kernel/signal.c b/kernel/signal.c
index 934ae5e..885b699 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1438,16 +1438,15 @@  int do_notify_parent(struct task_struct *tsk, int sig)
 	 * we are under tasklist_lock here so our parent is tied to
 	 * us and cannot exit and release its namespace.
 	 *
-	 * the only it can is to switch its nsproxy with sys_unshare,
-	 * bu uncharing pid namespaces is not allowed, so we'll always
-	 * see relevant namespace
+	 * The only it can is to switch its nsproxy with sys_unshare,
+	 * but we use the pid_namespace for task_pid which never changes.
 	 *
 	 * write_lock() currently calls preempt_disable() which is the
 	 * same as rcu_read_lock(), but according to Oleg, this is not
 	 * correct to rely on this
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(tsk->parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
@@ -1518,7 +1517,7 @@  static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	 * see comment in do_notify_parent() abot the following 3 lines
 	 */
 	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
+	info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
 	info.si_uid = __task_cred(tsk)->uid;
 	rcu_read_unlock();
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 8f5d16e..1e4da59 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -1356,7 +1356,7 @@  static ssize_t binary_sysctl(const int *name, int nlen,
 		goto out_putname;
 	}
 
-	mnt = current->nsproxy->pid_ns->proc_mnt;
+	mnt = task_active_pid_ns(current)->proc_mnt;
 	result = vfs_path_lookup(mnt->mnt_root, mnt, pathname, 0, &nd);
 	if (result)
 		goto out_putname;