diff mbox

[01/14] add Documentation/namespaces/user_namespace.txt

Message ID 1311706717-7398-2-git-send-email-serge@hallyn.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Serge E. Hallyn July 26, 2011, 6:58 p.m. UTC
From: Serge E. Hallyn <serge.hallyn@canonical.com>

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feed back from David Howells.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

Comments

Randy.Dunlap July 26, 2011, 8:22 p.m. UTC | #1
On Tue, 26 Jul 2011 18:58:24 +0000 Serge Hallyn wrote:

> From: Serge E. Hallyn <serge.hallyn@canonical.com>
> 
> This will hold some info about the design.  Currently it contains
> future todos, issues and questions.
> 
> Changelog:
>    jul 26: incorporate feed back from David Howells.
> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: David Howells <dhowells@redhat.com>
> ---
>  Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/namespaces/user_namespace.txt
> 
> diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
> new file mode 100644
> index 0000000..7e50517
> --- /dev/null
> +++ b/Documentation/namespaces/user_namespace.txt
> @@ -0,0 +1,107 @@
> +Description
> +===========
> +
> +Traditionally, each task is owned by a user ID (UID) and belongs to one or more
> +groups (GID).  Both are simple numeric IDs, though userspace usually translates
> +them to names.  The user namespace allows tasks to have different views of the
> +UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
> +below for more)

         for more.)

> +
> +The user namespace is a simple hierarchical one.  The system starts with all
> +tasks belonging to the initial user namespace.  A task creates a new user
> +namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
> +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
> +but it does not need to be running as root.  The clone(2) call will result in a
> +new task which to itself appears to be running as UID and GID 0, but to its
> +creator seems to have the creator's credentials.
> +
> +Any task in or resource belonging to the initial user namespace will, to this
> +new task, appear to belong to UID and GID -1 - which is usually known as

that extra hyphen is confusing.  how about:

                              to UID and GID -1, which is

> +'nobody'.  Permission to open such files will be granted according to world
> +access permissions.  UID comparisons and group membership checks will return
> +false, and privilege will be denied.
> +
> +When a task belonging to (for example) userid 500 in the initial user namespace
> +creates a new user namespace, even though the new task will see itself as
> +belonging to UID 0, any task in the initial user namespace will see it as
> +belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
> +able to kill the new task.  Files created by the new user will (eventually) be
> +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
> +the initial user namespace as belonging to UID 500.
> +
> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.
> +
> +Relationship between the User namespace and other namespaces
> +============================================================
> +
> +Other namespaces, such as UTS and network, are owned by a user namespace.  When
> +such a namespace is created, it is assigned to the user namespace of the task
> +by which it was created.  Therefore, attempts to exercise privilege to
> +resources in, for instance, a particular network namespace, can be properly
> +validated by checking whether the caller has the needed privilege (i.e.
> +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
> +This is done using the ns_capable() function.
> +
> +As an example, if a new task is cloned with a private user namespace but
> +no private network namespace, then the task's network namespace is owned
> +by the parent user namespace.  The new task has no privilege to the
> +parent user namespace, so it will not be able to create or configure
> +network devices.  If, instead, the task were cloned with both private
> +user and network namespaces, then the private network namespace is owned
> +by the private user namespace, and so root in the new user namespace
> +will have privilege targeted to the network namespace.  It will be able
> +to create and configure network devices.
> +
> +UID Mapping
> +===========
> +The current plan (see 'flexible UID mapping' at
> +https://wiki.ubuntu.com/UserNamespace) is:
> +
> +The UID/GID stored on disk will be that in the init_user_ns.  Most likely
> +UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
> +(a few years ago) leaving the details up to filesystems while providing a lib/
> +stock implementation.  See the thread around here

                                                here:

> +http://www.mail-archive.com/devel@openvz.org/msg09331.html
> +
> +
> +Working notes
> +=============

A lot of this file is working notes and will need to be updated...

> +Capability checks for actions related to syslog must be against the
> +init_user_ns until syslog is containerized.
> +
> +Same is true for reboot and power, control groups, devices, and time.
> +
> +Perf actions (kernel/event/core.c for instance) will always be constrained to
> +init_user_ns.
> +
> +Q:
> +Is accounting considered properly containerized wrt pidns?  (it appears to be).

s/wrt/with respect to/

> +If so, then we can change the capable() check in kernel/acct.c to
> +'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
> +
> +Q:
> +For things like nice and schedaffinity, we could allow root in a container to
> +control those, and leave only cgroups to constrain the container.  I'm not sure
> +whether that is right, or whether it violates admin expectations.
> +
> +I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
> +dentries, not inodes.
> +
> +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
> +them) target the capability checks at the user_ns owning the tty.  That will
> +have to wait until we get userns owning files straightened out.
> +
> +We need to figure out how to label devices.  Should we just toss a user_ns
> +right into struct device?
> +
> +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
> +some day LSMs were to be containerized, near zero chance.
> +
> +inode_owner_or_capable() should probably take an optional ns and cap parameter.
> +If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
> +inode.  But if ns is provided, then callers who need to derive
> +inode_userns(inode) anyway can save a few cycles.
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Howells July 26, 2011, 8:29 p.m. UTC | #2
Randy Dunlap <rdunlap@xenotime.net> wrote:

> > +Any task in or resource belonging to the initial user namespace will, to this
> > +new task, appear to belong to UID and GID -1 - which is usually known as
> 
> that extra hyphen is confusing.  how about:
> 
>                               to UID and GID -1, which is

'which are'.

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn July 27, 2011, 3:38 p.m. UTC | #3
Quoting Randy Dunlap (rdunlap@xenotime.net):
> On Tue, 26 Jul 2011 18:58:24 +0000 Serge Hallyn wrote:
> 
> > From: Serge E. Hallyn <serge.hallyn@canonical.com>
> > 
> > This will hold some info about the design.  Currently it contains
> > future todos, issues and questions.
> > 
> > Changelog:
> >    jul 26: incorporate feed back from David Howells.
> > 
> > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> > Cc: Eric W. Biederman <ebiederm@xmission.com>
> > Cc: David Howells <dhowells@redhat.com>
> > ---
> >  Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
> >  1 files changed, 107 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/namespaces/user_namespace.txt
> > 
> > diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
> > new file mode 100644
> > index 0000000..7e50517
> > --- /dev/null
> > +++ b/Documentation/namespaces/user_namespace.txt
> > @@ -0,0 +1,107 @@
> > +Description
> > +===========
> > +
> > +Traditionally, each task is owned by a user ID (UID) and belongs to one or more
> > +groups (GID).  Both are simple numeric IDs, though userspace usually translates
> > +them to names.  The user namespace allows tasks to have different views of the
> > +UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
> > +below for more)
> 
>          for more.)

Thanks for reviewing, Randy.

> > +
> > +The user namespace is a simple hierarchical one.  The system starts with all
> > +tasks belonging to the initial user namespace.  A task creates a new user
> > +namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
> > +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
> > +but it does not need to be running as root.  The clone(2) call will result in a
> > +new task which to itself appears to be running as UID and GID 0, but to its
> > +creator seems to have the creator's credentials.
> > +
> > +Any task in or resource belonging to the initial user namespace will, to this
> > +new task, appear to belong to UID and GID -1 - which is usually known as
> 
> that extra hyphen is confusing.  how about:
> 
>                               to UID and GID -1, which is
> 
> > +'nobody'.  Permission to open such files will be granted according to world

As I'd been asked to switch from comma, I'll restructure, something like:

"To this new task, any resource belonging to the initial user namespace will
appear to belong to user 'nobody', which has UID and GID -1."

> > +access permissions.  UID comparisons and group membership checks will return
> > +false, and privilege will be denied.
> > +
> > +When a task belonging to (for example) userid 500 in the initial user namespace
> > +creates a new user namespace, even though the new task will see itself as
> > +belonging to UID 0, any task in the initial user namespace will see it as
> > +belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
> > +able to kill the new task.  Files created by the new user will (eventually) be
> > +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
> > +the initial user namespace as belonging to UID 500.
> > +
> > +Note that this userid mapping for the VFS is not yet implemented, though the
> > +lkml and containers mailing list archives will show several previous
> > +prototypes.  In the end, those got hung up waiting on the concept of targeted
> > +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> > +they finally did.
> > +
> > +Relationship between the User namespace and other namespaces
> > +============================================================
> > +
> > +Other namespaces, such as UTS and network, are owned by a user namespace.  When
> > +such a namespace is created, it is assigned to the user namespace of the task
> > +by which it was created.  Therefore, attempts to exercise privilege to
> > +resources in, for instance, a particular network namespace, can be properly
> > +validated by checking whether the caller has the needed privilege (i.e.
> > +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
> > +This is done using the ns_capable() function.
> > +
> > +As an example, if a new task is cloned with a private user namespace but
> > +no private network namespace, then the task's network namespace is owned
> > +by the parent user namespace.  The new task has no privilege to the
> > +parent user namespace, so it will not be able to create or configure
> > +network devices.  If, instead, the task were cloned with both private
> > +user and network namespaces, then the private network namespace is owned
> > +by the private user namespace, and so root in the new user namespace
> > +will have privilege targeted to the network namespace.  It will be able
> > +to create and configure network devices.
> > +
> > +UID Mapping
> > +===========
> > +The current plan (see 'flexible UID mapping' at
> > +https://wiki.ubuntu.com/UserNamespace) is:
> > +
> > +The UID/GID stored on disk will be that in the init_user_ns.  Most likely
> > +UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
> > +(a few years ago) leaving the details up to filesystems while providing a lib/
> > +stock implementation.  See the thread around here
> 
>                                                 here:
> 
> > +http://www.mail-archive.com/devel@openvz.org/msg09331.html
> > +
> > +
> > +Working notes
> > +=============
> 
> A lot of this file is working notes and will need to be updated...

Yup.  I can leave it out of this file and keep it on the wiki instead, if
that is preferred.

> > +Capability checks for actions related to syslog must be against the
> > +init_user_ns until syslog is containerized.
> > +
> > +Same is true for reboot and power, control groups, devices, and time.
> > +
> > +Perf actions (kernel/event/core.c for instance) will always be constrained to
> > +init_user_ns.
> > +
> > +Q:
> > +Is accounting considered properly containerized wrt pidns?  (it appears to be).
> 
> s/wrt/with respect to/
> 
> > +If so, then we can change the capable() check in kernel/acct.c to
> > +'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
> > +
> > +Q:
> > +For things like nice and schedaffinity, we could allow root in a container to
> > +control those, and leave only cgroups to constrain the container.  I'm not sure
> > +whether that is right, or whether it violates admin expectations.
> > +
> > +I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
> > +dentries, not inodes.
> > +
> > +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
> > +them) target the capability checks at the user_ns owning the tty.  That will
> > +have to wait until we get userns owning files straightened out.
> > +
> > +We need to figure out how to label devices.  Should we just toss a user_ns
> > +right into struct device?
> > +
> > +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
> > +some day LSMs were to be containerized, near zero chance.
> > +
> > +inode_owner_or_capable() should probably take an optional ns and cap parameter.
> > +If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
> > +inode.  But if ns is provided, then callers who need to derive
> > +inode_userns(inode) anyway can save a few cycles.
> > -- 
> 
> 
> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Randy.Dunlap July 27, 2011, 4:02 p.m. UTC | #4
On Wed, 27 Jul 2011 15:38:48 +0000 Serge E. Hallyn wrote:

> > > +Working notes
> > > +=============
> > 
> > A lot of this file is working notes and will need to be updated...
> 
> Yup.  I can leave it out of this file and keep it on the wiki instead, if
> that is preferred.

Either place is OK with me, as long as you continue to update it
and don't let it go stale.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..7e50517
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@ 
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+Any task in or resource belonging to the initial user namespace will, to this
+new task, appear to belong to UID and GID -1 - which is usually known as
+'nobody'.  Permission to open such files will be granted according to world
+access permissions.  UID comparisons and group membership checks will return
+false, and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized wrt pidns?  (it appears to be).
+If so, then we can change the capable() check in kernel/acct.c to
+'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.