diff mbox series

[RFC,02/27] containers: Implement containers as kernel objects

Message ID 155024685321.21651.1504201877881622756.stgit@warthog.procyon.org.uk
State New
Headers show
Series Containers and using authenticated filesystems | expand

Commit Message

David Howells Feb. 15, 2019, 4:07 p.m. UTC
Implement a kernel container object such that it contains the following
things:

 (1) Namespaces.

 (2) A root directory.

 (3) A set of processes, including one designated as the 'init' process.

A container is created and attached to a file descriptor by:

	int cfd = container_create(const char *name, unsigned int flags);

this inherits all the namespaces of the parent container unless otherwise
the mask calls for new namespaces.

	CONTAINER_NEW_FS_NS
	CONTAINER_NEW_EMPTY_FS_NS
	CONTAINER_NEW_CGROUP_NS [root only]
	CONTAINER_NEW_UTS_NS
	CONTAINER_NEW_IPC_NS
	CONTAINER_NEW_USER_NS
	CONTAINER_NEW_PID_NS
	CONTAINER_NEW_NET_NS

Other flags include:

	CONTAINER_KILL_ON_CLOSE
	CONTAINER_CLOSE_ON_EXEC

Note that I've added a pointer to the current container to task_struct.
This doesn't make the nsproxy pointer redundant as you can still make new
namespaces with clone().

I've also added a list_head to task_struct to form a list in the container
of its member processes.  This is convenient, but redundant since the code
could iterate over all the tasks looking for ones that have a matching
task->container.

It might make sense to use fsconfig() to configure the container:

	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);

Comments

Trond Myklebust Feb. 17, 2019, 6:57 p.m. UTC | #1
Hi David,

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the
> following
> things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.
> 
>  (3) A set of processes, including one designated as the 'init'
> process.
> 
> A container is created and attached to a file descriptor by:
> 
> 	int cfd = container_create(const char *name, unsigned int
> flags);
> 
> this inherits all the namespaces of the parent container unless
> otherwise
> the mask calls for new namespaces.
> 
> 	CONTAINER_NEW_FS_NS
> 	CONTAINER_NEW_EMPTY_FS_NS
> 	CONTAINER_NEW_CGROUP_NS [root only]
> 	CONTAINER_NEW_UTS_NS
> 	CONTAINER_NEW_IPC_NS
> 	CONTAINER_NEW_USER_NS
> 	CONTAINER_NEW_PID_NS
> 	CONTAINER_NEW_NET_NS
> 
> Other flags include:
> 
> 	CONTAINER_KILL_ON_CLOSE
> 	CONTAINER_CLOSE_ON_EXEC
> 
> Note that I've added a pointer to the current container to
> task_struct.
> This doesn't make the nsproxy pointer redundant as you can still make
> new
> namespaces with clone().
> 
> I've also added a list_head to task_struct to form a list in the
> container
> of its member processes.  This is convenient, but redundant since the
> code
> could iterate over all the tasks looking for ones that have a
> matching
> task->container.
> 
> It might make sense to use fsconfig() to configure the container:
> 
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
> 	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);
> 
> 
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 
>      Once created, it should then be possible for the supervising
> process
>      to modify the new container.  Mounts can be created inside of
> the
>      container's namespaces:
> 
> 	fsfd = fsopen("ext4", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
> 	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> 	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(fsfd, 0, 0);
> 
>      and then mounted into the namespace:
> 
> 	move_mount(mfd, "", cfd, "/",
> 		   MOVE_MOUNT_F_EMPTY_PATH |
> MOVE_MOUNT_T_CONTAINER_ROOT);
> 
>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
>      Files and devices can be created by supplying the container fd
> as the
>      dirfd argument:
> 
> 	mkdirat(int cfd, const char *path, mode_t mode);
> 	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> 	int fd = openat(int cfd, const char *path,
> 			unsigned int flags, mode_t mode);
> 
>      [*] Note that when using cfd as dirfd, the path must not contain
> a '/'
>      	 at the front.
> 
>      Sockets, such as netlink, can be opened inside of the
> container's
>      namespaces:
> 
> 	int fd = container_socket(int cfd, int domain, int type,
> 				  int protocol);
> 
>      This should allow management of the container's network
> namespace from
>      outside.
> 
>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init'
> process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree
> within
>      the container.  Before this point, the container is simply
> destroyed
>      if the container fd is closed.
> 
>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process
> therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int
> wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed
> off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are
> preemptively
>      SIGKILL'd by the kernel.
> 
>      By default, if the container is active and its fd is closed, the
>      container is left running and wil be cleaned up when its 'init'
> exits.
>      The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>      Given that we have an fd attached to the container, we could
> make it
>      such that the supervising process could monitor and override
> EPERM
>      returns for mount and other privileged operations within the
>      container.
> 
>  (5) Per-container keyring.
> 
>      Each container can point to a per-container keyring for the
> holding of
>      integrity keys and filesystem keys for use inside the
> container.  This
>      would be attached:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>      This keyring would be searched by request_key() after it has
> searched
>      the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>      make sense with something like Apparmor where different path-
> based
>      rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---

Do we really need a new system call to set up containers? That would
force changes to all existing orchestration software.

Given that the main thing we want to achieve is to direct messages from
the kernel to an appropriate handler, why not focus on adding
functionality to do just that?

Is there any reason why a syscall to allow an appropriately privileged
process to add a keyring-specific message queue to its own
user_namespace and obtain a file descriptor to that message queue might
not work? That forces the container to use a daemon if it cares to
intercept keyring traffic, rather than worrying about the kernel
running request_key (in fact, it might make sense to allow a trivial
implementation of the daemon to be to just read the messages, parse
them and run request_key).

With such an implementation, the fallback mechanism could be to walk
back up the hierarchy of user_namespaces until a message queue is
found, and to invoke the existing request_key mechanism if not.
James Bottomley Feb. 17, 2019, 7:39 p.m. UTC | #2
Added containers and cgroups list, which somehow got lost since they
might have a slight interest in a complete rewrite of the container
API.

On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the
> following things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.

Doesn't this conflict with how the mount namespace works today?  It
contains the notion of unescapable root and we shouldn't have two of
those in different locations.

>  (3) A set of processes, including one designated as the 'init'
> process.

This is a violation of a fundamental tenet: I can create a "container"
as simply a set of unoccupied namespaces and bind them into the
filesystem with a mount.  This mechanism is what I use for
architectural emulation containers and how network namespaces currently
work.  For all of these cases, the container is empty of processes when
it is created and is selectively filled and emptied of processes as you
use it.

If I create a container without a PID namespace, I definitely wouldn't
want the notion of an "init" process because I'm deliberately avoiding
that.

> A container is created and attached to a file descriptor by:
> 
> 	int cfd = container_create(const char *name, unsigned int
> flags);

I thought we got agreement years ago that containers don't exist in
Linux as a single entity: they're currently a collection of cgroups and
namespaces some of which may and some of which may not be local to the
entity the orchestration system thinks of as a "container".

> this inherits all the namespaces of the parent container unless
> otherwise the mask calls for new namespaces.
> 
> 	CONTAINER_NEW_FS_NS
> 	CONTAINER_NEW_EMPTY_FS_NS
> 	CONTAINER_NEW_CGROUP_NS [root only]
> 	CONTAINER_NEW_UTS_NS
> 	CONTAINER_NEW_IPC_NS
> 	CONTAINER_NEW_USER_NS
> 	CONTAINER_NEW_PID_NS
> 	CONTAINER_NEW_NET_NS
> 
> Other flags include:
> 
> 	CONTAINER_KILL_ON_CLOSE
> 	CONTAINER_CLOSE_ON_EXEC
> 
> Note that I've added a pointer to the current container to
> task_struct. This doesn't make the nsproxy pointer redundant as you
> can still make new namespaces with clone().
> 
> I've also added a list_head to task_struct to form a list in the
> container of its member processes.  This is convenient, but redundant
> since the code could iterate over all the tasks looking for ones that
> have a matching task->container.
> 
> It might make sense to use fsconfig() to configure the container:
> 
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "user", NULL, userns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_NAMESPACE, "mnt", NULL, mntns_fd);
> 	fsconfig(cfd, FSCONFIG_SET_FD, "rootfs", NULL, root_fd);
> 	fsconfig(cfd, FSCONFIG_CMD_CREATE_CONTAINER, NULL, NULL, 0);

You're trying to introduce a new set of container APIs that don't quite
align with how containers work today.  If I look at the justification
below the whole thing seems to require the notion of a container as an
atomic entity with an exclusive process list.  You can argue that's how
you want it to work, but it looks like this notion would have
difficulty working with the standard kubernetes pod/container notion,
let alone all of the other esoteric ways we use containers today.

James

> 
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 
>      Once created, it should then be possible for the supervising
> process
>      to modify the new container.  Mounts can be created inside of
> the
>      container's namespaces:
> 
> 	fsfd = fsopen("ext4", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
> 	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
> 	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> 	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 	mfd = fsmount(fsfd, 0, 0);
> 
>      and then mounted into the namespace:
> 
> 	move_mount(mfd, "", cfd, "/",
> 		   MOVE_MOUNT_F_EMPTY_PATH |
> MOVE_MOUNT_T_CONTAINER_ROOT);
> 
>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
>      Files and devices can be created by supplying the container fd
> as the
>      dirfd argument:
> 
> 	mkdirat(int cfd, const char *path, mode_t mode);
> 	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> 	int fd = openat(int cfd, const char *path,
> 			unsigned int flags, mode_t mode);
> 
>      [*] Note that when using cfd as dirfd, the path must not contain
> a '/'
>      	 at the front.
> 
>      Sockets, such as netlink, can be opened inside of the
> container's
>      namespaces:
> 
> 	int fd = container_socket(int cfd, int domain, int type,
> 				  int protocol);
> 
>      This should allow management of the container's network
> namespace from
>      outside.
> 
>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init'
> process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree
> within
>      the container.  Before this point, the container is simply
> destroyed
>      if the container fd is closed.
> 
>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process
> therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int
> wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed
> off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are
> preemptively
>      SIGKILL'd by the kernel.
> 
>      By default, if the container is active and its fd is closed, the
>      container is left running and wil be cleaned up when its 'init'
> exits.
>      The default can be changed with the CONTAINER_KILL_ON_CLOSE
> flag.
> 
>  (4) Supervising the container.
> 
>      Given that we have an fd attached to the container, we could
> make it
>      such that the supervising process could monitor and override
> EPERM
>      returns for mount and other privileged operations within the
>      container.
> 
>  (5) Per-container keyring.
> 
>      Each container can point to a per-container keyring for the
> holding of
>      integrity keys and filesystem keys for use inside the
> container.  This
>      would be attached:
> 
> 	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)
> 
>      This keyring would be searched by request_key() after it has
> searched
>      the thread, process and session keyrings.
> 
>  (6) Running different LSM policies by container.  This might
> particularly
>      make sense with something like Apparmor where different path-
> based
>      rules might be required inside a container to inside the parent.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  fs/namespace.c                         |    5 
>  include/linux/container.h              |   86 ++++++++
>  include/linux/init_task.h              |    1 
>  include/linux/lsm_hooks.h              |   20 ++
>  include/linux/sched.h                  |    3 
>  include/linux/security.h               |   15 +
>  include/linux/syscalls.h               |    3 
>  include/uapi/linux/container.h         |   28 +++
>  init/Kconfig                           |    7 +
>  init/init_task.c                       |    3 
>  kernel/Makefile                        |    2 
>  kernel/container.c                     |  348
> ++++++++++++++++++++++++++++++++
>  kernel/exit.c                          |    1 
>  kernel/fork.c                          |    7 +
>  kernel/namespaces.h                    |   15 +
>  kernel/nsproxy.c                       |   23 +-
>  kernel/sys_ni.c                        |    3 
>  security/security.c                    |   12 +
>  20 files changed, 571 insertions(+), 13 deletions(-)
>  create mode 100644 include/linux/container.h
>  create mode 100644 include/uapi/linux/container.h
>  create mode 100644 kernel/container.c
>  create mode 100644 kernel/namespaces.h
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index c9db9d51a7df..3564814a5d21 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -407,3 +407,4 @@
>  393	i386	fsinfo			sys_fsinfo	
> 		__ia32_sys_fsinfo
>  394	i386	mount_notify		sys_mount_notify	
> 	__ia32_sys_mount_notify
>  395	i386	sb_notify		sys_sb_notify	
> 		__ia32_sys_sb_notify
> +396	i386	container_create	sys_container_create	
> 	__ia32_sys_container_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 17869bf7788a..aa6cccbe5271 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -352,6 +352,7 @@
>  341	common	fsinfo			__x64_sys_fsi
> nfo
>  342	common	mount_notify		__x64_sys_mount
> _notify
>  343	common	sb_notify		__x64_sys_sb_notif
> y
> +344	common	container_create	__x64_sys_container
> _create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache
> impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index f378cfc63043..ea005f55ec4c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -30,6 +30,7 @@
>  #include <uapi/linux/mount.h>
>  #include <linux/fs_context.h>
>  #include <linux/fsinfo.h>
> +#include <linux/container.h>
>  
>  #include "pnode.h"
>  #include "internal.h"
> @@ -3742,6 +3743,10 @@ static void __init init_mount_tree(void)
>  
>  	set_fs_pwd(current->fs, &root);
>  	set_fs_root(current->fs, &root);
> +#ifdef CONFIG_CONTAINERS
> +	path_get(&root);
> +	init_container.root = root;
> +#endif
>  }
>  
>  void __init mnt_init(void)
> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..0a8918435097
> --- /dev/null
> +++ b/include/linux/container.h
> @@ -0,0 +1,86 @@
> +/* Container objects
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_CONTAINER_H
> +#define _LINUX_CONTAINER_H
> +
> +#include <uapi/linux/container.h>
> +#include <linux/refcount.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/path.h>
> +#include <linux/seqlock.h>
> +
> +struct fs_struct;
> +struct nsproxy;
> +struct task_struct;
> +
> +/*
> + * The container object.
> + */
> +struct container {
> +	char			name[24];
> +	u64			id;		/* Container
> ID */
> +	refcount_t		usage;
> +	int			exit_code;	/* The exit
> code of 'init' */
> +	const struct cred	*cred;		/* Creds for
> this container, including userns */
> +	struct nsproxy		*ns;		/* This
> container's namespaces */
> +	struct path		root;		/* The root
> of the container's fs namespace */
> +	struct task_struct	*init;		/* The
> 'init' task for this container */
> +	struct container	*parent;	/* Parent of this
> container. */
> +	void			*security;	/* LSM data */
> +	struct list_head	members;	/* Member processes,
> guarded with ->lock */
> +	struct list_head	child_link;	/* Link in
> parent->children */
> +	struct list_head	children;	/* Child containers
> */
> +	wait_queue_head_t	waitq;		/* Someone
> waiting for init to exit waits here */
> +	unsigned long		flags;
> +#define CONTAINER_FLAG_INIT_STARTED	0	/* Init is
> started - certain ops now prohibited */
> +#define CONTAINER_FLAG_DEAD		1	/* Init has died
> */
> +#define CONTAINER_FLAG_KILL_ON_CLOSE	2	/* Kill init if
> container handle closed */
> +	spinlock_t		lock;
> +	seqcount_t		seq;		/* Track
> changes in ->root */
> +};
> +
> +extern struct container init_container;
> +
> +#ifdef CONFIG_CONTAINERS
> +extern const struct file_operations container_fops;
> +
> +extern int copy_container(unsigned long flags, struct task_struct
> *tsk,
> +			  struct container *container);
> +extern void exit_container(struct task_struct *tsk);
> +extern void put_container(struct container *c);
> +
> +static inline struct container *get_container(struct container *c)
> +{
> +	refcount_inc(&c->usage);
> +	return c;
> +}
> +
> +static inline bool is_container_file(struct file *file)
> +{
> +	return file->f_op == &container_fops;
> +}
> +
> +#else
> +
> +static inline int copy_container(unsigned long flags, struct
> task_struct *tsk,
> +				 struct container *container)
> +{ return 0; }
> +static inline void exit_container(struct task_struct *tsk) { }
> +static inline void put_container(struct container *c) {}
> +static inline struct container *get_container(struct container *c) {
> return NULL; }
> +static inline bool is_container_file(struct file *file) { return
> false; }
> +
> +#endif /* CONFIG_CONTAINERS */
> +
> +#endif /* _LINUX_CONTAINER_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index a7083a45a26c..f016cadece24 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -10,6 +10,7 @@
>  #include <linux/ipc.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
> +#include <linux/container.h>
>  #include <linux/securebits.h>
>  #include <linux/seqlock.h>
>  #include <linux/rbtree.h>
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 52d0f3f4c786..0f310d911815 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1460,6 +1460,16 @@
>   * @bpf_prog_free_security:
>   *	Clean up the security information stored inside bpf prog.
>   *
> + * Security hooks for containers:
> + *
> + * @container_alloc:
> + *	Permit creation of a new container and assign security
> data.
> + *	@container: The new container.
> + *
> + * @container_free:
> + *	Free security data attached to a container.
> + *	@container: The container.
> + *
>   */
>  union security_list_options {
>  	int (*binder_set_context_mgr)(struct task_struct *mgr);
> @@ -1825,6 +1835,12 @@ union security_list_options {
>  	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
>  	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
>  #endif /* CONFIG_BPF_SYSCALL */
> +
> +	/* Container management security hooks */
> +#ifdef CONFIG_CONTAINERS
> +	int (*container_alloc)(struct container *container, unsigned
> int flags);
> +	void (*container_free)(struct container *container);
> +#endif
>  };
>  
>  struct security_hook_heads {
> @@ -2069,6 +2085,10 @@ struct security_hook_heads {
>  	struct hlist_head bpf_prog_alloc_security;
>  	struct hlist_head bpf_prog_free_security;
>  #endif /* CONFIG_BPF_SYSCALL */
> +#ifdef CONFIG_CONTAINERS
> +	struct hlist_head container_alloc;
> +	struct hlist_head container_free;
> +#endif /* CONFIG_CONTAINERS */
>  } __randomize_layout;
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d2f90fa92468..073a3a930514 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@ struct backing_dev_info;
>  struct bio_list;
>  struct blk_plug;
>  struct cfs_rq;
> +struct container;
>  struct fs_struct;
>  struct futex_pi_state;
>  struct io_context;
> @@ -870,6 +871,8 @@ struct task_struct {
>  
>  	/* Namespaces: */
>  	struct nsproxy			*nsproxy;
> +	struct container		*container;
> +	struct list_head		container_link;
>  
>  	/* Signal handlers: */
>  	struct signal_struct		*signal;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index da538c06766f..acd0c14c6e95 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -70,6 +70,7 @@ struct ctl_table;
>  struct audit_krule;
>  struct user_namespace;
>  struct timezone;
> +struct container;
>  
>  enum lsm_event {
>  	LSM_POLICY_CHANGE,
> @@ -1751,6 +1752,20 @@ static inline void
> security_audit_rule_free(void *lsmrule)
>  #endif /* CONFIG_SECURITY */
>  #endif /* CONFIG_AUDIT */
>  
> +#ifdef CONFIG_CONTAINERS
> +#ifdef CONFIG_SECURITY
> +int security_container_alloc(struct container *container, unsigned
> int flags);
> +void security_container_free(struct container *container);
> +#else
> +static inline int security_container_alloc(struct container
> *container,
> +					   unsigned int flags)
> +{
> +	return 0;
> +}
> +static inline void security_container_free(struct container
> *container) {}
> +#endif
> +#endif /* CONFIG_CONTAINERS */
> +
>  #ifdef CONFIG_SECURITYFS
>  
>  extern struct dentry *securityfs_create_file(const char *name,
> umode_t mode,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 10127b1d923b..dac42098c2dd 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -943,6 +943,9 @@ asmlinkage long sys_mount_notify(int dfd, const
> char __user *path,
>  				 unsigned int at_flags, int
> watch_fd, int watch_id);
>  asmlinkage long sys_sb_notify(int dfd, const char __user *path,
>  			      unsigned int at_flags, int watch_fd,
> int watch_id);
> +asmlinkage long sys_container_create(const char __user *name,
> unsigned int flags,
> +				     unsigned long spare3, unsigned
> long spare4,
> +				     unsigned long spare5);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/linux/container.h
> b/include/uapi/linux/container.h
> new file mode 100644
> index 000000000000..43748099b28d
> --- /dev/null
> +++ b/include/uapi/linux/container.h
> @@ -0,0 +1,28 @@
> +/* Container UAPI
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_CONTAINER_H
> +#define _UAPI_LINUX_CONTAINER_H
> +
> +
> +#define CONTAINER_NEW_FS_NS		0x00000001 /* Dup current
> fs namespace */
> +#define CONTAINER_NEW_EMPTY_FS_NS	0x00000002 /* Provide new
> empty fs namespace */
> +#define CONTAINER_NEW_CGROUP_NS		0x00000004 /* Dup
> current cgroup namespace */
> +#define CONTAINER_NEW_UTS_NS		0x00000008 /* Dup
> current uts namespace */
> +#define CONTAINER_NEW_IPC_NS		0x00000010 /* Dup
> current ipc namespace */
> +#define CONTAINER_NEW_USER_NS		0x00000020 /* Dup
> current user namespace */
> +#define CONTAINER_NEW_PID_NS		0x00000040 /* Dup
> current pid namespace */
> +#define CONTAINER_NEW_NET_NS		0x00000080 /* Dup
> current net namespace */
> +#define CONTAINER_KILL_ON_CLOSE		0x00000100 /* Kill
> all member processes when fd closed */
> +#define CONTAINER_FD_CLOEXEC		0x00000200 /* Close the
> fd on exec */
> +#define CONTAINER__FLAG_MASK		0x000003ff
> +
> +#endif /* _UAPI_LINUX_CONTAINER_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 5984dd7f2156..ab37c3a55aa1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -992,6 +992,13 @@ config NET_NS
>  	  Allow user space to create what appear to be multiple
> instances
>  	  of the network stack.
>  
> +config CONTAINERS
> +	bool "Container support"
> +	default y
> +	help
> +	  Allow userspace to create and manipulate containers as
> objects that
> +	  have namespaces and hold a set of processes.
> +
>  endif # NAMESPACES
>  
>  config CHECKPOINT_RESTORE
> diff --git a/init/init_task.c b/init/init_task.c
> index 5aebe3be4d7c..90c7439a195b 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -108,6 +108,9 @@ struct task_struct init_task
>  	.signal		= &init_signals,
>  	.sighand	= &init_sighand,
>  	.nsproxy	= &init_nsproxy,
> +	.container	= &init_container,
> +	.container_link.next = &init_container.members,
> +	.container_link.prev = &init_container.members,
>  	.pending	= {
>  		.list = LIST_HEAD_INIT(init_task.pending.list),
>  		.signal = {{0}}
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 6aa7543bcdb2..98cdd18cecef 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -8,7 +8,7 @@ obj-y     = fork.o exec_domain.o panic.o \
>  	    sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
>  	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
>  	    extable.o params.o \
> -	    kthread.o sys_ni.o nsproxy.o \
> +	    kthread.o sys_ni.o nsproxy.o container.o \
>  	    notifier.o ksysfs.o cred.o reboot.o \
>  	    async.o range.o smpboot.o ucount.o
>  
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..ca4012632cfa
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,348 @@
> +/* Implement container objects.
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/init_task.h>
> +#include <linux/fs.h>
> +#include <linux/fs_struct.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/container.h>
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/security.h>
> +#include "namespaces.h"
> +
> +struct container init_container = {
> +	.name		= ".init",
> +	.id		= 1,
> +	.usage		= REFCOUNT_INIT(2),
> +	.cred		= &init_cred,
> +	.ns		= &init_nsproxy,
> +	.init		= &init_task,
> +	.members.next	= &init_task.container_link,
> +	.members.prev	= &init_task.container_link,
> +	.children	= LIST_HEAD_INIT(init_container.children),
> +	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
> +	.lock		=
> __SPIN_LOCK_UNLOCKED(init_container.lock),
> +	.seq		= SEQCNT_ZERO(init_fs.seq),
> +};
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +static atomic64_t container_id_counter = ATOMIC_INIT(1);
> +
> +/*
> + * Drop a ref on a container and clear it if no longer in use.
> + */
> +void put_container(struct container *c)
> +{
> +	struct container *parent;
> +
> +	while (c && refcount_dec_and_test(&c->usage)) {
> +		BUG_ON(!list_empty(&c->members));
> +		if (c->ns)
> +			put_nsproxy(c->ns);
> +		path_put(&c->root);
> +
> +		parent = c->parent;
> +		if (parent) {
> +			spin_lock(&parent->lock);
> +			list_del(&c->child_link);
> +			spin_unlock(&parent->lock);
> +		}
> +
> +		if (c->cred)
> +			put_cred(c->cred);
> +		security_container_free(c);
> +		kfree(c);
> +		c = parent;
> +	}
> +}
> +
> +/*
> + * Allow the user to poll for the container dying.
> + */
> +static unsigned int container_poll(struct file *file, poll_table
> *wait)
> +{
> +	struct container *container = file->private_data;
> +	unsigned int mask = 0;
> +
> +	poll_wait(file, &container->waitq, wait);
> +
> +	if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
> +		mask |= POLLHUP;
> +
> +	return mask;
> +}
> +
> +static int container_release(struct inode *inode, struct file *file)
> +{
> +	struct container *container = file->private_data;
> +
> +	put_container(container);
> +	return 0;
> +}
> +
> +const struct file_operations container_fops = {
> +	.poll		= container_poll,
> +	.release	= container_release,
> +};
> +
> +/*
> + * Handle fork/clone.
> + *
> + * A process inherits its parent's container.  The first process
> into the
> + * container is its 'init' process and the life of everything else
> in there is
> + * dependent upon that.
> + */
> +int copy_container(unsigned long flags, struct task_struct *tsk,
> +		   struct container *container)
> +{
> +	struct container *c = container ?: tsk->container;
> +	int ret = -ECANCELED;
> +
> +	spin_lock(&c->lock);
> +
> +	if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
> +		list_add_tail(&tsk->container_link, &c->members);
> +		get_container(c);
> +		tsk->container = c;
> +		if (!c->init) {
> +			set_bit(CONTAINER_FLAG_INIT_STARTED, &c-
> >flags);
> +			c->init = tsk;
> +		}
> +		ret = 0;
> +	}
> +
> +	spin_unlock(&c->lock);
> +	return ret;
> +}
> +
> +/*
> + * Remove a dead process from a container.
> + *
> + * If the 'init' process in a container dies, we kill off all the
> other
> + * processes in the container.
> + */
> +void exit_container(struct task_struct *tsk)
> +{
> +	struct task_struct *p;
> +	struct container *c = tsk->container;
> +	struct kernel_siginfo si = {
> +		.si_signo = SIGKILL,
> +		.si_code  = SI_KERNEL,
> +	};
> +
> +	spin_lock(&c->lock);
> +
> +	list_del(&tsk->container_link);
> +
> +	if (c->init == tsk) {
> +		c->init = NULL;
> +		c->exit_code = tsk->exit_code;
> +		smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
> +		set_bit(CONTAINER_FLAG_DEAD, &c->flags);
> +		wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
> +
> +		list_for_each_entry(p, &c->members, container_link)
> {
> +			si.si_pid = task_tgid_vnr(p);
> +			send_sig_info(SIGKILL, &si, p);
> +		}
> +	}
> +
> +	spin_unlock(&c->lock);
> +	put_container(c);
> +}
> +
> +/*
> + * Allocate a container.
> + */
> +static struct container *alloc_container(const char __user *name)
> +{
> +	struct container *c;
> +	long len;
> +	int ret;
> +
> +	c = kzalloc(sizeof(struct container), GFP_KERNEL);
> +	if (!c)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&c->members);
> +	INIT_LIST_HEAD(&c->children);
> +	init_waitqueue_head(&c->waitq);
> +	spin_lock_init(&c->lock);
> +	refcount_set(&c->usage, 1);
> +
> +	ret = -EFAULT;
> +	len = strncpy_from_user(c->name, name, sizeof(c->name));
> +	if (len < 0)
> +		goto err;
> +	ret = -ENAMETOOLONG;
> +	if (len >= sizeof(c->name))
> +		goto err;
> +	ret = -EINVAL;
> +	if (strchr(c->name, '/'))
> +		goto err;
> +
> +	c->name[len] = 0;
> +	return c;
> +
> +err:
> +	kfree(c);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create some creds for the container.  We don't want to pin things
> we don't
> + * have to, so drop all keyrings from the new cred.  The LSM gets to
> audit the
> + * cred struct when security_container_alloc() is invoked.
> + */
> +static const struct cred *create_container_creds(unsigned int flags)
> +{
> +	struct cred *new;
> +	int ret;
> +
> +	new = prepare_creds();
> +	if (!new)
> +		return ERR_PTR(-ENOMEM);
> +
> +#ifdef CONFIG_KEYS
> +	key_put(new->thread_keyring);
> +	new->thread_keyring = NULL;
> +	key_put(new->process_keyring);
> +	new->process_keyring = NULL;
> +	key_put(new->session_keyring);
> +	new->session_keyring = NULL;
> +	key_put(new->request_key_auth);
> +	new->request_key_auth = NULL;
> +#endif
> +
> +	if (flags & CONTAINER_NEW_USER_NS) {
> +		ret = create_user_ns(new);
> +		if (ret < 0)
> +			goto err;
> +		new->euid = new->user_ns->owner;
> +		new->egid = new->user_ns->group;
> +	}
> +
> +	new->fsuid = new->suid = new->uid = new->euid;
> +	new->fsgid = new->sgid = new->gid = new->egid;
> +	return new;
> +
> +err:
> +	abort_creds(new);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container.
> + */
> +static struct container *create_container(const char __user *name,
> unsigned int flags)
> +{
> +	struct container *parent, *c;
> +	struct fs_struct *fs;
> +	struct nsproxy *ns;
> +	const struct cred *cred;
> +	int ret;
> +
> +	c = alloc_container(name);
> +	if (IS_ERR(c))
> +		return c;
> +
> +	if (flags & CONTAINER_KILL_ON_CLOSE)
> +		__set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
> +
> +	cred = create_container_creds(flags);
> +	if (IS_ERR(cred)) {
> +		ret = PTR_ERR(cred);
> +		goto err_cont;
> +	}
> +	c->cred = cred;
> +
> +	ret = -ENOMEM;
> +	fs = copy_fs_struct(current->fs);
> +	if (!fs)
> +		goto err_cont;
> +
> +	ns = create_new_namespaces(
> +		(flags & CONTAINER_NEW_FS_NS	 ? CLONE_NEWNS :
> 0) |
> +		(flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP :
> 0) |
> +		(flags & CONTAINER_NEW_UTS_NS	 ? CLONE_NEWUTS
> : 0) |
> +		(flags & CONTAINER_NEW_IPC_NS	 ? CLONE_NEWIPC
> : 0) |
> +		(flags & CONTAINER_NEW_PID_NS	 ? CLONE_NEWPID
> : 0) |
> +		(flags & CONTAINER_NEW_NET_NS	 ? CLONE_NEWNET
> : 0),
> +		current->nsproxy, cred->user_ns, fs);
> +	if (IS_ERR(ns)) {
> +		ret = PTR_ERR(ns);
> +		goto err_fs;
> +	}
> +
> +	c->ns = ns;
> +	c->root = fs->root;
> +	c->seq = fs->seq;
> +	fs->root.mnt = NULL;
> +	fs->root.dentry = NULL;
> +
> +	ret = security_container_alloc(c, flags);
> +	if (ret < 0)
> +		goto err_fs;
> +
> +	parent = current->container;
> +	get_container(parent);
> +	c->parent = parent;
> +	c->id = atomic64_inc_return(&container_id_counter);
> +	spin_lock(&parent->lock);
> +	list_add_tail(&c->child_link, &parent->children);
> +	spin_unlock(&parent->lock);
> +	return c;
> +
> +err_fs:
> +	free_fs_struct(fs);
> +err_cont:
> +	put_container(c);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container object.
> + */
> +SYSCALL_DEFINE5(container_create,
> +		const char __user *, name,
> +		unsigned int, flags,
> +		unsigned long, spare3,
> +		unsigned long, spare4,
> +		unsigned long, spare5)
> +{
> +	struct container *c;
> +	int fd;
> +
> +	if (!name ||
> +	    flags & ~CONTAINER__FLAG_MASK ||
> +	    spare3 != 0 || spare4 != 0 || spare5 != 0)
> +		return -EINVAL;
> +	if ((flags & (CONTAINER_NEW_FS_NS |
> CONTAINER_NEW_EMPTY_FS_NS)) ==
> +	    (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
> +		return -EINVAL;
> +
> +	c = create_container(name, flags);
> +	if (IS_ERR(c))
> +		return PTR_ERR(c);
> +
> +	fd = anon_inode_getfd("container", &container_fops, c,
> +			      O_RDWR | (flags & CONTAINER_FD_CLOEXEC
> ? O_CLOEXEC : 0));
> +	if (fd < 0)
> +		put_container(c);
> +	return fd;
> +}
> +
> +#endif /* CONFIG_CONTAINERS */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 284f2fe9a293..78f6065ad799 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -864,6 +864,7 @@ void __noreturn do_exit(long code)
>  	if (group_dead)
>  		disassociate_ctty(1);
>  	exit_task_namespaces(tsk);
> +	exit_container(tsk);
>  	exit_task_work(tsk);
>  	exit_thread(tsk);
>  	exit_umh(tsk);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b69248e6f0e0..009cf7e63894 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1920,9 +1920,12 @@ static __latent_entropy struct task_struct
> *copy_process(
>  	retval = copy_namespaces(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_mm;
> -	retval = copy_io(clone_flags, p);
> +	retval = copy_container(clone_flags, p, NULL);
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
> +	retval = copy_io(clone_flags, p);
> +	if (retval)
> +		goto bad_fork_cleanup_container;
>  	retval = copy_thread_tls(clone_flags, stack_start,
> stack_size, p, tls);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
> @@ -2121,6 +2124,8 @@ static __latent_entropy struct task_struct
> *copy_process(
>  bad_fork_cleanup_io:
>  	if (p->io_context)
>  		exit_io_context(p);
> +bad_fork_cleanup_container:
> +	exit_container(p);
>  bad_fork_cleanup_namespaces:
>  	exit_task_namespaces(p);
>  bad_fork_cleanup_mm:
> diff --git a/kernel/namespaces.h b/kernel/namespaces.h
> new file mode 100644
> index 000000000000..c44e3cf0e254
> --- /dev/null
> +++ b/kernel/namespaces.h
> @@ -0,0 +1,15 @@
> +/* Local namespaces defs
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +extern struct nsproxy *create_new_namespaces(unsigned long flags,
> +					     struct nsproxy
> *nsproxy,
> +					     struct user_namespace
> *user_ns,
> +					     struct fs_struct
> *new_fs);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index f6c5d330059a..4bb5184b3a80 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -27,6 +27,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/cgroup.h>
>  #include <linux/perf_event.h>
> +#include "namespaces.h"
>  
>  static struct kmem_cache *nsproxy_cachep;
>  
> @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
>   * Return the newly created nsproxy.  Do not attach this to the
> task,
>   * leave it to the caller to do proper locking and attach it to
> task.
>   */
> -static struct nsproxy *create_new_namespaces(unsigned long flags,
> -	struct task_struct *tsk, struct user_namespace *user_ns,
> +struct nsproxy *create_new_namespaces(unsigned long flags,
> +	struct nsproxy *nsproxy, struct user_namespace *user_ns,
>  	struct fs_struct *new_fs)
>  {
>  	struct nsproxy *new_nsp;
> @@ -72,39 +73,39 @@ static struct nsproxy
> *create_new_namespaces(unsigned long flags,
>  	if (!new_nsp)
>  		return ERR_PTR(-ENOMEM);
>  
> -	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns,
> user_ns, new_fs);
> +	new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns,
> user_ns, new_fs);
>  	if (IS_ERR(new_nsp->mnt_ns)) {
>  		err = PTR_ERR(new_nsp->mnt_ns);
>  		goto out_ns;
>  	}
>  
> -	new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy-
> >uts_ns);
> +	new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy-
> >uts_ns);
>  	if (IS_ERR(new_nsp->uts_ns)) {
>  		err = PTR_ERR(new_nsp->uts_ns);
>  		goto out_uts;
>  	}
>  
> -	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy-
> >ipc_ns);
> +	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy-
> >ipc_ns);
>  	if (IS_ERR(new_nsp->ipc_ns)) {
>  		err = PTR_ERR(new_nsp->ipc_ns);
>  		goto out_ipc;
>  	}
>  
>  	new_nsp->pid_ns_for_children =
> -		copy_pid_ns(flags, user_ns, tsk->nsproxy-
> >pid_ns_for_children);
> +		copy_pid_ns(flags, user_ns, nsproxy-
> >pid_ns_for_children);
>  	if (IS_ERR(new_nsp->pid_ns_for_children)) {
>  		err = PTR_ERR(new_nsp->pid_ns_for_children);
>  		goto out_pid;
>  	}
>  
>  	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> -					    tsk->nsproxy-
> >cgroup_ns);
> +					    nsproxy->cgroup_ns);
>  	if (IS_ERR(new_nsp->cgroup_ns)) {
>  		err = PTR_ERR(new_nsp->cgroup_ns);
>  		goto out_cgroup;
>  	}
>  
> -	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy-
> >net_ns);
> +	new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy-
> >net_ns);
>  	if (IS_ERR(new_nsp->net_ns)) {
>  		err = PTR_ERR(new_nsp->net_ns);
>  		goto out_net;
> @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct
> task_struct *tsk)
>  		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
>  		return -EINVAL;
>  
> -	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk-
> >fs);
> +	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns,
> tsk->fs);
>  	if (IS_ERR(new_ns))
>  		return  PTR_ERR(new_ns);
>  
> @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long
> unshare_flags,
>  	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> -	*new_nsp = create_new_namespaces(unshare_flags, current,
> user_ns,
> +	*new_nsp = create_new_namespaces(unshare_flags, current-
> >nsproxy, user_ns,
>  					 new_fs ? new_fs : current-
> >fs);
>  	if (IS_ERR(*new_nsp)) {
>  		err = PTR_ERR(*new_nsp);
> @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
>  	if (nstype && (ns->ops->type != nstype))
>  		goto out;
>  
> -	new_nsproxy = create_new_namespaces(0, tsk,
> current_user_ns(), tsk->fs);
> +	new_nsproxy = create_new_namespaces(0, tsk->nsproxy,
> current_user_ns(), tsk->fs);
>  	if (IS_ERR(new_nsproxy)) {
>  		err = PTR_ERR(new_nsproxy);
>  		goto out;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a4e7131b2509..f0455cbb91cf 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -136,6 +136,9 @@ COND_SYSCALL(acct);
>  COND_SYSCALL(capget);
>  COND_SYSCALL(capset);
>  
> +/* kernel/container.c */
> +COND_SYSCALL(container_create);
> +
>  /* kernel/exec_domain.c */
>  
>  /* kernel/exit.c */
> diff --git a/security/security.c b/security/security.c
> index b49732c02e21..259be9a1746c 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1864,3 +1864,15 @@ void security_bpf_prog_free(struct
> bpf_prog_aux *aux)
>  	call_void_hook(bpf_prog_free_security, aux);
>  }
>  #endif /* CONFIG_BPF_SYSCALL */
> +
> +#ifdef CONFIG_CONTAINERS
> +int security_container_alloc(struct container *container, unsigned
> int flags)
> +{
> +	return call_int_hook(container_alloc, 0, container, flags);
> +}
> +
> +void security_container_free(struct container *container)
> +{
> +	call_void_hook(container_free, container);
> +}
> +#endif /* CONFIG_CONTAINERS */
>
Eric W. Biederman Feb. 19, 2019, 4:56 p.m. UTC | #3
David Howells <dhowells@redhat.com> writes:

The container id details are ludicrous and will break practically
every use case.  This completely unacceptable.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..0a8918435097
> --- /dev/null
> +++ b/include/linux/container.h
> +/*
> + * The container object.
> + */
> +struct container {
> +	u64			id;		/* Container ID */
...

No.  This is absolutely unacceptable.
As this breaks breaks nested containers and process migration.

> +};
> +
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d2f90fa92468..073a3a930514 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@ struct backing_dev_info;
>  struct bio_list;
>  struct blk_plug;
>  struct cfs_rq;
> +struct container;
>  struct fs_struct;
>  struct futex_pi_state;
>  struct io_context;
> @@ -870,6 +871,8 @@ struct task_struct {
>  
>  	/* Namespaces: */
>  	struct nsproxy			*nsproxy;
> +	struct container		*container;
> +	struct list_head		container_link;

Why?  nsproxy would be a much cheaper location to put this.
Less space and less foobar.

>  	/* Signal handlers: */
>  	struct signal_struct		*signal;
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..ca4012632cfa
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,348 @@
[...]
> +
> +	c->id = atomic64_inc_return(&container_id_counter);

This id is not in a namespace, and it doesn't have enough bits
of entropy to be globally unique.   Not that 64bit is enough
to have a chance at being globablly unique.


Eric
David Howells Feb. 19, 2019, 11:03 p.m. UTC | #4
Trond Myklebust <trondmy@hammerspace.com> wrote:

> Do we really need a new system call to set up containers? That would
> force changes to all existing orchestration software.

No, it wouldn't.  Nothing in my patches forces existing orchestration software
to change, unless it wants to use the new facilities - then it would have to
be changed anyway, right?  I will grant, though, that the extent of the change
might vary.

> Given that the main thing we want to achieve is to direct messages from
> the kernel to an appropriate handler, why not focus on adding
> functionality to do just that?

Because it's *not* just that that is added here.  There are a number of things
this patchset (and one it depends on) provides:

 (1) The ability to intercept request_key() upcalls that happen inside a
     container, filtered by operative namespace.

 (2) The ability to provide a per-container keyring that can hold keys that
     can be used inside the container without any action on behalf of the
     denizens of the container.

 (3) The ability to grant permissions to a *container* as a subject, allowing
     it and its denizens to use, but not necessarily read, modify, link or
     invalidate a key.

 (4) The ability to create superblocks inside a container with a separate
     mount namespace from outside, such that they can use the container keys,
     thereby allowing the root of a container to be on an authenticated
     filesystem.

> Is there any reason why a syscall to allow an appropriately privileged
> process to add a keyring-specific message queue to its own
> user_namespace and obtain a file descriptor to that message queue might
> not work?

Yes.  That forces the use of a new user_namespace for every container in which
you want to use any of the above features.  The user_namespace is already way
too big and intrusive a hammer as it is.

> With such an implementation, the fallback mechanism could be to walk
> back up the hierarchy of user_namespaces until a message queue is
> found, and to invoke the existing request_key mechanism if not.

That's definitely wrong.  /sbin/request-key should *not* be spawned if the key
to be instantiated is not in all the init namespaces.

I went with a container object with namespaces for a reason: initially, it was
so that the upcall could take place inside of the container's namespaces, but
now it's do that any request that doesn't match the namespaces on the
container gets rejected at the boundary - so that some daemon up the chain
doesn't try servicing a request for which it can't access the config data or
would end up talking out of the wrong NIC.

I can drop the container object part of it for the moment.

I could instead create 1-3 new namespaces:

 (1) A namespace with an upcall-interception point.

 (2) A namespace with a container keyring.

 (3) A namespace with a subject ID for use in key ACLs.

I think I should also consider adding:

 (4) A namespace with keyring names in it.  I'm leaning towards this not being
     part of user_namespace because these probably should not be visible
     between containers.

David
David Howells Feb. 19, 2019, 11:06 p.m. UTC | #5
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> I thought we got agreement years ago that containers don't exist in
> Linux as a single entity: they're currently a collection of cgroups and
> namespaces some of which may and some of which may not be local to the
> entity the orchestration system thinks of as a "container".

I wasn't party to that agreement and don't feel particularly bound by it.

David
David Howells Feb. 19, 2019, 11:13 p.m. UTC | #6
Eric W. Biederman <ebiederm@xmission.com> wrote:

> > +	c->id = atomic64_inc_return(&container_id_counter);
> 
> This id is not in a namespace, and it doesn't have enough bits
> of entropy to be globally unique.   Not that 64bit is enough
> to have a chance at being globablly unique.

It's in a container, so it doesn't need to be in a namespace.  The intended
purpose is for annotating audit messages.  Globally unique wasn't particularly
in mind.  It could be turned into, say, a uuid, so that isn't really a problem
at this point.

You are right, though, it really should be globally unique as best possible -
even the one in init_container should be.  Ideally, it would look the same
inside the root container as any subcontainer.

David
Tycho Andersen Feb. 19, 2019, 11:55 p.m. UTC | #7
On Fri, Feb 15, 2019 at 04:07:33PM +0000, David Howells wrote:
> ==================
> FUTURE DEVELOPMENT
> ==================
> 
>  (1) Setting up the container.
> 
>      A container would be created with, say:
> 
> 	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 

...

>      Further mounts can be added by:
> 
> 	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 

...

>  (2) Starting the container.
> 
>      Once all modifications are complete, the container's 'init' process
>      can be started by:
> 
> 	fork_into_container(int cfd);
> 
>      This precludes further external modification of the mount tree within
>      the container.

Is there a technical reason for this? In particular, there are some
container runtimes that do this today via clever use of bind mounts
and MS_MOVE, for things like dynamically attaching volumes. It would
be useful to be able to mount things into the container after the
fact.

>  (3) Waiting for the container to complete.
> 
>      The container fd can then be polled to wait for init process therein
>      to complete and the exit code collected by:
> 
> 	container_wait(int container_fd, int *_wstatus, unsigned int wait,
> 		       struct rusage *rusage);
> 
>      The container and everything in it can be terminated or killed off:
> 
> 	container_kill(int container_fd, int initonly, int signal);
> 
>      If 'init' dies, all other processes in the container are preemptively
>      SIGKILL'd by the kernel.

Isn't this essentially how the pid ns works today? I'm not sure what
the container fd offers here (of course if it lands, then having the
same semantics makes sense).

>  (6) Running different LSM policies by container.  This might particularly
>      make sense with something like Apparmor where different path-based
>      rules might be required inside a container to inside the parent.

Apparmor supports this today, as long as the host is also running
Apparmor. For the more general case, Casey (and others) have been
working on LSM stacking for a long time.

Tycho
James Bottomley Feb. 20, 2019, 2:20 a.m. UTC | #8
On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > I thought we got agreement years ago that containers don't exist in
> > Linux as a single entity: they're currently a collection of cgroups
> > and namespaces some of which may and some of which may not be local
> > to the entity the orchestration system thinks of as a "container".
> 
> I wasn't party to that agreement and don't feel particularly bound by
> it.

That's not at all relevant, is it?  The point is we have widespread
uses of namespaces and cgroups that span containers today meaning that
a "container id" becomes a problematic concept.  What we finally got to
with the audit people was an unmodifiable label which the orchestration
system can set ... can't you just use that?

James
Ian Kent Feb. 20, 2019, 2:46 a.m. UTC | #9
On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> Implement a kernel container object such that it contains the following
> things:
> 
>  (1) Namespaces.
> 
>  (2) A root directory.
> 
>  (3) A set of processes, including one designated as the 'init' process.

Yeah, I think a name other than init needs to be used for this
process.

The problem being that there is no requirement for container
process 1 to behave in any way like an "init" process is
expected to behave and that leads to confusion (at least
it certainly did for me).

Admittedly I haven't yet worked through the series but in the
light of the comments from James I wanted to chime in (probably
too early to be useful not having read the series but ...).

I believe what your trying to do here is so badly needed it
would be great if the needs of James could be met to some
(as yet undefined) satisfactory extent.

Would there be any possibility of introducing a concept of
inactive and active containers where the creation is a two
(maybe more) step procedure, first the creation of (if you like
a "true") container that's essentially empty, basically a shell
(not the program "shell" of course), inert wrt. events and such
and implement the ability to make the container active by adding
various things, like processes, to it?

Clearly the concepts of inactive and active require a definition
of what they mean and I don't have that, perhaps a starting point
could be a container that has a process 1 (which should also require
a root fs and namespaces) is active otherwise it's considered inactive.

Ian
Ian Kent Feb. 20, 2019, 3:04 a.m. UTC | #10
On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > 
> > > I thought we got agreement years ago that containers don't exist in
> > > Linux as a single entity: they're currently a collection of cgroups
> > > and namespaces some of which may and some of which may not be local
> > > to the entity the orchestration system thinks of as a "container".
> > 
> > I wasn't party to that agreement and don't feel particularly bound by
> > it.
> 
> That's not at all relevant, is it?  The point is we have widespread
> uses of namespaces and cgroups that span containers today meaning that
> a "container id" becomes a problematic concept.  What we finally got to
> with the audit people was an unmodifiable label which the orchestration
> system can set ... can't you just use that?

Sorry James, I fail to see how assigning an id to a collection of objects
constitutes a problem or how that could restrict the way a container is
used.

Isn't the only problem here the current restrictions on the way objects
need to be combined as a set and the ability to be able add or subtract
from that set.

Then again the notion of active vs. inactive might not be sufficient to
allow for the needed flexibility ...

Ian
James Bottomley Feb. 20, 2019, 3:46 a.m. UTC | #11
On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > 
> > > > I thought we got agreement years ago that containers don't
> > > > exist in Linux as a single entity: they're currently a
> > > > collection of cgroups and namespaces some of which may and some
> > > > of which may not be local to the entity the orchestration
> > > > system thinks of as a "container".
> > > 
> > > I wasn't party to that agreement and don't feel particularly
> > > bound by it.
> > 
> > That's not at all relevant, is it?  The point is we have widespread
> > uses of namespaces and cgroups that span containers today meaning
> > that a "container id" becomes a problematic concept.  What we
> > finally got to with the audit people was an unmodifiable label
> > which the orchestration system can set ... can't you just use that?
> 
> Sorry James, I fail to see how assigning an id to a collection of
> objects constitutes a problem or how that could restrict the way a
> container is used.

Rather than rehash the whole argument again, what's the reason you
can't use the audit label?  It seems to do what you want in a way that
doesn't cause problems.  If you can just use it there's little point
arguing over what is effectively a moot issue.

James


> Isn't the only problem here the current restrictions on the way
> objects need to be combined as a set and the ability to be able add
> or subtract from that set.
> 
> Then again the notion of active vs. inactive might not be sufficient
> to allow for the needed flexibility ...
> 
> Ian
>
Ian Kent Feb. 20, 2019, 4:42 a.m. UTC | #12
On Tue, 2019-02-19 at 19:46 -0800, James Bottomley wrote:
> On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > > 
> > > > > I thought we got agreement years ago that containers don't
> > > > > exist in Linux as a single entity: they're currently a
> > > > > collection of cgroups and namespaces some of which may and some
> > > > > of which may not be local to the entity the orchestration
> > > > > system thinks of as a "container".
> > > > 
> > > > I wasn't party to that agreement and don't feel particularly
> > > > bound by it.
> > > 
> > > That's not at all relevant, is it?  The point is we have widespread
> > > uses of namespaces and cgroups that span containers today meaning
> > > that a "container id" becomes a problematic concept.  What we
> > > finally got to with the audit people was an unmodifiable label
> > > which the orchestration system can set ... can't you just use that?
> > 
> > Sorry James, I fail to see how assigning an id to a collection of
> > objects constitutes a problem or how that could restrict the way a
> > container is used.
> 
> Rather than rehash the whole argument again, what's the reason you
> can't use the audit label?  It seems to do what you want in a way that
> doesn't cause problems.  If you can just use it there's little point
> arguing over what is effectively a moot issue.

David might want to use the audit label for this, I don't know.
And maybe that's a good choice initially.

But going way off topic.

Because there is a need to not clutter kernel space with logging,
leaving it to user space to handle but also without providing user
space with sufficient information to do so there will need to be
some sort of globally unique (sub-system) identifiers of kernel
objects for which user space needs logging information so that
if or when that kernel to user space information flow is
implemented the consistent identifiers that will be needed will
at least exist for some kernel objects.

Yes, that's way off topic for this series but I think it's something
that needs at least some consideration for new implementation work.

Unfortunately properly implementing such an encoding scheme probably
warrants a completely separate project so, as you say moot wrt. this
series.

> 
> James
> 
> 
> > Isn't the only problem here the current restrictions on the way
> > objects need to be combined as a set and the ability to be able add
> > or subtract from that set.
> > 
> > Then again the notion of active vs. inactive might not be sufficient
> > to allow for the needed flexibility ...
> > 
> > Ian
> > 
> 
>
Paul Moore Feb. 20, 2019, 6:57 a.m. UTC | #13
On Tue, Feb 19, 2019 at 10:46 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Wed, 2019-02-20 at 11:04 +0800, Ian Kent wrote:
> > On Tue, 2019-02-19 at 18:20 -0800, James Bottomley wrote:
> > > On Tue, 2019-02-19 at 23:06 +0000, David Howells wrote:
> > > > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > >
> > > > > I thought we got agreement years ago that containers don't
> > > > > exist in Linux as a single entity: they're currently a
> > > > > collection of cgroups and namespaces some of which may and some
> > > > > of which may not be local to the entity the orchestration
> > > > > system thinks of as a "container".
> > > >
> > > > I wasn't party to that agreement and don't feel particularly
> > > > bound by it.
> > >
> > > That's not at all relevant, is it?  The point is we have widespread
> > > uses of namespaces and cgroups that span containers today meaning
> > > that a "container id" becomes a problematic concept.  What we
> > > finally got to with the audit people was an unmodifiable label
> > > which the orchestration system can set ... can't you just use that?
> >
> > Sorry James, I fail to see how assigning an id to a collection of
> > objects constitutes a problem or how that could restrict the way a
> > container is used.
>
> Rather than rehash the whole argument again, what's the reason you
> can't use the audit label?  It seems to do what you want in a way that
> doesn't cause problems.  If you can just use it there's little point
> arguing over what is effectively a moot issue.

Ignoring for a moment whether or not the audit container ID is
applicable here, one of the things I've been focused on with the audit
container ID work is trying to make it difficult for other subsystems
to use it.  I've taken this stance not because I don't think something
like a container ID would be useful outside the audit subsystem, but
rather because I'm afraid of how it might be abused by other
subsystems and that abuse might threaten the existence of the audit
container ID.

If there is a willingness to implement a general kernel container ID
that behaves similarly to how the audit container ID is envisioned,
I'd much rather do that then implement something which is audit
specific.
Christian Brauner Feb. 20, 2019, 1:26 p.m. UTC | #14
On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > Implement a kernel container object such that it contains the following
> > things:
> > 
> >  (1) Namespaces.
> > 
> >  (2) A root directory.
> > 
> >  (3) A set of processes, including one designated as the 'init' process.
> 
> Yeah, I think a name other than init needs to be used for this
> process.
> 
> The problem being that there is no requirement for container
> process 1 to behave in any way like an "init" process is
> expected to behave and that leads to confusion (at least
> it certainly did for me).

If you look at the documentation for pid namespaces(7) you can see that
the pid 1 inside a pid namespace is expected to behave like an init
process:
-  "The  first  process created in a new namespace [...] has  the PID 1,
   and is the "init" process for the namespace (see init(1))."
- "[...] child process that is orphaned within the namespace will be
  reparented to this process rather than init(1) [...]"
- "If the "init" process of a PID namespace terminates, the kernel
  terminates all of the processes in the  namespace  via a SIGKILL
  signal. This behavior reflects the fact that the "init" process is
  essential for the cor‐ rect operation of a PID namespace."
- "Only signals for which the "init" process has established a signal
  handler can be sent to the  "init" process by other members of the
  PID namespace."
- "[...] the reboot(2) system call causes a signal to be sent to the
  namespace "init" process."

This is one of the reasons why all major current container runtimes
finally after years of failing to realize this run a stub init process
that mimicks a dumb init. Sure, you get away with not having an init
that behaves like an init but this is inherently broken or at least
against the way pid namespaces were designed.
Trond Myklebust Feb. 20, 2019, 2:23 p.m. UTC | #15
On Tue, 2019-02-19 at 23:03 +0000, David Howells wrote:
> Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> > Do we really need a new system call to set up containers? That
> > would
> > force changes to all existing orchestration software.
> 
> No, it wouldn't.  Nothing in my patches forces existing orchestration
> software
> to change, unless it wants to use the new facilities - then it would
> have to
> be changed anyway, right?  I will grant, though, that the extent of
> the change
> might vary.

Right. It depends on what you want to the orchestrator to do. If you
want it to manage authenticated storage for you, then I grant that you
may need to change the existing orchestrator. However if you just want
the containerised software to be able to manage AFS/CIFS/... keys for
its own processes, then it's not obvious to me why you would need a new
orchestrator.

> > Given that the main thing we want to achieve is to direct messages
> > from
> > the kernel to an appropriate handler, why not focus on adding
> > functionality to do just that?
> 
> Because it's *not* just that that is added here.  There are a number
> of things
> this patchset (and one it depends on) provides:
> 
>  (1) The ability to intercept request_key() upcalls that happen
> inside a
>      container, filtered by operative namespace.

The requirement that you need to filter derives from the fact that the
kernel is being forced to run an untrusted executable in user space.
That may be acceptable when running in an uncontainerised environment,
where the executable can be vetted by the sysadmin, but it clearly
isn't in an environment where containers can be set up by untrusted
users.

If we replace the executable with a daemon that is started from inside
the container (presumably by the init process running there), then
there should be no requirement for the orchestrator to filter.

>  (2) The ability to provide a per-container keyring that can hold
> keys that
>      can be used inside the container without any action on behalf of
> the
>      denizens of the container.

Keyrings already define some inheritance semantics for child processes.
Why can't we tweak those semantics to do what is needed?

IOW: instead of adding a container syscall and a new keyring type, why
can't we just define the required keyring type and let it be inherited
through the existing clone() mechanism?

>  (3) The ability to grant permissions to a *container* as a subject,
> allowing
>      it and its denizens to use, but not necessarily read, modify,
> link or
>      invalidate a key.

Again, this sounds like a child process keyring inheritance issue.
Right now, the session keyring does not appear to match the semantics
that you describe, but why couldn't we set up a new keyring type that
can provide them?

>  (4) The ability to create superblocks inside a container with a
> separate
>      mount namespace from outside, such that they can use the
> container keys,
>      thereby allowing the root of a container to be on an
> authenticated
>      filesystem.
> 

I'm not sure that I understand the premise. If the orchestrator is
setting up and managing that authenticated root filesystem, then why do
the containerised processes need to be involved at all?

If, OTOH, the intention is to allow the containerised processes to
manage the filesystems without knowledge of the keyring contents, then
again isn't that really the same issue as (3)?

> > Is there any reason why a syscall to allow an appropriately
> > privileged
> > process to add a keyring-specific message queue to its own
> > user_namespace and obtain a file descriptor to that message queue
> > might
> > not work?
> 
> Yes.  That forces the use of a new user_namespace for every container
> in which
> you want to use any of the above features.  The user_namespace is
> already way
> too big and intrusive a hammer as it is.

No. I would need a user_namespace if I want to allow child processes to
handle request upcalls. Is that unreasonable?

> > With such an implementation, the fallback mechanism could be to
> > walk
> > back up the hierarchy of user_namespaces until a message queue is
> > found, and to invoke the existing request_key mechanism if not.
> 
> That's definitely wrong.  /sbin/request-key should *not* be spawned
> if the key
> to be instantiated is not in all the init namespaces.
> 
> I went with a container object with namespaces for a reason:
> initially, it was
> so that the upcall could take place inside of the container's
> namespaces, but
> now it's do that any request that doesn't match the namespaces on the
> container gets rejected at the boundary - so that some daemon up the
> chain
> doesn't try servicing a request for which it can't access the config
> data or
> would end up talking out of the wrong NIC.
> 
> I can drop the container object part of it for the moment.
> 
> I could instead create 1-3 new namespaces:
> 
>  (1) A namespace with an upcall-interception point.
> 
>  (2) A namespace with a container keyring.
> 
>  (3) A namespace with a subject ID for use in key ACLs.
> 
> I think I should also consider adding:
> 
>  (4) A namespace with keyring names in it.  I'm leaning towards this
> not being
>      part of user_namespace because these probably should not be
> visible
>      between containers.
> 
> David
Ian Kent Feb. 21, 2019, 10:39 a.m. UTC | #16
On Wed, 2019-02-20 at 14:26 +0100, Christian Brauner wrote:
> On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> > On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > > Implement a kernel container object such that it contains the following
> > > things:
> > > 
> > >  (1) Namespaces.
> > > 
> > >  (2) A root directory.
> > > 
> > >  (3) A set of processes, including one designated as the 'init' process.
> > 
> > Yeah, I think a name other than init needs to be used for this
> > process.
> > 
> > The problem being that there is no requirement for container
> > process 1 to behave in any way like an "init" process is
> > expected to behave and that leads to confusion (at least
> > it certainly did for me).
> 
> If you look at the documentation for pid namespaces(7) you can see that
> the pid 1 inside a pid namespace is expected to behave like an init
> process:
> -  "The  first  process created in a new namespace [...] has  the PID 1,
>    and is the "init" process for the namespace (see init(1))."
> - "[...] child process that is orphaned within the namespace will be
>   reparented to this process rather than init(1) [...]"
> - "If the "init" process of a PID namespace terminates, the kernel
>   terminates all of the processes in the  namespace  via a SIGKILL
>   signal. This behavior reflects the fact that the "init" process is
>   essential for the cor‐ rect operation of a PID namespace."
> - "Only signals for which the "init" process has established a signal
>   handler can be sent to the  "init" process by other members of the
>   PID namespace."
> - "[...] the reboot(2) system call causes a signal to be sent to the
>   namespace "init" process."
> 
> This is one of the reasons why all major current container runtimes
> finally after years of failing to realize this run a stub init process
> that mimicks a dumb init. Sure, you get away with not having an init
> that behaves like an init but this is inherently broken or at least
> against the way pid namespaces were designed.

TBH I wasn't sure why the signal I sent didn't arrive, AFAICS
it should have regardless of what signals the container init
process was accepting. But it could have been due to a
different problem in my kernel code (that's very likely).

In any case it wasn't worth perusing because even if I did work
it out I had already found that the request_key sub-system wasn't
playing well with others when trying to run something within a
container's namespaces, so no point in going further ...

Ian
diff mbox series

Patch

==================
FUTURE DEVELOPMENT
==================

 (1) Setting up the container.

     A container would be created with, say:

	int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);

     Once created, it should then be possible for the supervising process
     to modify the new container.  Mounts can be created inside of the
     container's namespaces:

	fsfd = fsopen("ext4", 0);
	fsconfig(fsfd, FSCONFIG_SET_CONTAINER, NULL, NULL, cfd);
	fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda3", 0);
	fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
	fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fsfd, 0, 0);

     and then mounted into the namespace:

	move_mount(mfd, "", cfd, "/",
		   MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_CONTAINER_ROOT);

     Further mounts can be added by:

	move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);

     Files and devices can be created by supplying the container fd as the
     dirfd argument:

	mkdirat(int cfd, const char *path, mode_t mode);
	mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
	int fd = openat(int cfd, const char *path,
			unsigned int flags, mode_t mode);

     [*] Note that when using cfd as dirfd, the path must not contain a '/'
     	 at the front.

     Sockets, such as netlink, can be opened inside of the container's
     namespaces:

	int fd = container_socket(int cfd, int domain, int type,
				  int protocol);

     This should allow management of the container's network namespace from
     outside.

 (2) Starting the container.

     Once all modifications are complete, the container's 'init' process
     can be started by:

	fork_into_container(int cfd);

     This precludes further external modification of the mount tree within
     the container.  Before this point, the container is simply destroyed
     if the container fd is closed.

 (3) Waiting for the container to complete.

     The container fd can then be polled to wait for init process therein
     to complete and the exit code collected by:

	container_wait(int container_fd, int *_wstatus, unsigned int wait,
		       struct rusage *rusage);

     The container and everything in it can be terminated or killed off:

	container_kill(int container_fd, int initonly, int signal);

     If 'init' dies, all other processes in the container are preemptively
     SIGKILL'd by the kernel.

     By default, if the container is active and its fd is closed, the
     container is left running and wil be cleaned up when its 'init' exits.
     The default can be changed with the CONTAINER_KILL_ON_CLOSE flag.

 (4) Supervising the container.

     Given that we have an fd attached to the container, we could make it
     such that the supervising process could monitor and override EPERM
     returns for mount and other privileged operations within the
     container.

 (5) Per-container keyring.

     Each container can point to a per-container keyring for the holding of
     integrity keys and filesystem keys for use inside the container.  This
     would be attached:

	keyctl(KEYCTL_SET_CONTAINER_KEYRING, cfd, keyring)

     This keyring would be searched by request_key() after it has searched
     the thread, process and session keyrings.

 (6) Running different LSM policies by container.  This might particularly
     make sense with something like Apparmor where different path-based
     rules might be required inside a container to inside the parent.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |    5 
 include/linux/container.h              |   86 ++++++++
 include/linux/init_task.h              |    1 
 include/linux/lsm_hooks.h              |   20 ++
 include/linux/sched.h                  |    3 
 include/linux/security.h               |   15 +
 include/linux/syscalls.h               |    3 
 include/uapi/linux/container.h         |   28 +++
 init/Kconfig                           |    7 +
 init/init_task.c                       |    3 
 kernel/Makefile                        |    2 
 kernel/container.c                     |  348 ++++++++++++++++++++++++++++++++
 kernel/exit.c                          |    1 
 kernel/fork.c                          |    7 +
 kernel/namespaces.h                    |   15 +
 kernel/nsproxy.c                       |   23 +-
 kernel/sys_ni.c                        |    3 
 security/security.c                    |   12 +
 20 files changed, 571 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/container.h
 create mode 100644 include/uapi/linux/container.h
 create mode 100644 kernel/container.c
 create mode 100644 kernel/namespaces.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c9db9d51a7df..3564814a5d21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -407,3 +407,4 @@ 
 393	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
 394	i386	mount_notify		sys_mount_notify		__ia32_sys_mount_notify
 395	i386	sb_notify		sys_sb_notify			__ia32_sys_sb_notify
+396	i386	container_create	sys_container_create		__ia32_sys_container_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 17869bf7788a..aa6cccbe5271 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -352,6 +352,7 @@ 
 341	common	fsinfo			__x64_sys_fsinfo
 342	common	mount_notify		__x64_sys_mount_notify
 343	common	sb_notify		__x64_sys_sb_notify
+344	common	container_create	__x64_sys_container_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index f378cfc63043..ea005f55ec4c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -30,6 +30,7 @@ 
 #include <uapi/linux/mount.h>
 #include <linux/fs_context.h>
 #include <linux/fsinfo.h>
+#include <linux/container.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -3742,6 +3743,10 @@  static void __init init_mount_tree(void)
 
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);
+#ifdef CONFIG_CONTAINERS
+	path_get(&root);
+	init_container.root = root;
+#endif
 }
 
 void __init mnt_init(void)
diff --git a/include/linux/container.h b/include/linux/container.h
new file mode 100644
index 000000000000..0a8918435097
--- /dev/null
+++ b/include/linux/container.h
@@ -0,0 +1,86 @@ 
+/* Container objects
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_CONTAINER_H
+#define _LINUX_CONTAINER_H
+
+#include <uapi/linux/container.h>
+#include <linux/refcount.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/path.h>
+#include <linux/seqlock.h>
+
+struct fs_struct;
+struct nsproxy;
+struct task_struct;
+
+/*
+ * The container object.
+ */
+struct container {
+	char			name[24];
+	u64			id;		/* Container ID */
+	refcount_t		usage;
+	int			exit_code;	/* The exit code of 'init' */
+	const struct cred	*cred;		/* Creds for this container, including userns */
+	struct nsproxy		*ns;		/* This container's namespaces */
+	struct path		root;		/* The root of the container's fs namespace */
+	struct task_struct	*init;		/* The 'init' task for this container */
+	struct container	*parent;	/* Parent of this container. */
+	void			*security;	/* LSM data */
+	struct list_head	members;	/* Member processes, guarded with ->lock */
+	struct list_head	child_link;	/* Link in parent->children */
+	struct list_head	children;	/* Child containers */
+	wait_queue_head_t	waitq;		/* Someone waiting for init to exit waits here */
+	unsigned long		flags;
+#define CONTAINER_FLAG_INIT_STARTED	0	/* Init is started - certain ops now prohibited */
+#define CONTAINER_FLAG_DEAD		1	/* Init has died */
+#define CONTAINER_FLAG_KILL_ON_CLOSE	2	/* Kill init if container handle closed */
+	spinlock_t		lock;
+	seqcount_t		seq;		/* Track changes in ->root */
+};
+
+extern struct container init_container;
+
+#ifdef CONFIG_CONTAINERS
+extern const struct file_operations container_fops;
+
+extern int copy_container(unsigned long flags, struct task_struct *tsk,
+			  struct container *container);
+extern void exit_container(struct task_struct *tsk);
+extern void put_container(struct container *c);
+
+static inline struct container *get_container(struct container *c)
+{
+	refcount_inc(&c->usage);
+	return c;
+}
+
+static inline bool is_container_file(struct file *file)
+{
+	return file->f_op == &container_fops;
+}
+
+#else
+
+static inline int copy_container(unsigned long flags, struct task_struct *tsk,
+				 struct container *container)
+{ return 0; }
+static inline void exit_container(struct task_struct *tsk) { }
+static inline void put_container(struct container *c) {}
+static inline struct container *get_container(struct container *c) { return NULL; }
+static inline bool is_container_file(struct file *file) { return false; }
+
+#endif /* CONFIG_CONTAINERS */
+
+#endif /* _LINUX_CONTAINER_H */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a7083a45a26c..f016cadece24 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@ 
 #include <linux/ipc.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/container.h>
 #include <linux/securebits.h>
 #include <linux/seqlock.h>
 #include <linux/rbtree.h>
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 52d0f3f4c786..0f310d911815 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1460,6 +1460,16 @@ 
  * @bpf_prog_free_security:
  *	Clean up the security information stored inside bpf prog.
  *
+ * Security hooks for containers:
+ *
+ * @container_alloc:
+ *	Permit creation of a new container and assign security data.
+ *	@container: The new container.
+ *
+ * @container_free:
+ *	Free security data attached to a container.
+ *	@container: The container.
+ *
  */
 union security_list_options {
 	int (*binder_set_context_mgr)(struct task_struct *mgr);
@@ -1825,6 +1835,12 @@  union security_list_options {
 	int (*bpf_prog_alloc_security)(struct bpf_prog_aux *aux);
 	void (*bpf_prog_free_security)(struct bpf_prog_aux *aux);
 #endif /* CONFIG_BPF_SYSCALL */
+
+	/* Container management security hooks */
+#ifdef CONFIG_CONTAINERS
+	int (*container_alloc)(struct container *container, unsigned int flags);
+	void (*container_free)(struct container *container);
+#endif
 };
 
 struct security_hook_heads {
@@ -2069,6 +2085,10 @@  struct security_hook_heads {
 	struct hlist_head bpf_prog_alloc_security;
 	struct hlist_head bpf_prog_free_security;
 #endif /* CONFIG_BPF_SYSCALL */
+#ifdef CONFIG_CONTAINERS
+	struct hlist_head container_alloc;
+	struct hlist_head container_free;
+#endif /* CONFIG_CONTAINERS */
 } __randomize_layout;
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2f90fa92468..073a3a930514 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@  struct backing_dev_info;
 struct bio_list;
 struct blk_plug;
 struct cfs_rq;
+struct container;
 struct fs_struct;
 struct futex_pi_state;
 struct io_context;
@@ -870,6 +871,8 @@  struct task_struct {
 
 	/* Namespaces: */
 	struct nsproxy			*nsproxy;
+	struct container		*container;
+	struct list_head		container_link;
 
 	/* Signal handlers: */
 	struct signal_struct		*signal;
diff --git a/include/linux/security.h b/include/linux/security.h
index da538c06766f..acd0c14c6e95 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -70,6 +70,7 @@  struct ctl_table;
 struct audit_krule;
 struct user_namespace;
 struct timezone;
+struct container;
 
 enum lsm_event {
 	LSM_POLICY_CHANGE,
@@ -1751,6 +1752,20 @@  static inline void security_audit_rule_free(void *lsmrule)
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_AUDIT */
 
+#ifdef CONFIG_CONTAINERS
+#ifdef CONFIG_SECURITY
+int security_container_alloc(struct container *container, unsigned int flags);
+void security_container_free(struct container *container);
+#else
+static inline int security_container_alloc(struct container *container,
+					   unsigned int flags)
+{
+	return 0;
+}
+static inline void security_container_free(struct container *container) {}
+#endif
+#endif /* CONFIG_CONTAINERS */
+
 #ifdef CONFIG_SECURITYFS
 
 extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 10127b1d923b..dac42098c2dd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -943,6 +943,9 @@  asmlinkage long sys_mount_notify(int dfd, const char __user *path,
 				 unsigned int at_flags, int watch_fd, int watch_id);
 asmlinkage long sys_sb_notify(int dfd, const char __user *path,
 			      unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
+				     unsigned long spare3, unsigned long spare4,
+				     unsigned long spare5);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h
new file mode 100644
index 000000000000..43748099b28d
--- /dev/null
+++ b/include/uapi/linux/container.h
@@ -0,0 +1,28 @@ 
+/* Container UAPI
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_CONTAINER_H
+#define _UAPI_LINUX_CONTAINER_H
+
+
+#define CONTAINER_NEW_FS_NS		0x00000001 /* Dup current fs namespace */
+#define CONTAINER_NEW_EMPTY_FS_NS	0x00000002 /* Provide new empty fs namespace */
+#define CONTAINER_NEW_CGROUP_NS		0x00000004 /* Dup current cgroup namespace */
+#define CONTAINER_NEW_UTS_NS		0x00000008 /* Dup current uts namespace */
+#define CONTAINER_NEW_IPC_NS		0x00000010 /* Dup current ipc namespace */
+#define CONTAINER_NEW_USER_NS		0x00000020 /* Dup current user namespace */
+#define CONTAINER_NEW_PID_NS		0x00000040 /* Dup current pid namespace */
+#define CONTAINER_NEW_NET_NS		0x00000080 /* Dup current net namespace */
+#define CONTAINER_KILL_ON_CLOSE		0x00000100 /* Kill all member processes when fd closed */
+#define CONTAINER_FD_CLOEXEC		0x00000200 /* Close the fd on exec */
+#define CONTAINER__FLAG_MASK		0x000003ff
+
+#endif /* _UAPI_LINUX_CONTAINER_H */
diff --git a/init/Kconfig b/init/Kconfig
index 5984dd7f2156..ab37c3a55aa1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -992,6 +992,13 @@  config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+config CONTAINERS
+	bool "Container support"
+	default y
+	help
+	  Allow userspace to create and manipulate containers as objects that
+	  have namespaces and hold a set of processes.
+
 endif # NAMESPACES
 
 config CHECKPOINT_RESTORE
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..90c7439a195b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -108,6 +108,9 @@  struct task_struct init_task
 	.signal		= &init_signals,
 	.sighand	= &init_sighand,
 	.nsproxy	= &init_nsproxy,
+	.container	= &init_container,
+	.container_link.next = &init_container.members,
+	.container_link.prev = &init_container.members,
 	.pending	= {
 		.list = LIST_HEAD_INIT(init_task.pending.list),
 		.signal = {{0}}
diff --git a/kernel/Makefile b/kernel/Makefile
index 6aa7543bcdb2..98cdd18cecef 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@  obj-y     = fork.o exec_domain.o panic.o \
 	    sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
 	    signal.o sys.o umh.o workqueue.o pid.o task_work.o \
 	    extable.o params.o \
-	    kthread.o sys_ni.o nsproxy.o \
+	    kthread.o sys_ni.o nsproxy.o container.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
 	    async.o range.o smpboot.o ucount.o
 
diff --git a/kernel/container.c b/kernel/container.c
new file mode 100644
index 000000000000..ca4012632cfa
--- /dev/null
+++ b/kernel/container.c
@@ -0,0 +1,348 @@ 
+/* Implement container objects.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/init_task.h>
+#include <linux/fs.h>
+#include <linux/fs_struct.h>
+#include <linux/anon_inodes.h>
+#include <linux/container.h>
+#include <linux/syscalls.h>
+#include <linux/printk.h>
+#include <linux/security.h>
+#include "namespaces.h"
+
+struct container init_container = {
+	.name		= ".init",
+	.id		= 1,
+	.usage		= REFCOUNT_INIT(2),
+	.cred		= &init_cred,
+	.ns		= &init_nsproxy,
+	.init		= &init_task,
+	.members.next	= &init_task.container_link,
+	.members.prev	= &init_task.container_link,
+	.children	= LIST_HEAD_INIT(init_container.children),
+	.flags		= (1 << CONTAINER_FLAG_INIT_STARTED),
+	.lock		= __SPIN_LOCK_UNLOCKED(init_container.lock),
+	.seq		= SEQCNT_ZERO(init_fs.seq),
+};
+
+#ifdef CONFIG_CONTAINERS
+
+static atomic64_t container_id_counter = ATOMIC_INIT(1);
+
+/*
+ * Drop a ref on a container and clear it if no longer in use.
+ */
+void put_container(struct container *c)
+{
+	struct container *parent;
+
+	while (c && refcount_dec_and_test(&c->usage)) {
+		BUG_ON(!list_empty(&c->members));
+		if (c->ns)
+			put_nsproxy(c->ns);
+		path_put(&c->root);
+
+		parent = c->parent;
+		if (parent) {
+			spin_lock(&parent->lock);
+			list_del(&c->child_link);
+			spin_unlock(&parent->lock);
+		}
+
+		if (c->cred)
+			put_cred(c->cred);
+		security_container_free(c);
+		kfree(c);
+		c = parent;
+	}
+}
+
+/*
+ * Allow the user to poll for the container dying.
+ */
+static unsigned int container_poll(struct file *file, poll_table *wait)
+{
+	struct container *container = file->private_data;
+	unsigned int mask = 0;
+
+	poll_wait(file, &container->waitq, wait);
+
+	if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
+		mask |= POLLHUP;
+
+	return mask;
+}
+
+static int container_release(struct inode *inode, struct file *file)
+{
+	struct container *container = file->private_data;
+
+	put_container(container);
+	return 0;
+}
+
+const struct file_operations container_fops = {
+	.poll		= container_poll,
+	.release	= container_release,
+};
+
+/*
+ * Handle fork/clone.
+ *
+ * A process inherits its parent's container.  The first process into the
+ * container is its 'init' process and the life of everything else in there is
+ * dependent upon that.
+ */
+int copy_container(unsigned long flags, struct task_struct *tsk,
+		   struct container *container)
+{
+	struct container *c = container ?: tsk->container;
+	int ret = -ECANCELED;
+
+	spin_lock(&c->lock);
+
+	if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
+		list_add_tail(&tsk->container_link, &c->members);
+		get_container(c);
+		tsk->container = c;
+		if (!c->init) {
+			set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags);
+			c->init = tsk;
+		}
+		ret = 0;
+	}
+
+	spin_unlock(&c->lock);
+	return ret;
+}
+
+/*
+ * Remove a dead process from a container.
+ *
+ * If the 'init' process in a container dies, we kill off all the other
+ * processes in the container.
+ */
+void exit_container(struct task_struct *tsk)
+{
+	struct task_struct *p;
+	struct container *c = tsk->container;
+	struct kernel_siginfo si = {
+		.si_signo = SIGKILL,
+		.si_code  = SI_KERNEL,
+	};
+
+	spin_lock(&c->lock);
+
+	list_del(&tsk->container_link);
+
+	if (c->init == tsk) {
+		c->init = NULL;
+		c->exit_code = tsk->exit_code;
+		smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
+		set_bit(CONTAINER_FLAG_DEAD, &c->flags);
+		wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
+
+		list_for_each_entry(p, &c->members, container_link) {
+			si.si_pid = task_tgid_vnr(p);
+			send_sig_info(SIGKILL, &si, p);
+		}
+	}
+
+	spin_unlock(&c->lock);
+	put_container(c);
+}
+
+/*
+ * Allocate a container.
+ */
+static struct container *alloc_container(const char __user *name)
+{
+	struct container *c;
+	long len;
+	int ret;
+
+	c = kzalloc(sizeof(struct container), GFP_KERNEL);
+	if (!c)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&c->members);
+	INIT_LIST_HEAD(&c->children);
+	init_waitqueue_head(&c->waitq);
+	spin_lock_init(&c->lock);
+	refcount_set(&c->usage, 1);
+
+	ret = -EFAULT;
+	len = strncpy_from_user(c->name, name, sizeof(c->name));
+	if (len < 0)
+		goto err;
+	ret = -ENAMETOOLONG;
+	if (len >= sizeof(c->name))
+		goto err;
+	ret = -EINVAL;
+	if (strchr(c->name, '/'))
+		goto err;
+
+	c->name[len] = 0;
+	return c;
+
+err:
+	kfree(c);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create some creds for the container.  We don't want to pin things we don't
+ * have to, so drop all keyrings from the new cred.  The LSM gets to audit the
+ * cred struct when security_container_alloc() is invoked.
+ */
+static const struct cred *create_container_creds(unsigned int flags)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+#ifdef CONFIG_KEYS
+	key_put(new->thread_keyring);
+	new->thread_keyring = NULL;
+	key_put(new->process_keyring);
+	new->process_keyring = NULL;
+	key_put(new->session_keyring);
+	new->session_keyring = NULL;
+	key_put(new->request_key_auth);
+	new->request_key_auth = NULL;
+#endif
+
+	if (flags & CONTAINER_NEW_USER_NS) {
+		ret = create_user_ns(new);
+		if (ret < 0)
+			goto err;
+		new->euid = new->user_ns->owner;
+		new->egid = new->user_ns->group;
+	}
+
+	new->fsuid = new->suid = new->uid = new->euid;
+	new->fsgid = new->sgid = new->gid = new->egid;
+	return new;
+
+err:
+	abort_creds(new);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container.
+ */
+static struct container *create_container(const char __user *name, unsigned int flags)
+{
+	struct container *parent, *c;
+	struct fs_struct *fs;
+	struct nsproxy *ns;
+	const struct cred *cred;
+	int ret;
+
+	c = alloc_container(name);
+	if (IS_ERR(c))
+		return c;
+
+	if (flags & CONTAINER_KILL_ON_CLOSE)
+		__set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
+
+	cred = create_container_creds(flags);
+	if (IS_ERR(cred)) {
+		ret = PTR_ERR(cred);
+		goto err_cont;
+	}
+	c->cred = cred;
+
+	ret = -ENOMEM;
+	fs = copy_fs_struct(current->fs);
+	if (!fs)
+		goto err_cont;
+
+	ns = create_new_namespaces(
+		(flags & CONTAINER_NEW_FS_NS	 ? CLONE_NEWNS : 0) |
+		(flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) |
+		(flags & CONTAINER_NEW_UTS_NS	 ? CLONE_NEWUTS : 0) |
+		(flags & CONTAINER_NEW_IPC_NS	 ? CLONE_NEWIPC : 0) |
+		(flags & CONTAINER_NEW_PID_NS	 ? CLONE_NEWPID : 0) |
+		(flags & CONTAINER_NEW_NET_NS	 ? CLONE_NEWNET : 0),
+		current->nsproxy, cred->user_ns, fs);
+	if (IS_ERR(ns)) {
+		ret = PTR_ERR(ns);
+		goto err_fs;
+	}
+
+	c->ns = ns;
+	c->root = fs->root;
+	c->seq = fs->seq;
+	fs->root.mnt = NULL;
+	fs->root.dentry = NULL;
+
+	ret = security_container_alloc(c, flags);
+	if (ret < 0)
+		goto err_fs;
+
+	parent = current->container;
+	get_container(parent);
+	c->parent = parent;
+	c->id = atomic64_inc_return(&container_id_counter);
+	spin_lock(&parent->lock);
+	list_add_tail(&c->child_link, &parent->children);
+	spin_unlock(&parent->lock);
+	return c;
+
+err_fs:
+	free_fs_struct(fs);
+err_cont:
+	put_container(c);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Create a new container object.
+ */
+SYSCALL_DEFINE5(container_create,
+		const char __user *, name,
+		unsigned int, flags,
+		unsigned long, spare3,
+		unsigned long, spare4,
+		unsigned long, spare5)
+{
+	struct container *c;
+	int fd;
+
+	if (!name ||
+	    flags & ~CONTAINER__FLAG_MASK ||
+	    spare3 != 0 || spare4 != 0 || spare5 != 0)
+		return -EINVAL;
+	if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) ==
+	    (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
+		return -EINVAL;
+
+	c = create_container(name, flags);
+	if (IS_ERR(c))
+		return PTR_ERR(c);
+
+	fd = anon_inode_getfd("container", &container_fops, c,
+			      O_RDWR | (flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0));
+	if (fd < 0)
+		put_container(c);
+	return fd;
+}
+
+#endif /* CONFIG_CONTAINERS */
diff --git a/kernel/exit.c b/kernel/exit.c
index 284f2fe9a293..78f6065ad799 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -864,6 +864,7 @@  void __noreturn do_exit(long code)
 	if (group_dead)
 		disassociate_ctty(1);
 	exit_task_namespaces(tsk);
+	exit_container(tsk);
 	exit_task_work(tsk);
 	exit_thread(tsk);
 	exit_umh(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..009cf7e63894 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1920,9 +1920,12 @@  static __latent_entropy struct task_struct *copy_process(
 	retval = copy_namespaces(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_mm;
-	retval = copy_io(clone_flags, p);
+	retval = copy_container(clone_flags, p, NULL);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
+	retval = copy_io(clone_flags, p);
+	if (retval)
+		goto bad_fork_cleanup_container;
 	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
 	if (retval)
 		goto bad_fork_cleanup_io;
@@ -2121,6 +2124,8 @@  static __latent_entropy struct task_struct *copy_process(
 bad_fork_cleanup_io:
 	if (p->io_context)
 		exit_io_context(p);
+bad_fork_cleanup_container:
+	exit_container(p);
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
 bad_fork_cleanup_mm:
diff --git a/kernel/namespaces.h b/kernel/namespaces.h
new file mode 100644
index 000000000000..c44e3cf0e254
--- /dev/null
+++ b/kernel/namespaces.h
@@ -0,0 +1,15 @@ 
+/* Local namespaces defs
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+extern struct nsproxy *create_new_namespaces(unsigned long flags,
+					     struct nsproxy *nsproxy,
+					     struct user_namespace *user_ns,
+					     struct fs_struct *new_fs);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d330059a..4bb5184b3a80 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -27,6 +27,7 @@ 
 #include <linux/syscalls.h>
 #include <linux/cgroup.h>
 #include <linux/perf_event.h>
+#include "namespaces.h"
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -61,8 +62,8 @@  static inline struct nsproxy *create_nsproxy(void)
  * Return the newly created nsproxy.  Do not attach this to the task,
  * leave it to the caller to do proper locking and attach it to task.
  */
-static struct nsproxy *create_new_namespaces(unsigned long flags,
-	struct task_struct *tsk, struct user_namespace *user_ns,
+struct nsproxy *create_new_namespaces(unsigned long flags,
+	struct nsproxy *nsproxy, struct user_namespace *user_ns,
 	struct fs_struct *new_fs)
 {
 	struct nsproxy *new_nsp;
@@ -72,39 +73,39 @@  static struct nsproxy *create_new_namespaces(unsigned long flags,
 	if (!new_nsp)
 		return ERR_PTR(-ENOMEM);
 
-	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
+	new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs);
 	if (IS_ERR(new_nsp->mnt_ns)) {
 		err = PTR_ERR(new_nsp->mnt_ns);
 		goto out_ns;
 	}
 
-	new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
+	new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns);
 	if (IS_ERR(new_nsp->uts_ns)) {
 		err = PTR_ERR(new_nsp->uts_ns);
 		goto out_uts;
 	}
 
-	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
+	new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns);
 	if (IS_ERR(new_nsp->ipc_ns)) {
 		err = PTR_ERR(new_nsp->ipc_ns);
 		goto out_ipc;
 	}
 
 	new_nsp->pid_ns_for_children =
-		copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
+		copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children);
 	if (IS_ERR(new_nsp->pid_ns_for_children)) {
 		err = PTR_ERR(new_nsp->pid_ns_for_children);
 		goto out_pid;
 	}
 
 	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
-					    tsk->nsproxy->cgroup_ns);
+					    nsproxy->cgroup_ns);
 	if (IS_ERR(new_nsp->cgroup_ns)) {
 		err = PTR_ERR(new_nsp->cgroup_ns);
 		goto out_cgroup;
 	}
 
-	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
+	new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
@@ -162,7 +163,7 @@  int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 		(CLONE_NEWIPC | CLONE_SYSVSEM)) 
 		return -EINVAL;
 
-	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
+	new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
 	if (IS_ERR(new_ns))
 		return  PTR_ERR(new_ns);
 
@@ -203,7 +204,7 @@  int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
-	*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
+	*new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns,
 					 new_fs ? new_fs : current->fs);
 	if (IS_ERR(*new_nsp)) {
 		err = PTR_ERR(*new_nsp);
@@ -251,7 +252,7 @@  SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	if (nstype && (ns->ops->type != nstype))
 		goto out;
 
-	new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+	new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs);
 	if (IS_ERR(new_nsproxy)) {
 		err = PTR_ERR(new_nsproxy);
 		goto out;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a4e7131b2509..f0455cbb91cf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -136,6 +136,9 @@  COND_SYSCALL(acct);
 COND_SYSCALL(capget);
 COND_SYSCALL(capset);
 
+/* kernel/container.c */
+COND_SYSCALL(container_create);
+
 /* kernel/exec_domain.c */
 
 /* kernel/exit.c */
diff --git a/security/security.c b/security/security.c
index b49732c02e21..259be9a1746c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1864,3 +1864,15 @@  void security_bpf_prog_free(struct bpf_prog_aux *aux)
 	call_void_hook(bpf_prog_free_security, aux);
 }
 #endif /* CONFIG_BPF_SYSCALL */
+
+#ifdef CONFIG_CONTAINERS
+int security_container_alloc(struct container *container, unsigned int flags)
+{
+	return call_int_hook(container_alloc, 0, container, flags);
+}
+
+void security_container_free(struct container *container)
+{
+	call_void_hook(container_free, container);
+}
+#endif /* CONFIG_CONTAINERS */