Patchwork [berrange@redhat.com:,[PATCH] Forbid invocation of kexec_load() outside initial PID namespace]

login
register
mail settings
Submitter Serge E. Hallyn
Date Aug. 7, 2012, 3:01 p.m.
Message ID <20120807150127.GB5070@mail.hallyn.com>
Download mbox | patch
Permalink /patch/175898/
State New
Headers show

Comments

Serge E. Hallyn - Aug. 7, 2012, 3:01 p.m.
(Hopefully my unsubscribed account can email kernel-team)

Hi,

this patch will probably not hit upstream, because the 'proper' fix is
user namespaces.  User namespaces however won't be ready until after
quantal.  So I'd like this patch to be applied in precise and quantal
if possible.

Problem:

Containers are granted CAP_SYS_BOOT.  The reboot path in the kernel checks
whether you are in the initial pidns, and, if not, sends a signal to your
parent indicating you were 'rebooted' or 'shut down'.  So there is no
danger of a container rebooting the host.

However, CAP_SYS_BOOT also authorized kexec, without the pidns check.
Therefore, containers are able to kexec a new kernel, which is obviously
a bad thing.

This patch prevents that by only allowing kexec from the initial pid
namespace.  It is nacked by Eric Biederman (but acked by me) because
he feels this should be stopped by having the container in a private
user namespace, with the kexec cap_sys_boot check targeted to the initial
user namespace.  As I said, that won't be doable during quantal timeframe.

thanks,
-serge

----- Forwarded message from "Daniel P. Berrange" <berrange@redhat.com> -----

Date: Fri,  3 Aug 2012 11:53:04 +0100
From: "Daniel P. Berrange" <berrange@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: containers@lists.linux-foundation.org,
	"Daniel P. Berrange" <berrange@redhat.com>,
	Serge Hallyn <serge.hallyn@canonical.com>,
	Daniel Lezcano <daniel.lezcano@free.fr>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Tejun Heo <tj@kernel.org>, Oleg Nesterov <oleg@redhat.com>
Subject: [PATCH] Forbid invocation of kexec_load() outside initial PID namespace

From: "Daniel P. Berrange" <berrange@redhat.com>

The following commit

    commit cf3f89214ef6a33fad60856bc5ffd7bb2fc4709b
    Author: Daniel Lezcano <daniel.lezcano@free.fr>
    Date:   Wed Mar 28 14:42:51 2012 -0700

    pidns: add reboot_pid_ns() to handle the reboot syscall

introduced custom handling of the reboot() syscall when invoked
from a non-initial PID namespace. The intent was that a process
in a container can be allowed to keep CAP_SYS_BOOT and execute
reboot() to shutdown/reboot just their private container, rather
than the host.

Unfortunately the kexec_load() syscall also relies on the
CAP_SYS_BOOT capability. So by allowing a container to keep
this capability to safely invoke reboot(), they mistakenly
also gain the ability to use kexec_load(). The solution is
to make kexec_load() return -EPERM if invoked from a PID
namespace that is not the initial namespace

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/kexec.c | 5 +++++
 1 file changed, 5 insertions(+)
Tim Gardner - Aug. 8, 2012, 11:59 a.m.
Applied to Precise and Quantal.

Patch

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 0668d58..b152bde 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -947,6 +947,11 @@  SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
 	if (!capable(CAP_SYS_BOOT))
 		return -EPERM;
 
+	/* Processes in containers must not be allowed to load a new
+	 * kernel, even if they have CAP_SYS_BOOT */
+	if (task_active_pid_ns(current) != &init_pid_ns)
+		return -EPERM;
+
 	/*
 	 * Verify we have a legal set of flags
 	 * This leaves us room for future extensions.