diff mbox series

virtiofsd: Use clone() and not unshare(), support non-root

Message ID 348d4774-bd5f-4832-bd7e-a21491fdac8d@www.fastmail.com
State New
Headers show
Series virtiofsd: Use clone() and not unshare(), support non-root | expand

Commit Message

Colin Walters May 1, 2020, 6:25 p.m. UTC
I'd like to make use of virtiofs as part of our tooling in
https://github.com/coreos/coreos-assembler
Most of the code runs as non-root today; qemu also runs as non-root.
We use 9p right now.

virtiofsd's builtin sandboxing effectively assumes it runs as
root.

First, change the code to use `clone()` and not `unshare()+fork()`.

Next, automatically use `CLONE_NEWUSER` if we're running as non root.

This is similar logic to that in https://github.com/containers/bubblewrap
(Which...BTW, it could make sense for virtiofs to depend on bubblewrap
 and re-exec itself rather than re-implementing the containerization
 itself)

Signed-off-by: Colin Walters <walters@verbum.org>
---
 tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Comments

Daniel P. Berrangé May 4, 2020, 9:51 a.m. UTC | #1
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.

I'd suggest splitting these two, so that the re-factoring is separate
from introducing new functionality.

> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters <walters@verbum.org>
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>      printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +    return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +    return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}

What's the reason for using the raw syscall ?  Was it just to avoid having
to allocate a new stack space ?


> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>       * an empty network namespace to prevent TCP/IP and other network
>       * activity in case this process is compromised.
>       */
> -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -        exit(1);
> +    int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +    /* If we're non root, we need a new user namespace */
> +    if (getuid() != 0) {
> +        clone_flags |= CLONE_NEWUSER;
>      }

IIUC, with CLONE_NEWUSER we need to set a UID/GID mapping, otherwise all
file accesses will be squashed to the UID -1.  Or was it intentional
that you're only trying to provide read-only access to files that are
world-accessible ?



> -    child = fork();
> +    child = raw_clone(clone_flags, NULL);
>      if (child < 0) {
> -        fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +        fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>          exit(1);
>      }
>      if (child > 0) {

Regards,
Daniel
Stefan Hajnoczi May 4, 2020, 1:49 p.m. UTC | #2
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)

Great, thanks for posting this! The topic of how and when to do the
sandboxing, as well as whether to require root, has come up several
times.

Please update the man page to explain the behavior that users should
expect when running as non-root:
1. What happens to uid/gid of files from the perspective of a user
   outside the sandbox?
2. Are there any limitations (certain operations that don't work)?

Thanks,
Stefan

> 
> Signed-off-by: Colin Walters <walters@verbum.org>
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>      printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +    return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +    return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>       * an empty network namespace to prevent TCP/IP and other network
>       * activity in case this process is compromised.
>       */
> -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -        exit(1);
> +    int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +    /* If we're non root, we need a new user namespace */
> +    if (getuid() != 0) {
> +        clone_flags |= CLONE_NEWUSER;
>      }
>  
> -    child = fork();
> +    child = raw_clone(clone_flags, NULL);
>      if (child < 0) {
> -        fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +        fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>          exit(1);
>      }
>      if (child > 0) {
> -- 
> 2.24.1
> 
>
Marc-André Lureau May 4, 2020, 2:07 p.m. UTC | #3
Hi

On Fri, May 1, 2020 at 8:29 PM Colin Walters <walters@verbum.org> wrote:
>
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
>
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
>
> First, change the code to use `clone()` and not `unshare()+fork()`.
>
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
>
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
>

Now that systemd-nspawn works without privileges, isn't that also a
solution? One that would fit both system and session level
permissions, and integration with other services?

> Signed-off-by: Colin Walters <walters@verbum.org>
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>      printf("}\n");
>  }
>
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +    return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +    return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>       * an empty network namespace to prevent TCP/IP and other network
>       * activity in case this process is compromised.
>       */
> -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -        exit(1);
> +    int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +    /* If we're non root, we need a new user namespace */
> +    if (getuid() != 0) {
> +        clone_flags |= CLONE_NEWUSER;
>      }
>
> -    child = fork();
> +    child = raw_clone(clone_flags, NULL);
>      if (child < 0) {
> -        fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +        fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>          exit(1);
>      }
>      if (child > 0) {
> --
> 2.24.1
>
>
Colin Walters May 4, 2020, 2:20 p.m. UTC | #4
On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote:

> Now that systemd-nspawn works without privileges, isn't that also a
> solution? One that would fit both system and session level
> permissions, and integration with other services?

This is a complex topic and one I should probably write up in the bubblewrap README.md.  Today for example for CoreOS, our build and CI processes run inside OpenShift (Kubernetes) - we aren't running systemd inside our containers.

bubblewrap is a small self-contained C wrapper around the container system calls basically.  In contrast, AFAICS right now, nspawn requires systemd - which won't work for our use case.

Really the contention point here is systemd's dependency on cgroups for process tracking; in a "nested containerization" scenario you often just want the cgroups from the "outer" container to apply.  But having nested mounts/pid namespaces are still very useful.  (That said, cgroups v2 allows sane nesting, but we aren't there yet)

Also related is https://github.com/kubernetes/enhancements/issues/127 - without that one requires privileged containers to do nesting.

Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` because we fully trust the code running in these VMs.

Or to directly respond again to your proposal: systemd-nspawn as an option may work for some cases but won't for mine (I don't want virtiofsd/qemu instances to "escape" the build container or run separately).
Marc-André Lureau May 4, 2020, 3:43 p.m. UTC | #5
Hi

On Mon, May 4, 2020 at 4:27 PM Colin Walters <walters@verbum.org> wrote:
>
>
>
> On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote:
>
> > Now that systemd-nspawn works without privileges, isn't that also a
> > solution? One that would fit both system and session level
> > permissions, and integration with other services?
>
> This is a complex topic and one I should probably write up in the bubblewrap README.md.  Today for example for CoreOS, our build and CI processes run inside OpenShift (Kubernetes) - we aren't running systemd inside our containers.

Actually, I mean systemd-run (oops!)

>
> bubblewrap is a small self-contained C wrapper around the container system calls basically.  In contrast, AFAICS right now, nspawn requires systemd - which won't work for our use case.
>
> Really the contention point here is systemd's dependency on cgroups for process tracking; in a "nested containerization" scenario you often just want the cgroups from the "outer" container to apply.  But having nested mounts/pid namespaces are still very useful.  (That said, cgroups v2 allows sane nesting, but we aren't there yet)
>
> Also related is https://github.com/kubernetes/enhancements/issues/127 - without that one requires privileged containers to do nesting.
>
> Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` because we fully trust the code running in these VMs.
>
> Or to directly respond again to your proposal: systemd-nspawn as an option may work for some cases but won't for mine (I don't want virtiofsd/qemu instances to "escape" the build container or run separately).
>

You can run within your parent slice, and even more conveniently with:
https://github.com/systemd/systemd/pull/15362
Stefan Hajnoczi May 5, 2020, 3:23 p.m. UTC | #6
On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Fri, May 1, 2020 at 8:29 PM Colin Walters <walters@verbum.org> wrote:
> >
> > I'd like to make use of virtiofs as part of our tooling in
> > https://github.com/coreos/coreos-assembler
> > Most of the code runs as non-root today; qemu also runs as non-root.
> > We use 9p right now.
> >
> > virtiofsd's builtin sandboxing effectively assumes it runs as
> > root.
> >
> > First, change the code to use `clone()` and not `unshare()+fork()`.
> >
> > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> >
> > This is similar logic to that in https://github.com/containers/bubblewrap
> > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
> >  and re-exec itself rather than re-implementing the containerization
> >  itself)
> >
> 
> Now that systemd-nspawn works without privileges, isn't that also a
> solution? One that would fit both system and session level
> permissions, and integration with other services?

Does systemd-nspawn work inside containers?

I think virtiofsd will need to run inside containers in the future and
remember systemd being difficult to use in containers.

Stefan
Daniel P. Berrangé May 5, 2020, 3:32 p.m. UTC | #7
On Tue, May 05, 2020 at 04:23:59PM +0100, Stefan Hajnoczi wrote:
> On Mon, May 04, 2020 at 04:07:22PM +0200, Marc-André Lureau wrote:
> > Hi
> > 
> > On Fri, May 1, 2020 at 8:29 PM Colin Walters <walters@verbum.org> wrote:
> > >
> > > I'd like to make use of virtiofs as part of our tooling in
> > > https://github.com/coreos/coreos-assembler
> > > Most of the code runs as non-root today; qemu also runs as non-root.
> > > We use 9p right now.
> > >
> > > virtiofsd's builtin sandboxing effectively assumes it runs as
> > > root.
> > >
> > > First, change the code to use `clone()` and not `unshare()+fork()`.
> > >
> > > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> > >
> > > This is similar logic to that in https://github.com/containers/bubblewrap
> > > (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
> > >  and re-exec itself rather than re-implementing the containerization
> > >  itself)
> > >
> > 
> > Now that systemd-nspawn works without privileges, isn't that also a
> > solution? One that would fit both system and session level
> > permissions, and integration with other services?
> 
> Does systemd-nspawn work inside containers?
> 
> I think virtiofsd will need to run inside containers in the future and
> remember systemd being difficult to use in containers.

It can be made to work, but my gut tells me people won't be happy if
system were a mandatory requirement for virtiofsd usage. Also there
are current Linux distros which don't even use systemd.

Regards,
Daniel
Dr. David Alan Gilbert May 6, 2020, 7:16 p.m. UTC | #8
* Colin Walters (walters@verbum.org) wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.

Is it ever useful for root to run the code in a new user namespace?

Dave

> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters <walters@verbum.org>
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 4c35c95b25..468617f6d6 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2530,6 +2530,21 @@ static void print_capabilities(void)
>      printf("}\n");
>  }
>  
> +/* Copied from bubblewrap */
> +static int
> +raw_clone(unsigned long flags, void *child_stack)
> +{
> +#if defined(__s390__) || defined(__CRIS__)
> +  /*
> +   * On s390 and cris the order of the first and second arguments
> +   * of the raw clone() system call is reversed.
> +   */
> +    return (int) syscall(__NR_clone, child_stack, flags);
> +#else
> +    return (int) syscall(__NR_clone, flags, child_stack);
> +#endif
> +}
> +
>  /*
>   * Move to a new mount, net, and pid namespaces to isolate this process.
>   */
> @@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>       * an empty network namespace to prevent TCP/IP and other network
>       * activity in case this process is compromised.
>       */
> -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> -        exit(1);
> +    int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +    /* If we're non root, we need a new user namespace */
> +    if (getuid() != 0) {
> +        clone_flags |= CLONE_NEWUSER;
>      }
>  
> -    child = fork();
> +    child = raw_clone(clone_flags, NULL);
>      if (child < 0) {
> -        fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +        fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
>          exit(1);
>      }
>      if (child > 0) {
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Daniel P. Berrangé May 7, 2020, 9:28 a.m. UTC | #9
On Wed, May 06, 2020 at 08:16:14PM +0100, Dr. David Alan Gilbert wrote:
> * Colin Walters (walters@verbum.org) wrote:
> > I'd like to make use of virtiofs as part of our tooling in
> > https://github.com/coreos/coreos-assembler
> > Most of the code runs as non-root today; qemu also runs as non-root.
> > We use 9p right now.
> > 
> > virtiofsd's builtin sandboxing effectively assumes it runs as
> > root.
> > 
> > First, change the code to use `clone()` and not `unshare()+fork()`.
> > 
> > Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> Is it ever useful for root to run the code in a new user namespace?

Yes, user namespace is useful to both root and non-root alike. Roughly
speaking, for root, it offers security benefits, for non-root it offers
functionality benefits.

The longer answer...

With a new user namespaces, users inside the container get remapped
to different set of users outside the host, through defined UID & GID
mappings.  For any UID/GID which doesn't have a mapping, access will
get performed as (uid_t)-1 / (gid_t)-1.

eg consider you have a range of host IDs 100,000->165,536 available.
With user namespaces, you can now ssetuop a mapping of container
IDs 0 -> 65536.

Thus any time  UID 0 inside the container does something, from the
host POV they are acting as UID 100,000.  If UID 30,000 inside the
container does something, this is UID 130,000 in the host POV. If
UID 80,000 in the container does something, this is uid -1 from
the host POV.

If the person in the host launching virtiofsd is non-root, then
user namespaces mean they can offer the guest the full range of
POSIX APIs wrt access control & file ownership, since they're
no longer restricted to their single host UID when inside the
container.  They also get important things like CAP_DAC_OVERRIDE.
IOW, for non-root, user namespaces unlock the full functionality
of virtiofsd. Without it, we're limited to read-only access to
files not owned by the current non-root user.

If the person in the host launching virtiofsd is root, then user
namespaces mean we can reduce the effective privileges of virtiofsd.
Currently when inside the container, uid==0 is still the same as
uid==0 outside. So if there are any resources visible inside the
container (either accidentally or intentionally), then virtiofsd
shouldn't have write access to, we're lacking protection. By
adding usernamespace + a mapping, we strictly isolate virtiofsd
from any host resources.

The main pain point with user namespaces is that all the files
in the directory you are exporting need to be shifted to match
the UID/GID mapping user for the user namespaces. Traditionally
this has needed a recursive chown of the tree to remap the file
ownership. There has been talk of a filesystem overlay todo the
remapping transparently, but I've lost track of whether that's
a thing yet.


Regards,
Daniel
Stefan Hajnoczi May 21, 2020, 10:19 a.m. UTC | #10
On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> If the person in the host launching virtiofsd is non-root, then
> user namespaces mean they can offer the guest the full range of
> POSIX APIs wrt access control & file ownership, since they're
> no longer restricted to their single host UID when inside the
> container.

What installs the uid_map/gid_map for virtiofsd?

My machine has /etc/subuid and /etc/subgid, but how would this come into
play with these patches applied?

What happens when an unprivileged user who is not listed in /etc/subuid
runs virtiofsd?

Stefan
Daniel P. Berrangé May 21, 2020, 10:43 a.m. UTC | #11
On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote:
> On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> > If the person in the host launching virtiofsd is non-root, then
> > user namespaces mean they can offer the guest the full range of
> > POSIX APIs wrt access control & file ownership, since they're
> > no longer restricted to their single host UID when inside the
> > container.
> 
> What installs the uid_map/gid_map for virtiofsd?
> 
> My machine has /etc/subuid and /etc/subgid, but how would this come into
> play with these patches applied?

AFAICT, nothing is setting up the mapping, hence my question in the first
review of this patch, that this patch seems incomplete.

The subuid/subgid files are essentially just reserving a range of IDs
for each user. Something then needs to read & honour those IDs.

The rules on setting up a mapping are complex though, to avoid a user
from creating a new user namespace, and defining a mapping that lets
them elevate privileges to read files in the host they wouldn't otherwise
be permitted to access.

IIUC, applying the range of IDs from subuid/subgid will require the
process defining the mapping to have CAP_SETUID *outside* the container.
As an unprivileged host user, you won't have that, so I think the only
thing you can do is setup a mapping for your own UID/GID, which would
allow the container to read/write files owned by the host user that
started it. That should be ok for virtiofsd's needs at least for simple
file sharing. If virtiofsd needs to expose its full range of features
though, something privileged in the host needs to setup a full mapping
based on subuid/subgid

BTW, "man user_namespaces" gives many more details on UID mapping
rules.

> What happens when an unprivileged user who is not listed in /etc/subuid
> runs virtiofsd?

The UID map inside the container will be empty, which should result in
all files being remapped to (uid_t)-1 or whatever is shown in the
/proc/sys/kernel/overflow{u,g}id files.

So virtiofsd would not have any write access, and only have read access
to files which have world-read bit set.  


Regards,
Daniel
Stefan Hajnoczi May 27, 2020, 11:16 a.m. UTC | #12
On Thu, May 21, 2020 at 11:43:44AM +0100, Daniel P. Berrangé wrote:
> On Thu, May 21, 2020 at 11:19:23AM +0100, Stefan Hajnoczi wrote:
> > On Thu, May 07, 2020 at 10:28:32AM +0100, Daniel P. Berrangé wrote:
> > > If the person in the host launching virtiofsd is non-root, then
> > > user namespaces mean they can offer the guest the full range of
> > > POSIX APIs wrt access control & file ownership, since they're
> > > no longer restricted to their single host UID when inside the
> > > container.
> > 
> > What installs the uid_map/gid_map for virtiofsd?
> > 
> > My machine has /etc/subuid and /etc/subgid, but how would this come into
> > play with these patches applied?
> 
> AFAICT, nothing is setting up the mapping, hence my question in the first
> review of this patch, that this patch seems incomplete.
> 
> The subuid/subgid files are essentially just reserving a range of IDs
> for each user. Something then needs to read & honour those IDs.
> 
> The rules on setting up a mapping are complex though, to avoid a user
> from creating a new user namespace, and defining a mapping that lets
> them elevate privileges to read files in the host they wouldn't otherwise
> be permitted to access.
> 
> IIUC, applying the range of IDs from subuid/subgid will require the
> process defining the mapping to have CAP_SETUID *outside* the container.
> As an unprivileged host user, you won't have that, so I think the only
> thing you can do is setup a mapping for your own UID/GID, which would
> allow the container to read/write files owned by the host user that
> started it. That should be ok for virtiofsd's needs at least for simple
> file sharing. If virtiofsd needs to expose its full range of features
> though, something privileged in the host needs to setup a full mapping
> based on subuid/subgid
> 
> BTW, "man user_namespaces" gives many more details on UID mapping
> rules.
> 
> > What happens when an unprivileged user who is not listed in /etc/subuid
> > runs virtiofsd?
> 
> The UID map inside the container will be empty, which should result in
> all files being remapped to (uid_t)-1 or whatever is shown in the
> /proc/sys/kernel/overflow{u,g}id files.
> 
> So virtiofsd would not have any write access, and only have read access
> to files which have world-read bit set.  

Okay. Enabling user namespaces for virtiofsd is valuable. I think the
behavior should be documented though so users know what needs to be
configured. That is missing from this patch series.

Stefan
Stefan Hajnoczi June 2, 2020, 9:55 a.m. UTC | #13
On Fri, May 01, 2020 at 02:25:48PM -0400, Colin Walters wrote:
> I'd like to make use of virtiofs as part of our tooling in
> https://github.com/coreos/coreos-assembler
> Most of the code runs as non-root today; qemu also runs as non-root.
> We use 9p right now.
> 
> virtiofsd's builtin sandboxing effectively assumes it runs as
> root.
> 
> First, change the code to use `clone()` and not `unshare()+fork()`.
> 
> Next, automatically use `CLONE_NEWUSER` if we're running as non root.
> 
> This is similar logic to that in https://github.com/containers/bubblewrap
> (Which...BTW, it could make sense for virtiofs to depend on bubblewrap
>  and re-exec itself rather than re-implementing the containerization
>  itself)
> 
> Signed-off-by: Colin Walters <walters@verbum.org>
> ---
>  tools/virtiofsd/passthrough_ll.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)

Ping Colin. It would be great if you have time to share your thoughts on
this discussion and explain how you are using this patch.

To summarize: I'm unclear what behavior a user can expect since I'm not
aware of anything that applies /etc/subuid for the user namespace. Does
this mean the expected behavior is that virtiofsd will map all uids/gids
to -1 when invoked non-root?

Could you document the behavior and consider supporting both -1 and
/etc/subuid operation? Both seem like useful behaviors for different use
cases.

Thanks,
Stefan
diff mbox series

Patch

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b25..468617f6d6 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2530,6 +2530,21 @@  static void print_capabilities(void)
     printf("}\n");
 }
 
+/* Copied from bubblewrap */
+static int
+raw_clone(unsigned long flags, void *child_stack)
+{
+#if defined(__s390__) || defined(__CRIS__)
+  /*
+   * On s390 and cris the order of the first and second arguments
+   * of the raw clone() system call is reversed.
+   */
+    return (int) syscall(__NR_clone, child_stack, flags);
+#else
+    return (int) syscall(__NR_clone, flags, child_stack);
+#endif
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -2547,14 +2562,15 @@  static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
      * an empty network namespace to prevent TCP/IP and other network
      * activity in case this process is compromised.
      */
-    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
-        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
-        exit(1);
+    int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
+    /* If we're non root, we need a new user namespace */
+    if (getuid() != 0) {
+        clone_flags |= CLONE_NEWUSER;
     }
 
-    child = fork();
+    child = raw_clone(clone_flags, NULL);
     if (child < 0) {
-        fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
+        fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
         exit(1);
     }
     if (child > 0) {