Patchwork Add support for fd: protocol

login
register
mail settings
Submitter Corey Bryant
Date May 20, 2011, 6:48 p.m.
Message ID <4DD6B777.9020800@us.ibm.com>
Download mbox | patch
Permalink /patch/96659/
State New
Headers show

Comments

Corey Bryant - May 20, 2011, 6:48 p.m.
sVirt provides SELinux MAC isolation for Qemu guest processes and their
corresponding resources (image files). sVirt provides this support
by labeling guests and resources with security labels that are stored
in file system extended attributes. Some file systems, such as NFS, do
not support the extended attribute security namespace, which is needed
for image file isolation when using the sVirt SELinux security driver
in libvirt.

The proposed solution entails a combination of Qemu, libvirt, and
SELinux patches that work together to isolate multiple guests' images
when they're stored in the same NFS mount. This results in an
environment where sVirt isolation and NFS image file isolation can both
be provided.

Currently, Qemu opens an image file in addition to performing the
necessary read and write operations. The proposed solution will move
the open out of Qemu and into libvirt. Once libvirt opens an image
file for the guest, it will pass the file descriptor to Qemu via a
new fd: protocol.

If the image file resides in an NFS mount, the following SELinux policy
changes will provide image isolation:

  - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
    allow Qemu (svirt_t) to only have SELinux read and write
    permissions on nfs_t files

  - Qemu (svirt_t) also gets SELinux use permissions on libvirt
    (virtd_t) file descriptors

Following is a sample invocation of Qemu using the fd: protocol:

    qemu -drive file=fd:4,format=qcow2

This patch contains the Qemu code to support this solution. I would
like to solicit input from the libvirt community prior to starting
the libvirt patch.

This patch was tested with the following formats: raw, cow, qcow,
qcow2, vmdk, using the fd: protocol as well as existing file name
support. Non-valid file descriptors were also tested.

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
---
 block/raw-posix.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++-------
 qemu-doc.texi     |   12 +++++++++
 qemu-options.hx   |    8 ++++--
 3 files changed, 78 insertions(+), 12 deletions(-)

-- 1.7.1
Anthony Liguori - May 20, 2011, 7:05 p.m.
On 05/20/2011 01:48 PM, Corey Bryant wrote:
> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> corresponding resources (image files). sVirt provides this support
> by labeling guests and resources with security labels that are stored
> in file system extended attributes. Some file systems, such as NFS, do
> not support the extended attribute security namespace, which is needed
> for image file isolation when using the sVirt SELinux security driver
> in libvirt.
>
> The proposed solution entails a combination of Qemu, libvirt, and
> SELinux patches that work together to isolate multiple guests' images
> when they're stored in the same NFS mount. This results in an
> environment where sVirt isolation and NFS image file isolation can both
> be provided.
>
> Currently, Qemu opens an image file in addition to performing the
> necessary read and write operations. The proposed solution will move
> the open out of Qemu and into libvirt. Once libvirt opens an image
> file for the guest, it will pass the file descriptor to Qemu via a
> new fd: protocol.
>
> If the image file resides in an NFS mount, the following SELinux policy
> changes will provide image isolation:
>
>    - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>      allow Qemu (svirt_t) to only have SELinux read and write
>      permissions on nfs_t files
>
>    - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>      (virtd_t) file descriptors
>
> Following is a sample invocation of Qemu using the fd: protocol:
>
>      qemu -drive file=fd:4,format=qcow2
>
> This patch contains the Qemu code to support this solution. I would
> like to solicit input from the libvirt community prior to starting
> the libvirt patch.
>
> This patch was tested with the following formats: raw, cow, qcow,
> qcow2, vmdk, using the fd: protocol as well as existing file name
> support. Non-valid file descriptors were also tested.
>
> Signed-off-by: Corey Bryant<coreyb@linux.vnet.ibm.com>
> ---
>   block/raw-posix.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++-------
>   qemu-doc.texi     |   12 +++++++++
>   qemu-options.hx   |    8 ++++--
>   3 files changed, 78 insertions(+), 12 deletions(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index a95c8d4..6554b06 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -142,7 +142,8 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
>                              int bdrv_flags, int open_flags)
>   {
>       BDRVRawState *s = bs->opaque;
> -    int fd, ret;
> +    int fd = -1;
> +    int ret;
>
>       s->open_flags = open_flags | O_BINARY;
>       s->open_flags&= ~O_ACCMODE;
> @@ -159,15 +160,16 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
>       else if (!(bdrv_flags&  BDRV_O_CACHE_WB))
>           s->open_flags |= O_DSYNC;
>
> -    s->fd = -1;
> -    fd = qemu_open(filename, s->open_flags, 0644);
> -    if (fd<  0) {
> -        ret = -errno;
> -        if (ret == -EROFS)
> -            ret = -EACCES;
> -        return ret;
> +    if (s->fd == -1) {
> +        fd = qemu_open(filename, s->open_flags, 0644);
> +        if (fd<  0) {
> +            ret = -errno;
> +            if (ret == -EROFS)
> +                ret = -EACCES;
> +            return ret;
> +        }
> +        s->fd = fd;
>       }
> -    s->fd = fd;
>       s->aligned_buf = NULL;
>
>       if ((bdrv_flags&  BDRV_O_NOCACHE)) {
> @@ -224,6 +226,7 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
>   {
>       BDRVRawState *s = bs->opaque;
>
> +    s->fd = -1;
>       s->type = FTYPE_FILE;
>       return raw_open_common(bs, filename, flags, 0);
>   }
> @@ -819,6 +822,50 @@ static BlockDriver bdrv_file = {
>       .create_options = raw_create_options,
>   };
>
> +static int raw_open_fd(BlockDriverState *bs, const char *filename, int flags)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    const char *fd_str;
> +    int i;
> +
> +    /* extract the file descriptor - fail if it's not fd: */
> +    if (!strstart(filename, "fd:",&fd_str)) {
> +        return -EINVAL;
> +    }
> +
> +    for (i = 0; fd_str[i] != '\0'; i++) {
> +        if (!qemu_isdigit(fd_str[i]))
> +            return -EBADF;

This is off CODING_STYLE wise (braces around if), but a better way to do 
this is to use strtol(), and check for the error condition from it.

If it fails, you should call monitor_get_fd(cur_mon, fd_str) as this 
will allow named fds to be used.

Regards,

Anthony Liguori
Blue Swirl - May 20, 2011, 7:25 p.m.
On Fri, May 20, 2011 at 9:48 PM, Corey Bryant <bryntcor@us.ibm.com> wrote:
> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> corresponding resources (image files). sVirt provides this support
> by labeling guests and resources with security labels that are stored
> in file system extended attributes. Some file systems, such as NFS, do
> not support the extended attribute security namespace, which is needed
> for image file isolation when using the sVirt SELinux security driver
> in libvirt.
>
> The proposed solution entails a combination of Qemu, libvirt, and
> SELinux patches that work together to isolate multiple guests' images
> when they're stored in the same NFS mount. This results in an
> environment where sVirt isolation and NFS image file isolation can both
> be provided.

Very nice. QEMU should use this to support privilege separation. We
already have chroot and runas switches, a new switch should convert
all file references to fd references internally for that process. If
this can be made transparent, this should even be the default way of
operation.
Anthony Liguori - May 20, 2011, 7:42 p.m.
On 05/20/2011 02:25 PM, Blue Swirl wrote:
> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>> corresponding resources (image files). sVirt provides this support
>> by labeling guests and resources with security labels that are stored
>> in file system extended attributes. Some file systems, such as NFS, do
>> not support the extended attribute security namespace, which is needed
>> for image file isolation when using the sVirt SELinux security driver
>> in libvirt.
>>
>> The proposed solution entails a combination of Qemu, libvirt, and
>> SELinux patches that work together to isolate multiple guests' images
>> when they're stored in the same NFS mount. This results in an
>> environment where sVirt isolation and NFS image file isolation can both
>> be provided.
>
> Very nice. QEMU should use this to support privilege separation. We
> already have chroot and runas switches, a new switch should convert
> all file references to fd references internally for that process. If
> this can be made transparent, this should even be the default way of
> operation.

You mean, QEMU starts up, opens all disk images, reinvokes itself in a 
confined context, and then passes fds to the child?

Interesting idea.

Regards,

Anthony Liguori

>
Blue Swirl - May 20, 2011, 7:53 p.m.
On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>
>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>>>
>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>> corresponding resources (image files). sVirt provides this support
>>> by labeling guests and resources with security labels that are stored
>>> in file system extended attributes. Some file systems, such as NFS, do
>>> not support the extended attribute security namespace, which is needed
>>> for image file isolation when using the sVirt SELinux security driver
>>> in libvirt.
>>>
>>> The proposed solution entails a combination of Qemu, libvirt, and
>>> SELinux patches that work together to isolate multiple guests' images
>>> when they're stored in the same NFS mount. This results in an
>>> environment where sVirt isolation and NFS image file isolation can both
>>> be provided.
>>
>> Very nice. QEMU should use this to support privilege separation. We
>> already have chroot and runas switches, a new switch should convert
>> all file references to fd references internally for that process. If
>> this can be made transparent, this should even be the default way of
>> operation.
>
> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
> confined context, and then passes fds to the child?

And exit after that, or do the same without forking.

This wouldn't work now for the native CDROM devices which need to
reopen the device. For that, an explicit reopen method could be added.
The method could even chat with the privileged process to get that to
do the reopening, but I'd leave that to libvirt and fail without it
for plain QEMU.
Daniel P. Berrange - May 23, 2011, 9:45 a.m.
On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> corresponding resources (image files). sVirt provides this support
> by labeling guests and resources with security labels that are stored
> in file system extended attributes. Some file systems, such as NFS, do
> not support the extended attribute security namespace, which is needed
> for image file isolation when using the sVirt SELinux security driver
> in libvirt.
> 
> The proposed solution entails a combination of Qemu, libvirt, and
> SELinux patches that work together to isolate multiple guests' images
> when they're stored in the same NFS mount. This results in an
> environment where sVirt isolation and NFS image file isolation can both
> be provided.
> 
> Currently, Qemu opens an image file in addition to performing the
> necessary read and write operations. The proposed solution will move
> the open out of Qemu and into libvirt. Once libvirt opens an image
> file for the guest, it will pass the file descriptor to Qemu via a
> new fd: protocol.
> 
> If the image file resides in an NFS mount, the following SELinux policy
> changes will provide image isolation:
> 
>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>     allow Qemu (svirt_t) to only have SELinux read and write
>     permissions on nfs_t files
> 
>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>     (virtd_t) file descriptors
> 
> Following is a sample invocation of Qemu using the fd: protocol:
> 
>     qemu -drive file=fd:4,format=qcow2
> 
> This patch contains the Qemu code to support this solution. I would
> like to solicit input from the libvirt community prior to starting
> the libvirt patch.
> 
> This patch was tested with the following formats: raw, cow, qcow,
> qcow2, vmdk, using the fd: protocol as well as existing file name
> support. Non-valid file descriptors were also tested.

How can backing files work ?  The '-drive' syntax doesn't provide
any way to set properties against the backing files (which may be
nested to arbitrary depth).  AFAICT, we need to the often discussed
-blockdev advanced syntax to be able to set a 'fd' property against
nested backing files.  I'm not sure it is worth supporting fd: if
we only have -drive syntax, since backing files are an important
feature for most mgmt apps.

Also, there are a few places in QEMU, where it re-opens the existing
block driver on the fly. What is the plan for supporting this, without
having QEMU block on waiting for libvirt to pass it a new FD ?

Regards,
Daniel
Daniel P. Berrange - May 23, 2011, 9:48 a.m.
On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> corresponding resources (image files). sVirt provides this support
> by labeling guests and resources with security labels that are stored
> in file system extended attributes. Some file systems, such as NFS, do
> not support the extended attribute security namespace, which is needed
> for image file isolation when using the sVirt SELinux security driver
> in libvirt.

This will also allow libvirt to run QEMU confined by the Linux
container functionality. In particular it lets us use CLONE_NEWNS
flag to isolate its root filesystem, without having to worry about
setting up passthrough mounts for each disk image it needs to
access, which is a real pain when it comes to hotplug.

Daniel
Stefan Hajnoczi - May 23, 2011, 10:19 a.m.
On Mon, May 23, 2011 at 10:45 AM, Daniel P. Berrange
<berrange@redhat.com> wrote:
> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>> corresponding resources (image files). sVirt provides this support
>> by labeling guests and resources with security labels that are stored
>> in file system extended attributes. Some file systems, such as NFS, do
>> not support the extended attribute security namespace, which is needed
>> for image file isolation when using the sVirt SELinux security driver
>> in libvirt.
>>
>> The proposed solution entails a combination of Qemu, libvirt, and
>> SELinux patches that work together to isolate multiple guests' images
>> when they're stored in the same NFS mount. This results in an
>> environment where sVirt isolation and NFS image file isolation can both
>> be provided.
>>
>> Currently, Qemu opens an image file in addition to performing the
>> necessary read and write operations. The proposed solution will move
>> the open out of Qemu and into libvirt. Once libvirt opens an image
>> file for the guest, it will pass the file descriptor to Qemu via a
>> new fd: protocol.
>>
>> If the image file resides in an NFS mount, the following SELinux policy
>> changes will provide image isolation:
>>
>>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>>     allow Qemu (svirt_t) to only have SELinux read and write
>>     permissions on nfs_t files
>>
>>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>>     (virtd_t) file descriptors
>>
>> Following is a sample invocation of Qemu using the fd: protocol:
>>
>>     qemu -drive file=fd:4,format=qcow2
>>
>> This patch contains the Qemu code to support this solution. I would
>> like to solicit input from the libvirt community prior to starting
>> the libvirt patch.
>>
>> This patch was tested with the following formats: raw, cow, qcow,
>> qcow2, vmdk, using the fd: protocol as well as existing file name
>> support. Non-valid file descriptors were also tested.
>
> How can backing files work ?  The '-drive' syntax doesn't provide
> any way to set properties against the backing files (which may be
> nested to arbitrary depth).  AFAICT, we need to the often discussed
> -blockdev advanced syntax to be able to set a 'fd' property against
> nested backing files.  I'm not sure it is worth supporting fd: if
> we only have -drive syntax, since backing files are an important
> feature for most mgmt apps.
>
> Also, there are a few places in QEMU, where it re-opens the existing
> block driver on the fly. What is the plan for supporting this, without
> having QEMU block on waiting for libvirt to pass it a new FD ?

QEMU could ask for a file over QMP.  So you bootstrap it using fd: but
when a reopen or backing file is needed, QEMU raises a
QEVENT_BLOCK_REQUEST_FILE event.  Of course waiting around for the
reopen to complete it not ideal as it may temporarily pause the guest.

Stefan
Daniel P. Berrange - May 23, 2011, 10:30 a.m.
On Mon, May 23, 2011 at 11:19:15AM +0100, Stefan Hajnoczi wrote:
> On Mon, May 23, 2011 at 10:45 AM, Daniel P. Berrange
> <berrange@redhat.com> wrote:
> > On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
> >> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> >> corresponding resources (image files). sVirt provides this support
> >> by labeling guests and resources with security labels that are stored
> >> in file system extended attributes. Some file systems, such as NFS, do
> >> not support the extended attribute security namespace, which is needed
> >> for image file isolation when using the sVirt SELinux security driver
> >> in libvirt.
> >>
> >> The proposed solution entails a combination of Qemu, libvirt, and
> >> SELinux patches that work together to isolate multiple guests' images
> >> when they're stored in the same NFS mount. This results in an
> >> environment where sVirt isolation and NFS image file isolation can both
> >> be provided.
> >>
> >> Currently, Qemu opens an image file in addition to performing the
> >> necessary read and write operations. The proposed solution will move
> >> the open out of Qemu and into libvirt. Once libvirt opens an image
> >> file for the guest, it will pass the file descriptor to Qemu via a
> >> new fd: protocol.
> >>
> >> If the image file resides in an NFS mount, the following SELinux policy
> >> changes will provide image isolation:
> >>
> >>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
> >>     allow Qemu (svirt_t) to only have SELinux read and write
> >>     permissions on nfs_t files
> >>
> >>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
> >>     (virtd_t) file descriptors
> >>
> >> Following is a sample invocation of Qemu using the fd: protocol:
> >>
> >>     qemu -drive file=fd:4,format=qcow2
> >>
> >> This patch contains the Qemu code to support this solution. I would
> >> like to solicit input from the libvirt community prior to starting
> >> the libvirt patch.
> >>
> >> This patch was tested with the following formats: raw, cow, qcow,
> >> qcow2, vmdk, using the fd: protocol as well as existing file name
> >> support. Non-valid file descriptors were also tested.
> >
> > How can backing files work ?  The '-drive' syntax doesn't provide
> > any way to set properties against the backing files (which may be
> > nested to arbitrary depth).  AFAICT, we need to the often discussed
> > -blockdev advanced syntax to be able to set a 'fd' property against
> > nested backing files.  I'm not sure it is worth supporting fd: if
> > we only have -drive syntax, since backing files are an important
> > feature for most mgmt apps.
> >
> > Also, there are a few places in QEMU, where it re-opens the existing
> > block driver on the fly. What is the plan for supporting this, without
> > having QEMU block on waiting for libvirt to pass it a new FD ?
> 
> QEMU could ask for a file over QMP.  So you bootstrap it using fd: but
> when a reopen or backing file is needed, QEMU raises a
> QEVENT_BLOCK_REQUEST_FILE event.  Of course waiting around for the
> reopen to complete it not ideal as it may temporarily pause the guest.

It feels to me that turning the current block driver code which just does
open(2) on files, into something which issues events & asynchronously
waits for a file would potentially be quite complex.

You also need to be much more careful from a security POV if the mgmt
app is accepting requests to open arbitrary files from QEMU, to ensure
the filenames are correctly/strictly validated before opening them and
giving them back to QEMU. An architecture where the mgmt app decides
what FDs to supply upfront, has less potential for security errors.

To me the ideal would thus be that we can supply FDs for the backing
store with -blockdev syntax, and that places where QEMU re-opens files
would be enhanced to avoid that need. If there are things we can't do
without closing & re-opening the same file, then perhaps we need some
new ioctl()/fcntl() calls to change those file attributes on the fly.

Regards,
Daniel
Anthony Liguori - May 23, 2011, 12:50 p.m.
On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>> corresponding resources (image files). sVirt provides this support
>> by labeling guests and resources with security labels that are stored
>> in file system extended attributes. Some file systems, such as NFS, do
>> not support the extended attribute security namespace, which is needed
>> for image file isolation when using the sVirt SELinux security driver
>> in libvirt.
>>
>> The proposed solution entails a combination of Qemu, libvirt, and
>> SELinux patches that work together to isolate multiple guests' images
>> when they're stored in the same NFS mount. This results in an
>> environment where sVirt isolation and NFS image file isolation can both
>> be provided.
>>
>> Currently, Qemu opens an image file in addition to performing the
>> necessary read and write operations. The proposed solution will move
>> the open out of Qemu and into libvirt. Once libvirt opens an image
>> file for the guest, it will pass the file descriptor to Qemu via a
>> new fd: protocol.
>>
>> If the image file resides in an NFS mount, the following SELinux policy
>> changes will provide image isolation:
>>
>>    - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>>      allow Qemu (svirt_t) to only have SELinux read and write
>>      permissions on nfs_t files
>>
>>    - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>>      (virtd_t) file descriptors
>>
>> Following is a sample invocation of Qemu using the fd: protocol:
>>
>>      qemu -drive file=fd:4,format=qcow2
>>
>> This patch contains the Qemu code to support this solution. I would
>> like to solicit input from the libvirt community prior to starting
>> the libvirt patch.
>>
>> This patch was tested with the following formats: raw, cow, qcow,
>> qcow2, vmdk, using the fd: protocol as well as existing file name
>> support. Non-valid file descriptors were also tested.
>
> How can backing files work ?  The '-drive' syntax doesn't provide
> any way to set properties against the backing files (which may be
> nested to arbitrary depth).

This is orthogonal to having an fd: protocol.

> Also, there are a few places in QEMU, where it re-opens the existing
> block driver on the fly. What is the plan for supporting this, without
> having QEMU block on waiting for libvirt to pass it a new FD ?

That's only host CDROM AFAICT.

Regards,

Anthony Liguori

>
> Regards,
> Daniel
Anthony Liguori - May 23, 2011, 12:59 p.m.
On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
> It feels to me that turning the current block driver code which just does
> open(2) on files, into something which issues events&  asynchronously
> waits for a file would potentially be quite complex.
>
> You also need to be much more careful from a security POV if the mgmt
> app is accepting requests to open arbitrary files from QEMU, to ensure
> the filenames are correctly/strictly validated before opening them and
> giving them back to QEMU. An architecture where the mgmt app decides
> what FDs to supply upfront, has less potential for security errors.
>
> To me the ideal would thus be that we can supply FDs for the backing
> store with -blockdev syntax, and that places where QEMU re-opens files
> would be enhanced to avoid that need. If there are things we can't do
> without closing&  re-opening the same file, then perhaps we need some
> new ioctl()/fcntl() calls to change those file attributes on the fly.

I agree.  But my view of blockdev is that you wouldn't set an fd 
attribute but rather the backing file name and use the fd protocol.  For 
instance:

-blockdev id=foo-base,path=fd:4,format=raw
-blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo



> Regards,
> Daniel
Daniel P. Berrange - May 23, 2011, 1:06 p.m.
On Mon, May 23, 2011 at 07:50:12AM -0500, Anthony Liguori wrote:
> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
> >On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
> >>sVirt provides SELinux MAC isolation for Qemu guest processes and their
> >>corresponding resources (image files). sVirt provides this support
> >>by labeling guests and resources with security labels that are stored
> >>in file system extended attributes. Some file systems, such as NFS, do
> >>not support the extended attribute security namespace, which is needed
> >>for image file isolation when using the sVirt SELinux security driver
> >>in libvirt.
> >>
> >>The proposed solution entails a combination of Qemu, libvirt, and
> >>SELinux patches that work together to isolate multiple guests' images
> >>when they're stored in the same NFS mount. This results in an
> >>environment where sVirt isolation and NFS image file isolation can both
> >>be provided.
> >>
> >>Currently, Qemu opens an image file in addition to performing the
> >>necessary read and write operations. The proposed solution will move
> >>the open out of Qemu and into libvirt. Once libvirt opens an image
> >>file for the guest, it will pass the file descriptor to Qemu via a
> >>new fd: protocol.
> >>
> >>If the image file resides in an NFS mount, the following SELinux policy
> >>changes will provide image isolation:
> >>
> >>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
> >>     allow Qemu (svirt_t) to only have SELinux read and write
> >>     permissions on nfs_t files
> >>
> >>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
> >>     (virtd_t) file descriptors
> >>
> >>Following is a sample invocation of Qemu using the fd: protocol:
> >>
> >>     qemu -drive file=fd:4,format=qcow2
> >>
> >>This patch contains the Qemu code to support this solution. I would
> >>like to solicit input from the libvirt community prior to starting
> >>the libvirt patch.
> >>
> >>This patch was tested with the following formats: raw, cow, qcow,
> >>qcow2, vmdk, using the fd: protocol as well as existing file name
> >>support. Non-valid file descriptors were also tested.
> >
> >How can backing files work ?  The '-drive' syntax doesn't provide
> >any way to set properties against the backing files (which may be
> >nested to arbitrary depth).
> 
> This is orthogonal to having an fd: protocol.
> 
> >Also, there are a few places in QEMU, where it re-opens the existing
> >block driver on the fly. What is the plan for supporting this, without
> >having QEMU block on waiting for libvirt to pass it a new FD ?
> 
> That's only host CDROM AFAICT.

IIRC, Christoph posted patches which required re-opening files for
any disks, after a guest initiated change in the drive cache mode.

Regards,
Daniel
Stefan Hajnoczi - May 23, 2011, 1:09 p.m.
On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
>>
>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
>>>
>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>> corresponding resources (image files). sVirt provides this support
>>> by labeling guests and resources with security labels that are stored
>>> in file system extended attributes. Some file systems, such as NFS, do
>>> not support the extended attribute security namespace, which is needed
>>> for image file isolation when using the sVirt SELinux security driver
>>> in libvirt.
>>>
>>> The proposed solution entails a combination of Qemu, libvirt, and
>>> SELinux patches that work together to isolate multiple guests' images
>>> when they're stored in the same NFS mount. This results in an
>>> environment where sVirt isolation and NFS image file isolation can both
>>> be provided.
>>>
>>> Currently, Qemu opens an image file in addition to performing the
>>> necessary read and write operations. The proposed solution will move
>>> the open out of Qemu and into libvirt. Once libvirt opens an image
>>> file for the guest, it will pass the file descriptor to Qemu via a
>>> new fd: protocol.
>>>
>>> If the image file resides in an NFS mount, the following SELinux policy
>>> changes will provide image isolation:
>>>
>>>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>>>     allow Qemu (svirt_t) to only have SELinux read and write
>>>     permissions on nfs_t files
>>>
>>>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>>>     (virtd_t) file descriptors
>>>
>>> Following is a sample invocation of Qemu using the fd: protocol:
>>>
>>>     qemu -drive file=fd:4,format=qcow2
>>>
>>> This patch contains the Qemu code to support this solution. I would
>>> like to solicit input from the libvirt community prior to starting
>>> the libvirt patch.
>>>
>>> This patch was tested with the following formats: raw, cow, qcow,
>>> qcow2, vmdk, using the fd: protocol as well as existing file name
>>> support. Non-valid file descriptors were also tested.
>>
>> How can backing files work ?  The '-drive' syntax doesn't provide
>> any way to set properties against the backing files (which may be
>> nested to arbitrary depth).
>
> This is orthogonal to having an fd: protocol.
>
>> Also, there are a few places in QEMU, where it re-opens the existing
>> block driver on the fly. What is the plan for supporting this, without
>> having QEMU block on waiting for libvirt to pass it a new FD ?
>
> That's only host CDROM AFAICT.

The host page cache on|off option also uses it because fcntl(2) cannot
change O_DIRECT.  Actually for Linux it may be doable with
open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux
does not exist.

On the other hand fcntl(2) should really support changing O_DIRECT -
only that there is not much incentive to fix Linux if other OSes don't
support it either (and older Linux kernels also require workarounds).

Stefan
Anthony Liguori - May 23, 2011, 1:21 p.m.
On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote:
> On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
>>>
>>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
>>>>
>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>> corresponding resources (image files). sVirt provides this support
>>>> by labeling guests and resources with security labels that are stored
>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>> not support the extended attribute security namespace, which is needed
>>>> for image file isolation when using the sVirt SELinux security driver
>>>> in libvirt.
>>>>
>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>> SELinux patches that work together to isolate multiple guests' images
>>>> when they're stored in the same NFS mount. This results in an
>>>> environment where sVirt isolation and NFS image file isolation can both
>>>> be provided.
>>>>
>>>> Currently, Qemu opens an image file in addition to performing the
>>>> necessary read and write operations. The proposed solution will move
>>>> the open out of Qemu and into libvirt. Once libvirt opens an image
>>>> file for the guest, it will pass the file descriptor to Qemu via a
>>>> new fd: protocol.
>>>>
>>>> If the image file resides in an NFS mount, the following SELinux policy
>>>> changes will provide image isolation:
>>>>
>>>>    - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>>>>      allow Qemu (svirt_t) to only have SELinux read and write
>>>>      permissions on nfs_t files
>>>>
>>>>    - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>>>>      (virtd_t) file descriptors
>>>>
>>>> Following is a sample invocation of Qemu using the fd: protocol:
>>>>
>>>>      qemu -drive file=fd:4,format=qcow2
>>>>
>>>> This patch contains the Qemu code to support this solution. I would
>>>> like to solicit input from the libvirt community prior to starting
>>>> the libvirt patch.
>>>>
>>>> This patch was tested with the following formats: raw, cow, qcow,
>>>> qcow2, vmdk, using the fd: protocol as well as existing file name
>>>> support. Non-valid file descriptors were also tested.
>>>
>>> How can backing files work ?  The '-drive' syntax doesn't provide
>>> any way to set properties against the backing files (which may be
>>> nested to arbitrary depth).
>>
>> This is orthogonal to having an fd: protocol.
>>
>>> Also, there are a few places in QEMU, where it re-opens the existing
>>> block driver on the fly. What is the plan for supporting this, without
>>> having QEMU block on waiting for libvirt to pass it a new FD ?
>>
>> That's only host CDROM AFAICT.
>
> The host page cache on|off option also uses it because fcntl(2) cannot
> change O_DIRECT.  Actually for Linux it may be doable with
> open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux
> does not exist.

QEMU doesn't actually know which caching mode the fd is in if it's being 
passed to it so this command doesn't make much sense.

Regards,

Anthony Liguori

>
> On the other hand fcntl(2) should really support changing O_DIRECT -
> only that there is not much incentive to fix Linux if other OSes don't
> support it either (and older Linux kernels also require workarounds).
>
> Stefan
Stefan Hajnoczi - May 23, 2011, 1:26 p.m.
On Mon, May 23, 2011 at 2:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote:
>>
>> On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com>
>>  wrote:
>>>
>>> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
>>>>
>>>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
>>>>>
>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>>> corresponding resources (image files). sVirt provides this support
>>>>> by labeling guests and resources with security labels that are stored
>>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>>> not support the extended attribute security namespace, which is needed
>>>>> for image file isolation when using the sVirt SELinux security driver
>>>>> in libvirt.
>>>>>
>>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>>> SELinux patches that work together to isolate multiple guests' images
>>>>> when they're stored in the same NFS mount. This results in an
>>>>> environment where sVirt isolation and NFS image file isolation can both
>>>>> be provided.
>>>>>
>>>>> Currently, Qemu opens an image file in addition to performing the
>>>>> necessary read and write operations. The proposed solution will move
>>>>> the open out of Qemu and into libvirt. Once libvirt opens an image
>>>>> file for the guest, it will pass the file descriptor to Qemu via a
>>>>> new fd: protocol.
>>>>>
>>>>> If the image file resides in an NFS mount, the following SELinux policy
>>>>> changes will provide image isolation:
>>>>>
>>>>>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
>>>>>     allow Qemu (svirt_t) to only have SELinux read and write
>>>>>     permissions on nfs_t files
>>>>>
>>>>>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
>>>>>     (virtd_t) file descriptors
>>>>>
>>>>> Following is a sample invocation of Qemu using the fd: protocol:
>>>>>
>>>>>     qemu -drive file=fd:4,format=qcow2
>>>>>
>>>>> This patch contains the Qemu code to support this solution. I would
>>>>> like to solicit input from the libvirt community prior to starting
>>>>> the libvirt patch.
>>>>>
>>>>> This patch was tested with the following formats: raw, cow, qcow,
>>>>> qcow2, vmdk, using the fd: protocol as well as existing file name
>>>>> support. Non-valid file descriptors were also tested.
>>>>
>>>> How can backing files work ?  The '-drive' syntax doesn't provide
>>>> any way to set properties against the backing files (which may be
>>>> nested to arbitrary depth).
>>>
>>> This is orthogonal to having an fd: protocol.
>>>
>>>> Also, there are a few places in QEMU, where it re-opens the existing
>>>> block driver on the fly. What is the plan for supporting this, without
>>>> having QEMU block on waiting for libvirt to pass it a new FD ?
>>>
>>> That's only host CDROM AFAICT.
>>
>> The host page cache on|off option also uses it because fcntl(2) cannot
>> change O_DIRECT.  Actually for Linux it may be doable with
>> open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux
>> does not exist.
>
> QEMU doesn't actually know which caching mode the fd is in if it's being
> passed to it so this command doesn't make much sense.
>

fcntl(2) will report the flags.

Also, we need to make sure that the O_SYNC flag and write caching are
in agreement, although I guess it is libvirt's responsibility to set
that up correctly.

Stefan
Daniel P. Berrange - May 23, 2011, 1:42 p.m.
On Mon, May 23, 2011 at 02:26:05PM +0100, Stefan Hajnoczi wrote:
> On Mon, May 23, 2011 at 2:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> > On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote:
> >>
> >> On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com>
> >>  wrote:
> >>>
> >>> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote:
> >>>>
> >>>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote:
> >>>>>
> >>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
> >>>>> corresponding resources (image files). sVirt provides this support
> >>>>> by labeling guests and resources with security labels that are stored
> >>>>> in file system extended attributes. Some file systems, such as NFS, do
> >>>>> not support the extended attribute security namespace, which is needed
> >>>>> for image file isolation when using the sVirt SELinux security driver
> >>>>> in libvirt.
> >>>>>
> >>>>> The proposed solution entails a combination of Qemu, libvirt, and
> >>>>> SELinux patches that work together to isolate multiple guests' images
> >>>>> when they're stored in the same NFS mount. This results in an
> >>>>> environment where sVirt isolation and NFS image file isolation can both
> >>>>> be provided.
> >>>>>
> >>>>> Currently, Qemu opens an image file in addition to performing the
> >>>>> necessary read and write operations. The proposed solution will move
> >>>>> the open out of Qemu and into libvirt. Once libvirt opens an image
> >>>>> file for the guest, it will pass the file descriptor to Qemu via a
> >>>>> new fd: protocol.
> >>>>>
> >>>>> If the image file resides in an NFS mount, the following SELinux policy
> >>>>> changes will provide image isolation:
> >>>>>
> >>>>>   - A new SELinux boolean is created (e.g. virt_read_write_nfs) to
> >>>>>     allow Qemu (svirt_t) to only have SELinux read and write
> >>>>>     permissions on nfs_t files
> >>>>>
> >>>>>   - Qemu (svirt_t) also gets SELinux use permissions on libvirt
> >>>>>     (virtd_t) file descriptors
> >>>>>
> >>>>> Following is a sample invocation of Qemu using the fd: protocol:
> >>>>>
> >>>>>     qemu -drive file=fd:4,format=qcow2
> >>>>>
> >>>>> This patch contains the Qemu code to support this solution. I would
> >>>>> like to solicit input from the libvirt community prior to starting
> >>>>> the libvirt patch.
> >>>>>
> >>>>> This patch was tested with the following formats: raw, cow, qcow,
> >>>>> qcow2, vmdk, using the fd: protocol as well as existing file name
> >>>>> support. Non-valid file descriptors were also tested.
> >>>>
> >>>> How can backing files work ?  The '-drive' syntax doesn't provide
> >>>> any way to set properties against the backing files (which may be
> >>>> nested to arbitrary depth).
> >>>
> >>> This is orthogonal to having an fd: protocol.
> >>>
> >>>> Also, there are a few places in QEMU, where it re-opens the existing
> >>>> block driver on the fly. What is the plan for supporting this, without
> >>>> having QEMU block on waiting for libvirt to pass it a new FD ?
> >>>
> >>> That's only host CDROM AFAICT.
> >>
> >> The host page cache on|off option also uses it because fcntl(2) cannot
> >> change O_DIRECT.  Actually for Linux it may be doable with
> >> open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux
> >> does not exist.
> >
> > QEMU doesn't actually know which caching mode the fd is in if it's being
> > passed to it so this command doesn't make much sense.
> >
> 
> fcntl(2) will report the flags.
> 
> Also, we need to make sure that the O_SYNC flag and write caching are
> in agreement, although I guess it is libvirt's responsibility to set
> that up correctly.

This is where fcntl() support for setting/clearing O_DIRECT etc would
be useful. It avoids the need for libvirt or other mgmt apps to second
guess what flags QEMU expects for a particular cache mode.

Regards,
Daniel
Kevin Wolf - May 23, 2011, 2:28 p.m.
Am 20.05.2011 21:53, schrieb Blue Swirl:
> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>>
>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>>>>
>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>> corresponding resources (image files). sVirt provides this support
>>>> by labeling guests and resources with security labels that are stored
>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>> not support the extended attribute security namespace, which is needed
>>>> for image file isolation when using the sVirt SELinux security driver
>>>> in libvirt.
>>>>
>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>> SELinux patches that work together to isolate multiple guests' images
>>>> when they're stored in the same NFS mount. This results in an
>>>> environment where sVirt isolation and NFS image file isolation can both
>>>> be provided.
>>>
>>> Very nice. QEMU should use this to support privilege separation. We
>>> already have chroot and runas switches, a new switch should convert
>>> all file references to fd references internally for that process. If
>>> this can be made transparent, this should even be the default way of
>>> operation.
>>
>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
>> confined context, and then passes fds to the child?
> 
> And exit after that, or do the same without forking.
> 
> This wouldn't work now for the native CDROM devices which need to
> reopen the device. For that, an explicit reopen method could be added.
> The method could even chat with the privileged process to get that to
> do the reopening, but I'd leave that to libvirt and fail without it
> for plain QEMU.

There are more cases where we reopen the image file. One example is the
'commit' monitor command which temporarily reopens the backing file r/w.
Or Christoph's patch that allows guests to toggle the write-cache
enabled bit. Same for live snapshots. So we'll need a solution for them
before doing anything like this.

And breaking qemu without libvirt isn't really an option for me.

Kevin
Markus Armbruster - May 23, 2011, 2:35 p.m.
Anthony Liguori <anthony@codemonkey.ws> writes:

> On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
>> It feels to me that turning the current block driver code which just does
>> open(2) on files, into something which issues events&  asynchronously
>> waits for a file would potentially be quite complex.
>>
>> You also need to be much more careful from a security POV if the mgmt
>> app is accepting requests to open arbitrary files from QEMU, to ensure
>> the filenames are correctly/strictly validated before opening them and
>> giving them back to QEMU. An architecture where the mgmt app decides
>> what FDs to supply upfront, has less potential for security errors.
>>
>> To me the ideal would thus be that we can supply FDs for the backing
>> store with -blockdev syntax, and that places where QEMU re-opens files
>> would be enhanced to avoid that need. If there are things we can't do
>> without closing&  re-opening the same file, then perhaps we need some
>> new ioctl()/fcntl() calls to change those file attributes on the fly.
>
> I agree.  But my view of blockdev is that you wouldn't set an fd
> attribute but rather the backing file name and use the fd protocol.
> For instance:
>
> -blockdev id=foo-base,path=fd:4,format=raw
> -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo

I guess you mean backing_file=foo-base.

If none is specified, use the backing file specification stored in the
image.

Matches my current thinking.
Markus Armbruster - May 23, 2011, 3:24 p.m.
Kevin Wolf <kwolf@redhat.com> writes:

> Am 20.05.2011 21:53, schrieb Blue Swirl:
>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>>> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>>>
>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>>>>>
>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>>> corresponding resources (image files). sVirt provides this support
>>>>> by labeling guests and resources with security labels that are stored
>>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>>> not support the extended attribute security namespace, which is needed
>>>>> for image file isolation when using the sVirt SELinux security driver
>>>>> in libvirt.
>>>>>
>>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>>> SELinux patches that work together to isolate multiple guests' images
>>>>> when they're stored in the same NFS mount. This results in an
>>>>> environment where sVirt isolation and NFS image file isolation can both
>>>>> be provided.
>>>>
>>>> Very nice. QEMU should use this to support privilege separation. We
>>>> already have chroot and runas switches, a new switch should convert
>>>> all file references to fd references internally for that process. If
>>>> this can be made transparent, this should even be the default way of
>>>> operation.
>>>
>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
>>> confined context, and then passes fds to the child?
>> 
>> And exit after that, or do the same without forking.
>> 
>> This wouldn't work now for the native CDROM devices which need to
>> reopen the device. For that, an explicit reopen method could be added.
>> The method could even chat with the privileged process to get that to
>> do the reopening, but I'd leave that to libvirt and fail without it
>> for plain QEMU.
>
> There are more cases where we reopen the image file. One example is the
> 'commit' monitor command which temporarily reopens the backing file r/w.
> Or Christoph's patch that allows guests to toggle the write-cache
> enabled bit. Same for live snapshots. So we'll need a solution for them
> before doing anything like this.
>
> And breaking qemu without libvirt isn't really an option for me.

Reopening files is evil.  Sometimes flaws in the system call API make it
the only option.  You can mitigate via /dev/fd/%d, but only on some
systems.  The less we reopen, the better.

An fd: protocol can't easily support reopen.  So fail it.  This doesn't
break any existing usage.  It's just a restriction on the new protocol.
Restrictions can render the new protocol useless in practice, but we're
not "breaking qemu without libvirt" there.

Perhaps we can make relax the restriction on some system by avoiding the
reopen in a system-dependent way.
Kevin Wolf - May 23, 2011, 3:56 p.m.
Am 23.05.2011 17:24, schrieb Markus Armbruster:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
>> Am 20.05.2011 21:53, schrieb Blue Swirl:
>>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>>>> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>>>>
>>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>>>>>>
>>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>>>> corresponding resources (image files). sVirt provides this support
>>>>>> by labeling guests and resources with security labels that are stored
>>>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>>>> not support the extended attribute security namespace, which is needed
>>>>>> for image file isolation when using the sVirt SELinux security driver
>>>>>> in libvirt.
>>>>>>
>>>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>>>> SELinux patches that work together to isolate multiple guests' images
>>>>>> when they're stored in the same NFS mount. This results in an
>>>>>> environment where sVirt isolation and NFS image file isolation can both
>>>>>> be provided.
>>>>>
>>>>> Very nice. QEMU should use this to support privilege separation. We
>>>>> already have chroot and runas switches, a new switch should convert
>>>>> all file references to fd references internally for that process. If
>>>>> this can be made transparent, this should even be the default way of
>>>>> operation.
>>>>
>>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
>>>> confined context, and then passes fds to the child?
>>>
>>> And exit after that, or do the same without forking.
>>>
>>> This wouldn't work now for the native CDROM devices which need to
>>> reopen the device. For that, an explicit reopen method could be added.
>>> The method could even chat with the privileged process to get that to
>>> do the reopening, but I'd leave that to libvirt and fail without it
>>> for plain QEMU.
>>
>> There are more cases where we reopen the image file. One example is the
>> 'commit' monitor command which temporarily reopens the backing file r/w.
>> Or Christoph's patch that allows guests to toggle the write-cache
>> enabled bit. Same for live snapshots. So we'll need a solution for them
>> before doing anything like this.
>>
>> And breaking qemu without libvirt isn't really an option for me.
> 
> Reopening files is evil.  Sometimes flaws in the system call API make it
> the only option.  You can mitigate via /dev/fd/%d, but only on some
> systems.  The less we reopen, the better.
> 
> An fd: protocol can't easily support reopen.  So fail it.  This doesn't
> break any existing usage.  It's just a restriction on the new protocol.
> Restrictions can render the new protocol useless in practice, but we're
> not "breaking qemu without libvirt" there.
> 
> Perhaps we can make relax the restriction on some system by avoiding the
> reopen in a system-dependent way.

Right, you only get the regression once libvirt starts using it (or even
worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better.

Kevin
Corey Bryant - May 23, 2011, 6:20 p.m.
On 05/23/2011 11:24 AM, Markus Armbruster wrote:
> Kevin Wolf<kwolf@redhat.com>  writes:
>
>> Am 20.05.2011 21:53, schrieb Blue Swirl:
>>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>>>> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>>>>
>>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>   wrote:
>>>>>>
>>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>>>> corresponding resources (image files). sVirt provides this support
>>>>>> by labeling guests and resources with security labels that are stored
>>>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>>>> not support the extended attribute security namespace, which is needed
>>>>>> for image file isolation when using the sVirt SELinux security driver
>>>>>> in libvirt.
>>>>>>
>>>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>>>> SELinux patches that work together to isolate multiple guests' images
>>>>>> when they're stored in the same NFS mount. This results in an
>>>>>> environment where sVirt isolation and NFS image file isolation can both
>>>>>> be provided.
>>>>>
>>>>> Very nice. QEMU should use this to support privilege separation. We
>>>>> already have chroot and runas switches, a new switch should convert
>>>>> all file references to fd references internally for that process. If
>>>>> this can be made transparent, this should even be the default way of
>>>>> operation.
>>>>
>>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
>>>> confined context, and then passes fds to the child?
>>>
>>> And exit after that, or do the same without forking.
>>>
>>> This wouldn't work now for the native CDROM devices which need to
>>> reopen the device. For that, an explicit reopen method could be added.
>>> The method could even chat with the privileged process to get that to
>>> do the reopening, but I'd leave that to libvirt and fail without it
>>> for plain QEMU.
>>
>> There are more cases where we reopen the image file. One example is the
>> 'commit' monitor command which temporarily reopens the backing file r/w.
>> Or Christoph's patch that allows guests to toggle the write-cache
>> enabled bit. Same for live snapshots. So we'll need a solution for them
>> before doing anything like this.
>>
>> And breaking qemu without libvirt isn't really an option for me.
>
> Reopening files is evil.  Sometimes flaws in the system call API make it
> the only option.  You can mitigate via /dev/fd/%d, but only on some
> systems.  The less we reopen, the better.
>
> An fd: protocol can't easily support reopen.  So fail it.  This doesn't
> break any existing usage.  It's just a restriction on the new protocol.
> Restrictions can render the new protocol useless in practice, but we're
> not "breaking qemu without libvirt" there.
>
> Perhaps we can make relax the restriction on some system by avoiding the
> reopen in a system-dependent way.

A lot of great points here.  Thanks everyone.

I'd like to see if we can go forward with the suggestion of restricting 
the fd: protocol, at least for the initial patch.  Perhaps this first 
pass at the protocol can limit it to no reopen and no backing file 
support.  Even with this limited support, the fd: protocol can still 
provide added security for NFS users.

Corey
Blue Swirl - May 23, 2011, 7:50 p.m.
On Mon, May 23, 2011 at 6:56 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 23.05.2011 17:24, schrieb Markus Armbruster:
>> Kevin Wolf <kwolf@redhat.com> writes:
>>
>>> Am 20.05.2011 21:53, schrieb Blue Swirl:
>>>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>>>>> On 05/20/2011 02:25 PM, Blue Swirl wrote:
>>>>>>
>>>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com>  wrote:
>>>>>>>
>>>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their
>>>>>>> corresponding resources (image files). sVirt provides this support
>>>>>>> by labeling guests and resources with security labels that are stored
>>>>>>> in file system extended attributes. Some file systems, such as NFS, do
>>>>>>> not support the extended attribute security namespace, which is needed
>>>>>>> for image file isolation when using the sVirt SELinux security driver
>>>>>>> in libvirt.
>>>>>>>
>>>>>>> The proposed solution entails a combination of Qemu, libvirt, and
>>>>>>> SELinux patches that work together to isolate multiple guests' images
>>>>>>> when they're stored in the same NFS mount. This results in an
>>>>>>> environment where sVirt isolation and NFS image file isolation can both
>>>>>>> be provided.
>>>>>>
>>>>>> Very nice. QEMU should use this to support privilege separation. We
>>>>>> already have chroot and runas switches, a new switch should convert
>>>>>> all file references to fd references internally for that process. If
>>>>>> this can be made transparent, this should even be the default way of
>>>>>> operation.
>>>>>
>>>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a
>>>>> confined context, and then passes fds to the child?
>>>>
>>>> And exit after that, or do the same without forking.
>>>>
>>>> This wouldn't work now for the native CDROM devices which need to
>>>> reopen the device. For that, an explicit reopen method could be added.
>>>> The method could even chat with the privileged process to get that to
>>>> do the reopening, but I'd leave that to libvirt and fail without it
>>>> for plain QEMU.
>>>
>>> There are more cases where we reopen the image file. One example is the
>>> 'commit' monitor command which temporarily reopens the backing file r/w.
>>> Or Christoph's patch that allows guests to toggle the write-cache
>>> enabled bit. Same for live snapshots. So we'll need a solution for them
>>> before doing anything like this.
>>>
>>> And breaking qemu without libvirt isn't really an option for me.
>>
>> Reopening files is evil.  Sometimes flaws in the system call API make it
>> the only option.  You can mitigate via /dev/fd/%d, but only on some
>> systems.  The less we reopen, the better.
>>
>> An fd: protocol can't easily support reopen.  So fail it.  This doesn't
>> break any existing usage.  It's just a restriction on the new protocol.
>> Restrictions can render the new protocol useless in practice, but we're
>> not "breaking qemu without libvirt" there.
>>
>> Perhaps we can make relax the restriction on some system by avoiding the
>> reopen in a system-dependent way.
>
> Right, you only get the regression once libvirt starts using it (or even
> worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better.

One solution could be to add new commands which supply fresh fds while
performing the operation which needs reopening:
commit_fd ide0-hd0 #33
change ide1-cd0 fd:34
I don't remember the syntax for fd passing, so that part may be weird.
Anthony Liguori - May 23, 2011, 9:55 p.m.
On 05/23/2011 10:56 AM, Kevin Wolf wrote:
> Am 23.05.2011 17:24, schrieb Markus Armbruster:
>> Kevin Wolf<kwolf@redhat.com>  writes:
>>
>> An fd: protocol can't easily support reopen.  So fail it.  This doesn't
>> break any existing usage.  It's just a restriction on the new protocol.
>> Restrictions can render the new protocol useless in practice, but we're
>> not "breaking qemu without libvirt" there.
>>
>> Perhaps we can make relax the restriction on some system by avoiding the
>> reopen in a system-dependent way.
>
> Right, you only get the regression once libvirt starts using it (or even
> worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better.

libvirt already has the problem you describe.  QEMU is designed assuming 
that we can access whatever resources the invoking user can.  That's how 
our UI is constructed.

But libvirt chooses to invoke QEMU in such a way that the actual QEMU 
process does not have the same rights as the invoking user.  In fact, 
the context is entirely different.

That means that to get isolation, libvirt has to understand what QEMU 
does for each operation and then do some magic behind the scenes to make 
it all work.

This is not different really.  'commit' could require magic in libvirt 
anyway if libvirt was smart enough to mark the backing file with 
read-only rights.

libvirt already has to parse image file formats to figure out backing 
files so it can set the permissions right.

Regards,

Anthony Liguori


> Kevin
>
Jamie Lokier - May 23, 2011, 10:49 p.m.
Markus Armbruster wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
> 
> > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
> >> It feels to me that turning the current block driver code which just does
> >> open(2) on files, into something which issues events&  asynchronously
> >> waits for a file would potentially be quite complex.
> >>
> >> You also need to be much more careful from a security POV if the mgmt
> >> app is accepting requests to open arbitrary files from QEMU, to ensure
> >> the filenames are correctly/strictly validated before opening them and
> >> giving them back to QEMU. An architecture where the mgmt app decides
> >> what FDs to supply upfront, has less potential for security errors.
> >>
> >> To me the ideal would thus be that we can supply FDs for the backing
> >> store with -blockdev syntax, and that places where QEMU re-opens files
> >> would be enhanced to avoid that need. If there are things we can't do
> >> without closing&  re-opening the same file, then perhaps we need some
> >> new ioctl()/fcntl() calls to change those file attributes on the fly.
> >
> > I agree.  But my view of blockdev is that you wouldn't set an fd
> > attribute but rather the backing file name and use the fd protocol.
> > For instance:
> >
> > -blockdev id=foo-base,path=fd:4,format=raw
> > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo
> 
> I guess you mean backing_file=foo-base.
> 
> If none is specified, use the backing file specification stored in the
> image.
> 
> Matches my current thinking.

Being able to override the backing file path would be useful anyway.

I've already had problems when moving established qcow2 files between
systems, that for historical reasons contain either an absolute path
inside for the backing file, or some messy "../../whatever", or
"foo/bar/whatever", or "backing.img" (lots of different ones with the
same name), all of which are a pain to work around.

(Imho, it would also make sense if qcow2 files contained a UUID for
their backing file to verify you've given the correct backing file,
and maybe help find it (much like Linux finds real disk devices and
filesystems when mounting these days).)

-- Jamie
Stefan Hajnoczi - May 24, 2011, 8:39 a.m.
On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier <jamie@shareable.org> wrote:
> Markus Armbruster wrote:
>> Anthony Liguori <anthony@codemonkey.ws> writes:
>>
>> > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
>> >> It feels to me that turning the current block driver code which just does
>> >> open(2) on files, into something which issues events&  asynchronously
>> >> waits for a file would potentially be quite complex.
>> >>
>> >> You also need to be much more careful from a security POV if the mgmt
>> >> app is accepting requests to open arbitrary files from QEMU, to ensure
>> >> the filenames are correctly/strictly validated before opening them and
>> >> giving them back to QEMU. An architecture where the mgmt app decides
>> >> what FDs to supply upfront, has less potential for security errors.
>> >>
>> >> To me the ideal would thus be that we can supply FDs for the backing
>> >> store with -blockdev syntax, and that places where QEMU re-opens files
>> >> would be enhanced to avoid that need. If there are things we can't do
>> >> without closing&  re-opening the same file, then perhaps we need some
>> >> new ioctl()/fcntl() calls to change those file attributes on the fly.
>> >
>> > I agree.  But my view of blockdev is that you wouldn't set an fd
>> > attribute but rather the backing file name and use the fd protocol.
>> > For instance:
>> >
>> > -blockdev id=foo-base,path=fd:4,format=raw
>> > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo
>>
>> I guess you mean backing_file=foo-base.
>>
>> If none is specified, use the backing file specification stored in the
>> image.
>>
>> Matches my current thinking.
>
> Being able to override the backing file path would be useful anyway.
>
> I've already had problems when moving established qcow2 files between
> systems, that for historical reasons contain either an absolute path
> inside for the backing file, or some messy "../../whatever", or
> "foo/bar/whatever", or "backing.img" (lots of different ones with the
> same name), all of which are a pain to work around.
>
> (Imho, it would also make sense if qcow2 files contained a UUID for
> their backing file to verify you've given the correct backing file,
> and maybe help find it (much like Linux finds real disk devices and
> filesystems when mounting these days).)

Try the qemu-img rebase -f command:

    qemu-img uses the unsafe mode if "-u" is specified. In this mode, only the
    backing file name and format of filename is changed without any
checks on the
    file contents. The user must take care of specifying the correct new backing
    file, or the guest-visible content of the image will be corrupted.

    This mode is useful for renaming or moving the backing file to somewhere
    else.  It can be used without an accessible old backing file, i.e. you can
    use it to fix an image whose backing file has already been moved/renamed.

Stefan
Jamie Lokier - May 24, 2011, 3:31 p.m.
Stefan Hajnoczi wrote:
> On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier <jamie@shareable.org> wrote:
> > Being able to override the backing file path would be useful anyway.
> >
> > I've already had problems when moving established qcow2 files between
> > systems, that for historical reasons contain either an absolute path
> > inside for the backing file, or some messy "../../whatever", or
> > "foo/bar/whatever", or "backing.img" (lots of different ones with the
> > same name), all of which are a pain to work around.
...
> Try the qemu-img rebase -f command:
> 
>     qemu-img uses the unsafe mode if "-u" is specified. In this mode, only the
>     backing file name and format of filename is changed without any
>     checks on the
>     file contents. The user must take care of specifying the correct new backing
>     file, or the guest-visible content of the image will be corrupted.
> 
>     This mode is useful for renaming or moving the backing file to somewhere
>     else.  It can be used without an accessible old backing file, i.e. you can
>     use it to fix an image whose backing file has already been moved/renamed.

Yes indeed.  That feature was added after the last time I dealt with this problem.

However, I have wanted to open *precious*, *read-only* qcow2 images,
for example with -snapshot or the explicit equivalent, and for those
precious images I am loathe to let any tool write a single byte to
them.  The files are kept read-only, and often with the "immutable"
attribute on Linux, backed up and checksummed just to be sure.

I'd rather just override the value on the command line, so if that
feature may turn up for fd: related reasons, it'll be handy for the
read-only moved qcow2 file reason too.

-- Jamie

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a95c8d4..6554b06 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -142,7 +142,8 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
                            int bdrv_flags, int open_flags)
 {
     BDRVRawState *s = bs->opaque;
-    int fd, ret;
+    int fd = -1;
+    int ret;
 
     s->open_flags = open_flags | O_BINARY;
     s->open_flags &= ~O_ACCMODE;
@@ -159,15 +160,16 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     else if (!(bdrv_flags & BDRV_O_CACHE_WB))
         s->open_flags |= O_DSYNC;
 
-    s->fd = -1;
-    fd = qemu_open(filename, s->open_flags, 0644);
-    if (fd < 0) {
-        ret = -errno;
-        if (ret == -EROFS)
-            ret = -EACCES;
-        return ret;
+    if (s->fd == -1) {
+        fd = qemu_open(filename, s->open_flags, 0644);
+        if (fd < 0) {
+            ret = -errno;
+            if (ret == -EROFS)
+                ret = -EACCES;
+            return ret;
+        }
+        s->fd = fd;
     }
-    s->fd = fd;
     s->aligned_buf = NULL;
 
     if ((bdrv_flags & BDRV_O_NOCACHE)) {
@@ -224,6 +226,7 @@  static int raw_open(BlockDriverState *bs, const char *filename, int flags)
 {
     BDRVRawState *s = bs->opaque;
 
+    s->fd = -1;
     s->type = FTYPE_FILE;
     return raw_open_common(bs, filename, flags, 0);
 }
@@ -819,6 +822,50 @@  static BlockDriver bdrv_file = {
     .create_options = raw_create_options,
 };
 
+static int raw_open_fd(BlockDriverState *bs, const char *filename, int flags)
+{
+    BDRVRawState *s = bs->opaque;
+    const char *fd_str;
+    int i;
+
+    /* extract the file descriptor - fail if it's not fd: */
+    if (!strstart(filename, "fd:", &fd_str)) {
+        return -EINVAL;
+    }
+
+    for (i = 0; fd_str[i] != '\0'; i++) {
+        if (!qemu_isdigit(fd_str[i]))
+            return -EBADF;
+    }
+
+    s->fd = atoi(fd_str);
+    s->type = FTYPE_FILE;
+
+    return raw_open_common(bs, filename, flags, 0);
+}
+
+static BlockDriver bdrv_file_fd = {
+    .format_name = "file",
+    .protocol_name = "fd",
+    .instance_size = sizeof(BDRVRawState),
+    .bdrv_probe = NULL, /* no probe for protocols */
+    .bdrv_file_open = raw_open_fd,
+    .bdrv_read = raw_read,
+    .bdrv_write = raw_write,
+    .bdrv_close = raw_close,
+    .bdrv_flush = raw_flush,
+    .bdrv_discard = raw_discard,
+
+    .bdrv_aio_readv = raw_aio_readv,
+    .bdrv_aio_writev = raw_aio_writev,
+    .bdrv_aio_flush = raw_aio_flush,
+
+    .bdrv_truncate = raw_truncate,
+    .bdrv_getlength = raw_getlength,
+
+    .create_options = raw_create_options,
+};
+
 /***********************************************/
 /* host device */
 
@@ -927,6 +974,7 @@  static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
     }
 #endif
 
+    s->fd = -1;
     s->type = FTYPE_FILE;
 #if defined(__linux__)
     {
@@ -1097,6 +1145,7 @@  static int floppy_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int ret;
 
+    s->fd = -1;
     s->type = FTYPE_FD;
 
     /* open will not fail even if no floppy is inserted, so add O_NONBLOCK */
@@ -1209,6 +1258,7 @@  static int cdrom_open(BlockDriverState *bs, const char *filename, int flags)
 {
     BDRVRawState *s = bs->opaque;
 
+    s->fd = -1;
     s->type = FTYPE_CD;
 
     /* open will not fail even if no CD is inserted, so add O_NONBLOCK */
@@ -1313,6 +1363,7 @@  static int cdrom_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int ret;
 
+    s->fd = -1;
     s->type = FTYPE_CD;
 
     ret = raw_open_common(bs, filename, flags, 0);
@@ -1432,6 +1483,7 @@  static void bdrv_file_init(void)
      * Register all the drivers.  Note that order is important, the driver
      * registered last will get probed first.
      */
+    bdrv_register(&bdrv_file_fd);
     bdrv_register(&bdrv_file);
     bdrv_register(&bdrv_host_device);
 #ifdef __linux__
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 47e1991..fea8882 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -412,6 +412,7 @@  snapshots.
 * disk_images_fat_images::    Virtual FAT disk images
 * disk_images_nbd::           NBD access
 * disk_images_sheepdog::      Sheepdog disk images
+* disk_images_fd::            File descriptor access
 @end menu
 
 @node disk_images_quickstart
@@ -686,6 +687,17 @@  qemu-img create sheepdog:@var{hostname}:@var{port}:@var{image} @var{size}
 qemu sheepdog:@var{hostname}:@var{port}:@var{image}
 @end example
 
+@node disk_images_fd
+@subsection File descriptor access
+
+QEMU can access an image file that was opened outside of the QEMU
+process. The format option is required when passing a file descriptor
+to QEMU.
+
+@example
+qemu -drive file=fd:4,format=qcow2
+@end example
+
 @node pcsys_network
 @section Network emulation
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 5876fc5..7cf5358 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -116,7 +116,7 @@  using @file{/dev/cdrom} as filename (@pxref{host_drives}).
 ETEXI
 
 DEF("drive", HAS_ARG, QEMU_OPTION_drive,
-    "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
+    "-drive [file=[fd:]file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n"
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none|unsafe][,format=f]\n"
     "       [,serial=s][,addr=A][,id=name][,aio=threads|native]\n"
@@ -129,10 +129,12 @@  STEXI
 Define a new drive. Valid options are:
 
 @table @option
-@item file=@var{file}
+@item file=[fd:]@var{file}
 This option defines which disk image (@pxref{disk_images}) to use with
 this drive. If the filename contains comma, you must double it
-(for instance, "file=my,,file" to use file "my,file").
+(for instance, "file=my,,file" to use file "my,file"). @option{fd:}@var{file}
+specifies the file descriptor of an already open disk
+image.  @option{format=}@var{format} is required by @option{fd:}@var{file}.
 @item if=@var{interface}
 This option defines on which type on interface the drive is connected.
 Available types are: ide, scsi, sd, mtd, floppy, pflash, virtio.