Message ID | 4DD6B777.9020800@us.ibm.com |
---|---|
State | New |
Headers | show |
On 05/20/2011 01:48 PM, Corey Bryant wrote: > sVirt provides SELinux MAC isolation for Qemu guest processes and their > corresponding resources (image files). sVirt provides this support > by labeling guests and resources with security labels that are stored > in file system extended attributes. Some file systems, such as NFS, do > not support the extended attribute security namespace, which is needed > for image file isolation when using the sVirt SELinux security driver > in libvirt. > > The proposed solution entails a combination of Qemu, libvirt, and > SELinux patches that work together to isolate multiple guests' images > when they're stored in the same NFS mount. This results in an > environment where sVirt isolation and NFS image file isolation can both > be provided. > > Currently, Qemu opens an image file in addition to performing the > necessary read and write operations. The proposed solution will move > the open out of Qemu and into libvirt. Once libvirt opens an image > file for the guest, it will pass the file descriptor to Qemu via a > new fd: protocol. > > If the image file resides in an NFS mount, the following SELinux policy > changes will provide image isolation: > > - A new SELinux boolean is created (e.g. virt_read_write_nfs) to > allow Qemu (svirt_t) to only have SELinux read and write > permissions on nfs_t files > > - Qemu (svirt_t) also gets SELinux use permissions on libvirt > (virtd_t) file descriptors > > Following is a sample invocation of Qemu using the fd: protocol: > > qemu -drive file=fd:4,format=qcow2 > > This patch contains the Qemu code to support this solution. I would > like to solicit input from the libvirt community prior to starting > the libvirt patch. > > This patch was tested with the following formats: raw, cow, qcow, > qcow2, vmdk, using the fd: protocol as well as existing file name > support. Non-valid file descriptors were also tested. > > Signed-off-by: Corey Bryant<coreyb@linux.vnet.ibm.com> > --- > block/raw-posix.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++------- > qemu-doc.texi | 12 +++++++++ > qemu-options.hx | 8 ++++-- > 3 files changed, 78 insertions(+), 12 deletions(-) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index a95c8d4..6554b06 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -142,7 +142,8 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, > int bdrv_flags, int open_flags) > { > BDRVRawState *s = bs->opaque; > - int fd, ret; > + int fd = -1; > + int ret; > > s->open_flags = open_flags | O_BINARY; > s->open_flags&= ~O_ACCMODE; > @@ -159,15 +160,16 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, > else if (!(bdrv_flags& BDRV_O_CACHE_WB)) > s->open_flags |= O_DSYNC; > > - s->fd = -1; > - fd = qemu_open(filename, s->open_flags, 0644); > - if (fd< 0) { > - ret = -errno; > - if (ret == -EROFS) > - ret = -EACCES; > - return ret; > + if (s->fd == -1) { > + fd = qemu_open(filename, s->open_flags, 0644); > + if (fd< 0) { > + ret = -errno; > + if (ret == -EROFS) > + ret = -EACCES; > + return ret; > + } > + s->fd = fd; > } > - s->fd = fd; > s->aligned_buf = NULL; > > if ((bdrv_flags& BDRV_O_NOCACHE)) { > @@ -224,6 +226,7 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags) > { > BDRVRawState *s = bs->opaque; > > + s->fd = -1; > s->type = FTYPE_FILE; > return raw_open_common(bs, filename, flags, 0); > } > @@ -819,6 +822,50 @@ static BlockDriver bdrv_file = { > .create_options = raw_create_options, > }; > > +static int raw_open_fd(BlockDriverState *bs, const char *filename, int flags) > +{ > + BDRVRawState *s = bs->opaque; > + const char *fd_str; > + int i; > + > + /* extract the file descriptor - fail if it's not fd: */ > + if (!strstart(filename, "fd:",&fd_str)) { > + return -EINVAL; > + } > + > + for (i = 0; fd_str[i] != '\0'; i++) { > + if (!qemu_isdigit(fd_str[i])) > + return -EBADF; This is off CODING_STYLE wise (braces around if), but a better way to do this is to use strtol(), and check for the error condition from it. If it fails, you should call monitor_get_fd(cur_mon, fd_str) as this will allow named fds to be used. Regards, Anthony Liguori
On Fri, May 20, 2011 at 9:48 PM, Corey Bryant <bryntcor@us.ibm.com> wrote: > sVirt provides SELinux MAC isolation for Qemu guest processes and their > corresponding resources (image files). sVirt provides this support > by labeling guests and resources with security labels that are stored > in file system extended attributes. Some file systems, such as NFS, do > not support the extended attribute security namespace, which is needed > for image file isolation when using the sVirt SELinux security driver > in libvirt. > > The proposed solution entails a combination of Qemu, libvirt, and > SELinux patches that work together to isolate multiple guests' images > when they're stored in the same NFS mount. This results in an > environment where sVirt isolation and NFS image file isolation can both > be provided. Very nice. QEMU should use this to support privilege separation. We already have chroot and runas switches, a new switch should convert all file references to fd references internally for that process. If this can be made transparent, this should even be the default way of operation.
On 05/20/2011 02:25 PM, Blue Swirl wrote: > On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >> sVirt provides SELinux MAC isolation for Qemu guest processes and their >> corresponding resources (image files). sVirt provides this support >> by labeling guests and resources with security labels that are stored >> in file system extended attributes. Some file systems, such as NFS, do >> not support the extended attribute security namespace, which is needed >> for image file isolation when using the sVirt SELinux security driver >> in libvirt. >> >> The proposed solution entails a combination of Qemu, libvirt, and >> SELinux patches that work together to isolate multiple guests' images >> when they're stored in the same NFS mount. This results in an >> environment where sVirt isolation and NFS image file isolation can both >> be provided. > > Very nice. QEMU should use this to support privilege separation. We > already have chroot and runas switches, a new switch should convert > all file references to fd references internally for that process. If > this can be made transparent, this should even be the default way of > operation. You mean, QEMU starts up, opens all disk images, reinvokes itself in a confined context, and then passes fds to the child? Interesting idea. Regards, Anthony Liguori >
On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: > On 05/20/2011 02:25 PM, Blue Swirl wrote: >> >> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>> >>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>> corresponding resources (image files). sVirt provides this support >>> by labeling guests and resources with security labels that are stored >>> in file system extended attributes. Some file systems, such as NFS, do >>> not support the extended attribute security namespace, which is needed >>> for image file isolation when using the sVirt SELinux security driver >>> in libvirt. >>> >>> The proposed solution entails a combination of Qemu, libvirt, and >>> SELinux patches that work together to isolate multiple guests' images >>> when they're stored in the same NFS mount. This results in an >>> environment where sVirt isolation and NFS image file isolation can both >>> be provided. >> >> Very nice. QEMU should use this to support privilege separation. We >> already have chroot and runas switches, a new switch should convert >> all file references to fd references internally for that process. If >> this can be made transparent, this should even be the default way of >> operation. > > You mean, QEMU starts up, opens all disk images, reinvokes itself in a > confined context, and then passes fds to the child? And exit after that, or do the same without forking. This wouldn't work now for the native CDROM devices which need to reopen the device. For that, an explicit reopen method could be added. The method could even chat with the privileged process to get that to do the reopening, but I'd leave that to libvirt and fail without it for plain QEMU.
On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: > sVirt provides SELinux MAC isolation for Qemu guest processes and their > corresponding resources (image files). sVirt provides this support > by labeling guests and resources with security labels that are stored > in file system extended attributes. Some file systems, such as NFS, do > not support the extended attribute security namespace, which is needed > for image file isolation when using the sVirt SELinux security driver > in libvirt. > > The proposed solution entails a combination of Qemu, libvirt, and > SELinux patches that work together to isolate multiple guests' images > when they're stored in the same NFS mount. This results in an > environment where sVirt isolation and NFS image file isolation can both > be provided. > > Currently, Qemu opens an image file in addition to performing the > necessary read and write operations. The proposed solution will move > the open out of Qemu and into libvirt. Once libvirt opens an image > file for the guest, it will pass the file descriptor to Qemu via a > new fd: protocol. > > If the image file resides in an NFS mount, the following SELinux policy > changes will provide image isolation: > > - A new SELinux boolean is created (e.g. virt_read_write_nfs) to > allow Qemu (svirt_t) to only have SELinux read and write > permissions on nfs_t files > > - Qemu (svirt_t) also gets SELinux use permissions on libvirt > (virtd_t) file descriptors > > Following is a sample invocation of Qemu using the fd: protocol: > > qemu -drive file=fd:4,format=qcow2 > > This patch contains the Qemu code to support this solution. I would > like to solicit input from the libvirt community prior to starting > the libvirt patch. > > This patch was tested with the following formats: raw, cow, qcow, > qcow2, vmdk, using the fd: protocol as well as existing file name > support. Non-valid file descriptors were also tested. How can backing files work ? The '-drive' syntax doesn't provide any way to set properties against the backing files (which may be nested to arbitrary depth). AFAICT, we need to the often discussed -blockdev advanced syntax to be able to set a 'fd' property against nested backing files. I'm not sure it is worth supporting fd: if we only have -drive syntax, since backing files are an important feature for most mgmt apps. Also, there are a few places in QEMU, where it re-opens the existing block driver on the fly. What is the plan for supporting this, without having QEMU block on waiting for libvirt to pass it a new FD ? Regards, Daniel
On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: > sVirt provides SELinux MAC isolation for Qemu guest processes and their > corresponding resources (image files). sVirt provides this support > by labeling guests and resources with security labels that are stored > in file system extended attributes. Some file systems, such as NFS, do > not support the extended attribute security namespace, which is needed > for image file isolation when using the sVirt SELinux security driver > in libvirt. This will also allow libvirt to run QEMU confined by the Linux container functionality. In particular it lets us use CLONE_NEWNS flag to isolate its root filesystem, without having to worry about setting up passthrough mounts for each disk image it needs to access, which is a real pain when it comes to hotplug. Daniel
On Mon, May 23, 2011 at 10:45 AM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: >> sVirt provides SELinux MAC isolation for Qemu guest processes and their >> corresponding resources (image files). sVirt provides this support >> by labeling guests and resources with security labels that are stored >> in file system extended attributes. Some file systems, such as NFS, do >> not support the extended attribute security namespace, which is needed >> for image file isolation when using the sVirt SELinux security driver >> in libvirt. >> >> The proposed solution entails a combination of Qemu, libvirt, and >> SELinux patches that work together to isolate multiple guests' images >> when they're stored in the same NFS mount. This results in an >> environment where sVirt isolation and NFS image file isolation can both >> be provided. >> >> Currently, Qemu opens an image file in addition to performing the >> necessary read and write operations. The proposed solution will move >> the open out of Qemu and into libvirt. Once libvirt opens an image >> file for the guest, it will pass the file descriptor to Qemu via a >> new fd: protocol. >> >> If the image file resides in an NFS mount, the following SELinux policy >> changes will provide image isolation: >> >> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to >> allow Qemu (svirt_t) to only have SELinux read and write >> permissions on nfs_t files >> >> - Qemu (svirt_t) also gets SELinux use permissions on libvirt >> (virtd_t) file descriptors >> >> Following is a sample invocation of Qemu using the fd: protocol: >> >> qemu -drive file=fd:4,format=qcow2 >> >> This patch contains the Qemu code to support this solution. I would >> like to solicit input from the libvirt community prior to starting >> the libvirt patch. >> >> This patch was tested with the following formats: raw, cow, qcow, >> qcow2, vmdk, using the fd: protocol as well as existing file name >> support. Non-valid file descriptors were also tested. > > How can backing files work ? The '-drive' syntax doesn't provide > any way to set properties against the backing files (which may be > nested to arbitrary depth). AFAICT, we need to the often discussed > -blockdev advanced syntax to be able to set a 'fd' property against > nested backing files. I'm not sure it is worth supporting fd: if > we only have -drive syntax, since backing files are an important > feature for most mgmt apps. > > Also, there are a few places in QEMU, where it re-opens the existing > block driver on the fly. What is the plan for supporting this, without > having QEMU block on waiting for libvirt to pass it a new FD ? QEMU could ask for a file over QMP. So you bootstrap it using fd: but when a reopen or backing file is needed, QEMU raises a QEVENT_BLOCK_REQUEST_FILE event. Of course waiting around for the reopen to complete it not ideal as it may temporarily pause the guest. Stefan
On Mon, May 23, 2011 at 11:19:15AM +0100, Stefan Hajnoczi wrote: > On Mon, May 23, 2011 at 10:45 AM, Daniel P. Berrange > <berrange@redhat.com> wrote: > > On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: > >> sVirt provides SELinux MAC isolation for Qemu guest processes and their > >> corresponding resources (image files). sVirt provides this support > >> by labeling guests and resources with security labels that are stored > >> in file system extended attributes. Some file systems, such as NFS, do > >> not support the extended attribute security namespace, which is needed > >> for image file isolation when using the sVirt SELinux security driver > >> in libvirt. > >> > >> The proposed solution entails a combination of Qemu, libvirt, and > >> SELinux patches that work together to isolate multiple guests' images > >> when they're stored in the same NFS mount. This results in an > >> environment where sVirt isolation and NFS image file isolation can both > >> be provided. > >> > >> Currently, Qemu opens an image file in addition to performing the > >> necessary read and write operations. The proposed solution will move > >> the open out of Qemu and into libvirt. Once libvirt opens an image > >> file for the guest, it will pass the file descriptor to Qemu via a > >> new fd: protocol. > >> > >> If the image file resides in an NFS mount, the following SELinux policy > >> changes will provide image isolation: > >> > >> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to > >> allow Qemu (svirt_t) to only have SELinux read and write > >> permissions on nfs_t files > >> > >> - Qemu (svirt_t) also gets SELinux use permissions on libvirt > >> (virtd_t) file descriptors > >> > >> Following is a sample invocation of Qemu using the fd: protocol: > >> > >> qemu -drive file=fd:4,format=qcow2 > >> > >> This patch contains the Qemu code to support this solution. I would > >> like to solicit input from the libvirt community prior to starting > >> the libvirt patch. > >> > >> This patch was tested with the following formats: raw, cow, qcow, > >> qcow2, vmdk, using the fd: protocol as well as existing file name > >> support. Non-valid file descriptors were also tested. > > > > How can backing files work ? The '-drive' syntax doesn't provide > > any way to set properties against the backing files (which may be > > nested to arbitrary depth). AFAICT, we need to the often discussed > > -blockdev advanced syntax to be able to set a 'fd' property against > > nested backing files. I'm not sure it is worth supporting fd: if > > we only have -drive syntax, since backing files are an important > > feature for most mgmt apps. > > > > Also, there are a few places in QEMU, where it re-opens the existing > > block driver on the fly. What is the plan for supporting this, without > > having QEMU block on waiting for libvirt to pass it a new FD ? > > QEMU could ask for a file over QMP. So you bootstrap it using fd: but > when a reopen or backing file is needed, QEMU raises a > QEVENT_BLOCK_REQUEST_FILE event. Of course waiting around for the > reopen to complete it not ideal as it may temporarily pause the guest. It feels to me that turning the current block driver code which just does open(2) on files, into something which issues events & asynchronously waits for a file would potentially be quite complex. You also need to be much more careful from a security POV if the mgmt app is accepting requests to open arbitrary files from QEMU, to ensure the filenames are correctly/strictly validated before opening them and giving them back to QEMU. An architecture where the mgmt app decides what FDs to supply upfront, has less potential for security errors. To me the ideal would thus be that we can supply FDs for the backing store with -blockdev syntax, and that places where QEMU re-opens files would be enhanced to avoid that need. If there are things we can't do without closing & re-opening the same file, then perhaps we need some new ioctl()/fcntl() calls to change those file attributes on the fly. Regards, Daniel
On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: > On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: >> sVirt provides SELinux MAC isolation for Qemu guest processes and their >> corresponding resources (image files). sVirt provides this support >> by labeling guests and resources with security labels that are stored >> in file system extended attributes. Some file systems, such as NFS, do >> not support the extended attribute security namespace, which is needed >> for image file isolation when using the sVirt SELinux security driver >> in libvirt. >> >> The proposed solution entails a combination of Qemu, libvirt, and >> SELinux patches that work together to isolate multiple guests' images >> when they're stored in the same NFS mount. This results in an >> environment where sVirt isolation and NFS image file isolation can both >> be provided. >> >> Currently, Qemu opens an image file in addition to performing the >> necessary read and write operations. The proposed solution will move >> the open out of Qemu and into libvirt. Once libvirt opens an image >> file for the guest, it will pass the file descriptor to Qemu via a >> new fd: protocol. >> >> If the image file resides in an NFS mount, the following SELinux policy >> changes will provide image isolation: >> >> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to >> allow Qemu (svirt_t) to only have SELinux read and write >> permissions on nfs_t files >> >> - Qemu (svirt_t) also gets SELinux use permissions on libvirt >> (virtd_t) file descriptors >> >> Following is a sample invocation of Qemu using the fd: protocol: >> >> qemu -drive file=fd:4,format=qcow2 >> >> This patch contains the Qemu code to support this solution. I would >> like to solicit input from the libvirt community prior to starting >> the libvirt patch. >> >> This patch was tested with the following formats: raw, cow, qcow, >> qcow2, vmdk, using the fd: protocol as well as existing file name >> support. Non-valid file descriptors were also tested. > > How can backing files work ? The '-drive' syntax doesn't provide > any way to set properties against the backing files (which may be > nested to arbitrary depth). This is orthogonal to having an fd: protocol. > Also, there are a few places in QEMU, where it re-opens the existing > block driver on the fly. What is the plan for supporting this, without > having QEMU block on waiting for libvirt to pass it a new FD ? That's only host CDROM AFAICT. Regards, Anthony Liguori > > Regards, > Daniel
On 05/23/2011 05:30 AM, Daniel P. Berrange wrote: > It feels to me that turning the current block driver code which just does > open(2) on files, into something which issues events& asynchronously > waits for a file would potentially be quite complex. > > You also need to be much more careful from a security POV if the mgmt > app is accepting requests to open arbitrary files from QEMU, to ensure > the filenames are correctly/strictly validated before opening them and > giving them back to QEMU. An architecture where the mgmt app decides > what FDs to supply upfront, has less potential for security errors. > > To me the ideal would thus be that we can supply FDs for the backing > store with -blockdev syntax, and that places where QEMU re-opens files > would be enhanced to avoid that need. If there are things we can't do > without closing& re-opening the same file, then perhaps we need some > new ioctl()/fcntl() calls to change those file attributes on the fly. I agree. But my view of blockdev is that you wouldn't set an fd attribute but rather the backing file name and use the fd protocol. For instance: -blockdev id=foo-base,path=fd:4,format=raw -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo > Regards, > Daniel
On Mon, May 23, 2011 at 07:50:12AM -0500, Anthony Liguori wrote: > On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: > >On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: > >>sVirt provides SELinux MAC isolation for Qemu guest processes and their > >>corresponding resources (image files). sVirt provides this support > >>by labeling guests and resources with security labels that are stored > >>in file system extended attributes. Some file systems, such as NFS, do > >>not support the extended attribute security namespace, which is needed > >>for image file isolation when using the sVirt SELinux security driver > >>in libvirt. > >> > >>The proposed solution entails a combination of Qemu, libvirt, and > >>SELinux patches that work together to isolate multiple guests' images > >>when they're stored in the same NFS mount. This results in an > >>environment where sVirt isolation and NFS image file isolation can both > >>be provided. > >> > >>Currently, Qemu opens an image file in addition to performing the > >>necessary read and write operations. The proposed solution will move > >>the open out of Qemu and into libvirt. Once libvirt opens an image > >>file for the guest, it will pass the file descriptor to Qemu via a > >>new fd: protocol. > >> > >>If the image file resides in an NFS mount, the following SELinux policy > >>changes will provide image isolation: > >> > >> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to > >> allow Qemu (svirt_t) to only have SELinux read and write > >> permissions on nfs_t files > >> > >> - Qemu (svirt_t) also gets SELinux use permissions on libvirt > >> (virtd_t) file descriptors > >> > >>Following is a sample invocation of Qemu using the fd: protocol: > >> > >> qemu -drive file=fd:4,format=qcow2 > >> > >>This patch contains the Qemu code to support this solution. I would > >>like to solicit input from the libvirt community prior to starting > >>the libvirt patch. > >> > >>This patch was tested with the following formats: raw, cow, qcow, > >>qcow2, vmdk, using the fd: protocol as well as existing file name > >>support. Non-valid file descriptors were also tested. > > > >How can backing files work ? The '-drive' syntax doesn't provide > >any way to set properties against the backing files (which may be > >nested to arbitrary depth). > > This is orthogonal to having an fd: protocol. > > >Also, there are a few places in QEMU, where it re-opens the existing > >block driver on the fly. What is the plan for supporting this, without > >having QEMU block on waiting for libvirt to pass it a new FD ? > > That's only host CDROM AFAICT. IIRC, Christoph posted patches which required re-opening files for any disks, after a guest initiated change in the drive cache mode. Regards, Daniel
On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: >> >> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: >>> >>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>> corresponding resources (image files). sVirt provides this support >>> by labeling guests and resources with security labels that are stored >>> in file system extended attributes. Some file systems, such as NFS, do >>> not support the extended attribute security namespace, which is needed >>> for image file isolation when using the sVirt SELinux security driver >>> in libvirt. >>> >>> The proposed solution entails a combination of Qemu, libvirt, and >>> SELinux patches that work together to isolate multiple guests' images >>> when they're stored in the same NFS mount. This results in an >>> environment where sVirt isolation and NFS image file isolation can both >>> be provided. >>> >>> Currently, Qemu opens an image file in addition to performing the >>> necessary read and write operations. The proposed solution will move >>> the open out of Qemu and into libvirt. Once libvirt opens an image >>> file for the guest, it will pass the file descriptor to Qemu via a >>> new fd: protocol. >>> >>> If the image file resides in an NFS mount, the following SELinux policy >>> changes will provide image isolation: >>> >>> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to >>> allow Qemu (svirt_t) to only have SELinux read and write >>> permissions on nfs_t files >>> >>> - Qemu (svirt_t) also gets SELinux use permissions on libvirt >>> (virtd_t) file descriptors >>> >>> Following is a sample invocation of Qemu using the fd: protocol: >>> >>> qemu -drive file=fd:4,format=qcow2 >>> >>> This patch contains the Qemu code to support this solution. I would >>> like to solicit input from the libvirt community prior to starting >>> the libvirt patch. >>> >>> This patch was tested with the following formats: raw, cow, qcow, >>> qcow2, vmdk, using the fd: protocol as well as existing file name >>> support. Non-valid file descriptors were also tested. >> >> How can backing files work ? The '-drive' syntax doesn't provide >> any way to set properties against the backing files (which may be >> nested to arbitrary depth). > > This is orthogonal to having an fd: protocol. > >> Also, there are a few places in QEMU, where it re-opens the existing >> block driver on the fly. What is the plan for supporting this, without >> having QEMU block on waiting for libvirt to pass it a new FD ? > > That's only host CDROM AFAICT. The host page cache on|off option also uses it because fcntl(2) cannot change O_DIRECT. Actually for Linux it may be doable with open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux does not exist. On the other hand fcntl(2) should really support changing O_DIRECT - only that there is not much incentive to fix Linux if other OSes don't support it either (and older Linux kernels also require workarounds). Stefan
On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote: > On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com> wrote: >> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: >>> >>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: >>>> >>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>> corresponding resources (image files). sVirt provides this support >>>> by labeling guests and resources with security labels that are stored >>>> in file system extended attributes. Some file systems, such as NFS, do >>>> not support the extended attribute security namespace, which is needed >>>> for image file isolation when using the sVirt SELinux security driver >>>> in libvirt. >>>> >>>> The proposed solution entails a combination of Qemu, libvirt, and >>>> SELinux patches that work together to isolate multiple guests' images >>>> when they're stored in the same NFS mount. This results in an >>>> environment where sVirt isolation and NFS image file isolation can both >>>> be provided. >>>> >>>> Currently, Qemu opens an image file in addition to performing the >>>> necessary read and write operations. The proposed solution will move >>>> the open out of Qemu and into libvirt. Once libvirt opens an image >>>> file for the guest, it will pass the file descriptor to Qemu via a >>>> new fd: protocol. >>>> >>>> If the image file resides in an NFS mount, the following SELinux policy >>>> changes will provide image isolation: >>>> >>>> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to >>>> allow Qemu (svirt_t) to only have SELinux read and write >>>> permissions on nfs_t files >>>> >>>> - Qemu (svirt_t) also gets SELinux use permissions on libvirt >>>> (virtd_t) file descriptors >>>> >>>> Following is a sample invocation of Qemu using the fd: protocol: >>>> >>>> qemu -drive file=fd:4,format=qcow2 >>>> >>>> This patch contains the Qemu code to support this solution. I would >>>> like to solicit input from the libvirt community prior to starting >>>> the libvirt patch. >>>> >>>> This patch was tested with the following formats: raw, cow, qcow, >>>> qcow2, vmdk, using the fd: protocol as well as existing file name >>>> support. Non-valid file descriptors were also tested. >>> >>> How can backing files work ? The '-drive' syntax doesn't provide >>> any way to set properties against the backing files (which may be >>> nested to arbitrary depth). >> >> This is orthogonal to having an fd: protocol. >> >>> Also, there are a few places in QEMU, where it re-opens the existing >>> block driver on the fly. What is the plan for supporting this, without >>> having QEMU block on waiting for libvirt to pass it a new FD ? >> >> That's only host CDROM AFAICT. > > The host page cache on|off option also uses it because fcntl(2) cannot > change O_DIRECT. Actually for Linux it may be doable with > open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux > does not exist. QEMU doesn't actually know which caching mode the fd is in if it's being passed to it so this command doesn't make much sense. Regards, Anthony Liguori > > On the other hand fcntl(2) should really support changing O_DIRECT - > only that there is not much incentive to fix Linux if other OSes don't > support it either (and older Linux kernels also require workarounds). > > Stefan
On Mon, May 23, 2011 at 2:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote: >> >> On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com> >> wrote: >>> >>> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: >>>> >>>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: >>>>> >>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>>> corresponding resources (image files). sVirt provides this support >>>>> by labeling guests and resources with security labels that are stored >>>>> in file system extended attributes. Some file systems, such as NFS, do >>>>> not support the extended attribute security namespace, which is needed >>>>> for image file isolation when using the sVirt SELinux security driver >>>>> in libvirt. >>>>> >>>>> The proposed solution entails a combination of Qemu, libvirt, and >>>>> SELinux patches that work together to isolate multiple guests' images >>>>> when they're stored in the same NFS mount. This results in an >>>>> environment where sVirt isolation and NFS image file isolation can both >>>>> be provided. >>>>> >>>>> Currently, Qemu opens an image file in addition to performing the >>>>> necessary read and write operations. The proposed solution will move >>>>> the open out of Qemu and into libvirt. Once libvirt opens an image >>>>> file for the guest, it will pass the file descriptor to Qemu via a >>>>> new fd: protocol. >>>>> >>>>> If the image file resides in an NFS mount, the following SELinux policy >>>>> changes will provide image isolation: >>>>> >>>>> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to >>>>> allow Qemu (svirt_t) to only have SELinux read and write >>>>> permissions on nfs_t files >>>>> >>>>> - Qemu (svirt_t) also gets SELinux use permissions on libvirt >>>>> (virtd_t) file descriptors >>>>> >>>>> Following is a sample invocation of Qemu using the fd: protocol: >>>>> >>>>> qemu -drive file=fd:4,format=qcow2 >>>>> >>>>> This patch contains the Qemu code to support this solution. I would >>>>> like to solicit input from the libvirt community prior to starting >>>>> the libvirt patch. >>>>> >>>>> This patch was tested with the following formats: raw, cow, qcow, >>>>> qcow2, vmdk, using the fd: protocol as well as existing file name >>>>> support. Non-valid file descriptors were also tested. >>>> >>>> How can backing files work ? The '-drive' syntax doesn't provide >>>> any way to set properties against the backing files (which may be >>>> nested to arbitrary depth). >>> >>> This is orthogonal to having an fd: protocol. >>> >>>> Also, there are a few places in QEMU, where it re-opens the existing >>>> block driver on the fly. What is the plan for supporting this, without >>>> having QEMU block on waiting for libvirt to pass it a new FD ? >>> >>> That's only host CDROM AFAICT. >> >> The host page cache on|off option also uses it because fcntl(2) cannot >> change O_DIRECT. Actually for Linux it may be doable with >> open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux >> does not exist. > > QEMU doesn't actually know which caching mode the fd is in if it's being > passed to it so this command doesn't make much sense. > fcntl(2) will report the flags. Also, we need to make sure that the O_SYNC flag and write caching are in agreement, although I guess it is libvirt's responsibility to set that up correctly. Stefan
On Mon, May 23, 2011 at 02:26:05PM +0100, Stefan Hajnoczi wrote: > On Mon, May 23, 2011 at 2:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote: > > On 05/23/2011 08:09 AM, Stefan Hajnoczi wrote: > >> > >> On Mon, May 23, 2011 at 1:50 PM, Anthony Liguori<aliguori@us.ibm.com> > >> wrote: > >>> > >>> On 05/23/2011 04:45 AM, Daniel P. Berrange wrote: > >>>> > >>>> On Fri, May 20, 2011 at 02:48:23PM -0400, Corey Bryant wrote: > >>>>> > >>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their > >>>>> corresponding resources (image files). sVirt provides this support > >>>>> by labeling guests and resources with security labels that are stored > >>>>> in file system extended attributes. Some file systems, such as NFS, do > >>>>> not support the extended attribute security namespace, which is needed > >>>>> for image file isolation when using the sVirt SELinux security driver > >>>>> in libvirt. > >>>>> > >>>>> The proposed solution entails a combination of Qemu, libvirt, and > >>>>> SELinux patches that work together to isolate multiple guests' images > >>>>> when they're stored in the same NFS mount. This results in an > >>>>> environment where sVirt isolation and NFS image file isolation can both > >>>>> be provided. > >>>>> > >>>>> Currently, Qemu opens an image file in addition to performing the > >>>>> necessary read and write operations. The proposed solution will move > >>>>> the open out of Qemu and into libvirt. Once libvirt opens an image > >>>>> file for the guest, it will pass the file descriptor to Qemu via a > >>>>> new fd: protocol. > >>>>> > >>>>> If the image file resides in an NFS mount, the following SELinux policy > >>>>> changes will provide image isolation: > >>>>> > >>>>> - A new SELinux boolean is created (e.g. virt_read_write_nfs) to > >>>>> allow Qemu (svirt_t) to only have SELinux read and write > >>>>> permissions on nfs_t files > >>>>> > >>>>> - Qemu (svirt_t) also gets SELinux use permissions on libvirt > >>>>> (virtd_t) file descriptors > >>>>> > >>>>> Following is a sample invocation of Qemu using the fd: protocol: > >>>>> > >>>>> qemu -drive file=fd:4,format=qcow2 > >>>>> > >>>>> This patch contains the Qemu code to support this solution. I would > >>>>> like to solicit input from the libvirt community prior to starting > >>>>> the libvirt patch. > >>>>> > >>>>> This patch was tested with the following formats: raw, cow, qcow, > >>>>> qcow2, vmdk, using the fd: protocol as well as existing file name > >>>>> support. Non-valid file descriptors were also tested. > >>>> > >>>> How can backing files work ? The '-drive' syntax doesn't provide > >>>> any way to set properties against the backing files (which may be > >>>> nested to arbitrary depth). > >>> > >>> This is orthogonal to having an fd: protocol. > >>> > >>>> Also, there are a few places in QEMU, where it re-opens the existing > >>>> block driver on the fly. What is the plan for supporting this, without > >>>> having QEMU block on waiting for libvirt to pass it a new FD ? > >>> > >>> That's only host CDROM AFAICT. > >> > >> The host page cache on|off option also uses it because fcntl(2) cannot > >> change O_DIRECT. Actually for Linux it may be doable with > >> open('/proc/fd/<old>', flags ^ O_DIRECT) and on other hosts SELinux > >> does not exist. > > > > QEMU doesn't actually know which caching mode the fd is in if it's being > > passed to it so this command doesn't make much sense. > > > > fcntl(2) will report the flags. > > Also, we need to make sure that the O_SYNC flag and write caching are > in agreement, although I guess it is libvirt's responsibility to set > that up correctly. This is where fcntl() support for setting/clearing O_DIRECT etc would be useful. It avoids the need for libvirt or other mgmt apps to second guess what flags QEMU expects for a particular cache mode. Regards, Daniel
Am 20.05.2011 21:53, schrieb Blue Swirl: > On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: >> On 05/20/2011 02:25 PM, Blue Swirl wrote: >>> >>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>>> >>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>> corresponding resources (image files). sVirt provides this support >>>> by labeling guests and resources with security labels that are stored >>>> in file system extended attributes. Some file systems, such as NFS, do >>>> not support the extended attribute security namespace, which is needed >>>> for image file isolation when using the sVirt SELinux security driver >>>> in libvirt. >>>> >>>> The proposed solution entails a combination of Qemu, libvirt, and >>>> SELinux patches that work together to isolate multiple guests' images >>>> when they're stored in the same NFS mount. This results in an >>>> environment where sVirt isolation and NFS image file isolation can both >>>> be provided. >>> >>> Very nice. QEMU should use this to support privilege separation. We >>> already have chroot and runas switches, a new switch should convert >>> all file references to fd references internally for that process. If >>> this can be made transparent, this should even be the default way of >>> operation. >> >> You mean, QEMU starts up, opens all disk images, reinvokes itself in a >> confined context, and then passes fds to the child? > > And exit after that, or do the same without forking. > > This wouldn't work now for the native CDROM devices which need to > reopen the device. For that, an explicit reopen method could be added. > The method could even chat with the privileged process to get that to > do the reopening, but I'd leave that to libvirt and fail without it > for plain QEMU. There are more cases where we reopen the image file. One example is the 'commit' monitor command which temporarily reopens the backing file r/w. Or Christoph's patch that allows guests to toggle the write-cache enabled bit. Same for live snapshots. So we'll need a solution for them before doing anything like this. And breaking qemu without libvirt isn't really an option for me. Kevin
Anthony Liguori <anthony@codemonkey.ws> writes: > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote: >> It feels to me that turning the current block driver code which just does >> open(2) on files, into something which issues events& asynchronously >> waits for a file would potentially be quite complex. >> >> You also need to be much more careful from a security POV if the mgmt >> app is accepting requests to open arbitrary files from QEMU, to ensure >> the filenames are correctly/strictly validated before opening them and >> giving them back to QEMU. An architecture where the mgmt app decides >> what FDs to supply upfront, has less potential for security errors. >> >> To me the ideal would thus be that we can supply FDs for the backing >> store with -blockdev syntax, and that places where QEMU re-opens files >> would be enhanced to avoid that need. If there are things we can't do >> without closing& re-opening the same file, then perhaps we need some >> new ioctl()/fcntl() calls to change those file attributes on the fly. > > I agree. But my view of blockdev is that you wouldn't set an fd > attribute but rather the backing file name and use the fd protocol. > For instance: > > -blockdev id=foo-base,path=fd:4,format=raw > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo I guess you mean backing_file=foo-base. If none is specified, use the backing file specification stored in the image. Matches my current thinking.
Kevin Wolf <kwolf@redhat.com> writes: > Am 20.05.2011 21:53, schrieb Blue Swirl: >> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: >>> On 05/20/2011 02:25 PM, Blue Swirl wrote: >>>> >>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>>>> >>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>>> corresponding resources (image files). sVirt provides this support >>>>> by labeling guests and resources with security labels that are stored >>>>> in file system extended attributes. Some file systems, such as NFS, do >>>>> not support the extended attribute security namespace, which is needed >>>>> for image file isolation when using the sVirt SELinux security driver >>>>> in libvirt. >>>>> >>>>> The proposed solution entails a combination of Qemu, libvirt, and >>>>> SELinux patches that work together to isolate multiple guests' images >>>>> when they're stored in the same NFS mount. This results in an >>>>> environment where sVirt isolation and NFS image file isolation can both >>>>> be provided. >>>> >>>> Very nice. QEMU should use this to support privilege separation. We >>>> already have chroot and runas switches, a new switch should convert >>>> all file references to fd references internally for that process. If >>>> this can be made transparent, this should even be the default way of >>>> operation. >>> >>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a >>> confined context, and then passes fds to the child? >> >> And exit after that, or do the same without forking. >> >> This wouldn't work now for the native CDROM devices which need to >> reopen the device. For that, an explicit reopen method could be added. >> The method could even chat with the privileged process to get that to >> do the reopening, but I'd leave that to libvirt and fail without it >> for plain QEMU. > > There are more cases where we reopen the image file. One example is the > 'commit' monitor command which temporarily reopens the backing file r/w. > Or Christoph's patch that allows guests to toggle the write-cache > enabled bit. Same for live snapshots. So we'll need a solution for them > before doing anything like this. > > And breaking qemu without libvirt isn't really an option for me. Reopening files is evil. Sometimes flaws in the system call API make it the only option. You can mitigate via /dev/fd/%d, but only on some systems. The less we reopen, the better. An fd: protocol can't easily support reopen. So fail it. This doesn't break any existing usage. It's just a restriction on the new protocol. Restrictions can render the new protocol useless in practice, but we're not "breaking qemu without libvirt" there. Perhaps we can make relax the restriction on some system by avoiding the reopen in a system-dependent way.
Am 23.05.2011 17:24, schrieb Markus Armbruster: > Kevin Wolf <kwolf@redhat.com> writes: > >> Am 20.05.2011 21:53, schrieb Blue Swirl: >>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: >>>> On 05/20/2011 02:25 PM, Blue Swirl wrote: >>>>> >>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>>>>> >>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>>>> corresponding resources (image files). sVirt provides this support >>>>>> by labeling guests and resources with security labels that are stored >>>>>> in file system extended attributes. Some file systems, such as NFS, do >>>>>> not support the extended attribute security namespace, which is needed >>>>>> for image file isolation when using the sVirt SELinux security driver >>>>>> in libvirt. >>>>>> >>>>>> The proposed solution entails a combination of Qemu, libvirt, and >>>>>> SELinux patches that work together to isolate multiple guests' images >>>>>> when they're stored in the same NFS mount. This results in an >>>>>> environment where sVirt isolation and NFS image file isolation can both >>>>>> be provided. >>>>> >>>>> Very nice. QEMU should use this to support privilege separation. We >>>>> already have chroot and runas switches, a new switch should convert >>>>> all file references to fd references internally for that process. If >>>>> this can be made transparent, this should even be the default way of >>>>> operation. >>>> >>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a >>>> confined context, and then passes fds to the child? >>> >>> And exit after that, or do the same without forking. >>> >>> This wouldn't work now for the native CDROM devices which need to >>> reopen the device. For that, an explicit reopen method could be added. >>> The method could even chat with the privileged process to get that to >>> do the reopening, but I'd leave that to libvirt and fail without it >>> for plain QEMU. >> >> There are more cases where we reopen the image file. One example is the >> 'commit' monitor command which temporarily reopens the backing file r/w. >> Or Christoph's patch that allows guests to toggle the write-cache >> enabled bit. Same for live snapshots. So we'll need a solution for them >> before doing anything like this. >> >> And breaking qemu without libvirt isn't really an option for me. > > Reopening files is evil. Sometimes flaws in the system call API make it > the only option. You can mitigate via /dev/fd/%d, but only on some > systems. The less we reopen, the better. > > An fd: protocol can't easily support reopen. So fail it. This doesn't > break any existing usage. It's just a restriction on the new protocol. > Restrictions can render the new protocol useless in practice, but we're > not "breaking qemu without libvirt" there. > > Perhaps we can make relax the restriction on some system by avoiding the > reopen in a system-dependent way. Right, you only get the regression once libvirt starts using it (or even worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better. Kevin
On 05/23/2011 11:24 AM, Markus Armbruster wrote: > Kevin Wolf<kwolf@redhat.com> writes: > >> Am 20.05.2011 21:53, schrieb Blue Swirl: >>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori<anthony@codemonkey.ws> wrote: >>>> On 05/20/2011 02:25 PM, Blue Swirl wrote: >>>>> >>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>>>>> >>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>>>> corresponding resources (image files). sVirt provides this support >>>>>> by labeling guests and resources with security labels that are stored >>>>>> in file system extended attributes. Some file systems, such as NFS, do >>>>>> not support the extended attribute security namespace, which is needed >>>>>> for image file isolation when using the sVirt SELinux security driver >>>>>> in libvirt. >>>>>> >>>>>> The proposed solution entails a combination of Qemu, libvirt, and >>>>>> SELinux patches that work together to isolate multiple guests' images >>>>>> when they're stored in the same NFS mount. This results in an >>>>>> environment where sVirt isolation and NFS image file isolation can both >>>>>> be provided. >>>>> >>>>> Very nice. QEMU should use this to support privilege separation. We >>>>> already have chroot and runas switches, a new switch should convert >>>>> all file references to fd references internally for that process. If >>>>> this can be made transparent, this should even be the default way of >>>>> operation. >>>> >>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a >>>> confined context, and then passes fds to the child? >>> >>> And exit after that, or do the same without forking. >>> >>> This wouldn't work now for the native CDROM devices which need to >>> reopen the device. For that, an explicit reopen method could be added. >>> The method could even chat with the privileged process to get that to >>> do the reopening, but I'd leave that to libvirt and fail without it >>> for plain QEMU. >> >> There are more cases where we reopen the image file. One example is the >> 'commit' monitor command which temporarily reopens the backing file r/w. >> Or Christoph's patch that allows guests to toggle the write-cache >> enabled bit. Same for live snapshots. So we'll need a solution for them >> before doing anything like this. >> >> And breaking qemu without libvirt isn't really an option for me. > > Reopening files is evil. Sometimes flaws in the system call API make it > the only option. You can mitigate via /dev/fd/%d, but only on some > systems. The less we reopen, the better. > > An fd: protocol can't easily support reopen. So fail it. This doesn't > break any existing usage. It's just a restriction on the new protocol. > Restrictions can render the new protocol useless in practice, but we're > not "breaking qemu without libvirt" there. > > Perhaps we can make relax the restriction on some system by avoiding the > reopen in a system-dependent way. A lot of great points here. Thanks everyone. I'd like to see if we can go forward with the suggestion of restricting the fd: protocol, at least for the initial patch. Perhaps this first pass at the protocol can limit it to no reopen and no backing file support. Even with this limited support, the fd: protocol can still provide added security for NFS users. Corey
On Mon, May 23, 2011 at 6:56 PM, Kevin Wolf <kwolf@redhat.com> wrote: > Am 23.05.2011 17:24, schrieb Markus Armbruster: >> Kevin Wolf <kwolf@redhat.com> writes: >> >>> Am 20.05.2011 21:53, schrieb Blue Swirl: >>>> On Fri, May 20, 2011 at 10:42 PM, Anthony Liguori <anthony@codemonkey.ws> wrote: >>>>> On 05/20/2011 02:25 PM, Blue Swirl wrote: >>>>>> >>>>>> On Fri, May 20, 2011 at 9:48 PM, Corey Bryant<bryntcor@us.ibm.com> wrote: >>>>>>> >>>>>>> sVirt provides SELinux MAC isolation for Qemu guest processes and their >>>>>>> corresponding resources (image files). sVirt provides this support >>>>>>> by labeling guests and resources with security labels that are stored >>>>>>> in file system extended attributes. Some file systems, such as NFS, do >>>>>>> not support the extended attribute security namespace, which is needed >>>>>>> for image file isolation when using the sVirt SELinux security driver >>>>>>> in libvirt. >>>>>>> >>>>>>> The proposed solution entails a combination of Qemu, libvirt, and >>>>>>> SELinux patches that work together to isolate multiple guests' images >>>>>>> when they're stored in the same NFS mount. This results in an >>>>>>> environment where sVirt isolation and NFS image file isolation can both >>>>>>> be provided. >>>>>> >>>>>> Very nice. QEMU should use this to support privilege separation. We >>>>>> already have chroot and runas switches, a new switch should convert >>>>>> all file references to fd references internally for that process. If >>>>>> this can be made transparent, this should even be the default way of >>>>>> operation. >>>>> >>>>> You mean, QEMU starts up, opens all disk images, reinvokes itself in a >>>>> confined context, and then passes fds to the child? >>>> >>>> And exit after that, or do the same without forking. >>>> >>>> This wouldn't work now for the native CDROM devices which need to >>>> reopen the device. For that, an explicit reopen method could be added. >>>> The method could even chat with the privileged process to get that to >>>> do the reopening, but I'd leave that to libvirt and fail without it >>>> for plain QEMU. >>> >>> There are more cases where we reopen the image file. One example is the >>> 'commit' monitor command which temporarily reopens the backing file r/w. >>> Or Christoph's patch that allows guests to toggle the write-cache >>> enabled bit. Same for live snapshots. So we'll need a solution for them >>> before doing anything like this. >>> >>> And breaking qemu without libvirt isn't really an option for me. >> >> Reopening files is evil. Sometimes flaws in the system call API make it >> the only option. You can mitigate via /dev/fd/%d, but only on some >> systems. The less we reopen, the better. >> >> An fd: protocol can't easily support reopen. So fail it. This doesn't >> break any existing usage. It's just a restriction on the new protocol. >> Restrictions can render the new protocol useless in practice, but we're >> not "breaking qemu without libvirt" there. >> >> Perhaps we can make relax the restriction on some system by avoiding the >> reopen in a system-dependent way. > > Right, you only get the regression once libvirt starts using it (or even > worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better. One solution could be to add new commands which supply fresh fds while performing the operation which needs reopening: commit_fd ide0-hd0 #33 change ide1-cd0 fd:34 I don't remember the syntax for fd passing, so that part may be weird.
On 05/23/2011 10:56 AM, Kevin Wolf wrote: > Am 23.05.2011 17:24, schrieb Markus Armbruster: >> Kevin Wolf<kwolf@redhat.com> writes: >> >> An fd: protocol can't easily support reopen. So fail it. This doesn't >> break any existing usage. It's just a restriction on the new protocol. >> Restrictions can render the new protocol useless in practice, but we're >> not "breaking qemu without libvirt" there. >> >> Perhaps we can make relax the restriction on some system by avoiding the >> reopen in a system-dependent way. > > Right, you only get the regression once libvirt starts using it (or even > worse, qemu itself, like Blue Swirl suggested). Doesn't make it much better. libvirt already has the problem you describe. QEMU is designed assuming that we can access whatever resources the invoking user can. That's how our UI is constructed. But libvirt chooses to invoke QEMU in such a way that the actual QEMU process does not have the same rights as the invoking user. In fact, the context is entirely different. That means that to get isolation, libvirt has to understand what QEMU does for each operation and then do some magic behind the scenes to make it all work. This is not different really. 'commit' could require magic in libvirt anyway if libvirt was smart enough to mark the backing file with read-only rights. libvirt already has to parse image file formats to figure out backing files so it can set the permissions right. Regards, Anthony Liguori > Kevin >
Markus Armbruster wrote: > Anthony Liguori <anthony@codemonkey.ws> writes: > > > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote: > >> It feels to me that turning the current block driver code which just does > >> open(2) on files, into something which issues events& asynchronously > >> waits for a file would potentially be quite complex. > >> > >> You also need to be much more careful from a security POV if the mgmt > >> app is accepting requests to open arbitrary files from QEMU, to ensure > >> the filenames are correctly/strictly validated before opening them and > >> giving them back to QEMU. An architecture where the mgmt app decides > >> what FDs to supply upfront, has less potential for security errors. > >> > >> To me the ideal would thus be that we can supply FDs for the backing > >> store with -blockdev syntax, and that places where QEMU re-opens files > >> would be enhanced to avoid that need. If there are things we can't do > >> without closing& re-opening the same file, then perhaps we need some > >> new ioctl()/fcntl() calls to change those file attributes on the fly. > > > > I agree. But my view of blockdev is that you wouldn't set an fd > > attribute but rather the backing file name and use the fd protocol. > > For instance: > > > > -blockdev id=foo-base,path=fd:4,format=raw > > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo > > I guess you mean backing_file=foo-base. > > If none is specified, use the backing file specification stored in the > image. > > Matches my current thinking. Being able to override the backing file path would be useful anyway. I've already had problems when moving established qcow2 files between systems, that for historical reasons contain either an absolute path inside for the backing file, or some messy "../../whatever", or "foo/bar/whatever", or "backing.img" (lots of different ones with the same name), all of which are a pain to work around. (Imho, it would also make sense if qcow2 files contained a UUID for their backing file to verify you've given the correct backing file, and maybe help find it (much like Linux finds real disk devices and filesystems when mounting these days).) -- Jamie
On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier <jamie@shareable.org> wrote: > Markus Armbruster wrote: >> Anthony Liguori <anthony@codemonkey.ws> writes: >> >> > On 05/23/2011 05:30 AM, Daniel P. Berrange wrote: >> >> It feels to me that turning the current block driver code which just does >> >> open(2) on files, into something which issues events& asynchronously >> >> waits for a file would potentially be quite complex. >> >> >> >> You also need to be much more careful from a security POV if the mgmt >> >> app is accepting requests to open arbitrary files from QEMU, to ensure >> >> the filenames are correctly/strictly validated before opening them and >> >> giving them back to QEMU. An architecture where the mgmt app decides >> >> what FDs to supply upfront, has less potential for security errors. >> >> >> >> To me the ideal would thus be that we can supply FDs for the backing >> >> store with -blockdev syntax, and that places where QEMU re-opens files >> >> would be enhanced to avoid that need. If there are things we can't do >> >> without closing& re-opening the same file, then perhaps we need some >> >> new ioctl()/fcntl() calls to change those file attributes on the fly. >> > >> > I agree. But my view of blockdev is that you wouldn't set an fd >> > attribute but rather the backing file name and use the fd protocol. >> > For instance: >> > >> > -blockdev id=foo-base,path=fd:4,format=raw >> > -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo >> >> I guess you mean backing_file=foo-base. >> >> If none is specified, use the backing file specification stored in the >> image. >> >> Matches my current thinking. > > Being able to override the backing file path would be useful anyway. > > I've already had problems when moving established qcow2 files between > systems, that for historical reasons contain either an absolute path > inside for the backing file, or some messy "../../whatever", or > "foo/bar/whatever", or "backing.img" (lots of different ones with the > same name), all of which are a pain to work around. > > (Imho, it would also make sense if qcow2 files contained a UUID for > their backing file to verify you've given the correct backing file, > and maybe help find it (much like Linux finds real disk devices and > filesystems when mounting these days).) Try the qemu-img rebase -f command: qemu-img uses the unsafe mode if "-u" is specified. In this mode, only the backing file name and format of filename is changed without any checks on the file contents. The user must take care of specifying the correct new backing file, or the guest-visible content of the image will be corrupted. This mode is useful for renaming or moving the backing file to somewhere else. It can be used without an accessible old backing file, i.e. you can use it to fix an image whose backing file has already been moved/renamed. Stefan
Stefan Hajnoczi wrote: > On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier <jamie@shareable.org> wrote: > > Being able to override the backing file path would be useful anyway. > > > > I've already had problems when moving established qcow2 files between > > systems, that for historical reasons contain either an absolute path > > inside for the backing file, or some messy "../../whatever", or > > "foo/bar/whatever", or "backing.img" (lots of different ones with the > > same name), all of which are a pain to work around. ... > Try the qemu-img rebase -f command: > > qemu-img uses the unsafe mode if "-u" is specified. In this mode, only the > backing file name and format of filename is changed without any > checks on the > file contents. The user must take care of specifying the correct new backing > file, or the guest-visible content of the image will be corrupted. > > This mode is useful for renaming or moving the backing file to somewhere > else. It can be used without an accessible old backing file, i.e. you can > use it to fix an image whose backing file has already been moved/renamed. Yes indeed. That feature was added after the last time I dealt with this problem. However, I have wanted to open *precious*, *read-only* qcow2 images, for example with -snapshot or the explicit equivalent, and for those precious images I am loathe to let any tool write a single byte to them. The files are kept read-only, and often with the "immutable" attribute on Linux, backed up and checksummed just to be sure. I'd rather just override the value on the command line, so if that feature may turn up for fd: related reasons, it'll be handy for the read-only moved qcow2 file reason too. -- Jamie
diff --git a/block/raw-posix.c b/block/raw-posix.c index a95c8d4..6554b06 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -142,7 +142,8 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, int bdrv_flags, int open_flags) { BDRVRawState *s = bs->opaque; - int fd, ret; + int fd = -1; + int ret; s->open_flags = open_flags | O_BINARY; s->open_flags &= ~O_ACCMODE; @@ -159,15 +160,16 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, else if (!(bdrv_flags & BDRV_O_CACHE_WB)) s->open_flags |= O_DSYNC; - s->fd = -1; - fd = qemu_open(filename, s->open_flags, 0644); - if (fd < 0) { - ret = -errno; - if (ret == -EROFS) - ret = -EACCES; - return ret; + if (s->fd == -1) { + fd = qemu_open(filename, s->open_flags, 0644); + if (fd < 0) { + ret = -errno; + if (ret == -EROFS) + ret = -EACCES; + return ret; + } + s->fd = fd; } - s->fd = fd; s->aligned_buf = NULL; if ((bdrv_flags & BDRV_O_NOCACHE)) { @@ -224,6 +226,7 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags) { BDRVRawState *s = bs->opaque; + s->fd = -1; s->type = FTYPE_FILE; return raw_open_common(bs, filename, flags, 0); } @@ -819,6 +822,50 @@ static BlockDriver bdrv_file = { .create_options = raw_create_options, }; +static int raw_open_fd(BlockDriverState *bs, const char *filename, int flags) +{ + BDRVRawState *s = bs->opaque; + const char *fd_str; + int i; + + /* extract the file descriptor - fail if it's not fd: */ + if (!strstart(filename, "fd:", &fd_str)) { + return -EINVAL; + } + + for (i = 0; fd_str[i] != '\0'; i++) { + if (!qemu_isdigit(fd_str[i])) + return -EBADF; + } + + s->fd = atoi(fd_str); + s->type = FTYPE_FILE; + + return raw_open_common(bs, filename, flags, 0); +} + +static BlockDriver bdrv_file_fd = { + .format_name = "file", + .protocol_name = "fd", + .instance_size = sizeof(BDRVRawState), + .bdrv_probe = NULL, /* no probe for protocols */ + .bdrv_file_open = raw_open_fd, + .bdrv_read = raw_read, + .bdrv_write = raw_write, + .bdrv_close = raw_close, + .bdrv_flush = raw_flush, + .bdrv_discard = raw_discard, + + .bdrv_aio_readv = raw_aio_readv, + .bdrv_aio_writev = raw_aio_writev, + .bdrv_aio_flush = raw_aio_flush, + + .bdrv_truncate = raw_truncate, + .bdrv_getlength = raw_getlength, + + .create_options = raw_create_options, +}; + /***********************************************/ /* host device */ @@ -927,6 +974,7 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags) } #endif + s->fd = -1; s->type = FTYPE_FILE; #if defined(__linux__) { @@ -1097,6 +1145,7 @@ static int floppy_open(BlockDriverState *bs, const char *filename, int flags) BDRVRawState *s = bs->opaque; int ret; + s->fd = -1; s->type = FTYPE_FD; /* open will not fail even if no floppy is inserted, so add O_NONBLOCK */ @@ -1209,6 +1258,7 @@ static int cdrom_open(BlockDriverState *bs, const char *filename, int flags) { BDRVRawState *s = bs->opaque; + s->fd = -1; s->type = FTYPE_CD; /* open will not fail even if no CD is inserted, so add O_NONBLOCK */ @@ -1313,6 +1363,7 @@ static int cdrom_open(BlockDriverState *bs, const char *filename, int flags) BDRVRawState *s = bs->opaque; int ret; + s->fd = -1; s->type = FTYPE_CD; ret = raw_open_common(bs, filename, flags, 0); @@ -1432,6 +1483,7 @@ static void bdrv_file_init(void) * Register all the drivers. Note that order is important, the driver * registered last will get probed first. */ + bdrv_register(&bdrv_file_fd); bdrv_register(&bdrv_file); bdrv_register(&bdrv_host_device); #ifdef __linux__ diff --git a/qemu-doc.texi b/qemu-doc.texi index 47e1991..fea8882 100644 --- a/qemu-doc.texi +++ b/qemu-doc.texi @@ -412,6 +412,7 @@ snapshots. * disk_images_fat_images:: Virtual FAT disk images * disk_images_nbd:: NBD access * disk_images_sheepdog:: Sheepdog disk images +* disk_images_fd:: File descriptor access @end menu @node disk_images_quickstart @@ -686,6 +687,17 @@ qemu-img create sheepdog:@var{hostname}:@var{port}:@var{image} @var{size} qemu sheepdog:@var{hostname}:@var{port}:@var{image} @end example +@node disk_images_fd +@subsection File descriptor access + +QEMU can access an image file that was opened outside of the QEMU +process. The format option is required when passing a file descriptor +to QEMU. + +@example +qemu -drive file=fd:4,format=qcow2 +@end example + @node pcsys_network @section Network emulation diff --git a/qemu-options.hx b/qemu-options.hx index 5876fc5..7cf5358 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -116,7 +116,7 @@ using @file{/dev/cdrom} as filename (@pxref{host_drives}). ETEXI DEF("drive", HAS_ARG, QEMU_OPTION_drive, - "-drive [file=file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n" + "-drive [file=[fd:]file][,if=type][,bus=n][,unit=m][,media=d][,index=i]\n" " [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n" " [,cache=writethrough|writeback|none|unsafe][,format=f]\n" " [,serial=s][,addr=A][,id=name][,aio=threads|native]\n" @@ -129,10 +129,12 @@ STEXI Define a new drive. Valid options are: @table @option -@item file=@var{file} +@item file=[fd:]@var{file} This option defines which disk image (@pxref{disk_images}) to use with this drive. If the filename contains comma, you must double it -(for instance, "file=my,,file" to use file "my,file"). +(for instance, "file=my,,file" to use file "my,file"). @option{fd:}@var{file} +specifies the file descriptor of an already open disk +image. @option{format=}@var{format} is required by @option{fd:}@var{file}. @item if=@var{interface} This option defines on which type on interface the drive is connected. Available types are: ide, scsi, sd, mtd, floppy, pflash, virtio.
sVirt provides SELinux MAC isolation for Qemu guest processes and their corresponding resources (image files). sVirt provides this support by labeling guests and resources with security labels that are stored in file system extended attributes. Some file systems, such as NFS, do not support the extended attribute security namespace, which is needed for image file isolation when using the sVirt SELinux security driver in libvirt. The proposed solution entails a combination of Qemu, libvirt, and SELinux patches that work together to isolate multiple guests' images when they're stored in the same NFS mount. This results in an environment where sVirt isolation and NFS image file isolation can both be provided. Currently, Qemu opens an image file in addition to performing the necessary read and write operations. The proposed solution will move the open out of Qemu and into libvirt. Once libvirt opens an image file for the guest, it will pass the file descriptor to Qemu via a new fd: protocol. If the image file resides in an NFS mount, the following SELinux policy changes will provide image isolation: - A new SELinux boolean is created (e.g. virt_read_write_nfs) to allow Qemu (svirt_t) to only have SELinux read and write permissions on nfs_t files - Qemu (svirt_t) also gets SELinux use permissions on libvirt (virtd_t) file descriptors Following is a sample invocation of Qemu using the fd: protocol: qemu -drive file=fd:4,format=qcow2 This patch contains the Qemu code to support this solution. I would like to solicit input from the libvirt community prior to starting the libvirt patch. This patch was tested with the following formats: raw, cow, qcow, qcow2, vmdk, using the fd: protocol as well as existing file name support. Non-valid file descriptors were also tested. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- block/raw-posix.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++------- qemu-doc.texi | 12 +++++++++ qemu-options.hx | 8 ++++-- 3 files changed, 78 insertions(+), 12 deletions(-) -- 1.7.1