Patchwork Make default invocation of block drivers safer

login
register
mail settings
Submitter Anthony Liguori
Date July 14, 2010, 4:12 p.m.
Message ID <1279123952-1576-1-git-send-email-aliguori@us.ibm.com>
Download mbox | patch
Permalink /patch/58914/
State New
Headers show

Comments

Anthony Liguori - July 14, 2010, 4:12 p.m.
CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
trick the block probing code into accessing arbitrary files in a guest.  To
mitigate this, we added an explicit format parameter to -drive which disabling
block probing.

Fast forward to today, and the vast majority of users do not use this parameter.
libvirt does not use this by default nor does virt-manager.

Most users want block probing so we should try to make it safer.

This patch adds some logic to the raw device which attempts to detect a write
operation to the beginning of a raw device.  If the first 4 bytes happen to
match an image file that has a backing file that we support, it scrubs the
signature to all zeros.  If a user specifies an explicit format parameter, this
behavior is disabled.

I contend that while a legitimate guest could write such a signature to the
header, we would behave incorrectly anyway upon the next invocation of QEMU.
This simply changes the incorrect behavior to not involve a security
vulnerability.

I've tested this pretty extensively both in the positive and negative case.  I'm
not 100% confident in the block layer's ability to deal with zero sized writes
particularly with respect to the aio functions so some additional eyes would be
appreciated.

Even in the case of a single sector write, we have to make sure to invoked the
completion from a bottom half so just removing the zero sized write is not an
option.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
 block.c     |    4 ++
 block/raw.c |  115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block_int.h |    1 +
 3 files changed, 120 insertions(+), 0 deletions(-)
Kevin Wolf - July 14, 2010, 4:42 p.m.
Am 14.07.2010 18:12, schrieb Anthony Liguori:
> CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
> trick the block probing code into accessing arbitrary files in a guest.  To
> mitigate this, we added an explicit format parameter to -drive which disabling
> block probing.
> 
> Fast forward to today, and the vast majority of users do not use this parameter.
> libvirt does not use this by default nor does virt-manager.
> 
> Most users want block probing so we should try to make it safer.
> 
> This patch adds some logic to the raw device which attempts to detect a write
> operation to the beginning of a raw device.  If the first 4 bytes happen to
> match an image file that has a backing file that we support, it scrubs the
> signature to all zeros.  If a user specifies an explicit format parameter, this
> behavior is disabled.
> 
> I contend that while a legitimate guest could write such a signature to the
> header, we would behave incorrectly anyway upon the next invocation of QEMU.
> This simply changes the incorrect behavior to not involve a security
> vulnerability.
> 
> I've tested this pretty extensively both in the positive and negative case.  I'm
> not 100% confident in the block layer's ability to deal with zero sized writes
> particularly with respect to the aio functions so some additional eyes would be
> appreciated.
> 
> Even in the case of a single sector write, we have to make sure to invoked the
> completion from a bottom half so just removing the zero sized write is not an
> option.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

I guess something like this makes sense, and the approach looks okay in
general. With the check that we have really probed the format, we still
allow legitimate use cases (whatever they might be).

However, I wonder why you even bother with adjusting buffers and
requests and stuff instead of just returning a straight -EIO. Doing so
would have the additional advantage that the expectation of the guest OS
matches what is really on the disk (garbage) instead of silently
corrupting things.

>  static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs,
>      int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>      BlockDriverCompletionFunc *cb, void *opaque)
>  {
> +    if (check_write_unsafe(bs, sector_num, qiov->iov[0].iov_base, nb_sectors)) {

Have you checked that the bad value is always in iov[0]? Could a guest
construct a zero-length iov[0] and do the bad access in iov[1]? Or use
two two-byte buffers to write the magic number?

I'm not saying that any of these work, I honestly don't know, but did
you consider them?

Kevin
Anthony Liguori - July 14, 2010, 5:40 p.m.
On 07/14/2010 11:42 AM, Kevin Wolf wrote:
> Am 14.07.2010 18:12, schrieb Anthony Liguori:
>    
>> CVE-2008-2004 described a vulnerability in QEMU whereas a malicious user could
>> trick the block probing code into accessing arbitrary files in a guest.  To
>> mitigate this, we added an explicit format parameter to -drive which disabling
>> block probing.
>>
>> Fast forward to today, and the vast majority of users do not use this parameter.
>> libvirt does not use this by default nor does virt-manager.
>>
>> Most users want block probing so we should try to make it safer.
>>
>> This patch adds some logic to the raw device which attempts to detect a write
>> operation to the beginning of a raw device.  If the first 4 bytes happen to
>> match an image file that has a backing file that we support, it scrubs the
>> signature to all zeros.  If a user specifies an explicit format parameter, this
>> behavior is disabled.
>>
>> I contend that while a legitimate guest could write such a signature to the
>> header, we would behave incorrectly anyway upon the next invocation of QEMU.
>> This simply changes the incorrect behavior to not involve a security
>> vulnerability.
>>
>> I've tested this pretty extensively both in the positive and negative case.  I'm
>> not 100% confident in the block layer's ability to deal with zero sized writes
>> particularly with respect to the aio functions so some additional eyes would be
>> appreciated.
>>
>> Even in the case of a single sector write, we have to make sure to invoked the
>> completion from a bottom half so just removing the zero sized write is not an
>> option.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>      
> I guess something like this makes sense, and the approach looks okay in
> general. With the check that we have really probed the format, we still
> allow legitimate use cases (whatever they might be).
>
> However, I wonder why you even bother with adjusting buffers and
> requests and stuff instead of just returning a straight -EIO. Doing so
> would have the additional advantage that the expectation of the guest OS
> matches what is really on the disk (garbage) instead of silently
> corrupting things.
>    

I started with that approach.  My concern is that it would trigger the 
stop-on-error behavior and the result would be far too difficult for a 
management tool/person to deal with.

Scrubbing seemed like a easier-to-use solution.

>>   static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs,
>>       int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
>>       BlockDriverCompletionFunc *cb, void *opaque)
>>   {
>> +    if (check_write_unsafe(bs, sector_num, qiov->iov[0].iov_base, nb_sectors)) {
>>      
> Have you checked that the bad value is always in iov[0]? Could a guest
> construct a zero-length iov[0] and do the bad access in iov[1]? Or use
> two two-byte buffers to write the magic number?
>
> I'm not saying that any of these work, I honestly don't know, but did
> you consider them?
>    

No, and it's certainly worth being a bit more paranoid.

Regards,

Anthony Liguori

> Kevin
>
>
Christoph Hellwig - July 14, 2010, 6:43 p.m.
Err, strong NACK.  Please don't start messing with the contents of the
data plane, we're getting into real trouble there.  It's perfectly
valid for a guest to create an image inside an image, and with hardware
support for nested virtualization I guess this use case will become
rather common, just as it already is on S/390 with VM.
Anthony Liguori - July 14, 2010, 6:50 p.m.
On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
> Err, strong NACK.  Please don't start messing with the contents of the
> data plane, we're getting into real trouble there.  It's perfectly
> valid for a guest to create an image inside an image, and with hardware
> support for nested virtualization I guess this use case will become
> rather common, just as it already is on S/390 with VM.
>    

Then we have to remove block format probing.

The two things are fundamentally incompatible.

Regards,

Anthony Liguori
Anthony Liguori - July 14, 2010, 6:53 p.m.
On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
> Err, strong NACK.  Please don't start messing with the contents of the
> data plane, we're getting into real trouble there.  It's perfectly
> valid for a guest to create an image inside an image, and with hardware
> support for nested virtualization I guess this use case will become
> rather common, just as it already is on S/390 with VM.
>    

Note, this is limited to raw.

If you use qcow2 in L0, the guest can do whatever it wants to do.

But if we autoprobe a raw image, we simply cannot let a guest turn the 
raw image into something else.

A user can still use raw images and let them contain anything, but they 
have to explicitly specify that it's a raw image.

Regards,

Anthony Liguori
Aurelien Jarno - July 14, 2010, 6:54 p.m.
On Wed, Jul 14, 2010 at 08:43:11PM +0200, Christoph Hellwig wrote:
> Err, strong NACK.  Please don't start messing with the contents of the
> data plane, we're getting into real trouble there.  It's perfectly
> valid for a guest to create an image inside an image, and with hardware
> support for nested virtualization I guess this use case will become
> rather common, just as it already is on S/390 with VM.
> 

Maybe it should only be done on the hard drive used to boot?
Anthony Liguori - July 14, 2010, 7:04 p.m.
On 07/14/2010 01:54 PM, Aurelien Jarno wrote:
> On Wed, Jul 14, 2010 at 08:43:11PM +0200, Christoph Hellwig wrote:
>    
>> Err, strong NACK.  Please don't start messing with the contents of the
>> data plane, we're getting into real trouble there.  It's perfectly
>> valid for a guest to create an image inside an image, and with hardware
>> support for nested virtualization I guess this use case will become
>> rather common, just as it already is on S/390 with VM.
>>
>>      
> Maybe it should only be done on the hard drive used to boot?
>    

It's just as dangerous on any other disk.  My use of "bootsector" in the 
code is probably misleading.

Regards,

Anthony Liguori
Kevin Wolf - July 15, 2010, 8 a.m.
Am 14.07.2010 19:40, schrieb Anthony Liguori:
> On 07/14/2010 11:42 AM, Kevin Wolf wrote:
>> However, I wonder why you even bother with adjusting buffers and
>> requests and stuff instead of just returning a straight -EIO. Doing so
>> would have the additional advantage that the expectation of the guest OS
>> matches what is really on the disk (garbage) instead of silently
>> corrupting things.
>>    
> 
> I started with that approach.  My concern is that it would trigger the 
> stop-on-error behavior and the result would be far too difficult for a 
> management tool/person to deal with.
> 
> Scrubbing seemed like a easier-to-use solution.

Right, I didn't consider that. Makes sense then.

Kevin
Kevin Wolf - July 15, 2010, 8:09 a.m.
Am 14.07.2010 20:43, schrieb Christoph Hellwig:
> Err, strong NACK.  Please don't start messing with the contents of the
> data plane, we're getting into real trouble there.  It's perfectly
> valid for a guest to create an image inside an image, and with hardware
> support for nested virtualization I guess this use case will become
> rather common, just as it already is on S/390 with VM.

But you'll hardly ever find a legitimate or even common use case for
putting qcow2 on a raw hard disk. I mean, we have all learned that qcow2
can be used in LVs (which can grow at least), but directly on a hard
disk just doesn't make any sense to me. That's the first point.

The second is that you should always specify the format explicitly,
which turns this magic off. Specify format=raw and you're free to format
as many hard disks as qcow2 as you want.

Third, if you don't specify format=raw and the guest has written a qcow2
header, next time it would get the wrong content anyway because qcow2
will be interpreted by the host and not the level 1 guest (this one is
the security problem).

Kevin
Stefan Hajnoczi - July 15, 2010, 9:10 a.m.
On Thu, Jul 15, 2010 at 9:09 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 14.07.2010 20:43, schrieb Christoph Hellwig:
>> Err, strong NACK.  Please don't start messing with the contents of the
>> data plane, we're getting into real trouble there.  It's perfectly
>> valid for a guest to create an image inside an image, and with hardware
>> support for nested virtualization I guess this use case will become
>> rather common, just as it already is on S/390 with VM.
>
> But you'll hardly ever find a legitimate or even common use case for
> putting qcow2 on a raw hard disk. I mean, we have all learned that qcow2
> can be used in LVs (which can grow at least), but directly on a hard
> disk just doesn't make any sense to me. That's the first point.
>
> The second is that you should always specify the format explicitly,
> which turns this magic off. Specify format=raw and you're free to format
> as many hard disks as qcow2 as you want.
>
> Third, if you don't specify format=raw and the guest has written a qcow2
> header, next time it would get the wrong content anyway because qcow2
> will be interpreted by the host and not the level 1 guest (this one is
> the security problem).

I have mixed feelings about this approach.  It has good usability
because legitimate users are unaffected, but adding a check into the
I/O path is unfortunate from a clean code perspective.

Management stacks that don't explicitly set format= today are in
trouble.  In an environment where the VM owner is untrusted, the VM
owner could provide/upload a malicious disk image and cold boot it.
This patch only prevents dodgy images created inside a running VM.
Luckily this scenario is increasingly unlikely since management stacks
specifying explicit format= and SELinux/sVirt will eventually make
this go away.

I think there are actually two issues here:

1. Confusing QEMU so it sees an image with a different format than expected.

This is important because it's unexpected behavior for a user who puts
a QCOW2 image onto a raw disk to find the disk itself turn into a
QCOW2 disk on next reboot.

I also worry about this bug because it means that in a scenario where
format= is not explicitly given, the VM can change its disk image
format.  This is a problem because the host administrator might have
used raw files and be unhappy to find that the user is able to exploit
a (hypothetical) security issue in the vmdk code despite having
created the VM with a raw image.

2. Image formats that support backing files are inherently insecure.

The final scenario that doesn't go away is the casual user who tries
out a foreign disk image.  The expectation is that the disk image
could boot up and do completely silly things but it could not affect
the host.  In reality it can read the contents of any file owned by
the uid running QEMU and send them over the internet if the guest has
networking.

You really need to run qemu-img info to check that there is no
unwanted backing file.  So I suspect we'll never be 100% safe unless
backing files are disabled by default with an error message asking you
to add allow_backing_file=on.

Stefan
Daniel P. Berrange - July 15, 2010, 9:20 a.m.
On Wed, Jul 14, 2010 at 01:50:02PM -0500, Anthony Liguori wrote:
> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
> >Err, strong NACK.  Please don't start messing with the contents of the
> >data plane, we're getting into real trouble there.  It's perfectly
> >valid for a guest to create an image inside an image, and with hardware
> >support for nested virtualization I guess this use case will become
> >rather common, just as it already is on S/390 with VM.
> >   
> 
> Then we have to remove block format probing.
> 
> The two things are fundamentally incompatible.

FWIW, the latest libvirt code will now always set a fmt=XXX arg even
for raw, so probing will never be performed. We also always set a backing
store format when creating qcow2 files.

The main remaining unsolved issues are backing stores with non-qcow2 
files for which QEMU will always probe, and pre-existing qcow2 files
which have been created without backing store formats.  It would be 
nice to have an explicit arg to disable all backing store probing, 
forcing backing format to either be raw, or match the parent image 
format.

Regards,
Daniel
Anthony Liguori - July 15, 2010, 12:35 p.m.
On 07/15/2010 04:20 AM, Daniel P. Berrange wrote:
> On Wed, Jul 14, 2010 at 01:50:02PM -0500, Anthony Liguori wrote:
>    
>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>      
>>> Err, strong NACK.  Please don't start messing with the contents of the
>>> data plane, we're getting into real trouble there.  It's perfectly
>>> valid for a guest to create an image inside an image, and with hardware
>>> support for nested virtualization I guess this use case will become
>>> rather common, just as it already is on S/390 with VM.
>>>
>>>        
>> Then we have to remove block format probing.
>>
>> The two things are fundamentally incompatible.
>>      
> FWIW, the latest libvirt code will now always set a fmt=XXX arg even
> for raw, so probing will never be performed. We also always set a backing
> store format when creating qcow2 files.
>    

I assume libvirt probes once, then updates the guest's XML?

This is definitely the most user friendly option of all except for the 
fact that we don't have a writeable config file in QEMU today.

Regards,

Anthony Liguori

> The main remaining unsolved issues are backing stores with non-qcow2
> files for which QEMU will always probe, and pre-existing qcow2 files
> which have been created without backing store formats.  It would be
> nice to have an explicit arg to disable all backing store probing,
> forcing backing format to either be raw, or match the parent image
> format.
>
> Regards,
> Daniel
>
Anthony Liguori - July 15, 2010, 12:57 p.m.
On 07/15/2010 04:10 AM, Stefan Hajnoczi wrote:
> I have mixed feelings about this approach.  It has good usability
> because legitimate users are unaffected, but adding a check into the
> I/O path is unfortunate from a clean code perspective.
>
> Management stacks that don't explicitly set format= today are in
> trouble.  In an environment where the VM owner is untrusted, the VM
> owner could provide/upload a malicious disk image and cold boot it.
>    

More specifically, management stacks are in trouble if they use raw 
images and don't use format=.  Anyone using qcow2 exclusively is fine.

I'm less concerned about uploaded images.  Format spoofing is really the 
least of the concerns for uploaded images.  The bigger concern IMHO is 
attempting to exploit vulnerabilities in qcow2 or any of the other image 
formats.

> This patch only prevents dodgy images created inside a running VM.
> Luckily this scenario is increasingly unlikely since management stacks
> specifying explicit format= and SELinux/sVirt will eventually make
> this go away.
>
> I think there are actually two issues here:
>
> 1. Confusing QEMU so it sees an image with a different format than expected.
>
> This is important because it's unexpected behavior for a user who puts
> a QCOW2 image onto a raw disk to find the disk itself turn into a
> QCOW2 disk on next reboot.
>
> I also worry about this bug because it means that in a scenario where
> format= is not explicitly given, the VM can change its disk image
> format.  This is a problem because the host administrator might have
> used raw files and be unhappy to find that the user is able to exploit
> a (hypothetical) security issue in the vmdk code despite having
> created the VM with a raw image.
>    

One of the nasty things in QEMU right now is that we have absolutely no 
way to persist information about the guest and we have no persistent 
definition of the guest.

All of our VMs are basically stateless across invocations and that 
really makes things like this difficult.

> 2. Image formats that support backing files are inherently insecure.
>
> The final scenario that doesn't go away is the casual user who tries
> out a foreign disk image.  The expectation is that the disk image
> could boot up and do completely silly things but it could not affect
> the host.  In reality it can read the contents of any file owned by
> the uid running QEMU and send them over the internet if the guest has
> networking.
>
> You really need to run qemu-img info to check that there is no
> unwanted backing file.  So I suspect we'll never be 100% safe unless
> backing files are disabled by default with an error message asking you
> to add allow_backing_file=on.
>    

I'm not sure I'd classify it as insecure.  They are only insecure *if* 
the guest can modify the backing file.  With the proposed patch, they can't.

Regards,

Anthony Liguori

> Stefan
>
Kevin Wolf - July 15, 2010, 1:16 p.m.
Am 15.07.2010 14:57, schrieb Anthony Liguori:
> On 07/15/2010 04:10 AM, Stefan Hajnoczi wrote:
>> I think there are actually two issues here:
>>
>> 1. Confusing QEMU so it sees an image with a different format than expected.
>>
>> This is important because it's unexpected behavior for a user who puts
>> a QCOW2 image onto a raw disk to find the disk itself turn into a
>> QCOW2 disk on next reboot.
>>
>> I also worry about this bug because it means that in a scenario where
>> format= is not explicitly given, the VM can change its disk image
>> format.  This is a problem because the host administrator might have
>> used raw files and be unhappy to find that the user is able to exploit
>> a (hypothetical) security issue in the vmdk code despite having
>> created the VM with a raw image.
>>    
> 
> One of the nasty things in QEMU right now is that we have absolutely no 
> way to persist information about the guest and we have no persistent 
> definition of the guest.
> 
> All of our VMs are basically stateless across invocations and that 
> really makes things like this difficult.

On the one hand, yes, it can be nasty in some situations. But on the
other hand, when I first used qemu back in 0.6.0 times or so, what
really impressed me was how easy it was to use. No long config files to
create or anything, "qemu -hda my_image" (and maybe one or two other
options) and it just worked. We should try not to lose too much of this
ease of use.

Kevin
Stefan Hajnoczi - July 15, 2010, 1:20 p.m.
On Thu, Jul 15, 2010 at 1:57 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
> On 07/15/2010 04:10 AM, Stefan Hajnoczi wrote:
>>
>> I have mixed feelings about this approach.  It has good usability
>> because legitimate users are unaffected, but adding a check into the
>> I/O path is unfortunate from a clean code perspective.
>>
>> Management stacks that don't explicitly set format= today are in
>> trouble.  In an environment where the VM owner is untrusted, the VM
>> owner could provide/upload a malicious disk image and cold boot it.
>>
>
> More specifically, management stacks are in trouble if they use raw images
> and don't use format=.  Anyone using qcow2 exclusively is fine.
>
> I'm less concerned about uploaded images.  Format spoofing is really the
> least of the concerns for uploaded images.  The bigger concern IMHO is
> attempting to exploit vulnerabilities in qcow2 or any of the other image
> formats.

Attempting to exploit vulnerabilities in image formats is exactly what
you can do with image spoofing, even though your VM was created with a
raw image.

>
>> This patch only prevents dodgy images created inside a running VM.
>> Luckily this scenario is increasingly unlikely since management stacks
>> specifying explicit format= and SELinux/sVirt will eventually make
>> this go away.
>>
>> I think there are actually two issues here:
>>
>> 1. Confusing QEMU so it sees an image with a different format than
>> expected.
>>
>> This is important because it's unexpected behavior for a user who puts
>> a QCOW2 image onto a raw disk to find the disk itself turn into a
>> QCOW2 disk on next reboot.
>>
>> I also worry about this bug because it means that in a scenario where
>> format= is not explicitly given, the VM can change its disk image
>> format.  This is a problem because the host administrator might have
>> used raw files and be unhappy to find that the user is able to exploit
>> a (hypothetical) security issue in the vmdk code despite having
>> created the VM with a raw image.
>>
>
> One of the nasty things in QEMU right now is that we have absolutely no way
> to persist information about the guest and we have no persistent definition
> of the guest.
>
> All of our VMs are basically stateless across invocations and that really
> makes things like this difficult.
>
>> 2. Image formats that support backing files are inherently insecure.
>>
>> The final scenario that doesn't go away is the casual user who tries
>> out a foreign disk image.  The expectation is that the disk image
>> could boot up and do completely silly things but it could not affect
>> the host.  In reality it can read the contents of any file owned by
>> the uid running QEMU and send them over the internet if the guest has
>> networking.
>>
>> You really need to run qemu-img info to check that there is no
>> unwanted backing file.  So I suspect we'll never be 100% safe unless
>> backing files are disabled by default with an error message asking you
>> to add allow_backing_file=on.
>>
>
> I'm not sure I'd classify it as insecure.  They are only insecure *if* the
> guest can modify the backing file.  With the proposed patch, they can't.

You cannot trust a QCOW2 file I give you, it might read your mailbox
and send me the contents.  This is the offline attack vector again,
not a running VM modifying its disk image and rebooting.

I think it is reasonable to expect a published disk image to be unable
to access local files on the host.

Stefan
Markus Armbruster - July 15, 2010, 3:19 p.m.
Anthony Liguori <anthony@codemonkey.ws> writes:

> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>> Err, strong NACK.  Please don't start messing with the contents of the
>> data plane, we're getting into real trouble there.  It's perfectly
>> valid for a guest to create an image inside an image, and with hardware
>> support for nested virtualization I guess this use case will become
>> rather common, just as it already is on S/390 with VM.
>>    
>
> Then we have to remove block format probing.
>
> The two things are fundamentally incompatible.

I agree with Christoph: changing guest writes is a big no-no, and
changing them silently is even worse.

I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
rejected that approach because "it would trigger the stop-on-error
behavior and the result would be far too difficult for a management
tool/person to deal with."  I think that would be *far* superior in
fact: it fails spectacularly, immediately and safely instead of silently
corrupting disk contents.

The real problem in need of fixing is the unsafe default.  You wrote
that "most users want block probing".  I disagree.  Users want to set up
drives with as little hassle as possible.  If format is optional, and
appears to work, why bother specifying it?  That they get an unsafe
default that way is a big surprise to them.  And I can't blame them!
Users can reasonably expect programs not to trap them.

If we want to let users define drives without having to specify the
format, we can guess the format from the file name.
Anthony Liguori - July 15, 2010, 4:20 p.m.
On 07/15/2010 10:19 AM, Markus Armbruster wrote:
> Anthony Liguori<anthony@codemonkey.ws>  writes:
>
>    
>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>      
>>> Err, strong NACK.  Please don't start messing with the contents of the
>>> data plane, we're getting into real trouble there.  It's perfectly
>>> valid for a guest to create an image inside an image, and with hardware
>>> support for nested virtualization I guess this use case will become
>>> rather common, just as it already is on S/390 with VM.
>>>
>>>        
>> Then we have to remove block format probing.
>>
>> The two things are fundamentally incompatible.
>>      
> I agree with Christoph: changing guest writes is a big no-no, and
> changing them silently is even worse.
>    

I do sympathize.  The problem is we're already doing this.  This patch 
simply changes the behavior to not be a security problem.  I've 
committed it to attempt to resolve that security problem.  However, we 
still have a problem and I don't consider the issue closed.

> I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
> rejected that approach because "it would trigger the stop-on-error
> behavior and the result would be far too difficult for a management
> tool/person to deal with."  I think that would be *far* superior in
> fact: it fails spectacularly, immediately and safely instead of silently
> corrupting disk contents.
>    

There's really nothing wrong with this type of write, so EIO doesn't 
solve the problem.  While we can argue whether writing zeros or EIO is a 
"better bad" solution, let's try to figure out a good solution.

> The real problem in need of fixing is the unsafe default.  You wrote
> that "most users want block probing".  I disagree.  Users want to set up
> drives with as little hassle as possible.  If format is optional, and
> appears to work, why bother specifying it?

I really think specifying the format is a burden that is nice to avoid.

I have another idea that I hope will solve the problem in a more 
complete way.  The fundamental issue is that it's impossible to probe 
raw images reliably.  We can probe qcow2, vmdk, etc but not raw.

So, let's do the following: have raw_probe() always fail.  Probing 
shouldn't be a heuristic, it should be an absolute.  We can't prove it's 
a raw image, so we should always fail.

To accomodate current use-cases with raw, let's introduce a new format 
called "probed_raw".  probed_raw's semantics will be the following:

The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD', 
'OOOM', ...}.  If the signature is 'QRAW', then instead of reading the 
first sector at offset 0, we read the first sector at offset LENGTH.  If 
the signature is 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.

For probed_raw, write requests to sector 0 are checked.  If the first 
four bytes is an invalid probed_raw signature or QRAW, we write a QRAW 
signature to file offset 0 and copy the first sector to the end of the 
file redirecting reads and writes to the end of file.

An approach like this has the following properties:

1) We can make the bdrv_probe check 100% reliable and return a boolean.
2) In the cases where we known format=raw, none of this code is ever 
invoked.
3) probed_raw images usually look exactly like raw images in most cases
4) In the degenerate cases, probe_raw images are still mountable in the 
normal way.
5) Even after the QRAW signature is applied, if the guest writes a valid 
signature, we can truncate the file and make it appear as a normal raw 
image.

Christoph/Markus/Stefan, does this seem like a more reasonable approach?

Regards,

Anthony Liguori

>    That they get an unsafe
> default that way is a big surprise to them.  And I can't blame them!
> Users can reasonably expect programs not to trap them.
>
> If we want to let users define drives without having to specify the
> format, we can guess the format from the file name.
>
Kevin Wolf - July 15, 2010, 5:10 p.m.
Am 15.07.2010 18:20, schrieb Anthony Liguori:
> I have another idea that I hope will solve the problem in a more 
> complete way.  The fundamental issue is that it's impossible to probe 
> raw images reliably.  We can probe qcow2, vmdk, etc but not raw.
> 
> So, let's do the following: have raw_probe() always fail.  Probing 
> shouldn't be a heuristic, it should be an absolute.  We can't prove it's 
> a raw image, so we should always fail.
> 
> To accomodate current use-cases with raw, let's introduce a new format 
> called "probed_raw".  probed_raw's semantics will be the following:
> 
> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD', 
> 'OOOM', ...}.  If the signature is 'QRAW', then instead of reading the 
> first sector at offset 0, we read the first sector at offset LENGTH.  If 
> the signature is 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.
> 
> For probed_raw, write requests to sector 0 are checked.  If the first 
> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW 
> signature to file offset 0 and copy the first sector to the end of the 
> file redirecting reads and writes to the end of file.
> 
> An approach like this has the following properties:
> 
> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
> 2) In the cases where we known format=raw, none of this code is ever 
> invoked.
> 3) probed_raw images usually look exactly like raw images in most cases
> 4) In the degenerate cases, probe_raw images are still mountable in the 
> normal way.
> 5) Even after the QRAW signature is applied, if the guest writes a valid 
> signature, we can truncate the file and make it appear as a normal raw 
> image.
> 
> Christoph/Markus/Stefan, does this seem like a more reasonable approach?

I hope I may answer even though you didn't ask me. ;-)

This qraw format you introduce is a weird thing, mainly because
depending on a signature the first sectors moves to somewhere completely
else.

If we want to introduce something, that's _almost_ raw and that we can
use by default, what about doing something similar to what statically
sized VHD images look like: take a raw image and attach a footer? With
this approach you should still be able to mount it in the host etc. We
could automatically append this footer when opening a probed qraw image r/w.

Kevin
Anthony Liguori - July 15, 2010, 5:51 p.m.
On 07/15/2010 12:10 PM, Kevin Wolf wrote:
> Am 15.07.2010 18:20, schrieb Anthony Liguori:
>    
>> I have another idea that I hope will solve the problem in a more
>> complete way.  The fundamental issue is that it's impossible to probe
>> raw images reliably.  We can probe qcow2, vmdk, etc but not raw.
>>
>> So, let's do the following: have raw_probe() always fail.  Probing
>> shouldn't be a heuristic, it should be an absolute.  We can't prove it's
>> a raw image, so we should always fail.
>>
>> To accomodate current use-cases with raw, let's introduce a new format
>> called "probed_raw".  probed_raw's semantics will be the following:
>>
>> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD',
>> 'OOOM', ...}.  If the signature is 'QRAW', then instead of reading the
>> first sector at offset 0, we read the first sector at offset LENGTH.  If
>> the signature is 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.
>>
>> For probed_raw, write requests to sector 0 are checked.  If the first
>> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW
>> signature to file offset 0 and copy the first sector to the end of the
>> file redirecting reads and writes to the end of file.
>>
>> An approach like this has the following properties:
>>
>> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
>> 2) In the cases where we known format=raw, none of this code is ever
>> invoked.
>> 3) probed_raw images usually look exactly like raw images in most cases
>> 4) In the degenerate cases, probe_raw images are still mountable in the
>> normal way.
>> 5) Even after the QRAW signature is applied, if the guest writes a valid
>> signature, we can truncate the file and make it appear as a normal raw
>> image.
>>
>> Christoph/Markus/Stefan, does this seem like a more reasonable approach?
>>      
> I hope I may answer even though you didn't ask me. ;-)
>    

Heh, you just seemed more agreeable to the overall concept but obviously 
I value your input :-)

> This qraw format you introduce is a weird thing, mainly because
> depending on a signature the first sectors moves to somewhere completely
> else.
>
> If we want to introduce something, that's _almost_ raw and that we can
> use by default, what about doing something similar to what statically
> sized VHD images look like: take a raw image and attach a footer?

I guess this could work provided that the image size was not a multiple 
of 512 and that all other image formats guaranteed that they were a 
multiple of 512 or they also had a footer.

Makes me a little nervous though because it's easy to confuse an image 
with a magic id and data at the end from an image with a special footer.

Regards,

Anthony Liguori

>   With
> this approach you should still be able to mount it in the host etc. We
> could automatically append this footer when opening a probed qraw image r/w.
>    

D

> Kevin
>
Kevin Wolf - July 16, 2010, 7:30 a.m.
Am 15.07.2010 19:51, schrieb Anthony Liguori:
> On 07/15/2010 12:10 PM, Kevin Wolf wrote:
>> Am 15.07.2010 18:20, schrieb Anthony Liguori:
>>    
>>> I have another idea that I hope will solve the problem in a more
>>> complete way.  The fundamental issue is that it's impossible to probe
>>> raw images reliably.  We can probe qcow2, vmdk, etc but not raw.
>>>
>>> So, let's do the following: have raw_probe() always fail.  Probing
>>> shouldn't be a heuristic, it should be an absolute.  We can't prove it's
>>> a raw image, so we should always fail.
>>>
>>> To accomodate current use-cases with raw, let's introduce a new format
>>> called "probed_raw".  probed_raw's semantics will be the following:
>>>
>>> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD',
>>> 'OOOM', ...}.  If the signature is 'QRAW', then instead of reading the
>>> first sector at offset 0, we read the first sector at offset LENGTH.  If
>>> the signature is 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.
>>>
>>> For probed_raw, write requests to sector 0 are checked.  If the first
>>> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW
>>> signature to file offset 0 and copy the first sector to the end of the
>>> file redirecting reads and writes to the end of file.
>>>
>>> An approach like this has the following properties:
>>>
>>> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
>>> 2) In the cases where we known format=raw, none of this code is ever
>>> invoked.
>>> 3) probed_raw images usually look exactly like raw images in most cases
>>> 4) In the degenerate cases, probe_raw images are still mountable in the
>>> normal way.
>>> 5) Even after the QRAW signature is applied, if the guest writes a valid
>>> signature, we can truncate the file and make it appear as a normal raw
>>> image.
>>>
>>> Christoph/Markus/Stefan, does this seem like a more reasonable approach?
>>>      
>> I hope I may answer even though you didn't ask me. ;-)
>>    
> 
> Heh, you just seemed more agreeable to the overall concept but obviously 
> I value your input :-)
> 
>> This qraw format you introduce is a weird thing, mainly because
>> depending on a signature the first sectors moves to somewhere completely
>> else.
>>
>> If we want to introduce something, that's _almost_ raw and that we can
>> use by default, what about doing something similar to what statically
>> sized VHD images look like: take a raw image and attach a footer?
> 
> I guess this could work provided that the image size was not a multiple 
> of 512 and that all other image formats guaranteed that they were a 
> multiple of 512 or they also had a footer.
> 
> Makes me a little nervous though because it's easy to confuse an image 
> with a magic id and data at the end from an image with a special footer.

Yeah, I guess you're right. It would be hard to tell apart a qcow2 image
with a qraw footer in its data and a qraw image with a qcow2 header in
it's data.

Was just a thought...

Kevin
Stefan Hajnoczi - July 16, 2010, 12:55 p.m.
On Thu, Jul 15, 2010 at 5:20 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 07/15/2010 10:19 AM, Markus Armbruster wrote:
>>
>> Anthony Liguori<anthony@codemonkey.ws>  writes:
>>
>>
>>>
>>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>>
>>>>
>>>> Err, strong NACK.  Please don't start messing with the contents of the
>>>> data plane, we're getting into real trouble there.  It's perfectly
>>>> valid for a guest to create an image inside an image, and with hardware
>>>> support for nested virtualization I guess this use case will become
>>>> rather common, just as it already is on S/390 with VM.
>>>>
>>>>
>>>
>>> Then we have to remove block format probing.
>>>
>>> The two things are fundamentally incompatible.
>>>
>>
>> I agree with Christoph: changing guest writes is a big no-no, and
>> changing them silently is even worse.
>>
>
> I do sympathize.  The problem is we're already doing this.  This patch
> simply changes the behavior to not be a security problem.  I've committed it
> to attempt to resolve that security problem.  However, we still have a
> problem and I don't consider the issue closed.
>
>> I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
>> rejected that approach because "it would trigger the stop-on-error
>> behavior and the result would be far too difficult for a management
>> tool/person to deal with."  I think that would be *far* superior in
>> fact: it fails spectacularly, immediately and safely instead of silently
>> corrupting disk contents.
>>
>
> There's really nothing wrong with this type of write, so EIO doesn't solve
> the problem.  While we can argue whether writing zeros or EIO is a "better
> bad" solution, let's try to figure out a good solution.
>
>> The real problem in need of fixing is the unsafe default.  You wrote
>> that "most users want block probing".  I disagree.  Users want to set up
>> drives with as little hassle as possible.  If format is optional, and
>> appears to work, why bother specifying it?
>
> I really think specifying the format is a burden that is nice to avoid.
>
> I have another idea that I hope will solve the problem in a more complete
> way.  The fundamental issue is that it's impossible to probe raw images
> reliably.  We can probe qcow2, vmdk, etc but not raw.
>
> So, let's do the following: have raw_probe() always fail.  Probing shouldn't
> be a heuristic, it should be an absolute.  We can't prove it's a raw image,
> so we should always fail.
>
> To accomodate current use-cases with raw, let's introduce a new format
> called "probed_raw".  probed_raw's semantics will be the following:
>
> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD', 'OOOM',
> ...}.  If the signature is 'QRAW', then instead of reading the first sector
> at offset 0, we read the first sector at offset LENGTH.  If the signature is
> 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.
>
> For probed_raw, write requests to sector 0 are checked.  If the first four
> bytes is an invalid probed_raw signature or QRAW, we write a QRAW signature
> to file offset 0 and copy the first sector to the end of the file
> redirecting reads and writes to the end of file.
>
> An approach like this has the following properties:
>
> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
> 2) In the cases where we known format=raw, none of this code is ever
> invoked.
> 3) probed_raw images usually look exactly like raw images in most cases
> 4) In the degenerate cases, probe_raw images are still mountable in the
> normal way.
> 5) Even after the QRAW signature is applied, if the guest writes a valid
> signature, we can truncate the file and make it appear as a normal raw
> image.
>
> Christoph/Markus/Stefan, does this seem like a more reasonable approach?

It took me a little while to figure out how your scheme works.  I like
that the check for sector 0 writes is moved out of the generic I/O
code path and into its own module.  The probed_raw format could be
easily dropped later if the decision is made to stop probing
altogether.

It would be simpler to avoid the QRAW signature and sector 0
redirection by simply failing dodgy writes to sector 0 with EIO.  You
said that would be confusing to users, but if we have no good way to
present errors to the user, then that is a different problem that
needs to be addressed anyway.

What we're talking about here is similar to "boot sector virus
protection" that BIOSes implement(ed?).

Stefan

>
> Regards,
>
> Anthony Liguori
>
>>   That they get an unsafe
>> default that way is a big surprise to them.  And I can't blame them!
>> Users can reasonably expect programs not to trap them.
>>
>> If we want to let users define drives without having to specify the
>> format, we can guess the format from the file name.
>>
>
>
>
Stefan Hajnoczi - July 16, 2010, 1 p.m.
On Fri, Jul 16, 2010 at 1:55 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Jul 15, 2010 at 5:20 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> On 07/15/2010 10:19 AM, Markus Armbruster wrote:
>>>
>>> Anthony Liguori<anthony@codemonkey.ws>  writes:
>>>
>>>
>>>>
>>>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>>>
>>>>>
>>>>> Err, strong NACK.  Please don't start messing with the contents of the
>>>>> data plane, we're getting into real trouble there.  It's perfectly
>>>>> valid for a guest to create an image inside an image, and with hardware
>>>>> support for nested virtualization I guess this use case will become
>>>>> rather common, just as it already is on S/390 with VM.
>>>>>
>>>>>
>>>>
>>>> Then we have to remove block format probing.
>>>>
>>>> The two things are fundamentally incompatible.
>>>>
>>>
>>> I agree with Christoph: changing guest writes is a big no-no, and
>>> changing them silently is even worse.
>>>
>>
>> I do sympathize.  The problem is we're already doing this.  This patch
>> simply changes the behavior to not be a security problem.  I've committed it
>> to attempt to resolve that security problem.  However, we still have a
>> problem and I don't consider the issue closed.
>>
>>> I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
>>> rejected that approach because "it would trigger the stop-on-error
>>> behavior and the result would be far too difficult for a management
>>> tool/person to deal with."  I think that would be *far* superior in
>>> fact: it fails spectacularly, immediately and safely instead of silently
>>> corrupting disk contents.
>>>
>>
>> There's really nothing wrong with this type of write, so EIO doesn't solve
>> the problem.  While we can argue whether writing zeros or EIO is a "better
>> bad" solution, let's try to figure out a good solution.
>>
>>> The real problem in need of fixing is the unsafe default.  You wrote
>>> that "most users want block probing".  I disagree.  Users want to set up
>>> drives with as little hassle as possible.  If format is optional, and
>>> appears to work, why bother specifying it?
>>
>> I really think specifying the format is a burden that is nice to avoid.
>>
>> I have another idea that I hope will solve the problem in a more complete
>> way.  The fundamental issue is that it's impossible to probe raw images
>> reliably.  We can probe qcow2, vmdk, etc but not raw.
>>
>> So, let's do the following: have raw_probe() always fail.  Probing shouldn't
>> be a heuristic, it should be an absolute.  We can't prove it's a raw image,
>> so we should always fail.
>>
>> To accomodate current use-cases with raw, let's introduce a new format
>> called "probed_raw".  probed_raw's semantics will be the following:
>>
>> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD', 'OOOM',
>> ...}.  If the signature is 'QRAW', then instead of reading the first sector
>> at offset 0, we read the first sector at offset LENGTH.  If the signature is
>> 'QRAW', LENGTH is computed by calculating FILE_SIZE - 512.
>>
>> For probed_raw, write requests to sector 0 are checked.  If the first four
>> bytes is an invalid probed_raw signature or QRAW, we write a QRAW signature
>> to file offset 0 and copy the first sector to the end of the file
>> redirecting reads and writes to the end of file.
>>
>> An approach like this has the following properties:
>>
>> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
>> 2) In the cases where we known format=raw, none of this code is ever
>> invoked.
>> 3) probed_raw images usually look exactly like raw images in most cases
>> 4) In the degenerate cases, probe_raw images are still mountable in the
>> normal way.
>> 5) Even after the QRAW signature is applied, if the guest writes a valid
>> signature, we can truncate the file and make it appear as a normal raw
>> image.
>>
>> Christoph/Markus/Stefan, does this seem like a more reasonable approach?
>
> It took me a little while to figure out how your scheme works.  I like
> that the check for sector 0 writes is moved out of the generic I/O
> code path and into its own module.  The probed_raw format could be
> easily dropped later if the decision is made to stop probing
> altogether.
>
> It would be simpler to avoid the QRAW signature and sector 0
> redirection by simply failing dodgy writes to sector 0 with EIO.  You
> said that would be confusing to users, but if we have no good way to
> present errors to the user, then that is a different problem that
> needs to be addressed anyway.

On second thought, we can't do EIO if there is a reasonable case where
the user may wish to write any of the blacklisted signatures to disk
:(.  Perhaps the sector 0 redirection is really necessary.

I like this new approach better than the first patch.

Stefan

>
> What we're talking about here is similar to "boot sector virus
> protection" that BIOSes implement(ed?).
>
> Stefan
>
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>   That they get an unsafe
>>> default that way is a big surprise to them.  And I can't blame them!
>>> Users can reasonably expect programs not to trap them.
>>>
>>> If we want to let users define drives without having to specify the
>>> format, we can guess the format from the file name.
>>>
>>
>>
>>
>
Markus Armbruster - July 16, 2010, 4:06 p.m.
Anthony Liguori <anthony@codemonkey.ws> writes:

> On 07/15/2010 10:19 AM, Markus Armbruster wrote:
>> Anthony Liguori<anthony@codemonkey.ws>  writes:
>>
>>    
>>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>>      
>>>> Err, strong NACK.  Please don't start messing with the contents of the
>>>> data plane, we're getting into real trouble there.  It's perfectly
>>>> valid for a guest to create an image inside an image, and with hardware
>>>> support for nested virtualization I guess this use case will become
>>>> rather common, just as it already is on S/390 with VM.
>>>>
>>>>        
>>> Then we have to remove block format probing.
>>>
>>> The two things are fundamentally incompatible.
>>>      
>> I agree with Christoph: changing guest writes is a big no-no, and
>> changing them silently is even worse.
>>    
>
> I do sympathize.  The problem is we're already doing this.  This patch
> simply changes the behavior to not be a security problem.  I've
> committed it to attempt to resolve that security problem.  However, we
> still have a problem and I don't consider the issue closed.
>
>> I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
>> rejected that approach because "it would trigger the stop-on-error
>> behavior and the result would be far too difficult for a management
>> tool/person to deal with."  I think that would be *far* superior in
>> fact: it fails spectacularly, immediately and safely instead of silently
>> corrupting disk contents.
>>    
>
> There's really nothing wrong with this type of write, so EIO doesn't
> solve the problem.  While we can argue whether writing zeros or EIO is
> a "better bad" solution, let's try to figure out a good solution.
>
>> The real problem in need of fixing is the unsafe default.  You wrote
>> that "most users want block probing".  I disagree.  Users want to set up
>> drives with as little hassle as possible.  If format is optional, and
>> appears to work, why bother specifying it?
>
> I really think specifying the format is a burden that is nice to avoid.

Yes, users don't like having to specify the "obvious".

> I have another idea that I hope will solve the problem in a more
> complete way.  The fundamental issue is that it's impossible to probe
> raw images reliably.  We can probe qcow2, vmdk, etc but not raw.
>
> So, let's do the following: have raw_probe() always fail.  Probing
> shouldn't be a heuristic, it should be an absolute.  We can't prove
> it's a raw image, so we should always fail.

Note: if we stop right here, the security hole is patched, but use of
raw images requires explicit specification of format.

> To accomodate current use-cases with raw, let's introduce a new format
> called "probed_raw".  probed_raw's semantics will be the following:
>
> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD',
> OOOM', ...}.  If the signature is 'QRAW', then instead of reading the
> first sector at offset 0, we read the first sector at offset LENGTH.
> If the signature is 'QRAW', LENGTH is computed by calculating
> FILE_SIZE - 512.
>
> For probed_raw, write requests to sector 0 are checked.  If the first
> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW
> signature to file offset 0 and copy the first sector to the end of the
> file redirecting reads and writes to the end of file.

Doesn't this require an image that can grow?  What about host block
devices?

> An approach like this has the following properties:
>
> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
> 2) In the cases where we known format=raw, none of this code is ever
> invoked.
> 3) probed_raw images usually look exactly like raw images in most cases
> 4) In the degenerate cases, probe_raw images are still mountable in
> the normal way.
> 5) Even after the QRAW signature is applied, if the guest writes a
> valid signature, we can truncate the file and make it appear as a
> normal raw image.
>
> Christoph/Markus/Stefan, does this seem like a more reasonable approach?

I'm not convinced it's a good idea.  It's clearly a less bad idea,
though :)

It avoids guest-visible lossage, and that's good.

There's still host-visible lossage: as soon as we redirect sector 0, the
image isn't raw anymore, and accessing it with non-qemu tools (say
losetup + kpartx) no longer works.  You need to know what QEMU did to
your no-longer-raw image to work around the lossage (say losetup -o
512).

>>    That they get an unsafe
>> default that way is a big surprise to them.  And I can't blame them!
>> Users can reasonably expect programs not to trap them.
>>
>> If we want to let users define drives without having to specify the
>> format, we can guess the format from the file name.

I still think guessing the format from the file name is a better
way to spare users from having to specify formats.
Anthony Liguori - July 16, 2010, 4:16 p.m.
On 07/16/2010 11:06 AM, Markus Armbruster wrote:
> Anthony Liguori<anthony@codemonkey.ws>  writes:
>
>    
>> On 07/15/2010 10:19 AM, Markus Armbruster wrote:
>>      
>>> Anthony Liguori<anthony@codemonkey.ws>   writes:
>>>
>>>
>>>        
>>>> On 07/14/2010 01:43 PM, Christoph Hellwig wrote:
>>>>
>>>>          
>>>>> Err, strong NACK.  Please don't start messing with the contents of the
>>>>> data plane, we're getting into real trouble there.  It's perfectly
>>>>> valid for a guest to create an image inside an image, and with hardware
>>>>> support for nested virtualization I guess this use case will become
>>>>> rather common, just as it already is on S/390 with VM.
>>>>>
>>>>>
>>>>>            
>>>> Then we have to remove block format probing.
>>>>
>>>> The two things are fundamentally incompatible.
>>>>
>>>>          
>>> I agree with Christoph: changing guest writes is a big no-no, and
>>> changing them silently is even worse.
>>>
>>>        
>> I do sympathize.  The problem is we're already doing this.  This patch
>> simply changes the behavior to not be a security problem.  I've
>> committed it to attempt to resolve that security problem.  However, we
>> still have a problem and I don't consider the issue closed.
>>
>>      
>>> I could perhaps accept EIO.  Elsewhere in this thread you wrote that you
>>> rejected that approach because "it would trigger the stop-on-error
>>> behavior and the result would be far too difficult for a management
>>> tool/person to deal with."  I think that would be *far* superior in
>>> fact: it fails spectacularly, immediately and safely instead of silently
>>> corrupting disk contents.
>>>
>>>        
>> There's really nothing wrong with this type of write, so EIO doesn't
>> solve the problem.  While we can argue whether writing zeros or EIO is
>> a "better bad" solution, let's try to figure out a good solution.
>>
>>      
>>> The real problem in need of fixing is the unsafe default.  You wrote
>>> that "most users want block probing".  I disagree.  Users want to set up
>>> drives with as little hassle as possible.  If format is optional, and
>>> appears to work, why bother specifying it?
>>>        
>> I really think specifying the format is a burden that is nice to avoid.
>>      
> Yes, users don't like having to specify the "obvious".
>
>    
>> I have another idea that I hope will solve the problem in a more
>> complete way.  The fundamental issue is that it's impossible to probe
>> raw images reliably.  We can probe qcow2, vmdk, etc but not raw.
>>
>> So, let's do the following: have raw_probe() always fail.  Probing
>> shouldn't be a heuristic, it should be an absolute.  We can't prove
>> it's a raw image, so we should always fail.
>>      
> Note: if we stop right here, the security hole is patched, but use of
> raw images requires explicit specification of format.
>
>    
>> To accomodate current use-cases with raw, let's introduce a new format
>> called "probed_raw".  probed_raw's semantics will be the following:
>>
>> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD',
>> OOOM', ...}.  If the signature is 'QRAW', then instead of reading the
>> first sector at offset 0, we read the first sector at offset LENGTH.
>> If the signature is 'QRAW', LENGTH is computed by calculating
>> FILE_SIZE - 512.
>>
>> For probed_raw, write requests to sector 0 are checked.  If the first
>> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW
>> signature to file offset 0 and copy the first sector to the end of the
>> file redirecting reads and writes to the end of file.
>>      
> Doesn't this require an image that can grow?  What about host block
> devices?
>    

I don't believe we probe host block devices.  We assume they're raw 
which means they would never be probed_raw.

>> An approach like this has the following properties:
>>
>> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
>> 2) In the cases where we known format=raw, none of this code is ever
>> invoked.
>> 3) probed_raw images usually look exactly like raw images in most cases
>> 4) In the degenerate cases, probe_raw images are still mountable in
>> the normal way.
>> 5) Even after the QRAW signature is applied, if the guest writes a
>> valid signature, we can truncate the file and make it appear as a
>> normal raw image.
>>
>> Christoph/Markus/Stefan, does this seem like a more reasonable approach?
>>      
> I'm not convinced it's a good idea.  It's clearly a less bad idea,
> though :)
>
> It avoids guest-visible lossage, and that's good.
>
> There's still host-visible lossage: as soon as we redirect sector 0, the
> image isn't raw anymore, and accessing it with non-qemu tools (say
> losetup + kpartx) no longer works.  You need to know what QEMU did to
> your no-longer-raw image to work around the lossage (say losetup -o
> 512).
>    

Yeah, but as previously discussed, we can't probe raw.  So probed_raw 
ends up being a compromise.

>>>     That they get an unsafe
>>> default that way is a big surprise to them.  And I can't blame them!
>>> Users can reasonably expect programs not to trap them.
>>>
>>> If we want to let users define drives without having to specify the
>>> format, we can guess the format from the file name.
>>>        
> I still think guessing the format from the file name is a better
> way to spare users from having to specify formats.
>    
I think that would be true if we did it from day 1 but it would be a 
huge impact to users if we did it today.

Regards,

Anthony Liguori
Kevin Wolf - July 16, 2010, 4:24 p.m.
Am 16.07.2010 18:16, schrieb Anthony Liguori:
> On 07/16/2010 11:06 AM, Markus Armbruster wrote:
>> Anthony Liguori<anthony@codemonkey.ws>  writes:
>>> To accomodate current use-cases with raw, let's introduce a new format
>>> called "probed_raw".  probed_raw's semantics will be the following:
>>>
>>> The signature of a probed_raw will be ~{'QFI\xfb', 'VMDK', 'COWD',
>>> OOOM', ...}.  If the signature is 'QRAW', then instead of reading the
>>> first sector at offset 0, we read the first sector at offset LENGTH.
>>> If the signature is 'QRAW', LENGTH is computed by calculating
>>> FILE_SIZE - 512.
>>>
>>> For probed_raw, write requests to sector 0 are checked.  If the first
>>> four bytes is an invalid probed_raw signature or QRAW, we write a QRAW
>>> signature to file offset 0 and copy the first sector to the end of the
>>> file redirecting reads and writes to the end of file.
>>>      
>> Doesn't this require an image that can grow?  What about host block
>> devices?
>>    
> 
> I don't believe we probe host block devices.  We assume they're raw 
> which means they would never be probed_raw.

We do probe them. And yes, I know you love qcow2 on block devices. ;-)

>>> An approach like this has the following properties:
>>>
>>> 1) We can make the bdrv_probe check 100% reliable and return a boolean.
>>> 2) In the cases where we known format=raw, none of this code is ever
>>> invoked.
>>> 3) probed_raw images usually look exactly like raw images in most cases
>>> 4) In the degenerate cases, probe_raw images are still mountable in
>>> the normal way.
>>> 5) Even after the QRAW signature is applied, if the guest writes a
>>> valid signature, we can truncate the file and make it appear as a
>>> normal raw image.
>>>
>>> Christoph/Markus/Stefan, does this seem like a more reasonable approach?
>>>      
>> I'm not convinced it's a good idea.  It's clearly a less bad idea,
>> though :)
>>
>> It avoids guest-visible lossage, and that's good.
>>
>> There's still host-visible lossage: as soon as we redirect sector 0, the
>> image isn't raw anymore, and accessing it with non-qemu tools (say
>> losetup + kpartx) no longer works.  You need to know what QEMU did to
>> your no-longer-raw image to work around the lossage (say losetup -o
>> 512).
>>    
> 
> Yeah, but as previously discussed, we can't probe raw.  So probed_raw 
> ends up being a compromise.
> 
>>>>     That they get an unsafe
>>>> default that way is a big surprise to them.  And I can't blame them!
>>>> Users can reasonably expect programs not to trap them.
>>>>
>>>> If we want to let users define drives without having to specify the
>>>> format, we can guess the format from the file name.
>>>>        
>> I still think guessing the format from the file name is a better
>> way to spare users from having to specify formats.
>>    
> I think that would be true if we did it from day 1 but it would be a 
> huge impact to users if we did it today.

I think I agree.

Kevin

Patch

diff --git a/block.c b/block.c
index 65cf4dc..f837876 100644
--- a/block.c
+++ b/block.c
@@ -511,6 +511,7 @@  int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
               BlockDriver *drv)
 {
     int ret;
+    int probed = 0;
 
     if (flags & BDRV_O_SNAPSHOT) {
         BlockDriverState *bs1;
@@ -571,6 +572,7 @@  int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
     /* Find the right image format driver */
     if (!drv) {
         drv = find_image_format(filename);
+        probed = 1;
     }
 
     if (!drv) {
@@ -584,6 +586,8 @@  int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
         goto unlink_and_fail;
     }
 
+    bs->probed = probed;
+
     /* If there is a backing file, use it */
     if ((flags & BDRV_O_NO_BACKING) == 0 && bs->backing_file[0] != '\0') {
         char backing_filename[PATH_MAX];
diff --git a/block/raw.c b/block/raw.c
index 4406b8c..92e41d8 100644
--- a/block/raw.c
+++ b/block/raw.c
@@ -9,15 +9,82 @@  static int raw_open(BlockDriverState *bs, int flags)
     return 0;
 }
 
+/* check for the user attempting to write something that looks like a
+   block format header to the beginning of the image and fail out.
+*/
+static int check_for_block_signature(BlockDriverState *bs, const uint8_t *buf)
+{
+    static const uint8_t signatures[][4] = {
+        { 'Q', 'F', 'I', 0xfb }, /* qcow/qcow2 */
+        { 'C', 'O', 'W', 'D' }, /* VMDK3 */
+        { 'V', 'M', 'D', 'K' }, /* VMDK4 */
+        { 'O', 'O', 'O', 'M' }, /* UML COW */
+        {}
+    };
+    int i;
+
+    for (i = 0; signatures[i][0] != 0; i++) {
+        if (memcmp(buf, signatures[i], 4) == 0) {
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
+static int check_write_unsafe(BlockDriverState *bs, int64_t sector_num,
+                              const uint8_t *buf, int nb_sectors)
+{
+    /* assume that if the user specifies the format explicitly, then assume
+       that they will continue to do so and provide no safety net */
+    if (!bs->probed) {
+        return 0;
+    }
+
+    if (sector_num == 0 && nb_sectors > 0) {
+        return check_for_block_signature(bs, buf);
+    }
+
+    return 0;
+}
+
 static int raw_read(BlockDriverState *bs, int64_t sector_num,
                     uint8_t *buf, int nb_sectors)
 {
     return bdrv_read(bs->file, sector_num, buf, nb_sectors);
 }
 
+static int raw_write_scrubbed_bootsect(BlockDriverState *bs,
+                                       const uint8_t *buf)
+{
+    uint8_t bootsect[512];
+
+    /* scrub the dangerous signature */
+    memcpy(bootsect, buf, 512);
+    memset(bootsect, 0, 4);
+
+    return bdrv_write(bs->file, 0, bootsect, 1);
+}
+
 static int raw_write(BlockDriverState *bs, int64_t sector_num,
                      const uint8_t *buf, int nb_sectors)
 {
+    if (check_write_unsafe(bs, sector_num, buf, nb_sectors)) {
+        int ret;
+
+        ret = raw_write_scrubbed_bootsect(bs, buf);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ret = bdrv_write(bs->file, 1, buf + 512, nb_sectors - 1);
+        if (ret < 0) {
+            return ret;
+        }
+
+        return ret + 512;
+    }
+
     return bdrv_write(bs->file, sector_num, buf, nb_sectors);
 }
 
@@ -28,10 +95,58 @@  static BlockDriverAIOCB *raw_aio_readv(BlockDriverState *bs,
     return bdrv_aio_readv(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
 }
 
+typedef struct RawScrubberBounce
+{
+    BlockDriverCompletionFunc *cb;
+    void *opaque;
+    QEMUIOVector qiov;
+} RawScrubberBounce;
+
+static void raw_aio_writev_scrubbed(void *opaque, int ret)
+{
+    RawScrubberBounce *b = opaque;
+
+    if (ret < 0) {
+        b->cb(b->opaque, ret);
+    } else {
+        b->cb(b->opaque, ret + 512);
+    }
+
+    qemu_iovec_destroy(&b->qiov);
+    qemu_free(b);
+}
+
 static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs,
     int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
     BlockDriverCompletionFunc *cb, void *opaque)
 {
+    if (check_write_unsafe(bs, sector_num, qiov->iov[0].iov_base, nb_sectors)) {
+        RawScrubberBounce *b;
+        int ret;
+
+        /* write the first sector using sync I/O */
+        ret = raw_write_scrubbed_bootsect(bs, qiov->iov[0].iov_base);
+        if (ret < 0) {
+            return NULL;
+        }
+
+        /* adjust request to be everything but first sector */
+
+        b = qemu_malloc(sizeof(*b));
+        b->cb = cb;
+        b->opaque = opaque;
+
+        qemu_iovec_init(&b->qiov, qiov->nalloc);
+        qemu_iovec_concat(&b->qiov, qiov, qiov->size);
+
+        b->qiov.size -= 512;
+        b->qiov.iov[0].iov_base += 512;
+        b->qiov.iov[0].iov_len -= 512;
+
+        return bdrv_aio_writev(bs->file, sector_num + 1, &b->qiov,
+                               nb_sectors - 1, raw_aio_writev_scrubbed, b);
+    }
+
     return bdrv_aio_writev(bs->file, sector_num, qiov, nb_sectors, cb, opaque);
 }
 
diff --git a/block_int.h b/block_int.h
index 877e1e5..96ff4cf 100644
--- a/block_int.h
+++ b/block_int.h
@@ -144,6 +144,7 @@  struct BlockDriverState {
     int encrypted; /* if true, the media is encrypted */
     int valid_key; /* if true, a valid encryption key has been set */
     int sg;        /* if true, the device is a /dev/sg* */
+    int probed;    /* if true, format was probed automatically */
     /* event callback when inserting/removing */
     void (*change_cb)(void *opaque);
     void *change_opaque;