diff mbox series

block: Formats don't need CONSISTENT_READ with NO_IO

Message ID 20171130164401.29221-1-kwolf@redhat.com
State New
Headers show
Series block: Formats don't need CONSISTENT_READ with NO_IO | expand

Commit Message

Kevin Wolf Nov. 30, 2017, 4:44 p.m. UTC
Commit 1f4ad7d fixed 'qemu-img info' for raw images that are currently
in use as a mirror target. It is not enough for image formats, though,
as these still unconditionally request BLK_PERM_CONSISTENT_READ.

As this permission is meaningless unless you do actual I/O on the image,
drop the requirement and allow 'qemu-img info' even for image formats
under conditions where BLK_PERM_CONSISTENT_READ can't be granted.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Eric Blake Nov. 30, 2017, 5:09 p.m. UTC | #1
On 11/30/2017 10:44 AM, Kevin Wolf wrote:
> Commit 1f4ad7d fixed 'qemu-img info' for raw images that are currently
> in use as a mirror target. It is not enough for image formats, though,
> as these still unconditionally request BLK_PERM_CONSISTENT_READ.
> 
> As this permission is meaningless unless you do actual I/O on the image,
> drop the requirement and allow 'qemu-img info' even for image formats
> under conditions where BLK_PERM_CONSISTENT_READ can't be granted.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 

> @@ -1936,7 +1938,9 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
>   
>           /* bs->file always needs to be consistent because of the metadata. We
>            * can never allow other users to resize or write to it. */
> -        perm |= BLK_PERM_CONSISTENT_READ;
> +        if (!(flags & BDRV_O_NO_IO)) {
> +            perm |= BLK_PERM_CONSISTENT_READ;

I thought BDRV_O_NO_IO only means we aren't doing I/O on guest-visible 
data, but doesn't stop us from reading the metadata.  The comment is 
telling: if we can read metadata, then we depend on CONSISTENT_READ for 
the metadata to be stable (even if we don't care about guest data 
consistency).
Kevin Wolf Nov. 30, 2017, 6:27 p.m. UTC | #2
Am 30.11.2017 um 18:09 hat Eric Blake geschrieben:
> On 11/30/2017 10:44 AM, Kevin Wolf wrote:
> > Commit 1f4ad7d fixed 'qemu-img info' for raw images that are currently
> > in use as a mirror target. It is not enough for image formats, though,
> > as these still unconditionally request BLK_PERM_CONSISTENT_READ.
> > 
> > As this permission is meaningless unless you do actual I/O on the image,
> > drop the requirement and allow 'qemu-img info' even for image formats
> > under conditions where BLK_PERM_CONSISTENT_READ can't be granted.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block.c | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> 
> > @@ -1936,7 +1938,9 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
> >           /* bs->file always needs to be consistent because of the metadata. We
> >            * can never allow other users to resize or write to it. */
> > -        perm |= BLK_PERM_CONSISTENT_READ;
> > +        if (!(flags & BDRV_O_NO_IO)) {
> > +            perm |= BLK_PERM_CONSISTENT_READ;
> 
> I thought BDRV_O_NO_IO only means we aren't doing I/O on guest-visible data,
> but doesn't stop us from reading the metadata.  The comment is telling: if
> we can read metadata, then we depend on CONSISTENT_READ for the metadata to
> be stable (even if we don't care about guest data consistency).

Yes and no. The trouble is that at the file system level we have only a
single bit to describe the consistency of the whole image throughout the
whole block driver tree.

We forbid shared BLK_PERM_CONSISTENT_READ for mirror targets (which
aren't fully populated yet) and intermediate nodes for commit (which
expose corrupted data). Both scenarios are really about the data exposed
at the format layer, the metadata stays completely consistent.

The question is what we do with this as we propagate permissions down to
the protocol layer. Strictly speaking, the file at the protocol layer is
perfectly consistent, so we might not forbid BLK_PERM_CONSISTENT_READ
there. But I think it's more useful to do it anyway so that image
locking can prevent the typical case of another process that uses qcow2
over file-posix again, where the file-posix node could in theory be
considered consistent, but the qcow2 one wouldn't.

In the end, this is just a pragmatic way to let 'qemu-img info' work
while the image is a mirror target or intermediate node for commit, but
to forbid cases where corrupted data is used.

Or would you argue that either 'qemu-img info' shouldn't be working or
reading corrupted data should be allowed in other processes?

Kevin
Eric Blake Nov. 30, 2017, 7:05 p.m. UTC | #3
On 11/30/2017 12:27 PM, Kevin Wolf wrote:
>>> @@ -1936,7 +1938,9 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
>>>            /* bs->file always needs to be consistent because of the metadata. We
>>>             * can never allow other users to resize or write to it. */
>>> -        perm |= BLK_PERM_CONSISTENT_READ;
>>> +        if (!(flags & BDRV_O_NO_IO)) {
>>> +            perm |= BLK_PERM_CONSISTENT_READ;
>>
>> I thought BDRV_O_NO_IO only means we aren't doing I/O on guest-visible data,
>> but doesn't stop us from reading the metadata.  The comment is telling: if
>> we can read metadata, then we depend on CONSISTENT_READ for the metadata to
>> be stable (even if we don't care about guest data consistency).
> 
> Yes and no. The trouble is that at the file system level we have only a
> single bit to describe the consistency of the whole image throughout the
> whole block driver tree.
> 
> We forbid shared BLK_PERM_CONSISTENT_READ for mirror targets (which
> aren't fully populated yet) and intermediate nodes for commit (which
> expose corrupted data). Both scenarios are really about the data exposed
> at the format layer, the metadata stays completely consistent.

I guess it is the block of writes/resizes that prevents metadata from 
getting inconsistent; CONSISTENT_READ does indeed make more sense if 
interpreted solely in light of will the guest read consistent data (and 
not will the format layer see consistent contents from the protocol layer).

> 
> The question is what we do with this as we propagate permissions down to
> the protocol layer. Strictly speaking, the file at the protocol layer is
> perfectly consistent, so we might not forbid BLK_PERM_CONSISTENT_READ
> there. But I think it's more useful to do it anyway so that image
> locking can prevent the typical case of another process that uses qcow2
> over file-posix again, where the file-posix node could in theory be
> considered consistent, but the qcow2 one wouldn't.

Does that mean we need two separate permission bits, one for whether the 
guest-visible data is consistent, and one for whether the metadata is 
consistent? But as I mentioned above, blocking writes should mean that 
no one else is messing with metadata.

> 
> In the end, this is just a pragmatic way to let 'qemu-img info' work
> while the image is a mirror target or intermediate node for commit, but
> to forbid cases where corrupted data is used.
> 
> Or would you argue that either 'qemu-img info' shouldn't be working or
> reading corrupted data should be allowed in other processes?

Having qemu-img info work on a mirror target makes sense; as you pointed 
out, the metadata flushed to disk is consistent (may have leaked 
clusters, but not broken image) at any given point by the writer (it has 
to be, since the writer has to be able to recover the image if there is 
an abrupt restart).  And a second reader can still manage to see 
inconsistent data when reading two separate clusters if the timing is 
just right, regardless of whether the first writer is always able to see 
consistent data.  So to some extent, it's a measure of how risky is the 
action? qemu-img info is read-only, and most of the time will just work; 
it is very hard to get to the rare race condition where two consecutive 
reads raced with a parallel writer combine to result in incorrect 
interpretation of the data by the reader.

I'm not opposed to your patch, but am trying to make sure that I'm not 
overlooking any problem before giving R-b.  Maybe it's just that the 
comment needs updating in v2.
Kevin Wolf Dec. 1, 2017, 12:41 p.m. UTC | #4
Am 30.11.2017 um 20:05 hat Eric Blake geschrieben:
> On 11/30/2017 12:27 PM, Kevin Wolf wrote:
> > > > @@ -1936,7 +1938,9 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
> > > >            /* bs->file always needs to be consistent because of the metadata. We
> > > >             * can never allow other users to resize or write to it. */
> > > > -        perm |= BLK_PERM_CONSISTENT_READ;
> > > > +        if (!(flags & BDRV_O_NO_IO)) {
> > > > +            perm |= BLK_PERM_CONSISTENT_READ;
> > > 
> > > I thought BDRV_O_NO_IO only means we aren't doing I/O on guest-visible data,
> > > but doesn't stop us from reading the metadata.  The comment is telling: if
> > > we can read metadata, then we depend on CONSISTENT_READ for the metadata to
> > > be stable (even if we don't care about guest data consistency).
> > 
> > Yes and no. The trouble is that at the file system level we have only a
> > single bit to describe the consistency of the whole image throughout the
> > whole block driver tree.
> > 
> > We forbid shared BLK_PERM_CONSISTENT_READ for mirror targets (which
> > aren't fully populated yet) and intermediate nodes for commit (which
> > expose corrupted data). Both scenarios are really about the data exposed
> > at the format layer, the metadata stays completely consistent.
> 
> I guess it is the block of writes/resizes that prevents metadata from
> getting inconsistent; CONSISTENT_READ does indeed make more sense if
> interpreted solely in light of will the guest read consistent data
> (and not will the format layer see consistent contents from the
> protocol layer).

Yes, that's what the write/resize locks are meant for.

Consistent read is more about "this image may not contain the useful
data you're expecting".

> > The question is what we do with this as we propagate permissions down to
> > the protocol layer. Strictly speaking, the file at the protocol layer is
> > perfectly consistent, so we might not forbid BLK_PERM_CONSISTENT_READ
> > there. But I think it's more useful to do it anyway so that image
> > locking can prevent the typical case of another process that uses qcow2
> > over file-posix again, where the file-posix node could in theory be
> > considered consistent, but the qcow2 one wouldn't.
> 
> Does that mean we need two separate permission bits, one for whether
> the guest-visible data is consistent, and one for whether the metadata
> is consistent? But as I mentioned above, blocking writes should mean
> that no one else is messing with metadata.

Even this works only in the simple case of format over protocol.

I think this becomes pretty clear when you think of the (admittedly
unlikely) case of nested qcow2. Then you need at least a bit per layer,
one for the guest-visible data, one for the top-level qcow2 metadata and
another one for the second qcow2 metadata.

> > In the end, this is just a pragmatic way to let 'qemu-img info' work
> > while the image is a mirror target or intermediate node for commit, but
> > to forbid cases where corrupted data is used.
> > 
> > Or would you argue that either 'qemu-img info' shouldn't be working or
> > reading corrupted data should be allowed in other processes?
> 
> Having qemu-img info work on a mirror target makes sense; as you pointed
> out, the metadata flushed to disk is consistent (may have leaked clusters,
> but not broken image) at any given point by the writer (it has to be, since
> the writer has to be able to recover the image if there is an abrupt
> restart).

Exactly, that's the reasoning.

> And a second reader can still manage to see inconsistent data
> when reading two separate clusters if the timing is just right, regardless
> of whether the first writer is always able to see consistent data.  So to
> some extent, it's a measure of how risky is the action? qemu-img info is
> read-only, and most of the time will just work; it is very hard to get to
> the rare race condition where two consecutive reads raced with a parallel
> writer combine to result in incorrect interpretation of the data by the
> reader.

This again is more about the write permission. You still need 'qemu-img
info -U' because there is a concurrent writer, so you have to
acknowledge the risk you're mentioning.

> I'm not opposed to your patch, but am trying to make sure that I'm not
> overlooking any problem before giving R-b.  Maybe it's just that the
> comment needs updating in v2.

Do you have a suggestion for the comment?

Kevin
Eric Blake Dec. 1, 2017, 1:42 p.m. UTC | #5
On 12/01/2017 06:41 AM, Kevin Wolf wrote:
>>
>> I guess it is the block of writes/resizes that prevents metadata from
>> getting inconsistent; CONSISTENT_READ does indeed make more sense if
>> interpreted solely in light of will the guest read consistent data
>> (and not will the format layer see consistent contents from the
>> protocol layer).
> 
> Yes, that's what the write/resize locks are meant for.
> 
> Consistent read is more about "this image may not contain the useful
> data you're expecting".
> 

> 
>> I'm not opposed to your patch, but am trying to make sure that I'm not
>> overlooking any problem before giving R-b.  Maybe it's just that the
>> comment needs updating in v2.
> 
> Do you have a suggestion for the comment?

Perhaps:

Commit 1f4ad7d fixed 'qemu-img info' for raw images that are currently
in use as a mirror target. It is not enough for image formats, though,
as these still unconditionally request BLK_PERM_CONSISTENT_READ.

As this permission is geared towards whether the guest-visible data is 
consistent, and has no impact on whether the metadata is sane, and 
'qemu-img info' does not read guest-visible data (except for the raw 
format), it makes sense to not require BLK_PERM_CONSISTENT_READ if there 
is not going to be any guest I/O performed, regardless of image format.
diff mbox series

Patch

diff --git a/block.c b/block.c
index 9a1a0d1e73..ddb6836c52 100644
--- a/block.c
+++ b/block.c
@@ -1924,6 +1924,8 @@  void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
     assert(role == &child_backing || role == &child_file);
 
     if (!backing) {
+        int flags = bdrv_reopen_get_flags(reopen_queue, bs);
+
         /* Apart from the modifications below, the same permissions are
          * forwarded and left alone as for filters */
         bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
@@ -1936,7 +1938,9 @@  void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
 
         /* bs->file always needs to be consistent because of the metadata. We
          * can never allow other users to resize or write to it. */
-        perm |= BLK_PERM_CONSISTENT_READ;
+        if (!(flags & BDRV_O_NO_IO)) {
+            perm |= BLK_PERM_CONSISTENT_READ;
+        }
         shared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
     } else {
         /* We want consistent read from backing files if the parent needs it.