diff mbox series

[RFC] curl: Allow reading after EOF

Message ID 20210317151734.41656-1-kwolf@redhat.com
State New
Headers show
Series [RFC] curl: Allow reading after EOF | expand

Commit Message

Kevin Wolf March 17, 2021, 3:17 p.m. UTC
This makes the curl driver more consistent with file-posix in that it
doesn't return errors any more for reading after the end of the remote
file. Instead, zeros are returned for these areas.

This inconsistency was reported in:
https://bugzilla.redhat.com/show_bug.cgi?id=1935061

Note that the image used in this bug report has a corrupted snapshot
table, which means that the qcow2 driver tries to do a zero-length read
after EOF on its image file.

The old behaviour of the curl driver can hardly be called a bug, but the
inconsistency turned out to be confusing.

Reported-by: Alice Frosi <afrosi@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---

It is not entirely clear to me if this is something we want to do. If we
do care about consistency between protocol drivers, something like this
should probably be done in block/io.c eventually - but that would
require converting bs->total_sectors to byte granularity first.

Any opinions on what the most desirable semantics would be and whether
we should patch individual drivers until we can have a generic solution?

 block/curl.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Eric Blake March 17, 2021, 3:32 p.m. UTC | #1
On 3/17/21 10:17 AM, Kevin Wolf wrote:
> This makes the curl driver more consistent with file-posix in that it
> doesn't return errors any more for reading after the end of the remote
> file. Instead, zeros are returned for these areas.
> 
> This inconsistency was reported in:
> https://bugzilla.redhat.com/show_bug.cgi?id=1935061
> 
> Note that the image used in this bug report has a corrupted snapshot
> table, which means that the qcow2 driver tries to do a zero-length read
> after EOF on its image file.
> 
> The old behaviour of the curl driver can hardly be called a bug, but the
> inconsistency turned out to be confusing.
> 
> Reported-by: Alice Frosi <afrosi@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> 
> It is not entirely clear to me if this is something we want to do. If we
> do care about consistency between protocol drivers, something like this
> should probably be done in block/io.c eventually - but that would
> require converting bs->total_sectors to byte granularity first.

Something that's been (low priority) on my todo list for a while.  NBD
has the same problem.

> 
> Any opinions on what the most desirable semantics would be and whether
> we should patch individual drivers until we can have a generic solution?

In nbdkit, we took the following approach in the 'truncate' driver:

If presented with an image that is not a multiple of the desired block
size, we round the image size up (corner cases for images with sizes
near 2^63 where rounding would wrap to negative; and since qemu enforces
a max image size at 2^63-2^32 to avoid 32-bit operations ever
overflowing).  Reads of the virtual tail come back as zero, writes to
the virtual tail are allowed if they would write zero into the tail, and
fail with ENOSPC otherwise.

Doing that in the block layer makes more sense than doing it per-driver.

Thus, I'm not sure if I'm a fan of this patch.

> 
>  block/curl.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 50e741a0d7..a8d87a1813 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -898,6 +898,7 @@ out:
>  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>          uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>  {
> +    BDRVCURLState *s = bs->opaque;
>      CURLAIOCB acb = {
>          .co = qemu_coroutine_self(),
>          .ret = -EINPROGRESS,
> @@ -906,6 +907,15 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>          .bytes = bytes
>      };
>  
> +    if (offset > s->len || bytes > s->len - offset) {
> +        uint64_t req_bytes = offset > s->len ? 0 : s->len - offset;
> +        qemu_iovec_memset(qiov, req_bytes, 0, bytes - req_bytes);
> +        bytes = req_bytes;
> +    }
> +    if (bytes == 0) {
> +        return 0;
> +    }
> +
>      curl_setup_preadv(bs, &acb);
>      while (acb.ret == -EINPROGRESS) {
>          qemu_coroutine_yield();
>
Eric Blake March 17, 2021, 3:46 p.m. UTC | #2
On 3/17/21 10:32 AM, Eric Blake wrote:
> On 3/17/21 10:17 AM, Kevin Wolf wrote:
>> This makes the curl driver more consistent with file-posix in that it
>> doesn't return errors any more for reading after the end of the remote
>> file. Instead, zeros are returned for these areas.
>>
>> This inconsistency was reported in:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1935061
>>
>> Note that the image used in this bug report has a corrupted snapshot
>> table, which means that the qcow2 driver tries to do a zero-length read
>> after EOF on its image file.
>>
>> The old behaviour of the curl driver can hardly be called a bug, but the
>> inconsistency turned out to be confusing.
>>
>> Reported-by: Alice Frosi <afrosi@redhat.com>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>
>> It is not entirely clear to me if this is something we want to do. If we
>> do care about consistency between protocol drivers, something like this
>> should probably be done in block/io.c eventually - but that would
>> require converting bs->total_sectors to byte granularity first.
> 
> Something that's been (low priority) on my todo list for a while.  NBD
> has the same problem.

Actually, NBD has already been patched to fuzz around the lack of
byte-accurateness in the block layer; see commit 9cf638508.  So doing
something similar in the curl driver as a workaround until the block
layer does it for everyone is tolerable, but does not scale.

> 
>>
>> Any opinions on what the most desirable semantics would be and whether
>> we should patch individual drivers until we can have a generic solution?
> 
> In nbdkit, we took the following approach in the 'truncate' driver:
> 
> If presented with an image that is not a multiple of the desired block
> size, we round the image size up (corner cases for images with sizes
> near 2^63 where rounding would wrap to negative; and since qemu enforces
> a max image size at 2^63-2^32 to avoid 32-bit operations ever
> overflowing).  Reads of the virtual tail come back as zero, writes to
> the virtual tail are allowed if they would write zero into the tail, and
> fail with ENOSPC otherwise.

The current code in block/nbd.c does this for reads, but fails on EIO
without regards to the content of what is being attempted to write into
that tail.  I like the nbdkit behavior better.

> 
> Doing that in the block layer makes more sense than doing it per-driver.
> 
> Thus, I'm not sure if I'm a fan of this patch.
> 
>>
>>  block/curl.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/block/curl.c b/block/curl.c
>> index 50e741a0d7..a8d87a1813 100644
>> --- a/block/curl.c
>> +++ b/block/curl.c
>> @@ -898,6 +898,7 @@ out:
>>  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>>          uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>>  {
>> +    BDRVCURLState *s = bs->opaque;
>>      CURLAIOCB acb = {
>>          .co = qemu_coroutine_self(),
>>          .ret = -EINPROGRESS,
>> @@ -906,6 +907,15 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>>          .bytes = bytes
>>      };
>>  
>> +    if (offset > s->len || bytes > s->len - offset) {
>> +        uint64_t req_bytes = offset > s->len ? 0 : s->len - offset;
>> +        qemu_iovec_memset(qiov, req_bytes, 0, bytes - req_bytes);
>> +        bytes = req_bytes;

In nbd.c, I also have:
   if (offset >= client->info.size) {
        assert(bytes < BDRV_SECTOR_SIZE);

    if (offset + bytes > client->info.size) {
        assert(slop < BDRV_SECTOR_SIZE);

With those assertions added, I can give it

Reviewed-by: Eric Blake <eblake@redhat.com>

>> +    }
>> +    if (bytes == 0) {
>> +        return 0;
>> +    }
>> +
>>      curl_setup_preadv(bs, &acb);
>>      while (acb.ret == -EINPROGRESS) {
>>          qemu_coroutine_yield();
>>
>
Daniel P. Berrangé March 17, 2021, 4:12 p.m. UTC | #3
On Wed, Mar 17, 2021 at 04:17:34PM +0100, Kevin Wolf wrote:
> This makes the curl driver more consistent with file-posix in that it
> doesn't return errors any more for reading after the end of the remote
> file. Instead, zeros are returned for these areas.
> 
> This inconsistency was reported in:
> https://bugzilla.redhat.com/show_bug.cgi?id=1935061
> 
> Note that the image used in this bug report has a corrupted snapshot
> table, which means that the qcow2 driver tries to do a zero-length read
> after EOF on its image file.
> 
> The old behaviour of the curl driver can hardly be called a bug, but the
> inconsistency turned out to be confusing.
> 
> Reported-by: Alice Frosi <afrosi@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> 
> It is not entirely clear to me if this is something we want to do. If we
> do care about consistency between protocol drivers, something like this
> should probably be done in block/io.c eventually - but that would
> require converting bs->total_sectors to byte granularity first.
> 
> Any opinions on what the most desirable semantics would be and whether
> we should patch individual drivers until we can have a generic solution?

What valid scenarios are there for wanting to read beyond the bounds
of the protocol driver storage ? Why was file-posix allowing this
so far ?

If I've given file-posix a 10 GB plain file or device and something
requests a read from the 11 GB offset, IMHO, that is a sign of serious
error somewhere and possible impending doom.

For writable storage, I would think that read + write should be
symmetric, by which I mean if a read() at a particular offset
succeeds, then I would also expect a write() at the same offset to
succeed, and have its data later returned by a read().

We generally can't write at an offset beyond the storage (unless we
are intending to auto-enlarge a plain file), so I think we shouldn't
allow reads either.

> 
> diff --git a/block/curl.c b/block/curl.c
> index 50e741a0d7..a8d87a1813 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -898,6 +898,7 @@ out:
>  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>          uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>  {
> +    BDRVCURLState *s = bs->opaque;
>      CURLAIOCB acb = {
>          .co = qemu_coroutine_self(),
>          .ret = -EINPROGRESS,
> @@ -906,6 +907,15 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
>          .bytes = bytes
>      };
>  
> +    if (offset > s->len || bytes > s->len - offset) {
> +        uint64_t req_bytes = offset > s->len ? 0 : s->len - offset;
> +        qemu_iovec_memset(qiov, req_bytes, 0, bytes - req_bytes);
> +        bytes = req_bytes;
> +    }
> +    if (bytes == 0) {
> +        return 0;
> +    }
> +
>      curl_setup_preadv(bs, &acb);
>      while (acb.ret == -EINPROGRESS) {
>          qemu_coroutine_yield();
> -- 
> 2.30.2
> 
> 

Regards,
Daniel
Kevin Wolf March 17, 2021, 4:38 p.m. UTC | #4
Am 17.03.2021 um 16:46 hat Eric Blake geschrieben:
> On 3/17/21 10:32 AM, Eric Blake wrote:
> > On 3/17/21 10:17 AM, Kevin Wolf wrote:
> >> This makes the curl driver more consistent with file-posix in that it
> >> doesn't return errors any more for reading after the end of the remote
> >> file. Instead, zeros are returned for these areas.
> >>
> >> This inconsistency was reported in:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1935061
> >>
> >> Note that the image used in this bug report has a corrupted snapshot
> >> table, which means that the qcow2 driver tries to do a zero-length read
> >> after EOF on its image file.
> >>
> >> The old behaviour of the curl driver can hardly be called a bug, but the
> >> inconsistency turned out to be confusing.
> >>
> >> Reported-by: Alice Frosi <afrosi@redhat.com>
> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >> ---
> >>
> >> It is not entirely clear to me if this is something we want to do. If we
> >> do care about consistency between protocol drivers, something like this
> >> should probably be done in block/io.c eventually - but that would
> >> require converting bs->total_sectors to byte granularity first.
> > 
> > Something that's been (low priority) on my todo list for a while.  NBD
> > has the same problem.
> 
> Actually, NBD has already been patched to fuzz around the lack of
> byte-accurateness in the block layer; see commit 9cf638508.  So doing
> something similar in the curl driver as a workaround until the block
> layer does it for everyone is tolerable, but does not scale.

Right, duplicating the same code in every protocol driver wouldn't be
great.

> >> Any opinions on what the most desirable semantics would be and whether
> >> we should patch individual drivers until we can have a generic solution?
> > 
> > In nbdkit, we took the following approach in the 'truncate' driver:
> > 
> > If presented with an image that is not a multiple of the desired block
> > size, we round the image size up (corner cases for images with sizes
> > near 2^63 where rounding would wrap to negative; and since qemu enforces
> > a max image size at 2^63-2^32 to avoid 32-bit operations ever
> > overflowing).  Reads of the virtual tail come back as zero, writes to
> > the virtual tail are allowed if they would write zero into the tail, and
> > fail with ENOSPC otherwise.
> 
> The current code in block/nbd.c does this for reads, but fails on EIO
> without regards to the content of what is being attempted to write into
> that tail.  I like the nbdkit behavior better.

I don't, honestly. Making the success of a write depend on the content
written is just too surprising. Is there a real use case behind this
that would justify it?

I think the file-posix behaviour (which is really just the kernel
behaviour) is the most sane: On regular files, which can be extended,
extend the image file on writes. On block devices, return -ENOSPC.

> > Doing that in the block layer makes more sense than doing it per-driver.
> > 
> > Thus, I'm not sure if I'm a fan of this patch.

What we need to decide is the semantics that we want to have.

If the behaviour implemented by this patch (or a similar patch) is what
we want, then I would say we should commit it, even if we're planning to
refactor things in the long term so that it's implemented in the generic
block layer.

> >>
> >>  block/curl.c | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/block/curl.c b/block/curl.c
> >> index 50e741a0d7..a8d87a1813 100644
> >> --- a/block/curl.c
> >> +++ b/block/curl.c
> >> @@ -898,6 +898,7 @@ out:
> >>  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> >>          uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
> >>  {
> >> +    BDRVCURLState *s = bs->opaque;
> >>      CURLAIOCB acb = {
> >>          .co = qemu_coroutine_self(),
> >>          .ret = -EINPROGRESS,
> >> @@ -906,6 +907,15 @@ static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
> >>          .bytes = bytes
> >>      };
> >>  
> >> +    if (offset > s->len || bytes > s->len - offset) {
> >> +        uint64_t req_bytes = offset > s->len ? 0 : s->len - offset;
> >> +        qemu_iovec_memset(qiov, req_bytes, 0, bytes - req_bytes);
> >> +        bytes = req_bytes;
> 
> In nbd.c, I also have:
>    if (offset >= client->info.size) {
>         assert(bytes < BDRV_SECTOR_SIZE);
> 
>     if (offset + bytes > client->info.size) {
>         assert(slop < BDRV_SECTOR_SIZE);
> 
> With those assertions added, I can give it

Hm, that requires splitting it into two blocks. Maybe I should just copy
the whole thing from the NBD driver. Makes it a bit longer, but then
it's more obviously doing the same.

Anyway, these assertions show that we have already decided on the
desired behaviour for reads because we already read zeros in
bdrv_aligned_preadv() as long as things are sector aligned. Changing
bs->total_sectors to bytes would result in the same behaviour as this
patch.

What we can try is how hard this conversion is really. If we can just do
it right away, it's probably preferable.

Kevin
Kevin Wolf March 17, 2021, 4:43 p.m. UTC | #5
Am 17.03.2021 um 17:12 hat Daniel P. Berrangé geschrieben:
> On Wed, Mar 17, 2021 at 04:17:34PM +0100, Kevin Wolf wrote:
> > This makes the curl driver more consistent with file-posix in that it
> > doesn't return errors any more for reading after the end of the remote
> > file. Instead, zeros are returned for these areas.
> > 
> > This inconsistency was reported in:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1935061
> > 
> > Note that the image used in this bug report has a corrupted snapshot
> > table, which means that the qcow2 driver tries to do a zero-length read
> > after EOF on its image file.
> > 
> > The old behaviour of the curl driver can hardly be called a bug, but the
> > inconsistency turned out to be confusing.
> > 
> > Reported-by: Alice Frosi <afrosi@redhat.com>
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> > 
> > It is not entirely clear to me if this is something we want to do. If we
> > do care about consistency between protocol drivers, something like this
> > should probably be done in block/io.c eventually - but that would
> > require converting bs->total_sectors to byte granularity first.
> > 
> > Any opinions on what the most desirable semantics would be and whether
> > we should patch individual drivers until we can have a generic solution?
> 
> What valid scenarios are there for wanting to read beyond the bounds
> of the protocol driver storage ? Why was file-posix allowing this
> so far ?
> 
> If I've given file-posix a 10 GB plain file or device and something
> requests a read from the 11 GB offset, IMHO, that is a sign of serious
> error somewhere and possible impending doom.
> 
> For writable storage, I would think that read + write should be
> symmetric, by which I mean if a read() at a particular offset
> succeeds, then I would also expect a write() at the same offset to
> succeed, and have its data later returned by a read().
> 
> We generally can't write at an offset beyond the storage (unless we
> are intending to auto-enlarge a plain file), so I think we shouldn't
> allow reads either.

It is definitely related to format drivers that grow their image files.
I think the reason for allowing this may have been that with O_DIRECT,
you need aligned requests and when format drivers write just a few
bytes, we actually do a RMW - and you don't want to get an error during
the read part just because the image file will only be resized by the
write.

Since curl is a read-only protocol driver (at the moment, I actually
have an experimental branch that adds write support so we can run
iotests for http), this reason doesn't really apply. At the moment, it
would be just for consistency.

Kevin
Eric Blake March 17, 2021, 5:29 p.m. UTC | #6
On 3/17/21 11:43 AM, Kevin Wolf wrote:
>>> It is not entirely clear to me if this is something we want to do. If we
>>> do care about consistency between protocol drivers, something like this
>>> should probably be done in block/io.c eventually - but that would
>>> require converting bs->total_sectors to byte granularity first.
>>>
>>> Any opinions on what the most desirable semantics would be and whether
>>> we should patch individual drivers until we can have a generic solution?
>>
>> What valid scenarios are there for wanting to read beyond the bounds
>> of the protocol driver storage ? Why was file-posix allowing this
>> so far ?
>>

Our block driver already filters all reads larger than the image size
rounded to the nearest sector; so this discussion is ONLY about the 511
bytes possible in an unaligned file at the protocol layer and its
rounded-up size at the block layer.

>> If I've given file-posix a 10 GB plain file or device and something
>> requests a read from the 11 GB offset, IMHO, that is a sign of serious
>> error somewhere and possible impending doom.

The block layer won't permit that; it's too far beyond the 511 bytes of
rounding up a sector-unaligned image.

>>
>> For writable storage, I would think that read + write should be
>> symmetric, by which I mean if a read() at a particular offset
>> succeeds, then I would also expect a write() at the same offset to
>> succeed, and have its data later returned by a read().
>>
>> We generally can't write at an offset beyond the storage (unless we
>> are intending to auto-enlarge a plain file), so I think we shouldn't
>> allow reads either.
> 
> It is definitely related to format drivers that grow their image files.
> I think the reason for allowing this may have been that with O_DIRECT,
> you need aligned requests and when format drivers write just a few
> bytes, we actually do a RMW - and you don't want to get an error during
> the read part just because the image file will only be resized by the
> write.

I like the nbdkit behavior for symmetry: since we can read the tail as
zero, allowing write as zero does not change the size but also avoids an
ENOSPC, while allowing the guest full control over the bytes prior to
the unaligned tail.  But I can also live with the symmetry of reads from
the final sector see zero, but writes to the final sector fail
(basically, the final sector becomes read-only, even if the rest of the
image is writable).

> 
> Since curl is a read-only protocol driver (at the moment, I actually
> have an experimental branch that adds write support so we can run
> iotests for http), this reason doesn't really apply. At the moment, it
> would be just for consistency.
diff mbox series

Patch

diff --git a/block/curl.c b/block/curl.c
index 50e741a0d7..a8d87a1813 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -898,6 +898,7 @@  out:
 static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
         uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
+    BDRVCURLState *s = bs->opaque;
     CURLAIOCB acb = {
         .co = qemu_coroutine_self(),
         .ret = -EINPROGRESS,
@@ -906,6 +907,15 @@  static int coroutine_fn curl_co_preadv(BlockDriverState *bs,
         .bytes = bytes
     };
 
+    if (offset > s->len || bytes > s->len - offset) {
+        uint64_t req_bytes = offset > s->len ? 0 : s->len - offset;
+        qemu_iovec_memset(qiov, req_bytes, 0, bytes - req_bytes);
+        bytes = req_bytes;
+    }
+    if (bytes == 0) {
+        return 0;
+    }
+
     curl_setup_preadv(bs, &acb);
     while (acb.ret == -EINPROGRESS) {
         qemu_coroutine_yield();