diff mbox series

[5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment

Message ID 20190403030526.12258-6-eblake@redhat.com
State New
Headers show
Series Final round of NBD alignment fixes | expand

Commit Message

Eric Blake April 3, 2019, 3:05 a.m. UTC
Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server; the bug is also present with the raw
format driver when probing occurs. Basically, if we specify a
particular alignment > 1, but defer the actual block status to the
underlying file, and the underlying file has a smaller alignment, then
the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
with status split at an alignment unacceptable to our layer.  Many
callers don't care, but NBD does - it is a violation of the NBD
protocol to send status or read results split on an unaligned boundary
(we've taught our client to be tolerant of such violations because of
qemu 3.1; but we still need to fix our server for the sake of other
stricter clients).

This patch lays the groundwork - it adjusts bdrv_block_status to
ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
the deferral is either truncated down to an aligned boundary, or
multiple sub-aligned requests are coalesced into a single
representative answer (using an implicit hole beyond EOF as
needed). Iotest 241 exposes the effect (when format probing occurred,
we don't want block status to subdivide the initial sector, and thus
not any other sector either). Similarly, iotest 221 is a good
candidate to amend to specifically track the effects; a change to a
hole at EOF is not visible if an upper layer does not want alignment
smaller than 512. However, this patch alone is not a complete fix - it
is still possible to have an active layer with large alignment
constraints backed by another layer with smaller constraints; so the
next patch will complete the task.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c                 | 108 +++++++++++++++++++++++++++++++++++--
 tests/qemu-iotests/221     |  10 ++++
 tests/qemu-iotests/221.out |   6 +++
 tests/qemu-iotests/241.out |   3 +-
 4 files changed, 122 insertions(+), 5 deletions(-)

Comments

Kevin Wolf April 3, 2019, 1:03 p.m. UTC | #1
Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server; the bug is also present with the raw
> format driver when probing occurs. Basically, if we specify a
> particular alignment > 1, but defer the actual block status to the
> underlying file, and the underlying file has a smaller alignment, then
> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
> with status split at an alignment unacceptable to our layer.  Many
> callers don't care, but NBD does - it is a violation of the NBD
> protocol to send status or read results split on an unaligned boundary
> (we've taught our client to be tolerant of such violations because of
> qemu 3.1; but we still need to fix our server for the sake of other
> stricter clients).
> 
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/io.c                 | 108 +++++++++++++++++++++++++++++++++++--
>  tests/qemu-iotests/221     |  10 ++++
>  tests/qemu-iotests/221.out |   6 +++
>  tests/qemu-iotests/241.out |   3 +-
>  4 files changed, 122 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index dfc153b8d82..936877d3745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2021,6 +2021,97 @@ int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
>      return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
>  }
> 
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> +                                             bool want_zero,
> +                                             int64_t offset, int64_t bytes,
> +                                             int64_t *pnum, int64_t *map,
> +                                             BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified sectors.
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * offset and count to be aligned, and guarantees the result will also
> + * be aligned (even if it requires piecing together multiple
> + * sub-aligned queries into an appropriate coalesced answer).
> + */

I think this comment should specify the result of the operation when the
aligned region consists of multiple subregions with different status.
Probably something like this:

- BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
- BDRV_BLOCK_ZERO is set if the flag is set for all subregions
- BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
  the provided offsets are contiguous and file is the same for all
  subregions.
- BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)
- BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
  not set for any other subregion
- BDRV_BLOCK_RAW is never set

- *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the offset of the first subregion then.

- *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the *file of the subregions, which must be the same for all of them
  (otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).

- *pnum: The sum of all subregions

> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
> +                                                     uint32_t align,
> +                                                     bool want_zero,
> +                                                     int64_t offset,
> +                                                     int64_t bytes,
> +                                                     int64_t *pnum,
> +                                                     int64_t *map,
> +                                                     BlockDriverState **file)
> +{
> +    int ret;
> +
> +    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
> +    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    if (!*pnum) {
> +        assert(!bytes || ret & BDRV_BLOCK_EOF);
> +        return ret;
> +    }
> +    if (*pnum >= align) {
> +        if (!QEMU_IS_ALIGNED(*pnum, align)) {
> +            ret &= ~BDRV_BLOCK_EOF;
> +            *pnum = QEMU_ALIGN_DOWN(*pnum, align);
> +        }
> +        return ret;
> +    }

So we decide to just shorten the function if we can return at least some
*pnum > 0. This means that is most cases, this function will have to be
called twice, once for the aligned part, and again for the misaligned
rest.

We do save a little if the caller isn't actually interested in mapping
the whole image, but just in a specific range before the misaligned
part.

So it depends on the use case whether this is an optimisation or a
pessimisation.

> +    /* Merge in status for the rest of align */
> +    while (*pnum < align) {

Okay, in any case, I can see it's a simplification. :-)

> +        int ret2;
> +        int64_t pnum2;
> +        int64_t map2;
> +        BlockDriverState *file2;
> +
> +        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
> +                                    align - *pnum, &pnum2,
> +                                    map ? &map2 : NULL, file ? &file2 : NULL);
> +        if (ret2 < 0) {
> +            return ret2;
> +        }
> +        if (ret2 & BDRV_BLOCK_EOF) {
> +            ret |= BDRV_BLOCK_EOF;
> +            if (!pnum2) {
> +                pnum2 = align - *pnum;
> +                ret2 |= BDRV_BLOCK_ZERO;
> +            }
> +        } else {
> +            assert(pnum2);
> +        }
> +        if (ret2 & BDRV_BLOCK_DATA) {
> +            ret |= BDRV_BLOCK_DATA;
> +        }
> +        if (!(ret2 & BDRV_BLOCK_ZERO)) {
> +            ret &= ~BDRV_BLOCK_ZERO;
> +        }
> +        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
> +            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
> +             (map && *map + *pnum != map2) || (file && *file != file2))) {
> +            ret &= ~BDRV_BLOCK_OFFSET_VALID;
> +            if (map) {
> +                *map = 0;
> +            }
> +            if (file) {
> +                *file = NULL;
> +            }
> +        }
> +        if (ret2 & BDRV_BLOCK_ALLOCATED) {
> +            ret |= BDRV_BLOCK_ALLOCATED;
> +        }

Hm, so if we have a region that consists of two subregion, one is
unallocated and the other one is a zero cluster. The result will be
DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.

Did you check this case and come to the conclusion that it's okay? If
so, I think a comment would be good.

> +        if (ret2 & BDRV_BLOCK_RAW) {
> +            assert(ret & BDRV_BLOCK_RAW);
> +            assert(ret & BDRV_BLOCK_OFFSET_VALID);
> +        }

Doesn't RAW mean that we need to recurse rather than returning the flag?
Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
can't we assert that it's clear?

> +        *pnum += pnum2;
> +    }
> +    return ret;
> +}

Kevin
Eric Blake April 3, 2019, 2:02 p.m. UTC | #2
On 4/3/19 8:03 AM, Kevin Wolf wrote:
> Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
>> Previous patches mentioned how the blkdebug filter driver demonstrates
>> a bug present in our NBD server; the bug is also present with the raw
>> format driver when probing occurs. Basically, if we specify a
>> particular alignment > 1, but defer the actual block status to the
>> underlying file, and the underlying file has a smaller alignment, then
>> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
>> with status split at an alignment unacceptable to our layer.  Many
>> callers don't care, but NBD does - it is a violation of the NBD
>> protocol to send status or read results split on an unaligned boundary
>> (we've taught our client to be tolerant of such violations because of
>> qemu 3.1; but we still need to fix our server for the sake of other
>> stricter clients).
>>
>> This patch lays the groundwork - it adjusts bdrv_block_status to
>> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
>> the deferral is either truncated down to an aligned boundary, or
>> multiple sub-aligned requests are coalesced into a single
>> representative answer (using an implicit hole beyond EOF as
>> needed). Iotest 241 exposes the effect (when format probing occurred,
>> we don't want block status to subdivide the initial sector, and thus
>> not any other sector either). Similarly, iotest 221 is a good
>> candidate to amend to specifically track the effects; a change to a
>> hole at EOF is not visible if an upper layer does not want alignment
>> smaller than 512. However, this patch alone is not a complete fix - it
>> is still possible to have an active layer with large alignment
>> constraints backed by another layer with smaller constraints; so the
>> next patch will complete the task.

I should probably update this text to call out that the next patch
introduces some additional mutual recursion (that is, this patch in
isolation may appear to have dead paths, as the caller
bdrv_co_block_status always passes non-NULL 'map' and 'file'; but the
next patch adjusts bdrv_co_block_status_above to call this instead of
directly into bdrv_co_block_status, hence exposing those paths).

>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/io.c                 | 108 +++++++++++++++++++++++++++++++++++--
>>  tests/qemu-iotests/221     |  10 ++++
>>  tests/qemu-iotests/221.out |   6 +++
>>  tests/qemu-iotests/241.out |   3 +-
>>  4 files changed, 122 insertions(+), 5 deletions(-)
>>

>> +/*
>> + * Returns an aligned allocation status of the specified sectors.
>> + * Wrapper around bdrv_co_block_status() which requires the initial
>> + * offset and count to be aligned, and guarantees the result will also
>> + * be aligned (even if it requires piecing together multiple
>> + * sub-aligned queries into an appropriate coalesced answer).
>> + */
> 
> I think this comment should specify the result of the operation when the
> aligned region consists of multiple subregions with different status.

Good idea.

> Probably something like this:
> 
> - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
> - BDRV_BLOCK_ZERO is set if the flag is set for all subregions
> - BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
>   the provided offsets are contiguous and file is the same for all
>   subregions.

Correct.

> - BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)

Not true, bdrv_co_block_status can set BDRV_BLOCK_ALLOCATED.

> - BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
>   not set for any other subregion

With the additional caveat that status beyond BDRV_BLOCK_EOF up to the
alignment boundary is treated as an implicit hole.

> - BDRV_BLOCK_RAW is never set

That should be true.

> 
> - *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
>   the offset of the first subregion then.

Correct.  Since the subregions had to be contiguous, it is the correct
offset for the entire aligned region.

> 
> - *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
>   the *file of the subregions, which must be the same for all of them
>   (otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).
> 
> - *pnum: The sum of all subregions

And is guaranteed to be aligned, as well as being non-zero unless
'bytes' was 0 on input or if the entire status request is beyond EOF.

> 
>> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
>> +                                                     uint32_t align,
>> +                                                     bool want_zero,
>> +                                                     int64_t offset,
>> +                                                     int64_t bytes,
>> +                                                     int64_t *pnum,
>> +                                                     int64_t *map,
>> +                                                     BlockDriverState **file)
>> +{
>> +    int ret;
>> +
>> +    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
>> +    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +    if (!*pnum) {
>> +        assert(!bytes || ret & BDRV_BLOCK_EOF);
>> +        return ret;
>> +    }
>> +    if (*pnum >= align) {
>> +        if (!QEMU_IS_ALIGNED(*pnum, align)) {
>> +            ret &= ~BDRV_BLOCK_EOF;
>> +            *pnum = QEMU_ALIGN_DOWN(*pnum, align);
>> +        }
>> +        return ret;
>> +    }
> 
> So we decide to just shorten the function if we can return at least some
> *pnum > 0. This means that is most cases, this function will have to be
> called twice, once for the aligned part, and again for the misaligned
> rest.

Yes, that was my choice for ease of implementation this time around.

> 
> We do save a little if the caller isn't actually interested in mapping
> the whole image, but just in a specific range before the misaligned
> part.
> 
> So it depends on the use case whether this is an optimisation or a
> pessimisation.

Yeah, it looks like I should try harder for the BDRV_BLOCK_EOF case -
there, we KNOW that there is only one more subregion, and it will be a
hole; we also know that coalescing a hole with a hole is still a hole,
and coalescing a hole with data is data.  So special-casing
BDRV_BLOCK_EOF to widen out to alignment is probably better than splitting.

For cases where BDRV_BLOCK_EOF is not set, we have no easy guarantees
about how many subregions the next BDS will provide, nor if those
subsequent subregions will always coalesce into the type of the first
subregion, so splitting early is still probably the smartest approach
for keeping that code sane.

Sounds like I should also try harder to capture some of these nuances in
iotests (I did capture the mid-sector hole due to EOF, but capturing a
blkdebug at 4k alignment wrapping a qcow2 with 512 alignment will cover
more of the interesting coalescing code).


> 
>> +    /* Merge in status for the rest of align */
>> +    while (*pnum < align) {
> 
> Okay, in any case, I can see it's a simplification. :-)
> 
>> +        int ret2;
>> +        int64_t pnum2;
>> +        int64_t map2;
>> +        BlockDriverState *file2;
>> +
>> +        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
>> +                                    align - *pnum, &pnum2,
>> +                                    map ? &map2 : NULL, file ? &file2 : NULL);
>> +        if (ret2 < 0) {
>> +            return ret2;
>> +        }
>> +        if (ret2 & BDRV_BLOCK_EOF) {
>> +            ret |= BDRV_BLOCK_EOF;
>> +            if (!pnum2) {
>> +                pnum2 = align - *pnum;
>> +                ret2 |= BDRV_BLOCK_ZERO;
>> +            }
>> +        } else {
>> +            assert(pnum2);
>> +        }
>> +        if (ret2 & BDRV_BLOCK_DATA) {
>> +            ret |= BDRV_BLOCK_DATA;
>> +        }
>> +        if (!(ret2 & BDRV_BLOCK_ZERO)) {
>> +            ret &= ~BDRV_BLOCK_ZERO;
>> +        }
>> +        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
>> +            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
>> +             (map && *map + *pnum != map2) || (file && *file != file2))) {
>> +            ret &= ~BDRV_BLOCK_OFFSET_VALID;
>> +            if (map) {
>> +                *map = 0;
>> +            }
>> +            if (file) {
>> +                *file = NULL;
>> +            }
>> +        }
>> +        if (ret2 & BDRV_BLOCK_ALLOCATED) {
>> +            ret |= BDRV_BLOCK_ALLOCATED;
>> +        }
> 
> Hm, so if we have a region that consists of two subregion, one is
> unallocated and the other one is a zero cluster. The result will be
> DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.

In the common case of 'DATA|ALLOCATED|EOF, ZERO' for the hole past EOF,
I think we want to prefer keeping ALLOCATED set. But you are asking
about '0, ZERO|ALLOCATED', where we cannot guarantee that the entire
region reads as zeroes, and we also know that the entire region is not
allocated but some of it is. Maybe this should pay attention to the
'want_zero' flag? We already state elsewhere that:

 * If 'want_zero' is true, the caller is querying for mapping
 * purposes, with a focus on valid BDRV_BLOCK_OFFSET_VALID, _DATA, and
 * _ZERO where possible; otherwise, the result favors larger 'pnum',
 * with a focus on accurate BDRV_BLOCK_ALLOCATED.

so that would mean that if want_zero is true, we treat the entire region
as unallocated if any subregion prior to EOF is unallocated (since we
can't report ZERO, and the caller didn't care about ALLOCATED); if
want_zero is false, we treat the entire region as allocated if any
subregion is allocated (since we know the caller cared about ALLOCATED,
and less about ZERO)?

Or does it mean we just have to audit all callers to see if any of them
would misbehave in one way or another depending on which way we set the
flag?

> Did you check this case and come to the conclusion that it's okay? If
> so, I think a comment would be good.
> 
>> +        if (ret2 & BDRV_BLOCK_RAW) {
>> +            assert(ret & BDRV_BLOCK_RAW);
>> +            assert(ret & BDRV_BLOCK_OFFSET_VALID);
>> +        }
> 
> Doesn't RAW mean that we need to recurse rather than returning the flag?
> Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
> can't we assert that it's clear?

Yes, I think you are right that we can assert RAW is clear by this point.

> 
>> +        *pnum += pnum2;
>> +    }
>> +    return ret;
>> +}
> 
> Kevin
> 

So since I have to do a v2, this is definitely slipping into 4.1.
diff mbox series

Patch

diff --git a/block/io.c b/block/io.c
index dfc153b8d82..936877d3745 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2021,6 +2021,97 @@  int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
     return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
+                                             BlockDriverState **file);
+
+/*
+ * Returns an aligned allocation status of the specified sectors.
+ * Wrapper around bdrv_co_block_status() which requires the initial
+ * offset and count to be aligned, and guarantees the result will also
+ * be aligned (even if it requires piecing together multiple
+ * sub-aligned queries into an appropriate coalesced answer).
+ */
+static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
+                                                     uint32_t align,
+                                                     bool want_zero,
+                                                     int64_t offset,
+                                                     int64_t bytes,
+                                                     int64_t *pnum,
+                                                     int64_t *map,
+                                                     BlockDriverState **file)
+{
+    int ret;
+
+    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
+    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
+    if (ret < 0) {
+        return ret;
+    }
+    if (!*pnum) {
+        assert(!bytes || ret & BDRV_BLOCK_EOF);
+        return ret;
+    }
+    if (*pnum >= align) {
+        if (!QEMU_IS_ALIGNED(*pnum, align)) {
+            ret &= ~BDRV_BLOCK_EOF;
+            *pnum = QEMU_ALIGN_DOWN(*pnum, align);
+        }
+        return ret;
+    }
+    /* Merge in status for the rest of align */
+    while (*pnum < align) {
+        int ret2;
+        int64_t pnum2;
+        int64_t map2;
+        BlockDriverState *file2;
+
+        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
+                                    align - *pnum, &pnum2,
+                                    map ? &map2 : NULL, file ? &file2 : NULL);
+        if (ret2 < 0) {
+            return ret2;
+        }
+        if (ret2 & BDRV_BLOCK_EOF) {
+            ret |= BDRV_BLOCK_EOF;
+            if (!pnum2) {
+                pnum2 = align - *pnum;
+                ret2 |= BDRV_BLOCK_ZERO;
+            }
+        } else {
+            assert(pnum2);
+        }
+        if (ret2 & BDRV_BLOCK_DATA) {
+            ret |= BDRV_BLOCK_DATA;
+        }
+        if (!(ret2 & BDRV_BLOCK_ZERO)) {
+            ret &= ~BDRV_BLOCK_ZERO;
+        }
+        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
+            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
+             (map && *map + *pnum != map2) || (file && *file != file2))) {
+            ret &= ~BDRV_BLOCK_OFFSET_VALID;
+            if (map) {
+                *map = 0;
+            }
+            if (file) {
+                *file = NULL;
+            }
+        }
+        if (ret2 & BDRV_BLOCK_ALLOCATED) {
+            ret |= BDRV_BLOCK_ALLOCATED;
+        }
+        if (ret2 & BDRV_BLOCK_RAW) {
+            assert(ret & BDRV_BLOCK_RAW);
+            assert(ret & BDRV_BLOCK_OFFSET_VALID);
+        }
+        *pnum += pnum2;
+    }
+    return ret;
+}
+
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
@@ -2121,6 +2212,19 @@  static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
      */
     assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
            align > offset - aligned_offset);
+
+    if (ret & BDRV_BLOCK_RAW) {
+        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
+        ret = bdrv_co_block_status_aligned(local_file, align, want_zero,
+                                           local_map, *pnum, pnum, &local_map,
+                                           &local_file);
+        if (ret < 0) {
+            goto out;
+        }
+        assert(!(ret & BDRV_BLOCK_RAW));
+        ret |= BDRV_BLOCK_RAW;
+    }
+
     *pnum -= offset - aligned_offset;
     if (*pnum > bytes) {
         *pnum = bytes;
@@ -2130,9 +2234,7 @@  static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     }

     if (ret & BDRV_BLOCK_RAW) {
-        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_block_status(local_file, want_zero, local_map,
-                                   *pnum, pnum, &local_map, &local_file);
+        ret &= ~BDRV_BLOCK_RAW;
         goto out;
     }

diff --git a/tests/qemu-iotests/221 b/tests/qemu-iotests/221
index 808cd9a289c..3f60fd0fdc8 100755
--- a/tests/qemu-iotests/221
+++ b/tests/qemu-iotests/221
@@ -41,17 +41,27 @@  echo
 echo "=== Check mapping of unaligned raw image ==="
 echo

+# Note that when we enable format probing by omitting -f, the raw
+# layer forces 512-byte alignment and the bytes past EOF take on the
+# same status as the rest of the sector; otherwise, we can see the
+# implicit hole visible past EOF thanks to the block layer rounding
+# sizes up.
+
 _make_test_img 43009 # qemu-img create rounds size up
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 truncate --size=43009 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 $QEMU_IO -c 'w 43008 1' "$TEST_IMG" | _filter_qemu_io # writing also rounds up
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 truncate --size=43009 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
index a9c0190aadc..723e94037fe 100644
--- a/tests/qemu-iotests/221.out
+++ b/tests/qemu-iotests/221.out
@@ -5,12 +5,18 @@  QA output created by 221
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=43009
 [{ "start": 0, "length": 43520, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 [{ "start": 0, "length": 43520, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 43520, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
+[{ "start": 0, "length": 43520, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 wrote 1/1 bytes at offset 43008
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 40960, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+{ "start": 40960, "length": 2560, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 40960, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 40960, "length": 2049, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 43009, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 [{ "start": 0, "length": 40960, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
+{ "start": 40960, "length": 2560, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 40960, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
 { "start": 40960, "length": 2049, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
 { "start": 43009, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
 *** done
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index ef7de1205d2..fd856bb0c8d 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -22,8 +22,7 @@  WARNING: Image format was not specified for '/home/eblake/qemu/tests/qemu-iotest

   size:  1024
   min block: 1
-[{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}]
+[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)

 === Encrypted qcow2 file backed by unaligned raw image  ===