[v8,09/21] null: Switch to .bdrv_co_block_status()

Message ID 20180213202701.15858-10-eblake@redhat.com
State New
Headers show
Series
  • add byte-based block_status driver callbacks
Related show

Commit Message

Eric Blake Feb. 13, 2018, 8:26 p.m.
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the null driver accordingly.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>

---
v6-v7: no change
v5: minor fix to type of 'ret'
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping parameter
---
 block/null.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

Comments

Kevin Wolf Feb. 14, 2018, 12:05 p.m. | #1
Am 13.02.2018 um 21:26 hat Eric Blake geschrieben:
> We are gradually moving away from sector-based interfaces, towards
> byte-based.  Update the null driver accordingly.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Reviewed-by: Fam Zheng <famz@redhat.com>
> 
> ---
> v6-v7: no change
> v5: minor fix to type of 'ret'
> v4: rebase to interface tweak
> v3: no change
> v2: rebase to mapping parameter
> ---
>  block/null.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/block/null.c b/block/null.c
> index 214d394fff4..806a8631e4d 100644
> --- a/block/null.c
> +++ b/block/null.c
> @@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState *reopen_state,
>      return 0;
>  }
> 
> -static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs,
> -                                                     int64_t sector_num,
> -                                                     int nb_sectors, int *pnum,
> -                                                     BlockDriverState **file)
> +static int coroutine_fn null_co_block_status(BlockDriverState *bs,
> +                                             bool want_zero, int64_t offset,
> +                                             int64_t bytes, int64_t *pnum,
> +                                             int64_t *map,
> +                                             BlockDriverState **file)
>  {
>      BDRVNullState *s = bs->opaque;
> -    off_t start = sector_num * BDRV_SECTOR_SIZE;
> +    int ret = BDRV_BLOCK_OFFSET_VALID;
> 
> -    *pnum = nb_sectors;
> +    *pnum = bytes;
> +    *map = offset;
>      *file = bs;
> 
>      if (s->read_zeroes) {
> -        return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
> -    } else {
> -        return BDRV_BLOCK_OFFSET_VALID | start;
> +        ret |= BDRV_BLOCK_ZERO;
>      }
> +    return ret;
>  }

Preexisting, but I think this return value is wrong. OFFSET_VALID
without DATA is to documented to have the following semantics:

 * DATA ZERO OFFSET_VALID
 *  f    t        t       sectors preallocated, read as zero, returned file not
 *                        necessarily zero at offset
 *  f    f        t       sectors preallocated but read from backing_hd,
 *                        returned file contains garbage at offset

I'm not sure what OFFSET_VALID is even supposed to mean for null.

Or in fact, what it is supposed to mean for any protocol driver, because
normally it just means I can use this offset for accessing bs->file. But
protocol drivers don't have a bs->file, so it's interesting to see that
they still all set this flag.

OFFSET_VALID | DATA might be excusable because I can see that it's
convenient that a protocol driver refers to itself as *file instead of
returning NULL there and then the offset is valid (though it would be
pointless to actually follow the file pointer), but OFFSET_VALID without
DATA probably isn't.

Kevin
Eric Blake Feb. 14, 2018, 2:44 p.m. | #2
On 02/14/2018 06:05 AM, Kevin Wolf wrote:
> Am 13.02.2018 um 21:26 hat Eric Blake geschrieben:
>> We are gradually moving away from sector-based interfaces, towards
>> byte-based.  Update the null driver accordingly.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Reviewed-by: Fam Zheng <famz@redhat.com>
>>

>>       if (s->read_zeroes) {
>> -        return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
>> -    } else {
>> -        return BDRV_BLOCK_OFFSET_VALID | start;
>> +        ret |= BDRV_BLOCK_ZERO;
>>       }
>> +    return ret;
>>   }
> 
> Preexisting, but I think this return value is wrong. OFFSET_VALID
> without DATA is to documented to have the following semantics:
> 
>   * DATA ZERO OFFSET_VALID
>   *  f    t        t       sectors preallocated, read as zero, returned file not
>   *                        necessarily zero at offset
>   *  f    f        t       sectors preallocated but read from backing_hd,
>   *                        returned file contains garbage at offset
> 
> I'm not sure what OFFSET_VALID is even supposed to mean for null.

Yeah, and I was even thinking about that a bit yesterday when figuring 
out what to do with nvme.  It does highlight the fact that you get 
garbage when reading from the null driver (unless the zero option was 
enabled, then ZERO is set and you know you read zeros instead) - but 
there no pointer that is preallocated (whether it contains garbage or 
otherwise) that you can actually dereference to read what the guest 
would see.

> 
> Or in fact, what it is supposed to mean for any protocol driver, because
> normally it just means I can use this offset for accessing bs->file. But
> protocol drivers don't have a bs->file, so it's interesting to see that
> they still all set this flag.
> 
> OFFSET_VALID | DATA might be excusable because I can see that it's
> convenient that a protocol driver refers to itself as *file instead of
> returning NULL there and then the offset is valid (though it would be
> pointless to actually follow the file pointer), but OFFSET_VALID without
> DATA probably isn't.

Hmm, you're probably right.  Maybe that means I should tweak the 
documentation to be more explicit: for a format driver, OFFSET_VALID can 
always be used (and *file will be set to the underlying protocol 
driver); but for a protocol driver, OFFSET_VALID only makes sense if 
*file is the BDS itself and there is an actual buffer to read (that is, 
the protocol driver must also be returning DATA and/or ZERO).  Or maybe 
we can indeed state that protocol drivers always set *file to NULL 
(there is no further backing file to reference), and thus never need to 
return OFFSET_VALID (but I'm not sure whether that will accidentally 
propagate back up the call stack and negatively affect status queries of 
format drivers).

Since it is pre-existing, should I respin to address the issue in a 
separate patch, or should that be a followup after this series?
Kevin Wolf Feb. 14, 2018, 2:55 p.m. | #3
Am 14.02.2018 um 15:44 hat Eric Blake geschrieben:
> On 02/14/2018 06:05 AM, Kevin Wolf wrote:
> > Am 13.02.2018 um 21:26 hat Eric Blake geschrieben:
> > > We are gradually moving away from sector-based interfaces, towards
> > > byte-based.  Update the null driver accordingly.
> > > 
> > > Signed-off-by: Eric Blake <eblake@redhat.com>
> > > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > Reviewed-by: Fam Zheng <famz@redhat.com>
> > > 
> 
> > >       if (s->read_zeroes) {
> > > -        return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
> > > -    } else {
> > > -        return BDRV_BLOCK_OFFSET_VALID | start;
> > > +        ret |= BDRV_BLOCK_ZERO;
> > >       }
> > > +    return ret;
> > >   }
> > 
> > Preexisting, but I think this return value is wrong. OFFSET_VALID
> > without DATA is to documented to have the following semantics:
> > 
> >   * DATA ZERO OFFSET_VALID
> >   *  f    t        t       sectors preallocated, read as zero, returned file not
> >   *                        necessarily zero at offset
> >   *  f    f        t       sectors preallocated but read from backing_hd,
> >   *                        returned file contains garbage at offset
> > 
> > I'm not sure what OFFSET_VALID is even supposed to mean for null.
> 
> Yeah, and I was even thinking about that a bit yesterday when figuring out
> what to do with nvme.  It does highlight the fact that you get garbage when
> reading from the null driver (unless the zero option was enabled, then ZERO
> is set and you know you read zeros instead) - but there no pointer that is
> preallocated (whether it contains garbage or otherwise) that you can
> actually dereference to read what the guest would see.
> 
> > 
> > Or in fact, what it is supposed to mean for any protocol driver, because
> > normally it just means I can use this offset for accessing bs->file. But
> > protocol drivers don't have a bs->file, so it's interesting to see that
> > they still all set this flag.
> > 
> > OFFSET_VALID | DATA might be excusable because I can see that it's
> > convenient that a protocol driver refers to itself as *file instead of
> > returning NULL there and then the offset is valid (though it would be
> > pointless to actually follow the file pointer), but OFFSET_VALID without
> > DATA probably isn't.
> 
> Hmm, you're probably right.  Maybe that means I should tweak the
> documentation to be more explicit: for a format driver, OFFSET_VALID can
> always be used (and *file will be set to the underlying protocol driver);
> but for a protocol driver, OFFSET_VALID only makes sense if *file is the BDS
> itself and there is an actual buffer to read (that is, the protocol driver
> must also be returning DATA and/or ZERO).  Or maybe we can indeed state that
> protocol drivers always set *file to NULL (there is no further backing file
> to reference), and thus never need to return OFFSET_VALID (but I'm not sure
> whether that will accidentally propagate back up the call stack and
> negatively affect status queries of format drivers).
> 
> Since it is pre-existing, should I respin to address the issue in a separate
> patch, or should that be a followup after this series?

It's a more fundamental question that shouldn't hold up this series. I
just wanted to raise it while I was looking at it. So yes, a followup is
fine.

Kevin
Eric Blake Feb. 23, 2018, 4:43 p.m. | #4
On 02/14/2018 06:05 AM, Kevin Wolf wrote:

>> +static int coroutine_fn null_co_block_status(BlockDriverState *bs,

>>       if (s->read_zeroes) {
>> -        return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
>> -    } else {
>> -        return BDRV_BLOCK_OFFSET_VALID | start;
>> +        ret |= BDRV_BLOCK_ZERO;
>>       }
>> +    return ret;
>>   }
> 
> Preexisting, but I think this return value is wrong. OFFSET_VALID
> without DATA is to documented to have the following semantics:
> 
>   * DATA ZERO OFFSET_VALID
>   *  f    t        t       sectors preallocated, read as zero, returned file not
>   *                        necessarily zero at offset
>   *  f    f        t       sectors preallocated but read from backing_hd,
>   *                        returned file contains garbage at offset
> 
> I'm not sure what OFFSET_VALID is even supposed to mean for null.

I'm finally getting around to playing with this.

> 
> Or in fact, what it is supposed to mean for any protocol driver, because
> normally it just means I can use this offset for accessing bs->file. But > protocol drivers don't have a bs->file, so it's interesting to see that
> they still all set this flag.

More precisely, it means "I can use this offset for accessing the 
returned *file".  Format and filter drivers set *file = bs->file (ie. 
their protocol layer), but protocol drivers set *file = bs (ie. 
themselves).  As long as you read it as "the offset is valid in the 
returned *file", and are careful as to _which_ BDS gets returned in 
*file*, it can still make sense.

So next I tried playing with a patch, to see how much returning 
OFFSET_VALID with DATA matters; and it turns out is is easily observable 
anywhere that the underlying protocol bleeds through to the format layer 
(particularly the raw format driver):

$ echo abc > tmp
$ truncate --size=10M tmp

pre-patch:
$ ./qemu-img map --output=json tmp
[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, 
"offset": 0},
{ "start": 4096, "length": 10481664, "depth": 0, "zero": true, "data": 
false, "offset": 4096}]

turn off OFFSET_VALID at the protocol layer:
diff --git i/block/file-posix.c w/block/file-posix.c
index f1591c38490..c05992c1121 100644
--- i/block/file-posix.c
+++ w/block/file-posix.c
@@ -2158,9 +2158,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,

      if (!want_zero) {
          *pnum = bytes;
-        *map = offset;
-        *file = bs;
-        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+        return BDRV_BLOCK_DATA;
      }

      ret = find_allocation(bs, offset, &data, &hole);
@@ -2183,9 +2181,7 @@ static int coroutine_fn 
raw_co_block_status(BlockDriverState *bs,
          *pnum = MIN(bytes, data - offset);
          ret = BDRV_BLOCK_ZERO;
      }
-    *map = offset;
-    *file = bs;
-    return ret | BDRV_BLOCK_OFFSET_VALID;
+    return ret;
  }

  static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs,


post-patch:
$ ./qemu-img map --output=json tmp
[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true},
{ "start": 4096, "length": 10481664, "depth": 0, "zero": true, "data": 
false}]


> 
> OFFSET_VALID | DATA might be excusable because I can see that it's
> convenient that a protocol driver refers to itself as *file instead of
> returning NULL there and then the offset is valid (though it would be
> pointless to actually follow the file pointer), but OFFSET_VALID without
> DATA probably isn't.

So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but 
necessary to avoid breaking qemu-img map output.  But you are also right 
that OFFSET_VALID without data makes little sense at a protocol layer. 
So with that in mind, I'm auditing all of the protocol layers to make 
sure OFFSET_VALID ends up as something sane.
Kevin Wolf Feb. 23, 2018, 5:05 p.m. | #5
Am 23.02.2018 um 17:43 hat Eric Blake geschrieben:
> > OFFSET_VALID | DATA might be excusable because I can see that it's
> > convenient that a protocol driver refers to itself as *file instead of
> > returning NULL there and then the offset is valid (though it would be
> > pointless to actually follow the file pointer), but OFFSET_VALID without
> > DATA probably isn't.
> 
> So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but
> necessary to avoid breaking qemu-img map output.  But you are also right
> that OFFSET_VALID without data makes little sense at a protocol layer. So
> with that in mind, I'm auditing all of the protocol layers to make sure
> OFFSET_VALID ends up as something sane.

That's one way to look at it.

The other way is that qemu-img map shouldn't ask the protocol layer for
its offset because it already knows the offset (it is what it passes as
a parameter to bdrv_co_block_status).

Anyway, it's probably not worth changing the interface, we should just
make sure that the return values of the individual drivers are
consistent.

Kevin
Eric Blake Feb. 23, 2018, 11:38 p.m. | #6
On 02/23/2018 11:05 AM, Kevin Wolf wrote:
> Am 23.02.2018 um 17:43 hat Eric Blake geschrieben:
>>> OFFSET_VALID | DATA might be excusable because I can see that it's
>>> convenient that a protocol driver refers to itself as *file instead of
>>> returning NULL there and then the offset is valid (though it would be
>>> pointless to actually follow the file pointer), but OFFSET_VALID without
>>> DATA probably isn't.
>>
>> So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but
>> necessary to avoid breaking qemu-img map output.  But you are also right
>> that OFFSET_VALID without data makes little sense at a protocol layer. So
>> with that in mind, I'm auditing all of the protocol layers to make sure
>> OFFSET_VALID ends up as something sane.
> 
> That's one way to look at it.
> 
> The other way is that qemu-img map shouldn't ask the protocol layer for
> its offset because it already knows the offset (it is what it passes as
> a parameter to bdrv_co_block_status).
> 
> Anyway, it's probably not worth changing the interface, we should just
> make sure that the return values of the individual drivers are
> consistent.

Yet another inconsistency, and it's making me scratch my head today.

By the way, in my byte-based stuff that is now pending on your tree, I 
tried hard to NOT change semantics or the set of flags returned by a 
given driver, and we agreed that's why you'd accept the series as-is and 
make me do this followup exercise.  But it's looking like my followups 
may end up touching a lot of the same drivers again, now that I'm 
looking at what the semantics SHOULD be (and whatever I do end up 
tweaking, I will at least make sure that iotests is still happy with it).

First, let's read what states the NBD spec is proposing:

> It defines the following flags for the flags field:
> 
>     NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested.
>     NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested.
> 
> It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability.

So here's how Vladimir proposed implementing it in his series (written 
before my byte-based block status stuff went in to your tree):
https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html

Server side (3/9):

+        int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes, 
&num,
+                                          NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+
+        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
+                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);

Client side (6/9):

+    *pnum = extent.length >> BDRV_SECTOR_BITS;
+    return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) |
+           (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0);

Does anything there strike you as odd?  In isolation, they seemed fine 
to me, but side-by-side, I'm scratching my head: the server queries the 
block layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the 
client side then takes the NBD protocol and tries to turn it back into 
information to feed the block layer, where !NBD_STATE_HOLE now feeds 
BDRV_BLOCK_DATA.  Why the different choice of bits?

Part of the story is that right now, we document that ONLY the block 
layer sets _ALLOCATED, in io.c, as a result of the driver layer 
returning HOLE || ZERO (there are cases where the block layer can return 
ZERO but not ALLOCATED, because the driver layer returned 0 but the 
block layer still knows that area reads as zero).  So Victor's patch 
matches the fact that the driver shouldn't set ALLOCATED.  Still, if we 
are tying ALLOCATED to whether there is a hole, then that seems like 
information we should be getting from the driver, not something 
synthesized after we've left the driver!

Then there's the question of file-posix.c: what should it return for a 
hole, ZERO|OFFSET_VALID or DATA|ZERO|OFFSET_VALID?  The wording in 
block.h implies that if DATA is not set, then the area reads as zero to 
the guest, but may have indeterminate value on the underlying file - but 
we KNOW that a hole in a POSIX file reads as 0 rather than having 
indeterminate value, and returning DATA fits the current documentation 
(but doing so bleeds through to at least 'qemu-img map --output=json' 
for the raw format).  I think we're overloading too many things into 
DATA (which layer of the chain feeds what the guest sees, and do we have 
a hole or is storage allocated for the data).  The only uses of 
BDRV_BLOCK_ALLOCATED are in the computation of bdrv_is_allocated(), in 
qcow2 measure, and in qemu-img compare, which all really do care about 
the semantics of "does THIS layer provide the guest image, or do I defer 
to a backing layer".  But the question NBD wants answered is "do I know 
whether there is a hole in the storage"  There are also relatively few 
clients of BDRV_BLOCK_DATA (mirror.c, qemu-img, 
bdrv_co_block_status_above), and I wonder if some of them are more 
worried about BDRV_BLOCK_ALLOCATED instead.

I'm thinking of revamping things to still keep four bits, but with new 
names and semantics as follows:

BDRV_BLOCK_LOCAL - the guest gets this portion of the file from this 
BDS, rather than the backing chain - makes sense for format drivers, 
pointless for protocol drivers
BDRV_BLOCK_ZERO - this portion of the file reads as zeroes
BDRV_BLOCK_ALLOC - this portion of the file has reserved disk space
BDRV_BLOCK_OFFSET_VALID - offset for accessing raw data

For format drivers:
L Z A O   read as zero, returned file is zero at offset
L - A O   read as valid from file at offset
L Z - O   read as zero, but returned file has hole at offset
L - - O   preallocated at offset but reads as garbage - bug?
L Z A -   read as zero, but from unknown offset with storage
L - A -   read as valid, but from unknown offset (including compressed, 
encrypted)
L Z - -   read as zero, but from unknown offset with hole
L - - -   preallocated but no offset known - bug?
- Z A O   read defers to backing layer, but protocol layer contains 
allocated zeros at offset
- - A O   read defers to backing layer, but preallocated at offset
- Z - O   bug
- - - O   bug
- Z A -   bug
- - A -   bug
- Z - -   bug
- - - -   read defers to backing layer

For protocol drivers:
- Z A O   read as zero, offset is allocated
- - A O   read as data, offset is allocated
- Z - O   read as zero, offset is hole
- - - O   bug?
- Z A -   read as zero, but from unknown offset with storage
- - A -   read as valid, but from unknown offset
- Z - -   read as zero, but from unknown offset with hole
- - - -   can't access this portion of file

With the new bit definitions, any driver that returns RAW (necessarily 
with OFFSET_VALID) will have the block layer set LOCAL in addition to 
whatever the next layer returns (turning the protocol driver's response 
into the correct format layer response).  Protocol drivers can omit the 
callback and get the sane default of '- - A O' mapped in place (or would 
that be better as '- - A -'?). file-posix.c would return either '- - A 
O' (after SEEK_DATA) or '- Z - O' (after SEEK_HOLE).  NBD would map ZERO 
to NBD_STATE_ZERO, and ALLOC to !NBD_STATE_HOLE, in both server 
(block-layer-to-NBD-protocol) and client (NBD-protocol-to-block-layer). 
Format drivers would set LOCAL themselves (rather than the block layer 
synthesizing it).

bdrv_is_allocated will still let clients learn which layers are local 
without grabbing full mapping information, but is tied to the 
BDRV_BLOCK_LOCAL bit.  Optimizations made during mirror based on whether 
and qemu-img compare previously based on BDRV_BLOCK_ALLOCATED are now 
based on BDRV_BLOCK_LOCAL, those based on BDRV_BLOCK_DATA are now based 
on BDRV_BLOCK_ALLOC.

Thoughts?
Kevin Wolf Feb. 26, 2018, 2:05 p.m. | #7
Am 24.02.2018 um 00:38 hat Eric Blake geschrieben:
> On 02/23/2018 11:05 AM, Kevin Wolf wrote:
> > Am 23.02.2018 um 17:43 hat Eric Blake geschrieben:
> > > > OFFSET_VALID | DATA might be excusable because I can see that it's
> > > > convenient that a protocol driver refers to itself as *file instead of
> > > > returning NULL there and then the offset is valid (though it would be
> > > > pointless to actually follow the file pointer), but OFFSET_VALID without
> > > > DATA probably isn't.
> > > 
> > > So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but
> > > necessary to avoid breaking qemu-img map output.  But you are also right
> > > that OFFSET_VALID without data makes little sense at a protocol layer. So
> > > with that in mind, I'm auditing all of the protocol layers to make sure
> > > OFFSET_VALID ends up as something sane.
> > 
> > That's one way to look at it.
> > 
> > The other way is that qemu-img map shouldn't ask the protocol layer for
> > its offset because it already knows the offset (it is what it passes as
> > a parameter to bdrv_co_block_status).
> > 
> > Anyway, it's probably not worth changing the interface, we should just
> > make sure that the return values of the individual drivers are
> > consistent.
> 
> Yet another inconsistency, and it's making me scratch my head today.
> 
> By the way, in my byte-based stuff that is now pending on your tree, I tried
> hard to NOT change semantics or the set of flags returned by a given driver,
> and we agreed that's why you'd accept the series as-is and make me do this
> followup exercise.  But it's looking like my followups may end up touching a
> lot of the same drivers again, now that I'm looking at what the semantics
> SHOULD be (and whatever I do end up tweaking, I will at least make sure that
> iotests is still happy with it).

Hm, that's unfortunate, but I don't think we should hold up your first
series just so we can touch the drivers only once.

> First, let's read what states the NBD spec is proposing:
> 
> > It defines the following flags for the flags field:
> > 
> >     NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested.
> >     NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested.
> > 
> > It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability.
> 
> So here's how Vladimir proposed implementing it in his series (written
> before my byte-based block status stuff went in to your tree):
> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html
> 
> Server side (3/9):
> 
> +        int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes,
> &num,
> +                                          NULL, NULL);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
> +                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);
> 
> Client side (6/9):
> 
> +    *pnum = extent.length >> BDRV_SECTOR_BITS;
> +    return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) |
> +           (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0);
> 
> Does anything there strike you as odd?

Two things I noticed while reading the above:

1. NBD doesn't consider backing files, so the definition of holes
   becomes ambiguous. Is a hole any block that isn't allocated in the
   top layer (may cause fragmentation or encounter an ENOSPC error) or
   is it any block that isn't allocated anywhere in the whole backing
   chain (may read as non-zero)?

   Considering that there is a separate NBD_STATE_ZERO and nothing
   forbids a state of NBD_STATE_HOLE without NBD_STATE_ZERO, maybe the
   former is more useful. The code you quote implements the latter.

   Maybe if we go with the former, we should add a note to the NBD spec
   that explictly says that NBD_STATE_HOLE doesn't imply any specific
   content that is returned on reads.

2. Using BDRV_BLOCK_ALLOCATED to determine NBD_STATE_HOLE seems wrong. A
   (not preallocated) zero cluster in qcow2 returns BDRV_BLOCK_ALLOCATED
   (because we don't fall through to the backing file) even though I
   think it's a hole. BDRV_BLOCK_DATA should be used there (which makes
   it consistent with the other direction).

> In isolation, they seemed fine to
> me, but side-by-side, I'm scratching my head: the server queries the block
> layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side
> then takes the NBD protocol and tries to turn it back into information to
> feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA.  Why
> the different choice of bits?

Which is actually consistent in the end, becaue BDRV_BLOCK_DATA implies
BDRV_BLOCK_ALLOCATED.

Essentially, assuming a simple backing chain 'base <- overlay', we got
these combinations to represent in NBD (with my suggestion of the flags
to use):

1. Cluster allocated in overlay
   a. non-zero data                         0
   b. explicit zeroes                       0 or ZERO
2. Cluster marked zero in overlay           HOLE | ZERO
3. Cluster preallocated/zero in overlay     ZERO
4. Cluster unallocated in overlay
   a. Cluster allocated in base (non-zero)  HOLE
   b. Cluster allocated in base (zero)      HOLE or HOLE | ZERO
   c. Cluster marked zero in base           HOLE | ZERO
   d. Cluster preallocated/zero in base     HOLE | ZERO
   e. Cluster unallocated in base           HOLE | ZERO

Instead of 'base' you can read 'anywhere in the backing chain' and the
flags should stay the same.

So !BDRV_BLOCK_ALLOCATED (i.e. falling through to the backing file) does
indeed imply NBD_STATE_HOLE, but so does case 2, which is just !DATA.

> Part of the story is that right now, we document that ONLY the block layer
> sets _ALLOCATED, in io.c, as a result of the driver layer returning HOLE ||
> ZERO (there are cases where the block layer can return ZERO but not
> ALLOCATED, because the driver layer returned 0 but the block layer still
> knows that area reads as zero).  So Victor's patch matches the fact that the
> driver shouldn't set ALLOCATED.  Still, if we are tying ALLOCATED to whether
> there is a hole, then that seems like information we should be getting from
> the driver, not something synthesized after we've left the driver!

Yes, I'm getting this impression, too. If your documentation says
something like "not allocated or unknown offset" (for !OFFSET_VALID),
you should probably be using one bit more to distinguish these cases.

> Then there's the question of file-posix.c: what should it return for a hole,
> ZERO|OFFSET_VALID or DATA|ZERO|OFFSET_VALID?  The wording in block.h implies
> that if DATA is not set, then the area reads as zero to the guest, but may
> have indeterminate value on the underlying file - but we KNOW that a hole in
> a POSIX file reads as 0 rather than having indeterminate value, and
> returning DATA fits the current documentation (but doing so bleeds through
> to at least 'qemu-img map --output=json' for the raw format).

The "underlying file" for the file-posix layer (i.e. the filesystem) is
a block device. A hole in a file is defined by not mapping to anywhere
on the block device, so DATA should not be set.

DATA | ZERO would mean that the block is actually allocated on the block
device, but it still reads as zero.

The thing that is inconsistent here is OFFSET_VALID and the offset
returned because the protocol layer refers to itself there instead of
referring to the "underlying file". If done consistently, it would have
to return the offset on the block device (which is useless information
in QEMU, so I suggested not to set OFFSET_VALID there at all - but we
decided that that's too much hassle for no practical benefit).

> I think we're overloading too many things into DATA (which layer of
> the chain feeds what the guest sees, and do we have a hole or is
> storage allocated for the data).

As I understand it, DATA should only be about holes (in the sense of not
being mapped to anywhere in bs->file or any other child apart from the
backing file).

Documentation does define OFFSET_VALID without DATA, though, as
preallocation. Maybe preallocation would better be expressed as DATA,
but without ALLOCATED.

> The only uses of BDRV_BLOCK_ALLOCATED are in the computation of
> bdrv_is_allocated(), in qcow2 measure, and in qemu-img compare, which all
> really do care about the semantics of "does THIS layer provide the guest
> image, or do I defer to a backing layer".  But the question NBD wants
> answered is "do I know whether there is a hole in the storage"  There are
> also relatively few clients of BDRV_BLOCK_DATA (mirror.c, qemu-img,
> bdrv_co_block_status_above), and I wonder if some of them are more worried
> about BDRV_BLOCK_ALLOCATED instead.
> 
> I'm thinking of revamping things to still keep four bits, but with new names
> and semantics as follows:
> 
> BDRV_BLOCK_LOCAL - the guest gets this portion of the file from this BDS,
> rather than the backing chain - makes sense for format drivers, pointless
> for protocol drivers

This is the old BDRV_BLOCK_ALLOCATED.

Data almost never comes from the qcow2 layer, so what this really means
is that data doesn't come from bs->backing.

> BDRV_BLOCK_ZERO - this portion of the file reads as zeroes

Same as before.

> BDRV_BLOCK_ALLOC - this portion of the file has reserved disk space

I think this is essentially what I believe BDRV_BLOCK_DATA should have
been. "Disk space" isn't clearly defined, but "there is a mapping to a
child node (except bs->backing)" seems to be close enough to what you
have in mind.

> BDRV_BLOCK_OFFSET_VALID - offset for accessing raw data

Same as before.

As I understand it, you're just renaming the existing flags. I'm not
sure if this is a good idea, especially with ALLOC(ATED), which changes
the meaning. This is pretty confusing. I suggest MAPPED as an
alternative.

> For format drivers:
> L Z A O   read as zero, returned file is zero at offset

This is not what your definition above said. ZERO is about reading from
the node itself rather than reading from bs->file.

I interpret this as: Read as zero, space is preallocated in the image
file, content in the image file is undefined.

> L - A O   read as valid from file at offset
> L Z - O   read as zero, but returned file has hole at offset

Read as zero, no mapping into bs->file, but the offset in bs->file is
valid. Doesn't make sense, OFFSET_VALID (O) should always imply
MAPPED (A).

> L - - O   preallocated at offset but reads as garbage - bug?

Again O without A - yes, a bug.

> L Z A -   read as zero, but from unknown offset with storage

And the space is preallocated in a non-backing child, though not
necessarily zeroed.

> L - A -   read as valid, but from unknown offset (including compressed,
> encrypted)
> L Z - -   read as zero, but from unknown offset with hole
> L - - -   preallocated but no offset known - bug?

No, not preallocated, because MAPPED isn't set.

This is a block that isn't mapped to a block in any child node and
doesn't read as zero. It might be the appropriate response for the null
driver with read-zeroes=off.

> - Z A O   read defers to backing layer, but protocol layer contains
> allocated zeros at offset

No. Space is preallocated for this block, but read defers to the backing
layer and we know that the backing layer provides zeros.

This is not something that a driver should return, but
it's a valid return value from bdrv_co_block_status(). One example for
this is reading from an offset that is higher than the length of the
backing file.

> - - A O   read defers to backing layer, but preallocated at offset

Yes, this one is preallocation, finally.

> - Z - O   bug
> - - - O   bug
> - Z A -   bug

Same as '- Z A O' except that the offset can't be directly accessed in
the child node (e.g. because this is an encrypted image).

> - - A -   bug

Preallocated, but reads from backing file. Offset can't be directly
accessed in the child node.

> - Z - -   bug

Read defers to backing layer and we know it will read zeros. Like for
'- Z A O', this isn't something that a driver should return, but makes
sense as a return value of bdrv_co_block_status().

> - - - -   read defers to backing layer
> 
> For protocol drivers:
> - Z A O   read as zero, offset is allocated
> - - A O   read as data, offset is allocated
> - Z - O   read as zero, offset is hole
> - - - O   bug?
> - Z A -   read as zero, but from unknown offset with storage
> - - A -   read as valid, but from unknown offset
> - Z - -   read as zero, but from unknown offset with hole
> - - - -   can't access this portion of file

Why don't you set LOCAL? It's usually true for protocol drivers that
they don't get their data from a backing file (though in theory you
could imagine a protocol driver with backing file support).

As discussed before, OFFSET_VALID doesn't really make sense here because
we don't return offsets of the image file on the block device, but we
only decided to keep it because of convenience. But if we change
everything, then this should be changed, too.

While the offset still refers to the same node for protocol drivers,
"unknown offset" doesn't make any sense. There is no mapping involved
that could be unknown. We really only have four cases for protocols.

ZERO and MAPPED make sense this way.

Not sure about 0 (or only OFFSET_VALID), could this ever be valid? You
say "can't access this portion of file", but where would this happen?

Kevin
Vladimir Sementsov-Ogievskiy March 1, 2018, 7:25 a.m. | #8
26.02.2018 17:05, Kevin Wolf wrote:
> Am 24.02.2018 um 00:38 hat Eric Blake geschrieben:
>> On 02/23/2018 11:05 AM, Kevin Wolf wrote:
>>> Am 23.02.2018 um 17:43 hat Eric Blake geschrieben:
>>>>> OFFSET_VALID | DATA might be excusable because I can see that it's
>>>>> convenient that a protocol driver refers to itself as *file instead of
>>>>> returning NULL there and then the offset is valid (though it would be
>>>>> pointless to actually follow the file pointer), but OFFSET_VALID without
>>>>> DATA probably isn't.
>>>> So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but
>>>> necessary to avoid breaking qemu-img map output.  But you are also right
>>>> that OFFSET_VALID without data makes little sense at a protocol layer. So
>>>> with that in mind, I'm auditing all of the protocol layers to make sure
>>>> OFFSET_VALID ends up as something sane.
>>> That's one way to look at it.
>>>
>>> The other way is that qemu-img map shouldn't ask the protocol layer for
>>> its offset because it already knows the offset (it is what it passes as
>>> a parameter to bdrv_co_block_status).
>>>
>>> Anyway, it's probably not worth changing the interface, we should just
>>> make sure that the return values of the individual drivers are
>>> consistent.
>> Yet another inconsistency, and it's making me scratch my head today.
>>
>> By the way, in my byte-based stuff that is now pending on your tree, I tried
>> hard to NOT change semantics or the set of flags returned by a given driver,
>> and we agreed that's why you'd accept the series as-is and make me do this
>> followup exercise.  But it's looking like my followups may end up touching a
>> lot of the same drivers again, now that I'm looking at what the semantics
>> SHOULD be (and whatever I do end up tweaking, I will at least make sure that
>> iotests is still happy with it).
> Hm, that's unfortunate, but I don't think we should hold up your first
> series just so we can touch the drivers only once.
>
>> First, let's read what states the NBD spec is proposing:
>>
>>> It defines the following flags for the flags field:
>>>
>>>      NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested.
>>>      NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested.
>>>
>>> It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability.
>> So here's how Vladimir proposed implementing it in his series (written
>> before my byte-based block status stuff went in to your tree):
>> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html
>>
>> Server side (3/9):
>>
>> +        int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes,
>> &num,
>> +                                          NULL, NULL);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) |
>> +                (ret & BDRV_BLOCK_ZERO      ? NBD_STATE_ZERO : 0);
>>
>> Client side (6/9):
>>
>> +    *pnum = extent.length >> BDRV_SECTOR_BITS;
>> +    return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) |
>> +           (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0);
>>
>> Does anything there strike you as odd?
> Two things I noticed while reading the above:
>
> 1. NBD doesn't consider backing files, so the definition of holes
>     becomes ambiguous. Is a hole any block that isn't allocated in the
>     top layer (may cause fragmentation or encounter an ENOSPC error) or
>     is it any block that isn't allocated anywhere in the whole backing
>     chain (may read as non-zero)?
>
>     Considering that there is a separate NBD_STATE_ZERO and nothing
>     forbids a state of NBD_STATE_HOLE without NBD_STATE_ZERO, maybe the
>     former is more useful. The code you quote implements the latter.
>
>     Maybe if we go with the former, we should add a note to the NBD spec
>     that explictly says that NBD_STATE_HOLE doesn't imply any specific
>     content that is returned on reads.
>
> 2. Using BDRV_BLOCK_ALLOCATED to determine NBD_STATE_HOLE seems wrong. A
>     (not preallocated) zero cluster in qcow2 returns BDRV_BLOCK_ALLOCATED
>     (because we don't fall through to the backing file) even though I
>     think it's a hole. BDRV_BLOCK_DATA should be used there (which makes
>     it consistent with the other direction).
>
>> In isolation, they seemed fine to
>> me, but side-by-side, I'm scratching my head: the server queries the block
>> layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side
>> then takes the NBD protocol and tries to turn it back into information to
>> feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA.  Why
>> the different choice of bits?
> Which is actually consistent in the end, becaue BDRV_BLOCK_DATA implies
> BDRV_BLOCK_ALLOCATED.
>
> Essentially, assuming a simple backing chain 'base <- overlay', we got
> these combinations to represent in NBD (with my suggestion of the flags
> to use):
>
> 1. Cluster allocated in overlay
>     a. non-zero data                         0
>     b. explicit zeroes                       0 or ZERO
> 2. Cluster marked zero in overlay           HOLE | ZERO
> 3. Cluster preallocated/zero in overlay     ZERO
> 4. Cluster unallocated in overlay
>     a. Cluster allocated in base (non-zero)  HOLE
>     b. Cluster allocated in base (zero)      HOLE or HOLE | ZERO
>     c. Cluster marked zero in base           HOLE | ZERO
>     d. Cluster preallocated/zero in base     HOLE | ZERO
>     e. Cluster unallocated in base           HOLE | ZERO
>
> Instead of 'base' you can read 'anywhere in the backing chain' and the
> flags should stay the same.

I think only "anywhere in the backing chain" is valid here. Otherwise, 
semantics of bdrv_is_allocated would differ for NBD and for not-NBD. I 
think, if bdrv_is_allocated returns false, it means that we can skip 
this region in copying process, am I right?


>
> So !BDRV_BLOCK_ALLOCATED (i.e. falling through to the backing file) does
> indeed imply NBD_STATE_HOLE, but so does case 2, which is just !DATA.
>
>
Kevin Wolf March 1, 2018, 9:48 a.m. | #9
Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 26.02.2018 17:05, Kevin Wolf wrote:
> > Essentially, assuming a simple backing chain 'base <- overlay', we got
> > these combinations to represent in NBD (with my suggestion of the flags
> > to use):
> > 
> > 1. Cluster allocated in overlay
> >     a. non-zero data                         0
> >     b. explicit zeroes                       0 or ZERO
> > 2. Cluster marked zero in overlay           HOLE | ZERO
> > 3. Cluster preallocated/zero in overlay     ZERO
> > 4. Cluster unallocated in overlay
> >     a. Cluster allocated in base (non-zero)  HOLE
> >     b. Cluster allocated in base (zero)      HOLE or HOLE | ZERO
> >     c. Cluster marked zero in base           HOLE | ZERO
> >     d. Cluster preallocated/zero in base     HOLE | ZERO
> >     e. Cluster unallocated in base           HOLE | ZERO
> > 
> > Instead of 'base' you can read 'anywhere in the backing chain' and the
> > flags should stay the same.
> 
> I think only "anywhere in the backing chain" is valid here. Otherwise,
> semantics of bdrv_is_allocated would differ for NBD and for not-NBD.

This was meant as a mapping from cases to flags, not the other way
round, so really doesn't say anything about the cases where the block is
allocated further down the chain.

But yes, it shouldn't make a difference where in the backing chain a
block is allocated, so these cases are the same as 4.

> I think, if bdrv_is_allocated returns false, it means that we can skip
> this region in copying process, am I right?

-ENOCONTEXT? Which copying process?

There are cases where you want to copy such regions, and other cases
where you want to skip them. It depends on the use case. For example,
'qemu-img convert' skips them with -B (because the backing file is
reused), but not without -B (which creates a full copy).

Kevin
Vladimir Sementsov-Ogievskiy March 1, 2018, 9:57 a.m. | #10
01.03.2018 12:48, Kevin Wolf wrote:
> Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 26.02.2018 17:05, Kevin Wolf wrote:
>>> Essentially, assuming a simple backing chain 'base <- overlay', we got
>>> these combinations to represent in NBD (with my suggestion of the flags
>>> to use):
>>>
>>> 1. Cluster allocated in overlay
>>>      a. non-zero data                         0
>>>      b. explicit zeroes                       0 or ZERO
>>> 2. Cluster marked zero in overlay           HOLE | ZERO
>>> 3. Cluster preallocated/zero in overlay     ZERO
>>> 4. Cluster unallocated in overlay
>>>      a. Cluster allocated in base (non-zero)  HOLE
>>>      b. Cluster allocated in base (zero)      HOLE or HOLE | ZERO
>>>      c. Cluster marked zero in base           HOLE | ZERO
>>>      d. Cluster preallocated/zero in base     HOLE | ZERO
>>>      e. Cluster unallocated in base           HOLE | ZERO
>>>
>>> Instead of 'base' you can read 'anywhere in the backing chain' and the
>>> flags should stay the same.
>> I think only "anywhere in the backing chain" is valid here. Otherwise,
>> semantics of bdrv_is_allocated would differ for NBD and for not-NBD.
> This was meant as a mapping from cases to flags, not the other way
> round, so really doesn't say anything about the cases where the block is
> allocated further down the chain.
>
> But yes, it shouldn't make a difference where in the backing chain a
> block is allocated, so these cases are the same as 4.
>
>> I think, if bdrv_is_allocated returns false, it means that we can skip
>> this region in copying process, am I right?
> -ENOCONTEXT? Which copying process?
>
> There are cases where you want to copy such regions, and other cases
> where you want to skip them. It depends on the use case. For example,
> 'qemu-img convert' skips them with -B (because the backing file is
> reused), but not without -B (which creates a full copy).
>
> Kevin

Hm, I thought that bdrv_is_allocated loops through backings, but it 
doesn't, sorry.
Kevin Wolf March 1, 2018, 10:13 a.m. | #11
Am 01.03.2018 um 10:57 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 01.03.2018 12:48, Kevin Wolf wrote:
> > Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 26.02.2018 17:05, Kevin Wolf wrote:
> > > > Essentially, assuming a simple backing chain 'base <- overlay', we got
> > > > these combinations to represent in NBD (with my suggestion of the flags
> > > > to use):
> > > > 
> > > > 1. Cluster allocated in overlay
> > > >      a. non-zero data                         0
> > > >      b. explicit zeroes                       0 or ZERO
> > > > 2. Cluster marked zero in overlay           HOLE | ZERO
> > > > 3. Cluster preallocated/zero in overlay     ZERO
> > > > 4. Cluster unallocated in overlay
> > > >      a. Cluster allocated in base (non-zero)  HOLE
> > > >      b. Cluster allocated in base (zero)      HOLE or HOLE | ZERO
> > > >      c. Cluster marked zero in base           HOLE | ZERO
> > > >      d. Cluster preallocated/zero in base     HOLE | ZERO
> > > >      e. Cluster unallocated in base           HOLE | ZERO
> > > > 
> > > > Instead of 'base' you can read 'anywhere in the backing chain' and the
> > > > flags should stay the same.
> > > I think only "anywhere in the backing chain" is valid here. Otherwise,
> > > semantics of bdrv_is_allocated would differ for NBD and for not-NBD.
> > This was meant as a mapping from cases to flags, not the other way
> > round, so really doesn't say anything about the cases where the block is
> > allocated further down the chain.
> > 
> > But yes, it shouldn't make a difference where in the backing chain a
> > block is allocated, so these cases are the same as 4.
> > 
> > > I think, if bdrv_is_allocated returns false, it means that we can skip
> > > this region in copying process, am I right?
> > -ENOCONTEXT? Which copying process?
> > 
> > There are cases where you want to copy such regions, and other cases
> > where you want to skip them. It depends on the use case. For example,
> > 'qemu-img convert' skips them with -B (because the backing file is
> > reused), but not without -B (which creates a full copy).
> > 
> > Kevin
> 
> Hm, I thought that bdrv_is_allocated loops through backings, but it doesn't,
> sorry.

That would be bdrv_is_allocated_above() with a NULL base.

Kevin

Patch

diff --git a/block/null.c b/block/null.c
index 214d394fff4..806a8631e4d 100644
--- a/block/null.c
+++ b/block/null.c
@@ -223,22 +223,23 @@  static int null_reopen_prepare(BDRVReopenState *reopen_state,
     return 0;
 }

-static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs,
-                                                     int64_t sector_num,
-                                                     int nb_sectors, int *pnum,
-                                                     BlockDriverState **file)
+static int coroutine_fn null_co_block_status(BlockDriverState *bs,
+                                             bool want_zero, int64_t offset,
+                                             int64_t bytes, int64_t *pnum,
+                                             int64_t *map,
+                                             BlockDriverState **file)
 {
     BDRVNullState *s = bs->opaque;
-    off_t start = sector_num * BDRV_SECTOR_SIZE;
+    int ret = BDRV_BLOCK_OFFSET_VALID;

-    *pnum = nb_sectors;
+    *pnum = bytes;
+    *map = offset;
     *file = bs;

     if (s->read_zeroes) {
-        return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
-    } else {
-        return BDRV_BLOCK_OFFSET_VALID | start;
+        ret |= BDRV_BLOCK_ZERO;
     }
+    return ret;
 }

 static void null_refresh_filename(BlockDriverState *bs, QDict *opts)
@@ -270,7 +271,7 @@  static BlockDriver bdrv_null_co = {
     .bdrv_co_flush_to_disk  = null_co_flush,
     .bdrv_reopen_prepare    = null_reopen_prepare,

-    .bdrv_co_get_block_status   = null_co_get_block_status,
+    .bdrv_co_block_status   = null_co_block_status,

     .bdrv_refresh_filename  = null_refresh_filename,
 };
@@ -290,7 +291,7 @@  static BlockDriver bdrv_null_aio = {
     .bdrv_aio_flush         = null_aio_flush,
     .bdrv_reopen_prepare    = null_reopen_prepare,

-    .bdrv_co_get_block_status   = null_co_get_block_status,
+    .bdrv_co_block_status   = null_co_block_status,

     .bdrv_refresh_filename  = null_refresh_filename,
 };