Message ID | 20180213202701.15858-10-eblake@redhat.com |
---|---|
State | New |
Headers | show |
Series | add byte-based block_status driver callbacks | expand |
Am 13.02.2018 um 21:26 hat Eric Blake geschrieben: > We are gradually moving away from sector-based interfaces, towards > byte-based. Update the null driver accordingly. > > Signed-off-by: Eric Blake <eblake@redhat.com> > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > Reviewed-by: Fam Zheng <famz@redhat.com> > > --- > v6-v7: no change > v5: minor fix to type of 'ret' > v4: rebase to interface tweak > v3: no change > v2: rebase to mapping parameter > --- > block/null.c | 23 ++++++++++++----------- > 1 file changed, 12 insertions(+), 11 deletions(-) > > diff --git a/block/null.c b/block/null.c > index 214d394fff4..806a8631e4d 100644 > --- a/block/null.c > +++ b/block/null.c > @@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState *reopen_state, > return 0; > } > > -static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs, > - int64_t sector_num, > - int nb_sectors, int *pnum, > - BlockDriverState **file) > +static int coroutine_fn null_co_block_status(BlockDriverState *bs, > + bool want_zero, int64_t offset, > + int64_t bytes, int64_t *pnum, > + int64_t *map, > + BlockDriverState **file) > { > BDRVNullState *s = bs->opaque; > - off_t start = sector_num * BDRV_SECTOR_SIZE; > + int ret = BDRV_BLOCK_OFFSET_VALID; > > - *pnum = nb_sectors; > + *pnum = bytes; > + *map = offset; > *file = bs; > > if (s->read_zeroes) { > - return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; > - } else { > - return BDRV_BLOCK_OFFSET_VALID | start; > + ret |= BDRV_BLOCK_ZERO; > } > + return ret; > } Preexisting, but I think this return value is wrong. OFFSET_VALID without DATA is to documented to have the following semantics: * DATA ZERO OFFSET_VALID * f t t sectors preallocated, read as zero, returned file not * necessarily zero at offset * f f t sectors preallocated but read from backing_hd, * returned file contains garbage at offset I'm not sure what OFFSET_VALID is even supposed to mean for null. Or in fact, what it is supposed to mean for any protocol driver, because normally it just means I can use this offset for accessing bs->file. But protocol drivers don't have a bs->file, so it's interesting to see that they still all set this flag. OFFSET_VALID | DATA might be excusable because I can see that it's convenient that a protocol driver refers to itself as *file instead of returning NULL there and then the offset is valid (though it would be pointless to actually follow the file pointer), but OFFSET_VALID without DATA probably isn't. Kevin
On 02/14/2018 06:05 AM, Kevin Wolf wrote: > Am 13.02.2018 um 21:26 hat Eric Blake geschrieben: >> We are gradually moving away from sector-based interfaces, towards >> byte-based. Update the null driver accordingly. >> >> Signed-off-by: Eric Blake <eblake@redhat.com> >> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >> Reviewed-by: Fam Zheng <famz@redhat.com> >> >> if (s->read_zeroes) { >> - return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; >> - } else { >> - return BDRV_BLOCK_OFFSET_VALID | start; >> + ret |= BDRV_BLOCK_ZERO; >> } >> + return ret; >> } > > Preexisting, but I think this return value is wrong. OFFSET_VALID > without DATA is to documented to have the following semantics: > > * DATA ZERO OFFSET_VALID > * f t t sectors preallocated, read as zero, returned file not > * necessarily zero at offset > * f f t sectors preallocated but read from backing_hd, > * returned file contains garbage at offset > > I'm not sure what OFFSET_VALID is even supposed to mean for null. Yeah, and I was even thinking about that a bit yesterday when figuring out what to do with nvme. It does highlight the fact that you get garbage when reading from the null driver (unless the zero option was enabled, then ZERO is set and you know you read zeros instead) - but there no pointer that is preallocated (whether it contains garbage or otherwise) that you can actually dereference to read what the guest would see. > > Or in fact, what it is supposed to mean for any protocol driver, because > normally it just means I can use this offset for accessing bs->file. But > protocol drivers don't have a bs->file, so it's interesting to see that > they still all set this flag. > > OFFSET_VALID | DATA might be excusable because I can see that it's > convenient that a protocol driver refers to itself as *file instead of > returning NULL there and then the offset is valid (though it would be > pointless to actually follow the file pointer), but OFFSET_VALID without > DATA probably isn't. Hmm, you're probably right. Maybe that means I should tweak the documentation to be more explicit: for a format driver, OFFSET_VALID can always be used (and *file will be set to the underlying protocol driver); but for a protocol driver, OFFSET_VALID only makes sense if *file is the BDS itself and there is an actual buffer to read (that is, the protocol driver must also be returning DATA and/or ZERO). Or maybe we can indeed state that protocol drivers always set *file to NULL (there is no further backing file to reference), and thus never need to return OFFSET_VALID (but I'm not sure whether that will accidentally propagate back up the call stack and negatively affect status queries of format drivers). Since it is pre-existing, should I respin to address the issue in a separate patch, or should that be a followup after this series?
Am 14.02.2018 um 15:44 hat Eric Blake geschrieben: > On 02/14/2018 06:05 AM, Kevin Wolf wrote: > > Am 13.02.2018 um 21:26 hat Eric Blake geschrieben: > > > We are gradually moving away from sector-based interfaces, towards > > > byte-based. Update the null driver accordingly. > > > > > > Signed-off-by: Eric Blake <eblake@redhat.com> > > > Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > > Reviewed-by: Fam Zheng <famz@redhat.com> > > > > > > > if (s->read_zeroes) { > > > - return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; > > > - } else { > > > - return BDRV_BLOCK_OFFSET_VALID | start; > > > + ret |= BDRV_BLOCK_ZERO; > > > } > > > + return ret; > > > } > > > > Preexisting, but I think this return value is wrong. OFFSET_VALID > > without DATA is to documented to have the following semantics: > > > > * DATA ZERO OFFSET_VALID > > * f t t sectors preallocated, read as zero, returned file not > > * necessarily zero at offset > > * f f t sectors preallocated but read from backing_hd, > > * returned file contains garbage at offset > > > > I'm not sure what OFFSET_VALID is even supposed to mean for null. > > Yeah, and I was even thinking about that a bit yesterday when figuring out > what to do with nvme. It does highlight the fact that you get garbage when > reading from the null driver (unless the zero option was enabled, then ZERO > is set and you know you read zeros instead) - but there no pointer that is > preallocated (whether it contains garbage or otherwise) that you can > actually dereference to read what the guest would see. > > > > > Or in fact, what it is supposed to mean for any protocol driver, because > > normally it just means I can use this offset for accessing bs->file. But > > protocol drivers don't have a bs->file, so it's interesting to see that > > they still all set this flag. > > > > OFFSET_VALID | DATA might be excusable because I can see that it's > > convenient that a protocol driver refers to itself as *file instead of > > returning NULL there and then the offset is valid (though it would be > > pointless to actually follow the file pointer), but OFFSET_VALID without > > DATA probably isn't. > > Hmm, you're probably right. Maybe that means I should tweak the > documentation to be more explicit: for a format driver, OFFSET_VALID can > always be used (and *file will be set to the underlying protocol driver); > but for a protocol driver, OFFSET_VALID only makes sense if *file is the BDS > itself and there is an actual buffer to read (that is, the protocol driver > must also be returning DATA and/or ZERO). Or maybe we can indeed state that > protocol drivers always set *file to NULL (there is no further backing file > to reference), and thus never need to return OFFSET_VALID (but I'm not sure > whether that will accidentally propagate back up the call stack and > negatively affect status queries of format drivers). > > Since it is pre-existing, should I respin to address the issue in a separate > patch, or should that be a followup after this series? It's a more fundamental question that shouldn't hold up this series. I just wanted to raise it while I was looking at it. So yes, a followup is fine. Kevin
On 02/14/2018 06:05 AM, Kevin Wolf wrote: >> +static int coroutine_fn null_co_block_status(BlockDriverState *bs, >> if (s->read_zeroes) { >> - return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; >> - } else { >> - return BDRV_BLOCK_OFFSET_VALID | start; >> + ret |= BDRV_BLOCK_ZERO; >> } >> + return ret; >> } > > Preexisting, but I think this return value is wrong. OFFSET_VALID > without DATA is to documented to have the following semantics: > > * DATA ZERO OFFSET_VALID > * f t t sectors preallocated, read as zero, returned file not > * necessarily zero at offset > * f f t sectors preallocated but read from backing_hd, > * returned file contains garbage at offset > > I'm not sure what OFFSET_VALID is even supposed to mean for null. I'm finally getting around to playing with this. > > Or in fact, what it is supposed to mean for any protocol driver, because > normally it just means I can use this offset for accessing bs->file. But > protocol drivers don't have a bs->file, so it's interesting to see that > they still all set this flag. More precisely, it means "I can use this offset for accessing the returned *file". Format and filter drivers set *file = bs->file (ie. their protocol layer), but protocol drivers set *file = bs (ie. themselves). As long as you read it as "the offset is valid in the returned *file", and are careful as to _which_ BDS gets returned in *file*, it can still make sense. So next I tried playing with a patch, to see how much returning OFFSET_VALID with DATA matters; and it turns out is is easily observable anywhere that the underlying protocol bleeds through to the format layer (particularly the raw format driver): $ echo abc > tmp $ truncate --size=10M tmp pre-patch: $ ./qemu-img map --output=json tmp [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 4096, "length": 10481664, "depth": 0, "zero": true, "data": false, "offset": 4096}] turn off OFFSET_VALID at the protocol layer: diff --git i/block/file-posix.c w/block/file-posix.c index f1591c38490..c05992c1121 100644 --- i/block/file-posix.c +++ w/block/file-posix.c @@ -2158,9 +2158,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, if (!want_zero) { *pnum = bytes; - *map = offset; - *file = bs; - return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; + return BDRV_BLOCK_DATA; } ret = find_allocation(bs, offset, &data, &hole); @@ -2183,9 +2181,7 @@ static int coroutine_fn raw_co_block_status(BlockDriverState *bs, *pnum = MIN(bytes, data - offset); ret = BDRV_BLOCK_ZERO; } - *map = offset; - *file = bs; - return ret | BDRV_BLOCK_OFFSET_VALID; + return ret; } static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs, post-patch: $ ./qemu-img map --output=json tmp [{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true}, { "start": 4096, "length": 10481664, "depth": 0, "zero": true, "data": false}] > > OFFSET_VALID | DATA might be excusable because I can see that it's > convenient that a protocol driver refers to itself as *file instead of > returning NULL there and then the offset is valid (though it would be > pointless to actually follow the file pointer), but OFFSET_VALID without > DATA probably isn't. So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but necessary to avoid breaking qemu-img map output. But you are also right that OFFSET_VALID without data makes little sense at a protocol layer. So with that in mind, I'm auditing all of the protocol layers to make sure OFFSET_VALID ends up as something sane.
Am 23.02.2018 um 17:43 hat Eric Blake geschrieben: > > OFFSET_VALID | DATA might be excusable because I can see that it's > > convenient that a protocol driver refers to itself as *file instead of > > returning NULL there and then the offset is valid (though it would be > > pointless to actually follow the file pointer), but OFFSET_VALID without > > DATA probably isn't. > > So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but > necessary to avoid breaking qemu-img map output. But you are also right > that OFFSET_VALID without data makes little sense at a protocol layer. So > with that in mind, I'm auditing all of the protocol layers to make sure > OFFSET_VALID ends up as something sane. That's one way to look at it. The other way is that qemu-img map shouldn't ask the protocol layer for its offset because it already knows the offset (it is what it passes as a parameter to bdrv_co_block_status). Anyway, it's probably not worth changing the interface, we should just make sure that the return values of the individual drivers are consistent. Kevin
On 02/23/2018 11:05 AM, Kevin Wolf wrote: > Am 23.02.2018 um 17:43 hat Eric Blake geschrieben: >>> OFFSET_VALID | DATA might be excusable because I can see that it's >>> convenient that a protocol driver refers to itself as *file instead of >>> returning NULL there and then the offset is valid (though it would be >>> pointless to actually follow the file pointer), but OFFSET_VALID without >>> DATA probably isn't. >> >> So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but >> necessary to avoid breaking qemu-img map output. But you are also right >> that OFFSET_VALID without data makes little sense at a protocol layer. So >> with that in mind, I'm auditing all of the protocol layers to make sure >> OFFSET_VALID ends up as something sane. > > That's one way to look at it. > > The other way is that qemu-img map shouldn't ask the protocol layer for > its offset because it already knows the offset (it is what it passes as > a parameter to bdrv_co_block_status). > > Anyway, it's probably not worth changing the interface, we should just > make sure that the return values of the individual drivers are > consistent. Yet another inconsistency, and it's making me scratch my head today. By the way, in my byte-based stuff that is now pending on your tree, I tried hard to NOT change semantics or the set of flags returned by a given driver, and we agreed that's why you'd accept the series as-is and make me do this followup exercise. But it's looking like my followups may end up touching a lot of the same drivers again, now that I'm looking at what the semantics SHOULD be (and whatever I do end up tweaking, I will at least make sure that iotests is still happy with it). First, let's read what states the NBD spec is proposing: > It defines the following flags for the flags field: > > NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested. > NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested. > > It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability. So here's how Vladimir proposed implementing it in his series (written before my byte-based block status stuff went in to your tree): https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html Server side (3/9): + int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes, &num, + NULL, NULL); + if (ret < 0) { + return ret; + } + + flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) | + (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); Client side (6/9): + *pnum = extent.length >> BDRV_SECTOR_BITS; + return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) | + (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0); Does anything there strike you as odd? In isolation, they seemed fine to me, but side-by-side, I'm scratching my head: the server queries the block layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side then takes the NBD protocol and tries to turn it back into information to feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA. Why the different choice of bits? Part of the story is that right now, we document that ONLY the block layer sets _ALLOCATED, in io.c, as a result of the driver layer returning HOLE || ZERO (there are cases where the block layer can return ZERO but not ALLOCATED, because the driver layer returned 0 but the block layer still knows that area reads as zero). So Victor's patch matches the fact that the driver shouldn't set ALLOCATED. Still, if we are tying ALLOCATED to whether there is a hole, then that seems like information we should be getting from the driver, not something synthesized after we've left the driver! Then there's the question of file-posix.c: what should it return for a hole, ZERO|OFFSET_VALID or DATA|ZERO|OFFSET_VALID? The wording in block.h implies that if DATA is not set, then the area reads as zero to the guest, but may have indeterminate value on the underlying file - but we KNOW that a hole in a POSIX file reads as 0 rather than having indeterminate value, and returning DATA fits the current documentation (but doing so bleeds through to at least 'qemu-img map --output=json' for the raw format). I think we're overloading too many things into DATA (which layer of the chain feeds what the guest sees, and do we have a hole or is storage allocated for the data). The only uses of BDRV_BLOCK_ALLOCATED are in the computation of bdrv_is_allocated(), in qcow2 measure, and in qemu-img compare, which all really do care about the semantics of "does THIS layer provide the guest image, or do I defer to a backing layer". But the question NBD wants answered is "do I know whether there is a hole in the storage" There are also relatively few clients of BDRV_BLOCK_DATA (mirror.c, qemu-img, bdrv_co_block_status_above), and I wonder if some of them are more worried about BDRV_BLOCK_ALLOCATED instead. I'm thinking of revamping things to still keep four bits, but with new names and semantics as follows: BDRV_BLOCK_LOCAL - the guest gets this portion of the file from this BDS, rather than the backing chain - makes sense for format drivers, pointless for protocol drivers BDRV_BLOCK_ZERO - this portion of the file reads as zeroes BDRV_BLOCK_ALLOC - this portion of the file has reserved disk space BDRV_BLOCK_OFFSET_VALID - offset for accessing raw data For format drivers: L Z A O read as zero, returned file is zero at offset L - A O read as valid from file at offset L Z - O read as zero, but returned file has hole at offset L - - O preallocated at offset but reads as garbage - bug? L Z A - read as zero, but from unknown offset with storage L - A - read as valid, but from unknown offset (including compressed, encrypted) L Z - - read as zero, but from unknown offset with hole L - - - preallocated but no offset known - bug? - Z A O read defers to backing layer, but protocol layer contains allocated zeros at offset - - A O read defers to backing layer, but preallocated at offset - Z - O bug - - - O bug - Z A - bug - - A - bug - Z - - bug - - - - read defers to backing layer For protocol drivers: - Z A O read as zero, offset is allocated - - A O read as data, offset is allocated - Z - O read as zero, offset is hole - - - O bug? - Z A - read as zero, but from unknown offset with storage - - A - read as valid, but from unknown offset - Z - - read as zero, but from unknown offset with hole - - - - can't access this portion of file With the new bit definitions, any driver that returns RAW (necessarily with OFFSET_VALID) will have the block layer set LOCAL in addition to whatever the next layer returns (turning the protocol driver's response into the correct format layer response). Protocol drivers can omit the callback and get the sane default of '- - A O' mapped in place (or would that be better as '- - A -'?). file-posix.c would return either '- - A O' (after SEEK_DATA) or '- Z - O' (after SEEK_HOLE). NBD would map ZERO to NBD_STATE_ZERO, and ALLOC to !NBD_STATE_HOLE, in both server (block-layer-to-NBD-protocol) and client (NBD-protocol-to-block-layer). Format drivers would set LOCAL themselves (rather than the block layer synthesizing it). bdrv_is_allocated will still let clients learn which layers are local without grabbing full mapping information, but is tied to the BDRV_BLOCK_LOCAL bit. Optimizations made during mirror based on whether and qemu-img compare previously based on BDRV_BLOCK_ALLOCATED are now based on BDRV_BLOCK_LOCAL, those based on BDRV_BLOCK_DATA are now based on BDRV_BLOCK_ALLOC. Thoughts?
Am 24.02.2018 um 00:38 hat Eric Blake geschrieben: > On 02/23/2018 11:05 AM, Kevin Wolf wrote: > > Am 23.02.2018 um 17:43 hat Eric Blake geschrieben: > > > > OFFSET_VALID | DATA might be excusable because I can see that it's > > > > convenient that a protocol driver refers to itself as *file instead of > > > > returning NULL there and then the offset is valid (though it would be > > > > pointless to actually follow the file pointer), but OFFSET_VALID without > > > > DATA probably isn't. > > > > > > So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but > > > necessary to avoid breaking qemu-img map output. But you are also right > > > that OFFSET_VALID without data makes little sense at a protocol layer. So > > > with that in mind, I'm auditing all of the protocol layers to make sure > > > OFFSET_VALID ends up as something sane. > > > > That's one way to look at it. > > > > The other way is that qemu-img map shouldn't ask the protocol layer for > > its offset because it already knows the offset (it is what it passes as > > a parameter to bdrv_co_block_status). > > > > Anyway, it's probably not worth changing the interface, we should just > > make sure that the return values of the individual drivers are > > consistent. > > Yet another inconsistency, and it's making me scratch my head today. > > By the way, in my byte-based stuff that is now pending on your tree, I tried > hard to NOT change semantics or the set of flags returned by a given driver, > and we agreed that's why you'd accept the series as-is and make me do this > followup exercise. But it's looking like my followups may end up touching a > lot of the same drivers again, now that I'm looking at what the semantics > SHOULD be (and whatever I do end up tweaking, I will at least make sure that > iotests is still happy with it). Hm, that's unfortunate, but I don't think we should hold up your first series just so we can touch the drivers only once. > First, let's read what states the NBD spec is proposing: > > > It defines the following flags for the flags field: > > > > NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested. > > NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested. > > > > It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability. > > So here's how Vladimir proposed implementing it in his series (written > before my byte-based block status stuff went in to your tree): > https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html > > Server side (3/9): > > + int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes, > &num, > + NULL, NULL); > + if (ret < 0) { > + return ret; > + } > + > + flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) | > + (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); > > Client side (6/9): > > + *pnum = extent.length >> BDRV_SECTOR_BITS; > + return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) | > + (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0); > > Does anything there strike you as odd? Two things I noticed while reading the above: 1. NBD doesn't consider backing files, so the definition of holes becomes ambiguous. Is a hole any block that isn't allocated in the top layer (may cause fragmentation or encounter an ENOSPC error) or is it any block that isn't allocated anywhere in the whole backing chain (may read as non-zero)? Considering that there is a separate NBD_STATE_ZERO and nothing forbids a state of NBD_STATE_HOLE without NBD_STATE_ZERO, maybe the former is more useful. The code you quote implements the latter. Maybe if we go with the former, we should add a note to the NBD spec that explictly says that NBD_STATE_HOLE doesn't imply any specific content that is returned on reads. 2. Using BDRV_BLOCK_ALLOCATED to determine NBD_STATE_HOLE seems wrong. A (not preallocated) zero cluster in qcow2 returns BDRV_BLOCK_ALLOCATED (because we don't fall through to the backing file) even though I think it's a hole. BDRV_BLOCK_DATA should be used there (which makes it consistent with the other direction). > In isolation, they seemed fine to > me, but side-by-side, I'm scratching my head: the server queries the block > layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side > then takes the NBD protocol and tries to turn it back into information to > feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA. Why > the different choice of bits? Which is actually consistent in the end, becaue BDRV_BLOCK_DATA implies BDRV_BLOCK_ALLOCATED. Essentially, assuming a simple backing chain 'base <- overlay', we got these combinations to represent in NBD (with my suggestion of the flags to use): 1. Cluster allocated in overlay a. non-zero data 0 b. explicit zeroes 0 or ZERO 2. Cluster marked zero in overlay HOLE | ZERO 3. Cluster preallocated/zero in overlay ZERO 4. Cluster unallocated in overlay a. Cluster allocated in base (non-zero) HOLE b. Cluster allocated in base (zero) HOLE or HOLE | ZERO c. Cluster marked zero in base HOLE | ZERO d. Cluster preallocated/zero in base HOLE | ZERO e. Cluster unallocated in base HOLE | ZERO Instead of 'base' you can read 'anywhere in the backing chain' and the flags should stay the same. So !BDRV_BLOCK_ALLOCATED (i.e. falling through to the backing file) does indeed imply NBD_STATE_HOLE, but so does case 2, which is just !DATA. > Part of the story is that right now, we document that ONLY the block layer > sets _ALLOCATED, in io.c, as a result of the driver layer returning HOLE || > ZERO (there are cases where the block layer can return ZERO but not > ALLOCATED, because the driver layer returned 0 but the block layer still > knows that area reads as zero). So Victor's patch matches the fact that the > driver shouldn't set ALLOCATED. Still, if we are tying ALLOCATED to whether > there is a hole, then that seems like information we should be getting from > the driver, not something synthesized after we've left the driver! Yes, I'm getting this impression, too. If your documentation says something like "not allocated or unknown offset" (for !OFFSET_VALID), you should probably be using one bit more to distinguish these cases. > Then there's the question of file-posix.c: what should it return for a hole, > ZERO|OFFSET_VALID or DATA|ZERO|OFFSET_VALID? The wording in block.h implies > that if DATA is not set, then the area reads as zero to the guest, but may > have indeterminate value on the underlying file - but we KNOW that a hole in > a POSIX file reads as 0 rather than having indeterminate value, and > returning DATA fits the current documentation (but doing so bleeds through > to at least 'qemu-img map --output=json' for the raw format). The "underlying file" for the file-posix layer (i.e. the filesystem) is a block device. A hole in a file is defined by not mapping to anywhere on the block device, so DATA should not be set. DATA | ZERO would mean that the block is actually allocated on the block device, but it still reads as zero. The thing that is inconsistent here is OFFSET_VALID and the offset returned because the protocol layer refers to itself there instead of referring to the "underlying file". If done consistently, it would have to return the offset on the block device (which is useless information in QEMU, so I suggested not to set OFFSET_VALID there at all - but we decided that that's too much hassle for no practical benefit). > I think we're overloading too many things into DATA (which layer of > the chain feeds what the guest sees, and do we have a hole or is > storage allocated for the data). As I understand it, DATA should only be about holes (in the sense of not being mapped to anywhere in bs->file or any other child apart from the backing file). Documentation does define OFFSET_VALID without DATA, though, as preallocation. Maybe preallocation would better be expressed as DATA, but without ALLOCATED. > The only uses of BDRV_BLOCK_ALLOCATED are in the computation of > bdrv_is_allocated(), in qcow2 measure, and in qemu-img compare, which all > really do care about the semantics of "does THIS layer provide the guest > image, or do I defer to a backing layer". But the question NBD wants > answered is "do I know whether there is a hole in the storage" There are > also relatively few clients of BDRV_BLOCK_DATA (mirror.c, qemu-img, > bdrv_co_block_status_above), and I wonder if some of them are more worried > about BDRV_BLOCK_ALLOCATED instead. > > I'm thinking of revamping things to still keep four bits, but with new names > and semantics as follows: > > BDRV_BLOCK_LOCAL - the guest gets this portion of the file from this BDS, > rather than the backing chain - makes sense for format drivers, pointless > for protocol drivers This is the old BDRV_BLOCK_ALLOCATED. Data almost never comes from the qcow2 layer, so what this really means is that data doesn't come from bs->backing. > BDRV_BLOCK_ZERO - this portion of the file reads as zeroes Same as before. > BDRV_BLOCK_ALLOC - this portion of the file has reserved disk space I think this is essentially what I believe BDRV_BLOCK_DATA should have been. "Disk space" isn't clearly defined, but "there is a mapping to a child node (except bs->backing)" seems to be close enough to what you have in mind. > BDRV_BLOCK_OFFSET_VALID - offset for accessing raw data Same as before. As I understand it, you're just renaming the existing flags. I'm not sure if this is a good idea, especially with ALLOC(ATED), which changes the meaning. This is pretty confusing. I suggest MAPPED as an alternative. > For format drivers: > L Z A O read as zero, returned file is zero at offset This is not what your definition above said. ZERO is about reading from the node itself rather than reading from bs->file. I interpret this as: Read as zero, space is preallocated in the image file, content in the image file is undefined. > L - A O read as valid from file at offset > L Z - O read as zero, but returned file has hole at offset Read as zero, no mapping into bs->file, but the offset in bs->file is valid. Doesn't make sense, OFFSET_VALID (O) should always imply MAPPED (A). > L - - O preallocated at offset but reads as garbage - bug? Again O without A - yes, a bug. > L Z A - read as zero, but from unknown offset with storage And the space is preallocated in a non-backing child, though not necessarily zeroed. > L - A - read as valid, but from unknown offset (including compressed, > encrypted) > L Z - - read as zero, but from unknown offset with hole > L - - - preallocated but no offset known - bug? No, not preallocated, because MAPPED isn't set. This is a block that isn't mapped to a block in any child node and doesn't read as zero. It might be the appropriate response for the null driver with read-zeroes=off. > - Z A O read defers to backing layer, but protocol layer contains > allocated zeros at offset No. Space is preallocated for this block, but read defers to the backing layer and we know that the backing layer provides zeros. This is not something that a driver should return, but it's a valid return value from bdrv_co_block_status(). One example for this is reading from an offset that is higher than the length of the backing file. > - - A O read defers to backing layer, but preallocated at offset Yes, this one is preallocation, finally. > - Z - O bug > - - - O bug > - Z A - bug Same as '- Z A O' except that the offset can't be directly accessed in the child node (e.g. because this is an encrypted image). > - - A - bug Preallocated, but reads from backing file. Offset can't be directly accessed in the child node. > - Z - - bug Read defers to backing layer and we know it will read zeros. Like for '- Z A O', this isn't something that a driver should return, but makes sense as a return value of bdrv_co_block_status(). > - - - - read defers to backing layer > > For protocol drivers: > - Z A O read as zero, offset is allocated > - - A O read as data, offset is allocated > - Z - O read as zero, offset is hole > - - - O bug? > - Z A - read as zero, but from unknown offset with storage > - - A - read as valid, but from unknown offset > - Z - - read as zero, but from unknown offset with hole > - - - - can't access this portion of file Why don't you set LOCAL? It's usually true for protocol drivers that they don't get their data from a backing file (though in theory you could imagine a protocol driver with backing file support). As discussed before, OFFSET_VALID doesn't really make sense here because we don't return offsets of the image file on the block device, but we only decided to keep it because of convenience. But if we change everything, then this should be changed, too. While the offset still refers to the same node for protocol drivers, "unknown offset" doesn't make any sense. There is no mapping involved that could be unknown. We really only have four cases for protocols. ZERO and MAPPED make sense this way. Not sure about 0 (or only OFFSET_VALID), could this ever be valid? You say "can't access this portion of file", but where would this happen? Kevin
26.02.2018 17:05, Kevin Wolf wrote: > Am 24.02.2018 um 00:38 hat Eric Blake geschrieben: >> On 02/23/2018 11:05 AM, Kevin Wolf wrote: >>> Am 23.02.2018 um 17:43 hat Eric Blake geschrieben: >>>>> OFFSET_VALID | DATA might be excusable because I can see that it's >>>>> convenient that a protocol driver refers to itself as *file instead of >>>>> returning NULL there and then the offset is valid (though it would be >>>>> pointless to actually follow the file pointer), but OFFSET_VALID without >>>>> DATA probably isn't. >>>> So OFFSET_VALID | DATA for a protocol BDS is not just convenient, but >>>> necessary to avoid breaking qemu-img map output. But you are also right >>>> that OFFSET_VALID without data makes little sense at a protocol layer. So >>>> with that in mind, I'm auditing all of the protocol layers to make sure >>>> OFFSET_VALID ends up as something sane. >>> That's one way to look at it. >>> >>> The other way is that qemu-img map shouldn't ask the protocol layer for >>> its offset because it already knows the offset (it is what it passes as >>> a parameter to bdrv_co_block_status). >>> >>> Anyway, it's probably not worth changing the interface, we should just >>> make sure that the return values of the individual drivers are >>> consistent. >> Yet another inconsistency, and it's making me scratch my head today. >> >> By the way, in my byte-based stuff that is now pending on your tree, I tried >> hard to NOT change semantics or the set of flags returned by a given driver, >> and we agreed that's why you'd accept the series as-is and make me do this >> followup exercise. But it's looking like my followups may end up touching a >> lot of the same drivers again, now that I'm looking at what the semantics >> SHOULD be (and whatever I do end up tweaking, I will at least make sure that >> iotests is still happy with it). > Hm, that's unfortunate, but I don't think we should hold up your first > series just so we can touch the drivers only once. > >> First, let's read what states the NBD spec is proposing: >> >>> It defines the following flags for the flags field: >>> >>> NBD_STATE_HOLE (bit 0): if set, the block represents a hole (and future writes to that area may cause fragmentation or encounter an ENOSPC error); if clear, the block is allocated or the server could not otherwise determine its status. Note that the use of NBD_CMD_TRIM is related to this status, but that the server MAY report a hole even where NBD_CMD_TRIM has not been requested, and also that a server MAY report that the block is allocated even where NBD_CMD_TRIM has been requested. >>> NBD_STATE_ZERO (bit 1): if set, the block contents read as all zeroes; if clear, the block contents are not known. Note that the use of NBD_CMD_WRITE_ZEROES is related to this status, but that the server MAY report zeroes even where NBD_CMD_WRITE_ZEROES has not been requested, and also that a server MAY report unknown content even where NBD_CMD_WRITE_ZEROES has been requested. >>> >>> It is not an error for a server to report that a region of the export has both NBD_STATE_HOLE set and NBD_STATE_ZERO clear. The contents of such an area are undefined, and a client reading such an area should make no assumption as to its contents or stability. >> So here's how Vladimir proposed implementing it in his series (written >> before my byte-based block status stuff went in to your tree): >> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg04038.html >> >> Server side (3/9): >> >> + int ret = bdrv_block_status_above(bs, NULL, offset, tail_bytes, >> &num, >> + NULL, NULL); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + flags = (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE) | >> + (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0); >> >> Client side (6/9): >> >> + *pnum = extent.length >> BDRV_SECTOR_BITS; >> + return (extent.flags & NBD_STATE_HOLE ? 0 : BDRV_BLOCK_DATA) | >> + (extent.flags & NBD_STATE_ZERO ? BDRV_BLOCK_ZERO : 0); >> >> Does anything there strike you as odd? > Two things I noticed while reading the above: > > 1. NBD doesn't consider backing files, so the definition of holes > becomes ambiguous. Is a hole any block that isn't allocated in the > top layer (may cause fragmentation or encounter an ENOSPC error) or > is it any block that isn't allocated anywhere in the whole backing > chain (may read as non-zero)? > > Considering that there is a separate NBD_STATE_ZERO and nothing > forbids a state of NBD_STATE_HOLE without NBD_STATE_ZERO, maybe the > former is more useful. The code you quote implements the latter. > > Maybe if we go with the former, we should add a note to the NBD spec > that explictly says that NBD_STATE_HOLE doesn't imply any specific > content that is returned on reads. > > 2. Using BDRV_BLOCK_ALLOCATED to determine NBD_STATE_HOLE seems wrong. A > (not preallocated) zero cluster in qcow2 returns BDRV_BLOCK_ALLOCATED > (because we don't fall through to the backing file) even though I > think it's a hole. BDRV_BLOCK_DATA should be used there (which makes > it consistent with the other direction). > >> In isolation, they seemed fine to >> me, but side-by-side, I'm scratching my head: the server queries the block >> layer, and turns BDRV_BLOCK_ALLOCATED into !NBD_STATE_HOLE; the client side >> then takes the NBD protocol and tries to turn it back into information to >> feed the block layer, where !NBD_STATE_HOLE now feeds BDRV_BLOCK_DATA. Why >> the different choice of bits? > Which is actually consistent in the end, becaue BDRV_BLOCK_DATA implies > BDRV_BLOCK_ALLOCATED. > > Essentially, assuming a simple backing chain 'base <- overlay', we got > these combinations to represent in NBD (with my suggestion of the flags > to use): > > 1. Cluster allocated in overlay > a. non-zero data 0 > b. explicit zeroes 0 or ZERO > 2. Cluster marked zero in overlay HOLE | ZERO > 3. Cluster preallocated/zero in overlay ZERO > 4. Cluster unallocated in overlay > a. Cluster allocated in base (non-zero) HOLE > b. Cluster allocated in base (zero) HOLE or HOLE | ZERO > c. Cluster marked zero in base HOLE | ZERO > d. Cluster preallocated/zero in base HOLE | ZERO > e. Cluster unallocated in base HOLE | ZERO > > Instead of 'base' you can read 'anywhere in the backing chain' and the > flags should stay the same. I think only "anywhere in the backing chain" is valid here. Otherwise, semantics of bdrv_is_allocated would differ for NBD and for not-NBD. I think, if bdrv_is_allocated returns false, it means that we can skip this region in copying process, am I right? > > So !BDRV_BLOCK_ALLOCATED (i.e. falling through to the backing file) does > indeed imply NBD_STATE_HOLE, but so does case 2, which is just !DATA. > >
Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben: > 26.02.2018 17:05, Kevin Wolf wrote: > > Essentially, assuming a simple backing chain 'base <- overlay', we got > > these combinations to represent in NBD (with my suggestion of the flags > > to use): > > > > 1. Cluster allocated in overlay > > a. non-zero data 0 > > b. explicit zeroes 0 or ZERO > > 2. Cluster marked zero in overlay HOLE | ZERO > > 3. Cluster preallocated/zero in overlay ZERO > > 4. Cluster unallocated in overlay > > a. Cluster allocated in base (non-zero) HOLE > > b. Cluster allocated in base (zero) HOLE or HOLE | ZERO > > c. Cluster marked zero in base HOLE | ZERO > > d. Cluster preallocated/zero in base HOLE | ZERO > > e. Cluster unallocated in base HOLE | ZERO > > > > Instead of 'base' you can read 'anywhere in the backing chain' and the > > flags should stay the same. > > I think only "anywhere in the backing chain" is valid here. Otherwise, > semantics of bdrv_is_allocated would differ for NBD and for not-NBD. This was meant as a mapping from cases to flags, not the other way round, so really doesn't say anything about the cases where the block is allocated further down the chain. But yes, it shouldn't make a difference where in the backing chain a block is allocated, so these cases are the same as 4. > I think, if bdrv_is_allocated returns false, it means that we can skip > this region in copying process, am I right? -ENOCONTEXT? Which copying process? There are cases where you want to copy such regions, and other cases where you want to skip them. It depends on the use case. For example, 'qemu-img convert' skips them with -B (because the backing file is reused), but not without -B (which creates a full copy). Kevin
01.03.2018 12:48, Kevin Wolf wrote: > Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 26.02.2018 17:05, Kevin Wolf wrote: >>> Essentially, assuming a simple backing chain 'base <- overlay', we got >>> these combinations to represent in NBD (with my suggestion of the flags >>> to use): >>> >>> 1. Cluster allocated in overlay >>> a. non-zero data 0 >>> b. explicit zeroes 0 or ZERO >>> 2. Cluster marked zero in overlay HOLE | ZERO >>> 3. Cluster preallocated/zero in overlay ZERO >>> 4. Cluster unallocated in overlay >>> a. Cluster allocated in base (non-zero) HOLE >>> b. Cluster allocated in base (zero) HOLE or HOLE | ZERO >>> c. Cluster marked zero in base HOLE | ZERO >>> d. Cluster preallocated/zero in base HOLE | ZERO >>> e. Cluster unallocated in base HOLE | ZERO >>> >>> Instead of 'base' you can read 'anywhere in the backing chain' and the >>> flags should stay the same. >> I think only "anywhere in the backing chain" is valid here. Otherwise, >> semantics of bdrv_is_allocated would differ for NBD and for not-NBD. > This was meant as a mapping from cases to flags, not the other way > round, so really doesn't say anything about the cases where the block is > allocated further down the chain. > > But yes, it shouldn't make a difference where in the backing chain a > block is allocated, so these cases are the same as 4. > >> I think, if bdrv_is_allocated returns false, it means that we can skip >> this region in copying process, am I right? > -ENOCONTEXT? Which copying process? > > There are cases where you want to copy such regions, and other cases > where you want to skip them. It depends on the use case. For example, > 'qemu-img convert' skips them with -B (because the backing file is > reused), but not without -B (which creates a full copy). > > Kevin Hm, I thought that bdrv_is_allocated loops through backings, but it doesn't, sorry.
Am 01.03.2018 um 10:57 hat Vladimir Sementsov-Ogievskiy geschrieben: > 01.03.2018 12:48, Kevin Wolf wrote: > > Am 01.03.2018 um 08:25 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > 26.02.2018 17:05, Kevin Wolf wrote: > > > > Essentially, assuming a simple backing chain 'base <- overlay', we got > > > > these combinations to represent in NBD (with my suggestion of the flags > > > > to use): > > > > > > > > 1. Cluster allocated in overlay > > > > a. non-zero data 0 > > > > b. explicit zeroes 0 or ZERO > > > > 2. Cluster marked zero in overlay HOLE | ZERO > > > > 3. Cluster preallocated/zero in overlay ZERO > > > > 4. Cluster unallocated in overlay > > > > a. Cluster allocated in base (non-zero) HOLE > > > > b. Cluster allocated in base (zero) HOLE or HOLE | ZERO > > > > c. Cluster marked zero in base HOLE | ZERO > > > > d. Cluster preallocated/zero in base HOLE | ZERO > > > > e. Cluster unallocated in base HOLE | ZERO > > > > > > > > Instead of 'base' you can read 'anywhere in the backing chain' and the > > > > flags should stay the same. > > > I think only "anywhere in the backing chain" is valid here. Otherwise, > > > semantics of bdrv_is_allocated would differ for NBD and for not-NBD. > > This was meant as a mapping from cases to flags, not the other way > > round, so really doesn't say anything about the cases where the block is > > allocated further down the chain. > > > > But yes, it shouldn't make a difference where in the backing chain a > > block is allocated, so these cases are the same as 4. > > > > > I think, if bdrv_is_allocated returns false, it means that we can skip > > > this region in copying process, am I right? > > -ENOCONTEXT? Which copying process? > > > > There are cases where you want to copy such regions, and other cases > > where you want to skip them. It depends on the use case. For example, > > 'qemu-img convert' skips them with -B (because the backing file is > > reused), but not without -B (which creates a full copy). > > > > Kevin > > Hm, I thought that bdrv_is_allocated loops through backings, but it doesn't, > sorry. That would be bdrv_is_allocated_above() with a NULL base. Kevin
diff --git a/block/null.c b/block/null.c index 214d394fff4..806a8631e4d 100644 --- a/block/null.c +++ b/block/null.c @@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState *reopen_state, return 0; } -static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, int *pnum, - BlockDriverState **file) +static int coroutine_fn null_co_block_status(BlockDriverState *bs, + bool want_zero, int64_t offset, + int64_t bytes, int64_t *pnum, + int64_t *map, + BlockDriverState **file) { BDRVNullState *s = bs->opaque; - off_t start = sector_num * BDRV_SECTOR_SIZE; + int ret = BDRV_BLOCK_OFFSET_VALID; - *pnum = nb_sectors; + *pnum = bytes; + *map = offset; *file = bs; if (s->read_zeroes) { - return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; - } else { - return BDRV_BLOCK_OFFSET_VALID | start; + ret |= BDRV_BLOCK_ZERO; } + return ret; } static void null_refresh_filename(BlockDriverState *bs, QDict *opts) @@ -270,7 +271,7 @@ static BlockDriver bdrv_null_co = { .bdrv_co_flush_to_disk = null_co_flush, .bdrv_reopen_prepare = null_reopen_prepare, - .bdrv_co_get_block_status = null_co_get_block_status, + .bdrv_co_block_status = null_co_block_status, .bdrv_refresh_filename = null_refresh_filename, }; @@ -290,7 +291,7 @@ static BlockDriver bdrv_null_aio = { .bdrv_aio_flush = null_aio_flush, .bdrv_reopen_prepare = null_reopen_prepare, - .bdrv_co_get_block_status = null_co_get_block_status, + .bdrv_co_block_status = null_co_block_status, .bdrv_refresh_filename = null_refresh_filename, };