Message ID | 1480073296-6931-1-git-send-email-vsementsov@virtuozzo.com |
---|---|
State | New |
Headers | show |
On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: > With the availability of sparse storage formats, it is often needed > to query status of a particular range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds the BLOCK_STATUS > extension with one new NBD_CMD_BLOCK_STATUS command, a new > structured reply chunk format, and a new transmission flag. > > There exists a concept of data dirtiness, which is required > during, for example, incremental block device backup. To express > this concept via NBD protocol, this patch also adds a flag to > NBD_CMD_BLOCK_STATUS to request dirtiness information rather than > provisioning information; however, with the current proposal, data > dirtiness is only useful with additional coordination outside of > the NBD protocol (such as a way to start and stop the server from > tracking dirty sectors). Future NBD extensions may add commands > to control dirtiness through NBD. > > Since NBD protocol has no notion of block size, and to mimic SCSI > "GET LBA STATUS" command more closely, it has been chosen to return > a list of extents in the response of NBD_CMD_BLOCK_STATUS command, > instead of a bitmap. > > CC: Pavel Borzenkov <pborzenkov@virtuozzo.com> > CC: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > > v3: > > Hi all. This is almost a resend of v2 (by Eric Blake), The only change is > removing the restriction, that sum of status descriptor lengths must be equal > to requested length. I.e., let's permit the server to replay with less data > than required if it wants. > > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. > > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). > > By this resend I just want to continue the diqussion, started about half > a year ago. Here is a summary of some questions and ideas from v2 > diqussion: > > 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? > A: This all is for read-only disks, so the data is static and unchangeable. > > 2. Q: different granularities of dirty/allocated bitmaps. Any problems? > A: 1: server replies with status descriptors of any size, granularity > is hidden from the client > 2: dirty/allocated requests are separate and unrelated to each > other, so their granularities are not intersecting > > 3. Q: selecting of dirty bitmap to export > A: several variants: > 1: id of bitmap is in flags field of request > pros: - simple > cons: - it's a hack. flags field is for other uses. > - we'll have to map bitmap names to these "ids" > 2: introduce extended nbd requests with variable length and exploit this > feature for BLOCK_STATUS command, specifying bitmap identifier. > pros: - looks like a true way > cons: - we have to create additional extension > - possible we have to create a map, > {<QEMU bitmap name> <=> <NBD bitmap id>} > 3: exteranl tool should select which bitmap to export. So, in case of Qemu > it should be something like qmp command block-export-dirty-bitmap. > pros: - simple > - we can extend it to behave like (2) later > cons: - additional qmp command to implement (possibly, the lesser evil) > note: Hmm, external tool can make chose between allocated/dirty data too, > so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > > 4. Q: Should not get_{allocated,dirty} be separate commands? > cons: Two commands with almost same semantic and similar means? > pros: However here is a good point of separating clearly defined and native > for block devices GET_BLOCK_STATUS from user-driven and actually > undefined data, called 'dirtyness'. > > 5. Number of status descriptors, sent by server, should be restricted > variants: > 1: just allow server to restrict this as it wants (which was done in v3) > 2: (not excluding 1). Client specifies somehow the maximum for number > of descriptors. > 2.1: add command flag, which will request only one descriptor > (otherwise, no restrictions from the client) > 2.2: again, introduce extended nbd requests, and add field to > specify this maximum > > 6. A: What to do with unspecified flags (in request/reply)? > I think the normal variant is to make them reserved. (Server should > return EINVAL if found unknown bits, client should consider replay > with unknown bits as an error) > > ====== > > Also, an idea on 2-4: > > As we say, that dirtiness is unknown for NBD, and external tool > should specify, manage and understand, which data is actually > transmitted, why not just call it user_data and leave status field > of reply chunk unspecified in this case? > > So, I propose one flag for NBD_CMD_BLOCK_STATUS: > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by > Eric's 'Block provisioning status' paragraph. If it is set, we just > leave status field for some external... protocol? Who knows, what is > this user data. > > Note: I'm not sure, that I like this (my) proposal. It's just an > idea, may be someone like it. And, I think, it represents what we > are trying to do more honestly. > > Note2: the next step of generalization will be NBD_CMD_USER, with > variable request size, structured reply and no definition :) > > > Another idea, about backups themselves: > > Why do we need allocated/zero status for backup? IMHO we don't. > > Full backup: just do structured read - it will show us, which chunks > may be treaded as zeroes. > > Incremental backup: get dirty bitmap (somehow, for example through > user-defined part of proposed command), than, for dirty blocks, read > them through structured read, so information about zero/unallocated > areas are here. > > For me all the variants above are OK. Let's finally choose something. > > v2: > v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html > > Since then, we've added the STRUCTURED_REPLY extension, which > necessitates a rather larger rebase; I've also changed things > to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request > modes to be determined by boolean flags (rather than by fixed > values of the 16-bit flags field), changed the reply status fields > to be bitwise-or values (with a default of 0 always being sane), > and changed the descriptor layout to drop an offset but to include > a 32-bit status so that the descriptor is nicely 8-byte aligned > without padding. > > doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 154 insertions(+), 1 deletion(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Hi Vladimir, Quickly: the reason I haven't merged this yes is twofold: - I wasn't thrilled with the proposal at the time. It felt a bit hackish, and bolted onto NBD so you could use it, but without defining everything in the NBD protocol. "We're reading some data, but it's not about you". That didn't feel right - There were a number of questions still unanswered (you're answering a few below, so that's good). For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point. On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: > With the availability of sparse storage formats, it is often needed > to query status of a particular range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds the BLOCK_STATUS > extension with one new NBD_CMD_BLOCK_STATUS command, a new > structured reply chunk format, and a new transmission flag. > > There exists a concept of data dirtiness, which is required > during, for example, incremental block device backup. To express > this concept via NBD protocol, this patch also adds a flag to > NBD_CMD_BLOCK_STATUS to request dirtiness information rather than > provisioning information; however, with the current proposal, data > dirtiness is only useful with additional coordination outside of > the NBD protocol (such as a way to start and stop the server from > tracking dirty sectors). Future NBD extensions may add commands > to control dirtiness through NBD. > > Since NBD protocol has no notion of block size, and to mimic SCSI > "GET LBA STATUS" command more closely, it has been chosen to return > a list of extents in the response of NBD_CMD_BLOCK_STATUS command, > instead of a bitmap. > > CC: Pavel Borzenkov <pborzenkov@virtuozzo.com> > CC: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > > v3: > > Hi all. This is almost a resend of v2 (by Eric Blake), The only change is > removing the restriction, that sum of status descriptor lengths must be equal > to requested length. I.e., let's permit the server to replay with less data > than required if it wants. Reasonable, yes. The length that the client requests should be a maximum (i.e. "I'm interested in this range"), not an exact request. > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. Right. > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). Probably a good idea, given the above. > By this resend I just want to continue the diqussion, started about half > a year ago. Here is a summary of some questions and ideas from v2 > diqussion: > > 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? > A: This all is for read-only disks, so the data is static and unchangeable. I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever. We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't change expectations in any way. > 2. Q: different granularities of dirty/allocated bitmaps. Any problems? > A: 1: server replies with status descriptors of any size, granularity > is hidden from the client > 2: dirty/allocated requests are separate and unrelated to each > other, so their granularities are not intersecting Not entirely sure anymore what this is about? > 3. Q: selecting of dirty bitmap to export > A: several variants: > 1: id of bitmap is in flags field of request > pros: - simple > cons: - it's a hack. flags field is for other uses. > - we'll have to map bitmap names to these "ids" > 2: introduce extended nbd requests with variable length and exploit this > feature for BLOCK_STATUS command, specifying bitmap identifier. > pros: - looks like a true way > cons: - we have to create additional extension > - possible we have to create a map, > {<QEMU bitmap name> <=> <NBD bitmap id>} > 3: exteranl tool should select which bitmap to export. So, in case of Qemu > it should be something like qmp command block-export-dirty-bitmap. > pros: - simple > - we can extend it to behave like (2) later > cons: - additional qmp command to implement (possibly, the lesser evil) > note: Hmm, external tool can make chose between allocated/dirty data too, > so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. Downside of 3, though, is that it moves the definition of what the different states mean outside of the NBD protocol (i.e., the protocol messages are not entirely defined anymore, and their meaning depends on the clients and servers in use). To avoid this, we should have a clear definition of what the reply means *by default*, but then we can add a note that clients and servers can possibly define other meanings out of band if they want to. > 4. Q: Should not get_{allocated,dirty} be separate commands? > cons: Two commands with almost same semantic and similar means? > pros: However here is a good point of separating clearly defined and native > for block devices GET_BLOCK_STATUS from user-driven and actually > undefined data, called 'dirtyness'. Yeah, having them separate commands might be a bad idea indeed. > 5. Number of status descriptors, sent by server, should be restricted > variants: > 1: just allow server to restrict this as it wants (which was done in v3) > 2: (not excluding 1). Client specifies somehow the maximum for number > of descriptors. > 2.1: add command flag, which will request only one descriptor > (otherwise, no restrictions from the client) > 2.2: again, introduce extended nbd requests, and add field to > specify this maximum I think having a flag which requests just one descriptor can be useful, but I'm hesitant to add it unless it's actually going to be used; so in other words, I'll leave the decision on that bit to you. > 6. A: What to do with unspecified flags (in request/reply)? > I think the normal variant is to make them reserved. (Server should > return EINVAL if found unknown bits, client should consider replay > with unknown bits as an error) Right, probably best to do that, yes. > ====== > > Also, an idea on 2-4: > > As we say, that dirtiness is unknown for NBD, and external tool > should specify, manage and understand, which data is actually > transmitted, why not just call it user_data and leave status field > of reply chunk unspecified in this case? > > So, I propose one flag for NBD_CMD_BLOCK_STATUS: > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by > Eric's 'Block provisioning status' paragraph. If it is set, we just > leave status field for some external... protocol? Who knows, what is > this user data. Yes, this sounds like a reasonable approach. > Note: I'm not sure, that I like this (my) proposal. It's just an > idea, may be someone like it. And, I think, it represents what we > are trying to do more honestly. Indeed. > Note2: the next step of generalization will be NBD_CMD_USER, with > variable request size, structured reply and no definition :) Well, er, no please, if we can avoid it :-) > Another idea, about backups themselves: > > Why do we need allocated/zero status for backup? IMHO we don't. Well, I've been thinking so all along, but then I don't really know what it is, in detail, that you want to do :-) I can understand a "has this changed since time X" request, which the "dirty" thing seems to want to be. Whether something is allocated is just a special case of that. Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the block status thingies, right? If so, how about: - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of snapshots. Not required, optional, includes a machine-readable form, not defined by NBD, which explains what the snapshot is about (e.g., a qemu json file). The "base" version of that is just "allocation status", and is implied (i.e., you don't need to run NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the allocation status). - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors which tell you what the status of a block of data is for each of the relevant snapshots that we know about. Perhaps this is somewhat overengineered, but it does bring most of the definition of what a snapshot is back into the NBD protocol, without having to say "this could be anything", and without requiring connectivity over two ports for this to be useful (e.g., you could store the machine-readable form of the snapshot description into your backup program and match what they mean with what you're interested in at restore time, etc). This wouldn't work if you're interested in new snapshots that get created once we've already moved into transmission, but hey. Thoughts? > Full backup: just do structured read - it will show us, which chunks > may be treaded as zeroes. Right. > Incremental backup: get dirty bitmap (somehow, for example through > user-defined part of proposed command), than, for dirty blocks, read > them through structured read, so information about zero/unallocated > areas are here. > > For me all the variants above are OK. Let's finally choose something. > > v2: > v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html > > Since then, we've added the STRUCTURED_REPLY extension, which > necessitates a rather larger rebase; I've also changed things > to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request > modes to be determined by boolean flags (rather than by fixed > values of the 16-bit flags field), changed the reply status fields > to be bitwise-or values (with a default of 0 always being sane), > and changed the descriptor layout to drop an offset but to include > a 32-bit status so that the descriptor is nicely 8-byte aligned > without padding. > > doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 154 insertions(+), 1 deletion(-) [...] I'll commit this in a minute into a separate branch called "extension-blockstatus", under the understanding that changes are still required, as per above (i.e., don't assume that just because there's a branch I'm happy with the current result ;-) Regards
On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > Quickly: the reason I haven't merged this yes is twofold: > - I wasn't thrilled with the proposal at the time. It felt a bit > hackish, and bolted onto NBD so you could use it, but without defining > everything in the NBD protocol. "We're reading some data, but it's not > about you". That didn't feel right > > - There were a number of questions still unanswered (you're answering a > few below, so that's good). > > For clarity, I have no objection whatsoever to adding more commands if > they're useful, but I would prefer that they're also useful with NBD on > its own, i.e., without requiring an initiation or correlation of some > state through another protocol or network connection or whatever. If > that's needed, that feels like I didn't do my job properly, if you get > my point. The out-of-band operations you are referring to are for dirty bitmap management. (The goal is to read out blocks that changed since the last backup.) The client does not access the live disk, instead it accesses a read-only snapshot and the dirty information (so that it can copy out only blocks that were written). The client is allowed to read blocks that are not dirty too. If you want to implement the whole incremental backup workflow in NBD then the client would first have to connect to the live disk, set up dirty tracking, create a snapshot export, and then connect to that snapshot. That sounds like a big feature set and I'd argue it's for the control plane (storage API) and not the data plane (NBD). There were discussions about transferring the dirty information via the control plane but it seems more appropriate to it in the data plane since it is block-level information. I'm arguing that the NBD protocol doesn't need to support the incremental backup workflow since it's a complex control plane concept. Being able to read dirty information via NBD is useful for other block backup applications, not just QEMU. It could be used for syncing LVM volumes across machines, for example, if someone implements an NBD+LVM server. Another issue with adding control plane operations is that you need to begin considering privilege separation. Should all NBD clients be able to initiate snapshots, dirty tracking, etc or is some kind of access control required to limit certain commands? Not all clients require the same privileges and so they shouldn't have access to the same set of operations. Stefan
Hi Stefan, On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote: > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > > Quickly: the reason I haven't merged this yes is twofold: > > - I wasn't thrilled with the proposal at the time. It felt a bit > > hackish, and bolted onto NBD so you could use it, but without defining > > everything in the NBD protocol. "We're reading some data, but it's not > > about you". That didn't feel right > > > > - There were a number of questions still unanswered (you're answering a > > few below, so that's good). > > > > For clarity, I have no objection whatsoever to adding more commands if > > they're useful, but I would prefer that they're also useful with NBD on > > its own, i.e., without requiring an initiation or correlation of some > > state through another protocol or network connection or whatever. If > > that's needed, that feels like I didn't do my job properly, if you get > > my point. > > The out-of-band operations you are referring to are for dirty bitmap > management. (The goal is to read out blocks that changed since the last > backup.) > > The client does not access the live disk, instead it accesses a > read-only snapshot and the dirty information (so that it can copy out > only blocks that were written). The client is allowed to read blocks > that are not dirty too. I understood as much, yes. > If you want to implement the whole incremental backup workflow in NBD > then the client would first have to connect to the live disk, set up > dirty tracking, create a snapshot export, and then connect to that > snapshot. > > That sounds like a big feature set and I'd argue it's for the control > plane (storage API) and not the data plane (NBD). There were > discussions about transferring the dirty information via the control > plane but it seems more appropriate to it in the data plane since it is > block-level information. I agree that creating and managing snapshots is out of scope for NBD. The protocol is not set up for that. However, I'm arguing that if we're going to provide information about snapshots, we should be able to properly refer to these snapshots from within an NBD context. My previous mail suggested adding a negotiation message that would essentially ask the server "tell me about the snapshots you know about", giving them an NBD identifier in the process (accompanied by a "foreign" identifier that is decidedly *not* an NBD identifier and that could be used to match the NBD identifier to something implementation-defined). This would be read-only information; the client cannot ask the server to create new snapshots. We can then later in the protocol refer to these snapshots by way of that NBD identifier. My proposal also makes it impossible to get updates of newly created snapshots without disconnecting and reconnecting (due to the fact that you can't go from transmission back to negotiation), but I'm not sure that's a problem. Doing so has two advantages: - If a client is accidentally (due to misconfiguration or implementation bugs or whatnot) connecting to the wrong server after having created a snapshot through a management protocol, we have an opportunity to detect this error, due to the fact that the "foreign" identifiers passed to the client during negotiation will not match with what the client was expecting. - A future version of the protocol could possibly include an extended version of the read command, allowing a client to read information from multiple storage snapshots without requiring a reconnect, and allowing current clients information about allocation status across various snapshots (although a first implementation could very well limit itself to only having one snapshot). [...] > I'm arguing that the NBD protocol doesn't need to support the > incremental backup workflow since it's a complex control plane concept. > > Being able to read dirty information via NBD is useful for other block > backup applications, not just QEMU. It could be used for syncing LVM > volumes across machines, for example, if someone implements an NBD+LVM > server. Indeed, and I was considering adding a basic implementation to go with the copy-on-write support in stock nbd-server, too. > Another issue with adding control plane operations is that you need to > begin considering privilege separation. Should all NBD clients be able > to initiate snapshots, dirty tracking, etc or is some kind of access > control required to limit certain commands? Not all clients require the > same privileges and so they shouldn't have access to the same set of > operations. Sure, which is why I wasn't suggesting anything of the sorts :-)
Hi Wouter, Some of this mess may be partially my fault, but I have not been following the NBD extension proposals up until this point. Are you familiar with the genesis behind this idea and what we are trying to accomplish in general? We had the thought to propose an extension roughly similar to SCSI's 'get lba status', but the existing status bits there did not correlate semantically to what we are hoping to convey. There was some debate over if it would be an abuse of protocol to attempt to use them as such. For a quick recap, get lba status appears to offer three canonical statuses: 0h: "LBA extent is mapped, [...] or has an unknown state" 1h: "[...] LBA extent is deallocated" 2h: "[...] LBA extent is anchored" My interpretation of mapped was simply that it was physically allocated, and 'deallocated' was simply unallocated. (I uh, am not actually clear on what anchored means exactly.) Either way, we felt at the time that it would be wrong to propose an analogue command for NBD and then immediately abuse the existing semantics, hence a new command like -- but not identical to -- the SCSI one. On 11/27/2016 02:17 PM, Wouter Verhelst wrote: > Hi Vladimir, > > Quickly: the reason I haven't merged this yes is twofold: > - I wasn't thrilled with the proposal at the time. It felt a bit > hackish, and bolted onto NBD so you could use it, but without defining > everything in the NBD protocol. "We're reading some data, but it's not > about you". That didn't feel right > - There were a number of questions still unanswered (you're answering a > few below, so that's good). > > For clarity, I have no objection whatsoever to adding more commands if > they're useful, but I would prefer that they're also useful with NBD on > its own, i.e., without requiring an initiation or correlation of some > state through another protocol or network connection or whatever. If > that's needed, that feels like I didn't do my job properly, if you get > my point. > > On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: >> With the availability of sparse storage formats, it is often needed >> to query status of a particular range and read only those blocks of >> data that are actually present on the block device. >> >> To provide such information, the patch adds the BLOCK_STATUS >> extension with one new NBD_CMD_BLOCK_STATUS command, a new >> structured reply chunk format, and a new transmission flag. >> >> There exists a concept of data dirtiness, which is required >> during, for example, incremental block device backup. To express >> this concept via NBD protocol, this patch also adds a flag to >> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than >> provisioning information; however, with the current proposal, data >> dirtiness is only useful with additional coordination outside of >> the NBD protocol (such as a way to start and stop the server from >> tracking dirty sectors). Future NBD extensions may add commands >> to control dirtiness through NBD. >> >> Since NBD protocol has no notion of block size, and to mimic SCSI >> "GET LBA STATUS" command more closely, it has been chosen to return >> a list of extents in the response of NBD_CMD_BLOCK_STATUS command, >> instead of a bitmap. >> >> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com> >> CC: Denis V. Lunev <den@openvz.org> >> CC: Wouter Verhelst <w@uter.be> >> CC: Paolo Bonzini <pbonzini@redhat.com> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Eric Blake <eblake@redhat.com> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >> --- >> >> v3: >> >> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is >> removing the restriction, that sum of status descriptor lengths must be equal >> to requested length. I.e., let's permit the server to replay with less data >> than required if it wants. > > Reasonable, yes. The length that the client requests should be a maximum (i.e. > "I'm interested in this range"), not an exact request. > >> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now >> NBD_FLAG_CAN_MULTI_CONN in master branch. > > Right. > >> And, finally, I've rebased this onto current state of >> extension-structured-reply branch (which itself should be rebased on >> master IMHO). > > Probably a good idea, given the above. > >> By this resend I just want to continue the diqussion, started about half >> a year ago. Here is a summary of some questions and ideas from v2 >> diqussion: >> >> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? >> A: This all is for read-only disks, so the data is static and unchangeable. > > I think we should declare that it's up to the client to ensure no other > writes happen without its knowledge. This may be because the client and > server communicate out of band about state changes, or because the > client somehow knows that it's the only writer, or whatever. > > We can easily do that by declaring that the result of that command only > talks about *current* state, and that concurrent writes by different > clients may invalidate the state. This is true for NBD in general (i.e., > concurrent read or write commands from other clients may confuse file > systems on top of NBD), so it doesn't change expectations in any way. > Agree with you here. "This was correct when I sent it, but it's up to you to ensure that nothing would have invalidated that in the meantime" is fine semantically to me. Of course in our implementation, we intend to only export essentially read-only snapshots of data, so we don't personally expect to run into any of these kind of semantic problems. >> 2. Q: different granularities of dirty/allocated bitmaps. Any problems? >> A: 1: server replies with status descriptors of any size, granularity >> is hidden from the client >> 2: dirty/allocated requests are separate and unrelated to each >> other, so their granularities are not intersecting > > Not entirely sure anymore what this is about? > We have a concept of granularity for dirty tracking; consider it a block size. I don't think it's necessarily relevant to NBD, except for the case of false positives. Consider the case of a 512 byte block that is assumed dirty simply because it's adjacent to actually dirty data. That may have some meaning for how we write the NBD spec; i.e. the meaning of the status bit changes from "This is definitely modified" to "This was possibly modified," due to limitations in the management of this information by the server. If we expose the granularity information via NBD, it at least makes it clear how fuzzy the results presented may be. Otherwise, it's not really required. [Again, as seen below, a user-defined bit would be plenty sufficient...!] >> 3. Q: selecting of dirty bitmap to export >> A: several variants: >> 1: id of bitmap is in flags field of request >> pros: - simple >> cons: - it's a hack. flags field is for other uses. >> - we'll have to map bitmap names to these "ids" >> 2: introduce extended nbd requests with variable length and exploit this >> feature for BLOCK_STATUS command, specifying bitmap identifier. >> pros: - looks like a true way >> cons: - we have to create additional extension >> - possible we have to create a map, >> {<QEMU bitmap name> <=> <NBD bitmap id>} >> 3: exteranl tool should select which bitmap to export. So, in case of Qemu >> it should be something like qmp command block-export-dirty-bitmap. >> pros: - simple >> - we can extend it to behave like (2) later >> cons: - additional qmp command to implement (possibly, the lesser evil) >> note: Hmm, external tool can make chose between allocated/dirty data too, >> so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > > Downside of 3, though, is that it moves the definition of what the > different states mean outside of the NBD protocol (i.e., the protocol > messages are not entirely defined anymore, and their meaning depends on > the clients and servers in use). > > To avoid this, we should have a clear definition of what the reply means > *by default*, but then we can add a note that clients and servers can > possibly define other meanings out of band if they want to. > I had personally only ever considered #3; where a command would be issued to QEMU to begin offering NBD data for some point-in-time as associated with a particular reference/snapshot/backup/etc. This leaves it up to the out-of-band client to order up the right data. That does appear to choose a meaningful name for the status bits a bit more difficult... [...but reading ahead, a 'user defined' bit would fit the bill just fine.] When Vladimir authored a "persistence" feature for bitmaps to be stored alongside QCOW2 files, we had difficulty describing exactly what a dirty bit meant for the data -- ultimately it is reliant on external information. The meaning is simply but ambiguously, "The data associated with this bit has been changed since the last time the bit was reset." We don't record or stipulate the conditions for a reset. (for our purposes, a reset would occur during a full or incremental backup.) >> 4. Q: Should not get_{allocated,dirty} be separate commands? >> cons: Two commands with almost same semantic and similar means? >> pros: However here is a good point of separating clearly defined and native >> for block devices GET_BLOCK_STATUS from user-driven and actually >> undefined data, called 'dirtyness'. > > Yeah, having them separate commands might be a bad idea indeed. > >> 5. Number of status descriptors, sent by server, should be restricted >> variants: >> 1: just allow server to restrict this as it wants (which was done in v3) >> 2: (not excluding 1). Client specifies somehow the maximum for number >> of descriptors. >> 2.1: add command flag, which will request only one descriptor >> (otherwise, no restrictions from the client) >> 2.2: again, introduce extended nbd requests, and add field to >> specify this maximum > > I think having a flag which requests just one descriptor can be useful, > but I'm hesitant to add it unless it's actually going to be used; so in > other words, I'll leave the decision on that bit to you. > >> 6. A: What to do with unspecified flags (in request/reply)? >> I think the normal variant is to make them reserved. (Server should >> return EINVAL if found unknown bits, client should consider replay >> with unknown bits as an error) > > Right, probably best to do that, yes. > >> ====== >> >> Also, an idea on 2-4: >> >> As we say, that dirtiness is unknown for NBD, and external tool >> should specify, manage and understand, which data is actually >> transmitted, why not just call it user_data and leave status field >> of reply chunk unspecified in this case? >> >> So, I propose one flag for NBD_CMD_BLOCK_STATUS: >> NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by >> Eric's 'Block provisioning status' paragraph. If it is set, we just >> leave status field for some external... protocol? Who knows, what is >> this user data. > > Yes, this sounds like a reasonable approach. > I'd be pretty happy (personally) with some user defined bits. Leaves a lot of the ambiguousness of exactly what we're trying to convey away from the NBD spec, which is nice. >> Note: I'm not sure, that I like this (my) proposal. It's just an >> idea, may be someone like it. And, I think, it represents what we >> are trying to do more honestly. > > Indeed. > >> Note2: the next step of generalization will be NBD_CMD_USER, with >> variable request size, structured reply and no definition :) > > Well, er, no please, if we can avoid it :-) > >> Another idea, about backups themselves: >> >> Why do we need allocated/zero status for backup? IMHO we don't. > > Well, I've been thinking so all along, but then I don't really know what > it is, in detail, that you want to do :-) > > I can understand a "has this changed since time X" request, which the > "dirty" thing seems to want to be. Whether something is allocated is > just a special case of that. > > Actually, come to think of that. What is the exact use case for this > thing? I understand you're trying to create incremental backups of > things, which would imply you don't write from the client that is > getting the block status thingies, right? If so, how about: > Essentially you can create a bitmap object in-memory in QEMU and then associate it with a drive. It records changes to the drive, and QEMU can be instructed to write out any changes since the last backup to disk, while clearing the bits of the bitmap. It doesn't record a specific point in time, it's just implicit to the last time you cleared the bitmap -- usually the last incremental or full backup you've made. So it does describe "blocks changed since time X," it's just that we don't really know exactly when time X was. This is all well and dandy, except there is some desire from third parties to be able to ask QEMU about this dirty block information -- to be able to see this bitmap, more or less. We already use NBD for exporting data, and instead of inventing a new side-band, we decided that we wanted to use NBD to let external users get this information. What exactly they do with that info is beyond QEMU's scope. > - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of > snapshots. Not required, optional, includes a machine-readable form, > not defined by NBD, which explains what the snapshot is about (e.g., a > qemu json file). The "base" version of that is just "allocation > status", and is implied (i.e., you don't need to run > NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the > allocation status). We don't necessarily have /snapshots/ per se, but we do have some block device and one or more bitmaps describing deltas to one or more backups that we do not necessarily have access to. e.g. a given bitmap may describe a delta to an off-site backup. We know the delta, but do not maintain any meaningful handle to the given off-site backup, including name, URI, or date. QEMU leaves this association up to upper management, and provides only IDs to facilitate any additional correlation. We could offer up descriptions of these bitmaps in response to such a command, but they're not ... quite snapshots. It is more the case that we use them to create snapshots. > - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors > which tell you what the status of a block of data is for each of the > relevant snapshots that we know about. > > Perhaps this is somewhat overengineered, but it does bring most of the > definition of what a snapshot is back into the NBD protocol, without > having to say "this could be anything", and without requiring > connectivity over two ports for this to be useful (e.g., you could store > the machine-readable form of the snapshot description into your backup > program and match what they mean with what you're interested in at > restore time, etc). > > This wouldn't work if you're interested in new snapshots that get > created once we've already moved into transmission, but hey. > > Thoughts? > >> Full backup: just do structured read - it will show us, which chunks >> may be treaded as zeroes. > > Right. > >> Incremental backup: get dirty bitmap (somehow, for example through >> user-defined part of proposed command), than, for dirty blocks, read >> them through structured read, so information about zero/unallocated >> areas are here. >> >> For me all the variants above are OK. Let's finally choose something. >> >> v2: >> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html >> >> Since then, we've added the STRUCTURED_REPLY extension, which >> necessitates a rather larger rebase; I've also changed things >> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request >> modes to be determined by boolean flags (rather than by fixed >> values of the 16-bit flags field), changed the reply status fields >> to be bitwise-or values (with a default of 0 always being sane), >> and changed the descriptor layout to drop an offset but to include >> a 32-bit status so that the descriptor is nicely 8-byte aligned >> without padding. >> >> doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 154 insertions(+), 1 deletion(-) > > [...] > > I'll commit this in a minute into a separate branch called > "extension-blockstatus", under the understanding that changes are still > required, as per above (i.e., don't assume that just because there's a > branch I'm happy with the current result ;-) > > Regards > Err, I hope I haven't confused everything to heck and back now.... Thanks, --js
On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote: > Hi Stefan, > > On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote: > > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote: > > > Quickly: the reason I haven't merged this yes is twofold: > > > - I wasn't thrilled with the proposal at the time. It felt a bit > > > hackish, and bolted onto NBD so you could use it, but without defining > > > everything in the NBD protocol. "We're reading some data, but it's not > > > about you". That didn't feel right > > > > > > - There were a number of questions still unanswered (you're answering a > > > few below, so that's good). > > > > > > For clarity, I have no objection whatsoever to adding more commands if > > > they're useful, but I would prefer that they're also useful with NBD on > > > its own, i.e., without requiring an initiation or correlation of some > > > state through another protocol or network connection or whatever. If > > > that's needed, that feels like I didn't do my job properly, if you get > > > my point. > > > > The out-of-band operations you are referring to are for dirty bitmap > > management. (The goal is to read out blocks that changed since the last > > backup.) > > > > The client does not access the live disk, instead it accesses a > > read-only snapshot and the dirty information (so that it can copy out > > only blocks that were written). The client is allowed to read blocks > > that are not dirty too. > > I understood as much, yes. > > > If you want to implement the whole incremental backup workflow in NBD > > then the client would first have to connect to the live disk, set up > > dirty tracking, create a snapshot export, and then connect to that > > snapshot. > > > > That sounds like a big feature set and I'd argue it's for the control > > plane (storage API) and not the data plane (NBD). There were > > discussions about transferring the dirty information via the control > > plane but it seems more appropriate to it in the data plane since it is > > block-level information. > > I agree that creating and managing snapshots is out of scope for NBD. The > protocol is not set up for that. > > However, I'm arguing that if we're going to provide information about > snapshots, we should be able to properly refer to these snapshots from > within an NBD context. My previous mail suggested adding a negotiation > message that would essentially ask the server "tell me about the > snapshots you know about", giving them an NBD identifier in the process > (accompanied by a "foreign" identifier that is decidedly *not* an NBD > identifier and that could be used to match the NBD identifier to > something implementation-defined). This would be read-only information; > the client cannot ask the server to create new snapshots. We can then > later in the protocol refer to these snapshots by way of that NBD > identifier. > > My proposal also makes it impossible to get updates of newly created > snapshots without disconnecting and reconnecting (due to the fact that > you can't go from transmission back to negotiation), but I'm not sure > that's a problem. > > Doing so has two advantages: > - If a client is accidentally (due to misconfiguration or implementation > bugs or whatnot) connecting to the wrong server after having created a > snapshot through a management protocol, we have an opportunity to > detect this error, due to the fact that the "foreign" identifiers > passed to the client during negotiation will not match with what the > client was expecting. > - A future version of the protocol could possibly include an extended > version of the read command, allowing a client to read information > from multiple storage snapshots without requiring a reconnect, and > allowing current clients information about allocation status across > various snapshots (although a first implementation could very well > limit itself to only having one snapshot). Sorry, I misunderstood you. Snapshots are not very different from NBD exports. Especially if the storage system supports writeable-snapshot (aka cloning). Should we just used named exports? Stefan
Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben: > > 3. Q: selecting of dirty bitmap to export > > A: several variants: > > 1: id of bitmap is in flags field of request > > pros: - simple > > cons: - it's a hack. flags field is for other uses. > > - we'll have to map bitmap names to these "ids" > > 2: introduce extended nbd requests with variable length and exploit this > > feature for BLOCK_STATUS command, specifying bitmap identifier. > > pros: - looks like a true way > > cons: - we have to create additional extension > > - possible we have to create a map, > > {<QEMU bitmap name> <=> <NBD bitmap id>} > > 3: exteranl tool should select which bitmap to export. So, in case of Qemu > > it should be something like qmp command block-export-dirty-bitmap. > > pros: - simple > > - we can extend it to behave like (2) later > > cons: - additional qmp command to implement (possibly, the lesser evil) > > note: Hmm, external tool can make chose between allocated/dirty data too, > > so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > > Downside of 3, though, is that it moves the definition of what the > different states mean outside of the NBD protocol (i.e., the protocol > messages are not entirely defined anymore, and their meaning depends on > the clients and servers in use). Another point to consider is that option 3 doesn't allow you to access two (or more) different bitmaps from the client without using the side channel all the time to switch back and forth between them and having to drain the request queue each time to avoid races. In general, if we have something that "looks like the true way", I'd advocate to choose this option. Experience tells that we'd regret anything simpler in a year or two, and then we'll have to do the real thing anyway, but still need to support the quick hack for compatibility. > > 5. Number of status descriptors, sent by server, should be restricted > > variants: > > 1: just allow server to restrict this as it wants (which was done in v3) > > 2: (not excluding 1). Client specifies somehow the maximum for number > > of descriptors. > > 2.1: add command flag, which will request only one descriptor > > (otherwise, no restrictions from the client) > > 2.2: again, introduce extended nbd requests, and add field to > > specify this maximum > > I think having a flag which requests just one descriptor can be useful, > but I'm hesitant to add it unless it's actually going to be used; so in > other words, I'll leave the decision on that bit to you. The native qemu block layer interface returns the status of only one contiguous chunks, so the easiest way to implement the NBD block driver in qemu would always use that bit. On the other hand, it would be possible for the NBD block driver in qemu to cache the rest of the response internally and answer the next request from the cache instead of sending a new request over the network. Maybe that's what it should be doing anyway for good performance. > > Also, an idea on 2-4: > > > > As we say, that dirtiness is unknown for NBD, and external tool > > should specify, manage and understand, which data is actually > > transmitted, why not just call it user_data and leave status field > > of reply chunk unspecified in this case? > > > > So, I propose one flag for NBD_CMD_BLOCK_STATUS: > > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by > > Eric's 'Block provisioning status' paragraph. If it is set, we just > > leave status field for some external... protocol? Who knows, what is > > this user data. > > Yes, this sounds like a reasonable approach. Makes sense to me, too. However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also have use for a second one. If we go with one of the options where the bitmap is selected per command, we're fine because you can simply move the second bit to a different bitmap and do two requests. If we have only a single active bitmap at a time, though, this feels like an actual problem. > > Another idea, about backups themselves: > > > > Why do we need allocated/zero status for backup? IMHO we don't. > > Well, I've been thinking so all along, but then I don't really know what > it is, in detail, that you want to do :-) I think we do. A block can be dirtied by discarding/write_zero-ing it. Then we want the dirty bit to know that we need to include this block in the incremental backup, but we also want to know that we don't actually have to transfer the data in it. Kevin
29.11.2016 13:18, Kevin Wolf wrote: > Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben: >>> 3. Q: selecting of dirty bitmap to export >>> A: several variants: >>> 1: id of bitmap is in flags field of request >>> pros: - simple >>> cons: - it's a hack. flags field is for other uses. >>> - we'll have to map bitmap names to these "ids" >>> 2: introduce extended nbd requests with variable length and exploit this >>> feature for BLOCK_STATUS command, specifying bitmap identifier. >>> pros: - looks like a true way >>> cons: - we have to create additional extension >>> - possible we have to create a map, >>> {<QEMU bitmap name> <=> <NBD bitmap id>} >>> 3: exteranl tool should select which bitmap to export. So, in case of Qemu >>> it should be something like qmp command block-export-dirty-bitmap. >>> pros: - simple >>> - we can extend it to behave like (2) later >>> cons: - additional qmp command to implement (possibly, the lesser evil) >>> note: Hmm, external tool can make chose between allocated/dirty data too, >>> so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. >> Downside of 3, though, is that it moves the definition of what the >> different states mean outside of the NBD protocol (i.e., the protocol >> messages are not entirely defined anymore, and their meaning depends on >> the clients and servers in use). > Another point to consider is that option 3 doesn't allow you to access > two (or more) different bitmaps from the client without using the side > channel all the time to switch back and forth between them and having to > drain the request queue each time to avoid races. > > In general, if we have something that "looks like the true way", I'd > advocate to choose this option. Experience tells that we'd regret > anything simpler in a year or two, and then we'll have to do the real > thing anyway, but still need to support the quick hack for > compatibility. > >>> 5. Number of status descriptors, sent by server, should be restricted >>> variants: >>> 1: just allow server to restrict this as it wants (which was done in v3) >>> 2: (not excluding 1). Client specifies somehow the maximum for number >>> of descriptors. >>> 2.1: add command flag, which will request only one descriptor >>> (otherwise, no restrictions from the client) >>> 2.2: again, introduce extended nbd requests, and add field to >>> specify this maximum >> I think having a flag which requests just one descriptor can be useful, >> but I'm hesitant to add it unless it's actually going to be used; so in >> other words, I'll leave the decision on that bit to you. > The native qemu block layer interface returns the status of only one > contiguous chunks, so the easiest way to implement the NBD block driver > in qemu would always use that bit. > > On the other hand, it would be possible for the NBD block driver in qemu > to cache the rest of the response internally and answer the next request > from the cache instead of sending a new request over the network. Maybe > that's what it should be doing anyway for good performance. > >>> Also, an idea on 2-4: >>> >>> As we say, that dirtiness is unknown for NBD, and external tool >>> should specify, manage and understand, which data is actually >>> transmitted, why not just call it user_data and leave status field >>> of reply chunk unspecified in this case? >>> >>> So, I propose one flag for NBD_CMD_BLOCK_STATUS: >>> NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by >>> Eric's 'Block provisioning status' paragraph. If it is set, we just >>> leave status field for some external... protocol? Who knows, what is >>> this user data. >> Yes, this sounds like a reasonable approach. > Makes sense to me, too. > > However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also > have use for a second one. If we go with one of the options where the > bitmap is selected per command, we're fine because you can simply move > the second bit to a different bitmap and do two requests. If we have > only a single active bitmap at a time, though, this feels like an actual > problem. > >>> Another idea, about backups themselves: >>> >>> Why do we need allocated/zero status for backup? IMHO we don't. >> Well, I've been thinking so all along, but then I don't really know what >> it is, in detail, that you want to do :-) > I think we do. A block can be dirtied by discarding/write_zero-ing it. > Then we want the dirty bit to know that we need to include this block in > the incremental backup, but we also want to know that we don't actually > have to transfer the data in it. And we will know that automatically, by using structured read command, so separate call to get_block_status is not needed. > > Kevin
Vladimir, I went back to April to reread the previous train of conversation then found you had helpfully summarised some if it. Comments below. Rather than comment on many of the points individual, the root of my confusion and to some extent uncomfortableness about this proposal is 'who owns the meaning the the bitmaps'. Some of this is my own confusion (sorry) about the use to which this is being put, which is I think at root a documentation issue. To illustrate this, you write in the FAQ section that this is for read only disks, but the text talks about: +Some storage formats and operations over such formats express a +concept of data dirtiness. Whether the operation is block device +mirroring, incremental block device backup or any other operation with +a concept of data dirtiness, they all share a need to provide a list +of ranges that this particular operation treats as dirty. How can data be 'dirty' if it is static and unchangeable? (I thought) I now think what you are talking about backing up a *snapshot* of a disk that's running, where the disk itself was not connected using NBD? IE it's not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively an opaque state represented in a bitmap, which is binary metadata at some particular level of granularity. It might as well be 'happiness' or 'is coloured blue'. The NBD server would (normally) have no way of manipulating this bitmap. In previous comments, I said 'how come we can set the dirty bit through writes but can't clear it?'. This (my statement) is now I think wrong, as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The state of the bitmap comes from whatever sets the bitmap which is outside the scope of this protocol to transmit it. However, we have the uncomfortable (to me) situation where the protocol describes a flag 'dirty', with implications as to what it does, but no actual strict definition of how it's set. So any 'other' user has no real idea how to use the information, or how to implement a server that provides a 'dirty' bit, because the semantics of that aren't within the protocol. This does not sit happily with me. So I'm wondering whether we should simplify and generalise this spec. You say that for the dirty flag, there's no specification of when it is set and cleared - that's implementation defined. Would it not be better then to say 'that whole thing is private to Qemu - even the name'. Rather you could read the list of bitmaps a server has, with a textual name, each having an index (and perhaps a granularity). You could then ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec, and some (e.g. ones beginning with 'X-') could be reserved for user usage. So you could use 'X-QEMU-DIRTY'). If you then change what your dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not need a protocol change to support it. IE rather than looking at 'a way of reading the dirty bit', we could have this as a generic way of reading opaque bitmaps. Only one (allocation) might be given meaning to start off with, and it wouldn't be necessary for all servers to support that - i.e. you could support bitmap reading without having an ALLOCATION bitmap available. This spec would then be limited to the transmission of the bitmaps (remove the word 'dirty' entirely, except perhaps as an illustrative use case), and include only the definition of the allocation bitmap. Some more nits: > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. > > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). Each documentation branch should normally be branched off master unless it depends on another extension (in which case it will be branched from that). I haven't been rebasing them frequently as it can disrupt those working on the branches. There's only really an issue around rebasing where you depend on another branch. > 2. Q: different granularities of dirty/allocated bitmaps. Any problems? > A: 1: server replies with status descriptors of any size, granularity > is hidden from the client > 2: dirty/allocated requests are separate and unrelated to each > other, so their granularities are not intersecting I'm OK with this, but note that you do actually mention a granularity of sorts in the spec (512 byes) - I think you should replace that with the minimum block size. > 3. Q: selecting of dirty bitmap to export > A: several variants: > 1: id of bitmap is in flags field of request > pros: - simple > cons: - it's a hack. flags field is for other uses. > - we'll have to map bitmap names to these "ids" > 2: introduce extended nbd requests with variable length and exploit this > feature for BLOCK_STATUS command, specifying bitmap identifier. > pros: - looks like a true way > cons: - we have to create additional extension > - possible we have to create a map, > {<QEMU bitmap name> <=> <NBD bitmap id>} > 3: exteranl tool should select which bitmap to export. So, in case of Qemu > it should be something like qmp command block-export-dirty-bitmap. > pros: - simple > - we can extend it to behave like (2) later > cons: - additional qmp command to implement (possibly, the lesser evil) > note: Hmm, external tool can make chose between allocated/dirty data too, > so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. Yes, this is all pretty horrible. I suspect we want to do something like (2), and permit extra data across (in my proposal, it would only be one byte to select the index). I suppose one could ask for a list of bitmaps. > 4. Q: Should not get_{allocated,dirty} be separate commands? > cons: Two commands with almost same semantic and similar means? > pros: However here is a good point of separating clearly defined and native > for block devices GET_BLOCK_STATUS from user-driven and actually > undefined data, called 'dirtyness'. I'm suggesting one generic 'read bitmap' command like you. > 5. Number of status descriptors, sent by server, should be restricted > variants: > 1: just allow server to restrict this as it wants (which was done in v3) > 2: (not excluding 1). Client specifies somehow the maximum for number > of descriptors. > 2.1: add command flag, which will request only one descriptor > (otherwise, no restrictions from the client) > 2.2: again, introduce extended nbd requests, and add field to > specify this maximum I think some form of extended request is the way to go, but out of interest, what's the issue with as many descriptors being sent as it takes to encode the reply? The client can just consume the remainder (without buffering) and reissue the request at a later point for the areas it discarded. > > 6. A: What to do with unspecified flags (in request/reply)? > I think the normal variant is to make them reserved. (Server should > return EINVAL if found unknown bits, client should consider replay > with unknown bits as an error) Yeah. > > + > +* `NBD_CMD_BLOCK_STATUS` > + > + A block status query request. Length and offset define the range > + of interest. Clients SHOULD NOT use this request unless the server MUST NOT is what we say elsewhere I believe. > + set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which > + in turn requires the client to first negotiate structured replies. > + For a successful return, the server MUST use a structured reply, > + containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`. Nit: are you saying that non-structured error replies are permissible? You're always/often going to get a non-structured (simple) error reply if the server doesn't support the command, but I think it would be fair to say the server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if it supports the command. This is effectively what we say re NBD_CMD_READ. > + > + The list of block status descriptors within the > + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions > + of the file starting from specified *offset*, and the sum of the > + *length* fields of each descriptor MUST not be greater than the > + overall *length* of the request. This means that the server MAY > + return less data than required. However the server MUST return at > + least one status descriptor I'm not sure I understand why that's useful. What should the client infer from the server refusing to provide information? We don't permit short reads etc. > . The server SHOULD use different > + *status* values between consecutive descriptors, and SHOULD use > + descriptor lengths that are an integer multiple of 512 bytes where > + possible (the first and last descriptor of an unaligned query being > + the most obvious places for an exception). Surely better would be an an integer multiple of the minimum block size. Being able to offer bitmap support at finer granularity than the absolute minimum block size helps no one, and if it were possible to support a 256 byte block size (I think some floppy disks had that) I see no reason not to support that as a granularity.
29.11.2016 15:57, Alex Bligh wrote: > Vladimir, > > I went back to April to reread the previous train of conversation > then found you had helpfully summarised some if it. Comments > below. > > Rather than comment on many of the points individual, the root > of my confusion and to some extent uncomfortableness about this > proposal is 'who owns the meaning the the bitmaps'. > > Some of this is my own confusion (sorry) about the use to which > this is being put, which is I think at root a documentation issue. > To illustrate this, you write in the FAQ section that this is for > read only disks, but the text talks about: > > +Some storage formats and operations over such formats express a > +concept of data dirtiness. Whether the operation is block device > +mirroring, incremental block device backup or any other operation with > +a concept of data dirtiness, they all share a need to provide a list > +of ranges that this particular operation treats as dirty. > > How can data be 'dirty' if it is static and unchangeable? (I thought) > > I now think what you are talking about backing up a *snapshot* of a disk > that's running, where the disk itself was not connected using NBD? IE it's > not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively > an opaque state represented in a bitmap, which is binary metadata > at some particular level of granularity. It might as well be 'happiness' > or 'is coloured blue'. The NBD server would (normally) have no way of > manipulating this bitmap. Yes, something like this. Note: in Qemu it may not be a snapshot (actually, I didn't see a way in Qemu to export snapshots not switching to them (except opening external snapshot as a separate block dev)), but just any read-only drive, or temporary drive, used in the image fleecing scheme: driveA is online normal drive driveB is empty nbd export - start backup driveA->driveB with sync=none, which means that the only copy which is done is copying old data from A to B before every write to A - and set driveA as backing for driveB (on read from B, if data is not allocated it will be read from A) after that, driveB is something like a snapshot for backup through NBD. It's all just to say: calling backup-point-in-time-state just a 'snapshot' confuses me and I hope not to see this word in the spec (in this context). > > In previous comments, I said 'how come we can set the dirty bit through > writes but can't clear it?'. This (my statement) is now I think wrong, > as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The > state of the bitmap comes from whatever sets the bitmap which is outside > the scope of this protocol to transmit it. > > However, we have the uncomfortable (to me) situation where the protocol > describes a flag 'dirty', with implications as to what it does, but > no actual strict definition of how it's set. So any 'other' user has > no real idea how to use the information, or how to implement a server > that provides a 'dirty' bit, because the semantics of that aren't within > the protocol. This does not sit happily with me. > > So I'm wondering whether we should simplify and generalise this spec. You > say that for the dirty flag, there's no specification of when it is > set and cleared - that's implementation defined. Would it not be better > then to say 'that whole thing is private to Qemu - even the name'. > > Rather you could read the list of bitmaps a server has, with a textual > name, each having an index (and perhaps a granularity). You could then > ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that > bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec, > and some (e.g. ones beginning with 'X-') could be reserved for user > usage. So you could use 'X-QEMU-DIRTY'). If you then change what your > dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not > need a protocol change to support it. > > IE rather than looking at 'a way of reading the dirty bit', we could > have this as a generic way of reading opaque bitmaps. Only one (allocation) > might be given meaning to start off with, and it wouldn't be necessary > for all servers to support that - i.e. you could support bitmap reading > without having an ALLOCATION bitmap available. > > This spec would then be limited to the transmission of the bitmaps > (remove the word 'dirty' entirely, except perhaps as an illustrative > use case), and include only the definition of the allocation bitmap. Good point. For Qcow2 we finally come to just bitmaps, not "dirty bitmaps", to make it more general. There is a problem with allocation if we want to make it a subcase of bitmap: allocation natively have two bits per block: zero and allocated. We can of course separate this into two bitmaps, but this will not be similar with classic get_block_status. > Some more nits: > >> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now >> NBD_FLAG_CAN_MULTI_CONN in master branch. >> >> And, finally, I've rebased this onto current state of >> extension-structured-reply branch (which itself should be rebased on >> master IMHO). > Each documentation branch should normally be branched off master unless > it depends on another extension (in which case it will be branched from that). > I haven't been rebasing them frequently as it can disrupt those working > on the branches. There's only really an issue around rebasing where you > depend on another branch. > > >> 2. Q: different granularities of dirty/allocated bitmaps. Any problems? >> A: 1: server replies with status descriptors of any size, granularity >> is hidden from the client >> 2: dirty/allocated requests are separate and unrelated to each >> other, so their granularities are not intersecting > I'm OK with this, but note that you do actually mention a granularity > of sorts in the spec (512 byes) - I think you should replace that > with the minimum block size. > >> 3. Q: selecting of dirty bitmap to export >> A: several variants: >> 1: id of bitmap is in flags field of request >> pros: - simple >> cons: - it's a hack. flags field is for other uses. >> - we'll have to map bitmap names to these "ids" >> 2: introduce extended nbd requests with variable length and exploit this >> feature for BLOCK_STATUS command, specifying bitmap identifier. >> pros: - looks like a true way >> cons: - we have to create additional extension >> - possible we have to create a map, >> {<QEMU bitmap name> <=> <NBD bitmap id>} >> 3: exteranl tool should select which bitmap to export. So, in case of Qemu >> it should be something like qmp command block-export-dirty-bitmap. >> pros: - simple >> - we can extend it to behave like (2) later >> cons: - additional qmp command to implement (possibly, the lesser evil) >> note: Hmm, external tool can make chose between allocated/dirty data too, >> so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > Yes, this is all pretty horrible. I suspect we want to do something like (2), > and permit extra data across (in my proposal, it would only be one byte to select > the index). I suppose one could ask for a list of bitmaps. > >> 4. Q: Should not get_{allocated,dirty} be separate commands? >> cons: Two commands with almost same semantic and similar means? >> pros: However here is a good point of separating clearly defined and native >> for block devices GET_BLOCK_STATUS from user-driven and actually >> undefined data, called 'dirtyness'. > I'm suggesting one generic 'read bitmap' command like you. To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated. > >> 5. Number of status descriptors, sent by server, should be restricted >> variants: >> 1: just allow server to restrict this as it wants (which was done in v3) >> 2: (not excluding 1). Client specifies somehow the maximum for number >> of descriptors. >> 2.1: add command flag, which will request only one descriptor >> (otherwise, no restrictions from the client) >> 2.2: again, introduce extended nbd requests, and add field to >> specify this maximum > I think some form of extended request is the way to go, but out of > interest, what's the issue with as many descriptors being sent as it > takes to encode the reply? The client can just consume the remainder > (without buffering) and reissue the request at a later point for > the areas it discarded. the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side. > >> 6. A: What to do with unspecified flags (in request/reply)? >> I think the normal variant is to make them reserved. (Server should >> return EINVAL if found unknown bits, client should consider replay >> with unknown bits as an error) > Yeah. > >> + >> +* `NBD_CMD_BLOCK_STATUS` >> + >> + A block status query request. Length and offset define the range >> + of interest. Clients SHOULD NOT use this request unless the server > MUST NOT is what we say elsewhere I believe. > >> + set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which >> + in turn requires the client to first negotiate structured replies. >> + For a successful return, the server MUST use a structured reply, >> + containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`. > Nit: are you saying that non-structured error replies are permissible? > You're always/often going to get a non-structured (simple) error reply > if the server doesn't support the command, but I think it would be fair to say the > server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if > it supports the command. This is effectively what we say re NBD_CMD_READ. I agree. > >> + >> + The list of block status descriptors within the >> + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions >> + of the file starting from specified *offset*, and the sum of the >> + *length* fields of each descriptor MUST not be greater than the >> + overall *length* of the request. This means that the server MAY >> + return less data than required. However the server MUST return at >> + least one status descriptor > I'm not sure I understand why that's useful. What should the client > infer from the server refusing to provide information? We don't > permit short reads etc. if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload. > >> . The server SHOULD use different >> + *status* values between consecutive descriptors, and SHOULD use >> + descriptor lengths that are an integer multiple of 512 bytes where >> + possible (the first and last descriptor of an unaligned query being >> + the most obvious places for an exception). > Surely better would be an an integer multiple of the minimum block > size. Being able to offer bitmap support at finer granularity than > the absolute minimum block size helps no one, and if it were possible > to support a 256 byte block size (I think some floppy disks had that) > I see no reason not to support that as a granularity. > Agree. Anyway it is just a strong recommendation, not requirement.
Vladimir, >>> 4. Q: Should not get_{allocated,dirty} be separate commands? >>> cons: Two commands with almost same semantic and similar means? >>> pros: However here is a good point of separating clearly defined and native >>> for block devices GET_BLOCK_STATUS from user-driven and actually >>> undefined data, called 'dirtyness'. >> I'm suggesting one generic 'read bitmap' command like you. > > To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated. I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero. In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero). I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap. >>> 5. Number of status descriptors, sent by server, should be restricted >>> variants: >>> 1: just allow server to restrict this as it wants (which was done in v3) >>> 2: (not excluding 1). Client specifies somehow the maximum for number >>> of descriptors. >>> 2.1: add command flag, which will request only one descriptor >>> (otherwise, no restrictions from the client) >>> 2.2: again, introduce extended nbd requests, and add field to >>> specify this maximum >> I think some form of extended request is the way to go, but out of >> interest, what's the issue with as many descriptors being sent as it >> takes to encode the reply? The client can just consume the remainder >> (without buffering) and reissue the request at a later point for >> the areas it discarded. > > the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side. I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client. >>> + The list of block status descriptors within the >>> + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions >>> + of the file starting from specified *offset*, and the sum of the >>> + *length* fields of each descriptor MUST not be greater than the >>> + overall *length* of the request. This means that the server MAY >>> + return less data than required. However the server MUST return at >>> + least one status descriptor >> I'm not sure I understand why that's useful. What should the client >> infer from the server refusing to provide information? We don't >> permit short reads etc. > > if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload. Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent. I think the answer is probably a 'maximum number of extents' in the request packet. Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status.
29.11.2016 17:52, Alex Bligh wrote: > Vladimir, > >>>> 4. Q: Should not get_{allocated,dirty} be separate commands? >>>> cons: Two commands with almost same semantic and similar means? >>>> pros: However here is a good point of separating clearly defined and native >>>> for block devices GET_BLOCK_STATUS from user-driven and actually >>>> undefined data, called 'dirtyness'. >>> I'm suggesting one generic 'read bitmap' command like you. >> To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated. > I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero. > > In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero). > > I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap. > >>>> 5. Number of status descriptors, sent by server, should be restricted >>>> variants: >>>> 1: just allow server to restrict this as it wants (which was done in v3) >>>> 2: (not excluding 1). Client specifies somehow the maximum for number >>>> of descriptors. >>>> 2.1: add command flag, which will request only one descriptor >>>> (otherwise, no restrictions from the client) >>>> 2.2: again, introduce extended nbd requests, and add field to >>>> specify this maximum >>> I think some form of extended request is the way to go, but out of >>> interest, what's the issue with as many descriptors being sent as it >>> takes to encode the reply? The client can just consume the remainder >>> (without buffering) and reissue the request at a later point for >>> the areas it discarded. >> the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side. > I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client. > >>>> + The list of block status descriptors within the >>>> + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions >>>> + of the file starting from specified *offset*, and the sum of the >>>> + *length* fields of each descriptor MUST not be greater than the >>>> + overall *length* of the request. This means that the server MAY >>>> + return less data than required. However the server MUST return at >>>> + least one status descriptor >>> I'm not sure I understand why that's useful. What should the client >>> infer from the server refusing to provide information? We don't >>> permit short reads etc. >> if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload. > Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent. and with this approach we will in turn have overhead of too many requests for 00000000 or 11111111 bitmaps. > > I think the answer is probably a 'maximum number of extents' in the request packet. > > Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status. > With such predefined status, we can postpone creating extended requests, have number of extents restricted by server and have sum of extents lengths be equal to request length.
On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote: > 29.11.2016 17:52, Alex Bligh wrote: > > Vladimir, > >> if the bitmap is 010101010101 we will have too many descriptors. > >> For example, 16tb disk, 64k granularity -> 2G of descriptors > >> payload. > > Yep. And the cost of consuming and retrying is quite high. One > > option would be for the client to realise this is a possibility, and > > not request the entire extent map for a 16TB disk, as it might be > > very large! Even if the client worked at e.g. a 64MB level (where > > they'd get a maximum of 1024 extents per reply), this isn't going to > > noticeably increase the round trip timing. One issue here is that to > > determine a 'reasonable' size, the client needs to know the minimum > > length of any extent. > > and with this approach we will in turn have overhead of too many > requests for 00000000 or 11111111 bitmaps. This is why my proposal suggests the server may abort the sent extents after X extents (sixteen, but that number can certainly change) have been sent. After all, the server will have a better view on what's going to be costly in terms of "number of extents". > > I think the answer is probably a 'maximum number of extents' in the request packet. > > > > Of course with statuses in extent, the final extent could be > > represented as 'I don't know, break this bit into a separate > > request' status. > > > > With such predefined status, we can postpone creating extended requests, > have number of extents restricted by server and have sum of extents > lengths be equal to request length.
Hi, Wouter! > Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the ? > block status thingies, right? Overall, the most desired use case for this NBD extension is to allow 3-rd party software to make incremental backups. Acronis (vendor of backup solutions) would support qemu backup if block status is provided. -----Original Message----- From: Wouter Verhelst [mailto:w@uter.be] Sent: Sunday, November 27, 2016 22:17 To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Cc: nbd-general@lists.sourceforge.net; kwolf@redhat.com; qemu-devel@nongnu.org; pborzenkov@virtuozzo.com; stefanha@redhat.com; pbonzini@redhat.com; mpa@pengutronix.de; den@openvz.org Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension Hi Vladimir, Quickly: the reason I haven't merged this yes is twofold: - I wasn't thrilled with the proposal at the time. It felt a bit hackish, and bolted onto NBD so you could use it, but without defining everything in the NBD protocol. "We're reading some data, but it's not about you". That didn't feel right - There were a number of questions still unanswered (you're answering a few below, so that's good). For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point. On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote: > With the availability of sparse storage formats, it is often needed to > query status of a particular range and read only those blocks of data > that are actually present on the block device. > > To provide such information, the patch adds the BLOCK_STATUS extension > with one new NBD_CMD_BLOCK_STATUS command, a new structured reply > chunk format, and a new transmission flag. > > There exists a concept of data dirtiness, which is required during, > for example, incremental block device backup. To express this concept > via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS > to request dirtiness information rather than provisioning information; > however, with the current proposal, data dirtiness is only useful with > additional coordination outside of the NBD protocol (such as a way to > start and stop the server from tracking dirty sectors). Future NBD > extensions may add commands to control dirtiness through NBD. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list > of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of > a bitmap. > > CC: Pavel Borzenkov <pborzenkov@virtuozzo.com> > CC: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > --- > > v3: > > Hi all. This is almost a resend of v2 (by Eric Blake), The only change > is removing the restriction, that sum of status descriptor lengths > must be equal to requested length. I.e., let's permit the server to > replay with less data than required if it wants. Reasonable, yes. The length that the client requests should be a maximum (i.e. "I'm interested in this range"), not an exact request. > Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now > NBD_FLAG_CAN_MULTI_CONN in master branch. Right. > And, finally, I've rebased this onto current state of > extension-structured-reply branch (which itself should be rebased on > master IMHO). Probably a good idea, given the above. > By this resend I just want to continue the diqussion, started about > half a year ago. Here is a summary of some questions and ideas from v2 > diqussion: > > 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? > A: This all is for read-only disks, so the data is static and unchangeable. I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever. We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't change expectations in any way. > 2. Q: different granularities of dirty/allocated bitmaps. Any problems? > A: 1: server replies with status descriptors of any size, granularity > is hidden from the client > 2: dirty/allocated requests are separate and unrelated to each > other, so their granularities are not intersecting Not entirely sure anymore what this is about? > 3. Q: selecting of dirty bitmap to export > A: several variants: > 1: id of bitmap is in flags field of request > pros: - simple > cons: - it's a hack. flags field is for other uses. > - we'll have to map bitmap names to these "ids" > 2: introduce extended nbd requests with variable length and exploit this > feature for BLOCK_STATUS command, specifying bitmap identifier. > pros: - looks like a true way > cons: - we have to create additional extension > - possible we have to create a map, > {<QEMU bitmap name> <=> <NBD bitmap id>} > 3: exteranl tool should select which bitmap to export. So, in case of Qemu > it should be something like qmp command block-export-dirty-bitmap. > pros: - simple > - we can extend it to behave like (2) later > cons: - additional qmp command to implement (possibly, the lesser evil) > note: Hmm, external tool can make chose between allocated/dirty data too, > so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. Downside of 3, though, is that it moves the definition of what the different states mean outside of the NBD protocol (i.e., the protocol messages are not entirely defined anymore, and their meaning depends on the clients and servers in use). To avoid this, we should have a clear definition of what the reply means *by default*, but then we can add a note that clients and servers can possibly define other meanings out of band if they want to. > 4. Q: Should not get_{allocated,dirty} be separate commands? > cons: Two commands with almost same semantic and similar means? > pros: However here is a good point of separating clearly defined and native > for block devices GET_BLOCK_STATUS from user-driven and actually > undefined data, called 'dirtyness'. Yeah, having them separate commands might be a bad idea indeed. > 5. Number of status descriptors, sent by server, should be restricted > variants: > 1: just allow server to restrict this as it wants (which was done in v3) > 2: (not excluding 1). Client specifies somehow the maximum for number > of descriptors. > 2.1: add command flag, which will request only one descriptor > (otherwise, no restrictions from the client) > 2.2: again, introduce extended nbd requests, and add field to > specify this maximum I think having a flag which requests just one descriptor can be useful, but I'm hesitant to add it unless it's actually going to be used; so in other words, I'll leave the decision on that bit to you. > 6. A: What to do with unspecified flags (in request/reply)? > I think the normal variant is to make them reserved. (Server should > return EINVAL if found unknown bits, client should consider replay > with unknown bits as an error) Right, probably best to do that, yes. > ====== > > Also, an idea on 2-4: > > As we say, that dirtiness is unknown for NBD, and external tool > should specify, manage and understand, which data is actually > transmitted, why not just call it user_data and leave status field > of reply chunk unspecified in this case? > > So, I propose one flag for NBD_CMD_BLOCK_STATUS: > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by > Eric's 'Block provisioning status' paragraph. If it is set, we just > leave status field for some external... protocol? Who knows, what is > this user data. Yes, this sounds like a reasonable approach. > Note: I'm not sure, that I like this (my) proposal. It's just an > idea, may be someone like it. And, I think, it represents what we > are trying to do more honestly. Indeed. > Note2: the next step of generalization will be NBD_CMD_USER, with > variable request size, structured reply and no definition :) Well, er, no please, if we can avoid it :-) > Another idea, about backups themselves: > > Why do we need allocated/zero status for backup? IMHO we don't. Well, I've been thinking so all along, but then I don't really know what it is, in detail, that you want to do :-) I can understand a "has this changed since time X" request, which the "dirty" thing seems to want to be. Whether something is allocated is just a special case of that. Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the block status thingies, right? If so, how about: - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of snapshots. Not required, optional, includes a machine-readable form, not defined by NBD, which explains what the snapshot is about (e.g., a qemu json file). The "base" version of that is just "allocation status", and is implied (i.e., you don't need to run NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the allocation status). - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors which tell you what the status of a block of data is for each of the relevant snapshots that we know about. Perhaps this is somewhat overengineered, but it does bring most of the definition of what a snapshot is back into the NBD protocol, without having to say "this could be anything", and without requiring connectivity over two ports for this to be useful (e.g., you could store the machine-readable form of the snapshot description into your backup program and match what they mean with what you're interested in at restore time, etc). This wouldn't work if you're interested in new snapshots that get created once we've already moved into transmission, but hey. Thoughts? > Full backup: just do structured read - it will show us, which chunks > may be treaded as zeroes. Right. > Incremental backup: get dirty bitmap (somehow, for example through > user-defined part of proposed command), than, for dirty blocks, read > them through structured read, so information about zero/unallocated > areas are here. > > For me all the variants above are OK. Let's finally choose something. > > v2: > v1 was: > https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html > > Since then, we've added the STRUCTURED_REPLY extension, which > necessitates a rather larger rebase; I've also changed things to > rename the command 'NBD_CMD_BLOCK_STATUS', changed the request modes > to be determined by boolean flags (rather than by fixed values of the > 16-bit flags field), changed the reply status fields to be bitwise-or > values (with a default of 0 always being sane), and changed the > descriptor layout to drop an offset but to include a 32-bit status so > that the descriptor is nicely 8-byte aligned without padding. > > doc/proto.md | 155 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 154 insertions(+), 1 deletion(-) [...] I'll commit this in a minute into a separate branch called "extension-blockstatus", under the understanding that changes are still required, as per above (i.e., don't assume that just because there's a branch I'm happy with the current result ;-) Regards -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 ------------------------------------------------------------------------------
Hi Alex, let me try my hand at clarifying some points... On 11/29/2016 07:57 AM, Alex Bligh wrote: > Vladimir, > > I went back to April to reread the previous train of conversation > then found you had helpfully summarised some if it. Comments > below. > > Rather than comment on many of the points individual, the root > of my confusion and to some extent uncomfortableness about this > proposal is 'who owns the meaning the the bitmaps'. > > Some of this is my own confusion (sorry) about the use to which > this is being put, which is I think at root a documentation issue. > To illustrate this, you write in the FAQ section that this is for > read only disks, but the text talks about: > I sent an earlier email in response to Wouter over our exact "goal" with this spec extension that is a little more of an overview, but I will try to address your questions specifically. > +Some storage formats and operations over such formats express a > +concept of data dirtiness. Whether the operation is block device > +mirroring, incremental block device backup or any other operation with > +a concept of data dirtiness, they all share a need to provide a list > +of ranges that this particular operation treats as dirty. > > How can data be 'dirty' if it is static and unchangeable? (I thought) > In a simple case, live IO goes to e.g. hda.qcow2. These writes come from the VM and cause the bitmap that QEMU manages to become dirty. We intend to expose the ability to fleece dirty blocks via NBD. What happens in this scenario would be that a snapshot of the data at the time of the request is exported over NBD in a read-only manner. In this way, the drive itself is R/W, but the "view" of it from NBD is RO. While a hypothetical backup client is busy copying data out of this temporary view, new writes are coming in to the drive, but are not being exposed through the NBD export. (This goes into QEMU-specifics, but those new writes are dirtying a version of the bitmap not intended to be exposed via the NBD channel. NBD gets effectively a snapshot of both the bitmap AND the data.) > I now think what you are talking about backing up a *snapshot* of a disk > that's running, where the disk itself was not connected using NBD? IE it's > not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively > an opaque state represented in a bitmap, which is binary metadata > at some particular level of granularity. It might as well be 'happiness' > or 'is coloured blue'. The NBD server would (normally) have no way of > manipulating this bitmap. > > In previous comments, I said 'how come we can set the dirty bit through > writes but can't clear it?'. This (my statement) is now I think wrong, > as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The > state of the bitmap comes from whatever sets the bitmap which is outside > the scope of this protocol to transmit it. > You know, this is a fair point. We have not (to my knowledge) yet carefully considered the exact bitmap management scenario when NBD is involved in retrieving dirty blocks. Humor me for a moment while I talk about a (completely hypothetical, not yet fully discussed) workflow for how I envision this feature. (1) User sets up a drive in QEMU, a bitmap is initialized, an initial backup is made, etc. (2) As writes come in, QEMU's bitmap is dirtied. (3) The user decides they want to root around to see what data has changed and would like to use NBD to do so, in contrast to QEMU's own facilities for dumping dirty blocks. (4) A command is issued that creates a temporary, lightweight snapshot ('fleecing') and exports this snapshot over NBD. The bitmap is associated with the NBD export at this point at NBD server startup. (For the sake of QEMU discussion, maybe this command is "blockdev-fleece") (5) At this moment, the snapshot is static and represents the data at the time the NBD server was started. The bitmap is also forked and represents only this snapshot. The live data and bitmap continue to change. (6) Dirty blocks are queried and copied out via NBD. (7) The user closes the NBD instance upon completion of their task, whatever it was. (Making a new incremental backup? Just taking a peek at some changed data? who knows.) The point that's interesting here is what do we do with the two bitmaps at this point? The data delta can be discarded (this was after all just a lightweight read-only point-in-time snapshot) but the bitmap data needs to be dealt with. (A) In the case of "User made a new incremental backup," the bitmap that got forked off to serve the NBD read should be discarded. (B) In the case of "User just wanted to look around," the bitmap should be merged back into the bitmap it was forked from. I don't advise a hybrid where "User copied some data, but not all" where we need to partially clear *and* merge, but conceivably this could happen, because the things we don't want to happen always will. At this point maybe it's becoming obvious that actually it would be very prudent to allow the NBD client itself to inform QEMU via the NBD protocol which extents/blocks/(etc) that it is "done" with. Maybe it *would* actually be useful if, in NBD allowing us to add a "dirty" bit to the specification, we allow users to clear those bits. Then, whether the user was trying to do (A) or (B) or the unspeakable amalgamation of both things, it's up to the user to clear the bits desired and QEMU can do the simple task of simply always merging the bitmap fork upon the conclusion of the NBD fleecing exercise. Maybe this would allow the dirty bit to have a bit more concrete meaning for the NBD spec: "The bit stays dirty until the user clears it, and is set when the matching block/extent/etc is written to." With an exception that external management may cause the bits to clear. (I.e., someone fiddles with the backing store in a way opaque to NBD, e.g. someone clears the bitmap directly through QEMU instead of via NBD.) > However, we have the uncomfortable (to me) situation where the protocol > describes a flag 'dirty', with implications as to what it does, but > no actual strict definition of how it's set. So any 'other' user has > no real idea how to use the information, or how to implement a server > that provides a 'dirty' bit, because the semantics of that aren't within > the protocol. This does not sit happily with me. > > So I'm wondering whether we should simplify and generalise this spec. You > say that for the dirty flag, there's no specification of when it is > set and cleared - that's implementation defined. Would it not be better > then to say 'that whole thing is private to Qemu - even the name'. > > Rather you could read the list of bitmaps a server has, with a textual > name, each having an index (and perhaps a granularity). You could then > ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that > bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec, > and some (e.g. ones beginning with 'X-') could be reserved for user > usage. So you could use 'X-QEMU-DIRTY'). If you then change what your > dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not > need a protocol change to support it. > If we can't work out a more precise, semantically meaningful spec extension then I am very happy with simply the ability to have a user bit, or X-QEMU bits, etc etc etc. > IE rather than looking at 'a way of reading the dirty bit', we could > have this as a generic way of reading opaque bitmaps. Only one (allocation) > might be given meaning to start off with, and it wouldn't be necessary > for all servers to support that - i.e. you could support bitmap reading > without having an ALLOCATION bitmap available. > > This spec would then be limited to the transmission of the bitmaps > (remove the word 'dirty' entirely, except perhaps as an illustrative > use case), and include only the definition of the allocation bitmap. > > Some more nits: > >> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now >> NBD_FLAG_CAN_MULTI_CONN in master branch. >> >> And, finally, I've rebased this onto current state of >> extension-structured-reply branch (which itself should be rebased on >> master IMHO). > > Each documentation branch should normally be branched off master unless > it depends on another extension (in which case it will be branched from that). > I haven't been rebasing them frequently as it can disrupt those working > on the branches. There's only really an issue around rebasing where you > depend on another branch. > > >> 2. Q: different granularities of dirty/allocated bitmaps. Any problems? >> A: 1: server replies with status descriptors of any size, granularity >> is hidden from the client >> 2: dirty/allocated requests are separate and unrelated to each >> other, so their granularities are not intersecting > > I'm OK with this, but note that you do actually mention a granularity > of sorts in the spec (512 byes) - I think you should replace that > with the minimum block size. > >> 3. Q: selecting of dirty bitmap to export >> A: several variants: >> 1: id of bitmap is in flags field of request >> pros: - simple >> cons: - it's a hack. flags field is for other uses. >> - we'll have to map bitmap names to these "ids" >> 2: introduce extended nbd requests with variable length and exploit this >> feature for BLOCK_STATUS command, specifying bitmap identifier. >> pros: - looks like a true way >> cons: - we have to create additional extension >> - possible we have to create a map, >> {<QEMU bitmap name> <=> <NBD bitmap id>} >> 3: exteranl tool should select which bitmap to export. So, in case of Qemu >> it should be something like qmp command block-export-dirty-bitmap. >> pros: - simple >> - we can extend it to behave like (2) later >> cons: - additional qmp command to implement (possibly, the lesser evil) >> note: Hmm, external tool can make chose between allocated/dirty data too, >> so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all. > > Yes, this is all pretty horrible. I suspect we want to do something like (2), > and permit extra data across (in my proposal, it would only be one byte to select > the index). I suppose one could ask for a list of bitmaps. > Having missed most of the discussion on v1/v2, is it a given that we want in-band identification of bitmaps? I guess this might depend very heavily on the nature of the definition of the "dirty bit" in the NBD spec. >> 4. Q: Should not get_{allocated,dirty} be separate commands? >> cons: Two commands with almost same semantic and similar means? >> pros: However here is a good point of separating clearly defined and native >> for block devices GET_BLOCK_STATUS from user-driven and actually >> undefined data, called 'dirtyness'. > > I'm suggesting one generic 'read bitmap' command like you. > >> 5. Number of status descriptors, sent by server, should be restricted >> variants: >> 1: just allow server to restrict this as it wants (which was done in v3) >> 2: (not excluding 1). Client specifies somehow the maximum for number >> of descriptors. >> 2.1: add command flag, which will request only one descriptor >> (otherwise, no restrictions from the client) >> 2.2: again, introduce extended nbd requests, and add field to >> specify this maximum > > I think some form of extended request is the way to go, but out of > interest, what's the issue with as many descriptors being sent as it > takes to encode the reply? The client can just consume the remainder > (without buffering) and reissue the request at a later point for > the areas it discarded. > >> >> 6. A: What to do with unspecified flags (in request/reply)? >> I think the normal variant is to make them reserved. (Server should >> return EINVAL if found unknown bits, client should consider replay >> with unknown bits as an error) > > Yeah. > >> >> + >> +* `NBD_CMD_BLOCK_STATUS` >> + >> + A block status query request. Length and offset define the range >> + of interest. Clients SHOULD NOT use this request unless the server > > MUST NOT is what we say elsewhere I believe. > >> + set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which >> + in turn requires the client to first negotiate structured replies. >> + For a successful return, the server MUST use a structured reply, >> + containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`. > > Nit: are you saying that non-structured error replies are permissible? > You're always/often going to get a non-structured (simple) error reply > if the server doesn't support the command, but I think it would be fair to say the > server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if > it supports the command. This is effectively what we say re NBD_CMD_READ. > >> + >> + The list of block status descriptors within the >> + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions >> + of the file starting from specified *offset*, and the sum of the >> + *length* fields of each descriptor MUST not be greater than the >> + overall *length* of the request. This means that the server MAY >> + return less data than required. However the server MUST return at >> + least one status descriptor > > I'm not sure I understand why that's useful. What should the client > infer from the server refusing to provide information? We don't > permit short reads etc. > >> . The server SHOULD use different >> + *status* values between consecutive descriptors, and SHOULD use >> + descriptor lengths that are an integer multiple of 512 bytes where >> + possible (the first and last descriptor of an unaligned query being >> + the most obvious places for an exception). > > Surely better would be an an integer multiple of the minimum block > size. Being able to offer bitmap support at finer granularity than > the absolute minimum block size helps no one, and if it were possible > to support a 256 byte block size (I think some floppy disks had that) > I see no reason not to support that as a granularity. > Anyway, I hope I am being useful and just not more confounding. It seems to me that we're having difficulty conveying precisely what it is we're trying to accomplish, so I hope that I am making a good effort in elaborating on our goals/requirements. If not, well. You've got my address for hatemail :) Thanks, --John
02.12.2016 02:42, John Snow wrote: > (B) In the case of "User just wanted to look around," the bitmap should > be merged back into the bitmap it was forked from. currently existing example: "failed incremental backup"
John, >> +Some storage formats and operations over such formats express a >> +concept of data dirtiness. Whether the operation is block device >> +mirroring, incremental block device backup or any other operation with >> +a concept of data dirtiness, they all share a need to provide a list >> +of ranges that this particular operation treats as dirty. >> >> How can data be 'dirty' if it is static and unchangeable? (I thought) >> > > In a simple case, live IO goes to e.g. hda.qcow2. These writes come from > the VM and cause the bitmap that QEMU manages to become dirty. > > We intend to expose the ability to fleece dirty blocks via NBD. What > happens in this scenario would be that a snapshot of the data at the > time of the request is exported over NBD in a read-only manner. > > In this way, the drive itself is R/W, but the "view" of it from NBD is > RO. While a hypothetical backup client is busy copying data out of this > temporary view, new writes are coming in to the drive, but are not being > exposed through the NBD export. > > (This goes into QEMU-specifics, but those new writes are dirtying a > version of the bitmap not intended to be exposed via the NBD channel. > NBD gets effectively a snapshot of both the bitmap AND the data.) Thanks. That makes sense - or enough sense for me to carry on commenting! >> I now think what you are talking about backing up a *snapshot* of a disk >> that's running, where the disk itself was not connected using NBD? IE it's >> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively >> an opaque state represented in a bitmap, which is binary metadata >> at some particular level of granularity. It might as well be 'happiness' >> or 'is coloured blue'. The NBD server would (normally) have no way of >> manipulating this bitmap. >> >> In previous comments, I said 'how come we can set the dirty bit through >> writes but can't clear it?'. This (my statement) is now I think wrong, >> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The >> state of the bitmap comes from whatever sets the bitmap which is outside >> the scope of this protocol to transmit it. >> > > You know, this is a fair point. We have not (to my knowledge) yet > carefully considered the exact bitmap management scenario when NBD is > involved in retrieving dirty blocks. > > Humor me for a moment while I talk about a (completely hypothetical, not > yet fully discussed) workflow for how I envision this feature. > > (1) User sets up a drive in QEMU, a bitmap is initialized, an initial > backup is made, etc. > > (2) As writes come in, QEMU's bitmap is dirtied. > > (3) The user decides they want to root around to see what data has > changed and would like to use NBD to do so, in contrast to QEMU's own > facilities for dumping dirty blocks. > > (4) A command is issued that creates a temporary, lightweight snapshot > ('fleecing') and exports this snapshot over NBD. The bitmap is > associated with the NBD export at this point at NBD server startup. (For > the sake of QEMU discussion, maybe this command is "blockdev-fleece") > > (5) At this moment, the snapshot is static and represents the data at > the time the NBD server was started. The bitmap is also forked and > represents only this snapshot. The live data and bitmap continue to change. > > (6) Dirty blocks are queried and copied out via NBD. > > (7) The user closes the NBD instance upon completion of their task, > whatever it was. (Making a new incremental backup? Just taking a peek at > some changed data? who knows.) > > The point that's interesting here is what do we do with the two bitmaps > at this point? The data delta can be discarded (this was after all just > a lightweight read-only point-in-time snapshot) but the bitmap data > needs to be dealt with. > > (A) In the case of "User made a new incremental backup," the bitmap that > got forked off to serve the NBD read should be discarded. > > (B) In the case of "User just wanted to look around," the bitmap should > be merged back into the bitmap it was forked from. > > I don't advise a hybrid where "User copied some data, but not all" where > we need to partially clear *and* merge, but conceivably this could > happen, because the things we don't want to happen always will. > > At this point maybe it's becoming obvious that actually it would be very > prudent to allow the NBD client itself to inform QEMU via the NBD > protocol which extents/blocks/(etc) that it is "done" with. > > Maybe it *would* actually be useful if, in NBD allowing us to add a > "dirty" bit to the specification, we allow users to clear those bits. > > Then, whether the user was trying to do (A) or (B) or the unspeakable > amalgamation of both things, it's up to the user to clear the bits > desired and QEMU can do the simple task of simply always merging the > bitmap fork upon the conclusion of the NBD fleecing exercise. > > Maybe this would allow the dirty bit to have a bit more concrete meaning > for the NBD spec: "The bit stays dirty until the user clears it, and is > set when the matching block/extent/etc is written to." > > With an exception that external management may cause the bits to clear. > (I.e., someone fiddles with the backing store in a way opaque to NBD, > e.g. someone clears the bitmap directly through QEMU instead of via NBD.) There is currently one possible "I've done with the entire bitmap" signal, which is closing the connection. This has two obvious problems. Firstly if used, it discards the entire bitmap (not bits). Secondly, it makes recovery from a broken TCP session difficult (as either you treat a dirty close as meaning the bitmap needs to hang around, in which case you have a garbage collection issue, or you treat it as needing to drop the bitmap, in which case you can't recover). I think in your plan the block status doesn't change once the bitmap is forked. In that case, adding some command (optional) to change the status of the bitmap (or simply to set a given extent to status X) would be reasonable. Of course whether it's supported could be dependent on the bitmap. > Having missed most of the discussion on v1/v2, is it a given that we > want in-band identification of bitmaps? > > I guess this might depend very heavily on the nature of the definition > of the "dirty bit" in the NBD spec. I don't think it's a given. I think Wouter & I came up with it at the same time as a way to abstract the bitmap/extent concept and remove the need to specify a dirty bit at all (well, that's my excuse anyway). > Anyway, I hope I am being useful and just not more confounding. It seems > to me that we're having difficulty conveying precisely what it is we're > trying to accomplish, so I hope that I am making a good effort in > elaborating on our goals/requirements. Yes absolutely. I think part of the challenge is that you are quite reasonably coming at it from the point of view of qemu's particular need, and I'm coming at it from 'what should the nbd protocol look like in general' position, having done lots of work on the protocol docs (though I'm an occasional qemu contributor). So there's necessarily a gap of approach to be bridged. I'm overdue on a review of Wouter's latest patch (partly because I need to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in), but I think it's a bridge worth building.
On 12/02/2016 01:45 PM, Alex Bligh wrote: > John, > >>> +Some storage formats and operations over such formats express a >>> +concept of data dirtiness. Whether the operation is block device >>> +mirroring, incremental block device backup or any other operation with >>> +a concept of data dirtiness, they all share a need to provide a list >>> +of ranges that this particular operation treats as dirty. >>> >>> How can data be 'dirty' if it is static and unchangeable? (I thought) >>> >> >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from >> the VM and cause the bitmap that QEMU manages to become dirty. >> >> We intend to expose the ability to fleece dirty blocks via NBD. What >> happens in this scenario would be that a snapshot of the data at the >> time of the request is exported over NBD in a read-only manner. >> >> In this way, the drive itself is R/W, but the "view" of it from NBD is >> RO. While a hypothetical backup client is busy copying data out of this >> temporary view, new writes are coming in to the drive, but are not being >> exposed through the NBD export. >> >> (This goes into QEMU-specifics, but those new writes are dirtying a >> version of the bitmap not intended to be exposed via the NBD channel. >> NBD gets effectively a snapshot of both the bitmap AND the data.) > > Thanks. That makes sense - or enough sense for me to carry on commenting! > Whew! I'm glad. >>> I now think what you are talking about backing up a *snapshot* of a disk >>> that's running, where the disk itself was not connected using NBD? IE it's >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively >>> an opaque state represented in a bitmap, which is binary metadata >>> at some particular level of granularity. It might as well be 'happiness' >>> or 'is coloured blue'. The NBD server would (normally) have no way of >>> manipulating this bitmap. >>> >>> In previous comments, I said 'how come we can set the dirty bit through >>> writes but can't clear it?'. This (my statement) is now I think wrong, >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The >>> state of the bitmap comes from whatever sets the bitmap which is outside >>> the scope of this protocol to transmit it. >>> >> >> You know, this is a fair point. We have not (to my knowledge) yet >> carefully considered the exact bitmap management scenario when NBD is >> involved in retrieving dirty blocks. >> >> Humor me for a moment while I talk about a (completely hypothetical, not >> yet fully discussed) workflow for how I envision this feature. >> >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial >> backup is made, etc. >> >> (2) As writes come in, QEMU's bitmap is dirtied. >> >> (3) The user decides they want to root around to see what data has >> changed and would like to use NBD to do so, in contrast to QEMU's own >> facilities for dumping dirty blocks. >> >> (4) A command is issued that creates a temporary, lightweight snapshot >> ('fleecing') and exports this snapshot over NBD. The bitmap is >> associated with the NBD export at this point at NBD server startup. (For >> the sake of QEMU discussion, maybe this command is "blockdev-fleece") >> >> (5) At this moment, the snapshot is static and represents the data at >> the time the NBD server was started. The bitmap is also forked and >> represents only this snapshot. The live data and bitmap continue to change. >> >> (6) Dirty blocks are queried and copied out via NBD. >> >> (7) The user closes the NBD instance upon completion of their task, >> whatever it was. (Making a new incremental backup? Just taking a peek at >> some changed data? who knows.) >> >> The point that's interesting here is what do we do with the two bitmaps >> at this point? The data delta can be discarded (this was after all just >> a lightweight read-only point-in-time snapshot) but the bitmap data >> needs to be dealt with. >> >> (A) In the case of "User made a new incremental backup," the bitmap that >> got forked off to serve the NBD read should be discarded. >> >> (B) In the case of "User just wanted to look around," the bitmap should >> be merged back into the bitmap it was forked from. >> >> I don't advise a hybrid where "User copied some data, but not all" where >> we need to partially clear *and* merge, but conceivably this could >> happen, because the things we don't want to happen always will. >> >> At this point maybe it's becoming obvious that actually it would be very >> prudent to allow the NBD client itself to inform QEMU via the NBD >> protocol which extents/blocks/(etc) that it is "done" with. >> >> Maybe it *would* actually be useful if, in NBD allowing us to add a >> "dirty" bit to the specification, we allow users to clear those bits. >> >> Then, whether the user was trying to do (A) or (B) or the unspeakable >> amalgamation of both things, it's up to the user to clear the bits >> desired and QEMU can do the simple task of simply always merging the >> bitmap fork upon the conclusion of the NBD fleecing exercise. >> >> Maybe this would allow the dirty bit to have a bit more concrete meaning >> for the NBD spec: "The bit stays dirty until the user clears it, and is >> set when the matching block/extent/etc is written to." >> >> With an exception that external management may cause the bits to clear. >> (I.e., someone fiddles with the backing store in a way opaque to NBD, >> e.g. someone clears the bitmap directly through QEMU instead of via NBD.) > > There is currently one possible "I've done with the entire bitmap" > signal, which is closing the connection. This has two obvious > problems. Firstly if used, it discards the entire bitmap (not bits). > Secondly, it makes recovery from a broken TCP session difficult > (as either you treat a dirty close as meaning the bitmap needs > to hang around, in which case you have a garbage collection issue, > or you treat it as needing to drop the bitmap, in which case you > can't recover). > In my mind, I wasn't treating closing the connection as the end of the point-in-time snapshot; that would be stopping the export. I wouldn't advocate for a control channel (QEMU, here) clearing the bitmap just because a client disappeared. Either: (A) QEMU clears the bitmap because the NBD export was *stopped*, or (B) QEMU, acting as the NBD server, clears the bitmap as instructed by the NBD client, if we admit a provision to clear bits from the NBD protocol itself. I don't think there's room for the NBD server (QEMU) deciding to clear bits based on connection status. It has to be an explicit decision -- either via NBD or QMP. > I think in your plan the block status doesn't change once the bitmap > is forked. In that case, adding some command (optional) to change > the status of the bitmap (or simply to set a given extent to status X) > would be reasonable. Of course whether it's supported could be dependent > on the bitmap. > What I describe as "forking" was kind of a bad description. What really happens when we have a divergence is that the bitmap with data is split into two bitmaps that are related: - A new bitmap that is created takes over for the old bitmap. This new bitmap is empty. It records writes on the live version of the data. - The old bitmap as it existed remains in a read-only state, and describes some point-in-time snapshot view of the data. In the case of an incremental backup, once we've made a backup of the data, that read-only bitmap can actually be discarded without further thought. In the case of a failed incremental backup, or in the case of "I just wanted to look and see what has changed, but wasn't prepared to reset the counter yet," this bitmap gets merged back with the live bitmap as if nothing ever happened. ANYWAY, allowing the NBD client to request bits be cleared has an obvious use case even for QEMU, IMO -- which is, the NBD client itself gains the ability to, without relying on a control plane to the server, decide for itself if it is going to "make a backup" or "just look around." The client gains the ability to leave the bitmap alone (QEMU will re-merge it later once the snapshot is closed) or the ability to clear it ("I made my backup, we're done with this.") That usefulness would allow us to have an explicit dirty bit mechanism directly in NBD, IMO, because: (1) A RW NBD server has enough information to mark a bit dirty (2) Since there exists an in-spec mechanism to reset the bitmap, the dirty bit is meaningful to the server and the client >> Having missed most of the discussion on v1/v2, is it a given that we >> want in-band identification of bitmaps? >> >> I guess this might depend very heavily on the nature of the definition >> of the "dirty bit" in the NBD spec. > > I don't think it's a given. I think Wouter & I came up with it at > the same time as a way to abstract the bitmap/extent concept and > remove the need to specify a dirty bit at all (well, that's my excuse > anyway). > OK. We do certainly support multiple bitmaps being active at a time in QEMU, but I had personally always envisioned that you'd associate them one-at-a-time when starting the NBD export of a particular device. I don't have a use case in my head where two distinct bitmaps being exposed simultaneously offer any particular benefit, but maybe there is something. I'm sure there is. I will leave this aspect of it more to you NBD folks. I think QEMU could cope with either. (Vladimir, am I wrong? Do you have thoughts on this in particular? I haven't thought through this aspect of it very much.) >> Anyway, I hope I am being useful and just not more confounding. It seems >> to me that we're having difficulty conveying precisely what it is we're >> trying to accomplish, so I hope that I am making a good effort in >> elaborating on our goals/requirements. > > Yes absolutely. I think part of the challenge is that you are quite > reasonably coming at it from the point of view of qemu's particular > need, and I'm coming at it from 'what should the nbd protocol look > like in general' position, having done lots of work on the protocol > docs (though I'm an occasional qemu contributor). So there's necessarily > a gap of approach to be bridged. > Yeah, I understand quite well that we need to make sure the NBD spec is sane and useful in a QEMU-agnostic way, so my goal here is just to help elucidate our needs to enable you to reach a good consensus. > I'm overdue on a review of Wouter's latest patch (partly because I need > to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in), > but I think it's a bridge worth building. > Same. Thank you for your patience! Cheers, --js
> On 2 Dec 2016, at 20:39, John Snow <jsnow@redhat.com> wrote: > > OK. We do certainly support multiple bitmaps being active at a time in > QEMU, but I had personally always envisioned that you'd associate them > one-at-a-time when starting the NBD export of a particular device. > > I don't have a use case in my head where two distinct bitmaps being > exposed simultaneously offer any particular benefit, but maybe there is > something. I'm sure there is. The obvious case is an allocation bitmap, and a dirty bitmap. It's possible one might want more than one dirty bitmap at once. Perhaps two sorts of backup, or perhaps live migrate of storage backed by NBD, or perhaps inspecting the *state* of live migration via NBD and a bitmap, or perhaps determining what extents in a QCOW image are in that image itself (as opposed to the image on which it is based). I tried to pick some QEMU-like ones, but I am sure there are examples that would work outside of QEMU.
02.12.2016 23:39, John Snow wrote: > > On 12/02/2016 01:45 PM, Alex Bligh wrote: >> John, >> >>>> +Some storage formats and operations over such formats express a >>>> +concept of data dirtiness. Whether the operation is block device >>>> +mirroring, incremental block device backup or any other operation with >>>> +a concept of data dirtiness, they all share a need to provide a list >>>> +of ranges that this particular operation treats as dirty. >>>> >>>> How can data be 'dirty' if it is static and unchangeable? (I thought) >>>> >>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from >>> the VM and cause the bitmap that QEMU manages to become dirty. >>> >>> We intend to expose the ability to fleece dirty blocks via NBD. What >>> happens in this scenario would be that a snapshot of the data at the >>> time of the request is exported over NBD in a read-only manner. >>> >>> In this way, the drive itself is R/W, but the "view" of it from NBD is >>> RO. While a hypothetical backup client is busy copying data out of this >>> temporary view, new writes are coming in to the drive, but are not being >>> exposed through the NBD export. >>> >>> (This goes into QEMU-specifics, but those new writes are dirtying a >>> version of the bitmap not intended to be exposed via the NBD channel. >>> NBD gets effectively a snapshot of both the bitmap AND the data.) >> Thanks. That makes sense - or enough sense for me to carry on commenting! >> > Whew! I'm glad. > >>>> I now think what you are talking about backing up a *snapshot* of a disk >>>> that's running, where the disk itself was not connected using NBD? IE it's >>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively >>>> an opaque state represented in a bitmap, which is binary metadata >>>> at some particular level of granularity. It might as well be 'happiness' >>>> or 'is coloured blue'. The NBD server would (normally) have no way of >>>> manipulating this bitmap. >>>> >>>> In previous comments, I said 'how come we can set the dirty bit through >>>> writes but can't clear it?'. This (my statement) is now I think wrong, >>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The >>>> state of the bitmap comes from whatever sets the bitmap which is outside >>>> the scope of this protocol to transmit it. >>>> >>> You know, this is a fair point. We have not (to my knowledge) yet >>> carefully considered the exact bitmap management scenario when NBD is >>> involved in retrieving dirty blocks. >>> >>> Humor me for a moment while I talk about a (completely hypothetical, not >>> yet fully discussed) workflow for how I envision this feature. >>> >>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial >>> backup is made, etc. >>> >>> (2) As writes come in, QEMU's bitmap is dirtied. >>> >>> (3) The user decides they want to root around to see what data has >>> changed and would like to use NBD to do so, in contrast to QEMU's own >>> facilities for dumping dirty blocks. >>> >>> (4) A command is issued that creates a temporary, lightweight snapshot >>> ('fleecing') and exports this snapshot over NBD. The bitmap is >>> associated with the NBD export at this point at NBD server startup. (For >>> the sake of QEMU discussion, maybe this command is "blockdev-fleece") >>> >>> (5) At this moment, the snapshot is static and represents the data at >>> the time the NBD server was started. The bitmap is also forked and >>> represents only this snapshot. The live data and bitmap continue to change. >>> >>> (6) Dirty blocks are queried and copied out via NBD. >>> >>> (7) The user closes the NBD instance upon completion of their task, >>> whatever it was. (Making a new incremental backup? Just taking a peek at >>> some changed data? who knows.) >>> >>> The point that's interesting here is what do we do with the two bitmaps >>> at this point? The data delta can be discarded (this was after all just >>> a lightweight read-only point-in-time snapshot) but the bitmap data >>> needs to be dealt with. >>> >>> (A) In the case of "User made a new incremental backup," the bitmap that >>> got forked off to serve the NBD read should be discarded. >>> >>> (B) In the case of "User just wanted to look around," the bitmap should >>> be merged back into the bitmap it was forked from. >>> >>> I don't advise a hybrid where "User copied some data, but not all" where >>> we need to partially clear *and* merge, but conceivably this could >>> happen, because the things we don't want to happen always will. >>> >>> At this point maybe it's becoming obvious that actually it would be very >>> prudent to allow the NBD client itself to inform QEMU via the NBD >>> protocol which extents/blocks/(etc) that it is "done" with. >>> >>> Maybe it *would* actually be useful if, in NBD allowing us to add a >>> "dirty" bit to the specification, we allow users to clear those bits. >>> >>> Then, whether the user was trying to do (A) or (B) or the unspeakable >>> amalgamation of both things, it's up to the user to clear the bits >>> desired and QEMU can do the simple task of simply always merging the >>> bitmap fork upon the conclusion of the NBD fleecing exercise. >>> >>> Maybe this would allow the dirty bit to have a bit more concrete meaning >>> for the NBD spec: "The bit stays dirty until the user clears it, and is >>> set when the matching block/extent/etc is written to." >>> >>> With an exception that external management may cause the bits to clear. >>> (I.e., someone fiddles with the backing store in a way opaque to NBD, >>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.) >> There is currently one possible "I've done with the entire bitmap" >> signal, which is closing the connection. This has two obvious >> problems. Firstly if used, it discards the entire bitmap (not bits). >> Secondly, it makes recovery from a broken TCP session difficult >> (as either you treat a dirty close as meaning the bitmap needs >> to hang around, in which case you have a garbage collection issue, >> or you treat it as needing to drop the bitmap, in which case you >> can't recover). >> > In my mind, I wasn't treating closing the connection as the end of the > point-in-time snapshot; that would be stopping the export. > > I wouldn't advocate for a control channel (QEMU, here) clearing the > bitmap just because a client disappeared. > > Either: > > (A) QEMU clears the bitmap because the NBD export was *stopped*, or > (B) QEMU, acting as the NBD server, clears the bitmap as instructed by > the NBD client, if we admit a provision to clear bits from the NBD > protocol itself. > > I don't think there's room for the NBD server (QEMU) deciding to clear > bits based on connection status. It has to be an explicit decision -- > either via NBD or QMP. > >> I think in your plan the block status doesn't change once the bitmap >> is forked. In that case, adding some command (optional) to change >> the status of the bitmap (or simply to set a given extent to status X) >> would be reasonable. Of course whether it's supported could be dependent >> on the bitmap. >> > What I describe as "forking" was kind of a bad description. What really > happens when we have a divergence is that the bitmap with data is split > into two bitmaps that are related: > > - A new bitmap that is created takes over for the old bitmap. This new > bitmap is empty. It records writes on the live version of the data. > - The old bitmap as it existed remains in a read-only state, and > describes some point-in-time snapshot view of the data. > > In the case of an incremental backup, once we've made a backup of the > data, that read-only bitmap can actually be discarded without further > thought. > > In the case of a failed incremental backup, or in the case of "I just > wanted to look and see what has changed, but wasn't prepared to reset > the counter yet," this bitmap gets merged back with the live bitmap as > if nothing ever happened. > > ANYWAY, allowing the NBD client to request bits be cleared has an > obvious use case even for QEMU, IMO -- which is, the NBD client itself > gains the ability to, without relying on a control plane to the server, > decide for itself if it is going to "make a backup" or "just look around." > > The client gains the ability to leave the bitmap alone (QEMU will > re-merge it later once the snapshot is closed) or the ability to clear > it ("I made my backup, we're done with this.") > > That usefulness would allow us to have an explicit dirty bit mechanism > directly in NBD, IMO, because: > > (1) A RW NBD server has enough information to mark a bit dirty > (2) Since there exists an in-spec mechanism to reset the bitmap, the > dirty bit is meaningful to the server and the client > >>> Having missed most of the discussion on v1/v2, is it a given that we >>> want in-band identification of bitmaps? >>> >>> I guess this might depend very heavily on the nature of the definition >>> of the "dirty bit" in the NBD spec. >> I don't think it's a given. I think Wouter & I came up with it at >> the same time as a way to abstract the bitmap/extent concept and >> remove the need to specify a dirty bit at all (well, that's my excuse >> anyway). >> > OK. We do certainly support multiple bitmaps being active at a time in > QEMU, but I had personally always envisioned that you'd associate them > one-at-a-time when starting the NBD export of a particular device. > > I don't have a use case in my head where two distinct bitmaps being > exposed simultaneously offer any particular benefit, but maybe there is > something. I'm sure there is. > > I will leave this aspect of it more to you NBD folks. I think QEMU could > cope with either. > > (Vladimir, am I wrong? Do you have thoughts on this in particular? I > haven't thought through this aspect of it very much.) I'm ok with either too. Yes, with online external backup (fleecing), the bitmap is already selected in Qemu. And this is most interesting case, anyway. For offline external backup, when we have RO disk, it may have several bitmaps, for example for different backup frequency and it may be not bad for client to have an ability to chose.. But again, we can select exporting bitmap through qmp (even same fleecing scheme will be ok, with overhead of creating empty delta) > >>> Anyway, I hope I am being useful and just not more confounding. It seems >>> to me that we're having difficulty conveying precisely what it is we're >>> trying to accomplish, so I hope that I am making a good effort in >>> elaborating on our goals/requirements. >> Yes absolutely. I think part of the challenge is that you are quite >> reasonably coming at it from the point of view of qemu's particular >> need, and I'm coming at it from 'what should the nbd protocol look >> like in general' position, having done lots of work on the protocol >> docs (though I'm an occasional qemu contributor). So there's necessarily >> a gap of approach to be bridged. >> > Yeah, I understand quite well that we need to make sure the NBD spec is > sane and useful in a QEMU-agnostic way, so my goal here is just to help > elucidate our needs to enable you to reach a good consensus. > >> I'm overdue on a review of Wouter's latest patch (partly because I need >> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in), >> but I think it's a bridge worth building. >> > Same. Thank you for your patience! > > Cheers, > --js
Hi John Sorry for the late reply; weekend was busy, and so was monday. On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote: > On 12/02/2016 01:45 PM, Alex Bligh wrote: > > John, > > > >>> +Some storage formats and operations over such formats express a > >>> +concept of data dirtiness. Whether the operation is block device > >>> +mirroring, incremental block device backup or any other operation with > >>> +a concept of data dirtiness, they all share a need to provide a list > >>> +of ranges that this particular operation treats as dirty. > >>> > >>> How can data be 'dirty' if it is static and unchangeable? (I thought) > >>> > >> > >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from > >> the VM and cause the bitmap that QEMU manages to become dirty. > >> > >> We intend to expose the ability to fleece dirty blocks via NBD. What > >> happens in this scenario would be that a snapshot of the data at the > >> time of the request is exported over NBD in a read-only manner. > >> > >> In this way, the drive itself is R/W, but the "view" of it from NBD is > >> RO. While a hypothetical backup client is busy copying data out of this > >> temporary view, new writes are coming in to the drive, but are not being > >> exposed through the NBD export. > >> > >> (This goes into QEMU-specifics, but those new writes are dirtying a > >> version of the bitmap not intended to be exposed via the NBD channel. > >> NBD gets effectively a snapshot of both the bitmap AND the data.) > > > > Thanks. That makes sense - or enough sense for me to carry on commenting! > > > > Whew! I'm glad. > > >>> I now think what you are talking about backing up a *snapshot* of a disk > >>> that's running, where the disk itself was not connected using NBD? IE it's > >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively > >>> an opaque state represented in a bitmap, which is binary metadata > >>> at some particular level of granularity. It might as well be 'happiness' > >>> or 'is coloured blue'. The NBD server would (normally) have no way of > >>> manipulating this bitmap. > >>> > >>> In previous comments, I said 'how come we can set the dirty bit through > >>> writes but can't clear it?'. This (my statement) is now I think wrong, > >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The > >>> state of the bitmap comes from whatever sets the bitmap which is outside > >>> the scope of this protocol to transmit it. > >>> > >> > >> You know, this is a fair point. We have not (to my knowledge) yet > >> carefully considered the exact bitmap management scenario when NBD is > >> involved in retrieving dirty blocks. > >> > >> Humor me for a moment while I talk about a (completely hypothetical, not > >> yet fully discussed) workflow for how I envision this feature. > >> > >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial > >> backup is made, etc. > >> > >> (2) As writes come in, QEMU's bitmap is dirtied. > >> > >> (3) The user decides they want to root around to see what data has > >> changed and would like to use NBD to do so, in contrast to QEMU's own > >> facilities for dumping dirty blocks. > >> > >> (4) A command is issued that creates a temporary, lightweight snapshot > >> ('fleecing') and exports this snapshot over NBD. The bitmap is > >> associated with the NBD export at this point at NBD server startup. (For > >> the sake of QEMU discussion, maybe this command is "blockdev-fleece") > >> > >> (5) At this moment, the snapshot is static and represents the data at > >> the time the NBD server was started. The bitmap is also forked and > >> represents only this snapshot. The live data and bitmap continue to change. > >> > >> (6) Dirty blocks are queried and copied out via NBD. > >> > >> (7) The user closes the NBD instance upon completion of their task, > >> whatever it was. (Making a new incremental backup? Just taking a peek at > >> some changed data? who knows.) > >> > >> The point that's interesting here is what do we do with the two bitmaps > >> at this point? The data delta can be discarded (this was after all just > >> a lightweight read-only point-in-time snapshot) but the bitmap data > >> needs to be dealt with. > >> > >> (A) In the case of "User made a new incremental backup," the bitmap that > >> got forked off to serve the NBD read should be discarded. > >> > >> (B) In the case of "User just wanted to look around," the bitmap should > >> be merged back into the bitmap it was forked from. > >> > >> I don't advise a hybrid where "User copied some data, but not all" where > >> we need to partially clear *and* merge, but conceivably this could > >> happen, because the things we don't want to happen always will. > >> > >> At this point maybe it's becoming obvious that actually it would be very > >> prudent to allow the NBD client itself to inform QEMU via the NBD > >> protocol which extents/blocks/(etc) that it is "done" with. > >> > >> Maybe it *would* actually be useful if, in NBD allowing us to add a > >> "dirty" bit to the specification, we allow users to clear those bits. > >> > >> Then, whether the user was trying to do (A) or (B) or the unspeakable > >> amalgamation of both things, it's up to the user to clear the bits > >> desired and QEMU can do the simple task of simply always merging the > >> bitmap fork upon the conclusion of the NBD fleecing exercise. > >> > >> Maybe this would allow the dirty bit to have a bit more concrete meaning > >> for the NBD spec: "The bit stays dirty until the user clears it, and is > >> set when the matching block/extent/etc is written to." > >> > >> With an exception that external management may cause the bits to clear. > >> (I.e., someone fiddles with the backing store in a way opaque to NBD, > >> e.g. someone clears the bitmap directly through QEMU instead of via NBD.) > > > > There is currently one possible "I've done with the entire bitmap" > > signal, which is closing the connection. This has two obvious > > problems. Firstly if used, it discards the entire bitmap (not bits). > > Secondly, it makes recovery from a broken TCP session difficult > > (as either you treat a dirty close as meaning the bitmap needs > > to hang around, in which case you have a garbage collection issue, > > or you treat it as needing to drop the bitmap, in which case you > > can't recover). > > > > In my mind, I wasn't treating closing the connection as the end of the > point-in-time snapshot; that would be stopping the export. > > I wouldn't advocate for a control channel (QEMU, here) clearing the > bitmap just because a client disappeared. > > Either: > > (A) QEMU clears the bitmap because the NBD export was *stopped*, or > (B) QEMU, acting as the NBD server, clears the bitmap as instructed by > the NBD client, if we admit a provision to clear bits from the NBD > protocol itself. > > I don't think there's room for the NBD server (QEMU) deciding to clear > bits based on connection status. It has to be an explicit decision -- > either via NBD or QMP. > > > I think in your plan the block status doesn't change once the bitmap > > is forked. In that case, adding some command (optional) to change > > the status of the bitmap (or simply to set a given extent to status X) > > would be reasonable. Of course whether it's supported could be dependent > > on the bitmap. > > > > What I describe as "forking" was kind of a bad description. What really > happens when we have a divergence is that the bitmap with data is split > into two bitmaps that are related: > > - A new bitmap that is created takes over for the old bitmap. This new > bitmap is empty. It records writes on the live version of the data. > - The old bitmap as it existed remains in a read-only state, and > describes some point-in-time snapshot view of the data. > > In the case of an incremental backup, once we've made a backup of the > data, that read-only bitmap can actually be discarded without further > thought. > > In the case of a failed incremental backup, or in the case of "I just > wanted to look and see what has changed, but wasn't prepared to reset > the counter yet," this bitmap gets merged back with the live bitmap as > if nothing ever happened. > > ANYWAY, allowing the NBD client to request bits be cleared has an > obvious use case even for QEMU, IMO -- which is, the NBD client itself > gains the ability to, without relying on a control plane to the server, > decide for itself if it is going to "make a backup" or "just look around." > > The client gains the ability to leave the bitmap alone (QEMU will > re-merge it later once the snapshot is closed) or the ability to clear > it ("I made my backup, we're done with this.") > > That usefulness would allow us to have an explicit dirty bit mechanism > directly in NBD, IMO, because: > > (1) A RW NBD server has enough information to mark a bit dirty > (2) Since there exists an in-spec mechanism to reset the bitmap, the > dirty bit is meaningful to the server and the client While I can see that the ability to manipulate metadata might have advantages for certain use cases, I don't think that the ability to *inspect* metadata should require the ability to manipulate it in any way. So I'd like to finish the block_status extension before moving on to manipulation :) > >> Having missed most of the discussion on v1/v2, is it a given that we > >> want in-band identification of bitmaps? > >> > >> I guess this might depend very heavily on the nature of the definition > >> of the "dirty bit" in the NBD spec. > > > > I don't think it's a given. I think Wouter & I came up with it at > > the same time as a way to abstract the bitmap/extent concept and > > remove the need to specify a dirty bit at all (well, that's my excuse > > anyway). > > > > OK. We do certainly support multiple bitmaps being active at a time in > QEMU, but I had personally always envisioned that you'd associate them > one-at-a-time when starting the NBD export of a particular device. > > I don't have a use case in my head where two distinct bitmaps being > exposed simultaneously offer any particular benefit, but maybe there is > something. I'm sure there is. The ability to do something does not in any way imply the requirement to do the same :-) The idea is that the client negotiates one or more forms of metadata information from the server that it might be interested in, and then asks the server that information for a given extent where it has interest. The protocol spec does not define what that metadata is (beyond the "is allocated" one that we define in the spec currently, and possibly something else in the future). So if qemu only cares about just one type of metadata, there's no reason why it should *have* to export more than one type. > I will leave this aspect of it more to you NBD folks. I think QEMU could > cope with either. > > (Vladimir, am I wrong? Do you have thoughts on this in particular? I > haven't thought through this aspect of it very much.) > > >> Anyway, I hope I am being useful and just not more confounding. It seems > >> to me that we're having difficulty conveying precisely what it is we're > >> trying to accomplish, so I hope that I am making a good effort in > >> elaborating on our goals/requirements. > > > > Yes absolutely. I think part of the challenge is that you are quite > > reasonably coming at it from the point of view of qemu's particular > > need, and I'm coming at it from 'what should the nbd protocol look > > like in general' position, having done lots of work on the protocol > > docs (though I'm an occasional qemu contributor). So there's necessarily > > a gap of approach to be bridged. > > > > Yeah, I understand quite well that we need to make sure the NBD spec is > sane and useful in a QEMU-agnostic way, so my goal here is just to help > elucidate our needs to enable you to reach a good consensus. Right, that's why I was reluctant to merge the original spec as it stood. > > I'm overdue on a review of Wouter's latest patch (partly because I need > > to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in), > > but I think it's a bridge worth building. > > > > Same. Thank you for your patience! I can do some updates given a few of the suggestions that were made on this list (no guarantee when that will happen), but if people are interested in reviewing things in the mean time, be my guest...
On 12/06/2016 08:32 AM, Wouter Verhelst wrote: > Hi John > > Sorry for the late reply; weekend was busy, and so was monday. > No problems. > On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote: >> On 12/02/2016 01:45 PM, Alex Bligh wrote: >>> John, >>> >>>>> +Some storage formats and operations over such formats express a >>>>> +concept of data dirtiness. Whether the operation is block device >>>>> +mirroring, incremental block device backup or any other operation with >>>>> +a concept of data dirtiness, they all share a need to provide a list >>>>> +of ranges that this particular operation treats as dirty. >>>>> >>>>> How can data be 'dirty' if it is static and unchangeable? (I thought) >>>>> >>>> >>>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from >>>> the VM and cause the bitmap that QEMU manages to become dirty. >>>> >>>> We intend to expose the ability to fleece dirty blocks via NBD. What >>>> happens in this scenario would be that a snapshot of the data at the >>>> time of the request is exported over NBD in a read-only manner. >>>> >>>> In this way, the drive itself is R/W, but the "view" of it from NBD is >>>> RO. While a hypothetical backup client is busy copying data out of this >>>> temporary view, new writes are coming in to the drive, but are not being >>>> exposed through the NBD export. >>>> >>>> (This goes into QEMU-specifics, but those new writes are dirtying a >>>> version of the bitmap not intended to be exposed via the NBD channel. >>>> NBD gets effectively a snapshot of both the bitmap AND the data.) >>> >>> Thanks. That makes sense - or enough sense for me to carry on commenting! >>> >> >> Whew! I'm glad. >> >>>>> I now think what you are talking about backing up a *snapshot* of a disk >>>>> that's running, where the disk itself was not connected using NBD? IE it's >>>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively >>>>> an opaque state represented in a bitmap, which is binary metadata >>>>> at some particular level of granularity. It might as well be 'happiness' >>>>> or 'is coloured blue'. The NBD server would (normally) have no way of >>>>> manipulating this bitmap. >>>>> >>>>> In previous comments, I said 'how come we can set the dirty bit through >>>>> writes but can't clear it?'. This (my statement) is now I think wrong, >>>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The >>>>> state of the bitmap comes from whatever sets the bitmap which is outside >>>>> the scope of this protocol to transmit it. >>>>> >>>> >>>> You know, this is a fair point. We have not (to my knowledge) yet >>>> carefully considered the exact bitmap management scenario when NBD is >>>> involved in retrieving dirty blocks. >>>> >>>> Humor me for a moment while I talk about a (completely hypothetical, not >>>> yet fully discussed) workflow for how I envision this feature. >>>> >>>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial >>>> backup is made, etc. >>>> >>>> (2) As writes come in, QEMU's bitmap is dirtied. >>>> >>>> (3) The user decides they want to root around to see what data has >>>> changed and would like to use NBD to do so, in contrast to QEMU's own >>>> facilities for dumping dirty blocks. >>>> >>>> (4) A command is issued that creates a temporary, lightweight snapshot >>>> ('fleecing') and exports this snapshot over NBD. The bitmap is >>>> associated with the NBD export at this point at NBD server startup. (For >>>> the sake of QEMU discussion, maybe this command is "blockdev-fleece") >>>> >>>> (5) At this moment, the snapshot is static and represents the data at >>>> the time the NBD server was started. The bitmap is also forked and >>>> represents only this snapshot. The live data and bitmap continue to change. >>>> >>>> (6) Dirty blocks are queried and copied out via NBD. >>>> >>>> (7) The user closes the NBD instance upon completion of their task, >>>> whatever it was. (Making a new incremental backup? Just taking a peek at >>>> some changed data? who knows.) >>>> >>>> The point that's interesting here is what do we do with the two bitmaps >>>> at this point? The data delta can be discarded (this was after all just >>>> a lightweight read-only point-in-time snapshot) but the bitmap data >>>> needs to be dealt with. >>>> >>>> (A) In the case of "User made a new incremental backup," the bitmap that >>>> got forked off to serve the NBD read should be discarded. >>>> >>>> (B) In the case of "User just wanted to look around," the bitmap should >>>> be merged back into the bitmap it was forked from. >>>> >>>> I don't advise a hybrid where "User copied some data, but not all" where >>>> we need to partially clear *and* merge, but conceivably this could >>>> happen, because the things we don't want to happen always will. >>>> >>>> At this point maybe it's becoming obvious that actually it would be very >>>> prudent to allow the NBD client itself to inform QEMU via the NBD >>>> protocol which extents/blocks/(etc) that it is "done" with. >>>> >>>> Maybe it *would* actually be useful if, in NBD allowing us to add a >>>> "dirty" bit to the specification, we allow users to clear those bits. >>>> >>>> Then, whether the user was trying to do (A) or (B) or the unspeakable >>>> amalgamation of both things, it's up to the user to clear the bits >>>> desired and QEMU can do the simple task of simply always merging the >>>> bitmap fork upon the conclusion of the NBD fleecing exercise. >>>> >>>> Maybe this would allow the dirty bit to have a bit more concrete meaning >>>> for the NBD spec: "The bit stays dirty until the user clears it, and is >>>> set when the matching block/extent/etc is written to." >>>> >>>> With an exception that external management may cause the bits to clear. >>>> (I.e., someone fiddles with the backing store in a way opaque to NBD, >>>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.) >>> >>> There is currently one possible "I've done with the entire bitmap" >>> signal, which is closing the connection. This has two obvious >>> problems. Firstly if used, it discards the entire bitmap (not bits). >>> Secondly, it makes recovery from a broken TCP session difficult >>> (as either you treat a dirty close as meaning the bitmap needs >>> to hang around, in which case you have a garbage collection issue, >>> or you treat it as needing to drop the bitmap, in which case you >>> can't recover). >>> >> >> In my mind, I wasn't treating closing the connection as the end of the >> point-in-time snapshot; that would be stopping the export. >> >> I wouldn't advocate for a control channel (QEMU, here) clearing the >> bitmap just because a client disappeared. >> >> Either: >> >> (A) QEMU clears the bitmap because the NBD export was *stopped*, or >> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by >> the NBD client, if we admit a provision to clear bits from the NBD >> protocol itself. >> >> I don't think there's room for the NBD server (QEMU) deciding to clear >> bits based on connection status. It has to be an explicit decision -- >> either via NBD or QMP. >> >>> I think in your plan the block status doesn't change once the bitmap >>> is forked. In that case, adding some command (optional) to change >>> the status of the bitmap (or simply to set a given extent to status X) >>> would be reasonable. Of course whether it's supported could be dependent >>> on the bitmap. >>> >> >> What I describe as "forking" was kind of a bad description. What really >> happens when we have a divergence is that the bitmap with data is split >> into two bitmaps that are related: >> >> - A new bitmap that is created takes over for the old bitmap. This new >> bitmap is empty. It records writes on the live version of the data. >> - The old bitmap as it existed remains in a read-only state, and >> describes some point-in-time snapshot view of the data. >> >> In the case of an incremental backup, once we've made a backup of the >> data, that read-only bitmap can actually be discarded without further >> thought. >> >> In the case of a failed incremental backup, or in the case of "I just >> wanted to look and see what has changed, but wasn't prepared to reset >> the counter yet," this bitmap gets merged back with the live bitmap as >> if nothing ever happened. >> >> ANYWAY, allowing the NBD client to request bits be cleared has an >> obvious use case even for QEMU, IMO -- which is, the NBD client itself >> gains the ability to, without relying on a control plane to the server, >> decide for itself if it is going to "make a backup" or "just look around." >> >> The client gains the ability to leave the bitmap alone (QEMU will >> re-merge it later once the snapshot is closed) or the ability to clear >> it ("I made my backup, we're done with this.") >> >> That usefulness would allow us to have an explicit dirty bit mechanism >> directly in NBD, IMO, because: >> >> (1) A RW NBD server has enough information to mark a bit dirty >> (2) Since there exists an in-spec mechanism to reset the bitmap, the >> dirty bit is meaningful to the server and the client > > While I can see that the ability to manipulate metadata might have > advantages for certain use cases, I don't think that the ability to > *inspect* metadata should require the ability to manipulate it in any > way. > > So I'd like to finish the block_status extension before moving on to > manipulation :) > Understood. The problem I was trying to correct by admitting that manipulation of bits may have a purpose in NBD was to attempt to clarify the exact meaning of the dirty bit. It had seemed to me that by not specifying (or disallowing) the manipulation of those bits from within NBD necessarily meant that their meaning existed entirely out-of-spec for NBD, which could be a show-stopper. So I was attempting to show that by allowing their manipulation in NBD, they'd have full in-spec meaning. It all depends on what exactly we name those bits and how you'd like to define their meaning. There are many ways we can expose this information in a useful manner, so this was just another option. >>>> Having missed most of the discussion on v1/v2, is it a given that we >>>> want in-band identification of bitmaps? >>>> >>>> I guess this might depend very heavily on the nature of the definition >>>> of the "dirty bit" in the NBD spec. >>> >>> I don't think it's a given. I think Wouter & I came up with it at >>> the same time as a way to abstract the bitmap/extent concept and >>> remove the need to specify a dirty bit at all (well, that's my excuse >>> anyway). >>> >> >> OK. We do certainly support multiple bitmaps being active at a time in >> QEMU, but I had personally always envisioned that you'd associate them >> one-at-a-time when starting the NBD export of a particular device. >> >> I don't have a use case in my head where two distinct bitmaps being >> exposed simultaneously offer any particular benefit, but maybe there is >> something. I'm sure there is. > > The ability to do something does not in any way imply the requirement to > do the same :-) > hence the ask. It's not something QEMU currently needs, but I am a bad psychic. > The idea is that the client negotiates one or more forms of metadata > information from the server that it might be interested in, and then > asks the server that information for a given extent where it has > interest. > via the NBD protocol, you mean? > The protocol spec does not define what that metadata is (beyond the "is > allocated" one that we define in the spec currently, and possibly > something else in the future). So if qemu only cares about just one type > of metadata, there's no reason why it should *have* to export more than > one type. > >> I will leave this aspect of it more to you NBD folks. I think QEMU could >> cope with either. >> >> (Vladimir, am I wrong? Do you have thoughts on this in particular? I >> haven't thought through this aspect of it very much.) >> >>>> Anyway, I hope I am being useful and just not more confounding. It seems >>>> to me that we're having difficulty conveying precisely what it is we're >>>> trying to accomplish, so I hope that I am making a good effort in >>>> elaborating on our goals/requirements. >>> >>> Yes absolutely. I think part of the challenge is that you are quite >>> reasonably coming at it from the point of view of qemu's particular >>> need, and I'm coming at it from 'what should the nbd protocol look >>> like in general' position, having done lots of work on the protocol >>> docs (though I'm an occasional qemu contributor). So there's necessarily >>> a gap of approach to be bridged. >>> >> >> Yeah, I understand quite well that we need to make sure the NBD spec is >> sane and useful in a QEMU-agnostic way, so my goal here is just to help >> elucidate our needs to enable you to reach a good consensus. > > Right, that's why I was reluctant to merge the original spec as it > stood. > >>> I'm overdue on a review of Wouter's latest patch (partly because I need >>> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in), >>> but I think it's a bridge worth building. >>> >> >> Same. Thank you for your patience! > > I can do some updates given a few of the suggestions that were made on > this list (no guarantee when that will happen), but if people are > interested in reviewing things in the mean time, be my guest... > I'll take a look at your revision(s), thanks. --js
====== Also, an idea on 2-4: As we say, that dirtiness is unknown for NBD, and external tool should specify, manage and understand, which data is actually transmitted, why not just call it user_data and leave status field of reply chunk unspecified in this case? So, I propose one flag for NBD_CMD_BLOCK_STATUS: NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by Eric's 'Block provisioning status' paragraph. If it is set, we just leave status field for some external... protocol? Who knows, what is this user data. Note: I'm not sure, that I like this (my) proposal. It's just an idea, may be someone like it. And, I think, it represents what we are trying to do more honestly. Note2: the next step of generalization will be NBD_CMD_USER, with variable request size, structured reply and no definition :) Another idea, about backups themselves: Why do we need allocated/zero status for backup? IMHO we don't. Full backup: just do structured read - it will show us, which chunks may be treaded as zeroes. Incremental backup: get dirty bitmap (somehow, for example through user-defined part of proposed command), than, for dirty blocks, read them through structured read, so information about zero/unallocated areas are here. For me all the variants above are OK. Let's finally choose something. v2: v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html Since then, we've added the STRUCTURED_REPLY extension, which necessitates a rather larger rebase; I've also changed things to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request modes to be determined by boolean flags (rather than by fixed values of the 16-bit flags field), changed the reply status fields to be bitwise-or values (with a default of 0 always being sane), and changed the descriptor layout to drop an offset but to include a 32-bit status so that the descriptor is nicely 8-byte aligned without padding. doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 154 insertions(+), 1 deletion(-) diff --git a/doc/proto.md b/doc/proto.md index 1c2fa5b..253b6f1 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -755,6 +755,8 @@ The field has the following format: MUST leave this flag clear if structured replies have not been negotiated. Clients MUST NOT set the `NBD_CMD_FLAG_DF` request flag unless this transmission flag is set. +- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental + `BLOCK_STATUS` extension; see below. Clients SHOULD ignore unknown flags. @@ -1036,6 +1038,10 @@ interpret the "length" bytes of payload. 64 bits: offset (unsigned) 32 bits: hole size (unsigned, MUST be nonzero) +- `NBD_REPLY_TYPE_BLOCK_STATUS` (5) + + Defined by the experimental extension `BLOCK_STATUS`; see below. + All error chunk types have bit 15 set, and begin with the same *error*, *message length*, and optional *message* fields as `NBD_REPLY_TYPE_ERROR`. If non-zero, *message length* indicates @@ -1070,7 +1076,7 @@ remaining structured fields at the end. were sent earlier in the structured reply, the server SHOULD NOT send multiple distinct offsets that lie within the bounds of a single content chunk. Valid as a reply to `NBD_CMD_READ`, - `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`. + `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`. The payload is structured as: @@ -1247,6 +1253,11 @@ The following request types exist: Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md). +* `NBD_CMD_BLOCK_STATUS` (7) + + Defined by the experimental `BLOCK_STATUS` extension; see below. + + * Other requests Some third-party implementations may require additional protocol @@ -1331,6 +1342,148 @@ written as branches which can be merged into master if and when those extensions are promoted to the normative version of the document in the master branch. +### `BLOCK_STATUS` extension + +With the availability of sparse storage formats, it is often needed to +query the status of a particular range and read only those blocks of +data that are actually present on the block device. + +Some storage formats and operations over such formats express a +concept of data dirtiness. Whether the operation is block device +mirroring, incremental block device backup or any other operation with +a concept of data dirtiness, they all share a need to provide a list +of ranges that this particular operation treats as dirty. + +To provide such class of information, the `BLOCK_STATUS` extension +adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of +ranges with their respective states. This extension is not available +unless the client also negotiates the `STRUCTURED_REPLY` extension. + +* `NBD_FLAG_SEND_BLOCK_STATUS` + + The server SHOULD set this transmission flag to 1 if structured + replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS` + request is supported. + +* `NBD_REPLY_TYPE_BLOCK_STATUS` + + *length* MUST be a positive integer multiple of 8. This reply + represents a series of consecutive block descriptors where the sum + of the lengths of the descriptors MUST not be greater than the + length of the original request. This chunk type MUST appear at most + once in a structured reply. Valid as a reply to + `NBD_CMD_BLOCK_STATUS`. + + The payload is structured as a list of one or more descriptors, + each with this layout: + + * 32 bits, length (unsigned, MUST NOT be zero) + * 32 bits, status flags + + The definition of the status flags is determined based on the + flags present in the original request. + +* `NBD_CMD_BLOCK_STATUS` + + A block status query request. Length and offset define the range + of interest. Clients SHOULD NOT use this request unless the server + set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which + in turn requires the client to first negotiate structured replies. + For a successful return, the server MUST use a structured reply, + containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`. + + The list of block status descriptors within the + `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions + of the file starting from specified *offset*, and the sum of the + *length* fields of each descriptor MUST not be greater than the + overall *length* of the request. This means that the server MAY + return less data than required. However the server MUST return at + least one status descriptor. The server SHOULD use different + *status* values between consecutive descriptors, and SHOULD use + descriptor lengths that are an integer multiple of 512 bytes where + possible (the first and last descriptor of an unaligned query being + the most obvious places for an exception). The status flags are + intentionally defined so that a server MAY always safely report a + status of 0 for any block, although the server SHOULD return + additional status values when they can be easily detected. + + If an error occurs, the server SHOULD set the appropriate error + code in the error field of either a simple reply or an error + chunk. However, if the error does not involve invalid usage (such + as a request beyond the bounds of the file), a server MAY reply + with a single block status descriptor with *length* matching the + requested length, and *status* of 0 rather than reporting the + error. + + The type of information requested by the client is determined by + the request flags, as follows: + + 1. Block provisioning status + + Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag + `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the + provisioning status of the device, where the status field of each + descriptor is determined by the following bits (all four + combinations of these two bits are possible): + + - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole + (and future writes to that area may cause fragmentation or + encounter an `ENOSPC` error); if clear, the block is allocated + or the server could not otherwise determine its status. Note + that the use of `NBD_CMD_TRIM` is related to this status, but + that the server MAY report a hole even where trim has not been + requested, and also that a server MAY report allocation even + where a trim has been requested. + - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as + all zeroes; if clear, the block contents are not known. Note + that the use of `NBD_CMD_WRITE_ZEROES` is related to this + status, but that the server MAY report zeroes even where write + zeroes has not been requested, and also that a server MAY + report unknown content even where write zeroes has been + requested. + + The client SHOULD NOT read from an area that has both + `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. + + 2. Block dirtiness status + + This command is meant to operate in tandem with other (non-NBD) + channels to the server. Generally, a "dirty" block is a block + that has been written to by someone, but the exact meaning of "has + been written" is left to the implementation. For example, a + virtual machine monitor could provide a (non-NBD) command to start + tracking blocks written by the virtual machine. A backup client + can then connect to an NBD server provided by the virtual machine + monitor and use `NBD_CMD_BLOCK_STATUS` with the + `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty + blocks that the virtual machine has changed. + + An implementation that doesn't track the "dirtiness" state of + blocks MUST either fail this command with `EINVAL`, or mark all + blocks as dirty in the descriptor that it returns. Upon receiving + an `NBD_CMD_BLOCK_STATUS` command with the flag + `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness + status of the device, where the status field of each descriptor is + determined by the following bit: + + - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a + portion of the file that is still clean because it has not + been written; if clear, the block represents a portion of the + file that is dirty, or where the server could not otherwise + determine its status. + +A client MAY close the connection if it detects that the server has +sent an invalid chunks (such as lengths in the +`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length). +The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS` +request including one or more sectors beyond the size of the device. + +The extension adds the following new command flag: + +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`. + SHOULD be set to 1 if the client wants to request dirtiness status + rather than provisioning status. + ## About this file This file tries to document the NBD protocol as it is currently