[v3] doc: Add NBD_CMD_BLOCK_STATUS extension
diff mbox

Message ID 1480073296-6931-1-git-send-email-vsementsov@virtuozzo.com
State New
Headers show

Commit Message

Vladimir Sementsov-Ogievskiy Nov. 25, 2016, 11:28 a.m. UTC
With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

v3:

Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.

Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
 NBD_FLAG_CAN_MULTI_CONN in master branch.

And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).

By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:

1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
   A: This all is for read-only disks, so the data is static and unchangeable.

2. Q: different granularities of dirty/allocated bitmaps. Any problems?
   A: 1: server replies with status descriptors of any size, granularity
         is hidden from the client
      2: dirty/allocated requests are separate and unrelated to each
         other, so their granularities are not intersecting

3. Q: selecting of dirty bitmap to export
   A: several variants:
      1: id of bitmap is in flags field of request
          pros: - simple
          cons: - it's a hack. flags field is for other uses.
                - we'll have to map bitmap names to these "ids"
      2: introduce extended nbd requests with variable length and exploit this
         feature for BLOCK_STATUS command, specifying bitmap identifier.
         pros: - looks like a true way
         cons: - we have to create additional extension
               - possible we have to create a map,
                 {<QEMU bitmap name> <=> <NBD bitmap id>}
      3: exteranl tool should select which bitmap to export. So, in case of Qemu
         it should be something like qmp command block-export-dirty-bitmap.
         pros: - simple
               - we can extend it to behave like (2) later
         cons: - additional qmp command to implement (possibly, the lesser evil)
         note: Hmm, external tool can make chose between allocated/dirty data too,
               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

4. Q: Should not get_{allocated,dirty} be separate commands?
   cons: Two commands with almost same semantic and similar means?
   pros: However here is a good point of separating clearly defined and native
         for block devices GET_BLOCK_STATUS from user-driven and actually
         undefined data, called 'dirtyness'.

5. Number of status descriptors, sent by server, should be restricted
   variants:
   1: just allow server to restrict this as it wants (which was done in v3)
   2: (not excluding 1). Client specifies somehow the maximum for number
      of descriptors.
      2.1: add command flag, which will request only one descriptor
           (otherwise, no restrictions from the client)
      2.2: again, introduce extended nbd requests, and add field to
           specify this maximum

6. A: What to do with unspecified flags (in request/reply)?
   I think the normal variant is to make them reserved. (Server should
   return EINVAL if found unknown bits, client should consider replay
   with unknown bits as an error)

Comments

Stefan Hajnoczi Nov. 25, 2016, 2:02 p.m. UTC | #1
On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.
> 
> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).
> 
> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.
> 
> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting
> 
> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.
> 
> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum
> 
> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)
> 
> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.
> 
>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.
> 
>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)
> 
> 
> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.
> 
>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.
> 
>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which
> necessitates a rather larger rebase; I've also changed things
> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
> modes to be determined by boolean flags (rather than by fixed
> values of the 16-bit flags field), changed the reply status fields
> to be bitwise-or values (with a default of 0 always being sane),
> and changed the descriptor layout to drop an offset but to include
> a 32-bit status so that the descriptor is nicely 8-byte aligned
> without padding.
> 
>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Wouter Verhelst Nov. 27, 2016, 7:17 p.m. UTC | #2
Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the
different states mean outside of the NBD protocol (i.e., the protocol
messages are not entirely defined anymore, and their meaning depends on
the clients and servers in use).

To avoid this, we should have a clear definition of what the reply means
*by default*, but then we can add a note that clients and servers can
possibly define other meanings out of band if they want to.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.

Yeah, having them separate commands might be a bad idea indeed.

> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum

I think having a flag which requests just one descriptor can be useful,
but I'm hesitant to add it unless it's actually going to be used; so in
other words, I'll leave the decision on that bit to you.

> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)

Right, probably best to do that, yes.

> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.

Yes, this sounds like a reasonable approach.

>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.

Indeed.

>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)

Well, er, no please, if we can avoid it :-)

> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what
it is, in detail, that you want to do :-)

I can understand a "has this changed since time X" request, which the
"dirty" thing seems to want to be. Whether something is allocated is
just a special case of that.

Actually, come to think of that. What is the exact use case for this
thing? I understand you're trying to create incremental backups of
things, which would imply you don't write from the client that is
getting the block status thingies, right? If so, how about:

- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
  snapshots. Not required, optional, includes a machine-readable form,
  not defined by NBD, which explains what the snapshot is about (e.g., a
  qemu json file). The "base" version of that is just "allocation
  status", and is implied (i.e., you don't need to run
  NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
  allocation status).
- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
  which tell you what the status of a block of data is for each of the
  relevant snapshots that we know about.

Perhaps this is somewhat overengineered, but it does bring most of the
definition of what a snapshot is back into the NBD protocol, without
having to say "this could be anything", and without requiring
connectivity over two ports for this to be useful (e.g., you could store
the machine-readable form of the snapshot description into your backup
program and match what they mean with what you're interested in at
restore time, etc).

This wouldn't work if you're interested in new snapshots that get
created once we've already moved into transmission, but hey.

Thoughts?

>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.

Right.

>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which
> necessitates a rather larger rebase; I've also changed things
> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
> modes to be determined by boolean flags (rather than by fixed
> values of the 16-bit flags field), changed the reply status fields
> to be bitwise-or values (with a default of 0 always being sane),
> and changed the descriptor layout to drop an offset but to include
> a 32-bit status so that the descriptor is nicely 8-byte aligned
> without padding.
> 
>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

[...]

I'll commit this in a minute into a separate branch called
"extension-blockstatus", under the understanding that changes are still
required, as per above (i.e., don't assume that just because there's a
branch I'm happy with the current result ;-)

Regards
Stefan Hajnoczi Nov. 28, 2016, 11:19 a.m. UTC | #3
On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
>
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
> 
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.

The out-of-band operations you are referring to are for dirty bitmap
management.  (The goal is to read out blocks that changed since the last
backup.)

The client does not access the live disk, instead it accesses a
read-only snapshot and the dirty information (so that it can copy out
only blocks that were written).  The client is allowed to read blocks
that are not dirty too.

If you want to implement the whole incremental backup workflow in NBD
then the client would first have to connect to the live disk, set up
dirty tracking, create a snapshot export, and then connect to that
snapshot.

That sounds like a big feature set and I'd argue it's for the control
plane (storage API) and not the data plane (NBD).  There were
discussions about transferring the dirty information via the control
plane but it seems more appropriate to it in the data plane since it is
block-level information.

I'm arguing that the NBD protocol doesn't need to support the
incremental backup workflow since it's a complex control plane concept.

Being able to read dirty information via NBD is useful for other block
backup applications, not just QEMU.  It could be used for syncing LVM
volumes across machines, for example, if someone implements an NBD+LVM
server.

Another issue with adding control plane operations is that you need to
begin considering privilege separation.  Should all NBD clients be able
to initiate snapshots, dirty tracking, etc or is some kind of access
control required to limit certain commands?  Not all clients require the
same privileges and so they shouldn't have access to the same set of
operations.

Stefan
Wouter Verhelst Nov. 28, 2016, 5:33 p.m. UTC | #4
Hi Stefan,

On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote:
> On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > Quickly: the reason I haven't merged this yes is twofold:
> > - I wasn't thrilled with the proposal at the time. It felt a bit
> >   hackish, and bolted onto NBD so you could use it, but without defining
> >   everything in the NBD protocol. "We're reading some data, but it's not
> >   about you". That didn't feel right
> >
> > - There were a number of questions still unanswered (you're answering a
> >   few below, so that's good).
> > 
> > For clarity, I have no objection whatsoever to adding more commands if
> > they're useful, but I would prefer that they're also useful with NBD on
> > its own, i.e., without requiring an initiation or correlation of some
> > state through another protocol or network connection or whatever. If
> > that's needed, that feels like I didn't do my job properly, if you get
> > my point.
> 
> The out-of-band operations you are referring to are for dirty bitmap
> management.  (The goal is to read out blocks that changed since the last
> backup.)
> 
> The client does not access the live disk, instead it accesses a
> read-only snapshot and the dirty information (so that it can copy out
> only blocks that were written).  The client is allowed to read blocks
> that are not dirty too.

I understood as much, yes.

> If you want to implement the whole incremental backup workflow in NBD
> then the client would first have to connect to the live disk, set up
> dirty tracking, create a snapshot export, and then connect to that
> snapshot.
> 
> That sounds like a big feature set and I'd argue it's for the control
> plane (storage API) and not the data plane (NBD).  There were
> discussions about transferring the dirty information via the control
> plane but it seems more appropriate to it in the data plane since it is
> block-level information.

I agree that creating and managing snapshots is out of scope for NBD. The
protocol is not set up for that.

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

My proposal also makes it impossible to get updates of newly created
snapshots without disconnecting and reconnecting (due to the fact that
you can't go from transmission back to negotiation), but I'm not sure
that's a problem.

Doing so has two advantages:
- If a client is accidentally (due to misconfiguration or implementation
  bugs or whatnot) connecting to the wrong server after having created a
  snapshot through a management protocol, we have an opportunity to
  detect this error, due to the fact that the "foreign" identifiers
  passed to the client during negotiation will not match with what the
  client was expecting.
- A future version of the protocol could possibly include an extended
  version of the read command, allowing a client to read information
  from multiple storage snapshots without requiring a reconnect, and
  allowing current clients information about allocation status across
  various snapshots (although a first implementation could very well
  limit itself to only having one snapshot).

[...]
> I'm arguing that the NBD protocol doesn't need to support the
> incremental backup workflow since it's a complex control plane concept.
> 
> Being able to read dirty information via NBD is useful for other block
> backup applications, not just QEMU.  It could be used for syncing LVM
> volumes across machines, for example, if someone implements an NBD+LVM
> server.

Indeed, and I was considering adding a basic implementation to go with
the copy-on-write support in stock nbd-server, too.

> Another issue with adding control plane operations is that you need to
> begin considering privilege separation.  Should all NBD clients be able
> to initiate snapshots, dirty tracking, etc or is some kind of access
> control required to limit certain commands?  Not all clients require the
> same privileges and so they shouldn't have access to the same set of
> operations.

Sure, which is why I wasn't suggesting anything of the sorts :-)
John Snow Nov. 28, 2016, 11:15 p.m. UTC | #5
Hi Wouter,

Some of this mess may be partially my fault, but I have not been 
following the NBD extension proposals up until this point.

Are you familiar with the genesis behind this idea and what we are 
trying to accomplish in general?

We had the thought to propose an extension roughly similar to SCSI's 
'get lba status', but the existing status bits there did not correlate 
semantically to what we are hoping to convey. There was some debate over 
if it would be an abuse of protocol to attempt to use them as such.

For a quick recap, get lba status appears to offer three canonical statuses:

0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"

My interpretation of mapped was simply that it was physically allocated, 
and 'deallocated' was simply unallocated.

(I uh, am not actually clear on what anchored means exactly.)

Either way, we felt at the time that it would be wrong to propose an 
analogue command for NBD and then immediately abuse the existing 
semantics, hence a new command like -- but not identical to -- the SCSI one.


On 11/27/2016 02:17 PM, Wouter Verhelst wrote:
> Hi Vladimir,
>
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
>
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.
>
> On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> With the availability of sparse storage formats, it is often needed
>> to query status of a particular range and read only those blocks of
>> data that are actually present on the block device.
>>
>> To provide such information, the patch adds the BLOCK_STATUS
>> extension with one new NBD_CMD_BLOCK_STATUS command, a new
>> structured reply chunk format, and a new transmission flag.
>>
>> There exists a concept of data dirtiness, which is required
>> during, for example, incremental block device backup. To express
>> this concept via NBD protocol, this patch also adds a flag to
>> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
>> provisioning information; however, with the current proposal, data
>> dirtiness is only useful with additional coordination outside of
>> the NBD protocol (such as a way to start and stop the server from
>> tracking dirty sectors).  Future NBD extensions may add commands
>> to control dirtiness through NBD.
>>
>> Since NBD protocol has no notion of block size, and to mimic SCSI
>> "GET LBA STATUS" command more closely, it has been chosen to return
>> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
>> instead of a bitmap.
>>
>> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>> CC: Denis V. Lunev <den@openvz.org>
>> CC: Wouter Verhelst <w@uter.be>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> v3:
>>
>> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
>> removing the restriction, that sum of status descriptor lengths must be equal
>> to requested length. I.e., let's permit the server to replay with less data
>> than required if it wants.
>
> Reasonable, yes. The length that the client requests should be a maximum (i.e.
> "I'm interested in this range"), not an exact request.
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>>  NBD_FLAG_CAN_MULTI_CONN in master branch.
>
> Right.
>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
>
> Probably a good idea, given the above.
>
>> By this resend I just want to continue the diqussion, started about half
>> a year ago. Here is a summary of some questions and ideas from v2
>> diqussion:
>>
>> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
>>    A: This all is for read-only disks, so the data is static and unchangeable.
>
> I think we should declare that it's up to the client to ensure no other
> writes happen without its knowledge. This may be because the client and
> server communicate out of band about state changes, or because the
> client somehow knows that it's the only writer, or whatever.
>
> We can easily do that by declaring that the result of that command only
> talks about *current* state, and that concurrent writes by different
> clients may invalidate the state. This is true for NBD in general (i.e.,
> concurrent read or write commands from other clients may confuse file
> systems on top of NBD), so it doesn't change expectations in any way.
>

Agree with you here. "This was correct when I sent it, but it's up to 
you to ensure that nothing would have invalidated that in the meantime" 
is fine semantically to me.

Of course in our implementation, we intend to only export essentially 
read-only snapshots of data, so we don't personally expect to run into 
any of these kind of semantic problems.

>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
>
> Not entirely sure anymore what this is about?
>

We have a concept of granularity for dirty tracking; consider it a block 
size.

I don't think it's necessarily relevant to NBD, except for the case of 
false positives. Consider the case of a 512 byte block that is assumed 
dirty simply because it's adjacent to actually dirty data.

That may have some meaning for how we write the NBD spec; i.e. the 
meaning of the status bit changes from "This is definitely modified" to 
"This was possibly modified," due to limitations in the management of 
this information by the server.

If we expose the granularity information via NBD, it at least makes it 
clear how fuzzy the results presented may be. Otherwise, it's not really 
required.

[Again, as seen below, a user-defined bit would be plenty sufficient...!]

>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).
>
> To avoid this, we should have a clear definition of what the reply means
> *by default*, but then we can add a note that clients and servers can
> possibly define other meanings out of band if they want to.
>

I had personally only ever considered #3; where a command would be 
issued to QEMU to begin offering NBD data for some point-in-time as 
associated with a particular reference/snapshot/backup/etc. This leaves 
it up to the out-of-band client to order up the right data.

That does appear to choose a meaningful name for the status bits a bit 
more difficult...

[...but reading ahead, a 'user defined' bit would fit the bill just fine.]

When Vladimir authored a "persistence" feature for bitmaps to be stored 
alongside QCOW2 files, we had difficulty describing exactly what a dirty 
bit meant for the data -- ultimately it is reliant on external 
information. The meaning is simply but ambiguously, "The data associated 
with this bit has been changed since the last time the bit was reset."

We don't record or stipulate the conditions for a reset.

(for our purposes, a reset would occur during a full or incremental backup.)

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
>
> Yeah, having them separate commands might be a bad idea indeed.
>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
>
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.
>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
>
> Right, probably best to do that, yes.
>
>> ======
>>
>> Also, an idea on 2-4:
>>
>>     As we say, that dirtiness is unknown for NBD, and external tool
>>     should specify, manage and understand, which data is actually
>>     transmitted, why not just call it user_data and leave status field
>>     of reply chunk unspecified in this case?
>>
>>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>     leave status field for some external... protocol? Who knows, what is
>>     this user data.
>
> Yes, this sounds like a reasonable approach.
>

I'd be pretty happy (personally) with some user defined bits. Leaves a 
lot of the ambiguousness of exactly what we're trying to convey away 
from the NBD spec, which is nice.

>>     Note: I'm not sure, that I like this (my) proposal. It's just an
>>     idea, may be someone like it.  And, I think, it represents what we
>>     are trying to do more honestly.
>
> Indeed.
>
>>     Note2: the next step of generalization will be NBD_CMD_USER, with
>>     variable request size, structured reply and no definition :)
>
> Well, er, no please, if we can avoid it :-)
>
>> Another idea, about backups themselves:
>>
>>     Why do we need allocated/zero status for backup? IMHO we don't.
>
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)
>
> I can understand a "has this changed since time X" request, which the
> "dirty" thing seems to want to be. Whether something is allocated is
> just a special case of that.
>
> Actually, come to think of that. What is the exact use case for this
> thing? I understand you're trying to create incremental backups of
> things, which would imply you don't write from the client that is
> getting the block status thingies, right? If so, how about:
>

Essentially you can create a bitmap object in-memory in QEMU and then 
associate it with a drive. It records changes to the drive, and QEMU can 
be instructed to write out any changes since the last backup to disk, 
while clearing the bits of the bitmap.

It doesn't record a specific point in time, it's just implicit to the 
last time you cleared the bitmap -- usually the last incremental or full 
backup you've made.

So it does describe "blocks changed since time X," it's just that we 
don't really know exactly when time X was.

This is all well and dandy, except there is some desire from third 
parties to be able to ask QEMU about this dirty block information -- to 
be able to see this bitmap, more or less. We already use NBD for 
exporting data, and instead of inventing a new side-band, we decided 
that we wanted to use NBD to let external users get this information.

What exactly they do with that info is beyond QEMU's scope.

> - NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
>   snapshots. Not required, optional, includes a machine-readable form,
>   not defined by NBD, which explains what the snapshot is about (e.g., a
>   qemu json file). The "base" version of that is just "allocation
>   status", and is implied (i.e., you don't need to run
>   NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
>   allocation status).

We don't necessarily have /snapshots/ per se, but we do have some block 
device and one or more bitmaps describing deltas to one or more backups 
that we do not necessarily have access to.

e.g. a given bitmap may describe a delta to an off-site backup. We know 
the delta, but do not maintain any meaningful handle to the given 
off-site backup, including name, URI, or date.

QEMU leaves this association up to upper management, and provides only 
IDs to facilitate any additional correlation.

We could offer up descriptions of these bitmaps in response to such a 
command, but they're not ... quite snapshots. It is more the case that 
we use them to create snapshots.

> - NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
>   which tell you what the status of a block of data is for each of the
>   relevant snapshots that we know about.
>
> Perhaps this is somewhat overengineered, but it does bring most of the
> definition of what a snapshot is back into the NBD protocol, without
> having to say "this could be anything", and without requiring
> connectivity over two ports for this to be useful (e.g., you could store
> the machine-readable form of the snapshot description into your backup
> program and match what they mean with what you're interested in at
> restore time, etc).
>
> This wouldn't work if you're interested in new snapshots that get
> created once we've already moved into transmission, but hey.
>
> Thoughts?
>
>>     Full backup: just do structured read - it will show us, which chunks
>>     may be treaded as zeroes.
>
> Right.
>
>>     Incremental backup: get dirty bitmap (somehow, for example through
>>     user-defined part of proposed command), than, for dirty blocks, read
>>     them through structured read, so information about zero/unallocated
>>     areas are here.
>>
>> For me all the variants above are OK. Let's finally choose something.
>>
>> v2:
>> v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
>>
>> Since then, we've added the STRUCTURED_REPLY extension, which
>> necessitates a rather larger rebase; I've also changed things
>> to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
>> modes to be determined by boolean flags (rather than by fixed
>> values of the 16-bit flags field), changed the reply status fields
>> to be bitwise-or values (with a default of 0 always being sane),
>> and changed the descriptor layout to drop an offset but to include
>> a 32-bit status so that the descriptor is nicely 8-byte aligned
>> without padding.
>>
>>  doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 154 insertions(+), 1 deletion(-)
>
> [...]
>
> I'll commit this in a minute into a separate branch called
> "extension-blockstatus", under the understanding that changes are still
> required, as per above (i.e., don't assume that just because there's a
> branch I'm happy with the current result ;-)
>
> Regards
>

Err, I hope I haven't confused everything to heck and back now....

Thanks,
--js
Stefan Hajnoczi Nov. 29, 2016, 9:17 a.m. UTC | #6
On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> Hi Stefan,
> 
> On Mon, Nov 28, 2016 at 11:19:44AM +0000, Stefan Hajnoczi wrote:
> > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > > Quickly: the reason I haven't merged this yes is twofold:
> > > - I wasn't thrilled with the proposal at the time. It felt a bit
> > >   hackish, and bolted onto NBD so you could use it, but without defining
> > >   everything in the NBD protocol. "We're reading some data, but it's not
> > >   about you". That didn't feel right
> > >
> > > - There were a number of questions still unanswered (you're answering a
> > >   few below, so that's good).
> > > 
> > > For clarity, I have no objection whatsoever to adding more commands if
> > > they're useful, but I would prefer that they're also useful with NBD on
> > > its own, i.e., without requiring an initiation or correlation of some
> > > state through another protocol or network connection or whatever. If
> > > that's needed, that feels like I didn't do my job properly, if you get
> > > my point.
> > 
> > The out-of-band operations you are referring to are for dirty bitmap
> > management.  (The goal is to read out blocks that changed since the last
> > backup.)
> > 
> > The client does not access the live disk, instead it accesses a
> > read-only snapshot and the dirty information (so that it can copy out
> > only blocks that were written).  The client is allowed to read blocks
> > that are not dirty too.
> 
> I understood as much, yes.
> 
> > If you want to implement the whole incremental backup workflow in NBD
> > then the client would first have to connect to the live disk, set up
> > dirty tracking, create a snapshot export, and then connect to that
> > snapshot.
> > 
> > That sounds like a big feature set and I'd argue it's for the control
> > plane (storage API) and not the data plane (NBD).  There were
> > discussions about transferring the dirty information via the control
> > plane but it seems more appropriate to it in the data plane since it is
> > block-level information.
> 
> I agree that creating and managing snapshots is out of scope for NBD. The
> protocol is not set up for that.
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> My proposal also makes it impossible to get updates of newly created
> snapshots without disconnecting and reconnecting (due to the fact that
> you can't go from transmission back to negotiation), but I'm not sure
> that's a problem.
> 
> Doing so has two advantages:
> - If a client is accidentally (due to misconfiguration or implementation
>   bugs or whatnot) connecting to the wrong server after having created a
>   snapshot through a management protocol, we have an opportunity to
>   detect this error, due to the fact that the "foreign" identifiers
>   passed to the client during negotiation will not match with what the
>   client was expecting.
> - A future version of the protocol could possibly include an extended
>   version of the read command, allowing a client to read information
>   from multiple storage snapshots without requiring a reconnect, and
>   allowing current clients information about allocation status across
>   various snapshots (although a first implementation could very well
>   limit itself to only having one snapshot).

Sorry, I misunderstood you.

Snapshots are not very different from NBD exports.  Especially if the
storage system supports writeable-snapshot (aka cloning).  Should we
just used named exports?

Stefan
Kevin Wolf Nov. 29, 2016, 10:18 a.m. UTC | #7
Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
> > 3. Q: selecting of dirty bitmap to export
> >    A: several variants:
> >       1: id of bitmap is in flags field of request
> >           pros: - simple
> >           cons: - it's a hack. flags field is for other uses.
> >                 - we'll have to map bitmap names to these "ids"
> >       2: introduce extended nbd requests with variable length and exploit this
> >          feature for BLOCK_STATUS command, specifying bitmap identifier.
> >          pros: - looks like a true way
> >          cons: - we have to create additional extension
> >                - possible we have to create a map,
> >                  {<QEMU bitmap name> <=> <NBD bitmap id>}
> >       3: exteranl tool should select which bitmap to export. So, in case of Qemu
> >          it should be something like qmp command block-export-dirty-bitmap.
> >          pros: - simple
> >                - we can extend it to behave like (2) later
> >          cons: - additional qmp command to implement (possibly, the lesser evil)
> >          note: Hmm, external tool can make chose between allocated/dirty data too,
> >                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).

Another point to consider is that option 3 doesn't allow you to access
two (or more) different bitmaps from the client without using the side
channel all the time to switch back and forth between them and having to
drain the request queue each time to avoid races.

In general, if we have something that "looks like the true way", I'd
advocate to choose this option. Experience tells that we'd regret
anything simpler in a year or two, and then we'll have to do the real
thing anyway, but still need to support the quick hack for
compatibility.

> > 5. Number of status descriptors, sent by server, should be restricted
> >    variants:
> >    1: just allow server to restrict this as it wants (which was done in v3)
> >    2: (not excluding 1). Client specifies somehow the maximum for number
> >       of descriptors.
> >       2.1: add command flag, which will request only one descriptor
> >            (otherwise, no restrictions from the client)
> >       2.2: again, introduce extended nbd requests, and add field to
> >            specify this maximum
> 
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.

The native qemu block layer interface returns the status of only one
contiguous chunks, so the easiest way to implement the NBD block driver
in qemu would always use that bit.

On the other hand, it would be possible for the NBD block driver in qemu
to cache the rest of the response internally and answer the next request
from the cache instead of sending a new request over the network. Maybe
that's what it should be doing anyway for good performance.

> > Also, an idea on 2-4:
> > 
> >     As we say, that dirtiness is unknown for NBD, and external tool
> >     should specify, manage and understand, which data is actually
> >     transmitted, why not just call it user_data and leave status field
> >     of reply chunk unspecified in this case?
> > 
> >     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
> >     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
> >     Eric's 'Block provisioning status' paragraph.  If it is set, we just
> >     leave status field for some external... protocol? Who knows, what is
> >     this user data.
> 
> Yes, this sounds like a reasonable approach.

Makes sense to me, too.

However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
have use for a second one. If we go with one of the options where the
bitmap is selected per command, we're fine because you can simply move
the second bit to a different bitmap and do two requests. If we have
only a single active bitmap at a time, though, this feels like an actual
problem.

> > Another idea, about backups themselves:
> > 
> >     Why do we need allocated/zero status for backup? IMHO we don't.
> 
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)

I think we do. A block can be dirtied by discarding/write_zero-ing it.
Then we want the dirty bit to know that we need to include this block in
the incremental backup, but we also want to know that we don't actually
have to transfer the data in it.

Kevin
Vladimir Sementsov-Ogievskiy Nov. 29, 2016, 11:34 a.m. UTC | #8
29.11.2016 13:18, Kevin Wolf wrote:
> Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
>>> 3. Q: selecting of dirty bitmap to export
>>>     A: several variants:
>>>        1: id of bitmap is in flags field of request
>>>            pros: - simple
>>>            cons: - it's a hack. flags field is for other uses.
>>>                  - we'll have to map bitmap names to these "ids"
>>>        2: introduce extended nbd requests with variable length and exploit this
>>>           feature for BLOCK_STATUS command, specifying bitmap identifier.
>>>           pros: - looks like a true way
>>>           cons: - we have to create additional extension
>>>                 - possible we have to create a map,
>>>                   {<QEMU bitmap name> <=> <NBD bitmap id>}
>>>        3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>>           it should be something like qmp command block-export-dirty-bitmap.
>>>           pros: - simple
>>>                 - we can extend it to behave like (2) later
>>>           cons: - additional qmp command to implement (possibly, the lesser evil)
>>>           note: Hmm, external tool can make chose between allocated/dirty data too,
>>>                 so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>> Downside of 3, though, is that it moves the definition of what the
>> different states mean outside of the NBD protocol (i.e., the protocol
>> messages are not entirely defined anymore, and their meaning depends on
>> the clients and servers in use).
> Another point to consider is that option 3 doesn't allow you to access
> two (or more) different bitmaps from the client without using the side
> channel all the time to switch back and forth between them and having to
> drain the request queue each time to avoid races.
>
> In general, if we have something that "looks like the true way", I'd
> advocate to choose this option. Experience tells that we'd regret
> anything simpler in a year or two, and then we'll have to do the real
> thing anyway, but still need to support the quick hack for
> compatibility.
>
>>> 5. Number of status descriptors, sent by server, should be restricted
>>>     variants:
>>>     1: just allow server to restrict this as it wants (which was done in v3)
>>>     2: (not excluding 1). Client specifies somehow the maximum for number
>>>        of descriptors.
>>>        2.1: add command flag, which will request only one descriptor
>>>             (otherwise, no restrictions from the client)
>>>        2.2: again, introduce extended nbd requests, and add field to
>>>             specify this maximum
>> I think having a flag which requests just one descriptor can be useful,
>> but I'm hesitant to add it unless it's actually going to be used; so in
>> other words, I'll leave the decision on that bit to you.
> The native qemu block layer interface returns the status of only one
> contiguous chunks, so the easiest way to implement the NBD block driver
> in qemu would always use that bit.
>
> On the other hand, it would be possible for the NBD block driver in qemu
> to cache the rest of the response internally and answer the next request
> from the cache instead of sending a new request over the network. Maybe
> that's what it should be doing anyway for good performance.
>
>>> Also, an idea on 2-4:
>>>
>>>      As we say, that dirtiness is unknown for NBD, and external tool
>>>      should specify, manage and understand, which data is actually
>>>      transmitted, why not just call it user_data and leave status field
>>>      of reply chunk unspecified in this case?
>>>
>>>      So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>>      NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>>      Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>>      leave status field for some external... protocol? Who knows, what is
>>>      this user data.
>> Yes, this sounds like a reasonable approach.
> Makes sense to me, too.
>
> However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
> have use for a second one. If we go with one of the options where the
> bitmap is selected per command, we're fine because you can simply move
> the second bit to a different bitmap and do two requests. If we have
> only a single active bitmap at a time, though, this feels like an actual
> problem.
>
>>> Another idea, about backups themselves:
>>>
>>>      Why do we need allocated/zero status for backup? IMHO we don't.
>> Well, I've been thinking so all along, but then I don't really know what
>> it is, in detail, that you want to do :-)
> I think we do. A block can be dirtied by discarding/write_zero-ing it.
> Then we want the dirty bit to know that we need to include this block in
> the incremental backup, but we also want to know that we don't actually
> have to transfer the data in it.

And we will know that automatically, by using structured read command, 
so separate call to get_block_status is not needed.

>
> Kevin
Alex Bligh Nov. 29, 2016, 12:57 p.m. UTC | #9
Vladimir,

I went back to April to reread the previous train of conversation
then found you had helpfully summarised some if it. Comments
below.

Rather than comment on many of the points individual, the root
of my confusion and to some extent uncomfortableness about this
proposal is 'who owns the meaning the the bitmaps'.

Some of this is my own confusion (sorry) about the use to which
this is being put, which is I think at root a documentation issue.
To illustrate this, you write in the FAQ section that this is for
read only disks, but the text talks about:

+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)

I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.

In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.

However, we have the uncomfortable (to me) situation where the protocol
describes a flag 'dirty', with implications as to what it does, but
no actual strict definition of how it's set. So any 'other' user has
no real idea how to use the information, or how to implement a server
that provides a 'dirty' bit, because the semantics of that aren't within
the protocol. This does not sit happily with me.

So I'm wondering whether we should simplify and generalise this spec. You
say that for the dirty flag, there's no specification of when it is
set and cleared - that's implementation defined. Would it not be better
then to say 'that whole thing is private to Qemu - even the name'.

Rather you could read the list of bitmaps a server has, with a textual
name, each having an index (and perhaps a granularity). You could then
ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
and some (e.g. ones beginning with 'X-') could be reserved for user
usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
need a protocol change to support it.

IE rather than looking at 'a way of reading the dirty bit', we could
have this as a generic way of reading opaque bitmaps. Only one (allocation)
might be given meaning to start off with, and it wouldn't be necessary
for all servers to support that - i.e. you could support bitmap reading
without having an ALLOCATION bitmap available.

This spec would then be limited to the transmission of the bitmaps
(remove the word 'dirty' entirely, except perhaps as an illustrative
use case), and include only the definition of the allocation bitmap.

Some more nits:

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
> NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Each documentation branch should normally be branched off master unless
it depends on another extension (in which case it will be branched from that).
I haven't been rebasing them frequently as it can disrupt those working
on the branches. There's only really an issue around rebasing where you
depend on another branch.


> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>   A: 1: server replies with status descriptors of any size, granularity
>         is hidden from the client
>      2: dirty/allocated requests are separate and unrelated to each
>         other, so their granularities are not intersecting

I'm OK with this, but note that you do actually mention a granularity
of sorts in the spec (512 byes) - I think you should replace that
with the minimum block size.

> 3. Q: selecting of dirty bitmap to export
>   A: several variants:
>      1: id of bitmap is in flags field of request
>          pros: - simple
>          cons: - it's a hack. flags field is for other uses.
>                - we'll have to map bitmap names to these "ids"
>      2: introduce extended nbd requests with variable length and exploit this
>         feature for BLOCK_STATUS command, specifying bitmap identifier.
>         pros: - looks like a true way
>         cons: - we have to create additional extension
>               - possible we have to create a map,
>                 {<QEMU bitmap name> <=> <NBD bitmap id>}
>      3: exteranl tool should select which bitmap to export. So, in case of Qemu
>         it should be something like qmp command block-export-dirty-bitmap.
>         pros: - simple
>               - we can extend it to behave like (2) later
>         cons: - additional qmp command to implement (possibly, the lesser evil)
>         note: Hmm, external tool can make chose between allocated/dirty data too,
>               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Yes, this is all pretty horrible. I suspect we want to do something like (2),
and permit extra data across (in my proposal, it would only be one byte to select
the index). I suppose one could ask for a list of bitmaps.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>   cons: Two commands with almost same semantic and similar means?
>   pros: However here is a good point of separating clearly defined and native
>         for block devices GET_BLOCK_STATUS from user-driven and actually
>         undefined data, called 'dirtyness'.

I'm suggesting one generic 'read bitmap' command like you.

> 5. Number of status descriptors, sent by server, should be restricted
>   variants:
>   1: just allow server to restrict this as it wants (which was done in v3)
>   2: (not excluding 1). Client specifies somehow the maximum for number
>      of descriptors.
>      2.1: add command flag, which will request only one descriptor
>           (otherwise, no restrictions from the client)
>      2.2: again, introduce extended nbd requests, and add field to
>           specify this maximum

I think some form of extended request is the way to go, but out of
interest, what's the issue with as many descriptors being sent as it
takes to encode the reply? The client can just consume the remainder
(without buffering) and reissue the request at a later point for
the areas it discarded.

> 
> 6. A: What to do with unspecified flags (in request/reply)?
>   I think the normal variant is to make them reserved. (Server should
>   return EINVAL if found unknown bits, client should consider replay
>   with unknown bits as an error)

Yeah.

> 
> +
> +* `NBD_CMD_BLOCK_STATUS`
> +
> +    A block status query request. Length and offset define the range
> +    of interest. Clients SHOULD NOT use this request unless the server

MUST NOT is what we say elsewhere I believe.

> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
> +    in turn requires the client to first negotiate structured replies.
> +    For a successful return, the server MUST use a structured reply,
> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.

Nit: are you saying that non-structured error replies are permissible?
You're always/often going to get a non-structured  (simple) error reply
if the server doesn't support the command, but I think it would be fair to say the
server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
it supports the command. This is effectively what we say re NBD_CMD_READ.

> +
> +    The list of block status descriptors within the
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +    of the file starting from specified *offset*, and the sum of the
> +    *length* fields of each descriptor MUST not be greater than the
> +    overall *length* of the request. This means that the server MAY
> +    return less data than required. However the server MUST return at
> +    least one status descriptor

I'm not sure I understand why that's useful. What should the client
infer from the server refusing to provide information? We don't
permit short reads etc.

> .  The server SHOULD use different
> +    *status* values between consecutive descriptors, and SHOULD use
> +    descriptor lengths that are an integer multiple of 512 bytes where
> +    possible (the first and last descriptor of an unaligned query being
> +    the most obvious places for an exception).

Surely better would be an an integer multiple of the minimum block
size. Being able to offer bitmap support at finer granularity than
the absolute minimum block size helps no one, and if it were possible
to support a 256 byte block size (I think some floppy disks had that)
I see no reason not to support that as a granularity.
Vladimir Sementsov-Ogievskiy Nov. 29, 2016, 2:36 p.m. UTC | #10
29.11.2016 15:57, Alex Bligh wrote:
> Vladimir,
>
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
>
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
>
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
>
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
>
> How can data be 'dirty' if it is static and unchangeable? (I thought)
>
> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.

Yes, something like this.

Note: in Qemu it may not be a snapshot (actually, I didn't see a way in 
Qemu to export snapshots not switching to them (except opening external 
snapshot as a separate block dev)), but just any read-only drive, or 
temporary drive, used in the image fleecing scheme:

driveA is online normal drive
driveB is empty nbd export

- start backup driveA->driveB with sync=none, which means that the only 
copy which is done is copying old data from A to B before every write to A
- and set driveA as backing for driveB (on read from B, if data is not 
allocated it will be read from A)

after that, driveB is something like a snapshot for backup through NBD.

It's all just to say: calling backup-point-in-time-state just a 
'snapshot' confuses me and I hope not to see this word in the spec (in 
this context).

>
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
>
> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
>
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
>
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
>
> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
>
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.

Good point. For Qcow2 we finally come to just bitmaps, not "dirty 
bitmaps", to make it more general. There is a problem with allocation if 
we want to make it a subcase of bitmap: allocation natively have two 
bits per block: zero and allocated. We can of course separate this into 
two bitmaps, but this will not be similar with classic get_block_status.

> Some more nits:
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
>
>
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>    A: 1: server replies with status descriptors of any size, granularity
>>          is hidden from the client
>>       2: dirty/allocated requests are separate and unrelated to each
>>          other, so their granularities are not intersecting
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
>
>> 3. Q: selecting of dirty bitmap to export
>>    A: several variants:
>>       1: id of bitmap is in flags field of request
>>           pros: - simple
>>           cons: - it's a hack. flags field is for other uses.
>>                 - we'll have to map bitmap names to these "ids"
>>       2: introduce extended nbd requests with variable length and exploit this
>>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>>          pros: - looks like a true way
>>          cons: - we have to create additional extension
>>                - possible we have to create a map,
>>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>          it should be something like qmp command block-export-dirty-bitmap.
>>          pros: - simple
>>                - we can extend it to behave like (2) later
>>          cons: - additional qmp command to implement (possibly, the lesser evil)
>>          note: Hmm, external tool can make chose between allocated/dirty data too,
>>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to select
> the index). I suppose one could ask for a list of bitmaps.
>
>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>    cons: Two commands with almost same semantic and similar means?
>>    pros: However here is a good point of separating clearly defined and native
>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>          undefined data, called 'dirtyness'.
> I'm suggesting one generic 'read bitmap' command like you.

To support get_block_status in this general read_bitmap, we will need to 
define something like 'multibitmap', which allows several bits per 
chunk, as allocation data has two: zero and allocated.

>
>> 5. Number of status descriptors, sent by server, should be restricted
>>    variants:
>>    1: just allow server to restrict this as it wants (which was done in v3)
>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>       of descriptors.
>>       2.1: add command flag, which will request only one descriptor
>>            (otherwise, no restrictions from the client)
>>       2.2: again, introduce extended nbd requests, and add field to
>>            specify this maximum
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.

the issue is: too many descriptors possible. So, (1) solves it. (2) is 
optional, just to simplify/optimize client side.

>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>    I think the normal variant is to make them reserved. (Server should
>>    return EINVAL if found unknown bits, client should consider replay
>>    with unknown bits as an error)
> Yeah.
>
>> +
>> +* `NBD_CMD_BLOCK_STATUS`
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> MUST NOT is what we say elsewhere I believe.
>
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.

I agree.

>
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.

if the bitmap is 010101010101 we will have too many descriptors. For 
example, 16tb disk, 64k granularity -> 2G of descriptors payload.

>
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.
>

Agree. Anyway it is just a strong recommendation, not requirement.
Alex Bligh Nov. 29, 2016, 2:52 p.m. UTC | #11
Vladimir,

>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>   cons: Two commands with almost same semantic and similar means?
>>>   pros: However here is a good point of separating clearly defined and native
>>>         for block devices GET_BLOCK_STATUS from user-driven and actually
>>>         undefined data, called 'dirtyness'.
>> I'm suggesting one generic 'read bitmap' command like you.
> 
> To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap.

>>> 5. Number of status descriptors, sent by server, should be restricted
>>>   variants:
>>>   1: just allow server to restrict this as it wants (which was done in v3)
>>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>>      of descriptors.
>>>      2.1: add command flag, which will request only one descriptor
>>>           (otherwise, no restrictions from the client)
>>>      2.2: again, introduce extended nbd requests, and add field to
>>>           specify this maximum
>> I think some form of extended request is the way to go, but out of
>> interest, what's the issue with as many descriptors being sent as it
>> takes to encode the reply? The client can just consume the remainder
>> (without buffering) and reissue the request at a later point for
>> the areas it discarded.
> 
> the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client.

>>> +    The list of block status descriptors within the
>>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>> +    of the file starting from specified *offset*, and the sum of the
>>> +    *length* fields of each descriptor MUST not be greater than the
>>> +    overall *length* of the request. This means that the server MAY
>>> +    return less data than required. However the server MUST return at
>>> +    least one status descriptor
>> I'm not sure I understand why that's useful. What should the client
>> infer from the server refusing to provide information? We don't
>> permit short reads etc.
> 
> if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent.

I think the answer is probably a 'maximum number of extents' in the request packet.

Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status.
Vladimir Sementsov-Ogievskiy Nov. 29, 2016, 3:07 p.m. UTC | #12
29.11.2016 17:52, Alex Bligh wrote:
> Vladimir,
>
>>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>>    cons: Two commands with almost same semantic and similar means?
>>>>    pros: However here is a good point of separating clearly defined and native
>>>>          for block devices GET_BLOCK_STATUS from user-driven and actually
>>>>          undefined data, called 'dirtyness'.
>>> I'm suggesting one generic 'read bitmap' command like you.
>> To support get_block_status in this general read_bitmap, we will need to define something like 'multibitmap', which allows several bits per chunk, as allocation data has two: zero and allocated.
> I think you are saying that for arbitrary 'bitmap' there might be more than one state. For instance, one might (in an allocation 'bitmap') have a hole, a non-hole-zero, or a non-hole-non-zero.
>
> In the spec I'd suggest, for one 'bitmap', we represent the output as extents. Each extent has a status. For the bitmap to be useful, at least two status need to be possible, but the above would have three. This could be internally implemented by the server as (a) a bitmap (with two bits per entry), (b) two bitmaps (possibly with different granularity), (c) something else (e.g. reading file extents, then if the data is allocated manually comparing it against zero).
>
> I should have put 'bitmap' in quotes in what I wrote because returning extents (as you suggested) is a good idea, and there need not be an actual bitmap.
>
>>>> 5. Number of status descriptors, sent by server, should be restricted
>>>>    variants:
>>>>    1: just allow server to restrict this as it wants (which was done in v3)
>>>>    2: (not excluding 1). Client specifies somehow the maximum for number
>>>>       of descriptors.
>>>>       2.1: add command flag, which will request only one descriptor
>>>>            (otherwise, no restrictions from the client)
>>>>       2.2: again, introduce extended nbd requests, and add field to
>>>>            specify this maximum
>>> I think some form of extended request is the way to go, but out of
>>> interest, what's the issue with as many descriptors being sent as it
>>> takes to encode the reply? The client can just consume the remainder
>>> (without buffering) and reissue the request at a later point for
>>> the areas it discarded.
>> the issue is: too many descriptors possible. So, (1) solves it. (2) is optional, just to simplify/optimize client side.
> I think I'd prefer the server to return what it was asked for, and the client to deal with it. So either the client should be able to specify a maximum number of extents (and if we are extending the command structure, that's possible) or we deal with the client consuming and retrying unwanted extents. The reason for this is that it's unlikely the server can know a priori the number of extents which is the appropriate maximum for the client.
>
>>>> +    The list of block status descriptors within the
>>>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>>> +    of the file starting from specified *offset*, and the sum of the
>>>> +    *length* fields of each descriptor MUST not be greater than the
>>>> +    overall *length* of the request. This means that the server MAY
>>>> +    return less data than required. However the server MUST return at
>>>> +    least one status descriptor
>>> I'm not sure I understand why that's useful. What should the client
>>> infer from the server refusing to provide information? We don't
>>> permit short reads etc.
>> if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb disk, 64k granularity -> 2G of descriptors payload.
> Yep. And the cost of consuming and retrying is quite high. One option would be for the client to realise this is a possibility, and not request the entire extent map for a 16TB disk, as it might be very large! Even if the client worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per reply), this isn't going to noticeably increase the round trip timing. One issue here is that to determine a 'reasonable' size, the client needs to know the minimum length of any extent.

and with this approach we will in turn have overhead of too many 
requests for 00000000 or 11111111 bitmaps.

>
> I think the answer is probably a 'maximum number of extents' in the request packet.
>
> Of course with statuses in extent, the final extent could be represented as 'I don't know, break this bit into a separate request' status.
>

With such predefined status, we can postpone creating extended requests, 
have number of extents restricted by server and have sum of extents 
lengths be equal to request length.
Wouter Verhelst Nov. 29, 2016, 3:17 p.m. UTC | #13
On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 29.11.2016 17:52, Alex Bligh wrote:
> > Vladimir,
> >> if the bitmap is 010101010101 we will have too many descriptors.
> >> For example, 16tb disk, 64k granularity -> 2G of descriptors
> >> payload.
> > Yep. And the cost of consuming and retrying is quite high. One
> > option would be for the client to realise this is a possibility, and
> > not request the entire extent map for a 16TB disk, as it might be
> > very large! Even if the client worked at e.g. a 64MB level (where
> > they'd get a maximum of 1024 extents per reply), this isn't going to
> > noticeably increase the round trip timing. One issue here is that to
> > determine a 'reasonable' size, the client needs to know the minimum
> > length of any extent.
> 
> and with this approach we will in turn have overhead of too many 
> requests for 00000000 or 11111111 bitmaps.

This is why my proposal suggests the server may abort the sent extents
after X extents (sixteen, but that number can certainly change) have
been sent. After all, the server will have a better view on what's going
to be costly in terms of "number of extents".

> > I think the answer is probably a 'maximum number of extents' in the request packet.
> >
> > Of course with statuses in extent, the final extent could be
> > represented as 'I don't know, break this bit into a separate
> > request' status.
> >
> 
> With such predefined status, we can postpone creating extended requests, 
> have number of extents restricted by server and have sum of extents 
> lengths be equal to request length.
Sergey Talantov Nov. 30, 2016, 10:41 a.m. UTC | #14
Hi, Wouter!

> Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the ?
> block status thingies, right?

Overall, the most desired use case for this NBD extension is to allow 3-rd party software to make incremental backups.
Acronis (vendor of backup solutions) would support qemu backup if block status is provided.


-----Original Message-----
From: Wouter Verhelst [mailto:w@uter.be] 
Sent: Sunday, November 27, 2016 22:17
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: nbd-general@lists.sourceforge.net; kwolf@redhat.com; qemu-devel@nongnu.org; pborzenkov@virtuozzo.com; stefanha@redhat.com; pbonzini@redhat.com; mpa@pengutronix.de; den@openvz.org
Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if they're useful, but I would prefer that they're also useful with NBD on its own, i.e., without requiring an initiation or correlation of some state through another protocol or network connection or whatever. If that's needed, that feels like I didn't do my job properly, if you get my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed to 
> query status of a particular range and read only those blocks of data 
> that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS extension 
> with one new NBD_CMD_BLOCK_STATUS command, a new structured reply 
> chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required during, 
> for example, incremental block device backup. To express this concept 
> via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS 
> to request dirtiness information rather than provisioning information; 
> however, with the current proposal, data dirtiness is only useful with 
> additional coordination outside of the NBD protocol (such as a way to 
> start and stop the server from tracking dirty sectors).  Future NBD 
> extensions may add commands to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET 
> LBA STATUS" command more closely, it has been chosen to return a list 
> of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of 
> a bitmap.
> 
> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> CC: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change 
> is removing the restriction, that sum of status descriptor lengths 
> must be equal to requested length. I.e., let's permit the server to 
> replay with less data than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now  
> NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of 
> extension-structured-reply branch (which itself should be rebased on 
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about 
> half a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>    A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other writes happen without its knowledge. This may be because the client and server communicate out of band about state changes, or because the client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only talks about *current* state, and that concurrent writes by different clients may invalidate the state. This is true for NBD in general (i.e., concurrent read or write commands from other clients may confuse file systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>    A: 1: server replies with status descriptors of any size, granularity
>          is hidden from the client
>       2: dirty/allocated requests are separate and unrelated to each
>          other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>    A: several variants:
>       1: id of bitmap is in flags field of request
>           pros: - simple
>           cons: - it's a hack. flags field is for other uses.
>                 - we'll have to map bitmap names to these "ids"
>       2: introduce extended nbd requests with variable length and exploit this
>          feature for BLOCK_STATUS command, specifying bitmap identifier.
>          pros: - looks like a true way
>          cons: - we have to create additional extension
>                - possible we have to create a map,
>                  {<QEMU bitmap name> <=> <NBD bitmap id>}
>       3: exteranl tool should select which bitmap to export. So, in case of Qemu
>          it should be something like qmp command block-export-dirty-bitmap.
>          pros: - simple
>                - we can extend it to behave like (2) later
>          cons: - additional qmp command to implement (possibly, the lesser evil)
>          note: Hmm, external tool can make chose between allocated/dirty data too,
>                so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the different states mean outside of the NBD protocol (i.e., the protocol messages are not entirely defined anymore, and their meaning depends on the clients and servers in use).

To avoid this, we should have a clear definition of what the reply means *by default*, but then we can add a note that clients and servers can possibly define other meanings out of band if they want to.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>    cons: Two commands with almost same semantic and similar means?
>    pros: However here is a good point of separating clearly defined and native
>          for block devices GET_BLOCK_STATUS from user-driven and actually
>          undefined data, called 'dirtyness'.

Yeah, having them separate commands might be a bad idea indeed.

> 5. Number of status descriptors, sent by server, should be restricted
>    variants:
>    1: just allow server to restrict this as it wants (which was done in v3)
>    2: (not excluding 1). Client specifies somehow the maximum for number
>       of descriptors.
>       2.1: add command flag, which will request only one descriptor
>            (otherwise, no restrictions from the client)
>       2.2: again, introduce extended nbd requests, and add field to
>            specify this maximum

I think having a flag which requests just one descriptor can be useful, but I'm hesitant to add it unless it's actually going to be used; so in other words, I'll leave the decision on that bit to you.

> 6. A: What to do with unspecified flags (in request/reply)?
>    I think the normal variant is to make them reserved. (Server should
>    return EINVAL if found unknown bits, client should consider replay
>    with unknown bits as an error)

Right, probably best to do that, yes.

> ======
> 
> Also, an idea on 2-4:
> 
>     As we say, that dirtiness is unknown for NBD, and external tool
>     should specify, manage and understand, which data is actually
>     transmitted, why not just call it user_data and leave status field
>     of reply chunk unspecified in this case?
> 
>     So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>     NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>     Eric's 'Block provisioning status' paragraph.  If it is set, we just
>     leave status field for some external... protocol? Who knows, what is
>     this user data.

Yes, this sounds like a reasonable approach.

>     Note: I'm not sure, that I like this (my) proposal. It's just an
>     idea, may be someone like it.  And, I think, it represents what we
>     are trying to do more honestly.

Indeed.

>     Note2: the next step of generalization will be NBD_CMD_USER, with
>     variable request size, structured reply and no definition :)

Well, er, no please, if we can avoid it :-)

> Another idea, about backups themselves:
> 
>     Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what it is, in detail, that you want to do :-)

I can understand a "has this changed since time X" request, which the "dirty" thing seems to want to be. Whether something is allocated is just a special case of that.

Actually, come to think of that. What is the exact use case for this thing? I understand you're trying to create incremental backups of things, which would imply you don't write from the client that is getting the block status thingies, right? If so, how about:

- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
  snapshots. Not required, optional, includes a machine-readable form,
  not defined by NBD, which explains what the snapshot is about (e.g., a
  qemu json file). The "base" version of that is just "allocation
  status", and is implied (i.e., you don't need to run
  NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
  allocation status).
- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
  which tell you what the status of a block of data is for each of the
  relevant snapshots that we know about.

Perhaps this is somewhat overengineered, but it does bring most of the definition of what a snapshot is back into the NBD protocol, without having to say "this could be anything", and without requiring connectivity over two ports for this to be useful (e.g., you could store the machine-readable form of the snapshot description into your backup program and match what they mean with what you're interested in at restore time, etc).

This wouldn't work if you're interested in new snapshots that get created once we've already moved into transmission, but hey.

Thoughts?

>     Full backup: just do structured read - it will show us, which chunks
>     may be treaded as zeroes.

Right.

>     Incremental backup: get dirty bitmap (somehow, for example through
>     user-defined part of proposed command), than, for dirty blocks, read
>     them through structured read, so information about zero/unallocated
>     areas are here.
> 
> For me all the variants above are OK. Let's finally choose something.
> 
> v2:
> v1 was: 
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html
> 
> Since then, we've added the STRUCTURED_REPLY extension, which 
> necessitates a rather larger rebase; I've also changed things to 
> rename the command 'NBD_CMD_BLOCK_STATUS', changed the request modes 
> to be determined by boolean flags (rather than by fixed values of the 
> 16-bit flags field), changed the reply status fields to be bitwise-or 
> values (with a default of 0 always being sane), and changed the 
> descriptor layout to drop an offset but to include a 32-bit status so 
> that the descriptor is nicely 8-byte aligned without padding.
> 
>  doc/proto.md | 155 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 154 insertions(+), 1 deletion(-)

[...]

I'll commit this in a minute into a separate branch called "extension-blockstatus", under the understanding that changes are still required, as per above (i.e., don't assume that just because there's a branch I'm happy with the current result ;-)

Regards

--
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

------------------------------------------------------------------------------
John Snow Dec. 1, 2016, 11:42 p.m. UTC | #15
Hi Alex, let me try my hand at clarifying some points...

On 11/29/2016 07:57 AM, Alex Bligh wrote:
> Vladimir,
> 
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
> 
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
> 
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
> 

I sent an earlier email in response to Wouter over our exact "goal" with
this spec extension that is a little more of an overview, but I will try
to address your questions specifically.

> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> 
> How can data be 'dirty' if it is static and unchangeable? (I thought)
> 

In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
the VM and cause the bitmap that QEMU manages to become dirty.

We intend to expose the ability to fleece dirty blocks via NBD. What
happens in this scenario would be that a snapshot of the data at the
time of the request is exported over NBD in a read-only manner.

In this way, the drive itself is R/W, but the "view" of it from NBD is
RO. While a hypothetical backup client is busy copying data out of this
temporary view, new writes are coming in to the drive, but are not being
exposed through the NBD export.

(This goes into QEMU-specifics, but those new writes are dirtying a
version of the bitmap not intended to be exposed via the NBD channel.
NBD gets effectively a snapshot of both the bitmap AND the data.)

> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.
> 
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
> 

You know, this is a fair point. We have not (to my knowledge) yet
carefully considered the exact bitmap management scenario when NBD is
involved in retrieving dirty blocks.

Humor me for a moment while I talk about a (completely hypothetical, not
yet fully discussed) workflow for how I envision this feature.

(1) User sets up a drive in QEMU, a bitmap is initialized, an initial
backup is made, etc.

(2) As writes come in, QEMU's bitmap is dirtied.

(3) The user decides they want to root around to see what data has
changed and would like to use NBD to do so, in contrast to QEMU's own
facilities for dumping dirty blocks.

(4) A command is issued that creates a temporary, lightweight snapshot
('fleecing') and exports this snapshot over NBD. The bitmap is
associated with the NBD export at this point at NBD server startup. (For
the sake of QEMU discussion, maybe this command is "blockdev-fleece")

(5) At this moment, the snapshot is static and represents the data at
the time the NBD server was started. The bitmap is also forked and
represents only this snapshot. The live data and bitmap continue to change.

(6) Dirty blocks are queried and copied out via NBD.

(7) The user closes the NBD instance upon completion of their task,
whatever it was. (Making a new incremental backup? Just taking a peek at
some changed data? who knows.)

The point that's interesting here is what do we do with the two bitmaps
at this point? The data delta can be discarded (this was after all just
a lightweight read-only point-in-time snapshot) but the bitmap data
needs to be dealt with.

(A) In the case of "User made a new incremental backup," the bitmap that
got forked off to serve the NBD read should be discarded.

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

I don't advise a hybrid where "User copied some data, but not all" where
we need to partially clear *and* merge, but conceivably this could
happen, because the things we don't want to happen always will.

At this point maybe it's becoming obvious that actually it would be very
prudent to allow the NBD client itself to inform QEMU via the NBD
protocol which extents/blocks/(etc) that it is "done" with.

Maybe it *would* actually be useful if, in NBD allowing us to add a
"dirty" bit to the specification, we allow users to clear those bits.

Then, whether the user was trying to do (A) or (B) or the unspeakable
amalgamation of both things, it's up to the user to clear the bits
desired and QEMU can do the simple task of simply always merging the
bitmap fork upon the conclusion of the NBD fleecing exercise.

Maybe this would allow the dirty bit to have a bit more concrete meaning
for the NBD spec: "The bit stays dirty until the user clears it, and is
set when the matching block/extent/etc is written to."

With an exception that external management may cause the bits to clear.
(I.e., someone fiddles with the backing store in a way opaque to NBD,
e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
> 
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
> 
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
> 

If we can't work out a more precise, semantically meaningful spec
extension then I am very happy with simply the ability to have a user
bit, or X-QEMU bits, etc etc etc.

> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
> 
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.
> 
> Some more nits:
> 
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master branch.
>>
>> And, finally, I've rebased this onto current state of
>> extension-structured-reply branch (which itself should be rebased on
>> master IMHO).
> 
> Each documentation branch should normally be branched off master unless
> it depends on another extension (in which case it will be branched from that).
> I haven't been rebasing them frequently as it can disrupt those working
> on the branches. There's only really an issue around rebasing where you
> depend on another branch.
> 
> 
>> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>>   A: 1: server replies with status descriptors of any size, granularity
>>         is hidden from the client
>>      2: dirty/allocated requests are separate and unrelated to each
>>         other, so their granularities are not intersecting
> 
> I'm OK with this, but note that you do actually mention a granularity
> of sorts in the spec (512 byes) - I think you should replace that
> with the minimum block size.
> 
>> 3. Q: selecting of dirty bitmap to export
>>   A: several variants:
>>      1: id of bitmap is in flags field of request
>>          pros: - simple
>>          cons: - it's a hack. flags field is for other uses.
>>                - we'll have to map bitmap names to these "ids"
>>      2: introduce extended nbd requests with variable length and exploit this
>>         feature for BLOCK_STATUS command, specifying bitmap identifier.
>>         pros: - looks like a true way
>>         cons: - we have to create additional extension
>>               - possible we have to create a map,
>>                 {<QEMU bitmap name> <=> <NBD bitmap id>}
>>      3: exteranl tool should select which bitmap to export. So, in case of Qemu
>>         it should be something like qmp command block-export-dirty-bitmap.
>>         pros: - simple
>>               - we can extend it to behave like (2) later
>>         cons: - additional qmp command to implement (possibly, the lesser evil)
>>         note: Hmm, external tool can make chose between allocated/dirty data too,
>>               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Yes, this is all pretty horrible. I suspect we want to do something like (2),
> and permit extra data across (in my proposal, it would only be one byte to select
> the index). I suppose one could ask for a list of bitmaps.
> 

Having missed most of the discussion on v1/v2, is it a given that we
want in-band identification of bitmaps?

I guess this might depend very heavily on the nature of the definition
of the "dirty bit" in the NBD spec.

>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>   cons: Two commands with almost same semantic and similar means?
>>   pros: However here is a good point of separating clearly defined and native
>>         for block devices GET_BLOCK_STATUS from user-driven and actually
>>         undefined data, called 'dirtyness'.
> 
> I'm suggesting one generic 'read bitmap' command like you.
> 
>> 5. Number of status descriptors, sent by server, should be restricted
>>   variants:
>>   1: just allow server to restrict this as it wants (which was done in v3)
>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>      of descriptors.
>>      2.1: add command flag, which will request only one descriptor
>>           (otherwise, no restrictions from the client)
>>      2.2: again, introduce extended nbd requests, and add field to
>>           specify this maximum
> 
> I think some form of extended request is the way to go, but out of
> interest, what's the issue with as many descriptors being sent as it
> takes to encode the reply? The client can just consume the remainder
> (without buffering) and reissue the request at a later point for
> the areas it discarded.
> 
>>
>> 6. A: What to do with unspecified flags (in request/reply)?
>>   I think the normal variant is to make them reserved. (Server should
>>   return EINVAL if found unknown bits, client should consider replay
>>   with unknown bits as an error)
> 
> Yeah.
> 
>>
>> +
>> +* `NBD_CMD_BLOCK_STATUS`
>> +
>> +    A block status query request. Length and offset define the range
>> +    of interest. Clients SHOULD NOT use this request unless the server
> 
> MUST NOT is what we say elsewhere I believe.
> 
>> +    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
>> +    in turn requires the client to first negotiate structured replies.
>> +    For a successful return, the server MUST use a structured reply,
>> +    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> 
> Nit: are you saying that non-structured error replies are permissible?
> You're always/often going to get a non-structured  (simple) error reply
> if the server doesn't support the command, but I think it would be fair to say the
> server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
> it supports the command. This is effectively what we say re NBD_CMD_READ.
> 
>> +
>> +    The list of block status descriptors within the
>> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>> +    of the file starting from specified *offset*, and the sum of the
>> +    *length* fields of each descriptor MUST not be greater than the
>> +    overall *length* of the request. This means that the server MAY
>> +    return less data than required. However the server MUST return at
>> +    least one status descriptor
> 
> I'm not sure I understand why that's useful. What should the client
> infer from the server refusing to provide information? We don't
> permit short reads etc.
> 
>> .  The server SHOULD use different
>> +    *status* values between consecutive descriptors, and SHOULD use
>> +    descriptor lengths that are an integer multiple of 512 bytes where
>> +    possible (the first and last descriptor of an unaligned query being
>> +    the most obvious places for an exception).
> 
> Surely better would be an an integer multiple of the minimum block
> size. Being able to offer bitmap support at finer granularity than
> the absolute minimum block size helps no one, and if it were possible
> to support a 256 byte block size (I think some floppy disks had that)
> I see no reason not to support that as a granularity.
> 

Anyway, I hope I am being useful and just not more confounding. It seems
to me that we're having difficulty conveying precisely what it is we're
trying to accomplish, so I hope that I am making a good effort in
elaborating on our goals/requirements.

If not, well. You've got my address for hatemail :)

Thanks,
--John
Vladimir Sementsov-Ogievskiy Dec. 2, 2016, 9:16 a.m. UTC | #16
02.12.2016 02:42, John Snow wrote:
> (B) In the case of "User just wanted to look around," the bitmap should
> be merged back into the bitmap it was forked from.
currently existing example: "failed incremental backup"
Alex Bligh Dec. 2, 2016, 6:45 p.m. UTC | #17
John,

>> +Some storage formats and operations over such formats express a
>> +concept of data dirtiness. Whether the operation is block device
>> +mirroring, incremental block device backup or any other operation with
>> +a concept of data dirtiness, they all share a need to provide a list
>> +of ranges that this particular operation treats as dirty.
>> 
>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>> 
> 
> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> the VM and cause the bitmap that QEMU manages to become dirty.
> 
> We intend to expose the ability to fleece dirty blocks via NBD. What
> happens in this scenario would be that a snapshot of the data at the
> time of the request is exported over NBD in a read-only manner.
> 
> In this way, the drive itself is R/W, but the "view" of it from NBD is
> RO. While a hypothetical backup client is busy copying data out of this
> temporary view, new writes are coming in to the drive, but are not being
> exposed through the NBD export.
> 
> (This goes into QEMU-specifics, but those new writes are dirtying a
> version of the bitmap not intended to be exposed via the NBD channel.
> NBD gets effectively a snapshot of both the bitmap AND the data.)

Thanks. That makes sense - or enough sense for me to carry on commenting!

>> I now think what you are talking about backing up a *snapshot* of a disk
>> that's running, where the disk itself was not connected using NBD? IE it's
>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>> an opaque state represented in a bitmap, which is binary metadata
>> at some particular level of granularity. It might as well be 'happiness'
>> or 'is coloured blue'. The NBD server would (normally) have no way of
>> manipulating this bitmap.
>> 
>> In previous comments, I said 'how come we can set the dirty bit through
>> writes but can't clear it?'. This (my statement) is now I think wrong,
>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>> state of the bitmap comes from whatever sets the bitmap which is outside
>> the scope of this protocol to transmit it.
>> 
> 
> You know, this is a fair point. We have not (to my knowledge) yet
> carefully considered the exact bitmap management scenario when NBD is
> involved in retrieving dirty blocks.
> 
> Humor me for a moment while I talk about a (completely hypothetical, not
> yet fully discussed) workflow for how I envision this feature.
> 
> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> backup is made, etc.
> 
> (2) As writes come in, QEMU's bitmap is dirtied.
> 
> (3) The user decides they want to root around to see what data has
> changed and would like to use NBD to do so, in contrast to QEMU's own
> facilities for dumping dirty blocks.
> 
> (4) A command is issued that creates a temporary, lightweight snapshot
> ('fleecing') and exports this snapshot over NBD. The bitmap is
> associated with the NBD export at this point at NBD server startup. (For
> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> 
> (5) At this moment, the snapshot is static and represents the data at
> the time the NBD server was started. The bitmap is also forked and
> represents only this snapshot. The live data and bitmap continue to change.
> 
> (6) Dirty blocks are queried and copied out via NBD.
> 
> (7) The user closes the NBD instance upon completion of their task,
> whatever it was. (Making a new incremental backup? Just taking a peek at
> some changed data? who knows.)
> 
> The point that's interesting here is what do we do with the two bitmaps
> at this point? The data delta can be discarded (this was after all just
> a lightweight read-only point-in-time snapshot) but the bitmap data
> needs to be dealt with.
> 
> (A) In the case of "User made a new incremental backup," the bitmap that
> got forked off to serve the NBD read should be discarded.
> 
> (B) In the case of "User just wanted to look around," the bitmap should
> be merged back into the bitmap it was forked from.
> 
> I don't advise a hybrid where "User copied some data, but not all" where
> we need to partially clear *and* merge, but conceivably this could
> happen, because the things we don't want to happen always will.
> 
> At this point maybe it's becoming obvious that actually it would be very
> prudent to allow the NBD client itself to inform QEMU via the NBD
> protocol which extents/blocks/(etc) that it is "done" with.
> 
> Maybe it *would* actually be useful if, in NBD allowing us to add a
> "dirty" bit to the specification, we allow users to clear those bits.
> 
> Then, whether the user was trying to do (A) or (B) or the unspeakable
> amalgamation of both things, it's up to the user to clear the bits
> desired and QEMU can do the simple task of simply always merging the
> bitmap fork upon the conclusion of the NBD fleecing exercise.
> 
> Maybe this would allow the dirty bit to have a bit more concrete meaning
> for the NBD spec: "The bit stays dirty until the user clears it, and is
> set when the matching block/extent/etc is written to."
> 
> With an exception that external management may cause the bits to clear.
> (I.e., someone fiddles with the backing store in a way opaque to NBD,
> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

There is currently one possible "I've done with the entire bitmap"
signal, which is closing the connection. This has two obvious
problems. Firstly if used, it discards the entire bitmap (not bits).
Secondly, it makes recovery from a broken TCP session difficult
(as either you treat a dirty close as meaning the bitmap needs
to hang around, in which case you have a garbage collection issue,
or you treat it as needing to drop the bitmap, in which case you
can't recover).

I think in your plan the block status doesn't change once the bitmap
is forked. In that case, adding some command (optional) to change
the status of the bitmap (or simply to set a given extent to status X)
would be reasonable. Of course whether it's supported could be dependent
on the bitmap.

> Having missed most of the discussion on v1/v2, is it a given that we
> want in-band identification of bitmaps?
> 
> I guess this might depend very heavily on the nature of the definition
> of the "dirty bit" in the NBD spec.

I don't think it's a given. I think Wouter & I came up with it at
the same time as a way to abstract the bitmap/extent concept and
remove the need to specify a dirty bit at all (well, that's my excuse
anyway).

> Anyway, I hope I am being useful and just not more confounding. It seems
> to me that we're having difficulty conveying precisely what it is we're
> trying to accomplish, so I hope that I am making a good effort in
> elaborating on our goals/requirements.

Yes absolutely. I think part of the challenge is that you are quite
reasonably coming at it from the point of view of qemu's particular
need, and I'm coming at it from 'what should the nbd protocol look
like in general' position, having done lots of work on the protocol
docs (though I'm an occasional qemu contributor). So there's necessarily
a gap of approach to be bridged.

I'm overdue on a review of Wouter's latest patch (partly because I need
to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
but I think it's a bridge worth building.
John Snow Dec. 2, 2016, 8:39 p.m. UTC | #18
On 12/02/2016 01:45 PM, Alex Bligh wrote:
> John,
> 
>>> +Some storage formats and operations over such formats express a
>>> +concept of data dirtiness. Whether the operation is block device
>>> +mirroring, incremental block device backup or any other operation with
>>> +a concept of data dirtiness, they all share a need to provide a list
>>> +of ranges that this particular operation treats as dirty.
>>>
>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>
>>
>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>> the VM and cause the bitmap that QEMU manages to become dirty.
>>
>> We intend to expose the ability to fleece dirty blocks via NBD. What
>> happens in this scenario would be that a snapshot of the data at the
>> time of the request is exported over NBD in a read-only manner.
>>
>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>> RO. While a hypothetical backup client is busy copying data out of this
>> temporary view, new writes are coming in to the drive, but are not being
>> exposed through the NBD export.
>>
>> (This goes into QEMU-specifics, but those new writes are dirtying a
>> version of the bitmap not intended to be exposed via the NBD channel.
>> NBD gets effectively a snapshot of both the bitmap AND the data.)
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!
> 

Whew! I'm glad.

>>> I now think what you are talking about backing up a *snapshot* of a disk
>>> that's running, where the disk itself was not connected using NBD? IE it's
>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>> an opaque state represented in a bitmap, which is binary metadata
>>> at some particular level of granularity. It might as well be 'happiness'
>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>> manipulating this bitmap.
>>>
>>> In previous comments, I said 'how come we can set the dirty bit through
>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>> the scope of this protocol to transmit it.
>>>
>>
>> You know, this is a fair point. We have not (to my knowledge) yet
>> carefully considered the exact bitmap management scenario when NBD is
>> involved in retrieving dirty blocks.
>>
>> Humor me for a moment while I talk about a (completely hypothetical, not
>> yet fully discussed) workflow for how I envision this feature.
>>
>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>> backup is made, etc.
>>
>> (2) As writes come in, QEMU's bitmap is dirtied.
>>
>> (3) The user decides they want to root around to see what data has
>> changed and would like to use NBD to do so, in contrast to QEMU's own
>> facilities for dumping dirty blocks.
>>
>> (4) A command is issued that creates a temporary, lightweight snapshot
>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>> associated with the NBD export at this point at NBD server startup. (For
>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>
>> (5) At this moment, the snapshot is static and represents the data at
>> the time the NBD server was started. The bitmap is also forked and
>> represents only this snapshot. The live data and bitmap continue to change.
>>
>> (6) Dirty blocks are queried and copied out via NBD.
>>
>> (7) The user closes the NBD instance upon completion of their task,
>> whatever it was. (Making a new incremental backup? Just taking a peek at
>> some changed data? who knows.)
>>
>> The point that's interesting here is what do we do with the two bitmaps
>> at this point? The data delta can be discarded (this was after all just
>> a lightweight read-only point-in-time snapshot) but the bitmap data
>> needs to be dealt with.
>>
>> (A) In the case of "User made a new incremental backup," the bitmap that
>> got forked off to serve the NBD read should be discarded.
>>
>> (B) In the case of "User just wanted to look around," the bitmap should
>> be merged back into the bitmap it was forked from.
>>
>> I don't advise a hybrid where "User copied some data, but not all" where
>> we need to partially clear *and* merge, but conceivably this could
>> happen, because the things we don't want to happen always will.
>>
>> At this point maybe it's becoming obvious that actually it would be very
>> prudent to allow the NBD client itself to inform QEMU via the NBD
>> protocol which extents/blocks/(etc) that it is "done" with.
>>
>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>> "dirty" bit to the specification, we allow users to clear those bits.
>>
>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>> amalgamation of both things, it's up to the user to clear the bits
>> desired and QEMU can do the simple task of simply always merging the
>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>
>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>> set when the matching block/extent/etc is written to."
>>
>> With an exception that external management may cause the bits to clear.
>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
> 
> There is currently one possible "I've done with the entire bitmap"
> signal, which is closing the connection. This has two obvious
> problems. Firstly if used, it discards the entire bitmap (not bits).
> Secondly, it makes recovery from a broken TCP session difficult
> (as either you treat a dirty close as meaning the bitmap needs
> to hang around, in which case you have a garbage collection issue,
> or you treat it as needing to drop the bitmap, in which case you
> can't recover).
> 

In my mind, I wasn't treating closing the connection as the end of the
point-in-time snapshot; that would be stopping the export.

I wouldn't advocate for a control channel (QEMU, here) clearing the
bitmap just because a client disappeared.

Either:

(A) QEMU clears the bitmap because the NBD export was *stopped*, or
(B) QEMU, acting as the NBD server, clears the bitmap as instructed by
the NBD client, if we admit a provision to clear bits from the NBD
protocol itself.

I don't think there's room for the NBD server (QEMU) deciding to clear
bits based on connection status. It has to be an explicit decision --
either via NBD or QMP.

> I think in your plan the block status doesn't change once the bitmap
> is forked. In that case, adding some command (optional) to change
> the status of the bitmap (or simply to set a given extent to status X)
> would be reasonable. Of course whether it's supported could be dependent
> on the bitmap.
> 

What I describe as "forking" was kind of a bad description. What really
happens when we have a divergence is that the bitmap with data is split
into two bitmaps that are related:

- A new bitmap that is created takes over for the old bitmap. This new
bitmap is empty. It records writes on the live version of the data.
- The old bitmap as it existed remains in a read-only state, and
describes some point-in-time snapshot view of the data.

In the case of an incremental backup, once we've made a backup of the
data, that read-only bitmap can actually be discarded without further
thought.

In the case of a failed incremental backup, or in the case of "I just
wanted to look and see what has changed, but wasn't prepared to reset
the counter yet," this bitmap gets merged back with the live bitmap as
if nothing ever happened.

ANYWAY, allowing the NBD client to request bits be cleared has an
obvious use case even for QEMU, IMO -- which is, the NBD client itself
gains the ability to, without relying on a control plane to the server,
decide for itself if it is going to "make a backup" or "just look around."

The client gains the ability to leave the bitmap alone (QEMU will
re-merge it later once the snapshot is closed) or the ability to clear
it ("I made my backup, we're done with this.")

That usefulness would allow us to have an explicit dirty bit mechanism
directly in NBD, IMO, because:

(1) A RW NBD server has enough information to mark a bit dirty
(2) Since there exists an in-spec mechanism to reset the bitmap, the
dirty bit is meaningful to the server and the client

>> Having missed most of the discussion on v1/v2, is it a given that we
>> want in-band identification of bitmaps?
>>
>> I guess this might depend very heavily on the nature of the definition
>> of the "dirty bit" in the NBD spec.
> 
> I don't think it's a given. I think Wouter & I came up with it at
> the same time as a way to abstract the bitmap/extent concept and
> remove the need to specify a dirty bit at all (well, that's my excuse
> anyway).
> 

OK. We do certainly support multiple bitmaps being active at a time in
QEMU, but I had personally always envisioned that you'd associate them
one-at-a-time when starting the NBD export of a particular device.

I don't have a use case in my head where two distinct bitmaps being
exposed simultaneously offer any particular benefit, but maybe there is
something. I'm sure there is.

I will leave this aspect of it more to you NBD folks. I think QEMU could
cope with either.

(Vladimir, am I wrong? Do you have thoughts on this in particular? I
haven't thought through this aspect of it very much.)

>> Anyway, I hope I am being useful and just not more confounding. It seems
>> to me that we're having difficulty conveying precisely what it is we're
>> trying to accomplish, so I hope that I am making a good effort in
>> elaborating on our goals/requirements.
> 
> Yes absolutely. I think part of the challenge is that you are quite
> reasonably coming at it from the point of view of qemu's particular
> need, and I'm coming at it from 'what should the nbd protocol look
> like in general' position, having done lots of work on the protocol
> docs (though I'm an occasional qemu contributor). So there's necessarily
> a gap of approach to be bridged.
> 

Yeah, I understand quite well that we need to make sure the NBD spec is
sane and useful in a QEMU-agnostic way, so my goal here is just to help
elucidate our needs to enable you to reach a good consensus.

> I'm overdue on a review of Wouter's latest patch (partly because I need
> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
> but I think it's a bridge worth building.
> 

Same. Thank you for your patience!

Cheers,
--js
Alex Bligh Dec. 3, 2016, 11:08 a.m. UTC | #19
> On 2 Dec 2016, at 20:39, John Snow <jsnow@redhat.com> wrote:
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The obvious case is an allocation bitmap, and a dirty bitmap.

It's possible one might want more than one dirty bitmap at once.
Perhaps two sorts of backup, or perhaps live migrate of storage
backed by NBD, or perhaps inspecting the *state* of live migration
via NBD and a bitmap, or perhaps determining what extents in
a QCOW image are in that image itself (as opposed to the image
on which it is based).

I tried to pick some QEMU-like ones, but I am sure there are
examples that would work outside of QEMU.
Vladimir Sementsov-Ogievskiy Dec. 5, 2016, 8:36 a.m. UTC | #20
02.12.2016 23:39, John Snow wrote:
>
> On 12/02/2016 01:45 PM, Alex Bligh wrote:
>> John,
>>
>>>> +Some storage formats and operations over such formats express a
>>>> +concept of data dirtiness. Whether the operation is block device
>>>> +mirroring, incremental block device backup or any other operation with
>>>> +a concept of data dirtiness, they all share a need to provide a list
>>>> +of ranges that this particular operation treats as dirty.
>>>>
>>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>>
>>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>>> the VM and cause the bitmap that QEMU manages to become dirty.
>>>
>>> We intend to expose the ability to fleece dirty blocks via NBD. What
>>> happens in this scenario would be that a snapshot of the data at the
>>> time of the request is exported over NBD in a read-only manner.
>>>
>>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>>> RO. While a hypothetical backup client is busy copying data out of this
>>> temporary view, new writes are coming in to the drive, but are not being
>>> exposed through the NBD export.
>>>
>>> (This goes into QEMU-specifics, but those new writes are dirtying a
>>> version of the bitmap not intended to be exposed via the NBD channel.
>>> NBD gets effectively a snapshot of both the bitmap AND the data.)
>> Thanks. That makes sense - or enough sense for me to carry on commenting!
>>
> Whew! I'm glad.
>
>>>> I now think what you are talking about backing up a *snapshot* of a disk
>>>> that's running, where the disk itself was not connected using NBD? IE it's
>>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>>> an opaque state represented in a bitmap, which is binary metadata
>>>> at some particular level of granularity. It might as well be 'happiness'
>>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>>> manipulating this bitmap.
>>>>
>>>> In previous comments, I said 'how come we can set the dirty bit through
>>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>>> the scope of this protocol to transmit it.
>>>>
>>> You know, this is a fair point. We have not (to my knowledge) yet
>>> carefully considered the exact bitmap management scenario when NBD is
>>> involved in retrieving dirty blocks.
>>>
>>> Humor me for a moment while I talk about a (completely hypothetical, not
>>> yet fully discussed) workflow for how I envision this feature.
>>>
>>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>>> backup is made, etc.
>>>
>>> (2) As writes come in, QEMU's bitmap is dirtied.
>>>
>>> (3) The user decides they want to root around to see what data has
>>> changed and would like to use NBD to do so, in contrast to QEMU's own
>>> facilities for dumping dirty blocks.
>>>
>>> (4) A command is issued that creates a temporary, lightweight snapshot
>>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>>> associated with the NBD export at this point at NBD server startup. (For
>>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>>
>>> (5) At this moment, the snapshot is static and represents the data at
>>> the time the NBD server was started. The bitmap is also forked and
>>> represents only this snapshot. The live data and bitmap continue to change.
>>>
>>> (6) Dirty blocks are queried and copied out via NBD.
>>>
>>> (7) The user closes the NBD instance upon completion of their task,
>>> whatever it was. (Making a new incremental backup? Just taking a peek at
>>> some changed data? who knows.)
>>>
>>> The point that's interesting here is what do we do with the two bitmaps
>>> at this point? The data delta can be discarded (this was after all just
>>> a lightweight read-only point-in-time snapshot) but the bitmap data
>>> needs to be dealt with.
>>>
>>> (A) In the case of "User made a new incremental backup," the bitmap that
>>> got forked off to serve the NBD read should be discarded.
>>>
>>> (B) In the case of "User just wanted to look around," the bitmap should
>>> be merged back into the bitmap it was forked from.
>>>
>>> I don't advise a hybrid where "User copied some data, but not all" where
>>> we need to partially clear *and* merge, but conceivably this could
>>> happen, because the things we don't want to happen always will.
>>>
>>> At this point maybe it's becoming obvious that actually it would be very
>>> prudent to allow the NBD client itself to inform QEMU via the NBD
>>> protocol which extents/blocks/(etc) that it is "done" with.
>>>
>>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>>> "dirty" bit to the specification, we allow users to clear those bits.
>>>
>>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>>> amalgamation of both things, it's up to the user to clear the bits
>>> desired and QEMU can do the simple task of simply always merging the
>>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>>
>>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>>> set when the matching block/extent/etc is written to."
>>>
>>> With an exception that external management may cause the bits to clear.
>>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
>> There is currently one possible "I've done with the entire bitmap"
>> signal, which is closing the connection. This has two obvious
>> problems. Firstly if used, it discards the entire bitmap (not bits).
>> Secondly, it makes recovery from a broken TCP session difficult
>> (as either you treat a dirty close as meaning the bitmap needs
>> to hang around, in which case you have a garbage collection issue,
>> or you treat it as needing to drop the bitmap, in which case you
>> can't recover).
>>
> In my mind, I wasn't treating closing the connection as the end of the
> point-in-time snapshot; that would be stopping the export.
>
> I wouldn't advocate for a control channel (QEMU, here) clearing the
> bitmap just because a client disappeared.
>
> Either:
>
> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
> the NBD client, if we admit a provision to clear bits from the NBD
> protocol itself.
>
> I don't think there's room for the NBD server (QEMU) deciding to clear
> bits based on connection status. It has to be an explicit decision --
> either via NBD or QMP.
>
>> I think in your plan the block status doesn't change once the bitmap
>> is forked. In that case, adding some command (optional) to change
>> the status of the bitmap (or simply to set a given extent to status X)
>> would be reasonable. Of course whether it's supported could be dependent
>> on the bitmap.
>>
> What I describe as "forking" was kind of a bad description. What really
> happens when we have a divergence is that the bitmap with data is split
> into two bitmaps that are related:
>
> - A new bitmap that is created takes over for the old bitmap. This new
> bitmap is empty. It records writes on the live version of the data.
> - The old bitmap as it existed remains in a read-only state, and
> describes some point-in-time snapshot view of the data.
>
> In the case of an incremental backup, once we've made a backup of the
> data, that read-only bitmap can actually be discarded without further
> thought.
>
> In the case of a failed incremental backup, or in the case of "I just
> wanted to look and see what has changed, but wasn't prepared to reset
> the counter yet," this bitmap gets merged back with the live bitmap as
> if nothing ever happened.
>
> ANYWAY, allowing the NBD client to request bits be cleared has an
> obvious use case even for QEMU, IMO -- which is, the NBD client itself
> gains the ability to, without relying on a control plane to the server,
> decide for itself if it is going to "make a backup" or "just look around."
>
> The client gains the ability to leave the bitmap alone (QEMU will
> re-merge it later once the snapshot is closed) or the ability to clear
> it ("I made my backup, we're done with this.")
>
> That usefulness would allow us to have an explicit dirty bit mechanism
> directly in NBD, IMO, because:
>
> (1) A RW NBD server has enough information to mark a bit dirty
> (2) Since there exists an in-spec mechanism to reset the bitmap, the
> dirty bit is meaningful to the server and the client
>
>>> Having missed most of the discussion on v1/v2, is it a given that we
>>> want in-band identification of bitmaps?
>>>
>>> I guess this might depend very heavily on the nature of the definition
>>> of the "dirty bit" in the NBD spec.
>> I don't think it's a given. I think Wouter & I came up with it at
>> the same time as a way to abstract the bitmap/extent concept and
>> remove the need to specify a dirty bit at all (well, that's my excuse
>> anyway).
>>
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
>
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.
>
> I will leave this aspect of it more to you NBD folks. I think QEMU could
> cope with either.
>
> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
> haven't thought through this aspect of it very much.)

I'm ok with either too.

Yes, with online external backup (fleecing), the bitmap is already 
selected in Qemu. And this is most interesting case, anyway.

For offline external backup, when we have RO disk, it may have several 
bitmaps, for example for different backup frequency and it may be not 
bad for client to have an ability to chose.. But again, we can select 
exporting bitmap through qmp (even same fleecing scheme will be ok, with 
overhead of creating empty delta)

>
>>> Anyway, I hope I am being useful and just not more confounding. It seems
>>> to me that we're having difficulty conveying precisely what it is we're
>>> trying to accomplish, so I hope that I am making a good effort in
>>> elaborating on our goals/requirements.
>> Yes absolutely. I think part of the challenge is that you are quite
>> reasonably coming at it from the point of view of qemu's particular
>> need, and I'm coming at it from 'what should the nbd protocol look
>> like in general' position, having done lots of work on the protocol
>> docs (though I'm an occasional qemu contributor). So there's necessarily
>> a gap of approach to be bridged.
>>
> Yeah, I understand quite well that we need to make sure the NBD spec is
> sane and useful in a QEMU-agnostic way, so my goal here is just to help
> elucidate our needs to enable you to reach a good consensus.
>
>> I'm overdue on a review of Wouter's latest patch (partly because I need
>> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
>> but I think it's a bridge worth building.
>>
> Same. Thank you for your patience!
>
> Cheers,
> --js
Wouter Verhelst Dec. 6, 2016, 1:32 p.m. UTC | #21
Hi John

Sorry for the late reply; weekend was busy, and so was monday.

On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
> On 12/02/2016 01:45 PM, Alex Bligh wrote:
> > John,
> > 
> >>> +Some storage formats and operations over such formats express a
> >>> +concept of data dirtiness. Whether the operation is block device
> >>> +mirroring, incremental block device backup or any other operation with
> >>> +a concept of data dirtiness, they all share a need to provide a list
> >>> +of ranges that this particular operation treats as dirty.
> >>>
> >>> How can data be 'dirty' if it is static and unchangeable? (I thought)
> >>>
> >>
> >> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> >> the VM and cause the bitmap that QEMU manages to become dirty.
> >>
> >> We intend to expose the ability to fleece dirty blocks via NBD. What
> >> happens in this scenario would be that a snapshot of the data at the
> >> time of the request is exported over NBD in a read-only manner.
> >>
> >> In this way, the drive itself is R/W, but the "view" of it from NBD is
> >> RO. While a hypothetical backup client is busy copying data out of this
> >> temporary view, new writes are coming in to the drive, but are not being
> >> exposed through the NBD export.
> >>
> >> (This goes into QEMU-specifics, but those new writes are dirtying a
> >> version of the bitmap not intended to be exposed via the NBD channel.
> >> NBD gets effectively a snapshot of both the bitmap AND the data.)
> > 
> > Thanks. That makes sense - or enough sense for me to carry on commenting!
> > 
> 
> Whew! I'm glad.
> 
> >>> I now think what you are talking about backing up a *snapshot* of a disk
> >>> that's running, where the disk itself was not connected using NBD? IE it's
> >>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> >>> an opaque state represented in a bitmap, which is binary metadata
> >>> at some particular level of granularity. It might as well be 'happiness'
> >>> or 'is coloured blue'. The NBD server would (normally) have no way of
> >>> manipulating this bitmap.
> >>>
> >>> In previous comments, I said 'how come we can set the dirty bit through
> >>> writes but can't clear it?'. This (my statement) is now I think wrong,
> >>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> >>> state of the bitmap comes from whatever sets the bitmap which is outside
> >>> the scope of this protocol to transmit it.
> >>>
> >>
> >> You know, this is a fair point. We have not (to my knowledge) yet
> >> carefully considered the exact bitmap management scenario when NBD is
> >> involved in retrieving dirty blocks.
> >>
> >> Humor me for a moment while I talk about a (completely hypothetical, not
> >> yet fully discussed) workflow for how I envision this feature.
> >>
> >> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> >> backup is made, etc.
> >>
> >> (2) As writes come in, QEMU's bitmap is dirtied.
> >>
> >> (3) The user decides they want to root around to see what data has
> >> changed and would like to use NBD to do so, in contrast to QEMU's own
> >> facilities for dumping dirty blocks.
> >>
> >> (4) A command is issued that creates a temporary, lightweight snapshot
> >> ('fleecing') and exports this snapshot over NBD. The bitmap is
> >> associated with the NBD export at this point at NBD server startup. (For
> >> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> >>
> >> (5) At this moment, the snapshot is static and represents the data at
> >> the time the NBD server was started. The bitmap is also forked and
> >> represents only this snapshot. The live data and bitmap continue to change.
> >>
> >> (6) Dirty blocks are queried and copied out via NBD.
> >>
> >> (7) The user closes the NBD instance upon completion of their task,
> >> whatever it was. (Making a new incremental backup? Just taking a peek at
> >> some changed data? who knows.)
> >>
> >> The point that's interesting here is what do we do with the two bitmaps
> >> at this point? The data delta can be discarded (this was after all just
> >> a lightweight read-only point-in-time snapshot) but the bitmap data
> >> needs to be dealt with.
> >>
> >> (A) In the case of "User made a new incremental backup," the bitmap that
> >> got forked off to serve the NBD read should be discarded.
> >>
> >> (B) In the case of "User just wanted to look around," the bitmap should
> >> be merged back into the bitmap it was forked from.
> >>
> >> I don't advise a hybrid where "User copied some data, but not all" where
> >> we need to partially clear *and* merge, but conceivably this could
> >> happen, because the things we don't want to happen always will.
> >>
> >> At this point maybe it's becoming obvious that actually it would be very
> >> prudent to allow the NBD client itself to inform QEMU via the NBD
> >> protocol which extents/blocks/(etc) that it is "done" with.
> >>
> >> Maybe it *would* actually be useful if, in NBD allowing us to add a
> >> "dirty" bit to the specification, we allow users to clear those bits.
> >>
> >> Then, whether the user was trying to do (A) or (B) or the unspeakable
> >> amalgamation of both things, it's up to the user to clear the bits
> >> desired and QEMU can do the simple task of simply always merging the
> >> bitmap fork upon the conclusion of the NBD fleecing exercise.
> >>
> >> Maybe this would allow the dirty bit to have a bit more concrete meaning
> >> for the NBD spec: "The bit stays dirty until the user clears it, and is
> >> set when the matching block/extent/etc is written to."
> >>
> >> With an exception that external management may cause the bits to clear.
> >> (I.e., someone fiddles with the backing store in a way opaque to NBD,
> >> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
> > 
> > There is currently one possible "I've done with the entire bitmap"
> > signal, which is closing the connection. This has two obvious
> > problems. Firstly if used, it discards the entire bitmap (not bits).
> > Secondly, it makes recovery from a broken TCP session difficult
> > (as either you treat a dirty close as meaning the bitmap needs
> > to hang around, in which case you have a garbage collection issue,
> > or you treat it as needing to drop the bitmap, in which case you
> > can't recover).
> > 
> 
> In my mind, I wasn't treating closing the connection as the end of the
> point-in-time snapshot; that would be stopping the export.
> 
> I wouldn't advocate for a control channel (QEMU, here) clearing the
> bitmap just because a client disappeared.
> 
> Either:
> 
> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
> the NBD client, if we admit a provision to clear bits from the NBD
> protocol itself.
> 
> I don't think there's room for the NBD server (QEMU) deciding to clear
> bits based on connection status. It has to be an explicit decision --
> either via NBD or QMP.
> 
> > I think in your plan the block status doesn't change once the bitmap
> > is forked. In that case, adding some command (optional) to change
> > the status of the bitmap (or simply to set a given extent to status X)
> > would be reasonable. Of course whether it's supported could be dependent
> > on the bitmap.
> > 
> 
> What I describe as "forking" was kind of a bad description. What really
> happens when we have a divergence is that the bitmap with data is split
> into two bitmaps that are related:
> 
> - A new bitmap that is created takes over for the old bitmap. This new
> bitmap is empty. It records writes on the live version of the data.
> - The old bitmap as it existed remains in a read-only state, and
> describes some point-in-time snapshot view of the data.
> 
> In the case of an incremental backup, once we've made a backup of the
> data, that read-only bitmap can actually be discarded without further
> thought.
> 
> In the case of a failed incremental backup, or in the case of "I just
> wanted to look and see what has changed, but wasn't prepared to reset
> the counter yet," this bitmap gets merged back with the live bitmap as
> if nothing ever happened.
> 
> ANYWAY, allowing the NBD client to request bits be cleared has an
> obvious use case even for QEMU, IMO -- which is, the NBD client itself
> gains the ability to, without relying on a control plane to the server,
> decide for itself if it is going to "make a backup" or "just look around."
> 
> The client gains the ability to leave the bitmap alone (QEMU will
> re-merge it later once the snapshot is closed) or the ability to clear
> it ("I made my backup, we're done with this.")
> 
> That usefulness would allow us to have an explicit dirty bit mechanism
> directly in NBD, IMO, because:
> 
> (1) A RW NBD server has enough information to mark a bit dirty
> (2) Since there exists an in-spec mechanism to reset the bitmap, the
> dirty bit is meaningful to the server and the client

While I can see that the ability to manipulate metadata might have
advantages for certain use cases, I don't think that the ability to
*inspect* metadata should require the ability to manipulate it in any
way.

So I'd like to finish the block_status extension before moving on to
manipulation :)

> >> Having missed most of the discussion on v1/v2, is it a given that we
> >> want in-band identification of bitmaps?
> >>
> >> I guess this might depend very heavily on the nature of the definition
> >> of the "dirty bit" in the NBD spec.
> > 
> > I don't think it's a given. I think Wouter & I came up with it at
> > the same time as a way to abstract the bitmap/extent concept and
> > remove the need to specify a dirty bit at all (well, that's my excuse
> > anyway).
> > 
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The ability to do something does not in any way imply the requirement to
do the same :-)

The idea is that the client negotiates one or more forms of metadata
information from the server that it might be interested in, and then
asks the server that information for a given extent where it has
interest.

The protocol spec does not define what that metadata is (beyond the "is
allocated" one that we define in the spec currently, and possibly
something else in the future). So if qemu only cares about just one type
of metadata, there's no reason why it should *have* to export more than
one type.

> I will leave this aspect of it more to you NBD folks. I think QEMU could
> cope with either.
> 
> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
> haven't thought through this aspect of it very much.)
> 
> >> Anyway, I hope I am being useful and just not more confounding. It seems
> >> to me that we're having difficulty conveying precisely what it is we're
> >> trying to accomplish, so I hope that I am making a good effort in
> >> elaborating on our goals/requirements.
> > 
> > Yes absolutely. I think part of the challenge is that you are quite
> > reasonably coming at it from the point of view of qemu's particular
> > need, and I'm coming at it from 'what should the nbd protocol look
> > like in general' position, having done lots of work on the protocol
> > docs (though I'm an occasional qemu contributor). So there's necessarily
> > a gap of approach to be bridged.
> > 
> 
> Yeah, I understand quite well that we need to make sure the NBD spec is
> sane and useful in a QEMU-agnostic way, so my goal here is just to help
> elucidate our needs to enable you to reach a good consensus.

Right, that's why I was reluctant to merge the original spec as it
stood.

> > I'm overdue on a review of Wouter's latest patch (partly because I need
> > to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
> > but I think it's a bridge worth building.
> > 
> 
> Same. Thank you for your patience!

I can do some updates given a few of the suggestions that were made on
this list (no guarantee when that will happen), but if people are
interested in reviewing things in the mean time, be my guest...
John Snow Dec. 6, 2016, 4:39 p.m. UTC | #22
On 12/06/2016 08:32 AM, Wouter Verhelst wrote:
> Hi John
> 
> Sorry for the late reply; weekend was busy, and so was monday.
> 

No problems.

> On Fri, Dec 02, 2016 at 03:39:08PM -0500, John Snow wrote:
>> On 12/02/2016 01:45 PM, Alex Bligh wrote:
>>> John,
>>>
>>>>> +Some storage formats and operations over such formats express a
>>>>> +concept of data dirtiness. Whether the operation is block device
>>>>> +mirroring, incremental block device backup or any other operation with
>>>>> +a concept of data dirtiness, they all share a need to provide a list
>>>>> +of ranges that this particular operation treats as dirty.
>>>>>
>>>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>>>
>>>>
>>>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>>>> the VM and cause the bitmap that QEMU manages to become dirty.
>>>>
>>>> We intend to expose the ability to fleece dirty blocks via NBD. What
>>>> happens in this scenario would be that a snapshot of the data at the
>>>> time of the request is exported over NBD in a read-only manner.
>>>>
>>>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>>>> RO. While a hypothetical backup client is busy copying data out of this
>>>> temporary view, new writes are coming in to the drive, but are not being
>>>> exposed through the NBD export.
>>>>
>>>> (This goes into QEMU-specifics, but those new writes are dirtying a
>>>> version of the bitmap not intended to be exposed via the NBD channel.
>>>> NBD gets effectively a snapshot of both the bitmap AND the data.)
>>>
>>> Thanks. That makes sense - or enough sense for me to carry on commenting!
>>>
>>
>> Whew! I'm glad.
>>
>>>>> I now think what you are talking about backing up a *snapshot* of a disk
>>>>> that's running, where the disk itself was not connected using NBD? IE it's
>>>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
>>>>> an opaque state represented in a bitmap, which is binary metadata
>>>>> at some particular level of granularity. It might as well be 'happiness'
>>>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>>>> manipulating this bitmap.
>>>>>
>>>>> In previous comments, I said 'how come we can set the dirty bit through
>>>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>>>> the scope of this protocol to transmit it.
>>>>>
>>>>
>>>> You know, this is a fair point. We have not (to my knowledge) yet
>>>> carefully considered the exact bitmap management scenario when NBD is
>>>> involved in retrieving dirty blocks.
>>>>
>>>> Humor me for a moment while I talk about a (completely hypothetical, not
>>>> yet fully discussed) workflow for how I envision this feature.
>>>>
>>>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>>>> backup is made, etc.
>>>>
>>>> (2) As writes come in, QEMU's bitmap is dirtied.
>>>>
>>>> (3) The user decides they want to root around to see what data has
>>>> changed and would like to use NBD to do so, in contrast to QEMU's own
>>>> facilities for dumping dirty blocks.
>>>>
>>>> (4) A command is issued that creates a temporary, lightweight snapshot
>>>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>>>> associated with the NBD export at this point at NBD server startup. (For
>>>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>>>
>>>> (5) At this moment, the snapshot is static and represents the data at
>>>> the time the NBD server was started. The bitmap is also forked and
>>>> represents only this snapshot. The live data and bitmap continue to change.
>>>>
>>>> (6) Dirty blocks are queried and copied out via NBD.
>>>>
>>>> (7) The user closes the NBD instance upon completion of their task,
>>>> whatever it was. (Making a new incremental backup? Just taking a peek at
>>>> some changed data? who knows.)
>>>>
>>>> The point that's interesting here is what do we do with the two bitmaps
>>>> at this point? The data delta can be discarded (this was after all just
>>>> a lightweight read-only point-in-time snapshot) but the bitmap data
>>>> needs to be dealt with.
>>>>
>>>> (A) In the case of "User made a new incremental backup," the bitmap that
>>>> got forked off to serve the NBD read should be discarded.
>>>>
>>>> (B) In the case of "User just wanted to look around," the bitmap should
>>>> be merged back into the bitmap it was forked from.
>>>>
>>>> I don't advise a hybrid where "User copied some data, but not all" where
>>>> we need to partially clear *and* merge, but conceivably this could
>>>> happen, because the things we don't want to happen always will.
>>>>
>>>> At this point maybe it's becoming obvious that actually it would be very
>>>> prudent to allow the NBD client itself to inform QEMU via the NBD
>>>> protocol which extents/blocks/(etc) that it is "done" with.
>>>>
>>>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>>>> "dirty" bit to the specification, we allow users to clear those bits.
>>>>
>>>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>>>> amalgamation of both things, it's up to the user to clear the bits
>>>> desired and QEMU can do the simple task of simply always merging the
>>>> bitmap fork upon the conclusion of the NBD fleecing exercise.
>>>>
>>>> Maybe this would allow the dirty bit to have a bit more concrete meaning
>>>> for the NBD spec: "The bit stays dirty until the user clears it, and is
>>>> set when the matching block/extent/etc is written to."
>>>>
>>>> With an exception that external management may cause the bits to clear.
>>>> (I.e., someone fiddles with the backing store in a way opaque to NBD,
>>>> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)
>>>
>>> There is currently one possible "I've done with the entire bitmap"
>>> signal, which is closing the connection. This has two obvious
>>> problems. Firstly if used, it discards the entire bitmap (not bits).
>>> Secondly, it makes recovery from a broken TCP session difficult
>>> (as either you treat a dirty close as meaning the bitmap needs
>>> to hang around, in which case you have a garbage collection issue,
>>> or you treat it as needing to drop the bitmap, in which case you
>>> can't recover).
>>>
>>
>> In my mind, I wasn't treating closing the connection as the end of the
>> point-in-time snapshot; that would be stopping the export.
>>
>> I wouldn't advocate for a control channel (QEMU, here) clearing the
>> bitmap just because a client disappeared.
>>
>> Either:
>>
>> (A) QEMU clears the bitmap because the NBD export was *stopped*, or
>> (B) QEMU, acting as the NBD server, clears the bitmap as instructed by
>> the NBD client, if we admit a provision to clear bits from the NBD
>> protocol itself.
>>
>> I don't think there's room for the NBD server (QEMU) deciding to clear
>> bits based on connection status. It has to be an explicit decision --
>> either via NBD or QMP.
>>
>>> I think in your plan the block status doesn't change once the bitmap
>>> is forked. In that case, adding some command (optional) to change
>>> the status of the bitmap (or simply to set a given extent to status X)
>>> would be reasonable. Of course whether it's supported could be dependent
>>> on the bitmap.
>>>
>>
>> What I describe as "forking" was kind of a bad description. What really
>> happens when we have a divergence is that the bitmap with data is split
>> into two bitmaps that are related:
>>
>> - A new bitmap that is created takes over for the old bitmap. This new
>> bitmap is empty. It records writes on the live version of the data.
>> - The old bitmap as it existed remains in a read-only state, and
>> describes some point-in-time snapshot view of the data.
>>
>> In the case of an incremental backup, once we've made a backup of the
>> data, that read-only bitmap can actually be discarded without further
>> thought.
>>
>> In the case of a failed incremental backup, or in the case of "I just
>> wanted to look and see what has changed, but wasn't prepared to reset
>> the counter yet," this bitmap gets merged back with the live bitmap as
>> if nothing ever happened.
>>
>> ANYWAY, allowing the NBD client to request bits be cleared has an
>> obvious use case even for QEMU, IMO -- which is, the NBD client itself
>> gains the ability to, without relying on a control plane to the server,
>> decide for itself if it is going to "make a backup" or "just look around."
>>
>> The client gains the ability to leave the bitmap alone (QEMU will
>> re-merge it later once the snapshot is closed) or the ability to clear
>> it ("I made my backup, we're done with this.")
>>
>> That usefulness would allow us to have an explicit dirty bit mechanism
>> directly in NBD, IMO, because:
>>
>> (1) A RW NBD server has enough information to mark a bit dirty
>> (2) Since there exists an in-spec mechanism to reset the bitmap, the
>> dirty bit is meaningful to the server and the client
> 
> While I can see that the ability to manipulate metadata might have
> advantages for certain use cases, I don't think that the ability to
> *inspect* metadata should require the ability to manipulate it in any
> way.
> 
> So I'd like to finish the block_status extension before moving on to
> manipulation :)
> 

Understood. The problem I was trying to correct by admitting that
manipulation of bits may have a purpose in NBD was to attempt to clarify
the exact meaning of the dirty bit.

It had seemed to me that by not specifying (or disallowing) the
manipulation of those bits from within NBD necessarily meant that their
meaning existed entirely out-of-spec for NBD, which could be a show-stopper.

So I was attempting to show that by allowing their manipulation in NBD,
they'd have full in-spec meaning. It all depends on what exactly we name
those bits and how you'd like to define their meaning. There are many
ways we can expose this information in a useful manner, so this was just
another option.

>>>> Having missed most of the discussion on v1/v2, is it a given that we
>>>> want in-band identification of bitmaps?
>>>>
>>>> I guess this might depend very heavily on the nature of the definition
>>>> of the "dirty bit" in the NBD spec.
>>>
>>> I don't think it's a given. I think Wouter & I came up with it at
>>> the same time as a way to abstract the bitmap/extent concept and
>>> remove the need to specify a dirty bit at all (well, that's my excuse
>>> anyway).
>>>
>>
>> OK. We do certainly support multiple bitmaps being active at a time in
>> QEMU, but I had personally always envisioned that you'd associate them
>> one-at-a-time when starting the NBD export of a particular device.
>>
>> I don't have a use case in my head where two distinct bitmaps being
>> exposed simultaneously offer any particular benefit, but maybe there is
>> something. I'm sure there is.
> 
> The ability to do something does not in any way imply the requirement to
> do the same :-)
> 

hence the ask. It's not something QEMU currently needs, but I am a bad
psychic.

> The idea is that the client negotiates one or more forms of metadata
> information from the server that it might be interested in, and then
> asks the server that information for a given extent where it has
> interest.
> 

via the NBD protocol, you mean?

> The protocol spec does not define what that metadata is (beyond the "is
> allocated" one that we define in the spec currently, and possibly
> something else in the future). So if qemu only cares about just one type
> of metadata, there's no reason why it should *have* to export more than
> one type.
> 
>> I will leave this aspect of it more to you NBD folks. I think QEMU could
>> cope with either.
>>
>> (Vladimir, am I wrong? Do you have thoughts on this in particular? I
>> haven't thought through this aspect of it very much.)
>>
>>>> Anyway, I hope I am being useful and just not more confounding. It seems
>>>> to me that we're having difficulty conveying precisely what it is we're
>>>> trying to accomplish, so I hope that I am making a good effort in
>>>> elaborating on our goals/requirements.
>>>
>>> Yes absolutely. I think part of the challenge is that you are quite
>>> reasonably coming at it from the point of view of qemu's particular
>>> need, and I'm coming at it from 'what should the nbd protocol look
>>> like in general' position, having done lots of work on the protocol
>>> docs (though I'm an occasional qemu contributor). So there's necessarily
>>> a gap of approach to be bridged.
>>>
>>
>> Yeah, I understand quite well that we need to make sure the NBD spec is
>> sane and useful in a QEMU-agnostic way, so my goal here is just to help
>> elucidate our needs to enable you to reach a good consensus.
> 
> Right, that's why I was reluctant to merge the original spec as it
> stood.
> 
>>> I'm overdue on a review of Wouter's latest patch (partly because I need
>>> to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
>>> but I think it's a bridge worth building.
>>>
>>
>> Same. Thank you for your patience!
> 
> I can do some updates given a few of the suggestions that were made on
> this list (no guarantee when that will happen), but if people are
> interested in reviewing things in the mean time, be my guest...
> 

I'll take a look at your revision(s), thanks.

--js

Patch
diff mbox

======

Also, an idea on 2-4:

    As we say, that dirtiness is unknown for NBD, and external tool
    should specify, manage and understand, which data is actually
    transmitted, why not just call it user_data and leave status field
    of reply chunk unspecified in this case?

    So, I propose one flag for NBD_CMD_BLOCK_STATUS:
    NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
    Eric's 'Block provisioning status' paragraph.  If it is set, we just
    leave status field for some external... protocol? Who knows, what is
    this user data.

    Note: I'm not sure, that I like this (my) proposal. It's just an
    idea, may be someone like it.  And, I think, it represents what we
    are trying to do more honestly.

    Note2: the next step of generalization will be NBD_CMD_USER, with
    variable request size, structured reply and no definition :)


Another idea, about backups themselves:

    Why do we need allocated/zero status for backup? IMHO we don't.

    Full backup: just do structured read - it will show us, which chunks
    may be treaded as zeroes.

    Incremental backup: get dirty bitmap (somehow, for example through
    user-defined part of proposed command), than, for dirty blocks, read
    them through structured read, so information about zero/unallocated
    areas are here.

For me all the variants above are OK. Let's finally choose something.

v2:
v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html

Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.

 doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index 1c2fa5b..253b6f1 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -755,6 +755,8 @@  The field has the following format:
   MUST leave this flag clear if structured replies have not been
   negotiated. Clients MUST NOT set the `NBD_CMD_FLAG_DF` request
   flag unless this transmission flag is set.
+- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
+  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -1036,6 +1038,10 @@  interpret the "length" bytes of payload.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  
 
+- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+  Defined by the experimental extension `BLOCK_STATUS`; see below.
+
 All error chunk types have bit 15 set, and begin with the same
 *error*, *message length*, and optional *message* fields as
 `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
@@ -1070,7 +1076,7 @@  remaining structured fields at the end.
   were sent earlier in the structured reply, the server SHOULD NOT
   send multiple distinct offsets that lie within the bounds of a
   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
+  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
 
   The payload is structured as:
 
@@ -1247,6 +1253,11 @@  The following request types exist:
 
     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md).
 
+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    Defined by the experimental `BLOCK_STATUS` extension; see below.
+
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -1331,6 +1342,148 @@  written as branches which can be merged into master if and
 when those extensions are promoted to the normative version
 of the document in the master branch.
 
+### `BLOCK_STATUS` extension
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such class of information, the `BLOCK_STATUS` extension
+adds a new `NBD_CMD_BLOCK_STATUS` command which returns a list of
+ranges with their respective states.  This extension is not available
+unless the client also negotiates the `STRUCTURED_REPLY` extension.
+
+* `NBD_FLAG_SEND_BLOCK_STATUS`
+
+    The server SHOULD set this transmission flag to 1 if structured
+    replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
+    request is supported.
+
+* `NBD_REPLY_TYPE_BLOCK_STATUS`
+
+    *length* MUST be a positive integer multiple of 8.  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors MUST not be greater than the
+    length of the original request.  This chunk type MUST appear at most
+    once in a structured reply. Valid as a reply to
+    `NBD_CMD_BLOCK_STATUS`.
+
+    The payload is structured as a list of one or more descriptors,
+    each with this layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    The definition of the status flags is determined based on the
+    flags present in the original request.
+
+* `NBD_CMD_BLOCK_STATUS`
+
+    A block status query request. Length and offset define the range
+    of interest. Clients SHOULD NOT use this request unless the server
+    set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
+    in turn requires the client to first negotiate structured replies.
+    For a successful return, the server MUST use a structured reply,
+    containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file starting from specified *offset*, and the sum of the
+    *length* fields of each descriptor MUST not be greater than the
+    overall *length* of the request. This means that the server MAY
+    return less data than required. However the server MUST return at
+    least one status descriptor.  The server SHOULD use different
+    *status* values between consecutive descriptors, and SHOULD use
+    descriptor lengths that are an integer multiple of 512 bytes where
+    possible (the first and last descriptor of an unaligned query being
+    the most obvious places for an exception). The status flags are
+    intentionally defined so that a server MAY always safely report a
+    status of 0 for any block, although the server SHOULD return
+    additional status values when they can be easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of either a simple reply or an error
+    chunk.  However, if the error does not involve invalid usage (such
+    as a request beyond the bounds of the file), a server MAY reply
+    with a single block status descriptor with *length* matching the
+    requested length, and *status* of 0 rather than reporting the
+    error.
+
+    The type of information requested by the client is determined by
+    the request flags, as follows:
+
+    1. Block provisioning status
+
+    Upon receiving an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` clear, the server MUST return the
+    provisioning status of the device, where the status field of each
+    descriptor is determined by the following bits (all four
+    combinations of these two bits are possible):
+
+      - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
+        (and future writes to that area may cause fragmentation or
+        encounter an `ENOSPC` error); if clear, the block is allocated
+        or the server could not otherwise determine its status.  Note
+        that the use of `NBD_CMD_TRIM` is related to this status, but
+        that the server MAY report a hole even where trim has not been
+        requested, and also that a server MAY report allocation even
+        where a trim has been requested.
+      - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
+        all zeroes; if clear, the block contents are not known.  Note
+        that the use of `NBD_CMD_WRITE_ZEROES` is related to this
+        status, but that the server MAY report zeroes even where write
+        zeroes has not been requested, and also that a server MAY
+        report unknown content even where write zeroes has been
+        requested.
+
+    The client SHOULD NOT read from an area that has both
+    `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
+
+    2. Block dirtiness status
+
+    This command is meant to operate in tandem with other (non-NBD)
+    channels to the server.  Generally, a "dirty" block is a block
+    that has been written to by someone, but the exact meaning of "has
+    been written" is left to the implementation.  For example, a
+    virtual machine monitor could provide a (non-NBD) command to start
+    tracking blocks written by the virtual machine.  A backup client
+    can then connect to an NBD server provided by the virtual machine
+    monitor and use `NBD_CMD_BLOCK_STATUS` with the
+    `NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
+    blocks that the virtual machine has changed.
+
+    An implementation that doesn't track the "dirtiness" state of
+    blocks MUST either fail this command with `EINVAL`, or mark all
+    blocks as dirty in the descriptor that it returns.  Upon receiving
+    an `NBD_CMD_BLOCK_STATUS` command with the flag
+    `NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
+    status of the device, where the status field of each descriptor is
+    determined by the following bit:
+
+      - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
+        portion of the file that is still clean because it has not
+        been written; if clear, the block represents a portion of the
+        file that is dirty, or where the server could not otherwise
+        determine its status.
+
+A client MAY close the connection if it detects that the server has
+sent an invalid chunks (such as lengths in the
+`NBD_REPLY_TYPE_BLOCK_STATUS` not summing up to the requested length).
+The server SHOULD return `EINVAL` if it receives a `BLOCK_STATUS`
+request including one or more sectors beyond the size of the device.
+
+The extension adds the following new command flag:
+
+- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
+  SHOULD be set to 1 if the client wants to request dirtiness status
+  rather than provisioning status.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently