diff mbox

[Nbd,v5] doc: Add NBD_CMD_BLOCK_STATUS extension

Message ID 20161212204313.anhjire2gkwd6gzi@grep.be
State New
Headers show

Commit Message

Wouter Verhelst Dec. 12, 2016, 8:43 p.m. UTC
v4 -> v5:

- Editorial improvements as suggested by Eric (with the exception of
  absence/absense, which would really need to be done on master if valid)
- Rework so that the NBD_STATE_* constants are defined in the "Metadata
  contexts" section *only*, where the `base:allocation` context is
  defined. The other sections now define how to send and receive
  metadata in general, without going into any detail on what that
  metadata might be (which is not their concern).

Comments

Stefan Hajnoczi Dec. 13, 2016, 10:37 a.m. UTC | #1
On Mon, Dec 12, 2016 at 09:43:13PM +0100, Wouter Verhelst wrote:
> +- `NBD_OPT_LIST_META_CONTEXT` (10)
> +
> +    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
> +    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +    commands during the transmission phase.
> +
> +    If the query string is syntactically invalid, the server SHOULD send
> +    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> +    but finds no metadata contexts, the server MUST send a single
> +    reply of type `NBD_REP_ACK`.
> +
> +    This option MUST NOT be requested unless structured replies have
> +    been negotiated first. If a client attempts to do so, a server
> +    SHOULD send `NBD_REP_ERR_INVALID`.
> +
> +    Data:
> +    - 32 bits, length of export name
> +    - String, name of export for which we wish to list or select metadata
> +      contexts.
> +    - 32 bits, length of query
> +    - String, query to select a subset of the available metadata
> +      contexts. If this is not specified (i.e., the "length of query"
> +      field is 0 and no query is sent), then the server MUST send all
> +      the metadata contexts it knows about. If specified, this query
> +      string MUST start with a name that uniquely identifies a server
> +      implementation; e.g., the reference implementation that
> +      accompanies this document would support query strings starting
> +      with 'nbd-server:'

What about querying "base:" if the NBD spec adds more standard metadata
contexts in the future?  "base:" does not "uniquely identify a server
implementation" but it should still be possible to query it so that
additional contexts can be added to this spec in the future.

Is the query matching algorithm defined anywhere?  I guess it is
strncmp(query, context, strlen(query)) but I'm not sure from this text.

Another common syntax would use an asterisk wildcard ('*') so that it's
possible to differentiate between full matches and (wildcard) partial
matches.

> +
> +    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
> +    followed by `NBD_REP_ACK`. The metadata context ID in these replies
> +    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
> +
> +- `NBD_OPT_SET_META_CONTEXT` (11)
> +
> +    Change the set of active metadata contexts. Issuing this command
> +    replaces all previously-set metadata contexts; clients must ensure
> +    that all metadata contexts they're interested in are selected with
> +    the final query that they sent.
> +
> +    Data:
> +    - 32 bits, length of query
> +    - String, query to select metadata contexts. The syntax of this
> +      query is implementation-defined, except that it MUST start with a
> +      namespace. This namespace may be one of the following:
> +        - `base:`, for metadata contexts defined by this document;
> +        - `nbd-server:`, for metadata contexts defined by the
> +          implementation that accompanies this document (none
> +          currently);
> +        - `x-*:`, where `*` can be replaced by any random string not
> +          containing colons, for local experiments. This SHOULD NOT be
> +          used by metadata contexts that are expected to e widely used.

s/ e / be /

> +        - third-party implementations can register additional
> +          namespaces by simple request to the mailinglist.

Does this mean it is possible to activate multiple contexts but only if
their namespace is identical?  That seems like an arbitrary limitation.

In other words, the spec suggests you can activate nbd-server:a and
nbd-server:b but you cannot activate base:a and nbd-server:a.

I'd prefer a programming model where the contexts don't need to be set
for the lifetime of the connection.  Instead the client passes a bitmap
(64-bits?) of contexts along with each BLOCK_STATUS command.  That way
the client can select contexts on a per-command basis.  It's unlikely
that more than 64 contexts need to be available at once but I admit it's
an arbitrary limitation...

I guess you've considered this model and decided it's better to
negotiate upfront, it's wasteful to enable multiple contexts and then
discard the response data on the client side because only a subset is
needed for this command invocation.

> +
> +    The server MUST reply with a number of `NBD_REP_META_CONTEXT`
> +    replies, one for each selected metadata context, each with a unique
> +    metadata context ID. It is not an error if a
> +    `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
> +    context, provided the client then does not attempt to issue
> +    `NBD_CMD_BLOCK_STATUS` commands.
> +
>  #### Option reply types
>  
>  These values are used in the "reply type" field, sent by the server
> @@ -882,7 +989,7 @@ during option haggling in the fixed newstyle negotiation.
>      information is available, or when sending data related to the option
>      (in the case of `NBD_OPT_LIST`) has finished. No data.
>  
> -* `NBD_REP_SERVER` (2)
> +- `NBD_REP_SERVER` (2)
>  
>      A description of an export. Data:
>  
> @@ -897,10 +1004,18 @@ during option haggling in the fixed newstyle negotiation.
>        particular client request, this field is defined to be a string
>        suitable for direct display to a human being.
>  
> -* `NBD_REP_INFO` (3)
> +- `NBD_REP_INFO` (3)
>  
>      Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>  
> +- `NBD_REP_META_CONTEXT` (4)
> +
> +    A description of a metadata context. Data:
> +
> +    - 32 bits, NBD metadata context ID.
> +    - String, name of the metadata context. This is not required to be
> +      a human-readable string, but it MUST be valid UTF-8 data.
> +
>  There are a number of error reply types, all of which are denoted by
>  having bit 31 set. All error replies MAY have some data set, in which
>  case that data is an error message string suitable for display to the user.
> @@ -938,15 +1053,62 @@ case that data is an error message string suitable for display to the user.
>  
>      Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>  
> -* `NBD_REP_ERR_SHUTDOWN` (2^32 + 7)
> +* `NBD_REP_ERR_SHUTDOWN` (2^31 + 7)
>  
>      The server is unwilling to continue negotiation as it is in the
>      process of being shut down.
>  
> -* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^32 + 8)
> +* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^31 + 8)

Are these separate fixes that slipped into the patch?  Not directly
relevant to BLOCK_STATUS.
Alex Bligh Dec. 13, 2016, 2:09 p.m. UTC | #2
Wouter,

Some comments below:

> On 12 Dec 2016, at 20:43, Wouter Verhelst <w@uter.be> wrote:
...
> +## Metadata querying
> +
> +With the availability of sparse storage formats, it is often needed to
> +query the status of a particular range and read only those blocks of
> +data that are actually present on the block device.
> +
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.

I'm not sure this necessarily has anything to do with sparse storage
formats. For instance the 'dirty' concept would work perfectly well
on a non-sparse format. I propose the following to replace the two
paragraphs above:

"It is often helpful for the client to be able to query the status
of a range of blocks. The nature of the status that can be
queried is in part implementation dependent. For instance,
the status might represent:

* in a sparse storage format, whether the relevant blocks are
  actually present on the backing device for the export; or

* whether the relevant blocks are 'dirty'; some storage formats
  and operations over such formats express a concept of data dirtiness.
  Whether the operation is block device mirroring, incremental block
  device backup or any other operation with a concept of data dirtiness,
  they all share a need to provide a list of ranges that this
  particular operation treats as dirty.
"

> +
> +To provide such classes of information, the NBD protocol has a generic
> +framework for querying metadata; however, its use must first be
> +negotiated, and one or more metadata contexts must be selected.
> +
> +The procedure works as follows:
> +
> +- First, during negotiation,

insert: "if the client wishes to query metadata,"

> +  the client MUST select one or more metadata
> +  contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the client
> +  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts.

"to list contexts that the server supports"

> +- During transmission, a client can then indicate interest in metadata
> +  for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
> +  *offset* and *length* indicate the area of interest. The server MUST
> +  then respond with the requested information, for all contexts which
> +  were selected during negotiation. For every metadata context, the
> +  server sends one set of extent chunks, where the sizes of the
> +  extents MUST be less than or equal to the length as specified in the
> +  request. Each extent comes with a *flags* field, the semantics of
> +  which are defined by the metadata context.
> +- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured reply
> +  of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +
> +A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
> +nonzero

"non-zero"

> number of metadata contexts during negotiation. Servers SHOULD
> +reply to clients doing so anyway with `EINVAL`.

I had difficulty parsing "so". I think you mean "Servers SHOULD
reply to clients sending `NBD_CMD_BLOCK_STATUS without
selecting metadata contexts with"

But actually, why do we need to be so mean? Why can't we assume
that if NBD_OPT_SET_META_CONTEXT is not sent, then all the metadata
contexts should be selected?

Of course arguably this means why might need to restore
NBD_FLAG_SEND_BLOCK_STATUS.

> +
> +The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent by a

"be sent as a"

> +structured reply; this implies that in order to use metadata querying,
> +structured replies MUST be negotiated first.
> +
> +This standard defines exactly one metadata context; it is called
> +`base:allocation`, and it provides information on the basic allocation
> +status of extents (that is, whether they are allocated at all in a
> +sparse file context).
> +
> ## Values
> 
> This section describes the value and meaning of constants (other than
> @@ -768,8 +814,6 @@ The field has the following format:
>   to that command to the client. In the absense of this flag, clients
>   SHOULD NOT multiplex their commands over more than one connection to
>   the export.
> -- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
> -  `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
> 
> Clients SHOULD ignore unknown flags.
> 
> @@ -871,6 +915,69 @@ of the newstyle negotiation.
> 
>     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 
> +- `NBD_OPT_LIST_META_CONTEXT` (10)
> +
> +    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
> +    followed by an `NBD_REP_ACK`. If a server replies to such a request
> +    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS

backticks: `NBD_CMD_BLOCK_STATUS`

> +    commands during the transmission phase.

"otherwise, the client MUST NOT send NBD_CMD_BLOCK_STATUS
messages."

But actually isn't the telling question whether NBD_OPT_SET_META_CONTEXT
works?

Actually, as this could be sent more than once, I think this whole
thing would be better phrased as:

"A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
at least once, and the final time it was sent, the server
responded without an error."

obviously this would be better under _SET_ than _LIST_,
but the sentence can go entirely from here.

Equally obviously, if we are making _SET_ optional (as
lack of _SET_ means that all the contexts are selected)
we just gate this on `NBD_FLAG_SEND_BLOCK_STATUS`

> +
> +    If the query string is syntactically invalid, the server SHOULD send
> +    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> +    but finds no metadata contexts, the server MUST send a single
> +    reply of type `NBD_REP_ACK`.
> +
> +    This option MUST NOT be requested unless structured replies have
> +    been negotiated first. If a client attempts to do so, a server
> +    SHOULD send `NBD_REP_ERR_INVALID`.
> +
> +    Data:
> +    - 32 bits, length of export name
> +    - String, name of export for which we wish to list or select metadata
> +      contexts.

Surely as this is _LIST_ this should read "list" not "list or select"?

> +    - 32 bits, length of query
> +    - String, query to select a subset of the available metadata
> +      contexts. If this is not specified (i.e., the "length of query"
> +      field is 0 and no query is sent), then the server MUST send all
> +      the metadata contexts it knows about.


> If specified, this query
> +      string MUST start with a name that uniquely identifies a server
> +      implementation; e.g., the reference implementation that
> +      accompanies this document would support query strings starting
> +      with 'nbd-server:'

I'm not sure this is correct (particularly now you've adopted my "X-"
suggestion), as for instance "base:" does not started with a server
implementation.

Better:

"If specified, the server MUST return the zero or more contexts
whose names (including the namespace) consist of or start with
the query string. For instance a query string of 'nbd-server:'
would return all contexts within the 'nbd-server' namespace,
and a string 'base:a' would return all context within the
'base' namespace that began with 'a'"

> +
> +    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
> +    followed by `NBD_REP_ACK`. The metadata context ID in these replies
> +    is reserved and SHOULD be set to zero; clients SHOULD disregard it.

Why is the context ID set to zero? Surely it would be really helpful
for this to be filled in with the ID?

> +
> +- `NBD_OPT_SET_META_CONTEXT` (11)
> +
> +    Change the set of active metadata contexts. Issuing this command
> +    replaces all previously-set metadata contexts; clients must ensure
> +    that all metadata contexts they're interested in are selected with
> +    the final query that they sent.
> +
> +    Data:
> +    - 32 bits, length of query
> +    - String, query to select metadata contexts.


Again as above, with s/return/select/:

"If not specified, the server MUST select all metadata contexts.
If specified, the server MUST select the zero or more contexts
whose names (including the namespace) consist of or start with
the query string. For instance a query string of 'nbd-server:'
would select all contexts within the 'nbd-server' namespace,
and a string 'base:a' would select all context within the
'base' namespace that began with 'a'"


> The syntax of this
> +      query is implementation-defined,

Right, but I'd make this "The syntax of the metadata context
name is implementation defined" - after all it's not only
the 'query' that has the namespace, it's also the names
themselves. Also perhaps move this elsewhere as it doesn't
only apply to _SET_ but also to _LIST_.

> except that it MUST start with a
> +      namespace. This namespace may be one of the following:
> +        - `base:`, for metadata contexts defined by this document;
> +        - `nbd-server:`, for metadata contexts defined by the
> +          implementation that accompanies this document (none
> +          currently);
> +        - `x-*:`, where `*` can be replaced by any random string not
> +          containing colons, for local experiments.

Let's be a bit more restrictive:

where `*` can be replaced by an arbitrary string of non-whitespace
printable UTF-8 characters, such that the total length of the name
including namespace does not exceed 255 bytes.

> This SHOULD NOT be
> +          used by metadata contexts that are expected to e widely used.

"expected to be widely"

> +        - third-party implementations can register additional
> +          namespaces by simple request to the mailinglist.
> +
> +    The server MUST reply with a number of `NBD_REP_META_CONTEXT`
> +    replies, one for each selected metadata context, each with a unique
> +    metadata context ID

"followed by NBD_REP_ACK"

> . It is not an error if a
> +    `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
> +    context, provided the client then does not attempt to issue
> +    `NBD_CMD_BLOCK_STATUS` commands.
> +
> #### Option reply types
> 
> These values are used in the "reply type" field, sent by the server
> @@ -882,7 +989,7 @@ during option haggling in the fixed newstyle negotiation.
>     information is available, or when sending data related to the option
>     (in the case of `NBD_OPT_LIST`) has finished. No data.
> 
> -* `NBD_REP_SERVER` (2)
> +- `NBD_REP_SERVER` (2)
> 
>     A description of an export. Data:
> 
> @@ -897,10 +1004,18 @@ during option haggling in the fixed newstyle negotiation.
>       particular client request, this field is defined to be a string
>       suitable for direct display to a human being.
> 
> -* `NBD_REP_INFO` (3)
> +- `NBD_REP_INFO` (3)

These two probably belong in a different patch

> 
>     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 
> +- `NBD_REP_META_CONTEXT` (4)
> +
> +    A description of a metadata context. Data:
> +
> +    - 32 bits, NBD metadata context ID.
> +    - String, name of the metadata context. This is not required to be
> +      a human-readable string, but it MUST be valid UTF-8 data.

"consisting of printable non-whitespace UTF-8 characters not greater
than 255 bytes in length"

Did we not say that we wanted to insert the length of the string to allow
for expansion?

> +
> There are a number of error reply types, all of which are denoted by
> having bit 31 set. All error replies MAY have some data set, in which
> case that data is an error message string suitable for display to the user.
> @@ -938,15 +1053,62 @@ case that data is an error message string suitable for display to the user.
> 
>     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 
> -* `NBD_REP_ERR_SHUTDOWN` (2^32 + 7)
> +* `NBD_REP_ERR_SHUTDOWN` (2^31 + 7)
> 
>     The server is unwilling to continue negotiation as it is in the
>     process of being shut down.
> 
> -* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^32 + 8)
> +* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^31 + 8)
> 
>     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> 

Probably should be in a different patch

> +##### Metadata contexts
> +
> +The `base:allocation` metadata context is the basic "allocated at all"
> +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> +context, this means that the given extent is not allocated in the
> +backend storage, and that writing to the extent MAY result in the ENOSPC
> +error. This supports sparse file semantics on the server side. If a
> +server has only one metadata context (the default), then writing to an
> +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.

Again I'm still confused by this. I *think* you mean "If a server
supports the `base:allocation` metadata context, then writing
to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.`

I say that because as currently phrased:

* If a server has one metadata context only, but it is not
  `base:allocation`, then you implying something about writing
  to an extent with a state that won't even exist.

* If a server has `base:allocation` AND another metadata context
  (for instance `qemu:dirty`) then the rule you set out will not
  apply.

> +It defines the following flags for the flags field:
> +
> +- `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole (and
> +  future writes to that area may cause fragmentation or encounter an
> +  `ENOSPC` error); if clear, the block is allocated or the server could
> +  not otherwise determine its status. Note that the use of
> +  `NBD_CMD_TRIM` is related to this status, but that the server MAY
> +  report a hole even where trim has not been requested, and also that a
> +  server MAY report metadata even where a trim has been requested.

'may report that the block is allocated even where'

> +- `NBD_STATE_ZERO` (bit 1): if set, the block contents read as all
> +  zeroes; if clear, the block contents are not known. Note that the use
> +  of `NBD_CMD_WRITE_ZEROES` is related to this status, but that the
> +  server MAY report zeroes even where write zeroes has not been

'even where `NBD_CMD_WRITE_ZEROES` has not been'

> +  requested, and also that a server MAY report unknown content even
> +  where write zeroes has been requested.

'where `NBD_CMD_WRITE_ZEROES` has been requested'

> +
> +It is not an error for a server to report that a region of the
> +export has both `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. The
> +contents of such an area is undefined, and may not be stable;
> +clients who are aware of the existence of such a region SHOULD NOT
> +read it.

As per previous comment, there is no restriction on a client reading it.
For instance, if it was a block within a long read, it might be
sensible to read it.

How about: "the contents of such an area are undefined, and a client
reading such an area should make no assumption as to its contents
or stability."

> +
> +For the `base:allocation` context, the remainder of the flags field is
> +reserved. Servers SHOULD set it to all-zero;

Surely if we want to reserve them for extension, we need "Servers
MUST set it to all-zero"

> clients MUST ignore unknown
> +flags.
> +
> +For all other cases, this specification requires no specific semantics of
> +metadata contexts, except that all the information they provide MUST be
> +representable within the flags field as defined for
> +`NBD_REPLY_TYPE_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting metadata contexts in a document like this
> +one.
> +
> ### Transmission phase
> 
> #### Flag fields
> @@ -983,6 +1145,11 @@ valid may depend on negotiation during the handshake phase.
>    content chunk in reply.  MUST NOT be set unless the transmission
>    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
>    `EOVERFLOW` error chunk, if the request length is too large.
> +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> +  set, the client is interested in only one extent per metadata
> +  context. If this flag is present, the server SHOULD NOT send metadata

'MUST NOT'?

> +  on more than one extent in the reply. Clients SHOULD NOT use this flag
> +  on multiple requests for successive regions in the export.

Perhaps better to simply note:

"Client implementors should note that using this flag on multiple
contiguous requests is likely to be inefficient."

After all, it's no worse than multiple one block reads, which we do
not preclude or have as a 'SHOULD NOT'.

> 
> ##### Structured reply flags
> 
> @@ -1051,6 +1218,34 @@ interpret the "length" bytes of payload.
>   64 bits: offset (unsigned)
>   32 bits: hole size (unsigned, MUST be nonzero)
> 
> +- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> +
> +    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
> +    represents a series of consecutive block descriptors where the sum
> +    of the lengths of the descriptors

I think 'length fields within the descriptors' would be better, else
you might mean the length of each descriptor itself, which is 8.

> MUST not be greater than the
> +    length of the original request. This chunk type MUST appear exactly
> +    once per metadata ID in a structured reply.
> +
> +    The payload starts with:
> +
> +        * 32 bits, metadata context ID
> +
> +    and is followed by a list of one or more descriptors, each with this
> +    layout:
> +
> +        * 32 bits, length (unsigned, MUST NOT be zero)

Perhaps add: "the length of the extent to which which the status
below applies".

> +        * 32 bits, status flags
> +
> +    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
> +    then every reply chunk MUST NOT contain more than one descriptor.
> +
> +    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
> +    its request, the server MAY return less descriptors in the reply

s/less/fewer/

> +    than would be required to fully specify the whole range of requested
> +    information to the client, if the number of descriptors would be
> +    over 16 otherwise

"otherwise be over 16"

> and looking up the information would be too
> +    resource-intensive for the server.
> +
> All error chunk types have bit 15 set, and begin with the same
> *error*, *message length*, and optional *message* fields as
> `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
> @@ -1085,7 +1280,7 @@ remaining structured fields at the end.
>   were sent earlier in the structured reply, the server SHOULD NOT
>   send multiple distinct offsets that lie within the bounds of a
>   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> -  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
> +  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
> 
>   The payload is structured as:
> 
> @@ -1259,6 +1454,44 @@ The following request types exist:
> 
>     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-write-zeroes/doc/proto.md).
> 
> +* `NBD_CMD_BLOCK_STATUS` (7)
> +
> +    A block status query request. Length and offset define the range of
> +    interest. Clients MUST NOT use this request unless metadata
> +    contexts have been negotiated,

I think you mean "have been selected" but see my comment re perhaps
making no contexts being selected meaning all contexts are selected.

> which in turn requires the client to
> +    first negotiate structured replies. For a successful return, the
> +    server MUST use a structured reply, containing at least one chunk of
> +    type `NBD_REPLY_TYPE_BLOCK_STATUS`, where the status field of each
> +    descriptor is determined by the flags field as defined by the
> +    metadata context.
> +
> +    The list of block status descriptors within the
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +    of the file starting from specified *offset*, and the sum of the
> +    *length* fields of each descriptor MUST not be greater than the
> +    overall *length* of the request. This means that the server MAY
> +    return less data than required. However the server MUST return at
> +    least one status descriptor.  The server SHOULD use different
> +    *status* values between consecutive descriptors, and SHOULD use
> +    descriptor lengths that are an integer multiple of 512 bytes where
> +    possible (the first and last descriptor of an unaligned query being
> +    the most obvious places for an exception). The status flags are
> +    intentionally defined so that a server MAY always safely report a
> +    status of 0 for any block, although the server SHOULD return
> +    additional status values when they can be easily detected.
> +
> +    If an error occurs, the server SHOULD set the appropriate error
> +    code in the error field of an error chunk. However, if the error
> +    does not involve invalid usage (such as a request beyond the bounds
> +    of the file), a server MAY reply with a single block status
> +    descriptor with *length* matching the requested length, and *status*
> +    of 0 rather than reporting the error.
> +
> +    A client MAY initiate a hard disconnect if it detects that the
> +    server has sent an invalid chunk. The server SHOULD return `EINVAL`
> +    if it receives a `NBD_CMD_BLOCK_STATUS` request including one or
> +    more sectors beyond the size of the device.
> +
> * Other requests
> 
>     Some third-party implementations may require additional protocol
> 
> --
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>       people in the world who think they really understand all of its rules,
>       and pretty much all of them are just lying to themselves too.
> -- #debian-devel, OFTC, 2016-02-12
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot_______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

--
Alex Bligh
Alex Bligh Dec. 13, 2016, 2:12 p.m. UTC | #3
> On 12 Dec 2016, at 20:43, Wouter Verhelst <w@uter.be> wrote:
> 
> +- `NBD_OPT_SET_META_CONTEXT` (11)
> +
> +    Change the set of active metadata contexts. Issuing this command
> +    replaces all previously-set metadata contexts; clients must ensure
> +    that all metadata contexts they're interested in are selected with
> +    the final query that they sent.
> +
> +    Data:
> +    - 32 bits, length of query
> +    - String, query to select metadata contexts. The syntax of this
> +      query is implementation-defined, except that it MUST start with a
> +      namespace. This namespace may be one of the following:
> +        - `base:`, for metadata contexts defined by this document;
> +        - `nbd-server:`, for metadata contexts defined by the
> +          implementation that accompanies this document (none
> +          currently);
> +        - `x-*:`, where `*` can be replaced by any random string not
> +          containing colons, for local experiments. This SHOULD NOT be
> +          used by metadata contexts that are expected to e widely used.
> +        - third-party implementations can register additional
> +          namespaces by simple request to the mailinglist.

Surely also we need to specify multiple queries?


--
Alex Bligh
Alex Bligh Dec. 13, 2016, 2:15 p.m. UTC | #4
Stefan,

> What about querying "base:" if the NBD spec adds more standard metadata
> contexts in the future?  "base:" does not "uniquely identify a server
> implementation" but it should still be possible to query it so that
> additional contexts can be added to this spec in the future.
> 
> Is the query matching algorithm defined anywhere?  I guess it is
> strncmp(query, context, strlen(query)) but I'm not sure from this text.
> 
> Another common syntax would use an asterisk wildcard ('*') so that it's
> possible to differentiate between full matches and (wildcard) partial
> matches.

I agree. I have suggested a simple string-based left match, which
also fixes ...

>> +        - third-party implementations can register additional
>> +          namespaces by simple request to the mailinglist.
> 
> Does this mean it is possible to activate multiple contexts but only if
> their namespace is identical?  That seems like an arbitrary limitation.
> 
> In other words, the spec suggests you can activate nbd-server:a and
> nbd-server:b but you cannot activate base:a and nbd-server:a.

... this, provided that NBD_SET_META_CONTEXT actually takes a list of
query strings.

> I'd prefer a programming model where the contexts don't need to be set
> for the lifetime of the connection.  Instead the client passes a bitmap
> (64-bits?) of contexts along with each BLOCK_STATUS command.  That way
> the client can select contexts on a per-command basis.  It's unlikely
> that more than 64 contexts need to be available at once but I admit it's
> an arbitrary limitation...
> 
> I guess you've considered this model and decided it's better to
> negotiate upfront, it's wasteful to enable multiple contexts and then
> discard the response data on the client side because only a subset is
> needed for this command invocation.

The issue is that there's nowhere within the current command structure
to put it.

I think the easy way to do this would be to have context IDs as a payload
to the command, like they are with NBD_CMD_WRITE, but the payload
is currently always defined to be of the length specified within the
length section.

The question really is whether we should fix this silly protocol limitation.
I don't think it's as bad as the structured reply fix, and could conceivably
go in there.

--
Alex Bligh
Wouter Verhelst Dec. 13, 2016, 2:33 p.m. UTC | #5
On Tue, Dec 13, 2016 at 02:12:48PM +0000, Alex Bligh wrote:
> 
> > On 12 Dec 2016, at 20:43, Wouter Verhelst <w@uter.be> wrote:
> > 
> > +- `NBD_OPT_SET_META_CONTEXT` (11)
> > +
> > +    Change the set of active metadata contexts. Issuing this command
> > +    replaces all previously-set metadata contexts; clients must ensure
> > +    that all metadata contexts they're interested in are selected with
> > +    the final query that they sent.
> > +
> > +    Data:
> > +    - 32 bits, length of query
> > +    - String, query to select metadata contexts. The syntax of this
> > +      query is implementation-defined, except that it MUST start with a
> > +      namespace. This namespace may be one of the following:
> > +        - `base:`, for metadata contexts defined by this document;
> > +        - `nbd-server:`, for metadata contexts defined by the
> > +          implementation that accompanies this document (none
> > +          currently);
> > +        - `x-*:`, where `*` can be replaced by any random string not
> > +          containing colons, for local experiments. This SHOULD NOT be
> > +          used by metadata contexts that are expected to e widely used.
> > +        - third-party implementations can register additional
> > +          namespaces by simple request to the mailinglist.
> 
> Surely also we need to specify multiple queries?

Yes. See my reply to Stefan's mail ;-)
Wouter Verhelst Dec. 13, 2016, 3:03 p.m. UTC | #6
Alex,

(leaving out obvious grammar/language fixes that I have no problem with)

On Tue, Dec 13, 2016 at 02:09:19PM +0000, Alex Bligh wrote:
> Wouter,
> 
> Some comments below:
> 
> > On 12 Dec 2016, at 20:43, Wouter Verhelst <w@uter.be> wrote:
> ....
> > +## Metadata querying
> > +
> > +With the availability of sparse storage formats, it is often needed to
> > +query the status of a particular range and read only those blocks of
> > +data that are actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a
> > +concept of data dirtiness. Whether the operation is block device
> > +mirroring, incremental block device backup or any other operation with
> > +a concept of data dirtiness, they all share a need to provide a list
> > +of ranges that this particular operation treats as dirty.
> 
> I'm not sure this necessarily has anything to do with sparse storage
> formats. For instance the 'dirty' concept would work perfectly well
> on a non-sparse format. I propose the following to replace the two
> paragraphs above:
> 
> "It is often helpful for the client to be able to query the status
> of a range of blocks. The nature of the status that can be
> queried is in part implementation dependent. For instance,
> the status might represent:
> 
> * in a sparse storage format, whether the relevant blocks are
>   actually present on the backing device for the export; or
> 
> * whether the relevant blocks are 'dirty'; some storage formats
>   and operations over such formats express a concept of data dirtiness.
>   Whether the operation is block device mirroring, incremental block
>   device backup or any other operation with a concept of data dirtiness,
>   they all share a need to provide a list of ranges that this
>   particular operation treats as dirty.
> "

I suppose that works, yes.

[..]
> > +- During transmission, a client can then indicate interest in metadata
> > +  for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
> > +  *offset* and *length* indicate the area of interest. The server MUST
> > +  then respond with the requested information, for all contexts which
> > +  were selected during negotiation. For every metadata context, the
> > +  server sends one set of extent chunks, where the sizes of the
> > +  extents MUST be less than or equal to the length as specified in the
> > +  request. Each extent comes with a *flags* field, the semantics of
> > +  which are defined by the metadata context.
> > +- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured reply
> > +  of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> > +
> > +A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
> > +nonzero
> 
> "non-zero"
> 
> > number of metadata contexts during negotiation. Servers SHOULD
> > +reply to clients doing so anyway with `EINVAL`.
> 
> I had difficulty parsing "so". I think you mean "Servers SHOULD
> reply to clients sending `NBD_CMD_BLOCK_STATUS without
> selecting metadata contexts with"

Yes.

> But actually, why do we need to be so mean? Why can't we assume
> that if NBD_OPT_SET_META_CONTEXT is not sent, then all the metadata
> contexts should be selected?

No. If no metadata contexts have been negotiated, then no metadata
context IDs were assigned to contexts, so there is no mapping to
external names. As such, the data being sent in response to
NBD_CMD_BLOCK_STATUS would have no meaning to the client.

[...]
> > +    commands during the transmission phase.
> 
> "otherwise, the client MUST NOT send NBD_CMD_BLOCK_STATUS
> messages."
> 
> But actually isn't the telling question whether NBD_OPT_SET_META_CONTEXT
> works?
> 
> Actually, as this could be sent more than once, I think this whole
> thing would be better phrased as:
> 
> "A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
> within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
> at least once, and the final time it was sent, the server
> responded without an error."

... and returned at least one metadata context (since we state elsewhere
that sending a SET command which selects nothing is not an error).

> obviously this would be better under _SET_ than _LIST_,
> but the sentence can go entirely from here.
> 
> Equally obviously, if we are making _SET_ optional (as
> lack of _SET_ means that all the contexts are selected)
> we just gate this on `NBD_FLAG_SEND_BLOCK_STATUS`

You can't, as per above.

> > +    If the query string is syntactically invalid, the server SHOULD send
> > +    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> > +    but finds no metadata contexts, the server MUST send a single
> > +    reply of type `NBD_REP_ACK`.
> > +
> > +    This option MUST NOT be requested unless structured replies have
> > +    been negotiated first. If a client attempts to do so, a server
> > +    SHOULD send `NBD_REP_ERR_INVALID`.
> > +
> > +    Data:
> > +    - 32 bits, length of export name
> > +    - String, name of export for which we wish to list or select metadata
> > +      contexts.
> 
> Surely as this is _LIST_ this should read "list" not "list or select"?

Yes, I suppose. The data format for SET and LIST is the same, though, so
it could alternatively be reworded to point that out more clearly.

> > +    - 32 bits, length of query
> > +    - String, query to select a subset of the available metadata
> > +      contexts. If this is not specified (i.e., the "length of query"
> > +      field is 0 and no query is sent), then the server MUST send all
> > +      the metadata contexts it knows about.
> 
> 
> > If specified, this query
> > +      string MUST start with a name that uniquely identifies a server
> > +      implementation; e.g., the reference implementation that
> > +      accompanies this document would support query strings starting
> > +      with 'nbd-server:'
> 
> I'm not sure this is correct (particularly now you've adopted my "X-"
> suggestion), as for instance "base:" does not started with a server
> implementation.

Point.

> Better:
> 
> "If specified, the server MUST return the zero or more contexts
> whose names (including the namespace) consist of or start with
> the query string. For instance a query string of 'nbd-server:'
> would return all contexts within the 'nbd-server' namespace,
> and a string 'base:a' would return all context within the
> 'base' namespace that began with 'a'"

NAK. This imposes a particular syntax for parsing query strings upon
namespaces, which I explicitly do not wish to do.

The spec should only specify how to select a particular namespace, and
then leave all parsing rules of query strings up to the namespace
specification.

> > +    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
> > +    followed by `NBD_REP_ACK`. The metadata context ID in these replies
> > +    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
> 
> Why is the context ID set to zero? Surely it would be really helpful
> for this to be filled in with the ID?

The ID could change depending on which contexts are actually selected:

First, I could imagine a context namespace which allows the client to
dynamically create a context (e.g., "create a snapshot now and use
that"), but where that would only be actually done when the client
actually chooses the metadata context.

Second, a namespace could create symbolic context names (e.g., "the
latest snapshot, whatever that may be"), whose ID could change between
the LIST and SET options (presumably that's a very small race window,
but a race is a race).

Third, a valid (and simple) way to implement mapping could be to assign
context IDs as array indices, where the array is dynamically created at
SET time, and where the server side does a for loop over the array, and
returns the metadata context ID as htonl() of its loop variable. This
would mean that there must be no gaps between the IDs at run time.

Perhaps another way to deal with that would be to specify that an
implementation is not required to assign the same context IDs to the
same metadata contexts from one call to SET (or LIST) to the next; but I
thought it would be clearer if we were to explicitly make LIST return
"invalid" context IDs. Then again, maybe not.

> > +- `NBD_OPT_SET_META_CONTEXT` (11)
> > +
> > +    Change the set of active metadata contexts. Issuing this command
> > +    replaces all previously-set metadata contexts; clients must ensure
> > +    that all metadata contexts they're interested in are selected with
> > +    the final query that they sent.
> > +
> > +    Data:
> > +    - 32 bits, length of query
> > +    - String, query to select metadata contexts.
> 
> Again as above, with s/return/select/:
> 
> "If not specified, the server MUST select all metadata contexts.
> If specified, the server MUST select the zero or more contexts
> whose names (including the namespace) consist of or start with
> the query string. For instance a query string of 'nbd-server:'
> would select all contexts within the 'nbd-server' namespace,
> and a string 'base:a' would select all context within the
> 'base' namespace that began with 'a'"

Again, NAK.

> > The syntax of this
> > +      query is implementation-defined,
> 
> Right, but I'd make this "The syntax of the metadata context
> name is implementation defined" - after all it's not only
> the 'query' that has the namespace, it's also the names
> themselves. Also perhaps move this elsewhere as it doesn't
> only apply to _SET_ but also to _LIST_.

Point.

> > except that it MUST start with a
> > +      namespace. This namespace may be one of the following:
> > +        - `base:`, for metadata contexts defined by this document;
> > +        - `nbd-server:`, for metadata contexts defined by the
> > +          implementation that accompanies this document (none
> > +          currently);
> > +        - `x-*:`, where `*` can be replaced by any random string not
> > +          containing colons, for local experiments.
> 
> Let's be a bit more restrictive:
> 
> where `*` can be replaced by an arbitrary string of non-whitespace
> printable UTF-8 characters, such that the total length of the name
> including namespace does not exceed 255 bytes.

That works, certainly.

> > This SHOULD NOT be
> > +          used by metadata contexts that are expected to e widely used.
> 
> "expected to be widely"

Has already been fixed :)

> > +        - third-party implementations can register additional
> > +          namespaces by simple request to the mailinglist.
> > +
> > +    The server MUST reply with a number of `NBD_REP_META_CONTEXT`
> > +    replies, one for each selected metadata context, each with a unique
> > +    metadata context ID
> 
> "followed by NBD_REP_ACK"

Right.

> > . It is not an error if a
> > +    `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
> > +    context, provided the client then does not attempt to issue
> > +    `NBD_CMD_BLOCK_STATUS` commands.
> > +
> > #### Option reply types
> > 
> > These values are used in the "reply type" field, sent by the server
> > @@ -882,7 +989,7 @@ during option haggling in the fixed newstyle negotiation.
> >     information is available, or when sending data related to the option
> >     (in the case of `NBD_OPT_LIST`) has finished. No data.
> > 
> > -* `NBD_REP_SERVER` (2)
> > +- `NBD_REP_SERVER` (2)
> > 
> >     A description of an export. Data:
> > 
> > @@ -897,10 +1004,18 @@ during option haggling in the fixed newstyle negotiation.
> >       particular client request, this field is defined to be a string
> >       suitable for direct display to a human being.
> > 
> > -* `NBD_REP_INFO` (3)
> > +- `NBD_REP_INFO` (3)
> 
> These two probably belong in a different patch

3

(number of people telling me that. Yes, I know :-)

> >     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
> > 
> > +- `NBD_REP_META_CONTEXT` (4)
> > +
> > +    A description of a metadata context. Data:
> > +
> > +    - 32 bits, NBD metadata context ID.
> > +    - String, name of the metadata context. This is not required to be
> > +      a human-readable string, but it MUST be valid UTF-8 data.
> 
> "consisting of printable non-whitespace UTF-8 characters not greater
> than 255 bytes in length"
> 
> Did we not say that we wanted to insert the length of the string to allow
> for expansion?

Ah, yes, forgot about that bit.

[...]
> > +##### Metadata contexts
> > +
> > +The `base:allocation` metadata context is the basic "allocated at all"
> > +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> > +context, this means that the given extent is not allocated in the
> > +backend storage, and that writing to the extent MAY result in the ENOSPC
> > +error. This supports sparse file semantics on the server side. If a
> > +server has only one metadata context (the default), then writing to an
> > +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> 
> Again I'm still confused by this. I *think* you mean "If a server
> supports the `base:allocation` metadata context, then writing
> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.`
> 
> I say that because as currently phrased:
> 
> * If a server has one metadata context only, but it is not
>   `base:allocation`, then you implying something about writing
>   to an extent with a state that won't even exist.

Yes, that's wrong indeed. We should replace the "the default" bit by
"the base:allocation context".

> * If a server has `base:allocation` AND another metadata context
>   (for instance `qemu:dirty`) then the rule you set out will not
>   apply.

Yes, and this is intentional, as I've explained before. The other
context's semantics should clarify whether that rule still applies or
not. Implementations that do not know of the other context should
however assume that it doesn't.

[...]
> > +It is not an error for a server to report that a region of the
> > +export has both `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. The
> > +contents of such an area is undefined, and may not be stable;
> > +clients who are aware of the existence of such a region SHOULD NOT
> > +read it.
> 
> As per previous comment, there is no restriction on a client reading it.
> For instance, if it was a block within a long read, it might be
> sensible to read it.
> 
> How about: "the contents of such an area are undefined, and a client
> reading such an area should make no assumption as to its contents
> or stability."

That works.

> > +For the `base:allocation` context, the remainder of the flags field is
> > +reserved. Servers SHOULD set it to all-zero;
> 
> Surely if we want to reserve them for extension, we need "Servers
> MUST set it to all-zero"

No, SHOULD, otherwise a future extension which adds meaning to those
bits will suddenly become incompatible with this spec. Think about it
;-)

> > clients MUST ignore unknown
> > +flags.
> > +
> > +For all other cases, this specification requires no specific semantics of
> > +metadata contexts, except that all the information they provide MUST be
> > +representable within the flags field as defined for
> > +`NBD_REPLY_TYPE_BLOCK_STATUS`.
> > +
> > +Likewise, the syntax of query strings is not specified by this document.
> > +
> > +Server implementations SHOULD document their syntax for query strings
> > +and semantics for resulting metadata contexts in a document like this
> > +one.
> > +
> > ### Transmission phase
> > 
> > #### Flag fields
> > @@ -983,6 +1145,11 @@ valid may depend on negotiation during the handshake phase.
> >    content chunk in reply.  MUST NOT be set unless the transmission
> >    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
> >    `EOVERFLOW` error chunk, if the request length is too large.
> > +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> > +  set, the client is interested in only one extent per metadata
> > +  context. If this flag is present, the server SHOULD NOT send metadata
> 
> 'MUST NOT'?

I think it's not necessarily a problem for a server to ignore that,
especially in reply to clients which are doing the below.

> > +  on more than one extent in the reply. Clients SHOULD NOT use this flag
> > +  on multiple requests for successive regions in the export.
> 
> Perhaps better to simply note:
> 
> "Client implementors should note that using this flag on multiple
> contiguous requests is likely to be inefficient."
> 
> After all, it's no worse than multiple one block reads, which we do
> not preclude or have as a 'SHOULD NOT'.

That works too.

> > ##### Structured reply flags
> > 
> > @@ -1051,6 +1218,34 @@ interpret the "length" bytes of payload.
> >   64 bits: offset (unsigned)
> >   32 bits: hole size (unsigned, MUST be nonzero)
> > 
> > +- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> > +
> > +    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
> > +    represents a series of consecutive block descriptors where the sum
> > +    of the lengths of the descriptors
> 
> I think 'length fields within the descriptors' would be better, else
> you might mean the length of each descriptor itself, which is 8.

Fair enough.

> > MUST not be greater than the
> > +    length of the original request. This chunk type MUST appear exactly
> > +    once per metadata ID in a structured reply.
> > +
> > +    The payload starts with:
> > +
> > +        * 32 bits, metadata context ID
> > +
> > +    and is followed by a list of one or more descriptors, each with this
> > +    layout:
> > +
> > +        * 32 bits, length (unsigned, MUST NOT be zero)
> 
> Perhaps add: "the length of the extent to which which the status
> below applies".

Sounds good, yes.

> > +        * 32 bits, status flags
> > +
> > +    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
> > +    then every reply chunk MUST NOT contain more than one descriptor.
> > +
> > +    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
> > +    its request, the server MAY return less descriptors in the reply
> 
> s/less/fewer/
> 
> > +    than would be required to fully specify the whole range of requested
> > +    information to the client, if the number of descriptors would be
> > +    over 16 otherwise
> 
> "otherwise be over 16"

That requirement was dropped already (with
<20161212204913.oaatvexhnndyhhwa@grep.be>)

[...]
> > +* `NBD_CMD_BLOCK_STATUS` (7)
> > +
> > +    A block status query request. Length and offset define the range of
> > +    interest. Clients MUST NOT use this request unless metadata
> > +    contexts have been negotiated,
> 
> I think you mean "have been selected"

Yes.

(feel free to update the branch with those suggestions I've not NAK'd,
as I think they make sense...)
Alex Bligh Dec. 13, 2016, 3:24 p.m. UTC | #7
Wouter,

>> But actually, why do we need to be so mean? Why can't we assume
>> that if NBD_OPT_SET_META_CONTEXT is not sent, then all the metadata
>> contexts should be selected?
> 
> No. If no metadata contexts have been negotiated, then no metadata
> context IDs were assigned to contexts, so there is no mapping to
> external names. As such, the data being sent in response to
> NBD_CMD_BLOCK_STATUS would have no meaning to the client.

Good point.

>>> +    commands during the transmission phase.
>> 
>> "otherwise, the client MUST NOT send NBD_CMD_BLOCK_STATUS
>> messages."
>> 
>> But actually isn't the telling question whether NBD_OPT_SET_META_CONTEXT
>> works?
>> 
>> Actually, as this could be sent more than once, I think this whole
>> thing would be better phrased as:
>> 
>> "A client MUST NOT send `NBD_CMD_BLOCK_STATUS` unless
>> within the negotiation phase it sent `NBD_OPT_SET_META_CONTEXT`
>> at least once, and the final time it was sent, the server
>> responded without an error."
> 
> ... and returned at least one metadata context (since we state elsewhere
> that sending a SET command which selects nothing is not an error).
> 
>> obviously this would be better under _SET_ than _LIST_,
>> but the sentence can go entirely from here.

Yes

>> Equally obviously, if we are making _SET_ optional (as
>> lack of _SET_ means that all the contexts are selected)
>> we just gate this on `NBD_FLAG_SEND_BLOCK_STATUS`
> 
> You can't, as per above.

Indeed.

>> "If specified, the server MUST return the zero or more contexts
>> whose names (including the namespace) consist of or start with
>> the query string. For instance a query string of 'nbd-server:'
>> would return all contexts within the 'nbd-server' namespace,
>> and a string 'base:a' would return all context within the
>> 'base' namespace that began with 'a'"
> 
> NAK. This imposes a particular syntax for parsing query strings upon
> namespaces, which I explicitly do not wish to do.
> 
> The spec should only specify how to select a particular namespace, and
> then leave all parsing rules of query strings up to the namespace
> specification.

I'm not quite sure why you want that, but if that's the case, I'd
suggest the easiest thing to do would be to pass (separately) a
namespace string, and a name-query, because you are prohibiting
queries that cover more than one namespace (though per our
separate conversation, you can select context from more than
one namespace with multiple queries). It does seem a bit odd
thought that if context 'foo:bar' is returned from _LIST_,
there is no guarantee that selecting it as 'foo:bar' will
work (as the matching is namespace dependent).

>>> +    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
>>> +    followed by `NBD_REP_ACK`. The metadata context ID in these replies
>>> +    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
>> 
>> Why is the context ID set to zero? Surely it would be really helpful
>> for this to be filled in with the ID?
> 
> The ID could change depending on which contexts are actually selected:
> 
> First, I could imagine a context namespace which allows the client to
> dynamically create a context (e.g., "create a snapshot now and use
> that"), but where that would only be actually done when the client
> actually chooses the metadata context.
> 
> Second, a namespace could create symbolic context names (e.g., "the
> latest snapshot, whatever that may be"), whose ID could change between
> the LIST and SET options (presumably that's a very small race window,
> but a race is a race).
> 
> Third, a valid (and simple) way to implement mapping could be to assign
> context IDs as array indices, where the array is dynamically created at
> SET time, and where the server side does a for loop over the array, and
> returns the metadata context ID as htonl() of its loop variable. This
> would mean that there must be no gaps between the IDs at run time.
> 
> Perhaps another way to deal with that would be to specify that an
> implementation is not required to assign the same context IDs to the
> same metadata contexts from one call to SET (or LIST) to the next; but I
> thought it would be clearer if we were to explicitly make LIST return
> "invalid" context IDs. Then again, maybe not.

You're completely correct here and you should ignore my suggestion.

Perhaps you should be even clearer and explicitly state that
context IDs may change between sessions and even between successive
use of the _SET_ command.

>> These two probably belong in a different patch
> 
> 3
> 
> (number of people telling me that. Yes, I know :-)

Sorry, behind on mail :-)

>>> +##### Metadata contexts
>>> +
>>> +The `base:allocation` metadata context is the basic "allocated at all"
>>> +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
>>> +context, this means that the given extent is not allocated in the
>>> +backend storage, and that writing to the extent MAY result in the ENOSPC
>>> +error. This supports sparse file semantics on the server side. If a
>>> +server has only one metadata context (the default), then writing to an
>>> +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
>> 
>> Again I'm still confused by this. I *think* you mean "If a server
>> supports the `base:allocation` metadata context, then writing
>> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.`
>> 
>> I say that because as currently phrased:
>> 
>> * If a server has one metadata context only, but it is not
>>  `base:allocation`, then you implying something about writing
>>  to an extent with a state that won't even exist.
> 
> Yes, that's wrong indeed. We should replace the "the default" bit by
> "the base:allocation context".
> 
>> * If a server has `base:allocation` AND another metadata context
>>  (for instance `qemu:dirty`) then the rule you set out will not
>>  apply.
> 
> Yes, and this is intentional, as I've explained before. The other
> context's semantics should clarify whether that rule still applies or
> not. Implementations that do not know of the other context should
> however assume that it doesn't.

I think I must be being very stupid or am at least at cross-purposes.

Let's say the second context is (e.g.) 'colour:records_information_about_blue'
(i.e. nothing at all to do with space or holes). Are you saying that the
presence of that context might affect the interpretation of
NBD_STATE_HOLE? Or just other contexts within the 'base:' space?
Or that another context (e.g. perhaps compression) could be an independent
cause of ENOSPC? If so should it be the responsibility of the other context
to document that it does not affect that?

How about:

"If a server supports the `base:allocation` metadata context, then writing
to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
unless for reasons specified in the definition of another context."

>>> +For the `base:allocation` context, the remainder of the flags field is
>>> +reserved. Servers SHOULD set it to all-zero;
>> 
>> Surely if we want to reserve them for extension, we need "Servers
>> MUST set it to all-zero"
> 
> No, SHOULD, otherwise a future extension which adds meaning to those
> bits will suddenly become incompatible with this spec. Think about it
> ;-)

I did! If there is a future extension, it will change the spec to
incorporate those bits, so they won't be included within 'the
remainder' any more.

> (feel free to update the branch with those suggestions I've not NAK'd,
> as I think they make sense...)

OK. May take a look later.
Wouter Verhelst Dec. 13, 2016, 4:06 p.m. UTC | #8
On Tue, Dec 13, 2016 at 03:24:25PM +0000, Alex Bligh wrote:
> Wouter,
> > The spec should only specify how to select a particular namespace, and
> > then leave all parsing rules of query strings up to the namespace
> > specification.
> 
> I'm not quite sure why you want that,

Let's say you want to implement incremental backups (which is what we're
talking about, really). Your backup program would presumably know when
the last point in time is that we took a backup. In that context, you
could want to have a syntax to say "list all contexts that contain data
more recent than <previous backup point>" (or "select all contexts..."),
so as to actually select the backup.

> but if that's the case, I'd
> suggest the easiest thing to do would be to pass (separately) a
> namespace string, and a name-query, because you are prohibiting
> queries that cover more than one namespace

I think that would needlessly complicate matters, because there is
also...

> (though per our
> separate conversation, you can select context from more than
> one namespace with multiple queries)

... this.

>. It does seem a bit odd thought that if context 'foo:bar' is returned
>from _LIST_, there is no guarantee that selecting it as 'foo:bar' will
>work (as the matching is namespace dependent).

Yes. We should probably also make it clear that implementations SHOULD
allow the selection of a (subset of) metadata context(s) which are
returned by LIST by simply naming them individually.

[...]
> > Perhaps another way to deal with that would be to specify that an
> > implementation is not required to assign the same context IDs to the
> > same metadata contexts from one call to SET (or LIST) to the next; but I
> > thought it would be clearer if we were to explicitly make LIST return
> > "invalid" context IDs. Then again, maybe not.
> 
> You're completely correct here and you should ignore my suggestion.
> 
> Perhaps you should be even clearer and explicitly state that
> context IDs may change between sessions and even between successive
> use of the _SET_ command.

Makes sense, yes.

[...]
> >> * If a server has `base:allocation` AND another metadata context
> >>  (for instance `qemu:dirty`) then the rule you set out will not
> >>  apply.
> > 
> > Yes, and this is intentional, as I've explained before. The other
> > context's semantics should clarify whether that rule still applies or
> > not. Implementations that do not know of the other context should
> > however assume that it doesn't.
> 
> I think I must be being very stupid or am at least at cross-purposes.
> 
> Let's say the second context is (e.g.) 'colour:records_information_about_blue'
> (i.e. nothing at all to do with space or holes). Are you saying that the
> presence of that context might affect the interpretation of
> NBD_STATE_HOLE? Or just other contexts within the 'base:' space?
> Or that another context (e.g. perhaps compression) could be an independent
> cause of ENOSPC? If so should it be the responsibility of the other context
> to document that it does not affect that?
> 
> How about:
> 
> "If a server supports the `base:allocation` metadata context, then writing
> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
> unless for reasons specified in the definition of another context."

That is essentially what I mean, yes. The point though, is that I also
think a client should assume the worst.

(additionally, there may be reasons for ENOSPC that the server isn't
aware of; so even if NBD_STATE_HOLE is clear, it should still not be an
error for the server to return ENOSPC)

> >>> +For the `base:allocation` context, the remainder of the flags field is
> >>> +reserved. Servers SHOULD set it to all-zero;
> >> 
> >> Surely if we want to reserve them for extension, we need "Servers
> >> MUST set it to all-zero"
> > 
> > No, SHOULD, otherwise a future extension which adds meaning to those
> > bits will suddenly become incompatible with this spec. Think about it
> > ;-)
> 
> I did! If there is a future extension, it will change the spec to
> incorporate those bits, so they won't be included within 'the
> remainder' any more.

If we say MUST, a client would be within its rights to reject a future
version of the "base:allocation" data with more flags set as invalid,
and would therefore not be future-compliant. This is why it should be
SHOULD, not MUST.
Alex Bligh Dec. 13, 2016, 4:15 p.m. UTC | #9
> On 13 Dec 2016, at 16:06, Wouter Verhelst <w@uter.be> wrote:
> 
>> 
>> "If a server supports the `base:allocation` metadata context, then writing
>> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
>> unless for reasons specified in the definition of another context."
> 
> That is essentially what I mean, yes. The point though, is that I also
> think a client should assume the worst.
> 
> (additionally, there may be reasons for ENOSPC that the server isn't
> aware of; so even if NBD_STATE_HOLE is clear, it should still not be an
> error for the server to return ENOSPC)

All of this suggests 'SHOULD NOT' would be more appropriate than
'MUST NOT'.
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index c443494..526f71a 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -681,6 +681,52 @@  This functionality has not yet been implemented by the reference
 implementation, but was implemented by qemu and subsequently
 by other users, so has been moved out of the "experimental" section.
 
+## Metadata querying
+
+With the availability of sparse storage formats, it is often needed to
+query the status of a particular range and read only those blocks of
+data that are actually present on the block device.
+
+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.
+
+To provide such classes of information, the NBD protocol has a generic
+framework for querying metadata; however, its use must first be
+negotiated, and one or more metadata contexts must be selected.
+
+The procedure works as follows:
+
+- First, during negotiation, the client MUST select one or more metadata
+  contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the client
+  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts.
+- During transmission, a client can then indicate interest in metadata
+  for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
+  *offset* and *length* indicate the area of interest. The server MUST
+  then respond with the requested information, for all contexts which
+  were selected during negotiation. For every metadata context, the
+  server sends one set of extent chunks, where the sizes of the
+  extents MUST be less than or equal to the length as specified in the
+  request. Each extent comes with a *flags* field, the semantics of
+  which are defined by the metadata context.
+- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured reply
+  of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
+nonzero number of metadata contexts during negotiation. Servers SHOULD
+reply to clients doing so anyway with `EINVAL`.
+
+The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent by a
+structured reply; this implies that in order to use metadata querying,
+structured replies MUST be negotiated first.
+
+This standard defines exactly one metadata context; it is called
+`base:allocation`, and it provides information on the basic allocation
+status of extents (that is, whether they are allocated at all in a
+sparse file context).
+
 ## Values
 
 This section describes the value and meaning of constants (other than
@@ -768,8 +814,6 @@  The field has the following format:
   to that command to the client. In the absense of this flag, clients
   SHOULD NOT multiplex their commands over more than one connection to
   the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
-  `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
 
 Clients SHOULD ignore unknown flags.
 
@@ -871,6 +915,69 @@  of the newstyle negotiation.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_OPT_LIST_META_CONTEXT` (10)
+
+    Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
+    followed by an `NBD_REP_ACK`. If a server replies to such a request
+    with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+    commands during the transmission phase.
+
+    If the query string is syntactically invalid, the server SHOULD send
+    `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+    but finds no metadata contexts, the server MUST send a single
+    reply of type `NBD_REP_ACK`.
+
+    This option MUST NOT be requested unless structured replies have
+    been negotiated first. If a client attempts to do so, a server
+    SHOULD send `NBD_REP_ERR_INVALID`.
+
+    Data:
+    - 32 bits, length of export name
+    - String, name of export for which we wish to list or select metadata
+      contexts.
+    - 32 bits, length of query
+    - String, query to select a subset of the available metadata
+      contexts. If this is not specified (i.e., the "length of query"
+      field is 0 and no query is sent), then the server MUST send all
+      the metadata contexts it knows about. If specified, this query
+      string MUST start with a name that uniquely identifies a server
+      implementation; e.g., the reference implementation that
+      accompanies this document would support query strings starting
+      with 'nbd-server:'
+
+    The server MUST reply with a list of `NBD_REP_META_CONTEXT` replies,
+    followed by `NBD_REP_ACK`. The metadata context ID in these replies
+    is reserved and SHOULD be set to zero; clients SHOULD disregard it.
+
+- `NBD_OPT_SET_META_CONTEXT` (11)
+
+    Change the set of active metadata contexts. Issuing this command
+    replaces all previously-set metadata contexts; clients must ensure
+    that all metadata contexts they're interested in are selected with
+    the final query that they sent.
+
+    Data:
+    - 32 bits, length of query
+    - String, query to select metadata contexts. The syntax of this
+      query is implementation-defined, except that it MUST start with a
+      namespace. This namespace may be one of the following:
+        - `base:`, for metadata contexts defined by this document;
+        - `nbd-server:`, for metadata contexts defined by the
+          implementation that accompanies this document (none
+          currently);
+        - `x-*:`, where `*` can be replaced by any random string not
+          containing colons, for local experiments. This SHOULD NOT be
+          used by metadata contexts that are expected to e widely used.
+        - third-party implementations can register additional
+          namespaces by simple request to the mailinglist.
+
+    The server MUST reply with a number of `NBD_REP_META_CONTEXT`
+    replies, one for each selected metadata context, each with a unique
+    metadata context ID. It is not an error if a
+    `NBD_OPT_SET_META_CONTEXT` option does not select any metadata
+    context, provided the client then does not attempt to issue
+    `NBD_CMD_BLOCK_STATUS` commands.
+
 #### Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -882,7 +989,7 @@  during option haggling in the fixed newstyle negotiation.
     information is available, or when sending data related to the option
     (in the case of `NBD_OPT_LIST`) has finished. No data.
 
-* `NBD_REP_SERVER` (2)
+- `NBD_REP_SERVER` (2)
 
     A description of an export. Data:
 
@@ -897,10 +1004,18 @@  during option haggling in the fixed newstyle negotiation.
       particular client request, this field is defined to be a string
       suitable for direct display to a human being.
 
-* `NBD_REP_INFO` (3)
+- `NBD_REP_INFO` (3)
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_REP_META_CONTEXT` (4)
+
+    A description of a metadata context. Data:
+
+    - 32 bits, NBD metadata context ID.
+    - String, name of the metadata context. This is not required to be
+      a human-readable string, but it MUST be valid UTF-8 data.
+
 There are a number of error reply types, all of which are denoted by
 having bit 31 set. All error replies MAY have some data set, in which
 case that data is an error message string suitable for display to the user.
@@ -938,15 +1053,62 @@  case that data is an error message string suitable for display to the user.
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-* `NBD_REP_ERR_SHUTDOWN` (2^32 + 7)
+* `NBD_REP_ERR_SHUTDOWN` (2^31 + 7)
 
     The server is unwilling to continue negotiation as it is in the
     process of being shut down.
 
-* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^32 + 8)
+* `NBD_REP_ERR_BLOCK_SIZE_REQD` (2^31 + 8)
 
     Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+##### Metadata contexts
+
+The `base:allocation` metadata context is the basic "allocated at all"
+metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
+context, this means that the given extent is not allocated in the
+backend storage, and that writing to the extent MAY result in the ENOSPC
+error. This supports sparse file semantics on the server side. If a
+server has only one metadata context (the default), then writing to an
+extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
+
+It defines the following flags for the flags field:
+
+- `NBD_STATE_HOLE` (bit 0): if set, the block represents a hole (and
+  future writes to that area may cause fragmentation or encounter an
+  `ENOSPC` error); if clear, the block is allocated or the server could
+  not otherwise determine its status. Note that the use of
+  `NBD_CMD_TRIM` is related to this status, but that the server MAY
+  report a hole even where trim has not been requested, and also that a
+  server MAY report metadata even where a trim has been requested.
+- `NBD_STATE_ZERO` (bit 1): if set, the block contents read as all
+  zeroes; if clear, the block contents are not known. Note that the use
+  of `NBD_CMD_WRITE_ZEROES` is related to this status, but that the
+  server MAY report zeroes even where write zeroes has not been
+  requested, and also that a server MAY report unknown content even
+  where write zeroes has been requested.
+
+It is not an error for a server to report that a region of the
+export has both `NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear. The
+contents of such an area is undefined, and may not be stable;
+clients who are aware of the existence of such a region SHOULD NOT
+read it.
+
+For the `base:allocation` context, the remainder of the flags field is
+reserved. Servers SHOULD set it to all-zero; clients MUST ignore unknown
+flags.
+
+For all other cases, this specification requires no specific semantics of
+metadata contexts, except that all the information they provide MUST be
+representable within the flags field as defined for
+`NBD_REPLY_TYPE_BLOCK_STATUS`.
+
+Likewise, the syntax of query strings is not specified by this document.
+
+Server implementations SHOULD document their syntax for query strings
+and semantics for resulting metadata contexts in a document like this
+one.
+
 ### Transmission phase
 
 #### Flag fields
@@ -983,6 +1145,11 @@  valid may depend on negotiation during the handshake phase.
    content chunk in reply.  MUST NOT be set unless the transmission
    flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
    `EOVERFLOW` error chunk, if the request length is too large.
+- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
+  set, the client is interested in only one extent per metadata
+  context. If this flag is present, the server SHOULD NOT send metadata
+  on more than one extent in the reply. Clients SHOULD NOT use this flag
+  on multiple requests for successive regions in the export.
 
 ##### Structured reply flags
 
@@ -1051,6 +1218,34 @@  interpret the "length" bytes of payload.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  
 
+- `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
+
+    *length* MUST be 4 + (a positive integer multiple of 8).  This reply
+    represents a series of consecutive block descriptors where the sum
+    of the lengths of the descriptors MUST not be greater than the
+    length of the original request. This chunk type MUST appear exactly
+    once per metadata ID in a structured reply.
+
+    The payload starts with:
+
+        * 32 bits, metadata context ID
+
+    and is followed by a list of one or more descriptors, each with this
+    layout:
+
+        * 32 bits, length (unsigned, MUST NOT be zero)
+        * 32 bits, status flags
+
+    If the client used the `NBD_CMD_FLAG_REQ_ONE` flag in the request,
+    then every reply chunk MUST NOT contain more than one descriptor.
+
+    Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
+    its request, the server MAY return less descriptors in the reply
+    than would be required to fully specify the whole range of requested
+    information to the client, if the number of descriptors would be
+    over 16 otherwise and looking up the information would be too
+    resource-intensive for the server.
+
 All error chunk types have bit 15 set, and begin with the same
 *error*, *message length*, and optional *message* fields as
 `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
@@ -1085,7 +1280,7 @@  remaining structured fields at the end.
   were sent earlier in the structured reply, the server SHOULD NOT
   send multiple distinct offsets that lie within the bounds of a
   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
-  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
+  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
 
   The payload is structured as:
 
@@ -1259,6 +1454,44 @@  The following request types exist:
 
     Defined by the experimental `WRITE_ZEROES` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-write-zeroes/doc/proto.md).
 
+* `NBD_CMD_BLOCK_STATUS` (7)
+
+    A block status query request. Length and offset define the range of
+    interest. Clients MUST NOT use this request unless metadata
+    contexts have been negotiated, which in turn requires the client to
+    first negotiate structured replies. For a successful return, the
+    server MUST use a structured reply, containing at least one chunk of
+    type `NBD_REPLY_TYPE_BLOCK_STATUS`, where the status field of each
+    descriptor is determined by the flags field as defined by the
+    metadata context.
+
+    The list of block status descriptors within the
+    `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+    of the file starting from specified *offset*, and the sum of the
+    *length* fields of each descriptor MUST not be greater than the
+    overall *length* of the request. This means that the server MAY
+    return less data than required. However the server MUST return at
+    least one status descriptor.  The server SHOULD use different
+    *status* values between consecutive descriptors, and SHOULD use
+    descriptor lengths that are an integer multiple of 512 bytes where
+    possible (the first and last descriptor of an unaligned query being
+    the most obvious places for an exception). The status flags are
+    intentionally defined so that a server MAY always safely report a
+    status of 0 for any block, although the server SHOULD return
+    additional status values when they can be easily detected.
+
+    If an error occurs, the server SHOULD set the appropriate error
+    code in the error field of an error chunk. However, if the error
+    does not involve invalid usage (such as a request beyond the bounds
+    of the file), a server MAY reply with a single block status
+    descriptor with *length* matching the requested length, and *status*
+    of 0 rather than reporting the error.
+
+    A client MAY initiate a hard disconnect if it detects that the
+    server has sent an invalid chunk. The server SHOULD return `EINVAL`
+    if it receives a `NBD_CMD_BLOCK_STATUS` request including one or
+    more sectors beyond the size of the device.
+
 * Other requests
 
     Some third-party implementations may require additional protocol