diff mbox series

spec: Add NBD_OPT_EXTENDED_HEADERS

Message ID 20211203231434.3900824-1-eblake@redhat.com
State New
Headers show
Series spec: Add NBD_OPT_EXTENDED_HEADERS | expand

Commit Message

Eric Blake Dec. 3, 2021, 11:14 p.m. UTC
Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).

Second, when structured replies are active, clients have to deal with
the difference between 16- and 20-byte headers of simple
vs. structured replies, which impacts performance if the client must
perform multiple syscalls to first read the magic before knowing how
many additional bytes to read.  In extended header mode, all headers
are the same width, so the client can read a full header before
deciding whether the header describes a simple or structured reply.
Similarly, by having extended mode use a power-of-2 sizing, it becomes
easier to manipulate headers within a single cache line, even if it
requires padding bytes sent over the wire.  However, note that this
change only affects the headers; as data payloads can still be
unaligned (for example, a client performing 1-byte reads or writes),
we would need to negotiate yet another extension if we wanted to
ensure that all NBD transmission packets started on an 8-byte boundary
after option haggling has completed.

This spec addition was done in parallel with a proof of concept
implementation in qemu (server and client) and libnbd (client), and I
also have plans to implement it in nbdkit (server).

Signed-off-by: Eric Blake <eblake@redhat.com>
---

Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1

 doc/proto.md | 218 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 177 insertions(+), 41 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy Dec. 6, 2021, 11:40 a.m. UTC | #1
04.12.2021 02:14, Eric Blake wrote:
> Add a new negotiation feature where the client and server agree to use
> larger packet headers on every packet sent during transmission phase.
> This has two purposes: first, it makes it possible to perform
> operations like trim, write zeroes, and block status on more than 2^32
> bytes in a single command; this in turn requires that some structured
> replies from the server also be extended to match.  The wording chosen
> here is careful to permit a server to use either flavor in its reply
> (that is, a request less than 32-bits can trigger an extended reply,
> and conversely a request larger than 32-bits can trigger a compact
> reply).
> 
> Second, when structured replies are active, clients have to deal with
> the difference between 16- and 20-byte headers of simple
> vs. structured replies, which impacts performance if the client must
> perform multiple syscalls to first read the magic before knowing how
> many additional bytes to read.  In extended header mode, all headers
> are the same width, so the client can read a full header before
> deciding whether the header describes a simple or structured reply.
> Similarly, by having extended mode use a power-of-2 sizing, it becomes
> easier to manipulate headers within a single cache line, even if it
> requires padding bytes sent over the wire.  However, note that this
> change only affects the headers; as data payloads can still be
> unaligned (for example, a client performing 1-byte reads or writes),
> we would need to negotiate yet another extension if we wanted to
> ensure that all NBD transmission packets started on an 8-byte boundary
> after option haggling has completed.
> 
> This spec addition was done in parallel with a proof of concept
> implementation in qemu (server and client) and libnbd (client), and I
> also have plans to implement it in nbdkit (server).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 
> Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1
> 
>   doc/proto.md | 218 +++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 177 insertions(+), 41 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 3a877a9..46560b6 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -295,6 +295,21 @@ reply is also problematic for error handling of the `NBD_CMD_READ`
>   request.  Therefore, structured replies can be used to create a
>   a context-free server stream; see below.
> 
> +The results of client negotiation also determine whether the client
> +and server will utilize only compact requests and replies, or whether
> +both sides will use only extended packets.  Compact messages are the
> +default, but inherently limit single transactions to a 32-bit window
> +starting at a 64-bit offset.  Extended messages make it possible to
> +perform 64-bit transactions (although typically only for commands that
> +do not include a data payload).  Furthermore, when structured replies
> +have been negotiated, compact messages require the client to perform
> +partial reads to determine which reply packet style (simple or
> +structured) is on the wire before knowing the length of the rest of
> +the reply, which can reduce client performance.  With extended
> +messages, all packet headers have a fixed length of 32 bytes, and
> +although this results in more traffic over the network due to padding,
> +the resulting layout is friendlier for performance.
> +
>   Replies need not be sent in the same order as requests (i.e., requests
>   may be handled by the server asynchronously), and structured reply
>   chunks from one request may be interleaved with reply messages from
> @@ -343,7 +358,9 @@ may be useful.
> 
>   #### Request message
> 
> -The request message, sent by the client, looks as follows:
> +The compact request message, sent by the client when extended
> +transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +looks as follows:
> 
>   C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)
>   C: 16 bits, command flags
> @@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
>   C: 32 bits, length (unsigned)
>   C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
> 
> +If negotiation agreed on extended transactions with
> +`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
> +
> +C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
> +C: 16 bits, command flags
> +C: 16 bits, type
> +C: 64 bits, handle
> +C: 64 bits, offset (unsigned)
> +C: 64 bits, length (unsigned)
> +C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
> +
>   #### Simple reply message
> 
>   The simple reply message MUST be sent by the server in response to all
>   requests if structured replies have not been negotiated using
> -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> -but only if the reply has no data payload.  The message looks as
> -follows:
> +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> +negotiated, a simple reply MAY be used as a reply to any request other
> +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks as follows:
> 
>   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
>      `NBD_REPLY_MAGIC`)
> @@ -369,6 +398,16 @@ S: 64 bits, handle
>   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>       *error* is zero)
> 
> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks like:
> +
> +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> +S: 32 bits, error (MAY be zero)
> +S: 64 bits, handle
> +S: 128 bits, padding (MUST be zero)
> +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> +    *error* is zero)
> +

If we go this way, let's put payload length into padding: it will help to make the protocol context-independent and less error-prone.

Or, the otherway, may be just forbid the payload for simple-64bit ? What's the reason to allow 64bit requests without structured reply negotiation?

>   #### Structured reply chunk message
> 
>   Some of the major downsides of the default simple reply to
> @@ -410,7 +449,9 @@ considered successful only if it did not contain any error chunks,
>   although the client MAY be able to determine partial success based
>   on the chunks received.
> 
> -A structured reply chunk message looks as follows:
> +If extended headers were not negotiated using
> +`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
> +follows:
> 
>   S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
>   S: 16 bits, flags
> @@ -423,6 +464,17 @@ The use of *length* in the reply allows context-free division of
>   the overall server traffic into individual reply messages; the
>   *type* field describes how to further interpret the payload.
> 
> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks like:
> +
> +S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)
> +S: 16 bits, flags
> +S: 16 bits, type
> +S: 64 bits, handle
> +S: 64 bits, length of payload (unsigned)

Maybe, 64bits is too much for payload. But who knows. And it's good that it's symmetric to 64bit length in request.

> +S: 64 bits, padding (MUST be zero)

Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 24 bytes?

> +S: *length* bytes of payload data (if *length* is nonzero)

Hmm2: we probably may move "handle" to the start of payload. This way we can keep 16bytes header for simple reply and 16bytes header for structured. So structured are read in two shots: 1. the header, 2. handle + payload.. But that means deeper restructuring of the client code.. So seems not worth it.


> +
>   #### Terminating the transmission phase
> 
>   There are two methods of terminating the transmission phase:
> @@ -870,15 +922,19 @@ The procedure works as follows:
>     server supports.
>   - During transmission, a client can then indicate interest in metadata
>     for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
> -  where *offset* and *length* indicate the area of interest. The
> -  server MUST then respond with the requested information, for all
> +  where *offset* and *length* indicate the area of interest.
> +- The server MUST then respond with the requested information, for all
>     contexts which were selected during negotiation. For every metadata
> -  context, the server sends one set of extent chunks, where the sizes
> -  of the extents MUST be less than or equal to the length as specified
> -  in the request.

I'm not sure we can simply drop this requirement.. It seems like an incompatible change, isn't it? May be, we should allow any size of extent only for 64bit mode?

> Each extent comes with a *flags* field, the
> -  semantics of which are defined by the metadata context.
> -- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured
> -  reply of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
> +  context, the server sends one set of extent chunks, using
> +  `NBD_REPLY_TYPE_BLOCK_STATUS` or `NBD_REPLY_TYPE_BLOCK_STATUS_EXT`
> +  (the latter is only possible if the client also negotiated
> +  `NBD_OPT_EXTENDED_HEADERS`).  Each extent comes with a *flags*
> +  field, the semantics of which are defined by the metadata context.
> +
> +The client's requested *size* is only a hint to the server, so the
> +summed size of extents in the server's reply may be shorter, or in
> +some cases longer, than the original request, and may even differ
> +between contexts when multiple metadata contexts were negotiated.
> 
>   A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
>   nonzero number of metadata contexts during negotiation, and used the
> @@ -1179,10 +1235,10 @@ of the newstyle negotiation.
> 
>       When this command succeeds, the server MUST NOT preserve any
>       negotiation state (such as a request for
> -    `NBD_OPT_STRUCTURED_REPLY`, or metadata contexts from
> -    `NBD_OPT_SET_META_CONTEXT`) issued before this command.  A client
> -    SHOULD defer all stateful option requests until after it
> -    determines whether encryption is available.
> +    `NBD_OPT_STRUCTURED_REPLY` or `NBD_OPT_EXTENDED_HEADERS`, or
> +    metadata contexts from `NBD_OPT_SET_META_CONTEXT`) issued before
> +    this command.  A client SHOULD defer all stateful option requests
> +    until after it determines whether encryption is available.
> 
>       See the section on TLS above for further details.
> 
> @@ -1460,6 +1516,26 @@ of the newstyle negotiation.
>       option does not select any metadata context, provided the client
>       then does not attempt to issue `NBD_CMD_BLOCK_STATUS` commands.
> 
> +* `NBD_OPT_EXTENDED_HEADERS` (11)
> +
> +    The client wishes to use extended headers during the transmission
> +    phase.  The client MUST NOT send any additional data with the
> +    option, and the server SHOULD reject a request that includes data
> +    with `NBD_REP_ERR_INVALID`.
> +
> +    The server replies with the following, or with an error permitted
> +    elsewhere in this document:
> +
> +    - `NBD_REP_ACK`: Extended headers have been negotiated; the client
> +      MUST use the 32-byte extended request header, and the server
> +      MUST use the 32-byte extended reply header.
> +    - For backwards compatibility, clients SHOULD be prepared to also
> +      handle `NBD_REP_ERR_UNSUP`; in this case, only the compact
> +      transmission headers will be used.
> +
> +    If the client requests `NBD_OPT_STARTTLS` after this option, it
> +    MUST renegotiate extended headers.
> +
>   #### Option reply types
> 
>   These values are used in the "reply type" field, sent by the server
> @@ -1713,12 +1789,12 @@ unrecognized flags.
> 
>   #### Structured reply types
> 
> -These values are used in the "type" field of a structured reply.
> -Some chunk types can additionally be categorized by role, such as
> -*error chunks* or *content chunks*.  Each type determines how to
> -interpret the "length" bytes of payload.  If the client receives
> -an unknown or unexpected type, other than an *error chunk*, it
> -MUST initiate a hard disconnect.
> +These values are used in the "type" field of a structured reply.  Some
> +chunk types can additionally be categorized by role, such as *error
> +chunks*, *content chunks*, or *status chunks*.  Each type determines
> +how to interpret the "length" bytes of payload.  If the client
> +receives an unknown or unexpected type, other than an *error chunk*,
> +it MUST initiate a hard disconnect.

Just add "status chunks" to the list. Seems unrelated, better be in a separate patch.

> 
>   * `NBD_REPLY_TYPE_NONE` (0)
> 
> @@ -1761,13 +1837,34 @@ MUST initiate a hard disconnect.
>     64 bits: offset (unsigned)
>     32 bits: hole size (unsigned, MUST be nonzero)
> 
> +* `NBD_REPLY_TYPE_OFFSET_HOLE_EXT` (3)
> +
> +  This chunk type is in the content chunk category.  *length* MUST be
> +  exactly 16.  The semantics of this chunk mirror those of
> +  `NBD_REPLY_TYPE_OFFSET_HOLE`, other than the use of a larger *hole
> +  size* field.  This chunk type MUST NOT be used unless extended
> +  headers were negotiated with `NBD_OPT_EXTENDED_HEADERS`.

Why do you call all such things _EXT, not _64 ? _64 seems more informative.

> +
> +  The payload is structured as:
> +
> +  64 bits: offset (unsigned)
> +  64 bits: hole size (unsigned, MUST be nonzero)
> +
> +  Note that even when extended headers are in use, a server may
> +  enforce a maximum block size that is smaller than 32 bits, in which
> +  case no valid `NBD_CMD_READ` will have a *length* large enough to
s/nc/no/ ? But hard to read any way, as sounds very similar to "not valid", which breaks the meaning.

may be just "in which case valid NBD_CMD_READ will not have"

> +  require the use of this chunk type.  However, a client using
> +  extended headers MUST be prepared for the server to use either the
> +  compact or extended chunk type.
> +
>   * `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> 
> -  *length* MUST be 4 + (a positive integer multiple of 8).  This reply
> -  represents a series of consecutive block descriptors where the sum
> -  of the length fields within the descriptors is subject to further
> -  constraints documented below. This chunk type MUST appear
> -  exactly once per metadata ID in a structured reply.
> +  This chunk type is in the status chunk category.  *length* MUST be
> +  4 + (a positive integer multiple of 8).  This reply represents a
> +  series of consecutive block descriptors where the sum of the length
> +  fields within the descriptors is subject to further constraints
> +  documented below.  Each negotiated metadata ID must have exactly one
> +  status chunk in the overall structured reply.

just rewording, no semantic changes, yes?

> 
>     The payload starts with:
> 
> @@ -1796,9 +1893,36 @@ MUST initiate a hard disconnect.
>     information to the client, if looking up the information would be
>     too resource-intensive for the server, so long as at least one
>     extent is returned. Servers should however be aware that most
> -  clients implementations will then simply ask for the next extent
> +  client implementations will then simply ask for the next extent
>     instead.

So you keep all restrictions about NBD_CMD_FLAG_REQ_ONE and about sum of lenghts of extents as is here..

> 
> +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
> +
> +  This chunk type is in the status chunk category.  *length* MUST be
> +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
> +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
> +  larger *extent length* field, as well as added padding to ease
> +  alignment.

But what about restrictions on chunk lengths and cumulative chunk length?

> +  This chunk type MUST NOT be used unless extended headers
> +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> +
> +  The payload starts with:
> +
> +  32 bits, metadata context ID
> +
> +  and is followed by a list of one or more descriptors, each with this
> +  layout:
> +
> +  64 bits, length of the extent to which the status below
> +     applies (unsigned, MUST be nonzero)
> +  32 bits, status flags
> +  32 bits, padding (MUST be zero)
> +
> +  Note that even when extended headers are in use, the client MUST be
> +  prepared for the server to use either the compact or extended chunk
> +  type, regardless of whether the client's hinted length was more or
> +  less than 32 bits, but the server MUST use exactly one of the two
> +  chunk types per negotiated metacontext ID.

But we have anyway one chunk per ID in a reply.. Or you mean that the type of reply for the ID should be selected once for the whole session?

> +
>   All error chunk types have bit 15 set, and begin with the same
>   *error*, *message length*, and optional *message* fields as
>   `NBD_REPLY_TYPE_ERROR`.  If nonzero, *message length* indicates
> @@ -1812,7 +1936,10 @@ remaining structured fields at the end.
>     be at least 6.  This chunk represents that an error occurred,
>     and the client MAY NOT make any assumptions about partial
>     success. This type SHOULD NOT be used more than once in a
> -  structured reply.  Valid as a reply to any request.
> +  structured reply.  Valid as a reply to any request.  Note that
> +  *message length* MUST NOT exceed the 4096 bytes string length limit,
> +  and therefore there is no need for a counterpart extended-length
> +  error chunk type.
> 
>     The payload is structured as:
> 
> @@ -1867,7 +1994,8 @@ The following request types exist:
> 
>       If structured replies were not negotiated, then a read request
>       MUST always be answered by a simple reply, as documented above
> -    (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
> +    (using `NBD_SIMPLE_REPLY_MAGIC` or `NBD_SIMPLE_REPLY_EXT_MAGIC`
> +    according to whether extended headers are in use, and containing
>       length bytes of data according to the client's request).
> 
>       If an error occurs, the server SHOULD set the appropriate error code
> @@ -1883,7 +2011,8 @@ The following request types exist:
> 
>       If structured replies are negotiated, then a read request MUST
>       result in a structured reply with one or more chunks (each using
> -    magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), where the final
> +    `NBD_STRUCTURED_REPLY_MAGIC` or `NBD_STRUCTURED_REPLY_EXT_MAGIC`
> +    according to whether extended headers are in use), where the final
>       chunk has the flag `NBD_REPLY_FLAG_DONE`, and with the following
>       additional constraints.
> 
> @@ -1897,13 +2026,14 @@ The following request types exist:
>       chunks that describe data outside the offset and length of the
>       request, but MAY send the content chunks in any order (the client
>       MUST reassemble content chunks into the correct order), and MAY
> -    send additional content chunks even after reporting an error chunk.
> -    Note that a request for more than 2^32 - 8 bytes MUST be split
> -    into at least two chunks, so as not to overflow the length field
> -    of a reply while still allowing space for the offset of each
> -    chunk.  When no error is detected, the server MUST send enough
> -    data chunks to cover the entire region described by the offset and
> -    length of the client's request.
> +    send additional content chunks even after reporting an error
> +    chunk.  Note that if extended headers are not in use, a request
> +    for more than 2^32 - 8 bytes MUST be split into at least two
> +    chunks, so as not to overflow the length field of a reply while
> +    still allowing space for the offset of each chunk.  When no error
> +    is detected, the server MUST send enough data chunks to cover the
> +    entire region described by the offset and length of the client's
> +    request.
> 
>       To minimize traffic, the server MAY use a content or error chunk
>       as the final chunk by setting the `NBD_REPLY_FLAG_DONE` flag, but
> @@ -2136,13 +2266,19 @@ The following request types exist:
>       server returned at least one metadata context without an error.
>       This in turn requires the client to first negotiate structured
>       replies. For a successful return, the server MUST use a structured
> -    reply, containing exactly one chunk of type
> +    reply, containing exactly one status chunk of type
>       `NBD_REPLY_TYPE_BLOCK_STATUS` per selected context id, where the
>       status field of each descriptor is determined by the flags field
>       as defined by the metadata context.  The server MAY send chunks in
>       a different order than the context ids were assigned in reply to
>       `NBD_OPT_SET_META_CONTEXT`.
> 
> +    If extended headers were negotiated via
> +    `NBD_OPT_EXTENDED_HEADERS`, the server may use
> +    `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` instead of
> +    `NBD_REPLY_TYPE_BLOCK_STATUS` as the reply chunk for a metacontext
> +    id.
> +
>       The list of block status descriptors within the
>       `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>       of the file starting from specified *offset*.  If the client used
> 

Overall, seems good to me.

1. Could we move some fixes / rewordings to a preaparation patch?

2. I see you want also to overcome unpleasant restrictions we had around lengths / cumulative lengths of BLOCK_STATUS replies. I like the idea. But I think, it should be clarified that without 64bit extension negotiated all stay as is. And with 64bit extension negotiated, BLOCK_STATUS works in a slighter new way, so it may return what server wants, and original "length" is simply a hint. Or, at least that new behavior is only about NBD_REPLY_TYPE_BLOCK_STATUS_EXT.. Also, some clarifications may need around NBD_CMD_FLAG_REQ_ONE flag, what changes for it? You don't mention it at all in new version of BLOCK_STATUS reply.
Eric Blake Dec. 6, 2021, 11 p.m. UTC | #2
On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >   #### Simple reply message
> > 
> >   The simple reply message MUST be sent by the server in response to all
> >   requests if structured replies have not been negotiated using
> > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > -but only if the reply has no data payload.  The message looks as
> > -follows:
> > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > +negotiated, a simple reply MAY be used as a reply to any request other
> > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks as follows:
> > 
> >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> >      `NBD_REPLY_MAGIC`)
> > @@ -369,6 +398,16 @@ S: 64 bits, handle
> >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> >       *error* is zero)
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > +S: 32 bits, error (MAY be zero)
> > +S: 64 bits, handle
> > +S: 128 bits, padding (MUST be zero)
> > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > +    *error* is zero)
> > +
> 
> If we go this way, let's put payload length into padding: it will help to make the protocol context-independent and less error-prone.

Easy enough to do (the payload length will be 0 except for
NBD_CMD_READ).

> 
> Or, the otherway, may be just forbid the payload for simple-64bit ? What's the reason to allow 64bit requests without structured reply negotiation?

The two happened to be orthogonal enough in my implementation.  It was
easy to demonstrate either one without the other, and it IS easier to
write a client using non-structured replies (structured reads ARE
tougher than simple reads, even if it is less efficient when it comes
to reading zeros).  But you are also right that we could require
structured reads prior to allowing 64-bit operations, and then have
only one supported reply type on the wire when negotiated.  Wouter,
which way do you prefer?

> 
> >   #### Structured reply chunk message
> > 
> >   Some of the major downsides of the default simple reply to
> > @@ -410,7 +449,9 @@ considered successful only if it did not contain any error chunks,
> >   although the client MAY be able to determine partial success based
> >   on the chunks received.
> > 
> > -A structured reply chunk message looks as follows:
> > +If extended headers were not negotiated using
> > +`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
> > +follows:
> > 
> >   S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
> >   S: 16 bits, flags
> > @@ -423,6 +464,17 @@ The use of *length* in the reply allows context-free division of
> >   the overall server traffic into individual reply messages; the
> >   *type* field describes how to further interpret the payload.
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)
> > +S: 16 bits, flags
> > +S: 16 bits, type
> > +S: 64 bits, handle
> > +S: 64 bits, length of payload (unsigned)
> 
> Maybe, 64bits is too much for payload. But who knows. And it's good that it's symmetric to 64bit length in request.

Indeed, both qemu and libnbd implementations explicitly kill the
connection to any server that replies with more than the max buffer
used for NBD_CMD_READ/WRITE (32M for qemu, 64M for libnbd).  And if
the spec is not already clear on the topic, I should add an
independent patch to NBD_CMD_BLOCK_STATUS to make it obvious that a
server cannot reply with too many extents because of such clients.

So none of my proof-of-concept code ever used the full 64-bits of the
reply header length.  On the other hand, there is indeed the symmetry
argument - if someone writes a server willing to accept a 4G
NBD_CMD_WRITE, then it should also support a 4G NBD_CMD_READ, even if
no known existing server or client allows buffers that large..

> 
> > +S: 64 bits, padding (MUST be zero)
> 
> Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 24 bytes?

Consider:
struct header[100];

if sizeof(header[0]) is a power of 2 <= the cache line size (and the
compiler prefers to start arrays aligned to the cache line) then we
are guaranteed that all array members each reside in a single cache
line.  But if it is not a power of 2, some of the array members
straddle two cache lines.

Will there be code that wants to create an array of headers?  Perhaps
so, because that is a logical way (along with scatter/gather to
combine the header with variable-sized payloads) of tracking the
headers for multiple commands issued in parallel.

Do I have actual performance numbers?  No. But there's plenty of
google hits for why sizing structs to a power of 2 is a good idea.

> 
> > +S: *length* bytes of payload data (if *length* is nonzero)
> 
> Hmm2: we probably may move "handle" to the start of payload. This way we can keep 16bytes header for simple reply and 16bytes header for structured. So structured are read in two shots: 1. the header, 2. handle + payload.. But that means deeper restructuring of the client code.. So seems not worth it.

Right now, the handle is in the same offset for both simple and
structured replies, and for both normal and extended headers.  My
proof-of-concept for qemu always reads just the magic number and
handle, then decides how many more bytes to read (if any) (1 syscall
for simple compact headers, 2 syscalls for compact structured and for
both extended styles); while my proof-of-concept for libnbd actually
decides up front to only do a 32-byte read if extended headers are in
use for fewer syscalls.  I don't know if one way is better than the
other, but the differences in styles fell out naturally from the rest
of those code bases, and certainly anything that can be done with
fewer syscalls per transaction is going to show a modest improvement.

But you are right that repositioning the handle to live at some other
offset (including forcing it to live in the payload with a 16-byte
header, instead of having a 32-byte header) would be more invasive.
Doable?  Maybe.  That's why this is an RFC.  But unless there is a
compelling reason to try, I'd rather not go to that effort.

> 
> 
> > +
> >   #### Terminating the transmission phase
> > 
> >   There are two methods of terminating the transmission phase:
> > @@ -870,15 +922,19 @@ The procedure works as follows:
> >     server supports.
> >   - During transmission, a client can then indicate interest in metadata
> >     for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
> > -  where *offset* and *length* indicate the area of interest. The
> > -  server MUST then respond with the requested information, for all
> > +  where *offset* and *length* indicate the area of interest.
> > +- The server MUST then respond with the requested information, for all
> >     contexts which were selected during negotiation. For every metadata
> > -  context, the server sends one set of extent chunks, where the sizes
> > -  of the extents MUST be less than or equal to the length as specified
> > -  in the request.
> 
> I'm not sure we can simply drop this requirement.. It seems like an incompatible change, isn't it? May be, we should allow any size of extent only for 64bit mode?

I'm not dropping the requirement; what was listed here is redundant
with what appears elsewhere under NBD_REPLY_TYPE_BLOCK_STATUS, where
the addition of NBD_REPLY_TYPE_BLOCK_STATUS_EXT made it too wordy to
keep the redundancy here.  But yes, I can try and separate the patch
into minor cleanups separate from new additions.

...
> >   #### Structured reply types
> > 
> > -These values are used in the "type" field of a structured reply.
> > -Some chunk types can additionally be categorized by role, such as
> > -*error chunks* or *content chunks*.  Each type determines how to
> > -interpret the "length" bytes of payload.  If the client receives
> > -an unknown or unexpected type, other than an *error chunk*, it
> > -MUST initiate a hard disconnect.
> > +These values are used in the "type" field of a structured reply.  Some
> > +chunk types can additionally be categorized by role, such as *error
> > +chunks*, *content chunks*, or *status chunks*.  Each type determines
> > +how to interpret the "length" bytes of payload.  If the client
> > +receives an unknown or unexpected type, other than an *error chunk*,
> > +it MUST initiate a hard disconnect.
> 
> Just add "status chunks" to the list. Seems unrelated, better be in a separate patch.

Previously, only NBD_REPLY_TYPE_BLOCK_STATUS counts as a status chunk,
now we have two reply types with that qualification.  But I can indeed
split up the terminology addition from the addition of the second type
of status chunk.

> 
> > 
> >   * `NBD_REPLY_TYPE_NONE` (0)
> > 
> > @@ -1761,13 +1837,34 @@ MUST initiate a hard disconnect.
> >     64 bits: offset (unsigned)
> >     32 bits: hole size (unsigned, MUST be nonzero)
> > 
> > +* `NBD_REPLY_TYPE_OFFSET_HOLE_EXT` (3)
> > +
> > +  This chunk type is in the content chunk category.  *length* MUST be
> > +  exactly 16.  The semantics of this chunk mirror those of
> > +  `NBD_REPLY_TYPE_OFFSET_HOLE`, other than the use of a larger *hole
> > +  size* field.  This chunk type MUST NOT be used unless extended
> > +  headers were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> 
> Why do you call all such things _EXT, not _64 ? _64 seems more informative.

_64 would be fine with me.  As this is an RFC, the naming is not
locked in stone.

> 
> > +
> > +  The payload is structured as:
> > +
> > +  64 bits: offset (unsigned)
> > +  64 bits: hole size (unsigned, MUST be nonzero)
> > +
> > +  Note that even when extended headers are in use, a server may
> > +  enforce a maximum block size that is smaller than 32 bits, in which
> > +  case no valid `NBD_CMD_READ` will have a *length* large enough to
> s/nc/no/ ? But hard to read any way, as sounds very similar to "not valid", which breaks the meaning.
> 
> may be just "in which case valid NBD_CMD_READ will not have"

I like that.

> 
> > +  require the use of this chunk type.  However, a client using
> > +  extended headers MUST be prepared for the server to use either the
> > +  compact or extended chunk type.
> > +
> >   * `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
> > 
> > -  *length* MUST be 4 + (a positive integer multiple of 8).  This reply
> > -  represents a series of consecutive block descriptors where the sum
> > -  of the length fields within the descriptors is subject to further
> > -  constraints documented below. This chunk type MUST appear
> > -  exactly once per metadata ID in a structured reply.
> > +  This chunk type is in the status chunk category.  *length* MUST be
> > +  4 + (a positive integer multiple of 8).  This reply represents a
> > +  series of consecutive block descriptors where the sum of the length
> > +  fields within the descriptors is subject to further constraints
> > +  documented below.  Each negotiated metadata ID must have exactly one
> > +  status chunk in the overall structured reply.
> 
> just rewording, no semantic changes, yes?

The change is that it is no longer to have exactly one of these per
reply (you can have a BLOCK_STATUS_EXT instead).  True, not much of a
change, but it is because of the new type.  Again, adding the notion
of exactly one status chunk per metadata id (even with only one
possible status chunk) in one patch, then adding the second status
chunk with extended headers, may be easier to review, so I'll try that
for v2.

> 
> > 
> >     The payload starts with:
> > 
> > @@ -1796,9 +1893,36 @@ MUST initiate a hard disconnect.
> >     information to the client, if looking up the information would be
> >     too resource-intensive for the server, so long as at least one
> >     extent is returned. Servers should however be aware that most
> > -  clients implementations will then simply ask for the next extent
> > +  client implementations will then simply ask for the next extent
> >     instead.
> 
> So you keep all restrictions about NBD_CMD_FLAG_REQ_ONE and about sum of lenghts of extents as is here..

Yes.

> 
> > 
> > +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
> > +
> > +  This chunk type is in the status chunk category.  *length* MUST be
> > +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
> > +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
> > +  larger *extent length* field, as well as added padding to ease
> > +  alignment.
> 
> But what about restrictions on chunk lengths and cumulative chunk length?

That is supposed to still be in effect.  If I deleted that
restriction, it was unintentional.  That is, the cumulative length
(and thus each individual extent length, since no extent can be larger
than the cumulative length) is not allowed to exceed the client's
length request except in the case of the last extent, and even then
only when REQ_ONE was not in use.

> 
> > +  This chunk type MUST NOT be used unless extended headers
> > +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> > +
> > +  The payload starts with:
> > +
> > +  32 bits, metadata context ID
> > +
> > +  and is followed by a list of one or more descriptors, each with this
> > +  layout:
> > +
> > +  64 bits, length of the extent to which the status below
> > +     applies (unsigned, MUST be nonzero)
> > +  32 bits, status flags
> > +  32 bits, padding (MUST be zero)
> > +
> > +  Note that even when extended headers are in use, the client MUST be
> > +  prepared for the server to use either the compact or extended chunk
> > +  type, regardless of whether the client's hinted length was more or
> > +  less than 32 bits, but the server MUST use exactly one of the two
> > +  chunk types per negotiated metacontext ID.
> 
> But we have anyway one chunk per ID in a reply.. Or you mean that the type of reply for the ID should be selected once for the whole session?

I envision the following as valid:

OPT_SET_META_CONTEXT("base:allocation", "my:extension")
=> id 1: "base:allocation", id 2: "my:extension"
OPT_GO
...

CMD_BLOCK_STATUS(offset=0, length=3G)
=> REPLY_TYPE_BLOCK_STATUS(id=1, extent[2] { length=2G flags=0, length=2G flags=1 })
=> REPLY_TYPE_BLOCK_STATUS_EXT(id=2, extent[1] { length=3G flags=0 })
CMD_BLOCK_STATUS(offset=3G, length=6G)
=> REPLY_TYPE_BLOCK_STATUS_EXT(id 1, extent[1] { length=5G flags=0 })
=> REPLY_TYPE_BLOCK_STATUS(id 2, extent[2] { length=3.5G flags=0, length=3.5G flags=1 })

Note that the first id=1 responded with a cumulative length larger
than the client's request, and the cumulative length is > 4G, but the
response itself gets away with only 32-bit extents.  The first id=2
response is < 3G, but the server chose to use a 64-bit extent anyway.
The second id=1 response has to use the 64-bit response (because even
though 5G is shorter than the client's request for 6G, it is larger
than the 4G maximum of a 32-bit response). The second id=2 is similar
to the first id=1 in that it uses a 32-bit response even though the
cumulative length is >4G.  There is no requirement that the cumulative
lengths of the two ids be identical.  And since REQ_ONE is not in
effect, the last extent of a given extent array can cause the
cumulative value to exceed the client's request.

What is invalid:
CMD_BLOCK_STATUS(offset=0, length=3G)
=> REPLY_TYPE_BLOCK_STATUS(id=1, extent[2] { length=2G flags=0, length=2G flags=1 })
=> REPLY_TYPE_BLOCK_STATUS_EXT(id=1, extent[1] { length=3G flags=0 })

because it used two status chunks both for context id=1.

Maybe I need to add the phrase "within a given NBD_CMD_BLOCK_STATUS
response", to make it clear that exactly one status chunk per id is
chosen, but whether the server chooses a 32- or 64-bit status chunk is
dependent solely on the server's whims, and the client must be
prepared for either, regardless of the length the client used
originally.

...
> 
> Overall, seems good to me.

Glad to hear it!

> 
> 1. Could we move some fixes / rewordings to a preaparation patch?

Yes, I'll do that for v2.

> 
> 2. I see you want also to overcome unpleasant restrictions we had around lengths / cumulative lengths of BLOCK_STATUS replies. I like the idea. But I think, it should be clarified that without 64bit extension negotiated all stay as is. And with 64bit extension negotiated, BLOCK_STATUS works in a slighter new way, so it may return what server wants, and original "length" is simply a hint. Or, at least that new behavior is only about NBD_REPLY_TYPE_BLOCK_STATUS_EXT.. Also, some clarifications may need around NBD_CMD_FLAG_REQ_ONE flag, what changes for it? You don't mention it at all in new version of BLOCK_STATUS reply.

That's not new behavior.  The client's length has always been a mere
hint to the server, where the only constraints are that the server
must make progress on success, and that if REQ_ONE is in use, the
server may not report more than then length the client asked about.

Or are you proposing that we relax REQ_ONE, and allow a server to
report additional length in 64-bit mode even when REQ_ONE is in use?
The 32-bit limitation on not sending back too much length with compact
structured replies was because qemu as client at one point would abort
if the cumulative length was too long (and qemu has always used
REQ_ONE).  But if extended headers are in use, qemu no longer aborts
on oversize answers, and no other client is new enough to have
extended headers.
Vladimir Sementsov-Ogievskiy Dec. 7, 2021, 9:08 a.m. UTC | #3
07.12.2021 02:00, Eric Blake wrote:
> On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>>    #### Simple reply message
>>>
>>>    The simple reply message MUST be sent by the server in response to all
>>>    requests if structured replies have not been negotiated using
>>> -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
>>> -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
>>> -but only if the reply has no data payload.  The message looks as
>>> -follows:
>>> +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
>>> +negotiated, a simple reply MAY be used as a reply to any request other
>>> +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
>>> +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
>>> +the message looks as follows:
>>>
>>>    S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
>>>       `NBD_REPLY_MAGIC`)
>>> @@ -369,6 +398,16 @@ S: 64 bits, handle
>>>    S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>>>        *error* is zero)
>>>
>>> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
>>> +the message looks like:
>>> +
>>> +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
>>> +S: 32 bits, error (MAY be zero)
>>> +S: 64 bits, handle
>>> +S: 128 bits, padding (MUST be zero)
>>> +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>>> +    *error* is zero)
>>> +
>>
>> If we go this way, let's put payload length into padding: it will help to make the protocol context-independent and less error-prone.
> 
> Easy enough to do (the payload length will be 0 except for
> NBD_CMD_READ).
> 
>>
>> Or, the otherway, may be just forbid the payload for simple-64bit ? What's the reason to allow 64bit requests without structured reply negotiation?
> 
> The two happened to be orthogonal enough in my implementation.  It was
> easy to demonstrate either one without the other, and it IS easier to
> write a client using non-structured replies (structured reads ARE
> tougher than simple reads, even if it is less efficient when it comes
> to reading zeros).  But you are also right that we could require
> structured reads prior to allowing 64-bit operations, and then have
> only one supported reply type on the wire when negotiated.  Wouter,
> which way do you prefer?
> 
>>
>>>    #### Structured reply chunk message
>>>
>>>    Some of the major downsides of the default simple reply to
>>> @@ -410,7 +449,9 @@ considered successful only if it did not contain any error chunks,
>>>    although the client MAY be able to determine partial success based
>>>    on the chunks received.
>>>
>>> -A structured reply chunk message looks as follows:
>>> +If extended headers were not negotiated using
>>> +`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
>>> +follows:
>>>
>>>    S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
>>>    S: 16 bits, flags
>>> @@ -423,6 +464,17 @@ The use of *length* in the reply allows context-free division of
>>>    the overall server traffic into individual reply messages; the
>>>    *type* field describes how to further interpret the payload.
>>>
>>> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
>>> +the message looks like:
>>> +
>>> +S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)
>>> +S: 16 bits, flags
>>> +S: 16 bits, type
>>> +S: 64 bits, handle
>>> +S: 64 bits, length of payload (unsigned)
>>
>> Maybe, 64bits is too much for payload. But who knows. And it's good that it's symmetric to 64bit length in request.
> 
> Indeed, both qemu and libnbd implementations explicitly kill the
> connection to any server that replies with more than the max buffer
> used for NBD_CMD_READ/WRITE (32M for qemu, 64M for libnbd).  And if
> the spec is not already clear on the topic, I should add an
> independent patch to NBD_CMD_BLOCK_STATUS to make it obvious that a
> server cannot reply with too many extents because of such clients.
> 
> So none of my proof-of-concept code ever used the full 64-bits of the
> reply header length.  On the other hand, there is indeed the symmetry
> argument - if someone writes a server willing to accept a 4G
> NBD_CMD_WRITE, then it should also support a 4G NBD_CMD_READ, even if
> no known existing server or client allows buffers that large..
> 
>>
>>> +S: 64 bits, padding (MUST be zero)
>>
>> Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 24 bytes?
> 
> Consider:
> struct header[100];
> 
> if sizeof(header[0]) is a power of 2 <= the cache line size (and the
> compiler prefers to start arrays aligned to the cache line) then we
> are guaranteed that all array members each reside in a single cache
> line.  But if it is not a power of 2, some of the array members
> straddle two cache lines.
> 
> Will there be code that wants to create an array of headers?  Perhaps
> so, because that is a logical way (along with scatter/gather to
> combine the header with variable-sized payloads) of tracking the
> headers for multiple commands issued in parallel.
> 
> Do I have actual performance numbers?  No. But there's plenty of
> google hits for why sizing structs to a power of 2 is a good idea.
> 
>>
>>> +S: *length* bytes of payload data (if *length* is nonzero)
>>
>> Hmm2: we probably may move "handle" to the start of payload. This way we can keep 16bytes header for simple reply and 16bytes header for structured. So structured are read in two shots: 1. the header, 2. handle + payload.. But that means deeper restructuring of the client code.. So seems not worth it.
> 
> Right now, the handle is in the same offset for both simple and
> structured replies, and for both normal and extended headers.  My
> proof-of-concept for qemu always reads just the magic number and
> handle, then decides how many more bytes to read (if any) (1 syscall
> for simple compact headers, 2 syscalls for compact structured and for
> both extended styles); while my proof-of-concept for libnbd actually
> decides up front to only do a 32-byte read if extended headers are in
> use for fewer syscalls.  I don't know if one way is better than the
> other, but the differences in styles fell out naturally from the rest
> of those code bases, and certainly anything that can be done with
> fewer syscalls per transaction is going to show a modest improvement.
> 
> But you are right that repositioning the handle to live at some other
> offset (including forcing it to live in the payload with a 16-byte
> header, instead of having a 32-byte header) would be more invasive.
> Doable?  Maybe.  That's why this is an RFC.  But unless there is a
> compelling reason to try, I'd rather not go to that effort.
> 
>>
>>
>>> +
>>>    #### Terminating the transmission phase
>>>
>>>    There are two methods of terminating the transmission phase:
>>> @@ -870,15 +922,19 @@ The procedure works as follows:
>>>      server supports.
>>>    - During transmission, a client can then indicate interest in metadata
>>>      for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
>>> -  where *offset* and *length* indicate the area of interest. The
>>> -  server MUST then respond with the requested information, for all
>>> +  where *offset* and *length* indicate the area of interest.
>>> +- The server MUST then respond with the requested information, for all
>>>      contexts which were selected during negotiation. For every metadata
>>> -  context, the server sends one set of extent chunks, where the sizes
>>> -  of the extents MUST be less than or equal to the length as specified
>>> -  in the request.
>>
>> I'm not sure we can simply drop this requirement.. It seems like an incompatible change, isn't it? May be, we should allow any size of extent only for 64bit mode?
> 
> I'm not dropping the requirement; 

Hmm.

First, the sentence restrict all extents to be less than length. But actually, I think we don't want to restrict the last extent..  So, this is just a mistake in spec. Can we just drop it, not caring about possible implementations that checks all extents be less than length including the last one? It's an important thing if consider the case when server can reply with only one extent, that covers more than requested length and REQ_ONE is not set.

Second, you add a global statement that "size is only a hint". But formally that's not true: we do have some restrictions..

> what was listed here is redundant
> with what appears elsewhere under NBD_REPLY_TYPE_BLOCK_STATUS, where
> the addition of NBD_REPLY_TYPE_BLOCK_STATUS_EXT made it too wordy to
> keep the redundancy here.  But yes, I can try and separate the patch
> into minor cleanups separate from new additions.
> 
> ...
>>>    #### Structured reply types
>>>
>>> -These values are used in the "type" field of a structured reply.
>>> -Some chunk types can additionally be categorized by role, such as
>>> -*error chunks* or *content chunks*.  Each type determines how to
>>> -interpret the "length" bytes of payload.  If the client receives
>>> -an unknown or unexpected type, other than an *error chunk*, it
>>> -MUST initiate a hard disconnect.
>>> +These values are used in the "type" field of a structured reply.  Some
>>> +chunk types can additionally be categorized by role, such as *error
>>> +chunks*, *content chunks*, or *status chunks*.  Each type determines
>>> +how to interpret the "length" bytes of payload.  If the client
>>> +receives an unknown or unexpected type, other than an *error chunk*,
>>> +it MUST initiate a hard disconnect.
>>
>> Just add "status chunks" to the list. Seems unrelated, better be in a separate patch.
> 
> Previously, only NBD_REPLY_TYPE_BLOCK_STATUS counts as a status chunk,
> now we have two reply types with that qualification.  But I can indeed
> split up the terminology addition from the addition of the second type
> of status chunk.
> 
>>
>>>
>>>    * `NBD_REPLY_TYPE_NONE` (0)
>>>
>>> @@ -1761,13 +1837,34 @@ MUST initiate a hard disconnect.
>>>      64 bits: offset (unsigned)
>>>      32 bits: hole size (unsigned, MUST be nonzero)
>>>
>>> +* `NBD_REPLY_TYPE_OFFSET_HOLE_EXT` (3)
>>> +
>>> +  This chunk type is in the content chunk category.  *length* MUST be
>>> +  exactly 16.  The semantics of this chunk mirror those of
>>> +  `NBD_REPLY_TYPE_OFFSET_HOLE`, other than the use of a larger *hole
>>> +  size* field.  This chunk type MUST NOT be used unless extended
>>> +  headers were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
>>
>> Why do you call all such things _EXT, not _64 ? _64 seems more informative.
> 
> _64 would be fine with me.  As this is an RFC, the naming is not
> locked in stone.
> 
>>
>>> +
>>> +  The payload is structured as:
>>> +
>>> +  64 bits: offset (unsigned)
>>> +  64 bits: hole size (unsigned, MUST be nonzero)
>>> +
>>> +  Note that even when extended headers are in use, a server may
>>> +  enforce a maximum block size that is smaller than 32 bits, in which
>>> +  case no valid `NBD_CMD_READ` will have a *length* large enough to
>> s/nc/no/ ? But hard to read any way, as sounds very similar to "not valid", which breaks the meaning.
>>
>> may be just "in which case valid NBD_CMD_READ will not have"
> 
> I like that.
> 
>>
>>> +  require the use of this chunk type.  However, a client using
>>> +  extended headers MUST be prepared for the server to use either the
>>> +  compact or extended chunk type.
>>> +
>>>    * `NBD_REPLY_TYPE_BLOCK_STATUS` (5)
>>>
>>> -  *length* MUST be 4 + (a positive integer multiple of 8).  This reply
>>> -  represents a series of consecutive block descriptors where the sum
>>> -  of the length fields within the descriptors is subject to further
>>> -  constraints documented below. This chunk type MUST appear
>>> -  exactly once per metadata ID in a structured reply.
>>> +  This chunk type is in the status chunk category.  *length* MUST be
>>> +  4 + (a positive integer multiple of 8).  This reply represents a
>>> +  series of consecutive block descriptors where the sum of the length
>>> +  fields within the descriptors is subject to further constraints
>>> +  documented below.  Each negotiated metadata ID must have exactly one
>>> +  status chunk in the overall structured reply.
>>
>> just rewording, no semantic changes, yes?
> 
> The change is that it is no longer to have exactly one of these per
> reply (you can have a BLOCK_STATUS_EXT instead).  True, not much of a
> change, but it is because of the new type.  Again, adding the notion
> of exactly one status chunk per metadata id (even with only one
> possible status chunk) in one patch, then adding the second status
> chunk with extended headers, may be easier to review, so I'll try that
> for v2.
> 
>>
>>>
>>>      The payload starts with:
>>>
>>> @@ -1796,9 +1893,36 @@ MUST initiate a hard disconnect.
>>>      information to the client, if looking up the information would be
>>>      too resource-intensive for the server, so long as at least one
>>>      extent is returned. Servers should however be aware that most
>>> -  clients implementations will then simply ask for the next extent
>>> +  client implementations will then simply ask for the next extent
>>>      instead.
>>
>> So you keep all restrictions about NBD_CMD_FLAG_REQ_ONE and about sum of lenghts of extents as is here..
> 
> Yes.
> 
>>
>>>
>>> +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
>>> +
>>> +  This chunk type is in the status chunk category.  *length* MUST be
>>> +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
>>> +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
>>> +  larger *extent length* field, as well as added padding to ease
>>> +  alignment.
>>
>> But what about restrictions on chunk lengths and cumulative chunk length?
> 
> That is supposed to still be in effect.  If I deleted that
> restriction, it was unintentional.  That is, the cumulative length
> (and thus each individual extent length, since no extent can be larger
> than the cumulative length) is not allowed to exceed the client's
> length request except in the case of the last extent, and even then
> only when REQ_ONE was not in use.
> 
>>
>>> +  This chunk type MUST NOT be used unless extended headers
>>> +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
>>> +
>>> +  The payload starts with:
>>> +
>>> +  32 bits, metadata context ID
>>> +
>>> +  and is followed by a list of one or more descriptors, each with this
>>> +  layout:
>>> +
>>> +  64 bits, length of the extent to which the status below
>>> +     applies (unsigned, MUST be nonzero)
>>> +  32 bits, status flags
>>> +  32 bits, padding (MUST be zero)
>>> +
>>> +  Note that even when extended headers are in use, the client MUST be
>>> +  prepared for the server to use either the compact or extended chunk
>>> +  type, regardless of whether the client's hinted length was more or
>>> +  less than 32 bits, but the server MUST use exactly one of the two
>>> +  chunk types per negotiated metacontext ID.
>>
>> But we have anyway one chunk per ID in a reply.. Or you mean that the type of reply for the ID should be selected once for the whole session?
> 
> I envision the following as valid:
> 
> OPT_SET_META_CONTEXT("base:allocation", "my:extension")
> => id 1: "base:allocation", id 2: "my:extension"
> OPT_GO
> ...
> 
> CMD_BLOCK_STATUS(offset=0, length=3G)
> => REPLY_TYPE_BLOCK_STATUS(id=1, extent[2] { length=2G flags=0, length=2G flags=1 })
> => REPLY_TYPE_BLOCK_STATUS_EXT(id=2, extent[1] { length=3G flags=0 })
> CMD_BLOCK_STATUS(offset=3G, length=6G)
> => REPLY_TYPE_BLOCK_STATUS_EXT(id 1, extent[1] { length=5G flags=0 })
> => REPLY_TYPE_BLOCK_STATUS(id 2, extent[2] { length=3.5G flags=0, length=3.5G flags=1 })
> 
> Note that the first id=1 responded with a cumulative length larger
> than the client's request, and the cumulative length is > 4G, but the
> response itself gets away with only 32-bit extents.  The first id=2
> response is < 3G, but the server chose to use a 64-bit extent anyway.
> The second id=1 response has to use the 64-bit response (because even
> though 5G is shorter than the client's request for 6G, it is larger
> than the 4G maximum of a 32-bit response). The second id=2 is similar
> to the first id=1 in that it uses a 32-bit response even though the
> cumulative length is >4G.  There is no requirement that the cumulative
> lengths of the two ids be identical.  And since REQ_ONE is not in
> effect, the last extent of a given extent array can cause the
> cumulative value to exceed the client's request.
> 
> What is invalid:
> CMD_BLOCK_STATUS(offset=0, length=3G)
> => REPLY_TYPE_BLOCK_STATUS(id=1, extent[2] { length=2G flags=0, length=2G flags=1 })
> => REPLY_TYPE_BLOCK_STATUS_EXT(id=1, extent[1] { length=3G flags=0 })
> 
> because it used two status chunks both for context id=1.
> 
> Maybe I need to add the phrase "within a given NBD_CMD_BLOCK_STATUS
> response", to make it clear that exactly one status chunk per id is
> chosen, but whether the server chooses a 32- or 64-bit status chunk is
> dependent solely on the server's whims, and the client must be
> prepared for either, regardless of the length the client used
> originally.


Ah, I understand, thanks. Somehow, I just thought about REPLY_TYPE_BLOCK_STATUS and REPLY_TYPE_BLOCK_STATUS_EXT like about one object, and somehow I thought that old "This chunk type MUST appear exactly once per metadata ID in a structured reply" of NBD_REPLY_TYPE_BLOCK_STATUS is enough. But yes, REPLY_TYPE_BLOCK_STATUS_EXT is a new chunk type and we need a clarification. Yes, "within a given NBD_CMD_BLOCK_STATUS  response" helps.

> 
> ...
>>
>> Overall, seems good to me.
> 
> Glad to hear it!
> 
>>
>> 1. Could we move some fixes / rewordings to a preaparation patch?
> 
> Yes, I'll do that for v2.
> 
>>
>> 2. I see you want also to overcome unpleasant restrictions we had around lengths / cumulative lengths of BLOCK_STATUS replies. I like the idea. But I think, it should be clarified that without 64bit extension negotiated all stay as is. And with 64bit extension negotiated, BLOCK_STATUS works in a slighter new way, so it may return what server wants, and original "length" is simply a hint. Or, at least that new behavior is only about NBD_REPLY_TYPE_BLOCK_STATUS_EXT.. Also, some clarifications may need around NBD_CMD_FLAG_REQ_ONE flag, what changes for it? You don't mention it at all in new version of BLOCK_STATUS reply.
> 
> That's not new behavior.  The client's length has always been a mere
> hint to the server, where the only constraints are that the server
> must make progress on success, and that if REQ_ONE is in use, the
> server may not report more than then length the client asked about.
> 
> Or are you proposing that we relax REQ_ONE, and allow a server to
> report additional length in 64-bit mode even when REQ_ONE is in use?
> The 32-bit limitation on not sending back too much length with compact
> structured replies was because qemu as client at one point would abort
> if the cumulative length was too long (and qemu has always used
> REQ_ONE).  But if extended headers are in use, qemu no longer aborts
> on oversize answers, and no other client is new enough to have
> extended headers.
> 

I mistakenly thought that you want do it) OK, then we can consider it as my proposal: we are going to alter block-status behavior in incompatible way with new extension, why not to overcome some inconvenient limitations?

So what restrictions we have:

1. The LAST extent must be <= than requested length. It's from sentence "For every metadata context, the server sends one set of extent chunks, where the sizes of the extents MUST be less than or equal to the length"

In this patch you silently remove this mistaken limitation.. I'm not sure that we could do it. But if we sure that known clients are not affected, probably we could.
If we can't simply remove it, we still can fix it when extended headers are negotiated.

2. Only last extent can cross the boundary of the request.

Actually, why we need it? What if in some implementation, server has some more extents for free? We can drop this restriction when extended header are negotiated.

3. With REQ_ONE, the returned extent can't cross the boundary.

Recently we had patches and discussion in list to overcome this restriction by requesting more than we actually need, I argued against this logic in client.
If we allow last extent to cross the boundary when new extension is negotiated - great, we can use this feature in qemu immediately (when full support for multi-extent block-status requires a lot more work).
Wouter Verhelst Dec. 7, 2021, 4:14 p.m. UTC | #4
On Mon, Dec 06, 2021 at 05:00:47PM -0600, Eric Blake wrote:
> On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > >   #### Simple reply message
> > > 
> > >   The simple reply message MUST be sent by the server in response to all
> > >   requests if structured replies have not been negotiated using
> > > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> > > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > > -but only if the reply has no data payload.  The message looks as
> > > -follows:
> > > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > > +negotiated, a simple reply MAY be used as a reply to any request other
> > > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks as follows:
> > > 
> > >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > >      `NBD_REPLY_MAGIC`)
> > > @@ -369,6 +398,16 @@ S: 64 bits, handle
> > >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > >       *error* is zero)
> > > 
> > > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks like:
> > > +
> > > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > > +S: 32 bits, error (MAY be zero)
> > > +S: 64 bits, handle
> > > +S: 128 bits, padding (MUST be zero)
> > > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > +    *error* is zero)
> > > +
> > 
> > If we go this way, let's put payload length into padding: it will help to make the protocol context-independent and less error-prone.

Agreed.

> Easy enough to do (the payload length will be 0 except for
> NBD_CMD_READ).

Indeed.

> > Or, the otherway, may be just forbid the payload for simple-64bit ? What's the reason to allow 64bit requests without structured reply negotiation?
> 
> The two happened to be orthogonal enough in my implementation.  It was
> easy to demonstrate either one without the other, and it IS easier to
> write a client using non-structured replies (structured reads ARE
> tougher than simple reads, even if it is less efficient when it comes
> to reading zeros).  But you are also right that we could require
> structured reads prior to allowing 64-bit operations, and then have
> only one supported reply type on the wire when negotiated.  Wouter,
> which way do you prefer?

Given that I *still* haven't gotten around to implementing structured
replies for nbd-server, I'd prefer not to require it, but that's not
really a decent argument IMO :-)

[... I haven't read this in much detail yet, intend to do that later...]
Vladimir Sementsov-Ogievskiy Dec. 10, 2021, 6:05 p.m. UTC | #5
07.12.2021 12:08, Vladimir Sementsov-Ogievskiy wrote:
> 07.12.2021 02:00, Eric Blake wrote:
>> On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:

[..]

>>>
>>>> +S: 64 bits, padding (MUST be zero)
>>>
>>> Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 24 bytes?
>>
>> Consider:
>> struct header[100];
>>
>> if sizeof(header[0]) is a power of 2 <= the cache line size (and the
>> compiler prefers to start arrays aligned to the cache line) then we
>> are guaranteed that all array members each reside in a single cache
>> line.  But if it is not a power of 2, some of the array members
>> straddle two cache lines.
>>
>> Will there be code that wants to create an array of headers?  Perhaps
>> so, because that is a logical way (along with scatter/gather to
>> combine the header with variable-sized payloads) of tracking the
>> headers for multiple commands issued in parallel.
>>
>> Do I have actual performance numbers?  No. But there's plenty of
>> google hits for why sizing structs to a power of 2 is a good idea.

I have a thought:

If client stores headers in separate, nothing prevents make this padding in RAM. So you can define header struct with padding. But what a reason to make the padding in the stream? You can have and array of good-aligned structures, but fill only part of header structure reading from the socket. Note, that you can read only one header in one read() call anyway, as you have to analyze, does it have payload or not.

So, if we want to improve performance by padding the structures in RAM, it's not a reason for padding them in the wire, keeping in mind that we can't read more then one structure at once.
Vladimir Sementsov-Ogievskiy Dec. 10, 2021, 6:16 p.m. UTC | #6
04.12.2021 02:14, Eric Blake wrote:
> Add a new negotiation feature where the client and server agree to use
> larger packet headers on every packet sent during transmission phase.
> This has two purposes: first, it makes it possible to perform
> operations like trim, write zeroes, and block status on more than 2^32
> bytes in a single command; this in turn requires that some structured
> replies from the server also be extended to match.  The wording chosen
> here is careful to permit a server to use either flavor in its reply
> (that is, a request less than 32-bits can trigger an extended reply,
> and conversely a request larger than 32-bits can trigger a compact
> reply).


About this.. Isn't it too permissive?

I think that actually having to very similar ways to do the same thing is usually a bad design. I think we don't want someone implement the logic, which tries to send 32bit commands/replies for small requests and 64bit command/replies for larger ones? Moreover, you don't allow doing it for commands. So, for symmetry, it may be good to be strict with replies too: in 64bit mode only 64bit replies.

Now we of course have to support old 32bit commands and new 64bit commands. But, may be, we'll want to deprecate 32bit commands at some moment? I'm not sure we can deprecate them in protocol, but we can deprecate them in Qemu at least. And several years later we'll drop old code, keeping only support for 64bit commands. Less code paths, less similar structures, simpler code, I think it worth it.
Eric Blake March 22, 2022, 3:10 p.m. UTC | #7
On Tue, Dec 07, 2021 at 06:14:23PM +0200, Wouter Verhelst wrote:
> On Mon, Dec 06, 2021 at 05:00:47PM -0600, Eric Blake wrote:
> > On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > >   #### Simple reply message
> > > > 
> > > >   The simple reply message MUST be sent by the server in response to all
> > > >   requests if structured replies have not been negotiated using
> > > > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> > > > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > > > -but only if the reply has no data payload.  The message looks as
> > > > -follows:
> > > > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > > > +negotiated, a simple reply MAY be used as a reply to any request other
> > > > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > > > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > > +the message looks as follows:
> > > > 
> > > >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > > >      `NBD_REPLY_MAGIC`)
> > > > @@ -369,6 +398,16 @@ S: 64 bits, handle
> > > >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > >       *error* is zero)
> > > > 
> > > > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > > +the message looks like:
> > > > +
> > > > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > > > +S: 32 bits, error (MAY be zero)
> > > > +S: 64 bits, handle
> > > > +S: 128 bits, padding (MUST be zero)
> > > > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > > +    *error* is zero)
> > > > +
> > > 
> > > If we go this way, let's put payload length into padding: it will help to make the protocol context-independent and less error-prone.
> 
> Agreed.
> 
> > Easy enough to do (the payload length will be 0 except for
> > NBD_CMD_READ).
> 
> Indeed.
> 
> > > Or, the otherway, may be just forbid the payload for simple-64bit ? What's the reason to allow 64bit requests without structured reply negotiation?
> > 
> > The two happened to be orthogonal enough in my implementation.  It was
> > easy to demonstrate either one without the other, and it IS easier to
> > write a client using non-structured replies (structured reads ARE
> > tougher than simple reads, even if it is less efficient when it comes
> > to reading zeros).  But you are also right that we could require
> > structured reads prior to allowing 64-bit operations, and then have
> > only one supported reply type on the wire when negotiated.  Wouter,
> > which way do you prefer?
> 
> Given that I *still* haven't gotten around to implementing structured
> replies for nbd-server, I'd prefer not to require it, but that's not
> really a decent argument IMO :-)
> 
> [... I haven't read this in much detail yet, intend to do that later...]

Ping - any other responses on this thread, before I start working on
version 2 of the cross-project patches?

And repeating a comment from my original cover letter:

> with 64-bit commands, we may want to also make it easier to let
> servers advertise an actual maximum size they are willing to accept
> for the commands in question (for example, a server may be happy with
> a full 64-bit block status, but still want to limit non-fast zero and
> cache to a smaller limit to avoid denial of service).

Is it worth enhancing NBD_OPT_INFO in my v2?
Wouter Verhelst March 24, 2022, 5:31 p.m. UTC | #8
Hi Eric,

Thanks for the ping; it had slipped my mind.

On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
>  #### Request message
> 
> -The request message, sent by the client, looks as follows:
> +The compact request message, sent by the client when extended
> +transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +looks as follows:
> 
>  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
>  C: 16 bits, command flags  
> @@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
>  C: 32 bits, length (unsigned)  
>  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> 
> +If negotiation agreed on extended transactions with
> +`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
> +
> +C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
> +C: 16 bits, command flags  
> +C: 16 bits, type  
> +C: 64 bits, handle  
> +C: 64 bits, offset (unsigned)  
> +C: 64 bits, length (unsigned)  
> +C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> +

Perhaps we should decouple the ideas of "effect length" and "payload
length"? As in,

C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags
C: 16 bits, type
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, effect length
C: 64 bits, payload length
C: (*payload length* bytes of data)

This makes the protocol more context free. With the current set of
commands, only NBD_CMD_WRITE would have payload length be nonzero, but
that doesn't have to remain the case forever; e.g., we could have a
command that extends NBD_CMD_BLOCK_STATUS to only query a subset of the
metadata contexts that we declared (if that is wanted, of course).

Of course, that does have the annoying side effect of no longer fitting
in 32 bytes, requiring a 40-byte header instead. I think it would be
worth it though.

(This is obviously not relevant for reply messages, only for request
messages)

>  #### Simple reply message
> 
>  The simple reply message MUST be sent by the server in response to all
>  requests if structured replies have not been negotiated using
> -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> -but only if the reply has no data payload.  The message looks as
> -follows:
> +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> +negotiated, a simple reply MAY be used as a reply to any request other
> +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks as follows:
> 
>  S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
>     `NBD_REPLY_MAGIC`)  
> @@ -369,6 +398,16 @@ S: 64 bits, handle
>  S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>      *error* is zero)  
> 
> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks like:
> +
> +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)  
> +S: 32 bits, error (MAY be zero)  
> +S: 64 bits, handle  
> +S: 128 bits, padding (MUST be zero)  

Should all these requirements about padding not be a SHOULD rather than
a MUST?

[...]
> +* `NBD_OPT_EXTENDED_HEADERS` (11)
> +
> +    The client wishes to use extended headers during the transmission
> +    phase.  The client MUST NOT send any additional data with the
> +    option, and the server SHOULD reject a request that includes data
> +    with `NBD_REP_ERR_INVALID`.
> +
> +    The server replies with the following, or with an error permitted
> +    elsewhere in this document:
> +
> +    - `NBD_REP_ACK`: Extended headers have been negotiated; the client
> +      MUST use the 32-byte extended request header, and the server
> +      MUST use the 32-byte extended reply header.
> +    - For backwards compatibility, clients SHOULD be prepared to also
> +      handle `NBD_REP_ERR_UNSUP`; in this case, only the compact
> +      transmission headers will be used.
> +
> +    If the client requests `NBD_OPT_STARTTLS` after this option, it
> +    MUST renegotiate extended headers.
> +

Two thoughts here:

- We should probably allow NBD_REP_ERR_BLOCK_SIZE_REQD as a reply to
  this message; I could imagine a server might not want to talk 64-bit
  lengths if it doesn't know that block sizes are going to be
  reasonable.
- In the same vein, should we perhaps also add an error message for when
  extended headers are negotiated without structured replies? Perhaps a
  server implementation might not want to implement the "extended
  headers but no structured replies" message format.

On that note, while I know I had said earlier that I would prefer not
making this new extension depend on structured replies, in hindsight
perhaps it *is* a good idea to add that dependency; otherwise we create
an extra message format that is really a degenerate case of "we want to
be modern in one way but not in another", and that screams out to me
"I'm not going to be used much, look at me for security issues!"

Which perhaps is not a very good idea.

[...]
Eric Blake March 25, 2022, midnight UTC | #9
[Updating Vladimir's new preferred address in cc list]

On Thu, Mar 24, 2022 at 07:31:48PM +0200, Wouter Verhelst wrote:
> Hi Eric,
> 
> Thanks for the ping; it had slipped my mind.
> 
> On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
> >  #### Request message
> > 
> > -The request message, sent by the client, looks as follows:
> > +The compact request message, sent by the client when extended
> > +transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +looks as follows:
> > 
> >  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
> >  C: 16 bits, command flags  
> > @@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
> >  C: 32 bits, length (unsigned)  
> >  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> > 
> > +If negotiation agreed on extended transactions with
> > +`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
> > +
> > +C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
> > +C: 16 bits, command flags  
> > +C: 16 bits, type  
> > +C: 64 bits, handle  
> > +C: 64 bits, offset (unsigned)  
> > +C: 64 bits, length (unsigned)  
> > +C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> > +
> 
> Perhaps we should decouple the ideas of "effect length" and "payload
> length"? As in,
> 
> C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
> C: 16 bits, command flags
> C: 16 bits, type
> C: 64 bits, handle
> C: 64 bits, offset
> C: 64 bits, effect length
> C: 64 bits, payload length
> C: (*payload length* bytes of data)
> 
> This makes the protocol more context free. With the current set of
> commands, only NBD_CMD_WRITE would have payload length be nonzero, but
> that doesn't have to remain the case forever; e.g., we could have a
> command that extends NBD_CMD_BLOCK_STATUS to only query a subset of the
> metadata contexts that we declared (if that is wanted, of course).
> 
> Of course, that does have the annoying side effect of no longer fitting
> in 32 bytes, requiring a 40-byte header instead. I think it would be
> worth it though.

Could we still keep a 32-byte header, by having a new command (or new
command flag to the existing NBD_CMD_BLOCK_STATUS), such that the
payload itself contains the needed extra bytes?

Hmm - right now, the only command with a payload is NBD_CMD_WRITE, and
all other commands use the length field as an effect length.  So maybe
what we do is have a single command flag that says whether the length
field is serving as payload length or as effect length.  NBD_CMD_WRITE
would always set the new flag (if extended headers were negotiated),
and most other NBD_CMD_* would leave the flag unset, but in the case
of BLOCK_STATUS wanting only a subset of id status reported, we could
then have:

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, NBD_CMD_FLAG_PAYLOAD
C: 16 bits, type, NBD_CMD_BLOCK_STATUS
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, payload length = 12 + 4*n
PAYLOAD:
C: 64 bits, effect length (hint on desired range)
C: 32 bits, number of ids = n
C: 32 bits, id[0]
...
C: 32 bits, id[n-1]

vs.

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, 0
C: 16 bits, type, NBD_CMD_BLOCK_STATUS
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, effect length (hint on desired range)

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, NBD_CMD_FLAG_PAYLOAD
C: 16 bits, type, NBD_CMD_WRITE
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, payload length = n
PAYLOAD:
C: n*8 bits data


> 
> (This is obviously not relevant for reply messages, only for request
> messages)

Staying at a power of 2 may still be worth the expense of a new cmd
flag which must always be set for writes when extended headers are in
use.

> 
> >  #### Simple reply message
> > 
> >  The simple reply message MUST be sent by the server in response to all
> >  requests if structured replies have not been negotiated using
> > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
> > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > -but only if the reply has no data payload.  The message looks as
> > -follows:
> > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > +negotiated, a simple reply MAY be used as a reply to any request other
> > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks as follows:
> > 
> >  S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> >     `NBD_REPLY_MAGIC`)  
> > @@ -369,6 +398,16 @@ S: 64 bits, handle
> >  S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> >      *error* is zero)  
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)  
> > +S: 32 bits, error (MAY be zero)  
> > +S: 64 bits, handle  
> > +S: 128 bits, padding (MUST be zero)  
> 
> Should all these requirements about padding not be a SHOULD rather than
> a MUST?

Elsewhere in the thread, we talked about having
NBD_SIMPLE_REPLY_EXT_MAGIC have 64 bits length (only non-zero when
replying to NBD_CMD_READ) and 64 bits pad, instead of 128 bits pad.

For future extensibility, it's probably safest to require the server
to send 0 bits in the pad now, so that we can use them later.  Should
clients then ignore unknown padding bits, or is there a risk that a
future definition of non-zero values in what is now padding bits may
confuse an existing client that merely ignores those bits?

If we don't think extensibility is needed, then using SHOULD instead
of MUST means a non-careful server can leak data through the padding.
But it is certainly less restrictive to use SHOULD instead of MUST
(well-written servers won't leak, sloppy servers might, clients must
ignore the padding, and extension is not possible because of sloppy
servers).

> 
> [...]
> > +* `NBD_OPT_EXTENDED_HEADERS` (11)
> > +
> > +    The client wishes to use extended headers during the transmission
> > +    phase.  The client MUST NOT send any additional data with the
> > +    option, and the server SHOULD reject a request that includes data
> > +    with `NBD_REP_ERR_INVALID`.
> > +
> > +    The server replies with the following, or with an error permitted
> > +    elsewhere in this document:
> > +
> > +    - `NBD_REP_ACK`: Extended headers have been negotiated; the client
> > +      MUST use the 32-byte extended request header, and the server
> > +      MUST use the 32-byte extended reply header.
> > +    - For backwards compatibility, clients SHOULD be prepared to also
> > +      handle `NBD_REP_ERR_UNSUP`; in this case, only the compact
> > +      transmission headers will be used.
> > +
> > +    If the client requests `NBD_OPT_STARTTLS` after this option, it
> > +    MUST renegotiate extended headers.
> > +
> 
> Two thoughts here:
> 
> - We should probably allow NBD_REP_ERR_BLOCK_SIZE_REQD as a reply to
>   this message; I could imagine a server might not want to talk 64-bit
>   lengths if it doesn't know that block sizes are going to be
>   reasonable.

Good addition.  I'll include it in v2.

> - In the same vein, should we perhaps also add an error message for when
>   extended headers are negotiated without structured replies? Perhaps a
>   server implementation might not want to implement the "extended
>   headers but no structured replies" message format.

Seems reasonable.

> 
> On that note, while I know I had said earlier that I would prefer not
> making this new extension depend on structured replies, in hindsight
> perhaps it *is* a good idea to add that dependency; otherwise we create
> an extra message format that is really a degenerate case of "we want to
> be modern in one way but not in another", and that screams out to me
> "I'm not going to be used much, look at me for security issues!"
> 
> Which perhaps is not a very good idea.

Yeah, the more I read back over Vladimir's message, the more I am
agreeing that just because we CAN be orthogonal doesn't mean we WANT
to be orthogonal.  Every degree of orthogonality increases the testing
burden.  I'm happy to rework v2 along those lines (structured replies
mandatory, and only one extended reply header, so that only compact
style has two header types).
Eric Blake Oct. 4, 2022, 9:21 p.m. UTC | #10
On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
> Add a new negotiation feature where the client and server agree to use
> larger packet headers on every packet sent during transmission phase.
> This has two purposes: first, it makes it possible to perform
> operations like trim, write zeroes, and block status on more than 2^32
> bytes in a single command; this in turn requires that some structured
> replies from the server also be extended to match.  The wording chosen
> here is careful to permit a server to use either flavor in its reply
> (that is, a request less than 32-bits can trigger an extended reply,
> and conversely a request larger than 32-bits can trigger a compact
> reply).

Following up on this original proposal with something that came out of
KVM Forum this year.

> +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
> +
> +  This chunk type is in the status chunk category.  *length* MUST be
> +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
> +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
> +  larger *extent length* field, as well as added padding to ease
> +  alignment.  This chunk type MUST NOT be used unless extended headers
> +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> +
> +  The payload starts with:
> +
> +  32 bits, metadata context ID  
> +
> +  and is followed by a list of one or more descriptors, each with this
> +  layout:
> +
> +  64 bits, length of the extent to which the status below
> +     applies (unsigned, MUST be nonzero)  
> +  32 bits, status flags  
> +  32 bits, padding (MUST be zero)

During KVM Forum, I had several conversations about Zoned Block
Devices (https://zonedstorage.io/docs/linux/zbd-api), and what it
would take to expose ZBD information over NBD.  In particular,
NBD_CMD_BLOCK_STATUS sounds like a great way for advertising
information about zones (by adding several metadata contexts that can
be negotiated during NBD_OPT_SET_META_CONTEXT), except for the fact
that a zone might be larger than 32 bits in size.  So Rich Jones asked
me the question of whether my work on 64-bit extensions to the NBD
protocol could also allow for a server to advertise a metadata context
only to clients that support 64-bit extensions, at which point it can
report 64-bit offsets or lengths as needed, rather than being limited
to 32-bit status flags.

The idea definitely has merit, so I'm working on incorporating that
into my next revision for 64-bit extensions in NBD.
diff mbox series

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 3a877a9..46560b6 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -295,6 +295,21 @@  reply is also problematic for error handling of the `NBD_CMD_READ`
 request.  Therefore, structured replies can be used to create a
 a context-free server stream; see below.

+The results of client negotiation also determine whether the client
+and server will utilize only compact requests and replies, or whether
+both sides will use only extended packets.  Compact messages are the
+default, but inherently limit single transactions to a 32-bit window
+starting at a 64-bit offset.  Extended messages make it possible to
+perform 64-bit transactions (although typically only for commands that
+do not include a data payload).  Furthermore, when structured replies
+have been negotiated, compact messages require the client to perform
+partial reads to determine which reply packet style (simple or
+structured) is on the wire before knowing the length of the rest of
+the reply, which can reduce client performance.  With extended
+messages, all packet headers have a fixed length of 32 bytes, and
+although this results in more traffic over the network due to padding,
+the resulting layout is friendlier for performance.
+
 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously), and structured reply
 chunks from one request may be interleaved with reply messages from
@@ -343,7 +358,9 @@  may be useful.

 #### Request message

-The request message, sent by the client, looks as follows:
+The compact request message, sent by the client when extended
+transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+looks as follows:

 C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
 C: 16 bits, command flags  
@@ -353,14 +370,26 @@  C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  

+If negotiation agreed on extended transactions with
+`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
+
+C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
+C: 16 bits, command flags  
+C: 16 bits, type  
+C: 64 bits, handle  
+C: 64 bits, offset (unsigned)  
+C: 64 bits, length (unsigned)  
+C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
+
 #### Simple reply message

 The simple reply message MUST be sent by the server in response to all
 requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if the reply has no data payload.  If
+extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks as follows:

 S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
    `NBD_REPLY_MAGIC`)  
@@ -369,6 +398,16 @@  S: 64 bits, handle
 S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
     *error* is zero)  

+If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks like:
+
+S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)  
+S: 32 bits, error (MAY be zero)  
+S: 64 bits, handle  
+S: 128 bits, padding (MUST be zero)  
+S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
+    *error* is zero)  
+
 #### Structured reply chunk message

 Some of the major downsides of the default simple reply to
@@ -410,7 +449,9 @@  considered successful only if it did not contain any error chunks,
 although the client MAY be able to determine partial success based
 on the chunks received.

-A structured reply chunk message looks as follows:
+If extended headers were not negotiated using
+`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
+follows:

 S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
 S: 16 bits, flags  
@@ -423,6 +464,17 @@  The use of *length* in the reply allows context-free division of
 the overall server traffic into individual reply messages; the
 *type* field describes how to further interpret the payload.

+If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks like:
+
+S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)  
+S: 16 bits, flags  
+S: 16 bits, type  
+S: 64 bits, handle  
+S: 64 bits, length of payload (unsigned)  
+S: 64 bits, padding (MUST be zero)  
+S: *length* bytes of payload data (if *length* is nonzero)  
+
 #### Terminating the transmission phase

 There are two methods of terminating the transmission phase:
@@ -870,15 +922,19 @@  The procedure works as follows:
   server supports.
 - During transmission, a client can then indicate interest in metadata
   for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
-  where *offset* and *length* indicate the area of interest. The
-  server MUST then respond with the requested information, for all
+  where *offset* and *length* indicate the area of interest.
+- The server MUST then respond with the requested information, for all
   contexts which were selected during negotiation. For every metadata
-  context, the server sends one set of extent chunks, where the sizes
-  of the extents MUST be less than or equal to the length as specified
-  in the request. Each extent comes with a *flags* field, the
-  semantics of which are defined by the metadata context.
-- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured
-  reply of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+  context, the server sends one set of extent chunks, using
+  `NBD_REPLY_TYPE_BLOCK_STATUS` or `NBD_REPLY_TYPE_BLOCK_STATUS_EXT`
+  (the latter is only possible if the client also negotiated
+  `NBD_OPT_EXTENDED_HEADERS`).  Each extent comes with a *flags*
+  field, the semantics of which are defined by the metadata context.
+
+The client's requested *size* is only a hint to the server, so the
+summed size of extents in the server's reply may be shorter, or in
+some cases longer, than the original request, and may even differ
+between contexts when multiple metadata contexts were negotiated.

 A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
 nonzero number of metadata contexts during negotiation, and used the
@@ -1179,10 +1235,10 @@  of the newstyle negotiation.

     When this command succeeds, the server MUST NOT preserve any
     negotiation state (such as a request for
-    `NBD_OPT_STRUCTURED_REPLY`, or metadata contexts from
-    `NBD_OPT_SET_META_CONTEXT`) issued before this command.  A client
-    SHOULD defer all stateful option requests until after it
-    determines whether encryption is available.
+    `NBD_OPT_STRUCTURED_REPLY` or `NBD_OPT_EXTENDED_HEADERS`, or
+    metadata contexts from `NBD_OPT_SET_META_CONTEXT`) issued before
+    this command.  A client SHOULD defer all stateful option requests
+    until after it determines whether encryption is available.

     See the section on TLS above for further details.

@@ -1460,6 +1516,26 @@  of the newstyle negotiation.
     option does not select any metadata context, provided the client
     then does not attempt to issue `NBD_CMD_BLOCK_STATUS` commands.

+* `NBD_OPT_EXTENDED_HEADERS` (11)
+
+    The client wishes to use extended headers during the transmission
+    phase.  The client MUST NOT send any additional data with the
+    option, and the server SHOULD reject a request that includes data
+    with `NBD_REP_ERR_INVALID`.
+
+    The server replies with the following, or with an error permitted
+    elsewhere in this document:
+
+    - `NBD_REP_ACK`: Extended headers have been negotiated; the client
+      MUST use the 32-byte extended request header, and the server
+      MUST use the 32-byte extended reply header.
+    - For backwards compatibility, clients SHOULD be prepared to also
+      handle `NBD_REP_ERR_UNSUP`; in this case, only the compact
+      transmission headers will be used.
+
+    If the client requests `NBD_OPT_STARTTLS` after this option, it
+    MUST renegotiate extended headers.
+
 #### Option reply types

 These values are used in the "reply type" field, sent by the server
@@ -1713,12 +1789,12 @@  unrecognized flags.

 #### Structured reply types

-These values are used in the "type" field of a structured reply.
-Some chunk types can additionally be categorized by role, such as
-*error chunks* or *content chunks*.  Each type determines how to
-interpret the "length" bytes of payload.  If the client receives
-an unknown or unexpected type, other than an *error chunk*, it
-MUST initiate a hard disconnect.
+These values are used in the "type" field of a structured reply.  Some
+chunk types can additionally be categorized by role, such as *error
+chunks*, *content chunks*, or *status chunks*.  Each type determines
+how to interpret the "length" bytes of payload.  If the client
+receives an unknown or unexpected type, other than an *error chunk*,
+it MUST initiate a hard disconnect.

 * `NBD_REPLY_TYPE_NONE` (0)

@@ -1761,13 +1837,34 @@  MUST initiate a hard disconnect.
   64 bits: offset (unsigned)  
   32 bits: hole size (unsigned, MUST be nonzero)  

+* `NBD_REPLY_TYPE_OFFSET_HOLE_EXT` (3)
+
+  This chunk type is in the content chunk category.  *length* MUST be
+  exactly 16.  The semantics of this chunk mirror those of
+  `NBD_REPLY_TYPE_OFFSET_HOLE`, other than the use of a larger *hole
+  size* field.  This chunk type MUST NOT be used unless extended
+  headers were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
+
+  The payload is structured as:
+
+  64 bits: offset (unsigned)  
+  64 bits: hole size (unsigned, MUST be nonzero)  
+
+  Note that even when extended headers are in use, a server may
+  enforce a maximum block size that is smaller than 32 bits, in which
+  case no valid `NBD_CMD_READ` will have a *length* large enough to
+  require the use of this chunk type.  However, a client using
+  extended headers MUST be prepared for the server to use either the
+  compact or extended chunk type.
+
 * `NBD_REPLY_TYPE_BLOCK_STATUS` (5)

-  *length* MUST be 4 + (a positive integer multiple of 8).  This reply
-  represents a series of consecutive block descriptors where the sum
-  of the length fields within the descriptors is subject to further
-  constraints documented below. This chunk type MUST appear
-  exactly once per metadata ID in a structured reply.
+  This chunk type is in the status chunk category.  *length* MUST be
+  4 + (a positive integer multiple of 8).  This reply represents a
+  series of consecutive block descriptors where the sum of the length
+  fields within the descriptors is subject to further constraints
+  documented below.  Each negotiated metadata ID must have exactly one
+  status chunk in the overall structured reply.

   The payload starts with:

@@ -1796,9 +1893,36 @@  MUST initiate a hard disconnect.
   information to the client, if looking up the information would be
   too resource-intensive for the server, so long as at least one
   extent is returned. Servers should however be aware that most
-  clients implementations will then simply ask for the next extent
+  client implementations will then simply ask for the next extent
   instead.

+* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
+
+  This chunk type is in the status chunk category.  *length* MUST be
+  4 + (a positive multiple of 16).  The semantics of this chunk mirror
+  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
+  larger *extent length* field, as well as added padding to ease
+  alignment.  This chunk type MUST NOT be used unless extended headers
+  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
+
+  The payload starts with:
+
+  32 bits, metadata context ID  
+
+  and is followed by a list of one or more descriptors, each with this
+  layout:
+
+  64 bits, length of the extent to which the status below
+     applies (unsigned, MUST be nonzero)  
+  32 bits, status flags  
+  32 bits, padding (MUST be zero)  
+
+  Note that even when extended headers are in use, the client MUST be
+  prepared for the server to use either the compact or extended chunk
+  type, regardless of whether the client's hinted length was more or
+  less than 32 bits, but the server MUST use exactly one of the two
+  chunk types per negotiated metacontext ID.
+
 All error chunk types have bit 15 set, and begin with the same
 *error*, *message length*, and optional *message* fields as
 `NBD_REPLY_TYPE_ERROR`.  If nonzero, *message length* indicates
@@ -1812,7 +1936,10 @@  remaining structured fields at the end.
   be at least 6.  This chunk represents that an error occurred,
   and the client MAY NOT make any assumptions about partial
   success. This type SHOULD NOT be used more than once in a
-  structured reply.  Valid as a reply to any request.
+  structured reply.  Valid as a reply to any request.  Note that
+  *message length* MUST NOT exceed the 4096 bytes string length limit,
+  and therefore there is no need for a counterpart extended-length
+  error chunk type.

   The payload is structured as:

@@ -1867,7 +1994,8 @@  The following request types exist:

     If structured replies were not negotiated, then a read request
     MUST always be answered by a simple reply, as documented above
-    (using magic 0x67446698 `NBD_SIMPLE_REPLY_MAGIC`, and containing
+    (using `NBD_SIMPLE_REPLY_MAGIC` or `NBD_SIMPLE_REPLY_EXT_MAGIC`
+    according to whether extended headers are in use, and containing
     length bytes of data according to the client's request).

     If an error occurs, the server SHOULD set the appropriate error code
@@ -1883,7 +2011,8 @@  The following request types exist:

     If structured replies are negotiated, then a read request MUST
     result in a structured reply with one or more chunks (each using
-    magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), where the final
+    `NBD_STRUCTURED_REPLY_MAGIC` or `NBD_STRUCTURED_REPLY_EXT_MAGIC`
+    according to whether extended headers are in use), where the final
     chunk has the flag `NBD_REPLY_FLAG_DONE`, and with the following
     additional constraints.

@@ -1897,13 +2026,14 @@  The following request types exist:
     chunks that describe data outside the offset and length of the
     request, but MAY send the content chunks in any order (the client
     MUST reassemble content chunks into the correct order), and MAY
-    send additional content chunks even after reporting an error chunk.
-    Note that a request for more than 2^32 - 8 bytes MUST be split
-    into at least two chunks, so as not to overflow the length field
-    of a reply while still allowing space for the offset of each
-    chunk.  When no error is detected, the server MUST send enough
-    data chunks to cover the entire region described by the offset and
-    length of the client's request.
+    send additional content chunks even after reporting an error
+    chunk.  Note that if extended headers are not in use, a request
+    for more than 2^32 - 8 bytes MUST be split into at least two
+    chunks, so as not to overflow the length field of a reply while
+    still allowing space for the offset of each chunk.  When no error
+    is detected, the server MUST send enough data chunks to cover the
+    entire region described by the offset and length of the client's
+    request.

     To minimize traffic, the server MAY use a content or error chunk
     as the final chunk by setting the `NBD_REPLY_FLAG_DONE` flag, but
@@ -2136,13 +2266,19 @@  The following request types exist:
     server returned at least one metadata context without an error.
     This in turn requires the client to first negotiate structured
     replies. For a successful return, the server MUST use a structured
-    reply, containing exactly one chunk of type
+    reply, containing exactly one status chunk of type
     `NBD_REPLY_TYPE_BLOCK_STATUS` per selected context id, where the
     status field of each descriptor is determined by the flags field
     as defined by the metadata context.  The server MAY send chunks in
     a different order than the context ids were assigned in reply to
     `NBD_OPT_SET_META_CONTEXT`.

+    If extended headers were negotiated via
+    `NBD_OPT_EXTENDED_HEADERS`, the server may use
+    `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` instead of
+    `NBD_REPLY_TYPE_BLOCK_STATUS` as the reply chunk for a metacontext
+    id.
+
     The list of block status descriptors within the
     `NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
     of the file starting from specified *offset*.  If the client used