diff mbox

[v2,3/3] doc: Propose Structured Read extension

Message ID 1459292460-6875-4-git-send-email-eblake@redhat.com
State New
Headers show

Commit Message

Eric Blake March 29, 2016, 11:01 p.m. UTC
The existing transmission phase protocol is difficult to sniff,
because correct interpretation of the server stream requires
context from the client stream (or risks false positives if
data payloads happen to contain the protocol magic numbers).  It
also prohibits the ability to do efficient sparse reads, or to
return a short read where an error is reported without also
sending length bytes of (bogus) data.

Remedy this by adding a new option request negotiation, which
affects the response of the NBD_CMD_READ command, and sets
forth rules for how future command responses must behave when
they carry a data payload.

This proposal does NOT permit structured replies to anything
other than NBD_CMD_READ, although a future proposal may wish
to make that valid (so that a server could be written that
only returns structured replies).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 doc/proto.md | 260 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 256 insertions(+), 4 deletions(-)

Comments

Eric Blake March 29, 2016, 11:29 p.m. UTC | #1
On 03/29/2016 05:01 PM, Eric Blake wrote:
> The existing transmission phase protocol is difficult to sniff,
> because correct interpretation of the server stream requires
> context from the client stream (or risks false positives if
> data payloads happen to contain the protocol magic numbers).  It
> also prohibits the ability to do efficient sparse reads, or to
> return a short read where an error is reported without also
> sending length bytes of (bogus) data.
> 

> +* `NBD_CMD_READ`
> +
> +    If `NBD_OPT_STRUCTURED_READ` was not negotiated, then a read
> +    request MUST always be answered by a single non-structured
> +    response, as documented above (using magic 0x67446698
> +    `NBD_REPLY_MAGIC`, and containing length bytes of data according
> +    to the client's request, although those bytes MAY be invalid if an
> +    error is returned, and the connection MUST if an error occurs

MUST be closed

(I hate it when I send patches before 'git commit --amend' on the final
unsaved changes in my editor...)

> +    after a header claiming no error).
> +
Alex Bligh March 30, 2016, 6:50 a.m. UTC | #2
Eric,

There's a lot in common between our two proposals now (unsurprisingly).
You've highlighted the differences in the other mail and I'll
comment on them there. You may want to steal some of my wording as
I think there are bits I've got that you haven't (as well as vice versa).
But I'm inclined to use yours as a base unless you particularly
like mine.

Comments inline below.

Alex

On 30 Mar 2016, at 00:01, Eric Blake <eblake@redhat.com> wrote:
...

> +    While the server is permitted to send at most one normal reply (or
> +    else close the connection), a command that uses structured replies
> +    may document that the server is permitted to send mutiple replies,
> +    all sharing the same handle,

The thought is fine, but the language is confusing. I think this is
a single reply, made up of multiple parts (I called them chunks). You've
called them multiple replies, which I think makes things less clear.
Also below you've started using my 'chunk' language anyway!

> by using the `NBD_REPLY_FLAG_DONE`
> +    (bit 0) to delineate the final reply. The server MAY interleave
> +    intermediate replies to one structured command with replies
> +    relating to a different handle.

Neat.

The argument against this route is that now there are essentially
two ways to end a chain of chunks (with and without a NONE chunk)
which is necessary for the reasons you set out. On balance I like it though.

> +
> +    A server MUST NOT send a data payload in a normal reply if
> +    Structured Reads are negotiated.  It is envisioned that all future
> +    extension commands that require a data payload in the response
> +    will require independent option negotiation, and therefore, the
> +    `NBD_CMD_READ` command is the only command that is allowed to use
> +    the data payload of a normal reply, and only when Structured Reads
> +    were not negotiated.

See other email.

>  However, for ease of implementation, a
> +    server MAY close the connection rather than entering transmission
> +    phase if, at the end of option haggling, the client has negotiated
> +    another command that requires a structured reply but did not also
> +    negotiate Structured Reads.

That's pretty yucky given a reconnect will achieve the same result
and you'll end up in an infinite retry loop.

Wouldn't a better route be simply to say that implementing certain
commands (server or client sides) requires support of structured
replies?

> +    - `NBD_REPLY_TYPE_NONE` (0)
> +
> +      *length* MUST be 0 (and the payload field omitted).  This type
> +      SHOULD be used only as the final reply (that is, when
> +      `NBD_REPLY_FLAG_DONE` is set), and implies that the overall
> +      client request was successfully completed.

I think this would be clearer as 'SHOULD NOT be used other than as the
final reply'. Because you are also saying (I think) that you need not
have it as the final reply - it's just as good in a non-errored
reply to have NBD_REPLY_FLAG_DONE set on the last data packet (provided
you know it's not going to error before starting to send it).

...

> +    The server MAY split the reply into any number of data chunks,
> +    using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or
> +    `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least
> +    one byte, although to minimize overhead, the server SHOULD use
> +    chunks no smaller than 512 bytes where possible (the first and

This is a good idea, but rather than 'no smaller than 512 bytes', as
it's a 'SHOULD', could we have 'the server SHOULD use chunks each
an integer multiple of 512 bytes where possible' (you already have
a carve out for the first and last).

...

> +    If no error is detected, then the server MUST send enough chunks
> +    to cover the bytes requested.  The server MAY set the
> +    `NBD_REPLY_FLAG_DONE` on the final data chunk,

In which case it MUST NOT send any further non-data chunks
(e.g. an error chunk or a NONE chunk)

> to minimize
> +    traffic, but MUST NOT do so if it would still be possible to
> +    detect an error while transmitting the chunk.  If the last data
> +    chunk is not the final reply, the server MUST use
> +    `NBD_REPLY_TYPE_NONE` as the final reply to indicate success.

or an error chunk to indicate an error, and these final chunk MUST have
NBD_REPLY_FLAG_DONE set on it.

> +    If an error is detected, the server MUST send padding bytes to
> +    complete the current chunk (if any), MUST report the error with a
> +    reply type of either `NBD_REPLY_TYPE_ERROR` or
> +    `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies
> +    without sending the total number of bytes requested.  If one or
> +    more offset errors are reported, the client MAY assume that all
> +    data in chunks not including the offset,

"the offset(s)"

> and all data within the
> +    affected chunk

"within each affected chunk"

> but prior to the offset,

"prior to the relevant offset"

> is valid; the client MAY
> +    NOT assume anything about data validity if no offset is provided.

These multiple error chunks are neat. However, I suspect lazy implementors
may just send an error without an offset.

> +    The server MAY send additional chunks or offset error replies, if
> +    `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
> +    also reports an error (that is, the final reply MUST NOT use
> +    `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
> +    in constructing the final reply.

I'm not sure I get that bit. Why don't you make an errorred reply simply
one that contains one or more error chunks. An errorred reply need not contain
all the data requested (though each chunk must be complete). A reply that
isn't errorred needs not contain all the data requested. Why do you
need anything stronger than that? So if you have a parallelised server which
is simply sending several reads in parallel (think Ceph) it sends the
result from each thread, possibly followed by an error packet, and some
other thread notices when all of these have completed and sends a
NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the
handle. This seems perfectly natural and no harder for the client to deal
with, but you are prohibiting it.

>  A server SHOULD NOT mix
> +    `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies
> +    to the same request.
> +
> +    A client MAY close the connection if it detects that the server
> +    has sent invalid chunks (such as overlapping data, or not enough
> +    data before claiming success).
> +
> +    In order to avoid the burden of reassembly, the client MAY set the
> +    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
> +    fragment the reply.  If this flag is set, the server MUST send at
> +    most one `NBD_REPLY_TYPE_OFFSET_DATA` or
> +    `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than
> +    reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`).  If

"the flag is set and"

> +    the client's length request is larger than 65,536 bytes (or if a
> +    later extension adds a way to negotiate a larger maximum fragment
> +    size), the server MAY reject the command with `EOVERFLOW`.  The
> +    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
> +    was not set, or if the requested length is no larger than 65,536.
> +
> ## About this file
> 
> This file tries to document the NBD protocol as it is currently
> -- 
> 2.5.5
> 
>
Eric Blake March 30, 2016, 5:45 p.m. UTC | #3
On 03/30/2016 12:50 AM, Alex Bligh wrote:
> Eric,
> 
> There's a lot in common between our two proposals now (unsurprisingly).
> You've highlighted the differences in the other mail and I'll
> comment on them there. You may want to steal some of my wording as
> I think there are bits I've got that you haven't (as well as vice versa).
> But I'm inclined to use yours as a base unless you particularly
> like mine.
> 
> Comments inline below.
> 
> Alex
> 
> On 30 Mar 2016, at 00:01, Eric Blake <eblake@redhat.com> wrote:
> ...
> 
>> +    While the server is permitted to send at most one normal reply (or
>> +    else close the connection), a command that uses structured replies
>> +    may document that the server is permitted to send mutiple replies,

And just now noticing my typo on 'multiple' :)

>> +    all sharing the same handle,
> 
> The thought is fine, but the language is confusing. I think this is
> a single reply, made up of multiple parts (I called them chunks). You've
> called them multiple replies, which I think makes things less clear.
> Also below you've started using my 'chunk' language anyway!

All right, for v3, I will try to go with the wording that every request
has a single reply; but the reply will either be a 'simple reply'
(single message), or a 'structured reply' (which may occupy multiple
messages, where each message is called a 'chunk').

> 
>> by using the `NBD_REPLY_FLAG_DONE`
>> +    (bit 0) to delineate the final reply. The server MAY interleave
>> +    intermediate replies to one structured command with replies
>> +    relating to a different handle.
> 
> Neat.
> 
> The argument against this route is that now there are essentially
> two ways to end a chain of chunks (with and without a NONE chunk)
> which is necessary for the reasons you set out. On balance I like it though.

Yeah; I didn't see any way around that.  Always requiring a NONE chunk
is more network overhead if the server can guarantee that the last data
chunk is error-free; but at the same time, we shouldn't force servers to
guarantee the last data chunk will be error-free.

I don't think it is too much of a burden for a client to receive chunks
in a loop until the FLAG_DONE bit is set, without regards to what chunk
type came last.  And for CMD_READ, we still have a nice delineation: if
the last chunk is NONE, OFFSET_DATA, or OFFSET_HOLE, the command
succeeded; if the last chunk is ERROR or ERROR_OFFSET, the command failed.

> 
>>  However, for ease of implementation, a
>> +    server MAY close the connection rather than entering transmission
>> +    phase if, at the end of option haggling, the client has negotiated
>> +    another command that requires a structured reply but did not also
>> +    negotiate Structured Reads.
> 
> That's pretty yucky given a reconnect will achieve the same result
> and you'll end up in an infinite retry loop.
> 
> Wouldn't a better route be simply to say that implementing certain
> commands (server or client sides) requires support of structured
> replies?

I think we're in agreement that the only command that (for back-compat
reasons) can ever send data on a simple reply is CMD_READ.  Therefore,
if you negotiate any other command that can send data, that command will
use a structured reply; the mere fact that you negotiated the command
means that both client and server know about structured replies.

I guess what I was trying to get at is that if you are using any
structured replies, then it is a disservice to wire-sniffers if you did
not also enable structured replies for CMD_READ.  Technically, it would
be feasible to use simple replies for reads while using structured
replies for the other negotiated commands, but practically, a client
that does that seems undesirable, which is why I said that a server
could disconnect rather than talking to such a client.

So taking your sentence at face value, yes there is another solution
possible: require that any NBD_OPT_* haggling used to negotiate the use
of any other command with a structured reply MUST fail if the client has
not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection
open so the client can continue option haggling.  That way, the only way
to end option haggling with the new command enabled is to also enable
structured reads.  The burden is then on the client to haggle in the
correct order, and on the server to track haggling state when deciding
how to answer the option requests for the new commands.

I'm a bit reluctant to put ordering requirements on option haggling
(that option A must be turned on before option B), but then again, the
SELECT extension requires NBD_OPT_SELECT to be haggled prior to
NBD_OPT_GO, so there's precedent.  I also am thinking of proposing an
extension for option haggling to communicate minimum/preferred
alignments and maximum don't-fragment sizing, and that option would have
to be enabled before OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it
would change the NBD_REP_SERVER layout in response to those option
requests), which would be another case where option A affects the
behavior of option B.

> 
>> +    - `NBD_REPLY_TYPE_NONE` (0)
>> +
>> +      *length* MUST be 0 (and the payload field omitted).  This type
>> +      SHOULD be used only as the final reply (that is, when
>> +      `NBD_REPLY_FLAG_DONE` is set), and implies that the overall
>> +      client request was successfully completed.
> 
> I think this would be clearer as 'SHOULD NOT be used other than as the
> final reply'. Because you are also saying (I think) that you need not
> have it as the final reply - it's just as good in a non-errored
> reply to have NBD_REPLY_FLAG_DONE set on the last data packet (provided
> you know it's not going to error before starting to send it).

Yes, that sounds slightly better.  (Technically, there's no reason that
it can't be used as an intermediate chunk, but there it is only a no-op
filler that wastes bandwidth - hence the SHOULD and not MUST).

> 
> ...
> 
>> +    The server MAY split the reply into any number of data chunks,
>> +    using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or
>> +    `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least
>> +    one byte, although to minimize overhead, the server SHOULD use
>> +    chunks no smaller than 512 bytes where possible (the first and
> 
> This is a good idea, but rather than 'no smaller than 512 bytes', as
> it's a 'SHOULD', could we have 'the server SHOULD use chunks each
> an integer multiple of 512 bytes where possible' (you already have
> a carve out for the first and last).
> 
> ...
> 
>> +    If no error is detected, then the server MUST send enough chunks
>> +    to cover the bytes requested.  The server MAY set the
>> +    `NBD_REPLY_FLAG_DONE` on the final data chunk,
> 
> In which case it MUST NOT send any further non-data chunks
> (e.g. an error chunk or a NONE chunk)

Well, yeah - once the FLAG_DONE is sent, no further chunks of any type
are allowed; the reply is complete.

> 
>> to minimize
>> +    traffic, but MUST NOT do so if it would still be possible to
>> +    detect an error while transmitting the chunk.  If the last data
>> +    chunk is not the final reply, the server MUST use
>> +    `NBD_REPLY_TYPE_NONE` as the final reply to indicate success.
> 
> or an error chunk to indicate an error, and these final chunk MUST have
> NBD_REPLY_FLAG_DONE set on it.
> 
>> +    If an error is detected, the server MUST send padding bytes to
>> +    complete the current chunk (if any), MUST report the error with a
>> +    reply type of either `NBD_REPLY_TYPE_ERROR` or
>> +    `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies
>> +    without sending the total number of bytes requested.  If one or
>> +    more offset errors are reported, the client MAY assume that all
>> +    data in chunks not including the offset,
> 
> "the offset(s)"
> 
>> and all data within the
>> +    affected chunk
> 
> "within each affected chunk"
> 
>> but prior to the offset,
> 
> "prior to the relevant offset"
> 
>> is valid; the client MAY
>> +    NOT assume anything about data validity if no offset is provided.
> 

Yes, your tweaks help the flow.

> These multiple error chunks are neat. However, I suspect lazy implementors
> may just send an error without an offset.

Any ideas are appreciated on how to word it to suggest that servers
SHOULD try to give full details; but I think we want the fallback to the
no-offset case because some situations will not have an offset.

> 
>> +    The server MAY send additional chunks or offset error replies, if
>> +    `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
>> +    also reports an error (that is, the final reply MUST NOT use
>> +    `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
>> +    in constructing the final reply.
> 
> I'm not sure I get that bit. Why don't you make an errorred reply simply
> one that contains one or more error chunks. An errorred reply need not contain
> all the data requested (though each chunk must be complete). A reply that
> isn't errorred needs not contain all the data requested. Why do you
> need anything stronger than that? So if you have a parallelised server which
> is simply sending several reads in parallel (think Ceph) it sends the
> result from each thread, possibly followed by an error packet, and some
> other thread notices when all of these have completed and sends a
> NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the
> handle. This seems perfectly natural and no harder for the client to deal
> with, but you are prohibiting it.

I was thinking that it's easier on the client if the final chunk always
serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE,
NONE) or error (ERROR, ERROR_OFFSET).  But if you think that always
allowing a concluding NONE, even on an errored reply due to an earlier
chunk reporting the error, is reasonable, I can relax things along those
lines.  Or maybe we can state that the concluding chunk on an errored
reply may be ERROR_OFFSET with 'error' and 'offset' fields set to 0,
rather than forcing the server to replay one of the earlier offsets.


> 
>>  A server SHOULD NOT mix
>> +    `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies
>> +    to the same request.
>> +
>> +    A client MAY close the connection if it detects that the server
>> +    has sent invalid chunks (such as overlapping data, or not enough
>> +    data before claiming success).
>> +
>> +    In order to avoid the burden of reassembly, the client MAY set the
>> +    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
>> +    fragment the reply.  If this flag is set, the server MUST send at
>> +    most one `NBD_REPLY_TYPE_OFFSET_DATA` or
>> +    `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than
>> +    reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`).  If
> 
> "the flag is set and"
> 
>> +    the client's length request is larger than 65,536 bytes (or if a
>> +    later extension adds a way to negotiate a larger maximum fragment
>> +    size), the server MAY reject the command with `EOVERFLOW`.  The
>> +    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
>> +    was not set, or if the requested length is no larger than 65,536.

I'm also wondering if the wording should be switched along the lines of:

If the flag is set and the server deems the request to be too large, the
server MAY reject the command with `EOVERFLOW`; however, the server MUST
NOT reject a request that is no larger than 65,536 bytes in this manner.
Wouter Verhelst March 30, 2016, 7:51 p.m. UTC | #4
Hi all,

(side note: this seems to be mostly an NBD discussion at this point.
Does qemu-devel need to be in the loop before we've finished that? I
don't care either way, but then I don't want to bore or annoy people...)

On Wed, Mar 30, 2016 at 11:45:04AM -0600, Eric Blake wrote:
> On 03/30/2016 12:50 AM, Alex Bligh wrote:
[...]
> So taking your sentence at face value, yes there is another solution
> possible: require that any NBD_OPT_* haggling used to negotiate the use
> of any other command with a structured reply MUST fail if the client has
> not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection
> open so the client can continue option haggling.  That way, the only way
> to end option haggling with the new command enabled is to also enable
> structured reads.  The burden is then on the client to haggle in the
> correct order, and on the server to track haggling state when deciding
> how to answer the option requests for the new commands.
> 
> I'm a bit reluctant to put ordering requirements on option haggling
> (that option A must be turned on before option B),

Yes, me too.

> but then again, the
> SELECT extension requires NBD_OPT_SELECT to be haggled prior to
> NBD_OPT_GO, so there's precedent.

Yeah, but then, that's only because GO is meant to transition from
negotiation to transmission, so it needs to be after *any* other
negotiation; anything else would defeat its purpose.

Requiring structured read after negotiating other structured commands
could easily be done by saying that it's an error to leave the
negotiation phase with "some other" structured command negotiated, but
not structured read.

On the other hand, I feel compelled to point out that we only find
ourselves in this hole because we've coupled the structured reply with
the read command. That may have looked like a good idea at the time, but
obviously it isn't. If we just have it be "negotiate the structured
reply format" rather than "the structured read reply", and state that
once negotiated, the structured reply format is required for any command
with a payload, we're done.

Since only the read reply has a payload at this point in time, you don't
really need to special-case it, anyway.

> I also am thinking of proposing an extension for option haggling to
> communicate minimum/preferred alignments and maximum don't-fragment
> sizing, and that option would have to be enabled before
> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the
> NBD_REP_SERVER layout in response to those option requests), which
> would be another case where option A affects the behavior of option B.

I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT
command since its purpose seems fairly similar ("send metadata about an
export to the client"), but that may have been a mistake. It certainly
wasn't meant that if you say NBD_OPT_SELECT first and then go
NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be
the one as specified in the SELECT extension.

[...]
> > These multiple error chunks are neat. However, I suspect lazy implementors
> > may just send an error without an offset.
> 
> Any ideas are appreciated on how to word it to suggest that servers
> SHOULD try to give full details; but I think we want the fallback to the
> no-offset case because some situations will not have an offset.

I'm wary of making the spec too complicated. Adding such language might
make it unclear. Since as you say it can't be a hard-and-fast rule, I'd
prefer that we just trust implementor's judgement on this.

Not sending an offset (even if it would be possible) can certainly be
the better choice in some cases -- a protocol description can never know
all the trade-offs and choices a particular implementor may want or need
to make.

> >> +    The server MAY send additional chunks or offset error replies, if
> >> +    `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
> >> +    also reports an error (that is, the final reply MUST NOT use
> >> +    `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
> >> +    in constructing the final reply.
> > 
> > I'm not sure I get that bit. Why don't you make an errorred reply simply
> > one that contains one or more error chunks. An errorred reply need not contain
> > all the data requested (though each chunk must be complete). A reply that
> > isn't errorred needs not contain all the data requested. Why do you
> > need anything stronger than that? So if you have a parallelised server which
> > is simply sending several reads in parallel (think Ceph) it sends the
> > result from each thread, possibly followed by an error packet, and some
> > other thread notices when all of these have completed and sends a
> > NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the
> > handle. This seems perfectly natural and no harder for the client to deal
> > with, but you are prohibiting it.
> 
> I was thinking that it's easier on the client if the final chunk always
> serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE,
> NONE) or error (ERROR, ERROR_OFFSET).

The client will already need to keep state to reassemble a fragmented
and out-of-order read reply, anyway. If that's already the case, adding
in the requirement to *also* keep track of error state (which could in
the simplest case be a single bit for a client which doesn't care about
offsets or error count) isn't that much more of an issue.

I'm with Alex on this one.

[...]
> >> +    the client's length request is larger than 65,536 bytes (or if a
> >> +    later extension adds a way to negotiate a larger maximum fragment
> >> +    size), the server MAY reject the command with `EOVERFLOW`.  The
> >> +    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
> >> +    was not set, or if the requested length is no larger than 65,536.
> 
> I'm also wondering if the wording should be switched along the lines of:
> 
> If the flag is set and the server deems the request to be too large, the
> server MAY reject the command with `EOVERFLOW`; however, the server MUST
> NOT reject a request that is no larger than 65,536 bytes in this manner.

Yes, that seems better.
Wouter Verhelst March 30, 2016, 8:44 p.m. UTC | #5
So,

On Tue, Mar 29, 2016 at 05:01:00PM -0600, Eric Blake wrote:
[...]
> +- `NBD_OPT_STRUCTURED_READ` (8)
> +
> +    Defined by the experimental `Structured Read` extension; see below.

(detail: that makes the "structured read" extension be typographically
different from everything else. Either make it all caps, or not
monocase.)

[...]
>      A write request. Length and offset define the location and amount of
> @@ -536,13 +556,16 @@ The following error values are defined:
>  * `ENOMEM` (12), Cannot allocate memory.
>  * `EINVAL` (22), Invalid argument.
>  * `ENOSPC` (28), No space left on device.
> +* `EOVERFLOW` (75), Value too large.
>
>  The server SHOULD return `ENOSPC` if it receives a write request
>  including one or more sectors beyond the size of the device.  It SHOULD
>  return `EINVAL` if it receives a read or trim request including one or
>  more sectors beyond the size of the device.  It also SHOULD map the
> -`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
> -`EPERM` if it receives a write or trim request on a read-only export.
> +`EDQUOT` and `EFBIG` errors to `ENOSPC`.  It SHOULD return `EOVERFLOW`
> +on a request to send structured read data without fragmentation but
> +where the length is too large.  Finally, it SHOULD return `EPERM` if
> +it receives a write or trim request on a read-only export.

I'd like some more explicit language here that makes it clear EOVERFLOW
can *only* be used on structured replies. We reduced the set of valid
error numbers a while back for good reason; it would be bad if we
accidentally increase it for existing replies now.

>  The server SHOULD return `EINVAL` if it receives an unknown command.
> 
> @@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This extension adds
>  two option requests and one error reply type, and extends one existing
>  option reply type.
> 
> -* `NBD_OPT_SELECT`
> +* `NBD_OPT_SELECT` (6)

NAK. The spec is currently consistent in associating numbers to
constants in only *one* place. This is no accident, and I'd like to keep
it that way.

(at least I think it is; if not, that's a bug)

>      The client wishes to select the export with the given name for use
>      in the transmission phase, but does not yet want to move to the
> @@ -613,7 +636,7 @@ option reply type.
>        handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to
>        using `NBD_OPT_EXPORT_NAME`.
> 
> -* `NBD_OPT_GO`
> +* `NBD_OPT_GO` (7)

same.

>      The client wishes to terminate the negotiation phase and progress to
>      the transmission phase. Possible replies from the server include:
> @@ -635,6 +658,235 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
> 
> +### `Structured Read` extension
> +
> +Some of the major downsides of the default reply to `NBD_CMD_READ`
> +(without structured replies) are as follows.  First, it is not
> +possible to support partial reads (the command must succeed or fail as
> +a whole, either len bytes of data must be sent or the connection must
> +be closed).  There is no way to efficiently skip over portions of a
> +sparse file that are known to contain all zeroes.  Finally, it is not
> +possible to reliably decode the server traffic without also having
> +context of what pending read requests were sent by the client.
> +
> +To remedy this, a `Structured Read` extension is envisioned. This
> +extension adds a new option request, a new reply type during the
> +transmission phase, and a new command flag, and alters the set of
> +valid replies to an existing command.
> +
> +* `NBD_OPT_STRUCTURED_READ` (8)

same

> +    The client wishes to use structured reads during the transmission
> +    phase.  The option request has no additional data.
> +
> +    The server replies with one of the following:
> +
> +    - `NBD_REP_ACK`: Structured reads have been negotiated; the server
> +      MUST use structured replies to `NBD_CMD_READ`
> +    - `NBD_REP_UNSUP`: Structured reads are not available; the transmission
                  ^ ERR_

(however, see below ;-)

> +      phase MUST remain the same as if the client had not attempted
> +      `NBD_OPT_STRUCTURED_READ`

This makes it seem as though NBD_REP_ERR_UNSUP has a different meaning
here than it usually does. I think the spec should just say that the
server should reply with NBD_REP_ACK, and then mention that for
backwards compatibility the client should be prepared to handle
NBD_REP_UNSUP too (as is done elsewhere).

That is, if the server implements structured replies, it should allow it
(it makes no sense for the server to disallow structured reads if it
knows about them)

[...]
> +    A server MUST NOT send a data payload in a normal reply if
> +    Structured Reads are negotiated.  It is envisioned that all future
> +    extension commands that require a data payload in the response
> +    will require independent option negotiation, and therefore, the
> +    `NBD_CMD_READ` command is the only command that is allowed to use
> +    the data payload of a normal reply, and only when Structured Reads
> +    were not negotiated.  However, for ease of implementation, a
> +    server MAY close the connection rather than entering transmission
> +    phase if, at the end of option haggling, the client has negotiated
> +    another command that requires a structured reply but did not also
> +    negotiate Structured Reads.

(see my comments on this downthread)

[...]
> +* `NBD_CMD_FLAG_DF` (bit 1)
> +
> +    Valid during `NBD_CMD_READ`.  SHOULD be set to 1 if the client
> +    requires the server to send at most one data chunk in reply.  MUST
> +    NOT be set unless the client negotiated Structured Reads with the
> +    server.

I realize I'm flip-flopping on whether or not to use a flag bit for
this, but bear with me on this one for a moment.

There is an ioctl NBD_SET_FLAGS which just expects the per-export flags
to be passed to the kernel. By reusing that, there is no need for an
extra kernel call to specify the options the kernel-side client can use.
That still leaves the ability for userspace to detect whether the client
can interpret structured replies, but this could easily be signalled
using a sysfs flag (or similar).

If the client can do something along the lines of:

check sysfs (or whatever)
send NBD_OPT_STRUCTURED_READ to server
get flags from server
call NBD_SET_FLAGS ioctl

and that signals *everything* to both the kernel and the server, then we
don't need any extra uapi calls to be defined. This is probably a good
thing.

However, in order for that to be possible, the per-export flags field
needs to have a bit set to signal the server's understanding of, and
client's permission to use, NBD_CMD_FLAG_DF. As such, I think we need an
extra flag in the per-export flags field of the protocol, even though
we've already negotiated structured reads and I expressed the preference
that this shouldn't be decoupled.

Yes, that's slightly ugly.

Thinking about this, I suppose it makes sense to rename the "global"
flags field as the "negotiation flags field" which signals incompatible
changes in the negotiation phase, and the "per-export" flags field as
the "transmission flags field" which signals features the client can use
during transmission, or something along those lines. Thoughts?

[...]

no further comments (other than what's already been said)
Eric Blake March 30, 2016, 8:54 p.m. UTC | #6
On 03/30/2016 01:51 PM, Wouter Verhelst wrote:

> (side note: this seems to be mostly an NBD discussion at this point.
> Does qemu-devel need to be in the loop before we've finished that? I
> don't care either way, but then I don't want to bore or annoy people...)

Well, it stemmed out of qemu's desires to implement more efficient ways
to both push and pull sparse files across NBD; and qemu will ultimately
want to implement what we discuss.  But I'm okay doing most of the churn
off of the qemu list, and only adding qemu back in the loop when it is
actually time to implement the final design, unless someone else speaks
up asking to remain in on the conversation.

>> I'm a bit reluctant to put ordering requirements on option haggling
>> (that option A must be turned on before option B),
> 
> Yes, me too.
> 
>> but then again, the
>> SELECT extension requires NBD_OPT_SELECT to be haggled prior to
>> NBD_OPT_GO, so there's precedent.
> 
> Yeah, but then, that's only because GO is meant to transition from
> negotiation to transmission, so it needs to be after *any* other
> negotiation; anything else would defeat its purpose.
> 
> Requiring structured read after negotiating other structured commands
> could easily be done by saying that it's an error to leave the
> negotiation phase with "some other" structured command negotiated, but
> not structured read.
> 
> On the other hand, I feel compelled to point out that we only find
> ourselves in this hole because we've coupled the structured reply with
> the read command. That may have looked like a good idea at the time, but
> obviously it isn't. If we just have it be "negotiate the structured
> reply format" rather than "the structured read reply", and state that
> once negotiated, the structured reply format is required for any command
> with a payload, we're done.

Well, I'm worried about the opposite - if the client does not negotiate
structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has
a variable-length payload to reply), then the server has three choices:
0) refuse the client (we already said this is a bit undesirable, as it
may lead to the client retrying in an infloop - having a way to return
an error message is better than dropping the connection); 1) the server
must find a way to shoehorn the same data that would be sent in a
structured reply to fit in the old-style unstructured reply (one nice
thing about the structured reply is that we get a length for free; that
length would have to be shoehorned into the old-style reply, different
from CMD_READ where a length was implied from the request), 2) the
server must fail the message

Actually, having typed that, maybe choice 2 is not all that bad.  It's
fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS with
EINVAL if structured replies were not negotiated, and to document that a
client that negotiates GET_LBA_STATUS MUST also negotiate structured
replies for the command to be useful.  For that matter, we just
documented that servers SHOULD use EINVAL for an unrecognized (or
out-of-context) command.  The client can enable the two options in
either order.  And we'd have the following table:

enabled       enabled
structured    GET_LBA            result of:
replies                     read         GET_LBA       write
----------------------------------------------------------------
no            no            old reply    EINVAL        old reply
yes           no            new reply    EINVAL        [*]
no            yes           old reply    EINVAL        old reply
yes           yes           new reply    new reply     [*]

[*] Here, we're still debating whether it makes sense to allow/require a
server to send new replies everywhere (and clients must handle both
styles if they negotiate structured replies), or to require a server to
send old replies (so that read is the only command where clients have to
handle two styles, but where the results of negotiating pinpoint which
style).

> 
> Since only the read reply has a payload at this point in time, you don't
> really need to special-case it, anyway.

Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was
specific to STRUCTURED_READ; it sounds like we are leaning back towards
STRUCTURED_REPLY and just a caveat that any new command that sends
payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated.  I
guess that also makes it easier to argue for a server that uses a
structured reply for ALL messages (REPLY_TYPE_NONE for success changes
16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes into 24).

> 
>> I also am thinking of proposing an extension for option haggling to
>> communicate minimum/preferred alignments and maximum don't-fragment
>> sizing, and that option would have to be enabled before
>> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the
>> NBD_REP_SERVER layout in response to those option requests), which
>> would be another case where option A affects the behavior of option B.
> 
> I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT
> command since its purpose seems fairly similar ("send metadata about an
> export to the client"), but that may have been a mistake. It certainly
> wasn't meant that if you say NBD_OPT_SELECT first and then go
> NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be
> the one as specified in the SELECT extension.

Ah, well, then it's evidence that improved wording will help.

>> I was thinking that it's easier on the client if the final chunk always
>> serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE,
>> NONE) or error (ERROR, ERROR_OFFSET).
> 
> The client will already need to keep state to reassemble a fragmented
> and out-of-order read reply, anyway. If that's already the case, adding
> in the requirement to *also* keep track of error state (which could in
> the simplest case be a single bit for a client which doesn't care about
> offsets or error count) isn't that much more of an issue.

For a client that is copying NBD_CMD_READ into a local file, dumping
directly via pwrite() as each chunk comes in doesn't require the client
track any state (the client can just assume that by the end of the
command, all the bytes will have been covered); while a client using
pwritev() will have to construct an iovec that visits the chunks in the
correct order (not necessarily in the order received).  Clients that
really don't want to do much work have the DF flag to forbid
fragmentation.  But I think you've swayed me - I will make sure v3
allows an error at any point in the chain of chunks, and that the
wording on the final TYPE_NONE chunk does NOT imply success unless no
earlier error chunks were sent.
Wouter Verhelst March 30, 2016, 9:26 p.m. UTC | #7
On Wed, Mar 30, 2016 at 02:54:41PM -0600, Eric Blake wrote:
> On 03/30/2016 01:51 PM, Wouter Verhelst wrote:
> 
> > (side note: this seems to be mostly an NBD discussion at this point.
> > Does qemu-devel need to be in the loop before we've finished that? I
> > don't care either way, but then I don't want to bore or annoy people...)
> 
> Well, it stemmed out of qemu's desires to implement more efficient ways
> to both push and pull sparse files across NBD; and qemu will ultimately
> want to implement what we discuss.  But I'm okay doing most of the churn
> off of the qemu list, and only adding qemu back in the loop when it is
> actually time to implement the final design, unless someone else speaks
> up asking to remain in on the conversation.

Sure. I was just asking the question...

[...]
> Well, I'm worried about the opposite - if the client does not negotiate
> structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has
> a variable-length payload to reply), then the server has three choices:
> 0) refuse the client (we already said this is a bit undesirable, as it
> may lead to the client retrying in an infloop - having a way to return
> an error message is better than dropping the connection); 1) the server
> must find a way to shoehorn the same data that would be sent in a
> structured reply to fit in the old-style unstructured reply (one nice
> thing about the structured reply is that we get a length for free; that
> length would have to be shoehorned into the old-style reply, different
> from CMD_READ where a length was implied from the request), 2) the
> server must fail the message
> 
> Actually, having typed that, maybe choice 2 is not all that bad.

It isn't.

> It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS
> with EINVAL if structured replies were not negotiated, and to document
> that a client that negotiates GET_LBA_STATUS MUST also negotiate
> structured replies for the command to be useful.  For that matter, we
> just documented that servers SHOULD use EINVAL for an unrecognized (or
> out-of-context) command.  The client can enable the two options in
> either order.  And we'd have the following table:
> 
> enabled       enabled
> structured    GET_LBA            result of:
> replies                     read         GET_LBA       write
> ----------------------------------------------------------------
> no            no            old reply    EINVAL        old reply
> yes           no            new reply    EINVAL        [*]
> no            yes           old reply    EINVAL        old reply
> yes           yes           new reply    new reply     [*]

Right.

It would also be reasonable to say that if you don't negotiate an option
but do end up using it, you end up squarely in undefined behaviour-land.
The server could send EINVAL, it could honour your request in ways that
you may not expect (including structured replies when you didn't ask for
them), or it could cause nasal demons.

> [*] Here, we're still debating whether it makes sense to allow/require a
> server to send new replies everywhere (and clients must handle both
> styles if they negotiate structured replies), or to require a server to
> send old replies (so that read is the only command where clients have to
> handle two styles, but where the results of negotiating pinpoint which
> style).

I'm still in favour of having it be "old reply" for the whole of that
"write" column. I'm just not convinced there's a downside to that, while
there is an upside.

> > Since only the read reply has a payload at this point in time, you don't
> > really need to special-case it, anyway.
> 
> Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was
> specific to STRUCTURED_READ; it sounds like we are leaning back towards
> STRUCTURED_REPLY and just a caveat that any new command that sends
> payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated.

You could formulate it that way. Alternatively, you could formulate it
so that any command that sends payload may fail if STRUCTURED_REPLY was
not also negotiated, with caveat that there is this backwards
compatibility thing for READ.

(i.e., make READ be the exception rather than the rule)

> I guess that also makes it easier to argue for a server that uses a
> structured reply for ALL messages (REPLY_TYPE_NONE for success changes
> 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes
> into 24).

Perhaps.

[...]
> > The client will already need to keep state to reassemble a fragmented
> > and out-of-order read reply, anyway. If that's already the case, adding
> > in the requirement to *also* keep track of error state (which could in
> > the simplest case be a single bit for a client which doesn't care about
> > offsets or error count) isn't that much more of an issue.
> 
> For a client that is copying NBD_CMD_READ into a local file, dumping
> directly via pwrite() as each chunk comes in doesn't require the client
> track any state (the client can just assume that by the end of the
> command, all the bytes will have been covered); while a client using
> pwritev() will have to construct an iovec that visits the chunks in the
> correct order (not necessarily in the order received).

Ah yes, good point.

> Clients that really don't want to do much work have the DF flag to
> forbid fragmentation.  But I think you've swayed me - I will make sure
> v3 allows an error at any point in the chain of chunks, and that the
> wording on the final TYPE_NONE chunk does NOT imply success unless no
> earlier error chunks were sent.

Okay, thanks.
Alex Bligh March 30, 2016, 10:48 p.m. UTC | #8
Eric,

>>> However, for ease of implementation, a
>>> +    server MAY close the connection rather than entering transmission
>>> +    phase if, at the end of option haggling, the client has negotiated
>>> +    another command that requires a structured reply but did not also
>>> +    negotiate Structured Reads.
>> 
>> That's pretty yucky given a reconnect will achieve the same result
>> and you'll end up in an infinite retry loop.
>> 
>> Wouldn't a better route be simply to say that implementing certain
>> commands (server or client sides) requires support of structured
>> replies?
> 
> I think we're in agreement that the only command that (for back-compat
> reasons) can ever send data on a simple reply is CMD_READ.  Therefore,
> if you negotiate any other command that can send data, that command will
> use a structured reply; the mere fact that you negotiated the command
> means that both client and server know about structured replies.
> 
> I guess what I was trying to get at is that if you are using any
> structured replies, then it is a disservice to wire-sniffers if you did
> not also enable structured replies for CMD_READ.

Agree

> Technically, it would
> be feasible to use simple replies for reads while using structured
> replies for the other negotiated commands,

Agree

> but practically, a client
> that does that seems undesirable, which is why I said that a server
> could disconnect rather than talking to such a client.

I would prefer to make that a protocol breach, i.e. the client
MUST NOT do that.

> So taking your sentence at face value, yes there is another solution
> possible: require that any NBD_OPT_* haggling used to negotiate the use
> of any other command with a structured reply MUST fail if the client has
> not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection
> open so the client can continue option haggling.  That way, the only way
> to end option haggling with the new command enabled is to also enable
> structured reads.  The burden is then on the client to haggle in the
> correct order, and on the server to track haggling state when deciding
> how to answer the option requests for the new commands.

I think I'm saying something simpler (unless I'm missing something)
which is:

The client MUST NOT propose NBD_OPT_FOO unless it has previously
proposed and the server accepted NBD_OPT_BAR. In this case FOO
is e.g. the thing to find holes, BAR is NBD_OPT_STRUCTURED_READ.

So it's not a question of leaving the connection open or closing it,
it's simply that the client can't propose X unless it's already
negotiated Y. If it does, all bets are off, so of course it can
close the connection. But this makes it explicit that the client
if proposing X should first have successfully negotiated Y.

> I'm a bit reluctant to put ordering requirements on option haggling
> (that option A must be turned on before option B), but then again, the
> SELECT extension requires NBD_OPT_SELECT to be haggled prior to
> NBD_OPT_GO, so there's precedent.

Resisting such a change would resist the possibility in the future
that option X requires option Y. Even if you can get around that
this time, I'll bet my bottom dollar it will come up again.

> I also am thinking of proposing an
> extension for option haggling to communicate minimum/preferred
> alignments and maximum don't-fragment sizing, and that option would have
> to be enabled before OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it
> would change the NBD_REP_SERVER layout in response to those option
> requests), which would be another case where option A affects the
> behavior of option B.

... and as if by magic!

Yeah I was going to suggest a similar option, including a blocksize
alignment, a maximum size for a read/write, and a maximum DF size.

I'm busy writing an 'example' nbd server in golang and this is the
first thing I ran into. The purpose of this BTW is to serve as
an example implementation for structured replies.

>> These multiple error chunks are neat. However, I suspect lazy implementors
>> may just send an error without an offset.
> 
> Any ideas are appreciated on how to word it to suggest that servers
> SHOULD try to give full details; but I think we want the fallback to the
> no-offset case because some situations will not have an offset.

Indeed. Perhaps we can promise them Oreos.

> I was thinking that it's easier on the client if the final chunk always
> serves as a reliable indicator of success (OFFSET_DATA, OFFSET_HOLE,
> NONE) or error (ERROR, ERROR_OFFSET).  But if you think that always
> allowing a concluding NONE, even on an errored reply due to an earlier
> chunk reporting the error, is reasonable, I can relax things along those
> lines.  Or maybe we can state that the concluding chunk on an errored
> reply may be ERROR_OFFSET with 'error' and 'offset' fields set to 0,
> rather than forcing the server to replay one of the earlier offsets.

I think I'd prefer just that the last chunk has the relevant bit set,
and comes before (or is) the error chunk(s). That's lots easier for the
server and no more difficult for the client.

>>> +    the client's length request is larger than 65,536 bytes (or if a
>>> +    later extension adds a way to negotiate a larger maximum fragment
>>> +    size), the server MAY reject the command with `EOVERFLOW`.  The
>>> +    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
>>> +    was not set, or if the requested length is no larger than 65,536.
> 
> I'm also wondering if the wording should be switched along the lines of:
> 
> If the flag is set and the server deems the request to be too large, the
> server MAY reject the command with `EOVERFLOW`; however, the server MUST
> NOT reject a request that is no larger than 65,536 bytes in this manner.

That's clearer.

--
Alex Bligh
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 3f9ee23..75b1534 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -211,6 +211,14 @@  handle as was sent by the client in the corresponding request.  In
 this way, the client can correlate which request is receiving a
 response.

+By default, there is exactly one reply message for each request
+(unless the connection is closed due to an error).  Note that it is
+impossible to tell by reading just the server traffic whether a data
+field will be present.  The experimental `Structured Read` extension
+adds an additional reply type, documents when there will be multiple
+replies to a single request, and creates a context-free server stream;
+see below.
+
 ## Values

 This section describes the value and meaning of constants (other than
@@ -255,6 +263,8 @@  immediately after the global flags field in oldstyle negotiation:
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands

+Clients SHOULD ignore unknown flags.
+
 ##### Client flags

 This field of 32 bits is sent after initial connection and after
@@ -338,6 +348,10 @@  of the newstyle negotiation.

     Defined by the experimental `SELECT` extension; see below.

+- `NBD_OPT_STRUCTURED_READ` (8)
+
+    Defined by the experimental `Structured Read` extension; see below.
+
 #### Option reply types

 These values are used in the "reply type" field, sent by the server
@@ -430,6 +444,8 @@  valid may depend on negotiation during the handshake phase.
   set to 1 if the client requires "Force Unit Access" mode of
   operation.  MUST NOT be set unless export flags included
   `NBD_FLAG_SEND_FUA`.
+- bit 1, `NBD_CMD_FLAG_DF`; defined by the experimental `Structured
+  Read` extension; see below

 #### Request types

@@ -451,6 +467,10 @@  The following request types exist:
     signalling no error), the server MUST immediately close the
     connection; it MUST NOT send any further data to the client.

+    The experimental `Structured Read` extension changes the set of
+    valid replies, in part to allow recovery after a partial read and
+    more efficient reads of sparse files; see below.
+
 * `NBD_CMD_WRITE` (1)

     A write request. Length and offset define the location and amount of
@@ -536,13 +556,16 @@  The following error values are defined:
 * `ENOMEM` (12), Cannot allocate memory.
 * `EINVAL` (22), Invalid argument.
 * `ENOSPC` (28), No space left on device.
+* `EOVERFLOW` (75), Value too large.

 The server SHOULD return `ENOSPC` if it receives a write request
 including one or more sectors beyond the size of the device.  It SHOULD
 return `EINVAL` if it receives a read or trim request including one or
 more sectors beyond the size of the device.  It also SHOULD map the
-`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
-`EPERM` if it receives a write or trim request on a read-only export.
+`EDQUOT` and `EFBIG` errors to `ENOSPC`.  It SHOULD return `EOVERFLOW`
+on a request to send structured read data without fragmentation but
+where the length is too large.  Finally, it SHOULD return `EPERM` if
+it receives a write or trim request on a read-only export.

 The server SHOULD return `EINVAL` if it receives an unknown command.

@@ -579,7 +602,7 @@  To remedy this, a `SELECT` extension is envisioned. This extension adds
 two option requests and one error reply type, and extends one existing
 option reply type.

-* `NBD_OPT_SELECT`
+* `NBD_OPT_SELECT` (6)

     The client wishes to select the export with the given name for use
     in the transmission phase, but does not yet want to move to the
@@ -613,7 +636,7 @@  option reply type.
       handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to
       using `NBD_OPT_EXPORT_NAME`.

-* `NBD_OPT_GO`
+* `NBD_OPT_GO` (7)

     The client wishes to terminate the negotiation phase and progress to
     the transmission phase. Possible replies from the server include:
@@ -635,6 +658,235 @@  option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.

+### `Structured Read` extension
+
+Some of the major downsides of the default reply to `NBD_CMD_READ`
+(without structured replies) are as follows.  First, it is not
+possible to support partial reads (the command must succeed or fail as
+a whole, either len bytes of data must be sent or the connection must
+be closed).  There is no way to efficiently skip over portions of a
+sparse file that are known to contain all zeroes.  Finally, it is not
+possible to reliably decode the server traffic without also having
+context of what pending read requests were sent by the client.
+
+To remedy this, a `Structured Read` extension is envisioned. This
+extension adds a new option request, a new reply type during the
+transmission phase, and a new command flag, and alters the set of
+valid replies to an existing command.
+
+* `NBD_OPT_STRUCTURED_READ` (8)
+
+    The client wishes to use structured reads during the transmission
+    phase.  The option request has no additional data.
+
+    The server replies with one of the following:
+
+    - `NBD_REP_ACK`: Structured reads have been negotiated; the server
+      MUST use structured replies to `NBD_CMD_READ`
+    - `NBD_REP_UNSUP`: Structured reads are not available; the transmission
+      phase MUST remain the same as if the client had not attempted
+      `NBD_OPT_STRUCTURED_READ`
+
+* Transmission phase
+
+    The transmission phase includes a third message type: the
+    structured reply, to be used for commands where the response must
+    include a data payload.  The server MUST NOT send this reply type
+    unless the client has successfully negotiated an extension that
+    requires the use of a structured reply; this includes the
+    negotiation of Structured Reads via `NBD_OPT_STRUCTURED_READ`.
+
+    A structured reply looks as follows:
+
+    S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+    S: 16 bits, flags  
+    S: 16 bits, type  
+    S: 64 bits, handle  
+    S: 32 bits, length of payload (unsigned)  
+    S: *length* bytes of payload data
+
+    The use of *length* in the reply allows context-free division of
+    the overall server traffic into individual reply messages; the
+    *type* field describes how to further interpret the payload.
+
+    While the server is permitted to send at most one normal reply (or
+    else close the connection), a command that uses structured replies
+    may document that the server is permitted to send mutiple replies,
+    all sharing the same handle, by using the `NBD_REPLY_FLAG_DONE`
+    (bit 0) to delineate the final reply.  The server MAY interleave
+    intermediate replies to one structured command with replies
+    relating to a different handle.
+
+    A server MUST NOT send a data payload in a normal reply if
+    Structured Reads are negotiated.  It is envisioned that all future
+    extension commands that require a data payload in the response
+    will require independent option negotiation, and therefore, the
+    `NBD_CMD_READ` command is the only command that is allowed to use
+    the data payload of a normal reply, and only when Structured Reads
+    were not negotiated.  However, for ease of implementation, a
+    server MAY close the connection rather than entering transmission
+    phase if, at the end of option haggling, the client has negotiated
+    another command that requires a structured reply but did not also
+    negotiate Structured Reads.
+
+  * Structured Reply flags
+
+    This field of 16 bits is sent by the server as part of every
+    structured reply.
+
+    - bit 0, `NBD_REPLY_FLAG_DONE`; the server MUST clear this bit if
+      more structured replies will be sent for the same client
+      request, and MUST set this bit if this is the final reply.
+      Commands which are documented as using structured replies, but
+      not documented as sending multiple replies, MUST always set this
+      bit.
+
+    The server MUST NOT set any other flags without first negotiating
+    the extension with the client.  Clients that receive an
+    unrecognized flag SHOULD close the connection.
+
+  * Structured Reply types
+
+    These values are used in the "type" field of a structured reply.
+    Each type determines how to interpret the "length" bytes of data.
+    If the client receives an unknown or unexpected type, it SHOULD
+    close the connection.
+
+    - `NBD_REPLY_TYPE_NONE` (0)
+
+      *length* MUST be 0 (and the payload field omitted).  This type
+      SHOULD be used only as the final reply (that is, when
+      `NBD_REPLY_FLAG_DONE` is set), and implies that the overall
+      client request was successfully completed.  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_DATA` (1)
+
+      *length* MUST be at least 9.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      *length - 8* bytes: data
+
+      This reply represents the contents of *length - 8* bytes of the
+      file, starting at *offset*.  The data MUST lie within the
+      bounds of the original offset and length of the client's
+      request.  Valid as a reply to `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_OFFSET_HOLE` (2)
+
+      *length* MUST be exactly 12.  The payload is structured as:
+
+      64 bits: offset (unsigned)  
+      32 bits: hole size (unsigned)
+
+      This reply represents that *hole size* bytes of the file (which
+      MUST be non-zero), starting at *offset*, read as all zeroes.
+      The hole MUST lie within the bounds of the original offset and
+      length of the client's request.  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_ERROR` (3)
+
+      *length* MUST be exactly 4.  The payload is structured as:
+
+      32 bits: error
+
+      This reply represents that an error occurred, with no further
+      details as to the offset where the error occurred; and SHOULD be
+      used only as the final reply (that is, when
+      `NBD_REPLY_FLAG_DONE` is set).  Valid as a reply to
+      `NBD_CMD_READ`.
+
+    - `NBD_REPLY_TYPE_ERROR_OFFSET` (4)
+
+      *length* MUST be exactly 12.  The payload is structured as:
+
+      32 bits: error  
+      64 bits: offset (unsigned)
+
+      This reply represents that an error occurred while handling the
+      given offset.  *error* MUST be nonzero, and *offset* must lie
+      within the bounds of the original offset and length of the
+      client's request.  Valid as a reply to `NBD_CMD_READ`.
+
+* `NBD_CMD_FLAG_DF` (bit 1)
+
+    Valid during `NBD_CMD_READ`.  SHOULD be set to 1 if the client
+    requires the server to send at most one data chunk in reply.  MUST
+    NOT be set unless the client negotiated Structured Reads with the
+    server.
+
+* `NBD_CMD_READ`
+
+    If `NBD_OPT_STRUCTURED_READ` was not negotiated, then a read
+    request MUST always be answered by a single non-structured
+    response, as documented above (using magic 0x67446698
+    `NBD_REPLY_MAGIC`, and containing length bytes of data according
+    to the client's request, although those bytes MAY be invalid if an
+    error is returned, and the connection MUST if an error occurs
+    after a header claiming no error).
+
+    If `NBD_OPT_STRUCTURED_READ` is negotiated, then a read request
+    MUST result in one or more structured replies (each using magic
+    0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), with the following
+    additional constraints.
+
+    The server MAY split the reply into any number of data chunks,
+    using reply types of `NBD_REPLY_TYPE_OFFSET_DATA` or
+    `NBD_REPLY_TYPE_OFFSET_HOLE`; each chunk MUST describe at least
+    one byte, although to minimize overhead, the server SHOULD use
+    chunks no smaller than 512 bytes where possible (the first and
+    last chunk of an unaligned read being the most obvious place for
+    an exception).  The server MUST NOT send chunks that overlap, and
+    MUST NOT send chunks that describe data outside the offset and
+    length of the request, but MAY send the chunks in any order (the
+    client is responsible for reassembling chunks into the correct
+    order).  Note that a request for more than 2^32 - 8 bytes MUST be
+    split into at least two chunks, so as not to overflow the length
+    field of a reply while still allowing space for the offset of each
+    chunk.
+
+    If no error is detected, then the server MUST send enough chunks
+    to cover the bytes requested.  The server MAY set the
+    `NBD_REPLY_FLAG_DONE` on the final data chunk, to minimize
+    traffic, but MUST NOT do so if it would still be possible to
+    detect an error while transmitting the chunk.  If the last data
+    chunk is not the final reply, the server MUST use
+    `NBD_REPLY_TYPE_NONE` as the final reply to indicate success.
+
+    If an error is detected, the server MUST send padding bytes to
+    complete the current chunk (if any), MUST report the error with a
+    reply type of either `NBD_REPLY_TYPE_ERROR` or
+    `NBD_REPLY_TYPE_ERROR_OFFSET`, and MAY end the sequence of replies
+    without sending the total number of bytes requested.  If one or
+    more offset errors are reported, the client MAY assume that all
+    data in chunks not including the offset, and all data within the
+    affected chunk but prior to the offset, is valid; the client MAY
+    NOT assume anything about data validity if no offset is provided.
+    The server MAY send additional chunks or offset error replies, if
+    `NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
+    also reports an error (that is, the final reply MUST NOT use
+    `NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
+    in constructing the final reply.  A server SHOULD NOT mix
+    `NBD_REPLY_TYPE_ERROR` and `NBD_REPLY_TYPE_ERROR_OFFSET` replies
+    to the same request.
+
+    A client MAY close the connection if it detects that the server
+    has sent invalid chunks (such as overlapping data, or not enough
+    data before claiming success).
+
+    In order to avoid the burden of reassembly, the client MAY set the
+    `NBD_CMD_FLAG_DF` flag (bit 1), which instructs the server to not
+    fragment the reply.  If this flag is set, the server MUST send at
+    most one `NBD_REPLY_TYPE_OFFSET_DATA` or
+    `NBD_REPLY_TYPE_OFFSET_HOLE`, although it MAY still send more than
+    reply (for error reporting, or a final `NBD_REPLY_TYPE_NONE`).  If
+    the client's length request is larger than 65,536 bytes (or if a
+    later extension adds a way to negotiate a larger maximum fragment
+    size), the server MAY reject the command with `EOVERFLOW`.  The
+    `EOVERFLOW` error MUST NOT be used if the `NBD_CMD_FLAG_DF` flag
+    was not set, or if the requested length is no larger than 65,536.
+
 ## About this file

 This file tries to document the NBD protocol as it is currently