diff mbox

[3/1] doc: Propose Structured Replies extension

Message ID 1459223796-28474-2-git-send-email-eblake@redhat.com
State New
Headers show

Commit Message

Eric Blake March 29, 2016, 3:56 a.m. UTC
The existing transmission phase protocol is difficult to sniff,
because correct interpretation of the server stream requires
context from the client stream (or risks false positives if
data payloads happen to contain the protocol magic numbers).  It
also prohibits the ability to do short reads, or to return a
read error without also sending length bytes of data.

Remedy this by adding a new global/client flag negotiation,
which affects the response of the NBD_CMD_READ command, and sets
forth rules for how future command responses must behave when
they carry a data payload.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)

Comments

Wouter Verhelst March 29, 2016, 7:33 a.m. UTC | #1
Hi Eric,

After applying some of the other outstanding patches, this one doesn't apply
anymore. Can you rebase?

On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote:
> The existing transmission phase protocol is difficult to sniff,
> because correct interpretation of the server stream requires
> context from the client stream (or risks false positives if
> data payloads happen to contain the protocol magic numbers).  It
> also prohibits the ability to do short reads, or to return a
> read error without also sending length bytes of data.
> 
> Remedy this by adding a new global/client flag negotiation,
> which affects the response of the NBD_CMD_READ command, and sets
> forth rules for how future command responses must behave when
> they carry a data payload.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 123 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 44579fc..f687e3e 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -209,6 +209,10 @@ same value for handle as was sent by the client for each request that
>  the server is replying to, so that the client may correlate which
>  request is receiving a response.
> 
> +Note that it is impossible to tell by reading just the server traffic
> +whether a data field will be present.  To remedy this, the experimental
> +`Structured Reply` extension has been introduced; see below.
> +
>  ## Values
> 
>  This section describes the value and meaning of constants (other than
> @@ -231,6 +235,8 @@ the first magic number.
>  - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with
>    `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT
>    send the 124 bytes of zero at the end of the negotiation.
> +- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental
> +  `Structured Reply` extension; see below.
> 
>  The server MUST NOT set any other flags, and SHOULD NOT change behaviour
>  unless the client responds with a corresponding flag.  The server MUST
> @@ -265,6 +271,8 @@ receiving the global flags from the server.
>  - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not
>    set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124
>    bytes of zeroes at the end of the negotiation.
> +- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental
> +  `Structured Reply` extension; see below.
> 
>  Clients SHOULD NOT set any other flags; the server MUST drop the
>  connection if the client sets an unknown flag, or a flag that does
> @@ -435,6 +443,10 @@ The following request types exist:
>      signalling no error), the server MUST immediately close the
>      connection; it MUST NOT send any further data to the client.
> 
> +    The experimental `Structured Reply` extension changes the set of
> +    valid replies, in part to allow recovery after a partial read; see
> +    below.
> +
>  * `NBD_CMD_WRITE` (1)
> 
>      A write request. Length and offset define the location and amount of
> @@ -609,6 +621,117 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
> 
> +### `Structured Reply` extension
> +
> +Some major downsides of the default reply to `NBD_CMD_READ` is that it
> +is not possible to support partial reads (the command must succeed or
> +fail as a whole, and len bytes of data must be sent even on an error,
> +unless the connection is closed); nor is it possible to decode the
> +server traffic without also knowing what pending read requests were
> +sent by the client.
> +
> +To remedy this, a `Structured Reply` extension is envisioned. This
> +extension adds a global flag, a client flag, a new reply type during
> +the transmission phase, and alters the set of valid replies to an
> +existing command.
> +
> +* `NBD_FLAG_STRUCTURED_REPLY` (bit 2)
> +
> +  This is a global flag; if set, and if the client replies with
> +  `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server
> +  MUST use structured replies to the `NBD_CMD_READ` command.
> +
> +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2)
> +
> +  This is a client flag; MUST NOT be set if the server did not set
> +  `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured
> +  replies to the `NBD_CMD_READ` command.
> +
> +* Transmission phase
> +
> +  The transmission phase includes a third message type: the structured
> +  reply.  If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the
> +  normal server reply will never contain a data payload, and all
> +  server replies that send data will be in the following form:
> +
> +  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
> +  S: 32 bits, error  
> +  S: 64 bits, handle  
> +  S: 32 bits, length of payload (unsigned)  
> +  S: *length* bytes of payload data
> +
> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal
> +  server reply with a data payload will be used for `NBD_CMD_READ`;
> +  but any other replies with a data payload will still use a
> +  structured reply (that is, only `NBD_CMD_READ` is allowed to send
> +  data in the non-structured form, and negotiating
> +  `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to
> +  `NBD_CMD_READ`).  This implies that any new commands that require
> +  data in the reply should be gated by their own new global and client
> +  flag.  A server MAY refuse to allow a client that negotiates a
> +  command that requires a structured reply, but does not also
> +  negotiate `NBD_FLAG_C_STRUCTURED_REPLY`.
> +
> +  In the generic form, the length field of a structured response MAY
> +  be zero if there is no data payload, and the error field may be set
> +  regardless of the length field (although where possible, the server
> +  SHOULD use a length of zero when setting the error field).  However,
> +  particular commands may document additional restrictions regarding
> +  what forms a valid response (for example, a structured response to
> +  `NBD_CMD_READ` MUST NOT set the error field, and MUST have a
> +  non-zero length of at least 9).
> +
> +* `NBD_CMD_READ`
> +
> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read
> +  request MUST always be answered by a single response (with magic
> +  0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length
> +  bytes of data according to the client's request, although those
> +  bytes MAY be invalid if an error is returned, and the connection
> +  MUST if an error occurs after a header claiming no error.
> +
> +  Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the
> +  response MUST be a sequence of zero or more structured replies (with
> +  magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a
> +  concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the
> +  final response MUST NOT have a data payload; all responses in the
> +  sequence use the same handle from the client.  The payload of each
> +  intermediate structured reply, called a read chunk, MUST be of the
> +  following form:
> +
> +  S: 64 bits, offset (unsigned)  
> +  S: (*length* - 8) bytes of data
> +
> +  Note that the amount of data returned in a read chunk is determined
> +  by the length field of the structured reply, and not the original
> +  length of the client's request.  If the server sends a single read
> +  chunk for a successful read, the server's length will be the
> +  client's length plus 8, because the server must account for the
> +  offset field in its reply.  Similarly, a successful client request
> +  for a read of 2^32-8 or more bytes MUST be split into at least two
> +  read chunks, so that the length field does not suffer from overflow.
> +
> +  The server MUST ensure that each read chunk lies within the original
> +  offset and length of the original client request, MUST NOT send read
> +  chunks that would cover the same offset more than once, and MUST
> +  send at least one byte of data in addition to the offset field of
> +  each read chunk.  The server MAY send read chunks out of order, and
> +  may interleave other responses between read replies.  The server
> +  MUST NOT set the error field of a read chunk; if an error occurs, it
> +  MAY immediately end the sequence of structured response messages,
> +  MUST send the error in the concluding normal response, and SHOULD
> +  keep the connection open.  The final non-structured response MUST
> +  set an error unless the sum of data sent by all read chunks totals
> +  the original client length request.
> +
> +  The client SHOULD immediately close the connection if it detects
> +  that the server has sent an offset more than once (whether or not
> +  the overlapping data claimed to have the same contents), or if
> +  receives the concluding normal reply without an error set but
> +  without all bytes covered by read chunk(s).  A future extension may
> +  add a command flag that would allow the server to skip read chunks
> +  for portions of the file that read as all zeroes.
> +
>  ## About this file
> 
>  This file tries to document the NBD protocol as it is currently
> -- 
> 2.5.5
> 
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>
Alex Bligh March 29, 2016, 8:24 a.m. UTC | #2
On 29 Mar 2016, at 04:56, Eric Blake <eblake@redhat.com> wrote:

> The existing transmission phase protocol is difficult to sniff,
> because correct interpretation of the server stream requires
> context from the client stream (or risks false positives if
> data payloads happen to contain the protocol magic numbers).  It
> also prohibits the ability to do short reads, or to return a
> read error without also sending length bytes of data.
> 
> Remedy this by adding a new global/client flag negotiation,
> which affects the response of the NBD_CMD_READ command, and sets
> forth rules for how future command responses must behave when
> they carry a data payload.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> doc/proto.md | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 123 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 44579fc..f687e3e 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -209,6 +209,10 @@ same value for handle as was sent by the client for each request that
> the server is replying to, so that the client may correlate which
> request is receiving a response.
> 
> +Note that it is impossible to tell by reading just the server traffic
> +whether a data field will be present.  To remedy this, the experimental
> +`Structured Reply` extension has been introduced; see below.
> +
> ## Values
> 
> This section describes the value and meaning of constants (other than
> @@ -231,6 +235,8 @@ the first magic number.
> - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with
>   `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT
>   send the 124 bytes of zero at the end of the negotiation.
> +- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental
> +  `Structured Reply` extension; see below.
> 
> The server MUST NOT set any other flags, and SHOULD NOT change behaviour
> unless the client responds with a corresponding flag.  The server MUST
> @@ -265,6 +271,8 @@ receiving the global flags from the server.
> - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not
>   set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124
>   bytes of zeroes at the end of the negotiation.
> +- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental
> +  `Structured Reply` extension; see below.
> 
> Clients SHOULD NOT set any other flags; the server MUST drop the
> connection if the client sets an unknown flag, or a flag that does
> @@ -435,6 +443,10 @@ The following request types exist:
>     signalling no error), the server MUST immediately close the
>     connection; it MUST NOT send any further data to the client.
> 
> +    The experimental `Structured Reply` extension changes the set of
> +    valid replies, in part to allow recovery after a partial read; see
> +    below.
> +
> * `NBD_CMD_WRITE` (1)
> 
>     A write request. Length and offset define the location and amount of
> @@ -609,6 +621,117 @@ option reply type.
>       message if they do not also send it as a reply to the
>       `NBD_OPT_SELECT` message.
> 
> +### `Structured Reply` extension
> +
> +Some major downsides of the default reply to `NBD_CMD_READ` is that it
> +is not possible to support partial reads (the command must succeed or
> +fail as a whole, and len bytes of data must be sent even on an error,
> +unless the connection is closed); nor is it possible to decode the
> +server traffic without also knowing what pending read requests were
> +sent by the client.


"Some major downsides is" does not agree grammatically, and this
sentence is pretty convoluted.

How about:

Two of the major downsides of the default reply to `NBD_CMD_READ` (without
structured replies) are as follows. Firstly, it is not possible to support
partial reads (the command must succeed or fail as a whole, and len bytes
of data must be sent even on an error unless the connection is closed).
Secondly, it is not possible to decode the server traffic without also
knowing what pending read requests were sent by the client.

> +To remedy this, a `Structured Reply` extension is envisioned. This

if it's actually part of the standard (i.e. if this diff is accepted)
then its more than 'envisioned', it's 'specified', or 'available'
or similar.

> +extension adds a global flag, a client flag, a new reply type during
> +the transmission phase, and alters the set of valid replies to an
> +existing command.
> +
> +* `NBD_FLAG_STRUCTURED_REPLY` (bit 2)
> +
> +  This is a global flag; if set, and if the client replies with
> +  `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server
> +  MUST use structured replies to the `NBD_CMD_READ` command.
> +
> +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2)
> +
> +  This is a client flag; MUST NOT be set if the server did not set

*it* MUST not be set ...

> +  `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured
> +  replies to the `NBD_CMD_READ` command.
> +
> +* Transmission phase
> +
> +  The transmission phase includes a third message type: the structured
> +  reply.  If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the
> +  normal server reply will never contain a data payload, and all
> +  server replies that send data will be in the following form:
> +
> +  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
> +  S: 32 bits, error  
> +  S: 64 bits, handle  
> +  S: 32 bits, length of payload (unsigned)  
> +  S: *length* bytes of payload data

Unless I'm missing something this doesn't entirely solve the problem.
Imagine you are implementing NBD_CMD_READ (with structured reply)
and are asked to read 4G of data. 1G in you find an error. You can't
set the error ahead of time as you don't know (yet) there is an
error. By the time you discover, you've already streamed 1G of data.
What do you do now?

And this section seems at odds with the section below (starting
"Conversely" which describes an offset/length scheme not detailed
above.

I think you are saying that there can be one or more of these
structured replies in reference to any request, in which
case it should be made clearer, and not wait until NBD_CMD_READ.

> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal
> +  server reply with a data payload will be used for `NBD_CMD_READ`;
> +  but any other replies with a data payload will still use a
> +  structured reply (that is, only `NBD_CMD_READ` is allowed to send
> +  data in the non-structured form, and negotiating
> +  `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to
> +  `NBD_CMD_READ`).  This implies that any new commands that require
> +  data in the reply should be gated by their own new global and client
> +  flag.  A server MAY refuse to allow a client that negotiates a
> +  command that requires a structured reply, but does not also
> +  negotiate `NBD_FLAG_C_STRUCTURED_REPLY`.
> +
> +  In the generic form, the length field of a structured response MAY
> +  be zero if there is no data payload, and the error field may be set
> +  regardless of the length field (although where possible, the server
> +  SHOULD use a length of zero when setting the error field).  However,
> +  particular commands may document additional restrictions regarding
> +  what forms a valid response (for example, a structured response to
> +  `NBD_CMD_READ` MUST NOT set the error field, and MUST have a
> +  non-zero length of at least 9).
> +
> +* `NBD_CMD_READ`
> +
> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read
> +  request MUST always be answered by a single response (with magic
> +  0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length
> +  bytes of data according to the client's request, although those
> +  bytes MAY be invalid if an error is returned, and the connection
> +  MUST if an error occurs after a header claiming no error.

Word(s) apparently missing after "MUST" on last line. "MUST be
closed by the server" I think.

> +
> +  Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the
> +  response MUST be a sequence of zero or more structured replies (with
> +  magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a
> +  concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the
> +  final response MUST NOT have a data payload; all responses in the
> +  sequence use the same handle from the client.  The payload of each
> +  intermediate structured reply, called a read chunk, MUST be of the
> +  following form:
> +
> +  S: 64 bits, offset (unsigned)  
> +  S: (*length* - 8) bytes of data
> +
> +  Note that the amount of data returned in a read chunk is determined
> +  by the length field of the structured reply, and not the original
> +  length of the client's request.  If the server sends a single read
> +  chunk for a successful read, the server's length will be the
> +  client's length plus 8, because the server must account for the
> +  offset field in its reply.  Similarly, a successful client request
> +  for a read of 2^32-8 or more bytes MUST be split into at least two
> +  read chunks, so that the length field does not suffer from overflow.

OK, so the error is not sent with the chunk (fine), but at the end.

> +
> +  The server MUST ensure that each read chunk lies within the original
> +  offset and length of the original client request, MUST NOT send read
> +  chunks that would cover the same offset more than once, and MUST
> +  send at least one byte of data in addition to the offset field of
> +  each read chunk.  The server MAY send read chunks out of order, and
> +  may interleave other responses between read replies.

I can see why it's attractive from the server's point of view to
support out of order replies - for instance an NBD server with a back
end like Ceph could use this to launch requests simultaneously to
multiple backend stores.

However, theoretically a server can now send single one byte packets
back, which the client would have to reconstruct.

Also, given new commands aren't available unless you support structured
replies, you now have to support reassembly of replies (if you want
to use new features) even if all your reads are (e.g.) 1k.

Can I suggest that the client should be able to specify a minimum
'power of 2' chunk boundary, e.g. if the client says '1k', and its
requests do not cross a 1k boundary, the server will guarantee to
return them as a single chunk?

>  The server
> +  MUST NOT set the error field of a read chunk; if an error occurs, it
> +  MAY immediately end the sequence of structured response messages,
> +  MUST send the error in the concluding normal response, and SHOULD
> +  keep the connection open.  The final non-structured response MUST
> +  set an error unless the sum of data sent by all read chunks totals
> +  the original client length request.

add "and data for the entire range requested has been supplied." (I
know this is technically implied by the fact data cannot be duplicated).

> +
> +  The client SHOULD immediately close the connection if it detects
> +  that the server has sent an offset more than once (whether or not
> +  the overlapping data claimed to have the same contents), or if
> +  receives the concluding normal reply without an error set but
> +  without all bytes covered by read chunk(s).  A future extension may
> +  add a command flag that would allow the server to skip read chunks
> +  for portions of the file that read as all zeroes.
> +
> ## About this file
> 
> This file tries to document the NBD protocol as it is currently
> -- 
> 2.5.5
> 
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>
Eric Blake March 29, 2016, 2:21 p.m. UTC | #3
On 03/29/2016 02:24 AM, Alex Bligh wrote:
> 
> On 29 Mar 2016, at 04:56, Eric Blake <eblake@redhat.com> wrote:
> 
>> The existing transmission phase protocol is difficult to sniff,
>> because correct interpretation of the server stream requires
>> context from the client stream (or risks false positives if
>> data payloads happen to contain the protocol magic numbers).  It
>> also prohibits the ability to do short reads, or to return a
>> read error without also sending length bytes of data.
>>

>> +* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2)
>> +
>> +  This is a client flag; MUST NOT be set if the server did not set
> 
> *it* MUST not be set ...

Copy-and-paste from above; I can correct the place I copied from.

> 
>> +  `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured
>> +  replies to the `NBD_CMD_READ` command.
>> +
>> +* Transmission phase
>> +
>> +  The transmission phase includes a third message type: the structured
>> +  reply.  If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the
>> +  normal server reply will never contain a data payload, and all
>> +  server replies that send data will be in the following form:
>> +
>> +  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
>> +  S: 32 bits, error  
>> +  S: 64 bits, handle  
>> +  S: 32 bits, length of payload (unsigned)  
>> +  S: *length* bytes of payload data
> 
> Unless I'm missing something this doesn't entirely solve the problem.
> Imagine you are implementing NBD_CMD_READ (with structured reply)
> and are asked to read 4G of data. 1G in you find an error. You can't
> set the error ahead of time as you don't know (yet) there is an
> error. By the time you discover, you've already streamed 1G of data.
> What do you do now?

As the server, you can now either send 3G of (bogus) data followed by
the concluding normal response with error set, or you can immediately
send the normal response with error set and skip sending the remaining
3G of data.

I guess what I need to add is that in transmission phase, most commands
have exactly one response per request; but commands may document
scenarios where there will be multiple responses to a single request.
NBD_CMD_READ uses the multiple responses to make partial read and error
handling possible.

> 
> And this section seems at odds with the section below (starting
> "Conversely" which describes an offset/length scheme not detailed
> above.
> 
> I think you are saying that there can be one or more of these
> structured replies in reference to any request, in which
> case it should be made clearer, and not wait until NBD_CMD_READ.

Not for ANY request, but only for commands that document it.  I envision
that our proposal for NBD_CMD_GET_LBA_STATUS (or whatever we name it)
will always use a single structured reply (if error is set, the payload
is empty; otherwise, the payload is variable-sized but the length is
part of the structured reply header).  So far, only NBD_CMD_READ has a
reason for multiple replies.


>> +* `NBD_CMD_READ`
>> +
>> +  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read
>> +  request MUST always be answered by a single response (with magic
>> +  0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length
>> +  bytes of data according to the client's request, although those
>> +  bytes MAY be invalid if an error is returned, and the connection
>> +  MUST if an error occurs after a header claiming no error.
> 
> Word(s) apparently missing after "MUST" on last line. "MUST be
> closed by the server" I think.

Yep. Will fix in v2.


>> +
>> +  The server MUST ensure that each read chunk lies within the original
>> +  offset and length of the original client request, MUST NOT send read
>> +  chunks that would cover the same offset more than once, and MUST
>> +  send at least one byte of data in addition to the offset field of
>> +  each read chunk.  The server MAY send read chunks out of order, and
>> +  may interleave other responses between read replies.
> 
> I can see why it's attractive from the server's point of view to
> support out of order replies - for instance an NBD server with a back
> end like Ceph could use this to launch requests simultaneously to
> multiple backend stores.
> 
> However, theoretically a server can now send single one byte packets
> back, which the client would have to reconstruct.

Yeah, but the reconstruction is easy; naively:

while response_magic == structured:
    copy len-8 bytes of data from response to given offset
response_magic == normal, read is complete

Detecting overlap or incomplete reads would requires more complexity in
the client, but I don't know that a client has to care (the protocol is
specifically written that a client MAY detect bad servers, but not MUST;
a client that assumes the server is well-behaved is still compliant).

However, you DO have a point that the server SHOULD send data in
reasonable-size chunks; and maybe I should propose a parallel extension
where, when negotiated between client and server, the server will
advertise minimum and preferred I/O sizes in response to the export name
request (for example, a server backed by a real block device may have a
minimum of 512 bytes or 4096 bytes, and a preferred size of 64k; while a
server backed by a normal file system may have a minimum of 1 byte);
then put in restrictions that a server SHOULD reject read/write requests
where offset and length are not multiples of the minimum, and that the
server SHOULD send read chunks aligned to the preferred size (with
exceptions for the head and tail of a larger buffer that meets minimum
alignment but not preferred alignment).

> 
> Also, given new commands aren't available unless you support structured
> replies, you now have to support reassembly of replies (if you want
> to use new features) even if all your reads are (e.g.) 1k.

Are you arguing that there should be a flag that controls whether reads
must be in-order vs. reassembled?  Reassembly must happen either way,
the question is whether having a way to allow out-of-order for
efficiency, vs. defaulting to in-order for easier computation, is worth it.

> 
> Can I suggest that the client should be able to specify a minimum
> 'power of 2' chunk boundary, e.g. if the client says '1k', and its
> requests do not cross a 1k boundary, the server will guarantee to
> return them as a single chunk?

If we want to negotiate minimum and preferred transaction sizes, it
should probably be done in a separate proposal.  For this proposal, I
think the best we can do is merely suggest that the server SHOULD keep
read chunks suitably blocked (larger blocks lead to less overhead).

> 
>>  The server
>> +  MUST NOT set the error field of a read chunk; if an error occurs, it
>> +  MAY immediately end the sequence of structured response messages,
>> +  MUST send the error in the concluding normal response, and SHOULD
>> +  keep the connection open.  The final non-structured response MUST
>> +  set an error unless the sum of data sent by all read chunks totals
>> +  the original client length request.
> 
> add "and data for the entire range requested has been supplied." (I
> know this is technically implied by the fact data cannot be duplicated).

Sure.  But keep in mind that if (when?) we add a flag for allowing the
server to skip read chunks on holes, we'll have to tweak the wording to
allow the server to send fewer chunks than the client's length, where
the client must then assume zeroes for all chunks not received.
Alex Bligh March 29, 2016, 2:37 p.m. UTC | #4
Eric,

> I guess what I need to add is that in transmission phase, most commands
> have exactly one response per request; but commands may document
> scenarios where there will be multiple responses to a single request.
> NBD_CMD_READ uses the multiple responses to make partial read and error
> handling possible

Yes, this.

> Yeah, but the reconstruction is easy; naively:
> 
> while response_magic == structured:
>    copy len-8 bytes of data from response to given offset
> response_magic == normal, read is complete

It's easy if the result is written to memory. It's not easy if the
purpose was (e.g.) to send it to a socket in a sendfile type way.
It now requires the entire response be held in memory, which wasn't
a requirement before.

> Detecting overlap or incomplete reads would requires more complexity in
> the client, but I don't know that a client has to care (the protocol is
> specifically written that a client MAY detect bad servers, but not MUST;
> a client that assumes the server is well-behaved is still compliant).

Yep

> However, you DO have a point that the server SHOULD send data in
> reasonable-size chunks; and maybe I should propose a parallel extension
> where, when negotiated between client and server, the server will
> advertise minimum and preferred I/O sizes in response to the export name
> request (for example, a server backed by a real block device may have a
> minimum of 512 bytes or 4096 bytes, and a preferred size of 64k; while a
> server backed by a normal file system may have a minimum of 1 byte);
> then put in restrictions that a server SHOULD reject read/write requests
> where offset and length are not multiples of the minimum, and that the
> server SHOULD send read chunks aligned to the preferred size (with
> exceptions for the head and tail of a larger buffer that meets minimum
> alignment but not preferred alignment).

What I'm really after is something that enables me to read 'nicely'
in a manner where I won't get fragments.

>> Also, given new commands aren't available unless you support structured
>> replies, you now have to support reassembly of replies (if you want
>> to use new features) even if all your reads are (e.g.) 1k.
> 
> Are you arguing that there should be a flag that controls whether reads
> must be in-order vs. reassembled?  Reassembly must happen either way,
> the question is whether having a way to allow out-of-order for
> efficiency, vs. defaulting to in-order for easier computation, is worth it.

No, that sounds overengineered.

More a way of guaranteeing avoiding a fragmentation on 'simple' reads.
Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it
can always error the command.

>>> The server
>>> +  MUST NOT set the error field of a read chunk; if an error occurs, it
>>> +  MAY immediately end the sequence of structured response messages,
>>> +  MUST send the error in the concluding normal response, and SHOULD
>>> +  keep the connection open.  The final non-structured response MUST
>>> +  set an error unless the sum of data sent by all read chunks totals
>>> +  the original client length request.
>> 
>> add "and data for the entire range requested has been supplied." (I
>> know this is technically implied by the fact data cannot be duplicated).
> 
> Sure.  But keep in mind that if (when?) we add a flag for allowing the
> server to skip read chunks on holes, we'll have to tweak the wording to
> allow the server to send fewer chunks than the client's length, where
> the client must then assume zeroes for all chunks not received.

Or alternatively a chunk representing a hole. I wonder whether you
might be better to extend the chunk structure by 4 bytes to allow for
future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means
the chunk is full of zeroes and has no payload).

--
Alex Bligh
Eric Blake March 29, 2016, 3:12 p.m. UTC | #5
On 03/29/2016 08:37 AM, Alex Bligh wrote:
> Eric,
> 
>> I guess what I need to add is that in transmission phase, most commands
>> have exactly one response per request; but commands may document
>> scenarios where there will be multiple responses to a single request.
>> NBD_CMD_READ uses the multiple responses to make partial read and error
>> handling possible
> 
> Yes, this.
> 

>> Are you arguing that there should be a flag that controls whether reads
>> must be in-order vs. reassembled?  Reassembly must happen either way,
>> the question is whether having a way to allow out-of-order for
>> efficiency, vs. defaulting to in-order for easier computation, is worth it.
> 
> No, that sounds overengineered.
> 
> More a way of guaranteeing avoiding a fragmentation on 'simple' reads.
> Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it
> can always error the command.

Okay, that makes sense.  Does reusing NBD_CMD_FLAG_FUA sound reasonable
for this purpose, or should it be a new flag?  I guess from the
standpoint of client/server negotiation, we want to support this
don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in
export flags, so it sounds like a new flag is best.  Next, should it be
independently negotiated, or implied by negotiating
NBD_FLAG_C_STRUCTURED_REPLIES?  I'm leaning towards implied - it's
all-or-none for use of the improved read structures.  I'm also leaning
towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm
documenting that negotiating this particular global flag affects only
the replies to NBD_CMD_READ (other commands may use structured replies,
but those commands will be independently negotiated).


>>
>> Sure.  But keep in mind that if (when?) we add a flag for allowing the
>> server to skip read chunks on holes, we'll have to tweak the wording to
>> allow the server to send fewer chunks than the client's length, where
>> the client must then assume zeroes for all chunks not received.
> 
> Or alternatively a chunk representing a hole. I wonder whether you
> might be better to extend the chunk structure by 4 bytes to allow for
> future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means
> the chunk is full of zeroes and has no payload).

Seems reasonable (then I need to reword everything from len-8 to instead
be len-12 when dealing with data size within the len bytes of payload).
 Network traffic-wise, I think it's better to always send the chunk
flags, than it would be to try and make the sending of the chunk flags
dependent on the client's choice of command flags (that is, we already
argued that wire format should never be changed based merely on command
flags, as it makes the server stream harder to decipher in isolation).

Thanks for the good feedback, by the way; it will make v2 better.
Wouter Verhelst March 29, 2016, 4:37 p.m. UTC | #6
On Tue, Mar 29, 2016 at 09:12:02AM -0600, Eric Blake wrote:
> On 03/29/2016 08:37 AM, Alex Bligh wrote:
> > Eric,
> > 
> >> I guess what I need to add is that in transmission phase, most commands
> >> have exactly one response per request; but commands may document
> >> scenarios where there will be multiple responses to a single request.
> >> NBD_CMD_READ uses the multiple responses to make partial read and error
> >> handling possible
> > 
> > Yes, this.
> > 
> 
> >> Are you arguing that there should be a flag that controls whether reads
> >> must be in-order vs. reassembled?  Reassembly must happen either way,
> >> the question is whether having a way to allow out-of-order for
> >> efficiency, vs. defaulting to in-order for easier computation, is worth it.
> > 
> > No, that sounds overengineered.
> > 
> > More a way of guaranteeing avoiding a fragmentation on 'simple' reads.
> > Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it
> > can always error the command.
> 
> Okay, that makes sense.  Does reusing NBD_CMD_FLAG_FUA sound reasonable
> for this purpose, or should it be a new flag?  I guess from the
> standpoint of client/server negotiation, we want to support this
> don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in
> export flags, so it sounds like a new flag is best.  Next, should it be
> independently negotiated, or implied by negotiating
> NBD_FLAG_C_STRUCTURED_REPLIES?  I'm leaning towards implied - it's

I think it should be implied, yes.

Having said that,

There's only a limited number of flag bits available. We can obviously
always add more flags by adding in a second flags field, but that
introduces more backwards compatibility issues (would require another
global flag to say "we support extended flags", which the client then
has to ack, too, etc). As such, not using flag bits when we don't
strictly need them is a feature.

I'm not sure if this really needs to be negotiated using a flag bit. The
NO_ZEROES thing was negotiated using a flag because NBD_OPT_EXPORT_NAME
can't be upgraded, but that isn't the case here. This could easily be
negotiated using some NBD_OPT thing:

client->kernel: check whether structured replies are supported
(if yes:)
client->server: NBD_OPT_STRUCTURED_REPLIES
server->client: NBD_REP_ACK (if supported, or NBD_REP_UNSUP if not)

At this point, the server can send structured replies at leisure. We
could also set a "support don't fragment" flag bit in the per-export
flags field (which is a larger flags field than the global one), so that
the client kernel knows it can request non-fragmented replies without
requiring an extra kernel API (since per-export flags are passed to the
kernel by way of the NBD_SET_FLAGS ioctl).

(the spec can then possibly also say that the kernel MAY send structured
replies without sending the "support don't fragment" bit, but that it
then MUST NOT send fragmented replies -- although I'm not too sure
whether that's a good idea)

This would also get it more in line with the way the CMD_TRIM and
CMD_FLUSH commands are negotiated (by way of a per-export flag).

> all-or-none for use of the improved read structures.  I'm also leaning
> towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm

That's probably a good idea too, yes (with obvious s/FLAG_C/OPT/
change as per above).

> documenting that negotiating this particular global flag affects only
> the replies to NBD_CMD_READ (other commands may use structured replies,
> but those commands will be independently negotiated).

Right.

> >> Sure.  But keep in mind that if (when?) we add a flag for allowing the
> >> server to skip read chunks on holes, we'll have to tweak the wording to
> >> allow the server to send fewer chunks than the client's length, where
> >> the client must then assume zeroes for all chunks not received.
> > 
> > Or alternatively a chunk representing a hole. I wonder whether you
> > might be better to extend the chunk structure by 4 bytes to allow for
> > future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means
> > the chunk is full of zeroes and has no payload).
> 
> Seems reasonable (then I need to reword everything from len-8 to instead
> be len-12 when dealing with data size within the len bytes of payload).
> Network traffic-wise, I think it's better to always send the chunk
> flags, than it would be to try and make the sending of the chunk flags
> dependent on the client's choice of command flags (that is, we already
> argued that wire format should never be changed based merely on command
> flags, as it makes the server stream harder to decipher in isolation).
> 
> Thanks for the good feedback, by the way; it will make v2 better.

Regards,
Alex Bligh March 29, 2016, 5:34 p.m. UTC | #7
On 29 Mar 2016, at 16:12, Eric Blake <eblake@redhat.com> wrote:
>> 
>> More a way of guaranteeing avoiding a fragmentation on 'simple' reads.
>> Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it
>> can always error the command.
> 
> Okay, that makes sense.  Does reusing NBD_CMD_FLAG_FUA sound reasonable
> for this purpose, or should it be a new flag?  I guess from the
> standpoint of client/server negotiation, we want to support this
> don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in
> export flags, so it sounds like a new flag is best.

I think it should be separate from FUA, as there are possibly
servers out there that support FUA but not this, but ...

> Next, should it be
> independently negotiated, or implied by negotiating
> NBD_FLAG_C_STRUCTURED_REPLIES?  I'm leaning towards implied - it's
> all-or-none for use of the improved read structures.

I would agree. I think if it supports the structured reply semantics,
it should also support 'DF'. So if you know the server supports
structured replies, you know you can set DF on them without any
further requirements.

> I'm also leaning
> towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm
> documenting that negotiating this particular global flag affects only
> the replies to NBD_CMD_READ (other commands may use structured replies,
> but those commands will be independently negotiated).

I suspect the name is the thing that makes the least difference, and
hence do not feel strongly at all, but:

a) Why '_C_'?

b) Throughout the current documentation you've called them 'structured
   replies', not 'structured reads' and said that in the future multiple
   commands might support them. So you should arguably call the flag
   '*_STRUCTURED_REPLY' or change the text.

>>> Sure.  But keep in mind that if (when?) we add a flag for allowing the
>>> server to skip read chunks on holes, we'll have to tweak the wording to
>>> allow the server to send fewer chunks than the client's length, where
>>> the client must then assume zeroes for all chunks not received.
>> 
>> Or alternatively a chunk representing a hole. I wonder whether you
>> might be better to extend the chunk structure by 4 bytes to allow for
>> future modifications like this (e.g. NBD_CHUNK_FLAG_HOLE means
>> the chunk is full of zeroes and has no payload).
> 
> Seems reasonable (then I need to reword everything from len-8 to instead
> be len-12 when dealing with data size within the len bytes of payload).
> Network traffic-wise, I think it's better to always send the chunk
> flags, than it would be to try and make the sending of the chunk flags
> dependent on the client's choice of command flags (that is, we already
> argued that wire format should never be changed based merely on command
> flags, as it makes the server stream harder to decipher in isolation).

Absolutely. And that way if we have to add anything to the chunk (e.g.
we run out of flags!), we can use one or more bits to indicate this.

> Thanks for the good feedback, by the way; it will make v2 better.

No problem. Some time ago I rewrote chunks of the nbd test suite and
wrote the bit that tested parallel outstanding commands. At the back
of my mind is whether I should extend the test suite to test this
and how we could persuade a server to 'often fragment' so we can
test reassembly (some form of debug setting on the server like
'max fragment size' or similar I suspect).

--
Alex Bligh
Eric Blake March 29, 2016, 5:45 p.m. UTC | #8
On 03/29/2016 11:34 AM, Alex Bligh wrote:
> 
> On 29 Mar 2016, at 16:12, Eric Blake <eblake@redhat.com> wrote:
>>>
>>> More a way of guaranteeing avoiding a fragmentation on 'simple' reads.
>>> Perhaps a 'DF' bit (don't fragment)! If the server doesn't like it, it
>>> can always error the command.
>>
>> Okay, that makes sense.  Does reusing NBD_CMD_FLAG_FUA sound reasonable
>> for this purpose, or should it be a new flag?  I guess from the
>> standpoint of client/server negotiation, we want to support this
>> don't-fragment request even if NBD_FLAG_SEND_FUA was not listed in
>> export flags, so it sounds like a new flag is best.
> 
> I think it should be separate from FUA, as there are possibly
> servers out there that support FUA but not this, but ...
> 
>> Next, should it be
>> independently negotiated, or implied by negotiating
>> NBD_FLAG_C_STRUCTURED_REPLIES?  I'm leaning towards implied - it's
>> all-or-none for use of the improved read structures.
> 
> I would agree. I think if it supports the structured reply semantics,
> it should also support 'DF'. So if you know the server supports
> structured replies, you know you can set DF on them without any
> further requirements.

Supporting DF merely transfers the burden of collection between server
and client.  I suspect that there are cases where the server does NOT
want to support DF (because it would require the server to allocate
memory to collect the data before sending a single structured read
reply), just as you have stated that there are cases where the client
has an additional burden if DF is not supported.  So for v2, I'm going
to explicitly document a DF export flag, and recommend (but not require)
that the server support it.

> 
>> I'm also leaning
>> towards the name NBD_FLAG_C_STRUCTURED_READ, since elsewhere I'm
>> documenting that negotiating this particular global flag affects only
>> the replies to NBD_CMD_READ (other commands may use structured replies,
>> but those commands will be independently negotiated).
> 
> I suspect the name is the thing that makes the least difference, and
> hence do not feel strongly at all, but:
> 
> a) Why '_C_'?

Matches existing convention on client flags; but Wouter's idea of using
NBD_OPT_ instead of global/client flags is better, so the _C_ disappears
in v2.

> 
> b) Throughout the current documentation you've called them 'structured
>    replies', not 'structured reads' and said that in the future multiple
>    commands might support them. So you should arguably call the flag
>    '*_STRUCTURED_REPLY' or change the text.

I'm changing the text, and favoring the name STRUCTURED_READ except in
the description of the transmission phase, where Structured Reply is the
header used for ANY form of reply with data (to make it more obvious
that structured read is a subset of structured replies), while at the
same time emphasizing that NBD_CMD_READ is the only command that can get
away with data in a non-structured reply, and only when structured read
was not negotiated.
Wouter Verhelst March 29, 2016, 5:53 p.m. UTC | #9
Hi Eric,

Having read this in more detail now:

On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote:
> +  The server MUST ensure that each read chunk lies within the original
> +  offset and length of the original client request, MUST NOT send read
> +  chunks that would cover the same offset more than once, and MUST
> +  send at least one byte of data in addition to the offset field of
> +  each read chunk.  The server MAY send read chunks out of order, and
> +  may interleave other responses between read replies.  The server
> +  MUST NOT set the error field of a read chunk; if an error occurs, it
> +  MAY immediately end the sequence of structured response messages,
> +  MUST send the error in the concluding normal response, and SHOULD
> +  keep the connection open.  The final non-structured response MUST
> +  set an error unless the sum of data sent by all read chunks totals
> +  the original client length request.

I'm thinking it would probably be a good idea to have the concluding
response (if the error field is nonzero) have an offset too; the server
could use that to specify where, exactly, the error occurred (so that a
client which sent a very large read request doesn't have to go through a
binary search or some such to figure out where the read error happened)

i.e.,

C: read X bytes at offset Y
S: (X bytes)
S: (error, offset Z)

client now has Z-1 bytes of valid data (with the rest being garbage,
plus a read error)

The alternative (in the above) would be that the client has 0 bytes of
valid data, and would have to issue another read request to figure out
which parts of the data are valid.

> +  The client SHOULD immediately close the connection if it detects
> +  that the server has sent an offset more than once (whether or not
> +  the overlapping data claimed to have the same contents), or if
> +  receives the concluding normal reply without an error set but
> +  without all bytes covered by read chunk(s). A future extension may

I would reword this to...

The client MAY immediately close the connection if it detects that
[...]. The server MUST NOT send an offset more than once.

> +  add a command flag that would allow the server to skip read chunks
> +  for portions of the file that read as all zeroes.

Not sure if that part is necessary or helpful, really.
Wouter Verhelst March 29, 2016, 6:03 p.m. UTC | #10
On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote:
> On 03/29/2016 11:34 AM, Alex Bligh wrote:
> > I would agree. I think if it supports the structured reply semantics,
> > it should also support 'DF'. So if you know the server supports
> > structured replies, you know you can set DF on them without any
> > further requirements.
> 
> Supporting DF merely transfers the burden of collection between server
> and client.  I suspect that there are cases where the server does NOT
> want to support DF (because it would require the server to allocate
> memory to collect the data before sending a single structured read
> reply),

There are other ways to handle that; e.g., the server could have a
"request too large for non-fragmented read" error message. The spec
should give a minimum size that the server MUST support (which should be
reasonably large), and should state that a server MAY reply to any
request with DF set for a block larger than that minimum, with that
error.

Otherwise the client could conceivably send out heaps of requests for
(UINT32_MAX - 8) bytes with DF set and basically DoS the server.

> just as you have stated that there are cases where the client
> has an additional burden if DF is not supported.  So for v2, I'm going
> to explicitly document a DF export flag, and recommend (but not require)
> that the server support it.

I'd prefer not to see that.
Eric Blake March 29, 2016, 6:07 p.m. UTC | #11
On 03/29/2016 12:03 PM, Wouter Verhelst wrote:
> On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote:
>> On 03/29/2016 11:34 AM, Alex Bligh wrote:
>>> I would agree. I think if it supports the structured reply semantics,
>>> it should also support 'DF'. So if you know the server supports
>>> structured replies, you know you can set DF on them without any
>>> further requirements.
>>
>> Supporting DF merely transfers the burden of collection between server
>> and client.  I suspect that there are cases where the server does NOT
>> want to support DF (because it would require the server to allocate
>> memory to collect the data before sending a single structured read
>> reply),
> 
> There are other ways to handle that; e.g., the server could have a
> "request too large for non-fragmented read" error message. The spec
> should give a minimum size that the server MUST support (which should be
> reasonably large), and should state that a server MAY reply to any
> request with DF set for a block larger than that minimum, with that
> error.

How does 64k sound?

> 
> Otherwise the client could conceivably send out heaps of requests for
> (UINT32_MAX - 8) bytes with DF set and basically DoS the server.

Point taken.

> 
>> just as you have stated that there are cases where the client
>> has an additional burden if DF is not supported.  So for v2, I'm going
>> to explicitly document a DF export flag, and recommend (but not require)
>> that the server support it.
> 
> I'd prefer not to see that.

Okay, good thing I haven't sent v2 yet; I'm making more edits to drop
the export flag, and require unconditional support for the command flag
once structured reads are negotiated.
Alex Bligh March 29, 2016, 6:09 p.m. UTC | #12
On 29 Mar 2016, at 19:03, Wouter Verhelst <w@uter.be> wrote:

> There are other ways to handle that; e.g., the server could have a
> "request too large for non-fragmented read" error message. The spec
> should give a minimum size that the server MUST support (which should be
> reasonably large), and should state that a server MAY reply to any
> request with DF set for a block larger than that minimum, with that
> error.

Yeah something like that. Or the server could simply publish this
as part of the option negotiation.
Wouter Verhelst March 29, 2016, 6:19 p.m. UTC | #13
On Tue, Mar 29, 2016 at 12:07:59PM -0600, Eric Blake wrote:
> On 03/29/2016 12:03 PM, Wouter Verhelst wrote:
> > On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote:
> >> Supporting DF merely transfers the burden of collection between server
> >> and client.  I suspect that there are cases where the server does NOT
> >> want to support DF (because it would require the server to allocate
> >> memory to collect the data before sending a single structured read
> >> reply),
> > 
> > There are other ways to handle that; e.g., the server could have a
> > "request too large for non-fragmented read" error message. The spec
> > should give a minimum size that the server MUST support (which should be
> > reasonably large), and should state that a server MAY reply to any
> > request with DF set for a block larger than that minimum, with that
> > error.
> 
> How does 64k sound?

Dunno. It might make sense for this number to be based upon some
"standard" minimum request size in things like ATA or SCSI if such a
number exists there, but I don't know enough about either standard to
answer that question myself.

If such a number doesn't exist (or nobody who knows speaks up soon
enough), 64k is certainly good enough, I suppose.
Eric Blake March 29, 2016, 6:23 p.m. UTC | #14
On 03/29/2016 11:53 AM, Wouter Verhelst wrote:
> Hi Eric,
> 
> Having read this in more detail now:
> 
> On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote:
>> +  The server MUST ensure that each read chunk lies within the original
>> +  offset and length of the original client request, MUST NOT send read
>> +  chunks that would cover the same offset more than once, and MUST
>> +  send at least one byte of data in addition to the offset field of
>> +  each read chunk.  The server MAY send read chunks out of order, and
>> +  may interleave other responses between read replies.  The server
>> +  MUST NOT set the error field of a read chunk; if an error occurs, it
>> +  MAY immediately end the sequence of structured response messages,
>> +  MUST send the error in the concluding normal response, and SHOULD
>> +  keep the connection open.  The final non-structured response MUST
>> +  set an error unless the sum of data sent by all read chunks totals
>> +  the original client length request.
> 
> I'm thinking it would probably be a good idea to have the concluding
> response (if the error field is nonzero) have an offset too; the server
> could use that to specify where, exactly, the error occurred (so that a
> client which sent a very large read request doesn't have to go through a
> binary search or some such to figure out where the read error happened)
> 
> i.e.,
> 
> C: read X bytes at offset Y
> S: (X bytes)
> S: (error, offset Z)

Here, I'm assuming that you mean X > Z.

Unfortunately, I chose the design of 0 or more structured replies
followed by a normal reply, so that the normal reply is a reliable
indicator that the read is complete (whether successful or not); and the
whole goal of the extension is to avoid sending any data payload on a
normal reply.  I'm not sure how to send the offset in the normal reply
without violating the premise that a normal reply has no payload.

But what we could do is allow for the server to send a structured reply
data chunk of zero bytes, with the offset in question, as the offset
where an error occurred, prior to then sending the normal reply with the
final error indicator.  I guess that also means that if we don't have
the DF command flag set, the server could then report multiple failed
reads interspersed among larger successful clusters, when trying to
recover as much of the failing disk as possible, if each failure is
reported via a separate structured read of zero bytes.  Hmm, that also
means that we have to be careful on the wording - if we allow a
structured reply with 0 data bytes to report an error, after already
sending a larger reply with partially valid bytes, then that means that
a client will receive more than one read chunk visiting the same offset,
so we'd have to make the wording permit that.

> client now has Z-1 bytes of valid data (with the rest being garbage,
> plus a read error)
> 
> The alternative (in the above) would be that the client has 0 bytes of
> valid data, and would have to issue another read request to figure out
> which parts of the data are valid.

So if I'm understanding you, you are trying to state that the server may
report the header for X bytes, then fail partway through those X bytes;
it must still send X bytes, but can then report how many are valid (that
is, a client must assume that 0 of the X bytes received are valid
_unless_ the server also reported where it failed).  But I was
envisioning the opposite: the server must NOT send X bytes unless it
knows they are valid; if it encounters a read error at Z, then it sends
a structured read of Z-1 bytes before the final normal message that
reports overall failure.  The client then assumes that all X bytes
received are valid.

But I also documented that the client MAY, but not MUST, abort the read
at the first error; so the idea of being able to report multiple errors
and/or send headers prior to learning whether there are read errors
means that your interpretation is probably safer than mine.

I guess it will help to have actual v2 wording in front of us to further
fine-tune the wording.

> 
>> +  The client SHOULD immediately close the connection if it detects
>> +  that the server has sent an offset more than once (whether or not
>> +  the overlapping data claimed to have the same contents), or if
>> +  receives the concluding normal reply without an error set but
>> +  without all bytes covered by read chunk(s). A future extension may
> 
> I would reword this to...
> 
> The client MAY immediately close the connection if it detects that
> [...]. The server MUST NOT send an offset more than once.
> 
>> +  add a command flag that would allow the server to skip read chunks
>> +  for portions of the file that read as all zeroes.
> 
> Not sure if that part is necessary or helpful, really.

I envision such an extension in parallel to (or as part of) the proposed
NBD_CMD_GET_LBA_STATUS (or whatever we name it) - it is slightly more
efficient to skip reads of holes with a single read command flag than it
is to first read status to determine where holes are and only then issue
reads for the non-hole regions.  But I can also buy your argument that
such language belongs in the extension for sparse reads, and doesn't
need to be present in the extension for structured reads.
Eric Blake March 29, 2016, 6:25 p.m. UTC | #15
On 03/29/2016 12:19 PM, Wouter Verhelst wrote:
> On Tue, Mar 29, 2016 at 12:07:59PM -0600, Eric Blake wrote:
>> On 03/29/2016 12:03 PM, Wouter Verhelst wrote:
>>> On Tue, Mar 29, 2016 at 11:45:45AM -0600, Eric Blake wrote:
>>>> Supporting DF merely transfers the burden of collection between server
>>>> and client.  I suspect that there are cases where the server does NOT
>>>> want to support DF (because it would require the server to allocate
>>>> memory to collect the data before sending a single structured read
>>>> reply),
>>>
>>> There are other ways to handle that; e.g., the server could have a
>>> "request too large for non-fragmented read" error message. The spec
>>> should give a minimum size that the server MUST support (which should be
>>> reasonably large), and should state that a server MAY reply to any
>>> request with DF set for a block larger than that minimum, with that
>>> error.
>>
>> How does 64k sound?
> 
> Dunno. It might make sense for this number to be based upon some
> "standard" minimum request size in things like ATA or SCSI if such a
> number exists there, but I don't know enough about either standard to
> answer that question myself.
> 
> If such a number doesn't exist (or nobody who knows speaks up soon
> enough), 64k is certainly good enough, I suppose.

And as mentioned in another email, we may want to propose an independent
extension that allows NBD_OPT_LIST and friends to start advertising the
minimum and preferred sizes of operations on a given export, where the
server can give hard errors if the client requests a read or write not
aligned to the minimum, and where the server must not fail a DF set for
anything smaller than preferred size.
Wouter Verhelst March 29, 2016, 6:51 p.m. UTC | #16
On Tue, Mar 29, 2016 at 12:23:31PM -0600, Eric Blake wrote:
> On 03/29/2016 11:53 AM, Wouter Verhelst wrote:
> > Hi Eric,
> > 
> > Having read this in more detail now:
> > 
> > On Mon, Mar 28, 2016 at 09:56:36PM -0600, Eric Blake wrote:
> >> +  The server MUST ensure that each read chunk lies within the original
> >> +  offset and length of the original client request, MUST NOT send read
> >> +  chunks that would cover the same offset more than once, and MUST
> >> +  send at least one byte of data in addition to the offset field of
> >> +  each read chunk.  The server MAY send read chunks out of order, and
> >> +  may interleave other responses between read replies.  The server
> >> +  MUST NOT set the error field of a read chunk; if an error occurs, it
> >> +  MAY immediately end the sequence of structured response messages,
> >> +  MUST send the error in the concluding normal response, and SHOULD
> >> +  keep the connection open.  The final non-structured response MUST
> >> +  set an error unless the sum of data sent by all read chunks totals
> >> +  the original client length request.
> > 
> > I'm thinking it would probably be a good idea to have the concluding
> > response (if the error field is nonzero) have an offset too; the server
> > could use that to specify where, exactly, the error occurred (so that a
> > client which sent a very large read request doesn't have to go through a
> > binary search or some such to figure out where the read error happened)
> > 
> > i.e.,
> > 
> > C: read X bytes at offset Y
> > S: (X bytes)
> > S: (error, offset Z)
> 
> Here, I'm assuming that you mean X > Z.

Yes, obviously.

> Unfortunately, I chose the design of 0 or more structured replies
> followed by a normal reply, so that the normal reply is a reliable
> indicator that the read is complete (whether successful or not); and the
> whole goal of the extension is to avoid sending any data payload on a
> normal reply.  I'm not sure how to send the offset in the normal reply
> without violating the premise that a normal reply has no payload.

Oh. I thought you meant for the concluding message to also be a
structured reply with the length field be zero, but you mean for it to
be a non-structured reply message? If so, you should clarify that a bit
more (this wasn't clear to me)...

[...]
> But what we could do is allow for the server to send a structured reply
> data chunk of zero bytes, with the offset in question, as the offset
> where an error occurred, prior to then sending the normal reply with the
> final error indicator.  I guess that also means that if we don't have
> the DF command flag set, the server could then report multiple failed
> reads interspersed among larger successful clusters, when trying to
> recover as much of the failing disk as possible, if each failure is
> reported via a separate structured read of zero bytes.  Hmm, that also
> means that we have to be careful on the wording - if we allow a
> structured reply with 0 data bytes to report an error, after already
> sending a larger reply with partially valid bytes, then that means that
> a client will receive more than one read chunk visiting the same offset,
> so we'd have to make the wording permit that.
> 
> > client now has Z-1 bytes of valid data (with the rest being garbage,
> > plus a read error)
> > 
> > The alternative (in the above) would be that the client has 0 bytes of
> > valid data, and would have to issue another read request to figure out
> > which parts of the data are valid.
> 
> So if I'm understanding you, you are trying to state that the server may
> report the header for X bytes, then fail partway through those X bytes;
> it must still send X bytes, but can then report how many are valid (that
> is, a client must assume that 0 of the X bytes received are valid
> _unless_ the server also reported where it failed).

Yes.

> But I was envisioning the opposite: the server must NOT send X bytes
> unless it knows they are valid; if it encounters a read error at Z,
> then it sends a structured read of Z-1 bytes before the final normal
> message that reports overall failure.  The client then assumes that
> all X bytes received are valid.

The problem with that approach is that it makes it impossible for a
server to use a sendfile()-like system call, where you don't know that
there's a read error until start sending out data to the client (which
implies that you must've already sent out the header).

> But I also documented that the client MAY, but not MUST, abort the read
> at the first error; so the idea of being able to report multiple errors
> and/or send headers prior to learning whether there are read errors
> means that your interpretation is probably safer than mine.

I didn't mean to imply that. I do think that aborting the read at the
first error is probably a good idea. If the error occurs because the
disk is dying, having the server go ahead and try to read more data
anyway is probably not a very good idea (unless instructed to do so by
the user, i.e., client).

> I guess it will help to have actual v2 wording in front of us to further
> fine-tune the wording.

Certainly :-)

> >> +  The client SHOULD immediately close the connection if it detects
> >> +  that the server has sent an offset more than once (whether or not
> >> +  the overlapping data claimed to have the same contents), or if
> >> +  receives the concluding normal reply without an error set but
> >> +  without all bytes covered by read chunk(s). A future extension may
> > 
> > I would reword this to...
> > 
> > The client MAY immediately close the connection if it detects that
> > [...]. The server MUST NOT send an offset more than once.
> > 
> >> +  add a command flag that would allow the server to skip read chunks
> >> +  for portions of the file that read as all zeroes.
> > 
> > Not sure if that part is necessary or helpful, really.
> 
> I envision such an extension in parallel to (or as part of) the proposed
> NBD_CMD_GET_LBA_STATUS (or whatever we name it) - it is slightly more
> efficient to skip reads of holes with a single read command flag than it
> is to first read status to determine where holes are and only then issue
> reads for the non-hole regions.

Sure.

> But I can also buy your argument that such language belongs in the
> extension for sparse reads, and doesn't need to be present in the
> extension for structured reads.

Right, that was my point, mainly.
Wouter Verhelst March 29, 2016, 7:06 p.m. UTC | #17
On Tue, Mar 29, 2016 at 08:51:57PM +0200, Wouter Verhelst wrote:
> On Tue, Mar 29, 2016 at 12:23:31PM -0600, Eric Blake wrote:
> > Unfortunately, I chose the design of 0 or more structured replies
> > followed by a normal reply, so that the normal reply is a reliable
> > indicator that the read is complete (whether successful or not); and the
> > whole goal of the extension is to avoid sending any data payload on a
> > normal reply.  I'm not sure how to send the offset in the normal reply
> > without violating the premise that a normal reply has no payload.
> 
> Oh. I thought you meant for the concluding message to also be a
> structured reply with the length field be zero, but you mean for it to
> be a non-structured reply message? If so, you should clarify that a bit
> more (this wasn't clear to me)...

Also, I'm not convinced that's a very good approach, since it also
requires analyzers to have more context than just requiring a single
final "empty" structured reply message.
Alex Bligh March 29, 2016, 7:39 p.m. UTC | #18
On 29 Mar 2016, at 19:51, Wouter Verhelst <w@uter.be> wrote:

>> 
>> But I was envisioning the opposite: the server must NOT send X bytes
>> unless it knows they are valid; if it encounters a read error at Z,
>> then it sends a structured read of Z-1 bytes before the final normal
>> message that reports overall failure.  The client then assumes that
>> all X bytes received are valid.
> 
> The problem with that approach is that it makes it impossible for a
> server to use a sendfile()-like system call, where you don't know that
> there's a read error until start sending out data to the client (which
> implies that you must've already sent out the header).

I don't think sendfile semantics are ever compatible with reporting
read errors *unless* you pad after the read.

IIRC the way sendfile works is that you specify a pointer to an offset,
and sendfile sends as much as it can read (up to the length specified)
and updates the offset for the length read.

Naturally at the start of the read section, you don't know when the
error is going to occur, so you must say that the length of the data
read is going to be the length of the actual chunk. sendfile then
does its stuff, and fills up either the whole thing, or part of it.
In the case that part of the data (only) is available, you can't
report the error there and then, because the client is expecting
chunk data, so you must either close the connection (potentially
disruptive) or pad the data, and report the error at the end.

Using Eric's current scheme, you have no way of knowing where the error
occurred. Remember the chunks could be out of order, e.g. you get
chunks 1,3,5,7,9 in, and then an error, so you have no idea where
the error was. It could be in chunks 1,3,5,7,9 (and the server might
have padded the rest of the chunk) or in an unread chunk (2,4,6,8,10).
This seems undesirable.

I think we are paying too much attention to trying to keep NBD_RESPONSE
intact. The justification for this was (I think) that it made it easier
for existing protocol analysers. It doesn't, really, as all the data
is going to come BEFORE the NBD_RESPONSE (unlike in NBD_CMD_READ in
other situations).

I think we should therefore look at this the other way around. Here's
a straw man proposal as an alternative for the reply bits. For
a structured reply ALL we get is the chunks. The final chunk
(possibly the only chunk) is marked specially. Each chunk looks something
like:

offset+
0000    32 bit   NBD_STRUCTURED_REPLY_MAGIC
0004    64 bit   handle
000C    32 bit   Flags
0010    32 bit   Payload length


We have a couple of flags defined:

NBD_CHUNK_IS_DATA: the chunk is data, and the payload is a 64 bit offset
plus the data read

NBD_CHUNK_IS_HOLE: the chunk is zeroes, and the payload is a 64 bit offset
(only)

NBD_CHUNK_IS_END: (must be the final chunk). The payload is a 64 bit offset
plus a 32 bit error code, or zero. If no error, the offset must be set to
the total amount read. If there is an error, the offset MAY indicate the
position of the error. If an error occurs, no more chunks should be sent.


The advantages of this scheme are:

1. Only one packet type in the reply (chunks)

2. It's no more difficult to implement wireshark decoding of this (in
   addition to the normal NBD protocol) than the current proposal. I'd
   suggest in fact they could be easier.

3. Chunks that error part way through (sendfile type) must still be
   padded but now can indicate error location.

4. It would be possible to allow EVERY server reply to be a structured
   reply that simply set NBD_CHUNK_IS_END. That gives us a convenient
   route to servers which only implement structured replies. With DF,
   this would be little harder than implementing the current
   protocol.
Eric Blake March 29, 2016, 8 p.m. UTC | #19
On 03/29/2016 01:39 PM, Alex Bligh wrote:
> I think we are paying too much attention to trying to keep NBD_RESPONSE
> intact. The justification for this was (I think) that it made it easier
> for existing protocol analysers. It doesn't, really, as all the data
> is going to come BEFORE the NBD_RESPONSE (unlike in NBD_CMD_READ in
> other situations).
> 
> I think we should therefore look at this the other way around. Here's
> a straw man proposal as an alternative for the reply bits. For
> a structured reply ALL we get is the chunks. The final chunk
> (possibly the only chunk) is marked specially. Each chunk looks something
> like:
> 
> offset+
> 0000    32 bit   NBD_STRUCTURED_REPLY_MAGIC
> 0004    64 bit   handle
> 000C    32 bit   Flags
> 0010    32 bit   Payload length
> 
> 
> We have a couple of flags defined:
> 
> NBD_CHUNK_IS_DATA: the chunk is data, and the payload is a 64 bit offset
> plus the data read
> 
> NBD_CHUNK_IS_HOLE: the chunk is zeroes, and the payload is a 64 bit offset
> (only)
> 
> NBD_CHUNK_IS_END: (must be the final chunk). The payload is a 64 bit offset
> plus a 32 bit error code, or zero. If no error, the offset must be set to
> the total amount read. If there is an error, the offset MAY indicate the
> position of the error. If an error occurs, no more chunks should be sent.

I'm liking it - then we aren't sending a mandatory 0 error field on read
chunks.  Although I might use '32 bit Type' rather than '32 bit Flags',
since you don't really want to allow multiple reply types at once;
rather each reply type id is documented on its specific payload layout.

Another argument in favor of this over my original proposal: my proposal
is context-free for determining reply boundaries, but still requires
context in deciphering between a reply to NBD_CMD_READ vs. a reply to
NBD_CMD_GET_LBA_STATUS (that is, there was nothing in the reply that
said what the payload represents, only how many bytes to skip).
However, with a '32 bit Type', the wireshark detector can be taught
every type of payload, and as long as every command with a structured
reply uses 1 or more distinct types, the dissector can display more
details about the payload when it recognizes the type, and still skip
the payload on extensions it does not recognize.

> 
> 4. It would be possible to allow EVERY server reply to be a structured
>    reply that simply set NBD_CHUNK_IS_END. That gives us a convenient
>    route to servers which only implement structured replies. With DF,
>    this would be little harder than implementing the current
>    protocol.

For all remaining existing commands, that is just more overhead on the
wire.  The existing non-structured replies do not send any data; they
are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one
reply).  But your proposal inflates that to a minimum of 20 bytes (if
length is 0) or longer (if an error is set).  I'm still strongly in
favor of keeping the existing non-structured replies to commands that
don't have to return data.

I'm okay if a new command sometimes returns data and sometimes does not;
in which case, that command can always return a single structured reply
where the id of the reply says whether the payload will be length 0 or
not, but only new commands should get that treatment.
Alex Bligh March 29, 2016, 8:18 p.m. UTC | #20
On 29 Mar 2016, at 21:00, Eric Blake <eblake@redhat.com> wrote:

> I'm liking it - then we aren't sending a mandatory 0 error field on read
> chunks.

I'm writing it up as a strawman. I'll comment in a sec in further detail.

--
Alex Bligh
Alex Bligh March 29, 2016, 8:44 p.m. UTC | #21
Eric,

> I'm liking it - then we aren't sending a mandatory 0 error field on read
> chunks.

Straw man patch sent through. Alternatively at:

https://github.com/abligh/nbd/commit/3c40272704904ac74040ceb099fee0b44e355e1e

and in markdown format at:

https://github.com/abligh/nbd/blob/strawman-structured-reply/doc/proto.md

Very much off the top of my head.

>  Although I might use '32 bit Type' rather than '32 bit Flags',
> since you don't really want to allow multiple reply types at once;
> rather each reply type id is documented on its specific payload layout.

I used the bottom byte of the flags for this, so we can keep the flags.

>> 4. It would be possible to allow EVERY server reply to be a structured
>>   reply that simply set NBD_CHUNK_IS_END. That gives us a convenient
>>   route to servers which only implement structured replies. With DF,
>>   this would be little harder than implementing the current
>>   protocol.
> 
> For all remaining existing commands, that is just more overhead on the
> wire.  The existing non-structured replies do not send any data; they
> are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one
> reply).  But your proposal inflates that to a minimum of 20 bytes (if
> length is 0) or longer (if an error is set).  I'm still strongly in
> favor of keeping the existing non-structured replies to commands that
> don't have to return data.

I was saying that should be up to the server. If the server wants to
write something easily decodable (and easier to maintain) at the expense
of a few more bytes on the wire, then let it. If it wants to use
unstructured replies occasionally, that's fine.

> I'm okay if a new command sometimes returns data and sometimes does not;
> in which case, that command can always return a single structured reply
> where the id of the reply says whether the payload will be length 0 or
> not, but only new commands should get that treatment.

I think we are at cross purposes. I am saying that if you've negotiated
structured replies, you should be be able to return either for any
command, as the client can disambiguate using the magic number.

Clearly some future commands might REQUIRE structured replies as there
would be no way to represent them in an unstructured reply.

--
Alex Bligh
Wouter Verhelst March 29, 2016, 9:05 p.m. UTC | #22
Hi Alex,

On Tue, Mar 29, 2016 at 09:44:39PM +0100, Alex Bligh wrote:
> Eric,
> > For all remaining existing commands, that is just more overhead on the
> > wire.  The existing non-structured replies do not send any data; they
> > are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one
> > reply).  But your proposal inflates that to a minimum of 20 bytes (if
> > length is 0) or longer (if an error is set).  I'm still strongly in
> > favor of keeping the existing non-structured replies to commands that
> > don't have to return data.
> 
> I was saying that should be up to the server. If the server wants to
> write something easily decodable (and easier to maintain) at the expense
> of a few more bytes on the wire, then let it. If it wants to use
> unstructured replies occasionally, that's fine.

In adding that flexibility, you're adding more code paths on the client
(that need to be tested, etc), for (IMO) little benefit.

I would instead prefer to specify per command whether the reply is going
to be structured or not, and only have the read command be a special
case were both are possible, for backwards compatibility only. That way,
it can eventually be deprecated, too.
Alex Bligh March 29, 2016, 10:05 p.m. UTC | #23
On 29 Mar 2016, at 22:05, Wouter Verhelst <w@uter.be> wrote:

>>> For all remaining existing commands, that is just more overhead on the
>>> wire.  The existing non-structured replies do not send any data; they
>>> are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one
>>> reply).  But your proposal inflates that to a minimum of 20 bytes (if
>>> length is 0) or longer (if an error is set).  I'm still strongly in
>>> favor of keeping the existing non-structured replies to commands that
>>> don't have to return data.
>> 
>> I was saying that should be up to the server. If the server wants to
>> write something easily decodable (and easier to maintain) at the expense
>> of a few more bytes on the wire, then let it. If it wants to use
>> unstructured replies occasionally, that's fine.
> 
> In adding that flexibility, you're adding more code paths on the client
> (that need to be tested, etc), for (IMO) little benefit.
> 
> I would instead prefer to specify per command whether the reply is going
> to be structured or not, and only have the read command be a special
> case were both are possible, for backwards compatibility only. That way,
> it can eventually be deprecated, too.

I guess this is what comes of doing more NBD server work than client
work :-) I'd look at it the other way around and say that only one
code path is being exercised on the server, and that having multiple
types of reply depending on command builds fragility into the protocol.

If you want no choice in response type for the server for any given
session (i.e. code path minimisation on the client) my preference would
be what Eric didn't like, i.e. always send structured replies for
all commands if you negotiate structured replies, else always send
unstructured replies. We're talking an overhead of 8 bytes here
(flags & error offset); somehow I suspect the time to transmit
8 bytes is going to be negligible compared to disk time or the
rest of the network tx/rx time.
Wouter Verhelst March 29, 2016, 10:45 p.m. UTC | #24
On Tue, Mar 29, 2016 at 11:05:38PM +0100, Alex Bligh wrote:
> 
> On 29 Mar 2016, at 22:05, Wouter Verhelst <w@uter.be> wrote:
> 
> >>> For all remaining existing commands, that is just more overhead on the
> >>> wire.  The existing non-structured replies do not send any data; they
> >>> are 16 bytes each (only NBD_CMD_READ sends more than 16 bytes in one
> >>> reply).  But your proposal inflates that to a minimum of 20 bytes (if
> >>> length is 0) or longer (if an error is set).  I'm still strongly in
> >>> favor of keeping the existing non-structured replies to commands that
> >>> don't have to return data.
> >> 
> >> I was saying that should be up to the server. If the server wants to
> >> write something easily decodable (and easier to maintain) at the expense
> >> of a few more bytes on the wire, then let it. If it wants to use
> >> unstructured replies occasionally, that's fine.
> > 
> > In adding that flexibility, you're adding more code paths on the client
> > (that need to be tested, etc), for (IMO) little benefit.
> > 
> > I would instead prefer to specify per command whether the reply is going
> > to be structured or not, and only have the read command be a special
> > case were both are possible, for backwards compatibility only. That way,
> > it can eventually be deprecated, too.
> 
> I guess this is what comes of doing more NBD server work than client
> work :-) I'd look at it the other way around and say that only one
> code path is being exercised on the server,

Yes, but both code paths need to _exist_, which isn't the case with
having only one legal reply type per request type. The server just needs
to send header X for replies A, B, C, and header Y for replies D, E, F.
Forming the header is part of producing the reply type, and will be the
same for every conversation -- except for read replies, where it could
possibly be either (but that can't be avoided).

> and that having multiple types of reply depending on command builds
> fragility into the protocol.

I'd think that having the legal reply type depend on context is actually
more fragile.

> If you want no choice in response type for the server for any given
> session (i.e. code path minimisation on the client) my preference would
> be what Eric didn't like, i.e. always send structured replies for
> all commands if you negotiate structured replies, else always send
> unstructured replies.

Doing that requires performing a lookup to negotiated state (and a code
branch) for every response type that can possibly be structured or
nonstructured, and introduces exactly the two code paths that I think
should be avoided.

With what I'm suggesting, this will still be required for read requests,
but only while we retain backwards compatibility.

> We're talking an overhead of 8 bytes here (flags & error offset);
> somehow I suspect the time to transmit 8 bytes is going to be
> negligible compared to disk time or the rest of the network tx/rx
> time.

Sure, but I'm not worried about that.
Alex Bligh March 29, 2016, 10:53 p.m. UTC | #25
On 29 Mar 2016, at 23:45, Wouter Verhelst <w@uter.be> wrote:

> Doing that requires performing a lookup to negotiated state (and a code
> branch) for every response type that can possibly be structured or
> nonstructured, and introduces exactly the two code paths that I think
> should be avoided.
> 
> With what I'm suggesting, this will still be required for read requests,
> but only while we retain backwards compatibility.

OK, I *think* I've encapsulated all 3 options in the revised version I
just sent.

From the other email:

>> As a third option then:
>> 
>> Each chunk consists of the following:
>> 
>> S: 32 bits, 0x668e33ef, magic (NBD_STRUCTURED_REPLY_MAGIC)
>> S: 8 bits: type
>> S: 8 bits: reserved (must be zero)
>> S: 16 bits, flags
>> S: 64 bits, handle
>> S: 32 bits, payload length S: (length bytes of payload data)
>> 
>> The flags have the following meanings:
>> 
>> • bits 0-15: reserved (server MUST set these to zero)
> 
> That seems better in that context, yes. The reserved byte could later on
> be assigned as extra flags if need be.

Or for compatibility with NBD_CMD, it could be 2 16 bit quantities. Missed
this on v2.

> The reason why I suggested zero is that it doesn't require special-case
> code. If an error offset implies that everything beyond that offset is
> invalid, then having an offset of zero implies that the whole read is
> invalid -- which is correct if the server encountered an error, but
> doesn't know or doesn't want to say (for whatever reason) where.

I think that's wrong. As reads can be disordered, then if you get
a chunk at offset X, then a chunk at offset 0, then an end chunk
specifying an error, you (at least in theory) need to disambiguate
the two so you know whether the chunk at offset X was OK. That's
why I'm using 0xffffffff (now) to say "don't know where the error
is".
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 44579fc..f687e3e 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -209,6 +209,10 @@  same value for handle as was sent by the client for each request that
 the server is replying to, so that the client may correlate which
 request is receiving a response.

+Note that it is impossible to tell by reading just the server traffic
+whether a data field will be present.  To remedy this, the experimental
+`Structured Reply` extension has been introduced; see below.
+
 ## Values

 This section describes the value and meaning of constants (other than
@@ -231,6 +235,8 @@  the first magic number.
 - bit 1, `NBD_FLAG_NO_ZEROES`; if set, and if the client replies with
   `NBD_FLAG_C_NO_ZEROES` in the client flags field, the server MUST NOT
   send the 124 bytes of zero at the end of the negotiation.
+- bit 2, `NBD_FLAG_STRUCTURED_REPLY`; defined by the experimental
+  `Structured Reply` extension; see below.

 The server MUST NOT set any other flags, and SHOULD NOT change behaviour
 unless the client responds with a corresponding flag.  The server MUST
@@ -265,6 +271,8 @@  receiving the global flags from the server.
 - bit 1, `NBD_FLAG_C_NO_ZEROES`; MUST NOT be set if the server did not
   set `NBD_FLAG_NO_ZEROES`. If set, the server MUST NOT send the 124
   bytes of zeroes at the end of the negotiation.
+- bit 2, `NBD_FLAG_C_STRUCTURED_REPLY`; defined by the experimental
+  `Structured Reply` extension; see below.

 Clients SHOULD NOT set any other flags; the server MUST drop the
 connection if the client sets an unknown flag, or a flag that does
@@ -435,6 +443,10 @@  The following request types exist:
     signalling no error), the server MUST immediately close the
     connection; it MUST NOT send any further data to the client.

+    The experimental `Structured Reply` extension changes the set of
+    valid replies, in part to allow recovery after a partial read; see
+    below.
+
 * `NBD_CMD_WRITE` (1)

     A write request. Length and offset define the location and amount of
@@ -609,6 +621,117 @@  option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.

+### `Structured Reply` extension
+
+Some major downsides of the default reply to `NBD_CMD_READ` is that it
+is not possible to support partial reads (the command must succeed or
+fail as a whole, and len bytes of data must be sent even on an error,
+unless the connection is closed); nor is it possible to decode the
+server traffic without also knowing what pending read requests were
+sent by the client.
+
+To remedy this, a `Structured Reply` extension is envisioned. This
+extension adds a global flag, a client flag, a new reply type during
+the transmission phase, and alters the set of valid replies to an
+existing command.
+
+* `NBD_FLAG_STRUCTURED_REPLY` (bit 2)
+
+  This is a global flag; if set, and if the client replies with
+  `NBD_FLAG_C_STRUCTURED_REPLY` in the client flags field, the server
+  MUST use structured replies to the `NBD_CMD_READ` command.
+
+* `NBD_FLAG_C_STRUCTURED_REPLY` (bit 2)
+
+  This is a client flag; MUST NOT be set if the server did not set
+  `NBD_FLAG_STRUCTURED_REPLY`. If set, the server must use structured
+  replies to the `NBD_CMD_READ` command.
+
+* Transmission phase
+
+  The transmission phase includes a third message type: the structured
+  reply.  If `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, then the
+  normal server reply will never contain a data payload, and all
+  server replies that send data will be in the following form:
+
+  S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)  
+  S: 32 bits, error  
+  S: 64 bits, handle  
+  S: 32 bits, length of payload (unsigned)  
+  S: *length* bytes of payload data
+
+  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then the normal
+  server reply with a data payload will be used for `NBD_CMD_READ`;
+  but any other replies with a data payload will still use a
+  structured reply (that is, only `NBD_CMD_READ` is allowed to send
+  data in the non-structured form, and negotiating
+  `NBD_FLAG_C_STRUCTURED_REPLY` only affects replies to
+  `NBD_CMD_READ`).  This implies that any new commands that require
+  data in the reply should be gated by their own new global and client
+  flag.  A server MAY refuse to allow a client that negotiates a
+  command that requires a structured reply, but does not also
+  negotiate `NBD_FLAG_C_STRUCTURED_REPLY`.
+
+  In the generic form, the length field of a structured response MAY
+  be zero if there is no data payload, and the error field may be set
+  regardless of the length field (although where possible, the server
+  SHOULD use a length of zero when setting the error field).  However,
+  particular commands may document additional restrictions regarding
+  what forms a valid response (for example, a structured response to
+  `NBD_CMD_READ` MUST NOT set the error field, and MUST have a
+  non-zero length of at least 9).
+
+* `NBD_CMD_READ`
+
+  If `NBD_FLAG_C_STRUCTURED_REPLY` was not negotiated, then a read
+  request MUST always be answered by a single response (with magic
+  0x67446698 `NBD_REPLY_MAGIC`); the response MUST include length
+  bytes of data according to the client's request, although those
+  bytes MAY be invalid if an error is returned, and the connection
+  MUST if an error occurs after a header claiming no error.
+
+  Conversely, if the `NBD_FLAG_C_STRUCTURED_REPLY` was negotiated, the
+  response MUST be a sequence of zero or more structured replies (with
+  magic 0x668e33ef `NBD_STRUCTURED_REPLY_MAGIC`), followed by a
+  concluding normal response (0x67446698 `NBD_REPLY_MAGIC`), where the
+  final response MUST NOT have a data payload; all responses in the
+  sequence use the same handle from the client.  The payload of each
+  intermediate structured reply, called a read chunk, MUST be of the
+  following form:
+
+  S: 64 bits, offset (unsigned)  
+  S: (*length* - 8) bytes of data
+
+  Note that the amount of data returned in a read chunk is determined
+  by the length field of the structured reply, and not the original
+  length of the client's request.  If the server sends a single read
+  chunk for a successful read, the server's length will be the
+  client's length plus 8, because the server must account for the
+  offset field in its reply.  Similarly, a successful client request
+  for a read of 2^32-8 or more bytes MUST be split into at least two
+  read chunks, so that the length field does not suffer from overflow.
+
+  The server MUST ensure that each read chunk lies within the original
+  offset and length of the original client request, MUST NOT send read
+  chunks that would cover the same offset more than once, and MUST
+  send at least one byte of data in addition to the offset field of
+  each read chunk.  The server MAY send read chunks out of order, and
+  may interleave other responses between read replies.  The server
+  MUST NOT set the error field of a read chunk; if an error occurs, it
+  MAY immediately end the sequence of structured response messages,
+  MUST send the error in the concluding normal response, and SHOULD
+  keep the connection open.  The final non-structured response MUST
+  set an error unless the sum of data sent by all read chunks totals
+  the original client length request.
+
+  The client SHOULD immediately close the connection if it detects
+  that the server has sent an offset more than once (whether or not
+  the overlapping data claimed to have the same contents), or if
+  receives the concluding normal reply without an error set but
+  without all bytes covered by read chunk(s).  A future extension may
+  add a command flag that would allow the server to skip read chunks
+  for portions of the file that read as all zeroes.
+
 ## About this file

 This file tries to document the NBD protocol as it is currently