diff mbox

[v2,1/1] NBD proto: add WRITE_ZEROES extension

Message ID 1459429325-16350-1-git-send-email-den@openvz.org
State New
Headers show

Commit Message

Denis V. Lunev March 31, 2016, 1:02 p.m. UTC
From: Pavel Borzenkov <pborzenkov@virtuozzo.com>

There exist some cases when a client knows that the data it is going to
write is all zeroes. Such cases include mirroring or backing up a device
implemented by a sparse file.

With current NBD command set, the client has to issue NBD_CMD_WRITE
command with zeroed payload and transfer these zero bytes through the
wire. The server has to write the data onto disk, effectively denying
the sparseness.

To remedy this, the patch adds WRITE_ZEROES extension with one new
NBD_CMD_WRITE_ZEROES command.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Wouter Verhelst <w@uter.be>
CC: Alex Bligh <alex@alex.org.uk>
CC: Eric Blake <eblake@redhat.com>
---
v2:
  - rebased on master
  - explicitly state that the client must not set NBD_CMD_WRITE_ZEROES if
    support for it wasn't negotiated with the server;
  - add new command flag's description in format suitable for moving to
    "Command flags" section.


 doc/proto.md | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 4 deletions(-)

Comments

Alex Bligh March 31, 2016, 1:53 p.m. UTC | #1
On 31 Mar 2016, at 14:02, Denis V. Lunev <den@openvz.org> wrote:

> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.

Useful.

> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
> -  set to 1 if the client requires "Force Unit Access" mode of
> -  operation.  MUST NOT be set unless transmission flags included
> -  `NBD_FLAG_SEND_FUA`.
> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and
> +  `NBD_CMD_WRITE_ZEROES` commands.  SHOULD be set to 1 if the client requires
> +  "Force Unit Access" mode of operation.  MUST NOT be set unless transmission
> +  flags included `NBD_FLAG_SEND_FUA`.

Not your fault, but this should actually say "unless export flags
included". Transmission flags would be the flags with the command.

> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
> +  extension; see below.

For consistency, probably useful to say here:

MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZEROES.

> 
> #### Request types
> 
> @@ -523,6 +528,10 @@ The following request types exist:
>     A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM`
>     was set in the transmission flags field.
> 
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> +
> * Other requests
> 
>     Some third-party implementations may require additional protocol
> @@ -654,6 +663,53 @@ option reply type.
>       message if they do not also send it as a reply to the
>       `NBD_OPT_SELECT` message.
> 
> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +losing the sparseness.
> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> +one new command and one new command flag.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    A client MUST NOT send a write zeroes request unless
> +    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field,
> +    the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field.
> +    If this flag was set, the server MUST NOT send the reply until it has
> +    ensured that the newly-zeroed data has reached permanent storage.
> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
> +    flags field, the server MAY use trimming to zero out the area, but it
> +    MUST ensure that the data reads back as zero.
> +

Can you give an example of a situation where the client would not set this
and it would be undesirable for the server to create a 'hole' using
'trim' type technology, even when the client doesn't specify it?
I suspect there are already some backends (e.g. ceph on qemu-nbd) which
will effectively do a 'trim' if you write 4k of zeroes even under
current circumstances.

IE why not always permit trimming PROVIDED the data always reads back
as zero? This would be far simpler.
Paolo Bonzini March 31, 2016, 1:55 p.m. UTC | #2
On 31/03/2016 15:53, Alex Bligh wrote:
>> > +    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
>> > +    flags field, the server MAY use trimming to zero out the area, but it
>> > +    MUST ensure that the data reads back as zero.
>> > +
> Can you give an example of a situation where the client would not set this
> and it would be undesirable for the server to create a 'hole' using
> 'trim' type technology, even when the client doesn't specify it?
> I suspect there are already some backends (e.g. ceph on qemu-nbd) which
> will effectively do a 'trim' if you write 4k of zeroes even under
> current circumstances.
> 
> IE why not always permit trimming PROVIDED the data always reads back
> as zero? This would be far simpler.

Because trimming can make future operations more expensive and cause
fragmentation (which may not be as bad as it used to be at the media
level, but it is still somewhat bad at the filesystem level).

So if you want a fully-provisioned file, the simplest way to do so is to
write zeroes to it, and trimming is undesirable.

Paolo
Eric Blake March 31, 2016, 2:08 p.m. UTC | #3
On 03/31/2016 07:53 AM, Alex Bligh wrote:
> 
> On 31 Mar 2016, at 14:02, Denis V. Lunev <den@openvz.org> wrote:
> 
>> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>>
>> There exist some cases when a client knows that the data it is going to
>> write is all zeroes. Such cases include mirroring or backing up a device
>> implemented by a sparse file.
> 
> Useful.
> 
>> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
>> -  set to 1 if the client requires "Force Unit Access" mode of
>> -  operation.  MUST NOT be set unless transmission flags included
>> -  `NBD_FLAG_SEND_FUA`.
>> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and
>> +  `NBD_CMD_WRITE_ZEROES` commands.  SHOULD be set to 1 if the client requires
>> +  "Force Unit Access" mode of operation.  MUST NOT be set unless transmission
>> +  flags included `NBD_FLAG_SEND_FUA`.
> 
> Not your fault, but this should actually say "unless export flags
> included". Transmission flags would be the flags with the command.

No, we just barely renamed 'export flags' to 'transmission flags', to
represent the 16 bits sent by the server at the end of handshake phase;
these are named 'NBD_FLAG_*'.  We still use the term 'command flags'
(although maybe 'request flags' is better) for the 16 bits sent with
each request; these are named 'NBD_CMD_FLAG_*'.

So Pavel's text is correct as-is.

> 
>> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
>> +  extension; see below.
> 
> For consistency, probably useful to say here:
> 
> MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZEROES.

Elsewhere, when defining an experimental extension, the forward
reference has been as sparse as possible; so this sentence (about the
transmission flags including NBD_FLAG_SEND_WRITE_ZEROES) should appear
only in the experimental section, if it is not already there.


>>
>> +### `WRITE_ZEROES` extension
>> +
>> +There exist some cases when a client knows that the data it is going to write
>> +is all zeroes. Such cases include mirroring or backing up a device implemented
>> +by a sparse file. With current NBD command set, the client has to issue
>> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
>> +through the wire. The server has to write the data onto disk, effectively
>> +losing the sparseness.
>> +
>> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
>> +one new command and one new command flag.
>> +
>> +* `NBD_CMD_WRITE_ZEROES` (6)

Wouter recently pointed out that we explicitly do NOT want to repeat
constants in more than one location; define the value to (6) above where
you make the forward reference in the normative section, then keep the
experimental section referring to the command by name only.  Especially
useful if we end up renumbering things because we have multiple
extension proposals in flight at the moment.


>> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
>> +    flags field, the server MAY use trimming to zero out the area, but it
>> +    MUST ensure that the data reads back as zero.
>> +
> 
> Can you give an example of a situation where the client would not set this
> and it would be undesirable for the server to create a 'hole' using
> 'trim' type technology, even when the client doesn't specify it?

Yes, I can see situations where the client REQUIRES that the server
write actual zeroes, rather than trimming.  The biggest reason is that
in an environment where storage can be oversubscribed (multiple sparse
files that in name occupy more data than the underlying storage
contains), explicitly writing zeroes without punching a hole guarantees
that YOUR file has storage allocated to it (whereas if YOUR file is
trimmed, some other file can then use enough allocation to prevent you
from actually writing data in place of the hole).  Of course, the client
can still achieve this by sticking with NBD_CMD_WRITE, but that requires
more network traffic.

However, having written that, I'm thinking we have the wrong sense for
the flag.  I think it makes more sense to allow trim/hole-punching by
default (but ONLY when the server can guarantee that reads will still be
zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify
the cases where the server MUST NOT trim but allocate and write actual
zeroes.  I suspect that explicit allocation requests are less common,
and also less efficient; so having the default state of the flag geared
towards efficiency (both in the sense that punching holes can be faster
than writing zeroes, and that most people LIKE the storage savings of
sparse files).

> I suspect there are already some backends (e.g. ceph on qemu-nbd) which
> will effectively do a 'trim' if you write 4k of zeroes even under
> current circumstances.
> 
> IE why not always permit trimming PROVIDED the data always reads back
> as zero? This would be far simpler.
>
Alex Bligh March 31, 2016, 2:27 p.m. UTC | #4
On 31 Mar 2016, at 14:55, Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 31/03/2016 15:53, Alex Bligh wrote:
>>>> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
>>>> +    flags field, the server MAY use trimming to zero out the area, but it
>>>> +    MUST ensure that the data reads back as zero.
>>>> +
>> Can you give an example of a situation where the client would not set this
>> and it would be undesirable for the server to create a 'hole' using
>> 'trim' type technology, even when the client doesn't specify it?
>> I suspect there are already some backends (e.g. ceph on qemu-nbd) which
>> will effectively do a 'trim' if you write 4k of zeroes even under
>> current circumstances.
>> 
>> IE why not always permit trimming PROVIDED the data always reads back
>> as zero? This would be far simpler.
> 
> Because trimming can make future operations more expensive and cause
> fragmentation (which may not be as bad as it used to be at the media
> level, but it is still somewhat bad at the filesystem level).
> 
> So if you want a fully-provisioned file, the simplest way to do so is to
> write zeroes to it, and trimming is undesirable.

But isn't the server in a better position to know this than the
client? EG if the server has a back end implementation (as I suspect
Ceph on qemu-nbd does) which never actually stores all zero blocks,
it won't make a difference, and conceivably you're generating a whole
pile of I/O to avoid sparseness when sparseness might be faster. Take
for example a persistent memory interface, where fragmentation is
irrelevant, and writing piles of zeroes to memory is a waste of time.

and on the same subject

On 31 Mar 2016, at 15:08, Eric Blake <eblake@redhat.com> wrote:
> Yes, I can see situations where the client REQUIRES that the server
> write actual zeroes, rather than trimming.  The biggest reason is that
> in an environment where storage can be oversubscribed (multiple sparse
> files that in name occupy more data than the underlying storage
> contains), explicitly writing zeroes without punching a hole guarantees
> that YOUR file has storage allocated to it (whereas if YOUR file is
> trimmed, some other file can then use enough allocation to prevent you
> from actually writing data in place of the hole).  Of course, the client
> can still achieve this by sticking with NBD_CMD_WRITE, but that requires
> more network traffic.

Ditto, the server is surely in a better position to know this. Perhaps
the server KNOWS it doesn't oversubscribe.

On the other hand, a third reason I suppose could be security.

Whatever, the implication that a server may never use a trim type
operation unless NBD_CMD_FLAG_MAY_TRIM is specified seems to me
pretty draconian. I'd prefer this as NBD_CMD_FLAG_NO_TRIM
(as Eric sets out below), and to make it a 'hint', saying the
data SHOULD actually be written out as zeroes for security and
to maintain allocation and lack of sparseness.

A good example of why this can only be a 'SHOULD' would be
a file system that itself is CoW (or perhaps journals
data). Either way, you aren't going to get your space back, you
aren't going to get secure overwriting, and sparseness doesn't
much mean anything.

> However, having written that, I'm thinking we have the wrong sense for
> the flag.  I think it makes more sense to allow trim/hole-punching by
> default (but ONLY when the server can guarantee that reads will still be
> zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify
> the cases where the server MUST NOT trim but allocate and write actual
> zeroes.  I suspect that explicit allocation requests are less common,
> and also less efficient; so having the default state of the flag geared
> towards efficiency (both in the sense that punching holes can be faster
> than writing zeroes, and that most people LIKE the storage savings of
> sparse files).

I agree with the sense reversal, but I think it should be a SHOULD NOT
(for the reasons set out above), and explaining why would be helpful.
Paolo Bonzini March 31, 2016, 2:40 p.m. UTC | #5
On 31/03/2016 16:27, Alex Bligh wrote:
> > > IE why not always permit trimming PROVIDED the data always reads back
> > > as zero? This would be far simpler.
> > 
> > Because trimming can make future operations more expensive and cause
> > fragmentation (which may not be as bad as it used to be at the media
> > level, but it is still somewhat bad at the filesystem level).
> > 
> > So if you want a fully-provisioned file, the simplest way to do so is to
> > write zeroes to it, and trimming is undesirable.
> But isn't the server in a better position to know this than the
> client?

There are at least three possible states for a sector:

- hole (thin-provisioned)

- allocated as data (disk contains actual zeroes)

- allocated as unwritten (blocks reserved on backing storage, reads as
zeroes but the disk may not contain actual zeroes)

It's always okay for the backend to convert a zero block to an unwritten
extent; it's generally not okay for a backend to take a request to
create an unwritten extent and instead create a hole.

It's all an "as if" situation. The server must provide the semantics
requested by the client.  For example, writing to a hole could cause
ENOSPC, writing to an unwritten extend could not.  The server might know
better, because it certainly is in a better position to know how to
fulfill the client's request.

But even if it's just a hint, it makes sense for NBD to provide it.
It's not a coincidence that this hint exists at all levels: SCSI has an
UNMAP bit that can be set in the WRITE SAME command (and it has UNMAP
which matches NBD's TRIM); the fallocate system call has
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE (plus Linux has the
BLKDISCARD ioctl which again matches NBD's TRIM for block devices).

> EG if the server has a back end implementation (as I suspect
> Ceph on qemu-nbd does)

Ceph doesn't, but gluster does.

> which never actually stores all zero blocks,
> it won't make a difference, and conceivably you're generating a whole
> pile of I/O to avoid sparseness when sparseness might be faster. Take
> for example a persistent memory interface, where fragmentation is
> irrelevant, and writing piles of zeroes to memory is a waste of time.

It certainly isn't a waste of time if your intention is to scrub data
belonging to a previous tenant, before giving access to someone else!
If you have a metadata layer above then you can handle the command there
(that's why we're adding it); if you haven't you do have to write the
zeroes.

Paolo
Eric Blake March 31, 2016, 11:46 p.m. UTC | #6
On 03/31/2016 07:02 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 

> +++ b/doc/proto.md
> @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands

Hmm, we've picked overlapping bits between your proposal and mine for
`NBD_FLAG_SEND_DF`. Obviously, whoever goes in second gets bit 7.
Wouter Verhelst April 1, 2016, 8:37 a.m. UTC | #7
Hi,

Thanks, applied.

On Thu, Mar 31, 2016 at 04:02:05PM +0300, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> There exist some cases when a client knows that the data it is going to
> write is all zeroes. Such cases include mirroring or backing up a device
> implemented by a sparse file.
> 
> With current NBD command set, the client has to issue NBD_CMD_WRITE
> command with zeroed payload and transfer these zero bytes through the
> wire. The server has to write the data onto disk, effectively denying
> the sparseness.
> 
> To remedy this, the patch adds WRITE_ZEROES extension with one new
> NBD_CMD_WRITE_ZEROES command.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Wouter Verhelst <w@uter.be>
> CC: Alex Bligh <alex@alex.org.uk>
> CC: Eric Blake <eblake@redhat.com>
> ---
> v2:
>   - rebased on master
>   - explicitly state that the client must not set NBD_CMD_WRITE_ZEROES if
>     support for it wasn't negotiated with the server;
>   - add new command flag's description in format suitable for moving to
>     "Command flags" section.
> 
> 
>  doc/proto.md | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index c1e05c5..a574563 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle negotiation:
>    schedule I/O accesses as for a rotational medium
>  - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
>    `NBD_CMD_TRIM` commands
> +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> +  supports `NBD_CMD_WRITE_ZEROES` commands
>  
>  Clients SHOULD ignore unknown flags.
>  
> @@ -444,10 +446,13 @@ affects a particular command.  Clients MUST NOT set a command flag bit
>  that is not documented for the particular command; and whether a flag is
>  valid may depend on negotiation during the handshake phase.
>  
> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
> -  set to 1 if the client requires "Force Unit Access" mode of
> -  operation.  MUST NOT be set unless transmission flags included
> -  `NBD_FLAG_SEND_FUA`.
> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and
> +  `NBD_CMD_WRITE_ZEROES` commands.  SHOULD be set to 1 if the client requires
> +  "Force Unit Access" mode of operation.  MUST NOT be set unless transmission
> +  flags included `NBD_FLAG_SEND_FUA`.
> +
> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
> +  extension; see below.
>  
>  #### Request types
>  
> @@ -523,6 +528,10 @@ The following request types exist:
>      A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM`
>      was set in the transmission flags field.
>  
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    Defined by the experimental `WRITE_ZEROES` extension; see below.
> +
>  * Other requests
>  
>      Some third-party implementations may require additional protocol
> @@ -654,6 +663,53 @@ option reply type.
>        message if they do not also send it as a reply to the
>        `NBD_OPT_SELECT` message.
>  
> +### `WRITE_ZEROES` extension
> +
> +There exist some cases when a client knows that the data it is going to write
> +is all zeroes. Such cases include mirroring or backing up a device implemented
> +by a sparse file. With current NBD command set, the client has to issue
> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> +through the wire. The server has to write the data onto disk, effectively
> +losing the sparseness.
> +
> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
> +one new command and one new command flag.
> +
> +* `NBD_CMD_WRITE_ZEROES` (6)
> +
> +    A write request with no payload. Length and offset define the location
> +    and amount of data to be zeroed.
> +
> +    The server MUST zero out the data on disk, and then send the reply
> +    message. The server MAY send the reply message before the data has
> +    reached permanent storage.
> +
> +    A client MUST NOT send a write zeroes request unless
> +    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
> +
> +    If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field,
> +    the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field.
> +    If this flag was set, the server MUST NOT send the reply until it has
> +    ensured that the newly-zeroed data has reached permanent storage.
> +
> +    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
> +    flags field, the server MAY use trimming to zero out the area, but it
> +    MUST ensure that the data reads back as zero.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MAY then close the connection.
> +
> +The server SHOULD return `ENOSPC` if it receives a write zeroes request
> +including one or more sectors beyond the size of the device. It SHOULD
> +return `EPERM` if it receives a write zeroes request on a read-only export.
> +
> +The extension adds the following new command flag:
> +
> +- bit 1, `NBD_CMD_FLAG_MAY_TRIM`; valid during `NBD_CMD_WRITE_ZEROES`.
> +  SHOULD be set to 1 if the client allows the server to use trim to perform
> +  the requested operation. The client MAY send `NBD_CMD_FLAG_MAY_TRIM` even
> +  if `NBD_FLAG_SEND_TRIM` was not set in the transmission flags field.
> +
>  ## About this file
>  
>  This file tries to document the NBD protocol as it is currently
> -- 
> 2.1.4
> 
> 
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
>
Eric Blake April 1, 2016, 8:26 p.m. UTC | #8
On 04/01/2016 02:37 AM, Wouter Verhelst wrote:
> Hi,
> 
> Thanks, applied.
> 
> On Thu, Mar 31, 2016 at 04:02:05PM +0300, Denis V. Lunev wrote:
>> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>>
>> There exist some cases when a client knows that the data it is going to
>> write is all zeroes. Such cases include mirroring or backing up a device
>> implemented by a sparse file.
>>
>> With current NBD command set, the client has to issue NBD_CMD_WRITE
>> command with zeroed payload and transfer these zero bytes through the
>> wire. The server has to write the data onto disk, effectively denying
>> the sparseness.
>>

>> +
>> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
>> +  extension; see below.

Hmm, we had an unfinished conversation about whether the default sense
of this bit should be reversed. I'll propose a followup patch, now that
the original has been merged.
diff mbox

Patch

diff --git a/doc/proto.md b/doc/proto.md
index c1e05c5..a574563 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -261,6 +261,8 @@  immediately after the handshake flags field in oldstyle negotiation:
   schedule I/O accesses as for a rotational medium
 - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports
   `NBD_CMD_TRIM` commands
+- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
+  supports `NBD_CMD_WRITE_ZEROES` commands
 
 Clients SHOULD ignore unknown flags.
 
@@ -444,10 +446,13 @@  affects a particular command.  Clients MUST NOT set a command flag bit
 that is not documented for the particular command; and whether a flag is
 valid may depend on negotiation during the handshake phase.
 
-- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`.  SHOULD be
-  set to 1 if the client requires "Force Unit Access" mode of
-  operation.  MUST NOT be set unless transmission flags included
-  `NBD_FLAG_SEND_FUA`.
+- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and
+  `NBD_CMD_WRITE_ZEROES` commands.  SHOULD be set to 1 if the client requires
+  "Force Unit Access" mode of operation.  MUST NOT be set unless transmission
+  flags included `NBD_FLAG_SEND_FUA`.
+
+- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES`
+  extension; see below.
 
 #### Request types
 
@@ -523,6 +528,10 @@  The following request types exist:
     A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM`
     was set in the transmission flags field.
 
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    Defined by the experimental `WRITE_ZEROES` extension; see below.
+
 * Other requests
 
     Some third-party implementations may require additional protocol
@@ -654,6 +663,53 @@  option reply type.
       message if they do not also send it as a reply to the
       `NBD_OPT_SELECT` message.
 
+### `WRITE_ZEROES` extension
+
+There exist some cases when a client knows that the data it is going to write
+is all zeroes. Such cases include mirroring or backing up a device implemented
+by a sparse file. With current NBD command set, the client has to issue
+`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
+through the wire. The server has to write the data onto disk, effectively
+losing the sparseness.
+
+To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds
+one new command and one new command flag.
+
+* `NBD_CMD_WRITE_ZEROES` (6)
+
+    A write request with no payload. Length and offset define the location
+    and amount of data to be zeroed.
+
+    The server MUST zero out the data on disk, and then send the reply
+    message. The server MAY send the reply message before the data has
+    reached permanent storage.
+
+    A client MUST NOT send a write zeroes request unless
+    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
+
+    If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field,
+    the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field.
+    If this flag was set, the server MUST NOT send the reply until it has
+    ensured that the newly-zeroed data has reached permanent storage.
+
+    If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command
+    flags field, the server MAY use trimming to zero out the area, but it
+    MUST ensure that the data reads back as zero.
+
+    If an error occurs, the server SHOULD set the appropriate error code
+    in the error field. The server MAY then close the connection.
+
+The server SHOULD return `ENOSPC` if it receives a write zeroes request
+including one or more sectors beyond the size of the device. It SHOULD
+return `EPERM` if it receives a write zeroes request on a read-only export.
+
+The extension adds the following new command flag:
+
+- bit 1, `NBD_CMD_FLAG_MAY_TRIM`; valid during `NBD_CMD_WRITE_ZEROES`.
+  SHOULD be set to 1 if the client allows the server to use trim to perform
+  the requested operation. The client MAY send `NBD_CMD_FLAG_MAY_TRIM` even
+  if `NBD_FLAG_SEND_TRIM` was not set in the transmission flags field.
+
 ## About this file
 
 This file tries to document the NBD protocol as it is currently