Message ID | 1459429325-16350-1-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
On 31 Mar 2016, at 14:02, Denis V. Lunev <den@openvz.org> wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > There exist some cases when a client knows that the data it is going to > write is all zeroes. Such cases include mirroring or backing up a device > implemented by a sparse file. Useful. > -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be > - set to 1 if the client requires "Force Unit Access" mode of > - operation. MUST NOT be set unless transmission flags included > - `NBD_FLAG_SEND_FUA`. > +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and > + `NBD_CMD_WRITE_ZEROES` commands. SHOULD be set to 1 if the client requires > + "Force Unit Access" mode of operation. MUST NOT be set unless transmission > + flags included `NBD_FLAG_SEND_FUA`. Not your fault, but this should actually say "unless export flags included". Transmission flags would be the flags with the command. > +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES` > + extension; see below. For consistency, probably useful to say here: MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZEROES. > > #### Request types > > @@ -523,6 +528,10 @@ The following request types exist: > A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM` > was set in the transmission flags field. > > +* `NBD_CMD_WRITE_ZEROES` (6) > + > + Defined by the experimental `WRITE_ZEROES` extension; see below. > + > * Other requests > > Some third-party implementations may require additional protocol > @@ -654,6 +663,53 @@ option reply type. > message if they do not also send it as a reply to the > `NBD_OPT_SELECT` message. > > +### `WRITE_ZEROES` extension > + > +There exist some cases when a client knows that the data it is going to write > +is all zeroes. Such cases include mirroring or backing up a device implemented > +by a sparse file. With current NBD command set, the client has to issue > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes > +through the wire. The server has to write the data onto disk, effectively > +losing the sparseness. > + > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds > +one new command and one new command flag. > + > +* `NBD_CMD_WRITE_ZEROES` (6) > + > + A write request with no payload. Length and offset define the location > + and amount of data to be zeroed. > + > + The server MUST zero out the data on disk, and then send the reply > + message. The server MAY send the reply message before the data has > + reached permanent storage. > + > + A client MUST NOT send a write zeroes request unless > + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. > + > + If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field, > + the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field. > + If this flag was set, the server MUST NOT send the reply until it has > + ensured that the newly-zeroed data has reached permanent storage. > + > + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command > + flags field, the server MAY use trimming to zero out the area, but it > + MUST ensure that the data reads back as zero. > + Can you give an example of a situation where the client would not set this and it would be undesirable for the server to create a 'hole' using 'trim' type technology, even when the client doesn't specify it? I suspect there are already some backends (e.g. ceph on qemu-nbd) which will effectively do a 'trim' if you write 4k of zeroes even under current circumstances. IE why not always permit trimming PROVIDED the data always reads back as zero? This would be far simpler.
On 31/03/2016 15:53, Alex Bligh wrote: >> > + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command >> > + flags field, the server MAY use trimming to zero out the area, but it >> > + MUST ensure that the data reads back as zero. >> > + > Can you give an example of a situation where the client would not set this > and it would be undesirable for the server to create a 'hole' using > 'trim' type technology, even when the client doesn't specify it? > I suspect there are already some backends (e.g. ceph on qemu-nbd) which > will effectively do a 'trim' if you write 4k of zeroes even under > current circumstances. > > IE why not always permit trimming PROVIDED the data always reads back > as zero? This would be far simpler. Because trimming can make future operations more expensive and cause fragmentation (which may not be as bad as it used to be at the media level, but it is still somewhat bad at the filesystem level). So if you want a fully-provisioned file, the simplest way to do so is to write zeroes to it, and trimming is undesirable. Paolo
On 03/31/2016 07:53 AM, Alex Bligh wrote: > > On 31 Mar 2016, at 14:02, Denis V. Lunev <den@openvz.org> wrote: > >> From: Pavel Borzenkov <pborzenkov@virtuozzo.com> >> >> There exist some cases when a client knows that the data it is going to >> write is all zeroes. Such cases include mirroring or backing up a device >> implemented by a sparse file. > > Useful. > >> -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be >> - set to 1 if the client requires "Force Unit Access" mode of >> - operation. MUST NOT be set unless transmission flags included >> - `NBD_FLAG_SEND_FUA`. >> +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and >> + `NBD_CMD_WRITE_ZEROES` commands. SHOULD be set to 1 if the client requires >> + "Force Unit Access" mode of operation. MUST NOT be set unless transmission >> + flags included `NBD_FLAG_SEND_FUA`. > > Not your fault, but this should actually say "unless export flags > included". Transmission flags would be the flags with the command. No, we just barely renamed 'export flags' to 'transmission flags', to represent the 16 bits sent by the server at the end of handshake phase; these are named 'NBD_FLAG_*'. We still use the term 'command flags' (although maybe 'request flags' is better) for the 16 bits sent with each request; these are named 'NBD_CMD_FLAG_*'. So Pavel's text is correct as-is. > >> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES` >> + extension; see below. > > For consistency, probably useful to say here: > > MUST NOT be set unless the export flags include NBD_FLAG_SEND_WRITE_ZEROES. Elsewhere, when defining an experimental extension, the forward reference has been as sparse as possible; so this sentence (about the transmission flags including NBD_FLAG_SEND_WRITE_ZEROES) should appear only in the experimental section, if it is not already there. >> >> +### `WRITE_ZEROES` extension >> + >> +There exist some cases when a client knows that the data it is going to write >> +is all zeroes. Such cases include mirroring or backing up a device implemented >> +by a sparse file. With current NBD command set, the client has to issue >> +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes >> +through the wire. The server has to write the data onto disk, effectively >> +losing the sparseness. >> + >> +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds >> +one new command and one new command flag. >> + >> +* `NBD_CMD_WRITE_ZEROES` (6) Wouter recently pointed out that we explicitly do NOT want to repeat constants in more than one location; define the value to (6) above where you make the forward reference in the normative section, then keep the experimental section referring to the command by name only. Especially useful if we end up renumbering things because we have multiple extension proposals in flight at the moment. >> + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command >> + flags field, the server MAY use trimming to zero out the area, but it >> + MUST ensure that the data reads back as zero. >> + > > Can you give an example of a situation where the client would not set this > and it would be undesirable for the server to create a 'hole' using > 'trim' type technology, even when the client doesn't specify it? Yes, I can see situations where the client REQUIRES that the server write actual zeroes, rather than trimming. The biggest reason is that in an environment where storage can be oversubscribed (multiple sparse files that in name occupy more data than the underlying storage contains), explicitly writing zeroes without punching a hole guarantees that YOUR file has storage allocated to it (whereas if YOUR file is trimmed, some other file can then use enough allocation to prevent you from actually writing data in place of the hole). Of course, the client can still achieve this by sticking with NBD_CMD_WRITE, but that requires more network traffic. However, having written that, I'm thinking we have the wrong sense for the flag. I think it makes more sense to allow trim/hole-punching by default (but ONLY when the server can guarantee that reads will still be zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify the cases where the server MUST NOT trim but allocate and write actual zeroes. I suspect that explicit allocation requests are less common, and also less efficient; so having the default state of the flag geared towards efficiency (both in the sense that punching holes can be faster than writing zeroes, and that most people LIKE the storage savings of sparse files). > I suspect there are already some backends (e.g. ceph on qemu-nbd) which > will effectively do a 'trim' if you write 4k of zeroes even under > current circumstances. > > IE why not always permit trimming PROVIDED the data always reads back > as zero? This would be far simpler. >
On 31 Mar 2016, at 14:55, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 31/03/2016 15:53, Alex Bligh wrote: >>>> + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command >>>> + flags field, the server MAY use trimming to zero out the area, but it >>>> + MUST ensure that the data reads back as zero. >>>> + >> Can you give an example of a situation where the client would not set this >> and it would be undesirable for the server to create a 'hole' using >> 'trim' type technology, even when the client doesn't specify it? >> I suspect there are already some backends (e.g. ceph on qemu-nbd) which >> will effectively do a 'trim' if you write 4k of zeroes even under >> current circumstances. >> >> IE why not always permit trimming PROVIDED the data always reads back >> as zero? This would be far simpler. > > Because trimming can make future operations more expensive and cause > fragmentation (which may not be as bad as it used to be at the media > level, but it is still somewhat bad at the filesystem level). > > So if you want a fully-provisioned file, the simplest way to do so is to > write zeroes to it, and trimming is undesirable. But isn't the server in a better position to know this than the client? EG if the server has a back end implementation (as I suspect Ceph on qemu-nbd does) which never actually stores all zero blocks, it won't make a difference, and conceivably you're generating a whole pile of I/O to avoid sparseness when sparseness might be faster. Take for example a persistent memory interface, where fragmentation is irrelevant, and writing piles of zeroes to memory is a waste of time. and on the same subject On 31 Mar 2016, at 15:08, Eric Blake <eblake@redhat.com> wrote: > Yes, I can see situations where the client REQUIRES that the server > write actual zeroes, rather than trimming. The biggest reason is that > in an environment where storage can be oversubscribed (multiple sparse > files that in name occupy more data than the underlying storage > contains), explicitly writing zeroes without punching a hole guarantees > that YOUR file has storage allocated to it (whereas if YOUR file is > trimmed, some other file can then use enough allocation to prevent you > from actually writing data in place of the hole). Of course, the client > can still achieve this by sticking with NBD_CMD_WRITE, but that requires > more network traffic. Ditto, the server is surely in a better position to know this. Perhaps the server KNOWS it doesn't oversubscribe. On the other hand, a third reason I suppose could be security. Whatever, the implication that a server may never use a trim type operation unless NBD_CMD_FLAG_MAY_TRIM is specified seems to me pretty draconian. I'd prefer this as NBD_CMD_FLAG_NO_TRIM (as Eric sets out below), and to make it a 'hint', saying the data SHOULD actually be written out as zeroes for security and to maintain allocation and lack of sparseness. A good example of why this can only be a 'SHOULD' would be a file system that itself is CoW (or perhaps journals data). Either way, you aren't going to get your space back, you aren't going to get secure overwriting, and sparseness doesn't much mean anything. > However, having written that, I'm thinking we have the wrong sense for > the flag. I think it makes more sense to allow trim/hole-punching by > default (but ONLY when the server can guarantee that reads will still be > zeroes), and make the flag NBD_CMD_FLAG_NO_TRIM to explicitly specify > the cases where the server MUST NOT trim but allocate and write actual > zeroes. I suspect that explicit allocation requests are less common, > and also less efficient; so having the default state of the flag geared > towards efficiency (both in the sense that punching holes can be faster > than writing zeroes, and that most people LIKE the storage savings of > sparse files). I agree with the sense reversal, but I think it should be a SHOULD NOT (for the reasons set out above), and explaining why would be helpful.
On 31/03/2016 16:27, Alex Bligh wrote: > > > IE why not always permit trimming PROVIDED the data always reads back > > > as zero? This would be far simpler. > > > > Because trimming can make future operations more expensive and cause > > fragmentation (which may not be as bad as it used to be at the media > > level, but it is still somewhat bad at the filesystem level). > > > > So if you want a fully-provisioned file, the simplest way to do so is to > > write zeroes to it, and trimming is undesirable. > But isn't the server in a better position to know this than the > client? There are at least three possible states for a sector: - hole (thin-provisioned) - allocated as data (disk contains actual zeroes) - allocated as unwritten (blocks reserved on backing storage, reads as zeroes but the disk may not contain actual zeroes) It's always okay for the backend to convert a zero block to an unwritten extent; it's generally not okay for a backend to take a request to create an unwritten extent and instead create a hole. It's all an "as if" situation. The server must provide the semantics requested by the client. For example, writing to a hole could cause ENOSPC, writing to an unwritten extend could not. The server might know better, because it certainly is in a better position to know how to fulfill the client's request. But even if it's just a hint, it makes sense for NBD to provide it. It's not a coincidence that this hint exists at all levels: SCSI has an UNMAP bit that can be set in the WRITE SAME command (and it has UNMAP which matches NBD's TRIM); the fallocate system call has FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE (plus Linux has the BLKDISCARD ioctl which again matches NBD's TRIM for block devices). > EG if the server has a back end implementation (as I suspect > Ceph on qemu-nbd does) Ceph doesn't, but gluster does. > which never actually stores all zero blocks, > it won't make a difference, and conceivably you're generating a whole > pile of I/O to avoid sparseness when sparseness might be faster. Take > for example a persistent memory interface, where fragmentation is > irrelevant, and writing piles of zeroes to memory is a waste of time. It certainly isn't a waste of time if your intention is to scrub data belonging to a previous tenant, before giving access to someone else! If you have a metadata layer above then you can handle the command there (that's why we're adding it); if you haven't you do have to write the zeroes. Paolo
On 03/31/2016 07:02 AM, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > There exist some cases when a client knows that the data it is going to > write is all zeroes. Such cases include mirroring or backing up a device > implemented by a sparse file. > > With current NBD command set, the client has to issue NBD_CMD_WRITE > command with zeroed payload and transfer these zero bytes through the > wire. The server has to write the data onto disk, effectively denying > the sparseness. > > To remedy this, the patch adds WRITE_ZEROES extension with one new > NBD_CMD_WRITE_ZEROES command. > > +++ b/doc/proto.md > @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle negotiation: > schedule I/O accesses as for a rotational medium > - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports > `NBD_CMD_TRIM` commands > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > + supports `NBD_CMD_WRITE_ZEROES` commands Hmm, we've picked overlapping bits between your proposal and mine for `NBD_FLAG_SEND_DF`. Obviously, whoever goes in second gets bit 7.
Hi, Thanks, applied. On Thu, Mar 31, 2016 at 04:02:05PM +0300, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > There exist some cases when a client knows that the data it is going to > write is all zeroes. Such cases include mirroring or backing up a device > implemented by a sparse file. > > With current NBD command set, the client has to issue NBD_CMD_WRITE > command with zeroed payload and transfer these zero bytes through the > wire. The server has to write the data onto disk, effectively denying > the sparseness. > > To remedy this, the patch adds WRITE_ZEROES extension with one new > NBD_CMD_WRITE_ZEROES command. > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Wouter Verhelst <w@uter.be> > CC: Alex Bligh <alex@alex.org.uk> > CC: Eric Blake <eblake@redhat.com> > --- > v2: > - rebased on master > - explicitly state that the client must not set NBD_CMD_WRITE_ZEROES if > support for it wasn't negotiated with the server; > - add new command flag's description in format suitable for moving to > "Command flags" section. > > > doc/proto.md | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 60 insertions(+), 4 deletions(-) > > diff --git a/doc/proto.md b/doc/proto.md > index c1e05c5..a574563 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle negotiation: > schedule I/O accesses as for a rotational medium > - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports > `NBD_CMD_TRIM` commands > +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > + supports `NBD_CMD_WRITE_ZEROES` commands > > Clients SHOULD ignore unknown flags. > > @@ -444,10 +446,13 @@ affects a particular command. Clients MUST NOT set a command flag bit > that is not documented for the particular command; and whether a flag is > valid may depend on negotiation during the handshake phase. > > -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be > - set to 1 if the client requires "Force Unit Access" mode of > - operation. MUST NOT be set unless transmission flags included > - `NBD_FLAG_SEND_FUA`. > +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and > + `NBD_CMD_WRITE_ZEROES` commands. SHOULD be set to 1 if the client requires > + "Force Unit Access" mode of operation. MUST NOT be set unless transmission > + flags included `NBD_FLAG_SEND_FUA`. > + > +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES` > + extension; see below. > > #### Request types > > @@ -523,6 +528,10 @@ The following request types exist: > A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM` > was set in the transmission flags field. > > +* `NBD_CMD_WRITE_ZEROES` (6) > + > + Defined by the experimental `WRITE_ZEROES` extension; see below. > + > * Other requests > > Some third-party implementations may require additional protocol > @@ -654,6 +663,53 @@ option reply type. > message if they do not also send it as a reply to the > `NBD_OPT_SELECT` message. > > +### `WRITE_ZEROES` extension > + > +There exist some cases when a client knows that the data it is going to write > +is all zeroes. Such cases include mirroring or backing up a device implemented > +by a sparse file. With current NBD command set, the client has to issue > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes > +through the wire. The server has to write the data onto disk, effectively > +losing the sparseness. > + > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds > +one new command and one new command flag. > + > +* `NBD_CMD_WRITE_ZEROES` (6) > + > + A write request with no payload. Length and offset define the location > + and amount of data to be zeroed. > + > + The server MUST zero out the data on disk, and then send the reply > + message. The server MAY send the reply message before the data has > + reached permanent storage. > + > + A client MUST NOT send a write zeroes request unless > + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. > + > + If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field, > + the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field. > + If this flag was set, the server MUST NOT send the reply until it has > + ensured that the newly-zeroed data has reached permanent storage. > + > + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command > + flags field, the server MAY use trimming to zero out the area, but it > + MUST ensure that the data reads back as zero. > + > + If an error occurs, the server SHOULD set the appropriate error code > + in the error field. The server MAY then close the connection. > + > +The server SHOULD return `ENOSPC` if it receives a write zeroes request > +including one or more sectors beyond the size of the device. It SHOULD > +return `EPERM` if it receives a write zeroes request on a read-only export. > + > +The extension adds the following new command flag: > + > +- bit 1, `NBD_CMD_FLAG_MAY_TRIM`; valid during `NBD_CMD_WRITE_ZEROES`. > + SHOULD be set to 1 if the client allows the server to use trim to perform > + the requested operation. The client MAY send `NBD_CMD_FLAG_MAY_TRIM` even > + if `NBD_FLAG_SEND_TRIM` was not set in the transmission flags field. > + > ## About this file > > This file tries to document the NBD protocol as it is currently > -- > 2.1.4 > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140 > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general >
On 04/01/2016 02:37 AM, Wouter Verhelst wrote: > Hi, > > Thanks, applied. > > On Thu, Mar 31, 2016 at 04:02:05PM +0300, Denis V. Lunev wrote: >> From: Pavel Borzenkov <pborzenkov@virtuozzo.com> >> >> There exist some cases when a client knows that the data it is going to >> write is all zeroes. Such cases include mirroring or backing up a device >> implemented by a sparse file. >> >> With current NBD command set, the client has to issue NBD_CMD_WRITE >> command with zeroed payload and transfer these zero bytes through the >> wire. The server has to write the data onto disk, effectively denying >> the sparseness. >> >> + >> +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES` >> + extension; see below. Hmm, we had an unfinished conversation about whether the default sense of this bit should be reversed. I'll propose a followup patch, now that the original has been merged.
diff --git a/doc/proto.md b/doc/proto.md index c1e05c5..a574563 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -261,6 +261,8 @@ immediately after the handshake flags field in oldstyle negotiation: schedule I/O accesses as for a rotational medium - bit 5, `NBD_FLAG_SEND_TRIM`; should be set to 1 if the server supports `NBD_CMD_TRIM` commands +- bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server + supports `NBD_CMD_WRITE_ZEROES` commands Clients SHOULD ignore unknown flags. @@ -444,10 +446,13 @@ affects a particular command. Clients MUST NOT set a command flag bit that is not documented for the particular command; and whether a flag is valid may depend on negotiation during the handshake phase. -- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE`. SHOULD be - set to 1 if the client requires "Force Unit Access" mode of - operation. MUST NOT be set unless transmission flags included - `NBD_FLAG_SEND_FUA`. +- bit 0, `NBD_CMD_FLAG_FUA`; valid during `NBD_CMD_WRITE` and + `NBD_CMD_WRITE_ZEROES` commands. SHOULD be set to 1 if the client requires + "Force Unit Access" mode of operation. MUST NOT be set unless transmission + flags included `NBD_FLAG_SEND_FUA`. + +- bit 1, `NBD_CMD_MAY_TRIM`; defined by the experimental `WRITE_ZEROES` + extension; see below. #### Request types @@ -523,6 +528,10 @@ The following request types exist: A client MUST NOT send a trim request unless `NBD_FLAG_SEND_TRIM` was set in the transmission flags field. +* `NBD_CMD_WRITE_ZEROES` (6) + + Defined by the experimental `WRITE_ZEROES` extension; see below. + * Other requests Some third-party implementations may require additional protocol @@ -654,6 +663,53 @@ option reply type. message if they do not also send it as a reply to the `NBD_OPT_SELECT` message. +### `WRITE_ZEROES` extension + +There exist some cases when a client knows that the data it is going to write +is all zeroes. Such cases include mirroring or backing up a device implemented +by a sparse file. With current NBD command set, the client has to issue +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes +through the wire. The server has to write the data onto disk, effectively +losing the sparseness. + +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension adds +one new command and one new command flag. + +* `NBD_CMD_WRITE_ZEROES` (6) + + A write request with no payload. Length and offset define the location + and amount of data to be zeroed. + + The server MUST zero out the data on disk, and then send the reply + message. The server MAY send the reply message before the data has + reached permanent storage. + + A client MUST NOT send a write zeroes request unless + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. + + If the `NBD_FLAG_SEND_FUA` flag was set in the transmission flags field, + the client MAY set the flag `NBD_CMD_FLAG_FUA` in the command flags field. + If this flag was set, the server MUST NOT send the reply until it has + ensured that the newly-zeroed data has reached permanent storage. + + If the flag `NBD_CMD_FLAG_MAY_TRIM` was set by the client in the command + flags field, the server MAY use trimming to zero out the area, but it + MUST ensure that the data reads back as zero. + + If an error occurs, the server SHOULD set the appropriate error code + in the error field. The server MAY then close the connection. + +The server SHOULD return `ENOSPC` if it receives a write zeroes request +including one or more sectors beyond the size of the device. It SHOULD +return `EPERM` if it receives a write zeroes request on a read-only export. + +The extension adds the following new command flag: + +- bit 1, `NBD_CMD_FLAG_MAY_TRIM`; valid during `NBD_CMD_WRITE_ZEROES`. + SHOULD be set to 1 if the client allows the server to use trim to perform + the requested operation. The client MAY send `NBD_CMD_FLAG_MAY_TRIM` even + if `NBD_FLAG_SEND_TRIM` was not set in the transmission flags field. + ## About this file This file tries to document the NBD protocol as it is currently