diff mbox series

[1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Message ID 20190823143426.26838-2-eblake@redhat.com
State New
Headers show
Series NBD protocol change to add fast zero support | expand

Commit Message

Eric Blake Aug. 23, 2019, 2:34 p.m. UTC
While it may be counterintuitive at first, the introduction of
NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance
regression in qemu [1], when copying a sparse file. When the
destination file must contain the same contents as the source, but it
is not known in advance whether the destination started life with all
zero content, then there are cases where it is faster to request a
bulk zero of the entire device followed by writing only the portions
of the device that are to contain data, as that results in fewer I/O
transactions overall. In fact, there are even situations where
trimming the entire device prior to writing zeroes may be faster than
bare write zero request [2]. However, if a bulk zero request ever
falls back to the same speed as a normal write, a bulk pre-zeroing
algorithm is actually a pessimization, as it ends up writing portions
of the disk twice.

[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html
[2] https://github.com/libguestfs/nbdkit/commit/407f8dde

Hence, it is desirable to have a way for clients to specify that a
particular write zero request is being attempted for a fast wipe, and
get an immediate failure if the zero request would otherwise take the
same time as a write.  Conversely, if the client is not performing a
pre-initialization pass, it is still more efficient in terms of
networking traffic to send NBD_CMD_WRITE_ZERO requests where the
server implements the fallback to the slower write, than it is for the
client to have to perform the fallback to send NBD_CMD_WRITE with a
zeroed buffer.

Add a protocol flag and corresponding transmission advertisement flag
to make it easier for clients to inform the server of their intent. If
the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
things: to perform a fallback to write when the client does not
request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
lower network overhead); and to fail quickly with ENOTSUP, preferably
without modifying the export, if the client requested the flag but the
server cannot write zeroes more efficiently than a normal write (so
that the client is not penalized with the time of writing data areas
of the disk twice).

Note that the semantics are chosen so that servers should advertise
the new flag whether or not they have fast zeroing (that is, this is
NOT the server advertising that it has fast zeroes, but rather
advertising that the client can get fast feedback as needed on whether
zeroing is fast).  It is also intentional that the new advertisement
includes a new errno value, ENOTSUP, with rules that this error should
not be returned for any pre-existing behaviors, must not happen when
the client does not request a fast zero, and must be returned quickly
if the client requested fast zero but anything other than the error
would not be fast; while leaving it possible for clients to
distinguish other errors like EINVAL if alignment constraints are not
met.  Clients should not send the flag unless the server advertised
support, but well-behaved servers should already be reporting EINVAL
to unrecognized flags. If the server does not advertise the new
feature, clients can safely fall back to assuming that writing zeroes
is no faster than normal writes (whether or not the assumption
actually holds).

Note that the Linux fallocate(2) interface may or may not be powerful
enough to easily determine if zeroing will be efficient - in
particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
insight; likewise, for block devices, it is known that
ioctl(BLKZEROOUT) does NOT have a way for userspace to probe if it is
efficient or slow.  But with enough demand, the kernel may add another
FALLOC_FL_ flag to use with FALLOC_FL_ZERO_RANGE, and/or appropriate
ioctls with guaranteed ENOTSUP failures if a fast path cannot be
taken.  If a server cannot easily determine if write zeroes will be
efficient, the server should either fail all NBD_CMD_FLAG_FAST_ZERO
with ENOTSUP, or else choose to not advertise NBD_FLAG_SEND_FAST_ZERO.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

Comments

Wouter Verhelst Aug. 23, 2019, 6:48 p.m. UTC | #1
On Fri, Aug 23, 2019 at 09:34:26AM -0500, Eric Blake wrote:
> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
> +  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
> +  write zeroes any faster than it would for an equivalent
> +  `NBD_CMD_WRITE`,

One way of fulfilling the letter of this requirement but not its spirit
could be to have background writes; that is, the server makes a note
that the zeroed region should contain zeroes, makes an error-free reply
to the client, and then starts updating things in the background (with
proper layering so that an NBD_CMD_READ would see zeroes).

This could negatively impact performance after that command to the
effect that syncing the device would be slower rather than faster, if
not done right.

Do we want to keep that in consideration?
Eric Blake Aug. 23, 2019, 6:58 p.m. UTC | #2
On 8/23/19 1:48 PM, Wouter Verhelst wrote:
> On Fri, Aug 23, 2019 at 09:34:26AM -0500, Eric Blake wrote:
>> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
>> +  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
>> +  write zeroes any faster than it would for an equivalent
>> +  `NBD_CMD_WRITE`,
> 
> One way of fulfilling the letter of this requirement but not its spirit
> could be to have background writes; that is, the server makes a note
> that the zeroed region should contain zeroes, makes an error-free reply
> to the client, and then starts updating things in the background (with
> proper layering so that an NBD_CMD_READ would see zeroes).

For writes, this should still be viable IF the server can also cancel
that background write of zeroes in favor of a foreground request for
actual data to be written to the same offset.  In other words, as long
as the behavior to the client is "as if" there is no duplicated I/O
cost, the zero appears fast, even if it kicked off a long-running async
process to actually accomplish it.

> 
> This could negatively impact performance after that command to the
> effect that syncing the device would be slower rather than faster, if
> not done right.

Oh. I see - for flush requests, you're worried about the cost of the
flush forcing the I/O for the background zero to complete before flush
can return.

Perhaps that merely means that a client using fast zero requests as a
means of probing whether it can do a bulk pre-zero pass even though it
will be rewriting part of that image with data later SHOULD NOT attempt
to flush the disk until all other interesting write requests are also
ready to queue.  In the 'qemu-img convert' case which spawned this
discussion, that's certainly the case (qemu-img does not call flush
after the pre-zeroing, but only after all data is copied - and then it
really DOES want to wait for any remaining backgrounded zeroing to land
on the disk along with any normal writes when it does its final flush).

> 
> Do we want to keep that in consideration?

Ideas on how best to add what I mentioned above into the specification?
Wouter Verhelst Aug. 24, 2019, 6:44 a.m. UTC | #3
On Fri, Aug 23, 2019 at 01:58:44PM -0500, Eric Blake wrote:
> On 8/23/19 1:48 PM, Wouter Verhelst wrote:
> > On Fri, Aug 23, 2019 at 09:34:26AM -0500, Eric Blake wrote:
> >> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
> >> +  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
> >> +  write zeroes any faster than it would for an equivalent
> >> +  `NBD_CMD_WRITE`,
> > 
> > One way of fulfilling the letter of this requirement but not its spirit
> > could be to have background writes; that is, the server makes a note
> > that the zeroed region should contain zeroes, makes an error-free reply
> > to the client, and then starts updating things in the background (with
> > proper layering so that an NBD_CMD_READ would see zeroes).
> 
> For writes, this should still be viable IF the server can also cancel
> that background write of zeroes in favor of a foreground request for
> actual data to be written to the same offset.  In other words, as long
> as the behavior to the client is "as if" there is no duplicated I/O
> cost, the zero appears fast, even if it kicked off a long-running async
> process to actually accomplish it.

That's kind of what I was thinking of, yeah.

A background write would cause disk I/O, which *will* cause any write
that happen concurrently with it to slow down. If we need to write
several orders of magnitude of zeroes, then the "fast zero" will
actually cause the following writes to slow down, which could impact
performance.

The cancelling should indeed happen (otherwise ordering of writes will
be wrong, as per the spec), but that doesn't negate the performance
impact.

> > This could negatively impact performance after that command to the
> > effect that syncing the device would be slower rather than faster, if
> > not done right.
> 
> Oh. I see - for flush requests, you're worried about the cost of the
> flush forcing the I/O for the background zero to complete before flush
> can return.
> 
> Perhaps that merely means that a client using fast zero requests as a
> means of probing whether it can do a bulk pre-zero pass even though it
> will be rewriting part of that image with data later SHOULD NOT attempt
> to flush the disk until all other interesting write requests are also
> ready to queue.  In the 'qemu-img convert' case which spawned this
> discussion, that's certainly the case (qemu-img does not call flush
> after the pre-zeroing, but only after all data is copied - and then it
> really DOES want to wait for any remaining backgrounded zeroing to land
> on the disk along with any normal writes when it does its final flush).

Not what I meant, but also a good point, thanks :)

> > Do we want to keep that in consideration?
> 
> Ideas on how best to add what I mentioned above into the specification?

Perhaps clarify that the "fast zero" flag is meant to *improve*
performance, and that it therefore should either be implemented in a way
that does in fact improve performance, or not at all?
Vladimir Sementsov-Ogievskiy Aug. 28, 2019, 9:57 a.m. UTC | #4
23.08.2019 17:34, Eric Blake wrote:
> While it may be counterintuitive at first, the introduction of
> NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance
> regression in qemu [1], when copying a sparse file. When the
> destination file must contain the same contents as the source, but it
> is not known in advance whether the destination started life with all
> zero content, then there are cases where it is faster to request a
> bulk zero of the entire device followed by writing only the portions
> of the device that are to contain data, as that results in fewer I/O
> transactions overall. In fact, there are even situations where
> trimming the entire device prior to writing zeroes may be faster than
> bare write zero request [2]. However, if a bulk zero request ever
> falls back to the same speed as a normal write, a bulk pre-zeroing
> algorithm is actually a pessimization, as it ends up writing portions
> of the disk twice.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html
> [2] https://github.com/libguestfs/nbdkit/commit/407f8dde
> 
> Hence, it is desirable to have a way for clients to specify that a
> particular write zero request is being attempted for a fast wipe, and
> get an immediate failure if the zero request would otherwise take the
> same time as a write.  Conversely, if the client is not performing a
> pre-initialization pass, it is still more efficient in terms of
> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
> server implements the fallback to the slower write, than it is for the
> client to have to perform the fallback to send NBD_CMD_WRITE with a
> zeroed buffer.

How are you going to finally use it in qemu-img convert? Ok, we have a loop
of sending write-zero requests. And on first ENOTSUP we'll assume that there
is no benefit to continue? But what if actually server returns ENOTSUP only
once when we have 1000 iterations? Seems we should still do zeroing if we
have only a few ENOTSUPs...

I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
as first will always return ENOTSUP and second will never fail.. But in such way
we'll OK with simpler extension, which only have one server-advirtised negotiation
flag NBD_FLAG_ZERO_IS_FAST.

There is not such problem if we have only one iteration, so may be new command
FILL_ZERO, filling the whole device by zeros?

> 
> Add a protocol flag and corresponding transmission advertisement flag
> to make it easier for clients to inform the server of their intent. If
> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
> things: to perform a fallback to write when the client does not
> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
> lower network overhead); and to fail quickly with ENOTSUP, preferably
> without modifying the export, if the client requested the flag but the
> server cannot write zeroes more efficiently than a normal write (so
> that the client is not penalized with the time of writing data areas
> of the disk twice).
> 
> Note that the semantics are chosen so that servers should advertise
> the new flag whether or not they have fast zeroing (that is, this is
> NOT the server advertising that it has fast zeroes, but rather
> advertising that the client can get fast feedback as needed on whether
> zeroing is fast).  It is also intentional that the new advertisement
> includes a new errno value, ENOTSUP, with rules that this error should
> not be returned for any pre-existing behaviors, must not happen when
> the client does not request a fast zero, and must be returned quickly
> if the client requested fast zero but anything other than the error
> would not be fast; while leaving it possible for clients to
> distinguish other errors like EINVAL if alignment constraints are not
> met.  Clients should not send the flag unless the server advertised
> support, but well-behaved servers should already be reporting EINVAL
> to unrecognized flags. If the server does not advertise the new
> feature, clients can safely fall back to assuming that writing zeroes
> is no faster than normal writes (whether or not the assumption
> actually holds).
> 
> Note that the Linux fallocate(2) interface may or may not be powerful
> enough to easily determine if zeroing will be efficient - in
> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
> insight; likewise, for block devices, it is known that
> ioctl(BLKZEROOUT) does NOT have a way for userspace to probe if it is
> efficient or slow.  But with enough demand, the kernel may add another
> FALLOC_FL_ flag to use with FALLOC_FL_ZERO_RANGE, and/or appropriate
> ioctls with guaranteed ENOTSUP failures if a fast path cannot be
> taken.  If a server cannot easily determine if write zeroes will be
> efficient, the server should either fail all NBD_CMD_FLAG_FAST_ZERO
> with ENOTSUP, or else choose to not advertise NBD_FLAG_SEND_FAST_ZERO.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 52d3e7b..702688b 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -1070,6 +1070,18 @@ The field has the following format:
>     which support the command without advertising this bit, and
>     conversely that this bit does not guarantee that the command will
>     succeed or have an impact.
> +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether
> +  `NBD_CMD_WRITE_ZEROES` is faster than a corresponding write. The
> +  server MUST set this transmission flag to 1 if the
> +  `NBD_CMD_WRITE_ZEROES` request supports the `NBD_CMD_FLAG_FAST_ZERO`
> +  flag, and MUST set this transmission flag to 0 if
> +  `NBD_FLAG_SEND_WRITE_ZEROES` is not set. Servers MAY set this this
> +  transmission flag even if it will always use `NBD_ENOTSUP` failures for
> +  requests with `NBD_CMD_FLAG_FAST_ZERO` set (such as if the server
> +  cannot quickly determine whether a particular write zeroes request
> +  will be faster than a regular write). Clients MUST NOT set the
> +  `NBD_CMD_FLAG_FAST_ZERO` request flag unless this transmission flag
> +  is set.
> 
>   Clients SHOULD ignore unknown flags.
> 
> @@ -1647,6 +1659,12 @@ valid may depend on negotiation during the handshake phase.
>     MUST NOT send metadata on more than one extent in the reply. Client
>     implementors should note that using this flag on multiple contiguous
>     requests is likely to be inefficient.
> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
> +  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
> +  write zeroes any faster than it would for an equivalent
> +  `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of
> +  `NBD_ENOTSUP`. The client MUST NOT set this unless the server advertised
> +  `NBD_FLAG_SEND_FAST_ZERO`.
> 
>   ##### Structured reply flags
> 
> @@ -2015,7 +2033,10 @@ The following request types exist:
>       reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.
> 
>       A client MUST NOT send a write zeroes request unless
> -    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
> +    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags
> +    field. Additionally, a client MUST NOT send the
> +    `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was
> +    set in the transimssion flags field.
> 
>       By default, the server MAY use trimming to zero out the area, even
>       if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure
> @@ -2025,6 +2046,28 @@ The following request types exist:
>       same area will not cause fragmentation or cause failure due to
>       insufficient space.
> 
> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
> +    corresponding `NBD_CMD_WRITE`. Conversely, if
> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
> +    of the export when returning this failure. The server's
> +    determination of a fast request MAY depend on a number of factors,
> +    such as whether the request was suitably aligned, on whether the
> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
> +    a request, and SHOULD fail with `NBD_EINVAL` if the
> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
> +    the flag is set if the server cannot quickly determine in advance
> +    whether that request would have been fast, even if it turns out
> +    that the same request without the flag would be fast after all.
> +

What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
but server never can guarantee performance for one WRITE_ZERO operation, do
you restrict such case? Hmm, OK, SHOULD is not MUST actually..

>       If an error occurs, the server MUST set the appropriate error code
>       in the error field.
> 
> @@ -2125,6 +2168,7 @@ The following error values are defined:
>   * `NBD_EINVAL` (22), Invalid argument.
>   * `NBD_ENOSPC` (28), No space left on device.
>   * `NBD_EOVERFLOW` (75), Value too large.
> +* `NBD_ENOTSUP` (95), Operation not supported.
>   * `NBD_ESHUTDOWN` (108), Server is in the process of being shut down.
> 
>   The server SHOULD return `NBD_ENOSPC` if it receives a write request
> @@ -2139,6 +2183,10 @@ read-only export.
>   The server SHOULD NOT return `NBD_EOVERFLOW` except as documented in
>   response to `NBD_CMD_READ` when `NBD_CMD_FLAG_DF` is supported.
> 
> +The server SHOULD NOT return `NBD_ENOTSUP` except as documented in
> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
> +supported.
> +
>   The server SHOULD return `NBD_EINVAL` if it receives an unknown command.
> 
>   The server SHOULD return `NBD_EINVAL` if it receives an unknown
>
Eric Blake Aug. 28, 2019, 1:04 p.m. UTC | #5
On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:

>> Hence, it is desirable to have a way for clients to specify that a
>> particular write zero request is being attempted for a fast wipe, and
>> get an immediate failure if the zero request would otherwise take the
>> same time as a write.  Conversely, if the client is not performing a
>> pre-initialization pass, it is still more efficient in terms of
>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>> server implements the fallback to the slower write, than it is for the
>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>> zeroed buffer.
> 
> How are you going to finally use it in qemu-img convert?

It's already in use there (in fact, the cover letter shows actual timing
examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).

> Ok, we have a loop
> of sending write-zero requests. And on first ENOTSUP we'll assume that there
> is no benefit to continue? But what if actually server returns ENOTSUP only
> once when we have 1000 iterations? Seems we should still do zeroing if we
> have only a few ENOTSUPs...

When attempting a bulk zero, you try to wipe the entire device,
presumably with something that is large and aligned.  Yes, if you have
to split the write zero request due to size limitations, then you risk
that the first write zero succeeds but later ones fail, then you didn't
wipe the entire disk, but you also don't need to redo the work on the
first half of the image.  But it is much more likely that the first
write of the bulk zero is representative of the overall operation (and
so in practice, it only takes one fast zero attempt to learn if bulk
zeroing is worthwhile, then continue with fast zeroes without issues).

> 
> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
> as first will always return ENOTSUP and second will never fail.. But in such way
> we'll OK with simpler extension, which only have one server-advirtised negotiation
> flag NBD_FLAG_ZERO_IS_FAST.

Advertising that a server's zero behavior is always going to be
successfully fast is a much harder flag to implement.  The justification
for the semantics I chose (advertising that the server can quickly
report failure if success is not fast, but not requiring fast zero)
covers the case when the decision of whether a zero is fast may also
depend on other factors - for example, if the server knows the image
starts in an all-zero state, then it can track a boolean: all write zero
requests while the boolean is set return immediate success (nothing to
do), but after the first regular write, the boolean is set to false, and
all further write zero requests fail as being potentially slow; and such
an implementation is still useful for the qemu-img convert case.

> 
> There is not such problem if we have only one iteration, so may be new command
> FILL_ZERO, filling the whole device by zeros?

Or better yet, implement support for 64-bit commands.  Yes, my cover
letter called out further orthogonal extensions, and implementing 64-bit
zeroing (so that you can issue a write zero request over the entire
image in one command), as well as a way for a server to advertise when
the image begins life in an all-zero state, are also further extensions
coming down the pipeline.  But as not all servers have to implement all
of the extensions, each extension that can be orthogonally implemented
and show an improvement on its own is still worthwhile; and my cover
letter has shown that fast zeroes on their own make a measurable
difference to certain workloads.

>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>> +    of the export when returning this failure. The server's
>> +    determination of a fast request MAY depend on a number of factors,
>> +    such as whether the request was suitably aligned, on whether the
>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>> +    the flag is set if the server cannot quickly determine in advance
>> +    whether that request would have been fast, even if it turns out
>> +    that the same request without the flag would be fast after all.
>> +
> 
> What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
> but server never can guarantee performance for one WRITE_ZERO operation, do
> you restrict such case? Hmm, OK, SHOULD is not MUST actually..

I think my followup mail, based on Wouter's questions, covers this: the
goal is to document the use case of optimizing the copy of a sparse
image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
should be the measuring rod - if the implementation can perform a faster
sparse copy because of write zeroes that are sometimes, but not always,
faster than writes, in spite of the duplicated I/O that happens to the
data portions of the image that were touched twice by the pre-zero pass
then the actual data pass, then succeeding on fast zero requests is
okay.  But if it makes the overall image copy slower, then failing with
ENOTSUP is probably better.  And at the end of the day, it is really
just a heuristic - if the server guessed wrong, the worst that happens
is slower performance (and not data corruption).
Vladimir Sementsov-Ogievskiy Aug. 28, 2019, 1:45 p.m. UTC | #6
28.08.2019 16:04, Eric Blake wrote:
> On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> Hence, it is desirable to have a way for clients to specify that a
>>> particular write zero request is being attempted for a fast wipe, and
>>> get an immediate failure if the zero request would otherwise take the
>>> same time as a write.  Conversely, if the client is not performing a
>>> pre-initialization pass, it is still more efficient in terms of
>>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>>> server implements the fallback to the slower write, than it is for the
>>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>>> zeroed buffer.
>>
>> How are you going to finally use it in qemu-img convert?
> 
> It's already in use there (in fact, the cover letter shows actual timing
> examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
> to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).
> 
>> Ok, we have a loop
>> of sending write-zero requests. And on first ENOTSUP we'll assume that there
>> is no benefit to continue? But what if actually server returns ENOTSUP only
>> once when we have 1000 iterations? Seems we should still do zeroing if we
>> have only a few ENOTSUPs...
> 
> When attempting a bulk zero, you try to wipe the entire device,
> presumably with something that is large and aligned.  Yes, if you have
> to split the write zero request due to size limitations, then you risk
> that the first write zero succeeds but later ones fail, then you didn't
> wipe the entire disk, but you also don't need to redo the work on the
> first half of the image.  But it is much more likely that the first
> write of the bulk zero is representative of the overall operation (and
> so in practice, it only takes one fast zero attempt to learn if bulk
> zeroing is worthwhile, then continue with fast zeroes without issues).
> 
>>
>> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2,
>> as first will always return ENOTSUP and second will never fail.. But in such way
>> we'll OK with simpler extension, which only have one server-advirtised negotiation
>> flag NBD_FLAG_ZERO_IS_FAST.
> 
> Advertising that a server's zero behavior is always going to be
> successfully fast is a much harder flag to implement.  The justification
> for the semantics I chose (advertising that the server can quickly
> report failure if success is not fast, but not requiring fast zero)
> covers the case when the decision of whether a zero is fast may also
> depend on other factors - for example, if the server knows the image
> starts in an all-zero state, then it can track a boolean: all write zero
> requests while the boolean is set return immediate success (nothing to
> do), but after the first regular write, the boolean is set to false, and
> all further write zero requests fail as being potentially slow; and such
> an implementation is still useful for the qemu-img convert case.

Agreed, thanks for this example)

> 
>>
>> There is not such problem if we have only one iteration, so may be new command
>> FILL_ZERO, filling the whole device by zeros?
> 
> Or better yet, implement support for 64-bit commands.  Yes, my cover
> letter called out further orthogonal extensions, and implementing 64-bit
> zeroing (so that you can issue a write zero request over the entire
> image in one command), as well as a way for a server to advertise when
> the image begins life in an all-zero state, are also further extensions
> coming down the pipeline.  But as not all servers have to implement all
> of the extensions, each extension that can be orthogonally implemented
> and show an improvement on its own is still worthwhile; and my cover
> letter has shown that fast zeroes on their own make a measurable
> difference to certain workloads.
> 
>>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>>> +    of the export when returning this failure. The server's
>>> +    determination of a fast request MAY depend on a number of factors,
>>> +    such as whether the request was suitably aligned, on whether the
>>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>>> +    the flag is set if the server cannot quickly determine in advance
>>> +    whether that request would have been fast, even if it turns out
>>> +    that the same request without the flag would be fast after all.
>>> +
>>
>> What if WRITE_ZERO in the average is faster than WRITE (for example by 20%),
>> but server never can guarantee performance for one WRITE_ZERO operation, do
>> you restrict such case? Hmm, OK, SHOULD is not MUST actually..
> 
> I think my followup mail, based on Wouter's questions, covers this: the

Hmm, OK, sorry for duplication

> goal is to document the use case of optimizing the copy of a sparse
> image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
> should be the measuring rod - if the implementation can perform a faster
> sparse copy because of write zeroes that are sometimes, but not always,
> faster than writes, in spite of the duplicated I/O that happens to the
> data portions of the image that were touched twice by the pre-zero pass
> then the actual data pass, then succeeding on fast zero requests is
> okay.  But if it makes the overall image copy slower, then failing with
> ENOTSUP is probably better.  And at the end of the day, it is really
> just a heuristic - if the server guessed wrong, the worst that happens
> is slower performance (and not data corruption).
> 

OK, I understand, than it's all sounds good
diff mbox series

Patch

diff --git a/doc/proto.md b/doc/proto.md
index 52d3e7b..702688b 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -1070,6 +1070,18 @@  The field has the following format:
   which support the command without advertising this bit, and
   conversely that this bit does not guarantee that the command will
   succeed or have an impact.
+- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether
+  `NBD_CMD_WRITE_ZEROES` is faster than a corresponding write. The
+  server MUST set this transmission flag to 1 if the
+  `NBD_CMD_WRITE_ZEROES` request supports the `NBD_CMD_FLAG_FAST_ZERO`
+  flag, and MUST set this transmission flag to 0 if
+  `NBD_FLAG_SEND_WRITE_ZEROES` is not set. Servers MAY set this this
+  transmission flag even if it will always use `NBD_ENOTSUP` failures for
+  requests with `NBD_CMD_FLAG_FAST_ZERO` set (such as if the server
+  cannot quickly determine whether a particular write zeroes request
+  will be faster than a regular write). Clients MUST NOT set the
+  `NBD_CMD_FLAG_FAST_ZERO` request flag unless this transmission flag
+  is set.

 Clients SHOULD ignore unknown flags.

@@ -1647,6 +1659,12 @@  valid may depend on negotiation during the handshake phase.
   MUST NOT send metadata on more than one extent in the reply. Client
   implementors should note that using this flag on multiple contiguous
   requests is likely to be inefficient.
+- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
+  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
+  write zeroes any faster than it would for an equivalent
+  `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of
+  `NBD_ENOTSUP`. The client MUST NOT set this unless the server advertised
+  `NBD_FLAG_SEND_FAST_ZERO`.

 ##### Structured reply flags

@@ -2015,7 +2033,10 @@  The following request types exist:
     reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.

     A client MUST NOT send a write zeroes request unless
-    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
+    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags
+    field. Additionally, a client MUST NOT send the
+    `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was
+    set in the transimssion flags field.

     By default, the server MAY use trimming to zero out the area, even
     if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure
@@ -2025,6 +2046,28 @@  The following request types exist:
     same area will not cause fragmentation or cause failure due to
     insufficient space.

+    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
+    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
+    with `NBD_ENOTSUP`, even if the operation is no faster than a
+    corresponding `NBD_CMD_WRITE`. Conversely, if
+    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
+    `NBD_ENOTSUP` unless the request can be serviced in less time than
+    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
+    of the export when returning this failure. The server's
+    determination of a fast request MAY depend on a number of factors,
+    such as whether the request was suitably aligned, on whether the
+    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
+    previous `NBD_CMD_TRIM` had been performed on the region.  If the
+    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
+    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
+    a request, and SHOULD fail with `NBD_EINVAL` if the
+    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
+    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
+    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
+    the flag is set if the server cannot quickly determine in advance
+    whether that request would have been fast, even if it turns out
+    that the same request without the flag would be fast after all.
+
     If an error occurs, the server MUST set the appropriate error code
     in the error field.

@@ -2125,6 +2168,7 @@  The following error values are defined:
 * `NBD_EINVAL` (22), Invalid argument.
 * `NBD_ENOSPC` (28), No space left on device.
 * `NBD_EOVERFLOW` (75), Value too large.
+* `NBD_ENOTSUP` (95), Operation not supported.
 * `NBD_ESHUTDOWN` (108), Server is in the process of being shut down.

 The server SHOULD return `NBD_ENOSPC` if it receives a write request
@@ -2139,6 +2183,10 @@  read-only export.
 The server SHOULD NOT return `NBD_EOVERFLOW` except as documented in
 response to `NBD_CMD_READ` when `NBD_CMD_FLAG_DF` is supported.

+The server SHOULD NOT return `NBD_ENOTSUP` except as documented in
+response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
+supported.
+
 The server SHOULD return `NBD_EINVAL` if it receives an unknown command.

 The server SHOULD return `NBD_EINVAL` if it receives an unknown