diff mbox series

nbd/server: Honor FUA request on NBD_CMD_TRIM

Message ID 20180307225732.155835-1-eblake@redhat.com
State New
Headers show
Series nbd/server: Honor FUA request on NBD_CMD_TRIM | expand

Commit Message

Eric Blake March 7, 2018, 10:57 p.m. UTC
The NBD spec states that since trim requests can affect disk contents,
then they should allow for FUA semantics just like writes for ensuring
the disk has settled before returning.  As bdrv_[co_]pdiscard() does
not (yet?) support a flags argument, we can't pass FUA down the block
layer stack, and must therefore emulate it with a flush at the NBD
layer.

Signed-off-by: Eric Blake <eblake@redhat.com>
---

Question for Paolo: does ISCSI support the notion of FUA on a
TRIM request (where we could better emulate a guest TRIM request
with FUA all the way through our stack to the NBD server), or is
FUA just for normal writes?  Likewise, are you familiar enough
with the kernel's NBD module to know if the kernel as an NBD client
would ever request FUA on a discard request?

Question for Kevin: should we update the block layer to have a
flag arguments to bdrv_co_pdiscard (right now, the only valid
flag would be BDRV_REQ_FUA, and we'd probably need a
supported_discard_flags in parallel to supported_write_flags),
and implement qemu-io -c 'discard -f' for easily testing the use
of that flag?

Depending on answers to those questions, I may want to spin a
v2 patch that adds flag support throughout the block layer
discard implementation, rather than this patch which just does
it in NBD; but if nothing else, this is the shortest patch
possible to fix the (corner-case?) NBD spec non-compliance.

 nbd/server.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Paolo Bonzini March 8, 2018, 6:50 a.m. UTC | #1
> The NBD spec states that since trim requests can affect disk contents,
> then they should allow for FUA semantics just like writes for ensuring
> the disk has settled before returning.  As bdrv_[co_]pdiscard() does
> not (yet?) support a flags argument, we can't pass FUA down the block
> layer stack, and must therefore emulate it with a flush at the NBD
> layer.

TRIM requests should not need FUA since they're just advisory.  On
the other hand, WRITE ZEROES requests need to support FUA.

Paolo
Eric Blake March 8, 2018, 2:45 p.m. UTC | #2
On 03/08/2018 12:50 AM, Paolo Bonzini wrote:
> 
>> The NBD spec states that since trim requests can affect disk contents,
>> then they should allow for FUA semantics just like writes for ensuring
>> the disk has settled before returning.  As bdrv_[co_]pdiscard() does
>> not (yet?) support a flags argument, we can't pass FUA down the block
>> layer stack, and must therefore emulate it with a flush at the NBD
>> layer.
> 
> TRIM requests should not need FUA since they're just advisory.  On
> the other hand, WRITE ZEROES requests need to support FUA.

It looks like qemu's NBD implementation properly supports FUA on WRITE 
ZEROES requests.  The block layer in bdrv_co_do_pwrite_zeroes checks for 
FUA support for both an actual zero request to the driver, as well as 
any fallback write requests, and does a followup flush if any driver 
request did not support BDRV_REQ_FUA.  Even when BDRV_REQ_MAY_UNMAP is 
passed through to the driver and allows the zeroes to be written by a 
trim, then either that trim must also honor FUA semantics 
(block/nbd{,-client}.c does this), or the driver can't advertise FUA 
support on zeroes (block/iscsi.c does this) and the block layer fallback 
kicks in.  So I think we're fine on that front.

Still, while you argue that TRIM is advisory (which I agree), if it does 
nothing, then you've (implicitly) honored FUA (that transaction didn't 
affect persistent storage, so you didn't have to wait any longer for 
anything to land); but if it DOES change the disk contents, then waiting 
for that change to land IS worth supporting, hence why the NBD protocol 
requires the FUA flag to be honored on trim.
Paolo Bonzini March 8, 2018, 3:22 p.m. UTC | #3
On 08/03/2018 15:45, Eric Blake wrote:
> On 03/08/2018 12:50 AM, Paolo Bonzini wrote:
>>> The NBD spec states that since trim requests can affect disk contents,
>>> then they should allow for FUA semantics just like writes for ensuring
>>> the disk has settled before returning.  As bdrv_[co_]pdiscard() does
>>> not (yet?) support a flags argument, we can't pass FUA down the block
>>> layer stack, and must therefore emulate it with a flush at the NBD
>>> layer.
>>
>> TRIM requests should not need FUA since they're just advisory.
> 
> Still, while you argue that TRIM is advisory (which I agree), if it does
> nothing, then you've (implicitly) honored FUA (that transaction didn't
> affect persistent storage, so you didn't have to wait any longer for
> anything to land); but if it DOES change the disk contents, then waiting
> for that change to land IS worth supporting, hence why the NBD protocol
> requires the FUA flag to be honored on trim.

But if power fails, after restart you cannot see the difference between
a TRIM command that chose to did nothing, and one that chose to change
the disk contents but failed to persist the changes.  This is why I
thought there is no need for FUA in my opinion.

I suppose in principle you could detect the change by reading the
TRIMmed sectors and writing to another disk.  So TRIM would have to be a
Schroedinger command that is persistent once you read the sectors, and
that makes little sense.  The problem is, SCSI doesn't have a FUA flag
either...

Paolo
Eric Blake March 8, 2018, 4:05 p.m. UTC | #4
On 03/08/2018 09:22 AM, Paolo Bonzini wrote:
>>> TRIM requests should not need FUA since they're just advisory.
>>
>> Still, while you argue that TRIM is advisory (which I agree), if it does
>> nothing, then you've (implicitly) honored FUA (that transaction didn't
>> affect persistent storage, so you didn't have to wait any longer for
>> anything to land); but if it DOES change the disk contents, then waiting
>> for that change to land IS worth supporting, hence why the NBD protocol
>> requires the FUA flag to be honored on trim.
> 
> But if power fails, after restart you cannot see the difference between
> a TRIM command that chose to did nothing, and one that chose to change
> the disk contents but failed to persist the changes.  This is why I
> thought there is no need for FUA in my opinion.
> 
> I suppose in principle you could detect the change by reading the
> TRIMmed sectors and writing to another disk.  So TRIM would have to be a
> Schroedinger command that is persistent once you read the sectors, and
> that makes little sense.  The problem is, SCSI doesn't have a FUA flag
> either...

The documentation of NBD_CMD_TRIM says that in general you must not 
expect reliable read results from the area you trimmed (since the 
command is advisory, you don't know if you would read the old data 
unchanged, all zeroes, or even random unrelated data).  But if you know 
that a particular server treats TRIM as mandatory rather than advisory, 
and also guarantees a reads-as-zero after a successful TRIM, then for 
that particular server, the FUA flag on TRIM makes sense.  The 
documentation for NBD_CMD_BLOCK_STATUS also points out that block status 
may, but not must, be altered by NBD_CMD_TRIM, which might be another 
way to observer how much of a TRIM request was advisory.

At any rate, your argument makes sense that because bdrv_pdiscard() is 
advisory, we can't tell whether it made a difference, and therefore 
waiting for it to make a difference isn't worthwhile, and therefore 
plumbing BDRV_REQ_FUA through the block layer for bdrv_pdiscard() is 
pointless.  At this point, I will just go ahead and add the flush for 
qemu as NBD server if it ever sees NBD_CMD_TRIM + FUA (which is unlikely 
to happen in practice, as most clients are smart enough to realize that 
TRIM is advisory and reading after TRIM is unreliable anyways, so 
waiting for the TRIM to land is pointless); and qemu as a client will 
probably never send NBD_CMD_TRIM + FUA.
Eric Blake March 12, 2018, 10:30 p.m. UTC | #5
On 03/07/2018 04:57 PM, Eric Blake wrote:
> The NBD spec states that since trim requests can affect disk contents,
> then they should allow for FUA semantics just like writes for ensuring
> the disk has settled before returning.  As bdrv_[co_]pdiscard() does
> not (yet?) support a flags argument, we can't pass FUA down the block
> layer stack, and must therefore emulate it with a flush at the NBD
> layer.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> 

Thanks; queued to my NBD tree.
diff mbox series

Patch

diff --git a/nbd/server.c b/nbd/server.c
index 4990a5826e6..e098da819df 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1623,6 +1623,9 @@  static coroutine_fn void nbd_trip(void *opaque)
     case NBD_CMD_TRIM:
         ret = blk_co_pdiscard(exp->blk, request.from + exp->dev_offset,
                               request.len);
+        if (ret == 0 && request.flags & NBD_CMD_FLAG_FUA) {
+            ret = blk_co_flush(exp->blk);
+        }
         if (ret < 0) {
             error_setg_errno(&local_err, -ret, "discard failed");
         }