From patchwork Thu May 12 14:35:30 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Wolf X-Patchwork-Id: 621613 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3r5GWH1fD7z9ssP for ; Fri, 13 May 2016 01:05:15 +1000 (AEST) Received: from localhost ([::1]:58125 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b0sAr-0005oZ-1u for incoming@patchwork.ozlabs.org; Thu, 12 May 2016 11:05:13 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52586) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b0rjn-0006Ob-Dh for qemu-devel@nongnu.org; Thu, 12 May 2016 10:37:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b0rje-0004Eu-Oc for qemu-devel@nongnu.org; Thu, 12 May 2016 10:37:14 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54985) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b0rjX-00049W-Mw; Thu, 12 May 2016 10:36:59 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 322BE7D0CE; Thu, 12 May 2016 14:36:59 +0000 (UTC) Received: from noname.redhat.com (ovpn-116-37.ams2.redhat.com [10.36.116.37]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u4CEZp9S028730; Thu, 12 May 2016 10:36:58 -0400 From: Kevin Wolf To: qemu-block@nongnu.org Date: Thu, 12 May 2016 16:35:30 +0200 Message-Id: <1463063749-2201-51-git-send-email-kwolf@redhat.com> In-Reply-To: <1463063749-2201-1-git-send-email-kwolf@redhat.com> References: <1463063749-2201-1-git-send-email-kwolf@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 12 May 2016 14:36:59 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PULL 50/69] block: Honor BDRV_REQ_FUA during write_zeroes X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kwolf@redhat.com, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" From: Eric Blake The block layer has a couple of cases where it can lose Force Unit Access semantics when writing a large block of zeroes, such that the request returns before the zeroes have been guaranteed to land on underlying media. SCSI does not support FUA during WRITESAME(10/16); FUA is only supported if it falls back to WRITE(10/16). But where the underlying device is new enough to not need a fallback, it means that any upper layer request with FUA semantics was silently ignoring BDRV_REQ_FUA. Conversely, NBD has situations where it can support FUA but not ZERO_WRITE; when that happens, the generic block layer fallback to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu 2.6) was losing the FUA flag. The problem of losing flags unrelated to ZERO_WRITE has been latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but back then, it did not matter because there was no FUA flag. It became observable when commit 93f5e6d8 paved the way for flags that can impact correctness, when we should have been using bdrv_co_writev_flags() with modified flags. Compare to commit 9eeb6dd, which got flag manipulation right in bdrv_co_do_zero_pwritev(). Symptoms: I tested with qemu-io with default writethrough cache (which is supposed to use FUA semantics on every write), and targetted an NBD client connected to a server that intentionally did not advertise NBD_FLAG_SEND_FUA. When doing 'write 0 512', the NBD client sent two operations (NBD_CMD_WRITE then NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing 'write -z 0 512', the NBD client sent only NBD_CMD_WRITE. The fix is do to a cleanup bdrv_co_flush() at the end of the operation if any step in the middle relied on a BDS that does not natively support FUA for that step (note that we don't need to flush after every operation, if the operation is broken into chunks based on bounce-buffer sizing). Each BDS gains a new flag .supported_zero_flags, which parallels the use of .supported_write_flags but only when accessing a zero write operation (the flags MUST be different, because of SCSI having different semantics based on WRITE vs. WRITESAME; and also because BDRV_REQ_MAY_UNMAP only makes sense on zero writes). Also fix some documentation to describe -ENOTSUP semantics, particularly since iscsi depends on those semantics. Down the road, we may want to add a driver where its .bdrv_co_pwritev() honors all three of BDRV_REQ_FUA, BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise this via bs->supported_write_flags for blocks opened by that driver; such a driver should NOT supply .bdrv_co_write_zeroes nor .supported_zero_flags. But none of the drivers touched in this patch want to do that (the act of writing zeroes is different enough from normal writes to deserve a second callback). Signed-off-by: Eric Blake Reviewed-by: Fam Zheng Acked-by: Stefan Hajnoczi Signed-off-by: Kevin Wolf --- block/io.c | 28 +++++++++++++++++++++++++--- block/iscsi.c | 1 + block/raw-posix.c | 1 + block/raw_bsd.c | 1 + include/block/block_int.h | 7 +++++-- 5 files changed, 33 insertions(+), 5 deletions(-) diff --git a/block/io.c b/block/io.c index 1fb7afe..cd6d71a 100644 --- a/block/io.c +++ b/block/io.c @@ -652,7 +652,8 @@ int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num, * Completely zero out a block device with the help of bdrv_write_zeroes. * The operation is sped up by checking the block status and only writing * zeroes to the device if they currently do not return zeroes. Optional - * flags are passed through to bdrv_write_zeroes (e.g. BDRV_REQ_MAY_UNMAP). + * flags are passed through to bdrv_write_zeroes (e.g. BDRV_REQ_MAY_UNMAP, + * BDRV_REQ_FUA). * * Returns < 0 on error, 0 on success. For error codes see bdrv_write(). */ @@ -1160,6 +1161,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, QEMUIOVector qiov; struct iovec iov = {0}; int ret = 0; + bool need_flush = false; int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_write_zeroes, BDRV_REQUEST_MAX_SECTORS); @@ -1192,13 +1194,29 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, ret = -ENOTSUP; /* First try the efficient write zeroes operation */ if (drv->bdrv_co_write_zeroes) { - ret = drv->bdrv_co_write_zeroes(bs, sector_num, num, flags); + ret = drv->bdrv_co_write_zeroes(bs, sector_num, num, + flags & bs->supported_zero_flags); + if (ret != -ENOTSUP && (flags & BDRV_REQ_FUA) && + !(bs->supported_zero_flags & BDRV_REQ_FUA)) { + need_flush = true; + } + } else { + assert(!bs->supported_zero_flags); } if (ret == -ENOTSUP) { /* Fall back to bounce buffer if write zeroes is unsupported */ int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length, MAX_WRITE_ZEROES_BOUNCE_BUFFER); + BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE; + + if ((flags & BDRV_REQ_FUA) && + !(bs->supported_write_flags & BDRV_REQ_FUA)) { + /* No need for bdrv_driver_pwrite() to do a fallback + * flush on each chunk; use just one at the end */ + write_flags &= ~BDRV_REQ_FUA; + need_flush = true; + } num = MIN(num, max_xfer_len); iov.iov_len = num * BDRV_SECTOR_SIZE; if (iov.iov_base == NULL) { @@ -1212,7 +1230,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, qemu_iovec_init_external(&qiov, &iov, 1); ret = bdrv_driver_pwritev(bs, sector_num * BDRV_SECTOR_SIZE, - num * BDRV_SECTOR_SIZE, &qiov, 0); + num * BDRV_SECTOR_SIZE, &qiov, + write_flags); /* Keep bounce buffer around if it is big enough for all * all future requests. @@ -1228,6 +1247,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, } fail: + if (ret == 0 && need_flush) { + ret = bdrv_co_flush(bs); + } qemu_vfree(iov.iov_base); return ret; } diff --git a/block/iscsi.c b/block/iscsi.c index 6d5c1f6..10f3906 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -1553,6 +1553,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags, if (iscsilun->dpofua) { bs->supported_write_flags = BDRV_REQ_FUA; } + bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP; /* Check the write protect flag of the LUN if we want to write */ if (iscsilun->type == TYPE_DISK && (flags & BDRV_O_RDWR) && diff --git a/block/raw-posix.c b/block/raw-posix.c index 71ec463..a4f5a1b 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -517,6 +517,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, s->has_discard = true; s->has_write_zeroes = true; + bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP; if ((bs->open_flags & BDRV_O_NOCACHE) != 0) { s->needs_alignment = true; } diff --git a/block/raw_bsd.c b/block/raw_bsd.c index 1a1618e..3385ed4 100644 --- a/block/raw_bsd.c +++ b/block/raw_bsd.c @@ -205,6 +205,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags, { bs->sg = bs->file->bs->sg; bs->supported_write_flags = BDRV_REQ_FUA; + bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP; if (bs->probed && !bdrv_is_read_only(bs)) { fprintf(stderr, diff --git a/include/block/block_int.h b/include/block/block_int.h index 10f4962..2709488 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -161,8 +161,8 @@ struct BlockDriver { /* * Efficiently zero a region of the disk image. Typically an image format * would use a compact metadata representation to implement this. This - * function pointer may be NULL and .bdrv_co_writev() will be called - * instead. + * function pointer may be NULL or return -ENOSUP and .bdrv_co_writev() + * will be called instead. */ int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs, int64_t sector_num, int nb_sectors, BdrvRequestFlags flags); @@ -445,6 +445,9 @@ struct BlockDriverState { unsigned int request_alignment; /* Flags honored during pwrite (so far: BDRV_REQ_FUA) */ unsigned int supported_write_flags; + /* Flags honored during write_zeroes (so far: BDRV_REQ_FUA, + * BDRV_REQ_MAY_UNMAP) */ + unsigned int supported_zero_flags; /* the following member gives a name to every node on the bs graph. */ char node_name[32];