diff mbox series

[1/1] qcow2: Skip copy-on-write when allocating a zero cluster

Message ID 8d0ca4de285ec56fa24ea43b8763f305816a0acc.1597416317.git.berto@igalia.com
State New
Headers show
Series [1/1] qcow2: Skip copy-on-write when allocating a zero cluster | expand

Commit Message

Alberto Garcia Aug. 14, 2020, 2:57 p.m. UTC
Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.

In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.

This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.

This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.

After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 include/block/block.h |  2 +-
 block/commit.c        |  2 +-
 block/io.c            | 20 +++++++++++++++++---
 block/mirror.c        |  3 ++-
 block/qcow2.c         | 26 ++++++++++++++++----------
 block/replication.c   |  2 +-
 block/stream.c        |  2 +-
 qemu-img.c            |  2 +-
 8 files changed, 40 insertions(+), 19 deletions(-)

Comments

Vladimir Sementsov-Ogievskiy Aug. 14, 2020, 6:07 p.m. UTC | #1
14.08.2020 17:57, Alberto Garcia wrote:
> Since commit c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a when a write
> request results in a new allocation QEMU first tries to see if the
> rest of the cluster outside the written area contains only zeroes.
> 
> In that case, instead of doing a normal copy-on-write operation and
> writing explicit zero buffers to disk, the code zeroes the whole
> cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
> 
> This improves performance very significantly but it only happens when
> we are writing to an area that was completely unallocated before. Zero
> clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
> are therefore slower to allocate.
> 
> This happens because the code uses bdrv_is_allocated_above() rather
> bdrv_block_status_above(). The former is not as accurate for this
> purpose but it is faster. However in the case of qcow2 the underlying
> call does already report zero clusters just fine so there is no reason
> why we cannot use that information.
> 
> After testing 4KB writes on an image that only contains zero clusters
> this patch results in almost five times more IOPS.

Would be great to add this case to simplebench as well.

The idea is good, but I'm a bit confused with new interface.

As I understand the "The former is not as accurate for this
purpose but it is faster" is related to (and only to) want_zero
parameter of block_status. bdrv_is_allocated_above is about
allocation (in terms of backing chains), and is_unallocated() is
just wrong user (with wrong name:)): it actually want another kind
of information.
So, for me it looks like we need an interface to
bdrv_block_status_above with want_zero=false (add another
function, or add this parameter to public interface).

And even with your approach, I'd keep original
bdrv_is_allocated_above as is, and just add new function with *is_zero
argument, to not modify all users of bdrv_is_allocated_above to pass
additional NULL value, in this way patch will touch less files.

Also, note, I have a series about block_status & is_allocated:
"[PATCH v5 0/5] fix & merge block_status_above and is_allocated_above"
(me go and ping it), which a bit reduces all the mess around
block_status & is_allocated.
diff mbox series

Patch

diff --git a/include/block/block.h b/include/block/block.h
index 6e36154061..314ce6f425 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -495,7 +495,7 @@  int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
                       int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                             bool include_base, int64_t offset, int64_t bytes,
-                            int64_t *pnum);
+                            int64_t *pnum, bool *is_zero);
 
 bool bdrv_is_read_only(BlockDriverState *bs);
 int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
diff --git a/block/commit.c b/block/commit.c
index 7732d02dfe..9932a9cf90 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -154,7 +154,7 @@  static int coroutine_fn commit_run(Job *job, Error **errp)
         }
         /* Copy if allocated above the base */
         ret = bdrv_is_allocated_above(blk_bs(s->top), blk_bs(s->base), false,
-                                      offset, COMMIT_BUFFER_SIZE, &n);
+                                      offset, COMMIT_BUFFER_SIZE, &n, NULL);
         copy = (ret == 1);
         trace_commit_one_iteration(s, offset, n, ret);
         if (copy) {
diff --git a/block/io.c b/block/io.c
index ad3a51ed53..4dc92dc933 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2588,11 +2588,19 @@  int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
  *
+ * If 'is_zero' is not NULL and the range is allocated (i.e. this
+ * function returns 1) then *is_zero will be updated to indicate if
+ * the range is known to read back as zeroes (but note however that
+ * *is_zero == false does not guarantee non-zero data).
+ * If the range is not allocated then 'is_zero' is ignored and left
+ * unset.
+ *
  */
 int bdrv_is_allocated_above(BlockDriverState *top,
                             BlockDriverState *base,
                             bool include_base, int64_t offset,
-                            int64_t bytes, int64_t *pnum)
+                            int64_t bytes, int64_t *pnum,
+                            bool *is_zero)
 {
     BlockDriverState *intermediate;
     int ret;
@@ -2606,11 +2614,17 @@  int bdrv_is_allocated_above(BlockDriverState *top,
         int64_t size_inter;
 
         assert(intermediate);
-        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
+        ret = bdrv_common_block_status_above(intermediate,
+                                             backing_bs(intermediate),
+                                             false, offset, bytes,
+                                             &pnum_inter, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
-        if (ret) {
+        if (ret & BDRV_BLOCK_ALLOCATED) {
+            if (is_zero) {
+                *is_zero = !!(ret & BDRV_BLOCK_ZERO);
+            }
             *pnum = pnum_inter;
             return 1;
         }
diff --git a/block/mirror.c b/block/mirror.c
index e8e8844afc..b42d32ca48 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -837,7 +837,8 @@  static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
             return 0;
         }
 
-        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count);
+        ret = bdrv_is_allocated_above(bs, base, false, offset, bytes, &count,
+                                      NULL);
         if (ret < 0) {
             return ret;
         }
diff --git a/block/qcow2.c b/block/qcow2.c
index 6ad6bdc166..aaf3d5dddf 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2377,12 +2377,18 @@  static bool merge_cow(uint64_t offset, unsigned bytes,
     return false;
 }
 
-static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+static bool is_unallocated_or_zero(BlockDriverState *bs, int64_t offset,
+                                   int64_t bytes)
 {
+    int is_allocated;
+    bool is_zero;
     int64_t nr;
-    return !bytes ||
-        (!bdrv_is_allocated_above(bs, NULL, false, offset, bytes, &nr) &&
-         nr == bytes);
+    if (!bytes) {
+        return true;
+    }
+    is_allocated = bdrv_is_allocated_above(bs, NULL, false, offset, bytes,
+                                           &nr, &is_zero);
+    return ((!is_allocated || is_zero) && nr == bytes);
 }
 
 static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
@@ -2390,13 +2396,13 @@  static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
     /*
      * This check is designed for optimization shortcut so it must be
      * efficient.
-     * Instead of is_zero(), use is_unallocated() as it is faster (but not
-     * as accurate and can result in false negatives).
+     * Instead of is_zero(), use is_unallocated_or_zero() as it is faster
+     * (but not as accurate and can result in false negatives).
      */
-    return is_unallocated(bs, m->offset + m->cow_start.offset,
-                          m->cow_start.nb_bytes) &&
-           is_unallocated(bs, m->offset + m->cow_end.offset,
-                          m->cow_end.nb_bytes);
+    return is_unallocated_or_zero(bs, m->offset + m->cow_start.offset,
+                                  m->cow_start.nb_bytes) &&
+           is_unallocated_or_zero(bs, m->offset + m->cow_end.offset,
+                                  m->cow_end.nb_bytes);
 }
 
 static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
diff --git a/block/replication.c b/block/replication.c
index 0c70215784..cd3950a13b 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -279,7 +279,7 @@  static coroutine_fn int replication_co_writev(BlockDriverState *bs,
         ret = bdrv_is_allocated_above(top->bs, base->bs, false,
                                       sector_num * BDRV_SECTOR_SIZE,
                                       remaining_sectors * BDRV_SECTOR_SIZE,
-                                      &count);
+                                      &count, NULL);
         if (ret < 0) {
             goto out1;
         }
diff --git a/block/stream.c b/block/stream.c
index 310ccbaa4c..68aa651603 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -157,7 +157,7 @@  static int coroutine_fn stream_run(Job *job, Error **errp)
             /* Copy if allocated in the intermediate images.  Limit to the
              * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  */
             ret = bdrv_is_allocated_above(backing_bs(bs), s->bottom, true,
-                                          offset, n, &n);
+                                          offset, n, &n, NULL);
             /* Finish early if end of backing file has been reached */
             if (ret == 0 && n == 0) {
                 n = len - offset;
diff --git a/qemu-img.c b/qemu-img.c
index 5308773811..0099e3675d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3726,7 +3726,7 @@  static int img_rebase(int argc, char **argv)
                  * to take action
                  */
                 ret = bdrv_is_allocated_above(backing_bs(bs), prefix_chain_bs,
-                                              false, offset, n, &n);
+                                              false, offset, n, &n, NULL);
                 if (ret < 0) {
                     error_report("error while reading image metadata: %s",
                                  strerror(-ret));