diff mbox

[v12,03/14] qcow2: Optimize bdrv_make_empty()

Message ID 1409088987-17207-4-git-send-email-mreitz@redhat.com
State New
Headers show

Commit Message

Max Reitz Aug. 26, 2014, 9:36 p.m. UTC
bdrv_make_empty() is currently only called if the current image
represents an external snapshot that has been committed to its base
image; it is therefore unlikely to have internal snapshots. In this
case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
refcount table (while having the dirty flag set) and creating a trivial
refcount structure.

If there are snapshots, fall back to the simple implementation (discard
all clusters).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkdebug.c      |   2 +
 block/qcow2.c         | 137 ++++++++++++++++++++++++++++++++++++++++++++------
 include/block/block.h |   2 +
 3 files changed, 126 insertions(+), 15 deletions(-)

Comments

Eric Blake Oct. 10, 2014, 12:32 p.m. UTC | #1
On 08/26/2014 03:36 PM, Max Reitz wrote:
> bdrv_make_empty() is currently only called if the current image
> represents an external snapshot that has been committed to its base
> image; it is therefore unlikely to have internal snapshots. In this
> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> refcount table (while having the dirty flag set) and creating a trivial
> refcount structure.
> 
> If there are snapshots, fall back to the simple implementation (discard
> all clusters).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/blkdebug.c      |   2 +
>  block/qcow2.c         | 137 ++++++++++++++++++++++++++++++++++++++++++++------
>  include/block/block.h |   2 +
>  3 files changed, 126 insertions(+), 15 deletions(-)
> 

> +    } else {
> +        int l1_clusters;
> +        int64_t offset;
> +        uint64_t *new_reftable;
> +        uint8_t l1_ofs_rt_ofs_cls[20]; /* L1 offset; RT offset and clusters */
> +        uint64_t rt_entry;
> +
> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>          if (ret < 0) {
> -            break;
> +            return ret;
> +        }
> +
> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        /* Refcounts will be broken utterly */
> +        ret = qcow2_mark_dirty(bs);

qcow2_mark_dirty does assert(s->qcow_version >= 3).  But this code can
be reached when committing a qcow2 0.10 compat-level file, right?

> +        /* "Create" an empty reftable (one cluster) directly after the image
> +         * header and an empty L1 table three clusters after the image header;
> +         * the cluster between those two will be used as the first refblock */
> +        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 0], 3 * s->cluster_size);
> +        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 8],     s->cluster_size);
> +        cpu_to_be32w((uint32_t *)&l1_ofs_rt_ofs_cls[16], 1);

The offsets here feel a bit magic.
Eric Blake Oct. 10, 2014, 3:34 p.m. UTC | #2
On 10/10/2014 06:32 AM, Eric Blake wrote:
> On 08/26/2014 03:36 PM, Max Reitz wrote:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>> refcount table (while having the dirty flag set) and creating a trivial
>> refcount structure.
>>
>> If there are snapshots, fall back to the simple implementation (discard
>> all clusters).

Brain-wave...


>> +        /* Refcounts will be broken utterly */
>> +        ret = qcow2_mark_dirty(bs);
> 
> qcow2_mark_dirty does assert(s->qcow_version >= 3).  But this code can
> be reached when committing a qcow2 0.10 compat-level file, right?

What if you use the fallback of the slower code on compat=0.10 files?
It's no worse than it has always been, and after all, this patch is
about adding an optimization, not adding new behavior.  It's okay if the
optimization is only usable on compat=1.1 files.
Max Reitz Oct. 11, 2014, 10:27 a.m. UTC | #3
Am 10.10.2014 um 14:32 schrieb Eric Blake:
> On 08/26/2014 03:36 PM, Max Reitz wrote:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>> refcount table (while having the dirty flag set) and creating a trivial
>> refcount structure.
>>
>> If there are snapshots, fall back to the simple implementation (discard
>> all clusters).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/blkdebug.c      |   2 +
>>   block/qcow2.c         | 137 ++++++++++++++++++++++++++++++++++++++++++++------
>>   include/block/block.h |   2 +
>>   3 files changed, 126 insertions(+), 15 deletions(-)
>>
>> +    } else {
>> +        int l1_clusters;
>> +        int64_t offset;
>> +        uint64_t *new_reftable;
>> +        uint8_t l1_ofs_rt_ofs_cls[20]; /* L1 offset; RT offset and clusters */
>> +        uint64_t rt_entry;
>> +
>> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>> +        }
>> +
>> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        /* Refcounts will be broken utterly */
>> +        ret = qcow2_mark_dirty(bs);
> qcow2_mark_dirty does assert(s->qcow_version >= 3).  But this code can
> be reached when committing a qcow2 0.10 compat-level file, right?

You're right, this code doesn't work for compat=0.10. As you proposed in 
your own answer to this, I'll just use the fallback code in that case 
(in v13)

>> +        /* "Create" an empty reftable (one cluster) directly after the image
>> +         * header and an empty L1 table three clusters after the image header;
>> +         * the cluster between those two will be used as the first refblock */
>> +        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 0], 3 * s->cluster_size);
>> +        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 8],     s->cluster_size);
>> +        cpu_to_be32w((uint32_t *)&l1_ofs_rt_ofs_cls[16], 1);
> The offsets here feel a bit magic.

Well, such byte offsets are everywhere in the qcow2 code which does such 
small writes to the image header. Making them non-magic would mean 
offsetof(...) - offsetof(...). Can and will do, but I probably won't 
find it much more readable. ;-)

Max
diff mbox

Patch

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 69b330e..a21678d 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -198,6 +198,8 @@  static const char *event_names[BLKDBG_EVENT_MAX] = {
     [BLKDBG_PWRITEV]                        = "pwritev",
     [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
     [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
+
+    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
 };
 
 static int get_event_by_name(const char *name, BlkDebugEvent *event)
diff --git a/block/qcow2.c b/block/qcow2.c
index 2efd9b5..e475151 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2155,24 +2155,131 @@  fail:
 
 static int qcow2_make_empty(BlockDriverState *bs)
 {
+    BDRVQcowState *s = bs->opaque;
     int ret = 0;
-    uint64_t start_sector;
-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
 
-    for (start_sector = 0; start_sector < bs->total_sectors;
-         start_sector += sector_step)
-    {
-        /* As this function is generally used after committing an external
-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
-         * default action for this kind of discard is to pass the discard,
-         * which will ideally result in an actually smaller image file, as
-         * is probably desired. */
-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
-                                     MIN(sector_step,
-                                         bs->total_sectors - start_sector),
-                                     QCOW2_DISCARD_SNAPSHOT, true);
+    if (s->snapshots) {
+        uint64_t start_sector;
+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+
+        /* If there are snapshots, every active cluster has to be discarded */
+
+        for (start_sector = 0; start_sector < bs->total_sectors;
+             start_sector += sector_step)
+        {
+            /* As this function is generally used after committing an external
+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
+             * default action for this kind of discard is to pass the discard,
+             * which will ideally result in an actually smaller image file, as
+             * is probably desired. */
+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
+                                         MIN(sector_step,
+                                             bs->total_sectors - start_sector),
+                                         QCOW2_DISCARD_SNAPSHOT, true);
+            if (ret < 0) {
+                break;
+            }
+        }
+    } else {
+        int l1_clusters;
+        int64_t offset;
+        uint64_t *new_reftable;
+        uint8_t l1_ofs_rt_ofs_cls[20]; /* L1 offset; RT offset and clusters */
+        uint64_t rt_entry;
+
+        ret = qcow2_cache_empty(bs, s->l2_table_cache);
         if (ret < 0) {
-            break;
+            return ret;
+        }
+
+        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
+        if (ret < 0) {
+            return ret;
+        }
+
+        /* Refcounts will be broken utterly */
+        ret = qcow2_mark_dirty(bs);
+        if (ret < 0) {
+            return ret;
+        }
+
+        l1_clusters = DIV_ROUND_UP(s->l1_size,
+                                   s->cluster_size / sizeof(uint64_t));
+        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
+        if (!new_reftable) {
+            return -ENOMEM;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
+
+        /* Overwrite enough clusters at the beginning of the sectors to place
+         * the refcount table, a refcount block and the L1 table in; this may
+         * overwrite parts of the existing refcount and L1 table, which is not
+         * an issue because the dirty flag is set, complete data loss is in fact
+         * desired and partial data loss is consequently fine as well */
+        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
+                                (2 + l1_clusters) * s->cluster_size /
+                                BDRV_SECTOR_SIZE, 0);
+        if (ret < 0) {
+            g_free(new_reftable);
+            return ret;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
+        BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
+
+        /* "Create" an empty reftable (one cluster) directly after the image
+         * header and an empty L1 table three clusters after the image header;
+         * the cluster between those two will be used as the first refblock */
+        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 0], 3 * s->cluster_size);
+        cpu_to_be64w((uint64_t *)&l1_ofs_rt_ofs_cls[ 8],     s->cluster_size);
+        cpu_to_be32w((uint32_t *)&l1_ofs_rt_ofs_cls[16], 1);
+        ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_table_offset),
+                               l1_ofs_rt_ofs_cls, sizeof(l1_ofs_rt_ofs_cls));
+        if (ret < 0) {
+            g_free(new_reftable);
+            return ret;
+        }
+
+        s->l1_table_offset = 3 * s->cluster_size;
+        memset(s->l1_table, 0, s->l1_size * sizeof(uint64_t));
+
+        s->refcount_table_offset = s->cluster_size;
+        s->refcount_table_size   = s->cluster_size / sizeof(uint64_t);
+
+        g_free(s->refcount_table);
+        s->refcount_table = new_reftable;
+
+        BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC);
+
+        /* Enter the first refblock into the reftable */
+        rt_entry = cpu_to_be64(2 * s->cluster_size);
+        ret = bdrv_pwrite_sync(bs->file, s->cluster_size,
+                               &rt_entry, sizeof(rt_entry));
+        if (ret < 0) {
+            return ret;
+        }
+
+        s->refcount_table[0] = 2 * s->cluster_size;
+
+        ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size);
+        if (ret < 0) {
+            return ret;
+        }
+
+        s->free_cluster_index = 0;
+        offset = qcow2_alloc_clusters(bs, 3 * s->cluster_size +
+                                      s->l1_size * sizeof(uint64_t));
+        if (offset < 0) {
+            return offset;
+        } else if (offset > 0) {
+            error_report("First cluster in emptied image is in use");
+            abort();
+        }
+
+        ret = qcow2_mark_clean(bs);
+        if (ret < 0) {
+            return ret;
         }
     }
 
diff --git a/include/block/block.h b/include/block/block.h
index 8f4ad16..7ac4caf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -557,6 +557,8 @@  typedef enum {
     BLKDBG_PWRITEV_ZERO,
     BLKDBG_PWRITEV_DONE,
 
+    BLKDBG_EMPTY_IMAGE_PREPARE,
+
     BLKDBG_EVENT_MAX,
 } BlkDebugEvent;