[7/8] block/qcow2: Speed up zero cluster expansion

Message ID	1406311665-2814-8-git-send-email-mreitz@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Max Reitz <mreitz@redhat.com> To: qemu-devel@nongnu.org Date: Fri, 25 Jul 2014 20:07:44 +0200 Message-Id: <1406311665-2814-8-git-send-email-mreitz@redhat.com> In-Reply-To: <1406311665-2814-1-git-send-email-mreitz@redhat.com> References: <1406311665-2814-1-git-send-email-mreitz@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Max Reitz <mreitz@redhat.com> Subject: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Message ID

1406311665-2814-8-git-send-email-mreitz@redhat.com

State

New

Headers

From: Max Reitz <mreitz@redhat.com>
To: qemu-devel@nongnu.org
Date: Fri, 25 Jul 2014 20:07:44 +0200
Message-Id: <1406311665-2814-8-git-send-email-mreitz@redhat.com>
In-Reply-To: <1406311665-2814-1-git-send-email-mreitz@redhat.com>
References: <1406311665-2814-1-git-send-email-mreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>,
	Max Reitz <mreitz@redhat.com>
Subject: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster
	expansion
Precedence: list
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Commit Message

Max Reitz July 25, 2014, 6:07 p.m. UTC

Actually, we do not need to allocate a new data cluster for every zero
cluster to be expanded: It is completely sufficient to rely on qcow2's
COW part and instead create a single zero cluster and reuse it as much
as possible.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 92 insertions(+), 27 deletions(-)

Comments

Eric Blake July 30, 2014, 4:14 p.m. UTC | #1

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Actually, we do not need to allocate a new data cluster for every zero
> cluster to be expanded: It is completely sufficient to rely on qcow2's
> COW part and instead create a single zero cluster and reuse it as much
> as possible.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 92 insertions(+), 27 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 905beb6..867db03 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1558,6 +1558,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>      BDRVQcowState *s = bs->opaque;
>      bool is_active_l1 = (l1_table == s->l1_table);
>      uint64_t *l2_table = NULL;
> +    int64_t zeroed_cluster_offset = 0;
> +    int zeroed_cluster_refcount = 0;
> +    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
>      int ret;
>      int i, j;
>  
> @@ -1617,47 +1620,79 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>                      continue;
>                  }
>  
> -                offset = qcow2_alloc_clusters(bs, s->cluster_size);
> -                if (offset < 0) {
> -                    ret = offset;
> -                    goto fail;
> +                if (zeroed_cluster_offset) {
> +                    zeroed_cluster_refcount += l2_refcount;
> +                    if (zeroed_cluster_refcount > 0xffff) {

Doesn't the qcow2 file format allow variable-sized maximum refcount
(bytes 96-99 refcount_order in the header)?  Therefore, you should be
using the value computed from the header rather than hard-coding the
assumption that the header used (the default of) 16-bit refcount. [Yeah,
I know, we don't yet have code that supports non-default size, even
though the file format documents it, but that doesn't mean we should
make it harder to add support down the road...]

> +                        zeroed_cluster_refcount = 0;
> +                        zeroed_cluster_offset = 0;
> +                    }
>                  }

This isn't a maximal packing.  As long as we don't mind complexity to
gain compactness, couldn't we also expand the existing
zeroed_cluster_offset all the way up to full refcount, and decrement
l2_refcount by the difference, before spilling over to allocating the
next zero cluster?

Also, I have to wonder - since the all-zero cluster is the most likely
cluster to have a large refcount, even during normal runtime, should we
special case the normal qcow2 write code to track the current all-zero
cluster (if any), and merely increase its refcount rather than allocate
a new cluster any time it is detected that an all-zero cluster is
needed?  [Of course, the tracking would be runtime only, since
compat=0.10 header doesn't provide any way to track the location of an
all-zero cluster across file reloads.  Each new runtime would probably
settle on a new location for the all-zero cluster used during that run,
rather than trying to find an existing one.  And there's really no point
to adding a header to track an all-zero cluster in compat=1.1 images,
since those images already have the ability to track zero clusters
without needing one allocated.]

> +                if (!zeroed_cluster_offset) {
> +                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
> +                    if (offset < 0) {
> +                        ret = offset;
> +                        goto fail;
> +                    }
>  
> -                if (l2_refcount > 1) {
> -                    /* For shared L2 tables, set the refcount accordingly (it is
> -                     * already 1 and needs to be l2_refcount) */
> -                    ret = qcow2_update_cluster_refcount(bs,
> -                            offset >> s->cluster_bits, l2_refcount - 1,
> -                            QCOW2_DISCARD_OTHER);
> +                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
> +                                                        s->cluster_size);
> +                    if (ret < 0) {
> +                        qcow2_free_clusters(bs, offset, s->cluster_size,
> +                                            QCOW2_DISCARD_OTHER);
> +                        goto fail;
> +                    }
> +
> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
> +                                            s->cluster_sectors, 0);

That is, if bdrv_write_zeroes knows how to take advantage of an already
existing all-zero cluster, it would be less special casing in this code,
but still get the same benefits of maximizing refcount during the amend
operation, if all expanded clusters go through bdrv_write_zeroes.

Max Reitz July 30, 2014, 8:31 p.m. UTC | #2

On 30.07.2014 18:14, Eric Blake wrote:
> On 07/25/2014 12:07 PM, Max Reitz wrote:
>> Actually, we do not need to allocate a new data cluster for every zero
>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>> COW part and instead create a single zero cluster and reuse it as much
>> as possible.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 92 insertions(+), 27 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 905beb6..867db03 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -1558,6 +1558,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>>       BDRVQcowState *s = bs->opaque;
>>       bool is_active_l1 = (l1_table == s->l1_table);
>>       uint64_t *l2_table = NULL;
>> +    int64_t zeroed_cluster_offset = 0;
>> +    int zeroed_cluster_refcount = 0;
>> +    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
>>       int ret;
>>       int i, j;
>>   
>> @@ -1617,47 +1620,79 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>>                       continue;
>>                   }
>>   
>> -                offset = qcow2_alloc_clusters(bs, s->cluster_size);
>> -                if (offset < 0) {
>> -                    ret = offset;
>> -                    goto fail;
>> +                if (zeroed_cluster_offset) {
>> +                    zeroed_cluster_refcount += l2_refcount;
>> +                    if (zeroed_cluster_refcount > 0xffff) {
> Doesn't the qcow2 file format allow variable-sized maximum refcount
> (bytes 96-99 refcount_order in the header)?  Therefore, you should be
> using the value computed from the header rather than hard-coding the
> assumption that the header used (the default of) 16-bit refcount.

Oh, you're right, I didn't even think of that.

> [Yeah,
> I know, we don't yet have code that supports non-default size, even
> though the file format documents it, but that doesn't mean we should
> make it harder to add support down the road...]

And this is probably why, yes. ;-)

>> +                        zeroed_cluster_refcount = 0;
>> +                        zeroed_cluster_offset = 0;
>> +                    }
>>                   }
> This isn't a maximal packing.  As long as we don't mind complexity to
> gain compactness, couldn't we also expand the existing
> zeroed_cluster_offset all the way up to full refcount, and decrement
> l2_refcount by the difference, before spilling over to allocating the
> next zero cluster?

Hm, right.

> Also, I have to wonder - since the all-zero cluster is the most likely
> cluster to have a large refcount, even during normal runtime, should we
> special case the normal qcow2 write code to track the current all-zero
> cluster (if any), and merely increase its refcount rather than allocate
> a new cluster any time it is detected that an all-zero cluster is
> needed?  [Of course, the tracking would be runtime only, since
> compat=0.10 header doesn't provide any way to track the location of an
> all-zero cluster across file reloads.  Each new runtime would probably
> settle on a new location for the all-zero cluster used during that run,
> rather than trying to find an existing one.  And there's really no point
> to adding a header to track an all-zero cluster in compat=1.1 images,
> since those images already have the ability to track zero clusters
> without needing one allocated.]

This may improve performance for compat=0.10 images; however, I don't 
think we should care that much about compat=0.10 images to justify 
optimizations specifically for those all over the qcow2 code.

>> +                if (!zeroed_cluster_offset) {
>> +                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
>> +                    if (offset < 0) {
>> +                        ret = offset;
>> +                        goto fail;
>> +                    }
>>   
>> -                if (l2_refcount > 1) {
>> -                    /* For shared L2 tables, set the refcount accordingly (it is
>> -                     * already 1 and needs to be l2_refcount) */
>> -                    ret = qcow2_update_cluster_refcount(bs,
>> -                            offset >> s->cluster_bits, l2_refcount - 1,
>> -                            QCOW2_DISCARD_OTHER);
>> +                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
>> +                                                        s->cluster_size);
>> +                    if (ret < 0) {
>> +                        qcow2_free_clusters(bs, offset, s->cluster_size,
>> +                                            QCOW2_DISCARD_OTHER);
>> +                        goto fail;
>> +                    }
>> +
>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>> +                                            s->cluster_sectors, 0);
> That is, if bdrv_write_zeroes knows how to take advantage of an already
> existing all-zero cluster, it would be less special casing in this code,
> but still get the same benefits of maximizing refcount during the amend
> operation, if all expanded clusters go through bdrv_write_zeroes.

The special casing would then be somewhere else and I think only this 
code would really benefit from it. I don't think we should ingrain such 
optimizations in all of the qcow2 code; I personally can live with 
having these optimizations contained and separated from the rest of the 
qcow2 code in functions like this one, but anything else would be a bit 
too much effort (and maybe even too error-prone, since it would probably 
be rather complex) for current qemu. We do have compat=1.1 with zero 
clusters and that is what should be used for performance.

I'll wait with v2 of this patch until you have decided on whether it's 
worth it (in the alternative series, I completely dropped this patch).

Max

Eric Blake July 30, 2014, 8:31 p.m. UTC | #3

On 07/30/2014 10:14 AM, Eric Blake wrote:
> On 07/25/2014 12:07 PM, Max Reitz wrote:
>> Actually, we do not need to allocate a new data cluster for every zero
>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>> COW part and instead create a single zero cluster and reuse it as much
>> as possible.

> Also, I have to wonder - since the all-zero cluster is the most likely
> cluster to have a large refcount, even during normal runtime, should we
> special case the normal qcow2 write code to track the current all-zero
> cluster (if any), and merely increase its refcount rather than allocate
> a new cluster any time it is detected that an all-zero cluster is
> needed?  [Of course, the tracking would be runtime only, since
> compat=0.10 header doesn't provide any way to track the location of an
> all-zero cluster across file reloads.  Each new runtime would probably
> settle on a new location for the all-zero cluster used during that run,
> rather than trying to find an existing one.  And there's really no point
> to adding a header to track an all-zero cluster in compat=1.1 images,
> since those images already have the ability to track zero clusters
> without needing one allocated.]


>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>> +                                            s->cluster_sectors, 0);
> 
> That is, if bdrv_write_zeroes knows how to take advantage of an already
> existing all-zero cluster, it would be less special casing in this code,
> but still get the same benefits of maximizing refcount during the amend
> operation, if all expanded clusters go through bdrv_write_zeroes.

Now that I've looked through both variants, I'm leaning towards the
simplicity of your alternate series, rather than the complexity of this
one, if we can (independently?) optimize bdrv_write_zeroes to reuse a
known-all-zeroes cluster when possible.  Of course, you may want to get
other opinions than just mine before posting your next round of these
patches.

Max Reitz July 30, 2014, 8:41 p.m. UTC | #4

On 30.07.2014 22:31, Eric Blake wrote:
> On 07/30/2014 10:14 AM, Eric Blake wrote:
>> On 07/25/2014 12:07 PM, Max Reitz wrote:
>>> Actually, we do not need to allocate a new data cluster for every zero
>>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>>> COW part and instead create a single zero cluster and reuse it as much
>>> as possible.
>> Also, I have to wonder - since the all-zero cluster is the most likely
>> cluster to have a large refcount, even during normal runtime, should we
>> special case the normal qcow2 write code to track the current all-zero
>> cluster (if any), and merely increase its refcount rather than allocate
>> a new cluster any time it is detected that an all-zero cluster is
>> needed?  [Of course, the tracking would be runtime only, since
>> compat=0.10 header doesn't provide any way to track the location of an
>> all-zero cluster across file reloads.  Each new runtime would probably
>> settle on a new location for the all-zero cluster used during that run,
>> rather than trying to find an existing one.  And there's really no point
>> to adding a header to track an all-zero cluster in compat=1.1 images,
>> since those images already have the ability to track zero clusters
>> without needing one allocated.]
>
>>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>>> +                                            s->cluster_sectors, 0);
>> That is, if bdrv_write_zeroes knows how to take advantage of an already
>> existing all-zero cluster, it would be less special casing in this code,
>> but still get the same benefits of maximizing refcount during the amend
>> operation, if all expanded clusters go through bdrv_write_zeroes.
> Now that I've looked through both variants, I'm leaning towards the
> simplicity of your alternate series, rather than the complexity of this
> one, if we can (independently?) optimize bdrv_write_zeroes to reuse a
> known-all-zeroes cluster when possible.  Of course, you may want to get
> other opinions than just mine before posting your next round of these
> patches.

I'm pretty sure Kevin prefers a variant which is as simple as possible, 
so I'll use that (alternative) version for v2, then.

However, I still think we should not optimize bdrv_write_zeroes(). As 
far as I know, qemu should work best with raw and qcow2 in its current 
version. raw will not support things like a common zero cluster anyway; 
and qcow2 in its current version has zero clusters built-in. I don't 
think we should optimize for qcow2 compat=0.10 to make up for things it 
lacks in comparison to compat=1.1 by design.

Also, in regard to this patch: bs->file is most probably a raw file 
which won't support a common zero cluster. If we want to optimize the 
bdrv_write_zeroes() call alone, all we can do is to allow it to discard 
the sectors (which I guess I'll just do in v2 because it doesn't cost 
anything).

In any case, if later on I or somebody else does decide to optimize 
bdrv_write_zeroes() we can still implement this optimization 
independently of this series.

Max

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 905beb6..867db03 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1558,6 +1558,9 @@  static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
     BDRVQcowState *s = bs->opaque;
     bool is_active_l1 = (l1_table == s->l1_table);
     uint64_t *l2_table = NULL;
+    int64_t zeroed_cluster_offset = 0;
+    int zeroed_cluster_refcount = 0;
+    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
     int ret;
     int i, j;
 
@@ -1617,47 +1620,79 @@  static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                     continue;
                 }
 
-                offset = qcow2_alloc_clusters(bs, s->cluster_size);
-                if (offset < 0) {
-                    ret = offset;
-                    goto fail;
+                if (zeroed_cluster_offset) {
+                    zeroed_cluster_refcount += l2_refcount;
+                    if (zeroed_cluster_refcount > 0xffff) {
+                        zeroed_cluster_refcount = 0;
+                        zeroed_cluster_offset = 0;
+                    }
                 }
+                if (!zeroed_cluster_offset) {
+                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
+                    if (offset < 0) {
+                        ret = offset;
+                        goto fail;
+                    }
 
-                if (l2_refcount > 1) {
-                    /* For shared L2 tables, set the refcount accordingly (it is
-                     * already 1 and needs to be l2_refcount) */
-                    ret = qcow2_update_cluster_refcount(bs,
-                            offset >> s->cluster_bits, l2_refcount - 1,
-                            QCOW2_DISCARD_OTHER);
+                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
+                                                        s->cluster_size);
+                    if (ret < 0) {
+                        qcow2_free_clusters(bs, offset, s->cluster_size,
+                                            QCOW2_DISCARD_OTHER);
+                        goto fail;
+                    }
+
+                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
+                                            s->cluster_sectors, 0);
                     if (ret < 0) {
                         qcow2_free_clusters(bs, offset, s->cluster_size,
                                             QCOW2_DISCARD_OTHER);
                         goto fail;
                     }
+
+                    if (l2_refcount > 1) {
+                        ret = qcow2_update_cluster_refcount(bs,
+                                offset >> s->cluster_bits, l2_refcount - 1,
+                                QCOW2_DISCARD_OTHER);
+                        if (ret < 0) {
+                            qcow2_free_clusters(bs, offset, s->cluster_size,
+                                                QCOW2_DISCARD_OTHER);
+                            goto fail;
+                        }
+                    }
+
+                    zeroed_cluster_offset = offset;
+                    zeroed_cluster_refcount = l2_refcount;
+                } else {
+                    ret = qcow2_update_cluster_refcount(bs,
+                            zeroed_cluster_offset >> s->cluster_bits,
+                            l2_refcount, QCOW2_DISCARD_OTHER);
+                    if (ret < 0) {
+                        goto fail;
+                    }
                 }
+
+                offset = zeroed_cluster_offset;
+                last_zeroed_cluster_l1i = i;
+                last_zeroed_cluster_l2i = j;
             }
 
-            ret = qcow2_pre_write_overlap_check(bs, 0, offset, s->cluster_size);
-            if (ret < 0) {
-                if (!preallocated) {
-                    qcow2_free_clusters(bs, offset, s->cluster_size,
-                                        QCOW2_DISCARD_ALWAYS);
+            if (preallocated) {
+                ret = qcow2_pre_write_overlap_check(bs, 0, offset,
+                                                    s->cluster_size);
+                if (ret < 0) {
+                    goto fail;
                 }
-                goto fail;
-            }
 
-            ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
-                                    s->cluster_sectors, 0);
-            if (ret < 0) {
-                if (!preallocated) {
-                    qcow2_free_clusters(bs, offset, s->cluster_size,
-                                        QCOW2_DISCARD_ALWAYS);
+                ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
+                                        s->cluster_sectors, 0);
+                if (ret < 0) {
+                    goto fail;
                 }
-                goto fail;
             }
 
-            if (l2_refcount == 1) {
-                l2_table[j] = cpu_to_be64(offset | QCOW_OFLAG_COPIED);
+            if (preallocated) {
+                l2_table[j] = cpu_to_be64(offset | (l2_entry & QCOW_OFLAG_COPIED));
             } else {
                 l2_table[j] = cpu_to_be64(offset);
             }
@@ -1670,8 +1705,8 @@  static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 qcow2_cache_depends_on_flush(s->l2_table_cache);
             }
             ret = qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+            l2_table = NULL;
             if (ret < 0) {
-                l2_table = NULL;
                 goto fail;
             }
         } else {
@@ -1697,6 +1732,36 @@  static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
         }
     }
 
+    /* Fix COPIED (only valid for active L2 tables) */
+    if (is_active_l1 && zeroed_cluster_refcount == 1) {
+        uint64_t l2_offset, l2_entry;
+
+        l2_offset = l1_table[last_zeroed_cluster_l1i] & L1E_OFFSET_MASK;
+        assert(l2_offset);
+
+        ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
+                              (void **)&l2_table);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        l2_entry = be64_to_cpu(l2_table[last_zeroed_cluster_l2i]);
+
+        assert(!(l2_entry & QCOW_OFLAG_COPIED));
+        l2_entry |= QCOW_OFLAG_COPIED;
+
+        l2_table[last_zeroed_cluster_l2i] = cpu_to_be64(l2_entry);
+
+        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+        qcow2_cache_depends_on_flush(s->l2_table_cache);
+
+        ret = qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+        l2_table = NULL;
+        if (ret < 0) {
+            goto fail;
+        }
+    }
+
     ret = 0;
 
 fail:

[7/8] block/qcow2: Speed up zero cluster expansion

Commit Message

Comments

Patch