diff mbox

qcow2: Optimize the refcount-block overlap check

Message ID 20170130161441.30493-1-berto@igalia.com
State New
Headers show

Commit Message

Alberto Garcia Jan. 30, 2017, 4:14 p.m. UTC
The metadata overlap checks introduced in a40f1c2add help detect
corruption in the qcow2 image by verifying that data writes don't
overlap with existing metadata sections.

The 'refcount-block' check in particular iterates over the refcount
table in order to get the addresses of all refcount blocks and check
that none of them overlap with the region where we want to write.

The problem with the refcount table is that since it always occupies
complete clusters its size is usually very big. With the default
values of cluster_size=64KB and refcount_bits=16 this table holds 8192
entries, each one of them enough to map 2GB worth of host clusters.

So unless we're using images with several TB of allocated data this
table is going to be mostly empty, and iterating over it is a waste of
CPU. If the storage backend is fast enough this can have an effect on
I/O performance.

This patch keeps the index of the last used (i.e. non-zero) entry in
the refcount table and updates it every time the table changes. The
refcount-block overlap check then uses that index instead of reading
the whole table.

In my tests with a 4GB qcow2 file stored in RAM this doubles the
amount of write IOPS.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2-refcount.c | 21 ++++++++++++++++++++-
 block/qcow2.c          |  1 +
 block/qcow2.h          |  1 +
 3 files changed, 22 insertions(+), 1 deletion(-)

Comments

Alberto Garcia Jan. 30, 2017, 4:34 p.m. UTC | #1
On Mon 30 Jan 2017 05:14:41 PM CET, Alberto Garcia wrote:

> This patch keeps the index of the last used (i.e. non-zero) entry in
> the refcount table and updates it every time the table changes. The
> refcount-block overlap check then uses that index instead of reading
> the whole table.

Note that while I decided to go for this approach the patch can be made
much simpler by simply stopping at the first empty entry in the refcount
table:

    if ((chk & QCOW2_OL_REFCOUNT_BLOCK) && s->refcount_table) {
        for (i = 0; i < s->refcount_table_size; i++) {
            if (!(s->refcount_table[i] & REFT_OFFSET_MASK)) {
                break;
            }
            if (overlaps_with(s->refcount_table[i] & REFT_OFFSET_MASK,
                s->cluster_size)) {
                return QCOW2_OL_REFCOUNT_BLOCK;
            }
        }
    }

I don't think QEMU produces files where refcount_table[i] == 0 but
refcount_table[i + 1] != 0. Do they even make sense? In any case, my
patch would cover those cases too, but this simplified version wouldn't.

Berto
Eric Blake Jan. 30, 2017, 6:27 p.m. UTC | #2
On 01/30/2017 10:34 AM, Alberto Garcia wrote:

> 
> I don't think QEMU produces files where refcount_table[i] == 0 but
> refcount_table[i + 1] != 0. Do they even make sense?

I don't know if qemu can be directly coerced to create such an image,
but a third party tool can (and that probably includes qemu-io), and
such an image makes sense.  In particular, a refcount can be changed
from non-zero to zero during garbage collection when all clusters
covered by that refcount are no longer in use (perhaps because a
snapshot was deleted).

> In any case, my
> patch would cover those cases too, but this simplified version wouldn't.

Since an image can have holes in the refcount_table, it sounds like the
simplified version is not robust enough.
Max Reitz Feb. 1, 2017, 1:16 a.m. UTC | #3
On 30.01.2017 17:34, Alberto Garcia wrote:
> On Mon 30 Jan 2017 05:14:41 PM CET, Alberto Garcia wrote:
> 
>> This patch keeps the index of the last used (i.e. non-zero) entry in
>> the refcount table and updates it every time the table changes. The
>> refcount-block overlap check then uses that index instead of reading
>> the whole table.
> 
> Note that while I decided to go for this approach the patch can be made
> much simpler by simply stopping at the first empty entry in the refcount
> table:
> 
>     if ((chk & QCOW2_OL_REFCOUNT_BLOCK) && s->refcount_table) {
>         for (i = 0; i < s->refcount_table_size; i++) {
>             if (!(s->refcount_table[i] & REFT_OFFSET_MASK)) {
>                 break;
>             }
>             if (overlaps_with(s->refcount_table[i] & REFT_OFFSET_MASK,
>                 s->cluster_size)) {
>                 return QCOW2_OL_REFCOUNT_BLOCK;
>             }
>         }
>     }
> 
> I don't think QEMU produces files where refcount_table[i] == 0 but
> refcount_table[i + 1] != 0. Do they even make sense? In any case, my
> patch would cover those cases too, but this simplified version wouldn't.

I think it would be reasonable enough to assume that such images don't
exist (and if they do, we don't do overlap checks above the hole -- that
isn't exactly the end of the world). But your patch looks simple enough
as it is.

Max
Max Reitz Feb. 1, 2017, 1:46 a.m. UTC | #4
On 30.01.2017 17:14, Alberto Garcia wrote:
> The metadata overlap checks introduced in a40f1c2add help detect
> corruption in the qcow2 image by verifying that data writes don't
> overlap with existing metadata sections.
> 
> The 'refcount-block' check in particular iterates over the refcount
> table in order to get the addresses of all refcount blocks and check
> that none of them overlap with the region where we want to write.
> 
> The problem with the refcount table is that since it always occupies
> complete clusters its size is usually very big.

Actually the main problem is that BDRVQcow2State.refcount_table_size is
updated very generously as opposed to BDRVQcow2State.l1_size. The latter
is usually exactly as large as it needs to be (because the L1 table size
usually doesn't change), whereas the refcount_table_size is just bumped
up every time the image gets bigger until it reaches or exceeds the
value it needs to be.

>                                                 With the default
> values of cluster_size=64KB and refcount_bits=16 this table holds 8192
> entries, each one of them enough to map 2GB worth of host clusters.
> 
> So unless we're using images with several TB of allocated data this
> table is going to be mostly empty, and iterating over it is a waste of
> CPU. If the storage backend is fast enough this can have an effect on
> I/O performance.
> 
> This patch keeps the index of the last used (i.e. non-zero) entry in
> the refcount table and updates it every time the table changes. The
> refcount-block overlap check then uses that index instead of reading
> the whole table.
> 
> In my tests with a 4GB qcow2 file stored in RAM this doubles the
> amount of write IOPS.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>  block/qcow2-refcount.c | 21 ++++++++++++++++++++-
>  block/qcow2.c          |  1 +
>  block/qcow2.h          |  1 +
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index cbfb3fe064..5e4d36587f 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -83,6 +83,16 @@ static Qcow2SetRefcountFunc *const set_refcount_funcs[] = {
>  /*********************************************************/
>  /* refcount handling */
>  
> +static void update_max_refcount_table_index(BDRVQcow2State *s)
> +{
> +    unsigned i = s->refcount_table_size - 1;
> +    while (i > 0 && (s->refcount_table[i] & REFT_OFFSET_MASK) == 0) {
> +        i--;
> +    }
> +    /* Set s->max_refcount_table_index to the index of the last used entry */
> +    s->max_refcount_table_index = i;
> +}
> +
>  int qcow2_refcount_init(BlockDriverState *bs)
>  {
>      BDRVQcow2State *s = bs->opaque;
> @@ -111,6 +121,7 @@ int qcow2_refcount_init(BlockDriverState *bs)
>          }
>          for(i = 0; i < s->refcount_table_size; i++)
>              be64_to_cpus(&s->refcount_table[i]);
> +        update_max_refcount_table_index(s);
>      }
>      return 0;
>   fail:
> @@ -439,6 +450,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
>          }
>  
>          s->refcount_table[refcount_table_index] = new_block;
> +        s->max_refcount_table_index = refcount_table_index;

Should be MAX(s->max_refcount_table_index, refcount_table_index) or this
won't support refcount tables with holes in them.

Apart from this, the patch looks good to me. (Just a nagging comment below.)

>  
>          /* The new refcount block may be where the caller intended to put its
>           * data, so let it restart the search. */
> @@ -580,6 +592,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
>      s->refcount_table = new_table;
>      s->refcount_table_size = table_size;
>      s->refcount_table_offset = table_offset;
> +    update_max_refcount_table_index(s);
>  
>      /* Free old table. */
>      qcow2_free_clusters(bs, old_table_offset, old_table_size * sizeof(uint64_t),
> @@ -2171,6 +2184,7 @@ write_refblocks:
>      s->refcount_table = on_disk_reftable;
>      s->refcount_table_offset = reftable_offset;
>      s->refcount_table_size = reftable_size;
> +    update_max_refcount_table_index(s);
>  
>      return 0;
>  
> @@ -2383,7 +2397,11 @@ int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
>      }
>  
>      if ((chk & QCOW2_OL_REFCOUNT_BLOCK) && s->refcount_table) {
> -        for (i = 0; i < s->refcount_table_size; i++) {
> +        unsigned last_entry = s->max_refcount_table_index;
> +        assert(last_entry < s->refcount_table_size);
> +        assert(last_entry + 1 == s->refcount_table_size ||
> +               (s->refcount_table[last_entry + 1] & REFT_OFFSET_MASK) == 0);

I'm not sure we need this second assertion, but I don't mind it too much
either.

I'd actually find an assertion that last_entry is less than UNSIGNED_MAX
more important because otherwise the loop below would be an endless one. :-)

(Yes, it's pretty obvious that it's less than UNSIGNED_MAX because it's
less than s->refcount_table_size, which is an unsigned int. I'm just
trying to say something while I'm not sure exactly what I'm trying to
say. Sorry.)

Max

> +        for (i = 0; i <= last_entry; i++) {
>              if ((s->refcount_table[i] & REFT_OFFSET_MASK) &&
>                  overlaps_with(s->refcount_table[i] & REFT_OFFSET_MASK,
>                  s->cluster_size)) {
> @@ -2871,6 +2889,7 @@ int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
>      /* Now update the rest of the in-memory information */
>      old_reftable = s->refcount_table;
>      s->refcount_table = new_reftable;
> +    update_max_refcount_table_index(s);
>  
>      s->refcount_bits = 1 << refcount_order;
>      s->refcount_max = UINT64_C(1) << (s->refcount_bits - 1);
Alberto Garcia Feb. 1, 2017, 10:39 a.m. UTC | #5
On Wed 01 Feb 2017 02:46:20 AM CET, Max Reitz <mreitz@redhat.com> wrote:

>> The problem with the refcount table is that since it always occupies
>> complete clusters its size is usually very big.
>
> Actually the main problem is that BDRVQcow2State.refcount_table_size
> is updated very generously as opposed to BDRVQcow2State.l1_size. The
> latter is usually exactly as large as it needs to be (because the L1
> table size usually doesn't change), whereas the refcount_table_size is
> just bumped up every time the image gets bigger until it reaches or
> exceeds the value it needs to be.

That's actually a good point.

One reason why this happens is that the size of the L1 table is static
and the qcow2 header stores the actual number of entries, whereas for
the refcount table it stores the number of clusters.

Maybe we can have refcount_table_size store the actual size of the
table. The number of clusters can be calculated from that after all, and
we would get rid of max_refcount_table_index.

The change might be a bit more difficult, though, I'll take a look.

>> @@ -439,6 +450,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
>>          }
>>  
>>          s->refcount_table[refcount_table_index] = new_block;
>> +        s->max_refcount_table_index = refcount_table_index;
>
> Should be MAX(s->max_refcount_table_index, refcount_table_index) or
> this won't support refcount tables with holes in them.

Good catch.

Berto
Max Reitz Feb. 1, 2017, 11:58 a.m. UTC | #6
On 01.02.2017 11:39, Alberto Garcia wrote:
> On Wed 01 Feb 2017 02:46:20 AM CET, Max Reitz <mreitz@redhat.com> wrote:
> 
>>> The problem with the refcount table is that since it always occupies
>>> complete clusters its size is usually very big.
>>
>> Actually the main problem is that BDRVQcow2State.refcount_table_size
>> is updated very generously as opposed to BDRVQcow2State.l1_size. The
>> latter is usually exactly as large as it needs to be (because the L1
>> table size usually doesn't change), whereas the refcount_table_size is
>> just bumped up every time the image gets bigger until it reaches or
>> exceeds the value it needs to be.
> 
> That's actually a good point.
> 
> One reason why this happens is that the size of the L1 table is static
> and the qcow2 header stores the actual number of entries, whereas for
> the refcount table it stores the number of clusters.

The other reason is what I said: The image size changes, so the refcount
table size needs to be bumped up from time to time. The L1 table size is
pretty much static, so it makes sense to keep it exactly as large as it
needs to be to represent all of the virtual disk.

> Maybe we can have refcount_table_size store the actual size of the
> table. The number of clusters can be calculated from that after all, and
> we would get rid of max_refcount_table_index.

refcount_table_size as in the one in BDRVQcow2State? That is the number
of entries already. I don't think making it "exact" is worth it, because
see above (you need to bump it up any time an allocation is made, and
you don't want to reallocate the refcount table all the time).

Also, you lose all of the accuracy once the image is closed and reopened
because the field in the qcow2 header only stores number of clusters.
Changing that would not be backwards compatible. You could add a new
field which stores the number of entries, but I don't see the point, as
it is very easy to just recalculate it once the image is opened (which
doesn't take *that* much time if you do it just once).

> The change might be a bit more difficult, though, I'll take a look.

Well, I won't hold you back, but I don't think it's worth it.

Max

>>> @@ -439,6 +450,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
>>>          }
>>>  
>>>          s->refcount_table[refcount_table_index] = new_block;
>>> +        s->max_refcount_table_index = refcount_table_index;
>>
>> Should be MAX(s->max_refcount_table_index, refcount_table_index) or
>> this won't support refcount tables with holes in them.
> 
> Good catch.
> 
> Berto
>
diff mbox

Patch

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index cbfb3fe064..5e4d36587f 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -83,6 +83,16 @@  static Qcow2SetRefcountFunc *const set_refcount_funcs[] = {
 /*********************************************************/
 /* refcount handling */
 
+static void update_max_refcount_table_index(BDRVQcow2State *s)
+{
+    unsigned i = s->refcount_table_size - 1;
+    while (i > 0 && (s->refcount_table[i] & REFT_OFFSET_MASK) == 0) {
+        i--;
+    }
+    /* Set s->max_refcount_table_index to the index of the last used entry */
+    s->max_refcount_table_index = i;
+}
+
 int qcow2_refcount_init(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -111,6 +121,7 @@  int qcow2_refcount_init(BlockDriverState *bs)
         }
         for(i = 0; i < s->refcount_table_size; i++)
             be64_to_cpus(&s->refcount_table[i]);
+        update_max_refcount_table_index(s);
     }
     return 0;
  fail:
@@ -439,6 +450,7 @@  static int alloc_refcount_block(BlockDriverState *bs,
         }
 
         s->refcount_table[refcount_table_index] = new_block;
+        s->max_refcount_table_index = refcount_table_index;
 
         /* The new refcount block may be where the caller intended to put its
          * data, so let it restart the search. */
@@ -580,6 +592,7 @@  static int alloc_refcount_block(BlockDriverState *bs,
     s->refcount_table = new_table;
     s->refcount_table_size = table_size;
     s->refcount_table_offset = table_offset;
+    update_max_refcount_table_index(s);
 
     /* Free old table. */
     qcow2_free_clusters(bs, old_table_offset, old_table_size * sizeof(uint64_t),
@@ -2171,6 +2184,7 @@  write_refblocks:
     s->refcount_table = on_disk_reftable;
     s->refcount_table_offset = reftable_offset;
     s->refcount_table_size = reftable_size;
+    update_max_refcount_table_index(s);
 
     return 0;
 
@@ -2383,7 +2397,11 @@  int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
     }
 
     if ((chk & QCOW2_OL_REFCOUNT_BLOCK) && s->refcount_table) {
-        for (i = 0; i < s->refcount_table_size; i++) {
+        unsigned last_entry = s->max_refcount_table_index;
+        assert(last_entry < s->refcount_table_size);
+        assert(last_entry + 1 == s->refcount_table_size ||
+               (s->refcount_table[last_entry + 1] & REFT_OFFSET_MASK) == 0);
+        for (i = 0; i <= last_entry; i++) {
             if ((s->refcount_table[i] & REFT_OFFSET_MASK) &&
                 overlaps_with(s->refcount_table[i] & REFT_OFFSET_MASK,
                 s->cluster_size)) {
@@ -2871,6 +2889,7 @@  int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
     /* Now update the rest of the in-memory information */
     old_reftable = s->refcount_table;
     s->refcount_table = new_reftable;
+    update_max_refcount_table_index(s);
 
     s->refcount_bits = 1 << refcount_order;
     s->refcount_max = UINT64_C(1) << (s->refcount_bits - 1);
diff --git a/block/qcow2.c b/block/qcow2.c
index 96fb8a8f16..3e274bd1ba 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2743,6 +2743,7 @@  static int make_completely_empty(BlockDriverState *bs)
 
     s->refcount_table_offset = s->cluster_size;
     s->refcount_table_size   = s->cluster_size / sizeof(uint64_t);
+    s->max_refcount_table_index = 0;
 
     g_free(s->refcount_table);
     s->refcount_table = new_reftable;
diff --git a/block/qcow2.h b/block/qcow2.h
index 182341483a..f8aeb08794 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -251,6 +251,7 @@  typedef struct BDRVQcow2State {
     uint64_t *refcount_table;
     uint64_t refcount_table_offset;
     uint32_t refcount_table_size;
+    uint32_t max_refcount_table_index; /* Last used entry in refcount_table */
     uint64_t free_cluster_index;
     uint64_t free_byte_offset;