diff mbox series

qcow2: Reset free_cluster_index when allocating a new refcount block

Message ID 20180320135515.16823-1-berto@igalia.com
State New
Headers show
Series qcow2: Reset free_cluster_index when allocating a new refcount block | expand

Commit Message

Alberto Garcia March 20, 2018, 1:55 p.m. UTC
When we try to allocate new clusters we first look for available ones
starting from s->free_cluster_index and once we find them we increase
their reference counts. Before we get to call update_refcount() to do
this last step s->free_cluster_index is already pointing to the next
cluster after the ones we are trying to allocate.

During update_refcount() it may happen however that we also need to
allocate a new refcount block in order to store the refcounts of these
new clusters (and to complicate things further that may also require
us to grow the refcount table). After all this we don't know if the
clusters that we originally tried to allocate are still available, so
we return -EAGAIN to ask the caller to restart the search for free
clusters.

This is what can happen in a common scenario:

  1) We want to allocate a new cluster and we see that cluster N is
     free.

  2) We try to increase N's refcount but all refcount blocks are full,
     so we allocate a new one at N+1 (where s->free_cluster_index was
     pointing at).

  3) Once we're done we return -EAGAIN to look again for a free
     cluster, but now s->free_cluster_index points at N+2, so that's
     the one we allocate. Cluster N remains unallocated and we have a
     hole in the qcow2 file.

This can be reproduced easily:

     qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
     qemu-io -c 'write 0 124k' hd.qcow2

After this the image has 132608 bytes (256 clusters), and the refcount
block is full. If we write 512 more bytes it should allocate two new
clusters: the data cluster itself and a new refcount block.

     qemu-io -c 'write 124k 512' hd.qcow2

However the image has now three new clusters (259 in total), and the
first one of them is empty (and unallocated):

     dd if=hd.qcow2 bs=512c skip=256 count=1 | hexdump -C

If we write larger amounts of data in the last step instead of the 512
bytes used in this example we can create larger holes in the qcow2
file.

What this patch does is reset s->free_cluster_index to its previous
value when alloc_refcount_block() returns -EAGAIN. This way the caller
will try to allocate again the original clusters if they are still
free.

The output of iotest 026 also needs to be updated because now that
images have no holes some tests fail at a different point and the
number of leaked clusters is different.

Signed-off-by: Alberto Garcia <berto@igalia.com>
---
 block/qcow2-refcount.c     |  7 +++++++
 tests/qemu-iotests/026.out |  6 +++---
 tests/qemu-iotests/121     | 20 ++++++++++++++++++++
 tests/qemu-iotests/121.out | 10 ++++++++++
 4 files changed, 40 insertions(+), 3 deletions(-)

Comments

Eric Blake March 20, 2018, 5:54 p.m. UTC | #1
On 03/20/2018 08:55 AM, Alberto Garcia wrote:
> When we try to allocate new clusters we first look for available ones
> starting from s->free_cluster_index and once we find them we increase
> their reference counts. Before we get to call update_refcount() to do
> this last step s->free_cluster_index is already pointing to the next
> cluster after the ones we are trying to allocate.
> 
 > During update_refcount() it may happen however that we also need to
 > allocate a new refcount block in order to store the refcounts of these
 > new clusters

Your changes to test 121 covers this...

 > (and to complicate things further that may also require
 > us to grow the refcount table).

...but not this.  Is it worth also trying to cover this case in the 
testsuite as well?  Probably made easier if you use a refcount_order of 
6 instead of the default of 4, so that it is easier to make the refcount 
table spill into a second cluster because refcount blocks have fewer 
entries.

> This can be reproduced easily:
> 
>       qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>       qemu-io -c 'write 0 124k' hd.qcow2
> 
> After this the image has 132608 bytes (256 clusters), and the refcount
> block is full. If we write 512 more bytes it should allocate two new
> clusters: the data cluster itself and a new refcount block.
> 
>       qemu-io -c 'write 124k 512' hd.qcow2

Nice test case!  And yes, 512-byte clusters are easier for proving when 
we have inefficient allocation strategies.

> What this patch does is reset s->free_cluster_index to its previous
> value when alloc_refcount_block() returns -EAGAIN. This way the caller
> will try to allocate again the original clusters if they are still
> free.
> 
> The output of iotest 026 also needs to be updated because now that
> images have no holes some tests fail at a different point and the
> number of leaked clusters is different.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>   block/qcow2-refcount.c     |  7 +++++++
>   tests/qemu-iotests/026.out |  6 +++---
>   tests/qemu-iotests/121     | 20 ++++++++++++++++++++
>   tests/qemu-iotests/121.out | 10 ++++++++++
>   4 files changed, 40 insertions(+), 3 deletions(-)

Long-standing bug, but it IS a bug fix, so I think it is safe for 2.12.

> 
> +            /* If the caller needs to restart the search for free clusters,
> +             * try the same ones first to see if they're still free. */
> +            if (ret == -EAGAIN) {
> +                if (s->free_cluster_index > (start >> s->cluster_bits)) {
> +                    s->free_cluster_index = (start >> s->cluster_bits);
> +                }

Is there any harm in making this assignment unconditional, instead of 
only doing it when free_cluster_index has grown larger than start?  [And 
unrelated, but it might be nice to do a followup cleanup to track 
free_cluster_offset by bytes instead of having to shift 
free_cluster_index everywhere]

At any rate, my comments can be deferred to later if desired, so

Reviewed-by: Eric Blake <eblake@redhat.com>
Alberto Garcia March 21, 2018, 9:28 a.m. UTC | #2
On Tue 20 Mar 2018 06:54:15 PM CET, Eric Blake wrote:

>> When we try to allocate new clusters we first look for available ones
>> starting from s->free_cluster_index and once we find them we increase
>> their reference counts. Before we get to call update_refcount() to do
>> this last step s->free_cluster_index is already pointing to the next
>> cluster after the ones we are trying to allocate.
>> 
>  > During update_refcount() it may happen however that we also need to
>  > allocate a new refcount block in order to store the refcounts of
>  > these new clusters
>
> Your changes to test 121 covers this...
>
>  > (and to complicate things further that may also require us to grow
>  > the refcount table).
>
> ...but not this.  Is it worth also trying to cover this case in the
> testsuite as well?

I checked and the patch doesn't really fix that scenario. There's a
different problem that I haven't debugged completely yet, because of two
reasons:

 - One difference is that when we grow the refcount table we actually
   allocate a new one, so s->free_cluster_index points to the beginning
   of the image (where the previous table was) and any holes left during
   the process are allocated after that (depending on how much data we
   write though).

 - This scenario is harder to reach: in order to fill a 1-cluster
   refcount table the size of the image needs to be larger than
   (cluster_size³ / refcount_bits) bytes, that's 16TB with the default
   parameters. So although it can be reproduced easily if you reduce the
   cluster size I think it's very infrequent under normal conditions.

But yes, it's a task left for the future.

>> +            /* If the caller needs to restart the search for free clusters,
>> +             * try the same ones first to see if they're still free. */
>> +            if (ret == -EAGAIN) {
>> +                if (s->free_cluster_index > (start >> s->cluster_bits)) {
>> +                    s->free_cluster_index = (start >> s->cluster_bits);
>> +                }
>
> Is there any harm in making this assignment unconditional, instead of
> only doing it when free_cluster_index has grown larger than start?

It can happen that it is smaller than 'start' if we were moving the
refcount table to a new location, so we want to keep the lowest value.

> [And unrelated, but it might be nice to do a followup cleanup to track
> free_cluster_offset by bytes instead of having to shift
> free_cluster_index everywhere]

I've actually just seen that we already have free_byte_offset, we use
that for compressed clusters, so it might be possible to use that one...

I'll put that in my TODO list.

Berto
Eric Blake March 21, 2018, 1:08 p.m. UTC | #3
On 03/21/2018 04:28 AM, Alberto Garcia wrote:
> On Tue 20 Mar 2018 06:54:15 PM CET, Eric Blake wrote:
> 
>>> When we try to allocate new clusters we first look for available ones
>>> starting from s->free_cluster_index and once we find them we increase
>>> their reference counts. Before we get to call update_refcount() to do
>>> this last step s->free_cluster_index is already pointing to the next
>>> cluster after the ones we are trying to allocate.
>>>
>>   > During update_refcount() it may happen however that we also need to
>>   > allocate a new refcount block in order to store the refcounts of
>>   > these new clusters
>>
>> Your changes to test 121 covers this...
>>
>>   > (and to complicate things further that may also require us to grow
>>   > the refcount table).
>>
>> ...but not this.  Is it worth also trying to cover this case in the
>> testsuite as well?
> 
> I checked and the patch doesn't really fix that scenario. There's a
> different problem that I haven't debugged completely yet, because of two
> reasons:
> 
>   - One difference is that when we grow the refcount table we actually
>     allocate a new one, so s->free_cluster_index points to the beginning
>     of the image (where the previous table was) and any holes left during
>     the process are allocated after that (depending on how much data we
>     write though).

Yeah, that can make the test harder to reason about, and is slightly 
more sensitive to our internal algorithm - but it also explains why you 
checked for index > start rather than unconditionally assigning index = 
start.

> 
>   - This scenario is harder to reach: in order to fill a 1-cluster
>     refcount table the size of the image needs to be larger than
>     (cluster_size³ / refcount_bits) bytes, that's 16TB with the default
>     parameters. So although it can be reproduced easily if you reduce the
>     cluster size I think it's very infrequent under normal conditions.

Yes, 16TB for default size, but only 2M for the best-case (512-byte 
cluster, 64-bit refcount), so still easy to write a test for.

> 
> But yes, it's a task left for the future.

Fair enough.

> 
>>> +            /* If the caller needs to restart the search for free clusters,
>>> +             * try the same ones first to see if they're still free. */
>>> +            if (ret == -EAGAIN) {
>>> +                if (s->free_cluster_index > (start >> s->cluster_bits)) {
>>> +                    s->free_cluster_index = (start >> s->cluster_bits);
>>> +                }
>>
>> Is there any harm in making this assignment unconditional, instead of
>> only doing it when free_cluster_index has grown larger than start?
> 
> It can happen that it is smaller than 'start' if we were moving the
> refcount table to a new location, so we want to keep the lowest value.

Okay, then my R-b is sufficient on the patch as-is.

> 
>> [And unrelated, but it might be nice to do a followup cleanup to track
>> free_cluster_offset by bytes instead of having to shift
>> free_cluster_index everywhere]
> 
> I've actually just seen that we already have free_byte_offset, we use
> that for compressed clusters, so it might be possible to use that one...

free_byte_offset really IS a byte offset, as it can point mid-cluster. 
But having EVERYTHING be byte-based seems like it will make reasoning 
about the code easier to do.

> 
> I'll put that in my TODO list.
> 
> Berto
>
Alberto Garcia March 21, 2018, 1:15 p.m. UTC | #4
On Wed 21 Mar 2018 02:08:23 PM CET, Eric Blake wrote:
>>   - This scenario is harder to reach: in order to fill a 1-cluster
>>   refcount table the size of the image needs to be larger than
>>   (cluster_size³ / refcount_bits) bytes, that's 16TB with the default
>>   parameters. So although it can be reproduced easily if you reduce
>>   the cluster size I think it's very infrequent under normal
>>   conditions.
>
> Yes, 16TB for default size, but only 2M for the best-case (512-byte
> cluster, 64-bit refcount), so still easy to write a test for.

The test case is indeed very easy:

   qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 16M
   qemu-io -c 'write 0 8222208' hd.qcow2
   qemu-io -c 'write 8222208 512' hd.qcow2

What I haven't done is debug the problem yet.

As I said I don't think this happens often in a real-world scenario, so
I wanted to fix the simple (and more common) problem first.

Berto
Eric Blake March 21, 2018, 1:30 p.m. UTC | #5
On 03/20/2018 08:55 AM, Alberto Garcia wrote:
> When we try to allocate new clusters we first look for available ones
> starting from s->free_cluster_index and once we find them we increase
> their reference counts. Before we get to call update_refcount() to do
> this last step s->free_cluster_index is already pointing to the next
> cluster after the ones we are trying to allocate.
> 

> This can be reproduced easily:
> 
>       qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>       qemu-io -c 'write 0 124k' hd.qcow2

This reproduction fails if you use non-default refcount_order...

> +++ b/tests/qemu-iotests/121
> @@ -93,6 +93,26 @@ $QEMU_IO -c 'write 63M 130K' "$TEST_IMG" | _filter_qemu_io
>   
>   _check_test_img
>   
> +echo
> +echo '=== Allocating a new refcount block must not leave holes in the image ==='
> +echo
> +
> +IMGOPTS='cluster_size=512' _make_test_img 1M

...so here, I think IMGOPTS also has to include refcount_bits=4, so that 
calling iotests './check -o refcount_bits=3' or similar doesn't fail the 
test.
Eric Blake March 21, 2018, 1:31 p.m. UTC | #6
On 03/21/2018 08:30 AM, Eric Blake wrote:
> On 03/20/2018 08:55 AM, Alberto Garcia wrote:

>> This can be reproduced easily:
>>
>>       qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>>       qemu-io -c 'write 0 124k' hd.qcow2
> 
> This reproduction fails if you use non-default refcount_order...
> 
>> +++ b/tests/qemu-iotests/121
>> @@ -93,6 +93,26 @@ $QEMU_IO -c 'write 63M 130K' "$TEST_IMG" | 
>> _filter_qemu_io
>>   _check_test_img
>> +echo
>> +echo '=== Allocating a new refcount block must not leave holes in the 
>> image ==='
>> +echo
>> +
>> +IMGOPTS='cluster_size=512' _make_test_img 1M
> 
> ...so here, I think IMGOPTS also has to include refcount_bits=4, so that 
> calling iotests './check -o refcount_bits=3' or similar doesn't fail the 
> test.

Except that it's spelled refcount_bits=16 vs. refcount_bits=8 (we 
convert refcount_bits as a power of 2 to refcount_order as an exponent 
value written into the qcow2 header).
diff mbox series

Patch

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 362deaf303..6b8b63514a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -839,6 +839,13 @@  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                 qcow2_cache_put(s->refcount_block_cache, &refcount_block);
             }
             ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
+            /* If the caller needs to restart the search for free clusters,
+             * try the same ones first to see if they're still free. */
+            if (ret == -EAGAIN) {
+                if (s->free_cluster_index > (start >> s->cluster_bits)) {
+                    s->free_cluster_index = (start >> s->cluster_bits);
+                }
+            }
             if (ret < 0) {
                 goto fail;
             }
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index 86a50a2e13..8e89416a86 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -533,7 +533,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -561,7 +561,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -589,7 +589,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
diff --git a/tests/qemu-iotests/121 b/tests/qemu-iotests/121
index 1307b4e327..13b2479e86 100755
--- a/tests/qemu-iotests/121
+++ b/tests/qemu-iotests/121
@@ -93,6 +93,26 @@  $QEMU_IO -c 'write 63M 130K' "$TEST_IMG" | _filter_qemu_io
 
 _check_test_img
 
+echo
+echo '=== Allocating a new refcount block must not leave holes in the image ==='
+echo
+
+IMGOPTS='cluster_size=512' _make_test_img 1M
+
+# This results in an image with 256 used clusters: the qcow2 header,
+# the refcount table, one refcount block, the L1 table, four L2 tables
+# and 248 data clusters
+$QEMU_IO -c 'write 0 124k' "$TEST_IMG" | _filter_qemu_io
+
+# 256 clusters of 512 bytes each give us a 128K image
+stat -c "size=%s (expected 131072)" $TEST_IMG
+
+# All 256 entries of the refcount block are used, so writing a new
+# data cluster also allocates a new refcount block
+$QEMU_IO -c 'write 124k 512' "$TEST_IMG" | _filter_qemu_io
+
+# Two more clusters, the image size should be 129K now
+stat -c "size=%s (expected 132096)" $TEST_IMG
 
 # success, all done
 echo
diff --git a/tests/qemu-iotests/121.out b/tests/qemu-iotests/121.out
index 5961a44cd9..613d56185e 100644
--- a/tests/qemu-iotests/121.out
+++ b/tests/qemu-iotests/121.out
@@ -20,4 +20,14 @@  wrote 133120/133120 bytes at offset 66060288
 130 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 
+=== Allocating a new refcount block must not leave holes in the image ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 126976/126976 bytes at offset 0
+124 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+size=131072 (expected 131072)
+wrote 512/512 bytes at offset 126976
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+size=132096 (expected 132096)
+
 *** done