diff mbox series

[for-2.12,v2] qcow2: Reset free_cluster_index when allocating a new refcount block

Message ID 20180321133852.3560-1-berto@igalia.com
State New
Headers show
Series [for-2.12,v2] qcow2: Reset free_cluster_index when allocating a new refcount block | expand

Commit Message

Alberto Garcia March 21, 2018, 1:38 p.m. UTC
When we try to allocate new clusters we first look for available ones
starting from s->free_cluster_index and once we find them we increase
their reference counts. Before we get to call update_refcount() to do
this last step s->free_cluster_index is already pointing to the next
cluster after the ones we are trying to allocate.

During update_refcount() it may happen however that we also need to
allocate a new refcount block in order to store the refcounts of these
new clusters (and to complicate things further that may also require
us to grow the refcount table). After all this we don't know if the
clusters that we originally tried to allocate are still available, so
we return -EAGAIN to ask the caller to restart the search for free
clusters.

This is what can happen in a common scenario:

  1) We want to allocate a new cluster and we see that cluster N is
     free.

  2) We try to increase N's refcount but all refcount blocks are full,
     so we allocate a new one at N+1 (where s->free_cluster_index was
     pointing at).

  3) Once we're done we return -EAGAIN to look again for a free
     cluster, but now s->free_cluster_index points at N+2, so that's
     the one we allocate. Cluster N remains unallocated and we have a
     hole in the qcow2 file.

This can be reproduced easily:

     qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
     qemu-io -c 'write 0 124k' hd.qcow2

After this the image has 132608 bytes (256 clusters), and the refcount
block is full. If we write 512 more bytes it should allocate two new
clusters: the data cluster itself and a new refcount block.

     qemu-io -c 'write 124k 512' hd.qcow2

However the image has now three new clusters (259 in total), and the
first one of them is empty (and unallocated):

     dd if=hd.qcow2 bs=512c skip=256 count=1 | hexdump -C

If we write larger amounts of data in the last step instead of the 512
bytes used in this example we can create larger holes in the qcow2
file.

What this patch does is reset s->free_cluster_index to its previous
value when alloc_refcount_block() returns -EAGAIN. This way the caller
will try to allocate again the original clusters if they are still
free.

The output of iotest 026 also needs to be updated because now that
images have no holes some tests fail at a different point and the
number of leaked clusters is different.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---

v2: Force refcount_bits=16 in iotest 121 [Eric]

---
 block/qcow2-refcount.c     |  7 +++++++
 tests/qemu-iotests/026.out |  6 +++---
 tests/qemu-iotests/121     | 20 ++++++++++++++++++++
 tests/qemu-iotests/121.out | 10 ++++++++++
 4 files changed, 40 insertions(+), 3 deletions(-)

Comments

Kevin Wolf March 21, 2018, 1:52 p.m. UTC | #1
Am 21.03.2018 um 14:38 hat Alberto Garcia geschrieben:
> This can be reproduced easily:
> 
>      qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>      qemu-io -c 'write 0 124k' hd.qcow2
> 
> After this the image has 132608 bytes (256 clusters), and the refcount
> block is full. If we write 512 more bytes it should allocate two new
> clusters: the data cluster itself and a new refcount block.
> 
>      qemu-io -c 'write 124k 512' hd.qcow2
> 
> However the image has now three new clusters (259 in total), and the
> first one of them is empty (and unallocated):
> 
>      dd if=hd.qcow2 bs=512c skip=256 count=1 | hexdump -C
> 
> If we write larger amounts of data in the last step instead of the 512
> bytes used in this example we can create larger holes in the qcow2
> file.

So if I understand correctly, this is what we get:

0x20000: free
0x20200: refcount block
0x20400: data cluster
0x20600: potential next data cluster

> What this patch does is reset s->free_cluster_index to its previous
> value when alloc_refcount_block() returns -EAGAIN. This way the caller
> will try to allocate again the original clusters if they are still
> free.

This is an improvement, because now we should avoid the unused cluster:

0x20000: data cluster
0x20200: refcount block
0x20400: potential next data cluster

But now the data clusters are fragmented. Should we try to change the
logic so that already the refcount block allocation can make use of the
free space? That is:

0x20000: refcount block
0x20200: data cluster
0x20400: contiguous potential next data cluster

Kevin
Alberto Garcia March 21, 2018, 2:10 p.m. UTC | #2
On Wed 21 Mar 2018 02:52:38 PM CET, Kevin Wolf wrote:

>>      qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>>      qemu-io -c 'write 0 124k' hd.qcow2
>>      qemu-io -c 'write 124k 512' hd.qcow2
>
> So if I understand correctly, this is what we get:
>
> 0x20000: free
> 0x20200: refcount block
> 0x20400: data cluster
> 0x20600: potential next data cluster

Yes.

>> What this patch does is reset s->free_cluster_index to its previous
>> value when alloc_refcount_block() returns -EAGAIN. This way the caller
>> will try to allocate again the original clusters if they are still
>> free.
>
> This is an improvement, because now we should avoid the unused cluster:
>
> 0x20000: data cluster
> 0x20200: refcount block
> 0x20400: potential next data cluster

That's correct.

> But now the data clusters are fragmented. Should we try to change the
> logic so that already the refcount block allocation can make use of
> the free space? That is:
>
> 0x20000: refcount block
> 0x20200: data cluster
> 0x20400: contiguous potential next data cluster

Well, the clusters before 0x20000 are also data clusters so there's
going to be fragmentation anyway. Also, if you're writing more than 1
data cluster in a single request when the refcount block allocation
happens all those new data clusters are still going to be contiguous
(before the new refcount block).

Another possibility that I considered was to make alloc_clusters_noref()
return the offset but let free_cluster_index untouched until the
refcounts are correct, but this requires more code and I fear that
keeping free_cluster_index pointing to the clusters we're trying to
allocate can lead to unexpected surprises.

The solution I chose is -I think- the simplest and easiest to audit.

On a related note, I'm using a script that I wrote myself to debug qcow2
images. It prints the contents of the header and lists all host
clusters, indicating the type of each one. I find it useful to debug
problems like this one. If there's nothing similar already existing I
can post it and we could put it in the scripts/ directory.

Berto
Kevin Wolf March 21, 2018, 3:07 p.m. UTC | #3
Am 21.03.2018 um 15:10 hat Alberto Garcia geschrieben:
> On Wed 21 Mar 2018 02:52:38 PM CET, Kevin Wolf wrote:
> 
> >>      qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
> >>      qemu-io -c 'write 0 124k' hd.qcow2
> >>      qemu-io -c 'write 124k 512' hd.qcow2
> >
> > So if I understand correctly, this is what we get:
> >
> > 0x20000: free
> > 0x20200: refcount block
> > 0x20400: data cluster
> > 0x20600: potential next data cluster
> 
> Yes.
> 
> >> What this patch does is reset s->free_cluster_index to its previous
> >> value when alloc_refcount_block() returns -EAGAIN. This way the caller
> >> will try to allocate again the original clusters if they are still
> >> free.
> >
> > This is an improvement, because now we should avoid the unused cluster:
> >
> > 0x20000: data cluster
> > 0x20200: refcount block
> > 0x20400: potential next data cluster
> 
> That's correct.
> 
> > But now the data clusters are fragmented. Should we try to change the
> > logic so that already the refcount block allocation can make use of
> > the free space? That is:
> >
> > 0x20000: refcount block
> > 0x20200: data cluster
> > 0x20400: contiguous potential next data cluster
> 
> Well, the clusters before 0x20000 are also data clusters so there's
> going to be fragmentation anyway. Also, if you're writing more than 1
> data cluster in a single request when the refcount block allocation
> happens all those new data clusters are still going to be contiguous
> (before the new refcount block).

That's true.

I just remembered that when I looked at an image recently, I noticed
that the refcount block wasn't in the first cluster and I disliked it,
though mostly because it felt untidy rather than being a problem. There
was one effect that might count as an advantage, though: The first few
data clusters covered by the refcount block were corrupted before the
overlap check for the refcount block stopped it.

Not necessarily worth changing the logic, I just thought I'd bring it
up and see whether anyone else dislikes it.

> Another possibility that I considered was to make alloc_clusters_noref()
> return the offset but let free_cluster_index untouched until the
> refcounts are correct, but this requires more code and I fear that
> keeping free_cluster_index pointing to the clusters we're trying to
> allocate can lead to unexpected surprises.

Possibly. I think if I wanted to change the order, I'd consider
returning a hard error from update_refcount() when no refcount block is
available, and then the caller would have to call alloc_refcount_block()
manually before retrying update_refcount().

> The solution I chose is -I think- the simplest and easiest to audit.
> 
> On a related note, I'm using a script that I wrote myself to debug qcow2
> images. It prints the contents of the header and lists all host
> clusters, indicating the type of each one. I find it useful to debug
> problems like this one. If there's nothing similar already existing I
> can post it and we could put it in the scripts/ directory.

Having a qcow2 analysis script in the repo sounds like a good idea. John
has something, too. Maybe we can check whether the two things complement
each other and then check in a script that combines both (or if one
provides a superset of the other, just check in that one).

Kevin
Alberto Garcia March 21, 2018, 3:32 p.m. UTC | #4
On Wed 21 Mar 2018 04:07:28 PM CET, Kevin Wolf wrote:
> I just remembered that when I looked at an image recently, I noticed
> that the refcount block wasn't in the first cluster and I disliked it,
> though mostly because it felt untidy rather than being a problem.

I don't think you can fix that in general. In my test case I'm writing
new data when the existing refcount block is already full, so we can
allocate the new one before the new data and keep everything tidy.

But if there are, say, two refcount entries available and you write four
data clusters you don't want the new refcount block in the middle of
those four clusters just to have it at the beginning :-)

> Having a qcow2 analysis script in the repo sounds like a good
> idea. John has something, too. Maybe we can check whether the two
> things complement each other and then check in a script that combines
> both (or if one provides a superset of the other, just check in that
> one).

I'll take a look.

Berto
Kevin Wolf March 21, 2018, 3:40 p.m. UTC | #5
Am 21.03.2018 um 16:32 hat Alberto Garcia geschrieben:
> On Wed 21 Mar 2018 04:07:28 PM CET, Kevin Wolf wrote:
> > I just remembered that when I looked at an image recently, I noticed
> > that the refcount block wasn't in the first cluster and I disliked it,
> > though mostly because it felt untidy rather than being a problem.
> 
> I don't think you can fix that in general. In my test case I'm writing
> new data when the existing refcount block is already full, so we can
> allocate the new one before the new data and keep everything tidy.
> 
> But if there are, say, two refcount entries available and you write four
> data clusters you don't want the new refcount block in the middle of
> those four clusters just to have it at the beginning :-)

Agreed. I think if implemented like I suggested (hard error in
update_refcount() and then allocate a new refcount block in the caller),
we'd actually end up with the refcount block covered by an old refcount
block rather than being self-describing. And that in turn wouldn't make
things much tidier, so I guess you're right. :-)

> > Having a qcow2 analysis script in the repo sounds like a good
> > idea. John has something, too. Maybe we can check whether the two
> > things complement each other and then check in a script that combines
> > both (or if one provides a superset of the other, just check in that
> > one).
> 
> I'll take a look.

Thanks!

Kevin
Kevin Wolf March 21, 2018, 5:52 p.m. UTC | #6
Am 21.03.2018 um 14:38 hat Alberto Garcia geschrieben:
> When we try to allocate new clusters we first look for available ones
> starting from s->free_cluster_index and once we find them we increase
> their reference counts. Before we get to call update_refcount() to do
> this last step s->free_cluster_index is already pointing to the next
> cluster after the ones we are trying to allocate.
> 
> During update_refcount() it may happen however that we also need to
> allocate a new refcount block in order to store the refcounts of these
> new clusters (and to complicate things further that may also require
> us to grow the refcount table). After all this we don't know if the
> clusters that we originally tried to allocate are still available, so
> we return -EAGAIN to ask the caller to restart the search for free
> clusters.
> 
> This is what can happen in a common scenario:
> 
>   1) We want to allocate a new cluster and we see that cluster N is
>      free.
> 
>   2) We try to increase N's refcount but all refcount blocks are full,
>      so we allocate a new one at N+1 (where s->free_cluster_index was
>      pointing at).
> 
>   3) Once we're done we return -EAGAIN to look again for a free
>      cluster, but now s->free_cluster_index points at N+2, so that's
>      the one we allocate. Cluster N remains unallocated and we have a
>      hole in the qcow2 file.
> 
> This can be reproduced easily:
> 
>      qemu-img create -f qcow2 -o cluster_size=512 hd.qcow2 1M
>      qemu-io -c 'write 0 124k' hd.qcow2
> 
> After this the image has 132608 bytes (256 clusters), and the refcount
> block is full. If we write 512 more bytes it should allocate two new
> clusters: the data cluster itself and a new refcount block.
> 
>      qemu-io -c 'write 124k 512' hd.qcow2
> 
> However the image has now three new clusters (259 in total), and the
> first one of them is empty (and unallocated):
> 
>      dd if=hd.qcow2 bs=512c skip=256 count=1 | hexdump -C
> 
> If we write larger amounts of data in the last step instead of the 512
> bytes used in this example we can create larger holes in the qcow2
> file.
> 
> What this patch does is reset s->free_cluster_index to its previous
> value when alloc_refcount_block() returns -EAGAIN. This way the caller
> will try to allocate again the original clusters if they are still
> free.
> 
> The output of iotest 026 also needs to be updated because now that
> images have no holes some tests fail at a different point and the
> number of leaked clusters is different.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks, applied to the block branch.

Kevin
diff mbox series

Patch

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 362deaf303..6b8b63514a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -839,6 +839,13 @@  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                 qcow2_cache_put(s->refcount_block_cache, &refcount_block);
             }
             ret = alloc_refcount_block(bs, cluster_index, &refcount_block);
+            /* If the caller needs to restart the search for free clusters,
+             * try the same ones first to see if they're still free. */
+            if (ret == -EAGAIN) {
+                if (s->free_cluster_index > (start >> s->cluster_bits)) {
+                    s->free_cluster_index = (start >> s->cluster_bits);
+                }
+            }
             if (ret < 0) {
                 goto fail;
             }
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index 86a50a2e13..8e89416a86 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -533,7 +533,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -561,7 +561,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
@@ -589,7 +589,7 @@  Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
-11 leaked clusters were found on the image.
+10 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
diff --git a/tests/qemu-iotests/121 b/tests/qemu-iotests/121
index 1307b4e327..6d6f55a5dc 100755
--- a/tests/qemu-iotests/121
+++ b/tests/qemu-iotests/121
@@ -93,6 +93,26 @@  $QEMU_IO -c 'write 63M 130K' "$TEST_IMG" | _filter_qemu_io
 
 _check_test_img
 
+echo
+echo '=== Allocating a new refcount block must not leave holes in the image ==='
+echo
+
+IMGOPTS='cluster_size=512,refcount_bits=16' _make_test_img 1M
+
+# This results in an image with 256 used clusters: the qcow2 header,
+# the refcount table, one refcount block, the L1 table, four L2 tables
+# and 248 data clusters
+$QEMU_IO -c 'write 0 124k' "$TEST_IMG" | _filter_qemu_io
+
+# 256 clusters of 512 bytes each give us a 128K image
+stat -c "size=%s (expected 131072)" $TEST_IMG
+
+# All 256 entries of the refcount block are used, so writing a new
+# data cluster also allocates a new refcount block
+$QEMU_IO -c 'write 124k 512' "$TEST_IMG" | _filter_qemu_io
+
+# Two more clusters, the image size should be 129K now
+stat -c "size=%s (expected 132096)" $TEST_IMG
 
 # success, all done
 echo
diff --git a/tests/qemu-iotests/121.out b/tests/qemu-iotests/121.out
index 5961a44cd9..613d56185e 100644
--- a/tests/qemu-iotests/121.out
+++ b/tests/qemu-iotests/121.out
@@ -20,4 +20,14 @@  wrote 133120/133120 bytes at offset 66060288
 130 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 
+=== Allocating a new refcount block must not leave holes in the image ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 126976/126976 bytes at offset 0
+124 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+size=131072 (expected 131072)
+wrote 512/512 bytes at offset 126976
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+size=132096 (expected 132096)
+
 *** done