diff mbox series

[v12,2/3] qcow2: Allow writing compressed data of multiple clusters

Message ID 1575288906-551879-3-git-send-email-andrey.shinkevich@virtuozzo.com
State New
Headers show
Series qcow2: advanced compression options | expand

Commit Message

Andrey Shinkevich Dec. 2, 2019, 12:15 p.m. UTC
QEMU currently supports writing compressed data of the size equal to
one cluster. This patch allows writing QCOW2 compressed data that
exceed one cluster. Now, we split buffered data into separate clusters
and write them compressed using the block/aio_task API.

Suggested-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2.c | 102 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 75 insertions(+), 27 deletions(-)

Comments

Alberto Garcia April 9, 2020, 4:50 p.m. UTC | #1
On Mon 02 Dec 2019 01:15:05 PM CET, Andrey Shinkevich wrote:
> +static coroutine_fn int
> +qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
> +                                 uint64_t offset, uint64_t bytes,
> +                                 QEMUIOVector *qiov, size_t qiov_offset)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    AioTaskPool *aio = NULL;
> +    int ret = 0;
> +
> +    if (has_data_file(bs)) {
> +        return -ENOTSUP;
> +    }
> +
> +    if (bytes == 0) {
> +        /*
> +         * align end of file to a sector boundary to ease reading with
> +         * sector based I/Os
> +         */
> +        int64_t len = bdrv_getlength(bs->file->bs);
> +        if (len < 0) {
> +            return len;
> +        }
> +        return bdrv_co_truncate(bs->file, len, false, PREALLOC_MODE_OFF, NULL);
> +    }
> +
> +    if (offset_into_cluster(s, offset)) {
> +        return -EINVAL;
> +    }
> +
> +    while (bytes && aio_task_pool_status(aio) == 0) {
> +        uint64_t chunk_size = MIN(bytes, s->cluster_size);
> +
> +        if (!aio && chunk_size != bytes) {
> +            aio = aio_task_pool_new(QCOW2_MAX_WORKERS);
> +        }
> +
> +        ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_compressed_task_entry,
> +                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL);
> +        if (ret < 0) {
> +            break;
> +        }
> +        qiov_offset += chunk_size;
> +        offset += chunk_size;
> +        bytes -= chunk_size;
> +    }

This patch allows the user to write more than one cluster of compressed
data at a time, and it does so by splitting the request into many
cluster-sized requests and using qcow2_add_task() for each one of them.

What happens however is that there's no guarantee that the requests are
processed in the same order that they were added.

One consequence is that running on an empty qcow2 file a command as
simple as this one:

   qemu-io -c 'write -c 0 256k' image.qcow2

does not always produce the same results.

This does not have any user-visible consequences for the guest. In all
cases the data is correctly written, it's just that the ordering of the
compressed clusters (and therefore the contents of the L2 entries) will
be different each time.

Because of this a test cannot expect that running the same commands on
an empty image produces always the same results.

Is this something that we should be concerned about?

Berto
Vladimir Sementsov-Ogievskiy April 9, 2020, 6:39 p.m. UTC | #2
09.04.2020 19:50, Alberto Garcia wrote:
> On Mon 02 Dec 2019 01:15:05 PM CET, Andrey Shinkevich wrote:
>> +static coroutine_fn int
>> +qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
>> +                                 uint64_t offset, uint64_t bytes,
>> +                                 QEMUIOVector *qiov, size_t qiov_offset)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    AioTaskPool *aio = NULL;
>> +    int ret = 0;
>> +
>> +    if (has_data_file(bs)) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (bytes == 0) {
>> +        /*
>> +         * align end of file to a sector boundary to ease reading with
>> +         * sector based I/Os
>> +         */
>> +        int64_t len = bdrv_getlength(bs->file->bs);
>> +        if (len < 0) {
>> +            return len;
>> +        }
>> +        return bdrv_co_truncate(bs->file, len, false, PREALLOC_MODE_OFF, NULL);
>> +    }
>> +
>> +    if (offset_into_cluster(s, offset)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    while (bytes && aio_task_pool_status(aio) == 0) {
>> +        uint64_t chunk_size = MIN(bytes, s->cluster_size);
>> +
>> +        if (!aio && chunk_size != bytes) {
>> +            aio = aio_task_pool_new(QCOW2_MAX_WORKERS);
>> +        }
>> +
>> +        ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_compressed_task_entry,
>> +                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL);
>> +        if (ret < 0) {
>> +            break;
>> +        }
>> +        qiov_offset += chunk_size;
>> +        offset += chunk_size;
>> +        bytes -= chunk_size;
>> +    }
> 
> This patch allows the user to write more than one cluster of compressed
> data at a time, and it does so by splitting the request into many
> cluster-sized requests and using qcow2_add_task() for each one of them.
> 
> What happens however is that there's no guarantee that the requests are
> processed in the same order that they were added.
> 
> One consequence is that running on an empty qcow2 file a command as
> simple as this one:
> 
>     qemu-io -c 'write -c 0 256k' image.qcow2
> 
> does not always produce the same results.
> 
> This does not have any user-visible consequences for the guest. In all
> cases the data is correctly written, it's just that the ordering of the
> compressed clusters (and therefore the contents of the L2 entries) will
> be different each time.
> 
> Because of this a test cannot expect that running the same commands on
> an empty image produces always the same results.
> 
> Is this something that we should be concerned about?
> 

Parallel writing compressed clusters is significant improvement, as it allow compressing in really parallel threads.

Generally, async parallel issuing of several requests gives more performance than handling peaces one-by-one, mirror works on this basis and it is fast. I've already moved qcow2 to this idea (aio tasks in qcow2 code), and in progress of moving backup job. So, I think that asynchrony and ambiguity would be native for block-layer anyway.

Hmm. Still, what about cluster sequence? For normal clusters there may be simple thing to do: preallocation (at least of metadata). So, we can pre-create cluster sequence.. But what to do with compressed clusters if we want specific order for them, I don't know. On the other hand, ordering of normal cluster may make sence: it should increase performnace of following IO. But for compressed clusters it's not the case.

So, I don't think we should make specific workaround for testing... What exactly is the case?
Andrey Shinkevich April 10, 2020, 12:12 a.m. UTC | #3
We could assign indices to the clusters/chunks and improve the algorithm to write them down on the disk in the same order adjacently. If you find it feasible for QEMU, I'd like to create a task for doing that, shall I?

Andrey
Vladimir Sementsov-Ogievskiy April 10, 2020, 5:10 a.m. UTC | #4
10.04.2020 3:12, Andrey Shinkevich wrote:
> We could assign indices to the clusters/chunks and improve the algorithm to write them down on the disk in the same order adjacently. If you find it feasible for QEMU, I'd like to create a task for doing that, shall I?
> 

Compressed cluster occupy different size chunks in the image. How are you going to preallocate? Anyway, I don't see any benefit in ordering compressed clusters, I think it's not worth doing.

> 
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> *Sent:* Thursday, April 9, 2020 9:39 PM
> *To:* Alberto Garcia <berto@igalia.com>; Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>; qemu-devel@nongnu.org <qemu-devel@nongnu.org>; qemu-block@nongnu.org <qemu-block@nongnu.org>
> *Cc:* kwolf@redhat.com <kwolf@redhat.com>; armbru@redhat.com <armbru@redhat.com>; mreitz@redhat.com <mreitz@redhat.com>; Denis Lunev <den@virtuozzo.com>
> *Subject:* Re: [PATCH v12 2/3] qcow2: Allow writing compressed data of multiple clusters
> 09.04.2020 19:50, Alberto Garcia wrote:
>> On Mon 02 Dec 2019 01:15:05 PM CET, Andrey Shinkevich wrote:
>>> +static coroutine_fn int
>>> +qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
>>> +                                 uint64_t offset, uint64_t bytes,
>>> +                                 QEMUIOVector *qiov, size_t qiov_offset)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    AioTaskPool *aio = NULL;
>>> +    int ret = 0;
>>> +
>>> +    if (has_data_file(bs)) {
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    if (bytes == 0) {
>>> +        /*
>>> +         * align end of file to a sector boundary to ease reading with
>>> +         * sector based I/Os
>>> +         */
>>> +        int64_t len = bdrv_getlength(bs->file->bs);
>>> +        if (len < 0) {
>>> +            return len;
>>> +        }
>>> +        return bdrv_co_truncate(bs->file, len, false, PREALLOC_MODE_OFF, NULL);
>>> +    }
>>> +
>>> +    if (offset_into_cluster(s, offset)) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    while (bytes && aio_task_pool_status(aio) == 0) {
>>> +        uint64_t chunk_size = MIN(bytes, s->cluster_size);
>>> +
>>> +        if (!aio && chunk_size != bytes) {
>>> +            aio = aio_task_pool_new(QCOW2_MAX_WORKERS);
>>> +        }
>>> +
>>> +        ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_compressed_task_entry,
>>> +                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL);
>>> +        if (ret < 0) {
>>> +            break;
>>> +        }
>>> +        qiov_offset += chunk_size;
>>> +        offset += chunk_size;
>>> +        bytes -= chunk_size;
>>> +    }
>> 
>> This patch allows the user to write more than one cluster of compressed
>> data at a time, and it does so by splitting the request into many
>> cluster-sized requests and using qcow2_add_task() for each one of them.
>> 
>> What happens however is that there's no guarantee that the requests are
>> processed in the same order that they were added.
>> 
>> One consequence is that running on an empty qcow2 file a command as
>> simple as this one:
>> 
>>     qemu-io -c 'write -c 0 256k' image.qcow2
>> 
>> does not always produce the same results.
>> 
>> This does not have any user-visible consequences for the guest. In all
>> cases the data is correctly written, it's just that the ordering of the
>> compressed clusters (and therefore the contents of the L2 entries) will
>> be different each time.
>> 
>> Because of this a test cannot expect that running the same commands on
>> an empty image produces always the same results.
>> 
>> Is this something that we should be concerned about?
>> 
> 
> Parallel writing compressed clusters is significant improvement, as it allow compressing in really parallel threads.
> 
> Generally, async parallel issuing of several requests gives more performance than handling peaces one-by-one, mirror works on this basis and it is fast. I've already moved qcow2 to this idea (aio tasks in qcow2 code), and in progress of moving backup job. So, I think that asynchrony and ambiguity would be native for block-layer anyway.
> 
> Hmm. Still, what about cluster sequence? For normal clusters there may be simple thing to do: preallocation (at least of metadata). So, we can pre-create cluster sequence.. But what to do with compressed clusters if we want specific order for them, I don't know. On the other hand, ordering of normal cluster may make sence: it should increase performnace of following IO. But for compressed clusters it's not the case.
> 
> So, I don't think we should make specific workaround for testing... What exactly is the case?
> 
> -- 
> Best regards,
> Vladimir
Alberto Garcia April 10, 2020, 11:12 a.m. UTC | #5
On Thu 09 Apr 2020 08:39:12 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
>> Because of this a test cannot expect that running the same commands on
>> an empty image produces always the same results.
>> 
>> Is this something that we should be concerned about?
>
> Parallel writing compressed clusters is significant improvement, as it
> allow compressing in really parallel threads.

I see, I just wasn't sure if you were aware of this side effect.

> So, I don't think we should make specific workaround for
> testing... What exactly is the case?

I noticed this while writing some tests for the subcluster allocation
feature, but this is not a problem for me. Many of our iotests make
assumptions about the location of L2 and refcount tables so changing
those would break a lot of them. This thing only changes the offset of
the compressed data clusters (and their L2 entries), but as far as I'm
aware no one relies on them being predictable. I just need to make sure
that I don't do it either.

Berto
Vladimir Sementsov-Ogievskiy April 10, 2020, 11:44 a.m. UTC | #6
10.04.2020 14:12, Alberto Garcia wrote:
> On Thu 09 Apr 2020 08:39:12 PM CEST, Vladimir Sementsov-Ogievskiy wrote:
>>> Because of this a test cannot expect that running the same commands on
>>> an empty image produces always the same results.
>>>
>>> Is this something that we should be concerned about?
>>
>> Parallel writing compressed clusters is significant improvement, as it
>> allow compressing in really parallel threads.
> 
> I see, I just wasn't sure if you were aware of this side effect.

No, we didn't thought about it, so good to know, thanks.

> 
>> So, I don't think we should make specific workaround for
>> testing... What exactly is the case?
> 
> I noticed this while writing some tests for the subcluster allocation
> feature, but this is not a problem for me. Many of our iotests make
> assumptions about the location of L2 and refcount tables so changing
> those would break a lot of them. This thing only changes the offset of
> the compressed data clusters (and their L2 entries), but as far as I'm
> aware no one relies on them being predictable. I just need to make sure
> that I don't do it either.
> 

OK. I had similar problems (because of asynchronicity) with existing iotests
in may series for backup. As I remember, I had to add options to just disable
asynchronicity for some tests. So, if needed, we can add some options for
qcow2 (which can be used to justify number of parallel requests, not only to
disable them at all). Still, of course, it's better to avoid testing only
sequential IO when it is async without options.
diff mbox series

Patch

diff --git a/block/qcow2.c b/block/qcow2.c
index 7c18721..0e03a1a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4222,10 +4222,8 @@  fail:
     return ret;
 }
 
-/* XXX: put compressed sectors first, then all the cluster aligned
-   tables to avoid losing bytes in alignment */
 static coroutine_fn int
-qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
+qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
                                  uint64_t offset, uint64_t bytes,
                                  QEMUIOVector *qiov, size_t qiov_offset)
 {
@@ -4235,32 +4233,11 @@  qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
     uint8_t *buf, *out_buf;
     uint64_t cluster_offset;
 
-    if (has_data_file(bs)) {
-        return -ENOTSUP;
-    }
-
-    if (bytes == 0) {
-        /* align end of file to a sector boundary to ease reading with
-           sector based I/Os */
-        int64_t len = bdrv_getlength(bs->file->bs);
-        if (len < 0) {
-            return len;
-        }
-        return bdrv_co_truncate(bs->file, len, false, PREALLOC_MODE_OFF, NULL);
-    }
-
-    if (offset_into_cluster(s, offset)) {
-        return -EINVAL;
-    }
+    assert(bytes == s->cluster_size || (bytes < s->cluster_size &&
+           (offset + bytes == bs->total_sectors << BDRV_SECTOR_BITS)));
 
     buf = qemu_blockalign(bs, s->cluster_size);
-    if (bytes != s->cluster_size) {
-        if (bytes > s->cluster_size ||
-            offset + bytes != bs->total_sectors << BDRV_SECTOR_BITS)
-        {
-            qemu_vfree(buf);
-            return -EINVAL;
-        }
+    if (bytes < s->cluster_size) {
         /* Zero-pad last write if image size is not cluster aligned */
         memset(buf + bytes, 0, s->cluster_size - bytes);
     }
@@ -4309,6 +4286,77 @@  fail:
     return ret;
 }
 
+static coroutine_fn int qcow2_co_pwritev_compressed_task_entry(AioTask *task)
+{
+    Qcow2AioTask *t = container_of(task, Qcow2AioTask, task);
+
+    assert(!t->cluster_type && !t->l2meta);
+
+    return qcow2_co_pwritev_compressed_task(t->bs, t->offset, t->bytes, t->qiov,
+                                            t->qiov_offset);
+}
+
+/*
+ * XXX: put compressed sectors first, then all the cluster aligned
+ * tables to avoid losing bytes in alignment
+ */
+static coroutine_fn int
+qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
+                                 uint64_t offset, uint64_t bytes,
+                                 QEMUIOVector *qiov, size_t qiov_offset)
+{
+    BDRVQcow2State *s = bs->opaque;
+    AioTaskPool *aio = NULL;
+    int ret = 0;
+
+    if (has_data_file(bs)) {
+        return -ENOTSUP;
+    }
+
+    if (bytes == 0) {
+        /*
+         * align end of file to a sector boundary to ease reading with
+         * sector based I/Os
+         */
+        int64_t len = bdrv_getlength(bs->file->bs);
+        if (len < 0) {
+            return len;
+        }
+        return bdrv_co_truncate(bs->file, len, false, PREALLOC_MODE_OFF, NULL);
+    }
+
+    if (offset_into_cluster(s, offset)) {
+        return -EINVAL;
+    }
+
+    while (bytes && aio_task_pool_status(aio) == 0) {
+        uint64_t chunk_size = MIN(bytes, s->cluster_size);
+
+        if (!aio && chunk_size != bytes) {
+            aio = aio_task_pool_new(QCOW2_MAX_WORKERS);
+        }
+
+        ret = qcow2_add_task(bs, aio, qcow2_co_pwritev_compressed_task_entry,
+                             0, 0, offset, chunk_size, qiov, qiov_offset, NULL);
+        if (ret < 0) {
+            break;
+        }
+        qiov_offset += chunk_size;
+        offset += chunk_size;
+        bytes -= chunk_size;
+    }
+
+    if (aio) {
+        aio_task_pool_wait_all(aio);
+        if (ret == 0) {
+            ret = aio_task_pool_status(aio);
+        }
+        g_free(aio);
+    }
+
+    return ret;
+}
+
 static int coroutine_fn
 qcow2_co_preadv_compressed(BlockDriverState *bs,
                            uint64_t file_cluster_offset,